Chip Errors Are Becoming More Common and Harder to Track Down – The New York Times
The outages have had several causes, like programming mistakes and congestion on the networks. But there is growing anxiety that as cloud-computing networks have become larger and more complex, they are still dependent, at the most basic level, on computer chips that are now less reliable and, in some cases, less predictable
In the past year, researchers at both Facebook and Google have published studies describing computer hardware failures whose causes have not been easy to identify. The problem, they argued, was not in the software — it was somewhere in the computer hardware made by various companies. Google declined to comment on its study, while Facebook did not return requests for comment on its study..
This goes again in favour of my argument- use dependable hardware solutions on on-premises hardware and servers. If your critical business is too dependent on the cloud, there will be issues around outages.
“They’re seeing these silent errors, essentially coming from the underlying hardware,” said Subhasish Mitra, a Stanford University electrical engineer who specializes in testing computer hardware. Increasingly, Dr. Mitra said, people believe that manufacturing defects are tied to these so-called silent errors that cannot be easily caught.

Just in case you want to read the report from AMD:

The way out? Possibly a new layer of software to monitor the hardware:
Computer engineers are divided over how to respond to the challenge. One widespread response is demand for new kinds of software that proactively watch for hardware errors and make it possible for system operators to remove hardware when it begins to degrade. That has created an opportunity for new start-ups offering software that monitors the health of the underlying chips in data centers.
Why not improve design than add another layer of complexity? I guess NYT just did a product placement – I haven’t bothered to link to the startup mentioned in the article. Nevertheless, I linked it here to understand and highlight the follies on relying too much on the cloud.