Chip errors responsible for outages

Chip Errors Are Becoming More Common and Harder to Track Down – The New York Times

The outages have had several causes, like programming mistakes and congestion on the networks. But there is growing anxiety that as cloud-computing networks have become larger and more complex, they are still dependent, at the most basic level, on computer chips that are now less reliable and, in some cases, less predictable

In the past year, researchers at both Facebook and Google have published studies describing computer hardware failures whose causes have not been easy to identify. The problem, they argued, was not in the software — it was somewhere in the computer hardware made by various companies. Google declined to comment on its study, while Facebook did not return requests for comment on its study..

This goes again in favour of my argument- use dependable hardware solutions on on-premises hardware and servers. If your critical business is too dependent on the cloud, there will be issues around outages.

“They’re seeing these silent errors, essentially coming from the underlying hardware,” said Subhasish Mitra, a Stanford University electrical engineer who specializes in testing computer hardware. Increasingly, Dr. Mitra said, people believe that manufacturing defects are tied to these so-called silent errors that cannot be easily caught.

Facebook’s data center in Prineville, Ore. Large data centers have experienced outages that may be partly the result of chip errors.
Find the tiny error that’s caused the million-dollar outage here.

Just in case you want to read the report from AMD:

Processors may have a shorter lifespan than previously thought, experts say, which could be one factor contributing to calculation errors.
The fault lies not in the stars but the chips 🙂

The way out? Possibly a new layer of software to monitor the hardware:

Computer engineers are divided over how to respond to the challenge. One widespread response is demand for new kinds of software that proactively watch for hardware errors and make it possible for system operators to remove hardware when it begins to degrade. That has created an opportunity for new start-ups offering software that monitors the health of the underlying chips in data centers.

Why not improve design than add another layer of complexity? I guess NYT just did a product placement – I haven’t bothered to link to the startup mentioned in the article. Nevertheless, I linked it here to understand and highlight the follies on relying too much on the cloud.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.