On CNET: The top 10 titles of 2008
BNET Business Network:
BNET
TechRepublic
ZDNet

July 1st, 2008

Why computers fail

Posted by Robin Harris @ 1:25 am

Categories: Infrastructure, Clusters

Tags: Checkpoint, Multiprocessor, Failure, Computer, Desktop Computer, LANL, Desktops, Processors, Hardware, Semiconductors

Good failure data for PCs is hard to find: who knows how many times PC users are told to reinstall Windows? But in a recent paper, Bianca Schroeder and Garth Gibson of CMU found some surprising results in 10 years of large scale cluster system failures at Los Alamos National Labs.

Among the surprises: new hardware isn’t any more reliable than the old stuff. And even wicked smart LANL physicists can’t figure out the cause for every failure.

Special problems of petascale computing
Despite the incredible performance of Roadrunner, LANL’s new petaflop computer, the jobs it runs often take months to complete. With 3,000 nodes failures are inevitable.

What to do?
LANL’s strategy is stop the job and checkpoint. When a node fails they can roll the job back to the last checkpoint and restart, preserving the work already done - but losing the work done after the checkpoint.

Even using massively parallel high-performance storage the checkpoints take time away from getting the answer. Understanding Failures in Petascale Computers uses LANL’s data to better manage the tradeoffs and to suggest new strategies.

But its the failure data itself - and what it suggests about our own computers - that I found most interesting.

Failure etiology
Hardware accounts for over 50% of all LANL failures - with software about 20%. Given all the PhD’s at LANL you’d hope human error would be low on the list - and it is.

Here’s the graph:

Root cause analysis of system failures

Is reliability improving?
Nope. LANL hasn’t seen any improvement over the years - even with hardware from a decade ago.

Failures per year per processor

The key metric
The research showed that

. . . the failure rate of a system grows proportional to the number of processor chips in the system.

Which is a big problem for massive multi-processor systems.

The Storage Bits take
Extrapolating these results to our desktop systems is straightforward - with one big caveat: most desktop system crashes are software, not hardware.

Otherwise the Blue Screen of Death would be the No Screen of Death.

The biggest finding is that we shouldn’t expect our system hardware to get more reliable. Improvements get balanced out by increased complexity.

Those of us with multi-processor systems can expect to see less reliability - though with just a few systems you won’t see any trends. It’s a classic “glass half full” situation: our systems won’t get better, but al least they won’t get worse.

Comments welcome, of course.

Robin Harris has been selling and marketing data storage for over 20 years in companies large and small. See his full profile and disclosure of his industry affiliations.

  • Talkback
  • Most Recent of 42 Talkback(s)
A poor extrapolation and missing data
I've worked in the fault tolerant area and have several patents in this area, and found the information presented above disappointing.

First, there has been plenty of research that shows the fa... (Read the rest)
Posted by: oldsysprog Posted on: 01/05/09 You are currently: Logged In as: a Guest  | Login | Terms of Use
A good UPS...  bjbrock | 07/01/08
Power Surges  Spats30 | 07/01/08
Just a thought  morrigen | 07/02/08
...another thought  isotla | 07/02/08
You may want to rethink your "good" UPS  LiquidLearner | 07/01/08
...the failure rate of a system grows with the number of processors.  Henrik Moller | 07/01/08
Software crashes  dkawalec | 07/01/08
Just to clarify...  dkawalec | 07/01/08
Software failure?  jvenezia | 07/01/08
By no software failure, I mean ...  dkawalec | 07/01/08
Most software failures can be avoided.  CobraA1 | 07/02/08
Background on home users  hsec2@... | 07/01/08
I agree  EmenbladE | 07/05/08
You're confusing able and wishes to  LiquidLearner | 07/01/08
I posted a clarification to my original post  dkawalec | 07/01/08
RE: Why computers fail  Telix | 07/01/08
Clarification, please!  Rick_R | 07/01/08
A proportion is ...  dkawalec | 07/01/08
BSOD  t_mohajir | 07/01/08
Agreed  CobraA1 | 07/02/08
It's always virus!  BALTHOR | 07/01/08
Are you sure  nunna@... | 07/05/08
Why do computers fail?  itpro_z | 07/01/08
RE: Why computers fail  hugh@... | 07/02/08
RE: Why computers fail  cwallen19803@... | 07/02/08
Dust  softwareFlunky | 07/02/08
HDDs are your enemies  Gradius2 | 07/02/08
simple suggestions...  isotla | 07/02/08
Old Vs New  geminate7@... | 07/02/08
...the failure rate of a system grows with the number of processors.  Know1 | 07/02/08
You are missing the point here...  jsapaj | 07/03/08
RE: Why computers fail  tnetech | 07/03/08
RE: Why computers fail  alpinesoft | 07/03/08
RE: Why computers fail  eddy.snyders@... | 07/03/08
RE: Why computers fail  beijk@... | 07/03/08
Seems to be discrete chips not cores  R HarrisZDNet Moderator | 07/06/08
Not their specialty  Lizzie_B | 07/05/08
True, and  R HarrisZDNet Moderator | 07/06/08
RE: Why computers fail  psears_z | 07/07/08
RE: Why computers fail  Un-Common Sense | 08/19/08
Heat, Heat, and MTBF  hasntaclu | 09/25/08
A poor extrapolation and missing data  oldsysprog | 01/05/09

What do you think?

One Trackback

The URI to TrackBack this entry is:
http://blogs.zdnet.com/storage/wp-trackback.php?p=339

  • Maintaining A Vigil
    It remains one of the biggest problems for many who use computers. It is supposed to be getting better, but statistics from full time usage shows it just is not so. What is the problem? Hardware failure. In his post this morning, ...

    Trackback by Revelations From An Unwashed Brain — July 1, 2008 @ 2:01 pm

advertisement

Recent Entries

advertisement

Archives

Favorite Links

ZDNet Blogs

Storage Virtualization

  • In virtual environments, storage matters. It influences everything from application availability and disaster readiness to power consumption and TCO. Bottom line: Don’t defeat the purpose of your consolidation by skimping on storage.
  • From our sponsors
  • EMC Corporation
  • ESG applauds new CX4 in analyst report According to ESG, it's hard to find much missing in the new CLARiiON CX4. Read the report to learn more »
advertisement
Click Here