| Publisher | Carnegie Mellon University | ||
|---|---|---|---|
| Format | 568.0KB PDF | Date added | 01 Dec 2005 |
| Topics | High Performance Computing | ||
| Downloads | 2 | ||
Designing highly dependable systems requires a good understanding of failure characteristics. Unfortunately little raw data on failures in large IT installations is publicly available, due to the confidential nature of this data. This paper analyzes soon-to-be-public failure data covering systems at a large high-performance-computing site. The data has been collected over the past 9 years at Los Alamos National Laboratory and includes 23000 failures recorded on more than 20 different systems, mostly large clusters of SMP and NUMA nodes. They study the statistics of the data, including the root cause of failures, the mean time between failures, and the mean time to repair.
Related white papers
Go Green with IBM System x Servers and Intel Xeon Processors
By "going green" with energy-efficient IBM® System x™ servers featuring Intel® Xeon® processors, you can win back control of your IT budget—and win the battle with data center power constraints.
Recommended Practices for PC Fleet Management for Mid Market and EnterpriseOrganizations
PC management is both costly and ongoing. Desktop support alone soaks up 30-45 percent1 of IT budgets. But optimizing your PC fleet management strategy will produce efficiencies and lower costs. ...
Take business PCs to the next leve: Improve security and remote manageability with Intel? vPro? technology-based notebook and desktop PCs
Intel? Centrino? 2 with vPro? technology for notebook PCs and Intel? Core?2 processor with vPro? technology for desktop PCs will change your IT reality. Our latest technology is optimized for...
The Benefits of Intel? Centrino? with vPro? Technology in the Enterprise
PCs are essential in today's enterprises, yet managing a PC fleet can consume a significant portion of IT's time and budget. Finding ways to keep employees productive with powerful notebook...
Massively Scalable NAS - Pre-Empting Tomorrow's Data Overload with Today's Technology
HP is launching the HP StorageWorks 9100 Extreme Data Storage System that solves challenges such as extreme scability, manageability and affordability and creates new business opportunities. HP is going to...
Alchemi: A .NET-Based Grid Computing Framework and Its Integration Into Global Grids
Microsoft's .NET Framework has become near-ubiquitous for implementing commercial distributed systems for Windows-based platforms, positioning it as the ideal platform for grid computing in this context. This paper presents Alchemi,...
Dell Helps Lead Scale-Out Industry-Standard Server Computing
Meeting business requirements for less. It’s the perennial struggle for IT managers, made even more difficult by the economic climate of the past three years. Many have turned to server...




