Vitus Leung

Machine Learning Helps Diagnose Supercomputer Problems

Dec. 5, 2017
Engineers are leveraging machine learning to both uncover problems with supercomputers and fix them, all without human intervention.

Computer scientists and engineers from Sandia National Laboratories and Boston University recently earned the Gauss Award at the International Supercomputing conference. They were honored for their work automatically diagnosing problems and potentially fixing them in supercomputers using machine learning.

It turns out that supercomputers, which are relied on for everything from forecasting the weather to cancer research to ensuring U.S. nuclear weapons are safe and reliable, can have bad days. They contain a complex collection of interconnected parts and processes that can go wrong. For example, parts can break, previous programs can leave “zombie processes” running that gum up the works, network traffic can cause bottlenecks, or a computer code revision can instigate problems. These problems often result in programs not running to completion and wasting valuable supercomputer time.

So the team came up with a list of issues they have encountered when working with supercomputing and then wrote code to re-create those problems or anomalies. They ran a variety of programs with and without the anomaly codes on two supercomputers, one at Sandia and a public cloud system operated by Boston University.

While the programs were running, researchers collected data on the process, monitoring how much energy, processor power, and memory was used by each node. Monitoring more than 700 criteria used less than 0.005% of the supercomputer’s processing power, and this is where machine learning comes in.

Machine learning is a broad collection of computer algorithms that find patterns without being explicitly programmed on the important features. The team wrote several machine learning algorithms that detect anomalies by comparing data from normal program runs and those with anomalies. They tested the algorithms to see which was best at correctly diagnosing the anomalies. For example, one technique, called Random Forest, was particularly adept at analyzing vast quantities of the data monitored and deciding which metrics were important, then determining if the supercomputer was being affected by an anomaly.

To accelerate the analysis, the team calculated various statistics for each metric. Simple statistical values (such as the average and the fifth and 95th percentiles), as well as more complex values (such as noisiness, trends over time, and symmetry), did suggest abnormal behavior and thus potential warning signs. Calculating these values doesn’t take much computer power and they streamlined the rest of the analysis.

The team is now working with more artificial anomalies and more useful algorithms. A major future task is to validate diagnostic techniques on real anomalies discovered during normal runs.

Thanks to the relatively low computational cost of running the machine learning algorithms, diagnostics could be used in real time, which also needs to be tested. The hope is that diagnostics will eventually be able to inform users and operation staff of anomalies as they occur, or even autonomously take action to fix or work around them.

Sponsored Recommendations

March 31, 2025
Unlike passive products - made of simple carbon springs - the bionic prostheses developed by Revival Bionics are propulsive, equipped with a motor and an artificial Achilles tendon...
March 31, 2025
Electric drives are a key technology for the performance of machines, robots, and power tools. Download this guide for an introduction to high-quality mechatronic drive systems...
March 31, 2025
Discover the world of maxon drive technology: motors, gearheads, sensors, controllers, and accessories. Configure your drive system online, including all relevant product and ...
March 31, 2025
Share current page XSun designs and manufactures a drone that is both energy-independent and can make its own decisions, for fully-automated missions. The company needed reliable...

Voice your opinion!

To join the conversation, and become an exclusive member of Machine Design, create an account today!