img

Notice détaillée

Predictive Reliability and Fault Management in Exascale Systems

State of the Art and Perspectives

Article Ecrit par: Canal, Ramon ; Atienza, David ; Abella, Jaume ; Hernandez, Carles ; Tornero, Rafa ; Cilardo, Alessandro ; Massari, Giuseppe ; Reghenzani, Federico ; Fornaciari, William ; Zapater, Marina ; Oleksiak, Ariel ; Piatek, Wojciech ;

Résumé: Performance and power constraints come together with Complementary Metal Oxide Semiconductor tech- nology scaling in future Exascale systems. Technology scaling makes each individual transistor more prone to faults and, due to the exponential increase in the number of devices per chip, to higher system fault rates. Consequently, High-performance Computing (HPC) systems need to integrate prediction, detection, and re- covery mechanisms to cope with faults efficiently. This article reviews fault detection, fault prediction, and recovery techniques in HPC systems, from electronics to system level. We analyze their strengths and limi- tations. Finally, we identify the promising paths to meet the reliability levels of Exascale systems.


Langue: Anglais