Title: Current Efforts in Global Scheduling and Fault Tolerance for HPC Systems
Speaker: Laércio LIMA PILLA
Abstract: Performance, energy efficiency, and reliability have been important objectives and challenges in current and future computing systems. In this context, our approach has been based on understanding the details of the computing system architecture and the behavior of applications, in order to combine this information, identify issues and propose new solutions. In this presentation, I will discuss our experience with the development of new architecture-aware global scheduling algorithms for multiprocessor and multicomputer systems, and with fault tolerance mechanisms for radiation-induced errors in parallel accelerators. I will also present some future global scheduling plans to handle the inclusion of non-volatile random-access memories (NVRAMs) in computing systems.