Title: Current Efforts in Global Scheduling and Fault Tolerance for HPC Systems
Speaker: Laércio LIMA PILLA
Abstract: Performance, energy efficiency, and reliability have been important objectives and challenges in current and future computing systems. In this context, our approach has been based on understanding the details of the computing system architecture and the behavior of applications, in order to combine this information, identify issues and propose new solutions. In this presentation, I will discuss our experience with the development of new architecture-aware global scheduling algorithms for multiprocessor and multicomputer systems, and with fault tolerance mechanisms for radiation-induced errors in parallel accelerators. I will also present some future global scheduling plans to handle the inclusion of non-volatile random-access memories (NVRAMs) in computing systems.
Title: Programming Multi-BSP Algorithms in ML
Speaker: Victor Allombert
Abstract: From personal computers using an increasing number of cores, to supercomputers having millions of computing units, parallel architectures are the current standard. The high performance architectures are usually referenced to as hierarchical, as they are composed from clusters of multi-processors of multi-cores. Programming such architectures is known to be notoriously difficult. Writing parallel programs is, most of the time, difficult for both the algorithmic and the implementation phase. To answer those concerns, many structured models and languages were proposed in order to increase both expressiveness and efficiency. Among other models, Multi-BSP is a bridging model dedicated to hierarchical architecture that ensures efficiency, execution safety, scalability and cost prediction. It is an extension of the well known BSP model that handles flat architectures. We introduce the Multi-ML language, which allows programming Multi-BSP algorithms “à la ML” and thus, guarantees the properties of the Multi-BSP model and the execution safety, thanks to a ml type system. To deal with the multi-level execution model of Multi-ML, we defined formal semantics which describe the valid evaluation of an expression. To ensure the execution safety of Multi-ML programs, we also propose a typing system that preserves replicated coherence. An abstract machine is defined to formally describe the evaluation of a Multi-ML program on a Multi-BSP architecture. An implementation of the language is available as a compilation toolchain. It is thus possible to generate an efficient parallel code from a program written in Multi-ML and execute it on any hierarchical machine.