Exa-SofT : HPC software and tools

A NumPEx PEPR project

Though significant efforts have been devoted to the implementation and optimization of several crucial parts of a typical HPC software stack, most HPC experts agree that exascale supercomputers will raise new challenges, mostly because the trend in exascale compute-node hardware is toward heterogeneity and scalability: Compute nodes of future systems will have a combination of regular CPUs and accelerators (typically GPUs), along with a diversity of GPU architectures.

Meeting the needs of complex parallel applications and the requirements of exascale architectures raises numerous challenges which are still left unaddressed.
As a result, several parts of the software stack must evolve to better support these architectures. More importantly, the links between these parts must be strengthened to form a coherent, tightly integrated software suite.

Our project aims at consolidating the exascale software ecosystem by providing a coherent, exascale-ready software stack featuring breakthrough research advances enabled by multidisciplinary collaborations between researchers.

The main scientific challenges we intend to address are:

  • productivity,
  • performance portability,
  • heterogeneity,
  • scalability and resilience,
  • performance and energy efficiency.

AVALON is coordinating the WP1 and participates to WP1 and WP2

Project Information

  • URL: Not available yet
  • Starting date: 2023
  • End date: 2028

Taranis : Model, Deploy, Orchestrate, and Optimize Cloud

A PEPR Cloud project

New infrastructures, such as Edge Computing or the Cloud-Edge-IoT computing continuum, make cloud issues more complex as they add new challenges related to resource diversity and heterogeneity (from small sensor to data center/HPC, from low power network to core networks), geographical distribution, as well as increased dynamicity and security needs, all under energy consumption and regulatory constraints.

In order to efficiently exploit new infrastructures, we propose a strategy based on a significant abstraction of the application structure description to further automate application and infrastructure management. Thus, it will be possible to globally optimize the resources used with respect to multi-criteria objectives (price, deadline, performance, energy, etc.) on both the user side (applications) and the provider side (infrastructures). This abstraction also includes the challenges related to the abstraction of application reconfiguration and to automatically adapt the use of resources.

The Taranis project addresses these issues through four scientific work packages, each focusing on a phase of the application lifecycle: application and infrastructure description models, deployment and reconfiguration, orchestration, and optimization.

The first work package “Modeling” addresses the complexity of cloud-edge application and infrastructure models: formal verification and optimization of these models, multi-layer variability, the relationship between model expressiveness and efficient solution computation, lock-ins of proprietary models, and heterogeneity of cloud application and infrastructure modeling languages.

The second work package “Deployment and Reconfiguration” studies deployment and reconfiguration related issues to reduce management complexity and increase support for provisioning and configuration languages, while improving operations certification and increasing operations concurrency. The workpackage also aims to reduce the complexity of the bootstrapping problem on geo-distributed and heterogeneous resources.

The third work package “Orchestration of services and resources” aims at extending the orchestrators for the Cloud-Edge-IoT continuum, while making them more autonomous with respect to dynamic, functional and/or non-functional needs, in particular with respect to the network partitioning problem specific to Cloud-Edge-IoT infrastructures.

Finally, the fourth work package “Optimization” aims to revisit the optimization problems associated with the use of Cloud-Edge-IoT infrastructures and the execution of an application when a large number of decision variables need to be considered jointly. It also aims to make optimization techniques aware of the Cloud-Edge-IoT continuum, the heterogeneous distributed platforms and the wide range of application configurations involved.

AVALON is coordinating the project and participated to the first two workpackages.

Project Information

  • URL: Not available yet
  • Starting date: 2023, September 1st
  • End date: 2030, August 31th

PHC Aurora with University of Tromso (Norway) (2023-2024) : Extrem Edge-Fog-Cloud continuum

This collaboration concerns “EXPLORING ENERGY MONITORING AND LEVERAGING ENERGY EFFICIENCY ON END-TO-END WORST EDGE FOG CLOUD CONTINUUM FOR EXTREME CLIMATE ENVIRONMENTS OBSERVATORIES”.

This project is leaded by Laurent Lefevre and Issam Rais (UiT, Tromso, Norway).

The arctic tundra is one of the most sensitive eco-systems to climate change. It is a large area with presently too few large-scale observation sites. Scientific observatories for extreme climate environments (based on ICT facilities and sensors infrastructure) provide on-field data to researchers (e.g ecologists, biologists) in order to observe and model complex environments with rapidly changing conditions. Gathering, processing and reporting observations are often limited by the availability of sufficient energy. The reporting is also limited by the availability of a communication network with sufficient bandwidth and latency. The opportunities provided by the data are limited by the availability of the critical resources: energy and communication networks.

Collected data must be processed, transported, and stored on relevant ICT infrastructures. Such resources can be deployed in various geographic locations depending on the proximity of sensors and actuators. For extreme scientific observatories, we target ICT infrastructure based on a continuum of resources from sensors and actuators to fog nodes (with limited capabilities) and up to cloud infrastructures.

Extreme climate conditions imply highly heterogeneous systems. Two extremes are represented here: (i) Clouds, highly monitored and maintained, and (ii) edge devices in extreme conditions, needing high monitoring and maintenance but can’t have it (or in a very limited amount), as it would cost too much (e.g in human resources, devices, energy). In between, fog and edge devices can be in both situations, depending on their context. Such high heterogeneity creates hierarchies and cliques of nodes that have very different access to resources, monitoring and even availability.

The challenges that we want to address in this aurora project are how to (i) provide the needed mechanisms to reproduce the characteristics of extreme environments with an in-lab testbed (ii) provide an end-to-end energy monitoring of considered worst continuum infrastructure, (iii) discover the most impactful energy leverages to sustain observations and monitoring (iv) deploy a proof of concept (simulated and really implemented) that validates the abstraction, architecture, design and implementation choices.

SkyData

ANR-22-CE25-0008-01

Logo SkyData

Summary

A fundamental characteristic of our era is the deluge of Data. Since Grid environments, data management in distributed environments is a quite common practice and Cloud systems provide many solutions to store data. A data manager can be defined through its functionalities that can be understood as services, including security, replication strategy, green data transfer, synchronization, data migration, etc. Many ways to design those services exist and each data manager encompasses its own point of view to compose them. Usually, this point of view is centered on applications rather than data, even when an autonomic solution is provided.

The SkyData project aims at breaking the existing rules and the way the data management is in place nowadays. We propose a new paradigm without any centralized control nor middleware. Imagine a world where the data are controlled by themselves! It is a real challenge to provide an autonomic behavior for the data. In this project, we will endow data with autonomous behaviors and thus create a new entity, so-called Self-Managed Data (or SkyData). We plan to develop a distributed and autonomous environment where the data are regulated by themselves. This change of paradigm represents a huge and truly innovative challenge that can be split into three key challenges: the first one consists in a strong theoretical study on autonomic computing; the second one aims at developing the algorithmic support of the concepts; and the third one is the genesis of a prototype with this new generation of data and a significant use case. This latter will show how to use a SkyData environment to create an autonomous data-management system which produces a reduced carbon footprint.

Project Information

  • URL: SkyData Web Site
  • Starting date: 12/01/2023
  • End date: 12/01/2027

Slices PP– Preparatory Phase

The digital infrastructures research community continues to face numerous new challenges towards the design of the Next Generation Internet. This is an extremely complex ecosystem encompassing communication, networking, data-management and data-intelligence issues, supported by established and emerging technologies such as IoT, 5/6G, cloud-to-edge computing. Coupled with the enormous amount of data generated and exchanged over the network, this calls for incremental as well as radically new design paradigms. Experimentally-driven research is becoming worldwide a de-facto standard, which has to be supported by large- scale research infrastructures to make results trusted, repeatable and accessible to the research communities.


SLICES-RI (Research Infrastructure), which was recently included in the 2021 ESFRI roadmap, aims to answer these problems by building a large infrastructure needed for the experimental research on various aspects of distributed computing, networking, IoT and 5/6G networks. It will provide the resources needed to continuously design, experiment, operate and automate the full lifecycle management of digital infrastructures, data, applications, and services.


Based on the two preceding projects within SLICES-RI, SLICES-DS (Design Study) and SLICES-SC (Starting Community), the SLICES-PP (Preparatory Phase) project will validate the requirements to engage into the implementation phase of the RI lifecycle. It will set the policies and decision processes for the governance of SLICES-RI: i.e., the legal and financial frameworks, the business model, the required human resource capacities and training programme. It will also settle the final technical architecture design for implementation. It will engage member states and stakeholders to secure commitment and funding needed for the platform to operate. It will position SLICES as an impactful instrument to support European advanced research, industrial competitiveness and societal impact in the digital era.

Project Information

Action Exploratoire INRIA ExODE

Coordinateur: Jonathan Rouzaud-Cornabas (INRIA Beagle, Liris)

Participants: Samuel Bernard (INRIA Dracula, Institut Camille Jordan), Thierry Gautier (Avalon)

Date: 2019-2022

En Français:
En biologie, la grande majorité des systèmes peut être modélisée sous la forme d’équations différentielles ordinaires (ODE). Modéliser plus finement des objets biologiques mène à augmenter le nombre d’équations. Simuler des systèmes toujours plus grands mène également à augmenter le nombre d’équations. Par conséquent, nous observons une explosion de la taille des systèmes d’ODE à résoudre. Un verrou majeur est la limitation des logiciels de résolutions numériques d’ODE (solveur ODE) à quelques milliers d’équations à cause de temps de calcul prohibitif. L’AEx ExODE s’attaque à ce verrou via 1) l’introduction de nouvelles méthodes numériques qui tireront parti de la précision mixte qui mélange plusieurs précisions de nombre flottant au sein d’un schéma de calcul, 2) l’adaptation de ces nouvelles méthodes pour des machines de calcul de prochaines générations qui sont fortement hiérarchiques et hétérogénes et composées d’un grand nombre de CPUs et GPUs. Depuis un an, une nouvelle approche du Deep Learning se propose de remplacer les Recurrent Neural Network (RNN) par des systèmes d’ODE. Les méthodes numériques et parallèles d’ExODE seront évalué et adapté dans ce cadre afin de permettre l’amélioration de la performance et de l’exactitude de ces nouvelles approches.

En Anglais:
In biology, the vast majority of systems can be modeled as ordinary differential equations (ODEs). Modeling more finely biological objects leads to increase the number of equations. Simulating ever larger systems also leads to increasing the number of equations. Therefore, we observe a large increase in the size of the ODE systems to be solved. A major lock is the limitation of ODE numerical resolution software (ODE solver) to a few thousand equations due to prohibitive calculation time. The AEx ExODE tackles this lock via 1) the introduction of new numerical methods that will take advantage of the mixed precision that mixes several floating number precisions within numerical methods, 2) the adaptation of these new methods for next generation highly hierarchical and heterogeneous computers composed of a large number of CPUs and GPUs. For the past year, a new approach to Deep Learning has been proposed to replace the Recurrent Neural Network (RNN) with ODE systems. The numerical and parallel methods of ExODE will be evaluated and adapted in this framework in order to improve the performance and accuracy of these new approaches.

Labex MILYON

Laboratoire d’excellence en mathématiques et informatique fondamentale.

MILYON fédère les communautés mathématiques et informatique de Lyon autour de trois axes : la recherche d’excellence, notamment des domaines à l’interface des deux disciplines ou d’autres sciences ; la formation, avec l’appui à des filières innovantes tournées vers la recherche ; la société, à travers la médiation de la culture scientifique auprès du grand public et le transfert de technologie vers l’industrie.

Il regroupe plus de 350 chercheurs, et trois unités mixtes de recherche de l’Université de Lyon : l’Institut Camille Jordan, le Laboratoire de l’Informatique du Parallélisme et l’Unité de Mathématiques Pures et Appliquées.

Plus d’information sur le site de MILYON.

Start Date:

Duration: Until 2024

Avalon Members:

Inria-Illinois-ANL-BSC-JSC-Riken/AICS Joint Laboratory on Extreme Scale Computing

In June 2014, The University of Illinois at Urbana-Champaign, Inria, the French national computer science institute, Argonne National Laboratory, Barcelona Supercomputing Center, Jülich Supercomputing Centre and the Riken Advanced Institute for Computational Science formed the Joint Laboratory for Extreme Scale Computing, a follow-up of the Inria-Illinois Joint Laboratory for Petascale Computing.

Research areas include:

  • Scientific applications (big compute and big data) that are the drivers of the research in the other topics of the joint-laboratory.
  • Modeling and optimizing numerical libraries, which are at the heart of many scientific applications.
  • Novel programming models and runtime systems, which allow scientific applications to be updated or reimagined to take full advantage of extreme-scale supercomputers.
  • Resilience and Fault-tolerance research, which reduces the negative impact when processors, disk drives, or memory fail in supercomputers that have tens or hundreds of thousands of those components.
  • I/O and visualization, which are important part of parallel execution for numerical silulations and data analytics.
  • HPC Clouds, that may execute a portion of the HPC workload in the near future.

More on the lab website

Start Date: 2014

End date: 2022 years (extended for 4 years in 2019)

Avalon Members: T. Gautier, L. Lefevre, C. Perez