Title: Detecting Silent Data Corruption Using an Auxiliary Method and External Observer
Speaker: Hadrien Croubois
Abstract: HPC platforms and application are becoming increasingly complex. Consequently, protecting results against all forms of corruption and ensuring trustworthiness are becoming more important. While previous work focuses on application-specific detectors, the dataflow manager in our current work in the Decaf project aims to have an efficient generic mechanism. We address those issues using new replication patterns that rely on the use of an auxiliary method and an external learning observer. In this talk, we present both the theoretical validation mechanisms and different use cases where our mechanism can be applied to detect silent data corruption.