Loading...
Please wait, while we are loading the content...
Similar Documents
A Non-Invasive Approach for Realizing Resilience in MPI
| Content Provider | ACM Digital Library |
|---|---|
| Author | Feng, Wu Kalim, Umar Gardner, Mark K. |
| Abstract | As the computational capabilities of a supercomputer transition from petaflops to exaflops, more compute processes work concurrently to accomplish tasks, requiring more communication. This results in using an increasing number of software and hardware components, which in turn, increases the probability of abnormal events and failures. We present a solution that improves resilience against transient events in network communication. We observe that the coupling of the session and transport semantics in implementations inhibits recovery from transient failures. Our proposal, a session-layer intermediary (SLIM), serves as a shim layer on top of the interconnect's interface and enables separation of session and transport semantics. We use Open MPI as a case study where SLIM exposes an interface to the Byte Transfer Layer framework. This approach manages transient faults with the underlying transport, by trapping and resolving them and thus not allowing them to cascade into failed MPI primitives. Preliminary results show that the introduction of SLIM delivers resilience and does so without incurring any performance impact, either in latency or throughput. In future, we plan to include other interconnects, such as OpenIB, and enable tolerance for transient network failures. |
| Starting Page | 1 |
| Ending Page | 8 |
| Page Count | 8 |
| File Format | |
| ISBN | 9781450350013 |
| DOI | 10.1145/3086157.3086166 |
| Language | English |
| Publisher | Association for Computing Machinery (ACM) |
| Publisher Date | 2017-06-26 |
| Publisher Place | New York |
| Access Restriction | Subscribed |
| Subject Keyword | Fault tolerance Ex- tensions Message passing interface (mpi) Byte transfer layer (btl) Open mpi Resilience |
| Content Type | Text |
| Resource Type | Article |