Loading...
Please wait, while we are loading the content...
Similar Documents
Implementation and evaluation of application-level checkpoint-recovery scheme for MPI programs (2006)
| Content Provider | CiteSeerX |
|---|---|
| Author | Bronevetsky, Greg Pingali, Keshav Stodghill, Paul |
| Description | It is becoming important for long-running scientific applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR)- the computation’s state is saved periodically to disk. Upon failure the computation is restarted from the last saved state. The common CPR mechanism, called System-level Checkpointing (SLC), requires modifying the Operating System and the communication libraries to enable them to save the state of the entire parallel application. This approach is not portable since a checkpointer for one system rarely works on another. Application-level Checkpointing (ALC) is a portable alternative where the programmer manually modifies their program to enable CPR, a very labor-intensive task. We are investigating the use of compiler technology to instrument codes to embed the ability to tolerate faults into applications themselves, making them self-checkpointing and self-restarting on any platform. In [9] we described a general approach for checkpointing shared memory APIs at the application level. Since [9] applied to only a toy feature set common to most shared memory APIs, this paper shows the practicality of this approach by extending it to a specific popular shared memory API: OpenMP. We describe the challenges involved in providing automated ALC for OpenMP applications and experimentally validate this approach by showing detailed performance results for our implementation of this technique. Our experiments with the NAS OpenMP benchmarks [1] and the EPCC microbenchmarks [21] show generally low overhead on three different architectures: Linux/IA64, Tru64/Alpha and Solaris/Sparc and highlight important lessons about the performance characteristics of this aproach. 1. In 2004 Conference on Super Computing (SC2004 |
| File Format | |
| Language | English |
| Publisher | ACM Press |
| Publisher Date | 2006-01-01 |
| Access Restriction | Open |
| Subject Keyword | Low Overhead Compiler Technology Computation State Different Architecture Linux Ia64 Application Level Automated Alc Toy Feature Labor-intensive Task Solaris Sparc Entire Parallel Application Tru64 Alpha Last Saved State Operating System Detailed Performance Result Mpi Program Na Openmp Hardware Fault Application-level Checkpointing Portable Alternative Common Cpr Mechanism Shared Memory Apis Important Lesson Used Approach Memory Api Epcc Microbenchmarks General Approach Openmp Application Memory Apis Application-level Checkpoint-recovery Scheme Performance Characteristic System-level Checkpointing Long-running Scientific Application |
| Content Type | Text |
| Resource Type | Article |