Loading...
Please wait, while we are loading the content...
Similar Documents
Parallel Checkpoint / Restart for MPI Applications
| Content Provider | Semantic Scholar |
|---|---|
| Author | Sriram Sankaran Squyres, Jeffrey M. Barrett, Brian W. Lumsdaine, Andrew Duell, Jason Hargrove, Paul |
| Copyright Year | 2003 |
| Abstract | As high-performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. To address these issues, we present the design and implementation of a system for providing coordinated checkpointing and rollback recovery for MPI-based parallel applications. Our approach integrates the Berkeley Lab BLCR kernellevel process checkpoint system with the LAM implementation of MPI through a defined checkpoint/restart interface. Checkpointing is transparent to the application, allowing the system to be used for cluster maintenance and scheduling reasons as well as for fault tolerance. Experimental results show negligible performance impact due to the incorporation of the checkpoint support capabilities into LAM/MPI. |
| File Format | PDF HTM / HTML |
| Alternate Webpage(s) | http://crd-legacy.lbl.gov/~jcduell/papers/cr.pdf |
| Alternate Webpage(s) | http://www.researchgate.net/profile/Jeffrey_Squyres/publication/228720272_Parallel_CheckpointRestart_for_MPI_Applications/links/0fcfd505b37a839b9d000000.pdf |
| Language | English |
| Access Restriction | Open |
| Content Type | Text |
| Resource Type | Article |