NDLI: Abstract: Comparing GPU and Increment-Based Checkpoint Compression

Content Provider	IEEE Xplore Digital Library
Author	Ibtesham, D. Arnold, D. Ferreira, K.B. Brightwell, R.
Copyright Year	2012
Abstract	Increasing size and complexity of high performance computing systems have led to major concerns over fault frequencies and the mechanisms necessary to tolerate these faults. Based on expected increases in core counts (to at least on the order of millions) and expected increases in component complexities, it has been projected that system MTBF for future extreme scale systems will fall below 10 minutes. Previous studies have shown that state-of-the-field checkpoint/restart, commonly employed mechanisms for application fault tolerance in HPC, will not scale sufficiently for these systems. Checkpoint/restart protocols periodically records the state and address space of all application processes to stable storage during normal operation. On a failure, a new incarnation of the failed process is recovered from the failed process' most recent checkpoint - thereby reducing the amount of lost computation. But due to the overhead associated with checkpoint/restart, researchers have been trying to optimize it using different strategies for example, hiding or reducing commit latencies or reducing checkpoint sizes for example Increment-based checkpoints. Instead of saving the whole address space, Increment-based checkpoints only save the changes in application's address space that was made after the last checkpoint was taken, thus reducing the size of the checkpoints. In our previous study, we explored the feasibility of checkpoint data compression to reduce checkpoint commit latency and storage overheads. We also demonstrated that checkpoint data compression can improve an application's makespan significantly. In this work, we compare checkpoint compression against Increment-based checkpoints and also look into GPU-based compression algorithm for faster compression performance. GPUs are known for their extreme parallelism and can be very fast. In this study, we compare CPU-based compression algorithm with GPU-based checkpoint compression and tried to leverage from the faster parallel implementation of GPU-based compression algorithm. We demonstrate that although GPU-based compression algorithm can accelerate compression/decompression speed significantly, their poor compression performance limits their usefulness. We compare checkpoint compression against increment-based checkpoint optimization and demonstrate that checkpoint compression can exceed the performance of an optimal incremental checkpointing scheme. We show that the greatest checkpoint performance can be realized when CPU-based compression is used in conjunction with incremental checkpointing. Lastly we motivate future GPU-based compression development by exploring various hypothetical scenarios.
Starting Page	1505
Ending Page	1506
File Size	265539
Page Count	2
File Format	PDF
ISBN	9781467362184
e-ISBN	9780769549569
DOI	10.1109/SC.Companion.2012.290
Language	English
Publisher	Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher Date	2012-11-10
Publisher Place	USA
Access Restriction	Subscribed
Rights Holder	Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subject Keyword	Checkpoint Compression Checkpoint Restart Optimization Fault Tolerance
Content Type	Text
Resource Type	Article

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in

Poster: Comparing GPU and Increment-Based Checkpoint Compression

On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance

The design and implementation of checkpoint/restart process fault tolerance for Open MPI (2007)

The design and implementation of checkpoint/restart process tolerance for Open MPI (2007)

On the viability of compression for reducing the overheads of checkpoint/restart-based fault tolerance.

Affinity-aware checkpoint restart

Coarse-Grained Energy Modeling of Rollback/Recovery Mechanisms

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

A checkpoint compression study for high-performance computing systems

Abstract: Comparing GPU and Increment-Based Checkpoint Compression

Similar Documents

Poster: Comparing GPU and Increment-Based Checkpoint Compression

On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-Based Fault Tolerance

The design and implementation of checkpoint/restart process fault tolerance for Open MPI (2007)

The design and implementation of checkpoint/restart process tolerance for Open MPI (2007)

On the viability of compression for reducing the overheads of checkpoint/restart-based fault tolerance.

Affinity-aware checkpoint restart

Coarse-Grained Energy Modeling of Rollback/Recovery Mechanisms

CRFS: A Lightweight User-Level Filesystem for Generic Checkpoint/Restart

A checkpoint compression study for high-performance computing systems

Abstract: Comparing GPU and Increment-Based Checkpoint Compression