NDLI: Cache : Improving Read Performance of Deduplication Storage for Backup Applications

Please wait, while we are loading the content...

Cache : Improving Read Performance of Deduplication Storage for Backup Applications

Content Provider	Semantic Scholar
Author	Park, Dongchul Nam, Young Jin Du, David H. C.
Copyright Year	2012
Abstract	Data deduplication (for short, dedupe) is a special data compression technique and has been widely adopted especially in backup storage systems with the primary aims of backup time saving as well as storage saving. Thus, most of the traditional dedupe research has focused more on the write performance improvement during the dedupe process while very little effort has been made at read performance. However, the read performance in dedupe backup storage is also a crucial issue when it comes to the storage recovery from a system crash. In this paper, we newly design a read cache in dedupe storage for a backup application to improve read performance by taking advantage of its special characteristic: the read sequence is the same as the write sequence. Thus, for better cache utilization, we can evict the data containers with smallest future references from the cache by looking ahead their future references in a moving window. Moreover, To achieve better read cache performance, our design maintains a small log buffer to judiciously maintain future access data chunks. Our experiments with real world workloads demonstrates that our proposed read cache scheme makes a big contribution to read performance improvement. Keywords-deduplication, dedupe, cache, read performance. I. I NTRODUCTION AND MOTIVATIONS Digital data explosion empowers data deduplication (for short, dedupe) to have been in the spotlight and over 80% of companies are drawing their attention to dedupe technologies [1]. Data dedupe is a specialized technique to eliminate duplicated data so that it retains only one unique data on storage and replaces redundant data with a pointer to the unique data afterwards. These days, dedupe technologie s have been widely deployed particularly in secondary storag e systems for data backup or archive due to considerable cost (i.e., time as well as space) saving. Thus, major concerns have been mostly related to write performance improvement thereby efficiently detecting and removing as many duplicates as possible with the help of efficient data chunking, index optimization/caching, compression, and data contai ner design [2], [3], [4], [5], [6]. On the other hand, its read performance has not attracted considerable attention to re searchers because read operations are rarely invoked in suc h dedupe storge systems. However, when it comes to system recovery from a crash, it has a significantly different story . Long term digital preservation (LTDP) communities were recently very emphatic on the importance of read performance in dedupe storage [7], [8]. Moreover, some primary storage systems have started to equip the dedupe technologies [9]. Although this read performance as well as write performance is also a crucial factor in dedupe storage, very little effor t has been made at this issue. Typical data chunk (generally, a few KB) read processes in secondary dedupe storage are as follows: first, the dedupe storage system identifies the data container ID retaining the corresponding data chunks to be read. Then, it looks up the container (generally, 2 or 4MB) in the read cache. Once hitting the cache, it reads the chunks from the cache. Otherwise, it fetches one whole container from the underlying storage and then it can read the corresponding data chunks in the container. However, these read processes result in low cache utilizati on because even though there exist spatial locality in the data container, only partial data chunks in the containers are mostly accessed [1]. Furthermore, the higher dedupe rates, the higher data fragmentation rates. This can lower spatial locality so that it worsens cache utilization. Our key idea, in this paper, lies in exploiting future access patterns of dat a chunks. In general, read sequences are identical to write sequences in the dedupe storage for backup. Inspired by this special characteristic inherent in such an applicatio n, our read cache design can take advantage of future read access patterns during dedupe processes, but general cache algorithms such as LRU do not consider this special feature in dedupe mechanisms. Based on these observations, we propose a lookahead read cache design in dedupe storage for a backup application. In this paper, we make the following main contributions: • Exploiting Future Accesses: We maintain access information for future read references during dedupe (i.e., writ e) processes. Thus, our proposed design evicts a victim with a smallest future reference count from the read cache. • Design Extension with a Log Buffer: We assign a portion of a read cache space into a log buffer which can effectively maintain future access chunks on the basis of our hot data identification scheme. • Extensive Dataset Analysis: Our proposed design is fundamentally inspired by our diverse real dataset analyse s. Since our proposed design is a read cache scheme, unlike a selective duplication/deduplication approach that allo ws partial data duplication to improve read performance while hurting its write performance, our design not only does not hurt write performance at all, but also can be applied to other dedupe systems. The remainder of this paper is organized as follows. Section II explains the design and operations of our propose d cache scheme. Section III provides a variety of our experimental results and analyses. Section IV discusses related work addressing especially a dedupe read performance issue . Finally, Section V discusses our future work. II. L OOKAHEAD READ CACHE DESIGN A. Rationale and Assumptions In general, as more duplicates are eliminated from incoming data stream, the read performance stands in marked contrast to its good write performance due to the higher Table I: Various dedupe gain ratio (DGR) in successive versions of each backup dataset (Unit:%). DGR represents the ratio of a data saving size to an original data size. ver-1 ver-2 ver-3 ver-4 ver-5 avg. DGR ds-1 99.9 3.5 6.9 5.6 31.2 29 ds-2 100 28 24.7 14.9 20.6 37 ds-3 99.6 95.2 97.7 97.3 96.6 97 ds-4 90.5 55.4 63.6 20.8 20.6 50 ds-5 84.1 3.3 2.5 11.9 2.6 20 ds-6 54.4 22.4 – – – 38 0 5 0 0 0 1 0 0 0 0 1 5 0 0 0 2 0 0 0 0 2 5 0 0 0 3 0 0 0 0 3 5 0 0 0 4 0 0 0 0 1 1 0 1 0 0 th enum b e ro f c on t a i n ers d s ) 1 d s 2 d s 1 3 d s 5 4 d s 9 5 d s = 6 Figure 1: Distributions of the number of accessed container for six real backup datasets. X-axis represents the percent ag of accessed chunks in a container. likelihood of the shared data fragmentation [1]. This is the fundamental challenging issue of the tradeoff between read performance and write performance in dedupe storage. To address this read performance issue, we propose a novel read cache design leveraging future access information. In dedu p storage for backup or archive, the read access sequence is highly likely to be identical to its write sequence. Based on this key oservation, our proposed scheme records write access metadata information for the future read access duri ng each dedupe process, which enables our lookahead cache to exploit future read references. We assume that each data chunk size is variable and a data container retaining many (generally, 200-300) data chunks is a basic unit for reads.
File Format	PDF HTM / HTML
Alternate Webpage(s)	http://www-users.cselabs.umn.edu/classes/Spring-2017/csci5980/files/Dedupe/readperformance-dedupe.pdf
Language	English
Access Restriction	Open
Content Type	Text
Resource Type	Article

Central Library (ISO-9001:2015 Certified)
Indian Institute of Technology Kharagpur
Kharagpur, West Bengal, India | PIN - 721302

See location in the Map
03222 282435
Mail: support@ndl.gov.in

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in