NDLI: ReStore: reusing results of MapReduce jobs in pig

Please wait, while we are loading the content...

Edgar F. Codd Innovations Award Talk

Calvin: fast distributed transactions for partitioned database systems

Parallel main-memory indexing for moving-object query and update workloads

Sample-driven schema mapping

Interactive regret minimization

Managing large dynamic graphs efficiently

Skimmer: rapid scrolling of relational query results

bLSM: a general purpose log structured merge tree

High-performance complex event processing over XML streams

MaskIt: privately releasing user context streams for personalized mobile applications

Towards a unified architecture for in-RDBMS analytics

CrowdScreen: algorithms for filtering data with humans

Processing a large number of continuous preference top-k queries

Temporal alignment

Aggregate suppression for enterprise search engines

A model-based approach to attributed graph clustering

Locality-sensitive hashing scheme based on dynamic collision counting

Analytic database technologies for a new kind of user: the data enthusiast

Mob data sourcing

Automatic web-scale information extraction

Sindbad: a location-based social networking system

Shark: fast data analysis using coarse-grained distributed memory

Amazon dynamoDB: a seamlessly scalable non-relational database service

The value of social media data in enterprise applications

Query optimization in microsoft SQL server PDW

TAO: how facebook serves the social graph

Dynamic workload driven data integration in tableau

CloudRAMSort: fast and efficient large-scale distributed RAM sort on shared-nothing cluster

Declarative web application development: encapsulating dynamic JavaScript widgets (abstract only)

SIGMOD Contributions Award Talk

Advanced partitioning techniques for massively distributed computation

Divergent physical design tuning for replicated databases

Can we beat the prefix filtering?: an adaptive framework for similarity join and search

MCJoin: a memory-constrained join for column-store main-memory databases

Query preserving graph compression

Efficient spatial sampling of large geographical tables

Skeleton automata for FPGAs: reconfiguring without reconstructing

Prediction-based geometric monitoring over distributed data streams

Authenticating location-based services without compromising location privacy

Tiresias: the database oracle for how-to queries

Local structure and determinism in probabilistic databases

Optimal top-k generation of attribute combinations based on ranked lists

A highway-centric labeling approach for answering distance queries on large sparse graphs

Probase: a probabilistic taxonomy for text understanding

Towards effective partition management for large graphs

Efficient external-memory bisimulation on DAGs

Symbiosis in scale out networking and data management

Managing and mining large graphs: patterns and algorithms

Just-in-time information extraction using extraction views

MAQSA: a system for social analytics on news

Exploiting MapReduce-based similarity joins

Efficient transaction processing in SAP HANA database: the end of a column store myth

Anatomy of a gift recommendation engine powered by social media

F1: the fault-tolerant distributed RDBMS supporting google's ad business

Large-scale machine learning at twitter

Finding related tables

Adaptive optimizations of recursive queries in teradata

Towards scalable summarization and visualization of large text corpora (abstract only)

Test Of Time Award Talk: Executing SQL over Encrypted Data in the Database-Service-Provider Model

SkewTune: mitigating skew in mapreduce applications

Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems

InfoGather: entity augmentation and attribute discovery by holistic matching with web tables

Holistic optimization by prefetching query results

SCARAB: scaling reachability computation on large graphs

Declarative error management for robust data-intensive applications

NoDB: efficient query execution on raw data files

Online windowed subsequence matching over probabilistic sequences

Effective caching of shortest paths for location-based services

GUPT: privacy preserving data analysis made easy

So who won?: dynamic max discovery with the crowd

Top-k bounded diversification

Efficient processing of distance queries in large graphs: a vertex cover approach

Optimizing index for taxonomy keyword search

TreeSpan: efficiently computing similarity all-matching

Materialized view selection for XQuery workloads

Managing and mining large graphs: systems and implementations

ColumbuScout: towards building local search engines over large databases

Surfacing time-critical insights from social media

GLADE: big data analytics made easy

Walnut: a unified cloud object store

Designing a scalable crowdsourcing platform

Oracle in-database hadoop: when mapreduce meets RDBMS

Recurring job optimization in scope

Optimizing analytic data flows for multiple execution engines

From x100 to vectorwise: opportunities, challenges and things most researchers do not think about

Reducing cache misses in hash join probing phase by pre-sorting strategy (abstract only)

SIGMOD Jim Gray Doctoral Dissertation Award Talk

Computational reproducibility: state-of-the-art, challenges, and database research opportunities

SOFIA SEARCH: a tool for automating related-work search

Taagle: efficient, personalized search in collaborative tagging networks

ReStore: reusing results of MapReduce jobs in pig

DP-tree: indexing multi-dimensional data under differential privacy (abstract only)

Database techniques for linked data management

RACE: real-time applications over cloud-edge

PrefDB: bringing preferences closer to the DBMS

Clydesdale: structured data processing on hadoop

Temporal provenance discovery in micro-blog message streams (abstract only)

Differential privacy in data publication and analysis

Partiqle: an elastic SQL engine over key-value stores

Auto-completion learning for XML

Tiresias: a demonstration of how-to queries

SigSpot: mining significant anomalous regions from time-evolving networks (abstract only)

JustMyFriends: full SQL, full transactional amenities, and access privacy

Logos: a system for translating queries into narratives

AstroShelf: understanding the universe through scalable navigation of a galaxy of annotations

VRRC: web based tool for visualization and recommendation on co-authorship network (abstract only)

Dynamic optimization of generalized SQL queries with horizontal aggregations

PAnG: finding patterns in annotation graphs

OPAvion: mining and visualization in large graphs

Fast sampling word correlations of high dimensional text data (abstract only)

ConsAD: a real-time consistency anomalies detector

VizDeck: self-organizing dashboards for visual analytics

CloudAlloc: a monitoring and reservation system for compute clusters

Interactive performance monitoring of a composite OLTP and OLAP workload

Kaizen: a semi-automatic index advisor

TIRAMOLA: elastic nosql provisioning through a cloud management platform

ReStore: reusing results of MapReduce jobs in pig

Content Provider	ACM Digital Library
Author	Aboulnaga, Ashraf Elghandour, Iman
Abstract	Analyzing large scale data has become an important activity for many organizations, and is now facilitated by the MapReduce programming and execution model and its implementations, most notably Hadoop. Query languages such as Pig Latin, Hive, and Jaql make it simpler for users to express complex analysis tasks, and the compilers of these languages translate these complex tasks into workflows of MapReduce jobs. Each job in these workflows reads its input from the distributed file system used by the MapReduce system (e.g., HDFS in the case of Hadoop) and produces output that is stored in this distributed file system. This output is then read as input by the next job in the workflow. The current practice is to delete these intermediate results from the distributed file system at the end of executing the workflow. It would be more useful if these intermediate results can be stored and reused in future workflows. We demonstrate ReStore, an extension to Pig that enables it to manage storage and reuse of intermediate results of the MapReduce workflows executed in the Pig data analysis system. ReStore matches input workflows of MapReduce jobs with previously executed jobs and rewrites these workflows to reuse the stored results of the matched jobs. ReStore also creates additional reuse opportunities by materializing and reserving the output of query execution operators that are executed within a MapReduce job. In this demonstration we showcase the MapReduce jobs and sub-jobs recommended by ReStore for a given Pig query, the rewriting of input queries to reuse stored intermediate results, and a what-if analysis of the effectiveness of reusing stored outputs of previously executed jobs.
Starting Page	701
Ending Page	704
Page Count	4
File Format	PDF
ISBN	9781450312479
DOI	10.1145/2213836.2213937
Language	English
Publisher	Association for Computing Machinery (ACM)
Publisher Date	2012-05-20
Publisher Place	New York
Access Restriction	Subscribed
Subject Keyword	Data reuse Pig latin Mapreduce
Content Type	Text
Resource Type	Article

Central Library (ISO-9001:2015 Certified)
Indian Institute of Technology Kharagpur
Kharagpur, West Bengal, India | PIN - 721302

See location in the Map
03222 282435
Mail: support@ndl.gov.in

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in