NDLI: Stateful bulk processing for incremental analytics

Please wait, while we are loading the content...

Evolution and future directions of large-scale storage and computation systems at Google

An operating system for multicore and clouds: mechanisms and implementation

Lithium: virtual machine storage for the cloud

Stateful bulk processing for incremental analytics

Building facebook: performance at massive scale

Hermes: clustering users in large-scale e-mail services

Fluxo: a system for internet service programming by non-expert developers

Benchmarking cloud serving systems with YCSB

The internal design of salesforce.com's multi-tenant architecture

G-Store: a scalable data store for transactional multi key access in the cloud

Making cloud intermediate data fault-tolerant

Robust and flexible power-proportional storage

Differential virtual time (DVT): rethinking I/O service differentiation for virtual machines

Comet: batched stream processing for data intensive distributed computing

Defining future platform requirements for e-Science clouds

Nephele/PACTs: a programming model and execution framework for web-scale analytical processing

Automated software testing as a service

Google fusion tables: data management, integration and collaboration in the cloud

Characterizing cloud computing hardware reliability

RACS: a case for cloud storage diversity

Virtual machine power metering and provisioning

Skew-resistant parallel processing of feature-extracting scientific user-defined functions

The case for PIQL: a performance insightful query language

A self-organized, fault-tolerant and scalable replication scheme for cloud storage

Characterizing, modeling, and generating workload spikes for stateful services

Towards automatic optimization of MapReduce programs

Stateful bulk processing for incremental analytics

Content Provider	ACM Digital Library
Author	Webb, Kevin C. Yocum, Ken Logothetis, Dionysios Olston, Christopher Reed, Benjamin
Abstract	This work addresses the need for stateful dataflow programs that can rapidly sift through huge, evolving data sets. These data-intensive applications perform complex multi-step computations over successive generations of data inflows, such as weekly web crawls, daily image/video uploads, log files, and growing social networks. While programmers may simply re-run the entire dataflow when new data arrives, this is grossly inefficient, increasing result latency and squandering hardware resources and energy. Alternatively, programmers may use prior results to incrementally incorporate the changes. However, current large-scale data processing tools, such as Map-Reduce or Dryad, limit how programmers incorporate and use state in data-parallel programs. Straightforward approaches to incorporating state can result in custom, fragile code and disappointing performance. This work presents a generalized architecture for continuous bulk processing (CBP) that raises the level of abstraction for building incremental applications. At its core is a flexible, groupwise processing operator that takes state as an explicit input. Unifying stateful programming with a data-parallel operator affords several fundamental opportunities for minimizing the movement of data in the underlying processing system. As case studies, we show how one can use a small set of flexible dataflow primitives to perform web analytics and mine large-scale, evolving graphs in an incremental fashion. Experiments with our prototype using real-world data indicate significant data movement and running time reductions relative to current practice. For example, incrementally computing PageRank using CBP can reduce data movement by 46% and cut running time in half.
Starting Page	51
Ending Page	62
Page Count	12
File Format	PDF
ISBN	9781450300360
DOI	10.1145/1807128.1807138
Language	English
Publisher	Association for Computing Machinery (ACM)
Publisher Date	2010-06-10
Publisher Place	New York
Access Restriction	Subscribed
Subject Keyword	Cloud computing Incremental Parallel data processing Mapreduce
Content Type	Text
Resource Type	Article

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in

Stateful bulk processing for incremental analytics.

Comparison of the Efficiency of MapReduce and Bulk Synchronous Parallel Approaches to Large Network Processing

Stateful bulk processing for incremental analytics

Data intensive applications on clouds

A platform for scalable one-pass analytics using MapReduce

CAST: Tiering Storage for Data Analytics in the Cloud

Improving Encryption Performance Using MapReduce

Multiple Two-Phase Data Processing with MapReduce

Optimizing Cloud MapReduce for Processing Stream Data Using Pipelining

Stateful bulk processing for incremental analytics

Similar Documents

Stateful bulk processing for incremental analytics.

Comparison of the Efficiency of MapReduce and Bulk Synchronous Parallel Approaches to Large Network Processing

Stateful bulk processing for incremental analytics

Data intensive applications on clouds

A platform for scalable one-pass analytics using MapReduce

CAST: Tiering Storage for Data Analytics in the Cloud

Improving Encryption Performance Using MapReduce

Multiple Two-Phase Data Processing with MapReduce

Optimizing Cloud MapReduce for Processing Stream Data Using Pipelining

Stateful bulk processing for incremental analytics