NDLI: Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications

Please wait, while we are loading the content...

From Data to Insights @ Bare Metal Speed

Distributed Outlier Detection using Compressive Sensing

sPCA: Scalable Principal Component Analysis for Big Data on Distributed Platforms

SEMROD: Secure and Efficient MapReduce Over HybriD Clouds

TencentRec: Real-time Stream Recommendation in Practice

Overview of Data Exploration Techniques

Machine Learning and Databases: The Sound of Things to Come or a Cacophony of Hype?

Cost-based Fault-tolerance for Parallel Data Processing

Diversity-Aware Top-k Publish/Subscribe for Text Stream

Minimum Spanning Trees in Temporal Graphs

COMMIT: A Scalable Approach to Mining Communication Motifs from Dynamic Networks

Supporting Data Uncertainty in Array Databases

Telco Churn Prediction with Big Data

Three Favorite Results

The Power Behind the Throne: Information Integration in the Age of Data-Driven Discovery

On the Design and Scalability of Distributed Shared-Data Databases

Private Release of Graph Statistics using Ladder Functions

Persistent Data Sketching

CE-Storm: Confidential Elastic Processing of Data Streams

Mining and Forecasting of Big Time-series Data

Optimal Spatial Dominance: An Effective Search of Nearest Neighbor Candidates

The Importance of Being Expert: Efficient Max-Finding in Crowdsourcing

Thrifty: Offering Parallel Database as a Service using the Shared-Process Approach

Cache-Efficient Aggregation: Hashing Is Sorting

Query-Oriented Data Cleaning with Oracles

Minimizing Commit Latency of Transactions in Geo-Replicated Data Stores

REEF: Retainable Evaluator Execution Framework

Graft: A Debugging Tool For Apache Giraph

Rack-Scale In-Memory Join Processing using RDMA

GetReal: Towards Realistic Selection of Influence Maximization Strategies in Competitive Networks

From Group Recommendations to Group Formation

Large-scale Predictive Analytics in Vertica: Fast Data Transfer, Distributed Model Creation, and In-database Prediction

Data Management in Non-Volatile Memory

TEGRA: Table Extraction by Global Record Alignment

Graph-Aware, Workload-Adaptive SPARQL Query Caching

k-Shape: Efficient and Accurate Clustering of Time Series

Amazon Redshift and the Case for Simpler Data Warehouses

An Incremental Anytime Algorithm for Multi-Objective Query Optimization

S4: Top-k Spreadsheet-Style Search for Query Discovery

Knowledge Curation and Knowledge Fusion: Challenges, Models and Applications

Smooth Task Migration in Apache Storm

Locality-aware Partitioning in Parallel Database Systems

Exploiting Matrix Dependency for Efficient Distributed Matrix Computation

Authenticated Online Data Integration Services

Twitter Heron: Stream Processing at Scale

Squall: Fine-Grained Live Reconfiguration for Partitioned Main Memory Databases

Diverse and Proportional Size-l Object Summaries for Keyword Search

Efficient Enumeration of Maximal k-Plexes

LASH: Large-Scale Sequence Mining with Hierarchies

Identifying the Extent of Completeness of Query Answers over Partially Complete Databases

The LDBC Social Network Benchmark: Interactive Workload

Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems

Bayesian Differential Privacy on Correlated Data

Scalable Distributed Stream Join Processing

A SQL Debugger Built from Spare Parts: Turning a SQL: 1999 Database System into Its Own Debugger

THERMAL-JOIN: A Scalable Spatial Join for Dynamic Workloads

Minimizing Efforts in Validating Crowd Answers

BenchPress: Dynamic Workload Control in the OLTP-Bench Testbed

Efficient Similarity Join and Search on Multi-Attribute Data

BigDansing: A System for Big Data Cleansing

Optimizing Optimistic Concurrency Control for Tree-Structured, Log-Structured Databases

Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications

Even Metadata is Getting Big: Annotation Summarization using InsightNotes

Self-Tuning, GPU-Accelerated Kernel Density Models for Multidimensional Selectivity Estimation

Influence Maximization in Near-Linear Time: A Martingale Approach

Real-Time Multi-Criteria Social Graph Partitioning: A Game Theoretic Approach

Oracle Workload Intelligence

Mining Quality Phrases from Massive Text Corpora

Left Bit Right: For SPARQL Join Queries with OPTIONAL Patterns (Left-outer-joins)

SMiLer: A Semi-Lazy Time Series Prediction System for Sensors

ShareInsights: An Unified Approach to Full-stack Data Processing

Output-sensitive Evaluation of Prioritized Skyline Queries

Proactive Annotation Management in Relational Databases

JAFAR: Near-Data Processing for Databases

ByteSlice: Pushing the Envelop of Main Memory Data Processing with a New Storage Layout

LEMP: Fast Retrieval of Large Entries in a Matrix Product

ENKI: Access Control for Encrypted Query Processing

Analytics in Motion: High Performance Event-Processing AND Real-Time Analytics in the Same Database

Madeus: Database Live Migration Middleware under Heavy Workloads for Cloud Environment

Local Filtering: Improving the Performance of Approximate Queries on String Collections

Divide & Conquer: I/O Efficient Depth-First Search

Twister Tries: Approximate Hierarchical Agglomerative Clustering for Average Distance in Linear Time

k-Hit Query: Top-k Query with Probabilistic Utility Function

Rethinking Data-Intensive Science Using Scalable Analytics Systems

FOEDUS: OLTP Engine for a Thousand Cores and NVRAM

Modular Order-Preserving Encryption, Revisited

SCREEN: Stream Data Cleaning under Speed Constraints

Exploratory Keyword Search with Interactive Input

Indexing Metric Uncertain Data for Range Queries

iCrowd: An Adaptive Crowdsourcing Framework

Demonstrating "Data Near Here": Scientific Data Search

Holistic Indexing in Main-memory Column-stores

Data X-Ray: A Diagnostic Tool for Data Errors

The Homeostasis Protocol: Avoiding Transaction Coordination Through Program Analysis

Design and Implementation of the LogicBlox System

StoryPivot: Comparing and Contrasting Story Evolution

Rethinking SIMD Vectorization for In-Memory Databases

Community Level Diffusion Extraction

Utility-Aware Social Event-Participant Planning

Purity: Building Fast, Highly-Available Enterprise Flash Storage from Commodity Components

Mining Subjective Properties on the Web

How to Build Templates for RDF Question/Answering: An Uncertain Graph Similarity Join Approach

SQLGraph: An Efficient Relational-Based Property Graph Store

Learning Generalized Linear Models Over Normalized Data

Weighted Coverage based Reviewer Assignment

Job Scheduling with Minimizing Data Communication Costs

Implicit Parallelism through Deep Language Embedding

Skew-Aware Join Optimization for Array Databases

Collaborative Access Control in WebdamLog

Why Big Data Industrial Systems Need Rules and What We Can Do About It

Lineage-driven Fault Injection

Exact Top-k Nearest Keyword Search in Large Networks

Index-based Optimal Algorithms for Computing Steiner Components with Maximum Connectivity

DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation

Linking Temporal Records for Profiling Entities

QMapper for Smart Grid: Migrating SQL-based Application to Hive

Let's Talk About Storage & Recovery Methods for Non-Volatile Memory Database Systems

Chiaroscuro: Transparency and Privacy for Massive Personal Time-Series Clustering

Location-Aware Pub/Sub System: When Continuous Moving Queries Meet Dynamic Event Streams

QE3D: Interactive Visualization and Exploration of Complex, Distributed Query Plans

Efficient Route Planning on Public Transportation Networks: A Labelling Approach

QASCA: A Quality-Aware Task Assignment System for Crowdsourcing Applications

Slider: An Efficient Incremental Reasoner

CliffGuard: A Principled Framework for Finding Robust Database Designs

KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing

Feral Concurrency Control: An Empirical Investigation of Modern Application Integrity

Spark SQL: Relational Data Processing in Spark

The Flatter, the Better: Query Compilation Based on the Flattening Transformation

A Padded Encoding Scheme to Accelerate Scans by Leveraging Skew

BEAR: Block Elimination Approach for Random Walk with Restart on Large Graphs

Online Video Recommendation in Sharing Community

On Improving User Response Times in Tableau

Microblog Entity Linking with Social Temporal Context

RBench: Application-Specific RDF Benchmarking

Updating Graph Indices with a One-Pass Algorithm

Utilizing IDs to Accelerate Incremental View Maintenance

Distributed Online Tracking

One Loop Does Not Fit All

From Theory to Practice: Efficient Join Query Evaluation in a Parallel Database System

Resource Elasticity for Large-Scale Machine Learning

Automatic Enforcement of Data Use Policies with DataLawyer

Efficient Algorithms for Answering the m-Closest Keywords Query

The TagAdvisor: Luring the Lurkers to Review Web Items

DataXFormer: An Interactive Data Transformation Tool

tDP: An Optimal-Latency Budget Allocation Strategy for Crowdsourced MAXIMUM Operations

WANalytics: Geo-Distributed Analytics for a Data Intensive World

Exploiting Correlations for Expensive Predicate Evaluation

Crowd-Based Deduplication: An Adaptive Approach

D2WORM: A Management Infrastructure for Distributed Data-centric Workflows

The Minimum Wiener Connector Problem

ALEX: Automatic Link Exploration in Linked Data

DunceCap: Compiling Worst-Case Optimal Query Plans

Quality-Driven Continuous Query Execution over Out-of-Order Data Streams

FTT: A System for Finding and Tracking Tourists in Public Transport Services

$NL_{2}CM:$ A Natural Language Interface to Crowd Mining

DunceCap: Query Plans Using Generalized Hypertree Decompositions

MoDisSENSE: A Distributed Spatio-Temporal and Textual Processing Platform for Social Networking Services

SharkDB: An In-Memory Storage System for Massive Trajectory Data

Optimistic Recovery for Iterative Dataflows in Action

DocRicher: An Automatic Annotation System for Text Documents Using Social Media

Ringo: Interactive Graph Analytics on Big-Memory Machines

A Secure Search Engine for the Personal Cloud

A Demonstration of Rubato DB: A Highly Scalable NewSQL Database System for OLTP and Big Data Applications

STORM: Spatio-Temporal Online Reasoning and Management of Large Spatio-Temporal Data

IReS: Intelligent, Multi-Engine Resource Scheduler for Big Data Analytics Workflows

G-OLA: Generalized On-Line Aggregation for Interactive Analysis on Big Data

PAXQuery: Parallel Analytical XML Processing

Just can't get enough: Synthesizing Big Data

Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications

Content Provider	ACM Digital Library
Author	Shah, Hitesh Murthy, Arun Curino, Carlo Saha, Bikas Seth, Siddharth Vijayaraghavan, Gopal
Abstract	The broad success of Hadoop has led to a fast-evolving and diverse ecosystem of application engines that are building upon the YARN resource management layer. The open-source implementation of MapReduce is being slowly replaced by a collection of engines dedicated to specific verticals. This has led to growing fragmentation and repeated efforts with each new vertical engine re-implementing fundamental features (e.g. fault-tolerance, security, stragglers mitigation, etc.) from scratch. In this paper, we introduce Apache Tez, an open-source framework designed to build data-flow driven processing runtimes. Tez provides a scaffolding and library components that can be used to quickly build scalable and efficient data-flow centric engines. Central to our design is fostering component re-use, without hindering customizability of the performance-critical data plane. This is in fact the key differentiator with respect to the previous generation of systems (e.g. Dryad, MapReduce) and even emerging ones (e.g. Spark), that provided and mandated a fixed data plane implementation. Furthermore, Tez provides native support to build runtime optimizations, such as dynamic partition pruning for Hive. Tez is deployed at Yahoo!, Microsoft Azure, LinkedIn and numerous Hortonworks customer sites, and a growing number of engines are being integrated with it. This confirms our intuition that most of the popular vertical engines can leverage a core set of building blocks. We complement qualitative accounts of real-world adoption with quantitative experimental evidence that Tez-based implementations of Hive, Pig, Spark, and Cascading on YARN outperform their original YARN implementation on popular benchmarks (TPC-DS, TPC-H) and production workloads.
Starting Page	1357
Ending Page	1369
Page Count	13
File Format	PDF
ISBN	9781450327589
DOI	10.1145/2723372.2742790
Language	English
Publisher	Association for Computing Machinery (ACM)
Publisher Date	2015-05-27
Publisher Place	New York
Access Restriction	Subscribed
Subject Keyword	Distributed data processing Apache hadoop Big data Open source
Content Type	Text
Resource Type	Article

Central Library (ISO-9001:2015 Certified)
Indian Institute of Technology Kharagpur
Kharagpur, West Bengal, India | PIN - 721302

See location in the Map
03222 282435
Mail: support@ndl.gov.in

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in