NDLI: Introduction to Spark 2.0 for Database Researchers

Please wait, while we are loading the content...

Building Machine Learning Systems that Understand

Learning Linear Regression Models over Factorized Joins

Publishing Attributed Social Graphs with Formal Privacy Guarantees

The Snowflake Elastic Data Warehouse

Data Blocks: Hybrid OLTP and OLAP on Compressed Storage using both Vectorization and Compilation

iBFS: Concurrent Breadth-First Search on GPUs

Scalable Pattern Sharing on Event Streams*

Bridging the Archipelago between Row-Stores and Column-Stores for Hybrid Workloads

Stop-and-Stare: Optimal Sampling Algorithms for Viral Marketing in Billion-scale Networks

Goods: Organizing Google's Datasets

Constraint-Variance Tolerant Data Repairing

Topic Exploration in Spatio-Temporal Document Collections

Realtime Data Processing at Facebook

Diversified Top-k Subgraph Querying in a Large Graph

Fast Multi-Column Sorting in Main-Memory Column-Stores

FluxQuery: An Execution Framework for Highly Interactive Query Workloads

Time Adaptive Sketches (Ada-Sketches) for Summarizing Data Streams

A Hybrid B+-tree as Solution for In-Memory Indexing on CPU-GPU Heterogeneous Computing Platforms

TARDiS: A Branch-and-Merge Approach To Weak Consistency

Enabling Incremental Query Re-Optimization

Generating Preview Tables for Entity Graphs

Robust Query Processing in Co-Processor-accelerated Databases

Top-k Relevant Semantic Place Retrieval on Spatial RDF Data

Rheem: Enabling Multi-Platform Task Execution

Introduction to Spark 2.0 for Database Researchers

Constructing Join Histograms from Histograms with q-error Guarantees

To Join or Not to Join?: Thinking Twice about Joins before Feature Selection

Publishing Graph Degree Distribution with Node Differential Privacy

Closing the functional and Performance Gap between SQL and NoSQL

GeckoFTL: Scalable Flash Translation Techniques For Very Large Flash Devices

Tornado: A System For Real-Time Iterative Analysis Over Evolving Data

How to Win a Hot Dog Eating Contest: Distributed Incremental View Maintenance with Batch Updates

An Effective Syntax for Bounded Relational Queries

Spheres of Influence for More Effective Viral Marketing

Multi-Source Uncertain Entity Resolution at Yad Vashem: Transforming Holocaust Victim Reports into People

Interactive and Deterministic Data Cleaning

ParTime: Parallel Temporal Aggregation

SparkR: Scaling R Programs with Spark

Graph Indexing for Shortest-Path Finding over Dynamic Sub-Graphs

Elastic Pipelining in an In-Memory Database Cluster

iOLAP: Managing Uncertainty for Efficient Incremental OLAP

Streaming Algorithms for Robust Distinct Elements

Low-Overhead Asynchronous Checkpointing in Main-Memory Database Systems

TicToc: Time Traveling Optimistic Concurrency Control

Sampling-Based Query Re-Optimization

Speedup Graph Processing by Graph Ordering

How to Architect a Query Compiler

Local Similarity Search for Unstructured Text

Emma in Action: Declarative Dataflows for Scalable Data Analysis

Design Tradeoffs of Data Access Methods

Graph Summarization for Geo-correlated Trends Detection in Social Networks

Real-time Video Recommendation Exploration

Principled Evaluation of Differentially Private Algorithms using DPBench

Have Your Data and Query It Too: From Key-Value Caching to Big Data Management

SHARE Interface in Flash Storage for Relational and NoSQL Databases

EmptyHeaded: A Relational Engine for Graph Processing

Sharing-Aware Outlier Analytics over High-Volume Data Streams

Wander Join: Online Aggregation via Random Walks

Continuous Influence Maximization: What Discounts Should We Offer to Social Network Users?

A Hybrid Approach to Functional Dependency Discovery

Sequential Data Cleaning: A Statistical Approach

Data Polygamy: The Many-Many Relationships among Urban Spatio-Temporal Data Sets

VectorH: Taking SQL-on-Hadoop to the Next Level

Efficient Subgraph Matching by Postponing Cartesian Products

Page As You Go: Piecewise Columnar Access In SAP HANA

Dynamic Prefetching of Data Tiles for Interactive Visualization

Augmented Sketch: Faster and More Accurate Stream Processing

T-Part: Partitioning of Transactions for Forward-Pushing in Deterministic Database Systems

Scaling Multicore Databases via Constrained Parallel Execution

A Fast Randomized Algorithm for Multi-Objective Query Optimization

ROLL: Fast In-Memory Generation of Gigantic Scale-free Networks

Automated Demand-driven Resource Scaling in Relational Database-as-a-Service

Similarity Join over Array Data

Wildfire: Concurrent Blazing Data Ingest and Analytics

Data Cleaning: Overview and Emerging Challenges

M3: Scaling Up Machine Learning via Memory Mapping

Towards Globally Optimal Crowdsourcing Quality Management: The Uniform Worker Setting

PrivTree: A Differentially Private Algorithm for Hierarchical Decompositions

Ambry: LinkedIn's Scalable Geo-Distributed Object Store

Accelerating Relational Databases by Leveraging Remote Memory and RDMA

GTS: A Fast and Scalable Graph Processing Method based on Streaming Topology to GPUs

THEMIS: Fairness in Federated Stream Processing under Overload

Quickr: Lazily Approximating Complex AdHoc Queries in BigData Clusters

Holistic Influence Maximization: Combining Scalability and Efficiency with Opinion-Aware Models

Ontological Pathfinding

Learning-Based Cleansing for Indoor RFID Data

Distributed Evaluation of Top-k Temporal Joins

Adaptive Logging: Optimizing Logging and Recovery Costs in Distributed In-memory Databases

Adding Counting Quantifiers to Graph Patterns

Hybrid Garbage Collection for Multi-Version Concurrency Control in SAP HANA

Expressive Query Construction through Direct Manipulation of Nested Relational Results

Matrix Sketching Over Sliding Windows

Reducing the Storage Overhead of Main-Memory OLTP Databases with Hybrid Indexes

Towards a Non-2PC Transaction Management in Distributed Database Systems

Operator and Query Progress Estimation in Microsoft SQL Server Live Query Statistics

Functional Dependencies for Graphs

GPL: A GPU-based Pipelined Query Processing Engine

LazyLSH: Approximate Nearest Neighbor Search for Multiple Distance Functions with a Single Index

Efficient Query Processing on Many-core Architectures: A Case Study with Intel Xeon Phi Processor

Querying Geo-Textual Data: Spatial Keyword Queries and Beyond

K-means Split Revisited: Well-grounded Approach and Experimental Evaluation

Building the Enterprise Fabric for Big Data with Vertica and Spark Integration

Adaptive Indexing over Encrypted Numeric Data

SQL Schema Design: Foundations, Normal Forms, and Normalization

FPTree: A Hybrid SCM-DRAM Persistent and Concurrent B-Tree for Storage Class Memory

Graph Analytics Through Fine-Grained Parallelism

SABER: Window-Based Hybrid Stream Processing for Heterogeneous Architectures

A Study of Sorting Algorithms on Approximate Memory

Potential and Pitfalls of Domain-Specific Information Extraction at Web Scale

Extracting Databases from Dark Data with DeepDive

PrivateClean: Data Cleaning and Differential Privacy

AT-GIS: Highly Parallel Spatial Query Processing with Associative Transducers

Big Data Analytics with Datalog Queries on Spark

DUALSIM: Parallel Subgraph Enumeration in a Massive Graph on a Single Machine

UpBit: Scalable In-Memory Updatable Bitmap Indexing

Shasta: Interactive Reporting At Scale

Graph Stream Summarization: From Big Bang to Big Crunch

Design Principles for Scaling Multi-core OLTP Under High Contention

ERMIA: Fast Memory-Optimized Database System for Heterogeneous Workloads

Optimization of Nested Queries using the NF2 Algebra

SLING: A Near-Optimal Index Structure for SimRank

Towards a Hybrid Design for Fast Query Processing in DB2 with BLU Acceleration Using Graphical Processing Units: A Technology Demonstration

Set-based Similarity Search for Time Series

ReproZip: Computational Reproducibility With Ease

Provenance: On and Behind the Screens

Main Memory Adaptive Denormalization

Truss Decomposition of Probabilistic Graphs: Semantics and Algorithms

Practical Private Range Search Revisited

SQLShare: Results from a Multi-Year SQL-as-a-Service Experiment

Micro-architectural Analysis of In-memory OLTP

Hybrid Pulling/Pushing for I/O-Efficient Distributed and Iterative Graph Computing

Range Thresholding on Streams

Distributed Wavelet Thresholding for Maximum Error Metrics

Robust and Noise Resistant Wrapper Induction

Estimating the Impact of Unknown Unknowns on Aggregate Query Results

RDFind: Scalable Conditional Inclusion Dependency Discovery in RDF Datasets

Towards Best Region Search for Data Exploration

An Efficient MapReduce Cube Algorithm for Varied DataDistributions

Distributed Set Reachability

Datometry Hyper-Q: Bridging the Gap Between Real-Time and Historical Analytics

Scalable Approximate Query Tracking over Highly Distributed Data Streams

DBSherlock: A Performance Diagnostic Tool for Transactional Databases

Transaction Healing: Scaling Optimistic Concurrency Control on Multicores

Extracting Equivalent SQL from Imperative Code in Database Applications

Query Planning for Evaluating SPARQL Property Paths

An Experimental Comparison of Thirteen Relational Equi-Joins in Main Memory

Range-based Obstructed Nearest Neighbor Queries

CLAMS: Bringing Quality to Data Lakes

Microblogs Data Management Systems: Querying, Analysis, and Visualization

Adaptive Data Skipping in Main-Memory Systems

Efficient and Progressive Group Steiner Tree Search

Privacy Preserving Subgraph Matching on Large Graphs in Cloud

Automatic Generation of Normalized Relational Schemas from Nested Key-Value Data

Sample + Seek: Approximating Aggregates with Distribution Precision Guarantee

Cost-Effective Crowdsourced Entity Resolution: A Partial-Order Approach

Simba: Efficient In-Memory Spatial Analytics

FERARI: A Prototype for Complex Event Processing over Streaming Multi-cloud Platforms

The Challenges of Global-scale Data Management

Searching Web Data using MinHash LSH

Constance: An Intelligent Data Lake System

Semistructured Models, Queries and Algebras in the Big Data Era: Tutorial Summary

Research Contribution as a Measure of Influence

Exploring Privacy-Accuracy Tradeoffs using DPComp

Automatic Entity Recognition and Typing in Massive Text Data

Vectorizing an In Situ Query Engine

Interactive Search and Exploration of Waveform Data with Searchlight

Big Graph Analytics Systems

Exploring Visualization of Data Transforms

Ontology-Based Integration of Streaming and Static Relational Data with Optique

Minimizing Average Regret Ratio in Database

The CloudMdsQL Multistore System

ActiveClean: An Interactive Data Cleaning Framework For Modern Machine Learning

Wander Join: Online Aggregation for Joins

PerNav: A Route Summarization Framework for Personalized Navigation

Making the Case for Query-by-Voice with EchoQuery

QUEPA: QUerying and Exploring a Polystore by Augmentation

REACT: Context-Sensitive Recommendations for Data Analysis

PerfEnforce Demonstration: Data Analytics with Performance Guarantees

High-Performance Geospatial Analytics in HyPerSpace

What Makes a Good Physical plan?: Experiencing Hardware-Conscious Query Optimization with Candomblé

SnappyData: A Hybrid Transactional Analytical Store Built On Spark

SourceSight: Enabling Effective Source Selection

BART in Action: Error Generation and Empirical Evaluations of Data-Cleaning Systems

RxSpatial: Reactive Spatial Library for Real-Time Location Tracking and Processing

Web-based Benchmarks for Forecasting Systems: The ECAST Platform

Energy Elasticity on Heterogeneous Hardware using Adaptive Resource Reconfiguration LIVE

QFix: Demonstrating Error Diagnosis in Query Histories

CoDAR: Revealing the Generalized Procedure & Recommending Algorithms of Community Detection

DB-Risk: The Game of Global Database Placement

Quegel: A General-Purpose System for Querying Big Graphs

Introduction to Spark 2.0 for Database Researchers

Content Provider	ACM Digital Library
Author	Zaharia, Matei Bateman, Doug Xin, Reynold Armbrust, Michael
Abstract	Originally started as an academic research project at UC Berkeley, Apache Spark is one of the most popular open source projects for big data analytics. Over 1000 volunteers have contributed code to the project; it is supported by virtually every commercial vendor; many universities are now offering courses on Spark. Spark has evolved significantly since the 2010 research paper: its foundational APIs are becoming more relational and structural with the introduction of the Catalyst relational optimizer, and its execution engine is developing quickly to adopt the latest research advances in database systems such as whole-stage code generation. This tutorial is designed for database researchers (graduate students, faculty members, and industrial researchers) interested in a brief hands-on overview of Spark. This tutorial covers the core APIs for using Spark 2.0, including DataFrames, Datasets, SQL, streaming and machine learning pipelines. Each topic includes slide and lecture content along with hands-on use of a Spark cluster through a web-based notebook environment. In addition, we will dive into the engine internals to discuss architectural design choices and their implications in practice. We will guide the audience to "hack" Spark by extending its query optimizer to speed up distributed join execution.
Starting Page	2193
Ending Page	2194
Page Count	2
File Format	PDF
ISBN	9781450335317
DOI	10.1145/2882903.2912565
Language	English
Publisher	Association for Computing Machinery (ACM)
Publisher Date	2016-06-26
Publisher Place	New York
Access Restriction	Subscribed
Subject Keyword	Streaming Hadoop Machine learning Big data Spark Sql
Content Type	Text
Resource Type	Article

Central Library (ISO-9001:2015 Certified)
Indian Institute of Technology Kharagpur
Kharagpur, West Bengal, India | PIN - 721302

See location in the Map
03222 282435
Mail: support@ndl.gov.in

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in