NDLI: Revisiting aggregation techniques for big data

Please wait, while we are loading the content...

Revisiting aggregation techniques for big data

CineCubes: cubes as movie stars with little effort

Extended dimensions for cleaning and querying inconsistent data warehouses

Data warehousing and OLAP over big data: current challenges and future research directions

Clustering cubes with binary dimensions in one pass

Meta-stars: multidimensional modeling for social business intelligence

Slowly changing measures

Optimizing OLAP cube processing on solid state drives

Social microblogging cube

Using REO on ETL conceptual modelling: a first approach

Can we analyze big data inside a DBMS?

CXT-cube: contextual text cube model and aggregation operator for text OLAP

ProtOLAP: rapid OLAP prototyping with on-demand data supply

INDREX: in-database distributional relation extraction

Lazy data structure maintenance for main-memory analytics over sliding windows

Revisiting aggregation techniques for big data

Content Provider	ACM Digital Library
Author	Tsotras, Vassilis J.
Abstract	In this talk we first present an introduction to AsterixDB [1], a parallel, semistructured platform to ingest, store, index, query, analyze, and publish "big data" (http://asterixdb.ics.uci.edu) and the various challenges we addressed while building it. AsterixDB combines ideas from semistructured data management, parallel database systems, and first-generation data-intensive computing platforms (MapReduce and Hadoop). The full AsterixDB software stack provides support for big data applications from the storage and processing engine (Hyracks [2] available at: http://hyracks.googlecode.com), to the exible query optimization layer (Algebricks), to the interfaces for user-level interaction (AQL, HiveQL, Pregelix, etc.) Hyracks is a partitioned-parallel engine for data intensive computing jobs in the form of DAGs. Algebricks is a model-agnostic, algebraic layer for compiling and optimizing parallel queries to be processed by Hyracks. Queries for AsterixDB can be expressed by either popular higher-level data analysis languages like Pig, Hive or Jaql, or by its native query language (AQL) and data model (ADM) with support for semi-structured information and fuzzy data. Fundamental data processing operations, like joins and aggregations, are natively supported in AsterixDB. The second part of the talk focuses on our experiences while designing efficient local (per node) aggregation algorithms for AsterixDB. In particular, there are two challenges for local aggregations in a big data system: first, if the aggregation is group-based (like the "group-by" in SQL), the aggregation result may not fit in main memory; second, in order to allow multiple operations being processed simultaneously, an aggregation operation should work within a strict memory budget provided by the platform. Despite its importance and challenges, the design and evaluation of local aggregation algorithms has not received the same level of attention that other basic operators, such as joins, have received in the literature. Facing a lack of "off the shelf" local aggregation algorithms for big data, we present low-level implementation details for engineering the aggregation operator, utilizing (i) sort-based, (ii) hash-based, and (iii) sort-hash-hybrid approaches. We present six algorithms all of which work within a strictly bounded memory budget, and can easily adapt between in-memory and external processing. Among them, two are novel and four are based on extending existing join algorithms. We deployed all algorithms as operators in the Hyracks platform and evaluated their performance through extensive experimentation. Our experiments cover many different performance factors, including input cardinality, memory, data distribution, and hash table structure. Our study guided our selection of the local aggregation algorithms supported in the recent release of AsterixDB, namely: the hybrid-hash. Pre-Partitioning algorithm for its tolerance on the estimation of the input grouping key cardinality, the Hash-Sort algorithm for its good performance when aggregating skewed data, and the Sort-Based algorithm when the input data is already sorted. This local aggregation work is the first part of a two-part big data aggregation study, as it addresses the "map" phase. Our findings provide the foundation for the global aggregation strategy we are currently investigating for the "reduce" phase. We hope our experience can help developers of other Big Data platforms to build a solid local aggregation operator.
Starting Page	1
Ending Page	2
Page Count	2
File Format	PDF
ISBN	9781450324120
DOI	10.1145/2513190.2517827
Language	English
Publisher	Association for Computing Machinery (ACM)
Publisher Date	2013-10-28
Publisher Place	New York
Access Restriction	Subscribed
Subject Keyword	Aggregation Big data management system
Content Type	Text
Resource Type	Article

Central Library (ISO-9001:2015 Certified)
Indian Institute of Technology Kharagpur
Kharagpur, West Bengal, India | PIN - 721302

See location in the Map
03222 282435
Mail: support@ndl.gov.in

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in