NDLI: A Hadoop based distributed loading approach to parallel data warehouses

Please wait, while we are loading the content...

LazyFTL: a page-level flash translation layer optimized for NAND flash memory

Query optimization techniques for partitioned tables

Scalable query rewriting: a graph-based approach

Apples and oranges: a comparison of RDF benchmarks and real RDF datasets

No free lunch in data privacy

A latency and fault-tolerance optimizer for online parallel query plans

Schedule optimization for data processing flows on the cloud

Reverse spatial and textual k nearest neighbor search

Neighborhood-privacy protected shortest distance computing in cloud

Interaction between record matching and data repairing

Hybrid in-database inference for declarative information extraction

Keyword search over relational databases: a metadata approach

Changing flights in mid-air: a model for safely modifying continuous queries

More efficient datalog queries: subsumptive tabling beats magic sets

Exact indexing for support vector machines

Context-sensitive ranking for document retrieval

Ranking with uncertain scoring functions: semantics and sensitivity measures

Graph cube: on warehousing and OLAP multidimensional networks

Neighborhood based fast graph search in large networks

Processing theta-joins using MapReduce

ATLAS: a probabilistic algorithm for high dimensional similarity search

Managing scientific data: lessons, challenges, and opportunities

LCI: a social channel analysis platform for live customer intelligence

Learning statistical models from relational data

One-pass data mining algorithms in a DBMS with UDFs

SkylineSearch: semantic ranking and result visualization for pubmed

Pay-as-you-go mapping selection in dataspaces

CONFLuEnCE: CONtinuous workFLow ExeCution Engine

Operation-aware buffer management in flash-based systems

CrowdDB: answering queries with crowdsourcing

Automatic discovery of attributes in relational databases

Efficient query answering in probabilistic RDF graphs

TrustedDB: a trusted hardware based database with privacy and data confidentiality

ArrayStore: a storage manager for complex parallel array processing

Zephyr: live migration in shared nothing databases for elastic cloud platforms

Location-aware type ahead search on spatial databases: semantics and efficiency

On k-skip shortest paths

We challenge you to certify your updates

Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction

Sharing work in keyword search over databases

How soccer players would do stream joins

Entangled queries: enabling declarative data-driven coordination

Local graph sparsification for scalable clustering

Score-consistent algebraic optimization of full-text search queries with GRAFT

Querying uncertain data with aggregate constraints

MaSM: efficient online updates in data warehouses

A memory efficient reachability data structure through bit vector compression

Llama: leveraging columnar storage for scalable join processing in the MapReduce framework

Flexible aggregate similarity search

Internet scale storage

Bistro data feed management system

Web data management

Inspector gadget: a framework for custom monitoring and debugging of distributed dataflows

A cross-service travel engine for trip planning

Exelixis: evolving ontology-based data integration system

Demonstration of Qurk: a query processor for humanoperators

SkimpyStash: RAM space skimpy key-value store on flash-based storage

Skyline query processing over joins

Leveraging query logs for schema mapping generation in U-MAP

Facet discovery for structured web search: a query-log mining approach

Differentially private data cubes: optimizing noise sources and consistency

Fast checkpoint recovery algorithms for frequently consistent applications

Workload-aware database monitoring and consolidation

Collective spatial keyword querying

Finding shortest path on land surface

Labeling recursive workflow executions on-the-fly

Joint unsupervised structure discovery and information extraction

Nearest keyword search in XML documents

BE-tree: an index structure to efficiently match boolean expressions over high-dimensional discrete space

Data generation using declarative constraints

Advancing data clustering via projective clustering ensembles

Efficient diversity-aware search

Jigsaw: efficient optimization over uncertain enterprise data

Latent OLAP: data cubes over latent variables

Incremental graph pattern matching

Fast personalized PageRank on MapReduce

Effective data co-reduction for multimedia similarity search

Apache hadoop goes realtime at Facebook

Privacy-aware data management in information networks

RAFT at work: speeding-up mapreduce applications under task and node failures

WINACS: construction and analysis of web-based computer science information networks

U-MAP: a system for usage-based schema matching and mapping

Automatic example queries for ad hoc databases

Design and evaluation of main memory hash join algorithms for multi-core CPUs

Efficient parallel skyline processing using hyperplane projections

Designing and refining schema mappings via data examples

Schema-as-you-go: on probabilistic tagging and querying of wide tables

iReduct: differential privacy with reduced relative errors

Warding off the dangers of data corruption with amulet

Predicting cost amortization for query services

Finding semantics in time series

WHAM: a high-throughput sequence alignment method

Tracing data errors with view-conditioned causality

Attribute domain discovery for hidden web databases

Efficient and generic evaluation of ranked queries

TI: an efficient indexing mechanism for real-time search on tweets

Efficient auditing for complex SQL queries

Sampling based algorithms for quantile computation in sensor networks

Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation

Sensitivity analysis and explanations for robust query evaluation in probabilistic databases

E-Cube: multi-dimensional event sequence analysis using hierarchical pattern query sharing

Assessing and ranking structural correlations in graphs

A platform for scalable one-pass analytics using MapReduce

Efficient exact edit similarity query processing with the asymmetric signature scheme

Nova: continuous Pig/Hadoop workflows

Large-scale copy detection

WattDB: an energy-proportional cluster of wimpy nodes

Tweets as data: demonstration of TweeQL and Twitinfo

The SystemT IDE: an integrated development environment for information extraction rules

NetTrails: a declarative platform for maintaining and querying provenance in distributed systems

Performance prediction for concurrent database workloads

Querying contract databases based on temporal behavior

A new approach for processing ranked subsequence matching based on ranked union

A Hadoop based distributed loading approach to parallel data warehouses

Data management over flash memory

BRRL: a recovery library for main-memory applications in the cloud

MOBIES: mobile-interface enhancement service for hidden web database

ProApproX: a lightweight approximation query processor over probabilistic trees

GBLENDER: visual subgraph query formulation meets query processing

A batch of PNUTS: experiences connecting cloud batch and serving systems

Datalog and emerging applications: an interactive tutorial

A data-oriented transaction execution engine and supporting tools

Search computing: multi-domain search on ranked data

$SPROUT^{2}:$ a squared query engine for uncertain web data

Coordination through querying in the youtopia system

Turbocharging DBMS buffer pool using SSDs

iGraph in action: performance analysis of disk-based graph indexing techniques

EnBlogue: emergent topic detection in web 2.0 streams

Fuzzy prophet: parameter exploration in uncertain enterprise scenarios

DBWiki: a structured wiki for curated data and collaborative data management

Online reorganization in read optimized MMDBS

StreamRec: a real-time recommender system

NOAM: news outlets analysis and monitoring system

LinkDB: a probabilistic linkage database system

Rapid development of web-based query interfacesfor XML datasets with QURSED

Automated partitioning design in parallel database systems

Oracle database filesystem

Emerging trends in the enterprise data analytics: connecting Hadoop and DB2 warehouse

Efficient processing of data warehousing queries in a split execution environment

SQL server column store indexes

An analytic data engine for visualization in tableau

A Hadoop based distributed loading approach to parallel data warehouses

Content Provider	ACM Digital Library
Author	Zhao, Kevin Keliang Xu, Yu Qi, Yan Kostamaa, Pekka Wen, Jian
Abstract	One critical part of building and running a data warehouse is the ETL (Extraction Transformation Loading) process. In fact, the growing ETL tool market is already a multi-billion-dollar market. Getting data into data warehouses has been a hindering factor to wider potential database applications such as scientific computing, as discussed in recent panels at various database conferences. One particular problem with the current load approaches to data warehouses is that while data are partitioned and replicated across all nodes in data warehouses powered by parallel DBMS(PDBMS), load utilities typically reside on a single node which face the issues of i) data loss/data availability if the node/hard drives crash; ii) file size limit on a single node; iii) load performance. All of these issues are mostly handled manually or only helped to some degree by tools. We notice that one common thing between Hadoop and Teradata Enterprise Data Warehouse (EDW) is that data in both systems are partitioned across multiple nodes for parallel computing, which creates parallel loading opportunities not possible for DBMSs running on a single node. In this paper we describe our approach of using Hadoop as a distributed load strategy to Teradata EDW. We use Hadoop as the intermediate load server to store data to be loaded to Teradata EDW. We gain all the benefits from HDFS (Hadoop Distributed File System): i) significantly increased disk space for the file to be loaded; ii) once the data is written to HDFS, it is not necessary for the data sources to keep the data even before the file is loaded to Teradata EDW; iii) MapReduce programs can be used to transform and add structures to unstructured or semi-structured data; iv) more importantly since a file is distributed in HDFS, the file can be loaded more quickly in parallel to Teradata EDW, which is the main focus in this paper. When both Hadoop and Teradata EDW coexist on the same hardware platform, as being increasingly required by customers because of reduced hardware and system administration costs, we have another optimization opportunity to directly load HDFS data blocks to Teradata parallel units on the same nodes. However, due to the inherent non-uniform data distribution in HDFS, rarely we can avoid transferring HDFS blocks to remote Teradata nodes. We designed a polynomial time optimal algorithm and a polynomial time approximate algorithm to assign HDFS blocks to Teradata parallel units evenly and minimize network traffic. We performed experiments on synthetic and real data sets to compare the performances of the algorithms.
Starting Page	1091
Ending Page	1100
Page Count	10
File Format	PDF
ISBN	9781450306614
DOI	10.1145/1989323.1989440
Language	English
Publisher	Association for Computing Machinery (ACM)
Publisher Date	2011-06-12
Publisher Place	New York
Access Restriction	Subscribed
Subject Keyword	Parallel dbms Data load Hadoop
Content Type	Text
Resource Type	Article

Central Library (ISO-9001:2015 Certified)
Indian Institute of Technology Kharagpur
Kharagpur, West Bengal, India | PIN - 721302

See location in the Map
03222 282435
Mail: support@ndl.gov.in

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in