NDLI: Fusing data with correlations

Please wait, while we are loading the content...

Edgar F. Codd Innovations Award Talk

How I learned to stop worrying and love compilers

PLANET: making progress with commit processing in unpredictable environments

HYDRA: large-scale social identity linkage via heterogeneous behavior modeling

Density-based place clustering in geo-social networks

How to stop under-utilization and love multicores

AutoPlait: automatic mining of co-evolving time sequences

Towards indexing functions: answering scalar product queries

TriAD: a distributed shared-nothing RDF engine based on asynchronous message passing

Orca: a modular query optimizer architecture for big data

Parallel data analysis directly on scientific file formats

Fusing data with correlations

Knowing when you're wrong: building fast and reliable approximate query processing systems

Durable write cache in flash memory SSD for relational and NoSQL databases

Fun with hardware transactional memory

CrowdFill: collecting structured data from the crowd

Efficient cohesive subgraphs detection in parallel

Knowledge expansion over probabilistic knowledge bases

Versatile optimization of UDF-heavy data flows with sofa

Cloud-based RDF data management

Patience is a virtue: revisiting merge and sort on modern processors

Which concepts are worth extracting?

Scalable big graph processing in MapReduce

Demonstration of the Myria big data management service

Should we all be teaching "intro to data science" instead of "intro to databases"?

Characterizing and selecting fresh data sources

The pursuit of a good possible world: extracting representative instances of uncertain graphs

Complete yet practical search for minimal query reformulations under constraints

iCheck: computationally combating "lies, d--ned lies, and statistics"

H2O: a hands-free adaptive store

Fast and unified local search for random walk based k-nearest-neighbor query in large graphs

Modeling entity evolution for temporal record matching

HAWQ: a massively parallel processing SQL engine in hadoop

Querying encrypted data

Towards unified ad-hoc data processing

Querying k-truss community in large and dynamic graphs

Online optimization and fair costing for dynamic data sharing in a cloud data market

Are we experiencing a big data bubble?

Mining latent entity structures from massive unstructured and interconnected data

Explainable security for relational databases

Overlap interval partition join

Tracking set correlations at large scale

Indexing for interactive exploration of big data series

Efficient top-K SimRank-based similarity join

SIGMOD Jim Gray Doctoral Dissertation Award Talk

Lazy evaluation of transactions in database systems

In search of influential event organizers in online social networks

Hypersphere dominance: an optimal approach

Druid: a real-time analytical data store

Resource-oriented approximation for frequent itemset mining from bursty data streams

LINVIEW: incremental view maintenance for complex analytical queries

Querying big graphs within bounded resources

Parallel I/O aware query optimization

The PH-tree: a space-efficient storage structure and multi-dimensional index

Descriptive and prescriptive data cleaning

Discovering queries based on example tuples

Fast database restarts at facebook

OASSIS: query driven crowd mining

Parallel subgraph listing in a large-scale graph

InsightNotes: summary-based annotation management in relational databases

ERIS live: a NUMA-aware in-memory storage engine for tera-scale multiprocessor systems

Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age

Querying virtual hierarchies using virtual prefix-based numbers

Anti-combining for MapReduce

DataSift: a crowd-powered search toolkit

Sloth: being lazy is a virtue (when issuing database queries)

Navigating the maze of graph analytics frameworks using massive graph datasets

Query shredding: efficient relational evaluation of queries over nested multisets

ABS: a system for scalable approximate queries with accuracy guarantees

Fine-grained partitioning for aggressive data skipping

Global immutable region computation

Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation

Major technical advancements in apache hive

Partial results in database systems

Reachability queries on large dynamic graphs: a total order approach

A comparison of platforms for implementing and running very large scale machine learning algorithms

PrivBayes: private data release via bayesian networks

Similarity joins for uncertain strings

Aggregate estimation over a microblog platform

Histograms as a side effect of data movement for big data

Multi-dimensional data statistics for columnar in-memory databases

SIGMOD Jim Gray Doctoral Dissertation Award Talk

Scalable atomic visibility with RAMP transactions

Influence maximization: near-optimal time complexity meets practical efficiency

Efficient algorithms for optimal location queries in road networks

The next generation operational data historian for IoT based on informix

On complexity and optimization of expensive queries in complex event processing

Materialization optimizations for feature selection workloads

Natural language question answering over RDF: a graph data driven approach

Exploiting ordered dictionaries to efficiently construct histograms with q-error guarantees in SAP HANA

Incremental elasticity for array databases

Towards dependable data repairing with fixing rules

Interactive data exploration using semantic windows

SpongeFiles: mitigating data skew in mapreduce using distributed memory

Corleone: hands-off crowdsourcing for entity matching

OPT: a new framework for overlapped and parallel triangulation in large-scale graphs

A pivotal prefix based filtering algorithm for string similarity search

Demonstrating efficient query processing in heterogeneous environments

A comprehensive study of main-memory partitioning and its application to large-scale comparison- and radix-sort

NLyze: interactive programming by natural language for spreadsheet data analysis and manipulation

Opportunistic physical design for big data analytics

Reactive and proactive sharing across concurrent analytical queries

Dynamically optimizing queries over large scale data platforms

Local search of communities in large graphs

Plan bouquets: query processing without selectivity estimation

NADEEF/ER: generic and interactive entity resolution

DSH: data sensitive hashing for high-dimensional k-nnsearch

Answering top-k representative queries on graph databases

A probabilistic model for linking named entities in web text with heterogeneous information networks

JSON data management: supporting schema-less development in RDBMS

Parallel in-situ data processing with speculative loading

EAGr: supporting continuous ego-centric aggregate queries over large dynamic graphs

Re-evaluating designs for multi-tenant OLTP workloads on SSD-basedI/O subsystems

PriView: practical differentially private release of marginal contingency tables

Track join: distributed joins with minimal network traffic

Tripartite graph clustering for dynamic sentiment analysis on social media

A formal approach to finding explanations for database queries

A user interaction based community detection algorithm for online social networks

JECB: a join-extension, code-based approach to OLTP data partitioning

Efficient location-aware influence maximization

Robust set reconciliation

GenBase: a complex analytics genomics benchmark

Complex event analytics: online aggregation of stream sequence patterns

The analytical bootstrap: a new method for fast error estimation in approximate query processing

Scalable similarity search for SimRank

Optimizing queries over partitioned tables in MPP systems

Efficient summarization framework for multi-attribute uncertain data

A sample-and-clean framework for fast and accurate query processing on dirty data

Explore-by-example: an automatic query steering framework for interactive data exploration

Leveraging compression in the tableau data engine

One DBMS for all: the brawny few and the wimpy crowd

An application-specific instruction set for accelerating set-oriented database primitives

Sinew: a SQL system for multi-structured data

Stratified-sampling over social networks using mapreduce

SLQ: a user-friendly graph querying system

A software-defined networking based approach for performance management of analytical queries on distributed data stores

Mining statistically significant connected subgraphs in vertex labeled graphs

Schema-free SQL

SerpentTI: flexible analytics of users, boards and domains for pinterest

Matching heterogeneous event data

Approximation schemes for many-objective query optimization

Localizing anomalous changes in time-evolving graphs

Secure query processing with data interoperability in a cloud database environment

Blowfish privacy: tuning privacy-utility trade-offs using policies

On-the-fly token similarity joins in relational databases

A temporal context-aware model for user behavior modeling in social media systems

MISO: souping up big data query processing with a multistore system

EDS: a segment-based distance measure for sub-trajectory similarity search

VQA: vertica query analyzer

TAREEG: a MapReduce-based web service for extracting spatial data from OpenStreetMap

Interactive redescription mining

Spatio-temporal visual analysis for event-specific tweets

Palette: enabling scalable analytics for big-memory, multicore machines

Searching with XQ: the exemplar query search engine

ONTOCUBO: cube-based ontology construction and exploration

PackageBuilder: querying for packages of tuples

NaLIR: an interactive natural language interface for querying relational databases

MeanKS: meaningful keyword search in relational databases with complex schema

An extendable framework for managing uncertain spatio-temporal data

Privacy preserving social graphs for high precision community detection

BabbleFlow: a translator for analytic data flow programs

$H_{2}RDF+:$ an efficient data management system for big RDF graphs

NewsNetExplorer: automatic construction and exploration of news information networks

Indexing on modern hardware: hekaton and beyond

DoomDB: kill the query

IQR: an interactive query relaxation system for the empty-answer problem

CrowdMatcher: crowd-assisted schema matching

OceanRT: real-time analytics over large temporal data

Fusing data with correlations

Content Provider	ACM Digital Library
Author	Pochampally, Ravali Das Sarma, Anish Meliou, Alexandra Srivastava, Divesh Dong, Xin Luna
Abstract	Many applications rely on Web data and extraction systems to accomplish knowledge-driven tasks. Web information is not curated, so many sources provide inaccurate, or conflicting information. Moreover, extraction systems introduce additional noise to the data. We wish to automatically distinguish correct data and erroneous data for creating a cleaner set of integrated data. Previous work has shown that a naive voting strategy that trusts data provided by the majority or at least a certain number of sources may not work well in the presence of copying between the sources. However, correlation between sources can be much broader than copying: sources may provide data from complementary domains (negative correlation), extractors may focus on different types of information (negative correlation), and extractors may apply common rules in extraction (positive correlation, without copying). In this paper we present novel techniques modeling correlations between sources and applying it in truth finding. We provide a comprehensive evaluation of our approach on three real-world datasets with different characteristics, as well as on synthetic data, showing that our algorithms outperform the existing state-of-the-art techniques.
Starting Page	433
Ending Page	444
Page Count	12
File Format	PDF
ISBN	9781450323765
DOI	10.1145/2588555.2593674
Language	English
Publisher	Association for Computing Machinery (ACM)
Publisher Date	2014-06-18
Publisher Place	New York
Access Restriction	Subscribed
Subject Keyword	Integration Correlated sources Data fusion
Content Type	Text
Resource Type	Article

Central Library (ISO-9001:2015 Certified)
Indian Institute of Technology Kharagpur
Kharagpur, West Bengal, India | PIN - 721302

See location in the Map
03222 282435
Mail: support@ndl.gov.in

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in