NDLI: InfoGather: entity augmentation and attribute discovery by holistic matching with web tables

Please wait, while we are loading the content...

Edgar F. Codd Innovations Award Talk

Calvin: fast distributed transactions for partitioned database systems

Parallel main-memory indexing for moving-object query and update workloads

Sample-driven schema mapping

Interactive regret minimization

Managing large dynamic graphs efficiently

Skimmer: rapid scrolling of relational query results

bLSM: a general purpose log structured merge tree

High-performance complex event processing over XML streams

MaskIt: privately releasing user context streams for personalized mobile applications

Towards a unified architecture for in-RDBMS analytics

CrowdScreen: algorithms for filtering data with humans

Processing a large number of continuous preference top-k queries

Temporal alignment

Aggregate suppression for enterprise search engines

A model-based approach to attributed graph clustering

Locality-sensitive hashing scheme based on dynamic collision counting

Analytic database technologies for a new kind of user: the data enthusiast

Mob data sourcing

Automatic web-scale information extraction

Sindbad: a location-based social networking system

Shark: fast data analysis using coarse-grained distributed memory

Amazon dynamoDB: a seamlessly scalable non-relational database service

The value of social media data in enterprise applications

Query optimization in microsoft SQL server PDW

TAO: how facebook serves the social graph

Dynamic workload driven data integration in tableau

CloudRAMSort: fast and efficient large-scale distributed RAM sort on shared-nothing cluster

Declarative web application development: encapsulating dynamic JavaScript widgets (abstract only)

SIGMOD Contributions Award Talk

Advanced partitioning techniques for massively distributed computation

Divergent physical design tuning for replicated databases

Can we beat the prefix filtering?: an adaptive framework for similarity join and search

MCJoin: a memory-constrained join for column-store main-memory databases

Query preserving graph compression

Efficient spatial sampling of large geographical tables

Skeleton automata for FPGAs: reconfiguring without reconstructing

Prediction-based geometric monitoring over distributed data streams

Authenticating location-based services without compromising location privacy

Tiresias: the database oracle for how-to queries

Local structure and determinism in probabilistic databases

Optimal top-k generation of attribute combinations based on ranked lists

A highway-centric labeling approach for answering distance queries on large sparse graphs

Probase: a probabilistic taxonomy for text understanding

Towards effective partition management for large graphs

Efficient external-memory bisimulation on DAGs

Symbiosis in scale out networking and data management

Managing and mining large graphs: patterns and algorithms

Just-in-time information extraction using extraction views

MAQSA: a system for social analytics on news

Exploiting MapReduce-based similarity joins

Efficient transaction processing in SAP HANA database: the end of a column store myth

Anatomy of a gift recommendation engine powered by social media

F1: the fault-tolerant distributed RDBMS supporting google's ad business

Large-scale machine learning at twitter

Finding related tables

Adaptive optimizations of recursive queries in teradata

Towards scalable summarization and visualization of large text corpora (abstract only)

Test Of Time Award Talk: Executing SQL over Encrypted Data in the Database-Service-Provider Model

SkewTune: mitigating skew in mapreduce applications

Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems

InfoGather: entity augmentation and attribute discovery by holistic matching with web tables

Holistic optimization by prefetching query results

SCARAB: scaling reachability computation on large graphs

Declarative error management for robust data-intensive applications

NoDB: efficient query execution on raw data files

Online windowed subsequence matching over probabilistic sequences

Effective caching of shortest paths for location-based services

GUPT: privacy preserving data analysis made easy

So who won?: dynamic max discovery with the crowd

Top-k bounded diversification

Efficient processing of distance queries in large graphs: a vertex cover approach

Optimizing index for taxonomy keyword search

TreeSpan: efficiently computing similarity all-matching

Materialized view selection for XQuery workloads

Managing and mining large graphs: systems and implementations

ColumbuScout: towards building local search engines over large databases

Surfacing time-critical insights from social media

GLADE: big data analytics made easy

Walnut: a unified cloud object store

Designing a scalable crowdsourcing platform

Oracle in-database hadoop: when mapreduce meets RDBMS

Recurring job optimization in scope

Optimizing analytic data flows for multiple execution engines

From x100 to vectorwise: opportunities, challenges and things most researchers do not think about

Reducing cache misses in hash join probing phase by pre-sorting strategy (abstract only)

SIGMOD Jim Gray Doctoral Dissertation Award Talk

Computational reproducibility: state-of-the-art, challenges, and database research opportunities

SOFIA SEARCH: a tool for automating related-work search

Taagle: efficient, personalized search in collaborative tagging networks

ReStore: reusing results of MapReduce jobs in pig

DP-tree: indexing multi-dimensional data under differential privacy (abstract only)

Database techniques for linked data management

RACE: real-time applications over cloud-edge

PrefDB: bringing preferences closer to the DBMS

Clydesdale: structured data processing on hadoop

Temporal provenance discovery in micro-blog message streams (abstract only)

Differential privacy in data publication and analysis

Partiqle: an elastic SQL engine over key-value stores

Auto-completion learning for XML

Tiresias: a demonstration of how-to queries

SigSpot: mining significant anomalous regions from time-evolving networks (abstract only)

JustMyFriends: full SQL, full transactional amenities, and access privacy

Logos: a system for translating queries into narratives

AstroShelf: understanding the universe through scalable navigation of a galaxy of annotations

VRRC: web based tool for visualization and recommendation on co-authorship network (abstract only)

Dynamic optimization of generalized SQL queries with horizontal aggregations

PAnG: finding patterns in annotation graphs

OPAvion: mining and visualization in large graphs

Fast sampling word correlations of high dimensional text data (abstract only)

ConsAD: a real-time consistency anomalies detector

VizDeck: self-organizing dashboards for visual analytics

CloudAlloc: a monitoring and reservation system for compute clusters

Interactive performance monitoring of a composite OLTP and OLAP workload

Kaizen: a semi-automatic index advisor

TIRAMOLA: elastic nosql provisioning through a cloud management platform

InfoGather: entity augmentation and attribute discovery by holistic matching with web tables

Content Provider	ACM Digital Library
Author	Yakout, Mohamed Chaudhuri, Surajit Chakrabarti, Kaushik Ganjam, Kris
Abstract	The Web contains a vast corpus of HTML tables, specifically entity attribute tables. We present three core operations, namely entity augmentation by attribute name, entity augmentation by example and attribute discovery, that are useful for "information gathering" tasks (e.g., researching for products or stocks). We propose to use web table corpus to perform them automatically. We require the operations to have high precision and coverage, have fast (ideally interactive) response times and be applicable to any arbitrary domain of entities. The naive approach that attempts to directly match the user input with the web tables suffers from poor precision and coverage. Our key insight is that we can achieve much higher precision and coverage by considering indirectly matching tables in addition to the directly matching ones. The challenge is to be robust to spuriously matched tables: we address it by developing a holistic matching framework based on topic sensitive pagerank and an augmentation framework that aggregates predictions from multiple matched tables. We propose a novel architecture that leverages preprocessing in MapReduce to achieve extremely fast response times at query time. Our experiments on real-life datasets and 573M web tables show that our approach has (i) significantly higher precision and coverage and (ii) four orders of magnitude faster response times compared with the state-of-the-art approach.
Starting Page	97
Ending Page	108
Page Count	12
File Format	PDF
ISBN	9781450312479
DOI	10.1145/2213836.2213848
Language	English
Publisher	Association for Computing Machinery (ACM)
Publisher Date	2012-05-20
Publisher Place	New York
Access Restriction	Subscribed
Subject Keyword	Augmentation Page rank Data integration Web application Web tables
Content Type	Text
Resource Type	Article

Central Library (ISO-9001:2015 Certified)
Indian Institute of Technology Kharagpur
Kharagpur, West Bengal, India | PIN - 721302

See location in the Map
03222 282435
Mail: support@ndl.gov.in

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in