NDLI: Joint unsupervised structure discovery and information extraction

Please wait, while we are loading the content...

LazyFTL: a page-level flash translation layer optimized for NAND flash memory

Query optimization techniques for partitioned tables

Scalable query rewriting: a graph-based approach

Apples and oranges: a comparison of RDF benchmarks and real RDF datasets

No free lunch in data privacy

A latency and fault-tolerance optimizer for online parallel query plans

Schedule optimization for data processing flows on the cloud

Reverse spatial and textual k nearest neighbor search

Neighborhood-privacy protected shortest distance computing in cloud

Interaction between record matching and data repairing

Hybrid in-database inference for declarative information extraction

Keyword search over relational databases: a metadata approach

Changing flights in mid-air: a model for safely modifying continuous queries

More efficient datalog queries: subsumptive tabling beats magic sets

Exact indexing for support vector machines

Context-sensitive ranking for document retrieval

Ranking with uncertain scoring functions: semantics and sensitivity measures

Graph cube: on warehousing and OLAP multidimensional networks

Neighborhood based fast graph search in large networks

Processing theta-joins using MapReduce

ATLAS: a probabilistic algorithm for high dimensional similarity search

Managing scientific data: lessons, challenges, and opportunities

LCI: a social channel analysis platform for live customer intelligence

Learning statistical models from relational data

One-pass data mining algorithms in a DBMS with UDFs

SkylineSearch: semantic ranking and result visualization for pubmed

Pay-as-you-go mapping selection in dataspaces

CONFLuEnCE: CONtinuous workFLow ExeCution Engine

Operation-aware buffer management in flash-based systems

CrowdDB: answering queries with crowdsourcing

Automatic discovery of attributes in relational databases

Efficient query answering in probabilistic RDF graphs

TrustedDB: a trusted hardware based database with privacy and data confidentiality

ArrayStore: a storage manager for complex parallel array processing

Zephyr: live migration in shared nothing databases for elastic cloud platforms

Location-aware type ahead search on spatial databases: semantics and efficiency

On k-skip shortest paths

We challenge you to certify your updates

Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction

Sharing work in keyword search over databases

How soccer players would do stream joins

Entangled queries: enabling declarative data-driven coordination

Local graph sparsification for scalable clustering

Score-consistent algebraic optimization of full-text search queries with GRAFT

Querying uncertain data with aggregate constraints

MaSM: efficient online updates in data warehouses

A memory efficient reachability data structure through bit vector compression

Llama: leveraging columnar storage for scalable join processing in the MapReduce framework

Flexible aggregate similarity search

Internet scale storage

Bistro data feed management system

Web data management

Inspector gadget: a framework for custom monitoring and debugging of distributed dataflows

A cross-service travel engine for trip planning

Exelixis: evolving ontology-based data integration system

Demonstration of Qurk: a query processor for humanoperators

SkimpyStash: RAM space skimpy key-value store on flash-based storage

Skyline query processing over joins

Leveraging query logs for schema mapping generation in U-MAP

Facet discovery for structured web search: a query-log mining approach

Differentially private data cubes: optimizing noise sources and consistency

Fast checkpoint recovery algorithms for frequently consistent applications

Workload-aware database monitoring and consolidation

Collective spatial keyword querying

Finding shortest path on land surface

Labeling recursive workflow executions on-the-fly

Joint unsupervised structure discovery and information extraction

Nearest keyword search in XML documents

BE-tree: an index structure to efficiently match boolean expressions over high-dimensional discrete space

Data generation using declarative constraints

Advancing data clustering via projective clustering ensembles

Efficient diversity-aware search

Jigsaw: efficient optimization over uncertain enterprise data

Latent OLAP: data cubes over latent variables

Incremental graph pattern matching

Fast personalized PageRank on MapReduce

Effective data co-reduction for multimedia similarity search

Apache hadoop goes realtime at Facebook

Privacy-aware data management in information networks

RAFT at work: speeding-up mapreduce applications under task and node failures

WINACS: construction and analysis of web-based computer science information networks

U-MAP: a system for usage-based schema matching and mapping

Automatic example queries for ad hoc databases

Design and evaluation of main memory hash join algorithms for multi-core CPUs

Efficient parallel skyline processing using hyperplane projections

Designing and refining schema mappings via data examples

Schema-as-you-go: on probabilistic tagging and querying of wide tables

iReduct: differential privacy with reduced relative errors

Warding off the dangers of data corruption with amulet

Predicting cost amortization for query services

Finding semantics in time series

WHAM: a high-throughput sequence alignment method

Tracing data errors with view-conditioned causality

Attribute domain discovery for hidden web databases

Efficient and generic evaluation of ranked queries

TI: an efficient indexing mechanism for real-time search on tweets

Efficient auditing for complex SQL queries

Sampling based algorithms for quantile computation in sensor networks

Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation

Sensitivity analysis and explanations for robust query evaluation in probabilistic databases

E-Cube: multi-dimensional event sequence analysis using hierarchical pattern query sharing

Assessing and ranking structural correlations in graphs

A platform for scalable one-pass analytics using MapReduce

Efficient exact edit similarity query processing with the asymmetric signature scheme

Nova: continuous Pig/Hadoop workflows

Large-scale copy detection

WattDB: an energy-proportional cluster of wimpy nodes

Tweets as data: demonstration of TweeQL and Twitinfo

The SystemT IDE: an integrated development environment for information extraction rules

NetTrails: a declarative platform for maintaining and querying provenance in distributed systems

Performance prediction for concurrent database workloads

Querying contract databases based on temporal behavior

A new approach for processing ranked subsequence matching based on ranked union

A Hadoop based distributed loading approach to parallel data warehouses

Data management over flash memory

BRRL: a recovery library for main-memory applications in the cloud

MOBIES: mobile-interface enhancement service for hidden web database

ProApproX: a lightweight approximation query processor over probabilistic trees

GBLENDER: visual subgraph query formulation meets query processing

A batch of PNUTS: experiences connecting cloud batch and serving systems

Datalog and emerging applications: an interactive tutorial

A data-oriented transaction execution engine and supporting tools

Search computing: multi-domain search on ranked data

$SPROUT^{2}:$ a squared query engine for uncertain web data

Coordination through querying in the youtopia system

Turbocharging DBMS buffer pool using SSDs

iGraph in action: performance analysis of disk-based graph indexing techniques

EnBlogue: emergent topic detection in web 2.0 streams

Fuzzy prophet: parameter exploration in uncertain enterprise scenarios

DBWiki: a structured wiki for curated data and collaborative data management

Online reorganization in read optimized MMDBS

StreamRec: a real-time recommender system

NOAM: news outlets analysis and monitoring system

LinkDB: a probabilistic linkage database system

Rapid development of web-based query interfacesfor XML datasets with QURSED

Automated partitioning design in parallel database systems

Oracle database filesystem

Emerging trends in the enterprise data analytics: connecting Hadoop and DB2 warehouse

Efficient processing of data warehousing queries in a split execution environment

SQL server column store indexes

An analytic data engine for visualization in tableau

Joint unsupervised structure discovery and information extraction

Content Provider	ACM Digital Library
Author	da Silva, Altigran S. Laender, Alberto H.F. de Moura, Edleno S. Oliveira, Daniel Cortez, Eli
Abstract	In this paper we present JUDIE (Joint Unsupervised Structure Discovery and Information Extraction), a new method for automatically extracting semi-structured data records in the form of continuous text (e.g., bibliographic citations, postal addresses, classified ads, etc.) and having no explicit delimiters between them. While in state-of-the-art Information Extraction methods the structure of the data records is manually supplied the by user as a training step, JUDIE is capable of detecting the structure of each individual record being extracted without any user assistance. This is accomplished by a novel Structure Discovery algorithm that, given a sequence of labels representing attributes assigned to potential values, groups these labels into individual records by looking for frequent patterns of label repetitions among the given sequence. We also show how to integrate this algorithm in the information extraction process by means of successive refinement steps that alternate information extraction and structure discovery. Through an extensively experimental evaluation with different datasets in distinct domains, we compare JUDIE with state-of-the-art information extraction methods and conclude that, even without any user intervention, it is able to achieve high quality results on the tasks of discovering the structure of the records and extracting information from them.
Starting Page	541
Ending Page	552
Page Count	12
File Format	PDF
ISBN	9781450306614
DOI	10.1145/1989323.1989380
Language	English
Publisher	Association for Computing Machinery (ACM)
Publisher Date	2011-06-12
Publisher Place	New York
Access Restriction	Subscribed
Subject Keyword	Data management Information extraction Text segmentation
Content Type	Text
Resource Type	Article

Central Library (ISO-9001:2015 Certified)
Indian Institute of Technology Kharagpur
Kharagpur, West Bengal, India | PIN - 721302

See location in the Map
03222 282435
Mail: support@ndl.gov.in

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in