NDLI: Using the structure of Web sites for automatic segmentation of tables

Please wait, while we are loading the content...

The next database revolution

Adaptive stream resource management using Kalman Filters

BLAS: an efficient XPath processing system

FleXPath: flexible structure and full-text querying for XML

Identifying similarities, periodicities and bursts for online search queries

Optimization of query streams using semantic prefetching

Lazy query evaluation for Active XML

A bi-level Bernoulli scheme for database sampling

Transaction support for indexed summary views

Constraint-based XML query rewriting for data integration

Adaptive ordering of pipelined stream filters

Clustering objects on a spatial network

Implementing a scalable XML publish/subscribe system using relational database systems

The price of validity in dynamic networks

Extending query rewriting techniques for fine-grained access control

Indexing spatio-temporal trajectories with Chebyshev polynomials

CORDS: automatic discovery of correlations and soft functional dependencies

Joining interval data in relational databases

TOSS: an extension of TAX with ontologies and similarity queries

Efficient set joins on similarity predicates

When one sample is not enough: improving text database selection using shrinkage

Toward a progress indicator for database queries

Relaxed currency and consistency: how to say "good enough" in SQL

Query sampling in DB2 Universal Database

Data densification in a relational database system

Models for Web Services tansactions

SoundCompass: a practical query-by-humming system; normalization of scalable and shiftable time-series data and effective subsequence generation

Requirements and policy challenges in highly secure environments

XML in the middle: XQuery in the WebLogic Platform

Declarative specification of Web applications exploiting Web services and workflows

Knocking the door to the deep Web: integrating Web query interfaces

MAIDS: mining alarming incidents from data streams

PIPES: a public infrastructure for processing and exploring streams

P2P-DIET: an extensible P2P service that unifies ad-hoc and continuous querying in super-peer networks

XSeq: an indexing infrastructure for tree pattern queries

"Share your data, keep your secrets."

LexEQUAL: multilexical matching operator in SQL

Rethinking the conference reviewing process

Tools for design of composite Web services

Security of shared data in large systems: state of the art and research directions

Fast algorithms for time series with applications to finance, physics, music, biology, and other suspects

Indexing and mining streams

The role of cryptography in database security

Online event-driven subsequence matching over financial data streams

Efficient processing of XML twig queries with OR-predicates

An interactive clustering-based approach to integrating source query interfaces on the deep Web

FARMER: finding interesting rule groups in microarray datasets

Buffering databse operations for enhanced instruction cache performance

Data stream management for historical XML data

Effective use of block-level sampling in statistics estimation

Graph indexing: a frequent structure-based approach

iMAP: discovering complex semantic matches between database schemas

Static optimization of conjunctive queries with sliding windows over infinite streams

Computing Clusters of Correlation Connected objects

Incremental maintenance of XML structural indexes

Compressing historical information in sensor networks

Order preserving encryption for numeric data

Prediction and indexing of moving objects with unknown motion patterns

Robust query processing through progressive optimization

Approximation techniques for spatial data

Information-theoretic tools for mining database structure from large data sets

Automatic categorization of query results

On the integration of structure indexes and inverted lists

Estimating progress of execution for SQL queries

Highly available, fault-tolerant, parallel dataflows

Query processing for SQL updates

Hosting the .NET Runtime in Microsoft SQL server

Enabling sovereign information sharing using Web Services

Model-driven business UI based on maps

Information assurance technical challenges

ORDPATHs: insert-friendly XML node labels

Yoo-Hoo!: building a presence service with XQuery and WSDL

Efficient development of data migration transformations

FAÇADE: a fast and effective approach to the discovery of dense clusters in noisy spatial data

Web-CAM: monitoring the dynamic Web to respond to continual queries

Querying at Internet scale

A TeXQuery-based XML full-text search engine

Managing healthcare data hippocratically

ITQS: an integrated transport query system

Holistic UDAFs at streaming speeds

Tree logical classes for efficient evaluation of XQuery

Understanding Web query interfaces: best-effort parsing with hidden syntax

Diamond in the rough: finding Hierarchical Heavy Hitters in multi-dimensional data

Rank-aware query optimization

Colorful XML: one hierarchy isn't enough

Online maintenance of very large random samples

The Priority R-tree: a practically efficient and worst-case optimal R-tree

Adapting to source properties in processing data integration queries

Dynamic plan migration for continuous queries over data streams

Incremental and effective data summarization for dynamic hierarchical clustering

Incremental evaluation of schema-directed XML publishing

Efficient query reformulation in peer data management systems

A formal analysis of information disclosure in data exchange

SINA: scalable incremental processing of continuous queries in spatio-temporal databases

Canonical abstraction for outerjoin optimization

Spatially-decaying aggregation over a network: model and algorithms

Parallel SQL execution in Oracle 10g

Vertical and horizontal percentage aggregations

Building dynamic application networks with Web Services

dbSwitch™: towards a database utility

Service-oriented BI: towards tight integration of business intelligence into operational applications

Liquid data for WebLogic: integrating enterprise data and services

Load management and high availability in the Medusa distributed stream processing system

Support for relaxed currency and consistency constraints in MTCache

BODHI: a database habitat for bio-diversity information

Using the structure of Web sites for automatic segmentation of tables

Cost-based labeling of groups of mass spectra

Fast computation of database operations using graphics processors

Approximate XML query answers

Conditional selectivity for statistics on query expressions

Integrating vertical and horizontal partitioning into automated physical database design

Secure XML querying with security views

STRIPES: an efficient index for predicted trajectories

Secure, reliable, transacted: innovation in Web Services architecture

StreaMon: an adaptive engine for stream query processing

An indexing framework for peer-to-peer systems

CAMAS: a citizen awareness system for crisis mitigation

Using the structure of Web sites for automatic segmentation of tables

Content Provider	ACM Digital Library
Author	Minton, Steven Knoblock, Craig Lerman, Kristina Getoor, Lise
Abstract	Many Web sites, especially those that dynamically generate HTML pages to display the results of a user's query, present information in the form of list or tables. Current tools that allow applications to programmatically extract this information rely heavily on user input, often in the form of labeled extracted records. The sheer size and rate of growth of the Web make any solution that relies primarily on user input is infeasible in the long term. Fortunately, many Web sites contain much explicit and implicit structure, both in layout and content, that we can exploit for the purpose of information extraction. This paper describes an approach to automatic extraction and segmentation of records from Web tables. Automatic methods do not require any user input, but rely solely on the layout and content of the Web source. Our approach relies on the common structure of many Web sites, which present information as a list or a table, with a link in each entry leading to a detail page containing additional information about that item. We describe two algorithms that use redundancies in the content of table and detail pages to aid in information extraction. The first algorithm encodes additional information provided by detail pages as constraints and finds the segmentation by solving a constraint satisfaction problem. The second algorithm uses probabilistic inference to find the record segmentation. We show how each approach can exploit the web site structure in a general, domain-independent manner, and we demonstrate the effectiveness of each algorithm on a set of twelve Web sites.
Starting Page	119
Ending Page	130
Page Count	12
File Format	PDF
ISBN	1581138598
DOI	10.1145/1007568.1007584
Language	English
Publisher	Association for Computing Machinery (ACM)
Publisher Date	2004-06-13
Publisher Place	New York
Access Restriction	Subscribed
Content Type	Text
Resource Type	Article

Central Library (ISO-9001:2015 Certified)
Indian Institute of Technology Kharagpur
Kharagpur, West Bengal, India | PIN - 721302

See location in the Map
03222 282435
Mail: support@ndl.gov.in

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in