NDLI: Query Selection Techniques for Efficient Crawling of Structured Web Sources

Content Provider	IEEE Xplore Digital Library
Author	Ping Wu Ji-Rong Wen Huan Liu Wei-Ying Ma
Copyright Year	2006
Description	Author affiliation: University of California, Santa Barbara (Ping Wu)
Abstract	The high quality, structured data from Web structured sources is invaluable for many applications. Hidden Web databases are not directly crawlable by Web search engines and are only accessible through Web query forms or via Web service interfaces. Recent research efforts have been focusing on understanding these Web query forms. A critical but still largely unresolved question is: how to efficiently acquire the structured information inside Web databases through iteratively issuing meaningful queries? In this paper we focus on the central issue of enabling efficient Web database crawling through query selection, i.e. how to select good queries to rapidly harvest data records from Web databases. We model each structured Web database as a distinct attribute-value graph. Under this theoretical framework, the database crawling problem is transformed into a graph traversal one that follows "relational" links. We show that finding an optimal query selection plan is equivalent to finding a Minimum Weighted Dominating Set of the corresponding database graph, a well-known NP-Complete problem. We propose a suite of query selection techniques aiming at optimizing the query harvest rate. Extensive experimental evaluations over real Web sources and simulations over controlled database servers validate the effectiveness of our techniques and provide insights for future efforts in this
Starting Page	47
Ending Page	47
File Size	317025
Page Count	1
File Format	PDF
ISBN	0769525709
DOI	10.1109/ICDE.2006.124
Language	English
Publisher	Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher Date	2006-04-03
Publisher Place	USA
Access Restriction	Subscribed
Rights Holder	Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subject Keyword	Web services Relational databases Data acquisition Asia Web search Search engines Crawlers NP-complete problem Abstracts Probes
Content Type	Text
Resource Type	Article

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in

Query Selection Techniques . . .

Using a search engine to query a relational database

A New Framework for Domain-Specific Hidden Web Crawling Based on Data Extraction Techniques

Web search engines. Part 1

An ontology-based integration of Web query interfaces for house search

Performance Optimization of Focused Web Crawling Using Content Block Segmentation

A Memory Efficient Approach for Crawling Language Specific Web: The Arabic Web as a Case Study

AKSHR: A novel framework for a Domain-specific Hidden Web Crawler

PyBot: An Algorithm for Web Crawling

Query Selection Techniques for Efficient Crawling of Structured Web Sources

Similar Documents

Query Selection Techniques . . .

Using a search engine to query a relational database

A New Framework for Domain-Specific Hidden Web Crawling Based on Data Extraction Techniques

Web search engines. Part 1

An ontology-based integration of Web query interfaces for house search

Performance Optimization of Focused Web Crawling Using Content Block Segmentation

A Memory Efficient Approach for Crawling Language Specific Web: The Arabic Web as a Case Study

AKSHR: A novel framework for a Domain-specific Hidden Web Crawler

PyBot: An Algorithm for Web Crawling

Query Selection Techniques for Efficient Crawling of Structured Web Sources