NDLI: QA-Pagelet: data preparation techniques for large-scale data analysis of the deep Web

Content Provider	IEEE Xplore Digital Library
Author	Caverlee, J. Liu, L.
Copyright Year	1989
Abstract	This paper presents the QA-Pagelet as a fundamental data preparation technique for large-scale data analysis of the deep Web. To support QA-Pagelet extraction, we present the Thor framework for sampling, locating, and partioning the QA-Pagelets from the deep Web. Two unique features of the Thor framework are 1) the novel page clustering for grouping pages from a deep Web source into distinct clusters of control-flow dependent pages and 2) the novel subtree filtering algorithm that exploits the structural and content similarity at subtree level to identify the QA-Pagelets within highly ranked page clusters. We evaluate the effectiveness of the Thor framework through experiments using both simulation and real data sets. We show that Thor performs well over millions of deep Web pages and over a wide range of sources, including e-commerce sites, general and specialized search engines, corporate Web sites, medical and legal resources, and several others. Our experiments also show that the proposed page clustering algorithm achieves low-entropy clusters, and the subtree filtering algorithm identifies QA-Pagelets with excellent precision and recall.
Sponsorship	IEEE IEEE Comput. Soc. Tech. Committee on Data Eng IEEE Computer Society
Starting Page	1247
Ending Page	1262
Page Count	16
File Size	1484632
File Format	PDF
ISSN	10414347
Volume Number	17
Issue Number	9
Language	English
Publisher	Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher Date	2005-09-01
Publisher Place	U.S.A.
Access Restriction	One Nation One Subscription (ONOS)
Rights Holder	Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subject Keyword	Large-scale systems Data analysis Filtering algorithms Data mining Sampling methods Medical simulation Web pages Search engines Law Legal factors clustering. Index Terms- Deep Web data preparation data extraction pagelets
Content Type	Text
Resource Type	Article
Subject	Information Systems Computational Theory and Mathematics Computer Science Applications

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in

QA-Pagelet: Data Preparation Techniques for Large-Scale Data Analysis of the Deep Web (2005)

QA-Pagelet: Data Preparation Techniques for Large-Scale Data Analysis of the Deep Web (2005)

Probe, cluster, and discover: focused extraction of QA-Pagelets from the deep Web

Automatic Data Records Extraction from List Page in Deep Web Sources

Ranking Web Search Results from Personalized Perspective

Use link-based clustering to improve Web search results

Automatic identification of informative sections of Web pages

An Effective Schema Extraction Algorithm on the Deep Web

Personalized web search by generating and mapping two user profiles

QA-Pagelet: data preparation techniques for large-scale data analysis of the deep Web

Similar Documents

QA-Pagelet: Data Preparation Techniques for Large-Scale Data Analysis of the Deep Web (2005)

QA-Pagelet: Data Preparation Techniques for Large-Scale Data Analysis of the Deep Web (2005)

Probe, cluster, and discover: focused extraction of QA-Pagelets from the deep Web

Automatic Data Records Extraction from List Page in Deep Web Sources

Ranking Web Search Results from Personalized Perspective

Use link-based clustering to improve Web search results

Automatic identification of informative sections of Web pages

An Effective Schema Extraction Algorithm on the Deep Web

Personalized web search by generating and mapping two user profiles

QA-Pagelet: data preparation techniques for large-scale data analysis of the deep Web