NDLI: Algorithms of mining intact record from isomorphic Web page

Content Provider	IEEE Xplore Digital Library
Author	Yong Qiu Yong-Jie Lan
Copyright Year	2005
Description	Author affiliation: Sch. of Inf. & Electron. Eng., Shanghai Inst. of Bus. & Technol., China (Yong Qiu; Yong-Jie Lan)
Abstract	The huge amount of information available on the Web has attracted many research efforts into developing tools to extract data from Web pages. Many Web pages are generated automatically from an underlying database; therefore, the HTML structure of pages is fairly specific and regular. Some existing algorithms like OMINI, MDR can extract information from multi-recording Web pages, the main point is to identify repetitive record structure automatically. However, Web pages maintain multi-records are actually directory page, the information in directory page is not intact; the intact information exists in lower level Web page, called detailed page. A detailed page has one record information only, so it can not be extracted using duplicated record finding algorithm. To solve this problem, extracting intact information from Web, a concept of isomorphic Web page is introduced, and two algorithm are proposed, one algorithm for finding directory that has isomorphic Web pages, the other for mining record information from isomorphic Web pages.
Sponsorship	IEEE Syst., Man and Cybernetics Tech. Comm. on Cybernetics, Hong Kong Polytechnic Univ. Hebei Univ. South China Univ. Chongqing Univ. Sun Yat-sen Univ. Harbin Inst. of Technol. and Int. Univ. in Germany
Starting Page	2373
Ending Page	2378
File Size	913323
Page Count	6
File Format	PDF
ISBN	0780390911
DOI	10.1109/ICMLC.2005.1527341
Language	English
Publisher	Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher Date	2005-08-18
Publisher Place	China
Access Restriction	Subscribed
Rights Holder	Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subject Keyword	Data mining Web pages Databases Machine learning Local area networks Data engineering Electronic mail HTML Web mining Software systems isomorphic webpage WEB mining Information Extracting webpage WEB
Content Type	Text
Resource Type	Article

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in

Extracting Content from Web Pages Using the Sliding Window

Designing and Implementing of the Webpage Information Extracting Model Based on Tags

A new web information extracting method based on multi-coordinate

Extracting Objects from the Web

A Semantic DOM Approach for Webpage Information Extraction

Clustering for Web information hierarchy mining

Mining Collective Pair Data from the Web

Mining Web pages for data records

Using Visual Clues Concept for Extracting Main Data from Deep Web Pages

Algorithms of mining intact record from isomorphic Web page

Similar Documents

Extracting Content from Web Pages Using the Sliding Window

Designing and Implementing of the Webpage Information Extracting Model Based on Tags

A new web information extracting method based on multi-coordinate

Extracting Objects from the Web

A Semantic DOM Approach for Webpage Information Extraction

Clustering for Web information hierarchy mining

Mining Collective Pair Data from the Web

Mining Web pages for data records

Using Visual Clues Concept for Extracting Main Data from Deep Web Pages

Algorithms of mining intact record from isomorphic Web page