NDLI: Web Information Extraction by HTML Tree Edit Distance Matching

Content Provider	IEEE Xplore Digital Library
Author	Yeonjung Kim Jeahyun Park Taehwan Kim Joongmin Choi
Copyright Year	2007
Description	Author affiliation: Hanyang Univ., Seoul (Yeonjung Kim; Jeahyun Park; Taehwan Kim; Joongmin Choi)
Abstract	The main issue for effective Web information extraction is how to recognize similar patterns in a Web page. Traditionally, it has been shown that pattern matching by using the HTML DOM tree is more efficient than the simple string matching approach. Nonetheless, previous tree-based pattern matching methods have problems by assuming that all HTML tags have the same values, assigning the same weight to each node in HTML trees. This paper proposes an enhanced tree matching algorithm that improves the tree edit distance method by considering the characteristics of HTML features. We assign different values to different HTML tree nodes according to their weights for displaying the corresponding data objects in the browser. Pattern matching of HTML patterns is done by obtaining the maximum mapping values of two HTML trees that are constructed with weighted node values from HTML data objects. Experiments are done over several Web commerce sites to evaluate the effectiveness of the proposed HTML tree matching algorithm.
Starting Page	2455
Ending Page	2460
File Size	430349
Page Count	6
File Format	PDF
ISBN	0769530389
DOI	10.1109/ICCIT.2007.19
Language	English
Publisher	Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher Date	2007-11-21
Publisher Place	South Korea
Access Restriction	Subscribed
Rights Holder	Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subject Keyword	Computer science Web pages Vegetation mapping HTML Pattern recognition Dynamic programming Data mining Information technology Pattern matching Business
Content Type	Text
Resource Type	Article

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in

A web table extraction algorithm based on tree edit distance

Layered and Weighted Tree Matching Algorithm for Automatic Web Data Records Recognition

Web Data Extraction Based on Simple Tree Matching

Extraction of Web News from Web Pages Using a Ternary Tree Approach

Web Data Extraction Based on Label Library

The Research of Automatic Extraction Dynamic Web Data

A Topic-Specific Web Crawler with Web Page Hierarchy Based on HTML Dom-Tree

Development of a translation model from HTML to WML using component based information extraction technique

The Dynamic Web Pages Information Extraction Algorithm Based on Sequence Alignment

Web Information Extraction by HTML Tree Edit Distance Matching

Similar Documents

A web table extraction algorithm based on tree edit distance

Layered and Weighted Tree Matching Algorithm for Automatic Web Data Records Recognition

Web Data Extraction Based on Simple Tree Matching

Extraction of Web News from Web Pages Using a Ternary Tree Approach

Web Data Extraction Based on Label Library

The Research of Automatic Extraction Dynamic Web Data

A Topic-Specific Web Crawler with Web Page Hierarchy Based on HTML Dom-Tree

Development of a translation model from HTML to WML using component based information extraction technique

The Dynamic Web Pages Information Extraction Algorithm Based on Sequence Alignment

Web Information Extraction by HTML Tree Edit Distance Matching