NDLI: Extracting news content with visual unit of web pages

Content Provider	IEEE Xplore Digital Library
Author	Wenhao Zhu Song Dai Yang Song Zhiguo Lu
Copyright Year	2015
Description	Author affiliation: Libr. of Shanghai Univ., Shanghai, China (Zhiguo Lu) \|\| Sch. of Comput. Eng. & Sci., Shanghai Univ., Shanghai, China (Wenhao Zhu; Song Dai; Yang Song)
Abstract	The Document Object Model (DOM) provides a tree structure called DOM tree for representing with objects in HTML. Many researchers have considered using leaf nodes of DOM tree as basic objects in extracting information from web pages. However, web pages are more of information blocks which each have a consistent visual format rather than individual DOM tree nodes. And those information blocks do not necessarily have a direct map to DOM tree nodes. In this paper, we propose a visual oriented extraction method that extracts news content by visual unit (vu, for short). Visual units are identified by a top-down approach based on visual features and text features. After that, page content is extracted according to domain characteristic. In experiments, the proposed approach achieves 94.86% accuracy over 700 news web pages from 7 different news sites. The result demonstrates that our method represents a promising approach for news content extraction with visual units and domain characteristic.
Sponsorship	IEEE Comput.Soc.
Starting Page	1
Ending Page	5
File Size	417493
Page Count	5
File Format	PDF
ISBN	9781479986767
DOI	10.1109/SNPD.2015.7176203
Language	English
Publisher	Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher Date	2015-06-01
Publisher Place	Japan
Access Restriction	Subscribed
Rights Holder	Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subject Keyword	Visualization Accuracy information extraction DOM Web pages visual unit Feature extraction HTML Data mining
Content Type	Text
Resource Type	Article

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in

LBDA: A novel framework for extracting content from web pages

Using Visual Clues Concept for Extracting Main Data from Deep Web Pages

A Lightweight Algorithm for Automated Forum Information Processing

An automatic approach to extracting review link from Chinese news pages

Extracting Academic Information from Conference Web Pages

VEDD- a visual wrapper for extraction of data using DOM tree

Extracting the semantic content of web pages via repeated structures

An approach for text extraction from web news page

Content Extraction from Chinese Web Pages Based on Punctuations Distribution

Extracting news content with visual unit of web pages

Similar Documents

LBDA: A novel framework for extracting content from web pages

Using Visual Clues Concept for Extracting Main Data from Deep Web Pages

A Lightweight Algorithm for Automated Forum Information Processing

An automatic approach to extracting review link from Chinese news pages

Extracting Academic Information from Conference Web Pages

VEDD- a visual wrapper for extraction of data using DOM tree

Extracting the semantic content of web pages via repeated structures

An approach for text extraction from web news page

Content Extraction from Chinese Web Pages Based on Punctuations Distribution

Extracting news content with visual unit of web pages