NDLI: News Web Text Extraction Based on the Maximum Subsequence Segmentation

Content Provider	IEEE Xplore Digital Library
Author	Jianzhuo Yan Hexin Duan Liying Fang Wang Ying
Copyright Year	2013
Description	Author affiliation: Coll. of Electron. Inf. & Control Eng., Beijing Univ. of Technol., Beijing, China (Jianzhuo Yan; Hexin Duan; Liying Fang; Wang Ying)
Abstract	Many people use the web as the main information source in their daily lives. However, most web pages contain non-information components, such as site bars, footers and ads, etc., which make it complicated to extract text from the original HTML documents. Because of the high human intervention and the low results extraction quality, although the web text extraction techniques have been developed, the popularization and efficiency of the usage still need to be solved.. In this paper, we proposed a maximum subsequence segmentation (MSS) algorithm and discussed its application in the domain of news web sites. Differing from the tree structure analysis and VIPS, the algorithm divided the web into text segmentation and label segmentation. Experiment shows that the MSS algorithm achieves 93.73% accuracy over 2000 news pages from 5 different news sites and the efficiency is much faster than DOM-based using same dataset.
Sponsorship	Hubei Univ. Automot. Technol.
Starting Page	619
Ending Page	622
File Size	767239
Page Count	4
File Format	PDF
ISBN	9780769550046
DOI	10.1109/ICCIS.2013.170
Language	English
Publisher	Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher Date	2013-06-21
Publisher Place	China
Access Restriction	Subscribed
Rights Holder	Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subject Keyword	Algorithm design and analysis Accuracy Navigation Noise Maximum subsequence segmentation Web pages HTML Data mining Web text extraction
Content Type	Text
Resource Type	Article

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in

A Web Text Extraction Method Based on Regular Expressions and Text Density

Automatic Web News Content Extraction Based on Similar Pages

Content Extraction from Chinese Web Pages Based on Punctuations Distribution

Efficient Web Page Main Text Extraction towards Online News Analysis

Web information extraction based on news domain ontology theory

Web Information Extraction Algorithm Based on Ontology and DOM Tree

The Noise Reduction Method of Web Pages Based on Image Features

A novel approach for Web data extraction based on XML encoding

HisTrace: A system for mining on news-related articles instead of web pages

News Web Text Extraction Based on the Maximum Subsequence Segmentation

Similar Documents

A Web Text Extraction Method Based on Regular Expressions and Text Density

Automatic Web News Content Extraction Based on Similar Pages

Content Extraction from Chinese Web Pages Based on Punctuations Distribution

Efficient Web Page Main Text Extraction towards Online News Analysis

Web information extraction based on news domain ontology theory

Web Information Extraction Algorithm Based on Ontology and DOM Tree

The Noise Reduction Method of Web Pages Based on Image Features

A novel approach for Web data extraction based on XML encoding

HisTrace: A system for mining on news-related articles instead of web pages

News Web Text Extraction Based on the Maximum Subsequence Segmentation