NDLI: Extracting informative textual parts from web pages containing user-generated content

Please wait, while we are loading the content...

Collaborative modelling of reflection to inform the development and evaluation of work-based learning technologies

Extracting informative textual parts from web pages containing user-generated content

Multi-faceted context-dependent knowledge organisation with TACKO

Ranking resources in folksonomies by exploiting semantic information

The video summary GWAP: summarization of videos based on a social game

Crowdsourcing research opportunities: lessons from natural language processing

Guided discovery of interesting relationships between time series clusters and metadata properties

Dynamic and stabilizing forces in knowledge organization systems for business ecosystems

Personalized activity based eLearning

Algorithms for the verification of the semantic relation between a compound and a given lexeme

Knowcations: the quest for a personal knowledge management solution

Ontology-based standardization on knowledge exchange in social knowledge management environments

Patent images - a glass-encased tool: opening the case

Documenting and sharing scientific research over the semantic web

An embeddable dashboard for widget-based visual analytics on scientific communities

Personal knowledge management beyond versioning

Rethinking lessons learned capturing: using storytelling, root cause analysis, and collaboration engineering to capture lessons learned about project management

Benchmarking T-ANNE: text annotation system

Analysing user motivation in an art folksonomy

Addressing the long tail in empirical research data management

Interactive visual analysis of families of curves using data aggregation and derivation

I-know my users: user-centric profiling based on the perceptual preference questionnaire (PPQ)

Evaluation of similarity measures for knowledge profiles from an expert directory: a field study

Exploring the differences and similarities between hierarchical decentralized search and human navigation in information networks

Developing a system of qualitative metrics and recommendation for experimental science

Visualizations in exploratory search: a user study with stock market information

SketchViz: a sketching interface for domain comprehension tasks illustrated by an industrial network use case

Concept for improving industrial goods via contextual knowledge provision

Metadata visualization of scholarly search results: supporting exploration and discovery

Dual analysis of DNA microarrays

A knowledge-extraction approach to identify and present verbatim quotes in free text

Thought Bubbles: a conceptual prototype for a Twitter based recommender system for research 2.0

Challenges in creating multimedia instructions for support systems and dynamic problem-solving

Robust image retrieval using bag of visual words with fuzzy codebooks and fuzzy assignment

Extracting informative textual parts from web pages containing user-generated content

Content Provider	ACM Digital Library
Author	Katsimpras, Georgios Pappas, Nikolaos Stamatatos, Efstathios
Abstract	The vast amount of user-generated content on the Web has increased the need for handling the problem of automatically processing content in web pages. The segmentation of web pages and noise (non-informative segment) removal are important pre-processing steps in a variety of applications such as sentiment analysis, text summarization and information retrieval. Currently, these two tasks tend to be handled separately or are handled together without emphasizing the diversity of the web corpora and the web page type detection. We present a unified approach that is able to provide robust identification of informative textual parts in web pages along with accurate type detection. The proposed algorithm takes into account visual and non-visual characteristics of a web page and is able to remove noisy parts from three major categories of pages which contain user-generated content (News, Blogs, Discussions). Based on a human annotated corpus consisting of diverse topics, domains and templates, we demonstrate the learning abilities of our algorithm, we examine its effectiveness in extracting the informative textual parts and its usage as a rule-based classifier for web page type detection in a realistic web setting.
Starting Page	1
Ending Page	8
Page Count	8
File Format	PDF
ISBN	9781450312424
DOI	10.1145/2362456.2362462
Language	English
Publisher	Association for Computing Machinery (ACM)
Publisher Date	2012-09-05
Publisher Place	New York
Access Restriction	Subscribed
Subject Keyword	Noise removal Web page segmentation Web page type detection Information extraction
Content Type	Text
Resource Type	Article

Central Library (ISO-9001:2015 Certified)
Indian Institute of Technology Kharagpur
Kharagpur, West Bengal, India | PIN - 721302

See location in the Map
03222 282435
Mail: support@ndl.gov.in

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in