NDLI: Clustering web documents with tables for information extraction

Please wait, while we are loading the content...

Evaluation of a temporal-abstraction knowledge acquisition tool in the network security domain

Rapid knowledge capture using subgroup discovery with incremental refinement

Searching ontologies based on content: experiments in the biomedical domain

Maintaining constraint-based applications

Strategies for lifelong knowledge extraction from the web

Indexing ontologies with semantics-enhanced keywords

The X-COSIM integration framework for a seamless semantic desktop

Automated story capture from internet weblogs

Machine reading of web text

Applying problem solving methods for process knowledge acquisition, representation, and reasoning

An ontology for supporting communities of practice

Capturing and answering questions posed to a knowledge-based system

Extracting constraints for process modeling

Interactive thesaurus assessment for automatic document annotation

A methodology for asynchronous multi-user editing of semantic web ontologies

Interactive knowledge externalization and combination for SECI model

Clustering web documents with tables for information extraction

Human computation

KBS development through ontology mapping and ontology driven acquisition

Capturing knowledge about philosophy

Capturing a taxonomy of failures during automatic interpretation of questions posed in natural language

Information acquisition using multiple classifications

A framework for evaluating semantic metadata

Enabling experts to build knowledge bases from science textbooks

Criteria-based partitioning of large ontologies

Disambiguating for the web: a test of two methods

Enhancing enterprise knowledge processes via cross-media extraction

Extracting procedures from text

Extracting the meaning of medical concept correlations

Fostering knowledge sharing by inverse search

Garp3: a new workbench for qualitative reasoning and modelling

KnoFuss: a comprehensive architecture for knowledge fusion

Knowledge management using semantic web technologies: an application in software development

KnowWE: community-based knowledge capture with knowledge wikis

Muf: tool for knowledge extraction and knowledge base building

Observing knowledge clustering for educational resources using a course ontology

Ontology-based content model for scalable content reuse

Revelator's challenge

Thesaurus and metadata alignment for a semantic e-culture application

Towards web information extraction using extraction ontologies and (indirectly) domain ontologies

Clustering web documents with tables for information extraction

Content Provider	ACM Digital Library
Author	Friedrich, Gerhard Shchekotykhin, Kostyantyn Jannach, Dietmar
Abstract	One of the common approaches to extracting high-quality knowledge from Web sources is to exploit the redundancy of the published information. Therefore, a Web Mining System not only has to search for relevant Web pages but also has to somehow determine whether two pages describe the same entity in order to extract as much knowledge as possible about it. It has been shown that statistical clustering techniques are in general a suitable means to achieve this task by grouping documents that are supposed to contain similar information. However, when data is given in tabular form - which is for instance a typical way of describing items in online shops - existing document clustering algorithms show limited performance as documents containing tabular descriptions typically share a very common set of tokens although they describe different entities. In this paper we therefore propose a new document clustering approach that exploits hyperlinks and document metadata to extract candidates for entity names. These candidate names are subsequently used to cluster the documents and further improve these names, which are finally used to determine whether two documents describe the same entity. The detailed evaluation of our approach in two popular example domains showed its high accuracy in terms of precision and recall (F-Measure > 0.9).
Starting Page	169
Ending Page	170
Page Count	2
File Format	PDF
ISBN	9781595936431
DOI	10.1145/1298406.1298438
Language	English
Publisher	Association for Computing Machinery (ACM)
Publisher Date	2007-10-28
Publisher Place	New York
Access Restriction	Subscribed
Subject Keyword	X-means algorithm
Content Type	Text
Resource Type	Article

Central Library (ISO-9001:2015 Certified)
Indian Institute of Technology Kharagpur
Kharagpur, West Bengal, India | PIN - 721302

See location in the Map
03222 282435
Mail: support@ndl.gov.in

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in