NDLI: Title extraction from bodies of HTML documents and its application to web page retrieval

Please wait, while we are loading the content...

Orthogonal locality preserving indexing

A study of factors affecting the utility of implicit relevance feedback

Improving collection selection with overlap awareness in P2P search engines

Robustness of adaptive filtering methods in a cross-benchmark evaluation

OCFS: optimal orthogonal centroid feature selection for text categorization

Combining eye movements and collaborative filtering for proactive information retrieval

Detecting phrase-level duplication on the world wide web

Web-page summarization using clickthrough data

Optimization strategies for complex queries

Title extraction from bodies of HTML documents and its application to web page retrieval

Relevance information: a loss of entropy but a gain for IDF?

Controlling overlap in content-oriented XML retrieval

Web-based acquisition of Japanese katakana variants

Automatic music video summarization based on audio-visual-text analysis and alignment

Generic soft pattern models for definitional question answering

A study of relevance propagation for web search

When will information retrieval be "good enough"?

A study of the dirichlet priors for term frequency normalisation

Impedance coupling in content-targeted advertising

Iterative translation disambiguation for cross-language information retrieval

Hidden Markov models for automatic annotation and content-based retrieval of images and video

Analysis of factoid questions for effective relation extraction

An industrial-strength content-based music recommendation system

The Portinari project: IR helps art and culture

Why spectral retrieval works

Context-sensitive information retrieval using implicit feedback

Server selection methods in hybrid portal search

A probabilistic model for retrospective news event detection

SimFusion: measuring similarity using unified relationship matrix

Accurately interpreting clickthrough data as implicit feedback

Using ODP metadata to personalize search

Topic themes for multi-document summarization

Simplified similarity scoring using term ranks

Multi-label informed latent semantic indexing

Linear discriminant model for information retrieval

Publish/subscribe functionality in IR environments using structured overlay networks

On the collective classification of email "speech acts"

A phonotactic-semantic paradigm for automatic spoken document classification

Evaluation of resources for question answering evaluation

Relevance weighting for query independent evidence

Modeling task-genre relationships for IR in the workplace

A Markov random field model for term dependencies

Improving web search results using affinity graph

Bootstrapping dictionaries for cross-language information retrieval

Exploiting ontologies for automatic image annotation

A testbed for people searching strategies in the WWW

SPIN: searching personal information networks

The future of media, blogs and innovation: new IR challenges?

Better than the real thing?: iterative pseudo-query processing using cluster-based language models

User term feedback in interactive text-based image retrieval

Modeling search engine effectiveness for federated search

Scalable collaborative filtering using cluster-based smoothing

An application of text categorization methods to gene ontology annotation

Information retrieval system evaluation: effort, sensitivity, and reliability

Exploiting the hierarchical structure for link analysis

Do summaries help?

Efficiently decodable and searchable natural language adaptive compression

Text classification with kernels on the multinomial manifold

Integrating word relationships into language models

Learning to extract information from semi-structured text using a discriminative context free grammar

Using term informativeness for named entity detection

Boosted decision trees for word recognition in handwritten document retrieval

Question answering passage retrieval using dependency relations

Detecting dominant locations from search queries

Personalizing search via automated analysis of interests and activities

An exploration of axiomatic approaches to information retrieval

Learning to estimate query difficulty: including applications to missing content detection and distributed information retrieval

A maximum coherence model for dictionary-based cross-language information retrieval

A database centric view of semantic image annotation and retrieval

Measure-based metasearch

A CLIR interface to a web search engine

Challenges in running a commercial search engine

The maximum entropy method for analyzing retrieval measures

Active feedback in ad hoc information retrieval

A utility theoretic approach to determining optimal wait times in distributed information retrieval

Efficient and self-tuning incremental query expansion for top-k query processing

Multi-labelled classification using maximum entropy method

PageRank without hyperlinks: structural re-ranking using links induced by language models

The loquacious user: a document-independent source of terms for query expansion

Gravitation-based model for information retrieval

A geometric interpretation of r-precision and its correlation with average precision

Music-to-knowledge (M2K): a prototyping and evaluation environment for music information retrieval research

SIGIR 2005 Doctoral Consortium

Probabilistic hyperspace analogue to language

A wireless natural language search engine

Basic issues on the processing of web queries

The recap system for identifying information flow

An interface to search human movements based on geographic and chronological metadata

Hierarchical text summarization for WAP-enabled mobile devices

Automatic web query classification using labeled and unlabeled training data

Manjal: a text mining system for MEDLINE

Surrogate scoring for improved metasearch precision

UCAIR: a personalized search toolbar

Detecting action-items in e-mail

A web mining research platform

Characterization of a simple case of the reassignment of document identifiers as a pattern sequencing problem

Multi-faceted information retrieval system for large scale email archives

Testing algorithms is like testing students

Evaluating the impact of selection noise in community-based web search

Expectation of f-measures: tractable exact computation and some empirical observations of its properties

Search engines and how students think they work

On evaluation of adaptive topic tracking systems

Top subset retrieval on large collections using sorted indices

Relation between PLSA and NMF and implications

The impact of evaluation on multilingual text retrieval

Using Oracle® for natural language document retrieval an automatic query reformulation approach

Customizing information access according to domain and task knowledge: the ontoExplo system

Evaluating semantic indexing techniques through cross-language fingerprinting

Live visual relevance feedback for query formulation

A dual index model for contextual information retrieval

Predicting query difficulty on the web by learning visual clues

Finding semantically similar questions based on their answers

Study of cross lingual information retrieval using on-line translation systems

3D viewpoint-based photo search and information browsing

Examination and enhancement of a ring-structured graphical search interface based on usability testing

Short comings of latent models in supervised settings

Major topic detection and its application to opinion summarization

Using query term order for result summarisation

Profile-based event tracking

Analysis of recursive feature elimination methods

Assessing the term independence assumption in blind relevance feedback

Revisiting the effect of topic set size on retrieval error

Information sharing through rational links and viewpoint retrieval

Mining multimedia salient concepts for incremental information extraction

Translating pieces of words

Cross-language text classification

A temporally adaptive content-based relevance ranking algorithm

Automated evaluation of search engine performance via implicit user feedback

Dependency relation matching for answer selection

Using dragpushing to refine centroid text classifiers

Scalable hierarchical topic detection: exploring a sample based approach

Noun sense induction using web search results

Self-organizing distributed collaborative filtering

Dirichlet PageRank

A retrospective study of probabilistic context-based retrieval

Indexing emails and email threads for retrieval

Intelligent fusion of structural and citation-based evidence for text classification

Mining translations of OOV terms from the web through cross-lingual query expansion

On redundancy of training corpus for text categorization: a perspective of geometry

Title extraction from bodies of HTML documents and its application to web page retrieval

Content Provider	ACM Digital Library
Author	Cao, Yunbo Li, Hang Xin, Guomao Song, Ruihua Hu, Yunhua Shi, Shuming Hu, Guoping
Abstract	This paper is concerned with automatic extraction of titles from the bodies of HTML documents. Titles of HTML documents should be correctly defined in the title fields; however, in reality HTML titles are often bogus. It is desirable to conduct automatic extraction of titles from the bodies of HTML documents. This is an issue which does not seem to have been investigated previously. In this paper, we take a supervised machine learning approach to address the problem. We propose a specification on HTML titles. We utilize format information such as font size, position, and font weight as features in title extraction. Our method significantly outperforms the baseline method of using the lines in largest font size as title (20.9%-32.6% improvement in F1 score). As application, we consider web page retrieval. We use the TREC Web Track data for evaluation. We propose a new method for HTML documents retrieval using extracted titles. Experimental results indicate that the use of both extracted titles and title fields is almost always better than the use of title fields alone; the use of extracted titles is particularly helpful in the task of named page finding (23.1% -29.0% improvements).
Starting Page	250
Ending Page	257
Page Count	8
File Format	PDF
ISBN	1595930345
DOI	10.1145/1076034.1076079
Language	English
Publisher	Association for Computing Machinery (ACM)
Publisher Date	2005-08-15
Publisher Place	New York
Access Restriction	Subscribed
Subject Keyword	Metadata extraction Information retrieval Html document
Content Type	Text
Resource Type	Article

Central Library (ISO-9001:2015 Certified)
Indian Institute of Technology Kharagpur
Kharagpur, West Bengal, India | PIN - 721302

See location in the Map
03222 282435
Mail: support@ndl.gov.in

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in