NDLI: Extracting data records from the web using tag path clustering

Please wait, while we are loading the content...

A dynamic bayesian network click model for web search ranking

Fast dynamic reranking in large graphs

Detecting the origin of text segments efficiently

Latent space domain transfer between high dimensional overlapping distributions

Learning consensus opinion: mining data from a labeling game

Exploiting web search to generate synonyms for entities

Incorporating site-level knowledge to extract structured data from web forums

Hybrid keyword search auctions

Adaptive bidding for display advertising

Efficient application placement in a dynamic hosting platform

Less talk, more rock: automated organization of community-contributed collections of concert videos

Visual diversification of image search results

Efficient interactive fuzzy keyword search

Inverted index compression and query processing with optimized document ordering

Improved techniques for result caching in web search engines

Unsupervised query categorization using automatically-built concept graphs

A search-based method for forecasting ad impression in contextual advertising

Collective privacy management in social networks

All your contacts are belong to us: automated identity theft attacks on social networks

Rapid prototyping of semantic mash-ups through semantic web pipes

Large scale integration of senses for the semantic web

Evaluating similarity measures for emergent semantics of social tagging

Tagommenders: connecting users to items through tags

Social search in "Small-World" experiments

Network analysis of collaboration structure in Wikipedia

Mapping the world's photos

Mining interesting locations and travel sequences from GPS trajectories

Web 2.0: blind to an accessible new world

Rapid development of spreadsheet-based web mashups

Combining global optimization with local selection for efficient QoS-aware service composition

Why is the web loosely coupled?: a multi-faceted metric for service design

Co-browsing dynamic web pages

Extracting article text from the web with maximum subsequence segmentation

Performing grouping and aggregate functions in XML queries

Analyzing seller practices in a Brazilian marketplace

Competitive analysis from click-through log

WPBench: a benchmark for evaluating the client-side performance of web 2.0 applications

Mining cultural differences from a large number of geotagged photos

Click chain model in web search

Estimating the impressionrank of web pages

Enhancing diversity, coverage and balance for summarization through structure learning

StatSnowball: a statistical approach to extracting entity relationships

Rated aspect summarization of short comments

Smart Miner: a new framework for mining large scale web usage data

Towards context-aware search by learning a very large variable length hidden markov model from search logs

Bid optimization for broad match ad auctions

How much can behavioral targeting help online advertising?

Network-aware forward caching

A generalised cross-modal clustering method applied to multimedia news semantic indexing and retrieval

An axiomatic approach for result diversification

RuralCafe: web search in the rural developing world

Nearest-neighbor caching for content-match applications

Understanding user's query intent with wikipedia

Exploiting web search engines to search structured databases

To join or not to join: the illusion of privacy in social networks with mixed public and private user profiles

Using static analysis for Ajax intrusion detection

idMesh: graph-based disambiguation of linked data

Triplify: light-weight linked data publication from relational databases

Measuring the similarity between implicit semantic relations from the web

Collaborative filtering for orkut communities: discovery of user latent behavior

Behavioral profiles for advanced email features

The slashdot zoo: mining a social network with negative edges

Ranking and classifying attractiveness of photos in folksonomies

Computers and iphones and mobile phones, oh my!: a logs-based comparison of search users on different devices

Scrolling behaviour with single- and multi-column layout

Mashroom: end-user mashup programming using nested tables

A trust management framework for service-oriented environments

Highly scalable web applications with zero-copy data transfer

HTML templates that fly: a template engine approach to automated offloading from server to client

Extracting data records from the web using tag path clustering

XQuery in the browser

A geographical analysis of knowledge production in computer science

Predicting click through rate for job listings

MASTH proxy: an extensible platform for web overload control

An experimental study of large-scale mobile social network

Spatio-temporal models for estimating click-through rate

Learning to recognize reliable users and content in social media with coupled mutual reinforcement

Efficient overlap and content reuse detection in blogs and online news articles

Matchbox: large scale online bayesian recommendations

How opinions are received by online communities: a case study on amazon.com helpfulness votes

Releasing search queries and clicks privately

A class-feature-centroid classifier for text categorization

General auction mechanism for search advertising

Web service derivatives

Anycast-aware transport for content delivery networks

What makes conversations interesting?: themes, participants and consequences of conversations in online social media

Learning to tag

Quicklink selection for navigational query results

Using graphics processors for high performance IR query processing

Compressed web indexes

Discovering users' specific geo intention in web search

Online expansion of rare queries for sponsored search

Privacy diffusion on the web: a longitudinal perspective

A hybrid phish detection approach by identity discovery and keywords retrieval

OpenRuleBench: an analysis of the performance of rule engines

SOFIE: a self-organizing framework for information extraction

Extracting key terms from noisy and multitheme documents

Personalized recommendation on dynamic content using predictive bilinear models

A measurement-driven analysis of information propagation in the flickr social network

Community gravity: measuring bidirectional effects by trust and rating on online social networks

Constructing folksonomies from user-specified relations on flickr

A game based approach to assign geographical relevance to web images

What's up CAPTCHA?: a CAPTCHA based on image orientation

Automated construction of web accessibility models from transaction click-streams

Test case prioritization for regression testing of service-oriented business applications

REST-based management of loosely coupled services

Characterizing insecure javascript practices on the web

Sitemaps: above and beyond the crawl of duty

Answering approximate queries over autonomous web databases

Query clustering using click-through graph

A messaging API for inter-widgets communication

A P2P based distributed services network for next generation mobile internet communications

Large scale multi-label classification via metalabeler

An effective semantic search technique using ontology

Why are moved web pages difficult to find?: the WISH approach

Securely implementing open geospatial consortium web service interface standards in oracle spatial

Relationalizing RDF stores for tools reusability

Detecting soft errors by redirection classification

Visualization of Geo-annotated pictures in mobile phones

C-SPARQL: SPARQL for continuous querying

Automatic web service composition with abstraction and refinement

Deducing trip related information from flickr

Interactive search in XML data

Where to adapt dynamic service compositions

Link based small sample learning for web spam detection

Is there anything worth finding on the semantic web?

SGPS: a semantic scheme for web service similarity

Detecting image spam using local invariant features and pyramid match kernel

Instance-based probabilistic reasoning in the semantic web

Automated synthesis of composite services with correctness guarantee

Web image retrieval reranking with multi-view clustering

A flight meta-search engine with metamorph

User-centric content freshness metrics for search engines

Characterizing web-based video sharing workloads

Thumbs-up: a game for playing to rank search results

Reliability analysis using weighted combinational models for web-based software

Deriving music theme annotations from user tags

Search shortcuts: driving users towards their goals

sMash: semantic-based mashup navigation for data API network

Tag-oriented document summarization

A probabilistic model based approach for blended search

Semantic wiki aided business process specification

Search result re-ranking based on gap between search queries and social tags

Bucefalo: a tool for intelligent search and filtering for web-based personal health records

Raise semantics at the user level for dynamic and interactive SOA-based portals

Signaling emotion in tagclouds

Dataplorer: a scalable search engine for the data web

Towards lightweight and efficient DDOS attacks detection for web server

Two birds with one stone: a graph-based framework for disambiguating and tagging people names in web search

Threshold selection for web-page classification with highly skewed class distribution

A general framework for adaptive and online detection of web attacks

The value of socially tagged urls for a search engine

Web-scale classification with naive bayes

PAKE-based mutual HTTP authentication for preventing phishing attacks

The recurrence dynamics of social tagging

News article extraction with template-independent wrapper

Inferring private information using social network data

Playful tagging: folksonomy generation using online games

Graffiti: node labeling in heterogeneous networks

Privacy preserving frequency capping in internet banner advertising

Identifying vertical search intention of query through social tagging propagation

Graph based crawler seed selection

Crosslanguage blog mining and trend visualisation

Social search and discovery using a unified approach

Building term suggestion relational graphs from collective intelligence

Crawling English-Japanese person-name transliterations from the web

Extracting community structure through relational hypergraphs

Towards intent-driven bidterm suggestion

Near real time information mining in multilingual news

Ranking user-created contents by search user's inclination in online communities

Advertising keyword generation using active learning

Mining multilingual topics from wikipedia

Retaining personal expression for social search

Towards language-independent web genre detection

Discovering the staring people from social networks

Rare item detection in e-commerce site

The web of nations

Analysis of community structure in Wikipedia

A declarative framework for semantic link discovery over relational data

Cascading style sheets: a novel approach towards productive styling with today's standards

Content hole search in community-type content

Modeling semantics and structure of discussion threads

Automatically filling form-based web interfaces with free text inputs

Searching for events in the blogosphere

Combining anchor text categorization and graph analysis for paid link detection

A densitometric analysis of web template content

Ranking community answers via analogical reasoning

Rethinking email message and people search

A flexible dialogue system for enhancing web usability

Probabilistic question recommendation for question answering communities

Purely URL-based topic classification

Estimating web site readability using content extraction

Buzz-based recommender system

Web content accessibility guidelines: from 1.0 to 2.0

Discovering user profiles

Bootstrapped extraction of class attributes

Extracting data records from the web using tag path clustering

Content Provider	ACM Digital Library
Author	Tatemura, Junichi Sawires, Arsany Hsiung, Wang-Pin Moser, Louise E. Miao, Gengxin
Abstract	Fully automatic methods that extract lists of objects from the Web have been studied extensively. Record extraction, the first step of this object extraction process, identifies a set of Web page segments, each of which represents an individual object (e.g., a product). State-of-the-art methods suffice for simple search, but they often fail to handle more complicated or noisy Web page structures due to a key limitation -- their greedy manner of identifying a list of records through pairwise comparison (i.e., similarity match) of consecutive segments. This paper introduces a new method for record extraction that captures a list of objects in a more robust way based on a holistic analysis of a Web page. The method focuses on how a distinct tag path appears repeatedly in the DOM tree of the Web document. Instead of comparing a pair of individual segments, it compares a pair of tag path occurrence patterns (called visual signals) to estimate how likely these two tag paths represent the same list of objects. The paper introduces a similarity measure that captures how closely the visual signals appear and interleave. Clustering of tag paths is then performed based on this similarity measure, and sets of tag paths that form the structure of data records are extracted. Experiments show that this method achieves higher accuracy than previous methods.
Starting Page	981
Ending Page	990
Page Count	10
File Format	PDF
ISBN	9781605584874
DOI	10.1145/1526709.1526841
Language	English
Publisher	Association for Computing Machinery (ACM)
Publisher Date	2009-04-20
Publisher Place	New York
Access Restriction	Subscribed
Subject Keyword	Data record extraction Information extraction Clustering
Content Type	Text
Resource Type	Article

Central Library (ISO-9001:2015 Certified)
Indian Institute of Technology Kharagpur
Kharagpur, West Bengal, India | PIN - 721302

See location in the Map
03222 282435
Mail: support@ndl.gov.in

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in