NDLI: Locality sensitive hashing for scalable structural classification and clustering of web documents

Please wait, while we are loading the content...

Scholarly big data: information extraction and data mining

One size does not fit all: multi-granularity search of web forums

Graph-of-word and TW-IDF: new approach to ad hoc IR

Penguins in sweaters, or serendipitous entity search on user-generated content

Predicting user activity level in social networks

Discovering coherent topics using general knowledge

Mining frequent neighborhood patterns in a large labeled graph

Local correlation detection with linearity enhancement in streaming data

Locality sensitive hashing for scalable structural classification and clustering of web documents

Building a large-scale corpus for evaluating event detection on twitter

Location prediction in social media based on tie strength

StaticGreedy: solving the scalability-accuracy dilemma in influence maximization

The logical diversity of explanations in OWL ontologies

On mining mobile apps usage behavior for predicting apps usage in smartphones

Users versus models: what observation tells us about effectiveness metrics

The water filling model and the cube test: multi-dimensional evaluation for professional search

Efficient filtering and ranking schemes for finding inclusion dependencies on the web

Towards metric fusion on multi-view data: a cross-view based graph random walk approach

Fast parameterless density-based clustering via random projections

Linear-time enumeration of maximal K-edge-connected subgraphs in large networks by random contraction

Community question topic categorization via hierarchical kernelized classification

Accurate and scalable nearest neighbors in large networks based on effective importance

Exploring weakly supervised latent sentiment explanations for aspect-level review analysis

Robust question answering over the web of linked data

MRPacker: an SQL to mapreduce optimizer

Challenges in commerce search

Is top-k sufficient for ranking?

Feedback-driven multiclass active learning for data streams

Effective measures for inter-document similarity

Interactive collaborative filtering

Cross-domain sparse coding

Computational advertising: the linkedin way

Graph similarity search with edit distance constraint in large graph databases

Social recommendation incorporating topic mining and social trust analysis

Compact explanatory opinion summarization

Directing exploratory search with interactive intent modeling

Semi-supervised discriminative preference elicitation for cold-start recommendation

The online revolution: education for everyone

"All roads lead to Rome": optimistic recovery for distributed iterative data processing

Personalization of web-search using short-term browsing context

Toward advice mining: conditional random fields for extracting advice-revealing text units

Overlapping community detection using seed set expansion

Nonparametric bayesian multitask collaborative filtering

Beyond data: from user information to business value through personalized recommendations and consumer science

Wondering why data are missing from query results?: ask conseil why-not

GAPfm: optimal top-n recommendations for graded relevance domains

Transferring knowledge with source selection to learn IR functions on unlabeled collections

Re-ranking for joint named-entity recognition and linking

pEDM: online-forecasting for smart energy analytics

Consumer-centric SLA manager for cloud-hosted databases

GeCo: an online personal data generator and corruptor

PredictionIO: a distributed machine learning server for practical software development

Human computing games for knowledge acquisition

Channeling the deluge: research challenges for big data and information systems

AKBC 2013: third workshop on automated knowledge base construction

Applying theory to practice

Spatial search for K diverse-near neighbors

Map search via a factor graph model

Entity-centric document filtering: boosting feature mapping through meta-features

On popularity prediction of videos shared in online social networks

Spatio-temporal and events based analysis of topic popularity in twitter

A two-phase algorithm for mining sequential patterns with differential privacy

Efficient processing of streaming graphs for evolution-aware clustering

An index for efficient semantic full-text search

On sparsity and drift for effective real-time filtering in microblogs

To stay or not to stay: modeling engagement dynamics in social graphs

Online multitasking and user engagement

Ontology authoring with FORZA

Ranking fraud detection for mobile apps: a holistic view

Evaluating aggregated search using interleaving

Disinformation techniques for entity resolution

A generic front-stage for semi-stream processing

Discovering latent blockmodels in sparse and noisy graphs using non-negative matrix factorisation

Mining entity attribute synonyms via compact clustering

External memory K-bisimulation reduction of big graphs

Building structures from classifiers for passage reranking

Spatial-temporal query homogeneity for KNN object search on road networks

Using micro-reviews to select an efficient set of reviews

Expertise retrieval in bibliographic network: a topic dominance learning approach

A hybrid approach for privacy-preserving processing of knn queries in mobile database systems

Clustering: probably approximately useless?

How fresh do you want your search results?

Discriminative feature selection for multi-view cross-domain learning

Efficient hierarchical clustering of large high dimensional datasets

Building optimal information systems automatically: configuration space exploration for biomedical information systems

Motif discovery in spatial trajectories using grammar inference

Automatic ad format selection via contextual bandits

Fast and scalable reachability queries on graphs by pruned labeling with landmarks and paths

Originator or propagator?: incorporating social role theory into topic models for twitter content analysis

Towards an enhanced and adaptable ontology by distilling and assembling online encyclopedias

FRec: a novel framework of recommending users and communities in social media

Exploiting query term correlation for list caching in web search engines

Online learning from streaming data

Optimizing plurality for human intelligence tasks

Factors affecting aggregated search coherence and search behavior

Information extraction as a filtering task

TODMIS: mining communities from trajectories

Local-to-global semi-supervised feature selection

Beyond data: from user information to business value through personalized recommendations and consumer science

Fast evaluation of iceberg pattern-based aggregate queries

URL tree: efficient unsupervised content extraction from streams of web documents

Understanding how people interact with web search results that change in real-time using implicit feedback

Identifying salient entities in web pages

An efficient probabilistic framework for multi-dimensional classification

TerraFly GeoCloud: online spatial data analysis system

DeExcelerator: a framework for extracting relational data from partially structured documents

Exploring XML data is as easy as using maps

A tool for assisting provenance search in social media

DOLAP 2013 workshop summary

Usability in machine learning at scale with graphlab

Mining a search engine's corpus without a query pool

A phased ranking model for question answering

Structured positional entity language model for enterprise entity retrieval

Inferring anchor links across multiple heterogeneous social networks

Domain-dependent/independent topic switching model for online reviews with numerical ratings

Mining diabetes complication and treatment patterns for clinical decision support

Searching similar segments over textual event sequences

Load-sensitive selective pruning for distributed search

Probabilistic solutions of influence propagation on social networks

UNIK: unsupervised social network spam detection

PATRIC: a parallel algorithm for counting triangles in massive networks

Aligning freebase with the YAGO ontology

AnchorMF: towards effective event context identification

Using historical click data to increase interleaving sensitivity

Location recommendation for out-of-town users in location-based social networks

Scalable diversification of multiple search results

Understanding the roles of sub-graph features for graph classification: an empirical study perspective

Modeling interaction features for debate side clustering

Querying graphs with preferences

Uncovering collusive spammers in Chinese review websites

Discovering influential authors in heterogeneous academic networks by a co-ranking method

Automatic construction of domain and aspect specific sentiment lexicons for customer review mining

Instant foodie: predicting expert ratings from grassroots

Flexible and extensible generation and corruption of personal data

TellMyRelevance!: predicting the relevance of web search results from cursor interactions

Functional dirichlet process

Flexible and adaptive subspace search for outlier analysis

Learning to handle negated language in medical records search

LCMKL: latent-community and multi-kernel learning based image annotation

Graph hashing and factorization for fast graph stream classification

An effective latent networks fusion based model for event recommendation in offline ephemeral social networks

Assessing sparse information extraction using semantic contexts

Permutation indexing: fast approximate retrieval from large corpora

Speller performance prediction for query autocorrection

From big data to big knowledge

Entropy-based histograms for selectivity estimation

Improving passage ranking with user behavior information

Web news extraction via path ratios

Archiving the relaxed consistency web

Intelligently querying incomplete instances for improving classification performance

Leveraging data to change industry paradigms

Top-down keyword query processing on XML data

Estimating document focus time

Facet selection algorithms for web product search

Recommending tags with a model of human categorization

OMS-TL: a framework of online multiple source transfer learning

MetKB: enriching RDF knowledge bases with web entity-attribute tables

Demonstrating intelligent crawling and archiving of web applications

Inside the world's playlist

SPHINX: rich insights into evidence-hypotheses relationships via parameter space-based exploration

Sixth workshop on exploiting semantic annotations in information retrieval (ESAIR'13)

Structured data in web search

G-tree: an efficient index for KNN search on road networks

CRF framework for supervised preference aggregation

Learning relatedness measures for entity linking

Community-based user recommendation in uni-directional social networks

A partially supervised cross-collection topic model for cross-domain text classification

Mining-based compression approach of propositional formulae

RWS-Diff: flexible and efficient change detection in hierarchical data

Rank-energy selective query forwarding for distributed search systems

Improving pseudo-relevance feedback via tweet selection

Modeling dynamics of meta-populations with a probabilistic approach: global diffusion in social media

An efficient MapReduce algorithm for counting triangles in a very large graph

PIDGIN: ontology alignment using web text as interlingua

How the live web feels about events

On the reliability and intuitiveness of aggregated search metrics

Short text classification by detecting information path

Parallel triangle counting in massive streaming graphs

PAGE: a partition aware graph computation engine

Dynamic multi-faceted topic discovery in twitter

Network-aware search in social tagging applications: instance optimality versus efficiency

Towards minimizing the annotation cost of certified text classification

Entity disambiguation in anonymized graphs using graph kernels

Wikification via link co-occurrence

On segmentation of eCommerce queries

An efficient and robust privacy protection technique for massive streaming choice-based information

Selection fusion in semi-structured retrieval

Spatio-temporal meme prediction: learning what hashtags will be popular where

Query matching for report recommendation

A pattern-based selective recrawling approach for object-level vertical search

Random walk-based graphical sampling in unbalanced heterogeneous bipartite social graphs

Efficiently anonymizing social networks with reachability preservation

Predicting trends in social networks via dynamic activeness model

Studying from electronic textbooks

Clustering-based transduction for learning a ranking model with limited human labels

Predicting the impact of expansion terms using semantic and user interaction features

Efficient two-party private blocking based on sorted nearest neighborhood clustering

Personalized models of search satisfaction

Lead-lag analysis via sparse co-projection in correlated text streams

Programming with personalized pagerank: a locally groundable first-order probabilistic logic

A probabilistic mixture model for mining and analyzing product search log

Large-scale deep learning at Baidu

Efficient pruning algorithm for top-K ranking on dataset with value uncertainty

Faceted models of blog feeds

Learning deep structured semantic models for web search using clickthrough data

Automatically generating descriptions for resources by tag modeling

Discovering and managing quantitative association rules

READFAST: high-relevance search-engine for big text

iNewsBox: modeling and exploiting implicit feedback for building personalized news radio

Detecting and exploring clusters in attributed graphs: a plugin for the gephi platform

Search excavator: the knowledge discovery tool

2013 international workshop on computational scientometrics: theory and applications

Efficient parsing-based search over structured data

CQArank: jointly model topics and expertise in community question answering

Gem-based entity-knowledge maintenance

Personalized influence maximization on social networks

Content coverage maximization on word networks for hierarchical topic summarization

Correlating medical-dependent query features with image retrieval models using association rules

Causality and responsibility: probabilistic queries revisited in uncertain databases

Augmenting web search surrogates with images

Supporting exploratory people search: a study of factor transparency and user control

Diffusion of innovations revisited: from social network to innovation network

Parallel motif extraction from very long sequences

Mapping adaptation actions for the automatic reconciliation of dynamic ontologies

Boolean satisfiability for sequence mining

User intent and assessor disagreement in web search evaluation

Personalized point-of-interest recommendation by mining users' preference transition

Cache refreshing for online social news feeds

Active exploration: simultaneous sampling and labeling for large graphs

Mining causal topics in text data: iterative topic modeling with time series feedback

A comparison of two physical data designs for interactive social networking actions

A heterogenous automatic feedback semi-supervised method for image reranking

Estimating the relative utility of networks for predicting user activities

Manipulation among the arbiters of collective intelligence: how wikipedia administrators mold public opinion

Scientific articles recommendation

RCached-tree: an index structure for efficiently answering popular queries

Incorporating user preferences into click models

Cost-sensitive learning for large-scale hierarchical classification

Computing term similarity by large probabilistic isA knowledge

Robust models of mouse movement on dynamic web search results pages

Modeling information diffusion over social networks for temporal dynamic prediction

ImG-complex: graph data model for topology of unstructured meshes

Dyadic event attribution in social networks with mixtures of hawkes processes

Generating informative snippet to maximize item visibility

Exploiting ranking factorization machines for microblog retrieval

QBEES: query by entity examples

Context-aware top-K processing using views

Beyond clicks: query reformulation as a predictor of search satisfaction

Adaptive co-training SVM for sentiment classification on tweets

Towards faster and better retrieval models for question search

Eigenvalues perturbation of integral operator for kernel selection

Query execution timing: taming real-time anytime queries on multicore processors

SRbench--a benchmark for soundtrack recommendation systems

Learning open-domain comparable entity graphs from user search queries

Mining characteristic multi-scale motifs in sensor-based time series

Combining one-class classifiers via meta learning

FusionDB: conflict management system for small-science databases

SportSense: using motion queries to find scenes in sports videos

Cloud Armor: a platform for credibility-based trust management of cloud services

ESTHETE: a news browsing system to visualize the context and evolution of news stories

Workshop summary for the 2013 international workshop on mining unstructured big data using natural language processing

$Proximity^{2}-aware$ ranking for textual, temporal, and geographic queries

A new operator for efficient stream-relation join processing in data streaming engines

Local clustering in provenance graphs

Navigating the topical structure of academic search results via the Wikipedia category network

Label constrained shortest path estimation

Predicting retweet count using visual cues

ROU: advanced keyword search on graph

Modeling temporal effects of human mobile behavior on location-based social networks

Assessing quality score of Wikipedia article using mutual evaluation of editors and texts

Learning compact hashing codes for efficient tag completion and prediction

Learning to selectively rank patients' medical history

Locality sensitive hashing revisited: filling the gap between theory and algorithm analysis

Unsupervised identification of synonymous query intent templates for attribute intents

On handling textual errors in latent document modeling

Merged aggregate nearest neighbor query processing in road networks

CV-PCR: a context-guided value-driven framework for patent citation recommendation

RAProp: ranking tweets by exploiting the tweet/user/web ecosystem and inter-tweet agreement

Efficient forecasting for hierarchical time series

Scalable bootstrapping for python

WordSeer: a knowledge synthesis environment for textual data

CloudDB 2013: fifth international workshop on cloud data management

Timely crawling of high-quality ephemeral new content

SCISSOR: scalable and efficient reachability query processing in time-evolving hierarchies

Content-centric flow mining for influence analysis in social streams

A multimodal framework for unsupervised feature fusion

Feature-based models for improving the quality of noisy training data for relation extraction

Identifying multilingual Wikipedia articles based on cross language similarity and activity

Hotness-aware buffer management for flash-based hybrid storage systems

Social media news communities: gatekeeping, coverage, and statement bias

Concept-based analysis of scientific literature

How do users grow up along with search engines?: a study of long-term users' behavior

A belief propagation approach for detecting shilling attacks in collaborative filtering

SkyView: a user evaluation of the skyline operator

Modeling behavioral factors ininteractive information retrieval

Incorporating the surfing behavior of web users into pagerank

Extraction and integration of web data by end-users

FIRE: interactive visual support for parameter space-driven rule mining

DUBMOD13: international workshop on data-driven user behavioral modelling and mining from social media

LearNext: learning to predict tourists movements

Labels or attributes?: rethinking the neighbors for collective classification in sparsely-labeled networks

Probabilistic semantic similarity measurements for noisy short texts using Wikipedia entities

Weighted hashing for fast large scale similarity search

An efficient algorithm for approximate betweenness centrality computation

Expedited rating of data stores using agile data loading techniques

Discovering health-related knowledge in social media using ensembles of heterogeneous features

On sampling the wisdom of crowds: random vs. expert sampling of the twitter stream

LR-PPR: locality-sensitive, re-use promoting, approximate personalized pagerank computation

Automated snippet generation for online advertising

UMicS: from anonymized data to usable microdata

Intent models for contextualising and diversifying query suggestions

Question routing to user communities

PLEAD 2013: politics, elections and data

Where shall we go today?: planning touristic tours with tripbuilder

Term associations in query expansion: a structural linguistic perspective

Exploiting collaborative filtering techniques for automatic assessment of student free-text responses

Seeking provenance of information using social media

Can back-of-the-book indexes be automatically created?

Multimedia summarization for trending topics in microblogs

Detecting controversy on the web

Building user profiles from topic models for personalised search

Learning to rank for question routing in community question answering

DTMBIO 2013: international workshop on data and text mining in biomedical informatics

Predicting event-relatedness of popular queries

Automated probabilistic modeling for relational data

Mining user interest from search tasks and annotations

CIKM 2013 workshop on living labs for information retrieval evaluation

Modeling latent topic interactions using quantum interference for information retrieval

Semantic discovery from web comparison queries

Generating comparative summaries from reviews

The first workshop on user engagement optimization

Generalizing diversity detection in blog feed retrieval

Joint learning on sentiment and emotion classification

Zero-shot video retrieval using content and concepts

PIKM 2013: the 6th ACM workshop for ph.d. students in information and knowledge management

Dynamic query intent mining from a search log stream

A unified graph model for personalized query-oriented reference paper recommendation

Diversified query expansion using conceptnet

Web-KR 2013: the 4th international workshop on web-scale knowledge representation, retrieval and reasoning

Latency-aware strategy for static list caching in flash-based web search engines

Probabilistic latent class models for predicting student performance

An empirical study of top-n recommendation for venture finance

Data management & analytics for healthcare (DARE 2013)

Bootstrapping active name disambiguation with crowdsourcing

Timeline adaptation for text classification

Interest mining from user tweets

Modeling clicks beyond the first result page

Recommendation via user's personality and social contextual

An analysis of crowd workers mistakes for specific and complex relevance assessment task

Maintaining discriminatory power in quantized indexes

A fast convergence clustering algorithm merging MCMC and EM methods

Combining prestige and relevance ranking for personalized recommendation

Retrieving opinions from discussion forums

Discrimination aware classification for imbalanced datasets

Strategies for setting time-to-live values in result caches

Retrieval of trending keywords in a peer-to-peer micro-blogging OSN

Incremental shared nearest neighbor density-based clustering

Learning to detect task boundaries of query session

Trustable aggregation of online ratings

The essence of knowledge (bases) through entity rankings

Early prediction on imbalanced multivariate time series

Exploiting proximity feature in statistical translation models for information retrieval

Chinese syntactic parsing based on linguistic entity-relationship model

Exploiting trustors as well as trustees in trust-based recommendation

Position-based contextualization for passage retrieval

Clustering-based anomaly detection in multi-view data

Through-the-looking glass: utilizing rich post-search trail statistics for web search

High throughput filtering using FPGA-acceleration

Discovering relations using matrix factorization methods

Topical authority propagation on microblogs

On challenges with mobile e-health: lessons from a game-theoretic perspective

On exploiting content and citations together to compute similarity of scientific papers

The importance of being socially-savvy: quantifying the influence of social networks on microblog retrieval

Improving entity search over linked data by modeling latent semantics

Taxonomy-based regression model for cross-domain sentiment classification

Flexible and dynamic compromises for effective recommendations

Reconciliation of categorical opinions from multiple sources

An unsupervised transfer learning approach to discover topics for online reputation management

Discovering facts with boolean tensor tucker decomposition

Intelligent SSD: a turbo for big data mining

Software plagiarism detection: a graph-based approach

Objectionable content filtering by click-through data

Locality sensitive hashing for scalable structural classification and clustering of web documents

Content Provider	ACM Digital Library
Author	Hachenberg, Christian Gottron, Thomas
Abstract	Web content management systems as well as web front ends to databases usually use mechanisms based on homogeneous templates for generating and populating HTML documents containing structured, semi-structured or plain text data. Wrapper based information extraction techniques leverage such templates as an essential cornerstone of their functionality but rely heavily on the availability of proper training documents based on the specific template. Thus, structural classification and structural clustering of web documents is an important contributing factor to the success of those methods. We introduce a novel technique to support these two tasks: template fingerprints. Template fingerprints are locality sensitive hash values in the form of short sequences of characters which effectively represent the underlying template of a web document. Small changes in the document structure, as they may occur in template based documents, lead to no or only minor variations in the corresponding fingerprint. Based on the fingerprints we introduce a scalable index structure and algorithm for large collections of web documents, which can retrieve structurally similar documents efficiently. The effectiveness of our approach is empirically validated in a classification task on a data set of 13,237 documents based on 50 templates from different domains. The general efficiency and scalability is evaluated in a clustering task on a data set retrieved from the Open Directory Project comprising more than 3.6 million web documents. For both tasks, our template fingerprint approach provides results of high quality and demonstrates a linear runtime of O(n) w.r.t. the number of documents.
Starting Page	359
Ending Page	368
Page Count	10
File Format	PDF
ISBN	9781450322638
DOI	10.1145/2505515.2505673
Language	English
Publisher	Association for Computing Machinery (ACM)
Publisher Date	2013-10-27
Publisher Place	New York
Access Restriction	Subscribed
Subject Keyword	Locality sensitive hashing Template fingerprints Template detection
Content Type	Text
Resource Type	Article

Central Library (ISO-9001:2015 Certified)
Indian Institute of Technology Kharagpur
Kharagpur, West Bengal, India | PIN - 721302

See location in the Map
03222 282435
Mail: support@ndl.gov.in

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in