NDLI: CasJoin: a cascade chain for text similarity joins

Please wait, while we are loading the content...

Search engine support for software applications

Components for information extraction: ontology-based information extractors and generic platforms

What can quantum theory bring to information retrieval

On the selectivity of multidimensional routing indices

Wisdom of the ages: toward delivering the children's web with the link-based agerank algorithm

Mining topic-level influence in heterogeneous networks

Improved latent concept expansion using hierarchical markov random fields

Automatic schema merging using mapping constraints among incomplete sources

Mr.KNN: soft relevance for multi-label classification

Pricing guaranteed contracts in online display advertising

Learning click models via probit bayesian inference

Set cover algorithms for very large datasets

Factors affecting click-through behavior in aggregated search interfaces

Learning a user-thread alignment manifold for thread recommendation in online forum

Who should I cite: learning literature search models from citation behavior

PROSPECT: a system for screening candidates for recruitment

Index structures for efficiently searching natural language text

Network growth and the spectral evolution model

Two-tier similarity model for story link detection

XML schema computations: schema compatibility testing and subschema extraction

Constructing classification features using minimal predictive patterns

Evaluating, combining and generalizing recommendations with prerequisites

Personalized search by tag-based user profile and resource profile in collaborative tagging systems

Probabilistic first pass retrieval for search advertising: from theory to practice

Reverted indexing for feedback and expansion

Semantic tags generation and retrieval for online advertising

Clickthrough-based translation models for web search: from word models to phrase models

Building re-usable dictionary repositories for real-world text mining

VSEncoding: efficient coding and fast decoding of integer lists via dynamic programming

Search-log anonymization and advertisement: are they mutually exclusive?

Exploiting site-level information to improve web search

A topical link model for community discovery in textual interaction graph

BagBoo: a scalable hybrid bagging-the-boosting model

EntityEngine: answering entity-relationship queries using shallow semantics

LiquidXML: adaptive XML content redistribution

Summary of the 4th workshop on analytics for noisy unstructured text data (AND)

Schema extraction

Automatically acquiring a semantic network of related concepts

Entity ranking using Wikipedia as a pivot

Path-hop: efficiently indexing large graphs for reachability queries

A cross-lingual framework for monolingual biomedical information retrieval

Mining interesting link formation rules in social networks

Term necessity prediction

Preserving location and absence privacy in geo-social networks

Collaborative Dual-PLSA: mining distinction and commonality across multiple domains for text classification

Maximum normalized spacing for efficient visual clustering

Document allocation policies for selective searching of distributed indexes

The gist of everything new: personalized top-k processing over web 2.0 streams

Web search solved?: all result rankings the same?

FacetCube: a framework of incorporating prior knowledge into non-negative tensor factorization

A structured approach to query recommendation with social annotation data

Active caching for similarity queries based on shared-neighbor information

Efficient temporal keyword search over versioned text

Automatic detection of craters in planetary images: an embedded framework using feature selection and boosting

Selected new training documents to update user profile

Visual cube and on-line analytical processing of images

RankSVR: can preference data help regression?

Automatically suggesting topics for augmenting text documents

A comparison of user and system query performance predictions

Faceted search and browsing of audio content on spoken web

Improving verbose queries using subset distribution

MENTA: inducing multilingual taxonomies from wikipedia

Temporal query log profiling to improve web search ranking

OpinionIt: a text mining system for cross-lingual opinion analysis

Efficient term proximity search with term-pair indexes

Automated interaction in social networks with datalog

Exploration-exploitation tradeoff in interactive relevance feedback

Taxonomic clustering of web service for efficient discovery

Exploiting sequential relationships for familial classification

Facetedpedia: enabling query-dependent faceted search for wikipedia

Brown dwarf: a P2P data-warehousing system

3rd BooksOnline workshop: research advances in large digital book repositories and complementary media

Use of semantics in real life applications

Online annotation of text streams with structured entities

Ranking under temporal constraints

On wavelet decomposition of uncertain time series data sets

Multi-modal multi-correlation person-centric news retrieval

SHRINK: a structural clustering algorithm for detecting hierarchical communities in networks

Decomposing background topics from keywords by principal component pursuit

Preference query evaluation over expensive attributes

Inferring gender of movie reviewers: exploiting writing style, content and metadata

Multilevel manifold learning with application to spectral clustering

Rank learning for factoid question answering with linguistic and semantic constraints

Fast and accurate estimation of shortest paths in large graphs

Assessor error in stratified evaluation

Travel route recommendation using geotags in photo sharing sites

Using the past to score the present: extending term weighting models through revision history analysis

Searching consumer image collections using web-based concept expansion

Result-size estimation for information-retrieval subqueries

You are where you tweet: a content-based approach to geo-locating twitter users

Collaborative filtering in social tagging systems based on joint item-tag recommendations

Pattern discovery for large mixed-mode database

Estimating accuracy for text classification tasks on large unlabeled data

Detecting product review spammers using rating behaviors

The anatomy of a click: modeling user behavior on web information systems

Predicting product adoption in large-scale social networks

A unified optimization framework for robust pseudo-relevance feedback algorithms

Ontology emergence from folksonomies

Improving web search relevance and freshness with content previews

Hierarchical service analytics for improving productivity in an enterprise service center

Improved index compression techniques for versioned document collections

Towards a provenance framework for sub-image processing for astronomical data

Ranking social bookmarks using topic models

Active learning in parallel universes

Massive structured data management solution

Discovering, ranking and annotating cross-document relationships between concepts

RDFViewS: a storage tuning wizard for RDF applications

Report on the second international workshop on cloud data management (CloudDB 2010)

Temporal dynamics and information retrieval

Automatic extraction of web data records containing user-generated content

Examining the information retrieval process from an inductive perspective

Efficient set-correlation operator inside databases

Bringing order to your photos: event-driven classification of flickr images based on social knowledge

Outcome aware ranking in interaction networks

Document update summarization using incremental hierarchical clustering

Energy-efficient top-k query processing in wireless sensor networks

A robust semi-supervised classification method for transfer learning

Accelerating probabilistic frequent itemset mining: a model-based approach

Learning to rank relevant and novel documents through user feedback

Fast top-k simple shortest paths discovery in graphs

CiteData: a new multi-faceted dataset for evaluating personalized search performance

Boosting social network connectivity with link revival

Language pyramid and multi-scale text analysis

FACeTOR: cost-driven exploration of faceted query results

Partial drift detection using a rule induction framework

Collaborative future event recommendation

A probabilistic topic-connection model for automatic image annotation

Classical music for rock fans?: novel recommendations for expanding user interests

Exploring online social activities for adaptive search personalization

Ranking related entities: components and analyses

Multi-document topic segmentation

Organizing query completions for web search

Building efficient multi-threaded search nodes

Open user schema guided evaluation of streaming RDF queries

A fine-grained taxonomy of tables on the web

TAGME: on-the-fly annotation of short text fragments (by wikipedia entities)

Ranking of evolving stories through meta-aggregation

WikiPop: personalized event detection system based on Wikipedia page view statistics

WS-GraphMatching: a web service tool for graph matching

DTMBIO workshop summary

An efficient algorithm for mining time interval-based patterns in large database

On identifying representative relevant documents

Online update of b-trees

Expansion and search in networks

How about utilizing ordinal information from the distribution of unlabeled data

StableBuffer: optimizing write performance for DBMS applications on flash devices

Multi-view clustering with constraint propagation for learning with an incomplete mapping between views

Power in unity: forming teams in large-scale community systems

Latent interest-topic model: finding the causal relationships behind dyadic data

A framework for evaluating database keyword search strategies

A method for discovering components of human rituals from streams of sensor data

Hybrid tag recommendation for social annotation systems

Orientation distance-based discriminative feature extraction for multi-class classification

Improving one-class collaborative filtering by incorporating rich user information

Predicting short-term interests using activity-based search context

Meta-metadata: a metadata semantics language for collection representation applications

Selectively diversifying web search results

Real-time memory efficient data redundancy removal algorithm

SUMMA: subgraph matching in massive graphs

Challenges in personalized authority flow based ranking of social media

Adaptive outlierness for subspace outlier ranking

Injecting domain knowledge into a granular database engine: a position paper

XReal: an interactive XML keyword searching

FALCON: seamless access to meeting data from the inbox and calendar

DOLAP 2010 workshop summary

Automatically weighting tags in XML collection

A new mathematics retrieval system

Understanding retweeting behaviors in social networks

Experiences with using SVM-based learning for multi-objective ranking

EUI: an embedded engine for understanding user intents from mobile devices

SPac: a distributed, peer-to-peer, secure and privacy-aware social space

Third workshop on exploiting semantic annotations in information retrieval (ESAIR): CIKM 2010 workshop

Skyline query processing for uncertain data

Explore click models for search ranking

Mapping web pages to database records via link paths

Learning to blend rankings: a monotonic transformation to blend rankings from heterogeneous domains

TC-DCA: a system for text classification based on document's content allocation

SEQUEL: query completion via pattern mining on multi-column structural data

3rd international workshop on patent information retrieval (PaIR'10)

FD-buffer: a buffer manager for databases on flash disks

Generating advertising keywords from video content

Personalized recommender system based on item taxonomy and folksonomy

Crawling the web for structured documents

MI-WDIS: web data integration system for market intelligence

PIKM 2010: ACM workshop for ph.d. students in information and knowledge management

Efficiently querying archived data using Hadoop

Web page classification on child suitability

Communication motifs: a tool to characterize social communications

Connecting the local and the online in information management

i-SEE: integrated stream execution environment over on-line data streams

Overview of the 2nd international workshop on search and mining user-generated contents

Evaluation of top-k queries in peer-to-peer networks using threshold algorithms

Rough sets based reasoning and pattern mining for a two-stage information filtering system

Improving taxonomies for large-scale hierarchical classifiers of web documents

Summarizing biological literature with BioSumm

A metamodel approach to flexible semantic web service discovery

A late fusion approach to cross-lingual document re-ranking

PTM: probabilistic topic mapping model for mining parallel document collections

Exploring and visualizing academic social networks

On top-k social web search

Learning to generate summary as structured output

Hierarchical auto-tagging: organizing Q&A knowledge for everyone

Quantifying uncertainty in multi-dimensional cardinality estimations

Group ranking with application to image retrieval

Extracting structured information from Wikipedia articles to populate infoboxes

Approximate membership localization (AML) for web-based join

Unifying explicit and implicit feedback for collaborative filtering

Automatic metadata extraction from multilingual enterprise content

An efficient data-centric storage scheme considering storage and query hot-spots in sensor networks

Directly optimizing evaluation measures in learning to rank based on the clonal selection algorithm

Intelligent sales forecasting engine using genetic algorithms

Formal approach and automated tool for constructing ontology from object-oriented database model

Query model refinement using word graphs

Identifying new categories in community question answering archives: a topic modeling approach

(k,P)-anonymity: towards pattern-preserving anonymity of time-series data

Building recommendation systems using peer-to-peer shared content

An effective approach for mining mobile user habits

Concurrent atomic protocols for making and changing decisions in social networks

Threshold behavior of incentives in social networks

Mining networks with shared items

Extending dictionary-based entity extraction to tolerate errors

Image retrieval at memory's edge: known image search based on user-drawn sketches

Learning sentiment classification model from labeled features

Yet another write-optimized DBMS layer for flash-based solid state storage

Utilizing re-finding for personalized information retrieval

Embedding tolerance relations in formal concept analysis: an application in information fusion

Print: a provenance model to support integration processes

User behavior driven ranking without editorial judgments

Online learning for multi-task feature selection

OLAP-based query recommendation

Alignment of short length parallel corpora with an application to web search

Exploiting user interests for collaborative filtering: interests expansion via personalized ranking

Selective data acquisition for probabilistic K-NN query

A feature-word-topic model for image annotation

K-farthest-neighbors-based concept boundary determination for support vector data description

Support elements in graph structured schema reintegration

Weighting common syntactic structures for natural language based information retrieval

Relational feature engineering of natural language processing

BP-tree: an efficient index for similarity search in high-dimensional metric spaces

Ranking with auxiliary data

Transfer incremental learning for pattern classification

Query optimization for ontology-based information integration

Using various term dependencies according to their utilities

Learning ontology resolution for document representation and its applications in text mining

Data aspects in a relational database

Modeling reformulation using passage analysis

Supervised identification and linking of concept mentions to a domain-specific ontology

A hierarchical approach to reachability query answering in very large graph databases

Online learning for recency search ranking using real-time user feedback

Relevance-index size tradeoff in contextual advertising

Computing the top-k maximal answers in a join of ranked lists

Expert identification in community question answering: exploring question selection bias

CasJoin: a cascade chain for text similarity joins

PruSM: a prudent schema matching approach for web forms

On the relationship between novelty and popularity of user-generated content

Learning naïve bayes transfer classifier throughclass-wise test distribution estimation

Anonymizing data with quasi-sensitive attribute values

Focused crawling using navigational rank

Top-Eye: top-k evolving trajectory outlier detection

TAER: time-aware entity retrieval-exploiting the past to find relevant entities in news articles

Multi task learning on multiple related networks

Demographic information flows

Building a semantic representation for personal information

Detecting periodic changes in search intentions in a search engine

Affinity-driven prediction and ranking of products in online product review sites

Recommendation based on object typicality

Topic-driven web search result organization by leveraging wikipedia semantic knowledge

Selecting keywords for content based recommendation

Fast dimension reduction for document classification based on imprecise spectrum analysis

Structural annotation of search queries using pseudo-relevance feedback

Manifold ranking with sink points for update summarization

Search as if you were in your home town: geographic search by regional context and dynamic feature-space selection

Construction of a sentimental word dictionary

Topic aspect analysis for multi-document summarization

Exploiting novelty, coverage and balance for topic-focused multi-document summarization

Finding unusual review patterns using unexpected rules

Context modeling for ranking and tagging bursty features in text streams

Visual-semantic graphs: using queries to reduce the semantic gap in web image retrieval

Comparison of six aggregation strategies to compute users' trustworthiness

Novel local features with hybrid sampling technique for image retrieval

Visualization and clustering of crowd video content in MPCA subspace

Expected browsing utility for web search evaluation

ANITA: a narrative interpretation of taxonomies for their adaptation to text collections

Community-based topic modeling for social tagging

Yes we can: simplex volume maximization for descriptive web-scale matrix factorization

Discriminative factored prior models for personalized content-based recommendation

Incorporating terminology evolution for query translation in text retrieval with association rules

Fast query expansion using approximations of relevance models

Exploiting co-occurrence and information quality metrics to recommend tags in web 2.0 applications

Mining rules to explain activities in videos

Elusive vandalism detection in wikipedia: a text stability-based approach

Online stratified sampling: evaluating classifiers at web-scale

Feature subspace transformations for enhancing k-means clustering

Routing questions to appropriate answerers in community question answering services

On bootstrapping recommender systems

Learning to rank with groups

Using Wikipedia categories for compact representations of chemical documents

Optimizing unified loss for web ranking specialization

Efficient wikipedia-based semantic interpreter by exploiting top-k processing

Hypergraph-based multilevel matrix approximation for text information retrieval

A study of rumor control strategies on social networks

A peer-selection algorithm for information retrieval

Domain-independent entity coreference in RDF graphs

Exploring domain-specific term weight in archived question search

Opinion digger: an unsupervised opinion miner from unstructured product reviews

Multi-information fusion for uncertain semantic representations of videos

Combining link and content for collective active learning

Classifying sentiment in microblogs: is brevity an advantage?

Identifying hotspots on the real-time web

Discovery of numerous specific topics via term co-occurrence analysis

Digging for knowledge with information extraction: a case study on human gene-disease associations

Towards query log based personalization using topic models

Choosing your own adventure: automatic taxonomy generation to permit many paths

Robust prediction from multiple heterogeneous data sources with partial information

Collaboration analytics: mining work patterns from collaboration activities

Adapting cost-sensitive learning for reject option

SKIF: a data imputation framework for concept drifting data streams

Detecting controversial events from twitter

Topic detection and organization of mobile text messages

Unsupervised public health event detection for epidemic intelligence

Pattern based keyword extraction for contextual advertising

Mixture model label propagation

Regularization and feature selection for networked features

CasJoin: a cascade chain for text similarity joins

Content Provider	ACM Digital Library
Author	Guo, Zhili Zhu, Huijia Guo, Honglei Su, Zhong Zhang, Xiaoxun
Abstract	We are concerned with the problem of similarity joins of text data, where the task is to find all pairs of documents above an expected similarity. Such a problem often serves as an indispensable step in many web applications. A crucial issue is to preclude unnecessary candidate pairs as many as possible ahead of expensive similarity evaluation. In this paper, we initiate an idea of adopting a cascade structure in text joins for a large speedup, where a latter stage can exclude a considerable number of invalid pairs survived in former stages. The proposed algorithm is shortly referred to as CasJoin. We further adopt a prefix filter to build the stage of CasJoin by introducing a novel vision to the dynamic generation of document vector. Specifically, a vector is partitioned into a chain of multiple prefixes that are appended one by one for cascade joining. We evaluate our CasJoin on a typical web corpus, ODP. Experiments indicate that, comparing to the state-of-the-art prefix algorithms, CasJoin can achieve a drastic reduction of candidates by as much as 98.15% and a dramatic speedup of joining by up to 13.34x.
Starting Page	1725
Ending Page	1728
Page Count	4
File Format	PDF
ISBN	9781450300995
DOI	10.1145/1871437.1871714
Language	English
Publisher	Association for Computing Machinery (ACM)
Publisher Date	2010-10-26
Publisher Place	New York
Access Restriction	Subscribed
Subject Keyword	Duplicate detection Cascade filtering Similarity joins Prefix filtering Similarity search
Content Type	Text
Resource Type	Article

Central Library (ISO-9001:2015 Certified)
Indian Institute of Technology Kharagpur
Kharagpur, West Bengal, India | PIN - 721302

See location in the Map
03222 282435
Mail: support@ndl.gov.in

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in