NDLI: Pair-Wise entity resolution: overview and challenges

Please wait, while we are loading the content...

Pair-Wise entity resolution: overview and challenges

Efficient processing of complex similarity queries in RDBMS through query rewriting

Movie review mining and summarization

On GMAP: and other transformations

Window join approximation over data streams with importance semantics

Validating associations in biological databases

Discovering and exploiting keyword and attribute-value co-occurrences to improve P2P routing indices

3DString: a feature string kernel for 3D object classification on voxelized data

Improving novelty detection for general topics using sentence level information patterns

Capturing community search expertise for personalized web search using snippet-indexes

KDDCS: a load-balanced in-network data-centric storage scheme for sensor networks

In search of meaning for time series subsequence clustering: matching algorithms based on a new distance measure

Ranking web objects from multiple communities

SaLSa: computing the skyline without scanning the whole sky

On the structural properties of massive telecom call graphs: findings and implications

Automatic computation of semantic proximity using taxonomic knowledge

Secure search in enterprise webs: tradeoffs in efficient implementation for document level security

Efficient model selection for regularized linear discriminant analysis

Ranking robustness: a novel framework to predict query performance

Concept frequency distribution in biomedical text summarization

Annotation propagation revisited for key preserving views

Performance thresholding in practical text classification

Document re-ranking using cluster validation and label propagation

Cache-oblivious nested-loop joins

An integer programming approach for frequent itemset hiding

A comparative study on classifying the functions of web page blocks

How I learned to stop worrying and love the imminent internet singularity

Distributed spatio-temporal similarity search

Utility scoring of product reviews

Investigating the exhaustivity dimension in content-oriented XML element retrieval evaluation

Adaptive non-linear clustering in data streams

Finding highly correlated pairs efficiently with powerful pruning

A document-centric approach to static index pruning in text retrieval systems

Effective and efficient classification on a search-engine model

Topic evolution and social interactions: how authors effect research

Summarizing local context to personalize global web search

Efficient range-constrained similarity search on wavelet synopses over multiple streams

Incremental hierarchical clustering of text documents

Voting for candidates: adapting data fusion techniques for an expert search task

Constrained subspace skyline computation

Heuristic containment check of partial tree-pattern queries in the presence of index graphs

Exploiting asymmetry in hierarchical topic extraction

Vector and matrix operations programmed with UDFs in a relational DBMS

Concept-based document readability in domain specific information retrieval

Query result ranking over e-commerce web databases

Describing differences between databases

Query optimization using restructured views

Text classification improved through multigram models

Tracking dragon-hunters with language models

A combination of trie-trees and inverted files for the indexing of set-valued attributes

Privacy preserving sequential pattern mining in distributed databases

A neighborhood-based approach for clustering of linked document collections

The real-time nature and value of homeland security information

Structure-based querying of proteins using wavelets

Mining blog stories using community-based and temporal clustering

Evaluation by comparing result sets in context

Classification spanning correlated data streams

Mining compressed commodity workflows from massive RFID data sets

Pruning strategies for mixed-mode querying

Multi-evidence, multi-criteria, lazy associative document classification

A fast and robust method for web page template detection and removal

A study on the effects of personalization and task information on implicit feedback performance

A data stream language and system designed for power and extensibility

Efficiently clustering transactional data with weighted coverage density

Bayesian adaptive user profiling with explicit & implicit feedback

Processing relaxed skylines in PDMS using distributed data summaries

TRIPS and TIDES: new algorithms for tree mining

CP/CV: concept similarity mining without frequency information from domain describing taxonomies

POLESTAR: collaborative knowledge management and sensemaking tools for intelligence analysts

A probabilistic relevance propagation model for hypertext retrieval

Optimisation methods for ranking functions with multiple parameters

A system for query-specific document summarization

Improving query I/O performance by permuting and refining block request sequences

Coupling feature selection and machine learning methods for navigational query identification

Designing semantics-preserving cluster representatives for scientific input conditions

Efficient join processing over uncertain data

A dictionary for approximate string search and longest prefix search

A structure-oriented relevance feedback method for XML retrieval

An approximate multi-word matching algorithm for robust document retrieval

Eigen-trend: trend analysis in the blogosphere based on singular value decompositions

Estimating average precision with incomplete and imperfect judgments

Knowing a web page by the company it keeps

Out-of-context noun phrase semantic interpretation with cross-linguistic evidence

Incorporating query difference for learning retrieval functions in world wide web search

Task-based process know-how reuse and proactive information delivery in TaskNavigator

Term context models for information retrieval

Estimating corpus size via queries

Adapting association patterns for text categorization: weaknesses and enhancements

Amnesic online synopses for moving objects

An efficient one-phase holistic twig join algorithm for XML data

Approximate reverse k-nearest neighbor queries in general metric spaces

Best-k queries on database systems

Boosting relevance model performance with query term dependence

Collaborative filtering in dynamic usage environments

Automatically constructing collections of online database directories

Combining feature selectors for text classification

Constructing better document and query models with markov chains

Continuous keyword search on multiple text streams

On subspace clustering with density consciousness

Direct comparison of commercial and academic retrieval system: an initial study

Effective and efficient similarity search in time series

Efficient mining of max frequent patterns in a generalized environment

Estimation, sensitivity, and generalization in parameterized retrieval models

Filtering or adapting: two strategies to exploit noisy parallel corpora for cross-language information retrieval

HUX: a schemacentric approach for updating XML views

Improving query translation with confidence estimation for cross language information retrieval

Information retrieval from relational databases using semantic queries

Integrated RFID data modeling: an approach for querying physical objects in pervasive computing

Integration of cluster ensemble and EM based text mining for microarray gene cluster identification and annotation

Introduction to a new Farsi stemmer

IR principles for content-based indexing and retrieval of functional brain images

Matching directories and OWL ontologies with AROMA

Matching and evaluation of disjunctive predicates for data stream sharing

Maximizing the sustained throughput of distributed continuous queries

Measuring the meaning in time series clustering of text search queries

Mining coherent patterns from heterogeneous microarray data

k nearest neighbor classification across multiple private databases

Modeling performance-driven workload characterization of web search systems

Multi-query optimization of sliding window aggregates by schedule synchronization

Multi-task text segmentation and alignment based on weighted mutual information

PEPX: a query-friendly probabilistic XML database

On progressive sequential pattern mining

Practical private data matching deterrent to spoofing attacks

Probabilistic document-context based relevance feedback with limited relevance judgments

Processing information intent via weak labeling

Pseudo-anchor text extraction for searching vertical objects

Re-ranking search results using query logs

Query taxonomy generation for web search

Rank synopses for efficient time travel on the web graph

Ranking in context using vector spaces

Representing documents with named entities for story link detection (SLD)

Resource-aware kernel density estimators over streaming data

Retrieval evaluation with incomplete relevance data: a comparative study of three measures

Robust periodicity detection algorithms

Search result summarization and disambiguation via contextual dimensions

Semi-automatic annotation and MPEG-7 authoring of dance videos

The query-vector document model

The visual funding navigator: analysis of the NSF funding information

Towards interactive indexing for large Chinese calligraphic character databases

Query-specific clustering of search results based on document-context similarity scores

Pair-Wise entity resolution: overview and challenges

Content Provider	ACM Digital Library
Author	Garcia-Molina, Hector
Abstract	Information integration is one of the oldest and most important computer science problems: Information from diverse sources must be combined, so that users can access and manipulate the information in a unified way. One of the central problems in information integration is that of Entity Resolution (ER) (sometimes referred to as deduplication). ER is the process of identifying and merging incoming records judged to represent the same real-world entity.For example, consider a company that has different customer databases (e.g., one for each subsidiary), and would like to integrate them. Identifying matching records is challenging because there are no unique identifiers across the different sources or databases. A given customer may appear in different ways in each database, and there is a fair amount of guesswork in determining which customers match. Deciding if records match is often computationally expensive, e.g., may involve finding maximal common subsequences in two strings. How to combine matching records is often also application dependent. For example, say different phone numbers appear in two records to be merged. In some cases we may wish to keep both of them, while in others we may want to pick just one as the "consolidated" number.Another source of complexity is that newly merged records may match with other records. For instance, when we combine records $r_{1}$ and $r_{2}$ we may obtain a record $r_{12}$ that now matches $r_{3}.$ The original records, $r_{1}$ and $r_{2},$ may not match with $r_{3},$ but because $r_{12}$ contains more information about the same real-word entity that $r_{1}$ and $r_{2}$ represent, the "connection" to $r_{3}$ may now be apparent. Such "chained" matches imply that new merged records must be recursively compared to all records.There are many ways to perform ER, but in this talk I will explore only one general approach, where the decision of what records represent the same real-world entity is done in a pair-wise fashion. Furthermore, we assume that the matching is done by a "black-box" function, which makes our approach generic and applicable to many domains. Thus, given two records, $r_{1}$ and $r_{2},$ the match function $M(r_{1},$ $r_{2})$ returns true if there is enough evidence in the two records that they both refer to the same real-world entity. We also assume a black-box merge function that combines a pair of matching records.In this talk I will discuss the advantages and disadvantages of such a generic, pair-wise approach to ER. And even though the approach is relatively simple, there are still many interesting challenges. For instance, how can one minimize the number of invocations to the match and merge black-boxes? Are there any properties of the functions that can significantly reduce the number of calls? If one has available multiple processors, how can one distribute the computational load? If records have confidences associated with them, how does the problem complexity change, and how can we efficiently find the confidence of the resolved records? In the talk I will address these challenges, and report on some preliminary work we have done at Stanford. (This Stanford work in joint with Omar Benjelloun, Tyson Condie, Johnson (Heng) Gong, Jeff Jonas, Hideki Kawai, Tait E. Larson, David Menestrina, Nicolas Pombourcq, Qi Su, Steven Whang, Jennifer Widom.For additional information on ER and our Stanford SERF Project, please visit http://www-db.stanford.edu/serf/.
Starting Page	1
Ending Page	1
Page Count	1
File Format	PDF
ISBN	1595934332
DOI	10.1145/1183614.1183616
Language	English
Publisher	Association for Computing Machinery (ACM)
Publisher Date	2006-11-06
Publisher Place	New York
Access Restriction	Subscribed
Subject Keyword	Data cleaning Entity resolution
Content Type	Text
Resource Type	Article

Central Library (ISO-9001:2015 Certified)
Indian Institute of Technology Kharagpur
Kharagpur, West Bengal, India | PIN - 721302

See location in the Map
03222 282435
Mail: support@ndl.gov.in

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in