Please wait, while we are loading the content...
Please wait, while we are loading the content...
Content Provider | ACM Digital Library |
---|---|
Author | Garcia-Molina, Hector |
Abstract | Information integration is one of the oldest and most important computer science problems: Information from diverse sources must be combined, so that users can access and manipulate the information in a unified way. One of the central problems in information integration is that of Entity Resolution (ER) (sometimes referred to as deduplication). ER is the process of identifying and merging incoming records judged to represent the same real-world entity.For example, consider a company that has different customer databases (e.g., one for each subsidiary), and would like to integrate them. Identifying matching records is challenging because there are no unique identifiers across the different sources or databases. A given customer may appear in different ways in each database, and there is a fair amount of guesswork in determining which customers match. Deciding if records match is often computationally expensive, e.g., may involve finding maximal common subsequences in two strings. How to combine matching records is often also application dependent. For example, say different phone numbers appear in two records to be merged. In some cases we may wish to keep both of them, while in others we may want to pick just one as the "consolidated" number.Another source of complexity is that newly merged records may match with other records. For instance, when we combine records $r_{1}$ and $r_{2}$ we may obtain a record $r_{12}$ that now matches $r_{3}.$ The original records, $r_{1}$ and $r_{2},$ may not match with $r_{3},$ but because $r_{12}$ contains more information about the same real-word entity that $r_{1}$ and $r_{2}$ represent, the "connection" to $r_{3}$ may now be apparent. Such "chained" matches imply that new merged records must be recursively compared to all records.There are many ways to perform ER, but in this talk I will explore only one general approach, where the decision of what records represent the same real-world entity is done in a pair-wise fashion. Furthermore, we assume that the matching is done by a "black-box" function, which makes our approach generic and applicable to many domains. Thus, given two records, $r_{1}$ and $r_{2},$ the match function $M(r_{1},$ $r_{2})$ returns true if there is enough evidence in the two records that they both refer to the same real-world entity. We also assume a black-box merge function that combines a pair of matching records.In this talk I will discuss the advantages and disadvantages of such a generic, pair-wise approach to ER. And even though the approach is relatively simple, there are still many interesting challenges. For instance, how can one minimize the number of invocations to the match and merge black-boxes? Are there any properties of the functions that can significantly reduce the number of calls? If one has available multiple processors, how can one distribute the computational load? If records have confidences associated with them, how does the problem complexity change, and how can we efficiently find the confidence of the resolved records? In the talk I will address these challenges, and report on some preliminary work we have done at Stanford. (This Stanford work in joint with Omar Benjelloun, Tyson Condie, Johnson (Heng) Gong, Jeff Jonas, Hideki Kawai, Tait E. Larson, David Menestrina, Nicolas Pombourcq, Qi Su, Steven Whang, Jennifer Widom.For additional information on ER and our Stanford SERF Project, please visit http://www-db.stanford.edu/serf/. |
Starting Page | 1 |
Ending Page | 1 |
Page Count | 1 |
File Format | |
ISBN | 1595934332 |
DOI | 10.1145/1183614.1183616 |
Language | English |
Publisher | Association for Computing Machinery (ACM) |
Publisher Date | 2006-11-06 |
Publisher Place | New York |
Access Restriction | Subscribed |
Subject Keyword | Data cleaning Entity resolution |
Content Type | Text |
Resource Type | Article |
National Digital Library of India (NDLI) is a virtual repository of learning resources which is not just a repository with search/browse facilities but provides a host of services for the learner community. It is sponsored and mentored by Ministry of Education, Government of India, through its National Mission on Education through Information and Communication Technology (NMEICT). Filtered and federated searching is employed to facilitate focused searching so that learners can find the right resource with least effort and in minimum time. NDLI provides user group-specific services such as Examination Preparatory for School and College students and job aspirants. Services for Researchers and general learners are also provided. NDLI is designed to hold content of any language and provides interface support for 10 most widely used Indian languages. It is built to provide support for all academic levels including researchers and life-long learners, all disciplines, all popular forms of access devices and differently-abled learners. It is designed to enable people to learn and prepare from best practices from all over the world and to facilitate researchers to perform inter-linked exploration from multiple sources. It is developed, operated and maintained from Indian Institute of Technology Kharagpur.
Learn more about this project from here.
NDLI is a conglomeration of freely available or institutionally contributed or donated or publisher managed contents. Almost all these contents are hosted and accessed from respective sources. The responsibility for authenticity, relevance, completeness, accuracy, reliability and suitability of these contents rests with the respective organization and NDLI has no responsibility or liability for these. Every effort is made to keep the NDLI portal up and running smoothly unless there are some unavoidable technical issues.
Ministry of Education, through its National Mission on Education through Information and Communication Technology (NMEICT), has sponsored and funded the National Digital Library of India (NDLI) project.
Sl. | Authority | Responsibilities | Communication Details |
---|---|---|---|
1 | Ministry of Education (GoI), Department of Higher Education |
Sanctioning Authority | https://www.education.gov.in/ict-initiatives |
2 | Indian Institute of Technology Kharagpur | Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project | https://www.iitkgp.ac.in |
3 | National Digital Library of India Office, Indian Institute of Technology Kharagpur | The administrative and infrastructural headquarters of the project | Dr. B. Sutradhar bsutra@ndl.gov.in |
4 | Project PI / Joint PI | Principal Investigator and Joint Principal Investigators of the project |
Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon |
5 | Website/Portal (Helpdesk) | Queries regarding NDLI and its services | support@ndl.gov.in |
6 | Contents and Copyright Issues | Queries related to content curation and copyright issues | content@ndl.gov.in |
7 | National Digital Library of India Club (NDLI Club) | Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach | clubsupport@ndl.gov.in |
8 | Digital Preservation Centre (DPC) | Assistance with digitizing and archiving copyright-free printed books | dpc@ndl.gov.in |
9 | IDR Setup or Support | Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops | idr@ndl.gov.in |
Loading...
|