NDLI: Automatic complex schema matching across Web query interfaces: A correlation mining approach

Content Provider	ACM Digital Library
Author	He, Bin Chang, Kevin Chen-Chuan
Copyright Year	2006
Abstract	To enable information integration, schema matching is a critical step for discovering semantic correspondences of attributes across heterogeneous sources. While complex matchings are common, because of their far more complex search space, most existing techniques focus on simple 1:1 matchings. To tackle this challenge, this article takes a conceptually novel approach by viewing schema matching as correlation mining, for our task of matching Web query interfaces to integrate the myriad databases on the Internet. On this “deep Web ” query interfaces generally form complex matchings between attribute groups (e.g., {author} corresponds to {first name, last name} in the Books domain). We observe that the co-occurrences patterns across query interfaces often reveal such complex semantic relationships: grouping attributes (e.g., {first name, last name}) tend to be co-present in query interfaces and thus positively correlated. In contrast, synonym attributes are negatively correlated because they rarely co-occur. This insight enables us to discover complex matchings by a correlation mining approach. In particular, we develop the DCM framework, which consists of data preprocessing, dual mining of positive and negative correlations, and finally matching construction. We evaluate the DCM framework on manually extracted interfaces and the results show good accuracy for discovering complex matchings. Further, to automate the entire matching process, we incorporate automatic techniques for interface extraction. Executing the DCM framework on automatically extracted interfaces, we find that the inevitable errors in automatic interface extraction may significantly affect the matching result. To make the DCM framework robust against such “noisy” schemas, we integrate it with a novel “ensemble” approach, which creates an ensemble of DCM matchers, by randomizing the schema data into many $\textit{trials}$ and aggregating their ranked results by taking majority voting. As a principled basis, we provide analytic justification of the robustness of the ensemble approach. Empirically, our experiments show that the “ensemblization” indeed significantly boosts the matching accuracy, over automatically extracted and thus noisy schema data. By employing the DCM framework with the ensemble approach, we thus complete an automatic process of matchings Web query interfaces.
Starting Page	346
Ending Page	395
Page Count	50
File Format	PDF
ISSN	03625915
e-ISSN	15574644
DOI	10.1145/1132863.1132872
Volume Number	31
Issue Number	1
Journal	ACM Transactions on Database Systems (TODS)
Language	English
Publisher	Association for Computing Machinery (ACM)
Publisher Date	2006-03-01
Publisher Place	New York
Access Restriction	One Nation One Subscription (ONOS)
Subject Keyword	Data integration Bagging predictors Correlation mining Deep Web Ensemble Schema matching
Content Type	Text
Resource Type	Article
Subject	Information Systems

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in

Discovering complex matchings across web query interfaces: a correlation mining approach

Automatic complex schema matching across web query interfaces: A correlation mining approach (2003)

Making holistic schema matching robust: an ensemble approach

Automatic complex schema matching across web query interfaces: A correlation mining approach (2003)

Discovering complex matchings across web query interfaces: a correlation mining approach (2004).

Research track paper discovering complex matchings across web query interfaces: a correlation mining approach ∗.

Mining Complex Matchings across Web Query Interfaces

Deep web data integration approach based on schema and attributes extraction of query interfaces.

Making holistic schema matching robust: an ensemble approach (2005).

Automatic complex schema matching across Web query interfaces: A correlation mining approach

Similar Documents

Discovering complex matchings across web query interfaces: a correlation mining approach

Automatic complex schema matching across web query interfaces: A correlation mining approach (2003)

Making holistic schema matching robust: an ensemble approach

Automatic complex schema matching across web query interfaces: A correlation mining approach (2003)

Discovering complex matchings across web query interfaces: a correlation mining approach (2004).

Research track paper discovering complex matchings across web query interfaces: a correlation mining approach ∗.

Mining Complex Matchings across Web Query Interfaces

Deep web data integration approach based on schema and attributes extraction of query interfaces.

Making holistic schema matching robust: an ensemble approach (2005).

Automatic complex schema matching across Web query interfaces: A correlation mining approach