NDLI: Using Semantics & Statistics to Turn Data into Knowledge

Please wait, while we are loading the content...

Using Semantics & Statistics to Turn Data into Knowledge

Content Provider	Semantic Scholar
Author	Pujara, Jay Miao, Hui Getoor, Lise
Copyright Year	2014
Abstract	Many information extraction and knowledge base construction systems are addressing the challenge of deriving knowledge from text. A key problem in constructing these knowledge bases from sources like the web is overcoming the erroneous and incomplete information found in millions of candidate extractions. To solve this problem, we turn to semantics – using ontological constraints between candidate facts to eliminate errors. In this article, we represent the desired knowledge base as a knowledge graph and introduce the problem of knowledge graph identification, collectively resolving the entities, labels, and relations present in the knowledge graph. Knowledge graph identification requires reasoning jointly over millions of extractions simultaneously, posing a scalability challenge to many approaches. We use probabilistic soft logic (PSL), a recently-introduced statistical relational learning framework, to implement an efficient solution to knowledge graph identification and present state-of-the-art results for knowledge graph construction while performing an order of magnitude faster than competing methods. A growing body of research focuses on extracting knowledge from text such as news reports, encyclopedic articles and scholarly research in specialized domains. Much of this data is freely available on the World Wide Web and harnessing the knowledge contained in millions of web documents remains a problem of particular interest. The scale and diversity of this content poses a formidable challenge for systems designed to extract this knowledge. Many well-known broad domain and open information extraction systems seek to build knowledge bases from text, including the NeverEnding Language Learning (NELL) project (Carlson et al., 2010), OpenIE (Etzioni et al., 2008), DeepDive (Niu et al., 2012), and efforts at Google (Pasca et al., 2006). Ultimately, these information extraction systems produce a collection of candidate facts, that include a set of entities, attributes of these entities, and the relations between these entities. Information extraction systems use a sophisticated collection of strategies to generate candidate facts from web documents, spanning the syntactic, lexical and structural features of text (Weikum and Theobald, 2010; Wimalasuriya and Dou, 2010). While these systems are capable of extracting many candidate facts from the web, their output is ofCopyright c © 2014, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. ten hampered by noise. Documents contain inaccurate, outdated, incomplete, or hypothetical information, and informal and creative language used in web documents is often difficult to interpret. As a result, the candidates produced by information extraction systems often miss key facts and include spurious outputs, compromising the usefulness of the extractions. In an effort to combat such noise, information extraction systems capture a vast array of features and statistics, ranging from the characteristics of the webpages used to generate extractions to the reliability of the particular patterns or techniques used to extract information. Using this host of features and a modest amount of training data, many information extraction systems employ heuristics or learned prediction functions to assign a confidence score to each candidate fact. These confidence scores capture the inherent uncertainty in the text from which the facts were extracted, and can ideally be used to improve the quality of the knowledge base. While many information extraction systems use features derived from text to measure the quality of candidate facts, few take advantage of the many semantic dependencies between these facts. For example, many categories, such as “male” and “female” may be mutually exclusive, or restricted to a subset of entities, such as living organisms. Recently, the Semantic Web movement has developed standards and tools to express these dependencies through ontologies designed to capture the diverse information present on the Internet. The problem of building domain-specific ontologies for expert users with Semantic Web tools is challenging and well-researched, with high-quality ontologies for domains including bioinformatics, media such as music and books, and governmental data. More general ontologies have been developed for broad collections such as the online encyclopedia Wikipedia. These semantic constraints are valuable for improving the quality of knowledge bases, but incorporating these dependencies into existing information extraction systems is not straightforward. The constraints imposed by an ontology are generally constraints between facts. For example, candidate facts assigning a particular entity to the categories “male”, “female”, and “living organism” are interrelated. Hence, leveraging the dependencies between facts in a knowledge base requires reasoning jointly about the extracted candidates. Due to the large scale at which information extraction syscountry Kyrgyzstan Kyrgyz Republic
File Format	PDF HTM / HTML
Alternate Webpage(s)	http://www.cs.cmu.edu/~./wcohen/postscript/aimag-2014.pdf
Alternate Webpage(s)	http://www.cs.cmu.edu/~wcohen/postscript/aimag-2014.pdf
Alternate Webpage(s)	https://courses.soe.ucsc.edu/courses/cmps290c/Spring14/02/pages/attached-files/attachments/25564
Language	English
Access Restriction	Open
Content Type	Text
Resource Type	Article

Central Library (ISO-9001:2015 Certified)
Indian Institute of Technology Kharagpur
Kharagpur, West Bengal, India | PIN - 721302

See location in the Map
03222 282435
Mail: support@ndl.gov.in

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in