Loading...
Please wait, while we are loading the content...
Similar Documents
Using Semantics & Statistics to Turn Data into Knowledge
| Content Provider | Semantic Scholar |
|---|---|
| Author | Pujara, Jay Miao, Hui Getoor, Lise |
| Copyright Year | 2014 |
| Abstract | Many information extraction and knowledge base construction systems are addressing the challenge of deriving knowledge from text. A key problem in constructing these knowledge bases from sources like the web is overcoming the erroneous and incomplete information found in millions of candidate extractions. To solve this problem, we turn to semantics – using ontological constraints between candidate facts to eliminate errors. In this article, we represent the desired knowledge base as a knowledge graph and introduce the problem of knowledge graph identification, collectively resolving the entities, labels, and relations present in the knowledge graph. Knowledge graph identification requires reasoning jointly over millions of extractions simultaneously, posing a scalability challenge to many approaches. We use probabilistic soft logic (PSL), a recently-introduced statistical relational learning framework, to implement an efficient solution to knowledge graph identification and present state-of-the-art results for knowledge graph construction while performing an order of magnitude faster than competing methods. A growing body of research focuses on extracting knowledge from text such as news reports, encyclopedic articles and scholarly research in specialized domains. Much of this data is freely available on the World Wide Web and harnessing the knowledge contained in millions of web documents remains a problem of particular interest. The scale and diversity of this content poses a formidable challenge for systems designed to extract this knowledge. Many well-known broad domain and open information extraction systems seek to build knowledge bases from text, including the NeverEnding Language Learning (NELL) project (Carlson et al., 2010), OpenIE (Etzioni et al., 2008), DeepDive (Niu et al., 2012), and efforts at Google (Pasca et al., 2006). Ultimately, these information extraction systems produce a collection of candidate facts, that include a set of entities, attributes of these entities, and the relations between these entities. Information extraction systems use a sophisticated collection of strategies to generate candidate facts from web documents, spanning the syntactic, lexical and structural features of text (Weikum and Theobald, 2010; Wimalasuriya and Dou, 2010). While these systems are capable of extracting many candidate facts from the web, their output is ofCopyright c © 2014, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. ten hampered by noise. Documents contain inaccurate, outdated, incomplete, or hypothetical information, and informal and creative language used in web documents is often difficult to interpret. As a result, the candidates produced by information extraction systems often miss key facts and include spurious outputs, compromising the usefulness of the extractions. In an effort to combat such noise, information extraction systems capture a vast array of features and statistics, ranging from the characteristics of the webpages used to generate extractions to the reliability of the particular patterns or techniques used to extract information. Using this host of features and a modest amount of training data, many information extraction systems employ heuristics or learned prediction functions to assign a confidence score to each candidate fact. These confidence scores capture the inherent uncertainty in the text from which the facts were extracted, and can ideally be used to improve the quality of the knowledge base. While many information extraction systems use features derived from text to measure the quality of candidate facts, few take advantage of the many semantic dependencies between these facts. For example, many categories, such as “male” and “female” may be mutually exclusive, or restricted to a subset of entities, such as living organisms. Recently, the Semantic Web movement has developed standards and tools to express these dependencies through ontologies designed to capture the diverse information present on the Internet. The problem of building domain-specific ontologies for expert users with Semantic Web tools is challenging and well-researched, with high-quality ontologies for domains including bioinformatics, media such as music and books, and governmental data. More general ontologies have been developed for broad collections such as the online encyclopedia Wikipedia. These semantic constraints are valuable for improving the quality of knowledge bases, but incorporating these dependencies into existing information extraction systems is not straightforward. The constraints imposed by an ontology are generally constraints between facts. For example, candidate facts assigning a particular entity to the categories “male”, “female”, and “living organism” are interrelated. Hence, leveraging the dependencies between facts in a knowledge base requires reasoning jointly about the extracted candidates. Due to the large scale at which information extraction syscountry Kyrgyzstan Kyrgyz Republic |
| File Format | PDF HTM / HTML |
| Alternate Webpage(s) | http://www.cs.cmu.edu/~./wcohen/postscript/aimag-2014.pdf |
| Alternate Webpage(s) | http://www.cs.cmu.edu/~wcohen/postscript/aimag-2014.pdf |
| Alternate Webpage(s) | https://courses.soe.ucsc.edu/courses/cmps290c/Spring14/02/pages/attached-files/attachments/25564 |
| Language | English |
| Access Restriction | Open |
| Content Type | Text |
| Resource Type | Article |