Loading...
Please wait, while we are loading the content...
Similar Documents
Populating ontologies by semi-automatically inducing information extraction wrappers for lists in ocred documents (2012).
| Content Provider | CiteSeerX |
|---|---|
| Author | Packer, Thomas L. |
| Abstract | A flexible, accurate, and efficient method of extracting facts from lists in OCRed documents and inserting them into an ontology would help make those facts machine queryable, linkable, and editable. But, to work well, such a process must be adaptable to variations in list format, tolerant of OCR errors, and careful in its selection of human guidance. We propose a wrapper-induction solution for information extraction that is specialized for lists in OCRed documents. In this approach, we induce a grammar or model that can infer list structure and field labels in sequences of words in text. Second, we decrease the cost and improve the accuracy of this induction process using semi-supervised machine learning and active learning, allowing induction of a wrapper from a single hand-labeled instance per field per list. We then use the wrappers and data learned from the semi-supervised process to bootstrap an automatic (weakly supervised) wrapper induction process for additional lists in the same domain. In both induction scenarios, we automatically map labeled text to ontologically structured facts. Our implementation induces two kinds of wrappers, namely regular expressions and hidden Markov models. We evaluate our implementation in terms of annotation cost and extraction quality for lists in multiple types of historical documents. 1 |
| File Format | |
| Publisher Date | 2012-01-01 |
| Access Restriction | Open |
| Subject Keyword | Ocred Document Semi-automatically Inducing Information Extraction Wrapper Induction Process Semi-supervised Machine Learning Annotation Cost Structured Fact Information Extraction Regular Expression List Format Additional List Field Label Semi-supervised Process Efficient Method Extraction Quality Implementation Induces Active Learning Wrapper-induction Solution Hidden Markov Model Single Hand-labeled Instance Multiple Type Human Guidance Historical Document Induction Scenario List Structure Ocr Error Fact Machine |
| Content Type | Text |