Loading...
Please wait, while we are loading the content...
Similar Documents
Scalable Recognition, Extraction, and Structuring of Data from Lists in OCRed Text using Unsupervised Active Wrapper Induction
| Content Provider | CiteSeerX |
|---|---|
| Author | Packer, Thomas L. Embley, David W. |
| Abstract | A process for accurately and automatically extracting asserted facts from lists in OCRed documents and in-serting them into an ontology would contribute to making a variety of historical documents machine search-able, queryable, and linkable. To work well, such a process should be adaptable to variations in document and list format, tolerant of OCR errors, and careful in its selection of human guidance. We propose an unsu-pervised active wrapper induction solution for finding and extracting information from lists in OCRed text. ListReader discovers lists in the text of an OCRed document and induces a grammar for the internal struc-ture of list records without document-specific feature engineering or supervision. ListReader then applies the knowledge in this grammar to actively request a limited and targeted set of labels from a user to com-plete its list wrapper. Lastly, ListReader applies the completed wrapper, encoded as a regular expression, to extract information with high precision from the entire document and automatically maps the labeled text it produces to a rich variety of ontologically structured predicates. We evaluate our implementation on a family history book in terms of F-measure and annotation cost, showing with statistical significance that ListReader learns to extract high-quality data with less cost than a state-of-the-art statistical sequence labeler. |
| File Format | |
| Access Restriction | Open |
| Subject Keyword | Ocred Text Scalable Recognition Active Wrapper Induction Ocred Document Annotation Cost Regular Expression List Format State-of-the-art Statistical Sequence Labeler Internal Struc-ture Statistical Significance Induction Solution Rich Variety Labeled Text Entire Document Historical Document Machine Document-specific Feature Engineering Family History Book High Precision High-quality Data Human Guidance Targeted Set Ocr Error List Record List Wrapper |
| Content Type | Text |