NDLI: Cost-Effective Information Extraction from Lists in OCRed Historical Documents

Please wait, while we are loading the content...

Space Displacement Localization Neural Networks to locate origin points of handwritten text lines in historical documents

Learning Texture Features for Enhancement and Segmentation of Historical Document Images

Template generation from postmarks using cascaded unsupervised learning

A Framework for Efficient Transcription of Historical Documents Using Keyword Spotting

Selecting Autoencoder Features for Layout Analysis of Historical Documents

Publication Date Estimation for Printed Historical Documents using Convolutional Neural Networks

Handwritten Text Recognition Results on the Bentham Collection with Improved Classical N-Gram-HMM methods

Layout Analysis Algorithm Based on Probabilistic Graphical Model for Dunhuang Historical Documents

Large scale style based dating of medieval manuscripts

Cost-Effective Information Extraction from Lists in OCRed Historical Documents

Combining Learned Script Points and Combinatorial Optimization for Text Line Extraction

Homogenization of 2D & 3D Document Formats for Cuneiform Script Analysis

Historical Typewritten Document Recognition Using Minimal User Interaction

Document Image Binarization using LSTM: A Sequence Learning Approach

A Character Style Library for Syriac Manuscripts

Europeana Newspapers OCR Workflow Evaluation

Clustering Historical Documents Based on the Reconstruction Error of Autoencoders

Retrieving Cuneiform Structures in a Segmentation-free Word Spotting Framework

Cost-Effective Information Extraction from Lists in OCRed Historical Documents

Content Provider	ACM Digital Library
Author	Packer, Thomas L. Embley, David W.
Abstract	To work well, machine-learning-based approaches to information extraction and ontology population often require a large number of manually selected and annotated examples. In this paper, we propose ListReader which provides a way to train the structure and parameters of a Hidden Markov Model (HMM) without requiring any labeled training data. The induced HMM is a wrapper---a function that hides within it the complexities of low-level processing---in ListReader's case the complexities of information extraction from OCRed historical documents. The HMM wrapper is capable of recognizing lists of records in text documents and associating subsets of identical fields across related record templates. The algorithmic training method we employ is based on a novel unsupervised active grammar-induction framework. The training produces an HMM wrapper and uses an efficient active sampling process to complete the mapping from wrapper to ontology by requesting annotations from a user for automatically-selected examples. We measure performance of the final HMM in terms of F-measure of extracted information and manual annotation cost and show that ListReader learns faster and better than a state-of-the-art baseline and an alternate version of ListReader that induces a regular-expression wrapper.
Starting Page	23
Ending Page	30
Page Count	8
File Format	PDF
ISBN	9781450336024
DOI	10.1145/2809544.2809547
Language	English
Publisher	Association for Computing Machinery (ACM)
Publisher Date	2015-08-22
Publisher Place	New York
Access Restriction	Subscribed
Subject Keyword	Ontology population Active learning Hmm Information extraction List Hidden markov model Grammar induction Unsupervised learning Ocr Wrapper induction
Content Type	Text
Resource Type	Article

Central Library (ISO-9001:2015 Certified)
Indian Institute of Technology Kharagpur
Kharagpur, West Bengal, India | PIN - 721302

See location in the Map
03222 282435
Mail: support@ndl.gov.in

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in