NDLI: Performing information extraction to improve OCR error detection in semi-structured historical documents

Please wait, while we are loading the content...

Style-based retrieval for ancient Syriac manuscripts

Linking the past: discovering historical social networks from documents and linking to a genealogical database

Rule based document understanding of historical books using a hybrid fuzzy classification system

Development of Nom character segmentation for collecting patterns from historical document pages

IMPACT: centre of competence in text digitisation

TSV-LR: topological signature vector-based lexicon reduction for fast recognition of pre-modern Arabic subwords

Automatic indexing of French handwritten census registers for probate geneaology

Combining statistical and geometrical classifiers for text extraction in multispectral document images

Character segmentation from ancient palm leaf manuscripts in Thailand

An experimental workflow development platform for historical document digitisation and analysis

Searching historical manuscripts for near-duplicate figures

Enabling search for facts and implied facts in historical documents

Grid-based modelling and correction of arbitrarily warped historical document images for large-scale digitisation

Thanatos: automatically retrieving information from death certificates in Brazil

HistDoc v. 2.0: enhancing a platform to process historical documents

User-assisted alignment of Arabic historical manuscripts

Performing information extraction to improve OCR error detection in semi-structured historical documents

Towards a faithful visualization of historical books on e-book readers

A design of a preprocessing framework for large database of historical documents

Transcription alignment of Latin manuscripts using hidden Markov models

Data mining medieval documents by word spotting

Text line segmentation for gray scale historical document images

The CADAL calligraphic database

A keyword spotting approach using blurred shape model-based descriptors

Image processing for historical newspaper archives

Performing information extraction to improve OCR error detection in semi-structured historical documents

Content Provider	ACM Digital Library
Author	Packer, Thomas L.
Abstract	Optical character recognition (OCR) produces transcriptions of document images. These transcriptions often contain incorrectly recognized characters which we must avoid or correct downstream. An ability to both identify OCR errors and extract information from OCR output would allow us to extract and index only correct information and to post-process specific parts of the OCR output with targeted resources (e.g. re-OCR using specialized dictionaries). We present a general approach to OCR error detection that uses a hidden Markov model trained to simultaneously detect OCR errors and extract information. We evaluate this approach in two information extraction settings and on semi-structured text from two machine-printed family history documents. We show this joint approach to OCR error detection to be an improvement over two alternative approaches, one based on dictionary matching and the other using a hidden Markov model trained only to detect OCR errors. In particular, we report an average of 8% increase in macro-averaged F-measure between the dictionary approach and our best HMM. Our contribution is to show how an OCR error detection approach based on a word model can be improved by joining this task with an information extraction task, and that an improvement in OCR error detection is achieved regardless of the information extraction task.
Starting Page	67
Ending Page	74
Page Count	8
File Format	PDF
ISBN	9781450309165
DOI	10.1145/2037342.2037354
Language	English
Publisher	Association for Computing Machinery (ACM)
Publisher Date	2011-09-16
Publisher Place	New York
Access Restriction	Subscribed
Subject Keyword	Optical character recognition Error detection Information extraction Hidden markov model Semi-structured text Ocr
Content Type	Text
Resource Type	Article

Central Library (ISO-9001:2015 Certified)
Indian Institute of Technology Kharagpur
Kharagpur, West Bengal, India | PIN - 721302

See location in the Map
03222 282435
Mail: support@ndl.gov.in

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in