NDLI: Robust named entity detection from optical character recognition output

Content Provider	Springer Nature Link
Author	Subramanian, Krishna Prasad, Rohit Natarajan, Prem
Copyright Year	2011
Abstract	In this paper, we focus on information extraction from optical character recognition (OCR) output. Since the content from OCR inherently has many errors, we present robust algorithms for information extraction from OCR lattices instead of merely looking them up in the top-choice (1-best) OCR output. Specifically, we address the challenge of named entity detection in noisy OCR output and show that searching for named entities in the recognition lattice significantly improves detection accuracy over 1-best search. While lattice-based named entity (NE) detection improves NE recall from OCR output, there are two problems with this approach: (1) the number of false alarms can be prohibitive for certain applications and (2) lattice-based search is computationally more expensive than 1-best NE lookup. To mitigate the above challenges, we present techniques for reducing false alarms using confidence measures and for reducing the amount of computation involved in performing the NE search. Furthermore, to demonstrate that our techniques are applicable across multiple domains and languages, we experiment with optical character recognition systems for videotext in English and scanned handwritten text in Arabic.
Starting Page	189
Ending Page	200
Page Count	12
File Format	PDF
ISSN	14332833
Journal	International Journal of Document Analysis and Recognition (IJDAR)
Volume Number	14
Issue Number	2
e-ISSN	14332825
Language	English
Publisher	Springer-Verlag
Publisher Date	2011-04-13
Publisher Place	Berlin, Heidelberg
Access Restriction	One Nation One Subscription (ONOS)
Subject Keyword	Optical character recognition Hidden Markov Model Information extraction Named entity detection Image Processing and Computer Vision Pattern Recognition
Content Type	Text
Resource Type	Article
Subject	Computer Science Applications Computer Vision and Pattern Recognition Software

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in

Robust named entity detection in videotext using character lattices

A robust probabilistic Braille recognition system

Off-line handwritten character recognition using Hidden Markov Model

A survey of methods and strategies in character segmentation

Domain-specific entity extraction from noisy, unstructured data using ontology-guided search

Optical character recognition errors and their effects on natural language processing

Maximization of mutual information for offline Thai handwriting recognition

The BBN Byblos Japanese OCR system

Effective technique for the recognition of offline Arabic handwritten words using hidden Markov models

Robust named entity detection from optical character recognition output

Similar Documents

Robust named entity detection in videotext using character lattices

A robust probabilistic Braille recognition system

Off-line handwritten character recognition using Hidden Markov Model

A survey of methods and strategies in character segmentation

Domain-specific entity extraction from noisy, unstructured data using ontology-guided search

Optical character recognition errors and their effects on natural language processing

Maximization of mutual information for offline Thai handwriting recognition

The BBN Byblos Japanese OCR system

Effective technique for the recognition of offline Arabic handwritten words using hidden Markov models

Robust named entity detection from optical character recognition output