NDLI: Optical character recognition errors and their effects on natural language processing

Please wait, while we are loading the content...

International Journal of Document Analysis and Recognition (IJDAR) : Volume 20

International Journal of Document Analysis and Recognition (IJDAR) : Volume 19

International Journal of Document Analysis and Recognition (IJDAR) : Volume 18

International Journal of Document Analysis and Recognition (IJDAR) : Volume 17

International Journal of Document Analysis and Recognition (IJDAR) : Volume 16

International Journal of Document Analysis and Recognition (IJDAR) : Volume 15

International Journal of Document Analysis and Recognition (IJDAR) : Volume 14

International Journal of Document Analysis and Recognition (IJDAR) : Volume 13

International Journal of Document Analysis and Recognition (IJDAR) : Volume 12

International Journal of Document Analysis and Recognition (IJDAR) : Volume 12, Issue 4, December 2009

International Journal of Document Analysis and Recognition (IJDAR) : Volume 12, Issue 3, September 2009

Special issue on noisy text analytics

Optical character recognition errors and their effects on natural language processing

Using topic models for OCR correction

Successfully detecting and correcting false friends using channel profiles

Language independent unsupervised learning of short message service dialect

An effective coherence measure to determine topical consistency in user-generated content

Opinion mining from noisy text data

International Journal of Document Analysis and Recognition (IJDAR) : Volume 12, Issue 2, July 2009

International Journal of Document Analysis and Recognition (IJDAR) : Volume 12, Issue 1, May 2009

International Journal of Document Analysis and Recognition (IJDAR) : Volume 11

International Journal of Document Analysis and Recognition (IJDAR) : Volume 10

International Journal of Document Analysis and Recognition (IJDAR) : Volume 9

International Journal of Document Analysis and Recognition (IJDAR) : Volume 8

International Journal of Document Analysis and Recognition (IJDAR) : Volume 7

International Journal of Document Analysis and Recognition (IJDAR) : Volume 6

International Journal of Document Analysis and Recognition (IJDAR) : Volume 5

International Journal of Document Analysis and Recognition (IJDAR) : Volume 4

International Journal of Document Analysis and Recognition (IJDAR) : Volume 3

International Journal of Document Analysis and Recognition (IJDAR) : Volume 2

International Journal of Document Analysis and Recognition (IJDAR) : Volume 1

Optical character recognition errors and their effects on natural language processing

Content Provider	Springer Nature Link
Author	Lopresti, Daniel
Copyright Year	2009
Abstract	Errors are unavoidable in advanced computer vision applications such as optical character recognition, and the noise induced by these errors presents a serious challenge to downstream processes that attempt to make use of such data. In this paper, we apply a new paradigm we have proposed for measuring the impact of recognition errors on the stages of a standard text analysis pipeline: sentence boundary detection, tokenization, and part-of-speech tagging. Our methodology formulates error classification as an optimization problem solvable using a hierarchical dynamic programming approach. Errors and their cascading effects are isolated and analyzed as they travel through the pipeline. We present experimental results based on a large collection of scanned pages to study the varying impact depending on the nature of the error and the character(s) involved. This dataset has also been made available online to encourage future investigations.
Starting Page	141
Ending Page	151
Page Count	11
File Format	PDF
ISSN	14332833
Journal	International Journal of Document Analysis and Recognition (IJDAR)
Volume Number	12
Issue Number	3
e-ISSN	14332825
Language	English
Publisher	Springer-Verlag
Publisher Date	2009-09-25
Publisher Place	Berlin, Heidelberg
Access Restriction	One Nation One Subscription (ONOS)
Subject Keyword	Performance evaluation Optical character recognition Sentence boundary detection Tokenization Part-of-speech tagging Pattern Recognition Image Processing and Computer Vision
Content Type	Text
Resource Type	Article
Subject	Computer Science Applications Computer Vision and Pattern Recognition Software

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in

Optical character recognition errors and their effects on natural language processing

Performance evaluation for text processing of noisy inputs

Impact of imperfect OCR on part-of-speech tagging

Sentence boundary detection in conversational speech transcripts using noisily labeled examples

Morphological tagging approach in document analysis of invoices

Bidirectional HMM-based Arabic POS tagging

Robust named entity detection from optical character recognition output

Toward enhanced Arabic speech recognition using part of speech tagging

Integrating natural language processing with image document analysis: what we learned from two real-world applications

Optical character recognition errors and their effects on natural language processing

Similar Documents

Optical character recognition errors and their effects on natural language processing

Performance evaluation for text processing of noisy inputs

Impact of imperfect OCR on part-of-speech tagging

Sentence boundary detection in conversational speech transcripts using noisily labeled examples

Morphological tagging approach in document analysis of invoices

Bidirectional HMM-based Arabic POS tagging

Robust named entity detection from optical character recognition output

Toward enhanced Arabic speech recognition using part of speech tagging

Integrating natural language processing with image document analysis: what we learned from two real-world applications

Optical character recognition errors and their effects on natural language processing