NDLI: Retrieving poorly degraded OCR documents

Content Provider	Springer Nature Link
Author	Fataicha, Y. Cheriet, M. Nie, J. Y. Suen, C. Y.
Copyright Year	2005
Abstract	A significant portion of currently available documents exist in the form of images, for instance, as scanned documents. Electronic documents produced by scanning and OCR software contain recognition errors. This paper uses an automatic approach to examine the selection and the effectiveness of searching techniques for possible erroneous terms for query expansion. The proposed method consists of two basic steps. In the first step, confused characters in erroneous words are located and editing operations are applied to create a collection of erroneous error-grams in the basic unit of the model. The second step uses query terms and error-grams to generate additional query terms, identify appropriate matching terms, and determine the degree of relevance of retrieved document images to the user's query, based on a vector space IR model. The proposed approach has been trained on 979 document images to construct about 2,822 error-grams and tested on 100 scanned Web pages, 200 advertisements and manuals, and 700 degraded images. The performance of our method is evaluated experimentally by determining retrieval effectiveness with respect to recall and precision. The results obtained show its effectiveness and indicate an improvement over standard methods such as vectorial systems without expanded query and 3-gram overlapping.
Starting Page	1
Ending Page	99999
Page Count	99999
File Format	PDF
ISSN	14332833
Journal	International Journal of Document Analysis and Recognition (IJDAR)
Volume Number	8
Issue Number	1
e-ISSN	14332825
Language	English
Publisher	Springer-Verlag
Publisher Date	2005-10-13
Publisher Place	Berlin, Heidelberg
Access Restriction	One Nation One Subscription (ONOS)
Content Type	Text
Resource Type	Article
Subject	Computer Science Applications Computer Vision and Pattern Recognition Software

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in

Adaptive binarization of severely degraded and non-uniformly illuminated documents

Analysis and recognition of highly degraded printed characters

Using topic models for OCR correction

A survey on Arabic character segmentation

Learning on the fly: a font-free approach toward multilingual OCR

A blackboard approach towards integrated Farsi OCR system

User-configurable OCR enhancement for online natural history archives

Quantitative analysis of mathematical documents

A blackboard approach towards integrated Farsi OCR system

Retrieving poorly degraded OCR documents

Similar Documents

Adaptive binarization of severely degraded and non-uniformly illuminated documents

Analysis and recognition of highly degraded printed characters

Using topic models for OCR correction

A survey on Arabic character segmentation

Learning on the fly: a font-free approach toward multilingual OCR

A blackboard approach towards integrated Farsi OCR system

User-configurable OCR enhancement for online natural history archives

Quantitative analysis of mathematical documents

A blackboard approach towards integrated Farsi OCR system

Retrieving poorly degraded OCR documents