NDLI: Document cleanup using page frame detection

Content Provider	Springer Nature Link
Author	Shafait, Faisal Beusekom, Joost Keysers, Daniel Breuel, Thomas M.
Copyright Year	2008
Abstract	When a page of a book is scanned or photocopied, textual noise (extraneous symbols from the neighboring page) and/or non-textual noise (black borders, speckles, ...) appear along the border of the document. Existing document analysis methods can handle non-textual noise reasonably well, whereas textual noise still presents a major issue for document analysis systems. Textual noise may result in undesired text in optical character recognition (OCR) output that needs to be removed afterwards. Existing document cleanup methods try to explicitly detect and remove marginal noise. This paper presents a new perspective for document image cleanup by detecting the page frame of the document. The goal of page frame detection is to find the actual page contents area, ignoring marginal noise along the page border. We use a geometric matching algorithm to find the optimal page frame of structured documents (journal articles, books, magazines) by exploiting their text alignment property. We evaluate the algorithm on the UW-III database. The results show that the error rates are below 4% each of the performance measures used. Further tests were run on a dataset of magazine pages and on a set of camera captured document images. To demonstrate the benefits of using page frame detection in practical applications, we choose OCR and layout-based document image retrieval as sample applications. Experiments using a commercial OCR system show that by removing characters outside the computed page frame, the OCR error rate is reduced from 4.3 to 1.7% on the UW-III dataset. The use of page frame detection in layout-based document image retrieval application decreases the retrieval error rates by 30%.
Starting Page	81
Ending Page	96
Page Count	16
File Format	PDF
ISSN	14332833
Journal	International Journal of Document Analysis and Recognition (IJDAR)
Volume Number	11
Issue Number	2
e-ISSN	14332825
Language	English
Publisher	Springer-Verlag
Publisher Date	2008-09-30
Publisher Place	Berlin, Heidelberg
Access Restriction	One Nation One Subscription (ONOS)
Subject Keyword	Document analysis Marginal noise removal Document pre-processing Pattern Recognition Image Processing and Computer Vision
Content Type	Text
Resource Type	Article
Subject	Computer Science Applications Computer Vision and Pattern Recognition Software

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in

Clutter noise removal in binary document images

Multimodal page classification in administrative document image streams

Document image binarization using background estimation and stroke edges

Combined orientation and skew detection using geometric text-line modeling

Historical document enhancement using LUT classification

Page segmentation using minimum homogeneity algorithm and adaptive mathematical morphology

Conservative preprocessing of document images

Sparsity-based edge noise removal from bilevel graphical document images

Junction-based table detection in camera-captured document images

Document cleanup using page frame detection

Similar Documents

Clutter noise removal in binary document images

Multimodal page classification in administrative document image streams

Document image binarization using background estimation and stroke edges

Combined orientation and skew detection using geometric text-line modeling

Historical document enhancement using LUT classification

Page segmentation using minimum homogeneity algorithm and adaptive mathematical morphology

Conservative preprocessing of document images

Sparsity-based edge noise removal from bilevel graphical document images

Junction-based table detection in camera-captured document images

Document cleanup using page frame detection