NDLI: Efficient and flexible text extraction from document pages

Content Provider	Springer Nature Link
Author	Parodi, Pietro Fontana, Roberto
Copyright Year	1999
Abstract	This paper describes a novel method for extracting text from document pages of mixed content. The method works by detecting pieces of text lines in small overlapping columns of width $w^{'}$ , shifted with respect to each other by $\epsilon < w^{'}$ image elements (good default values are: $\epsilon=1\%$ of the image width, $w^{'}=2\epsilon$ ) and by merging these pieces in a bottom-up fashion to form complete text lines and blocks of text lines. The algorithm requires about 1.3 s for a 300 dpi image on a PC with a Pentium II CPU, 300 MHz, MotherBoard Intel440LX. The algorithm is largely independent of the layout of the document, the shape of the text regions, and the font size and style. The main assumptions are that the background be uniform and that the text sit approximately horizontally. For a skew of up to about 10 degrees no skew correction mechanism is necessary. The algorithm has been tested on the UW English Document Database I of the University of Washington and its performance has been evaluated by a suitable measure of segmentation accuracy. Also, a detailed analysis of the segmentation accuracy achieved by the algorithm as a function of noise and skew has been carried out.
Starting Page	67
Ending Page	79
Page Count	13
File Format	PDF
ISSN	14332833
Journal	International Journal of Document Analysis and Recognition (IJDAR)
Volume Number	2
Issue Number	2-3
Language	English
Publisher	Springer-Verlag
Publisher Date	1999-12-01
Publisher Place	Berlin Heidelberg
Access Restriction	One Nation One Subscription (ONOS)
Content Type	Text
Resource Type	Article
Subject	Computer Science Applications Computer Vision and Pattern Recognition Software

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in

Classification of document pages using structure-based features

Automatic name extraction from degraded document images

Towards historical document indexing: extraction of drop cap letters

Biblio: automatic meta-data extraction

Efficient multiscale Sauvola’s binarization

Automated analysis of images in documents for intelligent document search

Problem-adaptable document analysis and understanding for high-volume applications

Character pattern extraction from documents with complex backgrounds

A categorization system for handwritten documents

Efficient and flexible text extraction from document pages

Similar Documents

Classification of document pages using structure-based features

Automatic name extraction from degraded document images

Towards historical document indexing: extraction of drop cap letters

Biblio: automatic meta-data extraction

Efficient multiscale Sauvola’s binarization

Automated analysis of images in documents for intelligent document search

Problem-adaptable document analysis and understanding for high-volume applications

Character pattern extraction from documents with complex backgrounds

A categorization system for handwritten documents

Efficient and flexible text extraction from document pages