NDLI: Google Newspaper Search – Image Processing and Analysis Pipeline

Content Provider	IEEE Xplore Digital Library
Author	Chaudhury, K. Jain, A. Thirthala, S. Sahasranaman, V. Saxena, S. Mahalingam, S.
Copyright Year	2009
Abstract	The Google Newspaper Search program was launched on September 8, 2008. In this paper, we outline the technology pieces underlying this large and complex project. We have created a production pipeline which takes newspaper microfilms as input and emits individual news articles as output. These articles are then indexed and added to the content base, so that they turn up in response to Google searches. Thus, in response to a Google query “Hitler death”, we are able to show newspaper articles from the very day it was reported, authentic and unbiased by passage of time. Non-uniform illumination, presence of significant noise, tears and scratches in the microfilm image, all pose special challenges for this project. The significant variation of layouts across newspapers and time eras, the variations in font sizes occurring in a single page (which confuses the OCR engine) compound the difficulties. The project is still going on after the initial launch was made (with about 15 million news articles).
Starting Page	621
Ending Page	625
File Size	2164795
Page Count	5
File Format	PDF
ISBN	9781424445004
ISSN	15205363
DOI	10.1109/ICDAR.2009.272
Language	English
Publisher	Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher Date	2009-07-26
Publisher Place	Spain
Access Restriction	Subscribed
Rights Holder	Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subject Keyword	Image processing Image analysis Pipelines Image segmentation White spaces Image recognition Lighting Optical character recognition software Indexing Text analysis page orientation detection document image processing newspaper page layout detection image binarization newspaper front page identification
Content Type	Text
Resource Type	Article

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in

Recognition Driven Page Orientation Detection

Document Image Binarization Based on NFCM

Hybrid Page Layout Analysis via Tab-Stop Detection

Flexible page segmentation using the background

Improved Hybrid Binarization based on Kmeans for Heterogeneous document processing

Document layout analysis for Indian newspapers using contour based symbiotic approach

2009 10th international conference on document analysis and recognition google newspaper search – image processing and analysis pipeline.

Binarization and its evaluation for Urdu Nastalique document images

Google Newspaper Search – Image Processing and Analysis Pipeline

Google Newspaper Search – Image Processing and Analysis Pipeline

Similar Documents

Recognition Driven Page Orientation Detection

Document Image Binarization Based on NFCM

Hybrid Page Layout Analysis via Tab-Stop Detection

Flexible page segmentation using the background

Improved Hybrid Binarization based on Kmeans for Heterogeneous document processing

Document layout analysis for Indian newspapers using contour based symbiotic approach

2009 10th international conference on document analysis and recognition google newspaper search – image processing and analysis pipeline.

Binarization and its evaluation for Urdu Nastalique document images

Google Newspaper Search – Image Processing and Analysis Pipeline

Google Newspaper Search – Image Processing and Analysis Pipeline