NDLI: Automatic article extraction in old newspapers digitized collections

Please wait, while we are loading the content...

Automatic article extraction in old newspapers digitized collections

Automated assignment of topics to OCRed historical texts

Correcting noisy OCR: context beats confusion

Construction of a text digitization system for Nom historical documents

Wittgenstein's Nachlass: WiTTFind and Wittgenstein advanced search tools (WAST)

Handwritten text recognition for historical documents in the transcriptorium project

Document representation refinement for precise region description

An approach to unsupervised historical text normalisation

User-driven correction of OCR errors: combining crowdsourcing and information retrieval technology

OCR of historical printings of Latin texts: problems, prospects, progress

Estimating and rating the quality of optically character recognised text

OCR correction of documents generated during Argentina's national reorganization process

Recognition of degraded ancient characters based on dense SIFT

Reflections on cultural heritage and digital humanities: modelling in practice and theory

PoCoTo - an open source system for efficient interactive postcorrection of OCRed historical texts

Digital editions and diplomatic diagrams

A bimodal crowdsourcing platform for demographic historical manuscripts

Computer-assisted transcription of a historical botanical specimen book: organization and process overview

Cataloging for a billion word library of Greek and Latin

H-DocPro: a document image processing platform for historical documents

Semantics in storytelling in Swedish fiction

Creation of custom recognition profiles for historical documents

An adaptive zoning technique for efficient word retrieval using dynamic time warping

Highly interactive and natural user interfaces: enabling visual analysis in historical lexicography

On OCR ground truths and OCR post-correction gold standards, tools and formats

Automated page layout simplification of Patrologia Graeca

PIVAJ: displaying and augmenting digitized newspapers on the web experimental feedback from the "Journal de Rouen" collection

An open-source OCR evaluation tool

Logical structure recognition for heterogeneous periodical collections

Data processing and lemmatization in digitized $19^{th}-century$ Czech texts

Using ancestral layout models for document digitization

Automatic article extraction in old newspapers digitized collections

Content Provider	ACM Digital Library
Author	Palfray, Thomas Tranouez, Pierrick Hebert, David Nicolas, Stephane Paquet, Thierry
Abstract	We present a complete method for article segmentation in old newspapers, which deals with complex layouts analysis of degraded documents. The designed workflow can process large amounts of documents and generates digital objects in METS/ALTO format in order to facilitate the indexing and the browsing of information in digital libraries. The analysis of the document image is performed by a two stages scheme. Pixels are labeled in a first stage with a Conditional Random Field model in order to intent to label the areas of interest with a low logical level. Then this first logical representation of the document content is analyzed in a second stage to get a higher logical representation including article segmentation and reading order. This top-level structural analysis relies on the generation of an article separation grid applied recursively on the document image, allowing analyzing any type of Manhattan page layout, even for complex structures with multiple columns and overlapping entities. This method which benefits from both a local analysis using a probabilistic model trained using machine learning procedures, and a more global structural analysis using recursive rules, is evaluated on a dataset of daily local press document images covering several time periods and different page layouts, to prove its effectiveness.
Starting Page	3
Ending Page	8
Page Count	6
File Format	PDF
ISBN	9781450325882
DOI	10.1145/2595188.2595195
Language	English
Publisher	Association for Computing Machinery (ACM)
Publisher Date	2014-05-19
Publisher Place	New York
Access Restriction	Subscribed
Subject Keyword	Conditional random field Logical structure Document image labelling Articles extraction in newspapers Information extraction from document images Page layout analysis Structural analysis
Content Type	Text
Resource Type	Article

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in

Logical segmentation for article extraction in digitized old newspapers

PIVAJ: displaying and augmenting digitized newspapers on the web experimental feedback from the "Journal de Rouen" collection

A New Hierarchical Handwritten Document Layout Extraction Based on Conditional Random Field Modeling

An integrated approach for automatic semantic structure extraction in document images (2004)

Unconstrained Handwritten Document Layout Extraction Using 2D Conditional Random Fields

Newspaper Headlines Extraction from Microfilm Images

Logical structure recognition for heterogeneous periodical collections

Model-guided segmentation and layout labelling of document images using a hierarchical conditional random field

Comic2CEBX: A system for automatic comic content adaptation

Automatic article extraction in old newspapers digitized collections

Similar Documents

Logical segmentation for article extraction in digitized old newspapers

PIVAJ: displaying and augmenting digitized newspapers on the web experimental feedback from the "Journal de Rouen" collection

A New Hierarchical Handwritten Document Layout Extraction Based on Conditional Random Field Modeling

An integrated approach for automatic semantic structure extraction in document images (2004)

Unconstrained Handwritten Document Layout Extraction Using 2D Conditional Random Fields

Newspaper Headlines Extraction from Microfilm Images

Logical structure recognition for heterogeneous periodical collections

Model-guided segmentation and layout labelling of document images using a hierarchical conditional random field

Comic2CEBX: A system for automatic comic content adaptation

Automatic article extraction in old newspapers digitized collections