NDLI: Accuracy of Optical Character Recognition Software Google Tesseract

Please wait, while we are loading the content...

Accuracy of Optical Character Recognition Software Google Tesseract

Content Provider	Semantic Scholar
Author	Suitter, Joshua A.
Copyright Year	2015
Abstract	Tesseract is an open-source OCR (Optical Character Recognition) software engine originally developed by HP between 1985 and 1995, it is now sponsored by Google Projects (Google Tesseract). While Tesseract is known as one of the most accurate free OCR engines available today, it has numerous limitations that dramatically affect its performance; its ability to correctly recognize characters in a scan or image. During my research I have found that certain fonts are accepted more than others, and font size, spacing, and image quality all play a role in how Tesseract performs. In this project, I will also be looking into Wolfram’s Mathematica built-in Tesseract code: Text Recognize. You will see through this project how different fonts, font sizes, image quality, and tilting of an image affect Tesseracts recognition accuracy. The first part of this project I tested the fonts and font sizes using Tesseract. I did error calculations by eye, looking for when a word came back in the text file incorrectly. The reason for using Mathematica’s version is so I can automate my error process; getting a more accurate result. In my research, I found that both the original Tesseract program and Mathematica’s built-in version are very accurate, especially at higher quality images. Overview For this project, I took the first couple sections from the Constitution of the United States of America. I incorporated Microsoft Word to modify fonts and sizes of those fonts to see what affect it had on the documents recognition thru Tesseract. Once these different files were all created in PDF format they were then converted to an image using another free online software. Tesseract takes image files (i.e. .tiff, .jpg,) and extracts the words; as accurately as possible, from the images. Tesseract runs from the command prompt program. I also used Mathematica’s Text Recognize code to automate my data and get a more accurate result of how well the OCR works. With other built in codes like Smith Waterman Similarity and Sequence Alignment, I can see how accurately the program is running as well as where the errors occur.
File Format	PDF HTM / HTML
Alternate Webpage(s)	https://digitalcommons.usm.maine.edu/cgi/viewcontent.cgi?article=1042&context=thinking_matters&httpsredir=1&referer=
Language	English
Access Restriction	Open
Content Type	Text
Resource Type	Article

Central Library (ISO-9001:2015 Certified)
Indian Institute of Technology Kharagpur
Kharagpur, West Bengal, India | PIN - 721302

See location in the Map
03222 282435
Mail: support@ndl.gov.in

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in