NDLI: Semantic annotation of tabular data in PDF documents via crowdsourcing

Please wait, while we are loading the content...

Semantic annotation of tabular data in PDF documents via crowdsourcing

Content Provider	Semantic Scholar
Author	Islam, Saiful Auer, Sören
Copyright Year	2015
Abstract	I, A Q M Saiful Islam, declare that this thesis titled, 'Semantic annotation of tabular data in PDF documents via crowdsourcing' and the work presented in it are my own. I confirm that: This work was done wholly or mainly while in candidature for a research degree at this University. Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated. Where I have consulted the published work of others, this is always clearly attributed. Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work. I have acknowledged all main sources of help. Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself. Signed: Date: i " Computers are incredibly fast, accurate, and stupid. Human beings are incredibly slow, inaccurate, and brilliant. Together they are powerful beyond imagination." There are a lot of valuable tabular information that is only available in PDF documents. In order to apply query over those table information in a uniform way and to interlink them with related information, the semantics are missing there. This thesis describes the current problems of annotate tabular data semantically inside PDF documents and provides a solution to annotate and query over the data inside those tables. As a part of the solution a prototype has been developed. The architecture of the solution focuses on publishing annotated data in a crowdsourced manner. Earlier research work in this field does not sufficiently address the annotation of tabular data in PDF documents. The prototype was built as a multi user web application.To extracting data from PDF, this thesis proposes an algorithm focusing on a relatively simple structure of table cells. The proposed solution annotates the tabular data using available ontologies and resources from the DBpedia linked open dataset derived from Wikipedia and publishes the annotated table in a uniform structure using the RDF Data Cube Vocabulary. The annotations are stored in a triple store and can be queried via a SPARQL endpoint. Apart from annotation of tabular data, the prototype provides a functionality to export the selected tabular data as CSV for further custom use. To obtain feedback about …
File Format	PDF HTM / HTML
Alternate Webpage(s)	http://eis-bonn.github.io/Theses/2015/AQM_Saiful_Islam/thesis.pdf
Language	English
Access Restriction	Open
Content Type	Text
Resource Type	Article

Central Library (ISO-9001:2015 Certified)
Indian Institute of Technology Kharagpur
Kharagpur, West Bengal, India | PIN - 721302

See location in the Map
03222 282435
Mail: support@ndl.gov.in

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in