NDLI: A Web Service for Scholarly Big Data Information Extraction

Content Provider	IEEE Xplore Digital Library
Author	Williams, K. Lichi Li Khabsa, M. Jian Wu Shih, P.C. Giles, C.L.
Copyright Year	2014
Description	Author affiliation: Inf. Sci. & Technol., Comput. Sci. & Eng, Pennsylvania State Univ., University Park, PA, USA (Williams, K.; Lichi Li; Khabsa, M.; Jian Wu; Shih, P.C.; Giles, C.L.)
Abstract	The automatic extraction of metadata and other information from scholarly documents is a common task in academic digital libraries, search engines, and document management systems to allow for the management and categorization of documents and for search to take place. A Web-accessible API can simplify this extraction by providing a single point of operation for extraction that can be incorporated into multiple document workflows without the need for each workflow to implement and support its own extraction functionality. In this paper, we describe CiteSeerExtractor, a RESTful API for scholarly information extraction that exploits the fact that there is duplication in scholarly big data and makes use of a near duplicate matching backend. The backend stores previously extracted metadata and avoids extracting metadata from a document if it has already been extracted before. We describe the design, implementation, and functionality of CiteSeerExtractor and show how the duplicate document matching results in a difference of 8.46% in the time required to extract header and citation information from approximately 3.5 million documents compared to a baseline.
Starting Page	105
Ending Page	112
File Size	362147
Page Count	8
File Format	PDF
e-ISBN	9781479950546
DOI	10.1109/ICWS.2014.27
Language	English
Publisher	Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher Date	2014-06-27
Publisher Place	USA
Access Restriction	Subscribed
Rights Holder	Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subject Keyword	Data mining Big data Web servers Information retrieval Databases Hamming distance CiteSeerExtractor Web service information extraction scholarly big data
Content Type	Text
Resource Type	Article

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in

Towards building a scholarly big data platform: Challenges, lessons and opportunities

A web service for scholarly big data information extraction.

Scholarly big data: information extraction and data mining

AWeb Service for Scholarly Big Data Information Extraction

Web user log mining for Web retrieval

ADAM - A Database and Information Retrieval System for Big Multimedia Collections

Distributed index updating method for intranet information retrieval

Opinion Extraction & Classification of Reviews from Web Documents

Scholarly big data information extraction and integration in the $CiteSeer^{χ}$ digital library

A Web Service for Scholarly Big Data Information Extraction

Similar Documents

Towards building a scholarly big data platform: Challenges, lessons and opportunities

A web service for scholarly big data information extraction.

Scholarly big data: information extraction and data mining

AWeb Service for Scholarly Big Data Information Extraction

Web user log mining for Web retrieval

ADAM - A Database and Information Retrieval System for Big Multimedia Collections

Distributed index updating method for intranet information retrieval

Opinion Extraction & Classification of Reviews from Web Documents

Scholarly big data information extraction and integration in the $CiteSeer^{χ}$ digital library

A Web Service for Scholarly Big Data Information Extraction