NDLI: Web crawler middleware for search engine digital libraries: a case study for citeseerX

Please wait, while we are loading the content...

Search beyond the web: data from social networks and native apps

Modeling topic trends on the social web using temporal signatures

Managing analysis context

Web crawler middleware for search engine digital libraries: a case study for citeseerX

XPath satisfiability with downward and sibling axes is tractable under most of real-world DTDs

Using social tags to infer context in hybrid music recommendation

TitleFinder: extracting the headline of news web pages based on cosine similarity and overlap scoring similarity

A multi-layer data representation of trajectories in social networks based on points of interest

SNOPS: a smart environment for cultural heritage applications

M3D: a tool for the model driven development of web applications

A distributed index for efficient parallel top-k keyword search on massive graphs

Web crawler middleware for search engine digital libraries: a case study for citeseerX

Content Provider	ACM Digital Library
Author	Wu, Jian San Pedro Wandelmer, Jose Carman, Stephen Lu, Xin Giles, C. Lee Teregowda, Pradeep Khabsa, Madian Jordan, Douglas Mitra, Prasenjit
Abstract	Middleware is an important part of many search engine web crawling processes. We developed a middleware, the Crawl Document Importer (CDI), which selectively imports documents and the associated metadata to the digital library CiteSeerX crawl repository and database. This middleware is designed to be extensible as it provides a universal interface to the crawl database. It is designed to support input from multiple open source crawlers and archival formats, e.g., ARC, WARC. It can also import files downloaded via FTP. To use this middleware for another crawler, the user only needs to write a new log parser which returns a resource object with the standard metadata attributes and tells the middleware how to access downloaded files. When importing documents, users can specify document mime types and obtain text extracted from PDF/postscript documents. The middleware can adaptively identify academic research papers based on document context features. We developed a web user interface where the user can submit importing jobs. The middleware package can also work on supplemental jobs related to the crawl database and respository. Though designed for the CiteSeerX search engine, we feel this design would be appropriate for many search engine web crawling systems.
Starting Page	57
Ending Page	64
Page Count	8
File Format	PDF
ISBN	9781450317207
DOI	10.1145/2389936.2389949
Language	English
Publisher	Association for Computing Machinery (ACM)
Publisher Date	2012-11-02
Publisher Place	New York
Access Restriction	Subscribed
Subject Keyword	Web crawling Search engine Information retrieval Middleware Ingestion
Content Type	Text
Resource Type	Article

Central Library (ISO-9001:2015 Certified)
Indian Institute of Technology Kharagpur
Kharagpur, West Bengal, India | PIN - 721302

See location in the Map
03222 282435
Mail: support@ndl.gov.in

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in