NDLI: Do not crawl in the DUST: Different URLs with similar text

Please wait, while we are loading the content...

Do not crawl in the DUST: Different URLs with similar text

Content Provider	ACM Digital Library
Author	Keidar, Idit Schonfeld, Uri Bar-yossef, Ziv
Copyright Year	2009
Abstract	We consider the problem of DUST: Different URLs with Similar Text. Such duplicate URLs are prevalent in Web sites, as Web server software often uses aliases and redirections, and dynamically generates the same page from various different URL requests. We present a novel algorithm, $\textit{DustBuster},$ for uncovering DUST; that is, for discovering rules that transform a given URL to others that are likely to have similar content. DustBuster mines DUST effectively from previous crawl logs or Web server logs, $\textit{without}/examining$ page contents. Verifying these rules via sampling requires fetching few actual Web pages. Search engines can benefit from information about DUST to increase the effectiveness of crawling, reduce indexing overhead, and improve the quality of popularity statistics such as PageRank.
Starting Page	1
Ending Page	31
Page Count	31
File Format	PDF
ISSN	15591131
e-ISSN	1559114X
DOI	10.1145/1462148.1462151
Volume Number	3
Issue Number	1
Journal	ACM Transactions on the Web (TWEB)
Language	English
Publisher	Association for Computing Machinery (ACM)
Publisher Date	2009-01-17
Publisher Place	New York
Access Restriction	One Nation One Subscription (ONOS)
Subject Keyword	Search engines URL normalization Antialiasing Crawling Duplicate detection
Content Type	Text
Resource Type	Article
Subject	Computer Networks and Communications

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in