NDLI: Vries, “Uncovering the unarchived web

Please wait, while we are loading the content...

Vries, “Uncovering the unarchived web

Content Provider	CiteSeerX
Author	Samar, Thaer Huurdeman, Hugo C. Ben-David, Anat Kamps, Jaap Vries, Arjen De
Description	Many national and international heritage institutes realize the im-portance of archiving the web for future culture heritage. Web archiving is currently performed either by harvesting a national do-main, or by crawling a pre-defined list of websites selected by the archiving institution. In either method, crawling results in more information being harvested than just the websites intended for preservation; which could be used to reconstruct impressions of pages that existed on the live web of the crawl date, but would have been lost forever. We present a method to create representations of what we will refer to as a web collection’s aura: the web documents that were not included in the archived collection, but are known to have existed — due to their mentions on pages that were included in the archived web collection. To create representations of these unarchived pages, we exploit the information about the unarchived URLs that can be derived from the crawls by combining crawl date distribution, anchor text and link structure. We illustrate empiri-cally that the size of the aura can be substantial: in 2012, the Dutch Web archive contained 12.3M unique pages, while we uncover ref-erences to 11.9M additional (unarchived) pages.
File Format	PDF
Language	English
Publisher Institution	in SIGIR. ACM
Access Restriction	Open
Subject Keyword	Archived Web Collection Live Web Archived Collection Crawl Date Distribution Web Collection Aura Dutch Web Archive Future Culture Heritage Web Archiving Anchor Text Unique Page Link Structure Archiving Institution Pre-defined List National Do-main Unarchived Page International Heritage Institute Web Document Unarchived Url Unarchived Web Crawl Date
Content Type	Text
Resource Type	Article

Central Library (ISO-9001:2015 Certified)
Indian Institute of Technology Kharagpur
Kharagpur, West Bengal, India | PIN - 721302

See location in the Map
03222 282435
Mail: support@ndl.gov.in

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in