NDLI: Sampling Search-Engine Results

Content Provider	Springer Nature Link
Author	Anagstopoulos, Aris Broder, Andrei Z. Carmel, David
Copyright Year	2006
Abstract	We consider the problem of efficiently sampling Web search engine query results. In turn, using a small random sample instead of the full set of results leads to efficient approximate algorithms for several applications, such as: Determining the set of categories in a given taxonomy spanned by the search results; Finding the range of metadata values associated with the result set in order to enable “multi-faceted search”; Estimating the size of the result set; Data mining associations to the query terms. We present and analyze efficient algorithms for obtaining uniform random samples applicable to any search engine that is based on posting lists and document-at-a-time evaluation. (To our knowledge, all popular Web search engines, for example, Google, Yahoo Search, MSN Search, Ask, belong to this class.) Furthermore, our algorithm can be modified to follow the modern object-oriented approach whereby posting lists are viewed as streams equipped with a next method, and the next method for Boolean and other complex queries is built from the next method for primitive terms. In our case we show how to construct a basic sample-next(p) method that samples term posting lists with probability p, and show how to construct sample-next(p) methods for Boolean operators (AND, OR, WAND) from primitive methods. Finally, we test the efficiency and quality of our approach on both synthetic and real-world data.
Starting Page	397
Ending Page	429
Page Count	33
File Format	PDF
ISSN	1386145X
Journal	World Wide Web
Volume Number	9
Issue Number	4
e-ISSN	15731413
Language	English
Publisher	Kluwer Academic Publishers
Publisher Date	2007-01-16
Publisher Place	Boston
Access Restriction	One Nation One Subscription (ONOS)
Subject Keyword	Operating Systems Database Management Information Systems Applications (incl.Internet)
Content Type	Text
Resource Type	Article
Subject	Computer Networks and Communications Software Hardware and Architecture

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in

Query-Free News Search

Wrapper verification

AMORE: A World Wide Web image retrieval engine

Three-Level Caching for Efficient Query Processing in Large Web Search Engines

Multi-channel Adaptive Information Systems

Web++ architecture, design and performance

Clustering Web video search results based on integration of multiple features

Processing keyword search on XML: a survey

Summary of WWW characterizations

Sampling Search-Engine Results

Similar Documents

Query-Free News Search

Wrapper verification

AMORE: A World Wide Web image retrieval engine

Three-Level Caching for Efficient Query Processing in Large Web Search Engines

Multi-channel Adaptive Information Systems

Web++ architecture, design and performance

Clustering Web video search results based on integration of multiple features

Processing keyword search on XML: a survey

Summary of WWW characterizations

Sampling Search-Engine Results