NDLI: Searching by corpus with fingerprints

Please wait, while we are loading the content...

Clydesdale: structured data processing on MapReduce

Differentially private search log sanitization with optimal output utility

RecStore: an extensible and adaptive framework for online recommender queries inside the database engine

On optimizing relational self-joins

User oriented trajectory search for trip recommendation

An adaptive algorithm for online time series segmentation with error bound guarantee

Transactional stream processing

Efficient approximation of the maximal preference scores by lightweight cubic views

Optimizing index deployment order for evolving OLAP

Subscription indexes for web syndication systems

Searching by corpus with fingerprints

CRSI: a compact randomized similarity index for set-valued features

Adaptive MapReduce using situation-aware mappers

Finding top-k similar graphs in graph databases

SIMP: accurate and efficient near neighbor search in high dimensional spaces

Introducing MapLan to map banking survey data into a time series database

Towards an ecosystem of structured data on the web

An optimization framework for map-reduce queries

Integrating historical noisy answers for improving data utility under differential privacy

Supporting top-K item exchange recommendations in large online communities

Transitive closure and recursive Datalog implemented on clusters

Top-k spatial keyword queries on road networks

Dynamic diversification of continuous data

Skyline-sensitive joins with LR-pruning

Distance histogram computation based on spatiotemporal uniformity in scientific data

Heuristics-based query optimisation for SPARQL

Aggregate queries on probabilistic record linkages

VAST-Tree: a vector-advanced and compressed structure for massive data tree traversal

"Cut me some slack": latency-aware live migration for databases

I/O cost minimization: reachability queries processing over massive graphs

Effectively indexing the multi-dimensional uncertain objects for range searching

Extending a general-purpose streaming system for XML

Inside "Big Data management": ogres, onions, or parfaits?

Efficient parallel kNN joins for large data in MapReduce

Mining probabilistically frequent sequential patterns in uncertain databases

Limiting link disclosure in social network analysis through subgraph-wise perturbation

Shortest-path queries for complex networks: exploiting low tree-width outside the core

Relevance search in heterogeneous networks

Towards "intelligent compression" in streams: a biased reservoir sampling based Bloom filter approach

Top-k interesting phrase mining in ad-hoc collections using sequence pattern indexing

A generic data model and query language for spatiotemporal OLAP cube analysis

See what's enBlogue: real-time emergent topic identification in social media

Efficient distributed query processing for autonomous RDF databases

Repair-oriented relational schemas for multidimensional databases

Peak power plays in database engines

Finding maximal k-edge-connected subgraphs from a large graph

SFA: a symbolic fourier approximation and index for similarity search in high dimensional datasets

Mining search behavior and user-generated content: presentation at the industrial session - EDBT/ICDT 2012

Similarity in (spatial, temporal and) spatio-temporal datasets

Data management with SAPs in-memory computing engine

Distributed skyline processing: a trend in database research still going strong

Tailoring entity resolution for matching product offers

Indexing and mining topological patterns for drug discovery

Towards principled design support for scalable OLTP workloads

Adaptive indexing in modern database kernels

A probabilistic convex hull query tool

ColisTrack: testbed for a pervasive environment management system

The mainframe strikes back: elastic multi-tenancy using main memory database systems on a many-core server

SOS (save our systems): a uniform programming interface for non-relational systems

PeerTrack: a platform for tracking and tracing objects in large-scale traceability networks

Fault-tolerant complex event processing using customizable state machine-based operators

Knowledge-based processing of complex stock market events

Private-HERMES: a benchmark framework for privacy-preserving mobility data querying and mining methods

Evaluating hybrid queries through service coordination in HYPATIA

A desktop interface over distributed document repositories

SPARQL-RW: transparent query access over mapped RDF data sources

Intention insider: discovering people's intentions in the social channel

QUASAR: querying annotation, structure, and reasoning

Realtime healthcare services via nested complex event processing technology

Distributed data management for large-scale wireless sensor networks simulations

Knowing: a generic data analysis application

Searching by corpus with fingerprints

Content Provider	ACM Digital Library
Author	Aggarwal, Charu C. Yu, Philip S. Lin, Wangqun
Abstract	The growing sizes of text repositories on the world wide web has created a need for efficient indexing and retrieval methods for text collections. Almost all of the text retrieval and indexing methods have been designed for the case of simple keyword search, in which a few keywords are specified, and the text is retrieved on the basis of matches to these keywords. However, in many applications there is a need for a greater specificity during the search, such as the use of phrases, sentences, text fragments, or even documents for the retrieval process. An even more general case is one in which a collection of documents is available as a query to the search process. In such cases, it is desirable to return sets of all pairwise similar documents. Such queries are referred to as corpus to corpus queries, and are computationally intensive because of the very large number of document pairs which need to be compared. Such cases cannot be efficiently processed by the available indexing and searching methods. Most of the currently available techniques can index the text based on only a small number of keywords or representative phrases. In this paper, we design a compressed finger print index which can support the following more general queries: (a) The method can process very efficient document-to-corpus search because of their efficient bit-wise operations for the search process. (b) We further extend the method to work for corpus-to-corpus queries, in which it is desirable to determine the most similar pairs of documents in two collections. We design an efficient search technique which is able to reduce the search time for large collections. The key technique used to enable this is an efficient fingerprint representation, which can be used effectively for the search process. To the best of our knowledge, this is the first work on corpus-based search in massive document collections.
Starting Page	348
Ending Page	359
Page Count	12
File Format	PDF
ISBN	9781450307901
DOI	10.1145/2247596.2247638
Language	English
Publisher	Association for Computing Machinery (ACM)
Publisher Date	2012-03-27
Publisher Place	New York
Access Restriction	Subscribed
Content Type	Text
Resource Type	Article

Central Library (ISO-9001:2015 Certified)
Indian Institute of Technology Kharagpur
Kharagpur, West Bengal, India | PIN - 721302

See location in the Map
03222 282435
Mail: support@ndl.gov.in

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in