NDLI: 2 Count-min Sketch and Apriori Algorithm (and Bloom Filters) 12.1 Count-min Sketch

Please wait, while we are loading the content...

2 Count-min Sketch and Apriori Algorithm (and Bloom Filters) 12.1 Count-min Sketch

Content Provider	Semantic Scholar
Abstract	Many streaming algorithms use random hashing functions to compress data. They basically randomly map some data items on top of each other. This leads to some error, but if one is careful, the large important items show through. The unimportant items get distributed randomly, and essentially add a layer of noise. Because it is random, this noise can be bounded. An added benefit of the use of hash functions is the obliviousness. The algorithm is completely independent of the order or content of the data. This means its easy change items later if there was a mistake, or sketch different parts of data separately, and then combine them together later. Or if data is reordered due to race conditions, it has no effect on the output. Next we also consider a variant of the frequent items problem: the frequent itemset problem, where we seek to find items that commonly occur together. Again, we will see instance where important and common itemsets show through above the noise. This fits into the classic problem of association rule mining. The basic problem is posed as follows: We have a large set of m tuples {T 1 } has a small number (not all the same k) of items from a domain [n]. Think of the items as products in a grocery store and the tuples as things bought in the same purchase. Then the goal is to find times which frequently co-occur together. Think of what items to people frequently buy together: a famous example is " diapers " and " beer " found in real data. Again, like finding frequent items (heavy hitters), we assume m and n is very large (maybe not quite so large as before), but we don't want to check all pairs n 2 ≈ n 2 /2 against all m tuples since that would take roughly n 2 m/2 time and be way too long. Moreover, m does not fit in memory, so we only want to scan over it, in one pass (or a small number of passes). In contrast to the Misra-Gries algorithm covered last lecture, we describe a completely different way to solve the HEAVY-HITTER problem. It is called the Count-Min Sketch [Cormode + Muthukrishnan 2005] [5]. Start with t independent (random) hash functions {h 1 ,. .. , h t } where each h h : [n] → [k]. Now we store a …
File Format	PDF HTM / HTML
Alternate Webpage(s)	http://www.cs.utah.edu/~jeffp/teaching/cs5140-S16/cs5140/L12-Count-Min+Apriori.pdf
Alternate Webpage(s)	http://www.cs.utah.edu/~jeffp/teaching/cs5140/L12-Count-Min+Apriori.pdf
Alternate Webpage(s)	http://www2.cs.utah.edu/~jeffp/teaching/cs5140-S16/cs5140/L12-Count-Min+Apriori.pdf
Alternate Webpage(s)	http://www2.cs.utah.edu/~jeffp/teaching/cs5140/L12-Count-Min+Apriori.pdf
Language	English
Access Restriction	Open
Content Type	Text
Resource Type	Article

Sl.	Authority	Responsibilities	Communication Details
1	Ministry of Education (GoI), Department of Higher Education	Sanctioning Authority	https://www.education.gov.in/ict-initiatives
2	Indian Institute of Technology Kharagpur	Host Institute of the Project: The host institute of the project is responsible for providing infrastructure support and hosting the project	https://www.iitkgp.ac.in
3	National Digital Library of India Office, Indian Institute of Technology Kharagpur	The administrative and infrastructural headquarters of the project	Dr. B. Sutradhar bsutra@ndl.gov.in
4	Project PI / Joint PI	Principal Investigator and Joint Principal Investigators of the project	Dr. B. Sutradhar bsutra@ndl.gov.in Prof. Saswat Chakrabarti will be added soon
5	Website/Portal (Helpdesk)	Queries regarding NDLI and its services	support@ndl.gov.in
6	Contents and Copyright Issues	Queries related to content curation and copyright issues	content@ndl.gov.in
7	National Digital Library of India Club (NDLI Club)	Queries related to NDLI Club formation, support, user awareness program, seminar/symposium, collaboration, social media, promotion, and outreach	clubsupport@ndl.gov.in
8	Digital Preservation Centre (DPC)	Assistance with digitizing and archiving copyright-free printed books	dpc@ndl.gov.in
9	IDR Setup or Support	Queries related to establishment and support of Institutional Digital Repository (IDR) and IDR workshops	idr@ndl.gov.in