Loading...
Please wait, while we are loading the content...
Similar Documents
Workshop on Massive Geometric Data Sets ( Massive 2005 )
| Content Provider | Semantic Scholar |
|---|---|
| Author | Arge, Lars Berg, Mark De Vahrenhold, Jan |
| Copyright Year | 2005 |
| Abstract | Heavy hitters, which are items occurring with frequency above a given threshold, are an important aggregation and summary tool when processing data streams or data warehouses. Hierarchical heavy hitters (HHHs) have been introduced as a natural generalization for hierarchical and multi-dimensional data domains. An important application, for instance, involves inferring patterns in the stream of IP packets in the Internet, for the purpose of traffic engineering or network security. Each IP packet is mapped to a multi-dimensional point using its header fields, and the goal is to extract classification rules that account for a large fraction of the traffic. In geometric terms, the problem involves identifying rectangular boxes that contain a significant fraction of a stream of points, excluding those points counted in a smaller heavy hitter box. Formally, consider a stream S of d-dimensional points and a family B of axis-aligned d-dimensional boxes. Both |S| and |B| are large—the points of the stream arrive online, and they are too numerous to be stored in memory; the boxes also are too numerous to be maintained explicitly and are defined implicitly by certain rules. The boxes in B are partially ordered by the containment relation: for two boxes B and B′, we write B′ ≺ B if B′ ⊂ B. We define the frequency (or, population) of a box B to be the number of points in the stream that lie in B, namely, |B ∩ S|; we use the notation S to also denote the part of the stream seen so far. We are interested in identifying those boxes with frequency greater than φ|S|, for a given parameter 0 < φ < 1. In order to avoid redundancy, however, hierarchical heavy hitters are defined using the discounted frequency of a box. (Otherwise, all boxes containing a heavy box may be flagged as heavy even if they do not contain many additional points.) Discounted frequencies and φ-hierarchical heavy hitters (φ-HHHs) are defined recursively: the discounted frequency of B counts only those points that lie in B but not in another φ-HHH B′ where B′ ≺ B. A box B is a φ-HHH if its discounted frequency exceeds φ|S|. Hierarchical heavy hitters are natural and powerful constructs, but reliably estimating the discounted frequency of boxes has proved elusive. None of the known space-efficient data stream algorithms offer a worst-case guarantee on the approximation quality of the boxes they flag as φ-HHH . In this talk, we formalize the difficulty of computing true hierarchical heavy hitters and prove lower bounds on the space complexity of algorithms that compute them. For streams of 1-dimensional data, we give an Ω(1/φ) space lower bound for any algorithm, using an information-theoretic argument. To prove lower bounds for streams of multi-dimensional data and to establish stronger space bounds, we limit our discussion to a simple model of deterministic algorithms, which we call the box frequency model. In this model, an algorithm with space bound s is allowed s distinct counters, and each counter maintains the frequency of a box. We show that any single-pass deterministic scheme that computes φ-HHHs for d-dimensional data in the box frequency model with any bounded approximation guarantee must use Ω(1/φ) space. This bound is asymptotically tight as we can show a deterministic data stream algorithm (in the box frequency model) that computes φ-HHHs with constant approximation error, using O(1/φ) memory. ∗Mentor Graphics Corp., 8005 SW Boeckman Road, Wilsonville, OR 97070, USA, john_hershberger@mentor.com and (by courtesy) Department of Computer Science, University of California at Santa Barbara. †Department of Computer Science, University of California at Santa Barbara, Santa Barbara, CA 93106, USA, {nisheeth, suri}@cs.ucsb.edu. ‡Department of Mathematics, Room 2-336, MIT, Cambridge, MA 02139, USA, toth@math.mit.edu. |
| File Format | PDF HTM / HTML |
| Alternate Webpage(s) | http://www.cs.umd.edu/~hjs/pubs/danovarowmgds05.pdf |
| Language | English |
| Access Restriction | Open |
| Content Type | Text |
| Resource Type | Article |