Bloom Filters and FM Sketches

# Bloom Filters and FM Sketches ## Flajolet-Martin Sketches Probabilistic algorithm for **counting distinct elements** in a set. **Given:** a bit array `B[]` of length `l`, a hash function `h()` mapping to `[0, 2^l)`, and a function `r()` returning the position of the least-significant 1-bit. **Insertion of element x:** 1. `pn := h(x)` — hash to get a pseudo-random number 2. Set `B[r(pn)] = 1` Since the output of `h()` is pseudo-random, bit `i` is set ~`n/(2^(i+1))` times. **Estimating count:** find the smallest position `p` where `B[p] = 0`. Then estimate `n ≈ 2^p / 0.77351`. The basic estimator has high variance; accuracy is improved by using multiple bitmaps and averaging (stochastic averaging). **Uses:** data mining, P2P/distributed systems, document frequency estimation. > Flajolet, P. & Martin, G. (1985). *Probabilistic counting algorithms for data base applications.* JCSS 31(2). ## Bloom Filters Probabilistic structure for **testing set membership**. False positives possible, false negatives impossible. **Given:** a bit array `B[]` of length `m`, and `k` hash functions `h_k()` each mapping to `[0, m-1]`. **Insertion:** for each hash function `h_k`, set `B[h_k(x)] = 1`. **Lookup:** for each `h_k`, check `B[h_k(y)]`: - If any position is 0 → element is **definitely not** in the set - If all positions are 1 → element **might** be in the set **False positive rate:** approximately `(1 - e^(-kn/m))^k` **Optimal k:** `k = (m/n) * ln(2)` — more hash functions reduce false positives but slow queries. **Uses:** database query filtering, Google BigTable, network IP lookups, spell checkers. > Bloom, H. (1970). *Space/time trade-offs in hash coding with allowable errors.* CACM 13(7).