# Bloom Filters and FM Sketches
## Flajolet-Martin Sketches
Probabilistic algorithm for **counting distinct elements** in a set.
**Given:** a bit array `B[]` of length `l`, a hash function `h()` mapping to `[0, 2^l)`, and a function `r()` returning the position of the least-significant 1-bit.
**Insertion of element x:**
1. `pn := h(x)` — hash to get a pseudo-random number
2. Set `B[r(pn)] = 1`
Since the output of `h()` is pseudo-random, bit `i` is set ~`n/(2^(i+1))` times.
**Estimating count:** find the smallest position `p` where `B[p] = 0`. Then estimate `n ≈ 2^p / 0.77351`. The basic estimator has high variance; accuracy is improved by using multiple bitmaps and averaging (stochastic averaging).
**Uses:** data mining, P2P/distributed systems, document frequency estimation.
> Flajolet, P. & Martin, G. (1985). *Probabilistic counting algorithms for data base applications.* JCSS 31(2).
## Bloom Filters
Probabilistic structure for **testing set membership**. False positives possible, false negatives impossible.
**Given:** a bit array `B[]` of length `m`, and `k` hash functions `h_k()` each mapping to `[0, m-1]`.
**Insertion:** for each hash function `h_k`, set `B[h_k(x)] = 1`.
**Lookup:** for each `h_k`, check `B[h_k(y)]`:
- If any position is 0 → element is **definitely not** in the set
- If all positions are 1 → element **might** be in the set
**False positive rate:** approximately `(1 - e^(-kn/m))^k`
**Optimal k:** `k = (m/n) * ln(2)` — more hash functions reduce false positives but slow queries.
**Uses:** database query filtering, Google BigTable, network IP lookups, spell checkers.
> Bloom, H. (1970). *Space/time trade-offs in hash coding with allowable errors.* CACM 13(7).