Bloom Filter

#CS/DataStructure

布隆过滤器（Bloom Filter，BF） 是用于判断集合内某个元素是否存在（也叫 membership queries）的一个简单、高效且内存占用低的数据结构。类似哈希表，但不存储实际哈希值、也不处理冲突，而是将多个哈希函数的结果存储于同一个集合，判断存在性时也同样哈希后判断是否所有结果都存在与集合。也就是说，布隆过滤器 允许第二类错误，但显著降低了内存占用。

使用 布隆过滤器，判断某个 key 是否存在时，返回值可能有两种情况：

key 不存在（真阴性）
key 可能存在（真阳性或第二类错误）

历史

布隆过滤器 与 1970 年被 Burton Bloom 创建，在查询集合成员存在与否时一般考虑时间成本或空间成本，论文提出第三个优化方向：Allowable Fraction of Errors，也就是说允许一定的误判，来大幅降低空间占用。^[1] 提出后 布隆过滤器 就被广泛运用于数据库领域。

再之后，布隆过滤器 又在各种领域大放异彩 ^[2]：

端对端连接
网络应用
缓存

原理

布隆过滤器 包含两部份：

m 位的位图
k 个互相独立的哈希函数 $h_{1}, h_{2}, . . ., h_{k}$

k 个哈希函数的替代

可以用 $h_{i} (x) = M D 5 (x + i)$ 或者 $h_{i} (x) = M D 5 (x | i)$ 实现 k 个无关 hash 函数

为了直观描述，假设有一 10 位的位图：

0	1	2	3	4	5	6	7	8	9
0	0	0	0	0	0	0	0	0	0

以及两个哈希函数：Fnv、Murmur。

插入

需要插入 key 时，对插入的 key 执行 k 个哈希函数，得到结果 $h_{1} (k e y), h_{2} (k e y), . . ., h_{k} (k e y)$ ，需要保证结果不大于 m 位位图（比如取余数），然后将位图对应位置 1。

举例，现在需要插入字符串 hello，计算哈希得到 Fnv=6、Murmur=0，对应位置 1，得到位图：

0	1	2	3	4	5	6	7	8	9
1	0	0	0	0	0	1	0	0	0

继续插入字符串 world，计算哈希得到 Fnv=7、Murmur=6，对应位置 1，得到位图：

0	1	2	3	4	5	6	7	8	9
1	0	0	0	0	0	1	1	0	0

查询

需要查询 key 是否存在时，同样对查询的 key 执行 k 个哈希函数，得到结果 $h_{1} (k e y), h_{2} (k e y), . . ., h_{k} (k e y)$ ，也同样需要保证结果不大于 m 位位图（比如取余数），然后查位图对应的位是否都为 1。

若都为 1：key 可能存在
若存在不为 1：key 不存在

举例：

判断 charmer 是否存在
- 计算哈希得到 Fnv=7、Murmur=0，对应位都为 1，可能存在（第二类错误）
判断 hello 是否存在
- 计算哈希得到 Fnv=6、Murmur=0，可能存在（真阳性）
判断 byebye 是否存在
- 计算哈希得到 Fnv=0、Murmur=9，可以看到第 9 位不为 1，不存在（真阴性）。

应用

哈希函数数量（k）为多少时，错误率（ $ε$ ）最低，此时为多少？

由于 hash 函数“选择”bitmap 的位置是随机的，因此对于长度为 m 的 bitmap 而言，插入一个值时某个位被置 1、0 的概率分别是 $P (1) = \frac{1}{m}$ 、 $P (0) = 1 - \frac{1}{m}$

当 k 个 hash 函数置位后，对于任意一个位，仍然是 0 的概率是

P^{'} (0) = P (0)^{k} = (1 - \frac{1}{m})^{k}

对于 n 个 key 插入后，对于任意一个位，仍然是 0 的概率是

P^{″} (0) = P^{'} (0)^{n} = (1 - \frac{1}{m})^{k n}

由于有自然对数 e 的计算公式

lim_{x \to \infty} (1 - \frac{1}{x})^{- x} = e

我们可以近似计算 $P^{″} (0)$ 得到

P^{″} (0) = ((1 - \frac{1}{m})^{- m})^{- \frac{k n}{m}} \approx e^{- \frac{k n}{m}}

因此，对于任意一个位，其被置 1 的概率是

P^{″} (1) = 1 - P^{″} (0) \approx 1 - e^{- \frac{k n}{m}}

当 alse positive 发生时，是指它的 k 个 hash 函数得到的位置都为 1。由于 k 个 hash 函数相互独立，我们可以计算出 alse positive 的概率为

ε = (1 - (1 - \frac{1}{m})^{k n})^{k} \approx (1 - e^{- \frac{k n}{m}})^{k}

由于 m、n 是用户给定的值，唯一变量就是 k，我们需要指定 k 的值去最小化 alse positive 概率。

令 $p = e^{- \frac{k n}{m}}$ 以及自然对数公式，我们有

ε = (1 - p)^{k} = e^{k l n (1 - p)}

此时变更为最小的 $g = k l n (1 - p)$

无中生有下得到

g = k l n (1 - p) = - \frac{m}{n} l n (e^{- \frac{k n}{m}}) l n (1 - p) = - \frac{m}{n} l n (p) l n (1 - p)

根据对称性原理得知当 $p = \frac{1}{2}$ 时 g 取得最小，此时有

k = \frac{m}{n} l n (2)

插入回

ε = (1 - p)^{k}

得到 alse positive 的最小值为

ε_{m i n} = (\frac{1}{2})^{k} \approx (0.6185)^{\frac{m}{n}}

实际应用中如何使用布隆过滤器？

首先获取/估算插入个数（ $n$ ），并固定可容忍的错误率（ $ε$ ），然后通过公示计算：

根据公式 $m = ⌈ - \frac{n l n (ε)}{(l n (2))^{2}} ⌉$ 计算 m 的值
根据公式 $k = ⌈ - l o g_{2} (ε) ⌉$ 计算 k 的值

缺陷

基础 布隆过滤器 存在较多局限，比如需要预先知道 key 个数（ $n$ ）、不能删除元素，因此后续提出了一些变体，为特殊场景作出不同的定制：

可扩展布隆过滤器：解决需要预先知道 key 个数的问题
记数布隆过滤器：解决不能删除元素问题
Count-Min Sketch：利用记数布隆过滤器的相同变体实现基数统计的能力

A bloom filter is a probabilistic data structure that is based on hashing. It is extremely space efficient and is typically used to add elements to a set and test if an element is in a set. Though, the elements themselves are not added to a set. Instead a hash of the elements is added to the set.

When testing if an element is in the bloom filter, false positives are possible. It will either say that an element is definitely not in the set or that it is possible the element is in the set.

A bloom filter is very much like a hash table in that it will use a hash function to map a key to a bucket. However, it will not store that key in that bucket, it will simply mark it as filled. So, many keys might map to same filled bucket, creating false positives.

brilliant

A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set.

The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: it tells us that the element either definitely is not in the set or may be in the set.

Bloom Filters by Example

The computational factors considered are the size of the hash area (space), the time required to identify a message as a nonmember of the given set (reject time), and an allowable error frequency.

The new methods are intended to reduce the amount of space required to contain the hash-coded information from that associated with conventional methods. The reduction in space is accomplished by exploiting the possibility that a small fraction of errors of commission may be tolerable in some applications, in particular, applications in which a large amount of data is involved and a core resident hash area is consequently not feasible using conventional methods.

Space/Time Trade-offs in Hash Coding with Allowable Errors

We then consider four types of network-related applications of Bloom ﬁlters:

Collaborating in overlay and peer-to-peer networks: Bloom ﬁlters can be used for summarizing content to aid collaborations in overlay and peer-topeer networks.
Resource routing: Bloom ﬁlters allow probabilistic algorithms for locating resources.
Packet routing: Bloom ﬁlters provide a means to speed up or simplify packet routing protocols.
Measurement: Bloom ﬁlters provide a useful tool for measurement in- frastructures

Network Applications of Bloom Filters: A Survey

Bloom, Burton H. “Space/Time Trade-Offs in Hash Coding with Allowable Errors.” Communications of the ACM 13, no. 7 (July 1, 1970): 422–26. https://doi.org/10.1145/362686.362692. ↩︎ ↩︎
Broder, Andrei, and Michael Mitzenmacher. “Network Applications of Bloom Filters: A Survey.” Internet Mathematics, n.d., 25. ↩︎ ↩︎
https://brilliant.org/wiki/bloom-filter/ ↩︎
https://llimllib.github.io/bloomfilter-tutorial/ ↩︎

历史

原理

插入

查询

应用

哈希函数数量（k）为多少时，错误率（ε）最低，此时为多少？

实际应用中如何使用 布隆过滤器 ？

缺陷

哈希函数数量（k）为多少时，错误率（ $ε$ ）最低，此时为多少？

实际应用中如何使用布隆过滤器？