Search CTRL + K

Log-Structured Merge-Tree

LSM 树 是针对写入性能优化的文件组织格式,通过避免离散地就地更新,实现比 B+ 树更高地写入吞吐。

背景

提出、发展

  • 1996 年,论文 The Log-Structured Merge-Tree (LSM-Tree) 首次提出 LSM 树的概念 [1]。但该论文较为晦涩,且和当前主流架构有较大不同。
  • 2008 年 Google 的三架马车之一 Bigtable: A Distributed Storage System for Structured Data 发布,现代 LSM 树架构形成。[2]

ACM 报告表明 [3]

因此,对于频繁写入的系统,只对数据文件进行 append 操作(这种数据文件称为 loggingjournaling 或者 heap file),几乎能达到磁盘写入速度上限。

日志文件虽然插入性能很好,但造成了 读放大:

这导致纯 LSM 数 只能用于简单的场景,要么数据是完整访问的,比如 数据库预写日志;要么是通过 offset 访问的,比如 Kafka 的消息队列。

初步设想

为了让日志文件支持更复杂的查询场景,比如点查询或范围查询,有四种优化方案:

  1. Search Sorted File:日志文件根据键值排序
  2. Hash:将数据分 bucket 存储
  3. B+ tree:使用索引文件结构比如 B+ 数、ISAM 等
  4. External file:额外创建一个哈希/索引文件

这几种方法都会造成写入性能降低,因为:

为了解决这两个问题,通常解决办法是采用第四种 External file 方案,但是将索引保存在内存中。这个方案存在扩展性限制,比如当数据较小时,索引可能比数据还大。但该方案还是成功推广开来,比如 Riak、Oracle Coherence。

LSM 树

LSM 树 使用了完全不同的策略,保证日志文件写入性能的同时,加快了查询性能(虽然也比 B+ 树慢)。

事实上,LSM 树 就是将写入时的随机 I/O 变成了读取时的随机 I/O,这种交换在 OLAP 的场景是明智的。


ben stopford

In a nutshell LSM trees are designed to provide better write throughput than traditional B+ tree or ISAM approaches. They do this by removing the need to perform dispersed, update-in-place operations.[4]

The Pathologies of Big Data

On typical server hardware today, completely random memory access on a range much larger than cache size can be an order of magnitude or more slower than purely sequential access, but completely random disk access can be five orders of magnitude slower than sequential access. Even state-of-the-art solid-state (flash) disks, although they have much lower seek latency than magnetic disks, can differ in speed by roughly four orders of magnitude between random and sequential access patterns.[5]

20230307134731.neV4fB

ben stopford

So we need more than just a journal to efficiently perform more complex read workloads like key based access or a range search. Broadly speaking there are four approaches that can help us here: binary search, hash, B+ or external.

  1. Search Sorted File: save data to a file, sorted by key. If data has defined widths use Binary search. If not use a page index + scan.
  2. Hash: split the data into buckets using a hash function, which can later be used to direct reads.
  3. B+: use navigable file organisation such as a B+ tree, ISAM etc.
  4. External file: leave the data as a log/heap and create a separate hash or tree index into it.

One common solution is to use approach (4) A index into a journal – but keep the index in memory. So, for example, a Hash Table can be used to map keys to the position (offset) of the latest value in a journal file (log). This approach actually works pretty well as it compartmentalises random IO to something relatively small: the key-to-offset mapping, held in memory. Looking up a value is then only a single IO.[4:1]


  1. O’Neil, Patrick, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil. “The Log-Structured Merge-Tree (LSM-Tree).” Acta Informatica 33, no. 4 (June 1996): 351–85. https://doi.org/10.1007/s002360050048. ↩︎

  2. Chang, Fay, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. “Bigtable: A Distributed Storage System for Structured Data.” ACM Transactions on Computer Systems 26, no. 2 (June 2008): 1–26. https://doi.org/10.1145/1365815.1365816. ↩︎

  3. https://queue.acm.org/detail.cfm?id=1563874 ↩︎

  4. http://www.benstopford.com/2015/02/14/log-structured-merge-trees/ ↩︎ ↩︎

  5. https://queue.acm.org/detail.cfm?id=1563874 ↩︎