Column-Oriented Data Layout

列式（column-oriented, or columnar） 数据布局是指以列的方式存储数据，同一列的值连续地存储在磁盘上。

将列单独存储赋予了列式数据库高效查询地能力，因为可以只读取所需的列，因此列式数据库常用于 OLAP 场合，大量数据聚合、分析。

但是当需要重新将元组取出时，就需要从不同列中找到对应的数据。一些数据库使用显式 ID 去记录每一行，但这造成数据重复。有些数据库使用隐式 ID，比如文件偏移来获取对应数据。

What is column-oriented database?

Column-oriented database management systems partition data vertically (i.e., by col‐ umn) instead of storing it in rows. Here, values for the same column are stored contiguously on disk (as opposed to storing rows contiguously as in the previous example).^[1]

Advantages of column-oriented data layout

Storing values for different columns in separate files or file segments allows efficient queries by column, since they can be read in one pass rather than consuming entire rows and discarding data for columns that weren’t queried.^[1:1]

The suitable scenarios for column-oriented data layout

Column-oriented stores are a good fit for analytical workloads that compute aggregates, such as finding trends, computing average values, etc. Processing complex aggregates can be used in cases when logical records have multiple fields, but some of them (in this case, price quotes) have different importance and are often consumed together.^[1:2]

Reconstruct data tuple require additional works in columnar data layout

To reconstruct data tuples, which might be useful for joins, filtering, and multirow aggregates, we need to preserve some metadata on the column level to identify which data points from other columns it is associated with. If you do this explicitly, each value will have to hold a key, which introduces duplication and increases the amount of stored data. Some column stores use implicit identifiers (virtual IDs) instead and use the position of the value (in other words, its offset) to map it back to the related values

Petrov, Alex. Database Internals. 1st ed., 2019. ↩︎ ↩︎ ↩︎