Column-Oriented Data Layout
列式(column-oriented, or columnar) 数据布局是指以 列 的方式存储数据,同一列的值连续地存储在磁盘上。
将 列 单独存储赋予了 列式 数据库高效查询地能力,因为可以只读取所需的列,因此 列式 数据库常用于 OLAP 场合,大量数据聚合、分析。
但是当需要重新将 元组 取出时,就需要从不同列中找到对应的数据。一些数据库使用显式 ID 去记录每一 行,但这造成数据重复。有些数据库使用隐式 ID,比如文件偏移来获取对应数据。
Column-oriented database management systems partition data vertically (i.e., by col‐ umn) instead of storing it in rows. Here, values for the same column are stored contiguously on disk (as opposed to storing rows contiguously as in the previous example).[1]
Storing values for different columns in separate files or file segments allows efficient queries by column, since they can be read in one pass rather than consuming entire rows and discarding data for columns that weren’t queried.[1:1]
Column-oriented stores are a good fit for analytical workloads that compute aggregates, such as finding trends, computing average values, etc. Processing complex aggregates can be used in cases when logical records have multiple fields, but some of them (in this case, price quotes) have different importance and are often consumed together.[1:2]
To reconstruct data tuples, which might be useful for joins, filtering, and multirow aggregates, we need to preserve some metadata on the column level to identify which data points from other columns it is associated with. If you do this explicitly, each value will have to hold a key, which introduces duplication and increases the amount of stored data. Some column stores use implicit identifiers (virtual IDs) instead and use the position of the value (in other words, its offset) to map it back to the related values