Index Files

索引（index） 是用于加速查询存储在磁盘上的中记录的数据结构，索引存储在 index files 中，不同的 data files 有不同 index files 内容：

heap files：key 到 data files 中对应记录位置的映射表
index-organized tables：key 到主键位置的映射表

在主键上构建的索引称为 主键索引（primary index），而其他索引称为 二级索引（secondary index）。二级索引 可以直接指向记录，也可以指向记录的主键。

主键索引和二级索引区别

通常主键索引每个 key 和记录是一对一关系，而二级索引每个 key 和记录可以是多对多关系。

主键索引作为中转

上文提到 二级索引 可以直接指向 data files 上的记录，也可以指向记录的主键，这两种方式各有优劣。

	优点	缺点
直接引用数据	查询时减少磁盘寻址次数	更新数据时需要更新所有索引
主键中转	更新维护时减少成本	查询时开销更大

这两种方式往往根据数据库查询场景不同而选择，比如主要用于读取、很少更新的 OLAP 场景，那直接引用数据的方式更好。

聚集索引和非聚集索引

如果存储在 data files 上的记录按索引相同顺序存储，这个索引称为 聚集索引（clustered，也叫 clustering）。当记录存储在单独的文件中，且不按索引顺序存储时，称这个索引为 非聚集索引（nonclustered，也叫 unclustered）。

提示

IOTs 根据定义是聚集索引，因为它将数据存储在索引中
通常情况下主键索引是聚集索引
二级索引根据定义是非聚集索引，因为它们是用于主键索引外的数据访问

What is index files?

An index is a structure that organizes data records on disk in a way that facilitates efficient retrieval operations. Index files are organized as specialized structures that map keys to locations in data files where the records identified by these keys (in the case of heap files) or primary keys (in the case of index-organized tables) are stored.^[1]

Primary index and secondary index

An index on a primary (data) file is called the primary index. However, in most cases we can also assume that the primary index is built over a primary key or a set of keys identified as primary. All other indexes are called secondary.

Secondary indexes can point directly to the data record, or simply store its primary key. A pointer to a data record can hold an offset to a heap file or an index-organized table. Multiple secondary indexes can point to the same record, allowing a single data record to be identified by different fields and located through different indexes. While primary index files hold a unique entry per search key, secondary indexes may hold several entries per search key.^[1:1]

What is clustering?

If the order of data records follows the search key order, this index is called clustered (also known as clustering). Data records in the clustered case are usually stored in the same file or in a clustered file, where the key order is preserved. If the data is stored in a separate file, and its order does not follow the key order, the index is called nonclustered (sometimes called unclustered).^[1:2]

Tips on clustering of IOTs, primary indexes and secondary indexes

Index-organized tables store information in index order and are clustered by definition. Primary indexes are most often clustered. Secondary indexes are nonclustered by definition, since they’re used to facilitate access by keys other than the primary one. Clustered indexes can be both index-organized or have separate index and data files.^[1:3]

Using primary index as an indirection

Both approaches have their pros and cons and are better discussed in the scope of a complete implementation. By referencing data directly, we can reduce the number of disk seeks, but have to pay a cost of updating the pointers whenever the record is updated or relocated during a maintenance process. Using indirection in the form of a primary index allows us to reduce the cost of pointer updates, but has a higher cost on a read path.^[1:4]

Petrov, Alex. Database Internals. 1st ed., 2019. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎

主键索引 作为中转

聚集索引 和 非聚集索引

主键索引作为中转

聚集索引和非聚集索引