Index Files
索引(index) 是用于加速查询存储在磁盘上的中 记录 的数据结构,索引 存储在 index files 中,不同的 data files 有不同 index files 内容:
- heap files:key 到 data files 中对应 记录 位置的映射表
- index-organized tables:key 到 主键 位置的映射表
在 主键 上构建的索引称为 主键索引(primary index),而其他索引称为 二级索引(secondary index)。二级索引 可以直接指向 记录,也可以指向 记录 的 主键。
主键索引 作为中转
上文提到 二级索引 可以直接指向 data files 上的 记录,也可以指向 记录 的 主键,这两种方式各有优劣。
优点 | 缺点 | |
---|---|---|
直接引用数据 | 查询时减少磁盘寻址次数 | 更新数据时需要更新所有索引 |
主键中转 | 更新维护时减少成本 | 查询时开销更大 |
这两种方式往往根据数据库查询场景不同而选择,比如主要用于读取、很少更新的 OLAP 场景,那直接引用数据的方式更好。
聚集索引 和 非聚集索引
如果存储在 data files 上的 记录 按 索引 相同顺序存储,这个 索引 称为 聚集索引(clustered,也叫 clustering)。当 记录 存储在单独的文件中,且不按 索引 顺序存储时,称这个索引为 非聚集索引(nonclustered,也叫 unclustered)。
- IOTs 根据定义是聚集索引,因为它将数据存储在索引中
- 通常情况下主键索引是聚集索引
- 二级索引根据定义是非聚集索引,因为它们是用于主键索引外的数据访问
An index is a structure that organizes data records on disk in a way that facilitates efficient retrieval operations. Index files are organized as specialized structures that map keys to locations in data files where the records identified by these keys (in the case of heap files) or primary keys (in the case of index-organized tables) are stored.[1]
An index on a primary (data) file is called the primary index. However, in most cases we can also assume that the primary index is built over a primary key or a set of keys identified as primary. All other indexes are called secondary.
Secondary indexes can point directly to the data record, or simply store its primary key. A pointer to a data record can hold an offset to a heap file or an index-organized table. Multiple secondary indexes can point to the same record, allowing a single data record to be identified by different fields and located through different indexes. While primary index files hold a unique entry per search key, secondary indexes may hold several entries per search key.[1:1]
If the order of data records follows the search key order, this index is called clustered (also known as clustering). Data records in the clustered case are usually stored in the same file or in a clustered file, where the key order is preserved. If the data is stored in a separate file, and its order does not follow the key order, the index is called nonclustered (sometimes called unclustered).[1:2]
Index-organized tables store information in index order and are clustered by definition. Primary indexes are most often clustered. Secondary indexes are nonclustered by definition, since they’re used to facilitate access by keys other than the primary one. Clustered indexes can be both index-organized or have separate index and data files.[1:3]
Both approaches have their pros and cons and are better discussed in the scope of a complete implementation. By referencing data directly, we can reduce the number of disk seeks, but have to pay a cost of updating the pointers whenever the record is updated or relocated during a maintenance process. Using indirection in the form of a primary index allows us to reduce the cost of pointer updates, but has a higher cost on a read path.[1:4]