Search CTRL + K

schema-on-read

读时模式大数据 场景下的数据分析流程,数据按其原始、无结构形式存储于 数据湖、对象存储等地,在查询时再对其实时进行 架构 格式化。

这种模式对比于传统的 写时模式,有如下四个优点:

但这种模式也有严重缺陷,因为缺少了统一的 架构ETL 过程,数据很可能质量低、重复等,造成查询结果不正确、不完成。

该模式常见于 半结构化数据模型 数据库。


What is Schema-on-Read?

Schema-on-Read is a more flexible approach where data is stored in its raw, unstructured form and the schema is applied at the time of data retrieval or analysis. This means that data can be ingested quickly without the need for upfront schema design. The schema is applied on-the-fly during data query or analysis, allowing for more dynamic and agile data exploration and analysis.[1]

- dremio

How Schema-on-Write works?

with Schema-on-Read, data is stored in its raw, unstructured format, often in a data lake or object store. When data needs to be analyzed, a schema is applied dynamically during the query or analysis phase. This allows for more flexibility as the schema can be adjusted or modified based on the specific requirements of the analysis, without the need to transform the underlying data.[1:1]

- dremio

Disadvantages of Schema-on-Read

But since the data does not go through strict ETLs and transformation into strict data storage schemas, there can be a lot of missing or invalid data, duplicates, and many other problems that can lead to inaccurate or incomplete query results.[2]


  1. https://www.dremio.com/wiki/schema-on-read-vs-schema-on-write/ ↩︎ ↩︎

  2. https://luminousmen.com/post/schema-on-read-vs-schema-on-write ↩︎