schema-on-read
读时模式 是 大数据 场景下的数据分析流程,数据按其原始、无结构形式存储于 数据湖、对象存储等地,在查询时再对其实时进行 架构 格式化。
这种模式对比于传统的 写时模式,有如下四个优点:
- 灵活性:无需预先设定 架构,适合快速发展的商业公司
- 资源节约:无需耗费资源的 ETL 流程
- 扩展性:随意增加不同来源的、不同结构的数据,而无需设计兼容 架构
- 实时分析:因为没有前置 ETL 流程,数据可以被实时或近实时地处理分析
但这种模式也有严重缺陷,因为缺少了统一的 架构 和 ETL 过程,数据很可能质量低、重复等,造成查询结果不正确、不完成。
该模式常见于 半结构化数据模型 数据库。
Schema-on-Read is a more flexible approach where data is stored in its raw, unstructured form and the schema is applied at the time of data retrieval or analysis. This means that data can be ingested quickly without the need for upfront schema design. The schema is applied on-the-fly during data query or analysis, allowing for more dynamic and agile data exploration and analysis.[1]
- dremio
with Schema-on-Read, data is stored in its raw, unstructured format, often in a data lake or object store. When data needs to be analyzed, a schema is applied dynamically during the query or analysis phase. This allows for more flexibility as the schema can be adjusted or modified based on the specific requirements of the analysis, without the need to transform the underlying data.[1:1]
- dremio
But since the data does not go through strict ETLs and transformation into strict data storage schemas, there can be a lot of missing or invalid data, duplicates, and many other problems that can lead to inaccurate or incomplete query results.[2]