Search CTRL + K

Fault

故障 指系统中单个组件偏离预定轨道,而系统整体仍然部分可用。注意与 失效 区别。


系统弹性通常通过容忍错误实现。但是有时预防更重要,比如安全问题,一旦发生数据泄露,事件影响无法撤销。

- Designing Data-Intensive Applications

故障 分为三类:硬件故障、软件故障和人为故障。

硬件故障

硬件故障(hardware faults)通常通过冗余解决。

软件故障

系统错误(systematic error)难以预料,潜伏很久,往往造成系统 失效

这类错误通常是系统对环境做出了某种假设,异常情况时假设不成立了。解决方案有:

人为错误

人为错误不可避免,有如下方案:


Designing Data-Intensive Applications

A fault is usually defined as one component of the system deviating from its spec.[1:1]

Designing Data-Intensive Applications

Our first response is usually to add redundancy to the individual hardware components in order to reduce the failure rate of the system.[1:2]

Designing Data-Intensive Applications

Such faults are harder to anticipate, and because they are correlated across nodes, they tend to cause many more system failures than uncorrelated hardware faults.

The bugs that cause these kinds of software faults often lie dormant for a long time until they are triggered by an unusual set of circumstances. In those circumstances, it is revealed that the software is making some kind of assumption about its environment—and while that assumption is usually true, it eventually stops being true for some reason[2]


  1. Martin Kleppmann, Designing Data-Intensive Applications, n.d. p7 ↩︎ ↩︎ ↩︎

  2. Martin Kleppmann, Designing Data-Intensive Applications, n.d. p8 ↩︎