Fault
故障 指系统中单个组件偏离预定轨道,而系统整体仍然部分可用。注意与 失效 区别。
系统弹性通常通过容忍错误实现。但是有时预防更重要,比如安全问题,一旦发生数据泄露,事件影响无法撤销。
- Designing Data-Intensive Applications
故障 分为三类:硬件故障、软件故障和人为故障。
硬件故障
硬件故障(hardware faults)通常通过冗余解决。
软件故障
系统错误(systematic error)难以预料,潜伏很久,往往造成系统 失效。
这类错误通常是系统对环境做出了某种假设,异常情况时假设不成立了。解决方案有:
- 充分测试
- 进程隔离
- 允许进程崩溃、重启
- 测量、监控系统
人为错误
人为错误不可避免,有如下方案:
- 最小化犯错机会的方式设计系统
- 将犯错的地方与可能 失效 的地方解耦(测试环境)
- 各个层次重复测试
- 遥测(telemetry)
- 良好的管理实践与充分培训
A fault is usually defined as one component of the system deviating from its spec.[1:1]
Our first response is usually to add redundancy to the individual hardware components in order to reduce the failure rate of the system.[1:2]
Such faults are harder to anticipate, and because they are correlated across nodes, they tend to cause many more system failures than uncorrelated hardware faults.
The bugs that cause these kinds of software faults often lie dormant for a long time until they are triggered by an unusual set of circumstances. In those circumstances, it is revealed that the software is making some kind of assumption about its environment—and while that assumption is usually true, it eventually stops being true for some reason[2]