分布式数据管理系统
PART I 分布式系统概念和模型
1. 分布式系统简介
内容:
- 分布式系统概念
- 为什么需要它
- 它的问题、挑战:
阅读
- 书
Distributed Systems, Concepts and Design(Chapter1)
Distributed Systems An Algorithmic Approach(Chapter1)
2. 系统模型(抽象)
内容:
-
基于消息
同步
异步 -
基于共享
同步
异步 -
失败
-
抽象系统模型:
状态机 + 图
阅读
- 论文
I/O automata model: An Introduction to Input/Output Automata
要点:I/O automata概念,了解如何用它描述算法、系统
- 书
Notes on Theory of Distributed Systems(Chapter2, Chapter16, APPENDIX J)
Distributed Systems, Concepts and Design(Chapter2-7)
Distributed Systems An Algorithmic Approach (Chapter2~5)
Distributed Algorithms(Chapter2, Chapter8)
Specifying Systems(Chapter1~8)
- 视频
Leslie Lamport’s The TLA+ Video Course
3. 系统模型(实现)
-
物理系统模型:
线程,进程,事件,网络通信,客户端/服务器,消息/RPC -
go编程
[A tour of the Go programming language](https://go.dev/tour/welcome/1) [The Go Programming Language and Environment](https://dl.acm.org/doi/pdf/10.1145/3488716)
Patterns and Hints for Concurrency in Go, 视频
阅读
- 报告
Event VS Thread: Why Threads Are A Bad Idea(for most purposes)
要点:理解基于事件的处理,并发=/=多线程(进程)
- 论文
Message Passing VS RPC: A Note on Distributed Computing
要点:关注二者优缺点,设计系统时做出你的选择
- 书
Unix Network Programming
PART II 分布式系统技术
4. 分区和复制
内容:
- 分区和复制解决的问题
- 分区策略,键范围分区,哈希分区
阅读:
- 论文
Time, Clocks, and the Ordering of Events in a Distributed System
要点:理解状态机
扩展:HLC Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases
Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial
要点:复制状态机
Harvest, Yield, and Scalable Tolerant Systems
要点:理解CAP
- 书
Distributed Systems, Concepts and Design(Chapter18)
Distributed Systems An Algorithmic Approach (Chapter12~13)
5. 共识算法
内容:
- 共识解决问题
- Paxos / Raft
Raft
阅读:
- 论文
要点:单Value Paxos
要点: 多Value Paxos
Paxos vs Raft: Have we reached consensus on distributed consensus?
要点:理解Paxos和Raft差异
In Search of an Understandable Consensus Algorithm
要点:理争Raft,复制+选举
Consensus: Bridging Theory and Practice (thesis 可选)
要点:更多的Raft细节
要点:理解Raft正确性
案例:
6. 分布式事务处理
内容
- 并发控制
- 恢复
- 分布式提交
PART III 系统质量保障
7.系统软件质量保障(传统方法)
内容
- 基于规则
- 基于示例
- 随机化方法
阅读:
- 论文
要点:测试的重要关注点
Random:
Why is random testing effective for partition tolerance bugs?
要点:了解随机化方法
错误注入:
Lineage-driven Fault Injection
8.系统软件质量保障(现代方法)
内容
- 确定性模拟
- 形式化方法
Deterministic Simulation:
FoundationDB: A Distributed Key Value Store
要点:着重理解模拟,确定性
Formal method:
Verdi: A Framework for Implementing and Formally Verifying Distributed Systems
How formal methods helped AWS to design amazing services
Model Checking Guided Testing for Distributed Systems
要点:了解形式化方法,模型检验
- 其它资料:
Testing a Single-Node, Single Threaded, Distributed System Written in 1985 By Will Wilson
Colin Scott, Fuzzing Raft for Fun and Publication
“Simulation Testing” by Michael Nygard
Golang Fuzz https://go.dev/doc/security/fuzz/
参考书
Distributed Algorithms An Intuitive Approach
Distributed Systems Concepts and Design
Distributed Systems, An Algorithmic Approach