欢迎来到我的空间

分布式数据管理系统

PART I 分布式系统概念和模型

1. 分布式系统简介

分布式系统简介

内容:

  1. 分布式系统概念
  2. 为什么需要它
  3. 它的问题、挑战:

阅读

Distributed Systems, Concepts and Design(Chapter1)

Distributed Systems An Algorithmic Approach(Chapter1)

2. 系统模型(抽象)

内容:

  1. 基于消息
    同步
    异步

  2. 基于共享
    同步
    异步

  3. 失败

  4. 抽象系统模型:
    状态机 + 图

阅读

I/O automata model: An Introduction to Input/Output Automata

要点:I/O automata概念,了解如何用它描述算法、系统

Notes on Theory of Distributed Systems(Chapter2, Chapter16, APPENDIX J)

Distributed Systems, Concepts and Design(Chapter2-7)

Distributed Systems An Algorithmic Approach (Chapter2~5)

Distributed Algorithms(Chapter2, Chapter8)

Specifying Systems(Chapter1~8)

Leslie Lamport’s The TLA+ Video Course

3. 系统模型(实现)

  1. 物理系统模型:
    线程,进程,事件,网络通信,客户端/服务器,消息/RPC

  2. go编程

    [A tour of the Go programming language](https://go.dev/tour/welcome/1)
    
    [The Go Programming Language and Environment](https://dl.acm.org/doi/pdf/10.1145/3488716)
    

Patterns and Hints for Concurrency in Go视频

阅读

Event VS Thread: Why Threads Are A Bad Idea(for most purposes)

要点:理解基于事件的处理,并发=/=多线程(进程)

Message Passing VS RPC: A Note on Distributed Computing

要点:关注二者优缺点,设计系统时做出你的选择

Unix Network Programming

PART II 分布式系统技术

4. 分区和复制

内容:

  1. 分区和复制解决的问题
  2. 分区策略,键范围分区,哈希分区

阅读:

Time, Clocks, and the Ordering of Events in a Distributed System

要点:理解状态机

扩展:HLC Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases

Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial

要点:复制状态机

Harvest, Yield, and Scalable Tolerant Systems

要点:理解CAP

Distributed Systems, Concepts and Design(Chapter18)

Distributed Systems An Algorithmic Approach (Chapter12~13)

5. 共识算法

内容:

  1. 共识解决问题
  2. Paxos / Raft

Raft

阅读:

Paxos Made Simple

要点:单Value Paxos

Paxos Made Moderately Complex

要点: 多Value Paxos

Paxos vs Raft: Have we reached consensus on distributed consensus?

要点:理解Paxos和Raft差异

Distributed consensus revised

In Search of an Understandable Consensus Algorithm

要点:理争Raft,复制+选举

Consensus: Bridging Theory and Practice (thesis 可选)

要点:更多的Raft细节

Analysis of Raft Consensus

要点:理解Raft正确性

https://raft.github.io/

案例:

Etcd Raft

Tencent PaxosStore

6. 分布式事务处理

内容

  1. 并发控制
  2. 恢复
  3. 分布式提交

PART III 系统质量保障

7.系统软件质量保障(传统方法)

内容

  1. 基于规则
  2. 基于示例
  3. 随机化方法

阅读:

Testing:
Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems

要点:测试的重要关注点

Random:

Why is random testing effective for partition tolerance bugs?

要点:了解随机化方法

错误注入:

Lineage-driven Fault Injection

8.系统软件质量保障(现代方法)

内容

  1. 确定性模拟
  2. 形式化方法

Deterministic Simulation:
FoundationDB: A Distributed Key Value Store

要点:着重理解模拟,确定性

Formal method:

Verdi: A Framework for Implementing and Formally Verifying Distributed Systems

How formal methods helped AWS to design amazing services

eXtreme Modelling in Practice

Model Checking Guided Testing for Distributed Systems

要点:了解形式化方法,模型检验

Testing a Single-Node, Single Threaded, Distributed System Written in 1985 By Will Wilson

Colin Scott, Fuzzing Raft for Fun and Publication

jepsen

“Simulation Testing” by Michael Nygard

Golang Fuzz https://go.dev/doc/security/fuzz/

参考书

Distributed Algorithms An Intuitive Approach

Distributed Systems Concepts and Design

Distributed Systems, An Algorithmic Approach

Specifying Systems

Distributed Algorithms

Notes on Theory of Distributed Systems

Transactional Information Systems