why the design distributed system

three-tier front, app, db.

  • easy and flexible
  • state of the app is the weak spot



  • build individual of complete chunk of the whole system
  • pretend you are not building distributed system
  • parition could cause two masters.
  • client isolation is easy (GDPR)


  • complexity
  • no comprehensive view of the data(ETL), monitoring and logging. pretend you are not building distributed system, but it is
  • oversized shard (bad thing happend if a shared is too big)

lambda: streaming or batch?, bounded vs unbounded

assume: unbouned, immutable data.

  • long-term storage, batch, high latency
  • temp queuing for unbounded, low latency
  • write it to scale db. base frame + delta frame
  • messaging. kafka. no global ordering, only within one partition.

com: write the thing twice and only for analysis

streaming: event first vs data frist

  • integration is a first-classs concern
  • Life is dynamic, DB are static

  • storing data in message? rentention policy

SLA: how to measure the healthy of the services: Availability/Accurary/TPS/Latency

Consistency: strong consistency, weak consistency and eventual consistency.

Data Durability: Some form of replication is usually used to increase durability

Message Persistence and Durability

Idempotency work for both side.

Designing for idempotent, distributed systems require some sort of distributed locking strategy.

both with distributed locks or database constraints https://www.bennadel.com/blog/3390-considering-strategies-for-idempotency-without-distributed-locking-with-ben-darfler.htm?ref=blog.pragmaticengineer.com

Sharding and Quorum sharding is an interesting area to learn more about, especially around resharding

The Actor Model VS CSP.

Operating a Large, Distributed System in a Reliable Way


infra level monitoring. is this backend service healthy? traffic, errors, latency(p99) Business metrics monitoring

Oncall, Anomaly Detection & Alerting

Outages & Incident Management Processes

Runbooks Communicating outages across the organization Mitigation now, investigation tomorrow

Postmortems, Incident Reviews & a Culture of Ongoing Improvements Postmortems Incident reviews

Failover Drills, Capacity Planning & Blackbox Testing Data center failovers are drills

SLOs, SLAs & Reporting on Them

SRE as an Independent Team

Reliability as an Ongoing Investment

Written on August 26, 2023