Programmer: Lifelong Learning: System Design

Identify the user cases, requirements, constraints, scope
- ask a lot of questions
- current requirements and requirements in near future
- talk about near-future requirements and design for it.
- who is the target audience? how it will be used?
- which features would be Pri-0 versus Pri-1?
- what are key and high traffic operations?
- how accurate it requires?
- max requests/day, QPS, storage size, latency, availability
- high level design and architecture
- main user cases/specific APIs for each component
Try multiple approaches
- high level: API, application layer, data layer then drill down
- sketch up the major components and their relationships
- draw a simple diagram

Identify bottlenecks/limitations and how to tackle them

Trade-offs
Performance, Consistency, Availability, Safety

Compare with other approaches

- performance characteristics of proposed architecture

Relax Temporal Constraints
- Alleviate temporal constraints in your system whenever possible

Keys to improve performance
- cache, partition, replicate, index

Differentiated service classes
- Different services/data need different guarantees/SLA

Separation of concerns/things

- separate user-facing components from non user-facing components
- critical vs non-critical services
- different db/apps into different servers
- differentiation of different components
Separate frontend and backend
- always writable in frontend
- backend can write to queue, and asynchronously do business logic
Functional partition services
- understand who are priority customers

Services
- coarse granularity of services
- establish clear relationships between the service
- isolate problems, scale independently
- solve problems independently

Restful Service

- cache friendly

- good for public api
- put: idempotent

Use standard timestamps - UTC only

Use downcased and dash-separated path names, downcased, underscore separators for attribute names

- actions/:action
Avoid auto-I=increment id; use UUID
Use Plural Noun
put response entity into a namespace(data), able to extend/add counts, links
flexible - client can choose which fields/sub resources to return
pagination - Define a Maximum
Don’t use integers as timestamps

Etag + If-None-Match conditional GET

RPC

- Protobuf, Thrift, Apache Avro

- Efficency

- But hard to debug and not flexible

Services features
- read heavy (improve latency) write heavy(save space, async,queue) or both
- user facing or not?
- delay acceptable, (write feeds)
- QPS
- SLA, performance requirement

Partitioning
- each node may be the leader for some partitions, and a follower for other partitions.
partitioning by hash of key
- lose the ability to do efficient range queries: rang query has to be sent to all partitions
- (user_id, update_timestamp); (sensor, update_timestamp)
- can perform an efficient range query if specify fixed value for the first key

CAP theorem

Consistency - all of the nodes see the same data at the same time.

Availability - any available node can serve client requests even when other nodes fail.

Partition tolerance

- system can operate even in the face of network failures where communication between nodes is impossible.

- happens when some nodes in the system cannot reach other nodes, but all can be reached by clients.

CP - HBase, Bigtable, MongoDB, Redis
AP - Dynamdb, Cassandra, Voldemort, Riak, SimpleDB

CA - RMDBMs, Zookeeper

Consistency - link

Eventual consistency
Causal consistency

RYW (Read-Your-Writes) consistency - desired

- read it from the leader

- read user own data from leader, read others data from replicate
- or always read/write to same replica

Session consistency

- guarantee read-your-writes consistency in same session
Monotonic read consistency - desired
- if a process has seen a particular value for the object, any subsequent accesses will never return any previous values.
- always read to same replica
Monotonic write consistency
- guarantee to serialize the writes by the same process

- vector clock + RWN

Use write version - to implement optimistic lock, or consistency
Stickiness of clients

R+W>N
- N: number of replicas
- W: number of replicas that need to acknowledge update
- R: number of replicas that are read
- Normally: N=3, W=2, R=2
- Optimize for read: R=1, W=N
- Optimize for write: W=1, R=N

BASE(Basically Available, Soft State, and Eventually Consistent)

Replication

- Scale reads not writes

- Statement-based

- Write-ahead log (WAL)

- Logical log - MySQL binlog
Master-Master - lose single-source-of-truth
Single Master/Multiple Slaves

Fine-Grained Dynamic Partitioning
Speeds Failure Recovery
- Many machines each recover one or a few partition
Selective Replication
- find heavily used items and make more replicas
- can be static or dynamic

Sharding

Redundancy
- Kick off the same computation more than once and return the one that finishes fastest.

- Use multiple vendors for same service

Throttling
- rate-limiting, remove slow servers
- Always put a limit on everything

How to reduce latency - link
Differentiated service classes
- prioritized request queues in servers
Manage expensive background activities
- defer until load is lower
remove capacity under load to improve latency
Use right image/video types
Decrease processing time
- Collocation(locality)
- Caching, Pooling
- Parallelization

- Partitioning

Cross request adaptation - balance load across set of servers
Within request adaptation
- Canary Requests
- Backup Requests (w/ Cross-Server Cancellation)
– send request to first replica
– wait 2 ms, and send to second replica
– servers cancel request on other replica when starting read
Tainted Partial Results

- mark such results as tainted in caches

Use Timeouts pattern

- default Jersey uses HttpURLConnection, its socket timeout of 0 millis, meaning that the timeout is infinite.
- tryLock, lock with timeout

Use Connection pooling

How to handle failures
Design for Fault Tolerance and Graceful Failure
Graceful Degradation - reduced functionality
- degradable service

- better to give users limited functionality than an error page
- still do something reasonable even if not all is right
- read only from local cache, only from distributed cache, only return fixed data
- rate limit, throttles
Fail as Early/fast as Possible - circuit breaker

Avoid single point of failure
- Strive for active/active
Avoid hotspot

How to handle overload in each node
- know your limit, and do throttle
- reject new request
- clear request queue

Lazy purge - actively purge

Approximate correctness
- do a transaction every once in awhile instead of every time

Laziness

- Don't do work unless it's really needed

No computation at all
Pre-computation and cache

DNS

Route 53 - latency-based routing

geoDNS - based on location

Browser
- Use multiple domains

Load Balancer

- SSL termination/offloading (AWS ELB)

- SSL farm
- scale lb using dns otr layered lbs

Cache aggressively

CDN - CloudFront
- static files, not-change-often API response
Layered Cache
Application Cache
- In-Process Caching(ehcahce)
- Distributed Caching(redis)
- Vector/row/fragment/page cache

- spring cache sync=true

Reverse proxy: Nginx
Browser cache: cache header(filter, or httpd config)
- Prevent 雪崩

- separate global and unique to the person

Cache Warming - gearman

- proactively

- TTL(timeout-based) Caching
Write-through cache - link
- directs write I/O onto cache and through to underlying permanent storage
- write is slower, good for apps that writes and then re-reads data frequently
Write-around cache
- write I/O is written directly to permanent storage, bypassing the cache
Write-back cache
- write I/O is directed to cache, backed up to permanent storage only at specified intervals or under certain conditions
- good for write-intensive, non-critical app, may lose data if cache dies
- time shifting, rate limiting, conflation
Cache Aside
- read from cache, if missed, read from db then put it into cache

- update to db, and invalidate cache

Queue - kafka, redis, MQ

- Queue-based Load Leveling Pattern
- Concurrent request converted to sequential request
- Evening out traffic spikes, overdraft protection
- Isolating failures and self-healing
- Enabling asynchronous processing
- Decoupling
Partial message ordering
exactly-once

- one approach: at-least-once + idempotent
Multiple Queue
- different priorities, put same host into same queue - to be polite

Priority Queue pattern

- Use redis

-- or multiple queue with different priorities

Publish/Subscribe - Kafka, MQ
Asynchrony - asynchronous processing
- order placed successfully TO order received successfully
- queue, fire and forget calls

- avoid blocking calls, reduce contention

Batch operations

Statelessness

- Share Nothing

Make API Retryable, Idempotent and Killable

Retry

- selective retry - only for transient errors that may succeed on a subsequent attempt
- don't retry permanent(unrecoverable) errors
- limit the number of retries and time spent
- use exponential backoff with random
- handle overload-related errors differently from other errors
RetryTemplate, FixedBackOffPolicy, setMaxAttempts, HttpClient

Only retry in outermost service

Parallelization

Accurate vs approximately
Availability and Reliability
- how things can fail
Learn and work on an actual system
- focus on the architecture and the tradeoffs behind each decision

- do back-of-the-envelope calculations for something you’re building

Feature flags

- control flag to upstream(in nginx)

Design to Clone Things (X Axis)
- clone services + load balancers
Design to Split Different Things (Y Axis)
- Separate sets of data, can scale them differently.
- Separation of distinct and different functions and data

Design to Split Similar Things (Z Axis): sharding and podding

Design for Rollback
Design to Be Disabled
Design to Be Monitored
- Nagios, Ganglia, Graphite, statsd

Session

Memory-based session state with Load balancer affinity(AWS ELB)
- Cookie learning vs Cookie injection

Persist session state to a DB

Data compression - gzip header

Polyglot MicroServices
- Single-purpose
- Modular and independent
- Don't share a persistence layer, each service has its own database

Categorize users

- less frequency ops for inactive users

- return cached content(ads) for non-logged in user

Explore locality - partition
- temporal, location

Data Store

- Choose the right database for the right service
- Know what queries are needed currently or in near-future
Read to write ratio

- read/write heavy or both
- how frequently we write/read

- use different db for read/write

- How much data
- it may be too large/cost to keep all in memory
- is it time series?
- row size
- how long to keep the data, policy to delete/archive old data
Hot/Cold Data
- Hot data to memory, cold data to ssd or disk(Glacier)
- Store them in different system
Separate index and content
- index for search, no need to store everything (solr).

- store content in KV db

Separate read/write
- Write to master, read from replica
- First write to queue, then sync to master
- Use different data stores

Store raw data to S3 or hdfs
Store Data that matters to datastore

Index
- index everything want to query on, only query on indexes

Denormalize the data
Split up large tables

SQL vs NOSQL

Key-Value - Riak, Dynamodb
Document-oriented - Couchbase, MongoDB
Column oriented
- store data by columns
- query by key or by key range

- BigTable, HBase and Cassandra

MySQL

- more read than writes

master/slave

- writes goes to master

- Related or not

- Need ACID or not

Key-value
Column-oriented
Document-oriented

Use multiple data centers

Break incompatible schema change into multiple database and application deployments.

Optimistic vs Pessimistic lock

- separate big fields into different table: post/tweet body
- read by blocks, smaller rows => read more rows

- avoid joins, full table scan

PoPs servers - link
- maintain a set of ready-to-use, opened TCP connections to the main cluster service
- serve nearby user requests and forward them to the main cluster service using the ready-to-use, established connections

GeoHash - link
- divide the world into a grid of 32 cells
- each cell can be further divided into another 32 cells, which can be divided into another 32 cells, and so on.
- the longer, the more accurate

- the longer the shared prefix, the closer they are

Prefer sequential write vs random write

- Example: Cassandra, Solr

Gated Launch
Jitter - Add some randomness
- avoid a thundering herd

- Add Entropy Back into Your System

Security - link
fail securely
least privilege

never trust any input!

Deployments/Zero downtime
- rolling deployments

Cloud Design Patterns: Prescriptive Architecture Guidance for Cloud Applications

Autoscale

- not instantaneous

- can scale up or down

Valet Key Pattern - supported by S3

Command and Query Responsibility Segregation (CQRS) Pattern

Materialized View Pattern

Ecommerce

SPU - Standard Product Unit

SKU - Stock Keeping Unit

MVP - minimum viable product

Robustness Principle

Be conservative in what you send, be liberal in what you accept

Be conservative in what you do, be liberal in what you accept from others

Reactive programming

- data flows and the propagation of change

- when x changes in one location, things depend on the value of x are recalculated and updated in various other locations

CPU Bound Application

Caching precomputing values

Performing the computation in separate background job.

Most apps are IO Bound Application - db/web service calls

Lambda Architecture - link
- all data is dispatched to both the batch layer and the speed layer
- batch layer: managing the master dataset (an immutable, append-only set of raw data), to pre-compute the batch views.
- serving layer indexes the batch views
- merging results from batch views and real-time views.

Option 1. Hadoop + Spark Streaming + Memory Storage + Columnar Storage

Metrics, Numbers, Estimation

Apache Httpd Server

- one connection, one process

- max cocurrent connections: ~4000

- yaws

HAProxy, nginx

- 20K concurrent connections

Web application server

- few thousand concurrent requests

Redis performance

6gb memory - 50K or 140K RPS (different bandwidth)

50gb memory - 150K or 250K with Redis clustering, throughput increases linearly as you increase the number of shards

Cassandra

- mean read latency: 13 ms
- write latency: 25 ms

Kafka
- 200k messages/sec for writes and 3M messages/sec for reads
- Each broker can handle terabytes of messages

Send 1M bytes over 1Gbps network 10ms

Read 1M sequentially from memory 0.25ms

Round trip within data center 0.5ms

Disk seek 8~10ms

Read 1MB sequentially from disk 20~25ms

Back-of-the-envelope Calculation

2^30 bytes ~~ 1GB

2^40 ~~ 1 TB

2^10 ~~ 1 KB

2^20 ~~ 1 MB

million -> 10^6

billion -> 10^9 ~ 2^30 bytes ~ 1GB

trillion -> 10^12 bytes~ 1 TB

Terms

QPS - Queries per second SLA - Service-level agreement

LRU
LIRS
- Inter-Reference Recency
- Low IRR, Hight IRR

Resources

Designing Data-Intensive Web Applications

Web Scalability for Startup Engineers

Cloud Architecture Patterns: Using Microsoft Azure

The Art of Scalability: Scalable Web Architecture, Processes, and Organizations for the Modern Enterprise

Scalability Rules: Principles for Scaling Web Sites

System Design - Summary

Labels