Identify the user cases, requirements, constraints, scope
- ask a lot of questions
- current requirements and requirements in near future
- talk about near-future requirements and design for it.
- who is the target audience? how it will be used?
- which features would be Pri-0 versus Pri-1?
- what are key and high traffic operations?
- how accurate it requires?
- max requests/day, QPS, storage size, latency, availability
- high level design and architecture
- main user cases/specific APIs for each component
Try multiple approaches
- high level: API, application layer, data layer then drill down
- sketch up the major components and their relationships
- draw a simple diagram
Identify bottlenecks/limitations and how to tackle them
Trade-offs
Performance, Consistency, Availability, Safety
Compare with other approaches
- performance characteristics of proposed architecture
Relax Temporal Constraints
- Alleviate temporal constraints in your system whenever possible
Keys to improve performance
- cache, partition, replicate, index
Differentiated service classes
- Different services/data need different guarantees/SLA
- ask a lot of questions
- current requirements and requirements in near future
- talk about near-future requirements and design for it.
- who is the target audience? how it will be used?
- which features would be Pri-0 versus Pri-1?
- what are key and high traffic operations?
- how accurate it requires?
- max requests/day, QPS, storage size, latency, availability
- high level design and architecture
- main user cases/specific APIs for each component
Try multiple approaches
- high level: API, application layer, data layer then drill down
- sketch up the major components and their relationships
- draw a simple diagram
Identify bottlenecks/limitations and how to tackle them
Trade-offs
Performance, Consistency, Availability, Safety
Compare with other approaches
- performance characteristics of proposed architecture
Relax Temporal Constraints
- Alleviate temporal constraints in your system whenever possible
Keys to improve performance
- cache, partition, replicate, index
Differentiated service classes
- Different services/data need different guarantees/SLA
Separation of concerns/things
- separate user-facing components from non user-facing components
- critical vs non-critical services
- different db/apps into different servers
- differentiation of different components
Separate frontend and backend
- always writable in frontend
- backend can write to queue, and asynchronously do business logic
Functional partition services
- understand who are priority customers
Services
- coarse granularity of services
- establish clear relationships between the service
- isolate problems, scale independently
- solve problems independently
- critical vs non-critical services
- different db/apps into different servers
- differentiation of different components
Separate frontend and backend
- always writable in frontend
- backend can write to queue, and asynchronously do business logic
Functional partition services
- understand who are priority customers
- coarse granularity of services
- establish clear relationships between the service
- isolate problems, scale independently
- solve problems independently
Restful Service
- cache friendly
- good for public api
- put: idempotent
- put: idempotent
Use standard timestamps - UTC only
Use downcased and dash-separated path names, downcased, underscore separators for attribute names
- actions/:action
Avoid auto-I=increment id; use UUID
Use Plural Noun
put response entity into a namespace(data), able to extend/add counts, links
flexible - client can choose which fields/sub resources to return
pagination - Define a Maximum
Don’t use integers as timestamps
Etag + If-None-Match conditional GET
Avoid auto-I=increment id; use UUID
Use Plural Noun
put response entity into a namespace(data), able to extend/add counts, links
flexible - client can choose which fields/sub resources to return
pagination - Define a Maximum
Don’t use integers as timestamps
Etag + If-None-Match conditional GET
RPC
- Protobuf, Thrift, Apache Avro
- Efficency
- But hard to debug and not flexible
Services features
- read heavy (improve latency) write heavy(save space, async,queue) or both
- user facing or not?
- delay acceptable, (write feeds)
- QPS
- SLA, performance requirement
Partitioning
- each node may be the leader for some partitions, and a follower for other partitions.
partitioning by hash of key
- lose the ability to do efficient range queries: rang query has to be sent to all partitions
- (user_id, update_timestamp); (sensor, update_timestamp)
- can perform an efficient range query if specify fixed value for the first key
- read heavy (improve latency) write heavy(save space, async,queue) or both
- user facing or not?
- delay acceptable, (write feeds)
- QPS
- SLA, performance requirement
Partitioning
- each node may be the leader for some partitions, and a follower for other partitions.
partitioning by hash of key
- lose the ability to do efficient range queries: rang query has to be sent to all partitions
- (user_id, update_timestamp); (sensor, update_timestamp)
- can perform an efficient range query if specify fixed value for the first key
CAP theorem
Consistency - all of the nodes see the same data at the same time.
Availability - any available node can serve client requests even when other nodes fail.
Partition tolerance
- system can operate even in the face of network failures where communication between nodes is impossible.
- happens when some nodes in the system cannot reach other nodes, but all can be reached by clients.
CP - HBase, Bigtable, MongoDB, RedisAP - Dynamdb, Cassandra, Voldemort, Riak, SimpleDB
CA - RMDBMs, Zookeeper
Consistency - link
Eventual consistency
Causal consistency
Monotonic read consistency - desired
- if a process has seen a particular value for the object, any subsequent accesses will never return any previous values.
- always read to same replica
Monotonic write consistency
- guarantee to serialize the writes by the same process
Causal consistency
RYW (Read-Your-Writes) consistency - desired
- read it from the leader
- read user own data from leader, read others data from replicate
- or always read/write to same replica
- or always read/write to same replica
Session consistency
- guarantee read-your-writes consistency in same sessionMonotonic read consistency - desired
- if a process has seen a particular value for the object, any subsequent accesses will never return any previous values.
- always read to same replica
Monotonic write consistency
- guarantee to serialize the writes by the same process
- vector clock + RWN
Use write version - to implement optimistic lock, or consistency
Stickiness of clients
R+W>N
- N: number of replicas
- W: number of replicas that need to acknowledge update
- R: number of replicas that are read
- Normally: N=3, W=2, R=2
- Optimize for read: R=1, W=N
- Optimize for write: W=1, R=N
Use write version - to implement optimistic lock, or consistency
Stickiness of clients
R+W>N
- N: number of replicas
- W: number of replicas that need to acknowledge update
- R: number of replicas that are read
- Normally: N=3, W=2, R=2
- Optimize for read: R=1, W=N
- Optimize for write: W=1, R=N
BASE(Basically Available, Soft State, and Eventually Consistent)
Replication
- Scale reads not writes
- Statement-based
- Write-ahead log (WAL)
- Logical log - MySQL binlog
Master-Master - lose single-source-of-truth
Single Master/Multiple Slaves
Master-Master - lose single-source-of-truth
Single Master/Multiple Slaves
Fine-Grained Dynamic Partitioning
Speeds Failure Recovery
- Many machines each recover one or a few partition
Selective Replication
- find heavily used items and make more replicas
- can be static or dynamic
Sharding
Speeds Failure Recovery
- Many machines each recover one or a few partition
Selective Replication
- find heavily used items and make more replicas
- can be static or dynamic
Sharding
Redundancy
- Kick off the same computation more than once and return the one that finishes fastest.
Design for Fault Tolerance and Graceful Failure
Graceful Degradation - reduced functionality
- degradable service
- better to give users limited functionality than an error page
- still do something reasonable even if not all is right
- read only from local cache, only from distributed cache, only return fixed data
- rate limit, throttles
Fail as Early/fast as Possible - circuit breaker
- Kick off the same computation more than once and return the one that finishes fastest.
- Use multiple vendors for same service
Throttling
- rate-limiting, remove slow servers
- Always put a limit on everything
How to reduce latency - link
Differentiated service classes
- prioritized request queues in servers
Manage expensive background activities
- defer until load is lower
remove capacity under load to improve latency
Use right image/video types
Decrease processing time
- Collocation(locality)
- Caching, Pooling
- Parallelization
- Partitioning
Cross request adaptation - balance load across set of servers
Within request adaptation
- Canary Requests
- Backup Requests (w/ Cross-Server Cancellation)
– send request to first replica
– wait 2 ms, and send to second replica
– servers cancel request on other replica when starting read
Tainted Partial Results
- mark such results as tainted in caches
Use Timeouts pattern
- default Jersey uses HttpURLConnection, its socket timeout of 0 millis, meaning that the timeout is infinite.
- tryLock, lock with timeout
Use Connection pooling
How to handle failures- rate-limiting, remove slow servers
- Always put a limit on everything
How to reduce latency - link
Differentiated service classes
- prioritized request queues in servers
Manage expensive background activities
- defer until load is lower
remove capacity under load to improve latency
Use right image/video types
Decrease processing time
- Collocation(locality)
- Caching, Pooling
- Parallelization
- Partitioning
Cross request adaptation - balance load across set of servers
Within request adaptation
- Canary Requests
- Backup Requests (w/ Cross-Server Cancellation)
– send request to first replica
– wait 2 ms, and send to second replica
– servers cancel request on other replica when starting read
Tainted Partial Results
- mark such results as tainted in caches
Use Timeouts pattern
- default Jersey uses HttpURLConnection, its socket timeout of 0 millis, meaning that the timeout is infinite.
- tryLock, lock with timeout
Use Connection pooling
Design for Fault Tolerance and Graceful Failure
Graceful Degradation - reduced functionality
- degradable service
- better to give users limited functionality than an error page
- still do something reasonable even if not all is right
- read only from local cache, only from distributed cache, only return fixed data
- rate limit, throttles
Fail as Early/fast as Possible - circuit breaker
Avoid single point of failure
- Strive for active/active
Avoid hotspot
- know your limit, and do throttle
- reject new request
- clear request queue
Lazy purge - actively purge
Approximate correctness
- do a transaction every once in awhile instead of every time
- do a transaction every once in awhile instead of every time
Laziness
- Don't do work unless it's really needed
No computation at all
Pre-computation and cache
Pre-computation and cache
DNS
Route 53 - latency-based routing
geoDNS - based on location
Browser
- Use multiple domains
Load Balancer
- SSL termination/offloading (AWS ELB)
- SSL farm
- scale lb using dns otr layered lbs
Cache aggressively
- scale lb using dns otr layered lbs
Cache aggressively
- static files, not-change-often API response
Layered Cache
Application Cache
- In-Process Caching(ehcahce)
- Distributed Caching(redis)
- Vector/row/fragment/page cache
- spring cache sync=true
Reverse proxy: NginxBrowser cache: cache header(filter, or httpd config)
- Prevent 雪崩
- separate global and unique to the person
Cache Warming - gearman
- proactively
- TTL(timeout-based) Caching
Write-through cache - link
- directs write I/O onto cache and through to underlying permanent storage
- write is slower, good for apps that writes and then re-reads data frequently
Write-around cache
- write I/O is written directly to permanent storage, bypassing the cache
Write-back cache
- write I/O is directed to cache, backed up to permanent storage only at specified intervals or under certain conditions
- good for write-intensive, non-critical app, may lose data if cache dies
- time shifting, rate limiting, conflation
Cache Aside
- read from cache, if missed, read from db then put it into cache
- update to db, and invalidate cache
Write-through cache - link
- directs write I/O onto cache and through to underlying permanent storage
- write is slower, good for apps that writes and then re-reads data frequently
Write-around cache
- write I/O is written directly to permanent storage, bypassing the cache
Write-back cache
- write I/O is directed to cache, backed up to permanent storage only at specified intervals or under certain conditions
- good for write-intensive, non-critical app, may lose data if cache dies
- time shifting, rate limiting, conflation
Cache Aside
- read from cache, if missed, read from db then put it into cache
- update to db, and invalidate cache
Queue - kafka, redis, MQ
Publish/Subscribe - Kafka, MQ
Asynchrony - asynchronous processing
- order placed successfully TO order received successfully
- queue, fire and forget calls
- avoid blocking calls, reduce contention
Retry
Feature flags
- Queue-based Load Leveling Pattern
- Concurrent request converted to sequential request
- Evening out traffic spikes, overdraft protection
- Isolating failures and self-healing
- Enabling asynchronous processing
- Decoupling
Partial message ordering
exactly-once
- one approach: at-least-once + idempotent
Multiple Queue
- different priorities, put same host into same queue - to be polite
- Concurrent request converted to sequential request
- Evening out traffic spikes, overdraft protection
- Isolating failures and self-healing
- Enabling asynchronous processing
- Decoupling
Partial message ordering
exactly-once
- one approach: at-least-once + idempotent
Multiple Queue
- different priorities, put same host into same queue - to be polite
Priority Queue pattern
- Use redis
-- or multiple queue with different priorities
Asynchrony - asynchronous processing
- order placed successfully TO order received successfully
- queue, fire and forget calls
- avoid blocking calls, reduce contention
Batch operations
Statelessness
- Share Nothing
Make API Retryable, Idempotent and Killable
- selective retry - only for transient errors that may succeed on a subsequent attempt
- don't retry permanent(unrecoverable) errors
- limit the number of retries and time spent
- use exponential backoff with random
- handle overload-related errors differently from other errors
RetryTemplate, FixedBackOffPolicy, setMaxAttempts, HttpClient
Only retry in outermost service
- don't retry permanent(unrecoverable) errors
- limit the number of retries and time spent
- use exponential backoff with random
- handle overload-related errors differently from other errors
RetryTemplate, FixedBackOffPolicy, setMaxAttempts, HttpClient
Only retry in outermost service
Parallelization
Accurate vs approximately
Availability and Reliability
- how things can fail
Learn and work on an actual system
- focus on the architecture and the tradeoffs behind each decision
- do back-of-the-envelope calculations for something you’re building
Availability and Reliability
- how things can fail
Learn and work on an actual system
- focus on the architecture and the tradeoffs behind each decision
- do back-of-the-envelope calculations for something you’re building
Feature flags
- control flag to upstream(in nginx)
Design to Clone Things (X Axis)
- clone services + load balancers
Design to Split Different Things (Y Axis)
- Separate sets of data, can scale them differently.
- Separation of distinct and different functions and data
Design to Split Similar Things (Z Axis): sharding and podding
Design to Clone Things (X Axis)
- clone services + load balancers
Design to Split Different Things (Y Axis)
- Separate sets of data, can scale them differently.
- Separation of distinct and different functions and data
Design to Split Similar Things (Z Axis): sharding and podding
Design for Rollback
Design to Be Disabled
Design to Be Monitored
- Nagios, Ganglia, Graphite, statsd
Design to Be Disabled
Design to Be Monitored
- Nagios, Ganglia, Graphite, statsd
Session
Memory-based session state with Load balancer affinity(AWS ELB)
- Cookie learning vs Cookie injection
- Cookie learning vs Cookie injection
Persist session state to a DB
Data compression - gzip header
Polyglot MicroServices
- Single-purpose
- Modular and independent
- Don't share a persistence layer, each service has its own database
- Example: Cassandra, Solr
- Single-purpose
- Modular and independent
- Don't share a persistence layer, each service has its own database
Categorize users
- less frequency ops for inactive users
- return cached content(ads) for non-logged in user
Explore locality - partition
- temporal, location
Explore locality - partition
- temporal, location
Data Store
- Choose the right database for the right service
- Know what queries are needed currently or in near-future
Read to write ratio
- it may be too large/cost to keep all in memory
- is it time series?
- row size
- how long to keep the data, policy to delete/archive old data
Hot/Cold Data
- Hot data to memory, cold data to ssd or disk(Glacier)
- Store them in different system
Separate index and content
- index for search, no need to store everything (solr).
- Write to master, read from replica
- First write to queue, then sync to master
- Use different data stores
- Know what queries are needed currently or in near-future
Read to write ratio
- read/write heavy or both
- how frequently we write/read
- how frequently we write/read
- use different db for read/write
- How much data- it may be too large/cost to keep all in memory
- is it time series?
- row size
- how long to keep the data, policy to delete/archive old data
Hot/Cold Data
- Hot data to memory, cold data to ssd or disk(Glacier)
- Store them in different system
Separate index and content
- index for search, no need to store everything (solr).
- store content in KV db
Separate read/write- Write to master, read from replica
- First write to queue, then sync to master
- Use different data stores
Store raw data to S3 or hdfs
Store Data that matters to datastore
Index
- index everything want to query on, only query on indexes
Denormalize the data
Split up large tables
Store Data that matters to datastore
Index
- index everything want to query on, only query on indexes
Denormalize the data
Split up large tables
SQL vs NOSQL
Key-Value - Riak, Dynamodb
Document-oriented - Couchbase, MongoDB
Column oriented
- store data by columns
- query by key or by key range
- BigTable, HBase and Cassandra
MySQL
Document-oriented - Couchbase, MongoDB
Column oriented
- store data by columns
- query by key or by key range
- BigTable, HBase and Cassandra
MySQL
- more read than writes
master/slave
- writes goes to master
- Related or not
- Need ACID or not
Key-value
Column-oriented
Document-oriented
Column-oriented
Document-oriented
Use multiple data centers
Break incompatible schema change into multiple database and application deployments.
- maintain a set of ready-to-use, opened TCP connections to the main cluster service
- serve nearby user requests and forward them to the main cluster service using the ready-to-use, established connections
GeoHash - link
- divide the world into a grid of 32 cells
- each cell can be further divided into another 32 cells, which can be divided into another 32 cells, and so on.
- the longer, the more accurate
- the longer the shared prefix, the closer they are
Optimistic vs Pessimistic lock
- separate big fields into different table: post/tweet body
- read by blocks, smaller rows => read more rows
- avoid joins, full table scan
PoPs servers - link- read by blocks, smaller rows => read more rows
- avoid joins, full table scan
- maintain a set of ready-to-use, opened TCP connections to the main cluster service
- serve nearby user requests and forward them to the main cluster service using the ready-to-use, established connections
GeoHash - link
- divide the world into a grid of 32 cells
- each cell can be further divided into another 32 cells, which can be divided into another 32 cells, and so on.
- the longer, the more accurate
- the longer the shared prefix, the closer they are
Prefer sequential write vs random write
Gated Launch
Jitter - Add some randomness
- avoid a thundering herd
Security - link
fail securely
least privilege
never trust any input!
Deployments/Zero downtime
- rolling deployments
Jitter - Add some randomness
- avoid a thundering herd
- Add Entropy Back into Your System
Security - link
fail securely
least privilege
never trust any input!
- rolling deployments
Autoscale
- not instantaneous
- can scale up or down
Valet Key Pattern - supported by S3
Command and Query Responsibility Segregation (CQRS) Pattern
Materialized View Pattern
Lambda Architecture - link
- all data is dispatched to both the batch layer and the speed layer
- batch layer: managing the master dataset (an immutable, append-only set of raw data), to pre-compute the batch views.
- serving layer indexes the batch views
- merging results from batch views and real-time views.
Option 1. Hadoop + Spark Streaming + Memory Storage + Columnar Storage
- write latency: 25 ms
Ecommerce
SPU - Standard Product Unit
SKU - Stock Keeping Unit
MVP - minimum viable product
Robustness Principle
Be conservative in what you send, be liberal in what you accept
Be conservative in what you do, be liberal in what you accept from others
Reactive programming
- data flows and the propagation of change
- when x changes in one location, things depend on the value of x are recalculated and updated in various other locations
CPU Bound Application
Caching precomputing values
Performing the computation in separate background job.
Most apps are IO Bound Application - db/web service calls
- all data is dispatched to both the batch layer and the speed layer
- batch layer: managing the master dataset (an immutable, append-only set of raw data), to pre-compute the batch views.
- serving layer indexes the batch views
- merging results from batch views and real-time views.
Option 1. Hadoop + Spark Streaming + Memory Storage + Columnar Storage
Metrics, Numbers, Estimation
Apache Httpd Server
- one connection, one process
- max cocurrent connections: ~4000
- yaws
HAProxy, nginx
- 20K concurrent connections
Web application server
- few thousand concurrent requests
6gb memory - 50K or 140K RPS (different bandwidth)
50gb memory - 150K or 250K with Redis clustering, throughput increases linearly as you increase the number of shards
Cassandra
- mean read latency: 13 ms- write latency: 25 ms
Kafka
- 200k messages/sec for writes and 3M messages/sec for reads
- Each broker can handle terabytes of messages
- 200k messages/sec for writes and 3M messages/sec for reads
- Each broker can handle terabytes of messages
Send 1M bytes over 1Gbps network 10ms
Read 1M sequentially from memory 0.25ms
Round trip within data center 0.5ms
Disk seek 8~10ms
Read 1MB sequentially from disk 20~25ms
Back-of-the-envelope Calculation
2^30 bytes ~~ 1GB
2^40 ~~ 1 TB
2^10 ~~ 1 KB
2^20 ~~ 1 MB
million -> 10^6
billion -> 10^9 ~ 2^30 bytes ~ 1GB
trillion -> 10^12 bytes~ 1 TB
Terms
QPS - Queries per second SLA - Service-level agreement
Resources