Programmer: Lifelong Learning: October 2016

System Design Practices

Design home page stream
Design feed
Mixed approaches - Push + Pull from
-- selectively disable push for celebrities, pull them instead.
https://engineering.pinterest.com/blog/building-scalable-and-available-home-feed
Design Decisions For Scaling Your High Traffic Feeds

Cassandra + Denormalize

Redis + Normalized

Selective Fanout Based On Producer/Consumer

URL shortener
https://www.youtube.com/watch?v=fMZMm_0ZhK4 - Range based
- read heavy
- separate read/write to different apps
- able to disable write functions
- id (auto increment), actual url, and shorten url

- base 62 encode the id to string(shorten url)
- generate auto-generate id
- can use sql or nosql as long as find a way to generate unique long
- rate limit, prevent abuse
Insert Long Url to db, get the primary key: long
Convert base10 to Base62[0-9a-zA-Z], other characters such as -,_+, convert long to short url
Decode short url to long number and query db to get long url.

How to make it scalable
How to analyze short url click event
301 or 302?
- custom url

Extension - popular urls in last hour/day

Generate unique ID numbers
Twitter Snowflake
- 41bit timestamp
- 10bit workerid
- 12bit sequence
Flickr Ticket Servers
TicketServer1:
auto-increment-increment = 2
auto-increment-offset = 1

TicketServer2:
auto-increment-increment = 2
auto-increment-offset = 2
Map many logical shards to fewer physical shards
How to assign shards when we can add/remove nodes?

User Overseer(leader) to change concurrent into serialized problem

Crawler
- politeness: rate limit
- put same host pages into same queue
- different priority for different sites - multiple queues

- normalization/canonicalization
- directed graph
- each URL: last visited time, score, fingerprint
- Use Bloom filters to remove duplicates
Re-crawling
- predict which websites are most likely to have frequent update, put those into the fetchers’ priority queue.
How to handle realtime news?
- each url: next visited time based on its score and how frequently it changes
- if page(links, content) is (mostly) not changed, increase its next visit time(cut+2t) otherwise decrease its next visit time
- begin with a seed set
- use a simple fingerprint such as a checksum
- the frontier
- normalized URL
DNS resolver
Locality
- Put data centers in different regions - machines in asian to crawl asian web sites

To crawl price histories of products
signals:
- previous price history changes
- how many reviews they have

- To detect new products: provide api so client can add or manually crawl new added pages.

Design an in-memory search engine
How to index and store in memory
How to support free text queries, phrase queries
Map<String, List<Document>>
List<Document> is sorted
Document: docId, List<Long> positionIds
List<Long> is sorted
How to save to files
How to merge small files into big files
-- When save to file, make it sorted by word
-- Use merge sort to merge multiple fields
How to make it scalable - how solr cloud works

Auto completion in search
- single node
- Trie, Topk

Design Hit Counter - how to count number of requests in last second, minute and hour - link1,

Generate Game Map
- brute force, randomly generate one, check whether it can be solved in limit time by machine

- different difficulty level

Design score/rank system for social game.
Need update score realtime?
- yes, for this user's operation, when score is small
- no, for others' operation, when score is big
Rank - realtime?
- yes, when user is ranked in X(1000),
- no, if user ranks is below Y(1M)
Average score
- rank: use sorted tree - sorted by score, node splitted by score range
- table: userid, score
- sharding

- append, not update existing data
- separate score change to different types: real time vs batchable
- batch score change

Search nearby places
- GeoHash

Design Slack
Design Chat app like messenger, what's app
https://www.youtube.com/watch?v=zKPNUMkwOJE&t=2s
Websockets
ServerA forward message to ServerB which forwards to clientB
How to manage/detect friends status - presence
- reduce message, avoid O(n^2)
- Each user subscribes to its subscription list in server
- When one user state's changed, it uses pub/sub to update all related users's subscription list in server side
- Server pushes friend status to each user in some fixed interval - if there is change (batch)

How to store messages
Typing indicators
- Only send it to users who is currently interacting with this user/group

Message status: delivered, seen

De-duplication

- url dedup: use bloom filter
Detect duplicate documents in billions of urls
- back-of-the-envelope calculations
- (30 bytes for url + int/long(4 or 8 bytes)) * 10^9(~1g) ~ 34g
- A1: Hash and split into multiple files - slow (even with multiple threads) as a lot of IOs
- A2: Parallelize in multiple machines - like map-reduce

Design log collection and analysis system

- Auxiliary service, should not affect the existing service (Monitor system memory and cpu, adjust the push rate accordingly)

Design shopping cart
addToCart,removeFromCart,changeQuantity
- monitor
- abandoned cart
guest cart
- cookie
- merged if user login later
- check items availability at some point

- call other services - stock service(price, availability), metadata service, images(cdn)

ecommerce - amazon
- nosql: wide column for product table

- concurrency for order: optimistic, pessimistic
- analytics
- abandoned cart, user behavior(click,view,add/remove from cart)
products
checkout
Payment
- two phrases: Authorization, Capture
shipping

customer service

Design rate limiter
- Separate from application
- warn/notify user before user reaches 90% limit
Algorithms:

Token bucket, Leaky bucket, rolling window, counter
Guava RateLimiter

Design Miao Sha
- Use cases: different phrases of the activity: before, available, sold out
- cache boolean available instead of the count

- Principle: fair(fifo) - but no need to 100% guarantee
Two types of queues
- one queue for each on-sale hot product, only keep x (count *1.x) requests, reject directly if more than it (ask them to retry after x mins)
- handled by one single server which reads from queue
- avoid distributed problems
- asynchronous processing
Cache activity page
- protect order page by adding a server side generated random number
- server side will generate and update a non-cached a js including activity info and the random number for the order page, the js is not cached in client side

How to update stocks - concurrency
How to send email notification, handle delivery
- asynchronous, put them into a queue

Design train ticket booking - Chunyun

Protect against from hackers
- CAPTCHA
- filter based on ipAddress, acountId
- Prevent abuse: cheating

Design email

Counter
- 1. Use count table
- 2. Use Redis

Hierarchical Data
- id, parentid

Design the data structure for large social network
Design system to store/read images

Expire mp3 in cdn and generate new one after x min
pre-load

Playlist Shuffle
- Fisher-Yates shuffle
- a good, uniformly distributed mix of all the tracks, no burst

Design key-value Database(lookup service)
Design APIs for add/update/query
CA or CP?
- List main components first (meltable, sstable, index, bloom filter, WAL log)
- membership and failure detection(gossip)
- Partition (Virtual nodes, consistent hashing)
- Replication(Versioning)
- handling temporary failures
- read repair and anti-entropy using Merkle tree
- Sparse index file and loaded into memory
- compress block
Write ahead (redo) log
- Global(Local) secondary index
- Conflict resolution

- Request coordinator
- Smart client can send requests to right replica
- Cache: row cache, key cache
Merge and compaction

Advantages over B-tress
- turn random writes into sequential writes

Downside of LSM tree
- compaction will interfere with the performance (not consistency)
- more storage
- key may appear in multiple sstable,
- more storage during compaction

Design column database
How to update some fields

Design Delayed Job Scheduler
- single server or distributed
- generic or for specific tasks
- Java Delay Queue
- wheelTimer (Netty)
- qurtz

Top K Frequent Elements in Recent X mins
- data structure: hashmap + linkedlist
system design:
- approximate
- batch, update every x seconds
- 分阶段统计, bucket
- sample

- different intervals for different cache(hour, daily)

Design Ticket Master - TODO
- Analyze the requirements: QPS, SLA
- Autoscale, predicative scale based on estimated popularity of concert

- queuing and prioritizing

How to Design a Trending Algorithm - TODO
- offline pipelines
- cache, no need realtime
- candidates: hashtag

- signal: influencer, location, personalization

Make single-server datastore to peer-to-peer, linearly scalable, clustered systems
Dynomite

Design distributed priority queue service + Delayed queue
Distributed delay queues based on Dynomite + Redis
A Sorted Set containing queued elements by score.
A Hash set that contains message payload, with key as message ID.

A Sorted Set containing messages consumed by client but yet to be acknowledged. Un-ack set.

Design distributed file system
- master for meta data: where the chunk stores
- get file chunk location from master, read directly form slave

- replication, checksum

Image conversion service

- queue, asynchronously

Java Concurrency Tips

Executors
executor.execute vs submit

execute can only accept Runnable, execute returns nothing

- void execute(Runnable task)

- used for fire and forget

submit can accept Runnable and Callable, submit returns Future

Future submit(Callable task)

Future submit(Runnable task)

- used when need get the result

Exception handling - if task throws unhanded exception

- execute: the UncaughtExceptionHandler for the Thread running the task to be invoked. By default it prints exception to System.err. We can register global UncaughtExceptionHandler: Thread.setDefaultUncaughtExceptionHandler();

- submit: Any exception thrown will be bound in Future. Call future.get will throw an ExecutionException with the original Throwable as its cause.

RejectedExecutionHandler: ThreadPoolExecutor.CallerRunsPolicy(), NewThreadRunsPolicy

Threadpool
-Isolation: Allocate a separate thread pool for slow web requests within the same application

- corePoolSize 30, maximumPoolSize 50, ArrayBlockingQueue(100): what it means

Deadlock prevention
- Same Lock Ordering
- Lock Timeout

Mutual Exclusion
Hold and Wait
No Preemption

Circular Wait

Lock
Intrinsic Lock - synchronized
- every object and class is logically associated with a monitor
- wait, notify, notifyAll
ReentrantLock
- acquire lock interruptibly and with timeout - tryLock
- lockInterruptibly: used to interrupt thread when it is waiting for lock
- fair lock
- better performance: uses atomic variable and faster CAS operation
- get list of waiting thread for lock
disadvantages
- need put it into try-finally block, code unreadable
- easier to make mistake: forget to unlock in finally block
associate a counter with each thread

Condition
- wait for/notify a particular condition to become true
- can have multiple conditions on same lock
- enables the effect of having multiple wait-sets per object
- factors out the Object monitor methods (wait, notify and notifyAll) into distinct objects to give the effect of having multiple wait-sets per object
- Where a Lock replaces the use of synchronized methods and statements, a Condition replaces the use of the Object monitor methods.

- await releases the lock, suspend the thread, signal, signalAll
- BoundedBuffer
Condition notFull = lock.newCondition();
Condition notEmpty = lock.newCondition();

Semaphore
- can be released by other threads, useful for deadlock recovery
- P(acquire), V(release)

- a counter (integer) that allows a thread to get into a critical region if the value of the counter is greater than 0
- implementation: no lock, CAS
- acquire/(permits), tryAcquire, release/(permits)
CountDownLatch
CountDownLatch countDownLatch = new CountDownLatch(TOTAL_EVENTS);
countDownLatch.countDown();
countDownLatch.await();
Barrier
CyclicBarrier
- We can reuse CyclicBarrier even if Barrier is broken but can't reuse CountDownLatch

Programming Concurrency on the JVM
Read-write lock - ReentrantReadWriteLock
Thread local storage
Prefer local thread variables over static and shared when possible
Avoiding deadlocks by ordering the locks
Using atomic variables instead of synchronization
Holding locks for as short a time as possible
Avoid executing inside the critical section the code you don't control.

Use lock instead of synchronize
Use CyclicBarrier and CountdownLatch instead of wait/notify

volatile
- visibility: any thread that reads a field will see the most recently written value
- operations on it not be reordered

- not cached in registers or in caches; always read from and write to main memory - thus a little slower

Ensure Visibility
- cross the memory barrier on getters, use volatile, atomicXXX, readlock etc
Ensure Atomicity

Don't start threads in constructors
- use static factory methods, thread pool
Use tryLock - instead of lock
Cancel a task - future.cancel, executor.cancel

Initialization-on-demand holder idiom
Double-checked locking

AtomicLongArray
The long elements in the AtomicLongArray can be updated atomically
- compareAndSet: compare the value of a given element with a specified value, and if the two values are equal, set a new value for that element.
- getAndAdd(index, incrment), getAndIncrement(index), incrementAndGet

AtomicReferenceArray
AtomicReference

Concurrent data structures
ConcurrentHashMap
- guarantees O(log(n)) operations
- sorted, ceilingEntry/Key, floorEntry/Key

ConcurrentSkipListMap
ConcurrentSkipListSet
- thread-safe implementation of the SortedSet
- concurrent crud
- slower than TreeSet, don't use it unless in highly ...
- size is slow, O(n)
- head, nil are in every skip list
- when add, a random number generator is used to determine the levels of the new node
ConcurrentLinkedDeque
CopyOnWriteArrayList
CompletionService
extend thread class vs implement runnable interface
- use threadpool to share threads if implement runnable

Missed Signals - link
boolean wasSignalled = false;
synchronized(myMonitorObject){
while(!wasSignalled){
try{
myMonitorObject.wait();
} catch(InterruptedException e){...}
}
wasSignalled = false;
}

Don't call wait() on constant String's or global objects
LongAdder
- use more memory
- prefer AtomicLong when contention is low

unsafe

unsafe.getAndAddLong(this, valueOffset, 1L)
Practices
Implement AtomicInteger
synchronized
int currentValue;
int previousValue;

Implement ReadWriteLock and ReentrantReadWriteLock - link
private Map readingThreads =
new HashMap();

private int writeAccesses = 0;
private int writeRequests = 0;
private Thread writingThread = null;
Implement Lock and ReEntrantLock
int lockHoldCount;
long IdOfThreadCurrentlyHoldingLock;
IdOfThreadCurrentlyHoldingLock=Thread.currentThread().getId();
synchronized

Implement Semaphore
int permits; synchronized
while(this.permits == 0) wait();

Implement CyclicBarrier
2 or more threads wait for each other to reach a common barrier point. When all threads have reached common barrier point (i.e. when all threads have called await() method), all waiting threads are released.
CyclicBarrier(int parties)

int initialParties; //total parties
int partiesAwait; //parties yet to arrive
synchronized await
-
partiesAwait--;
if(partiesAwait>0){
this.wait();
}

Implementing (Thread Safe) Stack - link
- AtomicReference
- CAS
- if(head.compareAndSet(oldHead, newHead))
Design threadsafe real time counter
private AtomicLongArray counter = new AtomicLongArray(GRANULARITY);
private volatile int pos = 0;
- Another thread that update pos every second
void incrementPosition(){
counter.set((pos + 1)%GRANULARITY, 0);
pos = (pos + 1)%GRANULARITY;

}

Person A holds A, PB holds B, PC holds C, how can we guarantee always print ABC

- Use Semaphore

Print even and odd numbers using threads
Dining Philosophers Problem

System Design: Learning from Existing Products

Pinterest Home Feed
- best-first rather than newest-first
- separation of available, but unseen, content and content that’s already been presented

- fan out write pins to followers
- batch write
standby HBase cluster
- speculative execution
- make another call to the standby cluster if the first call to primary cluster fails or doesn’t return within certain time
- randomly forward x percent of calls to warm up it

smart feed worker
- score/store pins: quality and meaning to users
- repins by followers, related pins, interest Pins
- 3 priority queues(hbase tables) for each user, use key-based sorting of HBase: userid, score, pin
Smart feed content generator
- pins selected are removed from pool(hbase table)
Smart feed service
materialized feed

discover interests - Pinterest
- n-grams
- verify with search queries, public domain ontologies like Wikipedia titles
- blacklist, stop words
- implicit and explicit signals from users

related interests
- co-occurrence relationship for interests

search query
- user's demographic stats
- store and use previous queries
- adjacent queries in the same search session

- clicks afert submit query

Twitter
- users,tweets,be-followed,following
- persistent connection between client and server
Add a twitter
- Write it to the queue, return to client
- in background, save to db, push to (active) follower's queue
Cache aggressively
Vector Cache
- ids(user, tweet)
- user's followers(id)

Row Cache
- tweet

Fragment Cache
Page Cache

- separated and put into a different machine

Home Timeline - link
user's timeline vs home timeline

- precomputed and cached in redis for active users
Mixed Pull+Push
- local decisions about each producer/consumer pair
- how often producer adds a event, how ofter consumer views the event
- celebrity twitter - Pull

- normal user twitter - Push
Flock maintains the follower and followings lists.

Search

- tweets are stored in memory Lucene

- rerank based on social proof, the number of retweets, favorites, and replies.

- so every time, a tweet is retweeted or etc, no need update

Lucene

Facebook
Separate data
Recent Activity - hot
- changes frequently, cache is frequently invalidated
- small, fast to regenerate them

Non-Recent Activity - cold
- a lot of data, slow to generate
- but don't change, cache will not be invalidated

Photo storage
- use cdn popular images
- use Haystack for other images - long tail
NFS photo infrastructure - old version
- huge metadata, most are not used(permission)
- use cdn to store images
- one(or more) to translate the filename into inode number, another to read the inode from disk, and a final one to read the file itself.
- need keep all inodes in memeory
- CDN good for hot data, but not so for cold data
long tail data
- requests for less popular (often older) content

Haystack - link
Haystack Object Store
- a log structured (append-only) object store containing needles representing the stored objects
- eliminates the metadata overhead by aggregating hundreds of thousands of images in a single haystack store file
- at most one disk operation per read, keep all metadata in main memory
- haystack store file
- index - needle's location, not critical, can be rebuilt, written asynchronously

needle is identified by Offset, Key, Alternate Key, Cookie
- Don't allow overwrite existing needle, instead: add a new version and its index
- needles with tuple: duplicate Key, Alternate Key, Cookie, the largest offset is the most recent one
Read - Pass needle offset, key, alternate key, cookie and the data size
Delete - set a deleted bit in the flags

- Fault-tolerant: replicates

Photo Store Server - http layer
- keeps a minimal in-memory index
- Google sparse hash data structure

Compaction
- reclaim the space
- copy needles while skipping any duplicate or deleted entries

Haystack Cache
- insulation if upstream CDN nodes fail and

Timeline - your story
News Feed - Multifeed
- multi-fetch and aggregate stories at read time
- aggregator(staeless) and leafs
- aggregator servers sends request to all leaf servers and ranks the results.
- all data in leafs is in memory (memcached)
- fetch recent activity from all your friends
- gather and Group into stories
- rank stories by relevance
- more flexible, manageable
- fan-out reads can be bounded. writes, often cannot

- SSD
- others's story that related to you
- rank edges(stories) by importance(relevance) - link
- content score: ∑e = ue we de, wherein ∑e is the sum of the edge's rank, ue is the affinity score with the user who created the edge, we is the weight for the content type, and de is a time decay factor

- a friend's affinity score: ∑i = li ni wi, wherein ∑i is the sum of the interactions with that friend, li is the time since your last interaction (this would need to be weighted so that 1 day > 30 days), ni is the number of interacts, and wi is the weight of those interactions

- Activity(id, user_id, source_id, activity_type, edge_rank, parent_id, parent_type, data, time)
- timeline - user's story
- newsfeed
connections and activities

"Push" Model, or Fan-out-on-write
- a lot of write, only push to active users for non-celebrity users
- use write-optimized data store such as Cassandra, HBase, or BigTable
"Pull" Model, or Fan-out-on-load
- less data, less computation
- materialized what user already viewed into a table, only pull new data
- have a fallback mechanism if fails to pull - show old materialized data

Combination of push and pull
- make decision locally on per consumer/producer basis
- a frequent producer of events may overshadow events from less frequent producers
Selective Fanout Based On Consumer
- only fan-out to active/valuable users
Selective Fanout Based On Producer

- don't fan-out for high profile users
Pull for new users
- because nothing pushed to the user
- when there is no enough data, pull from following

Make schema changes is expensive in traditional db
- such as add/remove indexes
- Users don't know what they don't know.

- All content is not created equal.

Reddit
- Treat nonlogged user as second class citizens
- Return cached content to them
- Put a limit on everything
- Provide timeout

- Expire data. Lock out old threads and create a fully rendered page and cache it.

Red Envelop
- store count, balance into redis
- use redis
- decr count, CAS to decrease balance
- ttl

Uber
Driver - Long connection
- matches drivers and riders, the supply and demand
Maps/ETA (estimated time of arrival)

Nutch2
- plugins, extensible
- fetching pages, filtering and normalizing URLs and parsing responses

Etsy - link
$choose_connection = mt_rand() < $affinity;
Pruning Same/similiar message
combine similar stories into rollups
- group similar storeis

Cache
- seperate global and unique to the person
Cache Warming - gearman
- proactively
- TTL Caching
Lazy Below the Fold

JD
数据闭环,异构,聚合

EFK - Log collection/visualization
logstash/fleuent -> kafka -> Elasticsearch -> Kibana

System Design - Summary

Identify the user cases, requirements, constraints, scope
- ask a lot of questions
- current requirements and requirements in near future
- talk about near-future requirements and design for it.
- who is the target audience? how it will be used?
- which features would be Pri-0 versus Pri-1?
- what are key and high traffic operations?
- how accurate it requires?
- max requests/day, QPS, storage size, latency, availability
- high level design and architecture
- main user cases/specific APIs for each component
Try multiple approaches
- high level: API, application layer, data layer then drill down
- sketch up the major components and their relationships
- draw a simple diagram

Identify bottlenecks/limitations and how to tackle them

Trade-offs
Performance, Consistency, Availability, Safety

Compare with other approaches

- performance characteristics of proposed architecture

Relax Temporal Constraints
- Alleviate temporal constraints in your system whenever possible

Keys to improve performance
- cache, partition, replicate, index

Differentiated service classes
- Different services/data need different guarantees/SLA

Separation of concerns/things

- separate user-facing components from non user-facing components
- critical vs non-critical services
- different db/apps into different servers
- differentiation of different components
Separate frontend and backend
- always writable in frontend
- backend can write to queue, and asynchronously do business logic
Functional partition services
- understand who are priority customers

Services
- coarse granularity of services
- establish clear relationships between the service
- isolate problems, scale independently
- solve problems independently

Restful Service

- cache friendly

- good for public api
- put: idempotent

Use standard timestamps - UTC only

Use downcased and dash-separated path names, downcased, underscore separators for attribute names

- actions/:action
Avoid auto-I=increment id; use UUID
Use Plural Noun
put response entity into a namespace(data), able to extend/add counts, links
flexible - client can choose which fields/sub resources to return
pagination - Define a Maximum
Don’t use integers as timestamps

Etag + If-None-Match conditional GET

RPC

- Protobuf, Thrift, Apache Avro

- Efficency

- But hard to debug and not flexible

Services features
- read heavy (improve latency) write heavy(save space, async,queue) or both
- user facing or not?
- delay acceptable, (write feeds)
- QPS
- SLA, performance requirement

Partitioning
- each node may be the leader for some partitions, and a follower for other partitions.
partitioning by hash of key
- lose the ability to do efficient range queries: rang query has to be sent to all partitions
- (user_id, update_timestamp); (sensor, update_timestamp)
- can perform an efficient range query if specify fixed value for the first key

CAP theorem

Consistency - all of the nodes see the same data at the same time.

Availability - any available node can serve client requests even when other nodes fail.

Partition tolerance

- system can operate even in the face of network failures where communication between nodes is impossible.

- happens when some nodes in the system cannot reach other nodes, but all can be reached by clients.

CP - HBase, Bigtable, MongoDB, Redis
AP - Dynamdb, Cassandra, Voldemort, Riak, SimpleDB

CA - RMDBMs, Zookeeper

Consistency - link

Eventual consistency
Causal consistency

RYW (Read-Your-Writes) consistency - desired

- read it from the leader

- read user own data from leader, read others data from replicate
- or always read/write to same replica

Session consistency

- guarantee read-your-writes consistency in same session
Monotonic read consistency - desired
- if a process has seen a particular value for the object, any subsequent accesses will never return any previous values.
- always read to same replica
Monotonic write consistency
- guarantee to serialize the writes by the same process

- vector clock + RWN

Use write version - to implement optimistic lock, or consistency
Stickiness of clients

R+W>N
- N: number of replicas
- W: number of replicas that need to acknowledge update
- R: number of replicas that are read
- Normally: N=3, W=2, R=2
- Optimize for read: R=1, W=N
- Optimize for write: W=1, R=N

BASE(Basically Available, Soft State, and Eventually Consistent)

Replication

- Scale reads not writes

- Statement-based

- Write-ahead log (WAL)

- Logical log - MySQL binlog
Master-Master - lose single-source-of-truth
Single Master/Multiple Slaves

Fine-Grained Dynamic Partitioning
Speeds Failure Recovery
- Many machines each recover one or a few partition
Selective Replication
- find heavily used items and make more replicas
- can be static or dynamic

Sharding

Redundancy
- Kick off the same computation more than once and return the one that finishes fastest.

- Use multiple vendors for same service

Throttling
- rate-limiting, remove slow servers
- Always put a limit on everything

How to reduce latency - link
Differentiated service classes
- prioritized request queues in servers
Manage expensive background activities
- defer until load is lower
remove capacity under load to improve latency
Use right image/video types
Decrease processing time
- Collocation(locality)
- Caching, Pooling
- Parallelization

- Partitioning

Cross request adaptation - balance load across set of servers
Within request adaptation
- Canary Requests
- Backup Requests (w/ Cross-Server Cancellation)
– send request to first replica
– wait 2 ms, and send to second replica
– servers cancel request on other replica when starting read
Tainted Partial Results

- mark such results as tainted in caches

Use Timeouts pattern

- default Jersey uses HttpURLConnection, its socket timeout of 0 millis, meaning that the timeout is infinite.
- tryLock, lock with timeout

Use Connection pooling

How to handle failures
Design for Fault Tolerance and Graceful Failure
Graceful Degradation - reduced functionality
- degradable service

- better to give users limited functionality than an error page
- still do something reasonable even if not all is right
- read only from local cache, only from distributed cache, only return fixed data
- rate limit, throttles
Fail as Early/fast as Possible - circuit breaker

Avoid single point of failure
- Strive for active/active
Avoid hotspot

How to handle overload in each node
- know your limit, and do throttle
- reject new request
- clear request queue

Lazy purge - actively purge

Approximate correctness
- do a transaction every once in awhile instead of every time

Laziness

- Don't do work unless it's really needed

No computation at all
Pre-computation and cache

DNS

Route 53 - latency-based routing

geoDNS - based on location

Browser
- Use multiple domains

Load Balancer

- SSL termination/offloading (AWS ELB)

- SSL farm
- scale lb using dns otr layered lbs

Cache aggressively

CDN - CloudFront
- static files, not-change-often API response
Layered Cache
Application Cache
- In-Process Caching(ehcahce)
- Distributed Caching(redis)
- Vector/row/fragment/page cache

- spring cache sync=true

Reverse proxy: Nginx
Browser cache: cache header(filter, or httpd config)
- Prevent 雪崩

- separate global and unique to the person

Cache Warming - gearman

- proactively

- TTL(timeout-based) Caching
Write-through cache - link
- directs write I/O onto cache and through to underlying permanent storage
- write is slower, good for apps that writes and then re-reads data frequently
Write-around cache
- write I/O is written directly to permanent storage, bypassing the cache
Write-back cache
- write I/O is directed to cache, backed up to permanent storage only at specified intervals or under certain conditions
- good for write-intensive, non-critical app, may lose data if cache dies
- time shifting, rate limiting, conflation
Cache Aside
- read from cache, if missed, read from db then put it into cache

- update to db, and invalidate cache

Queue - kafka, redis, MQ

- Queue-based Load Leveling Pattern
- Concurrent request converted to sequential request
- Evening out traffic spikes, overdraft protection
- Isolating failures and self-healing
- Enabling asynchronous processing
- Decoupling
Partial message ordering
exactly-once

- one approach: at-least-once + idempotent
Multiple Queue
- different priorities, put same host into same queue - to be polite

Priority Queue pattern

- Use redis

-- or multiple queue with different priorities

Publish/Subscribe - Kafka, MQ
Asynchrony - asynchronous processing
- order placed successfully TO order received successfully
- queue, fire and forget calls

- avoid blocking calls, reduce contention

Batch operations

Statelessness

- Share Nothing

Make API Retryable, Idempotent and Killable

Retry

- selective retry - only for transient errors that may succeed on a subsequent attempt
- don't retry permanent(unrecoverable) errors
- limit the number of retries and time spent
- use exponential backoff with random
- handle overload-related errors differently from other errors
RetryTemplate, FixedBackOffPolicy, setMaxAttempts, HttpClient

Only retry in outermost service

Parallelization

Accurate vs approximately
Availability and Reliability
- how things can fail
Learn and work on an actual system
- focus on the architecture and the tradeoffs behind each decision

- do back-of-the-envelope calculations for something you’re building

Feature flags

- control flag to upstream(in nginx)

Design to Clone Things (X Axis)
- clone services + load balancers
Design to Split Different Things (Y Axis)
- Separate sets of data, can scale them differently.
- Separation of distinct and different functions and data

Design to Split Similar Things (Z Axis): sharding and podding

Design for Rollback
Design to Be Disabled
Design to Be Monitored
- Nagios, Ganglia, Graphite, statsd

Session

Memory-based session state with Load balancer affinity(AWS ELB)
- Cookie learning vs Cookie injection

Persist session state to a DB

Data compression - gzip header

Polyglot MicroServices
- Single-purpose
- Modular and independent
- Don't share a persistence layer, each service has its own database

Categorize users

- less frequency ops for inactive users

- return cached content(ads) for non-logged in user

Explore locality - partition
- temporal, location

Data Store

- Choose the right database for the right service
- Know what queries are needed currently or in near-future
Read to write ratio

- read/write heavy or both
- how frequently we write/read

- use different db for read/write

- How much data
- it may be too large/cost to keep all in memory
- is it time series?
- row size
- how long to keep the data, policy to delete/archive old data
Hot/Cold Data
- Hot data to memory, cold data to ssd or disk(Glacier)
- Store them in different system
Separate index and content
- index for search, no need to store everything (solr).

- store content in KV db

Separate read/write
- Write to master, read from replica
- First write to queue, then sync to master
- Use different data stores

Store raw data to S3 or hdfs
Store Data that matters to datastore

Index
- index everything want to query on, only query on indexes

Denormalize the data
Split up large tables

SQL vs NOSQL

Key-Value - Riak, Dynamodb
Document-oriented - Couchbase, MongoDB
Column oriented
- store data by columns
- query by key or by key range

- BigTable, HBase and Cassandra

MySQL

- more read than writes

master/slave

- writes goes to master

- Related or not

- Need ACID or not

Key-value
Column-oriented
Document-oriented

Use multiple data centers

Break incompatible schema change into multiple database and application deployments.

Optimistic vs Pessimistic lock

- separate big fields into different table: post/tweet body
- read by blocks, smaller rows => read more rows

- avoid joins, full table scan

PoPs servers - link
- maintain a set of ready-to-use, opened TCP connections to the main cluster service
- serve nearby user requests and forward them to the main cluster service using the ready-to-use, established connections

GeoHash - link
- divide the world into a grid of 32 cells
- each cell can be further divided into another 32 cells, which can be divided into another 32 cells, and so on.
- the longer, the more accurate

- the longer the shared prefix, the closer they are

Prefer sequential write vs random write

- Example: Cassandra, Solr

Gated Launch
Jitter - Add some randomness
- avoid a thundering herd

- Add Entropy Back into Your System

Security - link
fail securely
least privilege

never trust any input!

Deployments/Zero downtime
- rolling deployments

Cloud Design Patterns: Prescriptive Architecture Guidance for Cloud Applications

Autoscale

- not instantaneous

- can scale up or down

Valet Key Pattern - supported by S3

Command and Query Responsibility Segregation (CQRS) Pattern

Materialized View Pattern

Ecommerce

SPU - Standard Product Unit

SKU - Stock Keeping Unit

MVP - minimum viable product

Robustness Principle

Be conservative in what you send, be liberal in what you accept

Be conservative in what you do, be liberal in what you accept from others

Reactive programming

- data flows and the propagation of change

- when x changes in one location, things depend on the value of x are recalculated and updated in various other locations

CPU Bound Application

Caching precomputing values

Performing the computation in separate background job.

Most apps are IO Bound Application - db/web service calls

Lambda Architecture - link
- all data is dispatched to both the batch layer and the speed layer
- batch layer: managing the master dataset (an immutable, append-only set of raw data), to pre-compute the batch views.
- serving layer indexes the batch views
- merging results from batch views and real-time views.

Option 1. Hadoop + Spark Streaming + Memory Storage + Columnar Storage

Metrics, Numbers, Estimation

Apache Httpd Server

- one connection, one process

- max cocurrent connections: ~4000

- yaws

HAProxy, nginx

- 20K concurrent connections

Web application server

- few thousand concurrent requests

Redis performance

6gb memory - 50K or 140K RPS (different bandwidth)

50gb memory - 150K or 250K with Redis clustering, throughput increases linearly as you increase the number of shards

Cassandra

- mean read latency: 13 ms
- write latency: 25 ms

Kafka
- 200k messages/sec for writes and 3M messages/sec for reads
- Each broker can handle terabytes of messages

Send 1M bytes over 1Gbps network 10ms

Read 1M sequentially from memory 0.25ms

Round trip within data center 0.5ms

Disk seek 8~10ms

Read 1MB sequentially from disk 20~25ms

Back-of-the-envelope Calculation

2^30 bytes ~~ 1GB

2^40 ~~ 1 TB

2^10 ~~ 1 KB

2^20 ~~ 1 MB

million -> 10^6

billion -> 10^9 ~ 2^30 bytes ~ 1GB

trillion -> 10^12 bytes~ 1 TB

Terms

QPS - Queries per second SLA - Service-level agreement

LRU
LIRS
- Inter-Reference Recency
- Low IRR, Hight IRR

Resources

Designing Data-Intensive Web Applications

Web Scalability for Startup Engineers

Cloud Architecture Patterns: Using Microsoft Azure

The Art of Scalability: Scalable Web Architecture, Processes, and Organizations for the Modern Enterprise

Scalability Rules: Principles for Scaling Web Sites

System Design Practices

Java Concurrency Tips

System Design: Learning from Existing Products

System Design - Summary

Labels