System Design Practices

Design home page stream
Design feed
Mixed approaches - Push + Pull from 
-- selectively disable push for celebrities, pull them instead.
https://engineering.pinterest.com/blog/building-scalable-and-available-home-feed
Design Decisions For Scaling Your High Traffic Feeds

Cassandra + Denormalize
Redis + Normalized

Selective Fanout Based On Producer/Consumer
URL shortener
- read heavy
- separate read/write to different apps
- able to disable write functions
- id (auto increment), actual url, and shorten url

- base 62 encode the id to string(shorten url)
  - generate auto-generate id
  - can use sql or nosql as long as find a way to generate unique long
- rate limit, prevent abuse
Insert Long Url to db, get the primary key: long
Convert base10 to Base62[0-9a-zA-Z], other characters such as -,_+, convert long to short url
Decode short url to long number and query db to get long url.

How to make it scalable
How to analyze short url click event

Crawler
- politeness: rate limit
- put same host pages into same queue
- different priority for different sites - multiple queues
- normalization/canonicalization
- directed graph
- each URL: last visited time, score, fingerprint
- Use Bloom filters to remove duplicates
Re-crawling
- predict which websites are most likely to have frequent update, put those into the fetchers’ priority queue.
How to handle realtime news?
- each url: next visited time based on its score and how frequently it changes
- if page(links, content) is (mostly) not changed, increase its next visit time(cut+2t) otherwise decrease its next visit time
- begin with a seed set
- use a simple fingerprint such as a checksum
- the frontier
- normalized URL
DNS resolver
Locality
- Put data centers in different regions - machines in asian to crawl asian web sites

To crawl price histories of products
signals:
- previous price history changes
- how many reviews they have

- To detect new products: provide api so client can add or manually crawl new added pages.

Design an in-memory search engine
How to index and store in memory
How to support free text queries, phrase queries
Map<String, List<Document>>
List<Document> is sorted
Document: docId, List<Long> positionIds
List<Long> is sorted
How to save to files
How to merge small files into big files
-- When save to file, make it sorted by word
-- Use merge sort to merge multiple fields
How to make it scalable - how solr cloud works 

Auto completion in search
- single node
- Trie, Topk


Design Hit Counter - how to count number of requests in last second, minute and hour - link1,

Generate Game Map
- brute force, randomly generate one, check whether it can be solved in limit time by machine

- different difficulty level

Design score/rank system for social game.
Need update score realtime?
- yes, for this user's operation, when score is small
- no, for others' operation, when score is big
Rank - realtime?
- yes, when user is ranked in X(1000),
- no, if user ranks is below Y(1M)
Average score
- rank: use sorted tree - sorted by score, node splitted by score range
- table: userid, score
- sharding

- append, not update existing data
- separate score change to different types: real time vs batchable
- batch score change

Search nearby places
- GeoHash

Design Chat app like messenger, what's app
How to manage/detect friends status
How to store messages
How to keep connections
- persistent connection?

De-duplication

- url dedup: use bloom filter
Detect duplicate documents in billions of urls
- back-of-the-envelope calculations
- (30 bytes for url + int/long(4 or 8 bytes)) * 10^9(~1g) ~ 34g
- A1: Hash and split into multiple files  - slow (even with multiple threads) as a lot of IOs
- A2: Parallelize in multiple machines - like map-reduce

Design Slack

Design logging collection and analysis system
Design shopping cart
addToCart,removeFromCart,changeQuantity
- monitor
- abandoned cart
guest cart
- cookie
- merged if user login later
- check items availability at some point

- call other services - stock service(price, availability), metadata service, images(cdn)

ecommerce - amazon
- analytics
- abandoned cart, user behavior(click,view,add/remove from cart)
products
checkout + payment
shipping

customer service

Design rate limiter
Design Miao Sha
- Principle: fair(fifo) - but no need to 100% guarantee
Two types of queues
- each server has its own queue, only keep x requests, reject directly if more than it
- one queue in one single server which gets all requests from all server, and process it synchronously
- avoid distributed problems
- asynchronous processing
Protect against from hackers
- CAPTCHA

- filter based on ipAddress, acountId
- Prevent abuse: cheating

Design email

Counter
- 1. Use count table
- 2. Use Redis

Hierarchical Data
- id, parentid

Generate unique ID numbers
Twitter Snowflake
- 41bit timestamp
- 10bit workerid

- 12bit sequence
Ticket Servers
TicketServer1:
auto-increment-increment = 2
auto-increment-offset = 1

TicketServer2:
auto-increment-increment = 2
auto-increment-offset = 2

Design the data structure for large social network
Design system to store/read images

Expire mp3 in cdn and generate new one after x min
pre-load

Playlist Shuffle
- Fisher-Yates shuffle
- a good, uniformly distributed mix of all the tracks, no burst

Design key-value Database(lookup service)
Design APIs for add/update/query
CA or CP?
- membership and failure detection(gossip)
- Partition (Virtual nodes, consistent hashing)
- Replication(Versioning)
- handling temporary failures
- read repair and anti-entropy using Merkle tree
- Global(Local) secondary index
- Conflict resolution

- Request coordinator
- Cache: row cache, key cache

Design column database
Post a Comment

Labels

Java (159) Lucene-Solr (112) Interview (61) All (58) J2SE (53) Algorithm (45) Soft Skills (38) Eclipse (33) Code Example (31) Linux (25) JavaScript (23) Spring (22) Windows (22) Web Development (20) Tools (19) Nutch2 (18) Bugs (17) Debug (16) Defects (14) Text Mining (14) J2EE (13) Network (13) Troubleshooting (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) Problem Solving (9) UIMA (9) html (9) Http Client (8) Maven (8) Security (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) Shell (7) ANT (6) Coding Skills (6) Database (6) Lesson Learned (6) Programmer Skills (6) Scala (6) Tips (6) css (6) Algorithm Series (5) Cache (5) Dynamic Languages (5) IDE (5) System Design (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) How to Interview (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Python (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts