Spark Optimization Tips


Batch Operation
If we need call external resource(db, Solr), run them in batch.

Memory: Lazily Load & Split Data in Partition
For operations like: mapPartition, foreachPartition etc, if each partition is big, don't call List.newArraylist(Iterator<String> t), as this would load the whole partition into memory. especially if you later load more data.
In stead, call Guava  UnmodifiableIterator<List<String>> partitions =Iterators.partition(t, PARTITION_SIZE) to split the partition, which evaluates and get the smaller list lazily.

Call Async and Run different Rdds Operations in parallel?
1. On different related Rdds, if your spark clusters can run these multiple Rdds operations in parallel: Rdds are relatively small, each Rdd will not use all nodes, then call async(countAsync, foreachAsync, etc) would help, otherwise it may not.
2. On related Rdds: 
If these Rdds have not be materialized, Don't do it.
For example: 
a=sc.textFile()..other. transformers; 
a.cache()
b=a.union(cRdd).other. transformers;
a.countAsync()
b.countAsync()
This will cause the file read twice, and transformers on a executed twice.

Combine Operations
Reduce Operations
Don't do them at all
For example, if you need call foreachPartition, mapPartition, and also need know the size of Rdd: 
Don't call rdd.count, instead use accumulator and call accumulator.add(list.size()) in each partition. 

Coalesce if needed: after filter a lot of data

Filter Data Early
Don't call sc. collect Unnecessarily 
Use Rdds, don't store big data in driver node.

Use Spark UI  to Monitor
Shuffle writes, how many partitions, memory usage, when actions start/stop, whether actions are run in parallel.

Use Event History to Compare performance after Change

Labels

adsense (5) Algorithm (69) Algorithm Series (35) Android (7) ANT (6) bat (8) Big Data (7) Blogger (14) Bugs (6) Cache (5) Chrome (19) Code Example (29) Code Quality (7) Coding Skills (5) Database (7) Debug (16) Design (5) Dev Tips (63) Eclipse (32) Git (5) Google (33) Guava (7) How to (9) Http Client (8) IDE (7) Interview (88) J2EE (13) J2SE (49) Java (186) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (26) Lucene-Solr (112) Mac (10) Maven (8) Network (9) Nutch2 (18) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Scala (6) Security (9) Soft Skills (38) Spring (22) System Design (11) Testing (7) Text Mining (14) Tips (17) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)