Spark Optimization Tips

Batch Operation
If we need call external resource(db, Solr), run them in batch.

Memory: Lazily Load & Split Data in Partition
For operations like: mapPartition, foreachPartition etc, if each partition is big, don't call List.newArraylist(Iterator<String> t), as this would load the whole partition into memory. especially if you later load more data.
In stead, call Guava  UnmodifiableIterator<List<String>> partitions =Iterators.partition(t, PARTITION_SIZE) to split the partition, which evaluates and get the smaller list lazily.

Call Async and Run different Rdds Operations in parallel?
1. On different related Rdds, if your spark clusters can run these multiple Rdds operations in parallel: Rdds are relatively small, each Rdd will not use all nodes, then call async(countAsync, foreachAsync, etc) would help, otherwise it may not.
2. On related Rdds: 
If these Rdds have not be materialized, Don't do it.
For example: 
a=sc.textFile()..other. transformers; 
a.cache()
b=a.union(cRdd).other. transformers;
a.countAsync()
b.countAsync()
This will cause the file read twice, and transformers on a executed twice.

Combine Operations
Reduce Operations
Don't do them at all
For example, if you need call foreachPartition, mapPartition, and also need know the size of Rdd: 
Don't call rdd.count, instead use accumulator and call accumulator.add(list.size()) in each partition. 

Coalesce if needed: after filter a lot of data

Filter Data Early
Don't call sc. collect Unnecessarily 
Use Rdds, don't store big data in driver node.

Use Spark UI  to Monitor
Shuffle writes, how many partitions, memory usage, when actions start/stop, whether actions are run in parallel.

Use Event History to Compare performance after Change
Post a Comment

Labels

Java (159) Lucene-Solr (110) All (58) Interview (58) J2SE (53) Algorithm (41) Soft Skills (36) Eclipse (34) Code Example (31) Linux (25) JavaScript (23) Spring (22) Windows (22) Web Development (20) Nutch2 (18) Tools (18) Bugs (17) Debug (15) Defects (14) Text Mining (14) J2EE (13) Network (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) UIMA (9) html (9) Continuous Integration (8) Dynamic Languages (8) Http Client (8) Maven (8) Security (8) Trouble Shooting (8) bat (8) blogger (8) Big Data (7) Google (7) Guava (7) JSON (7) Problem Solving (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) IDE (5) Lesson Learned (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Python (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Troubleshooting (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts