Programmer: Lifelong Learning: Spark Optimization Tips

Batch Operation
If we need call external resource(db, Solr), run them in batch.

Memory: Lazily Load & Split Data in Partition
For operations like: mapPartition, foreachPartition etc, if each partition is big, don't call List.newArraylist(Iterator<String> t), as this would load the whole partition into memory. especially if you later load more data.
In stead, call Guava UnmodifiableIterator<List<String>> partitions =Iterators.partition(t, PARTITION_SIZE) to split the partition, which evaluates and get the smaller list lazily.

Call Async and Run different Rdds Operations in parallel?
1. On different related Rdds, if your spark clusters can run these multiple Rdds operations in parallel: Rdds are relatively small, each Rdd will not use all nodes, then call async(countAsync, foreachAsync, etc) would help, otherwise it may not.
2. On related Rdds:
If these Rdds have not be materialized, Don't do it.
For example:
a=sc.textFile()..other. transformers;
a.cache()
b=a.union(cRdd).other. transformers;
a.countAsync()
b.countAsync()
This will cause the file read twice, and transformers on a executed twice.

Combine Operations
Reduce Operations
Don't do them at all
For example, if you need call foreachPartition, mapPartition, and also need know the size of Rdd:
Don't call rdd.count, instead use accumulator and call accumulator.add(list.size()) in each partition.

Coalesce if needed: after filter a lot of data

Filter Data Early
Don't call sc. collect Unnecessarily
Use Rdds, don't store big data in driver node.

Use Spark UI to Monitor
Shuffle writes, how many partitions, memory usage, when actions start/stop, whether actions are run in parallel.

Use Event History to Compare performance after Change

Spark Optimization Tips

Labels