RDD - Resilient Distributed Datasets
local uses 1 thread.
immutable, partitioned, Lazy evaluation, fault-tolerant
transformation vs action
DataSet - static typed
DataFrame = DataSet[Row]
df.as(Encoders.bean(classOf[ModelClass]);
PairRDDFunctions, OrderedRDDFunctions and GroupedRDDFunctions
SparkSession.builder().master("local[*]")
local[N] uses N threads.
local[*] uses as many threads as there are cores.
local[N, M] and local[*, M]
DAGScheduler, TaskScheduler
Driver, Worker
- Understand what code is executing in worker or in worker.
Datasets, Dataframe
- type-safe, object-oriented programming interface
Read Compressed File
Spark can read compressed file as long as it ends with gz(created by gzip), not tar.gz - link
When read gz file, spark will only give one RDD as gzipped file is not splittable. You may need repartition the RDD.
Understand what's executed on the Driver vs. Workers
Don't collect huge data in driver
Serialization Errors
- Create the Non-Serializable object in the worker
- Use foreachPartition, create only one Non-Serializable object per partition.
-- GroupByKey doesn't do combine in each node
Logging in spark executor code
import org.slf4j.LoggerFactory
@transient lazy val logger = LoggerFactory.getLogger(getClass)
Passing Functions to Spark
cache vs persist
cache = persit(MEMORY_ONLY)
use StorageLevel.MEMORY_AND_DISK if there is no engough memory
Spark-CSV
Check CSVOptions.scala for all options supported by spark-csv.
To deal double quotes in csv: change "escape" from default \ to "
val csvInput = spark.read
.option("header", "true")
.option("inferschema", true)
.option("escape", "\"")
.option("ignoreLeadingWhiteSpace", true)
.option("ignoreTrailingWhiteSpace ", true)
.csv(csvFile)
WebUI
http://localhost:4040/jobs/
Using Either to handle good/bad data
Concepts
Driver, Executors, Cluster Managers
Wide Dependencies
map, filter, mapPartitions and flatMap, coalesce
Narrow Dependencies
sort, reduceByKey, groupByKey, join, and anything that calls for repartition/shuffle
http://localhost:4040/jobs/
Using Either to handle good/bad data
Concepts
Driver, Executors, Cluster Managers
Wide Dependencies
map, filter, mapPartitions and flatMap, coalesce
Narrow Dependencies
sort, reduceByKey, groupByKey, join, and anything that calls for repartition/shuffle
A job is defined an action, wide transformations break jobs into stages.