Programmer: Lifelong Learning: My Journey in Spark, Part 1: Basic

RDD - Resilient Distributed Datasets

immutable, partitioned, Lazy evaluation, fault-tolerant

transformation vs action

DataSet - static typed

DataFrame = DataSet[Row]

df.as(Encoders.bean(classOf[ModelClass]);

PairRDDFunctions, OrderedRDDFunctions and GroupedRDDFunctions

SparkSession.builder().master("local[*]")

local uses 1 thread.
local[N] uses N threads.
local[*] uses as many threads as there are cores.
local[N, M] and local[*, M]
DAGScheduler, TaskScheduler
Driver, Worker
- Understand what code is executing in worker or in worker.
Datasets, Dataframe
- type-safe, object-oriented programming interface

Read Compressed File
Spark can read compressed file as long as it ends with gz(created by gzip), not tar.gz - link

When read gz file, spark will only give one RDD as gzipped file is not splittable. You may need repartition the RDD.

Understand what's executed on the Driver vs. Workers

Don't collect huge data in driver

Serialization Errors

- Create the Non-Serializable object in the worker

- Use foreachPartition, create only one Non-Serializable object per partition.

Prefer reduceByKey, aggregateByKey, foldByKey, and combineByKey over groupByKey

-- GroupByKey doesn't do combine in each node

Logging in spark executor code
import org.slf4j.LoggerFactory
@transient lazy val logger = LoggerFactory.getLogger(getClass)

Passing Functions to Spark

cache vs persist
cache = persit(MEMORY_ONLY)

use StorageLevel.MEMORY_AND_DISK if there is no engough memory

Spark-CSV
Check CSVOptions.scala for all options supported by spark-csv.
To deal double quotes in csv: change "escape" from default \ to "


val csvInput = spark.read
  .option("header", "true")
  .option("inferschema", true)
  .option("escape", "\"")
  .option("ignoreLeadingWhiteSpace", true)
  .option("ignoreTrailingWhiteSpace ", true)
  .csv(csvFile)

WebUI
http://localhost:4040/jobs/
Using Either to handle good/bad data

Concepts
Driver, Executors, Cluster Managers

Wide Dependencies
map, filter, mapPartitions and flatMap, coalesce
Narrow Dependencies
sort, reduceByKey, groupByKey, join, and anything that calls for repartition/shuffle

A job is defined an action, wide transformations break jobs into stages.

My Journey in Spark, Part 1: Basic

Labels