My Journey in Spark, Part 1: Basic


RDD - Resilient Distributed Datasets

immutable, partitioned, Lazy evaluation, fault-tolerant

transformation vs action

DataSet - static typed
DataFrame = DataSet[Row]
df.as(Encoders.bean(classOf[ModelClass]);

PairRDDFunctions, OrderedRDDFunctions and GroupedRDDFunctions

SparkSession.builder().master("local[*]")
local uses 1 thread.
local[N] uses N threads.
local[*] uses as many threads as there are cores.
local[N, M] and local[*, M]

DAGScheduler, TaskScheduler
Driver, Worker
- Understand what code is executing in worker or in worker.
Datasets, Dataframe
- type-safe, object-oriented programming interface

Read Compressed File
Spark can read compressed file as long as it ends with gz(created by gzip), not tar.gz - link


When read gz file, spark will only give one RDD as gzipped file is not splittable. You may need repartition the RDD

Understand what's executed on the Driver vs. Workers
Don't collect huge data in driver

Serialization Errors
- Create the Non-Serializable object in the worker
- Use foreachPartition, create only one Non-Serializable object per partition.

-- GroupByKey doesn't do combine in each node

Logging in spark executor code
import org.slf4j.LoggerFactory
@transient lazy val logger = LoggerFactory.getLogger(getClass)

Passing Functions to Spark

cache vs persist
cache = persit(MEMORY_ONLY)

use StorageLevel.MEMORY_AND_DISK if there is no engough memory

Spark-CSV
Check CSVOptions.scala for all options supported by spark-csv.
To deal double quotes in csv: change "escape" from default \ to "

val csvInput = spark.read
  .option("header", "true")
  .option("inferschema", true)
  .option("escape", "\"")
  .option("ignoreLeadingWhiteSpace", true)
  .option("ignoreTrailingWhiteSpace ", true)
  .csv(csvFile)

WebUI
http://localhost:4040/jobs/
Using Either to handle good/bad data

Concepts
Driver, Executors, Cluster Managers

Wide Dependencies
map, filter, mapPartitions and flatMap, coalesce
Narrow Dependencies
sort, reduceByKey, groupByKey, join, and anything that calls for repartition/shuffle

A job is defined an action, wide transformations break jobs into stages.

Labels

adsense (5) Algorithm (69) Algorithm Series (35) Android (7) ANT (6) bat (8) Big Data (7) Blogger (14) Bugs (6) Cache (5) Chrome (19) Code Example (29) Code Quality (7) Coding Skills (5) Database (7) Debug (16) Design (5) Dev Tips (63) Eclipse (32) Git (5) Google (33) Guava (7) How to (9) Http Client (8) IDE (7) Interview (88) J2EE (13) J2SE (49) Java (186) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (26) Lucene-Solr (112) Mac (10) Maven (8) Network (9) Nutch2 (18) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Scala (6) Security (9) Soft Skills (38) Spring (22) System Design (11) Testing (7) Text Mining (14) Tips (17) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)