My Journey in Spark, Part 1: Basic

RDD - Resilient Distributed Datasets

immutable, partitioned, Lazy evaluation, fault-tolerant

transformation vs action

DataSet - static typed
DataFrame = DataSet[Row]
df.as(Encoders.bean(classOf[ModelClass]);

PairRDDFunctions, OrderedRDDFunctions and GroupedRDDFunctions

SparkSession.builder().master("local[*]")
local uses 1 thread.
local[N] uses N threads.
local[*] uses as many threads as there are cores.
local[N, M] and local[*, M]

DAGScheduler, TaskScheduler
Driver, Worker
- Understand what code is executing in worker or in worker.
Datasets, Dataframe
- type-safe, object-oriented programming interface

Read Compressed File
Spark can read compressed file as long as it ends with gz(created by gzip), not tar.gz - link


When read gz file, spark will only give one RDD as gzipped file is not splittable. You may need repartition the RDD

Understand what's executed on the Driver vs. Workers
Don't collect huge data in driver

Serialization Errors
- Create the Non-Serializable object in the worker
- Use foreachPartition, create only one Non-Serializable object per partition.

-- GroupByKey doesn't do combine in each node

Logging in spark executor code
import org.slf4j.LoggerFactory
@transient lazy val logger = LoggerFactory.getLogger(getClass)

Passing Functions to Spark

cache vs persist
cache = persit(MEMORY_ONLY)

use StorageLevel.MEMORY_AND_DISK if there is no engough memory

Spark-CSV
Check CSVOptions.scala for all options supported by spark-csv.
To deal double quotes in csv: change "escape" from default \ to "

val csvInput = spark.read
  .option("header", "true")
  .option("inferschema", true)
  .option("escape", "\"")
  .option("ignoreLeadingWhiteSpace", true)
  .option("ignoreTrailingWhiteSpace ", true)
  .csv(csvFile)

WebUI
http://localhost:4040/jobs/
Using Either to handle good/bad data

Concepts
Driver, Executors, Cluster Managers

Wide Dependencies
map, filter, mapPartitions and flatMap, coalesce
Narrow Dependencies
sort, reduceByKey, groupByKey, join, and anything that calls for repartition/shuffle

A job is defined an action, wide transformations break jobs into stages.
Post a Comment

Labels

Java (159) Lucene-Solr (111) Interview (61) All (58) J2SE (53) Algorithm (45) Soft Skills (37) Eclipse (33) Code Example (31) Linux (24) JavaScript (23) Spring (22) Windows (22) Web Development (20) Nutch2 (18) Tools (18) Bugs (17) Debug (16) Defects (14) Text Mining (14) J2EE (13) Network (13) Troubleshooting (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) Problem Solving (9) UIMA (9) html (9) Http Client (8) Maven (8) Security (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) Dynamic Languages (5) IDE (5) Lesson Learned (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) How to Interview (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Python (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts