Spark Basic Statistics - Using Scala

Summary statistics
colStats() returns an instance of MultivariateStatisticalSummary, which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the total count.
Test data:
1 2 3
10 20 30
100 200 300

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}
  
val data = sc.textFile("E:/jeffery/src/ML/data/statistics.txt").cache();  
val parsedData = data.map( line =>  Vectors.dense(line.split(' ').map(x => x.toDouble).toArray) )
val summary = Statistics.colStats(parsedData);
println(summary.count)
println(summary.min)
println(summary.max)
println(summary.mean) // a dense vector containing the mean value for each column
println(summary.variance) // column-wise variance
println(summary.numNonzeros) // number of nonzeros in each column


Stratified sampling

Stratified sampling methods, sampleByKey and sampleByKeyExact, can be performed on RDD’s of key-value pairs.

The sampleByKey method will flip a coin to decide whether an observation will be sampled or not, therefore requires one pass over the data, and provides an expected sample size. sampleByKeyExact requires significant more resources than the per-stratum simple random sampling used in sampleByKey, but will provide the exact sampling size with 99.99% confidence.


Test Dataman 6
woman 14
woman 19
child 6
baby 1
child 3
woman 26
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.PairRDDFunctions
val data = sc.textFile("E:/jeffery/src/ML/data/sampling.txt").cache();  
val parsedData = data.map{line => {
  val sp = line.split(' '); 
  (sp(0), sp(1).toInt);
}
}.cache()

parsedData.foreach(println)
var fractions = Map[String, Double]()

fractions += ("man" ->  0.5, "woman" -> 0.5, "child" -> 0.5, "baby" -> 0.3);
val approxSample = parsedData.sampleByKey(false, fractions).collect();
val exactSample = parsedData.sampleByKeyExact(false, fractions).collect();
print(approxSample.mkString(" "));
print(exactSample.mkString(" "));

Random data generation
import org.apache.spark.mllib.random.RandomRDDs._
val u = normalRDD(sc, 100L, 2);
// Apply a transform to get a random double RDD following `N(1, 4)`.
val v = u.map(x => 1.0 + 2.0 * x)
print(u.collect())
print(v.collect())

val u = poissonRDD(sc, 10, 100L);
val v = u.map(x => 1.0 + 2.0 * x).collect()

val u = uniformRDD(sc, 100L);
val v = u.map(x => 1.0 + 2.0 * x).collect()

Histogram
val ints = sc.parallelize(1 to 100)
ints.histogram(5) // 5 evenly spaced buckets
res92: (Array[Double], Array[Long]) = (Array(1.0, 20.8, 40.6, 60.4, 80.2, 100.0),Array(20, 20, 20, 20, 20)) Correlations


MLlib - Basic Statistics
Spark 1.1.0 Basic Statistics(上)
Post a Comment

Labels

Java (159) Lucene-Solr (110) All (58) Interview (58) J2SE (53) Algorithm (43) Soft Skills (36) Eclipse (34) Code Example (31) Linux (24) JavaScript (23) Spring (22) Windows (22) Web Development (20) Nutch2 (18) Tools (18) Bugs (17) Debug (15) Defects (14) Text Mining (14) J2EE (13) Network (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) UIMA (9) html (9) Dynamic Languages (8) Http Client (8) Maven (8) Security (8) Trouble Shooting (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) Problem Solving (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) IDE (5) Lesson Learned (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Python (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Troubleshooting (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts