Summary statistics
colStats() returns an instance of MultivariateStatisticalSummary, which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the total count.
Test data:
1 2 3
10 20 30
100 200 300
Stratified sampling
Stratified sampling methods, sampleByKey and sampleByKeyExact, can be performed on RDD’s of key-value pairs.
The sampleByKey method will flip a coin to decide whether an observation will be sampled or not, therefore requires one pass over the data, and provides an expected sample size. sampleByKeyExact requires significant more resources than the per-stratum simple random sampling used in sampleByKey, but will provide the exact sampling size with 99.99% confidence.
colStats() returns an instance of MultivariateStatisticalSummary, which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the total count.
Test data:
1 2 3
10 20 30
100 200 300
import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics} val data = sc.textFile("E:/jeffery/src/ML/data/statistics.txt").cache(); val parsedData = data.map( line => Vectors.dense(line.split(' ').map(x => x.toDouble).toArray) ) val summary = Statistics.colStats(parsedData); println(summary.count) println(summary.min) println(summary.max) println(summary.mean) // a dense vector containing the mean value for each column println(summary.variance) // column-wise variance println(summary.numNonzeros) // number of nonzeros in each column
Stratified sampling
Stratified sampling methods, sampleByKey and sampleByKeyExact, can be performed on RDD’s of key-value pairs.
The sampleByKey method will flip a coin to decide whether an observation will be sampled or not, therefore requires one pass over the data, and provides an expected sample size. sampleByKeyExact requires significant more resources than the per-stratum simple random sampling used in sampleByKey, but will provide the exact sampling size with 99.99% confidence.
Test Dataman 6
woman 14
woman 19
child 6
baby 1
child 3
woman 26
Random data generation
Histogram
MLlib - Basic Statistics
Spark 1.1.0 Basic Statistics(上)
woman 14
woman 19
child 6
baby 1
child 3
woman 26
import org.apache.spark.SparkContext._ import org.apache.spark.rdd.PairRDDFunctions val data = sc.textFile("E:/jeffery/src/ML/data/sampling.txt").cache(); val parsedData = data.map{line => { val sp = line.split(' '); (sp(0), sp(1).toInt); } }.cache() parsedData.foreach(println) var fractions = Map[String, Double]() fractions += ("man" -> 0.5, "woman" -> 0.5, "child" -> 0.5, "baby" -> 0.3); val approxSample = parsedData.sampleByKey(false, fractions).collect(); val exactSample = parsedData.sampleByKeyExact(false, fractions).collect(); print(approxSample.mkString(" ")); print(exactSample.mkString(" "));
Random data generation
import org.apache.spark.mllib.random.RandomRDDs._ val u = normalRDD(sc, 100L, 2); // Apply a transform to get a random double RDD following `N(1, 4)`. val v = u.map(x => 1.0 + 2.0 * x) print(u.collect()) print(v.collect()) val u = poissonRDD(sc, 10, 100L); val v = u.map(x => 1.0 + 2.0 * x).collect() val u = uniformRDD(sc, 100L); val v = u.map(x => 1.0 + 2.0 * x).collect()
Histogram
val ints = sc.parallelize(1 to 100) ints.histogram(5) // 5 evenly spaced bucketsres92: (Array[Double], Array[Long]) = (Array(1.0, 20.8, 40.6, 60.4, 80.2, 100.0),Array(20, 20, 20, 20, 20)) Correlations
MLlib - Basic Statistics
Spark 1.1.0 Basic Statistics(上)