PowerShell Tips: Get a Random Sample from CSV File


The Problem

I am trying to write and test R script against some data from customer. But the data is too big, it would take a lot of time to load the data and run the script. So it would be to extract a small fraction from the original data.

The Solution
First extract the first line from the original csv file, write to destination file.
Get-Content big.csv -TotalCount 1 | Out-File -Encoding utf8 sample.txt

Notice that by default Out-File cmdlet or redirection command >> uses system default encoding when write to a file. Most application by default uses utf-8 or utf-16 to read data. Hence we use -Encoding utf8 here.

Then we read all lines except the first line: Get-Content big.csv | where {$_.readcount -gt 1 }

Then randomly select 1000 lines and append them to the destination file.
Get-Content big.csv | where {$_.readcount -gt 1 } | Get-Random -Count 100 | Out-File -Encoding utf8 -Append sample.txt

The Complete Script
Get-Content big.csv -TotalCount 1 | Out-File -Encoding utf8 sample.txt; Get-Content big.csv | where {$_.readcount -gt 1 } | Get-Random -Count 100 | Out-File -Encoding utf8 -Append sample.txt

Related Script: Get default system encoding
[System.Text.Encoding]::Default
[System.Text.Encoding]::Default.EncodingName

Resource
PSTip: Get-Random
Get-Random Cmdlet

Labels

adsense (5) Algorithm (69) Algorithm Series (35) Android (7) ANT (6) bat (8) Big Data (7) Blogger (14) Bugs (6) Cache (5) Chrome (19) Code Example (29) Code Quality (7) Coding Skills (5) Database (7) Debug (16) Design (5) Dev Tips (63) Eclipse (32) Git (5) Google (33) Guava (7) How to (9) Http Client (8) IDE (7) Interview (88) J2EE (13) J2SE (49) Java (186) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (26) Lucene-Solr (112) Mac (10) Maven (8) Network (9) Nutch2 (18) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Scala (6) Security (9) Soft Skills (38) Spring (22) System Design (11) Testing (7) Text Mining (14) Tips (17) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)