PowerShell Tips: Get a Random Sample from CSV File

The Problem

I am trying to write and test R script against some data from customer. But the data is too big, it would take a lot of time to load the data and run the script. So it would be to extract a small fraction from the original data.

The Solution
First extract the first line from the original csv file, write to destination file.
Get-Content big.csv -TotalCount 1 | Out-File -Encoding utf8 sample.txt

Notice that by default Out-File cmdlet or redirection command >> uses system default encoding when write to a file. Most application by default uses utf-8 or utf-16 to read data. Hence we use -Encoding utf8 here.

Then we read all lines except the first line: Get-Content big.csv | where {$_.readcount -gt 1 }

Then randomly select 1000 lines and append them to the destination file.
Get-Content big.csv | where {$_.readcount -gt 1 } | Get-Random -Count 100 | Out-File -Encoding utf8 -Append sample.txt

The Complete Script
Get-Content big.csv -TotalCount 1 | Out-File -Encoding utf8 sample.txt; Get-Content big.csv | where {$_.readcount -gt 1 } | Get-Random -Count 100 | Out-File -Encoding utf8 -Append sample.txt

Related Script: Get default system encoding
[System.Text.Encoding]::Default
[System.Text.Encoding]::Default.EncodingName

Resource
PSTip: Get-Random
Get-Random Cmdlet
Post a Comment

Labels

Java (159) Lucene-Solr (110) All (58) Interview (58) J2SE (53) Algorithm (41) Soft Skills (36) Eclipse (34) Code Example (31) Linux (25) JavaScript (23) Spring (22) Windows (22) Web Development (20) Nutch2 (18) Tools (18) Bugs (17) Debug (15) Defects (14) Text Mining (14) J2EE (13) Network (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) UIMA (9) html (9) Continuous Integration (8) Dynamic Languages (8) Http Client (8) Maven (8) Security (8) Trouble Shooting (8) bat (8) blogger (8) Big Data (7) Google (7) Guava (7) JSON (7) Problem Solving (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) IDE (5) Lesson Learned (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Python (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Troubleshooting (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts