PowerShell: Working with CSV Files

Background
When import csv file to solr, it may fail because the csv is in correct formatted: mostly related with double quotes in column value, or maybe there is no enough columns.

When this happens, we may have to dig into csv files. Powershell is a great tool in this case.
Task: Get Line Number of the CSV Record
When solr fails to import csv: it may report the following error:
SEVERE: Import csv1.csv failed: org.apache.solr.common.SolrException: CSVLoader: input=file:/C:/csv1.csv, line=134370,expected 19 values but got 17
                values={field_values_in_this_row}
Solr shows the error happens at 134370 line, but if we use Get-Content csv1.csv | Select-Object -index 134370, we may find content of 134370 line is totally different. This is because if there are multiline records in the csv file, the line number would be not correct.
  /**
   * ATTENTION: in case your csv has multiline-values the returned
   *            number does not correspond to the record-number
   * 
   * @return  current line number
   */
  public int org.apache.solr.internal.csv.CSVParser.getLineNumber() {
    return in.getLineNumber();  
  }

To Get correct line of the csv record, use the following PowerShell command:
select-string -pattern 'field_values_in_this_row' csv1.csv | select Line,LineNumber
Line                                                                                              LineNumber
----                                                                                               ----------
field_values_in_this_row                                                                134378
Task: Get Record Number of CSV File
Users want to know whether all records are imported to csv. To do this, we need get number of all not-empty records in the csv file. Line number of the csv file is not useful, as ther may be empty lines , or multiple-lines records in the csv file.

We can use the following Powershell command: the Where-Object excludes empty records.
(Import-Csv csv1.csv | Where-Object { ($_.PSObject.Properties | ForEach-Object {$_.Value}) -ne $null} | Measure-Object).count

The previous command is slow, if we are sure there is no empty records(lines) in the csv file: we can use following command:
(Import-Csv .\csv1.csv | Measure-Object).count

Other CSV related PfowerShell Commands
Select fields from CSV file:
Import-Csv csv1.csv | select f1,f2 | Export-Csv -Path csv2.csv –NoTypeInformation
Add new fields into CSV file:
Import-CSV csv1.csv | Select @{Name="Surname";Expression={$_."Last Name"}}, @{Name="GivenName";Expression={$_."First Name"}} | Export-Csv -Path csv2.csv –NoTypeInformation
Import-Csv .\1.txt | select-object id | sort id –Unique | Measure-Object
Rescources
Import CSV that Contains Double-Quotes into Solr
Improve Solr CSVParser to Log Invalid Characters
Post a Comment

Labels

Java (159) Lucene-Solr (110) All (58) Interview (58) J2SE (53) Algorithm (43) Soft Skills (36) Eclipse (34) Code Example (31) Linux (24) JavaScript (23) Spring (22) Windows (22) Web Development (20) Nutch2 (18) Tools (18) Bugs (17) Debug (15) Defects (14) Text Mining (14) J2EE (13) Network (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) UIMA (9) html (9) Dynamic Languages (8) Http Client (8) Maven (8) Security (8) Trouble Shooting (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) Problem Solving (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) IDE (5) Lesson Learned (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Python (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Troubleshooting (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts