Powershell-Working with CSV: Delete Rows Without Enough Columns


The Problem
We import csv files into Solr server, which is very sensitive with the number of columns. If there is no enough columns, Solr will fail with exception:
org.apache.solr.common.SolrException: CSVLoader: input=file:/C:/1.csv, line=9158554,expected 19 values but got 17

So we would like to have a script to clean the csv: to remove the rows which have no enough data: the number of columns is not 19.

Don't know how to get the number of columns of current record, but it's easier to check whether the value of last field is null: this means exactly no enough columns.

The Solution: Using Powershell
Powershell command would be like below(- the last field is to):

Import-Csv .\1.csv | Where-Object { $_.to -ne $null} | Export-Csv .\rst1.csv -NoTypeInformation


To output which line has no enough columns:
Import-Csv .\1.csv| Foreach-Object {$line = 0} {
   if($_.bcc -eq $null) {
      echo "ignore line: $line, no enough fields"; 
   } else {
     convertto-csv -inputobject $_ -NoTypeInformation | select -Skip 1 | out-file -filepath .\r1.csv -Append 
   }
   $line++
}
The complete script: 
cleanCsv.ps1 Usage: .\cleanCsv.ps1 -filePath .\1.csv -destFilePath .\r1.csv
[CmdletBinding()]
Param(
  [Parameter(Mandatory=$True)]
  [string]$filePath,

  [Parameter(Mandatory=$True)]
  [string]$destFilePath,
  
  [Parameter(Mandatory=$False)]
  [string]$lastField="bcc"
)
# $ignoreLine = 2323533;

Get-Date -format "yyyy-MM-dd HH:mm:ss"
$sw = [Diagnostics.Stopwatch]::StartNew()

If (Test-Path $destFilePath ){
  echo "remove old $destFilePath"
 Remove-Item $destFilePath 
}

gc $filePath -TotalCount 1 | out-file -filepath $destFilePath
Import-Csv $filePath | Foreach-Object {$line = 0} {
  if($_.$lastField -eq $null) {
    echo "ignore line: $line, no enough fields"; 
  } else {
   convertto-csv -inputobject $_ -NoTypeInformation | select -Skip 1 | out-file -filepath $destFilePath -Append 
  }
  $line++
}

$sw.Stop()
Get-Date -format "yyyy-MM-dd HH:mm:ss"
echo "took " $sw.Elapsed

Labels

adsense (5) Algorithm (69) Algorithm Series (35) Android (7) ANT (6) bat (8) Big Data (7) Blogger (14) Bugs (6) Cache (5) Chrome (19) Code Example (29) Code Quality (7) Coding Skills (5) Database (7) Debug (16) Design (5) Dev Tips (63) Eclipse (32) Git (5) Google (33) Guava (7) How to (9) Http Client (8) IDE (7) Interview (88) J2EE (13) J2SE (49) Java (186) JavaScript (27) JSON (7) Learning code (9) Lesson Learned (6) Linux (26) Lucene-Solr (112) Mac (10) Maven (8) Network (9) Nutch2 (18) Performance (9) PowerShell (11) Problem Solving (11) Programmer Skills (6) regex (5) Scala (6) Security (9) Soft Skills (38) Spring (22) System Design (11) Testing (7) Text Mining (14) Tips (17) Tools (24) Troubleshooting (29) UIMA (9) Web Development (19) Windows (21) xml (5)