The Problem
We import csv files into Solr server, which is very sensitive with the number of columns. If there is no enough columns, Solr will fail with exception:
org.apache.solr.common.SolrException: CSVLoader: input=file:/C:/1.csv, line=9158554,expected 19 values but got 17
So we would like to have a script to clean the csv: to remove the rows which have no enough data: the number of columns is not 19.
Don't know how to get the number of columns of current record, but it's easier to check whether the value of last field is null: this means exactly no enough columns.
The Solution: Using Powershell
Powershell command would be like below(- the last field is to):
Import-Csv .\1.csv | Where-Object { $_.to -ne $null} | Export-Csv .\rst1.csv -NoTypeInformation
To output which line has no enough columns:
Import-Csv .\1.csv| Foreach-Object {$line = 0} { if($_.bcc -eq $null) { echo "ignore line: $line, no enough fields"; } else { convertto-csv -inputobject $_ -NoTypeInformation | select -Skip 1 | out-file -filepath .\r1.csv -Append } $line++ }The complete script:
cleanCsv.ps1 Usage: .\cleanCsv.ps1 -filePath .\1.csv -destFilePath .\r1.csv
[CmdletBinding()] Param( [Parameter(Mandatory=$True)] [string]$filePath, [Parameter(Mandatory=$True)] [string]$destFilePath, [Parameter(Mandatory=$False)] [string]$lastField="bcc" ) # $ignoreLine = 2323533; Get-Date -format "yyyy-MM-dd HH:mm:ss" $sw = [Diagnostics.Stopwatch]::StartNew() If (Test-Path $destFilePath ){ echo "remove old $destFilePath" Remove-Item $destFilePath } gc $filePath -TotalCount 1 | out-file -filepath $destFilePath Import-Csv $filePath | Foreach-Object {$line = 0} { if($_.$lastField -eq $null) { echo "ignore line: $line, no enough fields"; } else { convertto-csv -inputobject $_ -NoTypeInformation | select -Skip 1 | out-file -filepath $destFilePath -Append } $line++ } $sw.Stop() Get-Date -format "yyyy-MM-dd HH:mm:ss" echo "took " $sw.Elapsed