Powershell-Working with CSV: Delete Rows Without Enough Columns

The Problem
We import csv files into Solr server, which is very sensitive with the number of columns. If there is no enough columns, Solr will fail with exception:
org.apache.solr.common.SolrException: CSVLoader: input=file:/C:/1.csv, line=9158554,expected 19 values but got 17

So we would like to have a script to clean the csv: to remove the rows which have no enough data: the number of columns is not 19.

Don't know how to get the number of columns of current record, but it's easier to check whether the value of last field is null: this means exactly no enough columns.

The Solution: Using Powershell
Powershell command would be like below(- the last field is to):

Import-Csv .\1.csv | Where-Object { $_.to -ne $null} | Export-Csv .\rst1.csv -NoTypeInformation


To output which line has no enough columns:
Import-Csv .\1.csv| Foreach-Object {$line = 0} {
   if($_.bcc -eq $null) {
      echo "ignore line: $line, no enough fields"; 
   } else {
     convertto-csv -inputobject $_ -NoTypeInformation | select -Skip 1 | out-file -filepath .\r1.csv -Append 
   }
   $line++
}
The complete script: 
cleanCsv.ps1 Usage: .\cleanCsv.ps1 -filePath .\1.csv -destFilePath .\r1.csv
[CmdletBinding()]
Param(
  [Parameter(Mandatory=$True)]
  [string]$filePath,

  [Parameter(Mandatory=$True)]
  [string]$destFilePath,
  
  [Parameter(Mandatory=$False)]
  [string]$lastField="bcc"
)
# $ignoreLine = 2323533;

Get-Date -format "yyyy-MM-dd HH:mm:ss"
$sw = [Diagnostics.Stopwatch]::StartNew()

If (Test-Path $destFilePath ){
  echo "remove old $destFilePath"
 Remove-Item $destFilePath 
}

gc $filePath -TotalCount 1 | out-file -filepath $destFilePath
Import-Csv $filePath | Foreach-Object {$line = 0} {
  if($_.$lastField -eq $null) {
    echo "ignore line: $line, no enough fields"; 
  } else {
   convertto-csv -inputobject $_ -NoTypeInformation | select -Skip 1 | out-file -filepath $destFilePath -Append 
  }
  $line++
}

$sw.Stop()
Get-Date -format "yyyy-MM-dd HH:mm:ss"
echo "took " $sw.Elapsed
Post a Comment

Labels

Java (159) Lucene-Solr (112) Interview (61) All (58) J2SE (53) Algorithm (45) Soft Skills (38) Eclipse (33) Code Example (31) Linux (25) JavaScript (23) Spring (22) Windows (22) Web Development (20) Tools (19) Nutch2 (18) Bugs (17) Debug (16) Defects (14) Text Mining (14) J2EE (13) Network (13) Troubleshooting (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) Problem Solving (9) UIMA (9) html (9) Http Client (8) Maven (8) Security (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) Shell (7) ANT (6) Coding Skills (6) Database (6) Lesson Learned (6) Programmer Skills (6) Scala (6) Tips (6) css (6) Algorithm Series (5) Cache (5) Dynamic Languages (5) IDE (5) System Design (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) How to Interview (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Python (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts