PowerShell in Action: Analyze Log and Interact with Solr

The Problem
Need write a program to analyze solr logs to check why some items local solr server fetches from remote solr server is missing. 
We suspect it's because of the deduplication configuration. Items that have same values for signature fields are marked as duplication and removed by Solr. But we need analyze the log and find all these items.
Why Use PowerShell?
1. Powershell is preinstalled with Win7, Windows Server 2008 R2 and later Windows release.
2. It's powerful, we can even call .Net in powershell script.
3. It's an interpreted language. Means we can easily change the script and run it. No need to compile and package as Java or .Net.
4. I have worked as a Java programmer for more than 6 years, it's kind of boring to write this program in Java, So why not try some new tool and learn something new:)
Analyze Log
In linux, we can use awk, grep to search and extract content and field from log.
In powershell, we use Get-Content and Foreach-Object. In Foreach-Object, we test whether current item(log) contains "Got id", if so, split it by white space, and get the third field, then write result to a temporary file.

Get-Content $logs | Foreach-Object{ if($_.Contains("Got id")) {$a=$_.Split()[3]; $a.Substring(0,$a.Length-1); } } | out-file ".\ids.txt"
Interact with Solr
We then read 100 ids from the temp file, construct a url, then use Net.HttpWebRequest to send a http request, and use Net.HttpWebResponse and IO.StreamReader to read the http response.

In PowerShell 3.0 and newer, we can use Invoke-WebRequest to execute http request and parse response.

We then check ids in the response, if it doesn't exist in response. It means it is missing in Solr. We then save it to the result file.
$count=100
$ids=@()
gc .\ids.txt  | foreach  {$i=0;} {
  $ids+=$_
  $i++
  if($i -eq $count) { checkSolr $ids; $ids=@(); $i=0;}
}
Function checkSolr ($ids)
{
  $url=$solrServer+"/select?fl=contentid&omitHeader=true&q="
  foreach ($id in $ids) {$url+="contentid:$id OR "}
  $url=$url.SubString(0, $url.length-3)
  [Net.HttpWebRequest] $req = [Net.WebRequest]::create($url)
  $req.Method = "GET"
  $req.Timeout = 600000 # = 10 minutes
  [Net.HttpWebResponse] $result = $req.GetResponse()
  [IO.Stream] $stream = $result.GetResponseStream()
  [IO.StreamReader] $reader = New-Object IO.StreamReader($stream)
  [string] $output = $reader.readToEnd()
  $stream.flush()
  $stream.close()
  # A foreach loop doesn't ouput to the pipeline.
  foreach ($id in $ids) {
    $idx = $output.IndexOf($id)
    if($idx -eq -1)  {
       $notExistStream.WriteLine("$id not in solr");
    }
    else {
    if("$existFile" -ne "" ){ $existStream.WriteLine("$id exist in solr") }
    }
  }
}
Complete Code
[CmdletBinding()]
Param(
   [Parameter(Mandatory=$True,Position=1)]
   [String]$solrServer,
   
   [Parameter(Mandatory=$True,Position=2)]
   [String[]]$logs,
 
   [Parameter(Mandatory=$True)]
   [string]$notExistFile,
   
   [Parameter(Mandatory=$False)]
   [string]$existFile
)
Function checkSolr ($ids)
{
  $url=$solrServer+"/select?fl=contentid&omitHeader=true&q="
  foreach ($id in $ids) {$url+="contentid:$id OR "}
  $url=$url.SubString(0, $url.length-3)
  [Net.HttpWebRequest] $req = [Net.WebRequest]::create($url)
  $req.Method = "GET"
  $req.Timeout = 600000 # = 10 minutes
  [Net.HttpWebResponse] $result = $req.GetResponse()
  [IO.Stream] $stream = $result.GetResponseStream()
  [IO.StreamReader] $reader = New-Object IO.StreamReader($stream)
  [string] $output = $reader.readToEnd()
  $stream.flush()
  $stream.close()
  # A foreach loop doesn't ouput to the pipeline.
  foreach ($id in $ids) {
    $idx = $output.IndexOf($id)
    if($idx -eq -1)  {
       $notExistStream.WriteLine("$id not in solr");
    }
    else {
    if("$existFile" -ne "" ){ $existStream.WriteLine("$id exist in solr") }
    }
  }
}
function createNewFile($file)
{
  if(Test-Path -Path $file) { Remove-Item $file }
  New-Item $file -ItemType file
  $file=$(Resolve-Path $file).ToString()
}

Write-Host (Get-Date).tostring(), script started -BackgroundColor "Red" -ForegroundColor "Black"

$elapsed = [System.Diagnostics.Stopwatch]::StartNew()

Get-Content $logs | %{ if($_.Contains("Got id")) {$a=$_.Split()[3]; $a.Substring(0,$a.Length-1); } } | out-file ".\ids.txt"
Write-Host (Get-Date).tostring(), created ids.txt -BackgroundColor "Red" -ForegroundColor "Black"

$count=100
$ids=@()
gc .\ids.txt  | foreach  {$i=0;} {
  $ids+=$_
  $i++
  if($i -eq $count) { checkSolr $ids; $ids=@(); $i=0;}
 }
 
$notExistFile=createNewFile $notExistFile
$notExistStream = [System.IO.StreamWriter] "$notExistFile"
if("$existFile" -ne "") { createNewFile $existFile; $existStream = [System.IO.StreamWriter] "$existFile"; }
# check for remaining ids
checkSolr $ids;


$notExistStream.close()
if($existStream) {$existStream.close()}

Write-Host (Get-Date).tostring(), script finished -BackgroundColor "Red" -ForegroundColor "Black"
write-host "Total Elapsed Time: $($elapsed.Elapsed.TotalSeconds )" -BackgroundColor "Red" -ForegroundColor "Black"
PowerShell GUI
PowerGUI
Post a Comment

Labels

Java (159) Lucene-Solr (111) Interview (61) All (58) J2SE (53) Algorithm (45) Soft Skills (37) Eclipse (33) Code Example (31) Linux (24) JavaScript (23) Spring (22) Windows (22) Web Development (20) Nutch2 (18) Tools (18) Bugs (17) Debug (16) Defects (14) Text Mining (14) J2EE (13) Network (13) Troubleshooting (13) PowerShell (11) Chrome (9) Design (9) How to (9) Learning code (9) Performance (9) Problem Solving (9) UIMA (9) html (9) Http Client (8) Maven (8) Security (8) bat (8) blogger (8) Big Data (7) Continuous Integration (7) Google (7) Guava (7) JSON (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) Dynamic Languages (5) IDE (5) Lesson Learned (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) Miscs (4) OpenNLP (4) Project Managment (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Bit Operation (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Firefox (2) Google Drive (2) Gson (2) How to Interview (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Python (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts