Using Mergeindexes and PowerShell to Automate Deployment of Solr Index to Remote Production Machines

The Problem
We make change to our documentation site periodically, and use Nutch to crawl it and save index to solr server in local build machine and test it. 

On release date, we will deploy the new index into production machines. We want to minimize the downtime, so we can't restart Solr server in production machines.
The Solution
Luckily, Solr provides mergeindexes tool: it doesn't support merge remote indexes, but we can easily use Powershell to copy the new index to production machines, then run mergeindexes locally.

The reason we choose Window PowerShell is because PowerShell supports UNC path like(\\serverA\labelB\pathc), which Window batch doesn't support.
Steps and Script
1. Crawl vendorA doc to core core_vendorA in build machine.
PowerShell Script
This step is optional, as the site should be already crawled and tested before deploy. We include the script here for completeness.
We create a ServerResource in Nutch side to expose http API to start/stop/edit/delete a task to crawl a site and monitor crawl status. Please refer Nutch2: Extend Nutch2 to Crawl via Http API

$data = '{\"solrURL\":\"http://solrServerInbuildMachine/solr/vendorA/\",\"crawlID\":\"crawl_vendorA_ID1\",
\"taskName\":\"taskl_vendorA_ID1\",\"crawlDepth\":2,\"urls\":[\"http://docsite:port/rootpath/\"],
\"includePaths\":[\"+^(?i)http://docsite:port/rootpath/\"],\"subCollections\":[{\"name\":\"vendorA\",
\"id\":\"vendorA\",\"whiteList\":[\"http\",\"cifs\",\"file\",\"ftp\"]},{\"name\":\"vendorA\",
\"id\":\"vendorA\",\"whiteList\":[\"http\",\"cifs\",\"file\",\"ftp\"]}],\"solrindexParams\":\"update.chain=webCrawlerChain\",
\"delOldDataQuery\":\"subcollection:vendorA\",\"sync\":true,\"deleteIfExist\":true,\"updateDirectly\":false,
\"tmpCoreName\":\"core-tmp1\",\"cleanData\":true,\"startTask\":true,\"reuseIfExist\":false,
\"fileToFileMappings\":{\"conf/nutch-site.xml\":\"conf/predefinedTasks/nutch-site-templateA.xml\"}} '

&curl -X PUT -H "Content-Type: application/json" -d $data http://nutchServer:port/nutch/cvcrawler
2. After crawl is finished, copy and zip the index folder solr\data\core_vendorA\index to production machines, folder: %PREFIX%\new-index\core_vendorA\index, and unzip it.
PowerShell Script
if (-not (test-path "$env:ProgramFiles\7-Zip\7z.exe")) {throw "$env:ProgramFiles\7-Zip\7z.exe needed"} 
set-alias sz "$env:ProgramFiles\7-Zip\7z.exe" 
cd %LOCALBUILPATH%solr\data\core_vendorA\index
&sz a -tzip index.zip  -mx3

copy-item index.zip \\prodMachineA\new-index\vendorA\index
cd \\prodMachineA\new-index\vendorA\index
&sz x index.zip
Do the following steps for each production machine
3. Delete old index for vendor core_vendorA by sending a solr request:
curl "http://localhost:8080/solr/core_vendorA/update?commit=true&stream.body=<delete><query>*:*</query></delete>"

4. Merge index from new-index\core_vendorA\index to core core_vendorA by sending a solr request:
curl "http://localhost:8080/solr/admin/cores?action=mergeindexes&core=core_vendorA&indexDir=%PREFIX%\new-index\core_vendorA\index"

5. Commit the merged index by sending a commit request to solr: 
curl "http://localhost:8080/solr/core_vendorA/update?commit=true"

Step3,4,5 is pretty fast, usually take less than 1 minute.

Resources
Nutch2: Extend Nutch2 to Crawl via Http API
Post a Comment

Labels

Java (159) Lucene-Solr (110) All (60) Interview (59) J2SE (53) Algorithm (37) Eclipse (35) Soft Skills (35) Code Example (31) Linux (26) JavaScript (23) Spring (22) Windows (22) Web Development (20) Tools (19) Nutch2 (18) Bugs (17) Debug (15) Defects (14) Text Mining (14) J2EE (13) Network (13) PowerShell (11) Chrome (9) Continuous Integration (9) How to (9) Learning code (9) Performance (9) UIMA (9) html (9) Design (8) Dynamic Languages (8) Http Client (8) Maven (8) Security (8) Trouble Shooting (8) bat (8) blogger (8) Big Data (7) Google (7) Guava (7) JSON (7) Problem Solving (7) ANT (6) Coding Skills (6) Database (6) Scala (6) Shell (6) css (6) Algorithm Series (5) Cache (5) IDE (5) Lesson Learned (5) Miscs (5) Programmer Skills (5) System Design (5) Tips (5) adsense (5) xml (5) AIX (4) Code Quality (4) GAE (4) Git (4) Good Programming Practices (4) Jackson (4) Memory Usage (4) OpenNLP (4) Project Managment (4) Python (4) Spark (4) Testing (4) ads (4) regular-expression (4) Android (3) Apache Spark (3) Become a Better You (3) Concurrency (3) Eclipse RCP (3) English (3) Firefox (3) Happy Hacking (3) IBM (3) J2SE Knowledge Series (3) JAX-RS (3) Jetty (3) Restful Web Service (3) Script (3) regex (3) seo (3) .Net (2) Android Studio (2) Apache (2) Apache Procrun (2) Architecture (2) Batch (2) Build (2) Building Scalable Web Sites (2) C# (2) C/C++ (2) CSV (2) Career (2) Cassandra (2) Distributed (2) Fiddler (2) Google Drive (2) Gson (2) Html Parser (2) Http (2) Image Tools (2) JQuery (2) Jersey (2) LDAP (2) Life (2) Logging (2) Software Issues (2) Storage (2) Text Search (2) xml parser (2) AOP (1) Application Design (1) AspectJ (1) Bit Operation (1) Chrome DevTools (1) Cloud (1) Codility (1) Data Mining (1) Data Structure (1) ExceptionUtils (1) Exif (1) Feature Request (1) FindBugs (1) Greasemonkey (1) HTML5 (1) Httpd (1) I18N (1) IBM Java Thread Dump Analyzer (1) JDK Source Code (1) JDK8 (1) JMX (1) Lazy Developer (1) Mac (1) Machine Learning (1) Mobile (1) My Plan for 2010 (1) Netbeans (1) Notes (1) Operating System (1) Perl (1) Problems (1) Product Architecture (1) Programming Life (1) Quality (1) Redhat (1) Redis (1) Review (1) RxJava (1) Solutions logs (1) Team Management (1) Thread Dump Analyzer (1) Troubleshooting (1) Visualization (1) boilerpipe (1) htm (1) ongoing (1) procrun (1) rss (1)

Popular Posts