The Problem
We make change to our documentation site periodically, and use Nutch to crawl it and save index to solr server in local build machine and test it.
On release date, we will deploy the new index into production machines. We want to minimize the downtime, so we can't restart Solr server in production machines.
The Solution
Luckily, Solr provides mergeindexes tool: it doesn't support merge remote indexes, but we can easily use Powershell to copy the new index to production machines, then run mergeindexes locally.
The reason we choose Window PowerShell is because PowerShell supports UNC path like(\\serverA\labelB\pathc), which Window batch doesn't support.
Steps and Script
1. Crawl vendorA doc to core core_vendorA in build machine.
PowerShell Script
This step is optional, as the site should be already crawled and tested before deploy. We include the script here for completeness.
We create a ServerResource in Nutch side to expose http API to start/stop/edit/delete a task to crawl a site and monitor crawl status. Please refer Nutch2: Extend Nutch2 to Crawl via Http API
PowerShell Script
3. Delete old index for vendor core_vendorA by sending a solr request:
curl "http://localhost:8080/solr/core_vendorA/update?commit=true&stream.body=<delete><query>*:*</query></delete>"
4. Merge index from new-index\core_vendorA\index to core core_vendorA by sending a solr request:
curl "http://localhost:8080/solr/admin/cores?action=mergeindexes&core=core_vendorA&indexDir=%PREFIX%\new-index\core_vendorA\index"
5. Commit the merged index by sending a commit request to solr:
curl "http://localhost:8080/solr/core_vendorA/update?commit=true"
Step3,4,5 is pretty fast, usually take less than 1 minute.
Resources
Nutch2: Extend Nutch2 to Crawl via Http API
We make change to our documentation site periodically, and use Nutch to crawl it and save index to solr server in local build machine and test it.
On release date, we will deploy the new index into production machines. We want to minimize the downtime, so we can't restart Solr server in production machines.
The Solution
Luckily, Solr provides mergeindexes tool: it doesn't support merge remote indexes, but we can easily use Powershell to copy the new index to production machines, then run mergeindexes locally.
The reason we choose Window PowerShell is because PowerShell supports UNC path like(\\serverA\labelB\pathc), which Window batch doesn't support.
Steps and Script
1. Crawl vendorA doc to core core_vendorA in build machine.
PowerShell Script
This step is optional, as the site should be already crawled and tested before deploy. We include the script here for completeness.
We create a ServerResource in Nutch side to expose http API to start/stop/edit/delete a task to crawl a site and monitor crawl status. Please refer Nutch2: Extend Nutch2 to Crawl via Http API
$data = '{\"solrURL\":\"http://solrServerInbuildMachine/solr/vendorA/\",\"crawlID\":\"crawl_vendorA_ID1\", \"taskName\":\"taskl_vendorA_ID1\",\"crawlDepth\":2,\"urls\":[\"http://docsite:port/rootpath/\"], \"includePaths\":[\"+^(?i)http://docsite:port/rootpath/\"],\"subCollections\":[{\"name\":\"vendorA\", \"id\":\"vendorA\",\"whiteList\":[\"http\",\"cifs\",\"file\",\"ftp\"]},{\"name\":\"vendorA\", \"id\":\"vendorA\",\"whiteList\":[\"http\",\"cifs\",\"file\",\"ftp\"]}],\"solrindexParams\":\"update.chain=webCrawlerChain\", \"delOldDataQuery\":\"subcollection:vendorA\",\"sync\":true,\"deleteIfExist\":true,\"updateDirectly\":false, \"tmpCoreName\":\"core-tmp1\",\"cleanData\":true,\"startTask\":true,\"reuseIfExist\":false, \"fileToFileMappings\":{\"conf/nutch-site.xml\":\"conf/predefinedTasks/nutch-site-templateA.xml\"}} ' &curl -X PUT -H "Content-Type: application/json" -d $data http://nutchServer:port/nutch/cvcrawler2. After crawl is finished, copy and zip the index folder solr\data\core_vendorA\index to production machines, folder: %PREFIX%\new-index\core_vendorA\index, and unzip it.
PowerShell Script
if (-not (test-path "$env:ProgramFiles\7-Zip\7z.exe")) {throw "$env:ProgramFiles\7-Zip\7z.exe needed"} set-alias sz "$env:ProgramFiles\7-Zip\7z.exe" cd %LOCALBUILPATH%solr\data\core_vendorA\index &sz a -tzip index.zip -mx3 copy-item index.zip \\prodMachineA\new-index\vendorA\index cd \\prodMachineA\new-index\vendorA\index &sz x index.zipDo the following steps for each production machine
3. Delete old index for vendor core_vendorA by sending a solr request:
curl "http://localhost:8080/solr/core_vendorA/update?commit=true&stream.body=<delete><query>*:*</query></delete>"
4. Merge index from new-index\core_vendorA\index to core core_vendorA by sending a solr request:
curl "http://localhost:8080/solr/admin/cores?action=mergeindexes&core=core_vendorA&indexDir=%PREFIX%\new-index\core_vendorA\index"
5. Commit the merged index by sending a commit request to solr:
curl "http://localhost:8080/solr/core_vendorA/update?commit=true"
Step3,4,5 is pretty fast, usually take less than 1 minute.
Resources
Nutch2: Extend Nutch2 to Crawl via Http API