Monday, March 31, 2014

How to set up Apache SolrCloud


How to set up SolrCloud


Definitions


SolrCloud is utilized to scale out Apache Solr onto multiple machines; we can set up a Collection (A single search index, logically grouped) on multiple shards (A logical section of a single collection) that each serve requests for scalability purposes. This is done by splitting the index into multiple cores (physical indexes), residing on multiple physical nodes, forming a cluster. If requests velocity increases, we can set multiple copies of the core on each of the node, called replicas (the original core is called the leader). Of note is the fact that coordination is handled by Zookeeper, a 3rd party library, as opposed to using an internal communication protocol like in Gossip in Apache Cassandra.
So scaling out in SolrCloud is done by sharding, i.e. adding more nodes that have multiple cores of the collection, including replicas.


Shard:
A logical section of a single collection. Sometimes people will talk about "Shard" in a physical sense (a manifestation of a logical shard)

Replica:
A physical manifestation of a logical Shard, implemented as a single Lucene index on a SolrCore

Leader:
One Replica of every Shard will be designated as a Leader to coordinate indexing for that Shard

SolrCore:
Encapsulates a single physical index. One or more make up logical shards (or slices) which make up a collection.

Node:
A single instance of Solr. A single Solr instance can have multiple SolrCores that can be part of any number of collections.

Cluster:
All of the nodes you are using to host SolrCores.



Script to run SolrCloud:


Create configuration folders.


A Solr instance is comprised of:
a conf file, that contains the collections configurations to be indexed.
i.e. solr->conf->collection1.
the collection directory contains some simple configuration files:
collection1->conf
A good way to start is to copy the solr-xx/example/solr/collection1 directory.

Then change the name of the collection inside of this directory, in core.properties.

Also, a Solr instance contains a Solrhome directory, that contains solr.xml and zoo.cfg. SolrHome represents a node, and will contain the index data for that node. A node can be on a different machine. Ours will be solr1.
solr.xml and zoo.cfg can be copied from the original Solr-xx directory, under example/solr.
These files contain parameters that may need to be changed, like hostname and port numbers.


Install Zookeeper


Download Zookeeper and install.
Create data directory, and configure accordingly in zookeeper-xx/conf/zoo.cfg.
Also change the port number if needed , default: 2181.

Run ./zkServer.sh start
also ./zkServer.sh status.


Upload configuration into zookeeper


First, run Solr by itself; this is required to bootstrap properly.
java -jar start.jar

There is a script to run and automate the loading of our collection into zookeeper, where you pass the zookeeper information, the directory for the collection and its name, i.e.:

 cloud-scripts/zkcli.sh -cmd upconfig -zkhost localhost:2181 -confdir /Users/mlieber/app/solr/conf/testcollection3/conf -confname testcollection3

 cloud-scripts/zkcli.sh -cmd upconfig -zkhost localhost:9983 -confdir /Users/mlieber/app/solr/conf/collection1/conf -confname testcollection


This is a one-time task.


Run of the Solr node


This is done via the start.jar program found in solr-xx/example. We pass in either the Solr-embedded zookeeper :

java -DzkRun -Dsolr.solr.home=/Users/mlieber/app/solr/solrhome/ -jar start.jar

or in Production, our own zookeeper instance:

java -Dsolr.solr.home=/Users/mlieber/app/solr/solrhome1/ -DzkHost=localhost:2181 -jar start.jar

The solr.solr.home is the directory that was created for that node.

If testing this on a single machine with multiple nodes, you may need to change the jetty port for the 2nd node, and reflect this in the command:

 java -Dsolr.solr.home=/Users/mlieber/app/solr/solrhome1/ -DzkHost=localhost:2181 -Djetty.port=8984 -jar start.jar
The jetty.port can also optionally be changed in the node configuration folder, at solrhome1/solr.xml.

Create API to create the collection


Next, we can then create our collection via the Solr API, via a REST call. I.e.:

curl 'http://localhost:8983/solr/admin/collections?action=CREATE&name=testcollection3&numShards=2&maxShardsPerNode=3&replicationFactor=3'

http://localhost:8983/solr/admin/collections?action=CREATE&name=testcollection&numShards=2&maxShardsPerNode=2&replicationFactor=1

We need to pass the name of the collection being created, the # of shards, RF and max # of shards per node. You 'll get a useful error if it's not working. E.g. passing 2 shards and RF=2 on a single node, you will need a max of 4.

Example:
Create a testcollection which has 2 shards , replication factor 2, running on 2 JVMs

curl 'http://localhost:8983/solr/admin/collections?action=CREATE&name=testcollection&numShards=2&maxShardsPerNode=2&replicationFactor=2'

You can then add a document to this collection via :

java -Durl=http://localhost:8983/solr/collection1/update -jar ./example/exampledocs/post.jar ./example/exampledocs/monitor.xml

Add a replica for an existing node


You can add a replica after the initial creation, on each shard. The syntax is simply to add the new shardname. E.g:

curl 'http://localhost:8984/solr/admin/cores?action=CREATE&collection=testcollection3&shard=shard1&name=testcollection3_shard1_replica4'
curl 'http://localhost:8983/solr/admin/cores?action=CREATE&collection=testcollection3&shard=shard1&name=testcollection3_shard1_replica5'


Administration


You can view and test the configuration from the admin UI, under Cloud/Tree, clusterstate.json.


 Set up Solr on TomCat.


By default Solr is bundled with Jetty as the web server. TomCat is considered more robust as a Servlet container, therefore sometimes it is preferable to switch Solr over to TomCat.

- Copy Solr’s solr.war (usually in $SOLR_HOME/example/webapps/solr.war) to <$TOMCAT_HOME >/webapps  to make TomCat aware of Solr.
- Add the below to TomCat, in file 'conf/Catalina/localhost/solr.xml', referring to the location of solr.war you copied, as well as your SolrCloud node location.

<Context path="/solr" docBase="/app/apache-tomcat-7.0.29/webapps/solr.war" debug="0" crossContext="true">
      <Environment name="solr/home" type="java.lang.String" value="/app/solrnode1" override="true"/>
  </Context>

- I was told to also add this for precaution measures, in conf/server.xml:

vi conf/server.xml - Add the following 
  <Connector port="8080" protocol="HTTP/1.1"
               connectionTimeout="20000"
               redirectPort="8443"
               URIEncoding="UTF-8" />

- cp $SOLR_HOME/example/lib/ext/* $TOMCAT_HOME/lib/
- cp $SOLR_HOME/resources/log4j.properties $TOMCAT_HOME/lib/

- Edit catalina.sh and add these to be context-aware:
   SOLR_OPTS="-Dhost=localhost -DhostPort=8080 -DhostContext=solr -DzkClientTimeout=20000 -DzkHost=localhost:2181"
  - JAVA_OPTS="$JAVA_OPTS $SOLR_OPTS"

- Change 'jetty.port' to 'hostPort' in solr.xml
- Start TomCat
./catalina.sh start


Take a look at the TomCat logs to make sure everything is ok, in catalina.out in $TOMCAT_HOME/logs.
To view your Solr cores, go to http://{your-ip-address}:8080/solr