How to set up SolrCloud
Definitions
SolrCloud is utilized to scale out Apache Solr onto multiple
machines; we can set up a Collection (A single search index, logically grouped)
on multiple shards (A logical section of a single collection) that each serve
requests for scalability purposes. This is done by splitting the index into
multiple cores (physical indexes), residing on multiple physical nodes, forming
a cluster. If requests velocity increases, we can set multiple copies of the
core on each of the node, called replicas (the original core is called the
leader). Of note is the fact that coordination is handled by Zookeeper, a 3rd
party library, as opposed to using an internal communication protocol like in
Gossip in Apache Cassandra.
So scaling out in SolrCloud is done by sharding, i.e. adding
more nodes that have multiple cores of the collection, including replicas.
Shard:
A logical section of a single collection. Sometimes people
will talk about "Shard" in a physical sense (a manifestation of a
logical shard)
Replica:
A physical manifestation of a logical Shard, implemented as
a single Lucene index on a SolrCore
Leader:
One Replica of every Shard will be designated as a Leader to
coordinate indexing for that Shard
SolrCore:
Encapsulates a single physical index. One or more make up
logical shards (or slices) which make up a collection.
Node:
A single instance of Solr. A single Solr instance can have
multiple SolrCores that can be part of any number of collections.
Cluster:
All of the nodes you are using to host SolrCores.
Script to run SolrCloud:
Create configuration folders.
A Solr instance is comprised of:
a conf file, that contains the collections configurations to
be indexed.
i.e. solr->conf->collection1.
the collection directory contains some simple configuration
files:
collection1->conf
A good way to start is to copy the
solr-xx/example/solr/collection1 directory.
Then change the name of the collection inside of this
directory, in core.properties.
Also, a Solr instance contains a Solrhome directory, that
contains solr.xml and zoo.cfg. SolrHome represents a node, and will contain the
index data for that node. A node can be on a different machine. Ours will be
solr1.
solr.xml and zoo.cfg can be copied from the original Solr-xx
directory, under example/solr.
These files contain parameters that may need to be changed,
like hostname and port numbers.
Install Zookeeper
Download Zookeeper and install.
Create data directory, and configure accordingly in
zookeeper-xx/conf/zoo.cfg.
Also change the port number if needed , default: 2181.
Run ./zkServer.sh start
also ./zkServer.sh status.
Upload configuration into zookeeper
First, run Solr by itself; this is required to bootstrap
properly.
java -jar start.jar
There is a script to run and automate the loading of our
collection into zookeeper, where you pass the zookeeper information, the
directory for the collection and its name, i.e.:
cloud-scripts/zkcli.sh -cmd upconfig -zkhost
localhost:2181 -confdir /Users/mlieber/app/solr/conf/testcollection3/conf
-confname testcollection3
cloud-scripts/zkcli.sh -cmd upconfig -zkhost
localhost:9983 -confdir /Users/mlieber/app/solr/conf/collection1/conf -confname
testcollection
This is a one-time task.
Run of the Solr node
This is done via the start.jar program found in
solr-xx/example. We pass in either the Solr-embedded zookeeper :
java -DzkRun
-Dsolr.solr.home=/Users/mlieber/app/solr/solrhome/ -jar start.jar
or in Production, our own zookeeper instance:
java -Dsolr.solr.home=/Users/mlieber/app/solr/solrhome1/
-DzkHost=localhost:2181 -jar start.jar
The solr.solr.home is the directory that was created for
that node.
If testing this on a single machine with multiple nodes, you
may need to change the jetty port for the 2nd node, and reflect this in the
command:
java
-Dsolr.solr.home=/Users/mlieber/app/solr/solrhome1/ -DzkHost=localhost:2181
-Djetty.port=8984 -jar start.jar
The jetty.port can also optionally be changed in the node
configuration folder, at solrhome1/solr.xml.
Create API to create the collection
Next, we can then create our collection via the Solr API,
via a REST call. I.e.:
curl
'http://localhost:8983/solr/admin/collections?action=CREATE&name=testcollection3&numShards=2&maxShardsPerNode=3&replicationFactor=3'
http://localhost:8983/solr/admin/collections?action=CREATE&name=testcollection&numShards=2&maxShardsPerNode=2&replicationFactor=1
We need to pass the name of the collection being created,
the # of shards, RF and max # of shards per node. You 'll get a useful error if
it's not working. E.g. passing 2 shards and RF=2 on a single node, you will
need a max of 4.
Example:
Create a testcollection which has 2 shards , replication
factor 2, running on 2 JVMs
curl
'http://localhost:8983/solr/admin/collections?action=CREATE&name=testcollection&numShards=2&maxShardsPerNode=2&replicationFactor=2'
You can then add a document to this collection via :
java -Durl=http://localhost:8983/solr/collection1/update -jar
./example/exampledocs/post.jar ./example/exampledocs/monitor.xml
Add a replica for an existing node
You can add a replica after the initial creation, on each
shard. The syntax is simply to add the new shardname. E.g:
curl
'http://localhost:8984/solr/admin/cores?action=CREATE&collection=testcollection3&shard=shard1&name=testcollection3_shard1_replica4'
curl 'http://localhost:8983/solr/admin/cores?action=CREATE&collection=testcollection3&shard=shard1&name=testcollection3_shard1_replica5'
Administration
You can view and test the configuration from the admin UI,
under Cloud/Tree, clusterstate.json.
Set up Solr on TomCat.
By default Solr is bundled with Jetty as the web server.
TomCat is considered more robust as a Servlet container, therefore sometimes it
is preferable to switch Solr over to TomCat.
- Copy Solr’s solr.war (usually in $SOLR_HOME/example/webapps/solr.war)
to <$TOMCAT_HOME >/webapps
to make TomCat aware of Solr.
- Add the below to TomCat, in file
'conf/Catalina/localhost/solr.xml', referring to the location of solr.war you
copied, as well as your SolrCloud node location.
<Context path="/solr"
docBase="/app/apache-tomcat-7.0.29/webapps/solr.war"
debug="0" crossContext="true">
<Environment name="solr/home"
type="java.lang.String" value="/app/solrnode1"
override="true"/>
</Context>
- I was told to also add this for precaution measures, in
conf/server.xml:
vi conf/server.xml - Add the following
<Connector port="8080"
protocol="HTTP/1.1"
connectionTimeout="20000"
redirectPort="8443"
URIEncoding="UTF-8" />
- cp $SOLR_HOME/example/lib/ext/* $TOMCAT_HOME/lib/
- cp $SOLR_HOME/resources/log4j.properties $TOMCAT_HOME/lib/
- Edit catalina.sh and add these to be context-aware:
SOLR_OPTS="-Dhost=localhost
-DhostPort=8080 -DhostContext=solr -DzkClientTimeout=20000
-DzkHost=localhost:2181"
- JAVA_OPTS="$JAVA_OPTS $SOLR_OPTS"
- Change 'jetty.port' to 'hostPort' in solr.xml
- Start TomCat
./catalina.sh start
Take a look at the TomCat logs to make sure everything is
ok, in catalina.out in $TOMCAT_HOME/logs.
To view your Solr cores, go to
http://{your-ip-address}:8080/solr