Offline
Online
Viewers

How To Cluster Rabbit-MQ

Foreword

This explanation of clustering Rabbit-MQ assumes that you’ve had some experience with Rabbit-MQ.  At least to the point of being able to get Rabbit-MQ up and running and processing messages.  For this explanation I will be using CentOS linux, other linux distributions may or may not require slight modifications to the setup process.  You will need at least 2 machines or virtual instances up and running and install Rabbit-MQ on both.

Overview

Clustering Rabbit-MQ is actually very simple once you understand what’s going on and how it actually works.  There is no need for a load balancer or any other hardware/software component and the idea is simple.  Send all messages to the master queue and let the master distribute the messages down to the slaves.

Create Short Names

First, we need to change the host name and host entries of our machines to something short.  Rabbit-MQ has trouble clustering queues will fully qualified DNS names.  We’ll need a single short word host and route.  For now, let’s use the names “master” for the master head, then “slave1”, “slave2” … “slaveN” respectively for the rest.

Set the master host name to “master”

echo "master" > /proc/sys/kernel/hostname

Next we need to set the entries in the /etc/hosts file to allow the short names to be aliased to machine or instance IPs.  Open the /etc/hosts file in your favorite editor and add the following lines:

cat /etc/hosts
127.0.0.1   localhost   localhost.localdomain

192.168.0.100   master   master.localdomain
192.168.0.101   slave1   slave1.localdomain
192.168.0.102   slave2   slave2.localdomain

Please note: Your particular /etc/hosts file will look different that the above. You’ll need to substitute your actual ip and domain suffix for each entry.

Make sure each slave you plan to add has an entry in the /etc/hosts file of the master. To verify your settings for each of the entries you provide, try pinging them by their short name.

ping master
PING master (192.168.0.100) 56(84) bytes of data.
64 bytes from master (192.168.0.100): icmp_seq=1 ttl=61 time=0.499 ms
64 bytes from master (192.168.0.100): icmp_seq=2 ttl=61 time=0.620 ms
64 bytes from master (192.168.0.100): icmp_seq=3 ttl=61 time=0.590 ms
64 bytes from master (192.168.0.100): icmp_seq=4 ttl=61 time=0.494 ms

If you get something like the above, you’re good to go. If not, take a good look at your settings and adjust them until you do.

Once your short names are setup in the master /etc/hosts file, copy the /etc/hosts file to every slave so that all machines have the same hosts file entries, or to be more specific, that each machine has the master and slave routes. If you’re familiar with routing, feel free to just add the missing routes.

Then for each slave update the host name.

echo "slave1" > /proc/sys/kernel/hostname
echo "slave2" > /proc/sys/kernel/hostname
echo "slave3" > /proc/sys/kernel/hostname

Synchronize ERLang Cookie

Next we need to synchronize our ERlang cookie. Rabbit-MQ needs this to be the same on all machines for them to communicate properly. The file we need is located on the master at /var/lib/rabbitmq/.erlang.cookie, we’ll cat this value then update all the cookies on the slave.

cat /var/lib/rabbitmq/.erlang.cookie
DQRRLCTUGOBCRFNPIABC

Copy the value displayed by the cat.

Please notice that the file itself is storing the value without a carriage return nor a line feed. This value needs to go into the slaves the same way. Do so be executing the following command on each slave. Make sure you use the “-n” flag.

First let’s make sure we stop the rabbitmq-server on the slaves before updating the ERlang cookie.

service rabbitmq-server stop
service rabbitmq-server stop
service rabbitmq-server stop

Next let’s update the cookie and start the service back up.

echo -n "DQRRLCTUGOBCRFNPIABC" > /var/lib/rabbitmq/.erlang.cookie
service rabbitmq-server start
echo -n "DQRRLCTUGOBCRFNPIABC" > /var/lib/rabbitmq/.erlang.cookie
service rabbitmq-server start
echo -n "DQRRLCTUGOBCRFNPIABC" > /var/lib/rabbitmq/.erlang.cookie
service rabbitmq-server start

Once again substitute the “DQRRLCTUGOBCRFNPIABC” value with your actual ERlang cookie value.

Create The Cluster

Now we cluster the queues together. Starting with the master, issue the following commands:

rabbitmqctl stop_app
rabbitmqctl reset
rabbitmqctl start_app

Next we cluster the slaves to the master. For each slave execute the following commands:

rabbitmqctl stop_app
rabbitmqctl reset
rabbitmqctl cluster rabbit@master
rabbitmqctl start_app

These commands actually do the clustering of the slaves to the master. To verify that everything is in working order issue the following command on any master or slave instance:

rabbitmqctl status
Status of node rabbit@master ...
[{running_applications,[{rabbit,"RabbitMQ","1.7.2"},
                        {mnesia,"MNESIA  CXC 138 12","4.4.3"},
                        {os_mon,"CPO  CXC 138 46","2.1.6"},
                        {sasl,"SASL  CXC 138 11","2.1.5.3"},
                        {stdlib,"ERTS  CXC 138 10","1.15.3"},
                        {kernel,"ERTS  CXC 138 10","2.12.3"}]},
 {nodes,[rabbit@slave1,rabbit@slave2,rabbit@slave3,rabbit@master]},
 {running_nodes,[rabbit@slave1,rabbit@slave2,rabbit@slave3,rabbit@master]}]
...done.

Notice the lines containing nodes and running_nodes. They should list out all of the mater and slave entries. If they do not, you may have done something wrong. Go back and try executing all the steps again. Otherwise, you’re good to go. Start sending messages to the master and watch as they are distributed to each of the slave nodes.

You can always dynamically add more slave nodes. To do this, updated the /etc/hosts file of all the machines with the new entry. Copy the master ERlang cookie value to the new slave. Execute the commands to cluster the slave to the master and verify.

Troubleshooting

If you accidentally update the cookie value before you’ve stopped the service, you could get strange errors the next time you start the rabbitmq-server. If this happens just issue the following command:

rm /var/lib/rabbitmq/mnesia/rabbit/* -f

This removes the mnesia storage and allows you to restart the rabbitmq-server without errors.

13 Comments

  1. Great doc, thanks for publishing. Do you have a procedure to recover a cluster if your master goes down?

    There will be many messed up queue’s and channels. After which the publishers and consumers generate tons of errors. I’d rather not stop the entire cluster and restart it. Any suggestions would be appreciated.

    tks

    1. Hi Rocke,

      I believe as long as mnesia is enabled you should be good. Basically when the queue goes down, it serializes everything to a binlog. When you bring the queue back up the binlog or mnesia file should be read from automatically and reloaded into the queue.

      1. I understand what your saying, makes sense. Openstack in this case is the publisher/consumer. I guess the part that worries me is after the master is back up and in the cluster, I get tons of handshake_timeout,handshake and frame_header errors in the log files. The corresponding errors in the openstack logs show connections reset by peers. So the openstack logs show many message queue errors, but manages to keep working. It looks like the services are happy otherwise. I’m running 3.3.5 btw.

        Do you just fire up the rabbitmq server via service/systemctl and do a start_app ? Or do you start it in detached mode? and then start_app etc…?

        1. I usually started it via systemctl. I don’t remember specifically which modes were used, this was quite a while back.

  2. Don’t forget to change the ownership and permissions of the .erlang.cookie when copying it to another server:
    chmod 400 /var/lib/rabbitmq/.erlange.cookie && chown rabbitmq: /var/lib/rabbitmq/.erlange.cookie

  3. When using high availability, how can my client direct messages to one of the cluster machines if they all have different IPs? Unless I point the same DNS at all servers in the cluster? Is that right?

    1. Hi Jason,

      Usually when using a high availability architecture, one would designate a DNS entry as the master head for the cluster. For example: master-queue.my-dns.com, then stick all of the other (slave) queues behind a load balancer with a DNS name like slave-queue.my-dns.com . That way you have a single master write head and multiple slave read heads. Hope this helps.

  4. I have followed all above steps and i have created cluster with master and one node as slave successfully without any issue.

    But whenever, I tried to loing on both system using rabbitmq page at that time login failed on bit system.

    Please let me know possible reasons for this type of symptoms if any one knows.

    1. Hi Ritesh, I’m not sure I follow you on this. So you have a master and a slave which appear to be working correctly except when you perform what action exactly?

  5. All your master and slaves have the same hostname (user@computer) in commands and it’s confusing. Also you’re probably using root, not user.

    Also, another article mentioned you have to delete the mnesia database on the slaves after setting the new erlang cookie.
    http://agiletesting.blogspot.com/2010/05/rabbitmq-clustering-in-ubuntu.html

    I don’t know if you have to delete the DB, I never have and always managed to get the cluster up, but maybe it would make the process cleaner or it’s part of “best practices” who knows.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Affiliates