RabbitMQ – How to join a node to a cluster when you get the error: incompatible_feature_flags

If you are reading this post it is because you have received the dreaded incompatible_feature_flags error when trying to join an upgraded node or a newly created node to an existing RabbitMQ cluster.

I will describe the scenario that got me into this situation and the solution I used to resolve it.

Scenario

I have a three node RabbitMQ cluster running on CentOS. The existing three nodes are running version 3.8.3, but I wanted to upgrade to version 3.8.14. Now, these two RabbitMQ versions require different versions of Erlang, so Erlang must be upgraded from 22+ to 23+.

When BOTH versions of RabbitMQ use the SAME version of Erlang, the RabbitMQ installer will just do a normal upgrade, carrying forward settings, such as feature flags, cluster configuration, from the previous install.

However, when the version of Erlang needs to be upgraded, you cannot just upgrade RabbitMQ, instead you must:

  • Uninstall RabbitMQ
  • Upgrade Erlang
  • Reinstall RabbitMQ

On a new install of RabbitMQ, it by default enables ALL feature flags.

The documentation does provide you with a config option called forced_feature_flags_on_init to override the list of enabled feature flags, however this option only works if set BEFORE starting RabbitMQ on the node for the very first time. After you have already started RabbitMQ for the first time, it will have no effect.

Here is what you will find in the RabbitMQ documentation on How To Disable Feature Flags:

Also if you try to uninstall and reinstall RabbitMQ thinking that will give you a clean slate, it will not. The uninstall does not remove the local RabbitMQ database which is where the enabled feature flag settings are stored.

So basically I am stuck in this scenario where my existing nodes have the feature flags:

But my upgrade node has the feature flags:

So when I try to join my upgraded node to the cluster I get the error:

[root@node333 ~]# rabbitmqctl stop_app
Stopping rabbit application on node node333@node333 ...
[root@node333 ~]# rabbitmqctl join_cluster node111@node111
Clustering node node333@node333 with node111@node111
Error:
incompatible_feature_flags

But do not worry, I have a solution that worked!

Solution

The following is a step by step solution that worked for me to solve the incompatible feature flags issue.

For this example let’s say we have three nodes:

  • node111
  • node222
  • node333 – This is the node being upgraded

On an Existing Node in the Cluster

Ensure the upgraded node has been removed from the cluster

rabbitmqctl forget_cluster_node <ID of upgraded node>

Get the list of enabled feature flags

rabbitmqctl list_feature_flags

The output should look like:

[root@node111 ~]# rabbitmqctl list_feature_flags
Listing feature flags ...
name                        state
drop_unroutable_metric      disabled
empty_basic_get_metric      disabled
implicit_default_bindings   enabled
quorum_queue                enabled
virtual_host_metadata       enabled

So out of this list the only feature flags enabled are:

  • implicit_default_bindings
  • quorum_queue
  • virtual_host_metadata

Note: We will need this list later when we configure the list of feature flags to enable on the node being upgraded

On The Node That Was Upgraded

Uninstall RabbitMQ

yum remove rabbitmq-server-*

Remove the RabbitMQ lib directory

rm -rf /var/lib/rabbitmq

Remove the RabbitMQ config directory

rm -rf /etc/rabbitmq

Reinstall RabbitMQ

yum install rabbitmq-server <Version To Install>

Important: Before we start the new node, we need to update the initial set of feature flags to enable on startup.

Edit the rabbitmq.config file

vi /etc/rabbitmq/rabbitmq.config

Add our list of enabled feature flags to a forced_feature_flags_on_init entry in the config which should look like:

{forced_feature_flags_on_init, [quorum_queue, implicit_default_bindings, virtual_host_metadata]}]

So when you are done, your rabbitmq.config file should look something like this:

[
  {rabbit, [
    {default_user, <<"guest">>},
    {default_pass, <<"r@bb1t">>},
    {collect_statistics_interval, 10000},
    {forced_feature_flags_on_init, [quorum_queue, implicit_default_bindings, virtual_host_metadata]}
  ]}
].

Start RabbitMQ

systemctl start rabbitmq-server

Verify we have the correct set of feature flags enabled

rabbitmqctl list_feature_flags

The output should look something like:

[root@node333 ~]# rabbitmqctl list_feature_flags
Listing feature flags ...
name                        state
drop_unroutable_metric      disabled
empty_basic_get_metric      disabled
implicit_default_bindings   enabled
maintenance_mode_status     disabled
quorum_queue                enabled
user_limits                 disabled
virtual_host_metadata       enabled

Notice that only the three feature flags we wanted enabled, are enabled, so we should be fine now to join our node to the cluster again.

Also if you check the RabbitMQ management console on the new node you should see the feature flags as well:

Join the upgraded node to the cluster

rabbitmqctl stop_app
rabbitmqctl join_cluster  <ID of existing node>
rabbitmqctl start_app

If everything is successful, the output should look something like:

[root@node333 rabbitmq]# rabbitmqctl stop_app
Stopping rabbit application on node node333@node333 ...
[root@node333 rabbitmq]# rabbitmqctl join_cluster node111@node111
Clustering node node333@node333 with node111@node111
[root@node333 rabbitmq]# rabbitmqctl start_app
Starting node node333@node333 ...
[root@node333 rabbitmq]#

We have now successfully joined our new node running version 3.8.14 to our cluster of nodes running version 3.8.3.

So when you upgrade the rest of the nodes make sure to set the rabbitmq.config entry forced_feature_flags_on_init on each node AFTER upgrading but BEFORE starting it for the first time and save yourself all this trouble!

I hope that helps!

Cassandra – How to Add a New Node to an Existing Cluster

This article explains the steps for adding a new node to an existing Cassandra cluster using open source Apache Cassandra. If you are using the DataStax Enterprise Edition of Cassandra you can add new nodes using OpsCenter.

Adding a new node to an existing cluster, in Apache Cassandra version 3 and higher is fairly easy. When a new node is added to the cluster, Cassandra will automatically adjust the token ranges each node is responsible for resulting in each node in the cluster storing a smaller subset of the data.

Prepare the New Node

Once you have installed Cassandra on your new node (Or better yet, applied your Cassandra puppet template), before you start the new node, there are a few critical items you need to configure.

Verification

Verify Version

Verify that the version of Cassandra you want has been correctly installed. In this case we were using version 3.11.4.

yum list cassandra

You should see Cassandra and the correct version:

Verify Cassandra Config Settings

Verify Cluster Name

Important! Ensure that the cluster_name in the cassandra.yaml file is set to EXACTLY the same value as the one used on all of the other nodes in the cluster.

You can verify the cluster name by looking in the cassandra.yaml file:

cat /etc/cassandra/conf/cassandra.yaml | grep cluster_name:

For example:

Verify Seed

Important! Ensure the seeds value in the cassandra.yaml file is set to EXACTLY the same value as the one used on all of the other nodes in the cluster.

You can verify the seed node by looking in the cassandra.yaml file:

at /etc/cassandra/conf/cassandra.yaml | grep seeds:

For Example:

Disable Incremental Backups

Important! It is recommended to disable incremental backups initially then enable them once your node has successfully joined.

If you currently have a reasonably large data set that will be synced to the new node, it is highly recommended to temporarily disable incremental backups.

During the syncing process there will be a lot of file compaction running and having incremental backups enabled will prevent many files from getting compacted until the backups are cleared.

In the cassandra.yaml file ensure the incremental backups field is set to false:

  • incremental_backups: false

Start Cassandra on the new Node

Now that Cassandra’s configuration file has been updated with the name of the cluster it will be in and provided a seed node to connect to in the existing cluster, when you start the new node, it will automatically connect to the cluster and start syncing data.

Verify Cluster Status

Before starting the new node, let’s check the status of all the existing nodes in the cluster by running the nodetool status command from one of the existing nodes:

nodetool status

You can see that all the existing nodes are in status UN (Up – Normal):

Start Cassandra

Status

Check the status of the Cassandra service on the new node.

systemctl status cassandra

The service should be disabled and off:

Enable the service

systemctl enable cassandra

Start service

systemctl start cassandra

Watch for Errors

Watch to ensure there are no exceptions or errors being reported.

journalctl -f -u cassandra

Monitoring Node Syncing

Status

You can view the status of all nodes by running the following command from any node:

nodetool status

While the node is joining it’s status will be:

  • UJ – Up – Joining

Syncing

You can use nodetool to monitor progress of the data syncing:

nodetool netstats | grep Already

The bytes total should be continuing to increase

  • if they stop, the syncing processing is stuck
  • stop the node, clean out the data, and start it again

Ready

When the node is finished syncing it’s status will be:

  • UN – (Up – Normal)

Note:

Only add ONE new node to the cluster at a time. If one node is joining and you try to add another, the second node will throw an error and not join.

Also, it is important to wait for the cluster to stabilize before adding another node. Check system.log for activity to ensure the cluster has stabilized.

Compactions

Even though the new node is up and receiving traffic, depending on the size of the dataset, it will be busy for a while doing compactions. Allow the server to work through all major pending compactions before doing the first snapshot backup. A snapshot creates hard links to the existing data files and prevents Cassandra from compacting and removing existing data files. So allow the pending compaction queue to empty out before making the first backup and proceeding to the cleanup stage. 

You can monitor the pending compactions queue with the nodetool compactionstats command:

nodetool compactionstats

Which will show you output such as:

Notice the “Active compaction remaining time”. When the compactions show a remaining time in hours, they are major compactions. Allow these tasks to complete before making any other changes, such as running backups.

Finally, when all the long running major compactions are complete you should see:

It is alright if there are some compactions running or in the queue, but they should mostly all have completion times in minutes, not hours.

Once the compactions have completed, you can move forward with setting backups.

Backups

Now that the node is up and running, you will want to setup two backup steps: snapshots and incremental backups.

Snapshot

A snapshot in Cassandra is a full backup of all tables in the keyspace. You can create a snapshot with the following command:

nodetool snapshot

A folder called “snapshot” is created under the folder for each table. It is up to you to setup a job to copy each of these folders to a backup drive.

Important! If you do not regularly remove the snapshots once copied, over time it will create problems for Cassandra because it will not be able to compact older files referenced by those backups.

Once you have copied your snapshot to another drive, you can run the following command to clear all snapshot folders for all tables:

nodetool clearsnapshot

Incremental Backups

Cassandra also generates incremental backups every time a new SSTable is written to disk. You can save incremental backups between full snapshots.

To enable incremental backups you must change the setting in your cassandra.yaml file:

Incremental backups will be put in a folder called “backups” under each tables folder.  

Important! You are responsible for copying these backup files onto a backup drive and removing them from the Cassandra table folders. If you do not periodically remove these files it will prevent Cassandra from performing regular compaction and will cause performance problems.

Backup Summary

The snapshot and incremental backup folders are stored in the directory of each table in a keyspace, here is an example:

Repair

We want to make sure that all data for keyspaces that have a replication factor greater than one, have been replicated properly. So once the new node is up and the cluster is stable, run repair on each node. 

Run Repair (One node at a time)

To make sure all data has been replicated properly to each node, before cleanup run nodetool repair:

nodetool repair

Monitor Repair

The first phase of the repair process is doing validations between rows on the current node and the other nodes that should have the same rows.

You can view the progress of validations being done as part of the repair task by looking for “validation” compaction types queue when using nodetool compactions:

nodetool compactionstats

The second phase of the repair process is syncing data between nodes where a difference has been found, you can view this activity using nodetool netstats command:

nodetool netstats

Cleanup

Important! Do not start the cleanup step until you have confirmed that the new node:

  • has finished syncing
  • joined the cluster
  • all pending major compactions on the node have completed
  • the first backup has been run
  • repair has been run on every node

When a new node is added to a cluster, Cassandra adjusts the token ranges for which each node is responsible. Once data from existing nodes is shifted to the new node, the other nodes will no longer be responsible for those keys and will no longer serve them in requests.

However, for safety reasons Cassandra does not automatically remove the key data from the nodes that previously were responsible for those keys. They leave this to the administrator to do manually, which makes sense. If anything goes wrong with the new node, no data is lost. The cleanup task will remove any data on the node belonging to token ranges for which the node is no longer responsible.

Run Cleanup (One node at a time)

Once the repair task has completed, and their are no more validation tasks running, you can run the nodetool cleanup task to remove all data for the token ranges the node no longer supports:

nodetool cleanup

Monitor Cleanup

You can monitor the progress of cleanup using nodetool compactionstats:

nodetool compactionstats

References

The following is the best system administration book I have read on Apache Cassandra:

Mastering Apache Cassandra 3 (by Aaron Ploetz)

I hope that helps!