Cassandra Cluster and Replication

Rows in a Column Family store distributively across a cluster. Which node is used to store a row is determined by mapping its key to a token value determined by a partitioner

  • Developer can pick different partitioner for each column family for different mapping

Cassandar use a Distributed Hash Table (DHT) algorithm in determining where data with certain token value is stored in which node. Each Cassandra server are configured with a token with its initial_token value set to

i*(2**127)/number_of_nodes with i starts from 0 to (number_of_nodes - 1)
  • The 2 to power of 127 represents the 128bit DHT namespace used by Cassandar

Server token values are sorted and wrap around to form a ring. Each server is responsible for storing rows with token value range from the previous server token value to the server own token value. For example, the second server on the ring is responsible for token value from 0 to (2**127)/number_of_nodes

For example a 4 node cluster

Key Map to Server 0 init_token Range responsible
    0 0 2**127/4 * 3 .. 0
"somekey" 45678845 1 2**127/4 0 .. 2**127/4
    2 2**127/4 * 2 2**127/4 .. 2**127/4 * 2
    3 2**127/4 * 3 ...

Cassandar Partitioner

  • RandomPartitioner: Takes the MD5 hash value of a key and uses this as the token value. It evenly distributes rows around the entire ring
  • OrderPreservingPartitioner: Use the key (a byte array) as the token value. Good for queries on a range of values. However, server tokens may need constant adjustment to spread data hotspots across multiple nodes. This partitioner only supports keys with UTF-8 content
  • ByteOrderedPartitioner: Similar to OrderPreservingPartitioner but the key does not need to contain UTF-8 content
  • CollatingOrderPreservingPartitioner: Similar to OrderPreservingPartitioner, but compares keys using EN,US rules

For range query, use Order-preserving partitioning as the partitioner for the Column Family

Cassandar Replication

Cassandra Replication is controlled by the replication_factor. The default value of 1 means that data is written to only one node in the ring (no duplicated copy)

  • To define the replication factor
    create keyspace Keyspace1
        with replication_factor = 1
        and placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy';
  • Replication factor reflects the total number of replicas
  • A value of one means storing data in one node only (No duplicate copy)
  • A value higher than one provides data redundancy for failover
  • A value of 3 means data are committed to 3 separate nodes
  • A value of 3 also means each node now acts as a replica for 3 separate token ranges

Change the Cassandar Replication Factor

To Increase the ReplicationFactor for a cluster

  • For each server in the cluster:
    • Edit the configuration to contain the desired replication_factor
    • Restart Cassandra
    • Run nodetool repair against each keyspace individually

      nodetool can be very intensive. Should wait until a node is done before starting on another node

Cassandar Placement Strategies

The replica placement strategy determines which other nodes are picked as replicas besides the one chosen by the token

Replica Placement Strategies

  • SimpleStrategy (Default): Returns the nodes that are next to each other on the ring
  • NetworkTopologyStrategy: Configure the number of replicas per data center as specified in the strategy_options
    • Placement strategy for cluster running in multiple data centers
    • Update the replication factor used for two different data center DC1 and DC2
      • If the second data center is for load balancing based on geographical location
        update keyspace demo with strategy_options=[{DC1:3,  DC2:3}];
        • 3 replicas in DC1 and 3 replicas in DC2
      • If the second data center is only for disaster recover
        update keyspace demo with strategy_options=[{DC1:3,  DC2:3}];
  • OldNetworkTopologyStrategy: Places one replica in one data center while the rest on different racks in the current data center

To define the placement strategy

create keyspace Keyspace1
    with replication_factor = 1
    and placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy';

Cassandar Snitches

Snitches give Cassandra hints on how to route inter-node communication more effectively

  • SimpleSnitch (Default)
    • Full class name: org.apache.cassandra.locator.SimpleSnitch)
    • Used this if all nodes located in a single data center
  • RackInferringSnitch
    • Full class name: org.apache.cassandra.locator.RackInferringSnitch
    • Automatic determine the network topolology by analyzing the IP addresses
    • Assumes the second octet identifies the data center and the third octet identifies the rack
  • PropertyFileSnitch
    • Full class name: org.apache.cassandra.locator.PropertyFileSnitch
    • Store the network description in a property file
      # Cassandra Node IP=Data Center:Rack
      # default for unknown nodes

Use RackInferringSnitch or PropertyFileSnitch for Cassandar cluster deployment on multiple data centers. It provides hints to Cassandar on where to replicate data and reduce inter-node communication latency

To configure the snitch for Cassandra

  • vi conf/cassandra.yaml
  • Change endpont_snitch
    endpoint_snitch: org.apache.cassandra.locator.RackInferringSnitch

With NetworkTopologyStrategy or OldNetworkTopologyStrategy placement strategy, one of the rack-aware snitches (RackInferringSnitch or PropertyFileSnitch) must be used

By default, snitches are wrapped in a dynamic snitch that monitors read latency. When needed, Cassendra routes requests away from poorly-performing nodes

Add new nodes to Cassandra Cluster (Cassandra Bootstrap)

  1. Install and configure every new nodes according to previous chapter on installation
  2. Enable auto bootstrap for all non-seed nodes
    auto_bootstrap: true
  3. Re-calculate and change all the initial_token values for all new nodes
  4. Start the new nodes sequentially. Use nodetool to monitor the startup is completed before starting the next one
    nodetool netstats
  5. For each existing node, run nodetool to change the token
    nodetool move <new_server_token>
  6. For each existing node, run nodetool to remove data that is migrated to other nodes
    nodetool cleanup

This operation in partiucalar the cleanup is very intense. Arrange the operation to be done in off hours. It may even takes hours to complete

Wait long enough for all the nodes in your cluster to become aware of the bootstrapping node before starting another one. The new node will log "Bootstrapping" when this is safe

Replacing Dead Nodes in a Cassandra Cluster

Replace a dead instance with a new instance using the same IP

  1. Bring down the dead node server
  2. Build a new server using the same IP
  3. Start the new server
  4. Read from the new server will resume when auto bootstrap completes replicating data to the new server

Replace a dead instance with a new instance with a new IP

  1. Build a new server using a different IP but the same initial_token
  2. Start the new server
  3. Remove the dead server token
    nodetool -h removetoken <token_value>

    Node information can be retrieved from

    nodetool -h ring
  4. Bring the bad server down
  5. Repair every keyspace for the node next to the bad server
    nodetool -h repair Keyspace1