Cassandra Data Maintenance, Backup and System Recovery

Casandra Read Repair

When Casandra read a key, read repair is done automatically to update inconsistent or stale values of that key. If the consistency level is set higher than ONE, read repair is performed before returning a value. Otherwise, read repair is run asynchronously in the background while a potentially stale or inconsistent value is returned.

Casandra nodetool repair

Run Casandra nodetool repair when

  • a suspicious data lost or failure happens, use nodetool to repair Casandra data from its replicas
  • Run nodetool repair periodically on all nodes in the cluster within every GCGraceSeconds (defualt 10 days) to remove deleted rows
  • This operation is CPU and disk intense. Run in sequentially and one node at a time

Cassandra Data Backup & Recovery

Create, Delete Cassandra snapshot

Create a Cassandra snapshot for a single node

nodetool -h 10.10.10.1 snapshot thissnapshotname

Create a cluster wide Cassandra snapshot

clustertool -h 10.10.10.1 global_snapshot thissnapshotname
  • The snapshoot data are stored in
    /var/lib/cassandra/data/mykeyspace/snapshots/timestamp-thissnapshotname/*.db

To delete all Cassandra snapshots of a node

nodetool -h 10.10.10.1 clearsnapshot

To delete all Cassandra snapshots in a cluster

nodetool -h 10.10.10.1 clear_global_snapshot

Incremental Cassandra Backup

Enable incremental Cassandra backup

conf/cassandra.yaml
incremental_backups: true

When incremental backup is enabled (default is off), Cassandra persists flushed SSTable to a backup directory under

/var/lib/cassandra/data/mykeyspace/backups/

Old incremental backup files needs to be manually removed. Consider removing them after snapshots

With these incremental backup files in conjunction with a snapshot, an administrator can restore data in a node when data corruption occurs

Restore Cassandra from Backups

  1. Shut down the node to be restored
  2. Clear commitlog: Clear files under the folder
    rm /var/lib/cassandra/commitlog/*
  3. For every keyspace, remove the db files
    rm /var/lib/cassandra/data/mykeyspace/*.db

    Do not remove the snapshots directory in it

  4. Locate the latest snapshot directory
    /var/lib/cassandra/data/mykeyspace/snapshots/timestamp-thissnapshotname
  5. Copy the snapshot to the data directory
    cp -p /var/lib/cassandra/data/mykeyspace/snapshots/1304617358646-mylatestsnapshot/* /var/lib/cassandra/data/mykeyspace
  6. Copy the incremental backups to the data directory
    cp -p /var/lib/cassandra/data/mykeyspace/backups/* /var/lib/cassandra/data/mykeyspace
  7. Repeat the above steps for other keyspaces
  8. Restart the node

    The restart can be CPU and I/O intense because of the data compaction during the restoration

Cassandra Import / export: sstable2json & json2sstable

sstable2json converts the on-disk SSTable representation of a column family into a JSON formatted document

bin/sstable2json [-f output.json] /var/lib/cassandra/data/my_keyspace/users-f-1-Data.db

json2sstable converts a JSON representation of a column family to a Cassandra SSTable format

bin/json2sstable -K my_keyspace -c my_column_family /path/to/output.json /var/lib/cassandra/data/my_keyspace/users-f-1-Data.db

sstablekeys

The sstablekeys is shorthand for sstable2json -e option

  • Dumps only the keys
    bin/sstablekey /var/lib/cassandra/data/my_keyspace/users-f-1-Data.db