How Cassandra Read, Persists Data and Maintain Consistency

How Cassandra Persists and Writes Data to Files

  • Cassandra writes changed data (insert, update or delete columns) to a commitlog
    • Cassandra append changed data to commitlog
    • commitlog acts as a crash recovery log for data
    • Write operation will never consider successful at least until the changed data is appended to commitlog
  • Cassandra periodically sync the write-behind cache for commitlog to disk every CommitLogSyncPeriodInMS (Default: 1000ms)
    • The sequential write is fast since there is no disk seek time
    • Put commitlog in a separate drive to reduce I/O contention with SSTable reads/writes
      • Not easy to achieve in today's cloud computing offerings
    • Data will not be lost once commitlog is flushed out to file
    • Cassandra replay commitLog log after Cassandra restart to recover potential data lost within 1 second before the crash
  • Cassandra also writes the data to a in memory structure Memtable
    • Memtable is an in-memory cache with content stored as key/column
    • Memtable data are sorted by key
    • Each ColumnFamily has a separate Memtable and retrieve column data from the key
  • Flushing: Once Memtable is full, the sorted data is written out sequentially to disk as SSTables (Sorted String Table)
    • Flushed SSTable files are immutable and no changes cane be done
    • Numerous SSTables will be created on disk for a column family
    • Later changes to the same key after flushing will be written to a different SSTables
    • Row read therefore requires reading all existing SSTables for a Column Family to locate the latest value
    • SStables will be merged once it reaches some threshold to reduce read overhead
  • Each SSTable composes of 3 files
    • Bloom Filter: For read optimization, it determines whether this SSTable contains the requested key
    • Index: Index the data location by the key
    • Data: The column data

Unlike relational database, data changes does not write back to the original data files. It does not involve data read or random disk access. Both commitlog and SSTable are flushed to disk as a new file with sequential write. Hence Cassandra writes data very fast.

Cassandra Compaction

Compaction merges SSTables

  • Limit the number of SSTables to be read by a row read request
  • Merge changes for the same key on different SSTables: merges keys, combines columns
  • Reclaim space for delete row requests (Discard tombstones)
  • Creates a new index for the merged SSTable file

Minor Compaction

  • Triggered when at least N SSTables (default to 4) have been flushed to disk
  • Four similar-sized SSTables are merged into a single one
  • Perform regularly and automatically for each column family
    • Start compaction somewhere beteen min and max threshold on how many SSTables files accumulated
      • min_compaction_threshold (default is 4)
      • max_compaction_threshold (default is 32)
  • Obsolete SSTables are deleted asynchronously when the JVM performs a GC

Major Compaction

  • Done manually through nodetool compact for each keyspace
  • Merge all sstables in a given ColumnFamily

How Cassandra Read Data

  • To process a key/column read request, Cassandra checks if the in-memory memtable cache still contain the data
    • memtable is an in memory read/write cache for each Column Family
  • If not found, Cassandra will read all the SSTables for that Column Family
  • For read optimization,
    • Cassandra use bloom Filter for each SSTable to determine whether this SSTable contains the key
    • Cassandra use index in SSTable to locate the data fast
    • Cassandra compaction merges SSTables when the number of SSTables reaches certain threshold. This restricts the total number of SSTable for each Column Famoly
    • Cassandra read is slower than write but yet still very fast
  • Cassandra depends on OS to cache SSTable files
    • Do not configure Cassandra to use up most physical memory
    • Some deployment configure Cassandra to use 50% of the physical memory so the rest can be used for file cache
    • However, memory configuration is sensitive to data access pattern and volume

Cassandra Read Repair

When a query is made against a given key,

  • Cassandra performs a Read Repair
  • Read Repair perform a digest query on all replicas for that key
    • A digest query ask a replica to return a hash digest value and the timestamp for the key's data
    • Digest query verify whether replica possess the same data without sending the data over the network
  • Cassandra push the most recent data to any out-of-date replicas to make the queried data consistence again
    • Next query will therefore return the a consistent data

How Cassandra delete data (Cassandra Tombstone)

Since SSTables are not mutable, Cassandra cannot simply remove a row from the SSTables. Also, if a replica is down when we delete a row, we need a mechanism to identify its local copy is out of date when it is back online

When a user request a row to be deleted, Cassandra updates the column value with a special value called Tombstone. A read query will consider a tombstoned column as deleted. Cassandra will keep the column for at least gc_grace_seconds (Default: 10 days). After that, it is ready for removal after a compaction. If a replica is down longer than 10 days, we need to do a manual data sync before brining it back online.

Cassandra Consistency

Data is replicated across a cluster for data availability. Nevertheless, for a very short interval, some replicas may have the latest data while other replica are still saving/synchronizing the new value. The consistency gap widen when a dead node is bring back online and data is not in sync. Cassandra provides choices of requesting different level of consistency on each read or write operation at the cost of latency.

Low consistency request may return older copy of data (or stale data) but promise a faster response. Higher consistency request will reduce the chance of stale data but will have longer latency and more vulnerable to replica outage. The consistency setting allows developers to decide how many replicas in the cluster must acknowledge a write operation or respond to a read operation before considering the operation successful. Cassandra also runs a read repair for the queried key to bring the key/column back to consistency. For a low consistency level, read repair will run in background. Otherwise, it is done before returning the data.

Cassandra therefore provides eventual consistency rather than the much stricter consistency in the relational DB

Consistency For Cassandra Read Operations

With a replication factor of 3, data will be stored in 3 replica. Setting the consistency level to ONE, Cassandra will return the result to the requester after receiving response from 1 replica. On the contrary, QUORUM setting requires receiving a majority of replicas (2 replicas) to respond before returning the result to a client. Cassandra uses the timestamp in the column to determine which copy is the latest and only return the latest one to client. Hence, ONE have the fastest response time but with the possibility that the data is not the most updated one.

  • ONE: Returns the response from the first replica that respond. It triggers a read repair on the background to make sure all replicas has the latest data for that key
  • QUORUM: Returns the record with the most recent timestamp once a quorum (51%+) of replicas has responded. It also triggers a read repair
  • LOCAL_QUORUM: Returns the record with the most recent timestamp once a quorum of replicas in the datacenter local to the coordinator has responded
  • EACH_QUORUM: Returns the record with the most recent timestamp once a quorum of replicas in each datacenter has responded
  • ALL: Queries all replicas and returns the value with the most recent timestamp. If one node does not response, the whole operation is blocked/failed

LOCAL_QUORUM and EACH_QUORUM requires rack-aware replica placement strategies such as NetworkTopologyStrategy

ALL will provide highest consistency but usually not practical since it is vulnerably to a single node outage

Consistency For Cassandra Write operations

  • ANY: Guarantee the write operation is successful on at least one node (counting those with Hinted Handoff)
  • ONE: Guarantee the write operation is successful on at least one node including its commit log and memtable
  • QUORUM: Data has been written to a quorum of nodes
  • LOCAL_QUORUM: A QUORUM of nodes in the same data center local to the coordinator has written the data successfully
  • EACH_QUORUM: A QUORUM of nodes in each data center has written the data
  • ALL: Every replica node must successfully write the data. If one replica does not response, the write operation is blocked/failed

HintedHandoff

Cassandra Hinted hand off is an optimization technique for data write on replicas

When a write is made and a replica node for the key is down

  • Cassandra write a hint to a live replica node
  • That replica node will remind the downed node of changes once it is back on line
    • HintedHandoff reduce write latency when a replica is temporarily down
    • HintedHandoff provides high write availability at the cost of consistency
    • A hinted write does NOT count towards ConsistencyLevel requirements for ONE, QUORUM, or ALL
  • If no replica nodes are alive for this key and ConsistencyLevel.ANY was specified, the coordinating node will write the hint locally