Performance Tuning for Cassandra Write Operations
Optimize Cassandra for Write Operations
- Cassandra write path is very simple and require little tunning
- The biggest performance gain for write is to put commit log in a separate disk drive
- commit log uses sequential write and most hard drive will meet the throughput requirement
- However, if SSTables share the same drive with commit log
- I/O contention between commit log & SSTables may deteriorate commit log writes and SSTable reads
- Unfortunately, current cloud computing service does not provide a real standalone drive
Performance Tuning for Cassandra Read Operations
Tuning Cassandra Concurrent Reads & Concurrent Writes
concurrent_reads (Defaults are 8)
A good rule of thumb is 4 concurrent_reads per processor core. May increase the value for systems with fast I/O storage
concurrent_write (Defaults are 32)
- May not need tuning since write is usually fast
- If needed, increase the value for system with many cores
Contention From Cassandra Compaction
Cassandra Compaction increases I/O contention on SSTable data read
- Writing Cassandra data is usually fast and the write degradation may not be noticeable
- Reading Cassandra data (SSTables) continues to slow down when I/O contention increases
- Can be a major cause of read performance degradation
Monitoring Cassandra Compaction Contention
Monitoring Cassandra Compaction Statistics to Identify problem
- Use nodetool cfstats to monitor the number of SSTables
- If the count is continually growing, there may be I/O contention between reading the SSTable and compacting them
- Read will slow down because data is fragmented across many SSTables and compaction is continually running trying to reduce them
- If pending tasks and bytes total in progress are un-reasonable high, compaction may have I/O contention with the read operation (SSTables)
bin/nodetool -h 10.208.115.203 compactionstats
compaction type: n/a
column family: n/a
bytes compacted: n/a
bytes total in progress: n/a
pending tasks: 0
Remedy for Cassandra Compaction Contention
- Top priority: Reduce or merge application updates/insert requests
- Reduce the frequency of memtable flush
- By increasing the memtable size or preventing too pre-mature flushing
- Less frequent memtable flush results in fewer SSTables files and less compaction
- Fewer compaction reduces SSTables I/O contention, and therefore improves read operations
- Bigger memtables absorb more overwrites for updates to the same keys,
- and therefore accommodating more read/write operations between each flushes
| memtable_flush_after_mins |
The max time to leave a dirty memtable unflushed. Need to be large enough to avoid flushing memtable too prematurely. For production, a larger value such as 1440 is recommended |
| memtable_operations_in_millions |
The max number of columns to store in memory per ColumnFamily before flushing to disk |
| memtable_throughput_in_mb |
The maximum amount of data to store in memory per ColumnFamily before flushing to disk |
- Defaults are:
60 minutes, HeapSize/2**29 * 0.3 millions, and HeapSize/2**23 mb
- If the problem is not frequent or serve, lower the compaction thread priority may lower the I/O contention
- The lower priority may slow down the compaction write but may not help if there are plenty of idle CPU
- Add the following to JVM_OPTS in conf/cassandra-env.sh to lower the compaction priority
JVM_OPTS="$JVM_OPTS -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Dcassandra.compaction.priority=1"
- May need to add nodes or increase drives I/O bandwidth when it is a real I/O capacity problems
However, most problems may be originated from poorly implemented frequent data updates or inserts application codes
Cassandra Memory Cache Tunning
Basic principles
- Do no increase Cassandra cache size unless there is enough physical memory. Avoid memory swapping at any cost
- The available memory for Cassandra is less than most people think
- Cassandra heavily depend on OS file cache to optimize data read
- The real physical memory available for Casandra is
physical memory size - OS required memory - target memory size to cache data file by OS
- other memory used by other applications
Cassandra uses 3 level of caching strategy
| Cache type |
Memory Consumption |
Effectiveness |
If physical memory is scare |
| Row Cache |
High |
Lowest. Much harder to tune and may have opposite effect if the data work set changed |
Lowest return |
| Key Cache |
Low |
Most effective with its smaller memory footprint |
Highest return |
| File OS Cache |
High |
More effective than row cache in handling changes in data work set |
High return |
In the worst case:
- Cassandra read the index file for the row location in the data file
- Cassandra read the data file to load the column data
- The objective of tuning the Cassandra cache is to reduce the total number of disk reads under a limited amount of physical memory
Cassandra read path:
- Check if data can be found in in-memory row cache (Lookup the Column Data by the row key)
- If found, return the row cache data to the client
- If row columns are large, the cached data may take up a lot of memory and the row cache becomes less effective since it can host less keys
- Use row cache for column family that has hot data spot
- However, avoid row cache if data is frequent swapped out from the cache. That will trigger too frequent JVM GC and kill performance
- Use nodetool to monitor row cache hit ratio
- Increase row cache if hit ratio increase significantly with relatively small memory footprint
bin/nodetool setcachecapacity keyspace mycolumn_family key_cache_capacity row_cache_capacity
OR
create column family Standard1
with comparator = BytesType
and keys_cached = 10000
and rows_cached = 1000
- By default rows_cached is set to 0
- Check if key can be found in in-memory key cache (Key caches store the physical location of data for the cached key)
- Read the data location from the in memory key cache if found, otherwise read the information from the index file
- Save one File I/O seek if the key is found in the cache
- Key cache stores the key and data location, and therefore have the least memory footprint
- By default, the key cache is 200000
- With limited memory, key caches provides the biggest performance gain comparing with other cache using the same amount of memory
- Read the column data: if the requested data can be found in the OS file cache, data will be returned from the cache
- OS file cache can adjust to different access pattern better than the row caches
- Setting Cassandra row cache and key cache too high will deplete the OS File cache, and degrade the performance
- Cassandra performance assumes an effective OS file cache and some deployment reserve 50% of physical memory for the OS file cache
- Linux can use up as much remaining physical memory for file cache and release part of it whenever low in memory for other applications
- If the requested data is not in the OS File cache, Cassandra read the data from the file
Cassandra Row Cache Tuning
The row cache holds the entire content of a row in memory. It provides data caching instead of reading data from the disk
Row Cache
- Small row cache may result in low hit ratio
- Large row cache may deplete the physical memory and cause adverse impact on performance
- Use nodetool sfstats to find the average row size
- good if column's data is small so the cache is big enough to hold most of the hotspot data
- bad if column's data is too large so the cache is not big enough to hold most of the hotspot data
- bad for high write/read ratios
- By default, it is off
- If hit ratio is below 30%, row cache should be disabled
To enable row cache
update column family my_column_family with rows_cached=5000;
Use absolute number for rows_cached instead of percentage
To monitor the key cache hit ratio
bin/nodetool -h 10.10.10.1 cfstats
Cassandra Key Cache Tuning
The key cache holds the location of data in memory for each column family:
- Effective if there are hot data spot & cannot use row cache effectively because of the large column size
- Reduce 1 of the 2 disk seek
To monitor the key cache hit ratio
bin/nodetool -h 10.10.10.1 cfstats
Column Family: products
...
Pending Tasks: 0
Key cache capacity: 200000
Key cache size: 3000
Key cache hit rate: NaN
- By default, Cassandra caches 200000 keys per column family
- Use absolute number for keys_cached instead of percentage
To change the key cache
update column family my_column_family with keys_cached=25000;
- keys_cached defines how many key locations will be kept in memory per column family
Cassandra JVM Parameters Tuning
Java Heap Tunable Configuration
| JVM Parameters & Default Values |
Description |
| -Xms${MAX_HEAP_SIZE} |
Min Java Heap Size: Default to half of available physical memory |
| -Xmx${MAX_HEAP_SIZE} |
Max Java Heap Size: Default to half of available physical memory |
| -Xmn${HEAP_NEWSIZE} |
Size of young generation heap (1/4 of Java Heap) |
| -Xss128k |
Max native stack size of a thread |
| -XX:SurvivorRatio=8 |
Young Heap survivor ratio |
- Tune the Heap Size and the young generation heap according to the memory needs and the data access pattern
- Do NOT increase the size without confirming there are enough available physical memory- Always reserves memory for OS FIle cache
Java GC Tunable Configuration
| JVM Parameters & Default Values |
Description |
| -XX:+UseParNewGC |
Use parallel garbage collection in the young generation |
| -XX:+UseConcMarkSweepGC |
Use concurrent garbage collection in the old generation |
| -XX:+CMSParallelRemarkEnabled |
Decrease remark pauses |
| -XX:MaxTenuringThreshold=1 |
How fast an object will move to the old generation |
| -XX:CMSInitiatingOccupancyFraction=75 |
Execute GC when old generation is N% full |
| -XX:+UseCMSInitiatingOccupancyOnly |
Use of the anticipated promotions to start a concurrent collection set |
Java thread Tunable Configuration
| JVM Parameters & Default Values |
Description |
| -XX:+UseThreadPriorities |
Use native thread priority |
| -XX:ThreadPriorityPolicy=42 |
A work around to change thread priority |
Misc Parameters
- Cassandra dynamic snitches (enable by default) avoid reading from slow hosts
casssandra.yaml:
| dynamic_snitch_update_interval_in_ms |
The interval to calculate read latency. Defaults- 100 |
| dynamic_snitch_reset_interval_in_ms |
The interval to reset scores and allow a bad node to recover. Defaults- 60000 |
| dynamic_snitch_badness_threshold |
A performance threshold for dynamically routing requests away from a node. Defaults - 0.0. A value of 0.2 means Cassandra would continue the static snitch until the host was 20% worse than the fastest |
Cassandra Performance Monitoring
Locate the Cassandra bottleneck by examine all the internal pools
- Use nodetool tpstats to dump the length of the pending requests
bin/nodetool -h 10.208.115.203 tpstats
Pool Name Active Pending Completed
ReadStage 0 0 1
RequestResponseStage 0 0 0
MutationStage 0 0 3
ReadRepairStage 0 0 0
GossipStage 0 0 0
AntiEntropyStage 0 0 0
MigrationStage 0 0 0
MemtablePostFlusher 0 0 2
StreamStage 0 0 0
FlushWriter 0 0 2
MiscStage 0 0 0
FlushSorter 0 0 0
InternalResponseStage 0 0 0
HintedHandoff 0 0 0
- Trace down which pool is backlogging requests
- Focus on the pool with high pending count to narrow down the possible area of contentions
Monitoring CPU utilization, Swap activities and Memory usage with vmstat
vmstat - Report memory usage, swap space, I/O activities, interrupts and CPU utilization
% vmstat -S M 5 200
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 61116 113628 6569796 0 0 0 5 10 7 0 0 100 0 0
0 0 0 61108 113628 6569788 0 0 0 7 44 26 0 0 100 0 0
0 0 0 61108 113628 6569796 0 0 0 11 226 138 2 0 97 0 1
- Use Megabyes as unit
- Run every 5 seconds
- For 200 times
How to read/interpret vmstat output
| Column |
Description |
| r |
Number of processes are runnable |
| b |
Number of processes are block |
| swpd |
Virtual memory used |
| free |
Free memory |
| buff |
Memory buffer |
| cache |
Cache buffer |
| si |
Amount of swap in memory from disk (/s) |
| so |
Amount of swap out memory to disk (/s) |
| bi |
block received form a block device like disk (block/s) |
| bo |
block sent to a block device like disk (block/s) |
| in |
Interrupts per second |
| cs |
context switches per second |
| us |
User time |
| sy |
Kernek time |
| id |
Idle time |
| wa |
Time waiting for I/O |
| st |
Time stolen form other virtual machine |
- High activity in the swap columns implies
- Possible miss-configuration of the Java Heap or
- Miss-configuration of the Cassandra caches
- Reduce memory usage or add more physical memory
- High CPU utilization implies
- Poorly designed application code that makes too many requests to the Cassandra
- The system is running out of capacity
- Low CPU utilization with high number of blocking process implies other contentions that may be include
- I/O contention
- Internal Cassandra contention
Monitoring memory utilization with free
Monitoring the memory utilization to identify memory mis-configuration
- Make sure there is enough physical memory for the configured Java Heap and Cassandra caches
- A healthy system should have very low swapping activities
free
total used free shared buffers cached
Mem: 611212 508196 103016 0 90992 236640
-/+ buffers/cache: 180564 430648
Swap: 0 0 0
-/+ buffers/cache row shows the used and free memory respectively if all cached memory is reclaimed
-/+ buffers/cache: 885 6587
- Cassandra highly depends on file cache. Make sure there is enough memory for buffers and cache.
- Some applications may require 40-60% of physical memory allocated for file cache for optimal performance (But this is highly subjective to the data access pattern)
Monitoring I/O activities
Identify high utilization of the disk holding the data files (SSTables)
- High %util on SSTable's drive indicates I/O contentions
% iostat 5 100
Linux (ip-10-10-10-10) 11/06/2010 _x86_64_ (1 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
0.02 0.00 0.01 0.23 0.03 99.71
Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn
xdap1 0.06 1.63 0.88 2890754 1558192
xda 0.67 4.89 2.76 8670738 4889288
Possible causes:
- Too many data reads and writes from the application: Application code need refactoring
- Frequent Compaction causing I/O contention: Improper memtable configuration
- I/O capacity problem
How to read/interpret iostat output
| Column |
Description |
| tps |
Number of transfers (I/O request) per second for the device. A transfer is an I/O request |
| Blk_read/s |
Block read per second (Usually 512 bytes per block) |
| Blk_wrtn/s |
Block written per second |
| Blk_read |
Total block read |
| Blk_wrtn |
Total block written |
iostat -m # Display information in Mbytes
Print more detail in Mbytes
% iostat -mx 5 100
Linux (ip-10-10-10-10) 11/06/2010 _x86_64_ (1 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
0.02 0.00 0.01 0.23 0.03 99.71
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await svctm %util
xap1 0.00 0.09 0.04 0.02 0.00 0.00 39.35 0.00 31.44 3.12 0.02
xdh 0.00 0.28 0.61 0.07 0.00 0.00 11.34 0.01 8.21 1.74 0.12
| Column |
Description |
| rrqm/s |
The number of merged read request per second. (Logical write can sometime merged and handle together) |
| wrqm/s |
The number of merged write request per second |
| avgqu-sz |
Average queue length of the requests to the device |
| await |
Average wait time in milliseconds for I/O requests |
| svctm |
Average service time (in milliseconds) for I/O requests |
| %util |
CPU utilization during I/O request |
|