Cassandra Primary Key
Cassandra Primary key is the row key and is always index
UserProfile = {
name: "user profile",
Dave Jones: {
email: {name: "email", value: "dave@email.com", timestamp: 125555555},
userName: {name: "userName", value: "Dave", timestamp: 125555555}
},
Paul Simon: {
email: {name: "email", value: "paul@email.com", timestamp: 125555555},
phone: {name: "phone", value: "4155551212", timestamp: 125555555},
userName: {name: "userName", value: "Paul", timestamp: 125555555}
}
}
Row keys
UserProfile = {
Dave Jones: {
...
},
Paul Simon: {
...
}
}
To retrieve data using the row key
Create or Update a Cassandra 0.7 Secondary Index
Create a new index in a Cassandra column
create column family users with comparator = UTF8Type and
column_metadata = [{column_name: userName, validation_class:UTF8Type},
{column_name: email, validation_class:UTF8Type},
{column_name: city, validation_class:UTF8Type},
{column_name: age, validation_class: LongType, index_type: KEYS}];
Update a index in a Cassandra column
update column family users with comparator = UTF8Type and
column_metadata = [{column_name: userName, validation_class:UTF8Type},
{column_name: email, validation_class:UTF8Type},
{column_name: city, validation_class:UTF8Type},
{column_name: age, validation_class: LongType, index_type: KEYS}];
To retrieve data by a secondary indexed column in Cassandra 0.7
get users where age = 30;
Using multiple conditions
get users where age > 30 and email="j@email.com";
At least one index must exist for the queried column(s)
Use OrderPreservingPartitioner paritioner for more effective range query
Cassandra Secondary Index
- Create a secondary index on a Cassandra column (age)
- Available since Cassandra 0.7
- It is called secondary indexes to distinguish them from column families' row keys
- Secondary index are built in the background without blocking other operation
- Cassandra can retrieve data by the indexed column (equal comparison or range search)
Cassandra secondary index is only recommended for columns with low cardinality, i.e. columns that have few unique values
Build a Cassandra Index
A common strategy for custom indexing is to use one column family to store data and one or more column families to serve as indexes for the data
Do not put the index information in the data column family. This make performance tuning for cache extremely difficult
Store the data in a Column Family UserProfile
UserProfile = {
name: "user profile",
dave: {
email: {name: "email", value: "dave@email.com", timestamp: 125555555},
city: {name: "city", value: "San Francisco", timestamp: 125555555},
state: {name: "state", value: "CA", timestamp: 125555555},
screenName: {name: "screenName", value: "dave_dude", timestamp: 125555555},
userName: {name: "userName", value: "Dave", timestamp: 125555555}
},
paul: {
email: {name: "email", value: "paul@email.com", timestamp: 125555555},
city: {name: "city", value: "Chicago", timestamp: 125555555},
phone: {name: "phone", value: "4155551212", timestamp: 125555555},
state: {name: "state", value: "IL", timestamp: 125555555},
screenName: {name: "screenName", value: "paul_mighty", timestamp: 125555555},
userName: {name: "userName", value: "Paul", timestamp: 125555555}
}
}
Create an Column Family holding the index information for the column "city"
UserCityIndex = {
name: "user city index",
San Francisco: {
dave: {name: "dave", value: "Dave Jones", timestamp: 125555555},
tim {name: "tim", value: "Tim Simons", timestamp: 125555555},
},
Chicago: {
...
}
}
Create an index for the column "screenName" using a thin row
UserScreenNameIndex = {
name: "user screen name index",
dave_dude: {dave_dude: {name: "dave_dude", value: "dave", timestamp: 125555555}},
paul_mighty: {paul_mighty: {name: "paul_mighty", value: "paul", timestamp: 125555555},
....
}
- High concurrency in reading the index since the index spread among many nodes
- Client needs to know the index key ahead of the time
- Fast in looking up a single key
Create an index for the column "screenName" using a fat row
UserScreenNameIndex = {
name: "user screen name index",
screen_index: {
dave: {name: "dave", value: "Dave Jones", timestamp: 125555555},
tim {name: "tim", value: "Tim Simons", timestamp: 125555555},
....
}
}
- Less concurrency since the single row is always located in the same node
- Less overhead in accessing many keys
Create a composite index (state, city) using Super Column Family
UserStateCityIndex = {
name: "user state city index",
CA: {
San Francisco: {
name: "San Francisco",
value: {
dave: {name: "dave", value: "Dave Jones", timestamp: 125555555},
tim {name: "tim", value: "Tim Simons", timestamp: 125555555},
}
},
},
IL: {
...
}
}
- Data within the same row are stored in the same node. If Super Column Family is used for indexing data and the Super Column is large, there may be hotspot in the data rather than randomly distributed on the cluster
- Reading a single column in a super column require de-serialize the whole sub-column and may carry too much overhead
If it is problematic, replace the Super Column and use part of the index value as the key to balance the data better
UserStateCityIndex = {
name: "user state city index",
CA-San Francisco-UserC-D: {
dave: {name: "dave", value: "Dave Jones", timestamp: 125555555},
cal: {name: "cal", value: "Cavin Smith", timestamp: 125555555}
}
CA-San Francisco-UserE-G: {
...
}
}
OR use the first index Column Family to point to a separate index Column Family based on the index value
UserStateCityIndex = {
...
San Francisco: {sub_index_name: {name: "sub_index_name", value: "sfUserIndex", timestamp: 125555555}},
....
}
sfUserIndex = {
name: "user state city index",
dave: {dave: {name: "dave", value: "dave", timestamp: 125555555}},
paul: {paul: {name: "paul", value: "paul", timestamp: 125555555}},
...
}
|