Cassandra Index

Cassandra Primary Key

Cassandra Primary key is the row key and is always index

UserProfile = {
    name: "user profile",
    Dave Jones: {
       email: {name: "email", value: "dave@email.com", timestamp: 125555555},
       userName: {name: "userName", value: "Dave", timestamp: 125555555}
    },
    Paul Simon: {
       email: {name: "email", value: "paul@email.com", timestamp: 125555555},
       phone: {name: "phone", value: "4155551212", timestamp: 125555555},
       userName: {name: "userName", value: "Paul", timestamp: 125555555}
    }
}

Row keys

UserProfile = {
    Dave Jones: {
        ...
    },
    Paul Simon: {
        ...
    }
}

To retrieve data using the row key

get users['Dave Jones'];

Create or Update a Cassandra 0.7 Secondary Index

Create a new index in a Cassandra column

create column family users with comparator = UTF8Type and
   column_metadata = [{column_name: userName, validation_class:UTF8Type},
                      {column_name: email, validation_class:UTF8Type},
                      {column_name: city, validation_class:UTF8Type},
                      {column_name: age, validation_class: LongType, index_type: KEYS}];

Update a index in a Cassandra column

update column family users with comparator = UTF8Type and
   column_metadata = [{column_name: userName, validation_class:UTF8Type},
                      {column_name: email, validation_class:UTF8Type},
                      {column_name: city, validation_class:UTF8Type},
                      {column_name: age, validation_class: LongType, index_type: KEYS}];

To retrieve data by a secondary indexed column in Cassandra 0.7

get users where age = 30;

Using multiple conditions

get users where age > 30 and email="j@email.com";

At least one index must exist for the queried column(s)

Use OrderPreservingPartitioner paritioner for more effective range query

Cassandra Secondary Index

  • Create a secondary index on a Cassandra column (age)
  • Available since Cassandra 0.7
  • It is called secondary indexes to distinguish them from column families' row keys
  • Secondary index are built in the background without blocking other operation
  • Cassandra can retrieve data by the indexed column (equal comparison or range search)
    age = 30
    age > 30
  • Cassandra secondary index is only recommended for columns with low cardinality, i.e. columns that have few unique values

Build a Cassandra Index

A common strategy for custom indexing is to use one column family to store data and one or more column families to serve as indexes for the data

Do not put the index information in the data column family. This make performance tuning for cache extremely difficult

Store the data in a Column Family UserProfile

UserProfile = {
    name: "user profile",
    dave: {
       email: {name: "email", value: "dave@email.com", timestamp: 125555555},
       city: {name: "city", value: "San Francisco", timestamp: 125555555},
       state: {name: "state", value: "CA", timestamp: 125555555},
       screenName: {name: "screenName", value: "dave_dude", timestamp: 125555555},
       userName: {name: "userName", value: "Dave", timestamp: 125555555}
    },
    paul: {
       email: {name: "email", value: "paul@email.com", timestamp: 125555555},
       city: {name: "city", value: "Chicago", timestamp: 125555555},
       phone: {name: "phone", value: "4155551212", timestamp: 125555555},
       state: {name: "state", value: "IL", timestamp: 125555555},
       screenName: {name: "screenName", value: "paul_mighty", timestamp: 125555555},
       userName: {name: "userName", value: "Paul", timestamp: 125555555}
    }
}

Create an Column Family holding the index information for the column "city"

UserCityIndex = {
    name: "user city index",
    San Francisco: {
       dave: {name: "dave", value: "Dave Jones", timestamp: 125555555},
       tim {name: "tim", value: "Tim Simons", timestamp: 125555555},
    },
    Chicago: {
       ...
    }
}

Create an index for the column "screenName" using a thin row

UserScreenNameIndex = {
    name: "user screen name index",
    dave_dude: {dave_dude: {name: "dave_dude", value: "dave", timestamp: 125555555}},
    paul_mighty: {paul_mighty: {name: "paul_mighty", value: "paul", timestamp: 125555555},
       ....
}
  • High concurrency in reading the index since the index spread among many nodes
  • Client needs to know the index key ahead of the time
  • Fast in looking up a single key

Create an index for the column "screenName" using a fat row

UserScreenNameIndex = {
    name: "user screen name index",
    screen_index: {
       dave: {name: "dave", value: "Dave Jones", timestamp: 125555555},
       tim {name: "tim", value: "Tim Simons", timestamp: 125555555},
       ....
    }
}
  • Less concurrency since the single row is always located in the same node
  • Less overhead in accessing many keys

Create a composite index (state, city) using Super Column Family

UserStateCityIndex = {
    name: "user state city index",
    CA: {
       San Francisco: {
           name: "San Francisco",
           value: {
               dave: {name: "dave", value: "Dave Jones", timestamp: 125555555},
               tim {name: "tim", value: "Tim Simons", timestamp: 125555555},
           }
       },
    },
    IL: {
       ...
    }
}
  • Data within the same row are stored in the same node. If Super Column Family is used for indexing data and the Super Column is large, there may be hotspot in the data rather than randomly distributed on the cluster
  • Reading a single column in a super column require de-serialize the whole sub-column and may carry too much overhead

If it is problematic, replace the Super Column and use part of the index value as the key to balance the data better

UserStateCityIndex = {
    name: "user state city index",
    CA-San Francisco-UserC-D: {
               dave: {name: "dave", value: "Dave Jones", timestamp: 125555555},
               cal:  {name: "cal", value: "Cavin Smith", timestamp: 125555555}
           }
    CA-San Francisco-UserE-G: {
              ...
           }
}

OR use the first index Column Family to point to a separate index Column Family based on the index value

UserStateCityIndex = {
        ...
    San Francisco: {sub_index_name: {name: "sub_index_name", value: "sfUserIndex", timestamp: 125555555}},
       ....
}

sfUserIndex = {
    name: "user state city index",
    dave: {dave: {name: "dave", value: "dave", timestamp: 125555555}},
    paul: {paul: {name: "paul", value: "paul", timestamp: 125555555}},
    ...
}