Cassandra Data Model with Example

Cassandra key/value are basically stored in a
  • Cassandra column or
  • Cassandra super column

Cassandra Column

JSON Representation of Cassandra Column

{
    name: "userName",
    value: "Dave Jones",
    timestamp: 125555555
}

A Cassandra column composes of key/value/timestamp triplet

  • the name of the key (userName)
  • the value (Dave Jones)
  • Last updated timestamp
  • key and value are stored as binary data and can be of any length

Cassandra SuperColumn

JSON Representation of Cassandra SuperColumn

{
    name: "address",
    value: {
        city: {name: "city", value: "San Francisco", timestamp: 125555555},
        street: {name: "street", value: "555 Union Street", timestamp: 125555555},
        zip: {name: "zipcode", value: "94105", timestamp: 125555555},
    }
}

SuperColumn compose of

  • a name (address)
  • a value composed of a map of Cassandra columns
  • SuperColumn is more like a container for Columns (a column contains a list of columns)
    value: {
        city: {name: "city", value: "San Francisco", timestamp: 125555555},
        street: {...},
        zip: {...}
    }
    • Cassandra reuses the key in its value column as the key of the map
      city: {name: "city", ...
  • Both name and the key of the map are stored as binary data

Cassandra Standard ColumnFamily

JSON representation of ColumnFamily

UserProfile = {
    name: "user profile",
    Dave Jones: {
       email: {name: "email", value: "dave@email.com", timestamp: 125555555},
       userName: {name: "userName", value: "Dave", timestamp: 125555555}
    },
    Paul Simon: {
       email: {name: "email", value: "paul@email.com", timestamp: 125555555},
       phone: {name: "phone", value: "4155551212", timestamp: 125555555},
       userName: {name: "userName", value: "Paul", timestamp: 125555555}
    }
}

Standard ColumnFamily

  • is similar to a table in the relational DB
  • contain a name of type binary
  • contains many rows each identify by a key (similar to the primary key of a DB record)
    UserProfile = {
        Dave Jones: {
           ...
        },
        Paul Simon: {
           ...
        }
    }
  • Each row composes of column(s)
    Paul Simon: {
           email: {name: "email", value: "paul@email.com", timestamp: 125555555},
           phone: {name: "phone", value: "4155551212", timestamp: 125555555},
           userName: {name: "userName", value: "Paul", timestamp: 125555555}
        }
  • The key used in the columns' map is the same as the name in the column
    email: {name: "email", ...},
  • Different rows do not have to share the same set of columns
  • Column may be added to any rows at any time
  • Unlike relational DB, declare column for a ColumnFamily is not needed

Cassandra Super ColumnFamily

JSON representation of Super ColumnFamily

UserContact = {
    name: "user profile",
    Dave Jones: {
       johnAddress: {
           name: "johnAddress",
           value: {
                city: {name: "city", value: "San Francisco", timestamp: 125555555},
                street: {name: "street", value: "555 Union Street", timestamp: 125555555},
                zip: {name: "zipcode", value: "94105", timestamp: 125555555}
           }
       },
       paulAddress: {
           name: "paulaAddress",
           value: {
                city: ... ,
                street: ...,
                zip: ...
           }
       }
    },
    Pete Samsome: {
       ...
    }
}
  • Super ColumnFamily is similar to ColumnFamily except the inner map compose of SuperColumn instead of column
  • SuperColumn allows the denormalization of data. Data that is otherwise stored separately as a row in a column family can now stored locally and require one less read from another column family

Super ColumnFamily Limitations

Limitation on Super Column Family

  • Any access to the columns inside a super column will deserializes all those columns inside it at once
  • High penalty if super columns contains many columns and only a small portion is accessed
    value: {
                    city: {name: "city", value: "San Francisco", timestamp: 125555555},
                    street: {name: "street", value: "555 Union Street", timestamp: 125555555},
                    zip: {name: "zipcode", value: "94105", timestamp: 125555555}
               }
  • Not suitable for holding a lot of sub-columns if it is intended as an custom built index discussed in the index section

Cassandra Keyspace

A keyspace is similar to a DB in the relational DB. Keyspace compose of ColumnFamily in your applications. Typically, there will be one keyspace for each application.

show keyspaces
Keyspace: store:
  Replication Strategy: org.apache.cassandra.locator.SimpleStrategy
    Replication Factor: 1
  Column Families:
    ColumnFamily: products
      default_validation_class: org.apache.cassandra.db.marshal.BytesType
      Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
      ...
      Column Metadata:
        Column Name: inventory
          Validation Class: org.apache.cassandra.db.marshal.LongType
        Column Name: skid
          Validation Class: org.apache.cassandra.db.marshal.UTF8Type
    ColumnFamily: users
      default_validation_class: org.apache.cassandra.db.marshal.BytesType
      Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
      Column Metadata:
        Column Name: email
          Validation Class: org.apache.cassandra.db.marshal.UTF8Type
        Column Name: age
          Validation Class: org.apache.cassandra.db.marshal.LongType
          Index Type: KEYS
        Column Name: userName
          Validation Class: org.apache.cassandra.db.marshal.UTF8Type
...

Cassandra Sorting & Validation

A sample Cassandra column family declaration

update column family users with comparator = UTF8Type and
   column_metadata = [{column_name: userName, validation_class:UTF8Type},
                      {column_name: email, validation_class:UTF8Type},
                      {column_name: age, validation_class: LongType, index_type: KEYS}];

Unlike relational databases, Cassandra has very limited querying capability. In Cassandra, denormalization is the norm. A common way of accessing Cassandra data is to create one column family for each expected query.

Cassandra data is sorted when it is put in the cluster and remains sorted

  • It boosts read performance if it is designed carefully on how information is retrieved in the client application
  • Columns in a row are sorted based on the column names and the comparator used
  • SuperColumns & the columns inside a SuperColumns are also sorted by its corresponding name
  • Cassandra provides comparators to compare value
    • BytesType
    • UTF8Type
    • LexicalUUIDType
    • TimeUUIDType
    • AsciiType
    • LongType
  • Use comparator to set the comparators
    create column family users with comparator = UTF8Type and
       column_metadata = [{column_name: userName, validation_class:UTF8Type},
                          {column_name: email, validation_class:UTF8Type}];

Before Sorting

UserContact = {
    name: "user profile",
    Pete Samsome: {
       ...
    },
    Dave Jones: {
       paulAddress: {
           name: "paulaAddress",
           value: {
                street: ...,
                city: ... ,
                zip: ...
           }
       },
       johnAddress: {
           name: "johnAddress",
           value: {
                street: {name: "street", value: "555 Union Street", timestamp: 125555555},
                city: {name: "city", value: "San Francisco", timestamp: 125555555},
                zip: {name: "zipcode", value: "94105", timestamp: 125555555}
           }
       }
    }
}

After sorting in Cassandra

UserContact = {
    name: "user profile",
    Dave Jones: {
       johnAddress: {
           name: "johnAddress",
           value: {
                city: {name: "city", value: "San Francisco", timestamp: 125555555},
                street: {name: "street", value: "555 Union Street", timestamp: 125555555},
                zip: {name: "zipcode", value: "94105", timestamp: 125555555}
           }
       },
       paulAddress: {
           name: "paulaAddress",
           value: {
                city: ... ,
                street: ...,
                zip: ...
           }
       }
    },
    Pete Samsome: {
       ...
    }
}

Custom Cassandra Comparator

User can built custom Cassandra comparator. A good comparator sample is

org/apache/cassandra/db/marshal/LongType.java
1	package org.apache.cassandra.db.marshal;
2	/*
3	 *
4	 * Licensed to the Apache Software Foundation (ASF) under one
5	 * or more contributor license agreements.  See the NOTICE file
6	 * distributed with this work for additional information
7	 * regarding copyright ownership.  The ASF licenses this file
8	 * to you under the Apache License, Version 2.0 (the
9	 * "License"); you may not use this file except in compliance
10	 * with the License.  You may obtain a copy of the License at
11	 *
12	 *   http://www.apache.org/licenses/LICENSE-2.0
13	 *
14	 * Unless required by applicable law or agreed to in writing,
15	 * software distributed under the License is distributed on an
16	 * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
17	 * KIND, either express or implied.  See the License for the
18	 * specific language governing permissions and limitations
19	 * under the License.
20	 *
21	 */
22
23
24	import java.nio.ByteBuffer;
25
26	import org.apache.cassandra.utils.ByteBufferUtil;
27
28	public class LongType extends AbstractType<Long>
29	{
30	    public static final LongType instance = new LongType();
31
32	    LongType() {} // singleton
33
34	    public Long compose(ByteBuffer bytes)
35	    {
36	        return ByteBufferUtil.toLong(bytes);
37	    }
38
39	    public ByteBuffer decompose(Long value)
40	    {
41	        return ByteBufferUtil.bytes(value);
42	    }
43
44	    public int compare(ByteBuffer o1, ByteBuffer o2)
45	    {
46	        if (o1.remaining() == 0)
47	        {
48	            return o2.remaining() == 0 ? 0 : -1;
49	        }
50	        if (o2.remaining() == 0)
51	        {
52	            return 1;
53	        }
54
55	        int diff = o1.get(o1.position()) - o2.get(o2.position());
56	        if (diff != 0)
57	            return diff;
58
59
60	        return ByteBufferUtil.compareUnsigned(o1, o2);
61	    }
62
63	    public String getString(ByteBuffer bytes)
64	    {
65	        if (bytes.remaining() == 0)
66	        {
67	            return "";
68	        }
69	        if (bytes.remaining() != 8)
70	        {
71	            throw new MarshalException("A long is exactly 8 bytes: "+bytes.remaining());
72	        }
73
74	        return String.valueOf(bytes.getLong(bytes.position()));
75	    }
76
77	    public String toString(Long l)
78	    {
79	        return l.toString();
80	    }
81
82	    public ByteBuffer fromString(String source) throws MarshalException
83	    {
84	        // Return an empty ByteBuffer for an empty string.
85	        if (source.isEmpty())
86	            return ByteBufferUtil.EMPTY_BYTE_BUFFER;
87
88	        long longType;
89
90	        try
91	        {
92	            longType = Long.parseLong(source);
93	        }
94	        catch (Exception e)
95	        {
96	            throw new MarshalException(String.format("unable to make long from '%s'", source), e);
97	        }
98
99	        return decompose(longType);
100	    }
101
102	    public void validate(ByteBuffer bytes) throws MarshalException
103	    {
104	        if (bytes.remaining() != 8 && bytes.remaining() != 0)
105	            throw new MarshalException(String.format("Expected 8 or 0 byte long (%d)", bytes.remaining()));
106	    }
107
108	    public Class<Long> getType()
109	    {
110	        return Long.class;
111	    }
112	}

Cassandra Validation

Validation class can be set for

  • each column family or
  • each column
  • Column level validator take higher precedence
  • Support custom validation class