Monday, February 24, 2014

Cassandra modeling with CQL for a simple time series use case

Simple Cassandra modeling example
           



1       Introduction


The document explores different designs for a simple real-world time series use case on Apache Cassandra. Initially this data resided in a different database (SQL Server, Rdbms), but given the volume of the data, we need a more scalable solution. Cassandra was chosen for its scalability, no SPOF feature, and ease of use. We will review our options regarding the data model, using CQL to construct the table.

1.1     Scope and requirements


Devices need to communicate over the Web (devices are wifi-enabled) to their backend cloud infrastructure (IaaS and PaaS) to send status and retrieve updates. There will be about 100k devices connected, and it is important that the architecture be scalable when more devices are added. The rate fluctuates with spikes. Data format is XML.
Need to find the optimal schema for this data, accounting for fast writes and fast retrieval, minimizing the amount of space utilized. 
Cassandra is a good choice for storing this data, but how to proceed with the data schema?

1.2      Data schema & queries

The data that we want to store in Cassandra are of the format below.

1.    Sample data format

Column name
Sample value
Comment
Date
02/05/2013
Day's worth of data
Device_id
As-333-fd-45
unique id for device
timestamp
02:44:45
Time stamp for this date
Event_type
notification
Other data
Message_id
454


Message
M455






2.   Queries

As we know in NoSQL, the design of the tables is driven by the queries of the data rather than by its entities & relationships; i.e., optimizing for reads at write time. Here are the types of queries that we want out of the data:

- Q1 -> Get all records for a particular date
- Q2 -> Get a record for a date and a device_id
- Q3 -> Get a record for a device_id 
- Q4 -> Get all records for a date and a timestamp between x and y

 Lets review the options that we have to construct the table for this data. We will use CQL for this. The main issue has to do with finding the proper Primary Key for our objects.

1.2.1    Option 1: 


The date object, will be the partitioning key; device_id will be the clustering key. The date object will be the same throughout the day's worth of data.
The clustering key (device_id) on the other hand should not be unique. It is used for ordering when querying objects.
This seems like the wrong design altogether.

1.2.2    Option 2: 

Only utilize this as the partitioning key. It is preferable as this is a unique value. The partitioning key is used to distribute the data across nodes, so we want this as random as possible.
Secondary index on date ; required since date is not part of the key.
Secondary index on timestamp

1.2.3    Option 3:

This removes the need for the secondary indexes, and thus is a gain of space. However, it adds constraints as one must using filtering (i.e., scan in memory through a B-tree) for queries on the PK fields.

1.2.4    Option 4:

PRIMARY KEY ((device_id, date), timestamp);
device_id / date will be the partitioning key. timestamp is the clustering key.

 This lets us avoid the 'allow filtering' clause, which slows down the query. However it impedes us from querying on some of the PK fields..

1.3     Summary


Options
Q1
Q2
Q3
Q4
Comments / Pros & Cons
1
 Yes
Yes
Yes with filtering
No
1.     Wide-row if for a partition, there are lot of records
2.     The node will be very hot as everything is going in one row for one day. Lots of contention. Should be avoided
2
Yes
Yes
Yes
Yes with filtering
Q4 will be served with Allow filtering
1.     More space consumption due to secondary indices
3
Yes with allow filtering
Yes
Yes
Yes with filtering
No index. But not very effective to use allow filtering (~ B-tree in memory) , more than secondary index on partitionKey.
4
No, needs rowkey
yes
No, needs partition key
No, needs row key
No need for filtering, but then too restrictive.



The best choice seems to be option 2.

2       Appendix

Definitions

partitionkey = date
rowkey = device_id
range = timestamp.

2.1     Option 1


CREATE TABLE tablestorage6 (
  partitionkey text,
  rowkey text,
  policyid text,
  policyname text,
  range timestamp,
  PRIMARY KEY (partitionkey, rowkey);

CREATE INDEX i1 ON tablestorage6 (range);

cqlsh:demodb> select * from tablestorage6 ;

 partitionkey | rowkey | policyid | policyname | range
--------------+--------+----------+------------+--------------------------
           p1 |     r1 |      pl1 |     plicy1 | 2007-01-02 00:00:00+0000
           p1 |     r2 |      pl2 |     plicy2 | 2007-01-03 00:00:00+0000
           p2 |     r3 |      pl3 |     plicy3 | 2008-01-03 00:00:00+0000

(3 rows)

cqlsh:demodb> select * from tablestorage6 where rowkey = 'r2' allow filtering;

 partitionkey | rowkey | policyid | policyname | range
--------------+--------+----------+------------+--------------------------
           p1 |     r2 |      pl2 |     plicy2 | 2007-01-03 00:00:00+0000

(1 rows)

2.1.1  Option 2


CREATE TABLE tablestorage (
  rowkey text,
  clientipaddress text,
  deviceid text,
  eventtype text,
  grouppolicyid text,
  grouppolicyname text,
  loglevel text,
  macaddress text,
  messageid text,
  model text,
  msgparam text,
  packageid text,
  partitionkey text,
  policyid text,
  policyname text,
  serialnumber text,
  "timestamp" timestamp,
  username text,
  PRIMARY KEY (rowkey)

create index a on tablestorage(partitionkey);

select * from tablestorage where rowkey='726716f7-2e54-4715-9f00-91dcbea999448' and partitionkey = '2' ;

 rowkey                                | clientipaddress | deviceid | eventtype    | grouppolicyid | grouppolicyname | loglevel | macaddress | messageid | model  | msgparam | packageid | partitionkey | policyid | policyname | serialnumber | timestamp                    | username
---------------------------------------+-----------------+----------+--------------+---------------+-----------------+----------+------------+-----------+--------+----------+-----------+--------------+----------+------------+--------------+------------------------------+----------
 726716f7-2e54-4715-9f00-91dcbea999448 |           Host2 |      id2 | CreateSchema |         gpid2 |             gp2 |    debug |       1234 |         5 | model2 |     sgp2 |      pkg2 |            2 |     pol2 |     plnme2 |       seril2 | 20133114-07-02 03:19:39+0000 |    matt8

(1 rows)

cqlsh:demodb> select * from tablestorage where partitionkey = '2' and timestamp > '20131212' allow filtering;

 rowkey                                | clientipaddress | deviceid | eventtype    | grouppolicyid | grouppolicyname | loglevel | macaddress | messageid | model  | msgparam | packageid | partitionkey | policyid | policyname | serialnumber | timestamp                    | username
---------------------------------------+-----------------+----------+--------------+---------------+-----------------+----------+------------+-----------+--------+----------+-----------+--------------+----------+------------+--------------+------------------------------+----------
 726716f7-2e54-4715-9f00-91dcbea999442 |           Host2 |      id2 | CreateSchema |         gpid2 |             gp2 |    debug |       1234 |         5 | model2 |     sgp2 |      pkg2 |            2 |     pol2 |     plnme2 |       seril2 | 20133114-07-02 03:14:27+0000 |    matt2

(1 rows)

2.1.2  Option 3

CREATE TABLE tablestorage10 (
  rowkey text,
  partitionkey text,
  range timestamp,
  policyid text,
  policyname text,
  PRIMARY KEY (rowkey, partitionkey, range)

cqlsh:demodb> select * from tablestorage10;

 rowkey | partitionkey | range                    | policyid | policyname
--------+--------------+--------------------------+----------+------------
     r3 |           p2 | 2008-01-03 00:00:00+0000 |      pl3 |     plicy3
     r2 |           p1 | 2007-01-03 00:00:00+0000 |      pl2 |     plicy2

(2 rows)

Q4: cqlsh:demodb> select * from tablestorage10 where partitionkey= 'p1' and range > '2006-01-01' allow filtering;

 rowkey | partitionkey | range                    | policyid | policyname
--------+--------------+--------------------------+----------+------------
     r2 |           p1 | 2007-01-03 00:00:00+0000 |      pl2 |     plicy2

(1 rows)

cqlsh:demodb>

but
cqlsh:demodb> select * from tablestorage10 where range > '2006-01-01' allow filtering;
Bad Request: PRIMARY KEY part range cannot be restricted (preceding part partitionkey is either not restricted or by a non-EQ relation)
cqlsh:demodb>

cqlsh:demodb> select * from tablestorage10 where partitionkey='p1' and  range > '2006-01-01' allow filtering;

 rowkey | partitionkey | range                    | policyid | policyname
--------+--------------+--------------------------+----------+------------
     r2 |           p1 | 2007-01-03 00:00:00+0000 |      pl2 |     plicy2

(1 rows)

cqlsh:demodb>


cqlsh:demodb> select * from tablestorage10 where rowkey = 'r2' and  range > '2006-01-01' allow filtering;
Bad Request: PRIMARY KEY part range cannot be restricted (preceding part partitionkey is either not restricted or by a non-EQ relation)
cqlsh:demodb>

cqlsh:demodb> select * from tablestorage10 where rowkey = 'r2';

 rowkey | partitionkey | range                    | policyid | policyname
--------+--------------+--------------------------+----------+------------
     r2 |           p1 | 2007-01-03 00:00:00+0000 |      pl2 |     plicy2

(1 rows)

cqlsh:demodb> select * from tablestorage10 where rowkey = 'r2' and partitionkey = 'p1';

 rowkey | partitionkey | range                    | policyid | policyname
--------+--------------+--------------------------+----------+------------
     r2 |           p1 | 2007-01-03 00:00:00+0000 |      pl2 |     plicy2

(1 rows)

cqlsh:demodb> select * from tablestorage10 where partitionkey = 'p1';
Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING
cqlsh:demodb> select * from tablestorage10 where partitionkey = 'p1' allow filtering;

 rowkey | partitionkey | range                    | policyid | policyname
--------+--------------+--------------------------+----------+------------
     r2 |           p1 | 2007-01-03 00:00:00+0000 |      pl2 |     plicy2

(1 rows)

cqlsh:demodb>

2.1.3  Option 4

CREATE TABLE tablestorage8 (
  partitionkey text,
  rowkey text,
  range timestamp,
  policyid text,
  policyname text,
  PRIMARY KEY ((partitionkey, rowkey), range)

demodb> select * from tablestorage8  where partitionkey='p2';
Bad Request: Partition key part rowkey must be restricted since preceding part is

select * from tablestorage8 where rowkey='r3' and range > '20131212';
Bad Request: partition key part rowkey cannot be restricted (preceding part partitionkey is either not restricted or by a non-EQ relation)
cqlsh:demodb>



0 comments:

Post a Comment

Note: Only a member of this blog may post a comment.