Monday, February 24, 2014

Cassandra modeling with CQL for a simple time series use case

Simple Cassandra modeling example
           



1       Introduction


The document explores different designs for a simple real-world time series use case on Apache Cassandra. Initially this data resided in a different database (SQL Server, Rdbms), but given the volume of the data, we need a more scalable solution. Cassandra was chosen for its scalability, no SPOF feature, and ease of use. We will review our options regarding the data model, using CQL to construct the table.

1.1     Scope and requirements


Devices need to communicate over the Web (devices are wifi-enabled) to their backend cloud infrastructure (IaaS and PaaS) to send status and retrieve updates. There will be about 100k devices connected, and it is important that the architecture be scalable when more devices are added. The rate fluctuates with spikes. Data format is XML.
Need to find the optimal schema for this data, accounting for fast writes and fast retrieval, minimizing the amount of space utilized. 
Cassandra is a good choice for storing this data, but how to proceed with the data schema?

1.2      Data schema & queries

The data that we want to store in Cassandra are of the format below.

1.    Sample data format

Column name
Sample value
Comment
Date
02/05/2013
Day's worth of data
Device_id
As-333-fd-45
unique id for device
timestamp
02:44:45
Time stamp for this date
Event_type
notification
Other data
Message_id
454


Message
M455






2.   Queries

As we know in NoSQL, the design of the tables is driven by the queries of the data rather than by its entities & relationships; i.e., optimizing for reads at write time. Here are the types of queries that we want out of the data:

- Q1 -> Get all records for a particular date
- Q2 -> Get a record for a date and a device_id
- Q3 -> Get a record for a device_id 
- Q4 -> Get all records for a date and a timestamp between x and y

 Lets review the options that we have to construct the table for this data. We will use CQL for this. The main issue has to do with finding the proper Primary Key for our objects.

1.2.1    Option 1: 


The date object, will be the partitioning key; device_id will be the clustering key. The date object will be the same throughout the day's worth of data.
The clustering key (device_id) on the other hand should not be unique. It is used for ordering when querying objects.
This seems like the wrong design altogether.

1.2.2    Option 2: 

Only utilize this as the partitioning key. It is preferable as this is a unique value. The partitioning key is used to distribute the data across nodes, so we want this as random as possible.
Secondary index on date ; required since date is not part of the key.
Secondary index on timestamp

1.2.3    Option 3:

This removes the need for the secondary indexes, and thus is a gain of space. However, it adds constraints as one must using filtering (i.e., scan in memory through a B-tree) for queries on the PK fields.

1.2.4    Option 4:

PRIMARY KEY ((device_id, date), timestamp);
device_id / date will be the partitioning key. timestamp is the clustering key.

 This lets us avoid the 'allow filtering' clause, which slows down the query. However it impedes us from querying on some of the PK fields..

1.3     Summary


Options
Q1
Q2
Q3
Q4
Comments / Pros & Cons
1
 Yes
Yes
Yes with filtering
No
1.     Wide-row if for a partition, there are lot of records
2.     The node will be very hot as everything is going in one row for one day. Lots of contention. Should be avoided
2
Yes
Yes
Yes
Yes with filtering
Q4 will be served with Allow filtering
1.     More space consumption due to secondary indices
3
Yes with allow filtering
Yes
Yes
Yes with filtering
No index. But not very effective to use allow filtering (~ B-tree in memory) , more than secondary index on partitionKey.
4
No, needs rowkey
yes
No, needs partition key
No, needs row key
No need for filtering, but then too restrictive.



The best choice seems to be option 2.

2       Appendix

Definitions

partitionkey = date
rowkey = device_id
range = timestamp.

2.1     Option 1


CREATE TABLE tablestorage6 (
  partitionkey text,
  rowkey text,
  policyid text,
  policyname text,
  range timestamp,
  PRIMARY KEY (partitionkey, rowkey);

CREATE INDEX i1 ON tablestorage6 (range);

cqlsh:demodb> select * from tablestorage6 ;

 partitionkey | rowkey | policyid | policyname | range
--------------+--------+----------+------------+--------------------------
           p1 |     r1 |      pl1 |     plicy1 | 2007-01-02 00:00:00+0000
           p1 |     r2 |      pl2 |     plicy2 | 2007-01-03 00:00:00+0000
           p2 |     r3 |      pl3 |     plicy3 | 2008-01-03 00:00:00+0000

(3 rows)

cqlsh:demodb> select * from tablestorage6 where rowkey = 'r2' allow filtering;

 partitionkey | rowkey | policyid | policyname | range
--------------+--------+----------+------------+--------------------------
           p1 |     r2 |      pl2 |     plicy2 | 2007-01-03 00:00:00+0000

(1 rows)

2.1.1  Option 2


CREATE TABLE tablestorage (
  rowkey text,
  clientipaddress text,
  deviceid text,
  eventtype text,
  grouppolicyid text,
  grouppolicyname text,
  loglevel text,
  macaddress text,
  messageid text,
  model text,
  msgparam text,
  packageid text,
  partitionkey text,
  policyid text,
  policyname text,
  serialnumber text,
  "timestamp" timestamp,
  username text,
  PRIMARY KEY (rowkey)

create index a on tablestorage(partitionkey);

select * from tablestorage where rowkey='726716f7-2e54-4715-9f00-91dcbea999448' and partitionkey = '2' ;

 rowkey                                | clientipaddress | deviceid | eventtype    | grouppolicyid | grouppolicyname | loglevel | macaddress | messageid | model  | msgparam | packageid | partitionkey | policyid | policyname | serialnumber | timestamp                    | username
---------------------------------------+-----------------+----------+--------------+---------------+-----------------+----------+------------+-----------+--------+----------+-----------+--------------+----------+------------+--------------+------------------------------+----------
 726716f7-2e54-4715-9f00-91dcbea999448 |           Host2 |      id2 | CreateSchema |         gpid2 |             gp2 |    debug |       1234 |         5 | model2 |     sgp2 |      pkg2 |            2 |     pol2 |     plnme2 |       seril2 | 20133114-07-02 03:19:39+0000 |    matt8

(1 rows)

cqlsh:demodb> select * from tablestorage where partitionkey = '2' and timestamp > '20131212' allow filtering;

 rowkey                                | clientipaddress | deviceid | eventtype    | grouppolicyid | grouppolicyname | loglevel | macaddress | messageid | model  | msgparam | packageid | partitionkey | policyid | policyname | serialnumber | timestamp                    | username
---------------------------------------+-----------------+----------+--------------+---------------+-----------------+----------+------------+-----------+--------+----------+-----------+--------------+----------+------------+--------------+------------------------------+----------
 726716f7-2e54-4715-9f00-91dcbea999442 |           Host2 |      id2 | CreateSchema |         gpid2 |             gp2 |    debug |       1234 |         5 | model2 |     sgp2 |      pkg2 |            2 |     pol2 |     plnme2 |       seril2 | 20133114-07-02 03:14:27+0000 |    matt2

(1 rows)

2.1.2  Option 3

CREATE TABLE tablestorage10 (
  rowkey text,
  partitionkey text,
  range timestamp,
  policyid text,
  policyname text,
  PRIMARY KEY (rowkey, partitionkey, range)

cqlsh:demodb> select * from tablestorage10;

 rowkey | partitionkey | range                    | policyid | policyname
--------+--------------+--------------------------+----------+------------
     r3 |           p2 | 2008-01-03 00:00:00+0000 |      pl3 |     plicy3
     r2 |           p1 | 2007-01-03 00:00:00+0000 |      pl2 |     plicy2

(2 rows)

Q4: cqlsh:demodb> select * from tablestorage10 where partitionkey= 'p1' and range > '2006-01-01' allow filtering;

 rowkey | partitionkey | range                    | policyid | policyname
--------+--------------+--------------------------+----------+------------
     r2 |           p1 | 2007-01-03 00:00:00+0000 |      pl2 |     plicy2

(1 rows)

cqlsh:demodb>

but
cqlsh:demodb> select * from tablestorage10 where range > '2006-01-01' allow filtering;
Bad Request: PRIMARY KEY part range cannot be restricted (preceding part partitionkey is either not restricted or by a non-EQ relation)
cqlsh:demodb>

cqlsh:demodb> select * from tablestorage10 where partitionkey='p1' and  range > '2006-01-01' allow filtering;

 rowkey | partitionkey | range                    | policyid | policyname
--------+--------------+--------------------------+----------+------------
     r2 |           p1 | 2007-01-03 00:00:00+0000 |      pl2 |     plicy2

(1 rows)

cqlsh:demodb>


cqlsh:demodb> select * from tablestorage10 where rowkey = 'r2' and  range > '2006-01-01' allow filtering;
Bad Request: PRIMARY KEY part range cannot be restricted (preceding part partitionkey is either not restricted or by a non-EQ relation)
cqlsh:demodb>

cqlsh:demodb> select * from tablestorage10 where rowkey = 'r2';

 rowkey | partitionkey | range                    | policyid | policyname
--------+--------------+--------------------------+----------+------------
     r2 |           p1 | 2007-01-03 00:00:00+0000 |      pl2 |     plicy2

(1 rows)

cqlsh:demodb> select * from tablestorage10 where rowkey = 'r2' and partitionkey = 'p1';

 rowkey | partitionkey | range                    | policyid | policyname
--------+--------------+--------------------------+----------+------------
     r2 |           p1 | 2007-01-03 00:00:00+0000 |      pl2 |     plicy2

(1 rows)

cqlsh:demodb> select * from tablestorage10 where partitionkey = 'p1';
Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING
cqlsh:demodb> select * from tablestorage10 where partitionkey = 'p1' allow filtering;

 rowkey | partitionkey | range                    | policyid | policyname
--------+--------------+--------------------------+----------+------------
     r2 |           p1 | 2007-01-03 00:00:00+0000 |      pl2 |     plicy2

(1 rows)

cqlsh:demodb>

2.1.3  Option 4

CREATE TABLE tablestorage8 (
  partitionkey text,
  rowkey text,
  range timestamp,
  policyid text,
  policyname text,
  PRIMARY KEY ((partitionkey, rowkey), range)

demodb> select * from tablestorage8  where partitionkey='p2';
Bad Request: Partition key part rowkey must be restricted since preceding part is

select * from tablestorage8 where rowkey='r3' and range > '20131212';
Bad Request: partition key part rowkey cannot be restricted (preceding part partitionkey is either not restricted or by a non-EQ relation)
cqlsh:demodb>



1 comment:

  1. thuốc bổ xương khớp của mỹ 1500mg nếu như bạn đã biết đến sản phẩm này thì ngày hôm nay với một công thức mới đột phá trong một sản phẩm mới bổ sung Glucosamine lên đến 2000mggiúp xương của bạn trở nên chắc khỏe hơn bao giờ hết.
    thuốc uống làm trắng da ivory caps pillsIvory Caps được chiết xuất từ các hợp chất hoàn toàn tự nhiên với công thức tăng gấp ba lần Glutathione Complex 1500mg, cung cấp các dưỡng chất cần thiết sẽ làm trắng da từ bên trong
    bo xuong khop4 loại thuốc bổ xương khớp Glucosamine Chondroitin Msm của Mỹ lại được nhiều người tìm mua đến như vậy?
    glucosamine cua myBạn có biết rằng bệnh xương khớp đang là một trong những bệnh đang khiến thế giới phải lo lắng vì những biến chứng của nó vô cùng nguy hiểm có thể gây nên tàn phế, suy giảm sức lao động của con người
    thuoc bo khopVì thế mà việc tìm ra một loại thuốc chống lại bệnh xương khớp thôi thúc các nhà nghiên cứu y học nhằm giúp bệnh nhân khớp phục hồi lại được sức khỏe.
    thuoc bo khop cua myTrong nhiều thập niên qua, việc điều trị viêm khớp chủ yếu là dùng các thuốc kháng viêm giảm đau nhằm mục đích giảm các triệu chứng đau
    thuoc bo khop glucosamin cua myTuy nhiên những thuốc này có rất nhiều tác dụng phụ và thường không cải thiện được tình trạng bệnh lý của sụn khớp bị hư hỏng
    thuoc bo xuongVì vậy tác dụng của nhóm thuốc này chủ yếu chỉ nhằm giảm bớt hiện tượng viêm, cắt cơn đau mà thôi.
    thuoc bo xuong cua myÐến vài năm gần đây, người ta đã tìm ra một số loại thuốc tương tác lên bệnh lý của sụn khớp một cách hiệu quả, trong đó có Glucosamine
    thuoc bo xuong glucosamineThực tế lâm sàng cho thấy nó mang lại nhiều ưu điểm trong điều trị hơn hẳn NSAID. Ưu điểm lớn nhất ghi nhận được đến nay là có rất ít tác dụng phụ khi sử dụng Glucosamine sulfate.

    ReplyDelete

Note: Only a member of this blog may post a comment.