1 Introduction
The document explores different
designs for a simple real-world time series use case on Apache Cassandra. Initially this data resided in a different database (SQL Server, Rdbms), but given the volume of the data, we need a more scalable solution. Cassandra was chosen for its scalability, no SPOF feature, and ease of use. We will review our options regarding the data model, using CQL to construct the table.
1.1
Scope and requirements
Devices need to communicate over the
Web (devices are wifi-enabled) to their backend cloud infrastructure (IaaS and
PaaS) to send status and retrieve updates. There will be about 100k devices
connected, and it is important that the architecture be scalable when more
devices are added. The rate fluctuates with spikes. Data format is XML.
Need to find the optimal schema for
this data, accounting for fast writes and fast retrieval, minimizing the amount
of space utilized.
Cassandra is a good choice for storing this data, but how to proceed with the data schema?
Cassandra is a good choice for storing this data, but how to proceed with the data schema?
1.2 Data schema & queries
The data that we want to store in Cassandra are of the format below.
1.
Sample data
format
Column
name
|
Sample
value
|
Comment
|
Date
|
02/05/2013
|
Day's worth of data
|
Device_id
|
As-333-fd-45
|
unique id for device
|
timestamp
|
02:44:45
|
Time stamp for this date
|
Event_type
|
notification
|
Other data
|
Message_id
|
454
|
|
Message
|
M455
|
|
2. Queries
As we know in NoSQL, the design of the tables is driven by the queries of the data rather than by its entities & relationships; i.e., optimizing for reads at write time. Here are the types of queries that we want out of the data:
- Q1 -> Get all records for a particular date
- Q2 -> Get a record for a date and a device_id
- Q3 -> Get a record for a device_id
- Q4 -> Get all records for a date and a timestamp between x and y
- Q4 -> Get all records for a date and a timestamp between x and y
Lets review the options that we have to construct the table for this data. We will use CQL for this. The main issue has to do with finding the proper Primary Key for our objects.
1.2.1
Option 1:
The date object, will be the partitioning key; device_id will be the clustering key. The date object will be the same throughout the day's worth of data.
The clustering key (device_id) on the other hand should not be unique. It is used for ordering when querying objects.
This seems like the wrong design altogether.
The clustering key (device_id) on the other hand should not be unique. It is used for ordering when querying objects.
This seems like the wrong design altogether.
1.2.2 Option 2:
Only utilize this as the partitioning key. It is preferable as this is a unique value. The partitioning key is used to distribute the data across nodes, so we want this as random as possible.
Secondary index on date ; required since date is not part of the key.
Secondary index on timestamp
1.2.3 Option 3:
This removes the need for the secondary indexes, and thus is a gain of space. However, it adds constraints as one must using filtering (i.e., scan in memory through a B-tree) for queries on the PK fields.
1.2.4 Option 4:
PRIMARY KEY ((device_id, date), timestamp);
device_id / date will be the partitioning key. timestamp is the
clustering key.
This lets us avoid the 'allow filtering' clause, which slows down the query. However it impedes us from querying on some of the PK fields..
1.3 Summary
Options
|
Q1
|
Q2
|
Q3
|
Q4
|
Comments / Pros & Cons
|
1
|
Yes
|
Yes
|
Yes with
filtering
|
No
|
1.
Wide-row if for a partition, there are lot of records
2. The node will be
very hot as everything is going in one row for one day. Lots of contention.
Should be avoided
|
2
|
Yes
|
Yes
|
Yes
|
Yes with
filtering
|
Q4 will be served with Allow filtering
1. More space
consumption due to secondary indices
|
3
|
Yes with
allow filtering
|
Yes
|
Yes
|
Yes with
filtering
|
No index. But not very effective to use allow filtering (~ B-tree in memory) , more than
secondary index on partitionKey.
|
4
|
No, needs
rowkey
|
yes
|
No, needs
partition key
|
No, needs
row key
|
No need for filtering, but then too
restrictive.
|
The best choice seems to be option 2.
2
Appendix
Definitions
partitionkey = date
rowkey = device_id
range = timestamp.
2.1 Option
1
CREATE TABLE tablestorage6
(
partitionkey text,
rowkey text,
policyid text,
policyname text,
range timestamp,
PRIMARY KEY (partitionkey, rowkey);
CREATE INDEX i1 ON
tablestorage6 (range);
cqlsh:demodb> select *
from tablestorage6 ;
partitionkey | rowkey | policyid | policyname | range
--------------+--------+----------+------------+--------------------------
p1 | r1 | pl1 | plicy1 | 2007-01-02 00:00:00+0000
p1 | r2 | pl2 | plicy2 | 2007-01-03 00:00:00+0000
p2 | r3 | pl3 | plicy3 | 2008-01-03 00:00:00+0000
(3 rows)
cqlsh:demodb> select *
from tablestorage6 where rowkey = 'r2' allow filtering;
partitionkey | rowkey | policyid | policyname | range
--------------+--------+----------+------------+--------------------------
p1 | r2 | pl2 | plicy2 | 2007-01-03 00:00:00+0000
(1 rows)
2.1.1 Option 2
CREATE TABLE tablestorage
(
rowkey text,
clientipaddress text,
deviceid text,
eventtype text,
grouppolicyid text,
grouppolicyname text,
loglevel text,
macaddress text,
messageid text,
model text,
msgparam text,
packageid text,
partitionkey text,
policyid text,
policyname text,
serialnumber text,
"timestamp" timestamp,
username text,
PRIMARY KEY (rowkey)
create index a on
tablestorage(partitionkey);
select * from tablestorage
where rowkey='726716f7-2e54-4715-9f00-91dcbea999448' and partitionkey = '2' ;
rowkey | clientipaddress | deviceid | eventtype | grouppolicyid | grouppolicyname | loglevel | macaddress | messageid | model | msgparam | packageid | partitionkey | policyid | policyname | serialnumber | timestamp | username
---------------------------------------+-----------------+----------+--------------+---------------+-----------------+----------+------------+-----------+--------+----------+-----------+--------------+----------+------------+--------------+------------------------------+----------
726716f7-2e54-4715-9f00-91dcbea999448 | Host2 | id2 | CreateSchema | gpid2 | gp2 | debug | 1234 | 5 | model2 | sgp2 | pkg2 | 2 | pol2 | plnme2 | seril2 | 20133114-07-02 03:19:39+0000 | matt8
(1 rows)
cqlsh:demodb> select *
from tablestorage where partitionkey = '2' and timestamp > '20131212' allow
filtering;
rowkey | clientipaddress | deviceid | eventtype | grouppolicyid | grouppolicyname | loglevel | macaddress | messageid | model | msgparam | packageid | partitionkey | policyid | policyname | serialnumber | timestamp | username
---------------------------------------+-----------------+----------+--------------+---------------+-----------------+----------+------------+-----------+--------+----------+-----------+--------------+----------+------------+--------------+------------------------------+----------
726716f7-2e54-4715-9f00-91dcbea999442 | Host2 | id2 | CreateSchema | gpid2 | gp2 | debug | 1234 | 5 | model2 | sgp2 | pkg2 | 2 | pol2 | plnme2 | seril2 | 20133114-07-02 03:14:27+0000 | matt2
(1 rows)
2.1.2 Option 3
CREATE TABLE
tablestorage10 (
rowkey text,
partitionkey text,
range timestamp,
policyid text,
policyname text,
PRIMARY KEY (rowkey,
partitionkey, range)
cqlsh:demodb> select *
from tablestorage10;
rowkey | partitionkey | range | policyid | policyname
--------+--------------+--------------------------+----------+------------
r3 | p2 | 2008-01-03 00:00:00+0000 | pl3 | plicy3
r2 | p1 | 2007-01-03 00:00:00+0000 | pl2 | plicy2
(2 rows)
Q4: cqlsh:demodb>
select * from tablestorage10 where partitionkey= 'p1' and range >
'2006-01-01' allow filtering;
rowkey | partitionkey | range | policyid | policyname
--------+--------------+--------------------------+----------+------------
r2 | p1 | 2007-01-03 00:00:00+0000 | pl2 | plicy2
(1 rows)
cqlsh:demodb>
but
cqlsh:demodb> select *
from tablestorage10 where range > '2006-01-01' allow filtering;
Bad Request: PRIMARY KEY
part range cannot be restricted (preceding part partitionkey is either not
restricted or by a non-EQ relation)
cqlsh:demodb>
cqlsh:demodb> select *
from tablestorage10 where partitionkey='p1' and
range > '2006-01-01' allow filtering;
rowkey | partitionkey | range | policyid | policyname
--------+--------------+--------------------------+----------+------------
r2 | p1 | 2007-01-03 00:00:00+0000 | pl2 | plicy2
(1 rows)
cqlsh:demodb>
cqlsh:demodb> select *
from tablestorage10 where rowkey = 'r2' and
range > '2006-01-01' allow filtering;
Bad Request: PRIMARY KEY
part range cannot be restricted (preceding part partitionkey is either not
restricted or by a non-EQ relation)
cqlsh:demodb>
cqlsh:demodb> select *
from tablestorage10 where rowkey = 'r2';
rowkey | partitionkey | range | policyid | policyname
--------+--------------+--------------------------+----------+------------
r2 | p1 | 2007-01-03 00:00:00+0000 | pl2 | plicy2
(1 rows)
cqlsh:demodb> select *
from tablestorage10 where rowkey = 'r2' and partitionkey = 'p1';
rowkey | partitionkey | range | policyid | policyname
--------+--------------+--------------------------+----------+------------
r2 | p1 | 2007-01-03 00:00:00+0000 | pl2 | plicy2
(1 rows)
cqlsh:demodb> select *
from tablestorage10 where partitionkey = 'p1';
Bad Request: Cannot
execute this query as it might involve data filtering and thus may have
unpredictable performance. If you want to execute this query despite the
performance unpredictability, use ALLOW FILTERING
cqlsh:demodb> select *
from tablestorage10 where partitionkey = 'p1' allow filtering;
rowkey | partitionkey | range | policyid | policyname
--------+--------------+--------------------------+----------+------------
r2 | p1 | 2007-01-03 00:00:00+0000 | pl2 | plicy2
(1 rows)
cqlsh:demodb>
2.1.3 Option 4
CREATE TABLE
tablestorage8 (
partitionkey text,
rowkey text,
range timestamp,
policyid text,
policyname text,
PRIMARY KEY ((partitionkey, rowkey), range)
demodb>
select * from tablestorage8 where
partitionkey='p2';
Bad Request: Partition key part rowkey
must be restricted since preceding part is
select * from
tablestorage8 where rowkey='r3' and range > '20131212';
Bad Request:
partition key part rowkey cannot be restricted (preceding part partitionkey is
either not restricted or by a non-EQ relation)
cqlsh:demodb>
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.