Simple Cassandra modeling example

1 Introduction

The document explores different designs for a simple real-world time series use case on Apache Cassandra. Initially this data resided in a different database (SQL Server, Rdbms), but given the volume of the data, we need a more scalable solution. Cassandra was chosen for its scalability, no SPOF feature, and ease of use. We will review our options regarding the data model, using CQL to construct the table.

1.1 Scope and requirements

Devices need to communicate over the Web (devices are wifi-enabled) to their backend cloud infrastructure (IaaS and PaaS) to send status and retrieve updates. There will be about 100k devices connected, and it is important that the architecture be scalable when more devices are added. The rate fluctuates with spikes. Data format is XML.

Need to find the optimal schema for this data, accounting for fast writes and fast retrieval, minimizing the amount of space utilized.
Cassandra is a good choice for storing this data, but how to proceed with the data schema?

1.2 Data schema & queries

The data that we want to store in Cassandra are of the format below.

1. Sample data format

Column name	Sample value	Comment
Date	02/05/2013	Day's worth of data
Device_id	As-333-fd-45	unique id for device
timestamp	02:44:45	Time stamp for this date
Event_type	notification	Other data
Message_id	454
Message	M455

2. Queries

As we know in NoSQL, the design of the tables is driven by the queries of the data rather than by its entities & relationships; i.e., optimizing for reads at write time. Here are the types of queries that we want out of the data:

Types of Queries:

- Q1 -> Get all records for a particular date

- Q2 -> Get a record for a date and a device_id

- Q3 -> Get a record for a device_id
- Q4 -> Get all records for a date and a timestamp between x and y

Lets review the options that we have to construct the table for this data. We will use CQL for this. The main issue has to do with finding the proper Primary Key for our objects.

1.2.1 Option 1:

PRIMARY KEY = 'date, device_id'

The date object, will be the partitioning key; device_id will be the clustering key. The date object will be the same throughout the day's worth of data.
The clustering key (device_id) on the other hand should not be unique. It is used for ordering when querying objects.
This seems like the wrong design altogether.

1.2.2 Option 2:

- PRIMARY KEY = 'device_id'

Only utilize this as the partitioning key. It is preferable as this is a unique value. The partitioning key is used to distribute the data across nodes, so we want this as random as possible.

Secondary index on date ; required since date is not part of the key.

Secondary index on timestamp

1.2.3 Option 3:

PRIMARY KEY (device_id, date, timestamp)

This removes the need for the secondary indexes, and thus is a gain of space. However, it adds constraints as one must using filtering (i.e., scan in memory through a B-tree) for queries on the PK fields.

1.2.4 Option 4:

PRIMARY KEY ((device_id, date), timestamp);

device_id / date will be the partitioning key. timestamp is the clustering key.

This lets us avoid the 'allow filtering' clause, which slows down the query. However it impedes us from querying on some of the PK fields..

1.3 Summary

Options	Q1	Q2	Q3	Q4	Comments / Pros & Cons
1	Yes	Yes	Yes with filtering	No	1. Wide-row if for a partition, there are lot of records 2. The node will be very hot as everything is going in one row for one day. Lots of contention. Should be avoided
2	Yes	Yes	Yes	Yes with filtering	Q4 will be served with Allow filtering 1. More space consumption due to secondary indices
3	Yes with allow filtering	Yes	Yes	Yes with filtering	No index. But not very effective to use allow filtering (~ B-tree in memory) , more than secondary index on partitionKey.
4	No, needs rowkey	yes	No, needs partition key	No, needs row key	No need for filtering, but then too restrictive.

The best choice seems to be option 2.

2 Appendix

Definitions

partitionkey = date

rowkey = device_id

range = timestamp.

2.1 Option 1

CREATE TABLE tablestorage6 (

partitionkey text,

rowkey text,

policyid text,

policyname text,

range timestamp,

PRIMARY KEY (partitionkey, rowkey);

CREATE INDEX i1 ON tablestorage6 (range);

cqlsh:demodb> select * from tablestorage6 ;

partitionkey | rowkey | policyid | policyname | range

--------------+--------+----------+------------+--------------------------

p1 | r1 | pl1 | plicy1 | 2007-01-02 00:00:00+0000

p1 | r2 | pl2 | plicy2 | 2007-01-03 00:00:00+0000

p2 | r3 | pl3 | plicy3 | 2008-01-03 00:00:00+0000

(3 rows)

cqlsh:demodb> select * from tablestorage6 where rowkey = 'r2' allow filtering;

partitionkey | rowkey | policyid | policyname | range

--------------+--------+----------+------------+--------------------------

p1 | r2 | pl2 | plicy2 | 2007-01-03 00:00:00+0000

(1 rows)

2.1.1 Option 2

CREATE TABLE tablestorage (

rowkey text,

clientipaddress text,

deviceid text,

eventtype text,

grouppolicyid text,

grouppolicyname text,

loglevel text,

macaddress text,

messageid text,

model text,

msgparam text,

packageid text,

partitionkey text,

policyid text,

policyname text,

serialnumber text,

"timestamp" timestamp,

username text,

PRIMARY KEY (rowkey)

create index a on tablestorage(partitionkey);

select * from tablestorage where rowkey='726716f7-2e54-4715-9f00-91dcbea999448' and partitionkey = '2' ;

---------------------------------------+-----------------+----------+--------------+---------------+-----------------+----------+------------+-----------+--------+----------+-----------+--------------+----------+------------+--------------+------------------------------+----------

726716f7-2e54-4715-9f00-91dcbea999448 | Host2 | id2 | CreateSchema | gpid2 | gp2 | debug | 1234 | 5 | model2 | sgp2 | pkg2 | 2 | pol2 | plnme2 | seril2 | 20133114-07-02 03:19:39+0000 | matt8

(1 rows)

cqlsh:demodb> select * from tablestorage where partitionkey = '2' and timestamp > '20131212' allow filtering;

726716f7-2e54-4715-9f00-91dcbea999442 | Host2 | id2 | CreateSchema | gpid2 | gp2 | debug | 1234 | 5 | model2 | sgp2 | pkg2 | 2 | pol2 | plnme2 | seril2 | 20133114-07-02 03:14:27+0000 | matt2

(1 rows)

2.1.2 Option 3

CREATE TABLE tablestorage10 (

rowkey text,

partitionkey text,

range timestamp,

policyid text,

policyname text,

PRIMARY KEY (rowkey, partitionkey, range)

cqlsh:demodb> select * from tablestorage10;

rowkey | partitionkey | range | policyid | policyname

--------+--------------+--------------------------+----------+------------

r3 | p2 | 2008-01-03 00:00:00+0000 | pl3 | plicy3

r2 | p1 | 2007-01-03 00:00:00+0000 | pl2 | plicy2

(2 rows)

Q4: cqlsh:demodb> select * from tablestorage10 where partitionkey= 'p1' and range > '2006-01-01' allow filtering;

rowkey | partitionkey | range | policyid | policyname

--------+--------------+--------------------------+----------+------------

r2 | p1 | 2007-01-03 00:00:00+0000 | pl2 | plicy2

(1 rows)

cqlsh:demodb>

but

cqlsh:demodb> select * from tablestorage10 where range > '2006-01-01' allow filtering;

Bad Request: PRIMARY KEY part range cannot be restricted (preceding part partitionkey is either not restricted or by a non-EQ relation)

cqlsh:demodb>

cqlsh:demodb> select * from tablestorage10 where partitionkey='p1' and range > '2006-01-01' allow filtering;

rowkey | partitionkey | range | policyid | policyname

--------+--------------+--------------------------+----------+------------

r2 | p1 | 2007-01-03 00:00:00+0000 | pl2 | plicy2

(1 rows)

cqlsh:demodb>

cqlsh:demodb> select * from tablestorage10 where rowkey = 'r2' and range > '2006-01-01' allow filtering;

Bad Request: PRIMARY KEY part range cannot be restricted (preceding part partitionkey is either not restricted or by a non-EQ relation)

cqlsh:demodb>

cqlsh:demodb> select * from tablestorage10 where rowkey = 'r2';

rowkey | partitionkey | range | policyid | policyname

--------+--------------+--------------------------+----------+------------

r2 | p1 | 2007-01-03 00:00:00+0000 | pl2 | plicy2

(1 rows)

cqlsh:demodb> select * from tablestorage10 where rowkey = 'r2' and partitionkey = 'p1';

rowkey | partitionkey | range | policyid | policyname

--------+--------------+--------------------------+----------+------------

r2 | p1 | 2007-01-03 00:00:00+0000 | pl2 | plicy2

(1 rows)

cqlsh:demodb> select * from tablestorage10 where partitionkey = 'p1';

Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING

cqlsh:demodb> select * from tablestorage10 where partitionkey = 'p1' allow filtering;

rowkey | partitionkey | range | policyid | policyname

--------+--------------+--------------------------+----------+------------

r2 | p1 | 2007-01-03 00:00:00+0000 | pl2 | plicy2

(1 rows)

cqlsh:demodb>

2.1.3 Option 4

CREATE TABLE tablestorage8 (

partitionkey text,

rowkey text,

range timestamp,

policyid text,

policyname text,

PRIMARY KEY ((partitionkey, rowkey), range)

demodb> select * from tablestorage8 where partitionkey='p2';

Bad Request: Partition key part rowkey must be restricted since preceding part is

select * from tablestorage8 where rowkey='r3' and range > '20131212';

Bad Request: partition key part rowkey cannot be restricted (preceding part partitionkey is either not restricted or by a non-EQ relation)

cqlsh:demodb>

Big data - tidbits of knowledge

Contact

Monday, February 24, 2014

Cassandra modeling with CQL for a simple time series use case

1 Introduction

1.1 Scope and requirements

1.2 Data schema & queries

Lets review the options that we have to construct the table for this data. We will use CQL for this. The main issue has to do with finding the proper Primary Key for our objects.

1.2.1 Option 1:

1.2.2 Option 2:

1.2.3 Option 3:

1.2.4 Option 4:

This lets us avoid the 'allow filtering' clause, which slows down the query. However it impedes us from querying on some of the PK fields..

1.3 Summary

2 Appendix

Definitions

2.1 Option 1

2.1.1 Option 2

2.1.2 Option 3

2.1.3 Option 4

0 comments:

Post a Comment

Popular Posts

Recent Posts

Categories

Definition List

Text Widget

Blog Archive