As
described on their website, Prediction.io (PIO) integrates all the pieces
together needed to form a Machine learning engine platform:
-
A Machine learning Engine, built on top of the Spark ML Lib
library, that trains and evaluates predictive models;
-
A Query engine to serve the results;
-
A data collection layer, called Event server.
Together
this forms a deployable, Production-ready machine learning platform.
In this
post we will focus on the Event server component. This piece is essential to
the framework: the Event server is the place where data collection takes place and
on which analytics layer is built. In addition, it is highly scalable to
accommodate for Big Data use cases.
We will
first review what the Event server is good for, take a look at its architecture
and intrinsic data structure, and then dive into an exploratory analytics
example.
What is the function of the Event Server?
The Event server serves to store the
data later fed into the Machine-learning engine. It essentially acts as the
data repository of the PIO platform, and as such this is where all of your data
is unified together.
Following
the separation of concerns architecture, the Event server is disjointed from
the other PIO elements, and this is convenient because it acts as its own
independent tier and can be used as such.
Architecture overview
By
default the Event server is built on top of Apache HBase (although PIO can be
deployed on other NoSQL stores as well if needed). This allows for horizontal
scaling and near-real time storage and retrieval of the events data.
The PIO
engine expects the events in a certain data structure. Conveniently, as a data
scientist/developer working with the PIO framework, you are not expected to
interact with HBase directly, but with the PIO API in the form of Http Requests
or the PIO API, in order to store the events in the PIO data structure; this is
documented fully here. Let’s review this in more details.
How is the data stored?
Essentially
the PIO data event structure is centered to collect any type of data
interaction. The structure is comprised of:
The name of the
event
|
A set of operations:
- $set, to register the entity.
- $unset, to remove the entity.
- $delete, delete this entity.
|
The “type” of
entity being used
|
What entity is represented? i.e. a user , order, object,
etc.
|
The type id
|
A unique id for this entity
|
Another set of entity type and id, called target entity; optional
|
-
Another entity that has a relationship with the entity
above (i.e. user-items)
|
The properties
associated to the entity or the event.
|
This is a set of key-value pairs. The
properties can be associated with the entity, or the event.
Example:
Set the user’s information details like name,
gender, etc.
or
Set information about a
rate event, e.g. “{rating : 4}”.
|
An optional event
time.
|
|
-
-
All subsequent
changes to the properties of an entity will be stored over time (according to
event time), characteristic of a NoSQL data store behavior.
As
mentioned before, the Event server is the data store of the PIO framework. Not only can you easily import data into it
via a REST API, you can plug in any analytics tool to visualize,
interrogate and model that data for exploratory analytics purposes.
This is
done via the export command, into a business analytics tool of your choice.
Let’s review a complete example of this.
An example
For our
example we will play around with a mock-up of customer lifetime
value application
data: essentially data representing customer data and purchases in an online
e-commerce website, that we want to analyze to measure the value derived from
these customers over the their lifetime engagement with our business.
Data model setup
We will
mock-up the data to be of the form:
Orders
-
Order: customer’s
orders
-
Spend:
$ amount for this order
-
City:
name of the city of where the purchase was made
-
State:
state where the purchase was made
-
Customer:
Customer who made that purchase (uniqueness enforced through customer id). This allows for a one-to-many
relationship between a customer and his/her purchases.
Customers
-
Customer:
customer id
-
Channel: marketing
channel by which customer signed up
-
Customer name: name
of customer
Remember
that this must translate into the PIO Event data structure that we talked about
above. So this will look like:
Orders
{
“event” : “$set”,
“entityType” : “order”,
"entityId" : "<unique id>",
"properties"
: {
"spend"
: "<val>",
"city"
: "<string>",
"state"
: "<string>",
"store"
: "<string>"
“customer” :
<val>”
}
Customers
{
"event" : "$set",
"entityType" :
"customer",
"entityId" : "<unique
id>",
"properties" : {
“DOB” : “<string>”,
"channel" : "<string>",
"name" :
"<string>"
}
But first,
as described in the quickstart
guide, let’s
start our instance of the PIO Event server. An easy way to have this
automatically done for you is to use one of the pre-loaded images ready on Terminal.com.
Let’s first create a new app in which we will store our data points:
$ pio app new ordersApp
[WARN] [NativeCodeLoader] Unable to load native-hadoop library for
your platform... using builtin-java classes where applicable
[INFO] [HBLEvents] The table predictionio_eventdata:events_8
doesn't exist yet. Creating now...
[INFO] [App$] Initialized Event Store for this app ID: 8.
[INFO] [App$] Created new app:
[INFO] [App$] Name:
ordersApp
[INFO] [App$] ID: 8
[INFO] [App$] Access Key:
nE9KITDzprLR6utwUJ9a4qDhscsKsjKFlXMcMsxVEdbkQjqYRm8pFcHHDdrM6Cid
vagrant@vagrant-ubuntu-trusty-64:~/ $
Let’s insert a few data points for our example, via the HTTP REST API, using the Access key that was passed to us:
$ curl -i -X POST
http://localhost:7070/events.json?accessKey=nE9KITDzprLR6utwUJ9a4qDhscsKsjKFlXMcMsxVEdbkQjqYRm8pFcHHDdrM6Cid
-H "Content-Type: application/json" -d '{
>
"event" : "$set",
>
"entityType" : "order",
> "entityId"
: "3",
>
>
"properties" : {
>
"spend" : "4.01",
>
"city" : "san francisco",
>
"state" : "CA",
>
"store" : "Men Apparel"
> “customer” : “1”
> },
> }'
HTTP/1.1 201 Created
$ curl -i -X POST
http://localhost:7070/events.json?accessKey=
nE9KITDzprLR6utwUJ9a4qDhscsKsjKFlXMcMsxVEdbkQjqYRm8pFcHHDdrM6Cid -H
"Content-Type: application/json" -d '{
>
"event" : "$set",
>
"entityType" : "customer",
> "entityId"
: "1",
>
>
"properties" : {
> "channel" : "email",
>
“DOB” : “1/12/1970”,
> "name" :
"sam dolittle"
> },
> }'
HTTP/1.1 201 Created
Server: spray-can/1.3.2
Date: Thu, 12 Mar 2015 22:31:13 GMT
Content-Type: application/json; charset=UTF-8
Content-Length: 57
We add a
few more data points in a similar way (not shown). In a real-life example, we
would probably be ingesting our already existing data in batch. Here is how to do this with the
PIO API.
The data
is now stored in the Event server!
Data export
First,
let’s export the data from our apps.
$ bin/pio export --appid 8 --output /home/exportFinal3 --format
parquet
Let’s
verify our export by firing a few parquet commands (Parquet can simply be downloaded
to run this) on the generated result:
$ hadoop parquet.tools.Main
cat part-r-1.parquet
entityId = 2 entityType = user event = $set eventId =
5iuLzCHXzzehq_R3hjsL1AAAAUvZAVo8gzcZ76dDAEs eventTime =
2015-03-02T05:41:59.484Z properties: .rating = 2.0 targetEntityId = 98
targetEntityType = item creationTime =
2015-03-04T05:29:51.278Z entityId = 1 entityType = order event = $set eventId =
78r6v2QT5GgWWrt_bD_q7wAAAUvjQvWunS1ONY2GyoM eventTime =
2015-03-04T05:29:51.278Z properties: .city = san jose .spend = 11.99 .state =
CA .store = Women Apparel .customer = 1
$ parquet.tools.Main schema
part-r-1.parquet
message root { optional
binary creationTime (UTF8); optional
binary entityId (UTF8); optional binary
entityType (UTF8); optional binary
event (UTF8); optional binary eventId
(UTF8); optional binary eventTime
(UTF8); optional group properties
{ optional binary city (UTF8); optional double rating; optional binary spend (UTF8); optional binary state (UTF8); optional binary store (UTF8); optional binary customer (UTF8) }
optional binary targetEntityId (UTF8);
optional binary targetEntityType (UTF8); }
We end up with 2 sets of Event data, that we can now freely
explore in any BI tool. We will demonstrate this in the next section, and even
join these two datasets.
Data exploration
Business Intelligence tool: iPython
Let’s follow the guide in the
documentation to start exploring this data. For our needs, we will use iPython,
although we could use a lot of other tools as well. A good guide about using
iPython is here.
For your
convenience, here is a terminal.com-powered
ipython image that
has a Spark-enabled iPython image.
Set-up of pySpark
We will use SQL code to interact
with our data, in a Data frame environment. So first let’s ensure that we can
use our Python code to talk to Spark/Spark SQL. Initial setup to talk to Spark:
We will
first perform some data explorations via Spark SQL run in Python pyspark mode,
directly
in our iPython notebook.
SQL queries on our data
Order table
Let’s first explore our order table. For this we will query our
exported Events Parquet data for the
entity Type ‘order’ (conversely, look
for ‘customer’ within Events for customer data):
Customer table
Let’s explore our customer table a slightly different way,
and create a Hive SQL table from our data this time, using the same filtering
clause mechanism on Events as
earlier:
We also
created an order table.
Customer lifetime value query
We can now join the two Events data, order and customer, to
get an overall picture of our customer lifetime value, and see which marketing
channel is more prevalent for example:
It seems
like the answer is ‘email’!
Summary
In this
post, we have discovered how the PIO Event server can be a data repository of
choice for event data.
We then
exported that data and explored it further by way of a simple Customer lifetime
value example dataset via open sources tools like Spark, Python and iPython
notebooks.
Hope it
was fun!
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.