July 2015 ~ Big data - tidbits of knowledge

As described on their website, Prediction.io (PIO) integrates all the pieces together needed to form a Machine learning engine platform:

- A Machine learning Engine, built on top of the Spark ML Lib library, that trains and evaluates predictive models;

- A Query engine to serve the results;

- A data collection layer, called Event server.

Together this forms a deployable, Production-ready machine learning platform.

In this post we will focus on the Event server component. This piece is essential to the framework: the Event server is the place where data collection takes place and on which analytics layer is built. In addition, it is highly scalable to accommodate for Big Data use cases.

We will first review what the Event server is good for, take a look at its architecture and intrinsic data structure, and then dive into an exploratory analytics example.

What is the function of the Event Server?

The Event server serves to store the data later fed into the Machine-learning engine. It essentially acts as the data repository of the PIO platform, and as such this is where all of your data is unified together.

Following the separation of concerns architecture, the Event server is disjointed from the other PIO elements, and this is convenient because it acts as its own independent tier and can be used as such.

Architecture overview

(from web site)

By default the Event server is built on top of Apache HBase (although PIO can be deployed on other NoSQL stores as well if needed). This allows for horizontal scaling and near-real time storage and retrieval of the events data.

The PIO engine expects the events in a certain data structure. Conveniently, as a data scientist/developer working with the PIO framework, you are not expected to interact with HBase directly, but with the PIO API in the form of Http Requests or the PIO API, in order to store the events in the PIO data structure; this is documented fully here. Let’s review this in more details.

How is the data stored?

Essentially the PIO data event structure is centered to collect any type of data interaction. The structure is comprised of:

The name of the event	A set of operations: - $set, to register the entity. - $unset, to remove the entity. - $delete, delete this entity.
The “type” of entity being used	What entity is represented? i.e. a user , order, object, etc.
The type id	A unique id for this entity
Another set of entity type and id, called target entity; optional	- Another entity that has a relationship with the entity above (i.e. user-items)
The properties associated to the entity or the event.	This is a set of key-value pairs. The properties can be associated with the entity, or the event. Example: Set the user’s information details like name, gender, etc. or Set information about a rate event, e.g. “{rating : 4}”.
An optional event time.

All subsequent changes to the properties of an entity will be stored over time (according to event time), characteristic of a NoSQL data store behavior.

As mentioned before, the Event server is the data store of the PIO framework. Not only can you easily import data into it via a REST API, you can plug in any analytics tool to visualize, interrogate and model that data for exploratory analytics purposes.

This is done via the export command, into a business analytics tool of your choice. Let’s review a complete example of this.

An example

For our example we will play around with a mock-up of customer lifetime value application data: essentially data representing customer data and purchases in an online e-commerce website, that we want to analyze to measure the value derived from these customers over the their lifetime engagement with our business.

Data model setup

We will mock-up the data to be of the form:

Orders

- Order: customer’s orders

- Spend: $ amount for this order

- City: name of the city of where the purchase was made

- State: state where the purchase was made

- Customer: Customer who made that purchase (uniqueness enforced through customer id). This allows for a one-to-many relationship between a customer and his/her purchases.

Customers

- Customer: customer id

- Channel: marketing channel by which customer signed up

- Customer name: name of customer

Remember that this must translate into the PIO Event data structure that we talked about above. So this will look like:

Orders

{

“event” : “$set”,

“entityType” : “order”,

"entityId" : "<unique id>",

"properties" : {

"spend" : "<val>",

"city" : "<string>",

"state" : "<string>",

"store" : "<string>"

“customer” : <val>”

}

Customers

{

"event" : "$set",

"entityType" : "customer",

"entityId" : "<unique id>",

"properties" : {

“DOB” : “<string>”,

"channel" : "<string>",

"name" : "<string>"

}

But first, as described in the quickstart guide, let’s start our instance of the PIO Event server. An easy way to have this automatically done for you is to use one of the pre-loaded images ready on Terminal.com.

Let’s first create a new app in which we will store our data points:

$ pio app new ordersApp

[WARN] [NativeCodeLoader] Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

[INFO] [HBLEvents] The table predictionio_eventdata:events_8 doesn't exist yet. Creating now...

[INFO] [App$] Initialized Event Store for this app ID: 8.

[INFO] [App$] Created new app:

[INFO] [App$] Name: ordersApp

[INFO] [App$] ID: 8

[INFO] [App$] Access Key: nE9KITDzprLR6utwUJ9a4qDhscsKsjKFlXMcMsxVEdbkQjqYRm8pFcHHDdrM6Cid

vagrant@vagrant-ubuntu-trusty-64:~/ $

Let’s insert a few data points for our example, via the HTTP REST API, using the Access key that was passed to us:

$ curl -i -X POST http://localhost:7070/events.json?accessKey=nE9KITDzprLR6utwUJ9a4qDhscsKsjKFlXMcMsxVEdbkQjqYRm8pFcHHDdrM6Cid -H "Content-Type: application/json" -d '{

> "event" : "$set",

> "entityType" : "order",

> "entityId" : "3",

> "properties" : {

> "spend" : "4.01",

> "city" : "san francisco",

> "state" : "CA",

> "store" : "Men Apparel"

> “customer” : “1”

> },

> }'

HTTP/1.1 201 Created

$ curl -i -X POST http://localhost:7070/events.json?accessKey= nE9KITDzprLR6utwUJ9a4qDhscsKsjKFlXMcMsxVEdbkQjqYRm8pFcHHDdrM6Cid -H "Content-Type: application/json" -d '{

> "event" : "$set",

> "entityType" : "customer",

> "entityId" : "1",

> "properties" : {

> "channel" : "email",

> “DOB” : “1/12/1970”,

> "name" : "sam dolittle"

> },

> }'

HTTP/1.1 201 Created

Server: spray-can/1.3.2

Date: Thu, 12 Mar 2015 22:31:13 GMT

Content-Type: application/json; charset=UTF-8

Content-Length: 57

We add a few more data points in a similar way (not shown). In a real-life example, we would probably be ingesting our already existing data in batch. Here is how to do this with the PIO API.

The data is now stored in the Event server!

Data export

First, let’s export the data from our apps.

$ bin/pio export --appid 8 --output /home/exportFinal3 --format parquet

Let’s verify our export by firing a few parquet commands (Parquet can simply be downloaded to run this) on the generated result:

$ hadoop parquet.tools.Main cat part-r-1.parquet

entityId = 2 entityType = user event = $set eventId = 5iuLzCHXzzehq_R3hjsL1AAAAUvZAVo8gzcZ76dDAEs eventTime = 2015-03-02T05:41:59.484Z properties: .rating = 2.0 targetEntityId = 98 targetEntityType = item creationTime = 2015-03-04T05:29:51.278Z entityId = 1 entityType = order event = $set eventId = 78r6v2QT5GgWWrt_bD_q7wAAAUvjQvWunS1ONY2GyoM eventTime = 2015-03-04T05:29:51.278Z properties: .city = san jose .spend = 11.99 .state = CA .store = Women Apparel .customer = 1

$ parquet.tools.Main schema part-r-1.parquet

message root { optional binary creationTime (UTF8); optional binary entityId (UTF8); optional binary entityType (UTF8); optional binary event (UTF8); optional binary eventId (UTF8); optional binary eventTime (UTF8); optional group properties { optional binary city (UTF8); optional double rating; optional binary spend (UTF8); optional binary state (UTF8); optional binary store (UTF8); optional binary customer (UTF8) } optional binary targetEntityId (UTF8); optional binary targetEntityType (UTF8); }

We end up with 2 sets of Event data, that we can now freely explore in any BI tool. We will demonstrate this in the next section, and even join these two datasets.

Data exploration

Business Intelligence tool: iPython

Let’s follow the guide in the documentation to start exploring this data. For our needs, we will use iPython, although we could use a lot of other tools as well. A good guide about using iPython is here.

For your convenience, here is a terminal.com-powered ipython image that has a Spark-enabled iPython image.

Set-up of pySpark

We will use SQL code to interact with our data, in a Data frame environment. So first let’s ensure that we can use our Python code to talk to Spark/Spark SQL. Initial setup to talk to Spark:

We will first perform some data explorations via Spark SQL run in Python pyspark mode, directly in our iPython notebook.

SQL queries on our data

Order table

Let’s first explore our order table. For this we will query our exported Events Parquet data for the entity Type ‘order’ (conversely, look for ‘customer’ within Events for customer data):

Customer table

Let’s explore our customer table a slightly different way, and create a Hive SQL table from our data this time, using the same filtering clause mechanism on Events as earlier:

We also created an order table.

Customer lifetime value query

We can now join the two Events data, order and customer, to get an overall picture of our customer lifetime value, and see which marketing channel is more prevalent for example:

It seems like the answer is ‘email’!

Summary

In this post, we have discovered how the PIO Event server can be a data repository of choice for event data.

We then exported that data and explored it further by way of a simple Customer lifetime value example dataset via open sources tools like Spark, Python and iPython notebooks.

Hope it was fun!

Big data - tidbits of knowledge

Contact

Monday, July 13, 2015

Calculating the Customer lifetime value in Prediction.IO via a Python notebook & Hive