What is Prediction.IO in a nutshell?
Building
machine learning an application from scratch is hard; you need to have the
ability to work with your own data and train your algorithm with it, build a
layer to serve the prediction results, manage the different algorithms you are
running, their evaluations, deploy your application in production, manage the
dependencies with your other tools, etc.
Prediction.io is an open source Machine Learning server that
addresses these concerns. It aims to be the “LAMP stack” for data analytics.
Current state of Machine Learning frameworks
Lets first
review some of the tools that are popular currently in the Machine Learning
(ML) community. Some widely used tools are: Mahout in the Hadoop ecosystem,
MLLib in the Spark community, H2o, DeepLearning4j.
These APIs generally work great and provide implementations
of the main ML algorithms. However, what is missing from a general standpoint
in order to use them in a Production environment?
-
An integration layer to bring your data sources
-
A framework to roll a prototype into production
-
A simple API to query the results
Example
Let’s take
a classic recommender as an example; usually predictive modeling is based on
users’ behaviors to predict product recommendations.
We will convert the data (in Json) into binary Avro format.
// Read training data
val trainingData =
sc.textFile(“trainingData.txt”).map(_.split(‘,’) match {..})
which yields something like:
user1 purchases product1, product2
user2 purchases product2
Then build a predictive model with an algorithm:
// collaborative filtering algorithm
val model = ALS.train(trainingData, 10, 20, 0.01)
Then start using the model:
// collaborative filtering algorithm
allUsers.foreach {
user => model.recommendProducts(user, 5) }
This recommends 5 products for each user.
This code will work in development environment, but wouldn’t
work in production. Why?
- How do you integrate with your existing data?
- How do you integrate with your existing data?
- How do you unify the data from multiple sources?
- How to deploy a scalable service that responds to dynamic
prediction query?
- How do you persist the predictive model, in a distributed
environment?
- How to make your storage layer, Spark, and the algorithms
talk to each other?
- How to prepare the data for model training?
- How to update the model with new data, without downtime?
- Where does the business logic get added?
- How to make the code configurable, reusable and manageable?
- How do we build these with separation of concern (SOC),
like the web development side of things?
- How to make things work in a real time environment?
- How do I customize the recommender on a per-location
basis? How to discard data that is out of inventory?
- How about performing different tests on the algorithms you
selected?
Prediction IO to the rescue!
Let’s address the above questions.
Prediction.io boasts an event server for storage, that
collects data (say, from a mobile app, web, etc) in a unified way, from multiple channels.
You can plug multiple
engines within Prediction.io; each engine represents a type of prediction
problem. Why is that important?
In a Production system, you will typically use multiple engines.
I.e. the archetypal example of Amazon: if you bought this, recommend that. But
you may also run a different algorithm on the front page for article discovery,
and another one for email campaign based on what you browsed for retargeting
purposes.
Prediction.io does that very well.
How to deploy a predictive model service? In a typical mobile
app, the user behavior data will send user actions. Your prediction model will
be trained on these, and the prediction.io engine will be deployed as a Web service. So now your mobile app
can communicate wit h the engine via a REST API interface. If this was not
sufficient, there are other SDKs
available in different languages. The engine will return a list of results in
JSON format.
Prediction.io manages the
dependencies of Spark and Hbase and the algorithms automatically. You can
launch it with a one-line command.
When using the framework, it doesn’t act as a a black box – Prediction.io is one of
the most popular ML product on Github (5000+ contributors).
The framework is open-source, and is written in Scala, to take advantage of the JVM
support and is a natural fit for distributed computing. R in comparison is not
so easy to scale. Also Prediction.io uses Spark,
currently one of the best-distributed system framework to use, and is proven to
scale in Production. Algorithms are implemented via MLLib. Lastly, events are
store in Apache HBase as the NoSQL storage layer.
Preparing the data
for model training is a matter of running the Event server (launched via
(‘pio eventserver’) and interacting with it, by defining the action (i.e. change
the product price), product (i.e. give a rating A for product x), product name,
attribute name, all in free format.
Building the engine is made easy because Prediction.io
offers templates for recommendation and classification. The engine is built on
an MVC architecture, and has the following components:
- Data source:
data comes from any data source, and is preprocessed automatically into the
desired format. Data is prepared and cleansed according to what the engine
expects. This follows the Separation of Concerns concept.
- Algorithms: ML algorithms
at your disposal to do what you need; ability to combine multiple algorithms.
- Serving layer:
ability to serve results based on predictions, and add custom business logic to
them.
- Evaluator layer:
ability to evaluate the performance of the prediction to compare algorithms.
Of note, MLLib has made some improvements on the API lately to address some of the concerns (i.e. creating a ML pipeline).
Of note, MLLib has made some improvements on the API lately to address some of the concerns (i.e. creating a ML pipeline).
In summary, Prediction.io believes the functions of an engine should be to:
-
Train deployable predictive model(s)
-
Respond to dynamic queries
-
Evaluate the algorithm being used
How to get started?
The best way is to start is to:
-
Get one
of the templates,
everything you need will be laid out and set up already that way, and the
template can be modified according to your needs.
The whole stack can be installed in one line of code. You
can then start and deploy the event server, and update the engine model with
new data.
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.