Thursday, February 26, 2015

Hadoop world / Strata 2015 overview

A few notes about Strata 2015

I have been going to Strata for a few years now; so I am pretty familiar with the Hadoop vendors and offerings that are shown. Here are a few general thoughts about the event and what I've noted.


A lot more companies/players in the Big Data space in general. Of note, in addition to the "regulars", there are a lot more niche players, and a few behemoths (HP, Intel, Microsoft) trying to capitalize on Hadoop.

Trending this year

Big data in the cloud

A few companies now offer Hadoop-as-a-service (as well as other frameworks) in the cloud, in addition to IT or application-level features: Altiscale, Qubole, Datameer, etc. Apparently they are all mostly doing good, and there is enough space to accommodate everyone. Heard Qubole in particular is doing good.

Separation of concerns/Specialization of Hadoop tools

It seems like vendors offer either a one-stop shop to Hadoop, like Business Intelligence/Analytics tools (Platfora, Pentaho, etc) with the standard advantages and shortcomings that an off-the-shelf product may imply, or very specialized tools, like data discovery (Tamr), data cleansing (Paxata) or visualization (Zoomdata). Pick your weapon!
Of note: why was Google not there?

Stream processing

More interestingly, batch analytics is becoming commoditized, with a number of tools available to perform these kind of processes. A newer type of application that is proposed is the kind that offers NRT stream processing. Data Torrent, RapidMinder, and especially Interana are amongst these companies. This to counteract the fact that open source tools like Storm and Spark Streaming are not for the faint of heart to implement..

Data discovery

This is a new offering among startups: the ability to auto-discover your sources of data and manage them automatically; what used to be called MDM and CDC, essentially in the "old" datawarehouse world, and that is partially solved via tools like Apache Falcon in the downstream ecosystem of tools. See my post on this.
Instead, these companies (Tamr, Alation, Attivio) offer the ability to expose your data, expose their relationships, all of this by a combination of automation and machine learning tools.

Data Science/Machine Learning

I was stunned by the proliferation of startups around data science: H20, Dato,, Skytree, Dataiku, etc. It seems like there is a lot of redundancy in the space. One company seemingly out of the pack: DataRobot, which apparently won some Kaggle competition.

Of note, but you knew that already: Spark is omnipresent.

My personal Awards

Best T-shirt: Datameer, Databricks
Best toys: DataRobot
Biggest booth for the smallest funding in a company: Tamr


  1. Awesome post! Biggest booth for smallest funding in a company should always be a clear sign of a winner! ;-) I saw Tamr at Hadoop world in NYC and the level of interest around their booth was extremely high.

  2. Indeed, Tamr looks good.. Check out also Alation, in the same space - great UI!

  3. Indeed, Tamr looks good.. Check out also Alation, in the same space - great UI!


Note: Only a member of this blog may post a comment.