Tuesday, February 4, 2014

IBM stampede program notes

The IBM Stampede training is a second part training about the IBM products; one of  its main attraction was that it was taught by an actual IBM solutions architect, giving real-case examples of his past projects. These are some of the notes taken during the training.


It is important to have some kind of data stewardship when dealing with amounts of data; a lot of companies essentially deal with this on an ad-hoc way instead.
There are essentially 3 components at play
- Hadoop ; data at rest, landed
- Stream ; data in motion
- Data warehouse

IBM offers these 3 components as part of their Big Insights offering.
Also, IBM offers Accelerators that are essentially frameworks for working with specific use cases, for data that is not harmonized together.

In addition, IBM offers Watson, which is able to perform NLP (language processing) given a given context.


Hadoop in the context of the Data warehouse (DW)

Hadoop’s sweet spot from a data perspective is for the queryable archive of data, the cold/unused data offloaded from the expensive DW.
Hadoop is seen as performing DW Augmentation. The DW (typically Netezza) stays, only is complemented by Hadoop.
IBM talks about the Analytics landing zone as the logical data storage area.
In comparison, this is similar to the Enterprise hub from cloudera, or the Data lake from Hortonworks.
The landing zone is essentially for raw data (which ends up being stored along time) in addition to modeled data.

Obviously from a cost perspective, Hadoop is cheaper. It is also for “untrusted sources”, vs trusted sources in the DW. Hence the data is segmented.
Hadoop is mainly used today to offload cold data that is typically unused.
This data now becomes a queryable archive; instead of being stored to tape.

Delta files also end up being stored in the landing zone. They stay in the low cost platform, for recovery purposes.

Hadoop has a “Wild West” mentality today, the same as what the Data warehouse used to have 10 years ago! For example in DW, there used to be no systems management, nor recovery. This is now the case in Hadoop: i.e. the relative poor posix support of hdfs, the non-existent audit trail, etc.

Data Warehouse

The definition of a DW is central repository used for reporting & data analysis.
Its challenges are:
- It stores structured data.
-It mainly uses Batch data
-It has limited history, due to data volume constraints: thus, it mainly stores aggregated views.

The data-warehousing instance usually follows a set of processes, with the following:
-       A Data owner
-       A Data steward
-       Data governance
-       Measurements via KPI’s
-       Data Lineage to trace the data.

With Hadoop we now talk about DW Augmentation, to leverage all data and get timely insights, in a cost-optimized way. There starts to be some kind of data federation between DW and Hadoop also, via tools like Cirro; but typically you want to avoid data latency, and data movement (data is not collocated), depending on the cardinality of the data.

Use Cases

IBM sees Big Data exploration as 90% of the use cases. A lot of use cases have to do with finding new expected traits, via exhaust data.

Other use cases vary depending on the vertical; an example is Fraud detection with Hadoop at a major Credit card company. In that use case, detailed transactions in aggregated in a fraud model, utilizing a large volume of structured data, with a small set of users; this is NOT like the standard social data analytics use case that creates all the buzz today.

IBM-specific Big Data tools


BigSQL is IBM’s “secret sauce” around querying data in Hadoop, and is a SQL-like high-level language and wrapper to give access to the data in Big Insights.
Compared to Cloudera’s Impala, IBM says it is fully SQL 92 compliant, and is more accurate.  That said, BigSQL is still a revision 0 product.  It is very effective at what it does because some of the IBM research results went in its design; i.e. the DB2 cost-based optimizers are in there.
It also has a “local table” approach to querying the data, effectively bypassing HDFS by using local storage; it is one of its value-add. BigSQL has access to data stored in HBase, Hive, or DB2 for that matter.

Also, BigSQL has advantages over Hive: sub queries are not possible in Hive.

Data transformations in Big Insights

IBM’s Big Insights integrates Apache’s Pig and Hive which are the most common tools used by the community to perform transformations.
For text analytics, the Annotation Query Language (AQL) is widely used for text analytics, via context extractors.
And BigSheets, the Excel-like Big Data tool generates Pig under the hood; it will also integrate with BigSQL soon.

IBM also attempts to enhance data science tools like R: today R doesn’t work in parallel for the Reduce side of M/R and goes back to a single node. With IBM Big R, the execution of these functions is parallelized; in addition it removes R’s memory limitation.
For Data mining, SPSS and SAS are still the most widely used tools for predictive analytics. Of note, the FDA used to certify drug models prediction only with SAS, but recently opened it up R for statistical models.

Data explorer

IBM’s Data Explorer is a search tool, that provides core indexing, discovery, navigation and search capabilities. The user typically visualizes the results on a portal.
Howerver the data is on its own server, and usually the size of index is 2.5-3 times larger than the data.


There are essentially 3 levels of analytics reporting:
-       Descriptive:  essentially what is known as Operational reports, summarizing the data, usually a certain type of metrics (i.e. number of followers). This is akin to looking through your rear-view mirror, without knowing where you are going.
-       Predictive: in-depth analysis of the data via data mining tools, to try and make predictions about the future from the data that you have based on a set of assumptions.
-       Prescriptive: in addition to give prediction about the data, a prescriptive analysis recommends courses of actions based on actionable data and a feedback system.

Use case

In the retail vertical, a typical use case has been around how to better induce a sale. Retailers are essentially looking for the shopper’s trigger point.
To deduce this, retailers obtain the Mac address of mobile phone devices in range of their router (no need to be on the Wifi, mobiles broadcast their Mac address!) to find patterns of aisles routes.

In the car vertical, IBM has done work with major car manufacturers to pull data in motion from car sensors. Aggregation of all of the data can give some very interesting insights on the typical usage of the cars.

They mentioned that a new car like a Ford fusion that has a lot of sensors, will yield about 2Tb of data!


Post a Comment

Note: Only a member of this blog may post a comment.