The IBM Stampede training is a second part training about
the IBM products; one of its main
attraction was that it was taught by an actual IBM solutions architect, giving
real-case examples of his past projects. These are some of the notes taken
during the training.
Components
It is important to have some kind of data stewardship when
dealing with amounts of data; a lot of companies essentially deal with this on
an ad-hoc way instead.
There are essentially 3 components at play
- Hadoop ; data at rest, landed
- Stream ; data in motion
- Data warehouse
IBM offers these 3 components as part of their Big Insights
offering.
Also, IBM offers Accelerators that are essentially frameworks
for working with specific use cases, for data that is not harmonized together.
In addition, IBM offers Watson, which is able to perform NLP
(language processing) given a given context.
Hadoop
Hadoop in the context of the Data warehouse (DW)
Hadoop’s sweet spot from a data perspective is for the
queryable archive of data, the cold/unused data offloaded from the expensive
DW.
Hadoop is seen as performing DW Augmentation. The DW
(typically Netezza) stays, only is complemented by Hadoop.
IBM talks about the Analytics landing zone as the logical
data storage area.
In comparison, this is similar to the Enterprise hub from
cloudera, or the Data lake from Hortonworks.
The landing zone is essentially for raw data (which ends up
being stored along time) in addition to modeled data.
Obviously from a cost perspective, Hadoop is cheaper. It is
also for “untrusted sources”, vs trusted sources in the DW. Hence the data is
segmented.
Hadoop is mainly used today to offload cold data that is
typically unused.
This data now becomes a queryable archive; instead of being
stored to tape.
Delta files also end up being stored in the landing zone. They
stay in the low cost platform, for recovery purposes.
Hadoop has a “Wild West” mentality today, the same as what
the Data warehouse used to have 10 years ago! For example in DW, there used to
be no systems management, nor recovery. This is now the case in Hadoop: i.e.
the relative poor posix support of hdfs, the non-existent audit trail, etc.
Data Warehouse
The definition of a DW is central repository used for
reporting & data analysis.
Its challenges are:
- It stores structured data.
-It mainly uses Batch data
-It has limited history, due to data volume constraints:
thus, it mainly stores aggregated views.
The data-warehousing instance usually follows a set of
processes, with the following:
-
A Data owner
-
A Data steward
-
Data governance
-
Measurements via KPI’s
-
Data Lineage to trace the data.
With Hadoop we now talk about DW Augmentation, to leverage
all data and get timely insights, in a cost-optimized way. There starts to be
some kind of data federation between DW and Hadoop also, via tools like Cirro;
but typically you want to avoid data latency, and data movement (data is not
collocated), depending on the cardinality of the data.
Use Cases
IBM sees Big Data exploration as 90% of the use cases. A lot
of use cases have to do with finding new expected traits, via exhaust data.
Other use cases vary depending on the vertical; an example
is Fraud detection with Hadoop at a major Credit card company. In that use
case, detailed transactions in aggregated in a fraud model, utilizing a large
volume of structured data, with a small set of users; this is NOT like the standard
social data analytics use case that creates all the buzz today.
IBM-specific Big Data tools
BigSQL
BigSQL is IBM’s “secret sauce” around querying data in
Hadoop, and is a SQL-like high-level language and wrapper to give access to the
data in Big Insights.
Compared to Cloudera’s Impala, IBM says it is fully SQL 92
compliant, and is more accurate. That
said, BigSQL is still a revision 0 product. It is very effective at what it does because
some of the IBM research results went in its design; i.e. the DB2 cost-based
optimizers are in there.
It also has a “local table” approach to querying the data,
effectively bypassing HDFS by using local storage; it is one of its value-add.
BigSQL has access to data stored in HBase, Hive, or DB2 for that matter.
Also, BigSQL has advantages over Hive: sub queries are not
possible in Hive.
Data transformations in Big Insights
IBM’s Big Insights integrates Apache’s Pig and Hive which
are the most common tools used by the community to perform transformations.
For text analytics, the Annotation Query Language (AQL) is
widely used for text analytics, via context extractors.
And BigSheets, the Excel-like Big Data tool generates Pig
under the hood; it will also integrate with BigSQL soon.
IBM also attempts to enhance data science tools like R: today
R doesn’t work in parallel for the Reduce side of M/R and goes back to a single
node. With IBM Big R, the execution of these functions is parallelized; in
addition it removes R’s memory limitation.
For Data mining, SPSS and SAS are still the most widely used
tools for predictive analytics. Of note, the FDA used to certify drug models
prediction only with SAS, but recently opened it up R for statistical models.
Data explorer
IBM’s Data Explorer is a search tool, that provides core indexing, discovery,
navigation and search capabilities. The user typically visualizes the results
on a portal.
Howerver the data is on its
own server, and usually the size of index is 2.5-3 times larger than the
data.
Analytics
There are essentially 3 levels of analytics reporting:
-
Descriptive:
essentially what is known as Operational reports, summarizing the data,
usually a certain type of metrics (i.e. number of followers). This is akin to
looking through your rear-view mirror, without knowing where you are going.
-
Predictive: in-depth analysis of the data via
data mining tools, to try and make predictions about the future from the data
that you have based on a set of assumptions.
-
Prescriptive: in addition to give prediction
about the data, a prescriptive analysis recommends courses of actions based on
actionable data and a feedback system.
Use case
In the retail vertical, a typical use case has been around
how to better induce a sale. Retailers are essentially looking for the
shopper’s trigger point.
To deduce this, retailers obtain the Mac address of mobile
phone devices in range of their router (no need to be on the Wifi, mobiles
broadcast their Mac address!) to find patterns of aisles routes.
In the car vertical, IBM has done work with major car
manufacturers to pull data in motion from car sensors. Aggregation of all of
the data can give some very interesting insights on the typical usage of the
cars.
They mentioned that a new car like a Ford fusion that has a
lot of sensors, will yield about 2Tb of data!
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.