Tuesday, January 7, 2014

IBM's Big Insights training notes


IBM’s Big Insights: A primer


Recently I had the opportunity to attend a training session about IBM’s Big Insights, in November 2013. Below are my notes about this product.


What is Big Insights in a nutshell?


Big Insights is IBM’s Big Data platform. It is comprised of an all-in-one Big Data infrastructure, with IBM’s flavor of Hadoop and its ecosystem, as well as proprietary tools to query the data like JAQL and AQL, and out-of-the-box connectors and interfaces called accelerators. We’ll review these components in details in the below section.

Big Insights Hadoop infrastructure

Big Insights is composed of a Hadoop infrastructure (independent from vendors like Cloudera). It is using a released version of Hadoop that is well-tested, usually a bit older from trunk. However it differs from the Apache version in some ways also. Big Insights comes integrated with:
-       GPFS (IBM’s version of HDFS) for its file system
-       Adaptive Map Reduce, an enhanced version of MR that attempts to optimize task executions, by way of using automatic job tuning of speculative execution and Task JVM reuses. Map Reduce tasks become aware of the global state of the job they are working in. This helps balance the workload across Map tasks. 
-       Zookeeper, HBase, Hive, Pig

Of note is the fact that Big Insights is not bundled with Cloudera’s CDH anymore; IBM has its own version of Hadoop.

New query language: JAQL

Big Insights offers a language called JAQL, a functional language that can interface will of all the Big Insights tools. It provides API's (or modules) for reaching out to external IBM and 3rd party tools, such as relational databases, indexing services, text analytics, machine learning etc. JAQL stands for Json Query Language, because it is represented via Json. Similar to Pig, Jaql is automatically taking care managing the complexities of the MapReduce world to optimally perform the work. However it also manages deep level nested semi-structured data.
Jaql can be executed either from its own shell, or from within Eclipse.

Big Insights Applications

Big Insights provides an environment for developing and executing applications. A business user can launch existing applications from the Web console, supply any input parameters and view results.  These applications may be developed using Big Insights’ development tooling which enables programmers to publish completed applications through the Web console.
The BigInsights Eclipse tools include wizards, code generators, context-sensitive help, and a test environment to simplify your development efforts.
Workflow applications are run by Oozie as a workflow job.

Big Sheets

Big Insights also comes with a spreadsheet-like interface to interact with Big data in a manner business users would use Excel. To do so, it presents a familiar interface (e.g. Pivot, Union, Intersection functions) that allows users to gather, filter, combine, explore, and visualize data from various sources. Big Sheets has been designed to be used by non-technical professionals to rapidly gather insight (BigSheets executes work on a simulated environment of sample data first) and analysis from huge amounts of data, and to be able to act on those insights in a timely manner. No need to understand database schemas, no need to understand a query language. And Big Sheets conveniently has a built-in visualization module to chart and publish the results.
Also, the nice thing about it is that Big Sheets is integrated natively with the other Big Insights components, so it’s easy to navigate between the different tools that Big Insights provides; e.g. create an ETL job in Jaql and export the results to Big Sheets..


Big Data Accelerators

Big Insights bundles in some pre-built components for specific solutions to accelerate development on certain specific use cases. The accelerators generally provide business logic, data processing and visualization. An example of this is the Social Data Analytics accelerator, providing  a set of predefined elements as workbooks and dashboards to analyse social data.

Other Big Data tools

The IBM Big Data platform is comprised of Big Sheets, but also other tools like Infosphere Streams for low latency data, and an MPP (Massively Parallel Processing) database. The IBM ecosystem also seems to support Big Data: R is supported in Big Insights, Cognos supports Hive, Netezza integrates with Streams. These systems offer complementary analytical approaches.

IBM offers a free downloadable virtual machine to play with Big Insights.

Overall a good experience, although one can get easily lost by the sea of products IBM offers. On the other hand  tools like Big Sheets and the Accelerators seem very valuable.