Monday, November 10, 2014

An overview of Apache Spark

Overview of Apache Spark


            It is a well-known fact that Map Reduce is not good at multi-stage queries, and is a rather cumbersome solution for applications that are iterative, such as interactive queries, and some machine learning algorithms. So the motivation to create Spark was centered around these areas. Spark’s goal is two-fold:

-       Have primitives for an in-memory Map Reduce-like engine for iterative algorithm support.
-       Not be restricted to just Map Reduce, and replace that model with a DAG engine instead of discrete M/R steps. Spark provides a set of operators that are higher-level, expressive, and offer clean APIs.

The philosophy of Spark (read: business model?) is that unlike other projects in open source, Spark was built and marketed to get maximum exposure and be popularized as a useful and notable distributed framework in replacement of Map Reduce, as opposed to just being a university research project made open-source after the fact with no real support.
In light of this, Spark offers powerful standard libraries, a relational data model, and canned machine learning algorithms that we’ll review in more details. A definite advantage is also Spark’s unified support across many environments, is cloud-friendly, resource managers (YARN, Mesos, also offers its own). One of the biggest advantage that Spark has to offer is the fact that it offers a unified API and framework across data analysis contexts: online, offline, machine learning, graph.

Spark aims to be compatible with Hadoop’s Input and Output format API, meaning a user can plug in Spark onto its existing Hadoop cluster and data and can directly use Spark on top of it; Spark supports HDFS, but also HBase, S3, Cassandra, etc.

In terms of support, Spark releases are pretty regimented (every 3 months), and are defined by time, not by scope; again this is is to give predictability to users and a sense of stability to the project, as opposed to Hadoop in its earlier years, probably to also drive user adoption of Spark. Suffice is to say Apache Spark is the most active project in the Apache foundation as of Nov, 2014, and currently boasts exponential growth. And for good reason: Spark broke the performance record for sorting benchmark of 100TB in 23 mins, 200 machines (Oct, 2014), as a mostly network-bound job mostly; comparatively Hadoop doesn’t take advantage of the hardware resources as efficiently. It is worth noting that this was not using only memory (100TB cannot easily be stored in memory), but actually spilling to disk, which shows Spark is also performant using disk.

Map Reduce versus Spark

            In Map Reduce (MR), data is written back to HDFS between iterations. Bear in mind that MR was designed 10+ years ago when memory was expensive. Consider a set of Hive queries on the same data:

HDFS input
read -> query1
read -> query2

-       the ‘read’ portion is considerably slow in MR, due to 3 phases: the replication, the serialization, and the disk I/O. In fact on average 90% of the time is spent on these phases, instead of computing the actual algorithm from the query itself! The same principle applies to some machine learning algorithms, like gradient descent (which essentially has a for-loop to descent to the local minimum).

Besides the in-memory primitives and the new operators, Spark’s architecture is such that it is often optimized to avoid multiple passes on the data; so sometimes a Map+Reduce (Group by) + Reduce (Sort) is done in one pass in Spark and faster than MR, even without using the in-memory paradigm.

Advantages of Spark

-       In practice, it is said that Spark is 10x faster than Hadoop on average.
-       Agility compared to the monolithic aspect of Hadoop: Spark allows rapid changes, thanks to loading the data into memory and interacting with it in a rapid manner. The shell (REPL) is great to test things out.
-       Data scientists & non-data engineers can use Spark through Python.
-       Newer platform with multiple tools like Machine learning, Graph and streaming included, with strong community support.
-       Scala is superior for data processing thanks to its higher level of abstraction. Although Spark supports Java, it is recommended to use Scala in Spark as a non-functional programming will make coding less intuitive and lower level. Also using Scala will make debugging will be easier. A combination of using an IDE (for API autocompletion and data typing) and REPL (interactive shell) s actually best for efficiency.
-       Databricks makes cluster provisioning very easy.

However Spark is definitely less mature than Hadoop, is more bug-prone, and doesn’t have a good solution for managing and monitoring the system.

Spark architecture

Architecturally Spark is made up of 2 concepts:

-       Resilient Distributed Datasets (RDD), which is a collection of Java objects spread within the cluster,  as a distributed collection/dataset. This allows to directly have access to a higher-level API. A RDD is split into partitions. Each partition must fit on one node. A node hosts multiple partitions. An RDD is typed like a Scala collection (i.e. RDD[Int]); for example reading a text file/line input in Spark returns a RDD[String].
-       A DAG execution engine, as a Master-Slave architecture. The master is called the Driver, the slaves the Executors/Workers (2 processes). Executors are where the computation is run and data is cached. These run even when there is no running jobs. This avoids the JVM start time like in the case of MR’s task trackers. The cons is that you get a fixed number of Executors, remedied only if you use YARN’S Resource Manager to allocate resources dynamically. The Driver controls the program flow and executes the steps.

It should be noted that Spark has also an elastic scaling feature for ETL jobs , a la EMR, where you set your configuration to allow a minimum/maximum of executors instances.

Fault Tolerance

            Fault tolerance is provided by having the RDDs track the series of transformations making up the data processing, and recompute the lost data in case of failure, through the operation log called lineage. This is the 'resilient' part of the RDD acronym. It is worth noting that this lineage feature is nothing new: it actually comes from the M/R paradigm; i.e. if a map task fails, the whole Hadoop job doesn't fail but instead the task is recomputed on a different node. However whereas in M/R the computation is segregated by jobs  (i.e. where a data analysis is formed of multiple steps (filter, group, etc) that is each implemented by a M/R job), hence there is no resiliency built-in across jobs, in Spark the computation is much more fluid and resiliency works across the data analysis steps.
Also, the data is replicated into memory twice, in case of failure of an Executor. However if the replica is not fully done before the node fails, data may get lost.
Provisioning Spark on YARN will allow the Resource Manager to spin the Spark Master fail-over proxy (and the same applies in Mesos) and offer HA; another way is to set up Zookeeper for Spark’s Driver in standalone mode, to mitigate some of the problems of resiliency in case the Driver goes down.

Using memory

            Spark uses memory for fast computation. However if memory is unavailable, Spark will gracefully spill to disk. The strategy used for this is Least-Recently-Used: the dataset that has been less used will spill to disk.
Currently the Spark user has to specify what data set (once processed) has to be saved in memory. Automatic adaptive saving to memory is currently a subject of research and is not possible today.


            Security features like Kerberos can be set up on Spark as long as Spark is used in a YARN configuration.

Spark ecosystem

Spark essentially expended its ecosystem of tools to provide a one-stop shop for doing all kinds of analytics. Let’s review these components.

Spark SQL

Spark SQL is a complete rewrite (made faster) of what used to be Shark, a replacement of Hive, as a tool on top of Spark with relational semantics. Spark SQL is implemented to be compatible with Hive, and so existing Hive tables can still be used within Spark SQL.
Spark SQL does not cache Hive records as Java objects, which would incur too much overhead. Instead it uses column-oriented storage using primitive types (int, string, etc), similar to Parquet or ORC, with the same advantages of faster response time due to only scanning the needed columns, auto selection of best compression algorithm per column, etc.
The icing on the cake comes from the fact that Spark SQL code can be mixed up with “pure” Spark code, and one can be called from the other. Also the Spark/scala console makes this easy.
Spark SQL is compatible with Hive 0.13.

Spark Streaming

Spark was extended to perform stream computations. The way this works is, Spark Streaming runs as a series of small batch jobs during a period of (configurable) time, also called micro-batching. The state of the RDDs for these jobs is kept in memory.
Lowest latency in Spark Streaming is in order of seconds, not less; this accommodates 90% of streaming use cases usually. Storm on the other hand Storm can handle discrete events in a flow of steps, akin to a CEP system in terms of speed of processing.
Regarding fault-tolerance, Spark Streaming offers write-ahead log for full HA operation. Comparatively, in Storm, if the supervisor fails, the data gets reassigned and replayed, at the cost of having processing done twice.
Usually streaming frameworks are comprised of a pipeline of nodes; each node maintains a mutable state. However this state is lost if the node fails. So in Storm for example, each record in the worst case is processed at least once (could be multiple times). A remedy to this in Storm is to use the Trident API, which functions as micro-batching and offers transactions to update states. However this comes with a slower throughput.
Also Storm has a lower API than Spark Streaming does. And it has no built-in concept of look-back aggregation, nor a way to easily combine batch with streaming.

Technically Spark Streaming offers a new interface, DStream to deal with streaming, as well as a new operator to work on a timed-window: ‘reduceByWindow()’ for incremental computation, with arguments as the window length and the sliding interval. This can be run on a key basis also.
One of the advantages of Spark Streaming is code reuse, and intermixing of it with standard Spark code, i.e. the ability to mix batch (offline) with real-time (online) computing.
Some ML algorithms are also libraries, like K-means, to be available online.


Spark was extended to add a graph-processing library.


Post a Comment

Note: Only a member of this blog may post a comment.