Overview of Apache Spark
Origins
It is a
well-known fact that Map Reduce is not good at multi-stage queries, and is a rather cumbersome solution for applications that are iterative, such as interactive queries, and some machine learning
algorithms. So the motivation to create Spark was centered around these areas.
Spark’s goal is two-fold:
-
Have primitives for an in-memory Map Reduce-like engine for iterative algorithm support.
-
Not be restricted to just Map Reduce, and replace that model with a DAG engine instead of discrete M/R steps. Spark
provides a set of operators that are higher-level,
expressive, and offer clean APIs.
The philosophy of Spark (read: business model?) is that
unlike other projects in open source, Spark was built and marketed to get
maximum exposure and be popularized as a useful and notable distributed
framework in replacement of Map Reduce, as opposed to just being a university
research project made open-source after the fact with no real support.
In light of this, Spark offers powerful standard libraries,
a relational data model, and canned machine learning algorithms that we’ll
review in more details. A definite advantage is also Spark’s unified support
across many environments, is cloud-friendly, resource managers (YARN, Mesos,
also offers its own). One of the biggest advantage that Spark has to offer is
the fact that it offers a unified API
and framework across data analysis contexts: online, offline, machine
learning, graph.
Spark aims to be compatible
with Hadoop’s Input and Output format API, meaning a user can plug in Spark
onto its existing Hadoop cluster and data and can directly use Spark on top of
it; Spark supports HDFS, but also HBase, S3, Cassandra, etc.
In terms of support, Spark releases are pretty regimented
(every 3 months), and are defined by time, not by scope; again this is is to
give predictability to users and a
sense of stability to the project, as opposed to
Hadoop in its earlier years, probably to also drive user adoption of Spark.
Suffice is to say Apache Spark is the most active project in the Apache
foundation as of Nov, 2014, and currently boasts exponential growth. And for
good reason: Spark broke the performance
record for sorting benchmark
of 100TB in 23 mins, 200 machines (Oct, 2014), as a mostly network-bound
job mostly; comparatively Hadoop doesn’t take advantage of the hardware
resources as efficiently. It is worth noting that this was not using only memory (100TB cannot easily be stored in memory), but actually spilling to disk, which shows Spark is also performant using disk.
Map Reduce versus Spark
In Map
Reduce (MR), data is written back to
HDFS between iterations. Bear in mind that MR was designed 10+ years ago when
memory was expensive. Consider a set of Hive queries on the same data:
HDFS input
read -> query1
read -> query2
-
the ‘read’ portion is considerably slow in MR,
due to 3 phases: the replication,
the serialization, and the disk I/O. In fact on average 90% of the time is spent on these phases, instead of computing the
actual algorithm from the query itself! The same principle applies to some
machine learning algorithms, like gradient descent (which essentially has a
for-loop to descent to the local minimum).
Besides the in-memory primitives and the new operators,
Spark’s architecture is such that it is often optimized to avoid multiple passes on the data; so sometimes a Map+Reduce (Group
by) + Reduce (Sort) is done in one pass in Spark and faster than MR, even
without using the in-memory paradigm.
Advantages of Spark
-
In practice, it is said that Spark is 10x faster than Hadoop on average.
-
Agility
compared to the monolithic aspect of Hadoop: Spark allows rapid changes, thanks
to loading the data into memory and interacting with it in a rapid manner. The
shell (REPL) is great to test things
out.
-
Data
scientists & non-data engineers can use Spark through Python.
-
Newer platform
with multiple tools like Machine learning, Graph and streaming included, with
strong community support.
-
Scala is
superior for data processing thanks to its higher level of abstraction.
Although Spark supports Java, it is recommended to use Scala in Spark as a non-functional programming will make coding less intuitive and lower level. Also using
Scala will make debugging will be
easier. A combination of using an IDE (for API autocompletion and data typing)
and REPL (interactive shell) s actually best for efficiency.
-
Databricks makes cluster provisioning very easy.
However Spark is definitely less mature than Hadoop, is more
bug-prone, and doesn’t have a good solution for managing and monitoring the
system.
Spark architecture
Architecturally Spark is made up of 2 concepts:
-
Resilient
Distributed Datasets (RDD), which is a collection of Java objects spread
within the cluster, as a distributed
collection/dataset. This allows to directly have access to a higher-level API. A RDD is split into partitions. Each partition must fit on one node. A node hosts multiple partitions. An RDD is typed like a Scala collection (i.e. RDD[Int]); for example reading a text file/line input in Spark returns a RDD[String].
-
A DAG
execution engine, as a Master-Slave architecture. The master is called the
Driver, the slaves the Executors/Workers (2 processes). Executors are where the computation is run and data is cached. These run even when there is no running jobs.
This avoids the JVM start time like
in the case of MR’s task trackers. The cons is that you get a fixed number of Executors, remedied only
if you use YARN’S Resource Manager to allocate resources dynamically. The Driver controls the program flow and executes the steps.
It should be noted that Spark has also an elastic scaling feature for ETL jobs , a la EMR, where you set your configuration to allow a minimum/maximum of executors instances.
It should be noted that Spark has also an elastic scaling feature for ETL jobs , a la EMR, where you set your configuration to allow a minimum/maximum of executors instances.
Fault Tolerance
Fault
tolerance is provided by having the RDDs track the series of transformations
making up the data processing, and recompute
the lost data in case of failure, through the operation log called lineage. This is the 'resilient' part of the RDD acronym. It is worth noting that this lineage feature is nothing new: it actually comes from the M/R paradigm; i.e. if a map task fails, the whole Hadoop job doesn't fail but instead the task is recomputed on a different node. However whereas in M/R the computation is segregated by jobs (i.e. where a data analysis is formed of multiple steps (filter, group, etc) that is each implemented by a M/R job), hence there is no resiliency built-in across jobs, in Spark the computation is much more fluid and resiliency works across the data analysis steps.
Also, the data is replicated
into memory twice, in case of failure of an Executor. However if the
replica is not fully done before the node fails, data may get lost.
Provisioning Spark on YARN will allow the Resource Manager
to spin the Spark Master fail-over proxy (and the same applies in Mesos) and offer
HA; another way is to set up Zookeeper for Spark’s Driver in standalone
mode, to mitigate some of the problems of resiliency in case the Driver goes
down.
Using memory
Spark uses
memory for fast computation. However if memory
is unavailable, Spark will gracefully spill to disk. The strategy used for
this is Least-Recently-Used: the dataset that has been less used will spill to
disk.
Currently the Spark user has to specify what data set (once processed) has to be saved in memory.
Automatic adaptive saving to memory is currently a subject of research and is
not possible today.
Security
Security features like Kerberos can be
set up on Spark as long as Spark is used in a YARN configuration.
Spark ecosystem
Spark essentially expended its ecosystem of tools to provide a one-stop shop for doing all kinds
of analytics. Let’s review these components.
Spark SQL
Spark SQL is a complete rewrite (made faster) of what used
to be Shark, a replacement of Hive, as a
tool on top of Spark with relational semantics. Spark SQL is implemented to
be compatible with Hive, and so existing Hive tables can still be used within
Spark SQL.
Spark SQL does not cache Hive records as Java objects, which
would incur too much overhead. Instead it uses column-oriented storage using primitive types (int, string, etc),
similar to Parquet or ORC, with the same advantages of faster response time due
to only scanning the needed columns, auto selection of best compression
algorithm per column, etc.
The icing on the cake comes from the fact that Spark SQL
code can be mixed up with “pure” Spark code, and one can be called from the other. Also the Spark/scala console
makes this easy.
Spark SQL is compatible with Hive 0.13.
Spark Streaming
Spark was extended to perform stream computations. The way this works is, Spark Streaming runs as
a series of small batch jobs during
a period of (configurable) time, also called micro-batching. The state of the
RDDs for these jobs is kept in memory.
Lowest latency in Spark Streaming is in order of seconds, not less; this accommodates 90% of streaming use
cases usually. Storm on the other hand Storm can handle discrete events in a
flow of steps, akin to a CEP system in terms of speed of processing.
Regarding fault-tolerance,
Spark Streaming offers write-ahead log for full HA operation. Comparatively, in
Storm, if the supervisor fails, the data gets reassigned and replayed, at the
cost of having processing done twice.
Usually streaming frameworks are comprised of a pipeline of
nodes; each node maintains a mutable state. However this state is lost if the
node fails. So in Storm for example, each record in the worst case is processed
at least once (could be multiple times). A remedy to this in Storm is to
use the Trident API, which functions as micro-batching and offers transactions
to update states. However this comes with a slower throughput.
Also Storm has a
lower API than Spark Streaming does. And it has no built-in concept of
look-back aggregation, nor a way to easily combine batch with streaming.
Technically Spark Streaming offers a new interface, DStream to deal with streaming, as well as a new
operator to work on a timed-window: ‘reduceByWindow()’ for incremental
computation, with arguments as the window length and the sliding interval. This
can be run on a key basis also.
One of the advantages of Spark Streaming is code reuse, and
intermixing of it with standard Spark code, i.e. the ability to mix batch
(offline) with real-time (online) computing.
Some ML algorithms are also libraries, like K-means, to be
available online.
GraphX
Spark was extended to add a graph-processing library.
0 comments:
Post a Comment
Note: Only a member of this blog may post a comment.