Trying to get my head around fault tolerance in Spark streaming, and in light of the recent changes made to it, below is my high level understanding of it, based on conversations with a colleague.
First,
the basics:
Spark Streaming components
Data model
All data is modeled as RDDs, built by
design with lineage of deterministic operations, i.e. any re-computation always
leads to the same result. Essentially the same process (however with a
different mechanism) as in Hadoop's fault-tolerance for slave failures.
- An RDD is
an immutable, deterministically re-computable, distributed dataset in
Spark.
- A DStream is
an abstraction used in Spark streaming over RDDs, which is
essentially a stream of RDDs. A lot of the same APIs apply over DStreams.
Types of nodes
- Worker node:
slave nodes, running the application code on the cluster
- Driver node:
main program of the application. Similar to Application master in the
Hadoop YARN world, the Driver owns the Spark context, hence all the state
of application.
Main components in a streaming application
- Driver: akin to the master node in a Storm application
from a conceptual point of view.
- Receiver: the
Receiver, living in a worker node, is similar to a spout in Apache Storm,
and consumes the data from source; there are already
built-in receivers OOTB for the common ones.
- Executor: this
processes the data; similar to a bolt in Apache Storm from a conceptual
point of view.
Main steps in a Streaming application
There are essentially three steps in a
streaming application, so understanding the record processing guarantees (at least once, at most once or exactly-once
semantics) at each step is essential:
1. Receiving the streaming
data
- Depending on the kind of input source, at this step reliable
vs. unreliable receivers are used; e.g. a stream from a file (local or
Hdfs) is reliable, a Kafka stream is reliable, but data directly from a
socket connection is unreliable.
- In Spark
streaming when the data is received from any receiver, it is by default
replicated (in memory) to two worker nodes, after which if the receiver
was reliable,
the
acknowledgement is sent. In
case of an unreliable receiver, the data is lost (i.e. at least once scenario).
- In the event of failure of the Driver
node, the Spark context is lost and hence all the past data. The initial
remedy is a mechanism of a Spark WAL (write ahead logs), but the cleaner
way, and if the data sender allows for it, is to simply re-use and consume
their WAL instead.
2.
Transform the data
- At this stage we have a guarantee
of exactly once semantics due to
the underlying RDD guarantees; i.e. in
case of a worker node failure, the transformation gets computed on other
node where the data is replicated.
3. Output the transformed
data
- Output operations have at
least once semantics, that is, the transformed data may get
written to an external entity more than once in the event of a worker
failure. Additional effort may be necessary to achieve exactly-once semantics. There are
two approaches.
- Idempotent updates: Multiple attempts always write the same data.
- Transactional updates: All updates are made atomically so that updates are
made exactly once.
Example
Lets say there is a batch of
events, and one of the operations is maintaining ‘global count’, such that it
keeps a counter of total events streamed so far. Consider that when the batch
of events is being processed, mid-way during the processing the node that was
processing goes down. What happens now:
Is the
global count reflecting the ‘half way events’ processed?
If strictly speaking of
global count, there is built-in
global counter available in Spark which takes care of this problem. But as
this is just an example and for all other situations except counter, the
lineage of transformation applied on the whole batch of data will remedy this.
As mentioned, RDD transformations are deterministically re-computable, which
means the re-computation will give the same resultant state. However if the
result also needs to be stored externally, that logic needs to be handled independently.