Wednesday, June 1, 2016

spark



Converting a user program into tasks

The Spark driver is responsible for converting a user program into units of physical
execution called tasks. At a high level, all Spark programs follow the same
structure: they create RDDs from some input, derive new RDDs from those
using transformations, and perform actions to collect or save data. A Spark program
implicitly creates a logical directed acyclic graph (DAG) of operations.
When the driver runs, it converts this logical graph into a physical execution
plan.

Spark performs several optimizations, such as “pipelining” map transformations
together to merge them, and converts the execution graph into a set of stages.
Each stage, in turn, consists of multiple tasks. The tasks are bundled up and prepared
to be sent to the cluster. Tasks are the smallest unit of work in Spark; a
typical user program can launch hundreds or thousands of individual tasks.

Scheduling tasks on executors

Given a physical execution plan, a Spark driver must coordinate the scheduling
of individual tasks on executors. When executors are started they register themselves
with the driver, so it has a complete view of the application’s executors at
all times. Each executor represents a process capable of running tasks and storing
RDD data.
The Spark driver will look at the current set of executors and try to schedule each
task in an appropriate location, based on data placement. When tasks execute,
they may have a side effect of storing cached data. The driver also tracks the location
of cached data and uses it to schedule future tasks that access that data.
The driver exposes information about the running Spark application through a
web interface, which by default is available at port 4040. For instance, in local
mode, this UI is available at http://localhost:4040.

File Compression

Choosing an output compression codec can have a big impact on future users of the
data. With distributed systems such as Spark, we normally try to read our data in
from multiple different machines. To make this possible, each worker needs to be
able to find the start of a new record. Some compression formats make this impossible,
which requires a single node to read in all of the data and thus can easily lead to a
bottleneck. Formats that can be easily read from multiple machines are called “splittable.”
Table 5-3 lists the available compression options



No comments:

Post a Comment