Saturday, September 12, 2015

Spark Program Structure and Execution

Program Execution in Spark

Spark Context - A Spark context object (sc) is the main entry point for Spark functionality. A Spark context can be used to create Resilient Distributed Datasets (RDDs) on a cluster. When running Spark, you start a new Spark application by creating a Spark Context. When the SparkContext is created, it asks the master for some cores to use to do work. The master sets these cores aside. 
Spark cluster components


A Spark program consists of two programs, a driver program and a workers program.Drivers program runs on the driver machine and Worker programs run on cluster nodes(Cluster managers like Yarn or Mesos) or in local threads( Spark's standalone cluster manager). Driver programs access Spark through a SparkContext object, which represents a connection to a computing cluster.  
There are several useful things to note about this architecture:

  1. Each application gets its own executor processes, which stay up for the duration of the whole application and run tasks in multiple threads. 
  2. This has the benefit of isolating applications from each other, on both the scheduling side (each driver schedules its own tasks) and executor side (tasks from different applications run in different JVMs). However, it also means that data cannot be shared across different Spark applications (instances of SparkContext) without writing it to an external storage system.
  3. Spark is agnostic to the underlying cluster manager. As long as it can acquire executor processes, and these communicate with each other, it is relatively easy to run it even on a cluster manager that also supports other applications (e.g. Mesos/YARN).
  4. The driver program must listen for and accept incoming connections from its executors throughout its lifetime. The driver program must be network addressable from the worker nodes.
  5. Because the driver schedules tasks on the cluster, it should be run close to the worker nodes, preferably on the same local area network. If you’d like to send requests to the cluster remotely, it’s better to open an RPC to the driver and have it submit operations from nearby than to run a driver far away from the worker nodes.

Spark Program Structure
  1. create RDDs from some external data source or parallelize a collection in your driver program.
  2. lazily transform these RDDs into new RDDs.
  3. cache some of those RDDs for future reuse.
  4. perform actions to execute parallel computation and to produce results.
lines = sc.textFile("logfile.log") // point 1

val errors = lines.filter(_.startWith("ERROR")) //point 2

errors.cache() // point 3

var sqlError = errors.filter(_.contains("sql")) // perform more transformation as needed 

sqlError.count() //point 4

var mqError = errors.filter(_.contains("MQ")) // perform more transformation as needed

mqError.count() //point 4


Goal of above program identify the  errors in the log files and count them by sql error and MQ errors. When you load the data, the data is partitioned in different blocks on the cluster. Driver sent the code to the worker nodes to to be executed on each block. In this examples the various transformation and actions are sent to the worker node for execution.


Executors on each worker is going to perform the task. Executors read the data from HDFS to prepare the data and perform the transformations in parallel. 

After a series of transformation you want to cache the result up until the point in memory, a result is cached until that point. after the first action is complete the results are sent back to the driver. in this case the result of sql error count is return to the driver. 

For the second action, Spark reuse the cached data and perform the operation on it and return the result.  


No comments:

Post a Comment

Websphere Dummy certificate expired - DummyServerKeyFile.jks , DummyServerTrustFile.jks

If you faced issue with ibm provided dummy certificate expired just like us and looking for the solution.  This blog is for you.  You can re...