Monday, August 7, 2017

Structured Streaming


Pain point of DStream - 

  • Issue with processing late data
  • conversion from RDD to Dstream and vice versa. RDD/Dstream has similar API but still requires translation. 


Structure Streaming  - streaming with dataframes(No Dstreams). its fast, fault tolerant, exactly once stateful stream processing. 

  • High level streaming API build on the top of Dataframes.
  • Unified Streaming, interactive and batch processing. 

Dataframes planner has been modified to create incremental execution plans.  every trigger interval, it generate incremental execution plan  and it to read/write the data.

New model is based on trigger time

Input - data from source as an append-only table
Trigger - how frequently to check for new data. 
Query - operation on input data(map, filter,reduce)
Result - final operated table updated every trigger seconds
Output - write to the sink after every trigger. it can be append or complete

https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html


Mistakes in writing spark program - 

1. size up your executor right - (few things to think  - yarn memory overhead(7%), Application Master use Core and use 1 executor)

  • keep 4-5 cores per executors
  • 2-4 GB RAM per core


2. 2GB limit on spark shuffle block - Spark limit -- shuffle still sucks

  • 128MB per partitions
  • if your number of partition is close to 2000, bump it to over 2000 so spark use compression

3. Skew with join - most data goes to one partition use salting to distribute your keys  
4. Manage you DAG

  • shuffles to be avoided(map side reduction, only send what you have to)
  • use ReducedbyKey over GroupbyKey(Skew issue as its data dependent)
  • TreeReduce(reduce at executor) over Reduce(reduce at Driver)

5. Do shading (maven) - method not found exception. you need to use the library what spark is using else it will throw exception or you can use maven shading.

6. Driver program runs on single machine         

No comments:

Post a Comment

Websphere Dummy certificate expired - DummyServerKeyFile.jks , DummyServerTrustFile.jks

If you faced issue with ibm provided dummy certificate expired just like us and looking for the solution.  This blog is for you.  You can re...