Friday, November 9, 2018

Tips to consider before writing spark programs



Tips to consider before writing spark program 


1. size up your executor right 



Ideal executor size 
  • keep 4-5 cores per executors
  • 2-4 GB RAM per core
  • yarn memory overhead(7%)
  • Application Master use Core and 1 executor to run

     
sometimes you may need to run large data load and the above configuration may not help. here are few points to while sizing the executor.  


2. Driver program runs on single machine. 



This sometimes cause limitation and you may choose not to use spark in heavy data load applications and opt for other solutions like hive. Hive is better for single, heavy data load aggregation. consider Spark for   there are other workarounds so you need to take some additional steps. 



3. Manage you DAG



  • shuffles to be avoided(map side reduction, only send what you have to)
  • use ReducedbyKey over GroupbyKey(Skew issue as its data dependent)
  • TreeReduce(reduce at executor) over Reduce(reduce at Driver)


4. 2GB limit on spark shuffle block - Spark limit -- shuffle still sucks



  • 128MB per partitions
  • if your number of partition is close to 2000, bump it to over 2000 so spark use compression

5. Skew with join

most data goes to one partition use salting to distribute your keys






Websphere Dummy certificate expired - DummyServerKeyFile.jks , DummyServerTrustFile.jks

If you faced issue with ibm provided dummy certificate expired just like us and looking for the solution.  This blog is for you.  You can re...