Tips to consider before writing spark program
1. size up your executor right
Ideal executor size
- keep 4-5 cores per executors
- 2-4 GB RAM per core
- yarn memory overhead(7%)
- Application Master use Core and 1 executor to run
sometimes you may need to run large data load and the above configuration may not help. here are few points to while sizing the executor.
2. Driver program runs on single machine.
This sometimes cause limitation and you may choose not to use spark in heavy data load applications and opt for other solutions like hive. Hive is better for single, heavy data load aggregation. consider Spark for there are other workarounds so you need to take some additional steps.
3. Manage you DAG
- shuffles to be avoided(map side reduction, only send what you have to)
- use ReducedbyKey over GroupbyKey(Skew issue as its data dependent)
- TreeReduce(reduce at executor) over Reduce(reduce at Driver)
4. 2GB limit on spark shuffle block - Spark limit -- shuffle still sucks
- 128MB per partitions
- if your number of partition is close to 2000, bump it to over 2000 so spark use compression