My Notes: November 2018

Friday, November 9, 2018

Tips to consider before writing spark programs

Tips to consider before writing spark program

1. size up your executor right

Ideal executor size

keep 4-5 cores per executors

2-4 GB RAM per core

yarn memory overhead(7%)

Application Master use Core and 1 executor to run

sometimes you may need to run large data load and the above configuration may not help. here are few points to while sizing the executor.

2. Driver program runs on single machine.

This sometimes cause limitation and you may choose not to use spark in heavy data load applications and opt for other solutions like hive. Hive is better for single, heavy data load aggregation. consider Spark for there are other workarounds so you need to take some additional steps.

3. Manage you DAG

shuffles to be avoided(map side reduction, only send what you have to)

use ReducedbyKey over GroupbyKey(Skew issue as its data dependent)

TreeReduce(reduce at executor) over Reduce(reduce at Driver)

4. 2GB limit on spark shuffle block - Spark limit -- shuffle still sucks

128MB per partitions
if your number of partition is close to 2000, bump it to over 2000 so spark use compression

5. Skew with join

most data goes to one partition use salting to distribute your keys

My Notes

Friday, November 9, 2018

Tips to consider before writing spark programs

Tips to consider before writing spark program

1. size up your executor right

2. Driver program runs on single machine.

3. Manage you DAG

shuffles to be avoided(map side reduction, only send what you have to)

use ReducedbyKey over GroupbyKey(Skew issue as its data dependent)

TreeReduce(reduce at executor) over Reduce(reduce at Driver)

4. 2GB limit on spark shuffle block - Spark limit -- shuffle still sucks

5. Skew with join

Websphere Dummy certificate expired - DummyServerKeyFile.jks , DummyServerTrustFile.jks

Search This Blog

Friday, November 9, 2018

Tips to consider before writing spark programs

Tips to consider before writing spark program

1. size up your executor right

2. Driver program runs on single machine.

3. Manage you DAG

shuffles to be avoided(map side reduction, only send what you have to) use ReducedbyKey over GroupbyKey(Skew issue as its data dependent) TreeReduce(reduce at executor) over Reduce(reduce at Driver)

4. 2GB limit on spark shuffle block - Spark limit -- shuffle still sucks

5. Skew with join

Websphere Dummy certificate expired - DummyServerKeyFile.jks , DummyServerTrustFile.jks

shuffles to be avoided(map side reduction, only send what you have to)

use ReducedbyKey over GroupbyKey(Skew issue as its data dependent)

TreeReduce(reduce at executor) over Reduce(reduce at Driver)