Tuesday, July 16, 2024

Spark - Shuffle Optimisation:

Spark -  Shuffle Optimisation:

  • spark.sql.shuffle.partitions parameter  - 200mb
  • spark.serializer to org.apache.spark.serializer.KryoSerializer faster than java serial and deserial
  • Reduce Disk I/O  - spark.shuffle.compress — whether the engine would compress shuffle outputs or not. Default value is “true”.
  • spark.shuffle.spill.compress — whether to compress intermediate shuffle spill files or not. Default value is “true”.
  • spark.io.compression.codec codec for compressing the data, which is snappy by default.

For smaller datasets , generally snappy works well with most of the datasets but there is emerging compression codec “zstld” which surpasses snappy performance. Details are here.This has been introduced by Facebook and attached is the link for details.

Optimize Spark’s In-memory computation: Spark uses memory to store intermediate results during shuffling. Adjust the memory usage for shuffling by tuning the spark.memory.fraction parameter. By default, this parameter is set to 0.6, which means that 60% of the executor memory is used for storage/caching and 40% is used for execution


No comments: