Optimizing Spark jobs is crucial for performance. Here’s how to use partitioning and bucketing effectively.

Partitioning

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Optimization").getOrCreate()

# Repartition
df = df.repartition(10, "column_name")

# Coalesce
df = df.coalesce(5)

Bucketing

df.write \
    .bucketBy(10, "bucket_column") \
    .sortBy("sort_column") \
    .saveAsTable("bucketed_table")

Broadcast Joins

from pyspark.sql.functions import broadcast

result = large_df.join(broadcast(small_df), "key")

Caching

df.cache()
df.persist()

Best Practices

  1. Partition appropriately
  2. Use bucketing for joins
  3. Broadcast small tables
  4. Cache frequently used data
  5. Monitor performance

Conclusion

Optimize Spark jobs for better performance! ⚡