Optimizing Spark jobs is crucial for performance. Here’s how to use partitioning and bucketing effectively.
Partitioning
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Optimization").getOrCreate()
# Repartition
df = df.repartition(10, "column_name")
# Coalesce
df = df.coalesce(5)
Bucketing
df.write \
.bucketBy(10, "bucket_column") \
.sortBy("sort_column") \
.saveAsTable("bucketed_table")
Broadcast Joins
from pyspark.sql.functions import broadcast
result = large_df.join(broadcast(small_df), "key")
Caching
df.cache()
df.persist()
Best Practices
- Partition appropriately
- Use bucketing for joins
- Broadcast small tables
- Cache frequently used data
- Monitor performance
Conclusion
Optimize Spark jobs for better performance! ⚡