๐—›๐—ผ๐˜„ ๐——๐—ผ ๐—ฌ๐—ผ๐˜‚ ๐—›๐—ฎ๐—ป๐—ฑ๐—น๐—ฒ ๐—ฆ๐—ธ๐—ฒ๐˜„๐—ฒ๐—ฑ ๐——๐—ฎ๐˜๐—ฎ ๐—ถ๐—ป ๐—ฃ๐˜†๐—ฆ๐—ฝ๐—ฎ๐—ฟ๐—ธ?

This is a critical PySpark interview question! Hereโ€™s the breakdown:ย 

โœ… ๐—ช๐—ต๐—ฎ๐˜ ๐—ถ๐˜€ ๐—ฆ๐—ธ๐—ฒ๐˜„๐—ฒ๐—ฑ ๐——๐—ฎ๐˜๐—ฎ?
Aย skewed partitionย in Sparkย occurs when the data is distributed unevenly across the available partitions, causing some partitions to be significantly larger than others. This imbalance creates performance bottlenecks because the tasks processing the large partitions, known as “straggler tasks,” take much longer to complete, while the tasks for smaller partitions finish quickly and remain idle.ย 

Causes of Skewed Partitions

Skew typically arises during shuffle-intensive operations like joins or aggregations, where data with the same key must reside on the same executor. 

Dominant Keys:ย The most common cause is when a few specific keys in a dataset appear much more frequently than others (e.g., a customer ID in an order table where one customer has 90% of the orders).

Poor Partitioning Strategy:ย Using a partitioning key that naturally has uneven data distribution (like aย countryย column where 90% of records are from one country).

Data Quality Issues:ย High percentages ofย nullย or default values in a key column can cause all those records to end up in the same partition.ย 

Consequences of data skewness in spark

  • Slow Performance:ย The overall job performance is limited by the time it takes for the slowest straggler tasks to finish.
  • Resource Inefficiency:ย Most executors sit idle, waiting for the few overloaded ones to complete their work.
  • Out-of-Memory (OOM) Errors:ย The large partitions may exceed the memory capacity of a single executor, causing the job to fail or spill data to disk, which is much slower.

Detection of data skewness in spark

Skew can be detected using the Spark UI by observing: 

  • Uneven Task Durations:ย Some tasks take significantly longer (minutes vs. seconds) within the same stage.
  • Large Shuffle Read Sizes:ย A few tasks show disproportionately larger “Shuffle Read Size” metrics compared to the median.ย 


โœ… ๐—›๐—ผ๐˜„ ๐˜๐—ผ ๐—›๐—ฎ๐—ป๐—ฑ๐—น๐—ฒ ๐—ฆ๐—ธ๐—ฒ๐˜„๐—ฒ๐—ฑ ๐——๐—ฎ๐˜๐—ฎ:ย 
1. ๐—ฆ๐—ฎ๐—น๐˜๐—ถ๐—ป๐—ด: Add a random prefix to keys to distribute data more evenly.ย 
2. **๐—–๐˜‚๐˜€๐˜๐—ผ๐—บ ๐—ฃ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป๐—ถ๐—ป๐—ด**: Manually repartition the data based on a more balanced key or use a custom partitioner to ensure even distribution.ย 
3. ๐—•๐—ฟ๐—ผ๐—ฎ๐—ฑ๐—ฐ๐—ฎ๐˜€๐˜ ๐—๐—ผ๐—ถ๐—ป๐˜€: If one of the datasets is small enough (under 10MB or a configurable threshold), broadcasting it to all executors avoids the shuffle operation entirely.
4. ๐—ฅ๐—ฒ๐—ฝ๐—ฎ๐—ฟ๐˜๐—ถ๐˜๐—ถ๐—ผ๐—ป๐—ถ๐—ป๐—ด: Use ๐š›๐šŽ๐š™๐šŠ๐š›๐š๐š’๐š๐š’๐š˜๐š—() to increase the number of partitions and reduce skew.ย 
5. Adaptive Query Execution (AQE):ย Spark 3.x and later can dynamically handle skewed joins by splitting large partitions into smaller ones at runtime.

Example (Salting):ย 

๐š๐š›๐š˜๐š–โ€‚๐š™๐šข๐šœ๐š™๐šŠ๐š›๐š”.๐šœ๐šš๐š•.๐š๐šž๐š—๐šŒ๐š๐š’๐š˜๐š—๐šœโ€‚๐š’๐š–๐š™๐š˜๐š›๐šโ€‚๐šŒ๐š˜๐š•,โ€‚๐šŒ๐š˜๐š—๐šŒ๐šŠ๐š,โ€‚๐š•๐š’๐š,โ€‚๐š›๐šŠ๐š—๐šโ€‚โ€‚

#โ€‚๐™ฐ๐š๐šโ€‚๐šŠโ€‚๐š›๐šŠ๐š—๐š๐š˜๐š–โ€‚๐šœ๐šŠ๐š•๐šโ€‚๐š๐š˜โ€‚๐š๐š‘๐šŽโ€‚๐š”๐šŽ๐šขโ€‚โ€‚
๐š๐š_๐šœ๐šŠ๐š•๐š๐šŽ๐šโ€‚=โ€‚๐š๐š.๐š ๐š’๐š๐š‘๐™ฒ๐š˜๐š•๐šž๐š–๐š—(“๐šœ๐šŠ๐š•๐š๐šŽ๐š_๐š”๐šŽ๐šข”,โ€‚๐šŒ๐š˜๐š—๐šŒ๐šŠ๐š(๐šŒ๐š˜๐š•(“๐š”๐šŽ๐šข”),โ€‚๐š•๐š’๐š(“_”),โ€‚(๐š›๐šŠ๐š—๐š()โ€‚*โ€‚๐Ÿท๐Ÿถ).๐šŒ๐šŠ๐šœ๐š(“๐š’๐š—๐š”)))โ€‚โ€‚

#โ€‚๐™ฟ๐šŽ๐š›๐š๐š˜๐š›๐š–โ€‚๐š˜๐š™๐šŽ๐š›๐šŠ๐š๐š’๐š˜๐š—๐šœโ€‚๐š˜๐š—โ€‚๐š๐š‘๐šŽโ€‚๐šœ๐šŠ๐š•๐š๐šŽ๐šโ€‚๐š”๐šŽ๐šขโ€‚โ€‚
๐š›๐šŽ๐šœ๐šž๐š•๐šโ€‚=โ€‚๐š๐š_๐šœ๐šŠ๐š•๐š๐šŽ๐š.๐š๐š›๐š˜๐šž๐š™๐™ฑ๐šข(“๐šœ๐šŠ๐š•๐š๐šŽ๐š_๐š”๐šŽ๐šข”).๐šŒ๐š˜๐šž๐š—๐š()โ€‚โ€‚


๐—ฃ๐—ฟ๐—ผ ๐—ง๐—ถ๐—ฝ: Monitor your Spark UI to identify skewed partitions and apply the right strategy!ย 

Whatโ€™s your go-to method for handling skewed data? Share your tips below! ๐Ÿ‘‡ย 

#PySpark #DataEngineering #InterviewPrep #BigData #TechTipsย 

Leave a Reply

Your email address will not be published. Required fields are marked *