What Are Accumulators, and How Do They Work?
This is a most frequently asked PySpark interview question! Here’s the breakdown:
What Are Accumulators?
- Accumulators are shared variables used to aggregate values across tasks in a distributed system.
- They are primarily used for counters or sums (e.g., counting errors or tracking metrics).
How Do They Work?
- Only the driver program can read the accumulator’s value.
- Worker nodes can add to the accumulator, but they cannot read its value.
Example:
# Initialize an accumulator
error_counter = spark.sparkContext.accumulator(0)
# Define a function to count errors
def count_errors(row):
if row["value"] is None:
error_counter.add(1)
# Apply the function to the DataFrame
df.foreach(count_errors)
# Check the result
print("Total errors:", error_counter.value)
Pro Tip: Use accumulators for debugging or tracking metrics during distributed processing.