What Are Accumulators, and How Do They Work?

This is a most frequently asked PySpark interview question! Here’s the breakdown:

What Are Accumulators?

  • Accumulators are shared variables used to aggregate values across tasks in a distributed system.
  • They are primarily used for counters or sums (e.g., counting errors or tracking metrics).

How Do They Work?

  • Only the driver program can read the accumulator’s value.
  • Worker nodes can add to the accumulator, but they cannot read its value.

Example:

# Initialize an accumulator  
error_counter = spark.sparkContext.accumulator(0)  

# Define a function to count errors  
def count_errors(row):  
    if row["value"] is None:  
        error_counter.add(1)  

# Apply the function to the DataFrame  
df.foreach(count_errors)  

# Check the result  
print("Total errors:", error_counter.value)  

Pro Tip: Use accumulators for debugging or tracking metrics during distributed processing.

Leave a Reply

Your email address will not be published. Required fields are marked *