Data Engineering

What Are Accumulators, and How Do They Work?

By Vishal Jadhav 1 March 2025

What Are Accumulators, and How Do They Work?

This is a most frequently asked PySpark interview question! Here’s the breakdown:

Table of Contents

What Are Accumulators?

Accumulators are shared variables used to aggregate values across tasks in a distributed system.
They are primarily used for counters or sums (e.g., counting errors or tracking metrics).

How Do They Work?

Only the driver program can read the accumulator’s value.
Worker nodes can add to the accumulator, but they cannot read its value.

Example:

# Initialize an accumulator  
error_counter = spark.sparkContext.accumulator(0)  

# Define a function to count errors  
def count_errors(row):  
    if row["value"] is None:  
        error_counter.add(1)  

# Apply the function to the DataFrame  
df.foreach(count_errors)  

# Check the result  
print("Total errors:", error_counter.value)

Pro Tip: Use accumulators for debugging or tracking metrics during distributed processing.

By Vishal Jadhav

Leave a Reply Cancel reply

Data Engineering

Python Question asked me in interview [Coding + Theroy]

Data Engineering

Data Modeling – How to design it

Data Engineering

GCP BigQuery

Data Engineering

Pyspark Top 100 Interview Question – Crack any interview