• spark/pyspark performance optimization techniques
  • Pyspark optimization techniques 101 guide
    • PySpark jobs can be resource-intensive, but these tips can make a big difference:
    • 1️⃣ Cache Data: Use .cache() for repeated operations on the same dataset.
    • 2️⃣ Repartition Strategically: Avoid excessive partitions; balance between computation and storage.
    • 3️⃣ Use Broadcast Variables: Share small datasets across executors efficiently.
    • 4️⃣ Avoid Wide Transformations: Minimize operations like joins or groupBy when possible.
    • 5️⃣ Enable Dynamic Allocation: Adjust resources dynamically for better performance.
  • Function asked in an interview
    • map(), flatMap(), and filter()
    • groupBy() and reduceByKey()
    • select() and selectExpr()
    • coalesce() and repartition()
    • persist() and cache() – interview question
    • adaptive query execution AQE spark 3.0
    • catalyst optimizer vs tungsten optimizer
    • Dynamic partitioning
    • checkpoint
    • PySpark streaming
      • Watermark
  • Execution planExample
    • Unresolved plans
    • resolve plan
    • Logical plan
      • catalyst optimizer
        • Predicate Pushdown → Moves filters as early as possible. If
        • Projection Pruning → Drops unnecessary columns.
    • Physical planning
      It chooses the best join strategy based on dataset size:
      • Broadcast Hash Join (if one dataset is small).
      • Sort-Merge Join (if both datasets are large).
      • Shuffle Hash Join (for medium-sized datasets).
        It applies column purging (remove unnecessary columns).
    • code generation
    • Final optimized Execution plan
  • Given a program, tell me how many jobs, stages, and tasks were created.
  • To process 25GB of data in spark
    a. How many CPU cores are required?
    b. How many executors are required?
    C. How much of each executor’s memory is required?
    d. What is the total memory required?
  • https://youtu.be/NFpTXUVPyfQ?si=GYaEXGawN8ia-4Oa https://youtu.be/9onb1XPTr0k?si=Uj3B0xxtDJcBtL_h

  • Narrow vs wide transformation
  • Partition vs buckting/clustering
  • Small file problem
  • difference between RDD, Dataframe, and dataset
  • data skew
  • catalyst optimizer
  • broadcast join (rajajs data engineering)
  • What are accumulators in PySpark?
    • Accumulators are variables used for aggregating/SUM data across tasks, such as counting empty lines in files shared across executors. detailed explanation here.
    • 2 types of shared variable
      • Accumulator (we have a single copy on driver machine)
      • broadcast variable (we have a separate copy on data on each machine. here we send copy of data on each machine to reduce the data shuffling.)
  • Partitioning , multilevel partioitoning, dynamic partitioning , staric partitioning

The following question was asked in the top 10 product-based companies.

How would you handle skewed data in PySpark?

How to Join Two Large DataFrames in PySpark Efficiently

📌 PySpark Interview Questions (Sorted & Categorized)

1. PySpark Fundamentals

  • What is PySpark, and how is it different from Apache Spark?
  • Explain the Spark execution model and the role of DAG (Directed Acyclic Graph).
  • What are the main components of Apache Spark?
  • What are Resilient Distributed Datasets (RDDs), DataFrames, and Datasets(dataset supported by scala only. it is a combination of RDDs and DataFrames) in PySpark? How do they differ?
  • What are the advantages of PySpark over Pandas for big data processing?(pandas does not support a distributed system whereas Pyspark supports the distribution.)
  • What are the different deployment modes in PySpark? (Standalone, YARN, Mesos, Kubernetes)
  • Explain the concept of lazy evaluation in PySpark.
  • What is the difference between transformations and actions in PySpark? Provide examples.
  • What are the different ways to create a DataFrame in PySpark?
  • How do you read different file formats (CSV, JSON, Parquet, Avro) in PySpark?
  • What is the difference between SparkSession, SparkContext, and SQLContext?
  • Explain the difference between map(), flatMap(), and filter() in PySpark.
  • What is the difference between narrow and wide transformations?
  • What is a broadcast variable in PySpark, and when would you use it?
  • What are accumulators in PySpark, and how do they work?

2. PySpark SQL & DataFrames

  • What is Spark SQL, and how does it differ from Spark Core?
  • How do you create a temporary or global table using Spark SQL?
  • What are the advantages of using Spark SQL over traditional RDDs?
  • What is a Catalyst Optimizer in Spark SQL? How does it work?
  • What are the different types of joins supported in PySpark?
  • What is the difference between an inner join, outer join, left join, and right join in PySpark?
  • What are window functions in PySpark SQL? Provide examples.
  • How do you implement UDFs (User Defined Functions) in PySpark SQL?
  • How do you handle null values in PySpark SQL?
  • What are the different ways to filter, group, and aggregate data in PySpark SQL?
  • What is the difference between groupBy() and reduceByKey()?
  • How do you execute raw SQL queries using PySpark?
  • What is the difference between the select() and selectExpr() functions in PySpark?

3. PySpark Performance Optimization

  • What are some common PySpark performance tuning techniques?
  • How does PySpark handle partitioning, and why is it important?
  • Explain the difference between coalesce() and repartition().
  • What is data shuffling in PySpark, and how can you reduce it?
  • How does caching (persist() and cache()) work in PySpark, and when should you use it?
  • What are the different persistence storage levels in PySpark?
  • How do you handle data skew in PySpark?
  • How does PySpark optimize query execution using the Catalyst Optimizer?
  • What is the Tungsten execution engine in Spark, and how does it improve performance?
  • How can you optimize join operations in PySpark?
  • What is speculative execution in PySpark, and how does it help?
  • What is adaptive query execution (AQE) in PySpark 3.0?
  • How does PySpark handle small files, and what are the best practices for dealing with them?
  • How do you identify and resolve memory bottlenecks in PySpark applications?

4.PySpark Streaming & Real-Time Data Processing

  • What is Structured Streaming in PySpark, and how does it work?
  • How does PySpark handle real-time data processing?
  • What are watermarks in Structured Streaming?
  • What are triggers in PySpark Streaming, and how are they used?
  • How does PySpark handle late-arriving data in streaming?
  • How does checkpointing work in PySpark Structured Streaming?
  • What is the difference between batch processing and stream processing in PySpark?
  • What is the role of Apache Kafka in PySpark Streaming?
  • How do you integrate PySpark with real-time streaming tools like Kafka and Flume?

5. PySpark Deployment & Cluster Management

  • What are the different cluster managers available in PySpark?
  • What is YARN, and how does Spark work on YARN?
  • What is the difference between client mode and cluster mode in PySpark?
  • How do you configure dynamic resource allocation in PySpark?
  • What are executor memory and driver memory in PySpark?
  • How do you monitor and debug PySpark jobs?
  • What is Spark UI, and how can you use it for monitoring?
  • What are the common errors and exceptions in PySpark, and how do you debug them?
  • How do you deploy a PySpark application on Kubernetes?

6. PySpark Advanced Topics

  • What are lineage graphs, and how do they help in debugging Spark jobs?
  • What is a shuffle operation in PySpark?
  • Explain how PySpark integrates with Hadoop and Hive.
  • What are the differences between Apache Spark and Apache Flink?
  • What happens when a PySpark task fails?
  • How does PySpark handle OutOfMemory (OOM) errors?
  • What is the difference between shuffle partitioning and hash partitioning?
  • What is speculative execution, and when should it be used?
  • How do you configure PySpark logging for better debugging?
  • What is the difference between Order By, Sort By, and Cluster By?
  • How does Spark handle failures?

7. PySpark Coding & Hands-on Questions

  • Write a PySpark code to count the number of words in a file.
  • Write a PySpark script to read a CSV file and filter data based on a column value.
  • Write a PySpark program to remove duplicate rows from a DataFrame.
  • Write a PySpark job to perform an inner join between two DataFrames.
  • Write a PySpark script to compute the moving average of a column using window functions.
  • Write a PySpark script to find the top N products by revenue from a sales dataset.
  • How do you convert a PySpark DataFrame to a Pandas DataFrame?
  • How do you sort a DataFrame based on multiple columns in PySpark?
  • How would you implement a custom UDF in PySpark?

Final Notes & Strategy for PySpark Interviews

Understand Spark internals: Focus on DAG, execution model, and optimizations.
Practice coding problems: Work on PySpark transformations, actions, and SQL queries.
Performance tuning: Know techniques to optimize memory, joins, and shuffles.
Real-world scenarios: Be prepared for questions on handling large datasets, debugging, and deployment.
Hands-on experience: Work on projects integrating PySpark with Kafka, Hive, and databases.

Leave a Reply

Your email address will not be published. Required fields are marked *