February, 2025 - DataEngineerBlog.com

Given a code – How Catalyst Optimizer Works in This Code?

You have the following code. Explain how the catalyst optimizer works in the code? Explain in detail PySpark’s Catalyst Optimizer is a powerful query optimizer used by Spark SQL to…

Data Engineering

Catalist optimizer : Prediction pushdown

Vishal Jadhav No Comments

if in your code/query if you are filterring the data at the end, Catalyst optimizer (in prediction pushdown) will apply filtering on input or source and then do the other…

Data Engineering

What happens when you enable cache() in PySpark and the dataset exceeds the available memory? How does Spark handle this situation, and what potential issues might arise?

Vishal Jadhav No Comments

both cache() and persist() store data in memory to speed up the retrieval of intermediate data used for computation. However, persist() is more flexible and allows users to specify storage…

Data Engineering

SCD Types Explained with Examples

Vishal Jadhav No Comments

Slowly Changing Dimensions (SCD) are used in data warehousing to manage historical changes in dimension tables. There are several types of SCDs, each handling data changes differently. Types of SCDs…

Data Engineering

What is Spark Speculative Execution?

Vishal Jadhav No Comments

Set speculative execution configuration -> spark.speculation = True read more here

Data Engineering

sql to pyspark query converstion

Vishal Jadhav No Comments

In this article, we are converting the SQL queries to Pyspark code Task SQL Command PySpark Command Selecting Data SELECT col1, col2 FROM table; df.select(“col1”, “col2”) Filtering Data SELECT *…

Data Engineering

Pypark – Databriks + spark Rajas data engineering

Vishal Jadhav No Comments

PySpark Interview Questions (lec 6) How to read files in spark? You are going to see how to read different file formats in pyspark. first, you need to create the…

Data Engineering

How would you handle skewed data in PySpark

Vishal Jadhav No Comments

Handling skewed data in PySpark is crucial for optimizing performance and ensuring efficient processing. Skewed data occurs when some partitions have significantly more data than others, leading to uneven workload…

Data Engineering

Sliding Window Maximum (LeetCode 239)

Vishal Jadhav No Comments

Steps: 1. Iterate through the array nums from index 0 to len(nums) – k. 2. For each window of size k, find the maximum element. 3. Store the maximum value…

Data Engineering

Valid Parentheses (LeetCode 20)

Vishal Jadhav No Comments

Given a string s containing just the characters '(', ')', '{', '}', '', determine if the input string is valid. An input string is valid if: Example 1: Input: s…

February 2025

Given a code – How Catalyst Optimizer Works in This Code?

Catalist optimizer : Prediction pushdown

What happens when you enable cache() in PySpark and the dataset exceeds the available memory? How does Spark handle this situation, and what potential issues might arise?

SCD Types Explained with Examples

What is Spark Speculative Execution?

sql to pyspark query converstion

Pypark – Databriks + spark Rajas data engineering

How would you handle skewed data in PySpark

Sliding Window Maximum (LeetCode 239)

Valid Parentheses (LeetCode 20)

You Missed

Python Question asked me in interview [Coding + Theroy]

Data Modeling – How to design it

GCP BigQuery

Pyspark Top 100 Interview Question – Crack any interview