Data Engineer interview.

CAP theorem

Lambda vs kapp architecture

Star vs snowflake schema

Data warehouse , data lake, delta lake, dataware house

Scd types

SQL

How would you write a query to calculate a cumulative sum or running total within a specific partition in SQL?

How do window functions differ from aggregate functions, and when would you use them?

– How do you identify and remove duplicate records in SQL without using temporary tables?

Python

– How do you manage memory efficiently when processing large files in Python?

What are Python decorators, and how would you use them to optimize reusable code in ETL processes?

– How do you use Python’s built-in logging module to capture detailed error and audit logs?

Pyspark

– How would you handle skewed data in a Spark job to prevent performance issues?

– What is the difference between the Spark Session and Spark Context? When should each be used?

– How do you handle backpressure in Spark Streaming applications to manage load effectively?

Azure Databricks

– How do you configure cluster autoscaling in Databricks, and when should it be used?

How do you implement data versioning in Delta Lake tables within Databricks?

– How would you monitor and optimize Databricks job performance metrics?

Azure Data Factory

– What are tumbling window triggers in Azure Data Factory, and how do you configure them?

– How would you enable managed identity-based authentication for linked services in ADF?

– How do you create custom activity logs in ADF for monitoring data pipeline execution?

CI/CD

– What are blue-green deployments, and how would you use them for ETL jobs?

– How do you implement rollback mechanisms in CI/ CD pipelines for data integration processes?

– What strategies do you use to handle schema evolution in data pipelines as part of CI/CD?1

Azure Databricks

How do you configure cluster autoscaling in Databricks, and when should it be used?

How do you implement data versioning in Delta Lake tables within Databricks?

How would you monitor and optimize Databricks job performance metrics?

Azure Data Factory

What are tumbling window triggers in Azure Data Factory, and how do you configure them?
How would you enable managed identity-based authentication for linked services in ADF?
How do you create custom activity logs in ADF for monitoring data pipeline execution?

CI/CD

What are blue-green deployments, and how would you use them for ETL jobs?
How do you implement rollback mechanisms in CI/ CD pipelines for data integration processes?
What strategies do you use to handle schema evolution in data pipelines as part of CI/CD?

Data Warehousing

How do you optimize join operations in a data warehouse to improve query performance?
What is a slowly changing dimension (SCD), and what are different ways to implement it in a data warehouse?
How do surrogate keys benefit data warehouse design over natural keys?

Data Modeling

How do you decide between a star schema and a snowflake schema for a data warehouse? Provide examples of scenarios where each is ideal.

What is dimensional modeling, and how does it differ from entity-relationship modeling in terms of use cases?
How do you handle one-to-many relationships in a dimensional model to ensure efficient querying?

By Vishal Jadhav

Leave a Reply Cancel reply

You Missed

Python Question asked me in interview [Coding + Theroy]

Data Modeling – How to design it

GCP BigQuery

Pyspark Top 100 Interview Question – Crack any interview

By Vishal Jadhav

Related Post

Leave a Reply Cancel reply

You Missed