CAP theorem
Lambda vs kapp architecture
Star vs snowflake schema
Data warehouse , data lake, delta lake, dataware house
Scd types
SQL
How would you write a query to calculate a cumulative sum or running total within a specific partition in SQL?
How do window functions differ from aggregate functions, and when would you use them?
– How do you identify and remove duplicate records in SQL without using temporary tables?
Python
– How do you manage memory efficiently when processing large files in Python?
What are Python decorators, and how would you use them to optimize reusable code in ETL processes?
– How do you use Python’s built-in logging module to capture detailed error and audit logs?
Pyspark
– How would you handle skewed data in a Spark job to prevent performance issues?
– What is the difference between the Spark Session and Spark Context? When should each be used?
– How do you handle backpressure in Spark Streaming applications to manage load effectively?
Azure Databricks
– How do you configure cluster autoscaling in Databricks, and when should it be used?
How do you implement data versioning in Delta Lake tables within Databricks?
– How would you monitor and optimize Databricks job performance metrics?
Azure Data Factory
– What are tumbling window triggers in Azure Data Factory, and how do you configure them?
– How would you enable managed identity-based authentication for linked services in ADF?
– How do you create custom activity logs in ADF for monitoring data pipeline execution?
CI/CD
– What are blue-green deployments, and how would you use them for ETL jobs?
– How do you implement rollback mechanisms in CI/ CD pipelines for data integration processes?
– What strategies do you use to handle schema evolution in data pipelines as part of CI/CD?1
Azure Databricks
- How do you configure cluster autoscaling in Databricks, and when should it be used?
How do you implement data versioning in Delta Lake tables within Databricks?
- How would you monitor and optimize Databricks job performance metrics?
Azure Data Factory
- What are tumbling window triggers in Azure Data Factory, and how do you configure them?
- How would you enable managed identity-based authentication for linked services in ADF?
- How do you create custom activity logs in ADF for monitoring data pipeline execution?
CI/CD
- What are blue-green deployments, and how would you use them for ETL jobs?
- How do you implement rollback mechanisms in CI/ CD pipelines for data integration processes?
- What strategies do you use to handle schema evolution in data pipelines as part of CI/CD?
Data Warehousing
- How do you optimize join operations in a data warehouse to improve query performance?
- What is a slowly changing dimension (SCD), and what are different ways to implement it in a data warehouse?
- How do surrogate keys benefit data warehouse design over natural keys?
Data Modeling
How do you decide between a star schema and a snowflake schema for a data warehouse? Provide examples of scenarios where each is ideal.
- What is dimensional modeling, and how does it differ from entity-relationship modeling in terms of use cases?
- How do you handle one-to-many relationships in a dimensional model to ensure efficient querying?