Data Engineer Interview Questions
My interview experiences and Interview questions.
- Python interview questions asked in the interview
- SQL questions asked in the interview
- PySpark question asked in interview
Data Engineering concepts
- ETL vs ELT
- Data warehouse, Data Mart, Data lake, data lakehouse, delta lake, Data Mesh
- Data Modeling and Concepts – Best Resource
- Star schema vs Snowflake schema
- SCD type 1,2,3, with examples (How to track history of data warehouse)
- Full load vs incremental load
- Lambda vs Kappa architecture
- Data governance and security – who can access the data, data security.
- Data integrity
- Data Quality
- Data Privacy, security, and compliance – Role-based access control
- Data Discovery
- Data lineage
- Data Profiling
- Data Catalogue
- Data granularity
- How to handle data granularity through out the life cycle
- Data Architect
- Batch processing
- Real/Stream processing
- Event-driven architecture
Big Data
PySpark
- Prepare this PySpark and crack any interview – Pyspark Top 100 Interview Question
- Pyspark Most Important Question
- PySpark SQL syntax/coding
- Pyspark Streaming
- DStreams in Spark Streaming
- Watermarking in pyspark streaming (Watermarking allows you to drop data that is too late to process, avoiding the complexity of processing data that arrives after the expected window.)
The most common skill many companies looking for a data engineer
I made a list of the most demanding technology
- Programming language – Python, PySpark, SQL
- Pyspark – PySpark Streaming
- Cloud
- AWS
- ETL/ANALYTICS: EMR, GLUE, ATHENA, Redshift,
- ECS, EC2, S3, LAMBDA, Step Functions, API Gate way, SNS, RDS, Aurora Postgres,
- Data Migration: AWS DMS
- GCP
- BigQuery, Cloud Storage, DataProc
- AWS
- Orchestrations: Airflow, Docker, Kubernetes
- Big data: Hadoop/ HDFS, Kafka, Hive
- Non-Relational Database/Data Store (Good to have, company-specific)
- Object storage
- Document storage
- key-value store
- Graph Database
- Column-family Database
- Mango db, redis, elastic Search
The following are Technology are alternatives to each other
- Data warehouse: – Hive/ Redsfhit/ Bigquery/ Snowflake
- Distributed/Spark Processing: – DataProc / EMR / DataBricks