System Design for Data Engineers
The system design round for data engineers is slightly different from traditional system design interviews. In this article, you will explore various interview questions along with their answers.
Design storage system
How to choose file format or when to choose parque data format vs avro file format
Different types of system
- Batch processing
- Realtime/Streaming (most interviews expect this)
- Event Driven
Different types of architectures/system design concepts
- CAP theorem
- Lambda vs kapp architecture
- ETL vs ELT – How to choose
- Data warehouse, Data Mart, Data lake, data lakehouse, delta lake, Data Mesh
- Data Modeling and Concepts – Best Resource
- Star schema vs Snowflake schema
- SCD type 1,2,3, with examples (How to track history of data warehouse)
- Full load vs incremental load
- Lambda vs Kappa architecture
- Data governance and security – who can access the data, data security.
- Data integrity
- Data Quality
- Data Privacy, security, and compliance – Role-based access control
- Data Discovery
- Data lineage
- Data Profiling
- Data Catalogue
- Data granularity
- How to handle data granularity through out the life cycle
- Data Architect
- Batch processing
- Real/Stream processing
- Event-driven architecture
Choosing file formats, databases and Dataware house solution
- Parquet vs avro vs ORC vs CSV-: when use which format
- Data warehouse
- Hive/bigqiery/GCP/snowflake
- Star Schema vs snowflake schema
- Hive/bigqiery/GCP/snowflake
- DataBases
- Graph database
- Relational database
- Nosql Database
- Time series Database
Processing
- Spark streaming – For realtime/stream processing
Storage
In this layer, we store clean data for business usage. Generally, this data is stored in a data warehouse like Snowflake, Redshift, BigQuery, or Fabric.