System Design for Data Engineers

The system design round for data engineers is slightly different from traditional system design interviews. In this article, you will explore various interview questions along with their answers.

Design storage system

How to choose file format or when to choose parque data format vs avro file format

Different types of system

  • Batch processing
  • Realtime/Streaming (most interviews expect this)
  • Event Driven

Different types of architectures/system design concepts

  • CAP theorem
  • Lambda vs kapp architecture
  • ETL vs ELT – How to choose
  • Data warehouse, Data Mart, Data lake, data lakehouse, delta lake, Data Mesh
  • Data Modeling and Concepts Best Resource
    • Star schema vs Snowflake schema
    • SCD type 1,2,3, with examples (How to track history of data warehouse)
    • Full load vs incremental load
    • Lambda vs Kappa architecture
  • Data governance and security – who can access the data, data security.
    • Data integrity
    • Data Quality
    • Data Privacy, security, and compliance – Role-based access control
    • Data Discovery
  • Data lineage
  • Data Profiling
  • Data Catalogue
  • Data granularity
    • How to handle data granularity through out the life cycle
  • Data Architect
    • Batch processing
    • Real/Stream processing
    • Event-driven architecture

Choosing file formats,  databases and Dataware house solution

  • Parquet vs avro vs ORC vs CSV-: when use which format
  • Data warehouse
    • Hive/bigqiery/GCP/snowflake
      • Star Schema vs snowflake schema
  • DataBases
    • Graph database
    • Relational database
    • Nosql Database
    • Time series Database

Processing

  1. Spark streaming – For realtime/stream processing

Storage

In this layer, we store clean data for business usage. Generally, this data is stored in a data warehouse like Snowflake, Redshift, BigQuery, or Fabric.