Slowly Changing Dimensions (SCD) are used in data warehousing to manage historical changes in dimension tables. There are several types of SCDs, each handling data changes differently.

Types of SCDs with Examples and Implementations

SCD Type 0 (Fixed Dimension)

  • No changes are allowed once the data is inserted or No changes are allowed after the initial insertion.

Implementation of SCD Type 0

  • No update queries; only inserts are allowed.
CREATE TABLE product (
    product_id INT PRIMARY KEY,
    product_name VARCHAR(100),
    category VARCHAR(50)
);

SCD Type 1 Overwrite (No History Tracking)

  • Old Values are overwritten when updates occur, and no history is maintained.
  • Example: A customer’s address is updated in the record, but the old address is lost.

Implementation of SCD Type 1
SQL Approach

UPDATE customer
SET address = 'New Address'
WHERE customer_id = 101;

PySpark Implementation

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("SCD1").getOrCreate()

data = [(101, "John Doe", "Old Address"),
        (102, "Jane Doe", "Another Address")]

columns = ["customer_id", "name", "address"]
df = spark.createDataFrame(data, columns)

updated_data = [(101, "John Doe", "New Address")]
df_updates = spark.createDataFrame(updated_data, columns)

df_final = df.alias("old").join(df_updates.alias("new"), "customer_id", "left") \
    .select(
        col("old.customer_id"),
        col("old.name"),
        col("new.address").alias("address")
    )

df_final.show()

SCD Type 2 (History Tracking)

  • Maintains historical records by adding new rows with versioning or effective date ranges.
  • Example: Employee salary changes over time (mark old salary inactive and new salary active.)

Implementation
SQL Approach

CREATE TABLE employee (
    employee_id INT,
    name VARCHAR(100),
    salary DECIMAL(10,2),
    start_date DATE,
    end_date DATE NULL,
    is_active BOOLEAN,
    PRIMARY KEY (employee_id, start_date)
);

-- Insert new record
INSERT INTO employee (employee_id, name, salary, start_date, end_date, is_active)
VALUES (1, 'John Doe', 50000, '2024-01-01', NULL, TRUE);

-- When salary updates, mark old record inactive and insert a new one
UPDATE employee
SET end_date = '2024-06-01', is_active = FALSE
WHERE employee_id = 1 AND is_active = TRUE;

INSERT INTO employee (employee_id, name, salary, start_date, end_date, is_active)
VALUES (1, 'John Doe', 55000, '2024-06-02', NULL, TRUE);

PySpark Implementation

from pyspark.sql.functions import lit

data = [(1, "John Doe", 50000, "2024-01-01", None, True)]
columns = ["employee_id", "name", "salary", "start_date", "end_date", "is_active"]
df = spark.createDataFrame(data, columns)

# Updating salary (SCD Type 2 logic)
df_old = df.withColumn("end_date", lit("2024-06-01")).withColumn("is_active", lit(False))
df_new = spark.createDataFrame([(1, "John Doe", 55000, "2024-06-02", None, True)], columns)

df_final = df_old.union(df_new)
df_final.show()

SCD Type 3 (Limited History)

  • Maintains limited history by adding new columns for previous values.
  • Example: A customer’s last two addresses are stored in separate columns.

Implementation
SQL Approach

ALTER TABLE customer ADD COLUMN previous_address VARCHAR(255);

UPDATE customer
SET previous_address = address, address = 'New Address'
WHERE customer_id = 101;

PySpark Implementation

df = df.withColumn("previous_address", col("address"))
df = df.withColumn("address", lit("New Address"))
df.show()

SCD Type 4 (Hybrid)

  • Maintains history in a separate historical table while keeping the latest data in the main table.
  • Example: A student’s enrollment history is stored separately, while the current record is in the main table.

Implementation
SQL Approach

CREATE TABLE student (
    student_id INT PRIMARY KEY,
    name VARCHAR(100),
    current_grade VARCHAR(10)
);

CREATE TABLE student_history (
    student_id INT,
    name VARCHAR(100),
    grade VARCHAR(10),
    change_date DATE
);

-- Update current table and move old record to history
INSERT INTO student_history (student_id, name, grade, change_date)
SELECT student_id, name, current_grade, CURRENT_DATE FROM student WHERE student_id = 1;

UPDATE student SET current_grade = '12th' WHERE student_id = 1;

PySpark Implementation

df_history = df.select("customer_id", "name", "address").withColumn("change_date", lit("2024-06-01"))
df_main = df.withColumn("address", lit("New Address"))

df_main.show()
df_history.show()

SCD Type 6 (Hybrid of SCD 1, 2, and 3)

  • Combines history tracking (SCD2) and limited history(SCD3).
  • Example: A customer’s current and previous addresses are stored, while historical data is also maintained.

Implementation
SQL Approach

ALTER TABLE customer ADD COLUMN previous_address VARCHAR(255);

UPDATE customer
SET previous_address = address, address = 'New Address'
WHERE customer_id = 101;

INSERT INTO customer_history (customer_id, name, address, change_date)
VALUES (101, 'John Doe', 'Old Address', '2024-06-01');

PySpark Implementation

df_history = df.withColumn("change_date", lit("2024-06-01"))
df = df.withColumn("previous_address", col("address")).withColumn("address", lit("New Address"))

df_history.show()
df.show()

Choosing the right SCD Type

SCD Type
SCD 0 No changes allowed
SCD 1Overwrites old data.
SCD 2Keeps history with timestamps/flags.
SCD 3Keeps only the last change.
SCD 4Uses a separate history table.
SCD 6Mix of Type 1, 2, and 3.

Each type serves a different purpose, and the choice depends on business requirements and how much historical data needs to be stored.

Leave a Reply

Your email address will not be published. Required fields are marked *