Slowly Changing Dimensions (SCD) are used in data warehousing to manage historical changes in dimension tables. There are several types of SCDs, each handling data changes differently.
Types of SCDs with Examples and Implementations
SCD Type 0 (Fixed Dimension)
- No changes are allowed once the data is inserted or No changes are allowed after the initial insertion.
- Example: A product SKU (Stock Keeping Unit) that never changes.
Implementation of SCD Type 0
- No update queries; only inserts are allowed.
CREATE TABLE product (
product_id INT PRIMARY KEY,
product_name VARCHAR(100),
category VARCHAR(50)
);
SCD Type 1 Overwrite (No History Tracking)
- Old Values are overwritten when updates occur, and no history is maintained.
- Example: A customer’s address is updated in the record, but the old address is lost.
Implementation of SCD Type 1
SQL Approach
UPDATE customer
SET address = 'New Address'
WHERE customer_id = 101;
PySpark Implementation
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.appName("SCD1").getOrCreate()
data = [(101, "John Doe", "Old Address"),
(102, "Jane Doe", "Another Address")]
columns = ["customer_id", "name", "address"]
df = spark.createDataFrame(data, columns)
updated_data = [(101, "John Doe", "New Address")]
df_updates = spark.createDataFrame(updated_data, columns)
df_final = df.alias("old").join(df_updates.alias("new"), "customer_id", "left") \
.select(
col("old.customer_id"),
col("old.name"),
col("new.address").alias("address")
)
df_final.show()
SCD Type 2 (History Tracking)
- Maintains historical records by adding new rows with versioning or effective date ranges.
- Example: Employee salary changes over time (mark old salary inactive and new salary active.)
Implementation
SQL Approach
CREATE TABLE employee (
employee_id INT,
name VARCHAR(100),
salary DECIMAL(10,2),
start_date DATE,
end_date DATE NULL,
is_active BOOLEAN,
PRIMARY KEY (employee_id, start_date)
);
-- Insert new record
INSERT INTO employee (employee_id, name, salary, start_date, end_date, is_active)
VALUES (1, 'John Doe', 50000, '2024-01-01', NULL, TRUE);
-- When salary updates, mark old record inactive and insert a new one
UPDATE employee
SET end_date = '2024-06-01', is_active = FALSE
WHERE employee_id = 1 AND is_active = TRUE;
INSERT INTO employee (employee_id, name, salary, start_date, end_date, is_active)
VALUES (1, 'John Doe', 55000, '2024-06-02', NULL, TRUE);
PySpark Implementation
from pyspark.sql.functions import lit
data = [(1, "John Doe", 50000, "2024-01-01", None, True)]
columns = ["employee_id", "name", "salary", "start_date", "end_date", "is_active"]
df = spark.createDataFrame(data, columns)
# Updating salary (SCD Type 2 logic)
df_old = df.withColumn("end_date", lit("2024-06-01")).withColumn("is_active", lit(False))
df_new = spark.createDataFrame([(1, "John Doe", 55000, "2024-06-02", None, True)], columns)
df_final = df_old.union(df_new)
df_final.show()
SCD Type 3 (Limited History)
- Maintains limited history by adding new columns for previous values.
- Example: A customer’s last two addresses are stored in separate columns.
Implementation
SQL Approach
ALTER TABLE customer ADD COLUMN previous_address VARCHAR(255);
UPDATE customer
SET previous_address = address, address = 'New Address'
WHERE customer_id = 101;
PySpark Implementation
df = df.withColumn("previous_address", col("address"))
df = df.withColumn("address", lit("New Address"))
df.show()
SCD Type 4 (Hybrid)
- Maintains history in a separate historical table while keeping the latest data in the main table.
- Example: A student’s enrollment history is stored separately, while the current record is in the main table.
Implementation
SQL Approach
CREATE TABLE student (
student_id INT PRIMARY KEY,
name VARCHAR(100),
current_grade VARCHAR(10)
);
CREATE TABLE student_history (
student_id INT,
name VARCHAR(100),
grade VARCHAR(10),
change_date DATE
);
-- Update current table and move old record to history
INSERT INTO student_history (student_id, name, grade, change_date)
SELECT student_id, name, current_grade, CURRENT_DATE FROM student WHERE student_id = 1;
UPDATE student SET current_grade = '12th' WHERE student_id = 1;
PySpark Implementation
df_history = df.select("customer_id", "name", "address").withColumn("change_date", lit("2024-06-01"))
df_main = df.withColumn("address", lit("New Address"))
df_main.show()
df_history.show()
SCD Type 6 (Hybrid of SCD 1, 2, and 3)
- Combines history tracking (SCD2) and limited history(SCD3).
- Example: A customer’s current and previous addresses are stored, while historical data is also maintained.
Implementation
SQL Approach
ALTER TABLE customer ADD COLUMN previous_address VARCHAR(255);
UPDATE customer
SET previous_address = address, address = 'New Address'
WHERE customer_id = 101;
INSERT INTO customer_history (customer_id, name, address, change_date)
VALUES (101, 'John Doe', 'Old Address', '2024-06-01');
PySpark Implementation
df_history = df.withColumn("change_date", lit("2024-06-01"))
df = df.withColumn("previous_address", col("address")).withColumn("address", lit("New Address"))
df_history.show()
df.show()
Choosing the right SCD Type
SCD Type | |
SCD 0 | No changes allowed |
SCD 1 | Overwrites old data. |
SCD 2 | Keeps history with timestamps/flags. |
SCD 3 | Keeps only the last change. |
SCD 4 | Uses a separate history table. |
SCD 6 | Mix of Type 1, 2, and 3. |
Each type serves a different purpose, and the choice depends on business requirements and how much historical data needs to be stored.
Slowly Changing Dimensions in Data Warehousing help track historical changes efficiently. This step-by-step guide to Slowly Changing Dimensions covers SCD types explained with examples, including SCD Type 1, 2, 3, 4, and 6. If you’re unsure about choosing the right SCD type for a data warehouse, consider whether full or limited history is needed. SCD Type 2 implementation in SQL and PySpark ensures complete tracking, while SCD Type 3 example with SQL query stores only previous values. Understanding Slowly Changing Dimension vs Rapidly Changing Dimension helps in tracking data changes in a data warehouse effectively. Whether you need an SCD Type 2 example with SQL query or guidance on how to implement SCD in PySpark, this guide covers best approaches for handling SCD in data warehousing to optimize your data architecture.