Why Your Data History Matters: Intro to Slowly Changing Dimensions
1. Introduction
Imagine you’re running an online store. One of your loyal customers, Sarah, moves to a new city and updates her shipping address. A few months later, she gets promoted at her company and starts ordering more frequently. Now here’s a question: should your data warehouse remember her old address and old job title, or just keep the most recent version?
This might sound like a small detail—but in the world of analytics, it’s a big deal.
When building a data warehouse, you're not just collecting data; you're preserving history. Understanding how your data changes over time helps you answer questions like:
-
How did sales look when Sarah lived in her old city?
-
Did customer behavior shift after price changes?
-
How many people upgraded their memberships last year?
To handle questions like these, you need more than just raw facts. You need a smart way to track changes in your data over time—without losing what came before.
That’s where the concept of Slowly Changing Dimensions (SCD) comes in. It’s a fancy name, but the idea is simple: not all data stays the same forever, and we need a strategy to deal with that.
2. What is a Slowly Changing Dimension (SCD)?
Let’s break it down.
In a data warehouse, you’ll often hear about two types of data: facts and dimensions. Facts are your numbers—sales, revenue, clicks, etc. Dimensions are the things that give those numbers context, like customer info, product details, or store locations.
Now, while your facts are usually locked in time (you sold 10 items on July 1st, and that won’t change), your dimensions can evolve. People move, prices change, product names get updated, job titles shift.
A Slowly Changing Dimension is simply a dimension that… changes slowly over time.
But here's the tricky part: how do we handle those changes?
You could just update the existing record with the new info. Simple, right? But that also means losing the old data. In some cases, that’s fine. But in others, it’s a problem—especially when you need to analyze trends over time.
Let’s say you’re tracking a customer’s spending by region. If you overwrite their old address with the new one, your past reports suddenly start showing purchases in the wrong region. Your data becomes misleading.
So instead of overwriting, you need a way to manage those changes properly. That’s what SCD is all about: a strategy for handling changes in your dimension data while preserving the history.
And depending on how much history you want to keep—and how your business uses that data—there are different types of SCD you can choose from.
We’ll get into those next.
3. Why We Care About Data History
So, why go through all this trouble to keep old data? Why not just update everything and move on?
Well, here’s the thing: data history tells a story. And if you erase the past, you lose the ability to understand how and why things changed.
Let’s look at a few simple examples:
-
Customer behavior: Suppose a customer used to live in a rural area and made small, infrequent purchases. Then they moved to a big city and started buying more often. If you overwrite their location, you might wrongly assume city residents always spent that much. In reality, the change happened after the move.
-
Product pricing: A product that cost $100 last year now costs $120. If your product table only keeps the latest price, you can’t accurately calculate last year’s revenue—or explain why sales dropped after the price change.
-
Employee tracking: If an employee changes roles within a company, tracking performance or department-level metrics over time becomes a mess if you don’t store their job history.
In short, without data history, your reports lie—not because of bad intentions, but because the context has been erased.
Keeping historical data allows you to:
-
Perform accurate time-based analysis
-
Understand trends and behavior shifts
-
Investigate cause and effect
-
Make better business decisions based on how things evolved
And that’s exactly what Slowly Changing Dimensions help you do. It’s not just about storing old data—it’s about preserving meaning.
4. Common SCD Types Explained
Now that you understand why data history matters, let’s dive into the different ways we can actually manage that history. These are known as Slowly Changing Dimension (SCD) types.
Each type represents a different strategy for handling changes in your dimension data. Let’s walk through the most common ones, using simple examples.
SCD Type 0 – No Change Allowed
Think of this as "read-only" data. Once it's in the warehouse, it stays that way—forever.
✅ Use this when: The data should never change, like a product SKU or birthdate.
📦 Example: A product with a unique ID and name that must stay the same for audit purposes.
SCD Type 1 – Overwrite the Old Data
This is the "just update it" method. When a change happens, the old value is replaced with the new one. No history is kept.
✅ Use this when: You don’t care about the old value, or the change is a correction (like fixing a typo).
📦 Example: Updating a customer’s email address when they switch accounts.
⚠️ Caution: If you rely on old values for analysis, this type will erase your history.
SCD Type 2 – Keep Full History
This is the most popular type. When a change happens, instead of overwriting, you insert a new row with the updated info, and mark the old row as expired.
✅ Use this when: You want to track every change over time.
📦 Example: A customer moves to a new city. The old row ends (with an end date), and a new row starts.
💡 Bonus: You can add metadata like effective_date, end_date, and is_current to make querying easier.
SCD Type 3 – Track Limited History
This method stores only the previous value along with the current one—usually in the same row.
✅ Use this when: You only care about the most recent change, not the entire history.
📦 Example: A customer’s current and previous job title.
⚠️ Limitation: You lose anything beyond the last change.
(Optional Mentions – Advanced Types)
Some systems use hybrid types like:
-
SCD Type 4: Keep current data in one table, and full history in a separate history table.
-
SCD Type 6: A combo of Type 1, 2, and 3 (yes, really).
But if you’re just starting out, focusing on Type 1 and Type 2 is enough for 90% of use cases.
So, which type should you use?
It depends on business needs, query requirements, and storage trade-offs. But the good news is: once you understand these basic types, choosing the right one becomes much easier.
4. Common SCD Types Explained
Now that you understand why data history matters, let’s walk through the most common types of Slowly Changing Dimensions (SCD), along with simple examples and sample tables.
SCD Type 0 – No Change Allowed
This is the strictest type. Once a value is inserted, it stays locked. No updates allowed.
✅ Use this when: The data should never change, like a birthdate or a government-issued ID.
📦 Example: You store customer birthdates and do not allow edits, even if entered wrong.
🧾 Table (Customer Dimension):
| customer_id | name | birthdate |
|---|---|---|
| 101 | Sarah Lee | 1990-05-10 |
Even if Sarah says her birthdate was entered wrong, it won’t be changed. It's fixed.
SCD Type 1 – Overwrite the Old Data
The simplest strategy. You just update the existing row with the new value—no history.
✅ Use this when: You don't need to keep old values, like correcting typos.
📦 Example: Fixing a misspelled name.
🧾 Before Update:
| customer_id | name | city |
|---|---|---|
| 101 | Srah Lee | Bandung |
🧾 After Update:
| customer_id | name | city |
|---|---|---|
| 101 | Sarah Lee | Bandung |
The previous typo is gone. There’s no record of “Srah.”
SCD Type 2 – Keep Full History
This type creates a new row every time a change happens, so all historical versions are preserved.
✅ Use this when: You need to track changes over time.
📦 Example: Customer changes city.
🧾 Table (Customer Dimension):
| customer_id | name | city | start_date | end_date | is_current |
|---|---|---|---|---|---|
| 101 | Sarah Lee | Bandung | 2022-01-01 | 2023-04-01 | FALSE |
| 101 | Sarah Lee | Jakarta | 2023-04-01 | NULL | TRUE |
Each row represents a version of Sarah’s data. Queries can now reflect her correct location at any point in time.
SCD Type 3 – Track Limited History
This stores both current and previous values in the same row, typically in separate columns.
✅ Use this when: You only care about the last change—not the full history.
📦 Example: You want to store a customer’s previous and current city.
🧾 Table (Customer Dimension):
| customer_id | name | current_city | previous_city |
|---|---|---|---|
| 101 | Sarah Lee | Jakarta | Bandung |
You only get the last change. If Sarah moves again, “Bandung” will be replaced.
(Bonus Mentions – Advanced Types)
-
SCD Type 4: Separate history table for old versions.
-
SCD Type 6: A hybrid of Types 1, 2, and 3—track full history, update current values, and keep limited history—all in one.
These are more advanced and used in complex scenarios. As a beginner, focusing on Type 1 and Type 2 will cover most needs.
Having clear examples and tables helps make the differences between SCD types easier to understand. The next section will bring these concepts into real-world situations.
5. Real-World Examples
Now that we’ve covered the theory, let’s make it real.
Understanding SCD types is one thing—but choosing the right type in practice depends on the situation. Let’s go through a few real-world scenarios and see how SCD applies.
🛍️ Example 1: Customer Address Change (SCD Type 2)
Sarah, your loyal customer, moves from Bandung to Jakarta. Should we just update her address?
If you're doing regional sales analysis over time, the answer is no.
You want to know how much she spent while she lived in Bandung vs. after she moved to Jakarta.
✅ Best choice: SCD Type 2
You create a new row for Sarah’s Jakarta address and mark the old one as expired. This way, your reports stay accurate—because they reflect the correct location at the time of the transaction.
💼 Example 2: Job Title Update (SCD Type 3)
Let’s say you’re tracking employee promotions, but you only care about what their current and previous title was (not the full history).
✅ Best choice: SCD Type 3
In this case, you can keep both the current and previous job title in the same row, like:
current_title = 'Manager', previous_title = 'Senior Analyst'.
It’s quick, light, and gives you just enough history for basic reporting.
🧾 Example 3: Fixing a Typo in Product Name (SCD Type 1)
You found a mistake—someone entered “iPohne 14” instead of “iPhone 14”. This is clearly a typo.
✅ Best choice: SCD Type 1
No need to keep the wrong value around. Just update the product name directly. No one needs a history of typos.
🏢 Example 4: Department Changes with Full Audit (SCD Type 2 or 4)
Imagine your HR team needs a full audit trail of which departments employees worked in over the years.
✅ Best choice: SCD Type 2 (or Type 4 if you want to store history separately)
This lets you track changes over time, so you can answer questions like:
“How long did John work in Sales before moving to Marketing?”
As you can see, the “right” SCD type depends on how much history you need, how you use the data, and how important accuracy is for your reports.
6. How to Implement SCD
So, you’ve picked the right SCD type—now what?
Let’s talk about how to actually implement Slowly Changing Dimensions in your data warehouse. Don’t worry, we’ll keep it simple and tool-agnostic so you can apply the ideas whether you use SQL, dbt, Airflow, or even a basic ETL tool.
🛠️ SCD Type 1 – Overwrite in Place
This one’s easy. You just find the record and update it directly.
Example SQL:
✅ No need for extra columns or logic.
⚠️ Just remember: this erases the old value, forever.
🛠️ SCD Type 2 – Insert New Row with Versioning
This requires a bit more setup, but it's the most powerful.
You’ll typically add columns like:
-
start_date -
end_date -
is_current(or a similar flag)
Basic workflow:
-
Check if the incoming data has changes.
-
If there’s a change:
-
Set
end_dateon the old row. -
Insert a new row with
start_date = todayandis_current = TRUE.
-
Example SQL (simplified):
✅ Preserves full history
⚠️ Requires deduplication and versioning logic
🛠️ SCD Type 3 – Add Columns for Previous Value
No need to insert new rows—just shift values around.
Example SQL:
✅ Lightweight and fast
⚠️ Only supports one level of history
🧰 Common Tools That Can Help
You can build SCD logic with:
-
SQL: for full control
-
dbt: with macros like
dbt_utils.surrogate_keyand versioned models -
ETL Tools (e.g., Talend, Informatica, Apache NiFi): many have built-in SCD components
-
Airflow or Prefect: to orchestrate SCD logic in Python or SQL
Some cloud DWH tools like BigQuery, Snowflake, or Redshift may not offer built-in SCD features, but they’re easy to implement using views, merge queries, or scheduled jobs.
💡 Tips for Clean SCD Implementation
-
Always use a surrogate key (an internal ID) instead of relying only on natural keys like
customer_id -
Add timestamps like
created_atorupdated_atfor traceability -
Validate changes carefully—small changes like whitespace or casing may not need a version update
-
Test your logic with a few rows before applying it to production
Implementing SCD might look tricky at first, but once you understand the logic, it becomes second nature. Start small, test often, and document your design choices—your future self will thank you.
7. Best Practices for Beginners
You’ve learned what SCD is, how it works, and how to implement it. Great! But before you start applying it in the wild, here are some simple but powerful best practices to help you avoid common mistakes.
✅ 1. Let Business Needs Guide You
Don't choose an SCD type just because it's more "advanced."
Always ask:
-
Does the business care about the history of this field?
-
How will this data be used in reports?
If the answer is no, then maybe you don’t need SCD at all—or just use Type 1.
📝 2. Document the SCD Strategy for Each Dimension
Keep track of which dimension uses which SCD type, and why.
Your future self (or teammate) will need to know:
-
Which columns are tracked
-
How versioning works
-
What logic defines a “change”
Use a simple table like this:
| Dimension | SCD Type | Tracked Columns | Notes |
|---|---|---|---|
| dim_customer | Type 2 | city, status | Full history by row version |
| dim_product | Type 1 | product_name | Overwrite only |
| dim_employee | Type 3 | current_title | One-level job history |
🧪 3. Test with Realistic Scenarios
Before rolling out your SCD logic to production, run it on sample data. Simulate changes, check results, and make sure:
-
The old rows are preserved (for Type 2)
-
New versions are inserted properly
-
Queries using
is_current = TRUEreturn the latest data
🔁 4. Be Careful with Updates from Source Systems
Sometimes, upstream systems make small changes that aren’t meaningful—like trimming whitespace or changing letter case.
✅ Tip: Add logic to ignore these “false changes,” or you’ll end up versioning data for no real reason.
🧩 5. Build Reusable Components
If you’re using tools like dbt, SQL scripts, or ETL jobs, make your SCD logic modular and reusable:
-
Create macros or stored procedures for SCD Type 2 logic
-
Use helper columns (
is_current,row_hash, etc.) -
Build standard views to always query the latest data
This saves time and reduces bugs.
🗣️ 6. Communicate with Stakeholders
SCD is about data behavior over time, so it’s closely tied to business logic.
✅ Make sure you’re on the same page with:
-
Analysts (so their dashboards stay accurate)
-
Business users (so they get what they expect)
-
Engineers (so they know how changes are handled)
Mastering Slowly Changing Dimensions isn’t just about code—it's about understanding the why behind data changes and building systems that respect that history.
8. Conclusion
Data doesn’t live in a bubble—it evolves. Customers move, employees get promoted, products change names, and prices go up (or down). If your data warehouse can’t keep up with these changes, your reports will lose meaning—and trust.
That’s why Slowly Changing Dimensions (SCD) matter.
They’re not just about saving old records—they’re about capturing the story behind the numbers. Whether you’re correcting a typo (Type 1), tracking every version of a customer profile (Type 2), or keeping just the last two job titles (Type 3), the goal is the same: preserve truth over time.
As a beginner data warehouse engineer, don’t worry if all this feels new or a bit complex. Start small:
-
Pick the right SCD type based on business needs.
-
Practice with sample data.
-
Document everything clearly.
-
Talk to your team and stakeholders.
With time, you’ll see that managing data history isn’t just a technical task—it’s a mindset.
So the next time someone asks, “Do we need to keep the old value?”, you’ll know exactly what to say—and more importantly, how to handle it.
Post a Comment