Compare Tables Dynamically: Simplified SQL Approach



Ever needed to compare two tables in BigQuery but dreaded the manual work? Maybe the tables have slightly different schemas, or you only want to check specific columns. Manually writing comparison queries can be tedious and error-prone. But don’t worry, there’s a way to make this painless with a dynamic SQL approach!

In this article, I’ll walk you through how to create a flexible query that compares tables dynamically, highlights the differences, and keeps things clean and efficient.


The Problem: Comparing Tables Isn’t Always Straightforward

Let’s say you’re working with two tables, dwh_prod and dwh_dev. Both have similar data but may differ in some columns or rows. Your task? Compare them to:

  • Match columns side by side.
  • Highlight differences row by row.
  • Do this dynamically so the query works even if the column list changes.

Sounds like a lot, right? Here’s how to make it easier.


The Solution: Dynamic SQL in Action

Instead of hardcoding the column names, we can use BigQuery’s INFORMATION_SCHEMA.COLUMNS to fetch metadata dynamically. Then, we’ll construct the comparison query programmatically. Here’s what happens step by step:

  1. Choose Columns to Compare
    Select specific columns from both tables using metadata. This keeps the query focused and avoids comparing unnecessary data.

    SELECT column_name
    FROM `region-asia-southeast2.INFORMATION_SCHEMA.COLUMNS`
    WHERE table_name = 'dwh_prod'
      AND column_name IN ('customer_name', 'customer_address')
    
  2. Cast Columns to a Common Format
    To avoid mismatched data types, cast everything to STRING. This ensures compatibility when comparing values from the two tables.

    Example:

    CAST(customer_name AS STRING) AS customer_name
    
  3. Align Columns Side by Side
    Display matching columns from both tables next to each other for easy comparison. For instance:

    SELECT 
      dwh_prod.customer_name AS customer_name_prod,
      dwh_dev.customer_name AS customer_name_dev
    
  4. Highlight Differences
    Add a column that flags mismatched fields. If a value is different, it will show the column name in a summary.

    Example:

    ARRAY_TO_STRING(ARRAY(SELECT column_name 
      FROM UNNEST(['customer_name', 'customer_address']) 
      WHERE dwh_prod.column_name IS DISTINCT FROM dwh_dev.column_name), ', ') AS different_columns
    

Dynamic Query in Action

Here’s a simplified version of the final query:

WITH table1_data AS (
  SELECT CAST(customer_name AS STRING) AS customer_name
  FROM `dwh_prod.customers`
),
table2_data AS (
  SELECT CAST(customer_name AS STRING) AS customer_name
  FROM `dwh_dev.customers`
)
SELECT
  prod.customer_name AS customer_name_prod,
  dev.customer_name AS customer_name_dev,
  CASE 
    WHEN prod.customer_name IS DISTINCT FROM dev.customer_name THEN 'customer_name'
    ELSE NULL
  END AS different_columns
FROM table1_data prod
FULL JOIN table2_data dev
ON prod.primary_key = dev.primary_key;

This query dynamically builds the comparison logic, aligns matching columns, and identifies differences in a clean and efficient way.


Why Use This Approach?

Here’s why this method is a game-changer:

  • No Hardcoding: It works dynamically even if column names or schemas change.
  • Readability: Matching columns are displayed side by side for easy comparison.
  • Scalability: Handles large datasets efficiently, perfect for BigQuery's strengths.

Try It Out!

Whether you’re reconciling datasets or debugging data pipelines, this approach is a must-have in your BigQuery toolbox. Give it a try, and let me know how it simplifies your work!



-- Deklarasi variabel
DECLARE project_id STRING DEFAULT 'myproject';
DECLARE table1 STRING DEFAULT 'dwh_prod.customers'; -- Nama tabel pertama
DECLARE table2 STRING DEFAULT 'dwh_dev.customers'; -- Nama tabel kedua
DECLARE alias1 STRING DEFAULT 'prod'; -- Alias tabel pertama
DECLARE alias2 STRING DEFAULT 'dev'; -- Alias tabel kedua
DECLARE primary_key STRING DEFAULT 'customer_id'; -- Nama primary key
DECLARE included_columns ARRAY DEFAULT ['customer_name', 'customer_address']; -- Kolom yang di-include

-- Variabel untuk menyimpan query hasil
DECLARE comparison_query STRING;

-- Membuat query dinamis
SET comparison_query = (
  WITH
    -- Mendapatkan metadata kolom dari tabel pertama yang sesuai dengan included_columns
    columns_to_compare AS (
      SELECT column_name
      FROM `region-asia-southeast2.INFORMATION_SCHEMA.COLUMNS`
      WHERE table_name = SPLIT(table1, '.')[OFFSET(1)]
        AND table_schema = SPLIT(table1, '.')[OFFSET(0)]
        AND column_name IN UNNEST(included_columns)
    )
  SELECT
    '''
    WITH 
    table1_data AS (
      SELECT ''' || STRING_AGG(
          'CAST(' || column_name || ' AS STRING) AS ' || column_name, ', '
      ) || ', CAST(' || primary_key || ' AS STRING) AS ' || primary_key || ''' 
      FROM ''' || table1 || '''
    ),
    table2_data AS (
      SELECT ''' || STRING_AGG(
        'CAST(' || column_name || ' AS STRING) AS ' || column_name, ', '
      ) || ', CAST(' || primary_key || ' AS STRING) AS ' || primary_key || ''' 
      FROM ''' || table2 || '''
    )
    SELECT
      ''' || alias1 || '.' || primary_key || ''' AS primary_key,
      ''' || STRING_AGG(
            alias1 || '.' || column_name || ' AS ' || column_name || '_' || alias1 || ', ' || 
            alias2 || '.' || column_name || ' AS ' || column_name || '_' || alias2, ', '
        ) || ''',
      ARRAY_TO_STRING(ARRAY(SELECT column_name
        FROM UNNEST([ ''' || STRING_AGG(
            'IF(' || alias1 || '.' || column_name || ' IS DISTINCT FROM ' || alias2 || '.' || column_name || 
            ', "' || column_name || '", NULL)', 
            ', '
          ) || '''
        ]) AS column_name WHERE column_name IS NOT NULL), ', ') AS different_columns
    FROM table1_data ''' || alias1 || '''
    FULL JOIN table2_data ''' || alias2 || '''
      ON ''' || alias1 || '.' || primary_key || ' = ' || alias2 || '.' || primary_key || '''
    WHERE ''' || STRING_AGG(
        '(' || alias1 || '.' || column_name || ' IS DISTINCT FROM ' || alias2 || '.' || column_name || ')', 
        ' OR '
      ) || '''
    ''' AS final_query
  FROM columns_to_compare
);

-- Menampilkan query hasil
SELECT comparison_query;