Compare 2 SQL Tables: A Deep Dive into Efficient Techniques for Data Professionals

Ensuring data integrity and managing changes in database environments often requires the ability to effectively compare SQL tables. Whether you’re validating data migrations, auditing database modifications, or simply trying to understand the differences between two datasets, knowing how to efficiently compare SQL tables is a crucial skill for any data professional.

This article provides a comprehensive guide to comparing two SQL tables, focusing on set-based operations and schema analysis to identify discrepancies and ensure data consistency. We’ll explore practical techniques using SQL Server, applicable to scenarios ranging from simple data validation to complex database auditing.

Setting Up Your Comparison Environment

To illustrate the techniques discussed, let’s establish a test environment with two sample tables, Original and Revised. These tables will represent the initial and modified states of customer data, allowing us to clearly demonstrate comparison methods.

USE [SQLShack]
GO
/****** Object:  Table [dbo].[Original]    Script Date: 9/14/2017 7:57:37 PM ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[Original](
    [CustId] [int] IDENTITY(1,1) NOT NULL,
    [CustName] [varchar](255) NOT NULL,
    [CustAddress] [varchar](255) NOT NULL,
    [CustPhone] [numeric](12, 0) NULL,
 PRIMARY KEY CLUSTERED
(
    [CustId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
/****** Object:  Table [dbo].[Revised]    Script Date: 9/14/2017 7:57:37 PM ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[Revised](
    [CustId] [int] IDENTITY(1,1) NOT NULL,
    [CustName] [varchar](255) NOT NULL,
    [CustAddress] [varchar](255) NOT NULL,
    [CustPhone] [numeric](12, 0) NULL,
 PRIMARY KEY CLUSTERED
(
    [CustId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

--Populate Original Table
SET IDENTITY_INSERT [dbo].[Original] ON
GO
INSERT [dbo].[Original] ([CustId], [CustName], [CustAddress], [CustPhone]) VALUES (1, N'Salvador', N'1 Main Street North', CAST(76197081653 AS Numeric(12, 0)))
INSERT [dbo].[Original] ([CustId], [CustName], [CustAddress], [CustPhone]) VALUES (2, N'Edward', N'142 Main Street West', CAST(80414444338 AS Numeric(12, 0)))
INSERT [dbo].[Original] ([CustId], [CustName], [CustAddress], [CustPhone]) VALUES (3, N'Gilbert', N'51 Main Street East', CAST(23416310745 AS Numeric(12, 0)))
INSERT [dbo].[Original] ([CustId], [CustName], [CustAddress], [CustPhone]) VALUES (4, N'Nicholas', N'7 Walnut Street', CAST(62051432934 AS Numeric(12, 0)))
INSERT [dbo].[Original] ([CustId], [CustName], [CustAddress], [CustPhone]) VALUES (5, N'Jorge', N'176 Washington Street', CAST(58796383002 AS Numeric(12, 0)))
INSERT [dbo].[Original] ([CustId], [CustName], [CustAddress], [CustPhone]) VALUES (6, N'Ernest', N'39 Main Street', CAST(461992109 AS Numeric(12, 0)))
INSERT [dbo].[Original] ([CustId], [CustName], [CustAddress], [CustPhone]) VALUES (7, N'Stella', N'191 Church Street', CAST(78584836879 AS Numeric(12, 0)))
INSERT [dbo].[Original] ([CustId], [CustName], [CustAddress], [CustPhone]) VALUES (8, N'Jerome', N'177 Elm Street', CAST(30235760533 AS Numeric(12, 0)))
INSERT [dbo].[Original] ([CustId], [CustName], [CustAddress], [CustPhone]) VALUES (9, N'Ray', N'214 High Street', CAST(57288772686 AS Numeric(12, 0)))
INSERT [dbo].[Original] ([CustId], [CustName], [CustAddress], [CustPhone]) VALUES (10, N'Lawrence', N'53 Main Street South', CAST(92544965861 AS Numeric(12, 0)))
GO
SET IDENTITY_INSERT [dbo].[Original] OFF
GO

--Populate Revised Table
SET IDENTITY_INSERT [dbo].[Revised] ON
GO
INSERT [dbo].[Revised] ([CustId], [CustName], [CustAddress], [CustPhone]) VALUES (1, N'Jerome', N'1 Main Street North', CAST(36096777923 AS Numeric(12, 0)))
INSERT [dbo].[Revised] ([CustId], [CustName], [CustAddress], [CustPhone]) VALUES (2, N'Lawrence', N'53 Main Street South', CAST(73368786216 AS Numeric(12, 0)))
INSERT [dbo].[Revised] ([CustId], [CustName], [CustAddress], [CustPhone]) VALUES (3, N'Ray', N'214 High Street', CAST(64765571087 AS Numeric(12, 0)))
INSERT [dbo].[Revised] ([CustId], [CustName], [CustAddress], [CustPhone]) VALUES (4, N'Gilbert', N'177 Elm Street', CAST(4979477778 AS Numeric(12, 0)))
INSERT [dbo].[Revised] ([CustId], [CustName], [CustAddress], [CustPhone]) VALUES (5, N'Jorge', N'7 Walnut Street', CAST(88842643373 AS Numeric(12, 0)))
INSERT [dbo].[Revised] ([CustId], [CustName], [CustAddress], [CustPhone]) VALUES (6, N'Ernest', N'176 Washington Street', CAST(17153094018 AS Numeric(12, 0)))
INSERT [dbo].[Revised] ([CustId], [CustName], [CustAddress], [CustPhone]) VALUES (7, N'Edward', N'142 Main Street West', CAST(66115434358 AS Numeric(12, 0)))
INSERT [dbo].[Revised] ([CustId], [CustName], [CustAddress], [CustPhone]) VALUES (8, N'Stella', N'51 Main Street East', CAST(94093532159 AS Numeric(12, 0)))
INSERT [dbo].[Revised] ([CustId], [CustName], [CustAddress], [CustPhone]) VALUES (9, N'Nicholas', N'191 Church Street', CAST(54482064421 AS Numeric(12, 0)))
INSERT [dbo].[Revised] ([CustId], [CustName], [CustAddress], [CustPhone]) VALUES (10, N'Salvador', N'39 Main Street', CAST(94689656558 AS Numeric(12, 0)))
GO
SET IDENTITY_INSERT [dbo].[Revised] OFF
GO

This SQL script sets up two tables, Original and Revised, both structured to hold customer information. Initially, these tables contain distinct datasets. This setup allows us to explore different comparison scenarios, from identifying completely different tables to pinpointing subtle variations in data.

Leveraging Set-Based Operations for Efficient Comparison

SQL databases are built upon set theory, providing powerful set operations that are highly efficient for comparing data. These operations allow us to treat tables as sets of rows and perform comparisons using mathematical principles. The core set operations we’ll utilize are UNION, INTERSECT, and EXCEPT.

Unveiling Common and Unique Rows with INTERSECT and EXCEPT

INTERSECT helps identify common rows between two tables, while EXCEPT reveals the rows that are unique to one table compared to another.

To find rows present in both Original and Revised tables, we use INTERSECT:

SELECT CustId, CustName, CustAddress, CustPhone
FROM Original
INTERSECT
SELECT CustId, CustName, CustAddress, CustPhone
FROM Revised;

Currently, with our initial setup, this query will return no rows as the tables are entirely different.

To identify rows present in Original but not in Revised, we use EXCEPT:

SELECT CustId, CustName, CustAddress, CustPhone
FROM Original
EXCEPT
SELECT CustId, CustName, CustAddress, CustPhone
FROM Revised;

In our initial setup, this will return all rows from Original. Conversely, reversing the table order in the EXCEPT clause will show rows unique to Revised.

These set operations provide a foundational method to quickly understand the overlap and differences between entire rows in two tables.

This image visually represents the result of an INTERSECT operation, highlighting the rows that are identical in both the Original and Revised tables.

Addressing the Nuances of Non-Keyed Tables

The examples so far assume keyed tables, where each row is uniquely identifiable. However, when comparing non-keyed tables that may contain duplicate rows, the behavior of EXCEPT, INTERSECT, and UNION changes slightly as they inherently eliminate duplicates.

Consider temporary tables @t1 and @t2 with duplicate entries:

DECLARE @t1 TABLE(a INT, b INT);
INSERT INTO @t1 VALUES
(1, 2),
(1, 2),
(2, 3),
(3, 4);

DECLARE @t2 TABLE(a INT, b INT);
INSERT INTO @t2 VALUES
(1, 2),
(2, 3),
(2, 3),
(3, 4);

SELECT * FROM @t1
EXCEPT
SELECT * FROM @t2;

Surprisingly, this EXCEPT query returns an empty result, even though @t1 has a duplicate row (1, 2) that @t2 does not. This is because EXCEPT removes duplicates before comparison.

To accurately compare non-keyed tables with duplicates, more intricate techniques beyond basic set operations are required, which are beyond the scope of this article but worth noting for advanced scenarios.

Simulating Realistic Data Overlap for Meaningful Comparison

To make our table comparisons more practical, let’s introduce some overlap between the Original and Revised tables. We’ll insert a portion of rows from Original into Revised and vice-versa, creating a scenario where tables share some data but also have unique and modified entries.

SET IDENTITY_INSERT Revised ON;
INSERT INTO Revised(CustId, CustName, CustAddress, CustPhone)
SELECT TOP 50 PERCENT CustId, CustName, CustAddress, CustPhone
FROM Original
ORDER BY NEWID();
SET IDENTITY_INSERT Revised OFF;

SET IDENTITY_INSERT Original ON;
INSERT INTO Original(CustId, CustName, CustAddress, CustPhone)
SELECT TOP 50 PERCENT CustId, CustName, CustAddress, CustPhone
FROM Revised
WHERE CustID NOT IN (SELECT CustId FROM Original)
ORDER BY NEWID();
SET IDENTITY_INSERT Original OFF;

These scripts randomly select and insert approximately half the rows between the tables, ensuring some common data while retaining differences. Now, our INTERSECT and EXCEPT queries will yield more informative results, showing both shared and unique rows.

SELECT CustId, CustName, CustAddress, CustPhone
FROM Revised
EXCEPT
SELECT CustId, CustName, CustAddress, CustPhone
FROM Original;

This EXCEPT query will now return rows that are in Revised but not in Original, showcasing the differences after our data mixing.

This image exemplifies the output of an EXCEPT operation, displaying the rows that are present in the Revised table but absent from the Original table.

Moving Beyond Row-Level Differences: Identifying Column-Specific Changes

While identifying differing rows is valuable, often we need to pinpoint which columns within those rows have changed. For instance, a customer’s address might be updated while their name remains the same. To achieve this granularity, we need to compare tables column by column.

Imagine needing to report not just that rows are different, but also generate a summary like this:

This image demonstrates a summary of column differences, showing the count of rows where each specific column (CustId, CustName, CustAddress, CustPhone) differs between the two tables.

This level of detail requires a more refined approach than simple EXCEPT and INTERSECT operations alone.

Verifying Table Structure: Schema Comparison with sys.columns

Before diving into data comparison, it’s crucial to ensure that the schemas of the two tables are comparable. Comparing tables with different column structures or data types can lead to inaccurate or misleading results. SQL Server’s sys.columns system catalog view provides metadata about table columns, enabling schema comparison.

SELECT *
FROM sys.columns
WHERE object_id = OBJECT_ID(N'Original')
EXCEPT
SELECT *
FROM sys.columns
WHERE object_id = OBJECT_ID(N'Revised');

This query compares the column definitions of Original and Revised. If the schemas are identical (excluding object-specific columns like object_id and default_object_id), this query should return no results.

To refine schema comparison, we can exclude object-specific columns that are expected to differ:

IF OBJECT_ID(N'tempdb..#Source', N'U') IS NOT NULL
    DROP TABLE #Source;
SELECT TOP (0) *
INTO #Source
FROM sys.columns;

INSERT INTO #Source
SELECT *
FROM sys.columns c
WHERE c.object_id = OBJECT_ID(N'Original');

ALTER TABLE #Source DROP COLUMN object_id, default_object_id;

-- Repeat for Target table (Revised) into #Target

SELECT * FROM #Source
EXCEPT
SELECT * FROM #Target;

By creating temporary tables and excluding potentially differing columns, we ensure a more accurate schema comparison, focusing on the core column definitions relevant for data comparison.

Introducing Data Modifications for Column-Level Comparison

To demonstrate column-level difference detection, let’s intentionally modify data in the Revised table, changing CustName, CustAddress, and CustPhone values in a subset of rows:

WITH
    MixUpCustName(CustId, CustName, CustAddress, CustPhone) AS (
        SELECT TOP 50 PERCENT
            CustId,
            SUBSTRING(CustName, 6, LEN(CustName)) + LEFT(CustName, 5),
            CustAddress,
            CustPhone
        FROM Original ORDER BY NEWID()
    ),
    MixUpCustAddress(CustId, CustName, CustAddress, CustPhone) AS (
        SELECT TOP 50 PERCENT
            CustId,
            CustName,
            SUBSTRING(CustAddress, 6, LEN(CustAddress)) + LEFT(CustAddress, 5),
            CustPhone
        FROM Original ORDER BY NEWID()
    ),
    MixUpCustPhone(CustId, CustName, CustAddress, CustPhone) AS (
        SELECT TOP 50 PERCENT
            CustId,
            CustName,
            CustAddress,
            CAST(CustPhone / 100 AS INT) + 42
        FROM Original ORDER BY NEWID()
    ),
    MixItUp(CustId, CustName, CustAddress, CustPhone) AS
    (
        SELECT CustId, CustName, CustAddress, CustPhone
        FROM MixUpCustName
        UNION
        SELECT CustId, CustName, CustAddress, CustPhone
        FROM MixUpCustAddress
        UNION
        SELECT CustId, CustName, CustAddress, CustPhone
        FROM MixUpCustPhone
    )
    --Main query to mix up the data
UPDATE Revised
SET Revised.CustName = MixItUp.CustName,
    Revised.CustAddress = MixItUp.CustAddress,
    Revised.CustPhone = MixItUp.CustPhone
FROM Revised
JOIN MixItUp
    ON Revised.CustId = MixItUp.CustId;

This script randomly modifies data in Revised, creating variations in CustName, CustAddress, and CustPhone columns across different rows. Now, we can effectively demonstrate techniques for identifying these specific column differences.

Gathering Basic Comparison Statistics

Before delving into column-level details, let’s obtain high-level statistics summarizing the overall differences between the tables. This provides a quick overview of the extent of changes.

WITH
    InOriginal AS (
        SELECT * FROM Original
        EXCEPT
        SELECT * FROM Revised
    ),
    InRevised AS (
        SELECT * FROM Revised
        EXCEPT
        SELECT * FROM Original
    ),
    InBoth AS (
        SELECT * FROM Revised
        INTERSECT
        SELECT * FROM Original
    )
SELECT
    (SELECT COUNT(*) FROM Original) AS Original,
    (SELECT COUNT(*) FROM Revised) AS Revised,
    (SELECT COUNT(*) FROM InOriginal) AS InOriginal,
    (SELECT COUNT(*) FROM InRevised) AS InRevised,
    (SELECT COUNT(*) FROM InBoth) AS InBoth;

This query provides counts of total rows in each table, rows unique to Original, rows unique to Revised, and rows common to both. These statistics offer a valuable initial assessment of the data divergence.

This image illustrates the output of the basic statistics query, showing counts of rows in the Original and Revised tables, as well as the number of rows unique to each and common to both.

Pinpointing Column-Level Differences Using EXCEPT in Conditional Logic

To identify which columns differ within changed rows, we can leverage the EXCEPT operator in a clever way within conditional logic. The core idea is to check if EXCEPT returns any rows when comparing individual columns between corresponding rows in the Original and Revised tables.

The expression WHERE NOT EXISTS(SELECT a.col EXCEPT SELECT b.col) elegantly checks if two columns (a.col and b.col) from matched rows are identical, even handling NULL values correctly.

Applying this technique, we can construct a query to identify differing columns for each customer:

IF OBJECT_ID(N'tempdb..#diffcols', N'U') IS NOT NULL
    DROP TABLE #diffcols;

SELECT
    src.CustId,
    CONCAT(
        IIF(EXISTS(SELECT src.CustId EXCEPT SELECT tgt.CustId), RTRIM(', CustId   '), ''),
        IIF(EXISTS(SELECT src.CustName EXCEPT SELECT tgt.CustName), RTRIM(', CustName   '), ''),
        IIF(EXISTS(SELECT src.CustAddress EXCEPT SELECT tgt.CustAddress), RTRIM(',   CustAddress '), ''),
        IIF(EXISTS(SELECT src.CustPhone EXCEPT SELECT tgt.CustPhone), RTRIM(',   CustPhone '), '')
    ) + ', ' AS cols
INTO #diffcols
FROM Original src
JOIN Revised tgt
    ON src.CustId = tgt.CustId
WHERE EXISTS(SELECT src.* EXCEPT SELECT tgt.*);

This query joins Original and Revised on CustId. For each joined row pair where the rows are different (detected by EXISTS(SELECT src.* EXCEPT SELECT tgt.*)), it generates a comma-separated list of column names (cols) that differ.

This image displays the content of the temporary table #diffcols, which lists the CustId and a comma-separated string ‘cols’ indicating the columns that differ for each customer between the Original and Revised tables.

We can then easily join this temporary table #diffcols back to Original and Revised to examine the actual data differences in context:

SELECT *
FROM Original o
JOIN Revised r
    ON o.CustId = r.CustId
JOIN #diffcols d
    ON o.CustId = d.CustId;

This provides a detailed view of the rows with differences, highlighting the specific columns that have changed.

This image showcases the result of joining the Original, Revised, and #diffcols tables, providing a comprehensive view of rows that differ and the specific column values from both tables for those rows.

Generating Detailed Column Difference Statistics

Finally, to create the column difference summary we envisioned earlier, we can analyze the #diffcols temporary table. By parsing the comma-separated cols string, we can count the occurrences of each column name, providing a statistical breakdown of column-level changes.

WITH src AS (
    SELECT
        SUM(IIF(d.cols LIKE '%, CustId, %', 1, 0)) AS CustId,
        SUM(IIF(d.cols LIKE '%, CustName, %', 1, 0)) AS CustName,
        SUM(IIF(d.cols LIKE '%, CustAddress, %', 1, 0)) AS CustAddress,
        SUM(IIF(d.cols LIKE '%, CustPhone, %', 1, 0)) AS CustPhone
    FROM #diffcols d
)
SELECT
    ca.col AS ColumnName,
    ca.diff AS [Difference Count]
FROM src
CROSS APPLY (VALUES
    ('CustId ', CustId),
    ('CustName ', CustName),
    ('CustAddress ', CustAddress),
    ('CustPhone ', CustPhone)
) ca(col, diff)
WHERE diff > 0
ORDER BY diff DESC;

This query uses CROSS APPLY to unpivot the counts from the src CTE, effectively transforming the column difference counts into a columnar format for easy readability. The LIKE operator efficiently checks for the presence of each column name within the cols string.

This image presents the final column difference statistics, listing each column name (ColumnName) and the number of rows where that column differs between the Original and Revised tables (Difference Count).

This provides the desired summary, showing the count of differences for each column, giving a clear picture of the types of changes that occurred between the two tables.

Streamlining Comparisons with Stored Procedures

While the queries presented are effective, encapsulating them within a stored procedure can greatly simplify and automate the table comparison process. A stored procedure can take table names and key columns as parameters, dynamically generate and execute the comparison queries, and return the difference statistics, making it reusable and efficient for routine data validation tasks.

Conclusion: Mastering SQL Table Comparison for Data Integrity

Comparing SQL tables effectively is paramount for maintaining data integrity, validating changes, and ensuring the accuracy of database operations. This article has explored powerful set-based operations and schema analysis techniques in SQL Server to facilitate comprehensive table comparisons.

By leveraging INTERSECT, EXCEPT, and sys.columns, and applying conditional logic for column-level analysis, data professionals can gain deep insights into data differences, streamline data validation workflows, and confidently manage database changes. These techniques are essential tools in the arsenal of any SQL developer or database administrator seeking to maintain robust and reliable data environments.

References

Author: Gerald Britton, Senior SQL Server Solution Designer, Author, Software Developer, Teacher, and Microsoft Data Platform MVP.

[

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *