Comparing Two DataFrames in Pandas: A Comprehensive Guide

In data analysis with Python, the pandas library is indispensable for manipulating and analyzing datasets. A common task is to compare two DataFrames to identify differences, whether for data validation, auditing changes, or understanding data evolution. Pandas provides a powerful and flexible function, compare(), specifically designed for this purpose. This guide will delve into how to effectively use the compare() function in pandas for Pandas Comparing Two Dataframes, enhancing your data analysis workflow.

Understanding DataFrame.compare()

The DataFrame.compare() function in pandas is a method that allows you to compare a DataFrame with another DataFrame and highlight the differences between them. It returns a new DataFrame that succinctly presents these discrepancies, making it easy to pinpoint where your DataFrames diverge.

Let’s break down the key parameters of the compare() function to understand its versatility:

  • other: This is the first essential parameter, representing the other DataFrame you want to compare against your original DataFrame. The comparison is performed element-wise between the calling DataFrame and this other DataFrame.

  • align_axis: This parameter dictates the orientation of the comparison results. It accepts two possible values:

    • 0 or 'index': When set to 0 or 'index', the differences are stacked vertically. Rows from the original DataFrame (self) and the other DataFrame are interleaved in the resulting DataFrame. This is useful when you want to see row-by-row differences.
    • 1 or 'columns' (default): With 1 or 'columns', the differences are aligned horizontally. Columns from both DataFrames are shown side-by-side in the output, making it convenient to compare column values.
  • keep_shape: A boolean parameter that controls the shape of the resulting DataFrame.

    • False (default): If keep_shape is False, the resulting DataFrame will only include rows and columns where differences were found. This provides a compact view of only the changes.
    • True: Setting keep_shape to True ensures that the output DataFrame retains all rows and columns from the original DataFrames. Rows and columns without differences will be filled with NaN values, allowing you to see the differences within the context of the original structure.
  • keep_equal: Another boolean parameter that determines how equal values are handled in the comparison result.

    • False (default): When keep_equal is False, values that are the same in both DataFrames are represented as NaN in the output, focusing attention solely on the differences.
    • True: If keep_equal is True, the output will retain the equal values alongside the differing ones. This can be helpful when you need to see both the similarities and differences in the context of the comparison.
  • result_names: This parameter, introduced in pandas version 1.5.0, allows you to customize the names used to identify the original DataFrames in the comparison result, particularly in the MultiIndex. It accepts a tuple of two strings, where the first string represents the name for the original DataFrame (self) and the second for the other DataFrame. The default is ('self', 'other').

The compare() function returns a DataFrame that highlights the differences. Crucially, the index of the resulting DataFrame becomes a MultiIndex when differences are found, with ‘self’ and ‘other’ levels indicating the source of the compared values.

It’s important to note that DataFrame.compare() raises a ValueError if the two DataFrames being compared do not have identical labels (both row indices and column names) or if their shapes are not the same. This ensures a meaningful element-wise comparison is possible.

Furthermore, when comparing DataFrames, pandas treats matching NaN values as not being different. This behavior is consistent with how pandas handles NaN values in other operations.

Practical Examples of Comparing DataFrames with Pandas

Let’s illustrate the compare() function with practical examples. We’ll start with creating two sample DataFrames:

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        "col1": ["a", "a", "b", "b", "a"],
        "col2": [1.0, 2.0, 3.0, np.nan, 5.0],
        "col3": [1.0, 2.0, 3.0, 4.0, 5.0],
    },
    columns=["col1", "col2", "col3"],
)
print("DataFrame df:")
print(df)
DataFrame df:
  col1  col2  col3
0    a   1.0   1.0
1    a   2.0   2.0
2    b   3.0   3.0
3    b   NaN   4.0
4    a   5.0   5.0
df2 = df.copy()
df2.loc[0, 'col1'] = 'c'
df2.loc[2, 'col3'] = 4.0
print("nDataFrame df2 (modified copy of df):")
print(df2)
DataFrame df2 (modified copy of df):
  col1  col2  col3
0    c   1.0   1.0
1    a   2.0   2.0
2    b   3.0   4.0
3    b   NaN   4.0
4    a   5.0   5.0

Now, let’s explore different ways to use compare() with these DataFrames.

Basic Comparison: Aligning on Columns (Default)

By default, compare() aligns the comparison on columns (align_axis=1).

comparison_df_col = df.compare(df2)
print("nComparison aligned on columns:")
print(comparison_df_col)
Comparison aligned on columns:
  col1      col3
  self other self other
0    a     c  NaN   NaN
2  NaN   NaN  3.0   4.0

In this output, we see a MultiIndex for columns (‘col1’ and ‘col3’) and a two-level index (‘self’, ‘other’). For ‘col1’, row 0 shows ‘a’ from df (‘self’) and ‘c’ from df2 (‘other’), indicating a difference. For ‘col3’, row 2 shows ‘3.0’ from df and ‘4.0fromdf2, another difference. Rows and columns without differences are omitted by default (keep_shape=False,keep_equal=False`).

Customizing Result Names

We can customize the ‘self’ and ‘other’ labels using result_names.

comparison_df_custom_names = df.compare(df2, result_names=("left", "right"))
print("nComparison with custom result names:")
print(comparison_df_custom_names)
Comparison with custom result names:
  col1      col3
  left right left right
0    a     c  NaN   NaN
2  NaN   NaN  3.0   4.0

Here, ‘self’ and ‘other’ are replaced by ‘left’ and ‘right’ in the MultiIndex, as specified by result_names.

Aligning Differences on Rows

To stack differences vertically, set align_axis=0 (or 'index').

comparison_df_row = df.compare(df2, align_axis=0)
print("nComparison aligned on rows:")
print(comparison_df_row)
Comparison aligned on rows:
      col1  col3
0 self    a   NaN
  other   c   NaN
2 self  NaN   3.0
  other NaN   4.0

Now, the index is a MultiIndex with levels ‘0’ and ‘2’, representing the rows where differences occur, and a second level ‘self’ and ‘other’ indicating the source DataFrame for each row.

Keeping Equal Values

To include equal values in the output (represented as non-NaN), use keep_equal=True.

comparison_df_keep_equal = df.compare(df2, keep_equal=True)
print("nComparison keeping equal values:")
print(comparison_df_keep_equal)
Comparison keeping equal values:
  col1      col3
  self other self other
0    a     c  1.0   1.0
2    b     b  3.0   4.0

In this result, even where values are the same (e.g., ‘col3’ in row 0 for ‘self’ and ‘other’), they are shown instead of NaN.

Keeping Original Shape

To retain the original shape of the DataFrames in the comparison output, set keep_shape=True.

comparison_df_keep_shape = df.compare(df2, keep_shape=True)
print("nComparison keeping original shape:")
print(comparison_df_keep_shape)
Comparison keeping original shape:
  col1      col2      col3
  self other self other self other
0    a     c  NaN   NaN  NaN   NaN
1  NaN   NaN  NaN   NaN  NaN   NaN
2  NaN   NaN  NaN   NaN  3.0   4.0
3  NaN   NaN  NaN   NaN  NaN   NaN
4  NaN   NaN  NaN   NaN  NaN   NaN

The output now has the same rows and columns as the original DataFrames. Rows and columns without differences are filled with NaN across both ‘self’ and ‘other’ levels.

Keeping Original Shape and Equal Values

Combining keep_shape=True and keep_equal=True gives the most comprehensive comparison, showing all original rows and columns, and displaying both equal and unequal values.

comparison_df_all_kept = df.compare(df2, keep_shape=True, keep_equal=True)
print("nComparison keeping original shape and equal values:")
print(comparison_df_all_kept)
Comparison keeping original shape and equal values:
  col1      col2      col3
  self other self other self other
0    a     c  1.0   1.0  1.0   1.0
1    a     a  2.0   2.0  2.0   2.0
2    b     b  3.0   3.0  3.0   4.0
3    b     b  NaN   NaN  4.0   4.0
4    a     a  5.0   5.0  5.0   5.0

This output presents a full side-by-side comparison of all values, clearly showing both where the DataFrames are the same and where they differ.

Conclusion

The DataFrame.compare() function in pandas is an invaluable tool for effectively pandas comparing two dataframes. By understanding its parameters like align_axis, keep_shape, keep_equal, and result_names, you can tailor the comparison output to suit your specific needs. Whether you need a concise view of only the differences or a detailed side-by-side comparison of entire DataFrames, compare() offers the flexibility and power required for robust data analysis and manipulation in pandas. This function significantly streamlines the process of identifying changes and discrepancies between datasets, making your data workflows more efficient and insightful.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *