In data analysis with Python, the pandas library is indispensable for manipulating and analyzing datasets. A common task is to compare two DataFrames to identify differences, whether for data validation, auditing changes, or understanding data evolution. Pandas provides a powerful and flexible function, compare()
, specifically designed for this purpose. This guide will delve into how to effectively use the compare()
function in pandas for Pandas Comparing Two Dataframes, enhancing your data analysis workflow.
Understanding DataFrame.compare()
The DataFrame.compare()
function in pandas is a method that allows you to compare a DataFrame with another DataFrame and highlight the differences between them. It returns a new DataFrame that succinctly presents these discrepancies, making it easy to pinpoint where your DataFrames diverge.
Let’s break down the key parameters of the compare()
function to understand its versatility:
-
other
: This is the first essential parameter, representing the other DataFrame you want to compare against your original DataFrame. The comparison is performed element-wise between the calling DataFrame and thisother
DataFrame. -
align_axis
: This parameter dictates the orientation of the comparison results. It accepts two possible values:0
or'index'
: When set to0
or'index'
, the differences are stacked vertically. Rows from the original DataFrame (self
) and theother
DataFrame are interleaved in the resulting DataFrame. This is useful when you want to see row-by-row differences.1
or'columns'
(default): With1
or'columns'
, the differences are aligned horizontally. Columns from both DataFrames are shown side-by-side in the output, making it convenient to compare column values.
-
keep_shape
: A boolean parameter that controls the shape of the resulting DataFrame.False
(default): Ifkeep_shape
isFalse
, the resulting DataFrame will only include rows and columns where differences were found. This provides a compact view of only the changes.True
: Settingkeep_shape
toTrue
ensures that the output DataFrame retains all rows and columns from the original DataFrames. Rows and columns without differences will be filled withNaN
values, allowing you to see the differences within the context of the original structure.
-
keep_equal
: Another boolean parameter that determines how equal values are handled in the comparison result.False
(default): Whenkeep_equal
isFalse
, values that are the same in both DataFrames are represented asNaN
in the output, focusing attention solely on the differences.True
: Ifkeep_equal
isTrue
, the output will retain the equal values alongside the differing ones. This can be helpful when you need to see both the similarities and differences in the context of the comparison.
-
result_names
: This parameter, introduced in pandas version 1.5.0, allows you to customize the names used to identify the original DataFrames in the comparison result, particularly in the MultiIndex. It accepts a tuple of two strings, where the first string represents the name for the original DataFrame (self
) and the second for theother
DataFrame. The default is('self', 'other')
.
The compare()
function returns a DataFrame that highlights the differences. Crucially, the index of the resulting DataFrame becomes a MultiIndex when differences are found, with ‘self’ and ‘other’ levels indicating the source of the compared values.
It’s important to note that DataFrame.compare()
raises a ValueError
if the two DataFrames being compared do not have identical labels (both row indices and column names) or if their shapes are not the same. This ensures a meaningful element-wise comparison is possible.
Furthermore, when comparing DataFrames, pandas treats matching NaN
values as not being different. This behavior is consistent with how pandas handles NaN
values in other operations.
Practical Examples of Comparing DataFrames with Pandas
Let’s illustrate the compare()
function with practical examples. We’ll start with creating two sample DataFrames:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"col1": ["a", "a", "b", "b", "a"],
"col2": [1.0, 2.0, 3.0, np.nan, 5.0],
"col3": [1.0, 2.0, 3.0, 4.0, 5.0],
},
columns=["col1", "col2", "col3"],
)
print("DataFrame df:")
print(df)
DataFrame df:
col1 col2 col3
0 a 1.0 1.0
1 a 2.0 2.0
2 b 3.0 3.0
3 b NaN 4.0
4 a 5.0 5.0
df2 = df.copy()
df2.loc[0, 'col1'] = 'c'
df2.loc[2, 'col3'] = 4.0
print("nDataFrame df2 (modified copy of df):")
print(df2)
DataFrame df2 (modified copy of df):
col1 col2 col3
0 c 1.0 1.0
1 a 2.0 2.0
2 b 3.0 4.0
3 b NaN 4.0
4 a 5.0 5.0
Now, let’s explore different ways to use compare()
with these DataFrames.
Basic Comparison: Aligning on Columns (Default)
By default, compare()
aligns the comparison on columns (align_axis=1
).
comparison_df_col = df.compare(df2)
print("nComparison aligned on columns:")
print(comparison_df_col)
Comparison aligned on columns:
col1 col3
self other self other
0 a c NaN NaN
2 NaN NaN 3.0 4.0
In this output, we see a MultiIndex for columns (‘col1’ and ‘col3’) and a two-level index (‘self’, ‘other’). For ‘col1’, row 0 shows ‘a’ from df
(‘self’) and ‘c’ from df2
(‘other’), indicating a difference. For ‘col3’, row 2 shows ‘3.0’ from df
and ‘4.0from
df2, another difference. Rows and columns without differences are omitted by default (
keep_shape=False,
keep_equal=False`).
Customizing Result Names
We can customize the ‘self’ and ‘other’ labels using result_names
.
comparison_df_custom_names = df.compare(df2, result_names=("left", "right"))
print("nComparison with custom result names:")
print(comparison_df_custom_names)
Comparison with custom result names:
col1 col3
left right left right
0 a c NaN NaN
2 NaN NaN 3.0 4.0
Here, ‘self’ and ‘other’ are replaced by ‘left’ and ‘right’ in the MultiIndex, as specified by result_names
.
Aligning Differences on Rows
To stack differences vertically, set align_axis=0
(or 'index'
).
comparison_df_row = df.compare(df2, align_axis=0)
print("nComparison aligned on rows:")
print(comparison_df_row)
Comparison aligned on rows:
col1 col3
0 self a NaN
other c NaN
2 self NaN 3.0
other NaN 4.0
Now, the index is a MultiIndex with levels ‘0’ and ‘2’, representing the rows where differences occur, and a second level ‘self’ and ‘other’ indicating the source DataFrame for each row.
Keeping Equal Values
To include equal values in the output (represented as non-NaN), use keep_equal=True
.
comparison_df_keep_equal = df.compare(df2, keep_equal=True)
print("nComparison keeping equal values:")
print(comparison_df_keep_equal)
Comparison keeping equal values:
col1 col3
self other self other
0 a c 1.0 1.0
2 b b 3.0 4.0
In this result, even where values are the same (e.g., ‘col3’ in row 0 for ‘self’ and ‘other’), they are shown instead of NaN
.
Keeping Original Shape
To retain the original shape of the DataFrames in the comparison output, set keep_shape=True
.
comparison_df_keep_shape = df.compare(df2, keep_shape=True)
print("nComparison keeping original shape:")
print(comparison_df_keep_shape)
Comparison keeping original shape:
col1 col2 col3
self other self other self other
0 a c NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN 3.0 4.0
3 NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN
The output now has the same rows and columns as the original DataFrames. Rows and columns without differences are filled with NaN
across both ‘self’ and ‘other’ levels.
Keeping Original Shape and Equal Values
Combining keep_shape=True
and keep_equal=True
gives the most comprehensive comparison, showing all original rows and columns, and displaying both equal and unequal values.
comparison_df_all_kept = df.compare(df2, keep_shape=True, keep_equal=True)
print("nComparison keeping original shape and equal values:")
print(comparison_df_all_kept)
Comparison keeping original shape and equal values:
col1 col2 col3
self other self other self other
0 a c 1.0 1.0 1.0 1.0
1 a a 2.0 2.0 2.0 2.0
2 b b 3.0 3.0 3.0 4.0
3 b b NaN NaN 4.0 4.0
4 a a 5.0 5.0 5.0 5.0
This output presents a full side-by-side comparison of all values, clearly showing both where the DataFrames are the same and where they differ.
Conclusion
The DataFrame.compare()
function in pandas is an invaluable tool for effectively pandas comparing two dataframes. By understanding its parameters like align_axis
, keep_shape
, keep_equal
, and result_names
, you can tailor the comparison output to suit your specific needs. Whether you need a concise view of only the differences or a detailed side-by-side comparison of entire DataFrames, compare()
offers the flexibility and power required for robust data analysis and manipulation in pandas. This function significantly streamlines the process of identifying changes and discrepancies between datasets, making your data workflows more efficient and insightful.