Comparing two DataFrames in Pandas to pinpoint differences is a common task in data analysis and manipulation. Pandas offers a powerful and straightforward method to achieve this: the compare()
function. This guide will explore the intricacies of the compare()
function, enabling you to efficiently highlight discrepancies between two DataFrames.
Understanding the compare()
Function in Pandas
The compare()
function in Pandas is designed to identify and display the differences between two DataFrames that share the same index and column labels. It returns a DataFrame that neatly presents these discrepancies, making it easy to spot variations in your datasets. This is particularly useful when you’re tracking changes in data, debugging data transformations, or ensuring data consistency across different sources.
The basic syntax for using the compare()
function is as follows:
df1.compare(other_dataframe, align_axis=1, keep_shape=False, keep_equal=False, result_names=('self', 'other'))
Let’s break down the key parameters that control the behavior of the compare()
function.
Key Parameters of compare()
-
other
: This is the first crucial parameter. You need to specify the second DataFrame that you want to compare against your initial DataFrame. The comparison will be performed element-wise between the two DataFrames. -
align_axis
: This parameter dictates the alignment of the comparison results. It accepts two possible values:1
or'columns'
(default): Differences are displayed horizontally, with columns from the original DataFrame and the ‘other’ DataFrame shown side-by-side.0
or'index'
: Differences are stacked vertically, with rows alternating between the original DataFrame (‘self’) and the ‘other’ DataFrame.
-
keep_shape
: A boolean parameter that determines the shape of the resulting DataFrame.False
(default): Only rows and columns containing differences are included in the result, leading to a potentially smaller DataFrame focused solely on changes.True
: The result retains all original rows and columns, usingNaN
to indicate where no differences exist. This option is useful when you need to maintain the original structure for further analysis or reporting.
-
keep_equal
: Another boolean parameter that controls how equal values are handled in the output.False
(default): Equal values are represented asNaN
in the output, emphasizing only the differing values.True
: Equal values are kept in the resulting DataFrame, allowing you to see both the similarities and differences in context.
-
result_names
: This parameter allows you to customize the names used to identify the source DataFrames in the comparison output, particularly useful whenalign_axis
is set to1
or'columns'
. It expects a tuple of two strings, with the first name corresponding to the ‘self’ DataFrame and the second to the ‘other’ DataFrame. By default, these are set to('self', 'other')
.
Practical Examples of DataFrame Comparison
To illustrate the practical application of the compare()
function, let’s consider two example DataFrames, df
and df2
.
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"col1": ["a", "a", "b", "b", "a"],
"col2": [1.0, 2.0, 3.0, np.nan, 5.0],
"col3": [1.0, 2.0, 3.0, 4.0, 5.0],
},
columns=["col1", "col2", "col3"],
)
Pandas DataFrame named df with columns col1, col2, and col3, containing string and numeric data for comparison example.
df2 = df.copy()
df2.loc[0, 'col1'] = 'c'
df2.loc[2, 'col3'] = 4.0
Modified Pandas DataFrame named df2, a copy of df with changes in 'col1' at index 0 and 'col3' at index 2, for demonstrating comparison.
Basic Comparison
By default, df.compare(df2)
aligns differences horizontally (align_axis=1
), keeps only rows with differences (keep_shape=False
), and shows only differing values (keep_equal=False
).
df.compare(df2)
Output of basic pandas DataFrame comparison showing differences in 'col1' and 'col3' columns, with 'self' and 'other' labels.
This output clearly highlights the columns col1
and col3
where differences exist, and presents the ‘self’ (from df
) and ‘other’ (from df2
) values side-by-side for easy comparison.
Aligning Differences Vertically
To stack the differences vertically, set align_axis=0
:
df.compare(df2, align_axis=0)
Pandas DataFrame comparison result with vertical alignment, showing 'self' and 'other' values stacked for index 0 and 2 differences.
In this view, the ‘self’ and ‘other’ values are stacked, making it easy to read row-wise differences.
Keeping Equal Values
To include equal values in the output, use keep_equal=True
:
df.compare(df2, keep_equal=True)
Pandas DataFrame comparison output that includes equal values alongside differences for columns 'col1' and 'col3'.
Now, the output includes rows where values are equal, represented by the repeated values under ‘self’ and ‘other’, along with the differing values.
Keeping Original Shape
To maintain the original shape of the DataFrame in the output, set keep_shape=True
:
df.compare(df2, keep_shape=True)
Pandas DataFrame compare result with keep_shape set to True, preserving original shape and indicating differences with NaN.
With keep_shape=True
, the output DataFrame retains all original rows and columns. Rows and columns without differences are filled with NaN
values, preserving the original structure.
Keeping Original Shape and Equal Values
Combining keep_shape=True
and keep_equal=True
provides a comprehensive view, showing all original rows and columns with both equal and unequal values:
df.compare(df2, keep_shape=True, keep_equal=True)
Detailed pandas DataFrame comparison output keeping shape and equal values, showing all original data and highlighting differences.
This final example showcases the full DataFrames structure, explicitly showing both the values that are the same and those that differ, providing a complete side-by-side comparison within the original DataFrame context.
Conclusion
The compare()
function in Pandas is an indispensable tool for efficiently identifying and analyzing differences between two DataFrames. By understanding its parameters like align_axis
, keep_shape
, keep_equal
, and result_names
, you can tailor the comparison output to suit your specific needs, whether you’re focused on pinpointing changes, maintaining data structure, or gaining a comprehensive view of similarities and differences. Mastering compare()
will significantly enhance your data analysis workflow in Python, making it easier to ensure data quality and track data modifications effectively.