Comparing two Pandas DataFrames for differences is a common task in data analysis. This article explains how to use the compare()
method in Pandas to effectively highlight discrepancies between two DataFrames.
Understanding the compare()
Method
The compare()
method in Pandas provides a comprehensive way to compare two DataFrames and display only the differing values. This helps pinpoint changes or inconsistencies between datasets. Let’s explore its functionality and parameters:
Key Parameters
other
: The DataFrame to compare with the original DataFrame.align_axis
: Determines how the comparison is aligned:0
or'index'
: Differences are stacked vertically, with rows alternating between the two DataFrames.1
or'columns'
: Differences are aligned horizontally, with columns alternating between the two DataFrames (default).
keep_shape
: IfTrue
, preserves all rows and columns, even those with identical values. Otherwise, only differing rows and columns are shown (defaultFalse
).keep_equal
: IfTrue
, displays both differing and identical values. IfFalse
, identical values are represented asNaN
(defaultFalse
).result_names
: Allows customization of the labels for the compared DataFrames in the output (default('self', 'other')
).
Practical Examples
Let’s illustrate with examples:
Basic Comparison
First, create two DataFrames:
import pandas as pd
import numpy as np
df = pd.DataFrame({
"col1": ["a", "a", "b", "b", "a"],
"col2": [1.0, 2.0, 3.0, np.nan, 5.0],
"col3": [1.0, 2.0, 3.0, 4.0, 5.0]
}, columns=["col1", "col2", "col3"])
df2 = df.copy()
df2.loc[0, 'col1'] = 'c'
df2.loc[2, 'col3'] = 4.0
Now, compare df
and df2
using the default settings:
df.compare(df2)
This output shows only the differences, aligned by columns.
Aligning by Index
To align the differences by rows, use align_axis=0
:
df.compare(df2, align_axis=0)
Keeping Equal Values
To display all values, including those that are equal, set keep_equal=True
:
df.compare(df2, keep_equal=True)
Preserving Shape
To maintain the original shape of the DataFrames, use keep_shape=True
:
df.compare(df2, keep_shape=True)
Combining keep_shape
and keep_equal
For a comprehensive comparison showing all values in the original structure:
df.compare(df2, keep_shape=True, keep_equal=True)
Conclusion
The compare()
method in Pandas is a powerful tool for identifying differences between DataFrames. By understanding its parameters and applying them effectively, you can gain valuable insights from your data comparisons. Remember to choose the options that best suit your specific needs, whether it’s highlighting only discrepancies or viewing a complete side-by-side comparison.