How to Compare Two DataFrames in Pandas

Comparing two Pandas DataFrames for differences is a common task in data analysis. This article explains how to use the compare() method in Pandas to effectively highlight discrepancies between two DataFrames.

Understanding the compare() Method

The compare() method in Pandas provides a comprehensive way to compare two DataFrames and display only the differing values. This helps pinpoint changes or inconsistencies between datasets. Let’s explore its functionality and parameters:

Key Parameters

  • other: The DataFrame to compare with the original DataFrame.
  • align_axis: Determines how the comparison is aligned:
    • 0 or 'index': Differences are stacked vertically, with rows alternating between the two DataFrames.
    • 1 or 'columns': Differences are aligned horizontally, with columns alternating between the two DataFrames (default).
  • keep_shape: If True, preserves all rows and columns, even those with identical values. Otherwise, only differing rows and columns are shown (default False).
  • keep_equal: If True, displays both differing and identical values. If False, identical values are represented as NaN (default False).
  • result_names: Allows customization of the labels for the compared DataFrames in the output (default ('self', 'other')).

Practical Examples

Let’s illustrate with examples:

Basic Comparison

First, create two DataFrames:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "col1": ["a", "a", "b", "b", "a"],
    "col2": [1.0, 2.0, 3.0, np.nan, 5.0],
    "col3": [1.0, 2.0, 3.0, 4.0, 5.0]
}, columns=["col1", "col2", "col3"])

df2 = df.copy()
df2.loc[0, 'col1'] = 'c'
df2.loc[2, 'col3'] = 4.0

Now, compare df and df2 using the default settings:

df.compare(df2)

This output shows only the differences, aligned by columns.

Aligning by Index

To align the differences by rows, use align_axis=0:

df.compare(df2, align_axis=0)

Keeping Equal Values

To display all values, including those that are equal, set keep_equal=True:

df.compare(df2, keep_equal=True)

Preserving Shape

To maintain the original shape of the DataFrames, use keep_shape=True:

df.compare(df2, keep_shape=True)

Combining keep_shape and keep_equal

For a comprehensive comparison showing all values in the original structure:

df.compare(df2, keep_shape=True, keep_equal=True)

Conclusion

The compare() method in Pandas is a powerful tool for identifying differences between DataFrames. By understanding its parameters and applying them effectively, you can gain valuable insights from your data comparisons. Remember to choose the options that best suit your specific needs, whether it’s highlighting only discrepancies or viewing a complete side-by-side comparison.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *