Comparing two Excel spreadsheets manually can be tedious and error-prone, especially with large datasets. Fortunately, Python with the Pandas library offers a powerful and efficient solution. This tutorial demonstrates How To Compare Two Excel Sheets Using Python Pandas, highlighting changes, additions, and removals. We’ll cover two approaches, starting with a basic comparison and then scaling up to a more robust method for larger datasets.
Basic Comparison with Pandas
This approach uses Pandas’ Panel
data structure (now deprecated in recent versions of Pandas, but useful for understanding the underlying concepts) to compare two dataframes.
- Import necessary libraries:
import pandas as pd
import numpy as np
- Read Excel files into dataframes:
df1 = pd.read_excel('sample-address-1.xlsx', 'Sheet1', na_values=['NA'])
df2 = pd.read_excel('sample-address-2.xlsx', 'Sheet1', na_values=['NA'])
Replace ‘sample-address-1.xlsx’ and ‘sample-address-2.xlsx’ with your file names.
- Sort and reindex dataframes: This ensures consistent row order for comparison. While
reindex
was used in the original example, a more current approach is to usereset_index
:
df1.sort_values(by="account number", inplace=True)
df1.reset_index(drop=True, inplace=True)
df2.sort_values(by="account number", inplace=True)
df2.reset_index(drop=True, inplace=True)
- Define a diff function: This function highlights differences between two cells.
def report_diff(x):
return x[0] if x[0] == x[1] else '{} ---> {}'.format(*x)
- Concatenate and Compare: In modern Pandas, we can use
concat
andcompare
for a clearer comparison.
comparison_result = pd.concat([df1, df2]).drop_duplicates(keep=False).sort_values('account number')
diff_output = comparison_result.groupby('account number').apply(lambda x: x.compare(x.iloc[0], x.iloc[1]) if len(x) > 1 else x).reset_index(drop=True)
print(diff_output)
This will output a dataframe highlighting the changed values.
Scaling Up for Larger Datasets
The basic approach might not be optimal for larger datasets. Here’s a more robust method:
-
Import Libraries and Define
report_diff
: Same as steps 1 & 4 in the basic approach. -
Read and Label Data:
old = pd.read_excel('sample-address-old.xlsx', 'Sheet1', na_values=['NA'])
new = pd.read_excel('sample-address-new.xlsx', 'Sheet1', na_values=['NA'])
old['version'] = "old"
new['version'] = "new"
- Concatenate and Identify Changes:
full_set = pd.concat([old, new], ignore_index=True)
changes = full_set.drop_duplicates(subset=["account number", "name", "street", "city", "state", "postal code"], keep='last')
Note the change to keep='last'
which will preserve the most recent version of the row.
- Identify Duplicates and Changes:
dupe_accts = changes.set_index('account number').index.get_duplicates()
dupes = changes[changes["account number"].isin(dupe_accts)]
#Compare changed rows
change_new = dupes[dupes["version"] == "new"].drop('version', axis=1).set_index('account number')
change_old = dupes[dupes["version"] == "old"].drop('version', axis=1).set_index('account number')
diff_output = pd.concat([change_old, change_new], axis='columns', keys=['old', 'new'])
This updated method provides a side-by-side comparison of the old and new values for changed rows. Using concat
with the keys
argument clearly labels which column contains the old and new data.
- Identify Removed and Added Accounts:
removed_accounts = changes[(~changes["account number"].isin(dupe_accts)) & (changes["version"] == "old")]
added_accounts = changes[(~changes["account number"].isin(dupe_accts)) & (changes["version"] == "new")]
- Export Results to Excel:
with pd.ExcelWriter("compared_data.xlsx") as writer:
diff_output.to_excel(writer, sheet_name="changed")
removed_accounts.to_excel(writer, sheet_name="removed", index=False)
added_accounts.to_excel(writer, sheet_name="added", index=False)
Conclusion
Python Pandas provides flexible and efficient tools for comparing Excel sheets. By leveraging its powerful functions, you can automate the process of identifying differences, additions, and removals in your data, saving time and reducing errors. Choose the method that best suits your data size and complexity. Remember to adjust file names and column names to match your specific needs.