Compare Two Files in Linux: A Practical Guide to Finding Differences

In Linux environments, comparing files is a common and crucial task for system administrators, developers, and anyone managing data. Whether you’re tracking changes in configuration files, verifying data integrity, or debugging scripts, knowing how to effectively compare files from the command line is essential. This article delves into a practical bash script, colcmp.sh, designed to compare two files containing name/value pairs, highlighting the differences and changes between them.

What is `colcmp.sh` and Why Use It?

The colcmp.sh script is a command-line tool specifically crafted for comparing files formatted as name value pairs, where each line consists of a name and its corresponding value. This format is frequently used in configuration files, data lists, and various other text-based data storage. Unlike generic file comparison tools like diff which show line-by-line differences, colcmp.sh focuses on the values associated with each name. This makes it particularly useful for identifying changes in configurations or data where the order of entries might vary, but you are primarily interested in whether the value associated with a specific name has changed.

Key benefits of using colcmp.sh include:

Focused comparison: It specifically targets name/value pairs, making it ideal for configuration files and structured data.
Change detection: Clearly identifies names with changed values, added names, and removed names between two files.
Output to file: Writes changed names to an Output_File, providing a concise list of modifications.
Command-line efficiency: As a bash script, it’s lightweight, fast, and integrates seamlessly into Linux command-line workflows.

How to Use `colcmp.sh`

To utilize colcmp.sh, you need to have the script saved in your system and have execute permissions. Here’s a step-by-step guide on how to use it:

Save the script: Copy the script code provided below into a file named colcmp.sh.
Set execute permissions: Open your terminal and navigate to the directory where you saved colcmp.sh. Run the command chmod +x colcmp.sh to make the script executable.
Run the script: Execute the script with two file paths as arguments, representing the two files you want to compare:
```
./colcmp.sh File_1.txt File_2.txt
```
Replace File_1.txt and File_2.txt with the actual paths to your files.

Example:

Let’s assume you have two files, config_v1.txt and config_v2.txt, representing two versions of a configuration file.

File_1.txt (config_v1.txt):

User1 US
User2 US
User3 US
SettingA on
SettingB off

File_2.txt (config_v2.txt):

User1 US
User2 US
User3 NG
SettingA off
SettingC new_setting

Running the script:

$ ./colcmp.sh config_v1.txt config_v2.txt
User3 changed from 'US' to 'NG'
SettingA changed from 'on' to 'off'
User4 added as 'newValue'
no change: User1,User2,SettingB

Output_File (Output_File):

After running the script, an Output_File will be created (or overwritten) in the same directory containing the names of the entries that have changed.

$ cat Output_File
User3 has changed
SettingA has changed
User4 has changed

This output clearly shows that User3 and SettingA have changed their values, and User4 was added. The script also informs you about entries that remained unchanged.

Understanding the `colcmp.sh` Script

The colcmp.sh script leverages bash associative arrays (available in bash version 4 and later) to efficiently compare the name/value pairs. Here’s a breakdown of the script’s logic:

1. Basic File Comparison and Initial Setup

cmp -s "$1" "$2"
case "$?" in
    0)
        echo "" > Output_File
        echo "files are identical" ;;
    1)
        echo "" > Output_File
        # ... rest of the script ...
    ;;
    *)
        echo "error: file not found, access denied, etc..."
        echo "usage: ./colcmp.sh File_1.txt File_2.txt" ;;
esac

cmp -s "$1" "$2": This command silently compares the two input files ($1 and $2). The -s option prevents cmp from writing any output to standard output.
case "$?" in ... esac: This structure evaluates the exit status $? of the cmp command.
- 0): If cmp returns 0 (files are identical), it clears Output_File and prints “files are identical”.
- 1): If cmp returns 1 (files differ), it proceeds with the detailed comparison logic.
- *`)`:** For any other exit status (usually 2, indicating an error like file not found), it prints an error message and usage instructions.

2. Processing Files into Associative Arrays

The script processes each input file to create bash associative arrays (A1 and A2). This involves several steps for each file:

cp "$1" ~/.colcmp.array1.tmp.sh
sed -i -E "s/([^A-Za-z0-9 ])/\\\1/g" ~/.colcmp.array1.tmp.sh
sed -i -E "s/^(.*)$/#\1/" ~/.colcmp.array1.tmp.sh
sed -i -E "s/^#\s*(\S+)\s+(\S.*?)\s*$/A1\[\1\]="\2"/" ~/.colcmp.array1.tmp.sh
chmod 755 ~/.colcmp.array1.tmp.sh
declare -A A1
source ~/.colcmp.array1.tmp.sh

These lines are repeated for the second file, creating A2 from $2. Let’s break down what happens for each file (using File_1.txt and A1 as example):

cp "$1" ~/.colcmp.array1.tmp.sh: Copies the input file (File_1.txt) to a temporary script file in the user’s home directory (~/.colcmp.array1.tmp.sh).
sed -i -E "s/([^A-Za-z0-9 ])/\\\1/g" ~/.colcmp.array1.tmp.sh: This sed command escapes special characters in the values. It finds any character that is not alphanumeric or a space and prefixes it with a backslash (). This is crucial to prevent misinterpretation of special characters when the file is later sourced as a script.
*`sed -i -E “s/^(.)$/#1/” ~/.colcmp.array1.tmp.sh:** Comments out every line in the temporary script file by adding a#` at the beginning of each line. This is a safety measure to prevent accidental execution of any code within the input files.
*`sed -i -E “s/^#s(S+)s+(S.?)s$/A1[1]=”2″/” ~/.colcmp.array1.tmp.sh:** This is the core transformation step. It finds lines that are commented out (starting with#), followed by optional whitespace, then captures the first non-whitespace word as the *name* (1) and the rest of the line as the *value* (2). It then replaces the entire line with a bash associative array assignment:A1[name]=”value”`.
chmod 755 ~/.colcmp.array1.tmp.sh: Makes the temporary script file executable. While not strictly necessary for source, it’s a common practice when dealing with script files.
declare -A A1: Declares A1 as an associative array. This is essential for using name-value pairs as keys and values.
source ~/.colcmp.array1.tmp.sh: Executes the temporary script in the current shell. This runs all the array assignment commands within the temporary script, populating the associative array A1 with name-value pairs from File_1.txt.

3. Detecting Changes and Generating Output

After creating associative arrays A1 and A2 from both input files, the script proceeds to compare them and identify differences:

USERSWHODIDNOTCHANGE=
for i in "${!A1[@]}"; do
    if [ "${A2[$i]+x}" = "" ]; then
        echo "$i was removed"
        echo "$i has changed" > Output_File
    fi
done
for i in "${!A2[@]}"; do
    if [ "${A1[$i]+x}" = "" ]; then
        echo "$i was added as '${A2[$i]}'"
        echo "$i has changed" > Output_File
    elif [ "${A1[$i]}" != "${A2[$i]}" ]; then
        echo "$i changed from '${A1[$i]}' to '${A2[$i]}'"
        echo "$i has changed" > Output_File
    else
        if [ x$USERSWHODIDNOTCHANGE != x ]; then
            USERSWHODIDNOTCHANGE=",$USERSWHODIDNOTCHANGE"
        fi
        USERSWHODIDNOTCHANGE="$i$USERSWHODIDNOTCHANGE"
    fi
done
if [ x$USERSWHODIDNOTCHANGE != x ]; then
    echo "no change: $USERSWHODIDNOTCHANGE"
fi

USERSWHODIDNOTCHANGE=: Initializes an empty variable to store names that have not changed.
First for loop (Iterating through keys of A1):
- for i in "${!A1[@]}"; do ... done: Iterates through all the names (keys) in the associative array A1 (derived from File_1.txt).
- if [ "${A2[$i]+x}" = "" ]; then ... fi: Checks if a name from A1 exists as a key in A2. The construct ${A2[$i]+x} is a bash parameter expansion that checks if the key $i exists in A2. If it doesn’t exist, ${A2[$i]+x} evaluates to an empty string.
  - If the name from A1 is not in A2, it means the name was removed in the second file. The script outputs “$i was removed” and adds “$i has changed” to Output_File.
Second for loop (Iterating through keys of A2):
- for i in "${!A2[@]}"; do ... done: Iterates through all the names (keys) in the associative array A2 (derived from File_2.txt).
- if [ "${A1[$i]+x}" = "" ]; then ... elif [ "${A1[$i]}" != "${A2[$i]}" ]; then ... else ... fi: Checks different conditions for each name from A2.
  - if [ "${A1[$i]+x}" = "" ]; then ... fi: If the name from A2 is not in A1, it means the name was added in the second file. The script outputs “$i was added as ‘${A2[$i]}'” and adds “$i has changed” to Output_File.
  - elif [ "${A1[$i]}" != "${A2[$i]}" ]; then ... fi: If the name exists in both A1 and A2, this condition checks if the values associated with that name are different ("${A1[$i]}" != "${A2[$i]}"). If the values are different, it means the value has changed. The script outputs “$i changed from ‘${A1[$i]}’ to ‘${A2[$i]}'” and adds “$i has changed” to Output_File.
  - else ... fi: If the name exists in both A1 and A2 and their values are the same, it means the name/value pair has not changed. The script appends the name to the USERSWHODIDNOTCHANGE variable to keep track of unchanged entries.
if [ x$USERSWHODIDNOTCHANGE != x ]; then ... fi: Finally, if there are any names in the USERSWHODIDNOTCHANGE variable (meaning there were unchanged entries), the script outputs “no change: $USERSWHODIDNOTCHANGE”.

Alternatives to `colcmp.sh`

While colcmp.sh is effective for comparing name/value pairs, Linux offers other powerful command-line tools for file comparison:

diff: The classic difference utility. It excels at showing line-by-line changes between files and is highly configurable for various output formats (e.g., unified diffs, context diffs). However, it’s not specifically designed for name/value pairs.
comm: Compares two sorted files line by line and outputs lines unique to file 1, lines unique to file 2, and lines common to both. Useful for finding common and unique entries but requires sorted input and isn’t ideal for value comparisons.
cmp: A simpler comparison tool that identifies the first byte and line number where two files differ. Useful for quick binary or text file comparison to determine if they are identical or not.

Choosing the right tool depends on your specific comparison needs. For structured name/value data, colcmp.sh offers a targeted and efficient solution. For general text file comparisons or patch generation, diff remains the go-to tool.

Conclusion

The colcmp.sh script provides a practical and efficient way to Compare Two Files In Linux, specifically when dealing with name/value pairs. By leveraging bash associative arrays and sed for text processing, it accurately identifies changes, additions, and removals of entries. This script is a valuable addition to any Linux user’s toolkit for managing configuration files, tracking data modifications, and automating comparison tasks in scripts and workflows. Understanding its inner workings not only empowers you to use it effectively but also provides insights into bash scripting techniques for file manipulation and data comparison.

What is colcmp.sh and Why Use It?

How to Use colcmp.sh

Understanding the colcmp.sh Script