Linux File Comparison: Deep Dive into `colcmp.sh` Script

In the realm of Linux system administration and development, comparing files is a routine yet critical task. Whether you are tracking configuration changes, analyzing log files, or managing datasets, the ability to efficiently compare file content is indispensable. This article delves into colcmp.sh, a bash script designed to compare two files specifically for changes in name-value pairs, offering a focused approach to Linux Compare 2 Files. This script is particularly useful when you need to identify modifications in configuration files or data lists where each entry consists of a name and its corresponding value.

Understanding colcmp.sh

colcmp.sh is a command-line utility crafted in bash to pinpoint changes between two files formatted as name value pairs, each on a new line. It leverages the power of bash associative arrays (available in bash version 4 and above) to efficiently process and compare these files. The script is designed to output the names of entries that have been modified, added, or removed between the two input files.

Key Features and Benefits

  • Focused Comparison: Unlike generic file comparison tools like diff or cmp, colcmp.sh is tailored for name-value pair comparisons. This specialization allows it to provide more meaningful output in scenarios where the order of lines is not significant, but the association between names and values is crucial.
  • Change Detection: The script not only identifies differences but categorizes them as changes, additions, or removals. This detailed output makes it easy to understand the nature of modifications between file versions.
  • Bash Script Simplicity: Being a bash script, colcmp.sh is portable across Linux environments and can be easily understood, modified, and integrated into larger shell scripts or automation workflows.
  • Clear Output: The script provides human-readable output, indicating which names have changed and summarizing the overall comparison. It also generates an Output_File containing a concise summary of changes.

How to Use colcmp.sh for Linux File Comparison

To effectively use colcmp.sh for linux compare 2 files, follow these steps:

Prerequisites

  • Bash v4+: Ensure your Linux environment is running bash version 4 or higher. You can check your bash version by running bash --version in the terminal.
  • Executable Script: Save the script content (provided in the Source (colcmp.sh) section below) into a file named colcmp.sh, and make it executable using the command chmod +x colcmp.sh.
  • Input Files: Prepare two text files (File_1.txt and File_2.txt) that you want to compare. Each file should contain name-value pairs in the format name value, with each pair on a new line.

Running the Script

Execute colcmp.sh from your terminal with the following syntax:

./colcmp.sh File_1.txt File_2.txt

Replace File_1.txt and File_2.txt with the actual paths to your input files.

Interpreting the Output

After running the script, you will see output directly in your terminal and an Output_File will be created (or overwritten) in the same directory.

Terminal Output:

The terminal output provides a summary of the comparison:

  • Files are identical: If File_1.txt and File_2.txt are identical, the output will be: files are identical.
  • Changes detected: If differences are found, the output will list:
    • Changes in values: e.g., User3 changed from 'US' to 'NG'
    • Users removed from File_2.txt compared to File_1.txt: e.g., User4 was removed
    • Users added to File_2.txt compared to File_1.txt: e.g., User5 was added as 'CA'
    • Users with no changes: e.g., no change: User1,User2

Output_File Content:

The Output_File will contain a single line:

  • If files are identical or no changes in name-value pairs are detected that are considered as ‘changed’ in the context of the script’s logic (additions, removals, value modifications), Output_File will be empty.
  • If any changes (additions, removals, or value modifications) are detected, Output_File will contain: UserX has changed (where UserX is one of the users that has changed, added, or removed – the script logic makes it so that in case of changes, it will always write to this file).

Example:

Let’s assume File_1.txt contains:

User1 US
User2 CA
User3 US
User4 UK

And File_2.txt contains:

User1 US
User2 CA
User3 NG
User5 CA

Running ./colcmp.sh File_1.txt File_2.txt will produce the following terminal output:

User3 changed from 'US' to 'NG'
User4 was removed
User5 was added as 'CA'
no change: User1,User2

And Output_File will contain:

User3 has changed

Alt text: Command-line example showing the execution of colcmp.sh comparing File_1.txt and File_2.txt and displaying the terminal output with changes and no changes, alongside the content of Output_File.

Deep Dive into colcmp.sh Script Logic

To fully appreciate the functionality of colcmp.sh for linux compare 2 files, let’s break down its source code step by step.

Basic File Comparison and Initial Setup

cmp -s "$1" "$2"
case "$?" in
    0)
        echo "" > Output_File
        echo "files are identical"
        ;;
    1)
        # Compare logic
        ;;
    *)
        echo "error: file not found, access denied, etc..."
        echo "usage: ./colcmp.sh File_1.txt File_2.txt"
        ;;
esac

This section starts by using the cmp -s command to perform a silent byte-by-byte comparison of the two input files ($1 and $2). The cmp command sets the exit status $? based on the comparison result:

  • 0: Files are identical. In this case, the script clears Output_File and outputs “files are identical”.
  • 1: Files differ. This triggers the main comparison logic to identify name-value pair changes.
  • 2 (or any other non-zero, non-one): An error occurred (e.g., file not found). The script outputs an error message and usage instructions.

The case statement is used to handle these different exit statuses gracefully.

Processing File 1 into an Associative Array A1

        echo "" > Output_File
        cp "$1" ~/.colcmp.array1.tmp.sh
        sed -i -E "s/([^A-Za-z0-9 ])/\\\1/g" ~/.colcmp.array1.tmp.sh
        sed -i -E "s/^(.*)$/#\1/" ~/.colcmp.array1.tmp.sh
        sed -i -E "s/^#\s*(\S+)\s+(\S.*?)\s*$/A1\[\1\]="\2"/" ~/.colcmp.array1.tmp.sh
        chmod 755 ~/.colcmp.array1.tmp.sh
        declare -A A1
        source ~/.colcmp.array1.tmp.sh

This block processes the first input file ($1) to create a bash associative array named A1.

  1. Clear Output and Copy File: echo "" > Output_File clears the output file, and cp "$1" ~/.colcmp.array1.tmp.sh copies File_1.txt to a temporary script file in the user’s home directory.
  2. Escape Special Characters: sed -i -E "s/([^A-Za-z0-9 ])/\\\1/g" ~/.colcmp.array1.tmp.sh escapes special characters in the temporary file using sed and regular expressions. This is a safety measure to prevent unintended command execution when the file content is later sourced.
  3. Comment Out All Lines: sed -i -E "s/^(.*)$/#\1/" ~/.colcmp.array1.tmp.sh comments out every line in the temporary file by prepending #. This ensures that the lines are treated as comments initially and not executed directly when sourced.
  4. Convert to Array Assignments: sed -i -E "s/^#\s*(\S+)\s+(\S.*?)\s*$/A1\[\1\]="\2"/" ~/.colcmp.array1.tmp.sh is the core transformation step. It uses sed to find lines that start with a comment (#), followed by optional whitespace, then captures the first word as the ‘name’ (captured in 1) and the rest of the line as the ‘value’ (captured in 2). It then replaces each such line with a bash command to assign the value to the associative array A1 using the name as the key: A1[name]="value".
  5. Make Executable (Potentially Redundant): chmod 755 ~/.colcmp.array1.tmp.sh makes the temporary script executable. While not strictly necessary for source, it might have been included as a precautionary measure.
  6. Declare Associative Array: declare -A A1 explicitly declares A1 as an associative array, which is essential for using string keys.
  7. Source the Script: source ~/.colcmp.array1.tmp.sh executes the temporary script in the current shell environment. This effectively populates the associative array A1 with name-value pairs from File_1.txt.

Processing File 2 into Associative Array A2

The script repeats the same process for the second input file ($2) to create another associative array named A2, storing its name-value pairs.

        cp "$2" ~/.colcmp.array2.tmp.sh
        sed -i -E "s/([^A-Za-z0-9 ])/\\\1/g" ~/.colcmp.array2.tmp.sh
        sed -i -E "s/^(.*)$/#\1/" ~/.colcmp.array2.tmp.sh
        sed -i -E "s/^#\s*(\S+)\s+(\S.*?)\s*$/A2\[\1\]="\2"/" ~/.colcmp.array2.tmp.sh
        chmod 755 ~/.colcmp.array2.tmp.sh
        declare -A A2
        source ~/.colcmp.array2.tmp.sh

This section mirrors the previous one, but operates on File_2.txt and populates the associative array A2.

Comparing Arrays and Identifying Changes

        USERSWHODIDNOTCHANGE=
        for i in "${!A1[@]}"; do
            if [ "${A2[$i]+x}" = "" ]; then
                echo "$i was removed"
                echo "$i has changed" > Output_File
            fi
        done
        for i in "${!A2[@]}"; do
            if [ "${A1[$i]+x}" = "" ]; then
                echo "$i was added as '${A2[$i]}'"
                echo "$i has changed" > Output_File
            elif [ "${A1[$i]}" != "${A2[$i]}" ]; then
                echo "$i changed from '${A1[$i]}' to '${A2[$i]}'"
                echo "$i has changed" > Output_File
            else
                if [ x$USERSWHODIDNOTCHANGE != x ]; then
                    USERSWHODIDNOTCHANGE=",$USERSWHODIDNOTCHANGE"
                fi
                USERSWHODIDNOTCHANGE="$i$USERSWHODIDNOTCHANGE"
            fi
        done
        if [ x$USERSWHODIDNOTCHANGE != x ]; then
            echo "no change: $USERSWHODIDNOTCHANGE"
        fi

This crucial part of the script compares the two associative arrays A1 and A2 to detect changes:

  1. Initialize USERSWHODIDNOTCHANGE: USERSWHODIDNOTCHANGE= initializes an empty variable to store names that have not changed.
  2. Detect Removals: The first for loop iterates through the keys of array A1 (names from File_1.txt). For each name $i, it checks if the key exists in A2 using [ "${A2[$i]+x}" = "" ]. If the key is not present in A2, it means the name (and its corresponding entry) was removed in File_2.txt. The script then outputs $i was removed and writes to Output_File.
  3. Detect Additions, Modifications, and No Changes: The second for loop iterates through the keys of array A2 (names from File_2.txt). For each name $i:
    • Addition: It checks if the key $i exists in A1 using [ "${A1[$i]+x}" = "" ]. If not present in A1, it’s a new entry added in File_2.txt. The script outputs $i was added as '${A2[$i]}' and writes to Output_File.
    • Modification: If the key exists in both A1 and A2, it compares the values using [ "${A1[$i]}" != "${A2[$i]}" ]. If the values are different, it indicates a modification. The script outputs $i changed from '${A1[$i]}' to '${A2[$i]}' and writes to Output_File.
    • No Change: If the key and value are the same in both arrays, it means no change for this name. The script appends the name $i to the USERSWHODIDNOTCHANGE variable to build a comma-separated list of unchanged names.
  4. Output Unchanged Users: Finally, if USERSWHODIDNOTCHANGE is not empty, the script outputs no change: $USERSWHODIDNOTCHANGE.

Enhancements and Alternatives for File Comparison in Linux

While colcmp.sh provides a specialized solution for comparing name-value pairs, it’s beneficial to consider enhancements and alternative Linux commands for broader file comparison needs.

Potential Enhancements to colcmp.sh

  • Error Handling: Improve error handling for cases like incorrect file formats, missing files, or permission issues.
  • More Robust Input Validation: Add checks to ensure input files adhere to the name value format.
  • Function Refactoring: Encapsulate the array creation logic into functions to reduce code duplication and improve script readability.
  • Output Formatting Options: Allow users to customize the output format, perhaps with command-line options to control the verbosity or output delimiters.
  • Ignoring Case or Whitespace: Implement options to ignore case differences or leading/trailing whitespace in name or value comparisons.

Alternative Linux Commands for File Comparison

For more general linux compare 2 files tasks, consider these standard command-line utilities:

  • diff: A powerful tool for finding line-by-line differences between files. It’s highly versatile and offers various output formats (e.g., unified diff, context diff) suitable for patching and code review. diff File_1.txt File_2.txt
  • comm: Compares two sorted files and outputs lines unique to each file and lines common to both. Useful for identifying common and distinct entries in lists. comm File_1.txt File_2.txt (requires sorted input files)
  • cmp: Performs byte-by-byte comparison and is very efficient for quickly checking if two files are identical. cmp File_1.txt File_2.txt (as used in colcmp.sh for initial file identity check)
  • vimdiff (or gvimdiff): A graphical file comparison tool using the Vim text editor. It provides a visual side-by-side diff view, making it easy to spot and navigate differences, especially in code or structured text files. vimdiff File_1.txt File_2.txt

Alt text: Screenshot of vimdiff showing a side-by-side visual comparison of two text files, highlighting the differences between them in a graphical interface.

Conclusion

colcmp.sh offers a specialized and effective way to linux compare 2 files when dealing with name-value pair formatted data. Its bash script nature, combined with the use of associative arrays, provides a clear and concise solution for detecting changes in configuration files, user lists, or similar datasets. While standard tools like diff, comm, and cmp offer broader file comparison capabilities, colcmp.sh excels in its niche, providing targeted insights into modifications within name-value structures. For more complex or visual comparisons, tools like vimdiff can complement command-line utilities, offering a comprehensive toolkit for file comparison in Linux environments.

Source (colcmp.sh)


cmp -s "$1" "$2"
case "$?" in
    0)
        echo "" > Output_File
        echo "files are identical"
        ;;
    1)
        echo "" > Output_File
        cp "$1" ~/.colcmp.array1.tmp.sh
        sed -i -E "s/([^A-Za-z0-9 ])/\\\1/g" ~/.colcmp.array1.tmp.sh
        sed -i -E "s/^(.*)$/#\1/" ~/.colcmp.array1.tmp.sh
        sed -i -E "s/^#\s*(\S+)\s+(\S.*?)\s*$/A1\[\1\]="\2"/" ~/.colcmp.array1.tmp.sh
        chmod 755 ~/.colcmp.array1.tmp.sh
        declare -A A1
        source ~/.colcmp.array1.tmp.sh

        cp "$2" ~/.colcmp.array2.tmp.sh
        sed -i -E "s/([^A-Za-z0-9 ])/\\\1/g" ~/.colcmp.array2.tmp.sh
        sed -i -E "s/^(.*)$/#\1/" ~/.colcmp.array2.tmp.sh
        sed -i -E "s/^#\s*(\S+)\s+(\S.*?)\s*$/A2\[\1\]="\2"/" ~/.colcmp.array2.tmp.sh
        chmod 755 ~/.colcmp.array2.tmp.sh
        declare -A A2
        source ~/.colcmp.array2.tmp.sh

        USERSWHODIDNOTCHANGE=
        for i in "${!A1[@]}"; do
            if [ "${A2[$i]+x}" = "" ]; then
                echo "$i was removed"
                echo "$i has changed" > Output_File
            fi
        done
        for i in "${!A2[@]}"; do
            if [ "${A1[$i]+x}" = "" ]; then
                echo "$i was added as '${A2[$i]}'"
                echo "$i has changed" > Output_File
            elif [ "${A1[$i]}" != "${A2[$i]}" ]; then
                echo "$i changed from '${A1[$i]}' to '${A2[$i]}'"
                echo "$i has changed" > Output_File
            else
                if [ x$USERSWHODIDNOTCHANGE != x ]; then
                    USERSWHODIDNOTCHANGE=",$USERSWHODIDNOTCHANGE"
                fi
                USERSWHODIDNOTCHANGE="$i$USERSWHODIDNOTCHANGE"
            fi
        done
        if [ x$USERSWHODIDNOTCHANGE != x ]; then
            echo "no change: $USERSWHODIDNOTCHANGE"
        fi
        ;;
    *)
        echo "error: file not found, access denied, etc..."
        echo "usage: ./colcmp.sh File_1.txt File_2.txt"
        ;;
esac

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *