Manually reviewing lengthy contracts for unauthorized handwritten changes is a tedious and error-prone process. Imagine sifting through 50-page documents, line by line, trying to spot subtle pen marks. This was a real challenge for a company processing complex contracts, where even a tiny modification could have significant legal implications. Traditional digital signature workflows handled most cases, but these intricate deals, often negotiated over months, required a cumbersome print-sign-scan process, opening the door to unintended alterations.
The critical need was to ensure these extensive contracts remained unaltered after customer review and signing. The question arose: how could this painstaking visual inspection be automated, eliminating human error and saving valuable time?
The answer lies in leveraging the power of Computer Vision, specifically with tools like Intelligence Suite. This suite offers a collection of building blocks designed to interact with image-based documents like PDFs, JPEGs, and PNGs. It allows for image enhancement through alignment, cropping, and noise reduction. More importantly, it provides capabilities for data extraction, template application for text recognition, and even barcode and QR code processing. Furthermore, you can train image recognition models to classify new images based on learned patterns.
✔️ Ready to revolutionize your document processing? Explore the Intelligence Suite Trial and get started with the Starter Kit today!
An overview of a computer vision workflow for document comparison, highlighting the steps involved in automating the review process.
To tackle the challenge of comparing original documents against potentially modified versions, the approach centers on image processing, image profiling, and strategic data preparation. This article, Part 1, will delve into these foundational steps. Part 2 will then expand on this, incorporating advanced image processing and reporting techniques to create a comprehensive solution for identifying and reporting marked-up pages.
The Automated Workflow for Document Comparison
The objective is to create a streamlined workflow capable of comparing an original document with a revised version, specifically pinpointing pages containing handwritten markups. The designed workflow efficiently generates a report that isolates only those pages flagged as potentially modified. Instead of manually scrutinizing all 50 pages of a contract, imagine receiving a concise report highlighting just the two or three pages requiring closer attention – a significant leap in efficiency.
A detailed diagram illustrating the automated workflow for comparing documents and detecting handwritten markups, from data input to report generation.
Sample Data: End User License Agreement (EULA)
To demonstrate this workflow with publicly accessible data, the Alteryx Designer End User License Agreement (EULA) was chosen. The 6-page document was downloaded from this URL and saved as a PDF file named Original Alteryx EULA v20 2006.pdf.
To simulate a modified contract, the EULA was printed, and handwritten markups in red ink were added to pages 3 and 4. This marked-up document was then scanned and saved as Signed Alteryx EULA with Markups.pdf (available at the end of the original article). While referred to as “signed,” it’s important to note this EULA lacks a signature page – imagine it as a representative section of a larger contract preceding the signature section. A separate, future exploration will focus on automating signature detection itself. For this example, the “original” and “signed” documents represent the pages prior to any signature pages.
The crucial aspect of the data setup is having two distinct PDF files, each with an identical number of pages arranged in the same sequential order, representing the original and potentially modified documents.
Here’s a visual comparison of page 3, showcasing the original document on the left and the marked-up version on the right:
A side-by-side comparison of page 3 from the original EULA (left) and the EULA with handwritten markups (right), visually demonstrating the changes made.
Workflow Results: Pinpointing Modified Pages
Upon executing the workflow and examining the final output, the results clearly present pages 3 and 4, both the original and marked-up versions, side-by-side for easy comparison.
The workflow output displaying page 3 of both the original and marked-up documents side-by-side, highlighting the detected differences.
The workflow output displaying page 4, demonstrating the detection of even minor handwritten changes like the addition of the word “not”.
Notably, page 4 contained only a minor alteration – the insertion of the word “not.” The workflow successfully identified even this subtle change, showcasing its precision.
The following steps outline the process for identifying pages with handwritten markups.
Step 1 – Data Input with Image Input Tool
The process begins with the Image Input tool, which allows you to specify a folder containing both the original and revised PDF files. This workflow is designed to process two files at a time; ensure only the pair of documents you wish to compare are in the designated folder. The Image Input tool utilizes Optical Character Recognition (OCR) technology, treating each page within the PDF as an individual image.
Configuration of the Image Input tool, demonstrating how to specify the folder containing the PDF files for document comparison.
Step 2 – Image Pre-processing: Converting to Black and White
Although the markups were intentionally made in red ink, converting the images to black and white enhances the workflow’s reliability. The Image Processing tool facilitates this conversion.
Configuration of the Image Processing tool, showing the steps for converting color images to black and white for improved markup detection accuracy.
Achieving optimal black and white conversion involves a sequence of three operations within the Image Processing tool:
- Brightness Balance Adjustment: Setting Brightness Balance to -77 darkens the text, improving contrast. This value may need adjustment depending on document characteristics, but a negative value generally darkens the image.
- Grayscale Conversion: The Grayscale function converts the color image into shades of gray, a necessary intermediate step for black and white conversion.
- Binary Thresholding: Binary image thresholding minimizes stray pixels and noise, particularly helpful in mitigating shadows or artifacts introduced during the scanning process. This step uses image thresholding techniques.
Detailed steps within the Image Processing tool: Brightness Balance, Grayscale conversion, and Binary thresholding, illustrating the image enhancement process.
Step 3 – Image Profiling and Page Arrangement
Next, the Image Profile tool is employed to extract metadata from the processed images, adding descriptive fields. The pages are then arranged side-by-side, and duplicate entries are eliminated through subsequent steps. A temporary “Test” field is created and later removed during this phase.
A visual representation of the workflow segment involving Image Profiling and page arrangement, showing the tools used to organize and prepare the data for comparison.
Following the Image Input and Image Processing steps, which generated the [image] and [image_processed] fields, the Image Profile tool is applied to the [image_processed] field. By default, Image Profile is configured to “Select All” metadata profiles. However, for this workflow, only the bright pixel counts are relevant. Therefore, the configuration is changed to “Only Include Selected Profiles,” and the “Base” set is chosen from the dropdown.
Tool | Configuration |
---|---|
The Image Profile tool icon and its configuration panel, demonstrating the selection of “Base” metadata profiles to extract bright pixel count data.
The Image Profile tool generates numerous metadata fields (17 in this case). A Select tool is then used to retain only the [Bright_Pixel_Count] field and discard the rest, along with some renaming and resizing for clarity and efficiency.
The fields kept are:
- file (resized to 100 characters)
- page (renamed to Page)
- image (renamed to Original Image)
- Bright_Pixel_Count
Resizing the [file] field is a best practice for data management. Renaming [page] to [Page] and [image] to [Original Image] enhances field name consistency for improved readability and workflow maintainability.
The crucial transformation at this stage is restructuring the data from a single list containing all pages from both PDF files into a consolidated list. This list should have one row per page, with corresponding image and pixel count fields from both the original and signed documents aligned on the same row.
The initial data structure looks like this:
An example of the initial data structure after Image Profiling, showing separate rows for each page from both the original and signed documents.
The desired data structure for comparison is this:
The desired data structure after joining and filtering, with each row representing a page and containing data from both the original and signed documents for direct comparison.
To achieve this restructuring, a Join tool is used to join the data to itself based on the [Page] field. The checkbox next to [Page] from the right join input is unselected to avoid duplicate page fields.
Tool | Configuration |
---|---|
The Join tool icon and configuration, showing the self-join based on the “Page” field to combine data from the original and signed documents.
Joining a 6-page document dataset to itself results in 24 rows (4 combinations of 6 pages each). These combinations represent:
File 1 | File 2 |
---|---|
Original contract | Original contract |
Signed contract | Signed contract |
Original contract | Signed contract |
Signed contract | Original contract |
A Filter tool is then used to eliminate rows where both [File 1] and [File 2] refer to the same file (the first two combinations in the table above).
Tool | Configuration |
---|---|
Configuration of the Filter tool, used to remove redundant rows where both file inputs are the same, leaving only rows with paired original and signed documents.
This filtering reduces the dataset to 12 records.
The filtered data output, showing 12 records with paired original and signed document pages, ready for deduplication.
Upon closer inspection, each page now has two identical records. For example, records 1 and 2 both represent Page 4, with the original and signed files swapping positions between [File 1] and [File 2], [File 1 Original Image] and [File 2 Original Image], and [File 1 Bright Pixel Count] and [File 2 Bright Pixel Count]. To address this redundancy, one row from each pair needs to be removed consistently.
To ensure consistent deduplication, a Sort tool is applied, sorting the data by [Page] and then [File 1]. While the data may appear ordered already, this step ensures consistent sorting even with different input data in the future.
Tool | Configuration |
---|---|
Configuration of the Sort tool, ordering the data by “Page” and “File 1” to prepare for deduplication using the Unique tool.
Finally, the Unique tool is used to retain only the first unique row for each [Page], effectively removing the duplicate records.
Tool | Configuration |
---|---|
The Unique tool icon and configuration, set to identify unique rows based on the “Page” field, removing duplicate records and resulting in one row per page.
This process results in a dataset with one row per page, each containing the original and signed file names, images, and bright pixel counts, ready for the markup detection step.
Step 4 – Markup Detection and Page Filtering
The crucial step is to test for markups by calculating a ratio based on bright pixel counts and applying a threshold to identify pages likely containing handwritten modifications. Pages exceeding this threshold are flagged for manual review.
A visual representation of the workflow segment for markup detection, showing the Formula and Filter tools used to identify pages with handwritten changes.
The ratio is calculated as follows: the absolute difference in bright pixel counts between the two pages is divided by the smaller of the two bright pixel counts. This normalization is essential to account for potential variations in image size or scanning quality between the original and signed documents.
Tool | Configuration |
---|---|
The Formula tool icon and configuration, showing the formula used to calculate the markup ratio based on bright pixel counts.
Using a ratio, rather than raw pixel count difference, is critical. Variations in image dimensions between the original (likely digitally printed) and signed (printed and scanned) contracts can lead to significant pixel count differences unrelated to markups. The ratio normalizes this size difference, preventing disproportionately larger images from skewing the comparison.
Another Formula tool then creates the [Test for Markups] field, assigning text labels based on the calculated ratio. Pages exceeding a threshold (empirically determined to be 0.046 for this document) are flagged as “Markup likely here,” while others are labeled “No markups here.”
Tool | Configuration |
---|---|
The Formula tool icon and configuration, showing the conditional formula used to assign “Markup likely here” or “No markups here” labels based on the calculated ratio and threshold.
As expected, pages 3 and 4, which contain markups in the signed contract, are correctly identified. The threshold of 0.046 effectively distinguishes pages with and without markups for this particular document.
Finally, a basic Filter tool is used to isolate only the pages where [Test for Markups] is not equal to “No markups here,” leaving only the pages flagged as likely containing markups.
Tool | Configuration |
---|---|
The Filter tool icon and configuration, set to filter the data and output only the rows labeled “Markup likely here,” isolating the pages with potential handwritten modifications.
The workflow successfully isolates pages 3 and 4, the pages with handwritten markups, demonstrating the effectiveness of this computer vision-based approach.
This workflow provides an automated solution to a previously manual and time-consuming problem. The ability to automatically identify pages with handwritten markups in contracts significantly streamlines the review process, saving time and reducing the risk of overlooking critical changes.
While this part of the workflow provides the core functionality, the output format is basic. Part 2 of this exploration will focus on enhancing the presentation of these results, creating a visually appealing and informative report. Stay tuned for part 2 next week to see how to make these results presentation-ready.
Find contract pages with markups.yxmd