Unit testing tips: diffing PDF files

How do you unit test a piece of code that generates a PDF file? There are a number of interesting answers to this question around the web, including some neat ideas such as:

  • Use something like PIL to convert the PDF to a PNG or similar, then iterate over the pixels in the resulting bitmap.
  • OCR the PDF file and check the resulting text against ground truth.
  • Use a specialist PDF-diff tool to test the generated PDF against ground truth.

This seems like overkill to me! A simple way forward is just to use the diff tool that comes as standard on UNIX platforms.

Usually diff is used with plain text files, but it can work with binary files as well. Here’s a very simple example:


$ diff report.pdf expected.pdf
Binary files report.pdf and expected.pdf differ


Hmm! Neat, but not terribly useful. What else can we do? A quick browse through the diff man-page show that the -a command-line switch tells diff to treat a binary file as if it were text. This sounds like a step forward.


 diff -a report.pdf expected.pdf
< /CreationDate (D:20140812210344+01'00')
< /ModDate (D:20140812210344+01'00')
> /CreationDate (D:20140812012140+01'00')
> /ModDate (D:20140812012140+01'00')
&lt; /ID [&lt;3428D71EEBFEECF7176993643DEA57D0&gt; &lt;3428D71EEBFEECF7176993643DEA57D0&gt;]
&gt; /ID [&lt;3FD57F91F32489646331D1DBBF510CDA&gt; &lt;3FD57F91F32489646331D1DBBF510CDA&gt;]


As you’d expect with PDF, there is some metadata inside the files that we would expect to differ between PDF files, even if the files have the same content. What we need to do next is to tell diff to ignore this metadata, and we can do that with the -I switch. We might also want to ignore whitespace, which we can do with -w:


$ diff -w -a -I .*Date.* -I \/ID.* report.pdf expected.pdf


Just what we wanted! As with all UNIX tools here, the command was successful (the files were ‘identical’) so we didn’t get any output. To put that in a unit testing context, we can write that up as pytest unit test:


import os
import subprocess

def test_pdf():
    # Generate PDF here ...
    assert os.path.exists('expected.pdf')
    assert os.path.exists('report.pdf')
    # Diff the resulting PDF file with a ground truth.
    diff_command = ['diff', '-w', '-a', '-I', '.*Date.*', '-I', '\/ID.*',
                    'report.pdf', 'expected.pdf']
    child = subprocess.Popen(diff_command,
    out, err = child.communicate()
    assert 0 == child.returncode


This entry was posted in public_engagement and tagged , , , , , . Bookmark the permalink.

Please leave a response

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s