How do you unit test a piece of code that generates a PDF file? There are a number of interesting answers to this question around the web, including some neat ideas such as:
- Use something like PIL to convert the PDF to a PNG or similar, then iterate over the pixels in the resulting bitmap.
- OCR the PDF file and check the resulting text against ground truth.
- Use a specialist PDF-diff tool to test the generated PDF against ground truth.
This seems like overkill to me! A simple way forward is just to use the diff tool that comes as standard on UNIX platforms.
Usually diff is used with plain text files, but it can work with binary files as well. Here’s a very simple example:
$ diff report.pdf expected.pdf Binary files report.pdf and expected.pdf differ $
Hmm! Neat, but not terribly useful. What else can we do? A quick browse through the diff man-page show that the -a command-line switch tells diff to treat a binary file as if it were text. This sounds like a step forward.
diff -a report.pdf expected.pdf 162,163c162,163 < /CreationDate (D:20140812210344+01'00') < /ModDate (D:20140812210344+01'00') --- > /CreationDate (D:20140812012140+01'00') > /ModDate (D:20140812012140+01'00') 187c187 < /ID [<3428D71EEBFEECF7176993643DEA57D0> <3428D71EEBFEECF7176993643DEA57D0>] --- > /ID [<3FD57F91F32489646331D1DBBF510CDA> <3FD57F91F32489646331D1DBBF510CDA>] $
As you’d expect with PDF, there is some metadata inside the files that we would expect to differ between PDF files, even if the files have the same content. What we need to do next is to tell diff to ignore this metadata, and we can do that with the -I switch. We might also want to ignore whitespace, which we can do with -w:
$ diff -w -a -I .*Date.* -I \/ID.* report.pdf expected.pdf $
Just what we wanted! As with all UNIX tools here, the command was successful (the files were ‘identical’) so we didn’t get any output. To put that in a unit testing context, we can write that up as pytest unit test:
import os import subprocess def test_pdf(): # Generate PDF here ... assert os.path.exists('expected.pdf') assert os.path.exists('report.pdf') # Diff the resulting PDF file with a ground truth. diff_command = ['diff', '-w', '-a', '-I', '.*Date.*', '-I', '\/ID.*', 'report.pdf', 'expected.pdf'] child = subprocess.Popen(diff_command, stdout=subprocess.PIPE, cwd=os.path.dirname(__file__)) out, err = child.communicate() assert 0 == child.returncode