harsig.org broken PDFs

Fix:

Install ghostscript version 9.19 or greater (probably), then do:

gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=newfile.pdf fuckedfile.pdf

... and then read the newly created file, which should work fine.

Waffle:

There are some jolly good London Underground historical signalling diagrams on harsig.org in PDF format. Unfortunately, something over half the PDFs are fucked.

The first indication one gets is that on clicking the PDF link in the browser, it appears to download the file just fine, but instead of it then opening it, nothing happens at all. And indeed on looking in the browser's temporary downloads directory, there the file is. But on trying to open it from the command line, xpdf reports:

$ xpdf Met1933.pdf Syntax Error: Invalid encryption key length Command Line Error: Incorrect password

Which is a load of bollocks, because there is no fucking password. It is not an encrypted PDF. It's just fucked.

It doesn't work with other tools either. Trying to fix it with pdftk gives:

$ pdftk Met1933.pdf output met1933.pdf Error: Unexpected Exception in open_reader() Unhandled Java Exception in main(): java.lang.NullPointerException at gnu.gcj.runtime.NameFinder.lookup(libgcj.so.15) at java.lang.Throwable.getStackTrace(libgcj.so.15) at java.lang.Throwable.stackTraceString(libgcj.so.15) at java.lang.Throwable.printStackTrace(libgcj.so.15) at java.lang.Throwable.printStackTrace(libgcj.so.15)

On github it is suggested that this is due to a known bug that hasn't been fixed, and says that while pdftk can't handle it, it can be fixed by using ghostscript to read the file and create a new, unfucked version:

gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=newfile.pdf myfile.pdf

Except this STILL doesn't work. ghostscript produces: (line breaks inserted for readability)

$ gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=met1933.pdf Met1933.pdf **** This file uses an unknown standard security handler revision: 6 Error: /undefined in pdf_check_password Operand stack: () Execution stack: %interp_exit .runexec2 --nostringval-- --nostringval-- --nostringval-- 2 %stopped_push --nostringval-- --nostringval-- --nostringval-- false 1 %stopped_push1910 1 3 %oparray_pop 1909 1 3 %oparray_pop 1893 1 3 %oparray_pop --nostringval-- --nostringval-- --nostringval-- --nostringval-- false 1 %stopped_push --nostringval-- Dictionary stack: --dict:1167/1684(ro)(G)-- --dict:1/20(G)-- --dict:82/200(L)-- --dict:82/200(L)-- --dict:109/127(ro)(G)-- --dict:293/300(ro)(G)-- --dict:21/31(L)-- Current allocation mode is local GPL Ghostscript 9.06: Unrecoverable error, exit code 1

So three different tools which each use their own different method of opening a PDF all fail to open it. I think we can say pretty definitely that yes, this file is indeed fucked.

But what's interesting is that some of the fucked files were actually produced by ghostscript in the first place. At least that's what pdfinfo says:

$ pdfinfo Met1933.pdf Producer: GPL Ghostscript 9.50 CreationDate: Sun Nov 1 12:32:13 2020 GMT-1 ModDate: Sun Nov 1 12:35:13 2020 GMT-1 Tagged: no UserProperties: no Suspects: no Form: none JavaScript: no Pages: 1 Encrypted: yes (print:no copy:no change:no addNotes:no algorithm:AES-256) Page size: 11905.5 x 2834.25 pts Page rot: 0 File size: 1077267 bytes Optimized: yes PDF version: 1.7

And it's not only ghostscript that produces fucked files (unless this "Online2PDF" crap is using ghostscript and then fucking with the metadata to hide it). Here's the result for another one that's fucked:

$ pdfinfo MJB2005.pdf Creator: Online2PDF.com Producer: Online2PDF.com CreationDate: Mon Mar 1 12:53:15 2021 GMT-1 ModDate: Mon Mar 1 12:58:18 2021 GMT-1 Tagged: no UserProperties: no Suspects: no Form: none JavaScript: no Pages: 1 Encrypted: yes (print:no copy:no change:no addNotes:no algorithm:AES-256) Page size: 14400 x 4251.75 pts Page rot: 0 File size: 1580631 bytes Optimized: yes PDF version: 1.7

On the other hand, some of them actually do work:

$ pdfinfo WidenedLines1958.pdf Title: Widened Lines.SKF Author: Patrick Creator: PScript5.dll Version 5.2.2 Producer: Acrobat Distiller 20.0 (Windows) CreationDate: Mon May 18 18:03:52 2020 GMT-1 ModDate: Mon May 18 18:06:54 2020 GMT-1 Tagged: no UserProperties: no Suspects: no Form: none JavaScript: no Pages: 1 Encrypted: yes (print:no copy:no change:no addNotes:no algorithm:AES) Page size: 8504 x 2126 pts Page rot: 0 File size: 246429 bytes Optimized: yes PDF version: 1.6

Even though it's still got this bollocks about it being encrypted, that one does work. It is not encrypted, and xpdf (which is part of the same bunch of stuff as pdfinfo) opens it fine without any complaint.

Right, so that's the situation, what to do about it? It turns out that the ghostscript fix trick does work if you use a different version of ghostscript. Having installed ghostscript 9.19 it works fine.

The changelog says that "Support for PDF security handler revision 6" was added in version 9.15, which looking at the error output quoted above looks like it's at least part of the reason 9.19 works and 9.06 doesn't, but it still isn't clear why they're showing up as "encrypted" in the first place when they are in fact not encrypted and don't need a password. And the way some of them are spuriously showing up as "encrypted" but it doesn't fuck things up, along with the existence of bug reports from people who are experiencing the same error on PDFs that really are encrypted rather than just claiming to be so because of some bug, suggests that "spurious encryption" and "revision 6" are probably two separate, but related, bugs. So it could well be that 9.15 is as far as you need to go to make it work, but I shall stick to recommending 9.19 because I've actually tried that one.

The "spurious encryption" thing may be related to "setting an owner password" but not actually encrypting it. I'm not sure what this is supposed to do. The pdfinfo output above suggests it might somehow let you read it but stop you printing or copying it, or something, but it doesn't (unsurprisingly; how the fuck could it?), and the ghostscript unfucking trick gets rid of it anyway, so it seems completely fucking pointless to me.

That in the examples above all the fucked ones are PDF version 1.7, while the working one is 1.6, I think is just coincidence. I've got plenty of version 1.7 PDFs which are a lot older than any of the components of the tools I've got to read them with.

Anyway, it works now, and I seriously do not care enough about the unresolved details to want to bother digging through immense amounts of guff about all the weird and shitty obscure things a PDF can do in order to fully solve those tedious mysteries.




Back to Pigeon's Nest


Be kind to pigeons




Valid HTML 4.01!