HACKER Q&A
📣 shawnfrostx

Why is the PDF format so inaccessible?


I am working on some typographical software that is supposed to generate PDFs at the end. It seems like there is no accessible information on how to do this. The PDF ISO specification is behind a paywall and has a dead link to a 2008 spec. There are open source converters like pandoc, but nothing that actually writes to PDF that I can find. Is there any resource that goes over the process of PDF generation?


  👤 layer8 Accepted Answer ✓
The PDF spec is officially available here: http://www.adobe.com/go/pdfreference

There’s also this book which provides a good introduction and overview and is useful for understanding how the format works (although the PDF reference itself is pretty decent too, as far as specs go): https://www.oreilly.com/library/view/developing-with-pdf/978... (You can find a PDF copy if you look around.) EDIT: There’s also https://www.oreilly.com/library/view/pdf-explained/978144932... which might be even better.

However, be warned that the PDF format can be quite complex and is not exactly for the faint of heart. It’s best to use an established library to generate PDF output, like PDFBox, iText, PDFSharp, PDFKit, etc. Those tend to have their own tutorials.

For emphasis: Do not generate PDFs “by hand”! You risk inadvertently generating PDFs that do not fully conform to the spec, and not noticing it because PDF readers are quite lenient in what they accept. A lot of PDFs in the wild are not standard-conforming in some way or other, because their generators were not carefully written against the spec, but against “whatever Acrobat Reader accepts”. This is the bane of every software on the receiving end that needs to process PDFs.


👤 perardi
As for libraries, it seems PDFKit is the dominant one.

https://github.com/foliojs/pdfkit

As to why it’s so inaccessible…because Adobe created this monstrosity to do just about everything. Text, fonts, vector graphics, raster graphics, forms, color spaces, JavaScript, encryption, signatures, 3D artwork, video, audio, Flash, and probably more. It’s bonkers as to what it can possibly include, and it was developed during a way different time.


👤 kragen
Most recently I used ReportLab for direct PDF writing from Python¹, but generating them from PostScript is often easier², depending on what you're doing. https://en.wikipedia.org/wiki/PDF#External_links has a lot of information; also the "Further reading" section has some links which Adobe has broken at the moment, but archive.org versions of them like https://web.archive.org/web/20200127173721/https://www.adobe... work. Also I think Adobe put PDF 1.7 on the Archive themselves: https://archive.org/details/pdf1.7

The ReportLab APIs mirror the PDF file structure relatively closely.

Don't listen to the people who are nattering on about how PDF is proprietary on purpose. I think that may have been the case in its early years but it hasn't been the case this millennium.

PDF 1.7 (the spec from 02008) and even earlier verions are most often used, as you'll see if you run head -1 *.pdf in a directory with a lot of random PDFs. PDF 2.0 is not important and you may want to intentionally write an earlier version for broader compatibility. The big incompatibility is actually PDF 1.5 to 1.6: 1.6 added compressed object streams, and a lot of readers still don't support those.

______

¹ https://github.com/kragen/dercuano/blob/master/genpdf.py

² http://canonical.org/~kragen/sw/laserboot/Makefile


👤 midnitewarrior
It's old, proprietary, modeled after the PostScript printer control language, from an era before XML, and never had the intention of being open.

👤 sideproject
My previous startup worked with parsing PDFs, trying to apply NLP to the texts within PDFs - extracting titles, paragraphs, tables, bullet points etc. Oh my that was a nightmare. Sure we were doing difficult things, so that made us unique, but it was a slog. Working with different dimensions, pages upside down, sentences spanning across multiple pages etc etc.

I've also recently worked on a small tool called scholars.io [1] where I had to work with PDFs. I wasn't doing anything like parsing, but I just used existing PDF tools and libraries, which were much more pleasant, but still working on top of PDF is a challenge.

[1] - https://scholars.io (a tool to read & review reearch papers together with colleagues)


👤 ilayn
Make use of TeX and Friends source code for handling PDF symbols, then it is much easier to check different implementations. For example, TikZ/PGF package has both PS and PDF implementations of the same graphical objects. So you can see how PDF literals or PS specials come into object stream.

Also it is really not that cryptic but very much laborious, hence many people rely on classical tools to generate PDF instead of handcrafting pdf files from scratch. Here is a nice introduction from a decade ago for you https://blog.idrsolutions.com/2010/09/grow-your-own-pdf-file...


👤 airbreather
I see no mention here of the most straightforward way to generate a pdf - pdfmarks.

Create blank pdf with Adobe as your base, then add what you want to it using pdfmarks and distilling.

I spent a very, very long time diving into the rabbit hole that is pdf to come to this conclusion.

There are lots of libraries out there, but none I came across that met my needs would do named destinations, for one example. I think there might be some very expensive ones that might, but pdfmarks will get you sorted.

Here is the manual, if you search around there are few other references.

https://opensource.adobe.com/dc-acrobat-sdk-docs/acrobatsdk/...


👤 chrisseaton
> There are open source converters like pandoc

I don't think Pandoc knows anything about the PDF format. It can't read it https://github.com/jgm/pandoc/tree/master/src/Text/Pandoc/Re... or write it https://github.com/jgm/pandoc/tree/master/src/Text/Pandoc/Wr.... It uses other tools to do that.


👤 jjgreen
You could have a look at how ghostscript does it: http://git.ghostscript.com/?p=ghostpdl.git;a=tree;f=pdf;hb=r...

👤 tonetheman
The only link I could find was an archive link https://archive.org/details/pdf1.7

The open PDF standard now costs 250 USD. Adobe is supposed to have an archive of the 1.7 spec online but they do not care enough to keep that up it appears.

I am trying to think of a reason why they would do such a blindly dumb thing but the C++ people used to do the same thing.


👤 version_five
Not really answering your question but you could consider generating postscript output and then using ghostscript to convert it to pdf. That would let you create and write arbitrary stuff. I think pandoc uses pdflatex to generate a pdf via latex from the internal pandoc representation.

Imagemagick also writes to pdf I believe, but it may only convert raster images. With postscript you can generate a vector pdf


👤 thorum
Most of the best, most comprehensive PDF libraries are written for Java. There are libraries for other languages but they tend to be incomplete or flawed. There’s also some great paid libraries for C#, but if you want free, I’d recommend looking into Apache PDFBox.

👤 gettalong
As others have already written, there is a free version of the PDF 1.7 specification available and using this you are (nearly) able to implement a PDF reader/writer. I wrote nearly because of the many malformed PDFs out there and because of some ambiguities in the spec that will have you look at certain parts of the implementation of existing libraries.

That said implementing a basic PDF reader/writer is not that complex and can easily be done in a few months. However, since you seem to also want to generate PDF pages with content, a whole lot of things have to be considered, like fonts (Type1, TrueType, CFF) and how to actually generate the content.

Adding some straightforward text using some built-in PDF font onto a PDF page is easy. But if you want to use a (subset) TrueType or OpenType font, have ligatures, contextual character substitutions, (LaTeX like) line wrapping, tagging for accessibility, ... you will open a can of worms ;-)

This is certainly also doable but gets quite complex and is the reason many PDF libraries only implement basic typographic features that are easy. You can probably count the PDF libraries supporting advanced OpenType typographic features on one hand...

However, if you are already in the process of writing a typographical software, this last part may actually already be done in your case. So if you have, as output from that software, the glyphs and their position, there is not that much complexity to implement and you could probably use a basic PDF library to do the PDF writing for you.


👤 cozzyd
Doesn't e.g. cairo solve this problem? https://en.m.wikipedia.org/wiki/Cairo_(graphics)

👤 jasomill
If you're comfortable handling the (typo)graphical aspects of the PDF yourself and have the ability to consume a C++ library, I've had good experiences using the Apache-licensed qpdf[1] library to handle the low-level structural aspects of the PDF standard. It's particularly convenient when your application requires structure-preserving integration of existing PDF content.

Simple example applications, each completed in 2–3 days, both in C#, using C++/CLI to integrate libqpdf:

1. Overlaying fixed-format text on pre-existing blank PDF form pages, ensuring the content of each distinct form page is embedded exactly once, and that all necessary assets (fonts, images, etc.) from the blank form PDF pages are included in the output PDF.

2. Losslessly combining a sequence of PDF, TIFF, and JPEG images into a single PDF with bookmarks pointing to the first page of each source file and existing image compression maintained where possible. In this application, only the source TIFFs were anything other than arbitrary (i.e., the TIFFs were more-or-less baseline images coming from a small number of scanning systems, but the JPEGs and PDFs came from all sorts of different applications).

[1] https://github.com/qpdf/qpdf



👤 roschdal

👤 Const-me
Find a library.

Couple years ago I needed to generate PDF reports, relatively complicated ones: headers/footers/backgrounds, page numbers, complex tables, jpeg bitmaps, custom vector graphics in diagrams, etc. This one did the job: https://www.nuget.org/packages/iTextSharp-LGPL


👤 kennu
For Node.js there is a nice library called PDFKit (https://pdfkit.org/) which offers a canvas-like API for drawing graphics, text with ttf fonts, and other graphcal elements. I would say it's pretty good if you need exact control over the PDF output.

👤 epirogov
Aspose.PDF is great library for create and edit pdf documents :

https://products.aspose.app/pdf

and you can render your pdf documents online for free from many source formats in

https://products.aspose.app/pdf/conversion

for examle, this libraly generates documents from scratch with ISO specificaton from Adobe :

https://docs.aspose.com/pdf/net/create-document/


👤 d--b
Cause it belongs to Adobe and they clearly don’t want to make it easy for developers to work with it.

👤 chasil
I had to recreate some PDFs at work that were created by "iText by Lowagie" which must have been a java library at the time.

I redid it with the FPDF library for php, and it worked out fine. I tried some new features of tcpdf, and it wasn't much work to convert.

Using inkscape to make an EPS out of an svg was also challenging.

I know that postscript and PDF is based on a forth stack machine, if I really had to get that low.

http://fpdf.org/

https://tcpdf.org/

https://wiki.c2.com/?ForthPostscriptRelationship


👤 jchw
I did this once. Maybe my small journal will be useful.

https://github.com/jchv/resume/blob/master/journal.md


👤 mwcampbell
To everyone thinking about writing their own code to generate PDF, I'm begging you, please either implement tagged PDF support for accessibility, and test it with Adobe Reader and a screen reader, or consider using an existing PDF generator that supports tagged PDF, such as LibreOffice, iText, or a recent version of Chromium. The web already has enough untagged, inaccessible PDFs to provide no shortage of work for multiple document remediation businesses, including my own. But I'm an accessibility advocate first, and as the saying goes, an ounce of prevention is worth a pound of cure.

👤 alanh
From the title, I thought you meant inaccessible as in providing little to no affordances for users of assistive technology (you know, accessibility, a11y… alt text, semantic markup, that sort of thing)

👤 Maursault
This can be done for free using bash and ghostscript, or even using the free tools from ImageMagick (like convert). But there are decent third-party proprietary solutions, such as the Enfocus PitStop Pro suite of softwares. Don't use Adobe products, please.

👤 OJFord
Wouldn't that be nice.

Not directly answering your question, but I suppose the solution is to just pick the closest thing and convert. HTML&CSS being the most full-featured/generic. Markdown simplest for basic 'word processing'. Latex good for more advanced such cases. Images good for others. Maybe ePub would suit your 'typographical' needs (I think it's a lot more open than PDF, and itself HTML based)?


👤 vbezhenar
Simplest way to generate pdf is to generate xsl-fo document which is good old XML and then convert it to PDF using one of the processors, e.g. Apache FOP.

👤 farnerup
Years ago I wrote a PDF generator for Passepartout (http://www.stacken.kth.se/project/pptout/) from reading the PDF book. I though it was a well designed format. It is a binary format though, for the sake of efficiency.

👤 psnehanshu
I am amazed how most of the answers suggest reading the specs. Why overcomplicate things when you just need to generate pdf? They simplest way is to generate HTML files, then use a headless browser to conver them to PDF. Simple!

👤 acehw
There's an open standard version of pdf called PDF/A, which Libre Office can write and read.

https://en.wikipedia.org/wiki/PDF%2FA?


👤 Thaxll
Because the spec is terrible and full of corner case.

👤 wronglyprepaid
I can highly recommend pagedjs.org and CSS paged media. This is used by asciidoctor pdf JS and it is an absolute dream to work with.

👤 tmaly
I know there is a perl module that can write low level parts of pdf but it only supports 1.4 or 1.5 version

👤 oshirisuki
I think it's the same as ms-office, they make a format, and they are the only ones that make the software to use the files in that format, it's funny, because adobe reader is very bad, and they ended up "giving it up", now ISO handles the spec, as for generating PDFs, maybe check libreoffice? or some other software that creates PDFs with the source available

👤 tracyma
why there's no native PDF API on Windows? macOS has PDFKit.Win32 has great foundations to make a good pdf library:Direct2D/DirectWrite.But it's so inconvenient to do pdf programming on Windows.

👤 mistrial9
ask Leonard R. -- it stretches back to the pre-Internet "multimedia" competition..

👤 b215826
TeX? Troff? PostScript?

👤 zzo38computer
I think PDF is no good; I think is too messy. I think that better formats can be possible, such as maybe PCL, and I have some of my own ideas of making a better format, too.

However, PDF is a commonly used format.

When I wanted to generate PDF (or other formats such as PNG), I just wrote a PostScript program to do (and then run it through Ghostscript). (Drivers could also be added to make other output formats too if wanted.)


👤 bob1029
The fact that you have to pay to see the standard should tell you everything you need to know.