There’s also this book which provides a good introduction and overview and is useful for understanding how the format works (although the PDF reference itself is pretty decent too, as far as specs go): https://www.oreilly.com/library/view/developing-with-pdf/978... (You can find a PDF copy if you look around.) EDIT: There’s also https://www.oreilly.com/library/view/pdf-explained/978144932... which might be even better.
However, be warned that the PDF format can be quite complex and is not exactly for the faint of heart. It’s best to use an established library to generate PDF output, like PDFBox, iText, PDFSharp, PDFKit, etc. Those tend to have their own tutorials.
For emphasis: Do not generate PDFs “by hand”! You risk inadvertently generating PDFs that do not fully conform to the spec, and not noticing it because PDF readers are quite lenient in what they accept. A lot of PDFs in the wild are not standard-conforming in some way or other, because their generators were not carefully written against the spec, but against “whatever Acrobat Reader accepts”. This is the bane of every software on the receiving end that needs to process PDFs.
https://github.com/foliojs/pdfkit
As to why it’s so inaccessible…because Adobe created this monstrosity to do just about everything. Text, fonts, vector graphics, raster graphics, forms, color spaces, JavaScript, encryption, signatures, 3D artwork, video, audio, Flash, and probably more. It’s bonkers as to what it can possibly include, and it was developed during a way different time.
The ReportLab APIs mirror the PDF file structure relatively closely.
Don't listen to the people who are nattering on about how PDF is proprietary on purpose. I think that may have been the case in its early years but it hasn't been the case this millennium.
PDF 1.7 (the spec from 02008) and even earlier verions are most often used, as you'll see if you run head -1 *.pdf in a directory with a lot of random PDFs. PDF 2.0 is not important and you may want to intentionally write an earlier version for broader compatibility. The big incompatibility is actually PDF 1.5 to 1.6: 1.6 added compressed object streams, and a lot of readers still don't support those.
______
I've also recently worked on a small tool called scholars.io [1] where I had to work with PDFs. I wasn't doing anything like parsing, but I just used existing PDF tools and libraries, which were much more pleasant, but still working on top of PDF is a challenge.
[1] - https://scholars.io (a tool to read & review reearch papers together with colleagues)
Also it is really not that cryptic but very much laborious, hence many people rely on classical tools to generate PDF instead of handcrafting pdf files from scratch. Here is a nice introduction from a decade ago for you https://blog.idrsolutions.com/2010/09/grow-your-own-pdf-file...
Create blank pdf with Adobe as your base, then add what you want to it using pdfmarks and distilling.
I spent a very, very long time diving into the rabbit hole that is pdf to come to this conclusion.
There are lots of libraries out there, but none I came across that met my needs would do named destinations, for one example. I think there might be some very expensive ones that might, but pdfmarks will get you sorted.
Here is the manual, if you search around there are few other references.
https://opensource.adobe.com/dc-acrobat-sdk-docs/acrobatsdk/...
I don't think Pandoc knows anything about the PDF format. It can't read it https://github.com/jgm/pandoc/tree/master/src/Text/Pandoc/Re... or write it https://github.com/jgm/pandoc/tree/master/src/Text/Pandoc/Wr.... It uses other tools to do that.
The open PDF standard now costs 250 USD. Adobe is supposed to have an archive of the 1.7 spec online but they do not care enough to keep that up it appears.
I am trying to think of a reason why they would do such a blindly dumb thing but the C++ people used to do the same thing.
Imagemagick also writes to pdf I believe, but it may only convert raster images. With postscript you can generate a vector pdf
That said implementing a basic PDF reader/writer is not that complex and can easily be done in a few months. However, since you seem to also want to generate PDF pages with content, a whole lot of things have to be considered, like fonts (Type1, TrueType, CFF) and how to actually generate the content.
Adding some straightforward text using some built-in PDF font onto a PDF page is easy. But if you want to use a (subset) TrueType or OpenType font, have ligatures, contextual character substitutions, (LaTeX like) line wrapping, tagging for accessibility, ... you will open a can of worms ;-)
This is certainly also doable but gets quite complex and is the reason many PDF libraries only implement basic typographic features that are easy. You can probably count the PDF libraries supporting advanced OpenType typographic features on one hand...
However, if you are already in the process of writing a typographical software, this last part may actually already be done in your case. So if you have, as output from that software, the glyphs and their position, there is not that much complexity to implement and you could probably use a basic PDF library to do the PDF writing for you.
Simple example applications, each completed in 2–3 days, both in C#, using C++/CLI to integrate libqpdf:
1. Overlaying fixed-format text on pre-existing blank PDF form pages, ensuring the content of each distinct form page is embedded exactly once, and that all necessary assets (fonts, images, etc.) from the blank form PDF pages are included in the output PDF.
2. Losslessly combining a sequence of PDF, TIFF, and JPEG images into a single PDF with bookmarks pointing to the first page of each source file and existing image compression maintained where possible. In this application, only the source TIFFs were anything other than arbitrary (i.e., the TIFFs were more-or-less baseline images coming from a small number of scanning systems, but the JPEGs and PDFs came from all sorts of different applications).
https://web.archive.org/web/*/https://www.adobe.com/content/...
Edit: Nevermind, see this comment: https://news.ycombinator.com/item?id=31267227
Couple years ago I needed to generate PDF reports, relatively complicated ones: headers/footers/backgrounds, page numbers, complex tables, jpeg bitmaps, custom vector graphics in diagrams, etc. This one did the job: https://www.nuget.org/packages/iTextSharp-LGPL
https://products.aspose.app/pdf
and you can render your pdf documents online for free from many source formats in
https://products.aspose.app/pdf/conversion
for examle, this libraly generates documents from scratch with ISO specificaton from Adobe :
I redid it with the FPDF library for php, and it worked out fine. I tried some new features of tcpdf, and it wasn't much work to convert.
Using inkscape to make an EPS out of an svg was also challenging.
I know that postscript and PDF is based on a forth stack machine, if I really had to get that low.
Not directly answering your question, but I suppose the solution is to just pick the closest thing and convert. HTML&CSS being the most full-featured/generic. Markdown simplest for basic 'word processing'. Latex good for more advanced such cases. Images good for others. Maybe ePub would suit your 'typographical' needs (I think it's a lot more open than PDF, and itself HTML based)?
However, PDF is a commonly used format.
When I wanted to generate PDF (or other formats such as PNG), I just wrote a PostScript program to do (and then run it through Ghostscript). (Drivers could also be added to make other output formats too if wanted.)