HACKER Q&A
📣 keepamovin

Has anyone tried turning Web Content into static documents, like PDFs?


I'm aware of the good projects SingleFile, SingleFileZ, etc to archive a webpage into a file, but I'm thinking more like an actual dedicated, usable, sharable, hackable document format for sending Web Content around, but not over HTTP? I know MHTML, but I'm thinking something that more maintains visual fidelity and usability (up to what has been archives). Seems using Web Content an an authoring medium and sending these documents around could be a nice alternatives to PDFs but in a similar market? Where HTTP serving such Web Content, simply becomes another peer distribution medium for content that exists and is separate to the medium of exchange. These docs could be sent over email, messaging, all kinds of ways. Seems it would possibly offer a richer, more interactive, and potentially easier-to-hack/author alternative to PDFs, and other prop formats. I tried looking for this but didn't find anything. Is this idea fundamentally broken in ways that aren't obvious?


  👤 kordlessagain Accepted Answer ✓
On https://mitta.us, I snapshot the whole page into a PNG using a headless browser. That PNG can then be sent into a pipeline to have the text extracted by Google's Document AI. That text is sent off to OpenAI to embed it, and stored in Solr, where it can be used later to search for it. The main issue is that I can't pick up links with it, but most of the time that really doesn't matter as long as I can see the page like it was rendered and have the text searchable.

👤 LarryMade2
There's Print to PDF printer choice - so far Chrome seems to do the best job of it - if the HTML/CSS is not too responsive it can be not good.

👤 solardev
I don't get it. What's more portable and hackable than HTML? Why not just make a website and zip it up?

👤 coip
I’ve used athenapdf before for related activities w good results. They’re on github