HACKER Q&A
📣 ed_balls

Converting Pdf into CSV


I'd like to create a simple tool for coworker that would read a pdf file and convert it to a CSV. Usually it's an invoice or a file with rates.

I could make a screenshot and pate into ChatGPT, but it struggles (and cannot do pdfs, just images)

Is there a better way to automate this?


  👤 pwg Accepted Answer ✓
If you go this route you will discover a world full of pain.

PDF's look nice on screen and/or printed, but internally they are not always so nice for data extraction (unless the creator specifically set them up to be data extracted).

Inside a PDF, the PDF structure is simply instructions to position font glyphs at 2D coordinates on a virtual sheet of paper. And depending upon how the creating system generated the PDF, it might be relatively easy to extract (the PDF was created left to right, top to bottom, and positions nothing smaller than whole words at a time) or a royal pain (each individual letter is independently positioned at a specific x,y coordinate [this is unlikely, but possible]).

If you intend to consume a specific PDF from a specific generator you'll have better luck (because you can adapt to that specific generators methods) but if you expect to extract from any pdf from any source you'll be constantly updating to cover for some pdf creator program's quirks that you had not seen before.


👤 lmorandi
What does the PDF contain? Maybe this online tool is helpful for you, it converts PDF into Excel files: https://www.ilovepdf.com/es/pdf_a_excel

If it works you can extract the CSV from there.

Hope it helps!


👤 warrenm
My experience converting from PDF has been ... less than pleasant (even manually copy-pasting from a PDF into Excel only works some of the time)