HACKER Q&A
📣 ThinkBeat

I scanned an old paper book. How to turn it into an eBook?


Hi,

I have scanned a couple of books written by an ancestor.

He was not famous, and I doubt too many people are interested but I wanted to give the books a new digital life. I think the enormous back catalog of things that have been published but then disappeared is sad.

Anyways I have a few hundred pdf files. They have an image of the page and the OCR text. The text will have to be edited by hand I think. Its not in English.

I was wondering if anyone knew of scripts or applications that can take that as an input, do some decent formatting and typesetting and spit out a couple of different formats of eBooks at the other end.

There are photos and illustrations on some pages. I am hoping to keep them.

The images are ok but given the curvature of the page, and other factors the images are not I feel ready to be used.

I could try to cut and paste it all into Word

There is also an extensive cross reference and table of contents that I have no idea how to deal with. I presume I will do it by hand but it would take a long time.

Anyways if you have any tips for me on how to take the raw files and make them into pretty, easy to read eBooks that would be fabulous.


  👤 cyclotron3k Accepted Answer ✓
Coincidentally I'm currently doing exactly the same thing. This is what I've learnt so far.

OCR was the first problem. My photos of the book were not great; the pages were curved in the photos and it was causing trouble for many OCR packages. Ultimately, I found that the built-in OCR in Google Photos is amazingly good, and I was able to just cut and paste the text out of the photos with barely any corrections.

As for, PDF/EPUB/etc, I went for EPUB because it works better on a variety of screen sizes (it can reflow the text of course), but also because I intended to read it on my Kindle.

Amazon produces free software that allows you to create books for kindle, but it will only allow you to publish directly to the Kindle store. You can't even produce a preview copy to test on your Kindle.

So I abandoned that and used Calibre instead. It's OSS, and not too difficult to work with, but it works by importing Word docs or HTML files, so I had to convert my text to HTML.

An EPUB file is just a zip file filled with XML, any images and some metadata, so it's easy to edit by hand.

As for images: my source images are very poor quality and I've been experimenting with AI restoration and upscaling, with limited success so far.

I'm proofreading the book on my Kindle at the moment, and I must say it's very satisfying.


👤 mooreds
I'd suggest submitting it to Project Gutenberg, if it passes copyright muster. They have almost 70k free ebooks they make available in a variety of formats.

https://www.gutenberg.org/policy/collection_development.html

Many years ago I had free time, access to a scanner and a book that was out of copyright and submitted to PG. So it's been a while, but as I recall you can submit the OCR files and volunteers help proofread the books.


👤 fractallyte
Coincidentally, I'm working on something similar now.

It's a 100 page paperback book, with illustrations and text, on facing pages. The pages are yellowed, so it's not an option simply scan to PDF - they'll need to be cleaned up.

I'm undecided about the PDF version, but creating an EPUB is quite interesting:

- Scanned each page with a regular scanner, and cleaned up the illustrations in Gimp

- OCR'd the text using my iPhone (surprisingly quick and easy)

- Identified the typefaces using MyFonts WhatTheFont tool (https://www.myfonts.com/pages/whatthefont)

- Created one long HTML page containing all the text, properly formatted with headings, italics, etc.

- Imported the HTML and graphics into Sigil, splitting it into separate pages (https://sigil-ebook.com/sigil/)

- Exported the EPUB from Sigil

Sigil's UI is slightly daunting at first, but it provides total control over the markup. There are commercial alternatives which are easier to use - Jutoh is highly regarded (https://www.jutoh.com/)


👤 bentley
In addition to whatever else you do, it would be great if you uploaded each book to the Internet Archive/Open Library, with as much metadata as you can provide (title, author, publication date…). IA will automatically convert to a few different formats and provides good futureproofing that the unique information your scan provides will never go away.

👤 Cryptoclidus
http://ocr.space/

95% of the work will be editing and typesetting

Have a look at a good styleguide: https://standardebooks.org/manual/1.7.0


👤 dtagames
The simplest method is to use Adobe Acrobat to make a OCR'd PDF. There is no work to that process other than combining all the PDFs you want into a single file -- and that's something you can do inside Acrobat, too. It has OCR built in and will save the OCR'd text inside the file.

Once you have a PDF, that is an eBook. There are several online converters that can turn it into other formats. As for indices, etc., Acrobat won't make them automatically but your OCR'd eBook is searchable, so it may not matter in the same way it would with print.

I'd say the world would be better off with your rough-and-ready PDF (and yes, in the Internet Archive would be great!) than waiting for a perfect hand-made version.


👤 GianFabien
I would use LaTeX. Although LaTeX has a steep learning curve, it is specifically designed for high quality typesetting and is widely used to produce complex works. There are templates and tools to output epub, mobi, etc formatted books.

While you are proof reading and correcting, you would also markup chapters, sections, indexable entities, footnotes, etc. LaTeX then will generate ToC, index, bibliography, etc. You only need to choose a different template or just options to handle different layouts, etc.

If the images are not too distorted, it might be possible to use Gimp, etc to restore them to near correct shape, proportions, etc.


👤 vivegi
A couple of options:

1. Searchable Text PDF: You can use ImageMagick to convert your source TIFF image files (if you have scanned your pages into TIFF format) into Adobe PDF format. If you have Adobe Acrobat, you can run OCR on it to convert it to a "Searchable" text PDF (aka "text under image" PDF).

     There are variations of this workflow with other tools, but this is the basic idea.

     Tip: Ensure that your TIFF image files are uniformly sized.
2. EPUB Ebook Format EPUB is a ebook standard and at its basic level it is a collection of a (a) metadata file (in XML format), (b) HTML files for each unit of content (such as Chapter/Section/Unit etc., depending on how the book is organized) (c) Image files that are referenced in the HTML files (d) CSS file for styling your HTML.

     There are a number of options for advanced use, but you can avoid almost all of them if you approach it as a basic collection of XML + HTML + CSS files.

     The EPUB itself just uses a folder structure to organize the files (similar to a static website) and the format is a ZIP archive file named with the file extension '.epub'.

     Most of the work in creating the EPUB would be in structuring the HTML/CSS files and creating the XML file.

     Tip: There are many companies/freelancers in India whom you can find on sites like fiverr etc., who can do this as a project for you at a reasonable cost, if you would rather outsource this to an expert instead of doing it yourself.
3. Print-on-Demand If you are interested in creating a print run, that is also possible. A typical workflow would be to (a) scan the pages at high resolution (600 dpi) (b) perform image processing to deskew the images and remove noise and perform quality control (c) batch-align the front and back of the image pages (d) save the page images to TIFF (e) convert TIFF to multipage Adobe PDF.

     You can use this PDF with Print-on-Demand publishers to create a physical copy of the book.

    Note: This requires a cutting/unbinding of the original (you can re-bind the original after the process, if you want to keep the original).
4. Typesetting In this method, you have a typesetter re-typeset the entire book (in Indesign/QuarkXpress/LaTeX etc.,). This is a much more involved process, but offers you flexibility in terms of how you want to layout the content, perform some editorial changes etc.,

    Again, there are companies you can engage to do this for you and manage this  as an outsourced project.

👤 epirogov
In case, you have thin env without on smart tools installed, I can propose merge scans with:

https://products-qa.aspose.app/pdf/merger/jpg-to-pdf

and than recognize text with searchable pdf

https://products-qa.aspose.app/pdf/make-pdf-searchable


👤 CrypticShift
It is an art form. I believe there is some good threads in the workshop section of mobileread, like this one https://www.mobileread.com/forums/showthread.php?t=331376&hi...

👤 xeonmc
Turn it into an open source transcription project by putting them on a git repository and post it here on HN.

👤 chris-buck
Pdftotext, then convert text to epub

👤 solardev
Look into the app Calibre.

But it would still be a lot of manual work.