HACKER Q&A
📣 ducktective

How do you organize or rename PDF files (books, papers, etc)?


Is there a tool for finding metadata of pdf files based on their hash or content? Like MusicBraiz Picard for identifying and tagging mp3 files.


  👤 audiothrowaway Accepted Answer ✓
Zotero can do this. It cannot always find the metadata but there are additional plugins available too. Additionally the browser plug-in allows you to easily find the item on a library’s site or Amazon and add the metadata and associate it with the pdf/other file.

It’s open source to boot.

https://www.zotero.org/


👤 Foltos
I throw all documents into Calibre [1]. It can extract metadata from files and extend them from online sources. It has a lot of related plugins too. As a plus it can convert to make the document book reader friendly. There are web applications capable of reading calibres meta-data files and allowing to share the library.

1: https://calibre-ebook.com/


👤 superkuh
Recoll http://www.recoll.org/ . It's an excellent metadata indexing and search program. It is not hashing based for organizing uniqueness of files but it will help you with finding things within the files.

👤 thefrozenone
PDFs only get names once I'm finished reading them. I click "Save As" and sit for 30 seconds and try to figure out what search terms I would use in the future to look for this document. Sometimes a title, sometimes a summary, sometimes a topic, author, year, journal, etc. Everything is built into the filename. It doesn't matter if it's long. Computers can handle it. This also has the side effect of ensuring I know what I just read and putting it into a mental filesystem as well as an electronic one.

Once I name it, I move it from the Downloads or temp folder to a Documents folder (though it should really be called 'Library') and I sync it to the cloud with the Google Drive app. In it, there's a Reading-List folder, with a _done folder. I have a few category folders - usually for reading groups. If I read the same document in multiple reading groups, I store it in the folder of the first reading group I read it in. This way feels nostalgic to me. I will also put jpeg screenshots, txt files for notes, etc. This way feels nice, like when I used to go to the public library and see the DVDs and audiobooks next to the books.

The main query interface is MacOS spotlight, though Google Drive also works, and so would something like fzf, or any other finder. I like not having to download or use software. For annotations, I usually save a xyz.pdf and a xyz-Annotated.pdf. For news articles, I just Right Click > Print to PDF and save to the subfolder for a particular topic. You never know when something will get taken down from the Internet. If I search for something and find duplicates, I try to prune then and there.

I don't have to download any apps except for Google Drive, which I already use for e-mail, etc. I can at any point port this entire system to Dropbox, Box, self-hosted FUSE solution, or a flash drive, and keep all its functionality with no software on a computer except for a filesystem and a document viewer.

Names are hyphenated - e.g. Intro-to-Civil-War.pdf. This way feels readable with my eyes, and also most filesystem search utilities seem to tokenize well on hyphens. Tokenizing on spaces would mostly work, except you have to escape spaces on the command line.


👤 mxuribe
Years ago I standardized on a pretty flat folder hierarchy for such PDFs as well as a file naming convention such as the following examples: topic-author-YYYY-MM-DD.pdf or article-title-website-domain-YYY-MM-DD.pdf, etc. In about 2 or 3 times in my life, i have also drafted a small summary of the content of a PDF into a sibling README (text file) for ease of findability later on. This was because those few PDFs were a very old docs that were scanned in, etc. Also, the sibling README file would be named the exact same as the PDF, though have a file extension of .txt or .md or ...-README.txt, etc.

Nowadays, there are more tools available - as others have cited - which help with finding metadata. But i like to keep things simple, and because there's still way too many PDFs whose metadata starts with "Microsoft Word...", that i still stick to my file naming convention to help give hints about the contents.


👤 Helmut10001
I think the original filename is a critical piece of information and I usually leave names. For every day when I do literature research (or happen to come across an interesting paper), I create a new folder with the context of my search, e.g. `2022-08-23_einstein_original_papers`. I save the paper (and others on that day) to the folder as is, then add them to Zotero (, or Mendeley, if I were less privacy sensitive).

👤 low_tech_punk
I've been using https://www.sejda.com/ and https://lottatools.com/ for light weight tasks. They both run in browser, probably with a wasm backend.

👤 neilv
I usually rename them manually, with Unix-y filenames that start with the last/company name of the author (e.g., `adobe-postscript-language-reference-manual.pdf`, `blandy-programming-rust-1e-early.epub`).

In a company, I put relevant ones in a Git repo, or sometimes in a wiki. (In any case, the company/engineering wiki will probably reference them, and code in the Git repo might as well.)

For non-public documents, I track the provenance of each. In a company Git repo, usually in `README.md`, or maybe `.README.md`. And you have to be clear whether something was received under NDA or other restrictions, which can restrict sharing, quoting, facts, or even who within the organization is allowed to look at it.

For paid docs I personally own, I'm currently experimenting with inventorying all my paid digital content (books, videos, games) in GnuCash (especially since the payment transactions will be there). (I'm still slightly uncomfortable with, say, a $50 `.pdf` just sitting in my homedir, without indication that it's not public and shareable.)


👤 quartzic
DEVONthink (macOS, https://www.devontechnologies.com/apps/devonthink) does this very well. It can search metadata and contents and offers many organizational options.

👤 Syzygies
I manually manage by year/folder, synchronizing thousands of items via DropBox. Each folder has all the associated links and metadata I can muster.

By "manual" I mean various scripts, on MacOS. "bib" for example is an Alfred keyword that runs a script scraping the front browser page for bibtex data, and initializing a new folder based on this data. Other scripts write search output files that "Find Any File" thinks it saved; FAF offers a nice Finder-like interface for selectively opening search results.

I've avoided any canned solution. I don't want vendor lock, and I'd rather be crippled by my own lack of imagination than someone else's.

By far the most useful metadata are the dates of access. One wanders continuously through an idea space no software can adequately categorize, yet adjacent items in time are likely related.

The holy grail I haven't coded yet is to generate web pages with URI schemes that open each PDF in a viewer. GoodReader is a nice viewer on iOS with such an interface; stunningly its own browser can't understand these links, but DropBox can. To simplify versions for MacOS and iOS, one can write a custom URI handler on MacOS that recognizes the GoodReader link, and redirects it on a Mac.

The goal here is to meaningfully browse closer to the speed of light than clumsy current standards. Just as lock-picking is both skill and cycles, research is both skill and cycles: One wants to have the right three ideas in close enough proximity for our feeble brains to notice the connection. We have to be in our happy place when we're flailing for hours, days, years; it would nevertheless be nice to accelerate this process.

I've been in constant friction with the MathSciNet team: I believe that it should present itself as the premier playground for machine learning mind mapping experiments, in addition to maintaining its stodgy hand curation. There is some controversy as to how math is organized. Many great mathematicians wander freely; others never read a paper outside their field after the age of 23. Departmental hiring meetings degenerate to "it's the number theory group's turn", perpetuating a dated view of how math is organized. Independent views of the MathSciNet database could give us new understandings of our field.


👤 stakkur
I've used pikepdf (a Python lib) with success: https://github.com/pikepdf/pikepdf

👤 taubek
I keep the original names. Problem with metadata is that they can be stripped or you can have the same content with different metadata in it. If you download paper from publisher site or a copy provided by author, they don't need to be the same. When I was working on my PhD thesis I've tried using tools like Zotero for keeping bibliographical references, and pulling the metadata when I needed citations. I wasn't to satisfied with results. Maybe I was using it in the wrong way.

👤 kps
I've been using Tellico for cataloguing and ISBN lookup, in conjunction with a script to hunt for ISBNs in PDFs, but may try Zotero now. I've been normalizing file names to the form [author ‘:’] title [metadata] [‘.’ type] where the metadata is ‘[’ key ‘=’ value [‘;’ …] ‘]’, normally including ‘isbn=’ for books and ‘doi=’ for papers (and other things for audio and images).

👤 blindstitch
I put them into Zotero and let it auto-tag. I add additional tags as needed. Its full text search works quite well.

👤 challenger-derp
Calibre for books. Zotero for papers. Both are compatible with common cloud services like Dropbox.

👤 strangattractor
I don't - I leave them as a pile on my desk then comment their location to my memory palace.

👤 amanagnihotri
I manually name them as follows as soon as I download them:

[title] - [optional subtitle] - [comma separated contributor list] (publisher name, year of publication).

Contributors can include authors, editors, translators; in that order. If the name becomes too long for the file system, I opt for "et al." after a few names.

Examples:

Structure and Interpretation of Computer Programs - Harold Abelson, Gerald Jay Sussman, Julie Sussman (The MIT Press, 1996).pdf

The Phenomenology of Spirit - Georg Wilhelm Friedrich Hegel, Terry Pinkard (Cambridge University Press, 2018).pdf

Computer Graphics - Principles and Practice - John F. Hughes, Andries van Dam, Morgan McGuire, et al. (Addison-Wesley, 2014).pdf


👤 ravel-bar-foo
I have a Zotero database to which everything gets added. Files are renamed by Zotfile to Author1_Author2_(etal)-Title-Publication-Year.pdf.

Zotero generally is pretty good about finding metadata, and there are plugins for full text search. I've probably had as many cases of SciHub serving up the wrong document as Zotero failing to find metadata.

I tried recoll, but with the default parameters on my setup the db filled my disk. I think the database was close to 1/15 of indexed disk size. After freeing uo disk space I still use it to find things occasionally, but haven't checked whether the db is updating.


👤 rg111
I don't organize at the file level. I just dump all pdf files into a folder. That folder syncs with home-server and cloud-storage.

I organize at the app level. I have different apps on all my devices for different kinds of documents.

One for work/research, one for novels/non-fiction/poetry. I rely on the recently opened section for continued reading. All of the apps have big book-shelf features, so the title doesn't matter. When transfering, I sometimes email the document to myself with the title in the subject.

I use Okular, native doc viewer in Pop OS, Ebookdroid and Ebookdroid Pro.


👤 ivanjermakov
Git-versioned folder with by-topic structure. I haven't found a way to generate metadata and file name based on content though - I still name them manullay.

👤 ChrisMarshallNY
It depends.

In some cases, I preface with a YYY-MM-DD date, for sorting. In cases where I want to link to something (like when I attribute images in my writing), I may have a title like "wikimedia-commons-man-on-a-unicycle.pdf", or somesuch.

Other times, I consider it important to keep the original filename.

I'll often rely on the container context, to provide classification.


👤 goosedragons
Depends on the content. For academic papers and books there are tools out there that can rename files by scanning the document for the DOI number and then looking that up on CrossRef or whatever.

👤 tpoacher
Jabref is very good. In the case of articles, I believe it does what you ask (i.e. getting the bibtex citation)

If you want to go even more low level, the scholarref bash script is amazing.


👤 systemvoltage
Surprised no one has mentioned iBooks or Books.app from Apple. It works great and syncs across devices. You can add personal PDFs if you have a iCloud account.

👤 woleium
https://sioyek.info/#

Edit: misread, thought you were asking about annotations.


👤 MengerSponge
If it's a paper with a DOI, it's easy to re-generate the metadata. Same for an ISBN. My guess is that this type of file won't hash well, because the metadata and the data (cover page) are mutable. Zotero tries to extract metadata from PDF files, but it doesn't always succeed.

I use Zotero with the ZotFile and BetterBibTex plugins for my academic papers, and Calibre for my ebooks.

I'm not an archivist, and I don't read enough to benefit from automating this problem. https://xkcd.com/1319/


👤 Lapsa
I place them in a folder `Books`

👤 noud
[author]-[year]-[title].pdf

Pdfs are then grouped into folders per topic.


👤 progend
For papers and books I use “author-year” format.

👤 ur-whale
pdftotext piped into a custom script (python or shell) that tries to lift relevant metadata from the generated text.