I want to do the same thing for my web browser. At first I looked at Memex but they disabled browser history search. You have to save or annotate an article first before it becomes searchable. My brain, naturally, does not know ahead of time what could be useful in the future.
Is there any product out there that creates a fully searchable full-text history forever with little fuss?
I'm using Firefox on Linux but could switch to another browser (but not OS) if needed.
Put it in "save" mode when using Chrome (linux is fine) and it automatically saves every page you browse (so you can read it offline), and also indexes it for full text search. It's a work in progress and there are bugs (so my advice initialize a git repo in your archive directory, and make regular syncs to a remote in case of failure -- that also gives you a nice snapshotted archive).
Anyway, best of luck to you! :)
Diskernet: https://github.com/crisdosyago/Diskernet
I have a script to extract the last visited website from chrome for example: https://github.com/BarbUk/dotfiles/blob/master/bin/chrome_hi...
For firefox, you can use something like:
sqlite3 ~/.mozilla/firefox/.[dD]efault/places.sqlite "SELECT strftime('%d.%m.%Y %H:%M:%S', visit_date/1000000, 'unixepoch', 'localtime'),url FROM moz_places, moz_historyvisits WHERE moz_places.id = moz_historyvisits.place_id ORDER BY visit_date;"
Edit: I just tested it, and it's pretty neat! However they note in their Documentation[3] that this is a web cache and not intended to be an archive. It can simply be turned into an archive though by configuring a large cache size.
[1] https://www.lesbonscomptes.com/recoll/
[2] https://addons.mozilla.org/en-US/firefox/addon/recoll-we/
[3] https://www.lesbonscomptes.com/recoll/faqsandhowtos/IndexWeb...
No full text search included. Might be that grepping through the extension data is reasonably fast, even with multiple years of browsing history, however. I too see this data as valuable, so it's probably better to start capturing now and migrate somewhere else later, rather than wait for your desired browser to implement it.
[1] https://addons.mozilla.org/en-GB/firefox/addon/local-cache/
Then there's also ArchiveBox[2] which can convert your browser history into various formats.
menu Tools → Preferences… → Advanced → History → Remember visited addresses for history and autocompletion → [X] Remember content on visited pages
search from address field, history panel or about:historysearch
The gist is:
- It unifies your browsing history for Firefox, Chrome, most browsers into one sqlite database. - Provides quick (autocomplete style) full text search over that database via a UI.
I have built an app, that allows you to easy access your shell history https://loshadki.app/shellhistory/ and sync via iCloud. MacOS only.
Different model in that you have to choose what to archive (by hitting a button on a browser extension) but in practice I prefer this to the “trawl everything” model. Ymmv, of course.
But the killer feature for me is that it’s a unified “search-first” interface for _all_ your documents - not just your web browsing.
There are TONS of projects like Elasticsearch or just raw Lucene that will allow you to parse text and index it.
HTML? Not so much...
There are just to many problems
Text ads polluting the extracted text is by far the main issue but there are other issues as well including OCR of images, AJAX paginated pages, lazy loaded images that might need OCR, metadata extraction (when was the page published, who was the author, etc).
There are some projects that take this on but Google just does an amazing job and these secondary tools are pretty limited by comparison.
95% accuracy doesn't help because that 5% usually ends up being 100% of your false positives.
That was back in the day though, before HTTPS was common (outside ecommerce sites, which I didn't care about indexing), and before so many sites were SPAs that got their content via JS APIs. As more sites went to HTTPS, I realized that I'd have to re-write my proxy to MITM certificates if I wanted it to keep working and that wasn't really something I wanted to mess with. The project was useful, but not useful enough that I was willing to dive into the world of writing a browser plugin that could scrape directly from the DOM, so I eventually abandoned it.
I use several machines, plus my phone for browsing. As such my (local) browser history is useless, so I tend to turn it off. Also, I am not in control of it from a privacy point of view (who knows what extensions/browser functions are doing with it?)
With my own endpoint, I can then do what I want with the URLs... put them in a database as a cross machine history index, or schedule a job to index the page contents into a personal search engine, etc.
I've never written a browser extension, but I'm guessing that...
IF (URL.current <> URL.previous) { sendRequest(host=endpointURL, payload=URL.current) }
...can't be that hard?
I am curious how having your history is your greatest force multiplier.
I adopted a save/print to pdf + full text pdf search in case I need something in the future.
EDIT: Perhaps ranking by how much time you've spent reading the content could help with that.
On the other hand, your shell history is already filtered by your brain, it contains mostly potentially useful things.