HACKER Q&A
📣 emptysongglass

Full-text browser history search forever?


My biggest force-multiplier is my fish shell history, going on 7 years of command line history.

I want to do the same thing for my web browser. At first I looked at Memex but they disabled browser history search. You have to save or annotate an article first before it becomes searchable. My brain, naturally, does not know ahead of time what could be useful in the future.

Is there any product out there that creates a fully searchable full-text history forever with little fuss?

I'm using Firefox on Linux but could switch to another browser (but not OS) if needed.


  👤 graderjs Accepted Answer ✓
Hey. My project Diskernet does this: full text search over browser history.

Put it in "save" mode when using Chrome (linux is fine) and it automatically saves every page you browse (so you can read it offline), and also indexes it for full text search. It's a work in progress and there are bugs (so my advice initialize a git repo in your archive directory, and make regular syncs to a remote in case of failure -- that also gives you a nice snapshotted archive).

Anyway, best of luck to you! :)

Diskernet: https://github.com/crisdosyago/Diskernet


👤 barbuk
Chromium and Firefox have all your history stored in a sqlite database.

I have a script to extract the last visited website from chrome for example: https://github.com/BarbUk/dotfiles/blob/master/bin/chrome_hi...

For firefox, you can use something like:

sqlite3 ~/.mozilla/firefox/.[dD]efault/places.sqlite "SELECT strftime('%d.%m.%Y %H:%M:%S', visit_date/1000000, 'unixepoch', 'localtime'),url FROM moz_places, moz_historyvisits WHERE moz_places.id = moz_historyvisits.place_id ORDER BY visit_date;"


👤 ochicial
I use Recoll[1] to search through my local files. Haven't tried it myself, but they also have a extension for Firefox[2] that does what you are looking for.

Edit: I just tested it, and it's pretty neat! However they note in their Documentation[3] that this is a web cache and not intended to be an archive. It can simply be turned into an archive though by configuring a large cache size.

[1] https://www.lesbonscomptes.com/recoll/

[2] https://addons.mozilla.org/en-US/firefox/addon/recoll-we/

[3] https://www.lesbonscomptes.com/recoll/faqsandhowtos/IndexWeb...


👤 phil294
Not quite what you asked, but perhaps interesting either way: I recently made an extension that keeps text only copies of all visited sites for offline usage [1].

No full text search included. Might be that grepping through the extension data is reasonably fast, even with multiple years of browsing history, however. I too see this data as valuable, so it's probably better to start capturing now and migrate somewhere else later, rather than wait for your desired browser to implement it.

[1] https://addons.mozilla.org/en-GB/firefox/addon/local-cache/


👤 Farow
Not exactly what you're asking for but you can setup SingleFile[1] to automatically save each page you visit.

Then there's also ArchiveBox[2] which can convert your browser history into various formats.

[1] https://github.com/gildas-lormeau/SingleFile

[2] https://github.com/ArchiveBox/ArchiveBox


👤 roneoo
Promnesia is meant for that (I didn't test it though): https://beepb00p.xyz/promnesia.html

👤 bmn__
Opera versions 9.5 to 12 do this out of the box. The index is stored in `$OPERA_PREFDIR/vps/0000/`.

menu Tools → Preferences… → Advanced → History → Remember visited addresses for history and autocompletion → [X] Remember content on visited pages

search from address field, history panel or about:historysearch



👤 iansinnott
A bit late to the thread, but I also created something to solve this [1]. Currently only works on Mac though, so it does not solve the OPs problem. Hopefully others may find it useful.

The gist is:

- It unifies your browsing history for Firefox, Chrome, most browsers into one sqlite database. - Provides quick (autocomplete style) full text search over that database via a UI.

[1]: https://www.browserparrot.com/


👤 _dain_
You can use Archivebox. Set it to grab URLs from your browser history database and it will archive them all to disk in whatever formats you want. You can then use whatever tools you want on those local files.

https://archivebox.io/


👤 outcoldman
Self promotion about the shell history:

I have built an app, that allows you to easy access your shell history https://loshadki.app/shellhistory/ and sync via iCloud. MacOS only.


👤 darkteflon
I recently started using DEVONthink (MacOS / iOS) after someone mentioned it on here and have found it to be great. Just paid for the Pro license and consider it quite the bargain.

Different model in that you have to choose what to archive (by hitting a button on a browser extension) but in practice I prefer this to the “trawl everything” model. Ymmv, of course.

But the killer feature for me is that it’s a unified “search-first” interface for _all_ your documents - not just your web browsing.


👤 burtonator
The main problem with FTS isn't the search indexing component it's actually the HTML content parser.

There are TONS of projects like Elasticsearch or just raw Lucene that will allow you to parse text and index it.

HTML? Not so much...

There are just to many problems

Text ads polluting the extracted text is by far the main issue but there are other issues as well including OCR of images, AJAX paginated pages, lazy loaded images that might need OCR, metadata extraction (when was the page published, who was the author, etc).

There are some projects that take this on but Google just does an amazing job and these secondary tools are pretty limited by comparison.

95% accuracy doesn't help because that 5% usually ends up being 100% of your false positives.


👤 whiterock
https://minbrowser.org/ has full-text search on every website you ever visited if I understand correctly.

👤 thraxil
Probably not helpful, but I built a setup like that for myself years ago. I had a local proxy, originally written with Python Twisted, but later ported to Go, and set my browser to use that, so all requests went through it. Every URL that the proxy saw, it logged and for text/plain or text/html, it also took a copy of the request body and posted that to a web service that I was running. The web service was a Django app with a simple model that would track the URL, timestamp, and some other basic metadata. It would also save a gzipped copy of the request body to S3, and dump it into a SOLR instance. That gave me full-text search over the content of every site I visited and a backup copy of the text in case the original ever went offline. It was incredibly useful.

That was back in the day though, before HTTPS was common (outside ecommerce sites, which I didn't care about indexing), and before so many sites were SPAs that got their content via JS APIs. As more sites went to HTTPS, I realized that I'd have to re-write my proxy to MITM certificates if I wanted it to keep working and that wasn't really something I wanted to mess with. The project was useful, but not useful enough that I was willing to dive into the world of writing a browser plugin that could scrape directly from the DOM, so I eventually abandoned it.


👤 erusev
I make https://ibar.app/ which does something like this.

👤 Jaruzel
I think I'd like to approach this from a different angle - A browser extension that just sends the current URL to a http(s) endpoint of my choosing.

I use several machines, plus my phone for browsing. As such my (local) browser history is useless, so I tend to turn it off. Also, I am not in control of it from a privacy point of view (who knows what extensions/browser functions are doing with it?)

With my own endpoint, I can then do what I want with the URLs... put them in a database as a cross machine history index, or schedule a job to index the page contents into a personal search engine, etc.

I've never written a browser extension, but I'm guessing that...

  IF (URL.current <> URL.previous) { sendRequest(host=endpointURL, payload=URL.current) }
...can't be that hard?

👤 kasperset
Not for Linux :( Works in MacOS but paid software. https://www.stclairsoft.com/HistoryHound/

👤 subutux
If it’s just the metadata that you want, you can use activity watcher [1]. They have a browser plug-in.

[1] https://activitywatch.net/


👤 Hnrobert42
I go the opposite direction. I delete browser history on close. My BASH history isn’t reliable.

I am curious how having your history is your greatest force multiplier.


👤 rounakdatta
Nyxt browser is doing this pretty well! https://nyxt.atlas.engineer

👤 pseingatl
There was a Windows program out maybe 20 years ago called Elephant Tracks which did this. The developer shut the program down because of piracy.

👤 suifbwish
An interesting spin on this would be to save all text from every web page you have ever viewed in a browser. I am willing to bet most of us have not viewed more than 10G of raw text in the past 10 years

👤 nojito
Autosaving internet browsing history is largely a waste of time.

I adopted a save/print to pdf + full text pdf search in case I need something in the future.


👤 ComodoHacker
I don't think this could be useful to you. SNR would be too low. The fact you were convinced to click on a link someday doesn't mean the link contains anything useful.

EDIT: Perhaps ranking by how much time you've spent reading the content could help with that.

On the other hand, your shell history is already filtered by your brain, it contains mostly potentially useful things.


👤 hodanli
i think https://memex.garden/ does what you are asking for. but lately they decided to turn their code into source available.

👤 encima
histre.com does this

👤 retube
Doesn't chrome + google do this by default?

👤 warrenm
7 years of command line history?