HACKER Q&A
📣 vednig

Tips on building a search engine?


I'm recently thinking about building my own search engine from scratch that can be deployed over a cheap hardware like RPi, or an old android phone.

If I were to start today what are some good resources to build a search engine like Google (including in terms of quality, reliability and following web standards).

If you've ever worked on anything like this. I'd like to know why or what challenges I may face since starting or deploying in production.


  👤 MrVandemar Accepted Answer ✓
Viktor Lofgren's blog is probably worth a read — he is developing the rather delightful marginalia search engine.

https://www.marginalia.nu/log/


👤 Someone
> that can be deployed over a cheap hardware like RPi, or an old android phone.

Power usage will dominate hardware costs, making that hardware expensive, not cheap, probably more expensive than hardware designed to run 24/7 in a data center.

> build a search engine like Google (including in terms of quality, reliability and following web standards).

You won’t get reliability from “an old android phone”

> what challenges I may face since starting or deploying in production.

Buy a napkin first, and use it to make some calculations. Starting numbers: according to https://blog.hubspot.com/marketing/google-search-statistics, Google handles about 250k queries per second. https://zyppy.com/seo/google-index-size/ They have an index with 400 billion documents.


👤 krapp
You're a few years behind the times if you consider Google to be a standard of quality and reliability. Search is all but dead now, killed by an ocean of AI slop and ads because Google wanted search queries to be a moneymaker.

I think the only viable solutions going forward in a post-AI world will be decentralized, small scale and non-general, curated by human beings based on a reputation system rather than algorithms, possibly even torrent-based and not touching the web at all.


👤 readyplayernull
The easiest way would be running Postgres:

https://www.crunchydata.com/blog/postgres-full-text-search-a...

The problem is getting all the data. If you try to scrap another search engine it will punish you.

You could think on a distributed collaborative SE that scraps the web from the IP addresses of each device.


👤 sweca
You're going to face a challenge of scalability. With so many results like Google, you can't fit every embeddings vector in memory. You'll need to work a database hybrid of in memory caching and disk persistent indices.