Tips on building a search engine?

Question

I'm recently thinking about building my own search engine from scratch that can be deployed over a cheap hardware like RPi, or an old android phone.If I were to start today what are some good resources to build a search engine like Google (including in terms of quality, reliability and following web standards).If you've ever worked on anything like this. I'd like to know why or what challenges I may face since starting or deploying in production.

MrVandemar · Accepted Answer

Viktor Lofgren's blog is probably worth a read &mdash; he is developing the rather delightful marginalia search engine.https://www.marginalia.nu/log/

Someone · Answer

> that can be deployed over a cheap hardware like RPi, or an old android phone.
Power usage will dominate hardware costs, making that hardware expensive, not cheap, probably more expensive than hardware designed to run 24/7 in a data center.
> build a search engine like Google (including in terms of quality, reliability and following web standards).
You won’t get reliability from “an old android phone”
> what challenges I may face since starting or deploying in production.
Buy a napkin first, and use it to make some calculations. Starting numbers: according to https://blog.hubspot.com/marketing/google-search-statistics, Google handles about 250k queries per second. https://zyppy.com/seo/google-index-size/ They have an index with 400 billion documents.

krapp · Answer

You're a few years behind the times if you consider Google to be a standard of quality and reliability. Search is all but dead now, killed by an ocean of AI slop and ads because Google wanted search queries to be a moneymaker.I think the only viable solutions going forward in a post-AI world will be decentralized, small scale and non-general, curated by human beings based on a reputation system rather than algorithms, possibly even torrent-based and not touching the web at all.

readyplayernull · Answer

The easiest way would be running Postgres:
https://www.crunchydata.com/blog/postgres-full-text-search-a...
The problem is getting all the data. If you try to scrap another search engine it will punish you.
You could think on a distributed collaborative SE that scraps the web from the IP addresses of each device.

sweca · Answer

You're going to face a challenge of scalability. With so many results like Google, you can't fit every embeddings vector in memory. You'll need to work a database hybrid of in memory caching and disk persistent indices.