HACKER Q&A
📣 oikawa_tooru

I want to dive into how to make search engines


But I don't know where to start or what to study. If I were going for masters in this subject, what would my courses be?


  👤 jasode Accepted Answer ✓
>search engines

You can decompose a "search engine" into multiple big components and figure out what you want to look at first:

(1) web crawler/spiders

(2) database cache of web content -- aka building the "search index"

(3) algorithm of scoring/weighing/ranking of pages -- e.g. "PageRank"

(4) query engine -- translating user inputs into returning the most "relevant" pages

Each technical topic is a sub-specialty and can be staffed by dedicated engineers. There are also more topics such as lexical analysis, distributed computing (for all 4 areas), etc.

If you're mainly focused on experimenting with programming another ranking algorithm, you can skip part (1) by leveraging the dataset from Common Crawl: https://index.commoncrawl.org/

Here are some videos about PageRank:

https://www.youtube.com/watch?v=JGQe4kiPnrU , https://www.youtube.com/watch?v=qxEkY8OScYY

... but keep in mind that the scope of those videos omits all of (1), (2), and (4).


👤 arturventura
I'm working on building an AWS for anyone who wants to make their own search engine. The idea is to have a single open webindex database, continuously updated that you can apply ranking and embedding algorithms in it. This would reduce the cost of entry, and enable developers to build competitors to google on top of it, or create new products in the search space like a search engine for clothes. I don't know if this is interesting for anyone but if it is, hit me up.

👤 mindcrime
I don't know that anybody offers a Masters degree specifically in "search engines", but loosely speaking the main academic field backing most search engines is "Information Retrieval"[1].

You can get a taste of the kinds of things covered in this field by looking at this class site (among many others)

https://web.stanford.edu/class/cs276/

That said, to build a modern competitive search engine, you'll need to look into software engineering, distributed systems, artificial intelligence, machine learning, linguistics, computer vision, signal processing, graph theory, databases, scheduling, machine translation, and FSM-only-knows what else.

[1]: https://en.wikipedia.org/wiki/Information_retrieval


👤 mrazomor
Basic techniques: https://nlp.stanford.edu/IR-book/information-retrieval-book....

Smartness (NLP): https://web.stanford.edu/~jurafsky/slp3/

There are also newer techniques, like "deep learning" for search. Not sure what's a good resource for that (using ML to learn the scoring functions is more than a decade old field (learning to rank), it just surfaced again because of the new ML techniques).

Check the table of contents of those books. Courses which mention chapters or sections from them are probably a good choice.

You can also check Apache Lucene and do a deep dive to see state-of-the-art implementation (Lucene in Action is a good introduction).


👤 ianbutler
I built a pretty large (by non google standards) search engine a little over a year ago with a little over a few hundred million pages. Ultimately my cofounder and I decided not to continue but the tech itself is solid. We should opensource it as a case study for people to learn from.

👤 alexcg1
What kinda thing do you want to search? Text I guess? But there are search engines for images, gifs, video, all kinds of stuff.

I'm working at an open-source project that builds an AI-powered search framework [0], and I've built some examples in very few lines of code (for searching fashion products via image or text [1], PDF text/images/tables search [2]) and one of our community members built a protein search engine [3].

A good place to start might be with a no-code solution like (shameless self-plug time) Jina NOW [4], which lets you build a search engine and GUI with just one CLI command.

[0] https://github.com/jina-ai/jina/

[1] https://examples.jina.ai/fashion/

[2] https://colab.research.google.com/github/jina-ai/workshops/b...

[3] https://github.com/georgeamccarthy/protein_search

[4] https://now.jina.ai


👤 franczesko
David Evans' CS101 was about building a search engine with python, however I think it's no longer hosted on Udacity.

https://www.cs.virginia.edu/~evans/courses/

I highly recommend digging deeper and trying to find the course materials - I'm pretty sure they should be available somewhere on the Internet. Perhaps the author could point you in the right direction.


👤 mrkramer
I'm also very interested in search engines this are two books I would recommend:

Introduction to Information Retrieval: https://www.amazon.com/Introduction-Information-Retrieval-Ch...

Search User Interfaces: https://www.amazon.com/Search-User-Interfaces-Marti-Hearst-e...


👤 tintedfireglass
I'm in a similar bandwagon. I just started collecting search engines and analyzing them. I've listed some of them at https://github.com/Tintedfireglass/search-engines and what I feel is that it is easy to look at search algorithms, queries and User interfaces but I still don't understand how to create one. Prolly start learning some Javascript first. then I'll try. I just host a searx instance at this point.

👤 sea6ear
This book Artificial Intelligence through Prolog [0] was really interesting to me several years ago and I remember the author [1] had a number of papers on search engine / information retrieval on his research page.

[0] https://faculty.nps.edu/ncrowe/book/book.html [1] https://faculty.nps.edu/ncrowe/index.html


👤 Kukumber
You do like DuckDuckGo, you "pay" Microsoft to use Bing and pretend you made a search engine

👤 f0e4c2f7
Graph theory and page rank are a good place to start.

https://blogs.cornell.edu/info2040/2011/09/20/pagerank-backb...

Google's algorithm is indecipherably complex today but in the early days the way search engines more or less worked was they crawled the web and the way they decided to rank the pages was by how many other pages had a URL reference pointing to it.

You can apply this idea today in private (or I suppose public) search engines in the same way to interesting results.

For example a search engine for for scientific papers might use page rank to sort papers that are cited by the most other papers.

Or if you were going to make a search engine for open source projects you could create a page rank algorithm based on what projects had dependencies to other projects.

Part of why Google's algorithm today is more complex than this is that people try to game whatever algorithm search engines commonly use. You may remember back in the 90s and 2000s people would do stuff like put back links to other websites in the source to try to game page rank. Today that kind of behavior has expanded into a whole cottage industry (unfortunately).

What's interesting though is that for a lot of more limited data sets is you have less of that SEO type problem.

Whichever kind you're building good luck! Search engines usually wind up being petty cool (and profitable).


👤 jillesvangurp
Start with some basic intro to algorithms. Dig into the field of information retrieval a bit. Read up on things like reverse indices, bloom filters, ranking (tf/idf, bm25, etc.), state machines, etc. Then maybe look at vector search, nlp, and a few other fields. That about should cover the basics and give you some level of understanding of how different features in search engines work. The bar is pretty high if you want to do a good job.

👤 meltyness
Assuming you have a background in Math, System Administration (wget), Automation (python), Software (HTML, JS, REST), Linux (there's good resources for all of these)

Truth is, this track you'll set up a search engine in about 20 minutes:

Document Store/NoSQL: https://elastic.co https://solr.apache.org

Classical AI: https://youtube.com/playlist?list=PLUl4u3cNGP63gFHB6xb-kVBiQ...

Stat ML: https://www.tensorflow.org/tutorials https://youtube.com/playlist?list=PLoROMvodv4rMiGQp3WXShtMGg...

Frontiers: https://www.ycombinator.com/companies/?query=Search


👤 Wronnay
I wrote my own little search engine years ago.

It's pretty simple: You crawl a website and search for all links, then you crawl all the links from these linked websites and so on.

You can still see some of my code here: https://github.com/Wronnay/search-lib


👤 agencies
Interesting intro/overview in "What every software engineer should know about search" https://scribe.rip/p/what-every-software-engineer-should-kno...

👤 pjmorris
Tim Bray wrote a blog post series on building a full text search engine [0]

[0] https://www.tbray.org/ongoing/When/200x/2003/07/30/OnSearchT...


👤 unoti
There is a book called “Finding Out About” that I read a long time ago. It describes all aspects of what you would need to do to build a search engine from scratch. It provides details on storage and retrieval algorithms and so on. It’s dated and would need to be revised but the fundamentals are there.

👤 LordHeini
It kind of depends how in depth you want to go.

I wrote a small but somewhat complete search engine some time ago.

The steps are basically:

Have a queue with urls.

Download a page from the queue

Use a html parser to remove markup and get the links of the page. Add the links to a queue without the already visited ones.

Use a stemmer to clean up the text (porter stemmer or whatever).

Calculate an inverted index: https://en.wikipedia.org/wiki/Inverted_index

Save the stuff in an appropriate data structure (Hash or Tree).

Write a query engine for AND, OR and (all the words).

Calculate a simple page rank by counting the links to a page.

For just learning purposes it is not that hard but if you want to get all the crazy corner cases of the "real" web you will go insane.

The easy alternative:

Use a lib to crawl a page.

Plonk all the documents text into postgres with full text search or elastic search.


👤 jesuslop
I would try to grasp the 'random surfer' idea, that is modeled by a Markov Chain. A nice free book is Markov Chain for Programmers [1]. A discrete time Markov Chain boils down to a conditional probability that boils down to a matrix, and a steady distribution boils down to an eigenvalue 1 eigenvector of it, which determines PageRank. Then one can jump to 'The $25,000,000,000 Eigenvector: The Linear Algebra behind Google'.

[1] https://github.com/czekster/markov

[2] https://doi.org/10.1137/050623280

EDIT: admitedly there's much more to it


👤 fzliu
This depends mostly on what kind of search engine you're trying to build. I unfortunately won't be able to point you towards courses, but there are tons of great resources online to help you get started.

Search engines are a fairly broad topic, and a lot of it depends on the _type of data_ that you want to build a search engine for. If you're looking towards more traditional, Google/Yahoo-like search, Elasticsearch's learning center (https://www.elastic.co/learn) has quite a few good resources that can point you in the right direction. Many enterprise search solutions are built on top of Apache Lucene (including Elasticsearch), and videos/blogs discussing Lucene's architecture (https://www.endava.com/en/blog/Engineering/2021/Elasticsearc...) is a great starting point as well.

Opposite text/web search is _unstructured data_ search, i.e. searching across images, video, audio, etc... based on their semantics (https://www.deepset.ai/blog/semantic-search-with-milvus-know...). Work in this space has been ongoing for decades, but an emerging way of doing this is via a _vector database_ (https://frankzliu.com/blog/a-gentle-introduction-to-vector-d...) such as Zilliz Cloud (https://zilliz.com/cloud) or Milvus (https://milvus.io/). The idea here is to turn the type of data that you're searching for into an high-dimensional vector called an embedding, and to perform nearest-neighbor search on the embeddings themselves.

Disclaimer: I'm a part of the Zilliz team and Milvus community.


👤 zackmorris
Does anyone have a solution for when the spider or backend content aggregator gets its IP address blacklisted?

I think that Facebook got around this back in the day by having the user's device do the initial scraping. A side effect of this is that sometimes I'd post an article I found, but the preview image was blank because I submitted it too quickly or something.

Until we have a free publicly downloadable cache of all websites, similar to CoralCDN (is this defunct?), then building our own search engines is probably a nonstarter.


👤 lovelearning
All the replies here seem to be about building yet another search engine of the current generation. But they all have usability drawbacks.

I asked a question about my ideal search engine a few days ago: https://news.ycombinator.com/item?id=32452318

Basically, a conversational semantic search engine like the ones in Star Trek. I also listed some technologies that can possibly help implement it. You might be interested in that list.


👤 marginalia_nu
I just started building one (with no prior experience), and solved problems as I went along. There are a lot of problems in search, but none of them are spectacularly difficult. Requires breadth more than depth.

I've converged on a design that is actually almost eerily similar to Google's original design. ( http://infolab.stanford.edu/~backrub/google.html )


👤 johnamata
In most CS master's you'll get to take a course called information retrieval, and typically the main project for such a course is building a search engine.

👤 skoczko
If by a "search engine" you mean a tool to index and retrieve documents (essentially what the terms mean in the traditional Information Retrieval, e.g Lucene, SOLR, Elastic) then this is a pretty good on the subject that taught me a lot:

https://www.amazon.com/Managing-Gigabytes-Compressing-Multim...


👤 almog
I'm no expert in search engine and as such, I found Victor Lavrenko's videos (52 of them) very well made though it's probably not nearly as rigorous as what you might be looking for: https://www.youtube.com/c/VictorLavrenko/playlists?view=50&s...

👤 arthurjj
When I switched to working on search my boss had written "Search Engines: Information Retrieval in Practice" and I found it very helpful for wrapping my head around the different subsystems that make up search. It took me about a month to work through and it's only $16 on Amazon

[1] https://amzn.to/3dW5YBQ


👤 gregw134
The term you want to Google is information retrieval.

👤 bwb
Maybe rethink the core idea; why are web pages the valued part?

Maybe instead, we need people to answer questions as real people... and then score/weigh/rank those answers/people.

I just think Google is doing an amazing job for the web, and where it is failing is not with web search, but high quality answers from humans (and not marketing trash which Google is trying to fix this week)


👤 TekMol
I don't think Sergey Brin or Larry Page studied how to make search engines.

Let's start right away!

    urls_old = []
    urls_new = [ 'news.ycombinator.com' ]

    while urls_new.length>0:
        spider(urls_new[0])
... to be continued ... who writes the next 5 lines?

👤 freeCandy
Wiby has some implementation details in their install guide: https://wiby.me/about/guide.html

👤 Xeoncross
I've never worked on a project that encompasses as many computer science algorithms as a search engine. There are a lot of topics you can lookup in "Information Storage and Retrieval":

- Tries (patricia, radix, etc...)

- Trees (b-trees, b+trees, merkle trees, log-structured merge-tree, etc..)

- Consensus (raft, paxos, etc..)

- Block storage (disk block size optimizations, mmap files, delta storage, etc..)

- Probabilistic filters (hyperloloog, bloom filters, etc...)

- Binary Search (sstables, sorted inverted indexes, roaring bitmaps)

- Ranking (pagerank, tf/idf, bm25, etc...)

- NLP (stemming, POS tagging, subject identification, sentiment analysis etc...)

- HTML (document parsing/lexing)

- Images (exif extraction, removal, resizing / proxying, etc...)

- Queues (SQS, NATS, Apollo, etc...)

- Clustering (k-means, density, hierarchical, gaussian distributions, etc...)

- Rate limiting (leaky bucket, windowed, etc...)

- Compression

- Applied linear algebra

- Text processing (unicode-normalization, slugify, sanitation, lossless and lossy hashing like metaphone and document fingerprinting)

- etc...

I'm sure there is plenty more I've missed. There are lots of generic structures involved like hashes, linked-lists, skip-lists, heaps and priority queues and this is just to get 2000's level basic tech.

If you are comfortable with Go or Rust you should look at the latest projects in this space:

- https://github.com/quickwit-oss/tantivy

- https://github.com/valeriansaliou/sonic

- https://github.com/mosuka/phalanx

- https://github.com/meilisearch/MeiliSearch

- https://github.com/blevesearch/bleve

- https://github.com/thomasjungblut/go-sstables

A lot of people new to this space mistakenly think you can just throw elastic search or postgres fulltext search in front of terabytes of records and have something decent. The problem is that search with good rankings often requires custom storage so calculations can be sharded among multiple nodes and you can do layered ranking without passing huge blobs of results between systems.

Source: I'm currently working on the third version of my search engine and I've been studying this for 10 years.


👤 aluciani
Have you seen YaCy Search Engine? https://yacy.net/

This might be something to build on or explore.


👤 kordlessagain
Anything to do with vector search is probably a good choice.

👤 Xamayon
Getting and storing the content is one of the main challenges, and it's getting harder by the day with more and more sites using anti bot stuff from companies like Cloudflare. With the SauceNAO.com image search engine I tried to tailor it to my own needs, taking a slow and steady semi-curated approach. To keep things sane and costs low I went after specific sites (and other resources) which have high signal to noise, and highly desirable content. I add a couple at a time, finding and fixing bottlenecks as they come up. Nothing is perfect from the start, so I mainly focus on environment simplicity and getting the minimum viable setup working as quickly as possible. This has caused some problems to be sure, and led to the site looking and feeling less than awesome in many ways, but at least it (mostly) works... Over time I have had to rewrite everything - the crawling software, search algorithms, back-end database, and front-end when it became apparent things could be done more efficiently to deal with the ever increasing usage and scale. Having the content stored to enable re-generating indexes quickly has been very important long term! It has taken many years (started in 2008), but in its art/entertainment niche, it has really started to take off usage wise. My advice would be to start semi-small, throwing things at the wall and see if anything works. Try to keep the initial setup as simple and affordable as possible unless you have serious funding available. Building even a small search engine can take a lot of resources and time, but it can also be an amazingly fun hobby.

👤 thirdtrigger
Im affiliated with a company that makes one. We create a vector search engine called Weaviate [1] we also publish content on how it's done and the search engine itself is open source which might be helpful for you too.

[1] https://weaviate.io


👤 softwaredoug
This is a good opportunity to update my search reading list.

Books

* Lucene in Action - A bit out of date, but it has a lot of the core concepts of Lucene. 10 years ago, this was the go-to book even if you’re working with Solr or Elasticsearch, to understand the core data structures

* Relevant Search (my book) - Introduction to optimizing the relevance of full text search engines. Getting a bit old now, but still relevant for classic search engines.

* AI Powered Search (disclaimer - I contributed) - Author Trey Grainger is brilliant, and has been a long-time colleague of mine. He managed the search team at Careerbuilder (where we did some work together). This is in some ways his perspective on how machine learning and search work together.

* Elasticsearch The Definitive Guide - online free, very comprehensive, book from Elastic

---

Blogs

* OpenSource Connections (my old company) Blog (http://o19s.com/blog) - lots of meaty search and relevance info

* Query Understanding (https://queryunderstanding.com/) - Daniel Tunkenlang is a long term very smart search person. Has worked at Google, etc

* James Rubenstein’s Blog (https://jamesrubinstein.medium.com/) - I worked closely with James at LexisNexis. He has helped work on search and search evaluation at Ebay, Pinterest, LexisNexis, and Apple

* Sematext (https://sematext.com/blog/) - Sematext are probably best known for being really in the weeds search engine engineers, a fair amount of scaling, etc. But some relevance.

* Sease Query Blog (https://sease.io/blog-2/our-blog) - Sease are London based Information Retrieval Experts

* My Blog (http://softwaredoug.com)

---

Paid Trainings

* OpenSource Connections Training (https://opensourceconnections.com/training)

* CoRise (https://corise.com/course/search-with-machine-learning)

* ML Powered Search (My Sphere Class ) - https://www.getsphere.com/ml-engineering/ml-powered-search?s...

* Sease Training - https://sease.io/training

* Sematext Training - https://sematext.com/training/

---

Conferences

* Haystack - http://haystackconf.com

* MICES - http://mices.co (e-commerce search)

* Berlin Buzzwords - search, scale, stream, etc - https://2022.berlinbuzzwords.de/