You can decompose a "search engine" into multiple big components and figure out what you want to look at first:
(1) web crawler/spiders
(2) database cache of web content -- aka building the "search index"
(3) algorithm of scoring/weighing/ranking of pages -- e.g. "PageRank"
(4) query engine -- translating user inputs into returning the most "relevant" pages
Each technical topic is a sub-specialty and can be staffed by dedicated engineers. There are also more topics such as lexical analysis, distributed computing (for all 4 areas), etc.
If you're mainly focused on experimenting with programming another ranking algorithm, you can skip part (1) by leveraging the dataset from Common Crawl: https://index.commoncrawl.org/
Here are some videos about PageRank:
https://www.youtube.com/watch?v=JGQe4kiPnrU , https://www.youtube.com/watch?v=qxEkY8OScYY
... but keep in mind that the scope of those videos omits all of (1), (2), and (4).
You can get a taste of the kinds of things covered in this field by looking at this class site (among many others)
https://web.stanford.edu/class/cs276/
That said, to build a modern competitive search engine, you'll need to look into software engineering, distributed systems, artificial intelligence, machine learning, linguistics, computer vision, signal processing, graph theory, databases, scheduling, machine translation, and FSM-only-knows what else.
Smartness (NLP): https://web.stanford.edu/~jurafsky/slp3/
There are also newer techniques, like "deep learning" for search. Not sure what's a good resource for that (using ML to learn the scoring functions is more than a decade old field (learning to rank), it just surfaced again because of the new ML techniques).
Check the table of contents of those books. Courses which mention chapters or sections from them are probably a good choice.
You can also check Apache Lucene and do a deep dive to see state-of-the-art implementation (Lucene in Action is a good introduction).
I'm working at an open-source project that builds an AI-powered search framework [0], and I've built some examples in very few lines of code (for searching fashion products via image or text [1], PDF text/images/tables search [2]) and one of our community members built a protein search engine [3].
A good place to start might be with a no-code solution like (shameless self-plug time) Jina NOW [4], which lets you build a search engine and GUI with just one CLI command.
[0] https://github.com/jina-ai/jina/
[1] https://examples.jina.ai/fashion/
[2] https://colab.research.google.com/github/jina-ai/workshops/b...
https://www.cs.virginia.edu/~evans/courses/
I highly recommend digging deeper and trying to find the course materials - I'm pretty sure they should be available somewhere on the Internet. Perhaps the author could point you in the right direction.
Introduction to Information Retrieval: https://www.amazon.com/Introduction-Information-Retrieval-Ch...
Search User Interfaces: https://www.amazon.com/Search-User-Interfaces-Marti-Hearst-e...
[0] https://faculty.nps.edu/ncrowe/book/book.html [1] https://faculty.nps.edu/ncrowe/index.html
https://blogs.cornell.edu/info2040/2011/09/20/pagerank-backb...
Google's algorithm is indecipherably complex today but in the early days the way search engines more or less worked was they crawled the web and the way they decided to rank the pages was by how many other pages had a URL reference pointing to it.
You can apply this idea today in private (or I suppose public) search engines in the same way to interesting results.
For example a search engine for for scientific papers might use page rank to sort papers that are cited by the most other papers.
Or if you were going to make a search engine for open source projects you could create a page rank algorithm based on what projects had dependencies to other projects.
Part of why Google's algorithm today is more complex than this is that people try to game whatever algorithm search engines commonly use. You may remember back in the 90s and 2000s people would do stuff like put back links to other websites in the source to try to game page rank. Today that kind of behavior has expanded into a whole cottage industry (unfortunately).
What's interesting though is that for a lot of more limited data sets is you have less of that SEO type problem.
Whichever kind you're building good luck! Search engines usually wind up being petty cool (and profitable).
Truth is, this track you'll set up a search engine in about 20 minutes:
Document Store/NoSQL: https://elastic.co https://solr.apache.org
Classical AI: https://youtube.com/playlist?list=PLUl4u3cNGP63gFHB6xb-kVBiQ...
Stat ML: https://www.tensorflow.org/tutorials https://youtube.com/playlist?list=PLoROMvodv4rMiGQp3WXShtMGg...
Frontiers: https://www.ycombinator.com/companies/?query=Search
It's pretty simple: You crawl a website and search for all links, then you crawl all the links from these linked websites and so on.
You can still see some of my code here: https://github.com/Wronnay/search-lib
[0] https://www.tbray.org/ongoing/When/200x/2003/07/30/OnSearchT...
I wrote a small but somewhat complete search engine some time ago.
The steps are basically:
Have a queue with urls.
Download a page from the queue
Use a html parser to remove markup and get the links of the page. Add the links to a queue without the already visited ones.
Use a stemmer to clean up the text (porter stemmer or whatever).
Calculate an inverted index: https://en.wikipedia.org/wiki/Inverted_index
Save the stuff in an appropriate data structure (Hash or Tree).
Write a query engine for AND, OR and (all the words).
Calculate a simple page rank by counting the links to a page.
For just learning purposes it is not that hard but if you want to get all the crazy corner cases of the "real" web you will go insane.
The easy alternative:
Use a lib to crawl a page.
Plonk all the documents text into postgres with full text search or elastic search.
[1] https://github.com/czekster/markov
[2] https://doi.org/10.1137/050623280
EDIT: admitedly there's much more to it
Search engines are a fairly broad topic, and a lot of it depends on the _type of data_ that you want to build a search engine for. If you're looking towards more traditional, Google/Yahoo-like search, Elasticsearch's learning center (https://www.elastic.co/learn) has quite a few good resources that can point you in the right direction. Many enterprise search solutions are built on top of Apache Lucene (including Elasticsearch), and videos/blogs discussing Lucene's architecture (https://www.endava.com/en/blog/Engineering/2021/Elasticsearc...) is a great starting point as well.
Opposite text/web search is _unstructured data_ search, i.e. searching across images, video, audio, etc... based on their semantics (https://www.deepset.ai/blog/semantic-search-with-milvus-know...). Work in this space has been ongoing for decades, but an emerging way of doing this is via a _vector database_ (https://frankzliu.com/blog/a-gentle-introduction-to-vector-d...) such as Zilliz Cloud (https://zilliz.com/cloud) or Milvus (https://milvus.io/). The idea here is to turn the type of data that you're searching for into an high-dimensional vector called an embedding, and to perform nearest-neighbor search on the embeddings themselves.
Disclaimer: I'm a part of the Zilliz team and Milvus community.
I think that Facebook got around this back in the day by having the user's device do the initial scraping. A side effect of this is that sometimes I'd post an article I found, but the preview image was blank because I submitted it too quickly or something.
Until we have a free publicly downloadable cache of all websites, similar to CoralCDN (is this defunct?), then building our own search engines is probably a nonstarter.
I asked a question about my ideal search engine a few days ago: https://news.ycombinator.com/item?id=32452318
Basically, a conversational semantic search engine like the ones in Star Trek. I also listed some technologies that can possibly help implement it. You might be interested in that list.
I've converged on a design that is actually almost eerily similar to Google's original design. ( http://infolab.stanford.edu/~backrub/google.html )
https://www.amazon.com/Managing-Gigabytes-Compressing-Multim...
Maybe instead, we need people to answer questions as real people... and then score/weigh/rank those answers/people.
I just think Google is doing an amazing job for the web, and where it is failing is not with web search, but high quality answers from humans (and not marketing trash which Google is trying to fix this week)
Let's start right away!
urls_old = []
urls_new = [ 'news.ycombinator.com' ]
while urls_new.length>0:
spider(urls_new[0])
... to be continued ... who writes the next 5 lines?
- Tries (patricia, radix, etc...)
- Trees (b-trees, b+trees, merkle trees, log-structured merge-tree, etc..)
- Consensus (raft, paxos, etc..)
- Block storage (disk block size optimizations, mmap files, delta storage, etc..)
- Probabilistic filters (hyperloloog, bloom filters, etc...)
- Binary Search (sstables, sorted inverted indexes, roaring bitmaps)
- Ranking (pagerank, tf/idf, bm25, etc...)
- NLP (stemming, POS tagging, subject identification, sentiment analysis etc...)
- HTML (document parsing/lexing)
- Images (exif extraction, removal, resizing / proxying, etc...)
- Queues (SQS, NATS, Apollo, etc...)
- Clustering (k-means, density, hierarchical, gaussian distributions, etc...)
- Rate limiting (leaky bucket, windowed, etc...)
- Compression
- Applied linear algebra
- Text processing (unicode-normalization, slugify, sanitation, lossless and lossy hashing like metaphone and document fingerprinting)
- etc...
I'm sure there is plenty more I've missed. There are lots of generic structures involved like hashes, linked-lists, skip-lists, heaps and priority queues and this is just to get 2000's level basic tech.
If you are comfortable with Go or Rust you should look at the latest projects in this space:
- https://github.com/quickwit-oss/tantivy
- https://github.com/valeriansaliou/sonic
- https://github.com/mosuka/phalanx
- https://github.com/meilisearch/MeiliSearch
- https://github.com/blevesearch/bleve
- https://github.com/thomasjungblut/go-sstables
A lot of people new to this space mistakenly think you can just throw elastic search or postgres fulltext search in front of terabytes of records and have something decent. The problem is that search with good rankings often requires custom storage so calculations can be sharded among multiple nodes and you can do layered ranking without passing huge blobs of results between systems.
Source: I'm currently working on the third version of my search engine and I've been studying this for 10 years.
This might be something to build on or explore.
Books
* Lucene in Action - A bit out of date, but it has a lot of the core concepts of Lucene. 10 years ago, this was the go-to book even if you’re working with Solr or Elasticsearch, to understand the core data structures
* Relevant Search (my book) - Introduction to optimizing the relevance of full text search engines. Getting a bit old now, but still relevant for classic search engines.
* AI Powered Search (disclaimer - I contributed) - Author Trey Grainger is brilliant, and has been a long-time colleague of mine. He managed the search team at Careerbuilder (where we did some work together). This is in some ways his perspective on how machine learning and search work together.
* Elasticsearch The Definitive Guide - online free, very comprehensive, book from Elastic
---
Blogs
* OpenSource Connections (my old company) Blog (http://o19s.com/blog) - lots of meaty search and relevance info
* Query Understanding (https://queryunderstanding.com/) - Daniel Tunkenlang is a long term very smart search person. Has worked at Google, etc
* James Rubenstein’s Blog (https://jamesrubinstein.medium.com/) - I worked closely with James at LexisNexis. He has helped work on search and search evaluation at Ebay, Pinterest, LexisNexis, and Apple
* Sematext (https://sematext.com/blog/) - Sematext are probably best known for being really in the weeds search engine engineers, a fair amount of scaling, etc. But some relevance.
* Sease Query Blog (https://sease.io/blog-2/our-blog) - Sease are London based Information Retrieval Experts
* My Blog (http://softwaredoug.com)
---
Paid Trainings
* OpenSource Connections Training (https://opensourceconnections.com/training)
* CoRise (https://corise.com/course/search-with-machine-learning)
* ML Powered Search (My Sphere Class ) - https://www.getsphere.com/ml-engineering/ml-powered-search?s...
* Sease Training - https://sease.io/training
* Sematext Training - https://sematext.com/training/
---
Conferences
* Haystack - http://haystackconf.com
* MICES - http://mices.co (e-commerce search)
* Berlin Buzzwords - search, scale, stream, etc - https://2022.berlinbuzzwords.de/