HACKER Q&A
📣 siatkowski

How to Build AI Crawler?


Hi, I'm working on a graph-based content aggregator, and I need a crawler to feed it. The idea behind using a graph is that the information is precisely sorted and categorized. The crawler should understand if the page is, for example, a general article, tutorial, curiosity, or tips&tricks and assign a link to the right node. It should also understand the context and distinguish an article on security where you mention javascript from a javascript course where you mention security and put them in different places. Do you think it is possible? I was looking for existing solutions but couldn't find any. Anyone wants to help me with building one? The aggregator is at https://library.one


  👤 ac2u Accepted Answer ✓
Probably a few methods to get started.

1. Feed each crawl to a LLM prompt automation like ChatGPT and ask it to take each of your categories and score the article against them to get an idea as to how similar they are. You'll have to decide the thresholds yourself.

2. Grab lots and lots and lots of examples of articles and label them. Run a model, either on your own servers or using something like OpenAI's embeddings endpoints to generate a vector embedding for each document. Then take your whole dataset of category->vector pairings, split it in two, a training set and a verification set, and train a classifier model that can take vector embeddings and spit out probabilities of different classes.

3. Fine-tune an already existing language->category classification model.

4. Buy services to do any combination of the above work for you.