HACKER Q&A
📣 spgman

How to Classify Websites?


For years now I've wanted to develop a program that would be able to find similar websites. I am able to write a crawler, no problem. I haven't been able to figure out the classification side of things. I tried bayesian without much success. Would ChatGPT Or Llama be able to do this?


  👤 supriyo-biswas Accepted Answer ✓
Word vectorization + cosine distance, perhaps.

See also this post by marginalia[1] (HN discussion[2]) which discusses the same thing.

[1] https://memex.marginalia.nu/log/69-creepy-website-similarity...

[2] https://news.ycombinator.com/item?id=34143101


👤 GistNoesis
That's the basis for a search engine.

There are plenty of ways : For example you can render the home page of a website to an image. Then run CLIP to get a feature vector. Then use some approximate near neighbor search library like FAISS or HNSWlib to index it.

Or you can ask ChatGPT, or a neural network to summarize the webpages into a short description and then into feature vector. Old school approach are things like bags of words for document classification. Then you run a hierarchical clustering algorithm (something like hierarchical k-means). This will allow to present things that are similar but not-so-similar that they are duplicates.

Interesting distances between websites are often described by considering which other websites they links to and what other websites link to them. Graph Neural Network allow to build a feature vector based on these links between websites. This is also related to the well known Page Rank algorithm.

Finally gathering metadata about websites can also be an interesting axis of similarity : Who owns the site ? How often do they update ? How much money they generate ? How do they generate money ? How big is it ? How fast do they render ? What do people think about the site ? Basically answering the 5Ws about the website and building a database about it, and LLM can help answer those questions automatically (You do a web-search about the site summarize the results, put them in the context of the LLM, and ask the question in a prompt and index the answer.


👤 polote
I'm exactly working on that too, and don't have the answer. The problem is we all have our way to classify things and this is never the same way. The same word also never mean the same things for each one of us.

Two aspects I'm trying currently are (that need users browsing history):

- Dont try to recommend similar website, but recommend users that like similar things as you, and you can list the website that this user likes

- Create tags with accuracy. For example you will tag a website "product management" "startup" and "b2b". You can go one step further and ask users to rate how this tag matches the website. Like 90% for "b2b" and 50% for "startup" and 20% "product management". Then you can let users search tags and their accuracy (I want "product management" at average more than 50%)

Like you I feel like something can be done with LLM but I just haven't found it yet, maybe to suggest the tags of a website from a restricted list of tags, and then to suggest tags from an explanation of what the user is searching and then search those tags


👤 DeathArrow
There are countless ML algorithms good for classification, no need for LLMs.

But the problem is on how you feed data into them. Some websites are vast depots of pretty different type of texts spanning a huge number of domains. If you use data from /r/programming you will classify Reddit as a programming website. If you use data from /r/food you will classify it as a culinary website.

Some websites like Pinterest or dailymotion are media heavy, so using just text might not be helpful.

What I want to say is that actual classification is the last problem to solve, the problem is feeding relevant data into it.


👤 btrettel
The comments here focus on various AI/ML ways of classifying websites. That's surely part of a solution, but I think manual classification is still important. If you're making a public facing website, try to crowd-source the manual classification if you can't classify pages yourself. At present, AI/ML classification probably doesn't work anywhere near as well as you all think it does, which became obvious to me when I worked as a patent examiner in the past and used some of the many AI/ML search tools. These tools are quite good at finding somewhat similar things, but will usually miss things that I'd consider very similar. And when I was a patent examiner, "somewhat similar" wasn't good enough. Many people looking for similar websites won't be satisfied with a list of somewhat similar pages if a very similar page exists in the database.

👤 jerpint
The hard part will be figuring out your different categories to classify.

As some have pointed out, embedding your documents using an LLM would be a good bet.

If you take the time to manually annotate a portion of your data, you could then fine tune a model. You could also try doing some few shot / zero shot with chat GPT. You could also try to do some clustering on your embeddings to see if categories emerge and try to attribute them afterwards


👤 byschii
Maybe you can take inspiration from this https://youtu.be/z6ep308goxQ?t=187

It's not an explanation, but show a possible way to cluster (thus classify) websites based on the how they appear, if you want your classification based on content... Maybe you need something different


👤 djrockstar1
Someone had success using GPT-3 to classify episodes of a podcast[1]. I imagine if you fed the HTML from the crawler into an LLM, it could come up with a usable classification for it.

[1] https://news.ycombinator.com/item?id=35073603


👤 alangibson
Google "unsupervised clustering". There are many well known algorithms.