See also this post by marginalia[1] (HN discussion[2]) which discusses the same thing.
[1] https://memex.marginalia.nu/log/69-creepy-website-similarity...
There are plenty of ways : For example you can render the home page of a website to an image. Then run CLIP to get a feature vector. Then use some approximate near neighbor search library like FAISS or HNSWlib to index it.
Or you can ask ChatGPT, or a neural network to summarize the webpages into a short description and then into feature vector. Old school approach are things like bags of words for document classification. Then you run a hierarchical clustering algorithm (something like hierarchical k-means). This will allow to present things that are similar but not-so-similar that they are duplicates.
Interesting distances between websites are often described by considering which other websites they links to and what other websites link to them. Graph Neural Network allow to build a feature vector based on these links between websites. This is also related to the well known Page Rank algorithm.
Finally gathering metadata about websites can also be an interesting axis of similarity : Who owns the site ? How often do they update ? How much money they generate ? How do they generate money ? How big is it ? How fast do they render ? What do people think about the site ? Basically answering the 5Ws about the website and building a database about it, and LLM can help answer those questions automatically (You do a web-search about the site summarize the results, put them in the context of the LLM, and ask the question in a prompt and index the answer.
Two aspects I'm trying currently are (that need users browsing history):
- Dont try to recommend similar website, but recommend users that like similar things as you, and you can list the website that this user likes
- Create tags with accuracy. For example you will tag a website "product management" "startup" and "b2b". You can go one step further and ask users to rate how this tag matches the website. Like 90% for "b2b" and 50% for "startup" and 20% "product management". Then you can let users search tags and their accuracy (I want "product management" at average more than 50%)
Like you I feel like something can be done with LLM but I just haven't found it yet, maybe to suggest the tags of a website from a restricted list of tags, and then to suggest tags from an explanation of what the user is searching and then search those tags
But the problem is on how you feed data into them. Some websites are vast depots of pretty different type of texts spanning a huge number of domains. If you use data from /r/programming you will classify Reddit as a programming website. If you use data from /r/food you will classify it as a culinary website.
Some websites like Pinterest or dailymotion are media heavy, so using just text might not be helpful.
What I want to say is that actual classification is the last problem to solve, the problem is feeding relevant data into it.
As some have pointed out, embedding your documents using an LLM would be a good bet.
If you take the time to manually annotate a portion of your data, you could then fine tune a model. You could also try doing some few shot / zero shot with chat GPT. You could also try to do some clustering on your embeddings to see if categories emerge and try to attribute them afterwards
It's not an explanation, but show a possible way to cluster (thus classify) websites based on the how they appear, if you want your classification based on content... Maybe you need something different