HACKER Q&A
📣 verygoode

List of known AI dataset, training, and access crawlers


Looking to see if anyone has created or found a well-maintained list of known AI dataset and training crawlers.

This information can be useful if you want to keep stats or attempt to limit or block training.

E.g., OpenAI's user-agent for their bot is `GPTBot`

Common Crawl's is `CCBot`.

If you aren't aware of such a list, would you find one useful?


  👤 verygoode Accepted Answer ✓
I've not found much so far.

I think I'd also like to expand beyond just UAs and also curate IP ranges, docs..etc.

Starting a repo at https://github.com/JoshuaGoode/ai-user-agents