Are there any movements to create the equivalent of a sitemap for LLM's?

Question

I've been working on an indigenous language project for a while now -> https://github.com/australia/mobtranslate-server/blob/master/dictionaries/kuku_yalanji/dictionary.yamlIt currently fits inside the token limit of gpt-4o (120k tokens), so I am able to prompt inject it and make a translator-like bot that has amazing results -> https://i.imgur.com/UuEQPUA.png (this is above and beyond good enough for the project goals)The problem being that per translation I am paying for 100k+ tokens.Other than the Github YAML version of the dict, I have a publicly indexed (Google) html version of it -> https://mobtranslate.com/dictionaries/kuku_yalanjiBut obviously if the model's had trained on this data, it would already have an intrinsic knowledge of it and I would have to prompt inject a lot less.I know that models have their training cut offs every iteration, but is there a way to ensure that you are crawled during the next. (I'm talking in the context of OpenAI but curious to other answers for any other models)Essentially are there an equivalents of Google Webmaster Tools where I can submit a sitemap, check the progress of crawls or submit individual pages?If there isn't, are there any movements to create such a resource?

altdataseller · Accepted Answer

You can ensure you are crawled by simply not blocking them in your robots.txt. They'll find you eventually unless you simply never let anyone know about your website

sabbaticaldev · Answer

Next step would be LEO, LLM engine optimizations. Even if the LLM crawls your content, it will be low ranked, so the hallucinations on your data will be high.A GitHub repository with 11 stars would probably be almost ignored by a LLM.

fragmede · Answer

OpenAI doesn't have a public submission form, but for a specific topic, they do have a public framework to submit evals to. https://github.com/openai/evalsThey use Common Crawl, so get your site into to there.

Are there any movements to create the equivalent of a sitemap for LLM's?

You can ensure you are crawled by simply not blocking them in your robots.txt. They'll find you eventually unless you simply never let anyone know about your website

Next step would be LEO, LLM engine optimizations. Even if the LLM crawls your content, it will be low ranked, so the hallucinations on your data will be high.
A GitHub repository with 11 stars would probably be almost ignored by a LLM.

OpenAI doesn't have a public submission form, but for a specific topic, they do have a public framework to submit evals to. https://github.com/openai/evals
They use Common Crawl, so get your site into to there.