HACKER Q&A
📣 thomasfromcdnjs

Are there any movements to create the equivalent of a sitemap for LLM's?


I've been working on an indigenous language project for a while now -> https://github.com/australia/mobtranslate-server/blob/master/dictionaries/kuku_yalanji/dictionary.yaml

It currently fits inside the token limit of gpt-4o (120k tokens), so I am able to prompt inject it and make a translator-like bot that has amazing results -> https://i.imgur.com/UuEQPUA.png (this is above and beyond good enough for the project goals)

The problem being that per translation I am paying for 100k+ tokens.

Other than the Github YAML version of the dict, I have a publicly indexed (Google) html version of it -> https://mobtranslate.com/dictionaries/kuku_yalanji

But obviously if the model's had trained on this data, it would already have an intrinsic knowledge of it and I would have to prompt inject a lot less.

I know that models have their training cut offs every iteration, but is there a way to ensure that you are crawled during the next. (I'm talking in the context of OpenAI but curious to other answers for any other models)

Essentially are there an equivalents of Google Webmaster Tools where I can submit a sitemap, check the progress of crawls or submit individual pages?

If there isn't, are there any movements to create such a resource?


  👤 altdataseller Accepted Answer ✓
You can ensure you are crawled by simply not blocking them in your robots.txt. They'll find you eventually unless you simply never let anyone know about your website

👤 sabbaticaldev
Next step would be LEO, LLM engine optimizations. Even if the LLM crawls your content, it will be low ranked, so the hallucinations on your data will be high.

A GitHub repository with 11 stars would probably be almost ignored by a LLM.


👤 fragmede
OpenAI doesn't have a public submission form, but for a specific topic, they do have a public framework to submit evals to. https://github.com/openai/evals

They use Common Crawl, so get your site into to there.