Some ideas that pull me to thinking about this:
- LLMs build a world model and are not 'just' parrots
- The Buddha spoke and people wrote it down, maybe the 'original' world model can be extracted from the writing?
- The Pali canon has 15-17k pages, sounds like a lot but it's a small dataset compared to how much data is used to train models these days
- There's orders of magnitude more commentary that could be used, but does it detract/dilute the 'original' world model?
- Say we dump everything written about buddhism, there's probably redundancy but maybe enough material to get 'enough' data to train
eg there's https://chat.openai.com/g/g-WxckXARTP-astrology-birth-chart-... for astrology
I think the problem is that the 15-17k pages of the Pali Canon are already hopelessly sectarian. You won't get ChatBuddha, you'll get ChatTheravada.
As an aside, this answers the oft-posed question: how could Ananda possibly have recited the entire canon after Master Gotama's death? Answer: there was a lot less of it then.
More useful is to discard the sectarian suttas, read the rest and do what they say. In my experience it is highly effective. When you realize you must be "a lamp unto yourself" then you're almost there.