How to prompt LLM to infer XSD from many XML documents?

Question

I've got thousands of XML documents in a format whose XSD is not published. I'd like to produce an XSD for it, and I am wondering if a LLM could help. I've tried a few online LLMs like Claude and Copilot and the best they (or I) could do is to use a handful of XML files to generate an XSD. While the XSD was more or less valid, it was far from capturing all cases of the underlying format, and failed on the very next XML document I tried.I am ready to run a local LLM for this task, but can someone with more LLM experience than me (I have none) describe a good process to do so? And which LLM might be suited?Thanks!

ksr · Accepted Answer

In case anyone is interested in the details, I am trying to infer the MuseScore MSCX format to ensure that my MusicXML => MuseScore generator at https://github.com/infojunkie/musicxml-mscx produces valid scores. I want to use all the supplied MSCX files in the repo https://github.com/musescore/MuseScore for inference.

codingdave · Answer

LLMs are good at the gist of accuracy, not actual accuracy. They could create new XML files that follow a schema, sure. But they won't even do that correctly 100% of the time, and definitely are not awesome at going the other direction. So just write a parser. It will be no more work, and more correct.

seabass-labrax · Answer

What I'd do is extract all the deepest-level content by conventional means (XSL or a standalone parser) and pass those into an LLM one by one. It would be easier for an LLM to tell if a given string is an ISO-formatted date, for instance, than attempting to identify the entire schema at once. You might not even need the LLM if you use type inference libraries and the schema isn't too exotic.
Having used the results so far to annotate the original elements and attributes with their types, you could then pass a generated, simplified XML document into the LLM. So where the original document has real data, you can start replacing it with simple data that conforms to the same structure and data type. If the LLM is still confused, try giving it just the structure which you've identified with no actual data within the elements and attributes, only type annotations.
TL;DR: a depth-first approach and then building up from there will work better than giving everything to an LLM all at once. They are only clever thematic Markov chains after all.