AI to study my DSL and then output it?

Question

Ideally I want to contain and run LLM output of my domain-specific language, but it seems that I would need to fine-tune existing models. What&rsquo;s the easiest online or local solution?How to automatically generate: a broad array of security tests; the most efficient code; the most readable and extensible code

newhouseb · Accepted Answer

There are a couple different approaches:
- Use multi-shot prompting with something like guardrails to try prompting a commercial model until it works. [1]
- Use a local model with a final layer that steers token selection towards syntactically valid tokens [2]
[1] https://github.com/ShreyaR/guardrails
[2] "Structural Alignment: Modifying Transformers (like GPT) to Follow a JSON Schema" @ https://github.com/newhouseb/clownfish (full disclosure: this is my work)

bicx · Answer

Honestly ChatGPT has worked well for things like this in my experience. If you can fit enough examples within a prompt, you may not need anything special.

fergal_reid · Answer

LLMs like GPT-4 'natively' speak certain syntaxes very well - e.g. Python, JSON. I'd suggest you want to take advantage of that, if at all possible, rather than embark on training or fine tuning your own LLM.
If you have a particular data structure you want to have the LLM generate or manipulate, which there aren't large quantities of in the training set, you might want to consider writing a translator that will translate it into a format the LLM natively 'speaks', using the LLM on that, and then translating back into your DSL.
Going this direction and also adding examples in some sort of vector store, as others have suggested, could be a good direction.

joenot443 · Answer

The best answer, by far, would be ChatGPT and GPT4 with some well-written prompts.I'd be super impressed if any other approach worked as well and would fall under the category of "easy". Keep us updated on what you go with!

PaulHoule · Answer

Seehttps://huggingface.co/blog/codeparrotfor some idea of how to train a code generator.

tonerow · Answer

On https://flowchart.fun I found that I got better overall results by asking GPT for an intermediate syntax that it was less likely to mess up (and easier for me to parse), and then parsing and transforming that syntax to my DSL. The relevant code: https://github.com/tone-row/flowchart-fun/blob/main/api/prom...

bob1029 · Answer

We have a similar issue - we have a domain-specific schema that we want GPT4 to author SQL for. The challenge for us is that a full explanation of everything in the schema absolutely blows out the token limits.
Right now, we are playing around with the idea of using a classification layer to detect which schema elements are likely involved, and then dynamically including explanations for those elements in the final prompt.
Our attempts at fine tuning ended after about 2 weeks of struggling. I don't think it's viable for a certain range of domain-specific tasks.

summarity · Answer

I've had good success teaching GPT4 a language interactively: provide documentation, examples then asked it to generate examples of increasing complexity and correct it if it's wrong.See previous comment here: https://news.ycombinator.com/item?id=35447368

tester457 · Answer

Langchain with a vectorstore of examples of your DSL. https://python.langchain.com/en/latest/modules/indexes/vecto...

AJRF · Answer

What have you tried so far?

kordlessagain · Answer

This is very interesting.I&rsquo;m still noodling on how to send a full page screenshot to a model and get it to return the individual images (or the bounds of them) in the page.

everlier · Answer

txtai accomplished a similar task by fine tuning a very small t5 model, notebook with usage samples (training code has to be somewhere near)https://github.com/neuml/txtai/blob/master/examples/33_Query...

b20000 · Answer

AI today is not intelligent, it is just a sophisticated generator using patterns it was trained on.

pendrivedatare3 · Answer

There could be many reasons for Data Loss from devices such as Hard Disk, Pen Drive, Memory Card, Raid Servers, etc. However, need not worry now; you can recover your data by taking Pen Drive Recovery software Services uttar Pradesh. We are the expert in the recovery of data since 2000 with 2+ million happy customers. We are located in 15 major cities / country across India if you require data recovery services choose your nearest one. We maintain data confidential. We work on No Recovery, No Charge Policy. For more details, feel free to contact us at (+91) 9868337762 . Everything can be recovered from this website On our website, you may locate all of the available software tools: https://www.pendrivedatarecovery.org/ &rdquo;. Highlights of our Data Recovery Services: &bull; Hard Disk Recovery &bull; E-mail Recovery &bull; Server recovery &bull; Laptop/desktop Recovery &bull; Photo Recovery &bull; Removable Media Recovery &bull; Encrypted Data Recovery &bull; Database Recovery &bull; Tally Data Recovery &bull; Android Data Recovery &bull; Storage Box Recovery (nas &san) &bull; File Recovery &bull; Raid Recovery &bull; Sd Card Recovery

AI to study my DSL and then output it?

Honestly ChatGPT has worked well for things like this in my experience. If you can fit enough examples within a prompt, you may not need anything special.

The best answer, by far, would be ChatGPT and GPT4 with some well-written prompts.
I'd be super impressed if any other approach worked as well and would fall under the category of "easy". Keep us updated on what you go with!

See
https://huggingface.co/blog/codeparrot
for some idea of how to train a code generator.

I've had good success teaching GPT4 a language interactively: provide documentation, examples then asked it to generate examples of increasing complexity and correct it if it's wrong.
See previous comment here: https://news.ycombinator.com/item?id=35447368

Langchain with a vectorstore of examples of your DSL. https://python.langchain.com/en/latest/modules/indexes/vecto...

What have you tried so far?

This is very interesting.
I’m still noodling on how to send a full page screenshot to a model and get it to return the individual images (or the bounds of them) in the page.

txtai accomplished a similar task by fine tuning a very small t5 model, notebook with usage samples (training code has to be somewhere near)
https://github.com/neuml/txtai/blob/master/examples/33_Query...

AI today is not intelligent, it is just a sophisticated generator using patterns it was trained on.