While developing a text and binary data format (https://concise-encoding.org/), I ran into trouble building a formal description for them. Many formats use EBNF or ABNF, or just embed their own BNF-style metalanguage in the spec itself (I started taking cues from the XML spec).
But describing a binary format is deceptively complex, and eventually I had to split out the metalanguage. KBNF is the result.
I want it to be useful for other people, so I've decided to release it standalone here: https://github.com/kstenerud/kbnf/blob/master/kbnf_v1.md
I'd really appreciate some proofreading, as right now the only way I'm able to effectively find issues is to let it sit for a couple of weeks so that I can look at it from a fresh viewpoint, and even then I'm not confident that I'm discovering as much as multiple eyes could.
Cheers!
Links: https://www.iwriteiam.nl/Ha_BFF.html https://www.iwriteiam.nl/Ha_HTCABFF.html https://www.iwriteiam.nl/D0205.html#13MMF
The former are are patterns in the data, e.g.
rpm = float(32, -1000~1000);
which specifies a subset of the 2^32 floating point values, and the latter are actual numbers, e.g.
identifier = 'a'~'z'{5~8};
which shouldn't worry about rounding, signed zero, NaN values etc.
What arithmetic do you expect to need, in practice, at each of the two levels?
> real: any value from the set of reals, including qnan and snan unless otherwise specified
You mean floating point, not actual real numbers. Then, even within the IEEE-754 options, you need to specify what variant of floating point.
Note that floating point constants by themselves have ambiguous type: 1.3e5 can be float(32...), float(64...), etc.
> Note: Calculations can produce a quiet NaN value under certain conditions in accordiance with the IEEE 754 specification. If different processing is required (such as traps or exceptions), this must be documented in your specification.
This means specifying what to do with NaN results and similar errors: a heavy burden for the user.
> unsigned: limited to positive integers and 0 > signed: limited to positive and negative integers, and 0 (but excluding -0)
These are floating point values pretending to be integers. You should have actual unlimited precision integers with constraints like explicit maximum and minimum values.
Moreover, "-0" is a IEEE-754 technical detail that has no place in a formal, abstract language.
YouTube: https://youtu.be/7HKbjYqqPPQ
If you search for the talk title you can find the PDF of the slides.
It’s always nice when I can see upfront at a glance what the language looks like, rather than first going through all the grammar spec.
Since code must come back out after XTRAN rules have created / changed / translated it, I also created a rendering engine that is responsible for rendering XTRAN's internal format of language content out as text source code, complete with extensive styling controls. The rendering engine is also driven by EBNF, which it executes in order to render code content to text source code.
If you're interested in learning more, see WWW.XTRAN-LLC.com. I'll be glad to answer any questions.
I haven't read it all (and probably won't, these kinds of works are not my forte), but there's something that doesn't look that nice.
The swapped function seems strange, mostly because you treat 1 as a special case which reverses the content.
I would either have a reverse function or use negative numbers to reverse those chunks, like this:
uint(16,0xc01f) matches big endian 0xc01f (bit sequence 1,1,0,0,0,0,0,0,0,0,0,1,1,1,1,1).
swapped(8, uint(16,0xc01f)) (bit sequence 0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,0).
swapped(-16, uint(16,0xc01f)) (bit sequence 1,1,1,1,1,0,0,0,0,0,0,0,0,0,1,1).
swapped(-8, uint(16,0xc01f)) (bit sequence 1,1,1,1,1,0,0,0, 0,0,0,0,0,0,1,1).
swapped(-4, uint(16,0xc01f)) (bit sequence 0,0,0,0, 1,1,0,0, 1,1,1,1, 1,0,0,0).
There might be better solutions, I just dislike exceptions in rules, if it can be helped.
Also the string rules look like they don’t allow an empty string, maybe intentionally as that’s probably a bug in someone’s grammar.
About as far as I got on a quick glance, got distracted by the mathematical operators and other things like concatenation sharing the same tokens and started thinking how hard it would be to parse it.