HACKER Q&A
📣 kstenerud

Please Review My Metalanguage


Please Review My Metalanguage


  👤 kstenerud Accepted Answer ✓
Hi all!

While developing a text and binary data format (https://concise-encoding.org/), I ran into trouble building a formal description for them. Many formats use EBNF or ABNF, or just embed their own BNF-style metalanguage in the spec itself (I started taking cues from the XML spec).

But describing a binary format is deceptively complex, and eventually I had to split out the metalanguage. KBNF is the result.

I want it to be useful for other people, so I've decided to release it standalone here: https://github.com/kstenerud/kbnf/blob/master/kbnf_v1.md

I'd really appreciate some proofreading, as right now the only way I'm able to effectively find issues is to let it sit for a couple of weeks so that I can look at it from a fresh viewpoint, and even then I'm not confident that I'm discovering as much as multiple eyes could.

Cheers!


👤 fjfaase
A long time ago, I started working on something similar. There are some binary editors that support writing grammars. I worked on reverse engineering several binary file formats. When working on Quark Express file format, I discovered you do need two layers of grammars, as the file contains a header followed by allocated blocks, where the bocks are linked (by indices) and chains of linked blocks contain streams of data. I remember that there are also file formats, that simply have pointers to certain sections of the data contained in the file. I once implemented a memory mapped file data storage where all pointers to allocated blocks where represented by an offset from the start of the file. (I later, came up with a better idea and that is to make them relative to the location of the pointer, such that you do need to know the offset at which the memory mapped file is placed in memory. So basically, there are lots of 'pointers' in the file and you can only read the data in the file following those pointers.)

Links: https://www.iwriteiam.nl/Ha_BFF.html https://www.iwriteiam.nl/Ha_HTCABFF.html https://www.iwriteiam.nl/D0205.html#13MMF


👤 HelloNurse
You are conflating, in a typical model/metamodel confusion, numbers in the data structures you are describing, with their machine representations that you are modeling decently, and numbers in your language, which should probably be just integers without complications.

The former are are patterns in the data, e.g.

rpm = float(32, -1000~1000);

which specifies a subset of the 2^32 floating point values, and the latter are actual numbers, e.g.

identifier = 'a'~'z'{5~8};

which shouldn't worry about rounding, signed zero, NaN values etc.

What arithmetic do you expect to need, in practice, at each of the two levels?

> real: any value from the set of reals, including qnan and snan unless otherwise specified

You mean floating point, not actual real numbers. Then, even within the IEEE-754 options, you need to specify what variant of floating point.

Note that floating point constants by themselves have ambiguous type: 1.3e5 can be float(32...), float(64...), etc.

> Note: Calculations can produce a quiet NaN value under certain conditions in accordiance with the IEEE 754 specification. If different processing is required (such as traps or exceptions), this must be documented in your specification.

This means specifying what to do with NaN results and similar errors: a heavy burden for the user.

> unsigned: limited to positive integers and 0 > signed: limited to positive and negative integers, and 0 (but excluding -0)

These are floating point values pretending to be integers. You should have actual unlimited precision integers with constraints like explicit maximum and minimum values.

Moreover, "-0" is a IEEE-754 technical detail that has no place in a formal, abstract language.


👤 porcoda
You might be interested in Guy Steele’s talk “It’s time for a new old language”. He talks about similar notations and the problem of inventing new ones (although most of his talk is about things other than BNF-like notations : he just talks about them for a bit in the beginning). It’s an interesting talk and is useful in thinking about proposing new notations.

YouTube: https://youtu.be/7HKbjYqqPPQ

If you search for the talk title you can find the PDF of the slides.


👤 yardstick
Small suggestion: put an example at the start, maybe a reference to the IP spec, or maybe something smaller if you’ve got one.

It’s always nice when I can see upfront at a glance what the language looks like, rather than first going through all the grammar spec.


👤 Expert-007
In creating and enhancing XTRAN, my Expert System that knows over 40 computer languages, I got tired of hard-coding parsers, so I created a parsing engine that takes EBNF (specified in more friendly syntax I created), parses the EBNF to an internal format, and then when asked to parse a language, executes its EBNF in order to parse. I would be interested to know if anyone else is executing EBNF to parse.

Since code must come back out after XTRAN rules have created / changed / translated it, I also created a rendering engine that is responsible for rendering XTRAN's internal format of language content out as text source code, complete with extensive styling controls. The rendering engine is also driven by EBNF, which it executes in order to render code content to text source code.

If you're interested in learning more, see WWW.XTRAN-LLC.com. I'll be glad to answer any questions.


👤 JaumeGreen
EDIT: I misunderstood how it works, read the reply to see why I was in the wrong.

I haven't read it all (and probably won't, these kinds of works are not my forte), but there's something that doesn't look that nice.

The swapped function seems strange, mostly because you treat 1 as a special case which reverses the content.

I would either have a reverse function or use negative numbers to reverse those chunks, like this:

    uint(16,0xc01f) matches big endian 0xc01f (bit sequence 1,1,0,0,0,0,0,0,0,0,0,1,1,1,1,1).
    swapped(8, uint(16,0xc01f)) (bit sequence 0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,0).
    swapped(-16, uint(16,0xc01f)) (bit sequence 1,1,1,1,1,0,0,0,0,0,0,0,0,0,1,1).
    swapped(-8, uint(16,0xc01f)) (bit sequence 1,1,1,1,1,0,0,0, 0,0,0,0,0,0,1,1).
    swapped(-4, uint(16,0xc01f)) (bit sequence 0,0,0,0, 1,1,0,0, 1,1,1,1, 1,0,0,0).
There might be better solutions, I just dislike exceptions in rules, if it can be helped.

👤 UncleEntity
Not really a fan of using ‘&’ as the concatenation operator instead of the traditional EBNF way. The examples show both ways with the TOKEN_SEP (? space I guess) not seeming to need it.

Also the string rules look like they don’t allow an empty string, maybe intentionally as that’s probably a bug in someone’s grammar.

About as far as I got on a quick glance, got distracted by the mathematical operators and other things like concatenation sharing the same tokens and started thinking how hard it would be to parse it.


👤 anonymoushn
If you would like to include unicode support rather than supporting bytes and punting unicode to a library or the end user, it may be good to have opinions about unicode normalization.

👤 napsterbr
FYI there's a copy-pasta/typo at the very last section. You (probably) meant to type `digit_hex` but typed `digit_bin`.

👤 hyperhello
Does the format take left recursion into account or leave it undefined?

👤 aliqot
needs syntax above the fold