My use case is I have a DSL with a custom parser and interpreter. The DSL is essentially a programming language and is proving too slow (in terms of latency). The bottlneck is in the interpreter. I want to replace the interpreter with a JIT without having to deal with assembly code generation myself.
Preferably in Rust and/or Rust bindings. Preferably lightweight (small object code footprint). Preferably cross-arch (x86, arm, arm64).
You'll have to rewrite your parser/interpreter in Truffle, but you get everything else "for free".
Not in Rust. I wouldn't call it at all lightweight. It is cross-arch in the sense that the Graal JVM is cross-arch, which may or may not be sufficient for your purposes.
[0] https://www.graalvm.org/22.0/graalvm-as-a-platform/language-...
However, LLVM is extremely heavyweight. Which "latency" did you mean? Are you going to run these functions 1M times, so that the quality of the generated code is paramount (and you can afford really long compile time) or do you care more about "I hit enter and get the answer"? You can tune LLVM (disable almost all passes, use fast instruction selection) but it's really not focused on millisecond-ish compilation.
There are a lot of "simple JIT" libraries for the latter case (you just want to feed in a simple IR and get machine code out, do an okay job at register allocation, but nothing fancy). None of them has "won" and most only have C bindings (to my knowledge).
(For what it's worth I'm a very minor contributor.)
* https://en.wikipedia.org/wiki/GNU_lightning
* https://www.gnu.org/software/lightning/
If you want a more sophisticated JIT engine, others have already mentioned libgccjit and LLVM (heavyweight compiler solutions), as well as Cranelift and Mir (more lightweight).
Of these, only Cranelift is written in Rust.
[edit: Changed "host" to "guest"; Python to Python2]
I have a question to the experts though: The principle of meta-tracing suggests you might be able to write your guest language in Python. Is that currently possible with RPython/PyPy?
That said, cranelift is still experimental.
A very useful feature is that you can write C or C++ and run it through LLVM to see the IR it generates, and adapt it to your needs. You can even do it in Godbolt.
If your generated code is crunching over large amounts of data, an alternative to a JIT is to make the interpreter implicitly parallel. So each interpreter dispatch operation does N (say, 16) parallel operations, effectively cutting interpreter overhead by N. It works if there isn't data-dependent branching, which is often true for numerical operations.
- QBE [1] - small compiler backend with nice IL
- DynASM [2] - IIUC the laujit's backend, that can and is used by other languages
- uBPF - Userspace eBPF VM. Depending on your DSL the eBPF toolchain could fit your use-case, but this would probably be the biggest excursion. There is some basic assembler in python.
[1] https://www.graalvm.org/22.0/graalvm-as-a-platform/language-...
https://www.gnu.org/software/libjit/
I've used that in the past to speed up a toy interpreter, but of course it is in C, rather than Rust.
There is at least one binding for it in rust:
https://github.com/MonliH/jit-sys
Finally here's a good introduction with several approaches for JIT:
I’ve heard good things about cranelift and I believe it’s sort of meant to fulfill the same role as B3. Might be worth checking out.
Most likely though, you should start by writing a template jit before you try to optimize. WebKit’s “assembler” and “jit” directories will show you how and you can probably extract most of the relevant code as it’s not WebKit specific. In particular the cross platform machine code gen.
Lastly I would advise against trying to reuse a C compiler backend like llvm unless your language is very close to C.
If your DSL is dynamically typed I recommend LuaJIT; the bytecode is lean and documented (not as good as CIL though). LuaJIT also works well with statically typed languages, but Mono is faster in the latter case. Even if it was originally built for Lua any compiler can generate LuaJIT bytecode.
Both approaches are lean (Mono about 8 MB, LuaJIT about 1 MB, much leaner and less complex than e.g. LLVM), general purpose, available on many platforms (especially the ones you're mentioning) and work well (see e.g. https://github.com/rochus-keller/Oberon/ and https://github.com/rochus-keller/Som/).
https://blog.cloudflare.com/building-fast-interpreters-in-ru...
https://ndmitchell.com/downloads/slides-cheaply_writing_a_fa...
P.D: A simple trick I apply for mine is to inline the looping for equivalents to folds/maps/filters like `[1, 2, 3] + 1`. You can do that calculation directly inside Rust, and even eliminate all interpretation if allow for specialization on the AST, ie: Ast.Map(Fn()->Ast, Vec
Although I might recommend interpreter optimizations before you go straight to machine code. While writing a just-in-time compiler for your DSL will remove interpretation overhead in software and in hardware, you will probably want to have more type information so that you can generate better code. Check out my PL resources page, which has multiple sections on runtime optimization: https://bernsteinbear.com/pl-resources/
Happy to chat, if you like.