HACKER Q&A
📣 brundolf

Do you really need a lexer/tokenizer?


Every parser/interpreter/compiler tutorial has you start with a tokenizer. Intuitively it makes sense that you would slice off this piece of work up front.

But in practice, in my hobby projects, I've found it's trivially easy to skip this step and just operate directly on characters in the recursive-descent pass. And by doing so your code is simpler, and you avoid repeating a bit of work (for example: identifying the bounds of a number literal in one pass and then fully parsing it in another pass).

Is it more necessary in full-scale compilers? Is it an old-fashioned practice that people don't really do anymore? Or am I missing something?


  👤 jasonhansel Accepted Answer ✓
IMHO one of the benefits to having a separate tokenization step is in error handling. It can be easier to generate good error messages if you've already separated the input into tokens (e.g. "unexpected identifier" instead of "unexpected letter 'i'.")

There can also be performance advantages, at least in some cases.


👤 bjourne
No you don't. Common wisdom is that a lexer/parser setup is easier to extend as your project grows. But only you can say whether that bears out in practice for your particular project.