Rolling your own languages and frameworks

Published by marco on

The blog post/article So You Want To Write Your Own Language? by Walter Bright (Dr. Dobbs) contains a lot of interesting information, related to only to parsing, but also to runtime and framework design. Bright is well-known as the designer of the D programming language, so he’s definitely worth a read.

I thought he jumped back and forth between topics a bit, so I summarized the contents for myself below:

Parsing

Bright identifies Minimizing keystrokes, easy parsing and minimizing the number of keywords as false gods. Do not waste any time trying to satisfy these requirements; instead, let them flow naturally from a good design.

Your language should consist of productions that have only a single non-terminal on the left-hand side. That is, strive to make your language context-free.[1] The implication is that you’re actually going to define the grammar rather than just winging it. This means that you can can use a parser generator even though Bright says not to “bother wasting time with lexer or parser generators and other so-called ‘compiler compilers.’”

I instead agree with the article Advice on writing a programming language by Ted Kaminski' (Generic Language), which advises providing a grammar that can be used with parser generators because “many of those people eager to contribute either get stuck trying and failing to build a parser or trying and failing to learn to use the daunting internals of your compiler”.

You can either make it easy for people to build compilers for your language or you can maintain a very friendly API for your own compiler. If you choose the API route, it might force you to be more disciplined, but it might also cause you no end of backwards-compatibility headaches as your compiler quickly evolves. Not only that, but you’d then have to make that API available for any number of languages and any number of platforms.

If you take the route of publishing the BNF, that may also not not be enough. This because it can still be daunting to convert a BNF to something that your compiler-generator can use, especially for non-trivial languages. Providing a grammar for a widely supported parser-generator like ANTLR [2] will give those willing to build tools for your language a good jump-start.

“Use an LR parser generator. It’ll keep your language parsable, and make it easier to change early on. When your language becomes popular enough that you have the problem of parsing errors not being friendly enough for your users, celebrate that fact and hand-roll only then.

“And then reap the benefit of all the tooling an easily parsed language gets, since you know you kept it LR(1).”

Error-handling

Introduce redundancy into the language definition (e.g. semicolons as line-terminators in addition to whitespace/newlines) in order to make error-message generation much easier and much more likely to produce friendly output.

Compilers can handle error messages in different ways:

Bail out on the first error

This is a good fallback, but it saves the developer a lot of work if you identify all of the root errors in source—that is, errors that are not a consequence of another error.

Collect multiple errors

In order to continue parsing/compiling after an error, the machine can take one of two approaches:

“Guess what the programmer intended, repair the syntax trees, and continue.” (Bright) Bad guesses lead to spurious and inscrutable error messages which lead to developers no longer trusting their compilers. Avoid this approach as it is very difficult to get right.
Take the approach that Bright did with the D compiler: consider any part of the code that has an error as “poisoned”. He likens it to the way that “floating-point NaNs are handled. Any operation with a NaN operand silently results in a NaN.” With this approach, “the compiler is able to detect multiple errors as long as the errors are in sections of code with no dependency between them”, which yields only high-quality and relevant error messages for the developer.

Stand on the shoulders of giants

Do not re-invent the syntax for everything in your language. Instead, as Bright says, “[s]ave the divergence for features not generally seen before, which also signals the user that this is new.”

The runtime

A language definition is nothing without a runtime. Bright recommends “taking the common sense approach and using an existing back end, such as the JVM, CLR, gcc, or LLVM. (Of course, I can always set you up with the glorious Digital Mars back end!)” If you can avoid writing your own back-end, you should definitely do so. Similar to the approach recommended for parsing the language: start with a stock runtime and migrate to something custom if the needs of your project warrant it (they almost certainly won’t). This is the approach taken by any number of other popular languages, like Scala.

The framework

And then there’s the library/framework that accompanies the language and, arguably, helps to define it for people. Complaints about a language are often complaints about the standard runtime library/framework for the language. Developers quickly associate them and treat them as one entity. Bright’s focus is on very low-level runtimes (such as the one for his language, D) and thus his advice focuses on fast I/O, fast and efficient memory allocation/de-allocation and robust/fast transcendental functions[3]. However, he also offers the following excellent rule of thumb for any framework:

“My general rule is if the explanation for what the function does is more lines than the implementation code, then the function is likely trivia and should be booted out.”

[1] See Example of why C++ is NOT a context free grammar? by Kaz Kylheku (Velocity Reviews) and Context-free grammar (Wikipedia) for more information.↩

[2] These are functions that cannot be composed of other functions in the framework, what I would call “core” or “root” functions. These are the functions that the developer would find it either impossible or incredibly difficult to replicate efficiently in the language itself.↩

[3] As of publication time, the current stable release of ANTLR for C# available via NuGet—and used by most other packages—is version 3.5.1. However, the official home page is now ANTLR rather than ANTLR3. The NuGet packages for version 4 are in pre-release, but it’s very nice to see some progress being made after years of relatively minor upgrades. In particular, the new version of the IDE ANTLRWorks looks much nicer and seems to have been based on the JETBrains IDE framework. I’m definitely looking forward to checking it out in more detail.↩