Why can’t identifiers start with a number?

Published by marco on

The video I’m not sure how much longer I can wait! by Kevin Powell is an excellent introduction to sub-grids in CSS. But I was more interested in the fact that he told his viewers that,

“you can use numbers in classes, but if you have a class or id that starts with a number, it’s invalid. […] It’s one of those weird things in CSS that sometimes trips people up.”

I immediately thought to myself, “it’s not weird. Every programming language is like that.”

Then, I thought, “I bet this guy only knows CSS, so he doesn’t have anything to compare it to.”

Then, I thought, “Wait…why can’t you start an identifier with a number?”

And, finally, “I bet it’s a lexing/parsing thing.”

Parser or lexer?

I’ve written several parsers for medium-sized languages and my gut feeling is that letting an identifier start with a number seems like a surefire way of making the lexer more ambiguous or pushing more work into the parsing stage.

For example, if 25L can be either an identifier or a long integer, then the parser has to figure out from context which one it is (e.g. by checking whether that identifier is declared). If it can only be a number, then it comes out of the lexer as a number token and the parser doesn’t have to disambiguate.

Even if your language doesn’t allow suffixes, you’d still have the problem with an identifier like 25, which would be legal unless you introduce the additional restriction that an identifier must have at least one alphabetic character. In that case, though, you might as well make the rule that the identifier has to start with an alphabetic character and avoid the whole ambiguity.

With that common—not weird!—rule, the disambiguation happens in the lexer, where the operation is clearer and less expensive, performance-wise.

Unresolvable ambiguity

It’s actually worse than that, though. In the case of a programming language, you could see how the following would result in a compiler ambiguity:

var 3 = 5; // I'm already confused
           //…the compiler gets it, though

var a = 3; // Now, the compiler's confused as well

Is the developer assigned the value 3 to a or the variable 3? Not only is this a terrible idea for readability, the compiler can literally not resolve this ambiguity without additional information. So there have to be restrictions on identifier names in order to avoid clashes with not only reserved words (e.g. if) but also manifest constants (e.g. 3).

Where’s the problem with CSS?

In the case of CSS, where you do have suffixes (e.g. 25px) but you can’t really mix class identifiers with values, it’s possible that you could get away with no ambiguities right now. So it’s not weird that you can’t start an identifier with a number—it’s perfectly natural for developers—but it is, in the case of CSS, not required for unambiguous processing. As you can see below, though, it’s still kind of confusing for the user.

What if we have a class named “3”? It’s not very expressive—we’d probably call the class something like “3-part-panel”—but it’s the pathological case. Maybe a class called “3px” would be even worse.

.3-part-panel {
  /* This is fine */
}
.3 {
  /* Weird, but OK */
}
.3px {
  /* Now you're just being obnoxious */
}

Do we actually get any ambiguities, though? I don’t think so. I think in this case, the authors of CSS just used the “standard” (not weird!) definition of an identifier. It’s only when you have people using CSS who have had no exposure to any other programming languages (or parsing/lexing) that you get people thinking it’s “weird” that you can’t start with a number.

The only place where you could get an ambiguity is with CSS customer properties. In that case, though, “[a] custom property is any property whose name starts with two dashes”, according to CSS Custom Properties for Cascading Variables Module Level 1 (W3C). So, variable names in CSS are even more restricted than in most programming languages. Is that weird? Again, no. As in the case above with other programming languages, the end result is more clarity for the user.

For example, the following declares a few CSS custom properties with deliberately obnoxious names.

:root {
  red: #F33;
  color: #FF0;
  0: 1;
  3px: 1px;
}

.error-text {
  color: var(red);
  background-color: var(color);
  border-width: var(3px);
  opacity: var(0);
}

Although I’ve chosen confusing values and names, this doesn’t—at first glance—seem to cause any ambiguities. As with the examples above, it does force implementations to handle enumerations (e.g. all of the colors) in the parser, rather than the lexer. If the word “red” cannot be used as a variable, then it could (possibly) be recognized as its own token in the lexer, (possibly) improving performance.

The same goes for the property names. If it’s possible for custom properties to use the same names as built-in properties, then the lexer can’t handle them. There is no ambiguity because custom-property values must be resolved using the CSS function var().

The problem is worse than that, though. There is an actual ambiguity that isn’t obvious because we’re using the :root pseudo-class[1]. The example below, using < html>, makes it clearer.

html {
  color: #F33; // Is this setting the color 
               // …or declaring a color variable?
}

This is an ambiguity that the compiler cannot resolve. So that’s why the CSS designers settled on a prefix for custom properties.

So, to a layman or user of CSS, naming restrictions on class or custom-property identifiers may seem arbitrary and “weird”, but they are a logical requirement of being able to process the grammar unambiguously.

[1] If you know where I’m headed, then fine, it’s obvious to you. Congratulations. I didn’t see it immediately, so I’m writing it this way.↩