The Simple-Minded Indentation Engine

I've written a few major-modes for Emacs over the years, and every time I start on a new one I take a look at the smie documentation and conclude that it seems too complicated to add a proper indentation engine for the mode.

This time however, I've had enough. It's time to get my hands dirty and write a major mode that utilizes smie.

Now, if you've ever written a major-mode for Emacs, you may have noticed that there is quite a few things that should be done - setting up a syntax table, enabling keyword highlighting etc. Doing these things are however well documented in various tutorials.

When it comes to setting up a proper indentation function however, some guides properly describes the implementation of a manual indentation function, and some others recommend that you use smie for structuring your indentation function.

The main problem is really that the documentation for smie is rather poor, and nobody has (to my knowledge) written a proper guide for using it yet. Obviously, smie is used by some built-in major-modes, such as ruby-mode and sh-script-mode, so there are some examples to be found, but it's about time some implementers notes' are made available to ease the use of this framework.

Now, smie is a beast in its own right, so I'll start us off with some requirements that the smie documentation implicitly assumes the reader knows about.

Context-Free Grammars - Backus-Naur-Form

In order to work efficiently with smie, I think it's important to have a basic understanding of context-free grammars. In particular of the so called Backus-Naur form (BNF) grammar or any of the derivatives from it.

Formally, BNF grammars are simply a way of exactly describing the syntax of formal languages used in computing, such as programming languages. For instance, a BNF grammar for the C programming language can be used to verify if a file is syntactically correct.

So why is this important? Well, for smie to provide intelligent navigation indentation, it needs to have at least a rudimentary understanding of the underlying language.

Practically speaking, BNF grammars are a set of rules of the kind:

<symbol> := __expressions__

Where symbol is often called a non-terminal and __expressions__ is one or more ways of combining multiple (possibly different) symbols. In contrast to the non-terminal, any symbol that doesn't appear on the left hand side of these rules is often called a terminal.

To dig a little deeper in this, let us start a small example:

Lets say that we want to create a set of rules to correctly identify different number-types:

  • Natural numbers,
  • Integers,
  • Rational numbers, and
  • Real numbers

To do this, we could setup a BNF grammar as follows:

zero              := "0";
non-zero-digit    := "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
digits            := "" | zero | non-zero-digit | digits
natural_number    := non-zero-digit, digits
integer           := zero | "-", natural_number | natural_number
positive_rational := natural_number, "/", natural_number
rational_number   := "-", positive_rational | positive_rational
real_number       := integer, ".", natural_number
number            := real_number | rational_number | integer | natural_number

Here, we let | (vertical bar) represent the the choice between two or more symbols, , (comma) represents the combined use of symbols and quote-delimited strings represent terminals.

Now, walking on a per-character basis, the following input strings will all match the respective number-types:

0
-1
-100/101
1.0151

Take for instance 100/101 above, we first begin by trying to see if it matches our number rule.

For the interested, Wikipedia describes all of this detail and has several useful examples. For now, I think this is more than enough BNF knowledge than what is strictly necessary to understand how smie works.

Operator Precedence Grammars

In theory, the BNF-grammar along with a lexer could be used to accurately parse any given file to ensure that it complies with the grammar, so why not use that instead of this smie thing? Essentially, it all boils down to runtime performance. A generic BNF-grammar requires a recursive-decent parser, which can be very slow, depending on the complexity of the grammar. Therefore, a much simpler parser strategy is used by smie, the operator precedence grammar.

A parser based on this kind of grammar can parse both forward and backwards from any token in a file, which makes it ideal to figure out where the next or previous token is located.

Obviously, such a grammar also come with a few limitations:

  • Prefix operators, such unary plus or minus operators, may require special handling.
  • Given only a set of operator precedence, it is difficult limit the parser to just the intended language.
  • Only a small class of grammars can be parsed.

As a consequence of this, smie cannot take an arbitrary BNF grammars and directly map that to an operator precedence grammar. So the implementer must simplify the BNF grammar in such a way that smie can handle it using operator precedence parsing techniques and create operator precedence tables.

The main reason for setting up these table is so that smie can map the expressions in the new major-mode to symbolic-expressions (sexps). Once it has done that, the usual navigation and indentation functions used for sexp-based modes (such as lisp) can be used directly. Thereby avoiding the need to create custom ones for each mode.

SMIE BNF Rules

As I mentioned, the operator precedence grammar sets some limitations on the parser, but smie still accepts a BNF grammar as an input that it then maps to a set of operator precedence tables. This BNF grammar must however follow a set of rules:

  • The right hand side of a non-terminal cannot be the empty list.
  • The right hand side cannot have two consecutive non-terminals.

The first rule isn't much of a problem in practice, but the second one puts some real limitations on what can be given to smie. For instance, a grammar such as this is not accepted:

a = "A";
b = "B";
c = a c;

Examples

I may have gone a bit too meta while trying things out, so if these examples don't make any sense, I apologize. Anyways, as I mentioned above, my logical starting point of the indentation engine is to write down a simplified version of the BNF grammar, and since the BNF grammar can be used to describe itself, I figured that creating a major mode for BNF grammar(s) themselves was a good starting point and good learning opportunity.

After digging a bit, there doesn't seem to be any widely recognized standard for BNF grammars. As far as I can tell, it is mostly used in an ad-hoc fashion. There are some standards however and Emacs itself can parse a few of these to create postscript files (see ebnf2ps.el). Thus, I decided to implement one or more major-modes that could support the same BNF grammars that ebnf2ps.el could understand.

So, looking over the styles that ebnf2ps.el supports, I start with the so called iso-ebnf, since it seems to be the most well-defined BNF grammar. A slightly simplified version of it is included here:

(* Basic terminals *)
letter = "A" | "B" | "C" | "D" | "E" | "F" | "G" |
      "H" | "I" | "J" | "K" | "L" | "M" | "N" |
      "O" | "P" | "Q" | "R" | "S" | "T" | "U" |
      "V" | "W" | "X" | "Y" | "Z" | "a" | "b" |
      "c" | "d" | "e" | "f" | "g" | "h" | "i" |
      "j" | "k" | "l" | "m" | "n" | "o" | "p" |
      "q" | "r" | "s" | "t" | "u" | "v" | "w" |
      "x" | "y" | "z" ;

digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;

symbol = "[" | "]" | "{" | "}" | "(" | ")" | "<" | ">" |
      "'" | '"' | "=" | "|" | "." | "," | ";" ;

(* Simple compounds *)
character = letter | digit | symbol | "_" ;

identifier = letter , { letter | digit | "_" } ;

terminal = "'" , character , { character } , "'" |
        '"' , character , { character } , '"' ;

(* Rule creation *)
lhs = identifier ;
rhs = identifier |
   terminal |
   "[" , rhs , "]" |
   "{" , rhs , "}" |
   "(" , rhs , ")" |
   rhs , "|" , rhs |
   rhs , "," , rhs ;
rule = lhs , "=" , rhs , ";" ;
grammar = { rule } ;

This will be out starting point for working with smie.

The first thing we should ask ourselves is simply: How do we want smie to indent the code? In this case, there are no extensive control structures, so we simply want to make sure that everything lines up nicely, similar to how the example above is structured.

Thus, let us begin by creating the simplified BNF-grammar that smie require:

   (defvar ebnf-mode-smie-iso-ebnf-grammar
     (smie-prec2->grammar
      (smie-bnf->prec2
       '((id)
	 (lhs (id))
	 (rhs (rhs "|" rhs)
	      (rhs "," rhs)
	      (id))
	 (rule (lhs "=" rhs))
	 (rules (rules ";" rules) (rule)))
       '((assoc ";"))
       '((assoc "|") (assoc ","))))
     "Simplfied BNF grammar for `smie' indentation of the `iso-ebnf' style.")

Smie structures the BNF quite a bit differently from the BNF shown above, but it still follows the same basic rules. Looking at the list given to smie-bnf->prec2 we find a list of rules where:

  • Each rule is a new list, where:
  • The left-hand side of a rule is the first entry of the list.
  • Nested lists represent choices, and
  • A symbol or a string represent a combination.

Take the rule symbol above as an example: A rule is something with a left-hand side (lhs) combined with an equal sign (=), which in turn is combined with a right-hand side (rhs). Furthermore, the right-hand side is either a single identifier (id) or a new right-hand side separated by a vertical bar ("|") or a comma (","), followed by another right-hand symbol.

It is clearly different from what we started with, but if you look closely, you can see some similarities between the grammars.

There are some quirks that should be noted however:

  • Notice the id symbol above, this entry does not have a right-hand side. Semantically, this means that id should match anything that represents an identifier. Practically however, it will match any sequence of characters that the syntax table has specified as having word syntax.
  • Note that there are no two consecutive non-terminals.
  • The first list given to smie-bnf->prec2 represents the BNF grammar, any consecutive lists will instead specify a number of resolvers, i.e., symbols that specify inter-symbol precedence resolutions. The smie documentation has some more details on this. For this example though, it is enough to specify that the rule terminator (semicolon, ;) and the choice/combination operators are associative, as seen above.

Next, we need to setup the lexer. In this case, we don't actually need to do much. The default smie lexer will simply regard any sequence of characters that have word or symbol syntax as a token, so we only need to setup the syntax table properly:

   (defconst ebnf-mode-iso-ebnf-syntax-table
     (let ((table (make-syntax-table)))
       (modify-syntax-entry ?'  "\""  table) ; String delimiter
       (modify-syntax-entry ?\" "\"" table)  ; String delimiter
       (modify-syntax-entry ?\( "()1" table)  ; Comment start
       (modify-syntax-entry ?*  ". 23" table) ; Comment start/end
       (modify-syntax-entry ?\) ")(4" table)  ; Comment end
       table))

In this case we only really need the first two entries to ensure that strings get properly delimited, but the various comment entries also allow us to treat (* as a comment starter, and *) as a comment finisher.

Now, we need to create the rules for how smie should indent the code. In practice, this is determined with a single function that receives two symbolic arguments, often called kind and token (Note that the official documentation calls these functions method and arg instead). To be honest though, I mostly relied on trial and error while creating this function, taking some cues from the corresponding functions in sh-script-mode, ruby-mode or sgml-mode.

   (defcustom ebnf-mode-indent-offset 4)

   (defun ebnf-mode-smie-iso-ebnf-rules (kind token)
     "Perform indentation of KIND on TOKEN using the `smie' engine."
     (pcase (cons kind token)
       (`(:elem . basic)            ebnf-mode-indent-offset)
       (`(:elem . args)             ebnf-mode-indent-offset)
       (`(:list-intro . "=")        0)
       (`(:before . ,(or ";" "," "|"))  (smie-rule-separator kind))
       (`(:after . "=") ebnf-mode-indent-offset)))

There are some tricks to it though,

Finally, we need to call the smie-setup function with the previously defined grammar and indentation rules to apply the configuration:

   (defun ebnf-iso-ebnf-setup ()
     "Setup the `major-mode' for the `iso-ebnf' style."
     (set-syntax-table ebnf-mode-iso-ebnf-syntax-table)
     (smie-setup ebnf-mode-smie-iso-ebnf-grammar #'ebnf-mode-smie-iso-ebnf-rules))

Additional BNF Styles

As I mentioned earlier, there are several distinct styles of BNF grammars, and while adding support for those I encountered a few interesting caveats to the smie engine that may be of interest. First off: How do we handle whitespace separators?

Some languages, such as Python, uses whitespace to separate statements and as you may have noticed from the example above, every statement were separated by non-whitespace symbols. In fact, this is the simplest case for smie to work with, since it can easily treat everything binary operators with different levels of precedence.

In the case of the abnf BNF style, this is no longer the case:

concatenation  =  repetition *(1*c-wsp repetition)

repetition     =  [repeat] element

repeat         =  1*DIGIT / (*DIGIT "*" *DIGIT)

element        =  rulename / group / option /
                  char-val / num-val / prose-val

As you can see, the rules themselves are no longer ending with a semicolon, so we now need to make sure that smie can parse rules without explicit symbols.

This can be solved a number of ways, but the recommended way is to have the lexer analyze the whitespace and generate a 'synthetic' operator when it encounters the end of a rule.

To achieve this, we create a BNF grammar similar to the one before:

    (defvar ebnf-mode-smie-abnf-grammar
      (smie-prec2->grammar
       (smie-bnf->prec2
	'((id)
	  (lhs (id))
	  (rhs (rhs "/" rhs)
	       (id))
	  (rule (lhs "=/" rhs)
		(lhs "="  rhs))
	  (rules (rules "NEWLINE-SEPARATOR" rules) ; Token generated by lexer.
		 (rule)))
	'((assoc "NEWLINE-SEPARATOR"))
	'((assoc "/"))))
      "Simplfied BNF grammar for `smie' indentation of the `abnf' style.")

The major difference is the introduction of the NEWLINE-SEPARATOR operator. This is simply a name I choose for this operator, in practice, it could have been any other string. Due note however that if the chosen string is present in the code, smie may get confused.

Next, we need to create the lexer.

Kommentarer

Comments powered by Disqus