Updated Format Specifier Highlighting in Emacs

What feels like ages ago now, I wrote a short post about improving the syntax highlighting of format specifiers for printf style functions (actually, all strings). While it has worked great for a long time, it is about time to give it a bit of a face-lift and at the same time make it even more general.

The first item on the agenda however, is to make the original regexp, manageable. When we finished things up, we were left with the following:

(defvar printf-fmt-regexp
  (concat "\\(%"
          "\\([[:digit:]]+\\$\\)?"   ; Posix argument position extension.
          "[-+' #0*]*"
          "\\(?:[[:digit:]]*\\|\\*\\|\\*[[:digit:]]+\\$\\)"
          "\\(?:\\.\\(?:[[:digit:]]*\\|\\*\\|\\*[[:digit:]]+\\$\\)\\)?"
          "\\(?:[hlLjzt]\\|ll\\|hh\\)?"
          "\\(?:[aAbdiuoxXDOUfFeEgGcCsSpn]\\|\\[\\^?.[^]]*\\]\\)\\)")
  "Regular expression to capture all possible `printf' formats in C/C++.")

While this was a lot better than the original, it is still incredibly hard to read. To that end, I decided I was better off rewriting it completely using the arguably better regexp notation rx. The code I ended up with, is this:

  (defvar combined-fmt-rx
    (rx
     (or
      (group-n 1 "%%")
      (group-n 2
        (or
         (seq
          "{"                             ; `std::format' (C++20) specification.
          (group-n 3 (* (in digit)))
          (? ":"
             (? (? (not (in "{}"))) (in "<^>"))
             (? (in "-+ #0"))
             (? (or (+ (in digit))
                    (seq "{" (group-n 4 (* (in digit))) "}")))
             (? (seq "." (or (+ (in digit))
                             (seq "{" (group-n 5 (* (in digit))) "}"))))
             (? "L")
             (? (in "sbBcdoxXaAeEfFgGpP?")))
          "}")
         (seq
          "%"                             ; Positional `printf' specification.
          (seq (group-n 3 (+ (in digit))) "$")
          (? (in "-+' #0"))
          (? (or (+ (in digit))
                 (seq "*" (group-n 4 (+ (in digit))) "$")))
          (? (seq "." (or (+ (in digit))
                          (seq "*" (group-n 5 (+ (in digit))) "$"))))
          (? (or (in "hlLjzt") "ll" "hh"))
          (? (seq "v" (in "234")))
          (in "aAbdiuoxXDOUfFeEgGcCsSpn"))
         (seq "%"                         ; Regular `printf' specification.
              (? (+ (in "-+' #0*")))
              (* (in digit))
              (? (seq "." (? "*") (* (in digit))))
              (? (or (in "hlLjzt") "ll" "hh"))
              (? (seq "v" (in "234")))
              (in "aAbdiuoxXDOUfFeEgGcCsSpn"))
         (seq "%"                         ; Regular `scanf' specification.
           (? "*")
           (? (* (in digit)))
           (? (or (in "hlLjzt") "ll" "hh"))
           (or (in "aAbdiuoxXDOUfFeEgGcCsSpn")
               (seq "[" (+ (not (in "[]^"))) "]")))))))
    "Regular expression to capture all possible `printf' formats.")


  (defun printf-fmt-matcher (end)
    "Search for `printf' format specifiers within strings up to END."
    (let ((case-fold-search nil)
          (rx combined-fmt-rx)
          (pos)
          (found))
      (setq pos (re-search-forward rx end t))
      (while (and (not found) pos)
        (if (nth 3 (save-excursion (syntax-ppss pos)))
            (setq found pos)
          (setq pos (re-search-forward rx end t))))
      found))

(defun my-cc-mode-common-hook ()
  "Setup common utilities for all C-like modes."
  (face-remap-add-relative 'font-lock-doc-face
                           :foreground (face-foreground font-lock-comment-face))
  (font-lock-add-keywords
   nil
   '(("\\<\\(FIXME\\|TODO\\):?" 1 font-lock-warning-face prepend)
     ;; Add extra constants for true/false and NULL.
     ("\\<\\(true\\|false\\|NULL\\)" . font-lock-constant-face)
     ;; Add a printf() modifier highlighter.
     (printf-fmt-matcher (2 '(face font-lock-format-specifier-face) prepend lax)
                         (3 '(face font-lock-format-position-face) prepend lax)
                         (4 '(face font-lock-format-position-face) prepend lax)
                         (5 '(face font-lock-format-position-face) prepend lax)))))

Now this is substantially more verbose, but it is conceivably a lot easier to read (at least for us who are familiar with Lisp). Feature-wise however, it is now a lot more complete, and I want to highlight the following improvements with this implementation in particular:

  1. The printf and scanf modifiers have been separated to support each of them better.
  2. Added support for vector types using the v{2,3,4} modifier, e.g., in GLSL code using debugPrintfEXT.
  3. Positional ('Posix' style) printf modifiers have also been separated, and the position specification is now placed in a separate match group to receives extra highlighting.
  4. Added support for the C++20 formatting library: std::format.

And, to show it in action:

Update format specifier in action

I am particularly pleased with the way this code is able to make positional modifiers receive extra highlighting to make them easier to read. Something that could be really hard to read before.

I am a little concerned about performance however: Adding extra 'branches' to a regexp (or rx) like this can be expensive. Although, I have not actually encountered any such problem yet even in relatively large code bases, but I really should add some way of profiling this code to find out if it could be a problem in the future.

But that, as well as bundling all of this into a proper package for Emacs will have to remain as future work for now.