How It Works

JuliaFormatter works in several stages:

  1. Parsing the source code to generate a concrete syntax tree (CST).

  2. Generating a formatted syntax tree (FST) from the CST. This is not a formal term; it's just a term that's used internally to represent a CST with additional formatting-specific metadata, such as placeholders indicating spots where newlines can be inserted.

  3. Nesting the FST, i.e., determining where newlines should be inserted.

  4. Printing the FST to text.

These stages are described in more detail below.

Parsing: String -> CST

A concrete syntax tree differs from the more familiar abstract syntax tree in that it retains all information in the source code that is not semantically relevant, such as whitespace, extra parentheses, and comments.

In particular, JuliaFormatter v2.5 uses JuliaSyntax.jl v1 as its CST parser. Here is an example of a CST:

using JuliaSyntax: parseall, GreenNode

text = "x = f(y + z) # comment"
cst = parseall(GreenNode, text)
     1:22     │[toplevel]
     1:12     │  [=]
     1:1      │    Identifier           ✔
     2:2      │    Whitespace
     3:3      │    =
     4:4      │    Whitespace
     5:12     │    [call]
     5:5      │      Identifier         ✔
     6:6      │      (
     7:11     │      [call]
     7:7      │        Identifier       ✔
     8:8      │        Whitespace
     9:9      │        Identifier       ✔
    10:10     │        Whitespace
    11:11     │        Identifier       ✔
    12:12     │      )
    13:13     │  Whitespace
    14:22     │  Comment

In this example, you can see that JuliaSyntax has generated a tree structure. The numbers on the left-hand side indicate the byte offsets of each node in the original source code.

Notice that the whitespace and comment in the original source code is preserved in the CST. In contrast, an AST, generated e.g. using Meta.parse, would discard this information. In principle, we could pretty-print the AST and thus reconstruct "formatted" code with the same semantics. However, this would not be ideal since we would lose (for example) comments as well as the original parenthesisation of expressions, which a user may have intentionally added for readability.

dump(Meta.parse(text))
Expr
  head: Symbol =
  args: Array{Any}((2,))
    1: Symbol x
    2: Expr
      head: Symbol call
      args: Array{Any}((2,))
        1: Symbol f
        2: Expr
          head: Symbol call
          args: Array{Any}((3,))
            1: Symbol +
            2: Symbol y
            3: Symbol z
Note

The CST parsing backend has varied across versions. JuliaFormatter v2 - v2.4 used JuliaSyntax.jl v0, and JuliaFormatter v1 used CSTParser.jl.

Generating an FST

From the CST we transform it into a slightly richer representation, which is internally called an FST. The FST is very similar to the CST, but it contains additional information that is specific to formatting.

For example, the CST is very lightweight: it actually does not store any strings for the various nodes, only byte offsets. When generating the FST, we use the byte offsets plus the original source code to extract the relevant substrings and store them in the FST.

using JuliaFormatter: State, Document, Options, pretty, DefaultStyle

state = State(Document(text), Options(; margin=10))
fst = pretty(DefaultStyle(), cst, state)
FST: TopLevel 1
  (1, 1, 0, 12)
  nest_behavior: AllowNest
  extra_margin: 0
  nodes:
    [1]     FST: Binary 6
      (1, 1, 0, 12)
      nest_behavior: AllowNest
      extra_margin: 0
      nodes:
        [1]         FST: IDENTIFIER
          val: "x"
          line_offset: 1
          indent: 0
        [2]         FST: WHITESPACE
          val: " "
          line_offset: -1
          indent: 0
        [3]         FST: OPERATOR
          val: "="
          line_offset: 3
          indent: 0
        [4]         FST: PLACEHOLDER
          val: " "
          line_offset: -1
          indent: 0
        [5]         FST: PLACEHOLDER
          val: ""
          line_offset: -1
          indent: 0
        [6]         FST: Call 7
          (1, 1, 0, 8)
          nest_behavior: AllowNest
          extra_margin: 0
          nodes:
            [1]             FST: IDENTIFIER
              val: "f"
              line_offset: 5
              indent: 0
            [2]             FST: PUNCTUATION
              val: "("
              line_offset: 6
              indent: 0
            [3]             FST: PLACEHOLDER
              val: ""
              line_offset: -1
              indent: 0
            [4]             FST: Binary 6
              (1, 1, 0, 5)
              nest_behavior: AllowNest
              extra_margin: 0
              nodes:
                [1]                 FST: IDENTIFIER
                  val: "y"
                  line_offset: 7
                  indent: 0
                [2]                 FST: WHITESPACE
                  val: " "
                  line_offset: -1
                  indent: 0
                [3]                 FST: OPERATOR
                  val: "+"
                  line_offset: 9
                  indent: 0
                [4]                 FST: PLACEHOLDER
                  val: " "
                  line_offset: -1
                  indent: 0
                [5]                 FST: PLACEHOLDER
                  val: ""
                  line_offset: -1
                  indent: 0
                [6]                 FST: IDENTIFIER
                  val: "z"
                  line_offset: 11
                  indent: 0
            [5]             FST: TRAILINGCOMMA
              val: ""
              line_offset: -1
              indent: 0
            [6]             FST: PLACEHOLDER
              val: ""
              line_offset: -1
              indent: 0
            [7]             FST: PUNCTUATION
              val: ")"
              line_offset: 12
              indent: 0

In this very simple example here, margin=10 does not actually affect the FST. However, in general, the style as well as the options passed in here will affect the structure of the generated FST, either via multiple dispatch, or via the state.opts field.

Notice that each node in this tree contains a val field, which is the string corresponding to that node in the CST. On top of that, there are some mysterious PLACEHOLDER nodes, which are not present in the CST. These indicate potential locations where we can insert newlines when we later nest the FST. For example, in the string x = f(y + z) we can insert a newline in several places:

x = f( y + z )
   ^  ^   ^ ^

Each of these potential newline locations is represented by a PLACEHOLDER node in the FST.

Several other transformations happen at this stage:

  • Code and comments are indented to match surrounding code blocks.
  • Unnecessary whitespace is removed (although newlines in between code blocks are untouched).
  • Code is flattened as much as possible. For example, if an expression can be put on a single line, it will be: it doesn't matter if it's over the margin or not at this stage. However, if the expression has a block-like structure to it, such as a try, if, or a struct definition, it will be spread across multiple lines appropriately:

The important invariant underpinning a FST is any two strings that are syntactically the same (ignoring whitespace) will produce an identical FST. For example:

a = 
       foo(a,                     b,           
       c,d)

and

a =                      foo(a,
b,
c,d)

will produce the same FST, which when printed would look like a = foo(a, b, c, d).

Nesting the FST

At this point we have not actually yet decided which of these (if any) need to be converted into newlines. This depends on, for example, the margin. We can see how this affects the output here:

using JuliaFormatter: format_text

format_text(text; margin=40) |> print
x = f(y + z) # comment
format_text(text; margin=10) |> print
x = f(
    y + z,
) # comment

The decision on where to insert newlines is made in the nesting stage, which converts PLACEHOLDER nodes to NEWLINE nodes as needed. Recall that above we specified margin=10. This will cause the nest! function to insert newlines such that no line exceeds 10 characters. Just as before, the style and options passed in here will affect the decisions made inside nest!.

using JuliaFormatter: nest!

nest!(DefaultStyle(), fst, state)
fst
FST: TopLevel 1
  (1, 1, 0, 12)
  nest_behavior: AllowNest
  extra_margin: 0
  nodes:
    [1]     FST: Binary 6
      (1, 1, 0, 12)
      nest_behavior: AllowNest
      extra_margin: 0
      nodes:
        [1]         FST: IDENTIFIER
          val: "x"
          line_offset: 1
          indent: 0
        [2]         FST: WHITESPACE
          val: " "
          line_offset: -1
          indent: 0
        [3]         FST: OPERATOR
          val: "="
          line_offset: 3
          indent: 0
        [4]         FST: WHITESPACE
          val: " "
          line_offset: -1
          indent: 0
        [5]         FST: WHITESPACE
          val: ""
          line_offset: -1
          indent: 0
        [6]         FST: Call 7
          (1, 1, 4, 8)
          nest_behavior: AllowNest
          extra_margin: 0
          nodes:
            [1]             FST: IDENTIFIER
              val: "f"
              line_offset: 5
              indent: 4
            [2]             FST: PUNCTUATION
              val: "("
              line_offset: 6
              indent: 8
            [3]             FST: NEWLINE
              val: "
"
              line_offset: -1
              indent: 0
            [4]             FST: Binary 6
              (1, 1, 4, 5)
              nest_behavior: AllowNest
              extra_margin: 1
              nodes:
                [1]                 FST: IDENTIFIER
                  val: "y"
                  line_offset: 7
                  indent: 8
                [2]                 FST: WHITESPACE
                  val: " "
                  line_offset: -1
                  indent: 8
                [3]                 FST: OPERATOR
                  val: "+"
                  line_offset: 9
                  indent: 8
                [4]                 FST: WHITESPACE
                  val: " "
                  line_offset: -1
                  indent: 0
                [5]                 FST: PLACEHOLDER
                  val: ""
                  line_offset: -1
                  indent: 8
                [6]                 FST: IDENTIFIER
                  val: "z"
                  line_offset: 11
                  indent: 8
            [5]             FST: TRAILINGCOMMA
              val: ","
              line_offset: -1
              indent: 4
            [6]             FST: NEWLINE
              val: "
"
              line_offset: -1
              indent: 0
            [7]             FST: PUNCTUATION
              val: ")"
              line_offset: 12
              indent: 0

Printing the FST

Once the FST has been nested the rest is straightforward: traverse the tree and print it out into a string!

using JuliaFormatter: print_tree

io = IOBuffer()
print_tree(io, fst, state)
String(take!(io)) |> print
x = f(
    y + z,
)