How It Works
JuliaFormatter works in several stages:
Parsing the source code to generate a concrete syntax tree (CST).
Generating a formatted syntax tree (FST) from the CST. This is not a formal term; it's just a term that's used internally to represent a CST with additional formatting-specific metadata, such as placeholders indicating spots where newlines can be inserted.
Nesting the FST, i.e., determining where newlines should be inserted.
Printing the FST to text.
These stages are described in more detail below.
Parsing: String -> CST
A concrete syntax tree differs from the more familiar abstract syntax tree in that it retains all information in the source code that is not semantically relevant, such as whitespace, extra parentheses, and comments.
In particular, JuliaFormatter v2.5 uses JuliaSyntax.jl v1 as its CST parser. Here is an example of a CST:
using JuliaSyntax: parseall, GreenNode
text = "x = f(y + z) # comment"
cst = parseall(GreenNode, text) 1:22 │[toplevel]
1:12 │ [=]
1:1 │ Identifier ✔
2:2 │ Whitespace
3:3 │ =
4:4 │ Whitespace
5:12 │ [call]
5:5 │ Identifier ✔
6:6 │ (
7:11 │ [call]
7:7 │ Identifier ✔
8:8 │ Whitespace
9:9 │ Identifier ✔
10:10 │ Whitespace
11:11 │ Identifier ✔
12:12 │ )
13:13 │ Whitespace
14:22 │ Comment
In this example, you can see that JuliaSyntax has generated a tree structure. The numbers on the left-hand side indicate the byte offsets of each node in the original source code.
Notice that the whitespace and comment in the original source code is preserved in the CST. In contrast, an AST, generated e.g. using Meta.parse, would discard this information. In principle, we could pretty-print the AST and thus reconstruct "formatted" code with the same semantics. However, this would not be ideal since we would lose (for example) comments as well as the original parenthesisation of expressions, which a user may have intentionally added for readability.
dump(Meta.parse(text))Expr
head: Symbol =
args: Array{Any}((2,))
1: Symbol x
2: Expr
head: Symbol call
args: Array{Any}((2,))
1: Symbol f
2: Expr
head: Symbol call
args: Array{Any}((3,))
1: Symbol +
2: Symbol y
3: Symbol zThe CST parsing backend has varied across versions. JuliaFormatter v2 - v2.4 used JuliaSyntax.jl v0, and JuliaFormatter v1 used CSTParser.jl.
Generating an FST
From the CST we transform it into a slightly richer representation, which is internally called an FST. The FST is very similar to the CST, but it contains additional information that is specific to formatting.
For example, the CST is very lightweight: it actually does not store any strings for the various nodes, only byte offsets. When generating the FST, we use the byte offsets plus the original source code to extract the relevant substrings and store them in the FST.
using JuliaFormatter: State, Document, Options, pretty, DefaultStyle
state = State(Document(text), Options(; margin=10))
fst = pretty(DefaultStyle(), cst, state)FST: TopLevel 1
(1, 1, 0, 12)
nest_behavior: AllowNest
extra_margin: 0
nodes:
[1] FST: Binary 6
(1, 1, 0, 12)
nest_behavior: AllowNest
extra_margin: 0
nodes:
[1] FST: IDENTIFIER
val: "x"
line_offset: 1
indent: 0
[2] FST: WHITESPACE
val: " "
line_offset: -1
indent: 0
[3] FST: OPERATOR
val: "="
line_offset: 3
indent: 0
[4] FST: PLACEHOLDER
val: " "
line_offset: -1
indent: 0
[5] FST: PLACEHOLDER
val: ""
line_offset: -1
indent: 0
[6] FST: Call 7
(1, 1, 0, 8)
nest_behavior: AllowNest
extra_margin: 0
nodes:
[1] FST: IDENTIFIER
val: "f"
line_offset: 5
indent: 0
[2] FST: PUNCTUATION
val: "("
line_offset: 6
indent: 0
[3] FST: PLACEHOLDER
val: ""
line_offset: -1
indent: 0
[4] FST: Binary 6
(1, 1, 0, 5)
nest_behavior: AllowNest
extra_margin: 0
nodes:
[1] FST: IDENTIFIER
val: "y"
line_offset: 7
indent: 0
[2] FST: WHITESPACE
val: " "
line_offset: -1
indent: 0
[3] FST: OPERATOR
val: "+"
line_offset: 9
indent: 0
[4] FST: PLACEHOLDER
val: " "
line_offset: -1
indent: 0
[5] FST: PLACEHOLDER
val: ""
line_offset: -1
indent: 0
[6] FST: IDENTIFIER
val: "z"
line_offset: 11
indent: 0
[5] FST: TRAILINGCOMMA
val: ""
line_offset: -1
indent: 0
[6] FST: PLACEHOLDER
val: ""
line_offset: -1
indent: 0
[7] FST: PUNCTUATION
val: ")"
line_offset: 12
indent: 0
In this very simple example here, margin=10 does not actually affect the FST. However, in general, the style as well as the options passed in here will affect the structure of the generated FST, either via multiple dispatch, or via the state.opts field.
Notice that each node in this tree contains a val field, which is the string corresponding to that node in the CST. On top of that, there are some mysterious PLACEHOLDER nodes, which are not present in the CST. These indicate potential locations where we can insert newlines when we later nest the FST. For example, in the string x = f(y + z) we can insert a newline in several places:
x = f( y + z )
^ ^ ^ ^Each of these potential newline locations is represented by a PLACEHOLDER node in the FST.
Several other transformations happen at this stage:
- Code and comments are indented to match surrounding code blocks.
- Unnecessary whitespace is removed (although newlines in between code blocks are untouched).
- Code is flattened as much as possible. For example, if an expression can be put on a single line, it will be: it doesn't matter if it's over the margin or not at this stage. However, if the expression has a block-like structure to it, such as a
try,if, or astructdefinition, it will be spread across multiple lines appropriately:
The important invariant underpinning a FST is any two strings that are syntactically the same (ignoring whitespace) will produce an identical FST. For example:
a =
foo(a, b,
c,d)and
a = foo(a,
b,
c,d)will produce the same FST, which when printed would look like a = foo(a, b, c, d).
Nesting the FST
At this point we have not actually yet decided which of these (if any) need to be converted into newlines. This depends on, for example, the margin. We can see how this affects the output here:
using JuliaFormatter: format_text
format_text(text; margin=40) |> printx = f(y + z) # commentformat_text(text; margin=10) |> printx = f(
y + z,
) # commentThe decision on where to insert newlines is made in the nesting stage, which converts PLACEHOLDER nodes to NEWLINE nodes as needed. Recall that above we specified margin=10. This will cause the nest! function to insert newlines such that no line exceeds 10 characters. Just as before, the style and options passed in here will affect the decisions made inside nest!.
using JuliaFormatter: nest!
nest!(DefaultStyle(), fst, state)
fstFST: TopLevel 1
(1, 1, 0, 12)
nest_behavior: AllowNest
extra_margin: 0
nodes:
[1] FST: Binary 6
(1, 1, 0, 12)
nest_behavior: AllowNest
extra_margin: 0
nodes:
[1] FST: IDENTIFIER
val: "x"
line_offset: 1
indent: 0
[2] FST: WHITESPACE
val: " "
line_offset: -1
indent: 0
[3] FST: OPERATOR
val: "="
line_offset: 3
indent: 0
[4] FST: WHITESPACE
val: " "
line_offset: -1
indent: 0
[5] FST: WHITESPACE
val: ""
line_offset: -1
indent: 0
[6] FST: Call 7
(1, 1, 4, 8)
nest_behavior: AllowNest
extra_margin: 0
nodes:
[1] FST: IDENTIFIER
val: "f"
line_offset: 5
indent: 4
[2] FST: PUNCTUATION
val: "("
line_offset: 6
indent: 8
[3] FST: NEWLINE
val: "
"
line_offset: -1
indent: 0
[4] FST: Binary 6
(1, 1, 4, 5)
nest_behavior: AllowNest
extra_margin: 1
nodes:
[1] FST: IDENTIFIER
val: "y"
line_offset: 7
indent: 8
[2] FST: WHITESPACE
val: " "
line_offset: -1
indent: 8
[3] FST: OPERATOR
val: "+"
line_offset: 9
indent: 8
[4] FST: WHITESPACE
val: " "
line_offset: -1
indent: 0
[5] FST: PLACEHOLDER
val: ""
line_offset: -1
indent: 8
[6] FST: IDENTIFIER
val: "z"
line_offset: 11
indent: 8
[5] FST: TRAILINGCOMMA
val: ","
line_offset: -1
indent: 4
[6] FST: NEWLINE
val: "
"
line_offset: -1
indent: 0
[7] FST: PUNCTUATION
val: ")"
line_offset: 12
indent: 0
Printing the FST
Once the FST has been nested the rest is straightforward: traverse the tree and print it out into a string!
using JuliaFormatter: print_tree
io = IOBuffer()
print_tree(io, fst, state)
String(take!(io)) |> printx = f(
y + z,
)