Go compiler explained (1) - The parser
Introduction⌗
In this series we are going to take a look at some of the inner workings of the Go compiler.
Starting from a Go package and finishing with a native binary is a process that goes roughly through 4 steps (each will hopefully be somewhat explained in it’s own blogpost):
- Lexing, Parsing and AST generation
- Type checking
- SSA generation
- Machine code generation
this post is part 1 and will be touching on the lexer/parser aspect of the process.
Note: The Go compiler codebase could be found here
Audience⌗
- This is not a compiler tutorial, this is merely notes while I was reading the code, it can help add some context to code segments but it shouldn’t teach you how to write a compiler from scratch.
- I am assuming that the reader is familiar with some Compiler lingo, or at least willing to Google :)
The Lexer⌗
A lexer is simple software component that goes through a source file and extract
lexical tokens from it (example: var
, "hello"
, 1123
, (
).
The Go source lexer is no different, it has a list of known tokens and token
patterns, and it exposes a struct scanner
that has a Next
method to walk
through in the order they are discovered.
Every call to the next
method tries to find the next available token based on
the scanner’s state or reports an error back via an error handler function supplied at construction time.
Let’s go through an example:
Say that lexer’s is at character "
, this could only mean that the current
token is a string so the lexer would run the stdString
method to verify that
the current token is in-fact a string (see inline comments for details):
func (s *scanner) stdString() {
ok := true // marks if the string is lexically valid.
s.nextch() // moves the underlying source pointer to the start of the string
for {
if s.ch == '"' {
s.nextch() // moves the underlying source pointer to the character after
'"'
break
}
...
if s.ch == '\n' { // an example of an error: standard strings shouldn't
contain new line characters
s.errorf("newline in string")
ok = false // the literal token will be marked as invalid because of the
ok value
break
}
if s.ch < 0 { // we never closed the the opening '"' and reached EOF
s.errorAtf(0, "string not terminated")
ok = false
break
}
s.nextch() // everything else goes
}
// sets the internal scanner state to the current lexer with its value.
s.setLit(StringLit, ok)
}
Almost all of the Go tokens are identified in a similar fashion, I’ll leave the details as an exercise to the reader, as the lexer is one of the easiest components in the compilation process.
The Parser⌗
The parser is (in a very simple terms) a software component that given a source file constructs an AST (Abstract Syntax Tree) from it, which itself is composed of nodes that have a semantic relationship between them according to the language definition.
The Go compiler defines its list of nodes
Every node should adhere to the Node interface, which stats that every Node should be able to give it’s position in the source, and to accomplish this, the compiler authors performed a very neat trick of creating a dummy struct that implements the Node
interface and added it as an embedded field to all of the other node definitions, this way, they didn’t have to implement the same thing over and over again for every node (neat).
This trick is used in the same way to mark nodes that are Declarations and those that are Expressions. The linked Go file has all of the node definitions, please take a look (it is very well documented) before continuing.
The parser uses the lexer to produce tokens from the given source file, the parsing loop is fairly simple as shown below:
for p.tok != _EOF {
switch p.tok {
case _Const:
p.next()
f.DeclList = p.appendGroup(f.DeclList, p.constDecl)
case _Type:
p.next()
f.DeclList = p.appendGroup(f.DeclList, p.typeDecl)
case _Var:
p.next()
f.DeclList = p.appendGroup(f.DeclList, p.varDecl)
case _Func:
p.next()
if d := p.funcDeclOrNil(); d != nil {
f.DeclList = append(f.DeclList, d)
}
// ...
}
}
// ...
return f
Parsing a file is as simple as taking the first token of each new top level statement and deciding how to parse that and add it to the children of the current root node (i.e the file).
The design of the parser itself is based of having for each non-terminal rule a corresponding function that handles parsing that rule specifically, as an example if we take the ConstDecl declaration node, it has a corresponding parsing function in the parser constDecl, you can expect similar things for other declarations/expressions as well.
For a better understanding of parsing top level declarations lets walk through
an example, the var
declaration.
The Go language allows to declare top level variables using the var
keyword,
if you are reading this and never seen Golang code before, it looks something
like this:
var a = 10 // simple declaration
var b int // simple declaration with a type
var c, d int // two variable declaration
var e, f int = 10, 12 // two variable declarations with value and types
To parse this declaration, the parser uses the varDecl function:
func (p *parser) varDecl(group *Group) Decl {
// ...
d := new(VarDecl)
d.pos = p.pos()
d.Group = group
d.Pragma = p.takePragma()
d.NameList = p.nameList(p.name())
if p.gotAssign() {
d.Values = p.exprList()
} else {
d.Type = p.type_()
if p.gotAssign() {
d.Values = p.exprList()
}
}
return d
}
So we start by creating a VarDecl node since this is what we expect to return after this function invocation, we initialize it with some metadata from the parser such as the current position, the Group etc …
We start by parsing a nameList which takes care of the comma-separated names on
the left side of the declaration and from there it’s one of two things: we have
a =
character which means its an initialization with values and we set the
values as parsed expression list, or we have a type declaration before and we
parse that before trying to parse the actual values of the variables.
Note about AST nodes⌗
The AST nodes defined by the compiler are not the same ones defined in the
ast
module that is used by go tools like gofmt
.