Parser

Parser is responsible for interpreting a SQL string into an abstract syntax tree (AST), which is more structural and easier to process. AST can be used for preprocessing, syntactic analysis, and so on.

The code lives in the pingcap/tidb repo, parser directory.

Understand Parser

Parser is generated by a parser generator named yacc. It takes the grammar file parser.y as the input and outputs the source code file parser.go, which is the real parser imported by TiDB. Thus, the core file is parser.y because when the SQL syntax changes, most of the changes take place in parser.y.

In case you are unfamiliar with yacc, some concepts are listed here:

  • Terminal Symbol is also known as "token". When a SQL string reaches parser, the first step is to tokenize them into an array of tokens. For example, "SELECT * FROM t" is tokenized to [selectKeyword, '*', fromKeyword, identifier(t)] by lexer.Lex().
  • Non-terminal Symbol is a syntactic variable, which can represent a group of terminal/non-terminal symbols.
  • Grammar Rule specifies which symbols can replace which other non-terminal symbol.
  • Semantic Action defines how an AST node is constructed.

An example of a grammar rule is as follows:

AlterDatabaseStmt:
	"ALTER" DatabaseSym DBName DatabaseOptionList
	{
		$$ = &ast.AlterDatabaseStmt{
			Name:                 $3,
			AlterDefaultDatabase: false,
			Options:              $4.([]*ast.DatabaseOption),
		}
	}
  • AlterDatabaseStmt is a non-terminal symbol because there is no such token.
  • "ALTER" is a terminal symbol.
  • DatabaseSym, DBName and DatabaseOptionList are non-terminal symbols that are defined in other grammar rules.
  • The pseudo-code in brackets is the semantic action. It means an AST node ast.AlterDatabaseStmt will be constructed when the rule is reduced by the parser. Note that a dollar character $ followed by a number represents the binding Golang value previously (in other rules), where the number is the index of symbol in rule (1-based). $$ represents current binding value. After goyacc substitution, this code snippet will be valid Golang code.

Getting back to parser.y, the structure of this file is divided into three parts:

  1. %union enumerates all the Golang types that can be passed around grammar rules.
  2. %token or %type declares the terminal or non-terminal symbols that will be used in grammar rules.
  3. Grammar rules define the syntax of SQL and related semantic actions.

Except for parser.y, other sub-package/files should be easy to understand, feel free to explore them by yourself:

PackageDescription
astThe AST definition used by TiDB.
authAuthentication-related functions.
charsetCurrently supported charsets and encodings.
formatThe Formatters for yacc file and the functions for restoring AST to SQL.
goyaccThe generator for parser.go.
modelBasic structures in TiDB like TableInfo, ColumnInfo...
mysqlMySQL constants, errors, privileges, types, and others.
opcodeOperator code like <, >, +, =...
terrorThe errors used by TiDB.
test_driverA parser driver only for unit tests.
tidbTiDB related features' keywords.
typesThe field types and evaluation types that used in TiDB.

Develop and Build

To get started with the parser development, please also take a look at quickstart.md. It shows the basic usage of the parser and it explains some concepts like parser_driver.

Run make parser in the project root directory to generate a new parser.go.

FAQ

  1. How to debug the parsing procedure?

Put the test file in the parser package. Set yyDebug level to 4(or any integers >= 4) before calling Parse(). The parser will try to show state information in each step.

  1. How to resolve shift-reduce or reduce-reduce conflicts?

Shift means "move the next token in" to match the current rule. Reduce means "replace current tokens/symbols to a non-terminal symbol". Shift-reduce conflicts occur when the parser cannot decide the next step is to shift or to reduce.

When yacc reports such conflicts, it also keeps the file y.output. You can search "conflict on" in the file to locate which rule conflicts with other rules. Then you can try to annotate the %precedence to tokens, rewrite the grammar rule, or ask for help on GitHub.