Implementation of Vectorized Execution
This section introduces the implementation details of the TiDB vectorized execution model.
Understanding Vectorized Execution
Vectorized execution, also known as batch processing, is a method of processing data in batches, rather than row by row. Traditional row-based processing handles one row at a time, which can lead to significant overhead and reduced efficiency, especially when dealing with large datasets. Vectorized execution, on the other hand, processes data in chunks or vectors, allowing for better utilization of CPU and memory resources.
Key Benefits of Vectorized Execution
- Improved CPU Utilization: Processing data in batches minimizes the overhead associated with instruction fetching and decoding, leading to better CPU utilization and improved performance.
- Reduced Memory Access: Data processed in vectors is more likely to be present in the CPU cache, reducing the need for memory access, which is often a performance bottleneck.
- Reduced Branching: Traditional row-based processing often involves conditional statements and branching, which can hinder performance. Vectorized execution minimizes branching, leading to more predictable and faster execution.
Implementing Vectorized Execution in TiDB
TiDB leverages a memory layout similar to Apache Arrow to enable the execution of a batch of data at a time. This approach allows for efficient data processing, reducing overhead and improving performance.
Columnar Memory Layout Implementation in TiDB
In TiDB, the columnar memory layout is defined as a Column
, and a batch of these Columns
is referred to as a Chunk
. The implementation of Column
draws inspiration from Apache Arrow, ensuring efficient data processing. Depending on the data type they store, TiDB employs two types of Columns
:
- Column with Fixed Length: These Columns store data of a fixed length, such as Double, Bigint, Decimal, and similar data types. This structure is optimized for predictable and consistent data sizes, facilitating efficient processing.
- Column with Variable Length: These Columns accommodate variable-length data types, including Char, Varchar, and others. Variable-length data types can hold a range of character lengths, and the Column structure adapts to handle such variability. In TiDB, the Column and Chunk types are defined as follows:
type Column struct {
length int // the number of elements in the column
nullBitmap []byte // bitmap indicating null values
offsets []int64 // used for varLen column; row i starts from data[offsets[i]]
data []byte // the actual data
elemBuf []byte // used to indicate the byte length of each element for fixed-length objects
// ... (additional fields)
}
type Chunk struct {
columns []*Column
sel []int // sel indicates which rows are selected. If it is nil, all rows are selected.
capacity int // capacity indicates the maximum number of rows this chunk can hold.
// ... (additional fields)
}
Column and Chunk Data Manipulation
TiDB supports various data manipulation operations on Column
and Chunk
:
Appending a Fixed-Length Non-NULL Value to a Column:
- To append an element, a specific
append
method tailored to the data type is called (e.g., AppendInt64). - The data to be appended is shallow copied to the
elemBuf
using anunsafe.Pointer
. - The data in
elemBuf
is then appended to thedata
. - A
1
is appended to thenullBitmap
.
Appending a Non-Fixed-Length Non-NULL Value to a Column:
- To append a variable-length data value, such as a string, it is directly appended to the
data
. - A
1
is appended to thenullBitmap
. - The starting point of the newly appended data in the
data
is recorded in theoffsets
.
Appending a NULL Value to a Column:
- To append a NULL value, the AppendNull function is used.
- A
0
is appended to thenullBitmap
. - If it's a fixed-length column, a placeholder data of the same size as
elemBuf
is appended to thedata
. - If it's a variable-length column, no data is appended to the
data
, but the starting point of the next element is recorded in theoffsets
.
Reading a Value from a Column:
- Values in a
Column
can be read using functions like GetInt64(rowIdx) and GetString(rowIdx). The reading principle can be deduced from the previously described appending mechanism. Here, we retrieve the specified element in theColumn
based on the rowID. The details of reading from aColumn
are consistent with the principles discussed for appending.
Reading a Row from a Chunk:
- Within a
Chunk
, the concept of a Row is logical. The data for a row is stored across differentColumns
in theChunk
. The data for the same row in various columns is not necessarily stored consecutively in memory. When obtaining aRow
object, there is no need to perform data copying, as the data for the same row is already stored in the correspondingColumns
. - The concept of a
Row
is useful because, during the operation of operators, data is often accessed and processed on a per-row basis. For example, operations like aggregation, sorting, and similar tasks work with data at the row level. - You can retrieve a row from a
Chunk
using the GetRow(rowIdx) function. Once you have aRow
object, you can further access the data in specific columns within that row using functions like Row::GetInt64(colIdx), which allows you to retrieve the data corresponding to the specified column index for that row.
Examples
How expression is evaluated
In this section, we’ll use the TiDB expression colA * 0.8 + colB
to demonstrate how expression evaluation works using vectorized execution and to highlight the performance gap between row-based execution and vectorized execution.
Expression Tree Representation
The TiDB expression colA * 0.8 + colB
is parsed into an expression evaluation tree, where each non-leaf node represents an arithmetic operator, and the leaf nodes represent the data source. Each non-leaf node can be either a constant (like 0.8
) or a field (like colA
) in the table. The parent-child relationship between nodes indicates a computationally dependent relationship: the evaluation result of the child node is the input data for the parent node.
┌─┐
┌───┤+├───┐
│ └─┘ │
┌┴┐ ┌┴┐
colA*0.8+colB───► ┌──┤*├───┐ │B│
│ └─┘ │ └─┘
┌┴┐ ┌─┴─┐
│A│ │0.8│
└─┘ └───┘
Non-Vectorized Execution
In a non-vectorized execution model, the computing logic of each node can be abstracted using the following evaluation interface:
type Node interface {
evalReal(row Row) (val float64, isNull bool)
}
Taking *
, 0.8
, and col
nodes as examples, all three nodes implement the interface above. Their pseudocode is as follows:
func (node *multiplyRealNode) evalReal(row Row) (float64, bool) {
v1, null1 := node.leftChild.evalReal(row)
v2, null2 := node.rightChild.evalReal(row)
return v1 * v2, null1 || null2
}
func (node *constantNode) evalReal(row Row) (float64, bool) {
return node.someConstantValue, node.isNull // 0.8 in this case
}
func (node *columnNode) evalReal(row Row) (float64, bool) {
return row.GetReal(node.colIdx)
}
In non-vectorized execution, the expression is iterated over rows. Every time this function performs a multiplication, only a few instructions are actually involved in the "real" multiplication compared to the number of assembly instructions required to perform the function.
Vectorized Execution
In vectorized execution, the interface to evaluate an expression in a batch manner in TiDB looks like this:
type VecNode interface {
vecEvalReal(input *Chunk, result *Column)
}
Taking multiplyRealNode
as an example:
func (node *multiplyRealNode) vecEvalReal(input *Chunk, result *Column) {
buf := pool.AllocColumnBuffer(TypeReal, input.NumRows())
defer pool.ReleaseColumnBuffer(buf)
node.leftChild.vecEvalReal(input, result)
node.rightChild.vecEvalReal(input, buf)
f64s1 := result.Float64s()
f64s2 := buf.Float64s()
result.MergeNulls(buf)
for i := range i64s1 {
if result.IsNull(i) {
continue
}
i64s1[i] *= i64s2[i]
}
}
This implementation reduces the interpretation overhead by batch processing, which is more beneficial for modern CPUs:
- A vector of data is sequentially accessed. This reduces CPU cache misses.
- Most of the computational work is within a simple loop. This facilitates CPU branch prediction and instruction pipelining.
We use the same dataset (1024 rows formed by two columns of floating-point numbers) to compute colA * 0.8 + colB
in two ways: row-based execution and vectorized execution. The results are as follows:
BenchmarkVec-12 152166 7056 ns/op 0 B/op 0 allocs/op
BenchmarkRow-12 28944 38461 ns/op 0 B/op 0 allocs/op
The results above show vectorized execution is four times faster than row-based execution. For more details about the vectorized expression evaluation, you can refer to this link.
How operators are evaluated
In this section, we'll dive deeper into the evaluation of operators, focusing on HashJoin as an example.
HashJoin in vectorized execution consists of the following steps:
Hashing Phase
Let's consider the table used for constructing the hash table as 't'. The data from table 't' is read into Chunk
in batches. First, the data in the Chunk is filtered by columns based on the predicates on table 't'. The filtered results for these columns are then combined into a selected
array. In the selected
array, true values indicate valid rows. The relevant code is available in the VectorizedFilter section.
Subsequently, the hash values for the remaining valid data in the Chunk are calculated column-wise. If multiple columns are used in the hash, their values are concatenated to form the final hash key for a row. Further details can be found in the HashChunkSelected code section.
Finally, the selected
array is used for filtering. The hash key for valid rows, along with their corresponding row pointers, is used to construct the hash table.
Probe Phase
The probe phase in HashJoin mirrors the build phase. Initially, data from the probe table is read into Chunk
in batches. Predicates are applied to filter the Chunk by columns, and a selected
array is generated to mark valid rows. The hash keys are then calculated for each of the valid rows.
For the valid rows in the Chunk, the calculated hash value is employed to probe the hash table constructed during the build phase. This lookup aims to identify matching rows in the hash table using the hash values. The implementation can be explored in join2Chunk.
Matching and Output
Upon discovering matching rows in the hash table, the outcomes are output as joined rows and saved in a new Chunk
. For deeper insights, see the code in joinMatchedProbeSideRow2Chunk.
Vectorized computation in HashJoin offers considerable advantages over row-based computation, primarily concerning performance. By allowing for batch processing of data, vectorized computation minimizes instruction fetch and decode overhead, enhancing CPU utilization, trimming memory access, reducing conditional branches, and augmenting parallelism. These benefits render vectorized HashJoin exceptionally efficient and performant when processing large datasets.
Conclusion
In conclusion, TiDB's adept data processing, drawing inspiration from the Apache Arrow memory layout with its columns and chunks, emerges as an invaluable asset for contemporary data professionals. With its vectorized execution, TiDB heightens CPU utilization, curtails memory access overhead, and curbs branching, culminating in substantially accelerated and more streamlined query performance.