Optimistic Transaction

Under the optimistic transaction model, modification conflicts are regarded as part of the transaction commit. TiDB cluster uses the optimistic transaction model by default before version 3.0.8, uses pessimistic transaction model after version 3.0.8. System variable tidb_txn_mode controls TiDB uses optimistic transaction mode or pessimistic transaction mode.

This document talks about the implementation of optimistic transaction in TiDB. It is recommended that you have learned about the principles of optimistic transaction.

This document refers to the code of TiDB v5.2.1

Begin Optimistic Transaction

The main function stack to start an optimistic transaction is as followers.

(a *ExecStmt) Exec
​    (a *ExecStmt) handleNoDelay
​        (a *ExecStmt) handleNoDelayExecutor
​            Next
​                (e *SimpleExec) Next
​                    (e *SimpleExec) executeBegin

The function (e *SimpleExec) executeBegin do the main work for a "BEGIN" statement,The important comment and simplified code is as followers. The completed code is here .

/*

Check the syntax "start transaction read only" and "as of timestamp" used correctly.
If stale read timestamp was set,  creates a new stale read transaction and sets "in transaction" state, and return.
create a new transaction and set some properties like snapshot, startTS etc
*/

func (e *SimpleExec) executeBegin(ctx context.Context, s *ast.BeginStmt) error {
​    if s.ReadOnly {
​       // the statement "start transaction read only" must be used with tidb_enable_noop_functions is true
​       //  the statement "start transaction read only  as of timestamp" can be used Whether  tidb_enable_noop_functions  is true or false,but that tx_read_ts mustn't be set.
​       //  the statement "start transaction read only  as of timestamp" must ensure the timestamp is in the legal safe point range        
​       enableNoopFuncs := e.ctx.GetSessionVars().EnableNoopFuncs
​       if !enableNoopFuncs && s.AsOf == nil {
​           return expression.ErrFunctionsNoopImpl.GenWithStackByArgs("READ ONLY")
​       }

​       if s.AsOf != nil {
​           if e.ctx.GetSessionVars().TxnReadTS.PeakTxnReadTS() > 0 {
​              return errors.New("start transaction read only as of is forbidden after set transaction read only as of")
​           }
​       }
​    } 

​    // process stale read transaction
​    if e.staleTxnStartTS > 0 {
​      // check timestamp of stale read correctly
​       if err := e.ctx.NewStaleTxnWithStartTS(ctx, e.staleTxnStartTS); err != nil {
​           return err
​       }

​       // ignore tidb_snapshot configuration if in stale read transaction
​       vars := e.ctx.GetSessionVars()
​       vars.SetSystemVar(variable.TiDBSnapshot, "")
    
​       // set "in transaction" state and return
​       vars.SetInTxn(true)

​       return nil
​    }

​    /* If BEGIN is the first statement in TxnCtx, we can reuse the existing transaction, without the need to call NewTxn, which commits the existing transaction and begins a new one. If the last un-committed/un-rollback transaction is a time-bounded read-only transaction, we should always create a new transaction. */
    
​    txnCtx := e.ctx.GetSessionVars().TxnCtx
​    if txnCtx.History != nil || txnCtx.IsStaleness {
​       err := e.ctx.NewTxn(ctx)
​    }

​  // set "in transaction" state
   e.ctx.GetSessionVars().SetInTxn(true)

   // create a new transaction and set some properties like snapshot, startTS etc.
   txn, err := e.ctx.Txn(true)
​    // set Linearizability option
​    if s.CausalConsistencyOnly {
​       txn.SetOption(kv.GuaranteeLinearizability, false)
​    }

​    return nil
}

DML Executed In Optimistic Transaction

There are three kinds of DML operations, such as update, delete and insert. For simplicity, This article only describes the update statement execution process. DML mutations are not sended to tikv directly, but buffered in TiDB temporarily, commit operation make the mutations effective at last.

The main function stack to execute an update statement such as "update t1 set id2 = 2 where pk = 1" is as followers.

(a *ExecStmt) Exec
​    (a *ExecStmt) handleNoDelay
​        (a *ExecStmt) handleNoDelayExecutor
​            (e *UpdateExec) updateRows
​                 Next
​                   (e *PointGetExecutor) Next

(e *UpdateExec) updateRows

The function (e *UpdateExec) updateRows does the main work for update statement. The important comment and simplified code are as followers. The completed code is here .

/*
Take a batch of data that needs to be updated each time.
Traverse every row in the batch, make handle which identifies the data uniquely for the row and generate a new row.
Write the new row to table.
*/

func (e *UpdateExec) updateRows(ctx context.Context) (int, error) {
​    globalRowIdx := 0
​    chk := newFirstChunk(e.children[0])
​    // composeFunc generates a new row 
​    composeFunc = e.composeNewRow
​    totalNumRows := 0

​    for {
​       // call "Next" method recursively until every executor finished, every "Next" returns a batch rows
​       err := Next(ctx, e.children[0], chk)

​       // If all rows are updated, return
​       if chk.NumRows() == 0 {
​           break
​       }

​       for rowIdx := 0; rowIdx < chk.NumRows(); rowIdx++ {
​           // take one row from the batch above
​           chunkRow := chk.GetRow(rowIdx)
​	        // convert the data from chunk.Row to types.DatumRow,stored by fields
​           datumRow := chunkRow.GetDatumRow(fields)
​           // generate handle which is  unique ID for every row
​           e.prepare(datumRow)
​           // compose non-generated columns
​           newRow, err := composeFunc(globalRowIdx, datumRow, colsInfo)
​           // merge non-generated columns
​           e.merge(datumRow, newRow, false)
​	        // compose generated columns and merge generated columns
​           if e.virtualAssignmentsOffset < len(e.OrderedList) {          
​              newRow = e.composeGeneratedColumns(globalRowIdx, newRow, colsInfo)     
​              e.merge(datumRow, newRow, true)
​           }

​           // write to table
​           e.exec(ctx, e.children[0].Schema(), datumRow, newRow)
​       }
​    }
}

Commit Optimistic Transaction

Committing transaction includes "prewrite" and "commit" two phases that are explained separately below. The function (c *twoPhaseCommitter) execute does the main work for committing transaction. The important comment and simplified code are as followers. The completed code is here .

/*
do the "prewrite" operation first
if OnePC transaction, return immediately
if AsyncCommit transaction, commit the transaction Asynchronously and return
if not OnePC and AsyncCommit transaction, commit the transaction synchronously.
*/

func (c *twoPhaseCommitter) execute(ctx context.Context) (err error) {
​    // do the "prewrite" operation
​    c.prewriteStarted = true

​    start := time.Now()
​    err = c.prewriteMutations(bo, c.mutations) 
​    if c.isOnePC() {
​       // If OnePC transaction, return immediately
​       return nil
​    }  

​    if c.isAsyncCommit() {  
​       // commit the transaction Asynchronously and return for AsyncCommit
​       go func() {
​           err := c.commitMutations(commitBo, c.mutations)
​       }          

​       return nil
​    } else {  
​        // do the "commit" phase 
​        return c.commitTxn(ctx, commitDetail)
​    } 
}

prewrite

The entry function to prewrite a transaction is (c *twoPhaseCommitter) prewriteMutations which calls the function (batchExe *batchExecutor) process to do it. The function (batchExe *batchExecutor) process calls (batchExe *batchExecutor) startWorker to prewrite evenry batch parallelly. The function (batchExe *batchExecutor) startWorker calls (action actionPrewrite) handleSingleBatch to prewrite a single batch.

(batchExe *batchExecutor) process

The important comment and simplified code are as followers. The completed code is here .

// start worker routine to prewrite every batch parallely and collect results
func (batchExe *batchExecutor) process(batches []batchMutations) error {
​    var err error
​    err = batchExe.initUtils()
​    // For prewrite, stop sending other requests after receiving first error.
​    var cancel context.CancelFunc

​    if _, ok := batchExe.action.(actionPrewrite); ok {
​       batchExe.backoffer, cancel = batchExe.backoffer.Fork()
​       defer cancel()  
​    }

​    // concurrently do the work for each batch.
​    ch := make(chan error, len(batches))
​    exitCh := make(chan struct{})
​    go batchExe.startWorker(exitCh, ch, batches)
​    // check results of every batch prewrite synchronously, if one batch fails, 
​    // stops every prewrite worker routine immediately.
​    for i := 0; i < len(batches); i++ {
​       if e := <-ch; e != nil {
​           // Cancel other requests and return the first error.
​           if cancel != nil {
​              cancel()
​           }

​           if err == nil {
​              err = e
​           }
​       }
​    }

​    close(exitCh)   // break the loop of function startWorker

​    return err
}

(batchExe *batchExecutor) startWorker

The important comment and simplified code are as followers. The completed code is here .

// startWork concurrently do the work for each batch considering rate limit
func (batchExe *batchExecutor) startWorker(exitCh chan struct{}, ch chan error, batches []batchMutations) {
​    for idx, batch1 := range batches {
​       waitStart := time.Now()
​       //  Limit the number of go routines by the buffer size of channel
​       if exit := batchExe.rateLimiter.GetToken(exitCh); !exit {
​           batchExe.tokenWaitDuration += time.Since(waitStart)
​           batch := batch1
​           //  call the function "handleSingleBatch" to prewrite every batch keys
​           go func() {
​              defer batchExe.rateLimiter.PutToken() //  release the chan buffer
​              ch <- batchExe.action.handleSingleBatch(batchExe.committer, singleBatchBackoffer, batch)
​           }()

​       } else {
​           break
​       }
​    }
}

(action actionPrewrite) handleSingleBatch

The important comment and simplified code are as followers. The completed code is here .

/*

create a prewrite request and a region request sender that sends the prewrite request to tikv.
(1)get Prewrite Response coming from tikv
(2)If no error happened and it is OnePC transaction, update onePCCommitTS by prewriteResp and return
(3)if no error happened and it is AsyncCommit transaction, update minCommitTS  if need and return
(4)If errors hanpped beacause of lock confilict, extract the locks from the error responsed, resolove the locks expired
(5)do the backoff for prewrite
*/

func (action actionPrewrite) handleSingleBatch(c *twoPhaseCommitter, bo *retry.Backoffer, batch batchMutations) (err error) {
​    // create a prewrite request and a region request sender that sends the prewrite request to tikv.
​    txnSize := uint64(c.regionTxnSize[batch.region.GetID()])
​    req := c.buildPrewriteRequest(batch, txnSize)
​    sender := locate.NewRegionRequestSender(c.store.GetRegionCache(), c.store.GetTiKVClient())

​    for {
​       resp, err := sender.SendReq(bo, req, batch.region, client.ReadTimeoutShort)
​       regionErr, err := resp.GetRegionError()  

​       // get Prewrite Response coming from tikv
​       prewriteResp := resp.Resp.(*kvrpcpb.PrewriteResponse)
​       keyErrs := prewriteResp.GetErrors()
​       if len(keyErrs) == 0 {  
​           // If no error happened and it is OnePC transaction, update onePCCommitTS by prewriteResp and return
​           if c.isOnePC() {
​                c.onePCCommitTS = prewriteResp.OnePcCommitTs
​                return nil
​          } 

​          // if no error happened and it is AsyncCommit transaction, update minCommitTS  if need and return
​          if c.isAsyncCommit() {
​                  if prewriteResp.MinCommitTs > c.minCommitTS {
​                       c.minCommitTS = prewriteResp.MinCommitTs
​                  }
​           }
​           return nil
​       }// if len(keyErrs) == 0

​       // If errors hanpped beacause of lock confilict, extract the locks from the error responsed
​       var locks []*txnlock.Lock
​       for _, keyErr := range keyErrs {
​           // Extract lock from key error
​           lock, err1 := txnlock.ExtractLockFromKeyErr(keyErr)
​           if err1 != nil {
​              return errors.Trace(err1)
​           }
​           locks = append(locks, lock)
​       }// for _, keyErr := range keyErrs
​       // resolve conflict locks expired, do the backoff for prewrite
​       start := time.Now()
​       msBeforeExpired, err := c.store.GetLockResolver().ResolveLocksForWrite(bo, c.startTS, c.forUpdateTS, locks)
​       if msBeforeExpired > 0 {
​           err = bo.BackoffWithCfgAndMaxSleep(retry.BoTxnLock, int(msBeforeExpired), errors.Errorf("2PC prewrite lockedKeys: %d", len(locks)))
​           if err != nil {
​              return errors.Trace(err) // backoff exceeded maxtime, returns error
​           }
​       }
​    }// for loop
}

commit

The entry function of commiting a transaction is (c *twoPhaseCommitter) commitMutations which calls the function (c *twoPhaseCommitter) doActionOnGroupMutations to do it. The batch of primary key will be committed first, then the function (batchExe *batchExecutor) process calls (batchExe *batchExecutor) startWorker to commit the rest batches parallelly and asynchronously. The function (batchExe *batchExecutor) startWorker calls (actionCommit) handleSingleBatch to commit a single batch.

(c *twoPhaseCommitter) doActionOnGroupMutations

The important comment and simplified code are as followers. The completed code is here .

/*
If the groups contain primary, commit the primary batch synchronously
If the first time to commit, spawn a goroutine to commit secondary batches asynchronously
if retry to commit,  commit the secondary batches synchronously, because itself is in the asynchronously goroutine
*/

func (c *twoPhaseCommitter) doActionOnGroupMutations(bo *retry.Backoffer, action twoPhaseCommitAction, groups []groupedMutations) error {
​    batchBuilder := newBatched(c.primary())
​    // Whether the groups being operated contain primary
​    firstIsPrimary := batchBuilder.setPrimary()
​    actionCommit, actionIsCommit := action.(actionCommit)
​    c.checkOnePCFallBack(action, len(batchBuilder.allBatches()))
​    // If the groups contain primary, commit the primary batch synchronously
​    if firstIsPrimary &&
​       (actionIsCommit && !c.isAsyncCommit()) {
​         // primary should be committed(not async commit)/cleanup/pessimistically locked first
​       err = c.doActionOnBatches(bo, action, batchBuilder.primaryBatch())    
​       batchBuilder.forgetPrimary()
​    }
​  // If the first time to commit, spawn a goroutine to commit secondary batches asynchronously
   // if retry to commit,  commit the secondary batches synchronously, because itself is in the asynchronously goroutine
​    if actionIsCommit && !actionCommit.retry && !c.isAsyncCommit() {
​        secondaryBo := retry.NewBackofferWithVars(c.store.Ctx(), CommitSecondaryMaxBackoff, c.txn.vars)
​       c.store.WaitGroup().Add(1)
​       go func() {
​           defer c.store.WaitGroup().Done()
​           e := c.doActionOnBatches(secondaryBo, action, batchBuilder.allBatches())
​       }
​    }
​    else {
​       err = c.doActionOnBatches(bo, action, batchBuilder.allBatches())
​    }

​    return errors.Trace(err)
}

(batchExe *batchExecutor) process

The function (c *twoPhaseCommitter) doActionOnGroupMutations calls (c *twoPhaseCommitter) doActionOnBatches to do the second phase of commit. The function (c *twoPhaseCommitter) doActionOnBatches calls (batchExe *batchExecutor) process to do the main work.

The important comment and simplified code of function (batchExe *batchExecutor) process are as mentioned above in prewrite part . The completed code is here .

(actionCommit) handleSingleBatch

The function (batchExe *batchExecutor) process calls the function (actionCommit) handleSingleBatch to send commit request to all tikv nodes.

The important comment and simplified code are as followers. The completed code is here .

/*
create a commit request and commit sender.
If regionErr happened, backoff and retry the commit operation.
If the error is not a regionErr, but rejected by TiKV beacause the commit ts was expired, retry with a newer commits.
Other errors happened, return error immediately.
No error happened, exit the for loop and return success.
*/

func (actionCommit) handleSingleBatch(c *twoPhaseCommitter, bo *retry.Backoffer, batch batchMutations) error {
​    // create a commit request and commit sender
​    keys := batch.mutations.GetKeys()
​    req := tikvrpc.NewRequest(tikvrpc.CmdCommit, &kvrpcpb.CommitRequest{
​       StartVersion: c.startTS,
​        Keys:     keys,
​       CommitVersion: c.commitTS,
​    }, kvrpcpb.Context{Priority: c.priority, SyncLog: c.syncLog,
​       ResourceGroupTag: c.resourceGroupTag, DiskFullOpt: c.diskFullOpt}) 

​    tBegin := time.Now()
​    attempts := 0 
​    sender := locate.NewRegionRequestSender(c.store.GetRegionCache(), c.store.GetTiKVClient())

​    for {
​       attempts++
​       resp, err := sender.SendReq(bo, req, batch.region, client.ReadTimeoutShort)
​       regionErr, err := resp.GetRegionError()
​       // If regionErr happened, backoff and retry the commit operation
​       if regionErr != nil {
​           // For other region error and the fake region error, backoff because there's something wrong.
​           // For the real EpochNotMatch error, don't backoff.
​           if regionErr.GetEpochNotMatch() == nil || locate.IsFakeRegionError(regionErr) {
​              err = bo.Backoff(retry.BoRegionMiss, errors.New(regionErr.String()))
​              if err != nil {
​                  return errors.Trace(err)
​              }
​           }

​           same, err := batch.relocate(bo, c.store.GetRegionCache())
​           if err != nil {
​              return errors.Trace(err)
​           }

​           if same {
​              continue
​           }

​           err = c.doActionOnMutations(bo, actionCommit{true}, batch.mutations)
​           return errors.Trace(err)
​       }// if regionErr != nil

​	   // If the error is not a regionErr, but rejected by TiKV beacause the commit ts was expired, retry with a newer commits. Other errors happened, return error immediately.

​       commitResp := resp.Resp.(*kvrpcpb.CommitResponse)
​       if keyErr := commitResp.GetError(); keyErr != nil {
​           if rejected := keyErr.GetCommitTsExpired(); rejected != nil {
​              // 2PC commitTS rejected by TiKV, retry with a newer commits, update commit ts and retry.
​              commitTS, err := c.store.GetTimestampWithRetry(bo, c.txn.GetScope())
​              c.mu.Lock()
​              c.commitTS = commitTS
​              c.mu.Unlock()
​              // Update the commitTS of the request and retry.
​              req.Commit().CommitVersion = commitTS
​              continue
​           }

​           if c.mu.committed {
​              // 2PC failed commit key after primary key committed
​              // No secondary key could be rolled back after it's primary key is committed.
​               return errors.Trace(err)
​           }
​           return err
​       }

​        // No error happened, exit the for loop
​        break  
​    }// for loop
}