Skip to content

Vector Search

Vector search uses semantic similarity to help you find the most relevant records, even if your query does not explicitly include all the keywords.

Tip

For a complete example of vector search, see the vector-search example.

Basic Usage

This section shows you how to use vector search in your application in minimal steps. Before you start, you need to connect to the database.

Step 1. Create a table with a vector field

You can use client.create_table() to create a table and use VectorField to define a vector field.

In this example, we create a table named documents with four columns:

  • id: The primary key of the table.
  • text: The text content of the document.
  • text_vec: The vector embedding of the text content.
  • meta: The metadata of the document, which is a JSON object.
from pytidb.schema import TableModel, Field, VectorField
from pytidb.datatype import TEXT, JSON

class Document(TableModel):
    __tablename__ = "documents"

    id: int = Field(primary_key=True)
    text: str = Field(sa_type=TEXT)
    text_vec: list[float] = VectorField(dimensions=3)
    meta: dict = Field(sa_type=JSON, default_factory=dict)

table = client.create_table(schema=Document, mode="overwrite")

The VectorField class accepts the following parameters:

  • dimensions: The number of dimensions of the vector. Once specified, only vectors with this exact dimension can be stored in this field.
  • index: Whether to create a vector index for the vector field. Defaults to True.
  • distance_metric: The distance metric to use for the vector index. Supported values:
    • DistanceMetric.COSINE (default): Cosine distance metric, suitable for measuring text similarity
    • DistanceMetric.L2: L2 distance metric, suitable for capturing overall difference

You can use the CREATE TABLE statement to create a table and using VECTOR type to define a vector column.

CREATE TABLE documents (
    id INT PRIMARY KEY,
    text TEXT,
    text_vec VECTOR(3),
    VECTOR INDEX `vec_idx_text_vec`((VEC_COSINE_DISTANCE(`text_vec`)))
);

In this example:

  • The text_vec column is defined as a VECTOR type with 3 dimensions, it means that the vector to be stored in this column must have 3 dimensions.
  • A vector index is created using the VEC_COSINE_DISTANCE function to optimize vector search performance

TiDB supports two distance functions for vector indexes:

  • VEC_COSINE_DISTANCE: Calculates the cosine distance between two vectors
  • VEC_L2_DISTANCE: Calculates L2 distance (Euclidean distance) between two vectors

Step 2. Insert vector data into the table

For demonstration purposes, insert some text and their corresponding vector embeddings into the table. In this example, we use simple 3-dimensional vectors.

We insert three documents:

  • dog with the vector embedding [1, 2, 1]
  • fish with the vector embedding [1, 2, 4]
  • tree with the vector embedding [1, 0, 0]
table.bulk_insert([
    Document(text="dog", text_vec=[1,2,1], meta={"category": "animal"}),
    Document(text="fish", text_vec=[1,2,4], meta={"category": "animal"}),
    Document(text="tree", text_vec=[1,0,0], meta={"category": "plant"}),
])
INSERT INTO documents (id, text, text_vec, meta)
VALUES
    (1, 'dog', '[1,2,1]', '{"category": "animal"}'),
    (2, 'fish', '[1,2,4]', '{"category": "animal"}'),
    (3, 'tree', '[1,0,0]', '{"category": "plant"}');

Tip

In real-world applications, vector embeddings are usually generated by an embedding model.

For convenience, pytidb provides an auto embedding feature that can automatically generate vector embeddings for your text fields when you insert, update, or search—no manual processing needed.

For details, see the Auto Embedding guide.

Vector search uses vector distance metrics to measure the similarity and relevance between vectors. The closer the distance, the more relevant the record. To find the most relevant documents in the table, you need to specify a query vector.

In this example, we assume the query is A swimming animal and its vector embedding is [1, 2, 3].

You can use the table.search() method to perform vector search, which uses search_mode="vector" by default.

table.search([1, 2, 3]).limit(3).to_list()
Execution result
[
    {"id": 2, "text": "fish", "text_vec": [1,2,4], "_distance": 0.00853986601633272},
    {"id": 1, "text": "dog", "text_vec": [1,2,1], "_distance": 0.12712843905603044},
    {"id": 3, "text": "tree", "text_vec": [1,0,0], "_distance": 0.7327387580875756},
]

The result shows that the most relevant document is fish with a distance of 0.00853986601633272.

You can use the ORDER BY <distance_function>(<column_name>, <query_vector>) LIMIT <n> clause in the SELECT statement to get the n nearest neighbors of the query vector.

In this example, we use the vec_cosine_distance function to calculate the cosine distance between the vectors stored in the text_vec column and the provided query vector [1, 2, 3].

SELECT id, text, vec_cosine_distance(text_vec, '[1,2,3]') AS distance
FROM documents
ORDER BY distance
LIMIT 3;
Execution result
+----+----------+---------------------+
| id | text     | distance            |
+----+----------+---------------------+
|  2 | fish     | 0.00853986601633272 |
|  1 | dog      | 0.12712843905603044 |
|  3 | tree     |  0.7327387580875756 |
+----+----------+---------------------+
3 rows in set (0.15 sec)

The result shows that the most relevant document is fish with a distance of 0.00853986601633272.

Distance metrics

Distance metrics are a measure of the similarity between a pair of vectors. Currently, TiDB supports the following distance metrics:

The table.search() API supports the following distance metrics:

Metric Name Description Best For
DistanceMetric.COSINE Calculates the cosine distance between two vectors (default). Measures the angle between vectors. Text embeddings, semantic search
DistanceMetric.L2 Calculates the L2 distance (Euclidean distance) between two vectors. Measures the straight-line distance. Image features

To change the distance metric used for vector search, use the .distance_metric() method.

Example: Use the L2 distance metric

from pytidb.schema import DistanceMetric

results = (
    table.search([1, 2, 3])
        .distance_metric(DistanceMetric.L2)
        .limit(10)
        .to_list()
)

In SQL, you can use the following built-in functions to calculate vector distances directly in your queries:

Function Name Description
VEC_L2_DISTANCE Calculates L2 distance (Euclidean distance) between two vectors
VEC_COSINE_DISTANCE Calculates the cosine distance between two vectors
VEC_NEGATIVE_INNER_PRODUCT Calculates the negative of the inner product between two vectors
VEC_L1_DISTANCE Calculates L1 distance (Manhattan distance) between two vectors

Distance threshold

The table.search() API allows you to set a distance threshold to control the similarity of the returned results. By specifying this threshold, you can exclude less similar vectors and return only those that meet your relevance criteria.

Use the .distance_threshold() method to set a maximum distance for the search results. Only records with a distance less than the threshold are returned.

Example: Only return documents with a distance less than 0.5

results = table.search([1, 2, 3]).distance_threshold(0.5).limit(10).to_list()

In SQL, use the HAVING clause with a distance function to filter results by distance:

Example: Only return documents with a distance less than 0.1

SELECT id, text, vec_cosine_distance(text_vec, '[1,2,3]') AS distance
FROM documents
HAVING distance < 0.1
ORDER BY distance
LIMIT 10;

Distance range

The table.search() API also supports specifying a distance range to further refine the results.

Use the .distance_range() method to set both minimum and maximum distance values. Only records with a distance within this range are returned.

Example: Only return documents with a distance between 0.01 and 0.05

results = table.search([1, 2, 3]).distance_range(0.01, 0.05).limit(10).to_list()

To specify a distance range in SQL, use BETWEEN or other comparison operators in the HAVING clause:

Example: Only return documents with a distance between 0.01 and 0.05

SELECT id, text, vec_l2_distance(text_vec, '[1,2,3]') AS distance
FROM documents
HAVING distance BETWEEN 0.01 AND 0.05
ORDER BY distance
LIMIT 10;

Metadata filtering

As a relational database, TiDB supports a rich set of SQL operators and allows flexible combinations of filtering conditions.

For vector search in TiDB, you can apply metadata filtering on scalar fields (e.g., integers, strings) or JSON fields.

Typically, vector search combined with metadata filtering operates in two modes:

  • Post-filtering: In a two-stage retrieval process, TiDB first performs vector search to retrieve the top-k candidate results from the entire vector space, then applies the filter to this candidate set. The vector search stage typically leverages a vector index for efficiency.
  • Pre-filtering: The filter is applied before vector search. If the filter is highly selective and the filtered field is indexed with a scalar index, this approach can significantly reduce the search space and improve performance.

Post-filtering

Use the .filter() method with a filter dictionary to apply filtering to vector search.

By default, the table.search() API uses post-filtering mode to maximize search performance with the vector index.

Example: Vector search with post-filtering

results = (
    table.search([1, 2, 3])
        # The `meta` is a JSON field, and its value is a JSON object
        # like {"category": "animal"}
        .filter({"meta.category": "animal"})
        .num_candidate(50)
        .limit(10)
        .to_list()
)

Tip

When using a vector index, if the final limit is very small, the accuracy of the results may decrease. You can use the .num_candidate() method to control how many candidates to retrieve from the vector index during the vector search phase, without changing the limit parameter.

A higher num_candidate value generally improves recall but may reduce query performance. Adjust this value based on your dataset and accuracy requirements.

Currently, vector indexes are only effective in strict ANN (Approximate Nearest Neighbor) queries, such as:

SELECT * FROM <table> ORDER BY <distance_func>(<column>) LIMIT <n>

In other words, you cannot use a WHERE clause together with a vector index in the same query.

If you need to combine vector search with additional filtering conditions, you can use the post-filtering pattern. In this approach, the ANN query will be divided into two parts:

  • The inner query performs the vector search using the vector index.
  • The outer query applies the WHERE condition to filter the results.
SELECT *
FROM (
    SELECT id, text, meta, vec_cosine_distance(text_vec, '[1,2,3]') AS distance
    FROM documents
    ORDER BY distance
    LIMIT 50
) candidates
WHERE meta->>'$.category' = 'animal'
ORDER BY distance
LIMIT 10;

Tip

The post-filtering pattern may lead to false positives — for example, the inner query may retrieve the top 50 most similar records, but none of them match the WHERE condition.

To mitigate this, you can increase the LIMIT value (e.g., 50) in the inner query to fetch more candidates, improving the chances of returning enough valid results after filtering.

For supported SQL operators, see Operators in the TiDB Cloud documentation.

Pre-filtering

To enable pre-filtering, set the prefilter parameter to True in the .filter() method.

Example: Vector search with pre-filtering

results = (
    table.search([1, 2, 3])
        .filter({"meta.category": "animal"}, prefilter=True)
        .limit(10)
        .to_list()
)

For supported filter operators, see Filtering.

In SQL, use the ->> operator or JSON_EXTRACT to access JSON fields in the WHERE clause:

SELECT id, text, meta, vec_cosine_distance(text_vec, '[1,2,3]') AS distance
FROM documents
WHERE meta->>'$.category' = 'animal'
ORDER BY distance
LIMIT 10;

For supported SQL operators, see Operators in the TiDB Cloud documentation.

Multiple vector fields

TiDB supports defining multiple vector columns in a single table, allowing you to store and search different types of vector embeddings.

For example, you can store both text embeddings and image embeddings in the same table, making it convenient to manage multi-modal data.

You can define multiple vector fields in the schema and perform vector search on the specified vector field by using the .vector_column() method.

Example: Specify the vector field to search on

# Create a table with multiple vector fields
class RichTextDocument(TableModel):
    __tablename__ = "rich_text_documents"
    id: int = Field(primary_key=True)
    text: str = Field(sa_type=TEXT)
    text_vec: list[float] = VectorField(dimensions=3)
    image_url: str
    image_vec: list[float] = VectorField(dimensions=3)

table = client.create_table(schema=RichTextDocument, mode="overwrite")

# Insert sample data ...

# Search using image vector field
results = (
    table.search([1, 2, 3])
        .vector_column("image_vec")
        .distance_metric(DistanceMetric.COSINE)
        .limit(10)
        .to_list()
)

You can create multiple vector columns in a table and search them using suitable distance functions:

-- Create a table with multiple vector fields
CREATE TABLE rich_text_documents (
    id BIGINT PRIMARY KEY,
    text TEXT,
    text_vec VECTOR(3),
    image_url VARCHAR(255),
    image_vec VECTOR(3)
);

-- Insert sample data ...

-- Search using text vector
SELECT id, image_url, vec_l2_distance(image_vec, '[4,5,6]') AS image_distance
FROM rich_text_documents
ORDER BY image_distance
LIMIT 10;

Output search results

The table.search() API lets you convert search results into several common data processing formats:

As SQLAlchemy result rows

To work with raw SQLAlchemy result rows, use:

table.search([1, 2, 3]).limit(10).to_rows()

As a list of Python dictionaries

For easier manipulation in Python, convert the results to a list of dictionaries:

table.search([1, 2, 3]).limit(10).to_list()

As a pandas DataFrame

To display results in a user-friendly table—especially useful in Jupyter notebooks—convert them to a pandas DataFrame:

table.search([1, 2, 3]).limit(10).to_pandas()

As a list of Pydantic model instances

The TableModel class can also be used as a Pydantic model to represent data entities. To work with results as Pydantic model instances, use:

table.search([1, 2, 3]).limit(10).to_pydantic()