Vector Search
Vector search uses semantic similarity to help you find the most relevant records, even if your query does not explicitly include all the keywords.
Tip
For a complete example of vector search, see the vector-search example.
Basic Usage
This section shows you how to use vector search in your application in minimal steps. Before you start, you need to connect to the database.
Step 1. Create a table with a vector field
You can use client.create_table()
to create a table and use VectorField
to define a vector field.
In this example, we create a table named documents
with four columns:
id
: The primary key of the table.text
: The text content of the document.text_vec
: The vector embedding of the text content.meta
: The metadata of the document, which is a JSON object.
from pytidb.schema import TableModel, Field, VectorField
from pytidb.datatype import TEXT, JSON
class Document(TableModel):
__tablename__ = "documents"
id: int = Field(primary_key=True)
text: str = Field(sa_type=TEXT)
text_vec: list[float] = VectorField(dimensions=3)
meta: dict = Field(sa_type=JSON, default_factory=dict)
table = client.create_table(schema=Document, mode="overwrite")
The VectorField
class accepts the following parameters:
dimensions
: The number of dimensions of the vector. Once specified, only vectors with this exact dimension can be stored in this field.index
: Whether to create a vector index for the vector field. Defaults toTrue
.distance_metric
: The distance metric to use for the vector index. Supported values:DistanceMetric.COSINE
(default): Cosine distance metric, suitable for measuring text similarityDistanceMetric.L2
: L2 distance metric, suitable for capturing overall difference
You can use the CREATE TABLE
statement to create a table and using VECTOR
type to define a vector column.
CREATE TABLE documents (
id INT PRIMARY KEY,
text TEXT,
text_vec VECTOR(3),
VECTOR INDEX `vec_idx_text_vec`((VEC_COSINE_DISTANCE(`text_vec`)))
);
In this example:
- The
text_vec
column is defined as aVECTOR
type with 3 dimensions, it means that the vector to be stored in this column must have 3 dimensions. - A vector index is created using the
VEC_COSINE_DISTANCE
function to optimize vector search performance
TiDB supports two distance functions for vector indexes:
VEC_COSINE_DISTANCE
: Calculates the cosine distance between two vectorsVEC_L2_DISTANCE
: Calculates L2 distance (Euclidean distance) between two vectors
Step 2. Insert vector data into the table
For demonstration purposes, insert some text and their corresponding vector embeddings into the table. In this example, we use simple 3-dimensional vectors.
We insert three documents:
dog
with the vector embedding[1, 2, 1]
fish
with the vector embedding[1, 2, 4]
tree
with the vector embedding[1, 0, 0]
Tip
In real-world applications, vector embeddings are usually generated by an embedding model.
For convenience, pytidb provides an auto embedding feature that can automatically generate vector embeddings for your text fields when you insert, update, or search—no manual processing needed.
For details, see the Auto Embedding guide.
Step 3. Perform vector search
Vector search uses vector distance metrics to measure the similarity and relevance between vectors. The closer the distance, the more relevant the record. To find the most relevant documents in the table, you need to specify a query vector.
In this example, we assume the query is A swimming animal
and its vector embedding is [1, 2, 3]
.
You can use the table.search()
method to perform vector search, which uses search_mode="vector"
by default.
[
{"id": 2, "text": "fish", "text_vec": [1,2,4], "_distance": 0.00853986601633272},
{"id": 1, "text": "dog", "text_vec": [1,2,1], "_distance": 0.12712843905603044},
{"id": 3, "text": "tree", "text_vec": [1,0,0], "_distance": 0.7327387580875756},
]
The result shows that the most relevant document is fish
with a distance of 0.00853986601633272
.
You can use the ORDER BY <distance_function>(<column_name>, <query_vector>) LIMIT <n>
clause in the SELECT
statement to get the n nearest neighbors of the query vector.
In this example, we use the vec_cosine_distance
function to calculate the cosine distance between the vectors stored in the text_vec
column and the provided query vector [1, 2, 3]
.
SELECT id, text, vec_cosine_distance(text_vec, '[1,2,3]') AS distance
FROM documents
ORDER BY distance
LIMIT 3;
+----+----------+---------------------+
| id | text | distance |
+----+----------+---------------------+
| 2 | fish | 0.00853986601633272 |
| 1 | dog | 0.12712843905603044 |
| 3 | tree | 0.7327387580875756 |
+----+----------+---------------------+
3 rows in set (0.15 sec)
The result shows that the most relevant document is fish
with a distance of 0.00853986601633272
.
Distance metrics
Distance metrics are a measure of the similarity between a pair of vectors. Currently, TiDB supports the following distance metrics:
The table.search()
API supports the following distance metrics:
Metric Name | Description | Best For |
---|---|---|
DistanceMetric.COSINE |
Calculates the cosine distance between two vectors (default). Measures the angle between vectors. | Text embeddings, semantic search |
DistanceMetric.L2 |
Calculates the L2 distance (Euclidean distance) between two vectors. Measures the straight-line distance. | Image features |
To change the distance metric used for vector search, use the .distance_metric()
method.
Example: Use the L2 distance metric
In SQL, you can use the following built-in functions to calculate vector distances directly in your queries:
Function Name | Description |
---|---|
VEC_L2_DISTANCE |
Calculates L2 distance (Euclidean distance) between two vectors |
VEC_COSINE_DISTANCE |
Calculates the cosine distance between two vectors |
VEC_NEGATIVE_INNER_PRODUCT |
Calculates the negative of the inner product between two vectors |
VEC_L1_DISTANCE |
Calculates L1 distance (Manhattan distance) between two vectors |
Distance threshold
The table.search()
API allows you to set a distance threshold to control the similarity of the returned results. By specifying this threshold, you can exclude less similar vectors and return only those that meet your relevance criteria.
Use the .distance_threshold()
method to set a maximum distance for the search results. Only records with a distance less than the threshold are returned.
Example: Only return documents with a distance less than 0.5
Distance range
The table.search()
API also supports specifying a distance range to further refine the results.
Use the .distance_range()
method to set both minimum and maximum distance values. Only records with a distance within this range are returned.
Example: Only return documents with a distance between 0.01 and 0.05
To specify a distance range in SQL, use BETWEEN
or other comparison operators in the HAVING
clause:
Example: Only return documents with a distance between 0.01 and 0.05
Metadata filtering
As a relational database, TiDB supports a rich set of SQL operators and allows flexible combinations of filtering conditions.
For vector search in TiDB, you can apply metadata filtering on scalar fields (e.g., integers, strings) or JSON fields.
Typically, vector search combined with metadata filtering operates in two modes:
- Post-filtering: In a two-stage retrieval process, TiDB first performs vector search to retrieve the top-k candidate results from the entire vector space, then applies the filter to this candidate set. The vector search stage typically leverages a vector index for efficiency.
- Pre-filtering: The filter is applied before vector search. If the filter is highly selective and the filtered field is indexed with a scalar index, this approach can significantly reduce the search space and improve performance.
Post-filtering
Use the .filter()
method with a filter dictionary to apply filtering to vector search.
By default, the table.search()
API uses post-filtering mode to maximize search performance with the vector index.
Example: Vector search with post-filtering
results = (
table.search([1, 2, 3])
# The `meta` is a JSON field, and its value is a JSON object
# like {"category": "animal"}
.filter({"meta.category": "animal"})
.num_candidate(50)
.limit(10)
.to_list()
)
Tip
When using a vector index, if the final limit
is very small, the accuracy of the results may decrease. You can use the .num_candidate()
method to control how many candidates to retrieve from the vector index during the vector search phase, without changing the limit
parameter.
A higher num_candidate
value generally improves recall but may reduce query performance. Adjust this value based on your dataset and accuracy requirements.
Currently, vector indexes are only effective in strict ANN (Approximate Nearest Neighbor) queries, such as:
In other words, you cannot use a WHERE
clause together with a vector index in the same query.
If you need to combine vector search with additional filtering conditions, you can use the post-filtering pattern. In this approach, the ANN query will be divided into two parts:
- The inner query performs the vector search using the vector index.
- The outer query applies the
WHERE
condition to filter the results.
SELECT *
FROM (
SELECT id, text, meta, vec_cosine_distance(text_vec, '[1,2,3]') AS distance
FROM documents
ORDER BY distance
LIMIT 50
) candidates
WHERE meta->>'$.category' = 'animal'
ORDER BY distance
LIMIT 10;
Tip
The post-filtering pattern may lead to false positives — for example, the inner query may retrieve the top 50 most similar records, but none of them match the WHERE
condition.
To mitigate this, you can increase the LIMIT
value (e.g., 50) in the inner query to fetch more candidates, improving the chances of returning enough valid results after filtering.
For supported SQL operators, see Operators in the TiDB Cloud documentation.
Pre-filtering
To enable pre-filtering, set the prefilter
parameter to True
in the .filter()
method.
Example: Vector search with pre-filtering
results = (
table.search([1, 2, 3])
.filter({"meta.category": "animal"}, prefilter=True)
.limit(10)
.to_list()
)
For supported filter operators, see Filtering.
In SQL, use the ->>
operator or JSON_EXTRACT
to access JSON fields in the WHERE
clause:
SELECT id, text, meta, vec_cosine_distance(text_vec, '[1,2,3]') AS distance
FROM documents
WHERE meta->>'$.category' = 'animal'
ORDER BY distance
LIMIT 10;
For supported SQL operators, see Operators in the TiDB Cloud documentation.
Multiple vector fields
TiDB supports defining multiple vector columns in a single table, allowing you to store and search different types of vector embeddings.
For example, you can store both text embeddings and image embeddings in the same table, making it convenient to manage multi-modal data.
You can define multiple vector fields in the schema and perform vector search on the specified vector field by using the .vector_column()
method.
Example: Specify the vector field to search on
# Create a table with multiple vector fields
class RichTextDocument(TableModel):
__tablename__ = "rich_text_documents"
id: int = Field(primary_key=True)
text: str = Field(sa_type=TEXT)
text_vec: list[float] = VectorField(dimensions=3)
image_url: str
image_vec: list[float] = VectorField(dimensions=3)
table = client.create_table(schema=RichTextDocument, mode="overwrite")
# Insert sample data ...
# Search using image vector field
results = (
table.search([1, 2, 3])
.vector_column("image_vec")
.distance_metric(DistanceMetric.COSINE)
.limit(10)
.to_list()
)
You can create multiple vector columns in a table and search them using suitable distance functions:
-- Create a table with multiple vector fields
CREATE TABLE rich_text_documents (
id BIGINT PRIMARY KEY,
text TEXT,
text_vec VECTOR(3),
image_url VARCHAR(255),
image_vec VECTOR(3)
);
-- Insert sample data ...
-- Search using text vector
SELECT id, image_url, vec_l2_distance(image_vec, '[4,5,6]') AS image_distance
FROM rich_text_documents
ORDER BY image_distance
LIMIT 10;
Output search results
The table.search()
API lets you convert search results into several common data processing formats:
As SQLAlchemy result rows
To work with raw SQLAlchemy result rows, use:
As a list of Python dictionaries
For easier manipulation in Python, convert the results to a list of dictionaries:
As a pandas DataFrame
To display results in a user-friendly table—especially useful in Jupyter notebooks—convert them to a pandas DataFrame:
As a list of Pydantic model instances
The TableModel
class can also be used as a Pydantic model to represent data entities. To work with results as Pydantic model instances, use: