Skip to content

Vector Search

Vector search uses semantic similarity to help you find the most relevant records, even if your query does not explicitly include all the keywords.

Tip

To check complete example of vector search, please refer to the vector-search example.

Basic Usage

Step 1. Create a table with a vector field

To store vector data, you need to define a vector field via VectorField in the table schema.

from pytidb.schema import TableModel, Field, VectorField
from pytidb.datatype import TEXT

class Chunk(TableModel, table=True):
    __table__ = "chunks"
    id: int = Field(primary_key=True)
    text: str = Field(sa_type=TEXT)
    text_vec: list[float] = VectorField(dimension=3)

table = db.create_table(schema=Chunk)

By default, the create_table method will create a vector index for cosine distance metric on the vector field.

To store vector data, you need to define a column of VECTOR type in the CREATE TABLE statement.

CREATE TABLE chunks (
    id INT PRIMARY KEY,
    text TEXT,
    text_vec VECTOR(3),
    VECTOR INDEX `vec_idx_text_vec`((VEC_COSINE_DISTANCE(`text_vec`)))
);

In the example above, we also create a VECTOR INDEX on the text_vec column to optimize vector search using the VEC_COSINE_DISTANCE function.

Step 2. Insert vector data into the table

For demonstration purposes, we will ingest some text and their corresponding vector embeddings into the table. To simplify the example, we use simplified 3-dimensional vectors.

In this example, we insert 3 documents:

  • dog with vector embedding [1, 2, 1]
  • fish with vector embedding [1, 2, 4]
  • tree with vector embedding [1, 0, 0]
table.bulk_insert([
    Chunk(text="dog", text_vec=[1,2,1]),
    Chunk(text="fish", text_vec=[1,2,4]),
    Chunk(text="tree", text_vec=[1,0,0]),
])
INSERT INTO chunks (id, text, text_vec)
VALUES
    (1, 'dog', '[1,2,1]'),
    (2, 'fish', '[1,2,4]'),
    (3, 'tree', '[1,0,0]');

Tip

In real-world applications, the vector embedding is usually generated by a embedding model.

For convenience, pytidb provides an auto embedding feature that can automatically generate vector embeddings for your text fields when you insert, update, or search—no manual processing needed.

For details, see the Auto Embedding guide.

Vector search uses vector distance metrics to find the most relevant records, the closer the distance, the more relevant the record.

To find the most relevant documents in the table, you need to specify a query vector. Assume the query is A swimming animal and its vector embedding is [1, 2, 3].

You can use the table.search() method to perform vector search, which uses search_mode="vector" by default.

table.search([1, 2, 3]).limit(3).to_list()
Execution result
[
    {"id": 2, "text": "fish", "text_vec": [1,2,4], "_distance": 0.00853986601633272},
    {"id": 1, "text": "dog", "text_vec": [1,2,1], "_distance": 0.12712843905603044},
    {"id": 3, "text": "tree", "text_vec": [1,0,0], "_distance": 0.7327387580875756},
]

The result shows that the most relevant document is fish with a distance of 0.00853986601633272.

You can use the ORDER BY <distance_function>(<column_name>, <query_vector>) LIMIT <n> clause in the SELECT statement to get the n nearest neighbors of the query vector.

In this example, we use the vec_cosine_distance function to calculate the cosine distance between the text_vec column and the query vector [1, 2, 3].

SELECT id, text, vec_cosine_distance(text_vec, '[1,2,3]') AS distance
FROM chunks
ORDER BY distance
LIMIT 3;
Execution result
+----+----------+---------------------+
| id | text     | distance            |
+----+----------+---------------------+
|  2 | fish     | 0.00853986601633272 |
|  1 | dog      | 0.12712843905603044 |
|  3 | tree     |  0.7327387580875756 |
+----+----------+---------------------+
3 rows in set (0.15 sec)

The result shows that the most relevant document is fish with a distance of 0.00853986601633272.

Distance metrics

Distance metrics are a measure of the similarity between a pair of vectors. Currently, TiDB supports the following metrics:

pytidb provides the following functions on vector column to calculate vector distances:

Function Name Description
l1_distance() Calculates the L1 distance (Manhattan distance) between two vectors
l2_distance() Calculates the L2 distance (Euclidean distance) between two vectors
cosine_distance() Calculates the cosine distance between two vectors
negative_inner_product() Calculates the negative of the inner product between two vectors

For example, to calculate the L1 distance between the text_vec column and the query vector [1, 2, 3], you can use the following SQL statement:

from pytidb.sql import select

# table is created in the previous basic usage example.
stmt = select(
    table.id,
    table.text,
    table.text_vec.l1_distance([1, 2, 3])
).limit(3)

db.query(stmt).to_list()

In SQL, you can use the following built-in functions to calculate vector distances directly in your queries:

Function Name Description
VEC_L2_DISTANCE Calculates L2 distance (Euclidean distance) between two vectors
VEC_COSINE_DISTANCE Calculates the cosine distance between two vectors
VEC_NEGATIVE_INNER_PRODUCT Calculates the negative of the inner product between two vectors
VEC_L1_DISTANCE Calculates L1 distance (Manhattan distance) between two vectors

Output search results

pytidb provides several helper methods to output the search results in a convenient format as you need.

As SQLAlchemy result rows

If you want to work with the raw SQLAlchemy result rows, you can use:

table.search([1, 2, 3]).to_rows()

As a list of Python dicts

For easier manipulation in Python, you can convert the results to a list of dictionaries:

table.search([1, 2, 3]).to_list()

As a pandas DataFrame

To display the results in a user-friendly table—especially useful in Jupyter notebooks—you can convert them to a pandas DataFrame:

table.search([1, 2, 3]).to_pandas()