Vector Search
Vector search uses semantic similarity to help you find the most relevant records, even if your query does not explicitly include all the keywords.
Tip
To check complete example of vector search, please refer to the vector-search example.
Basic Usage
Step 1. Create a table with a vector field
To store vector data, you need to define a vector field via VectorField
in the table schema.
from pytidb.schema import TableModel, Field, VectorField
from pytidb.datatype import TEXT
class Chunk(TableModel, table=True):
__table__ = "chunks"
id: int = Field(primary_key=True)
text: str = Field(sa_type=TEXT)
text_vec: list[float] = VectorField(dimension=3)
table = db.create_table(schema=Chunk)
By default, the create_table
method will create a vector index for cosine distance metric on the vector field.
To store vector data, you need to define a column of VECTOR
type in the CREATE TABLE
statement.
CREATE TABLE chunks (
id INT PRIMARY KEY,
text TEXT,
text_vec VECTOR(3),
VECTOR INDEX `vec_idx_text_vec`((VEC_COSINE_DISTANCE(`text_vec`)))
);
In the example above, we also create a VECTOR INDEX
on the text_vec
column to optimize vector search using the VEC_COSINE_DISTANCE
function.
Step 2. Insert vector data into the table
For demonstration purposes, we will ingest some text and their corresponding vector embeddings into the table. To simplify the example, we use simplified 3-dimensional vectors.
In this example, we insert 3 documents:
dog
with vector embedding[1, 2, 1]
fish
with vector embedding[1, 2, 4]
tree
with vector embedding[1, 0, 0]
Tip
In real-world applications, the vector embedding is usually generated by a embedding model.
For convenience, pytidb provides an auto embedding feature that can automatically generate vector embeddings for your text fields when you insert, update, or search—no manual processing needed.
For details, see the Auto Embedding guide.
Step 3. Perform vector search
Vector search uses vector distance metrics to find the most relevant records, the closer the distance, the more relevant the record.
To find the most relevant documents in the table, you need to specify a query vector. Assume the query is A swimming animal
and its vector embedding is [1, 2, 3]
.
You can use the table.search()
method to perform vector search, which uses search_mode="vector"
by default.
[
{"id": 2, "text": "fish", "text_vec": [1,2,4], "_distance": 0.00853986601633272},
{"id": 1, "text": "dog", "text_vec": [1,2,1], "_distance": 0.12712843905603044},
{"id": 3, "text": "tree", "text_vec": [1,0,0], "_distance": 0.7327387580875756},
]
The result shows that the most relevant document is fish
with a distance of 0.00853986601633272
.
You can use the ORDER BY <distance_function>(<column_name>, <query_vector>) LIMIT <n>
clause in the SELECT
statement to get the n nearest neighbors of the query vector.
In this example, we use the vec_cosine_distance
function to calculate the cosine distance between the text_vec
column and the query vector [1, 2, 3]
.
SELECT id, text, vec_cosine_distance(text_vec, '[1,2,3]') AS distance
FROM chunks
ORDER BY distance
LIMIT 3;
+----+----------+---------------------+
| id | text | distance |
+----+----------+---------------------+
| 2 | fish | 0.00853986601633272 |
| 1 | dog | 0.12712843905603044 |
| 3 | tree | 0.7327387580875756 |
+----+----------+---------------------+
3 rows in set (0.15 sec)
The result shows that the most relevant document is fish
with a distance of 0.00853986601633272
.
Distance metrics
Distance metrics are a measure of the similarity between a pair of vectors. Currently, TiDB supports the following metrics:
pytidb provides the following functions on vector column to calculate vector distances:
Function Name | Description |
---|---|
l1_distance() |
Calculates the L1 distance (Manhattan distance) between two vectors |
l2_distance() |
Calculates the L2 distance (Euclidean distance) between two vectors |
cosine_distance() |
Calculates the cosine distance between two vectors |
negative_inner_product() |
Calculates the negative of the inner product between two vectors |
For example, to calculate the L1 distance between the text_vec
column and the query vector [1, 2, 3]
, you can use the following SQL statement:
In SQL, you can use the following built-in functions to calculate vector distances directly in your queries:
Function Name | Description |
---|---|
VEC_L2_DISTANCE |
Calculates L2 distance (Euclidean distance) between two vectors |
VEC_COSINE_DISTANCE |
Calculates the cosine distance between two vectors |
VEC_NEGATIVE_INNER_PRODUCT |
Calculates the negative of the inner product between two vectors |
VEC_L1_DISTANCE |
Calculates L1 distance (Manhattan distance) between two vectors |
Output search results
pytidb provides several helper methods to output the search results in a convenient format as you need.
As SQLAlchemy result rows
If you want to work with the raw SQLAlchemy result rows, you can use:
As a list of Python dicts
For easier manipulation in Python, you can convert the results to a list of dictionaries:
As a pandas DataFrame
To display the results in a user-friendly table—especially useful in Jupyter notebooks—you can convert them to a pandas DataFrame: