Hybrid Search
Hybrid search is a technique that combines multiple search algorithms to deliver more accurate and relevant results.
TiDB supports both semantic search (also termed vector search) and keyword-based search (also known as full-text search). You can leverage the strengths of both approaches to achieve better search results via hybrid search.
Tip
To check complete example of hybrid search, please refer to the hybrid-search example.
Basic Usage
Step 1. Define a embedding function
Define an embedding function to generate vector representations of text data.
from pytidb.embeddings import EmbeddingFunction
embed_fn = EmbeddingFunction(
model_name="openai/text-embedding-3-small",
api_key=os.getenv("OPENAI_API_KEY"),
)
Step 2. Create a table with vector index and full-text index
Assuming you have already connected to your TiDB database using TiDBClient
:
Now, you can create a table with a text field and a vector field. PyTiDB will automatically create a vector index on the text_vec
column and a full-text index can be created via create_fts_index()
method.
Sample code:
from pytidb.schema import TableModel, Field
from pytidb.datatype import Text
class Chunk(TableModel, table=True):
__tablename__ = "chunks_for_hybrid_search"
id: int = Field(primary_key=True)
text: str = Field(sa_type=Text)
text_vec: list[float] = embed_fn.VectorField(source_field="text")
table = db.create_table(schema=Chunk)
if not table.has_fts_index("text"):
table.create_fts_index("text")
Step 3. Insert some sample data
Using bulk_insert()
method to insert some sample data into the table.
table.delete()
table.bulk_insert([
Chunk(
text="TiDB is a distributed database that supports OLTP, OLAP, HTAP and AI workloads.",
),
Chunk(
text="PyTiDB is a Python library for developers to connect to TiDB.",
),
Chunk(
text="LlamaIndex is a Python library for building AI-powered applications.",
),
])
Step 4. Perform hybrid search
To enable hybrid search, you need to set the search_type
parameter to hybrid
when calling the search()
method.
results = (
table.search(
"AI database", search_type="hybrid"
)
.limit(3)
.to_list()
)
for item in results:
item.pop("text_vec")
print(json.dumps(results, indent=4, sort_keys=True))
The search results contains three special fields:
_distance
: The distance between the query vector and the vector data in the table returned by the vector search._match_score
: The match score between the query and the text field returned by the full-text search._score
: The final score of the search result, which is calculated by the fusion algorithm.
[
{
"_distance": 0.4740166257687124,
"_match_score": 1.6804268,
"_score": 0.03278688524590164,
"id": 60013,
"text": "TiDB is a distributed database that supports OLTP, OLAP, HTAP and AI workloads."
},
{
"_distance": 0.6428459116216618,
"_match_score": 0.78427225,
"_score": 0.03200204813108039,
"id": 60015,
"text": "LlamaIndex is a Python library for building AI-powered applications."
},
{
"_distance": 0.641581407158715,
"_match_score": null,
"_score": 0.016129032258064516,
"id": 60014,
"text": "PyTiDB is a Python library for developers to connect to TiDB."
}
]
Fusion Method
PyTiDB currently supports one fusion methods:
rrf
: Reciprocal Rank Fusion (RRF) (default)
Reciprocal Rank Fusion (RRF)
Reciprocal Rank Fusion (RRF) is an algorithm that evaluates the search by leveraging the rank of the documents in multiple search result sets.
For more details, please refer to the RRF paper.
You can use fusion()
method to specify the fusion method and parameters.
results = (
table.search(
"AI database", search_type="hybrid"
)
.fusion(method="rrf", k=60)
.limit(3)
.to_list()
)
Parameters:
k
: is a constant (default: 60) to prevent division by zero and control the impact of high-ranked documents.
Reranking
Use the rerank()
method to specify a reranker that sorts search results by relevance between the query and documents. See the Reranking section for details.