From ef9664c986372a569c520f894868037ef6cc52d1 Mon Sep 17 00:00:00 2001 From: Docsite Preview Bot <> Date: Mon, 25 Nov 2024 09:02:47 +0000 Subject: [PATCH] Preview PR https://github.com/pingcap/docs/pull/19494 and this preview is triggered from commit https://github.com/pingcap/docs/pull/19494/commits/31a0065d8ec64375bf2b126fd5d5a59b5c5b752a --- .../tidb-cloud/vector-search-data-types.md | 246 ++++++++ .../vector-search-functions-and-operators.md | 284 +++++++++ .../vector-search-get-started-using-python.md | 195 ++++++ .../vector-search-get-started-using-sql.md | 150 +++++ .../master/tidb-cloud/vector-search-index.md | 243 ++++++++ ...vector-search-integrate-with-django-orm.md | 221 +++++++ ...-search-integrate-with-jinaai-embedding.md | 240 +++++++ .../vector-search-integrate-with-langchain.md | 586 ++++++++++++++++++ ...vector-search-integrate-with-llamaindex.md | 266 ++++++++ .../vector-search-integrate-with-peewee.md | 211 +++++++ ...vector-search-integrate-with-sqlalchemy.md | 185 ++++++ .../vector-search-integration-overview.md | 71 +++ .../tidb-cloud/vector-search-limitations.md | 42 ++ .../tidb-cloud/vector-search-overview.md | 72 +++ 14 files changed, 3012 insertions(+) create mode 100644 markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-data-types.md create mode 100644 markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-functions-and-operators.md create mode 100644 markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-get-started-using-python.md create mode 100644 markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-get-started-using-sql.md create mode 100644 markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-index.md create mode 100644 markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-integrate-with-django-orm.md create mode 100644 markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-integrate-with-jinaai-embedding.md create mode 100644 markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-integrate-with-langchain.md create mode 100644 markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-integrate-with-llamaindex.md create mode 100644 markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-integrate-with-peewee.md create mode 100644 markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-integrate-with-sqlalchemy.md create mode 100644 markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-integration-overview.md create mode 100644 markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-limitations.md create mode 100644 markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-overview.md diff --git a/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-data-types.md b/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-data-types.md new file mode 100644 index 00000000..90a46cfa --- /dev/null +++ b/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-data-types.md @@ -0,0 +1,246 @@ +--- +title: Vector Data Types +summary: Learn about the Vector data types in TiDB. +--- + +# Vector Data Types + +A vector is a sequence of floating-point numbers, such as `[0.3, 0.5, -0.1, ...]`. TiDB offers Vector data types, specifically optimized for efficiently storing and querying vector embeddings widely used in AI applications. + +The following Vector data types are currently available: + +- `VECTOR`: A sequence of single-precision floating-point numbers with any dimension. +- `VECTOR(D)`: A sequence of single-precision floating-point numbers with a fixed dimension `D`. + +Using vector data types provides the following advantages over using the [`JSON`](/data-type-json.md) type: + +- Vector index support: You can build a [vector search index](/tidb-cloud/vector-search-index.md) to speed up vector searching. +- Dimension enforcement: You can specify a dimension to forbid inserting vectors with different dimensions. +- Optimized storage format: Vector data types are optimized for handling vector data, offering better space efficiency and performance compared to `JSON` types. + +> **Note** +> +> TiDB Vector Search is only available for TiDB (>= v8.4) and [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless). It is not available for [TiDB Cloud Dedicated](/tidb-cloud/select-cluster-tier.md#tidb-cloud-dedicated). + +## Syntax + +You can use a string in the following syntax to represent a Vector value: + +```sql +'[, , ...]' +``` + +Example: + +```sql +CREATE TABLE vector_table ( + id INT PRIMARY KEY, + embedding VECTOR(3) +); + +INSERT INTO vector_table VALUES (1, '[0.3, 0.5, -0.1]'); + +INSERT INTO vector_table VALUES (2, NULL); +``` + +Inserting vector values with invalid syntax will result in an error: + +```sql +[tidb]> INSERT INTO vector_table VALUES (3, '[5, ]'); +ERROR 1105 (HY000): Invalid vector text: [5, ] +``` + +In the following example, because dimension `3` is enforced for the `embedding` column when the table is created, inserting a vector with a different dimension will result in an error: + +```sql +[tidb]> INSERT INTO vector_table VALUES (4, '[0.3, 0.5]'); +ERROR 1105 (HY000): vector has 2 dimensions, does not fit VECTOR(3) +``` + +For available functions and operators over the vector data types, see [Vector Functions and Operators](/tidb-cloud/vector-search-functions-and-operators.md). + +For more information about building and using a vector search index, see [Vector Search Index](/tidb-cloud/vector-search-index.md). + +## Store vectors with different dimensions + +You can store vectors with different dimensions in the same column by omitting the dimension parameter in the `VECTOR` type: + +```sql +CREATE TABLE vector_table ( + id INT PRIMARY KEY, + embedding VECTOR +); + +INSERT INTO vector_table VALUES (1, '[0.3, 0.5, -0.1]'); -- 3 dimensions vector, OK +INSERT INTO vector_table VALUES (2, '[0.3, 0.5]'); -- 2 dimensions vector, OK +``` + +However, note that you cannot build a [vector search index](/tidb-cloud/vector-search-index.md) for this column, as vector distances can be only calculated between vectors with the same dimensions. + +## Comparison + +You can compare vector data types using [comparison operators](/functions-and-operators/operators.md) such as `=`, `!=`, `<`, `>`, `<=`, and `>=`. For a complete list of comparison operators and functions for vector data types, see [Vector Functions and Operators](/tidb-cloud/vector-search-functions-and-operators.md). + +Vector data types are compared element-wise numerically. For example: + +- `[1] < [12]` +- `[1,2,3] < [1,2,5]` +- `[1,2,3] = [1,2,3]` +- `[2,2,3] > [1,2,3]` + +Two vectors with different dimensions are compared using lexicographical comparison, with the following rules: + +- Two vectors are compared element by element from the start, and each element is compared numerically. +- The first mismatching element determines which vector is lexicographically _less_ or _greater_ than the other. +- If one vector is a prefix of another, the shorter vector is lexicographically _less_ than the other. For example, `[1,2,3] < [1,2,3,0]`. +- Vectors of the same length with identical elements are lexicographically _equal_. +- An empty vector is lexicographically _less_ than any non-empty vector. For example, `[] < [1]`. +- Two empty vectors are lexicographically _equal_. + +When comparing vector constants, consider performing an [explicit cast](#cast) from string to vector to avoid comparisons based on string values: + +```sql +-- Because string is given, TiDB is comparing strings: +[tidb]> SELECT '[12.0]' < '[4.0]'; ++--------------------+ +| '[12.0]' < '[4.0]' | ++--------------------+ +| 1 | ++--------------------+ +1 row in set (0.01 sec) + +-- Cast to vector explicitly to compare by vectors: +[tidb]> SELECT VEC_FROM_TEXT('[12.0]') < VEC_FROM_TEXT('[4.0]'); ++--------------------------------------------------+ +| VEC_FROM_TEXT('[12.0]') < VEC_FROM_TEXT('[4.0]') | ++--------------------------------------------------+ +| 0 | ++--------------------------------------------------+ +1 row in set (0.01 sec) +``` + +## Arithmetic + +Vector data types support arithmetic operations `+` (addition) and `-` (subtraction). However, arithmetic operations between vectors with different dimensions are not supported and will result in an error. + +Examples: + +```sql +[tidb]> SELECT VEC_FROM_TEXT('[4]') + VEC_FROM_TEXT('[5]'); ++---------------------------------------------+ +| VEC_FROM_TEXT('[4]') + VEC_FROM_TEXT('[5]') | ++---------------------------------------------+ +| [9] | ++---------------------------------------------+ +1 row in set (0.01 sec) + +[tidb]> SELECT VEC_FROM_TEXT('[2,3,4]') - VEC_FROM_TEXT('[1,2,3]'); ++-----------------------------------------------------+ +| VEC_FROM_TEXT('[2,3,4]') - VEC_FROM_TEXT('[1,2,3]') | ++-----------------------------------------------------+ +| [1,1,1] | ++-----------------------------------------------------+ +1 row in set (0.01 sec) + +[tidb]> SELECT VEC_FROM_TEXT('[4]') + VEC_FROM_TEXT('[1,2,3]'); +ERROR 1105 (HY000): vectors have different dimensions: 1 and 3 +``` + +## Cast + +### Cast between Vector ⇔ String + +To cast between Vector and String, use the following functions: + +- `CAST(... AS VECTOR)`: String ⇒ Vector +- `CAST(... AS CHAR)`: Vector ⇒ String +- `VEC_FROM_TEXT`: String ⇒ Vector +- `VEC_AS_TEXT`: Vector ⇒ String + +To improve usability, if you call a function that only supports vector data types, such as a vector correlation distance function, you can also just pass in a format-compliant string. TiDB automatically performs an implicit cast in this case. + +```sql +-- The VEC_DIMS function only accepts VECTOR arguments, so you can directly pass in a string for an implicit cast. +[tidb]> SELECT VEC_DIMS('[0.3, 0.5, -0.1]'); ++------------------------------+ +| VEC_DIMS('[0.3, 0.5, -0.1]') | ++------------------------------+ +| 3 | ++------------------------------+ +1 row in set (0.01 sec) + +-- You can also explicitly cast a string to a vector using VEC_FROM_TEXT and then pass the vector to the VEC_DIMS function. +[tidb]> SELECT VEC_DIMS(VEC_FROM_TEXT('[0.3, 0.5, -0.1]')); ++---------------------------------------------+ +| VEC_DIMS(VEC_FROM_TEXT('[0.3, 0.5, -0.1]')) | ++---------------------------------------------+ +| 3 | ++---------------------------------------------+ +1 row in set (0.01 sec) + +-- You can also cast explicitly using CAST(... AS VECTOR): +[tidb]> SELECT VEC_DIMS(CAST('[0.3, 0.5, -0.1]' AS VECTOR)); ++----------------------------------------------+ +| VEC_DIMS(CAST('[0.3, 0.5, -0.1]' AS VECTOR)) | ++----------------------------------------------+ +| 3 | ++----------------------------------------------+ +1 row in set (0.01 sec) +``` + +When using an operator or function that accepts multiple data types, you need to explicitly cast the string type to the vector type before passing the string to that operator or function, because TiDB does not perform implicit casts in this case. For example, before performing comparison operations, you need to explicitly cast strings to vectors; otherwise, TiDB compares them as string values rather than as vector numeric values: + +```sql +-- Because string is given, TiDB is comparing strings: +[tidb]> SELECT '[12.0]' < '[4.0]'; ++--------------------+ +| '[12.0]' < '[4.0]' | ++--------------------+ +| 1 | ++--------------------+ +1 row in set (0.01 sec) + +-- Cast to vector explicitly to compare by vectors: +[tidb]> SELECT VEC_FROM_TEXT('[12.0]') < VEC_FROM_TEXT('[4.0]'); ++--------------------------------------------------+ +| VEC_FROM_TEXT('[12.0]') < VEC_FROM_TEXT('[4.0]') | ++--------------------------------------------------+ +| 0 | ++--------------------------------------------------+ +1 row in set (0.01 sec) +``` + +You can also explicitly cast a vector to its string representation. Take using the `VEC_AS_TEXT()` function as an example: + +```sql +-- The string is first implicitly cast to a vector, and then the vector is explicitly cast to a string, thus returning a string in the normalized format: +[tidb]> SELECT VEC_AS_TEXT('[0.3, 0.5, -0.1]'); ++--------------------------------------+ +| VEC_AS_TEXT('[0.3, 0.5, -0.1]') | ++--------------------------------------+ +| [0.3,0.5,-0.1] | ++--------------------------------------+ +1 row in set (0.01 sec) +``` + +For additional cast functions, see [Vector Functions and Operators](/tidb-cloud/vector-search-functions-and-operators.md). + +### Cast between Vector ⇔ other data types + +Currently, direct casting between Vector and other data types (such as `JSON`) is not supported. To work around this limitation, use String as an intermediate data type for casting in your SQL statement. + +Note that vector data type columns stored in a table cannot be converted to other data types using `ALTER TABLE ... MODIFY COLUMN ...`. + +## Limitations + +See [Vector data type limitations](/tidb-cloud/vector-search-limitations.md#vector-data-type-limitations). + +## MySQL compatibility + +Vector data types are TiDB specific, and are not supported in MySQL. + +## See also + +- [Vector Functions and Operators](/tidb-cloud/vector-search-functions-and-operators.md) +- [Vector Search Index](/tidb-cloud/vector-search-index.md) +- [Improve Vector Search Performance](/tidb-cloud/vector-search-improve-performance.md) diff --git a/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-functions-and-operators.md b/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-functions-and-operators.md new file mode 100644 index 00000000..d378b9fd --- /dev/null +++ b/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-functions-and-operators.md @@ -0,0 +1,284 @@ +--- +title: Vector Functions and Operators +summary: Learn about functions and operators available for Vector data types. +--- + +# Vector Functions and Operators + +This document lists the functions and operators available for Vector data types. + +> **Note** +> +> TiDB Vector Search is only available for TiDB (>= v8.4) and [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless). It is not available for [TiDB Cloud Dedicated](/tidb-cloud/select-cluster-tier.md#tidb-cloud-dedicated). + +## Vector functions + +The following functions are designed specifically for [Vector data types](/tidb-cloud/vector-search-data-types.md). + +**Vector distance functions:** + +| Function Name | Description | +| ----------------------------------------------------------- | ---------------------------------------------------------------- | +| [`VEC_L2_DISTANCE`](#vec_l2_distance) | Calculates L2 distance (Euclidean distance) between two vectors | +| [`VEC_COSINE_DISTANCE`](#vec_cosine_distance) | Calculates the cosine distance between two vectors | +| [`VEC_NEGATIVE_INNER_PRODUCT`](#vec_negative_inner_product) | Calculates the negative of the inner product between two vectors | +| [`VEC_L1_DISTANCE`](#vec_l1_distance) | Calculates L1 distance (Manhattan distance) between two vectors | + +**Other vector functions:** + +| Function Name | Description | +| --------------------------------- | --------------------------------------------------- | +| [`VEC_DIMS`](#vec_dims) | Returns the dimension of a vector | +| [`VEC_L2_NORM`](#vec_l2_norm) | Calculates the L2 norm (Euclidean norm) of a vector | +| [`VEC_FROM_TEXT`](#vec_from_text) | Converts a string into a vector | +| [`VEC_AS_TEXT`](#vec_as_text) | Converts a vector into a string | + +## Extended built-in functions and operators + +The following built-in functions and operators are extended to support operations on [Vector data types](/tidb-cloud/vector-search-data-types.md). + +**Arithmetic operators:** + +| Name | Description | +| :-------------------------------------------------------------------------------------- | :--------------------------------------- | +| [`+`](https://dev.mysql.com/doc/refman/8.0/en/arithmetic-functions.html#operator_plus) | Vector element-wise addition operator | +| [`-`](https://dev.mysql.com/doc/refman/8.0/en/arithmetic-functions.html#operator_minus) | Vector element-wise subtraction operator | + +For more information about how vector arithmetic works, see [Vector Data Type | Arithmetic](/tidb-cloud/vector-search-data-types.md#arithmetic). + +**Aggregate (GROUP BY) functions:** + +| Name | Description | +| :------------------------------------------------------------------------------------------------------------ | :----------------------------------------------- | +| [`COUNT()`](https://dev.mysql.com/doc/refman/8.0/en/aggregate-functions.html#function_count) | Return a count of the number of rows returned | +| [`COUNT(DISTINCT)`](https://dev.mysql.com/doc/refman/8.0/en/aggregate-functions.html#function_count-distinct) | Return the count of a number of different values | +| [`MAX()`](https://dev.mysql.com/doc/refman/8.0/en/aggregate-functions.html#function_max) | Return the maximum value | +| [`MIN()`](https://dev.mysql.com/doc/refman/8.0/en/aggregate-functions.html#function_min) | Return the minimum value | + +**Comparison functions and operators:** + +| Name | Description | +| ------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------- | +| [`BETWEEN ... AND ...`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_between) | Check whether a value is within a range of values | +| [`COALESCE()`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#function_coalesce) | Return the first non-NULL argument | +| [`=`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_equal) | Equal operator | +| [`<=>`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_equal-to) | NULL-safe equal to operator | +| [`>`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_greater-than) | Greater than operator | +| [`>=`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_greater-than-or-equal) | Greater than or equal operator | +| [`GREATEST()`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#function_greatest) | Return the largest argument | +| [`IN()`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_in) | Check whether a value is within a set of values | +| [`IS NULL`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_is-null) | Test whether a value is `NULL` | +| [`ISNULL()`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#function_isnull) | Test whether the argument is `NULL` | +| [`LEAST()`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#function_least) | Return the smallest argument | +| [`<`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_less-than) | Less than operator | +| [`<=`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_less-than-or-equal) | Less than or equal operator | +| [`NOT BETWEEN ... AND ...`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_not-between) | Check whether a value is not within a range of values | +| [`!=`, `<>`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_not-equal) | Not equal operator | +| [`NOT IN()`](https://dev.mysql.com/doc/refman/8.0/en/comparison-operators.html#operator_not-in) | Check whether a value is not within a set of values | + +For more information about how vectors are compared, see [Vector Data Type | Comparison](/tidb-cloud/vector-search-data-types.md#comparison). + +**Control flow functions:** + +| Name | Description | +| :------------------------------------------------------------------------------------------------ | :----------------------------- | +| [`CASE`](https://dev.mysql.com/doc/refman/8.0/en/flow-control-functions.html#operator_case) | Case operator | +| [`IF()`](https://dev.mysql.com/doc/refman/8.0/en/flow-control-functions.html#function_if) | If/else construct | +| [`IFNULL()`](https://dev.mysql.com/doc/refman/8.0/en/flow-control-functions.html#function_ifnull) | Null if/else construct | +| [`NULLIF()`](https://dev.mysql.com/doc/refman/8.0/en/flow-control-functions.html#function_nullif) | Return `NULL` if expr1 = expr2 | + +**Cast functions:** + +| Name | Description | +| :------------------------------------------------------------------------------------------ | :--------------------------------- | +| [`CAST()`](https://dev.mysql.com/doc/refman/8.0/en/cast-functions.html#function_cast) | Cast a value as a string or vector | +| [`CONVERT()`](https://dev.mysql.com/doc/refman/8.0/en/cast-functions.html#function_convert) | Cast a value as a string | + +For more information about how to use `CAST()`, see [Vector Data Type | Cast](/tidb-cloud/vector-search-data-types.md#cast). + +## Full references + +### VEC_L2_DISTANCE + +```sql +VEC_L2_DISTANCE(vector1, vector2) +``` + +Calculates the [L2 distance](https://en.wikipedia.org/wiki/Euclidean_distance) (Euclidean distance) between two vectors using the following formula: + +$DISTANCE(p,q)=\sqrt {\sum \limits _{i=1}^{n}{(p_{i}-q_{i})^{2}}}$ + +The two vectors must have the same dimension. Otherwise, an error is returned. + +Example: + +```sql +[tidb]> SELECT VEC_L2_DISTANCE('[0,3]', '[4,0]'); ++-----------------------------------+ +| VEC_L2_DISTANCE('[0,3]', '[4,0]') | ++-----------------------------------+ +| 5 | ++-----------------------------------+ +``` + +### VEC_COSINE_DISTANCE + +```sql +VEC_COSINE_DISTANCE(vector1, vector2) +``` + +Calculates the [cosine distance](https://en.wikipedia.org/wiki/Cosine_similarity) between two vectors using the following formula: + +$DISTANCE(p,q)=1.0 - {\frac {\sum \limits _{i=1}^{n}{p_{i}q_{i}}}{{\sqrt {\sum \limits _{i=1}^{n}{p_{i}^{2}}}}\cdot {\sqrt {\sum \limits _{i=1}^{n}{q_{i}^{2}}}}}}$ + +The two vectors must have the same dimension. Otherwise, an error is returned. + +Example: + +```sql +[tidb]> SELECT VEC_COSINE_DISTANCE('[1, 1]', '[-1, -1]'); ++-------------------------------------------+ +| VEC_COSINE_DISTANCE('[1, 1]', '[-1, -1]') | ++-------------------------------------------+ +| 2 | ++-------------------------------------------+ +``` + +### VEC_NEGATIVE_INNER_PRODUCT + +```sql +VEC_NEGATIVE_INNER_PRODUCT(vector1, vector2) +``` + +Calculates the distance by using the negative of the [inner product](https://en.wikipedia.org/wiki/Dot_product) between two vectors, using the following formula: + +$DISTANCE(p,q)=- INNER\_PROD(p,q)=-\sum \limits _{i=1}^{n}{p_{i}q_{i}}$ + +The two vectors must have the same dimension. Otherwise, an error is returned. + +Example: + +```sql +[tidb]> SELECT VEC_NEGATIVE_INNER_PRODUCT('[1,2]', '[3,4]'); ++----------------------------------------------+ +| VEC_NEGATIVE_INNER_PRODUCT('[1,2]', '[3,4]') | ++----------------------------------------------+ +| -11 | ++----------------------------------------------+ +``` + +### VEC_L1_DISTANCE + +```sql +VEC_L1_DISTANCE(vector1, vector2) +``` + +Calculates the [L1 distance](https://en.wikipedia.org/wiki/Taxicab_geometry) (Manhattan distance) between two vectors using the following formula: + +$DISTANCE(p,q)=\sum \limits _{i=1}^{n}{|p_{i}-q_{i}|}$ + +The two vectors must have the same dimension. Otherwise, an error is returned. + +Example: + +```sql +[tidb]> SELECT VEC_L1_DISTANCE('[0,0]', '[3,4]'); ++-----------------------------------+ +| VEC_L1_DISTANCE('[0,0]', '[3,4]') | ++-----------------------------------+ +| 7 | ++-----------------------------------+ +``` + +### VEC_DIMS + +```sql +VEC_DIMS(vector) +``` + +Returns the dimension of a vector. + +Examples: + +```sql +[tidb]> SELECT VEC_DIMS('[1,2,3]'); ++---------------------+ +| VEC_DIMS('[1,2,3]') | ++---------------------+ +| 3 | ++---------------------+ + +[tidb]> SELECT VEC_DIMS('[]'); ++----------------+ +| VEC_DIMS('[]') | ++----------------+ +| 0 | ++----------------+ +``` + +### VEC_L2_NORM + +```sql +VEC_L2_NORM(vector) +``` + +Calculates the [L2 norm]() (Euclidean norm) of a vector using the following formula: + +$NORM(p)=\sqrt {\sum \limits _{i=1}^{n}{p_{i}^{2}}}$ + +Example: + +```sql +[tidb]> SELECT VEC_L2_NORM('[3,4]'); ++----------------------+ +| VEC_L2_NORM('[3,4]') | ++----------------------+ +| 5 | ++----------------------+ +``` + +### VEC_FROM_TEXT + +```sql +VEC_FROM_TEXT(string) +``` + +Converts a string into a vector. + +Example: + +```sql +[tidb]> SELECT VEC_FROM_TEXT('[1,2]') + VEC_FROM_TEXT('[3,4]'); ++-------------------------------------------------+ +| VEC_FROM_TEXT('[1,2]') + VEC_FROM_TEXT('[3,4]') | ++-------------------------------------------------+ +| [4,6] | ++-------------------------------------------------+ +``` + +### VEC_AS_TEXT + +```sql +VEC_AS_TEXT(vector) +``` + +Converts a vector into a string. + +Example: + +```sql +[tidb]> SELECT VEC_AS_TEXT('[1.000, 2.5]'); ++-------------------------------+ +| VEC_AS_TEXT('[1.000, 2.5]') | ++-------------------------------+ +| [1,2.5] | ++-------------------------------+ +``` + +## MySQL compatibility + +The vector functions and the extended usage of built-in functions and operators over vector data types are TiDB specific, and are not supported in MySQL. + +## See also + +- [Vector Data Types](/tidb-cloud/vector-search-data-types.md) diff --git a/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-get-started-using-python.md b/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-get-started-using-python.md new file mode 100644 index 00000000..ac414b19 --- /dev/null +++ b/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-get-started-using-python.md @@ -0,0 +1,195 @@ +--- +title: Get Started with TiDB + AI via Python +summary: Learn how to quickly develop an AI application that performs semantic search using Python and TiDB Vector Search. +--- + +# Get Started with TiDB + AI via Python + +This tutorial demonstrates how to develop a simple AI application that provides **semantic search** features. Unlike traditional keyword search, semantic search intelligently understands the meaning behind your query and returns the most relevant result. For example, if you have documents titled "dog", "fish", and "tree", and you search for "a swimming animal", the application would identify "fish" as the most relevant result. + +Throughout this tutorial, you will develop this AI application using [TiDB Vector Search](/tidb-cloud/vector-search-overview.md), Python, [TiDB Vector SDK for Python](https://github.com/pingcap/tidb-vector-python), and AI models. + +> **Note** +> +> TiDB Vector Search is only available for TiDB (>= v8.4) and [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless). It is not available for [TiDB Cloud Dedicated](/tidb-cloud/select-cluster-tier.md#tidb-cloud-dedicated). + +## Prerequisites + +To complete this tutorial, you need: + +- [Python 3.8 or higher](https://www.python.org/downloads/) installed. +- [Git](https://git-scm.com/downloads) installed. +- A TiDB Cloud Serverless cluster. Follow [creating a TiDB Cloud Serverless cluster](/tidb-cloud/create-tidb-cluster-serverless.md) to create your own TiDB Cloud cluster if you don't have one. + +## Get started + +The following steps show how to develop the application from scratch. To run the demo directly, you can check out the sample code in the [pingcap/tidb-vector-python](https://github.com/pingcap/tidb-vector-python/blob/main/examples/python-client-quickstart) repository. + +### Step 1. Create a new Python project + +In your preferred directory, create a new Python project and a file named `example.py`: + +```shell +mkdir python-client-quickstart +cd python-client-quickstart +touch example.py +``` + +### Step 2. Install required dependencies + +In your project directory, run the following command to install the required packages: + +```shell +pip install sqlalchemy pymysql sentence-transformers tidb-vector python-dotenv +``` + +- `tidb-vector`: the Python client for interacting with TiDB Vector Search. +- [`sentence-transformers`](https://sbert.net): a Python library that provides pre-trained models for generating [vector embeddings](/tidb-cloud/vector-search-overview.md#vector-embedding) from text. + +### Step 3. Configure the connection string to the TiDB cluster + +1. Navigate to the [**Clusters**](https://tidbcloud.com/console/clusters) page, and then click the name of your target cluster to go to its overview page. + +2. Click **Connect** in the upper-right corner. A connection dialog is displayed. + +3. Ensure the configurations in the connection dialog match your operating environment. + + - **Connection Type** is set to `Public`. + - **Branch** is set to `main`. + - **Connect With** is set to `SQLAlchemy`. + - **Operating System** matches your environment. + + > **Tip:** + > + > If your program is running in Windows Subsystem for Linux (WSL), switch to the corresponding Linux distribution. + +4. Click the **PyMySQL** tab and copy the connection string. + + > **Tip:** + > + > If you have not set a password yet, click **Generate Password** to generate a random password. + +5. In the root directory of your Python project, create a `.env` file and paste the connection string into it. + + The following is an example for macOS: + + ```dotenv + TIDB_DATABASE_URL="mysql+pymysql://.root:@gateway01..prod.aws.tidbcloud.com:4000/test?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true" + ``` + +### Step 4. Initialize the embedding model + +An [embedding model](/tidb-cloud/vector-search-overview.md#embedding-model) transforms data into [vector embeddings](/tidb-cloud/vector-search-overview.md#vector-embedding). This example uses the pre-trained model [**msmarco-MiniLM-L12-cos-v5**](https://huggingface.co/sentence-transformers/msmarco-MiniLM-L12-cos-v5) for text embedding. This lightweight model, provided by the `sentence-transformers` library, transforms text data into 384-dimensional vector embeddings. + +To set up the model, copy the following code into the `example.py` file. This code initializes a `SentenceTransformer` instance and defines a `text_to_embedding()` function for later use. + +```python +from sentence_transformers import SentenceTransformer + +print("Downloading and loading the embedding model...") +embed_model = SentenceTransformer("sentence-transformers/msmarco-MiniLM-L12-cos-v5", trust_remote_code=True) +embed_model_dims = embed_model.get_sentence_embedding_dimension() + +def text_to_embedding(text): + """Generates vector embeddings for the given text.""" + embedding = embed_model.encode(text) + return embedding.tolist() +``` + +### Step 5. Connect to the TiDB cluster + +Use the `TiDBVectorClient` class to connect to your TiDB cluster and create a table `embedded_documents` with a vector column. + +> **Note** +> +> Make sure the dimension of your vector column in the table matches the dimension of the vectors generated by your embedding model. For example, the **msmarco-MiniLM-L12-cos-v5** model generates vectors with 384 dimensions, so the dimension of your vector columns in `embedded_documents` should be 384 as well. + +```python +import os +from tidb_vector.integrations import TiDBVectorClient +from dotenv import load_dotenv + +# Load the connection string from the .env file +load_dotenv() + +vector_store = TiDBVectorClient( + # The 'embedded_documents' table will store the vector data. + table_name='embedded_documents', + # The connection string to the TiDB cluster. + connection_string=os.environ.get('TIDB_DATABASE_URL'), + # The dimension of the vector generated by the embedding model. + vector_dimension=embed_model_dims, + # Recreate the table if it already exists. + drop_existing_table=True, +) +``` + +### Step 6. Embed text data and store the vectors + +In this step, you will prepare sample documents containing single words, such as "dog", "fish", and "tree". The following code uses the `text_to_embedding()` function to transform these text documents into vector embeddings, and then inserts them into the vector store. + +```python +documents = [ + { + "id": "f8e7dee2-63b6-42f1-8b60-2d46710c1971", + "text": "dog", + "embedding": text_to_embedding("dog"), + "metadata": {"category": "animal"}, + }, + { + "id": "8dde1fbc-2522-4ca2-aedf-5dcb2966d1c6", + "text": "fish", + "embedding": text_to_embedding("fish"), + "metadata": {"category": "animal"}, + }, + { + "id": "e4991349-d00b-485c-a481-f61695f2b5ae", + "text": "tree", + "embedding": text_to_embedding("tree"), + "metadata": {"category": "plant"}, + }, +] + +vector_store.insert( + ids=[doc["id"] for doc in documents], + texts=[doc["text"] for doc in documents], + embeddings=[doc["embedding"] for doc in documents], + metadatas=[doc["metadata"] for doc in documents], +) +``` + +### Step 7. Perform semantic search + +In this step, you will search for "a swimming animal", which doesn't directly match any words in existing documents. + +The following code uses the `text_to_embedding()` function again to convert the query text into a vector embedding, and then queries with the embedding to find the top three closest matches. + +```python +def print_result(query, result): + print(f"Search result (\"{query}\"):") + for r in result: + print(f"- text: \"{r.document}\", distance: {r.distance}") + +query = "a swimming animal" +query_embedding = text_to_embedding(query) +search_result = vector_store.query(query_embedding, k=3) +print_result(query, search_result) +``` + +Run the `example.py` file and the output is as follows: + +```plain +Search result ("a swimming animal"): +- text: "fish", distance: 0.4562914811223072 +- text: "dog", distance: 0.6469335836410557 +- text: "tree", distance: 0.798545178640937 +``` + +The three terms in the search results are sorted by their respective distance from the queried vector: the smaller the distance, the more relevant the corresponding `document`. + +Therefore, according to the output, the swimming animal is most likely a fish, or a dog with a gift for swimming. + +## See also + +- [Vector Data Types](/tidb-cloud/vector-search-data-types.md) +- [Vector Search Index](/tidb-cloud/vector-search-index.md) diff --git a/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-get-started-using-sql.md b/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-get-started-using-sql.md new file mode 100644 index 00000000..b4f8538b --- /dev/null +++ b/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-get-started-using-sql.md @@ -0,0 +1,150 @@ +--- +title: Get Started with Vector Search via SQL +summary: Learn how to quickly get started with Vector Search in TiDB using SQL statements to power your generative AI applications. +--- + +# Get Started with Vector Search via SQL + +TiDB extends MySQL syntax to support [Vector Search](/tidb-cloud/vector-search-overview.md) and introduce new [Vector data types](/tidb-cloud/vector-search-data-types.md) and several [vector functions](/tidb-cloud/vector-search-functions-and-operators.md). + +This tutorial demonstrates how to get started with TiDB Vector Search just using SQL statements. You will learn how to use the [MySQL command-line client](https://dev.mysql.com/doc/refman/8.4/en/mysql.html) to complete the following operations: + +- Connect to your TiDB cluster. +- Create a vector table. +- Store vector embeddings. +- Perform vector search queries. + +> **Note** +> +> TiDB Vector Search is only available for TiDB (>= v8.4) and [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless). It is not available for [TiDB Cloud Dedicated](/tidb-cloud/select-cluster-tier.md#tidb-cloud-dedicated). + +## Prerequisites + +To complete this tutorial, you need: + +- [MySQL command-line client](https://dev.mysql.com/doc/refman/8.4/en/mysql.html) (MySQL CLI) installed on your machine. +- A TiDB Cloud Serverless cluster. Follow [creating a TiDB Cloud Serverless cluster](/tidb-cloud/create-tidb-cluster-serverless.md) to create your own TiDB Cloud cluster if you don't have one. + +## Get started + +### Step 1. Connect to the TiDB cluster + +1. Navigate to the [**Clusters**](https://tidbcloud.com/console/clusters) page, and then click the name of your target cluster to go to its overview page. + +2. Click **Connect** in the upper-right corner. A connection dialog is displayed. + +3. In the connection dialog, select **MySQL CLI** from the **Connect With** drop-down list and keep the default setting of the **Connection Type** as **Public**. + +4. If you have not set a password yet, click **Generate Password** to generate a random password. + +5. Copy the connection command and paste it into your terminal. The following is an example for macOS: + + ```bash + mysql -u '.root' -h '' -P 4000 -D 'test' --ssl-mode=VERIFY_IDENTITY --ssl-ca=/etc/ssl/cert.pem -p'' + ``` + +### Step 2. Create a vector table + +When creating a table, you can define a column as a [vector](/tidb-cloud/vector-search-overview.md#vector-embedding) column by specifying the `VECTOR` data type. + +For example, to create a table `embedded_documents` with a three-dimensional `VECTOR` column, execute the following SQL statements using your MySQL CLI: + +```sql +USE test; +CREATE TABLE embedded_documents ( + id INT PRIMARY KEY, + -- Column to store the original content of the document. + document TEXT, + -- Column to store the vector representation of the document. + embedding VECTOR(3) +); +``` + +The expected output is as follows: + +```text +Query OK, 0 rows affected (0.27 sec) +``` + +### Step 3. Insert vector embeddings to the table + +Insert three documents with their [vector embeddings](/tidb-cloud/vector-search-overview.md#vector-embedding) into the `embedded_documents` table: + +```sql +INSERT INTO embedded_documents +VALUES + (1, 'dog', '[1,2,1]'), + (2, 'fish', '[1,2,4]'), + (3, 'tree', '[1,0,0]'); +``` + +The expected output is as follows: + +``` +Query OK, 3 rows affected (0.15 sec) +Records: 3 Duplicates: 0 Warnings: 0 +``` + +> **Note** +> +> This example simplifies the dimensions of the vector embeddings and uses only 3-dimensional vectors for demonstration purposes. +> +> In real-world applications, [embedding models](/tidb-cloud/vector-search-overview.md#embedding-model) often produce vector embeddings with hundreds or thousands of dimensions. + +### Step 4. Query the vector table + +To verify that the documents have been inserted correctly, query the `embedded_documents` table: + +```sql +SELECT * FROM embedded_documents; +``` + +The expected output is as follows: + +```sql ++----+----------+-----------+ +| id | document | embedding | ++----+----------+-----------+ +| 1 | dog | [1,2,1] | +| 2 | fish | [1,2,4] | +| 3 | tree | [1,0,0] | ++----+----------+-----------+ +3 rows in set (0.15 sec) +``` + +### Step 5. Perform a vector search query + +Similar to full-text search, users provide search terms to the application when using vector search. + +In this example, the search term is "a swimming animal", and its corresponding vector embedding is assumed to be `[1,2,3]`. In practical applications, you need to use an embedding model to convert the user's search term into a vector embedding. + +Execute the following SQL statement, and TiDB will identify the top three documents closest to `[1,2,3]` by calculating and sorting the cosine distances (`vec_cosine_distance`) between the vector embeddings in the table. + +```sql +SELECT id, document, vec_cosine_distance(embedding, '[1,2,3]') AS distance +FROM embedded_documents +ORDER BY distance +LIMIT 3; +``` + +The expected output is as follows: + +```plain ++----+----------+---------------------+ +| id | document | distance | ++----+----------+---------------------+ +| 2 | fish | 0.00853986601633272 | +| 1 | dog | 0.12712843905603044 | +| 3 | tree | 0.7327387580875756 | ++----+----------+---------------------+ +3 rows in set (0.15 sec) +``` + +The three terms in the search results are sorted by their respective distance from the queried vector: the smaller the distance, the more relevant the corresponding `document`. + +Therefore, according to the output, the swimming animal is most likely a fish, or a dog with a gift for swimming. + +## See also + +- [Vector Data Types](/tidb-cloud/vector-search-data-types.md) +- [Vector Search Index](/tidb-cloud/vector-search-index.md) diff --git a/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-index.md b/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-index.md new file mode 100644 index 00000000..00e6dce6 --- /dev/null +++ b/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-index.md @@ -0,0 +1,243 @@ +--- +title: Vector Search Index +summary: Learn how to build and use the vector search index to accelerate K-Nearest neighbors (KNN) queries in TiDB. +--- + +# Vector Search Index + +K-nearest neighbors (KNN) search is the method for finding the K closest points to a given point in a vector space. The most straightforward approach to perform KNN search is a brute force search, which calculates the distance between the given vector and all other vectors in the space. This approach guarantees perfect accuracy, but it is usually too slow for real-world use. Therefore, approximate algorithms are commonly used in KNN search to enhance speed and efficiency. + +In TiDB, you can create and use vector search indexes for such approximate nearest neighbor (ANN) searches over columns with [vector data types](/tidb-cloud/vector-search-data-types.md). By using vector search indexes, vector search queries could be finished in milliseconds. + +Currently, TiDB supports the [HNSW (Hierarchical Navigable Small World)](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world) vector search index algorithm. + +## Create the HNSW vector index + +[HNSW](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world) is one of the most popular vector indexing algorithms. The HNSW index provides good performance with relatively high accuracy, up to 98% in specific cases. + +In TiDB, you can create an HNSW index for a column with a [vector data type](/tidb-cloud/vector-search-data-types.md) in either of the following ways: + +- When creating a table, use the following syntax to specify the vector column for the HNSW index: + + ```sql + CREATE TABLE foo ( + id INT PRIMARY KEY, + embedding VECTOR(5), + VECTOR INDEX idx_embedding ((VEC_COSINE_DISTANCE(embedding))) + ); + ``` + +- For an existing table that already contains a vector column, use the following syntax to create an HNSW index for the vector column: + + ```sql + CREATE VECTOR INDEX idx_embedding ON foo ((VEC_COSINE_DISTANCE(embedding))); + ALTER TABLE foo ADD VECTOR INDEX idx_embedding ((VEC_COSINE_DISTANCE(embedding))); + + -- You can also explicitly specify "USING HNSW" to build the vector search index. + CREATE VECTOR INDEX idx_embedding ON foo ((VEC_COSINE_DISTANCE(embedding))) USING HNSW; + ALTER TABLE foo ADD VECTOR INDEX idx_embedding ((VEC_COSINE_DISTANCE(embedding))) USING HNSW; + ``` + +> **Note:** +> +> The vector search index feature relies on TiFlash replicas for tables. +> +> - If a vector search index is defined when a table is created, TiDB automatically creates a TiFlash replica for the table. +> - If no vector search index is defined when a table is created, and the table currently does not have a TiFlash replica, you need to manually create a TiFlash replica before adding a vector search index to the table. For example: `ALTER TABLE 'table_name' SET TIFLASH REPLICA 1;`. + +When creating an HNSW vector index, you need to specify the distance function for the vector: + +- Cosine Distance: `((VEC_COSINE_DISTANCE(embedding)))` +- L2 Distance: `((VEC_L2_DISTANCE(embedding)))` + +The vector index can only be created for fixed-dimensional vector columns, such as a column defined as `VECTOR(3)`. It cannot be created for non-fixed-dimensional vector columns (such as a column defined as `VECTOR`) because vector distances can only be calculated between vectors with the same dimension. + +For other limitations, see [Vector index limitations](/tidb-cloud/vector-search-limitations.md#vector-index-limitations). + +## Use the vector index + +The vector search index can be used in K-nearest neighbor search queries by using the `ORDER BY ... LIMIT` clause as follows: + +```sql +SELECT * +FROM foo +ORDER BY VEC_COSINE_DISTANCE(embedding, '[1, 2, 3, 4, 5]') +LIMIT 10 +``` + +To use an index in a vector search, make sure that the `ORDER BY ... LIMIT` clause uses the same distance function as the one specified when creating the vector index. + +## Use the vector index with filters + +Queries that contain a pre-filter (using the `WHERE` clause) cannot utilize the vector index because they are not querying for K-Nearest neighbors according to the SQL semantics. For example: + +```sql +-- For the following query, the `WHERE` filter is performed before KNN, so the vector index cannot be used: + +SELECT * FROM vec_table +WHERE category = "document" +ORDER BY VEC_COSINE_DISTANCE(embedding, '[1, 2, 3]') +LIMIT 5; +``` + +To use the vector index with filters, query for the K-Nearest neighbors first using vector search, and then filter out unwanted results: + +```sql +-- For the following query, the `WHERE` filter is performed after KNN, so the vector index cannot be used: + +SELECT * FROM +( + SELECT * FROM vec_table + ORDER BY VEC_COSINE_DISTANCE(embedding, '[1, 2, 3]') + LIMIT 5 +) t +WHERE category = "document"; + +-- Note that this query might return fewer than 5 results if some are filtered out. +``` + +## View index build progress + +After you insert a large volume of data, some of it might not be instantly persisted to TiFlash. For vector data that has already been persisted, the vector search index is built synchronously. For data that has not yet been persisted, the index will be built once the data is persisted. This process does not affect the accuracy and consistency of the data. You can still perform vector searches at any time and get complete results. However, performance will be suboptimal until vector indexes are fully built. + +To view the index build progress, you can query the `INFORMATION_SCHEMA.TIFLASH_INDEXES` table as follows: + +```sql +SELECT * FROM INFORMATION_SCHEMA.TIFLASH_INDEXES; ++---------------+------------+----------+-------------+---------------+-----------+----------+------------+---------------------+-------------------------+--------------------+------------------------+---------------+------------------+ +| TIDB_DATABASE | TIDB_TABLE | TABLE_ID | COLUMN_NAME | INDEX_NAME | COLUMN_ID | INDEX_ID | INDEX_KIND | ROWS_STABLE_INDEXED | ROWS_STABLE_NOT_INDEXED | ROWS_DELTA_INDEXED | ROWS_DELTA_NOT_INDEXED | ERROR_MESSAGE | TIFLASH_INSTANCE | ++---------------+------------+----------+-------------+---------------+-----------+----------+------------+---------------------+-------------------------+--------------------+------------------------+---------------+------------------+ +| test | tcff1d827 | 219 | col1fff | 0a452311 | 7 | 1 | HNSW | 29646 | 0 | 0 | 0 | | 127.0.0.1:3930 | +| test | foo | 717 | embedding | idx_embedding | 2 | 1 | HNSW | 0 | 0 | 0 | 3 | | 127.0.0.1:3930 | ++---------------+------------+----------+-------------+---------------+-----------+----------+------------+---------------------+-------------------------+--------------------+------------------------+---------------+------------------+ +``` + +- You can check the `ROWS_STABLE_INDEXED` and `ROWS_STABLE_NOT_INDEXED` columns for the index build progress. When `ROWS_STABLE_NOT_INDEXED` becomes 0, the index build is complete. + + As a reference, indexing a 500 MiB vector dataset with 768 dimensions might take up to 20 minutes. The indexer can run in parallel for multiple tables. Currently, adjusting the indexer priority or speed is not supported. + +- You can check the `ROWS_DELTA_NOT_INDEXED` column for the number of rows in the Delta layer. Data in the storage layer of TiFlash is stored in two layers: Delta layer and Stable layer. The Delta layer stores recently inserted or updated rows and is periodically merged into the Stable layer according to the write workload. This merge process is called Compaction. + + The Delta layer is always not indexed. To achieve optimal performance, you can force the merge of the Delta layer into the Stable layer so that all data can be indexed: + + ```sql + ALTER TABLE COMPACT; + ``` + + For more information, see [`ALTER TABLE ... COMPACT`](/sql-statements/sql-statement-alter-table-compact.md). + +In addition, you can monitor the execution progress of the DDL job by executing `ADMIN SHOW DDL JOBS;` and checking the `row count`. However, this method is not fully accurate, because the `row count` value is obtained from the `rows_stable_indexed` field in `TIFLASH_INDEXES`. You can use this approach as a reference for tracking the progress of indexing. + +## Check whether the vector index is used + +Use the [`EXPLAIN`](/sql-statements/sql-statement-explain.md) or [`EXPLAIN ANALYZE`](/sql-statements/sql-statement-explain-analyze.md) statement to check whether a query is using the vector index. When `annIndex:` is presented in the `operator info` column for the `TableFullScan` executor, it means this table scan is utilizing the vector index. + +**Example: the vector index is used** + +```sql +[tidb]> EXPLAIN SELECT * FROM vector_table_with_index +ORDER BY VEC_COSINE_DISTANCE(embedding, '[1, 2, 3]') +LIMIT 10; ++-----+-------------------------------------------------------------------------------------+ +| ... | operator info | ++-----+-------------------------------------------------------------------------------------+ +| ... | ... | +| ... | Column#5, offset:0, count:10 | +| ... | ..., vec_cosine_distance(test.vector_table_with_index.embedding, [1,2,3])->Column#5 | +| ... | MppVersion: 1, data:ExchangeSender_16 | +| ... | ExchangeType: PassThrough | +| ... | ... | +| ... | Column#4, offset:0, count:10 | +| ... | ..., vec_cosine_distance(test.vector_table_with_index.embedding, [1,2,3])->Column#4 | +| ... | annIndex:COSINE(test.vector_table_with_index.embedding..[1,2,3], limit:10), ... | ++-----+-------------------------------------------------------------------------------------+ +9 rows in set (0.01 sec) +``` + +**Example: The vector index is not used because of not specifying a Top K** + +```sql +[tidb]> EXPLAIN SELECT * FROM vector_table_with_index + -> ORDER BY VEC_COSINE_DISTANCE(embedding, '[1, 2, 3]'); ++--------------------------------+-----+--------------------------------------------------+ +| id | ... | operator info | ++--------------------------------+-----+--------------------------------------------------+ +| Projection_15 | ... | ... | +| └─Sort_4 | ... | Column#4 | +| └─Projection_16 | ... | ..., vec_cosine_distance(..., [1,2,3])->Column#4 | +| └─TableReader_14 | ... | MppVersion: 1, data:ExchangeSender_13 | +| └─ExchangeSender_13 | ... | ExchangeType: PassThrough | +| └─TableFullScan_12 | ... | keep order:false, stats:pseudo | ++--------------------------------+-----+--------------------------------------------------+ +6 rows in set, 1 warning (0.01 sec) +``` + +When the vector index cannot be used, a warning occurs in some cases to help you learn the cause: + +```sql +-- Using a wrong distance function: +[tidb]> EXPLAIN SELECT * FROM vector_table_with_index +ORDER BY VEC_L2_DISTANCE(embedding, '[1, 2, 3]') +LIMIT 10; + +[tidb]> SHOW WARNINGS; +ANN index not used: not ordering by COSINE distance + +-- Using a wrong order: +[tidb]> EXPLAIN SELECT * FROM vector_table_with_index +ORDER BY VEC_COSINE_DISTANCE(embedding, '[1, 2, 3]') DESC +LIMIT 10; + +[tidb]> SHOW WARNINGS; +ANN index not used: index can be used only when ordering by vec_cosine_distance() in ASC order +``` + +## Analyze vector search performance + +To learn detailed information about how a vector index is used, you can execute the [`EXPLAIN ANALYZE`](/sql-statements/sql-statement-explain-analyze.md) statement and check the `execution info` column in the output: + +```sql +[tidb]> EXPLAIN ANALYZE SELECT * FROM vector_table_with_index +ORDER BY VEC_COSINE_DISTANCE(embedding, '[1, 2, 3]') +LIMIT 10; ++-----+--------------------------------------------------------+-----+ +| | execution info | | ++-----+--------------------------------------------------------+-----+ +| ... | time:339.1ms, loops:2, RU:0.000000, Concurrency:OFF | ... | +| ... | time:339ms, loops:2 | ... | +| ... | time:339ms, loops:3, Concurrency:OFF | ... | +| ... | time:339ms, loops:3, cop_task: {...} | ... | +| ... | tiflash_task:{time:327.5ms, loops:1, threads:4} | ... | +| ... | tiflash_task:{time:327.5ms, loops:1, threads:4} | ... | +| ... | tiflash_task:{time:327.5ms, loops:1, threads:4} | ... | +| ... | tiflash_task:{time:327.5ms, loops:1, threads:4} | ... | +| ... | tiflash_task:{...}, vector_idx:{ | ... | +| | load:{total:68ms,from_s3:1,from_disk:0,from_cache:0},| | +| | search:{total:0ms,visited_nodes:2,discarded_nodes:0},| | +| | read:{vec_total:0ms,others_total:0ms}},...} | | ++-----+--------------------------------------------------------+-----+ +``` + +> **Note:** +> +> The execution information is internal. Fields and formats are subject to change without any notification. Do not rely on them. + +Explanation of some important fields: + +- `vector_index.load.total`: The total duration of loading index. This field might be larger than the actual query time because multiple vector indexes might be loaded in parallel. +- `vector_index.load.from_s3`: Number of indexes loaded from S3. +- `vector_index.load.from_disk`: Number of indexes loaded from disk. The index was already downloaded from S3 previously. +- `vector_index.load.from_cache`: Number of indexes loaded from cache. The index was already downloaded from S3 previously. +- `vector_index.search.total`: The total duration of searching in the index. Large latency usually means the index is cold (never accessed before, or accessed long ago) so that there are heavy I/O operations when searching through the index. This field might be larger than the actual query time because multiple vector indexes might be searched in parallel. +- `vector_index.search.discarded_nodes`: Number of vector rows visited but discarded during the search. These discarded vectors are not considered in the search result. Large values usually indicate that there are many stale rows caused by `UPDATE` or `DELETE` statements. + +See [`EXPLAIN`](/sql-statements/sql-statement-explain.md), [`EXPLAIN ANALYZE`](/sql-statements/sql-statement-explain-analyze.md), and [EXPLAIN Walkthrough](/explain-walkthrough.md) for interpreting the output. + +## Limitations + +See [Vector index limitations](/tidb-cloud/vector-search-limitations.md#vector-index-limitations). + +## See also + +- [Improve Vector Search Performance](/tidb-cloud/vector-search-improve-performance.md) +- [Vector Data Types](/tidb-cloud/vector-search-data-types.md) diff --git a/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-integrate-with-django-orm.md b/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-integrate-with-django-orm.md new file mode 100644 index 00000000..54950429 --- /dev/null +++ b/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-integrate-with-django-orm.md @@ -0,0 +1,221 @@ +--- +title: Integrate TiDB Vector Search with Django ORM +summary: Learn how to integrate TiDB Vector Search with Django ORM to store embeddings and perform semantic search. +--- + +# Integrate TiDB Vector Search with Django ORM + +This tutorial walks you through how to use [Django](https://www.djangoproject.com/) ORM to interact with the [TiDB Vector Search](/tidb-cloud/vector-search-overview.md), store embeddings, and perform vector search queries. + +> **Note** +> +> TiDB Vector Search is only available for TiDB (>= v8.4) and [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless). It is not available for [TiDB Cloud Dedicated](/tidb-cloud/select-cluster-tier.md#tidb-cloud-dedicated). + +## Prerequisites + +To complete this tutorial, you need: + +- [Python 3.8 or higher](https://www.python.org/downloads/) installed. +- [Git](https://git-scm.com/downloads) installed. +- A TiDB Cloud Serverless cluster. Follow [creating a TiDB Cloud Serverless cluster](/tidb-cloud/create-tidb-cluster-serverless.md) to create your own TiDB Cloud cluster if you don't have one. + +## Run the sample app + +You can quickly learn about how to integrate TiDB Vector Search with Django ORM by following the steps below. + +### Step 1. Clone the repository + +Clone the `tidb-vector-python` repository to your local machine: + +```shell +git clone https://github.com/pingcap/tidb-vector-python.git +``` + +### Step 2. Create a virtual environment + +Create a virtual environment for your project: + +```bash +cd tidb-vector-python/examples/orm-django-quickstart +python3 -m venv .venv +source .venv/bin/activate +``` + +### Step 3. Install required dependencies + +Install the required dependencies for the demo project: + +```bash +pip install -r requirements.txt +``` + +Alternatively, you can install the following packages for your project: + +```bash +pip install Django django-tidb mysqlclient numpy python-dotenv +``` + +If you encounter installation issues with mysqlclient, refer to the mysqlclient official documentation. + +#### What is `django-tidb` + +`django-tidb` is a TiDB dialect for Django, which enhances the Django ORM to support TiDB-specific features (for example, TiDB Vector Search) and resolves compatibility issues between TiDB and Django. + +To install `django-tidb`, choose a version that matches your Django version. For example, if you are using `django==4.2.*`, install `django-tidb==4.2.*`. The minor version does not need to be the same. It is recommended to use the latest minor version. + +For more information, refer to [django-tidb repository](https://github.com/pingcap/django-tidb). + +### Step 4. Configure the environment variables + +1. Navigate to the [**Clusters**](https://tidbcloud.com/console/clusters) page, and then click the name of your target cluster to go to its overview page. + +2. Click **Connect** in the upper-right corner. A connection dialog is displayed. + +3. Ensure the configurations in the connection dialog match your operating environment. + + - **Connection Type** is set to `Public` + - **Branch** is set to `main` + - **Connect With** is set to `General` + - **Operating System** matches your environment. + + > **Tip:** + > + > If your program is running in Windows Subsystem for Linux (WSL), switch to the corresponding Linux distribution. + +4. Copy the connection parameters from the connection dialog. + + > **Tip:** + > + > If you have not set a password yet, click **Generate Password** to generate a random password. + +5. In the root directory of your Python project, create a `.env` file and paste the connection parameters to the corresponding environment variables. + + - `TIDB_HOST`: The host of the TiDB cluster. + - `TIDB_PORT`: The port of the TiDB cluster. + - `TIDB_USERNAME`: The username to connect to the TiDB cluster. + - `TIDB_PASSWORD`: The password to connect to the TiDB cluster. + - `TIDB_DATABASE`: The database name to connect to. + - `TIDB_CA_PATH`: The path to the root certificate file. + + The following is an example for macOS: + + ```dotenv + TIDB_HOST=gateway01.****.prod.aws.tidbcloud.com + TIDB_PORT=4000 + TIDB_USERNAME=********.root + TIDB_PASSWORD=******** + TIDB_DATABASE=test + TIDB_CA_PATH=/etc/ssl/cert.pem + ``` + +### Step 5. Run the demo + +Migrate the database schema: + +```bash +python manage.py migrate +``` + +Run the Django development server: + +```bash +python manage.py runserver +``` + +Open your browser and visit `http://127.0.0.1:8000` to try the demo application. Here are the available API paths: + +| API Path | Description | +| --------------------------------------- | ---------------------------------------- | +| `POST: /insert_documents` | Insert documents with embeddings. | +| `GET: /get_nearest_neighbors_documents` | Get the 3-nearest neighbor documents. | +| `GET: /get_documents_within_distance` | Get documents within a certain distance. | + +## Sample code snippets + +You can refer to the following sample code snippets to complete your own application development. + +### Connect to the TiDB cluster + +In the file `sample_project/settings.py`, add the following configurations: + +```python +dotenv.load_dotenv() + +DATABASES = { + "default": { + # https://github.com/pingcap/django-tidb + "ENGINE": "django_tidb", + "HOST": os.environ.get("TIDB_HOST", "127.0.0.1"), + "PORT": int(os.environ.get("TIDB_PORT", 4000)), + "USER": os.environ.get("TIDB_USERNAME", "root"), + "PASSWORD": os.environ.get("TIDB_PASSWORD", ""), + "NAME": os.environ.get("TIDB_DATABASE", "test"), + "OPTIONS": { + "charset": "utf8mb4", + }, + } +} + +TIDB_CA_PATH = os.environ.get("TIDB_CA_PATH", "") +if TIDB_CA_PATH: + DATABASES["default"]["OPTIONS"]["ssl_mode"] = "VERIFY_IDENTITY" + DATABASES["default"]["OPTIONS"]["ssl"] = { + "ca": TIDB_CA_PATH, + } +``` + +You can create a `.env` file in the root directory of your project and set up the environment variables `TIDB_HOST`, `TIDB_PORT`, `TIDB_USERNAME`, `TIDB_PASSWORD`, `TIDB_DATABASE`, and `TIDB_CA_PATH` with the actual values of your TiDB cluster. + +### Create vector tables + +#### Define a vector column + +`tidb-django` provides a `VectorField` to store vector embeddings in a table. + +Create a table with a column named `embedding` that stores a 3-dimensional vector. + +```python +class Document(models.Model): + content = models.TextField() + embedding = VectorField(dimensions=3) +``` + +### Store documents with embeddings + +```python +Document.objects.create(content="dog", embedding=[1, 2, 1]) +Document.objects.create(content="fish", embedding=[1, 2, 4]) +Document.objects.create(content="tree", embedding=[1, 0, 0]) +``` + +### Search the nearest neighbor documents + +TiDB Vector support the following distance functions: + +- `L1Distance` +- `L2Distance` +- `CosineDistance` +- `NegativeInnerProduct` + +Search for the top-3 documents that are semantically closest to the query vector `[1, 2, 3]` based on the cosine distance function. + +```python +results = Document.objects.annotate( + distance=CosineDistance('embedding', [1, 2, 3]) +).order_by('distance')[:3] +``` + +### Search documents within a certain distance + +Search for the documents whose cosine distance from the query vector `[1, 2, 3]` is less than 0.2. + +```python +results = Document.objects.annotate( + distance=CosineDistance('embedding', [1, 2, 3]) +).filter(distance__lt=0.2).order_by('distance')[:3] +``` + +## See also + +- [Vector Data Types](/tidb-cloud/vector-search-data-types.md) +- [Vector Search Index](/tidb-cloud/vector-search-index.md) diff --git a/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-integrate-with-jinaai-embedding.md b/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-integrate-with-jinaai-embedding.md new file mode 100644 index 00000000..719bd15b --- /dev/null +++ b/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-integrate-with-jinaai-embedding.md @@ -0,0 +1,240 @@ +--- +title: Integrate TiDB Vector Search with Jina AI Embeddings API +summary: Learn how to integrate TiDB Vector Search with Jina AI Embeddings API to store embeddings and perform semantic search. +--- + +# Integrate TiDB Vector Search with Jina AI Embeddings API + +This tutorial walks you through how to use [Jina AI](https://jina.ai/) to generate embeddings for text data, and then store the embeddings in TiDB vector storage and search similar texts based on embeddings. + +> **Note** +> +> TiDB Vector Search is only available for TiDB (>= v8.4) and [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless). It is not available for [TiDB Cloud Dedicated](/tidb-cloud/select-cluster-tier.md#tidb-cloud-dedicated). + +## Prerequisites + +To complete this tutorial, you need: + +- [Python 3.8 or higher](https://www.python.org/downloads/) installed. +- [Git](https://git-scm.com/downloads) installed. +- A TiDB Cloud Serverless cluster. Follow [creating a TiDB Cloud Serverless cluster](/tidb-cloud/create-tidb-cluster-serverless.md) to create your own TiDB Cloud cluster if you don't have one. + +## Run the sample app + +You can quickly learn about how to integrate TiDB Vector Search with JinaAI Embedding by following the steps below. + +### Step 1. Clone the repository + +Clone the `tidb-vector-python` repository to your local machine: + +```shell +git clone https://github.com/pingcap/tidb-vector-python.git +``` + +### Step 2. Create a virtual environment + +Create a virtual environment for your project: + +```bash +cd tidb-vector-python/examples/jina-ai-embeddings-demo +python3 -m venv .venv +source .venv/bin/activate +``` + +### Step 3. Install required dependencies + +Install the required dependencies for the demo project: + +```bash +pip install -r requirements.txt +``` + +### Step 4. Configure the environment variables + +Get the Jina AI API key from the [Jina AI Embeddings API](https://jina.ai/embeddings/) page. Then, obtain the cluster connection string and configure environment variables as follows: + +1. Navigate to the [**Clusters**](https://tidbcloud.com/console/clusters) page, and then click the name of your target cluster to go to its overview page. + +2. Click **Connect** in the upper-right corner. A connection dialog is displayed. + +3. Ensure the configurations in the connection dialog match your operating environment. + + - **Connection Type** is set to `Public` + - **Branch** is set to `main` + - **Connect With** is set to `SQLAlchemy` + - **Operating System** matches your environment. + + > **Tip:** + > + > If your program is running in Windows Subsystem for Linux (WSL), switch to the corresponding Linux distribution. + +4. Switch to the **PyMySQL** tab and click the **Copy** icon to copy the connection string. + + > **Tip:** + > + > If you have not set a password yet, click **Create password** to generate a random password. + +5. Set the Jina AI API key and the TiDB connection string as environment variables in your terminal, or create a `.env` file with the following environment variables: + + ```dotenv + JINAAI_API_KEY="****" + TIDB_DATABASE_URL="{tidb_connection_string}" + ``` + + The following is an example connection string for macOS: + + ```dotenv + TIDB_DATABASE_URL="mysql+pymysql://.root:@gateway01..prod.aws.tidbcloud.com:4000/test?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true" + ``` + +### Step 5. Run the demo + +```bash +python jina-ai-embeddings-demo.py +``` + +Example output: + +```text +- Inserting Data to TiDB... + - Inserting: Jina AI offers best-in-class embeddings, reranker and prompt optimizer, enabling advanced multimodal AI. + - Inserting: TiDB is an open-source MySQL-compatible database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. +- List All Documents and Their Distances to the Query: + - distance: 0.3585317326132522 + content: Jina AI offers best-in-class embeddings, reranker and prompt optimizer, enabling advanced multimodal AI. + - distance: 0.10858102967720984 + content: TiDB is an open-source MySQL-compatible database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. +- The Most Relevant Document and Its Distance to the Query: + - distance: 0.10858102967720984 + content: TiDB is an open-source MySQL-compatible database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. +``` + +## Sample code snippets + +### Get embeddings from Jina AI + +Define a `generate_embeddings` helper function to call Jina AI embeddings API: + +```python +import os +import requests +import dotenv + +dotenv.load_dotenv() + +JINAAI_API_KEY = os.getenv('JINAAI_API_KEY') + +def generate_embeddings(text: str): + JINAAI_API_URL = 'https://api.jina.ai/v1/embeddings' + JINAAI_HEADERS = { + 'Content-Type': 'application/json', + 'Authorization': f'Bearer {JINAAI_API_KEY}' + } + JINAAI_REQUEST_DATA = { + 'input': [text], + 'model': 'jina-embeddings-v2-base-en' # with dimension 768. + } + response = requests.post(JINAAI_API_URL, headers=JINAAI_HEADERS, json=JINAAI_REQUEST_DATA) + return response.json()['data'][0]['embedding'] +``` + +### Connect to the TiDB cluster + +Connect to the TiDB cluster through SQLAlchemy: + +```python +import os +import dotenv + +from tidb_vector.sqlalchemy import VectorType +from sqlalchemy.orm import Session, declarative_base + +dotenv.load_dotenv() + +TIDB_DATABASE_URL = os.getenv('TIDB_DATABASE_URL') +assert TIDB_DATABASE_URL is not None +engine = create_engine(url=TIDB_DATABASE_URL, pool_recycle=300) +``` + +### Define the vector table schema + +Create a table named `jinaai_tidb_demo_documents` with a `content` column for storing texts and a vector column named `content_vec` for storing embeddings: + +```python +from sqlalchemy import Column, Integer, String, create_engine +from sqlalchemy.orm import declarative_base + +Base = declarative_base() + +class Document(Base): + __tablename__ = "jinaai_tidb_demo_documents" + + id = Column(Integer, primary_key=True) + content = Column(String(255), nullable=False) + content_vec = Column( + # DIMENSIONS is determined by the embedding model, + # for Jina AI's jina-embeddings-v2-base-en model it's 768. + VectorType(dim=768), + comment="hnsw(distance=cosine)" +``` + +> **Note:** +> +> - The dimension of the vector column must match the dimension of the embeddings generated by the embedding model. +> - In this example, the dimension of embeddings generated by the `jina-embeddings-v2-base-en` model is `768`. + +### Create embeddings with Jina AI and store in TiDB + +Use the Jina AI Embeddings API to generate embeddings for each piece of text and store the embeddings in TiDB: + +```python +TEXTS = [ + 'Jina AI offers best-in-class embeddings, reranker and prompt optimizer, enabling advanced multimodal AI.', + 'TiDB is an open-source MySQL-compatible database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads.', +] +data = [] + +for text in TEXTS: + # Generate embeddings for the texts via Jina AI API. + embedding = generate_embeddings(text) + data.append({ + 'text': text, + 'embedding': embedding + }) + +with Session(engine) as session: + print('- Inserting Data to TiDB...') + for item in data: + print(f' - Inserting: {item["text"]}') + session.add(Document( + content=item['text'], + content_vec=item['embedding'] + )) + session.commit() +``` + +### Perform semantic search with Jina AI embeddings in TiDB + +Generate the embedding for the query text via Jina AI embeddings API, and then search for the most relevant document based on the cosine distance between **the embedding of the query text** and **each embedding in the vector table**: + +```python +query = 'What is TiDB?' +# Generate the embedding for the query via Jina AI API. +query_embedding = generate_embeddings(query) + +with Session(engine) as session: + print('- The Most Relevant Document and Its Distance to the Query:') + doc, distance = session.query( + Document, + Document.content_vec.cosine_distance(query_embedding).label('distance') + ).order_by( + 'distance' + ).limit(1).first() + print(f' - distance: {distance}\n' + f' content: {doc.content}') +``` + +## See also + +- [Vector Data Types](/tidb-cloud/vector-search-data-types.md) +- [Vector Search Index](/tidb-cloud/vector-search-index.md) diff --git a/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-integrate-with-langchain.md b/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-integrate-with-langchain.md new file mode 100644 index 00000000..0b545e1b --- /dev/null +++ b/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-integrate-with-langchain.md @@ -0,0 +1,586 @@ +--- +title: Integrate Vector Search with LangChain +summary: Learn how to integrate Vector Search in TiDB Cloud with LangChain. +--- + +# Integrate Vector Search with LangChain + +This tutorial demonstrates how to integrate the [vector search](/tidb-cloud/vector-search-overview.md) feature in TiDB Cloud with [LangChain](https://python.langchain.com/). + +> **Note** +> +> TiDB Vector Search is only available for TiDB (>= v8.4) and [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless). It is not available for [TiDB Cloud Dedicated](/tidb-cloud/select-cluster-tier.md#tidb-cloud-dedicated). + +> **Tip** +> +> You can view the complete [sample code](https://github.com/langchain-ai/langchain/blob/master/docs/docs/integrations/vectorstores/tidb_vector.ipynb) on Jupyter Notebook, or run the sample code directly in the [Colab](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/docs/integrations/vectorstores/tidb_vector.ipynb) online environment. + +## Prerequisites + +To complete this tutorial, you need: + +- [Python 3.8 or higher](https://www.python.org/downloads/) installed. +- [Jupyter Notebook](https://jupyter.org/install) installed. +- [Git](https://git-scm.com/downloads) installed. +- A TiDB Cloud Serverless cluster. Follow [creating a TiDB Cloud Serverless cluster](/tidb-cloud/create-tidb-cluster-serverless.md) to create your own TiDB Cloud cluster if you don't have one. + +## Get started + +This section provides step-by-step instructions for integrating TiDB Vector Search with LangChain to perform semantic searches. + +### Step 1. Create a new Jupyter Notebook file + +In your preferred directory, create a new Jupyter Notebook file named `integrate_with_langchain.ipynb`: + +```shell +touch integrate_with_langchain.ipynb +``` + +### Step 2. Install required dependencies + +In your project directory, run the following command to install the required packages: + +```shell +!pip install langchain langchain-community +!pip install langchain-openai +!pip install pymysql +!pip install tidb-vector +``` + +Open the `integrate_with_langchain.ipynb` file in Jupyter Notebook, and then add the following code to import the required packages: + +```python +from langchain_community.document_loaders import TextLoader +from langchain_community.vectorstores import TiDBVectorStore +from langchain_openai import OpenAIEmbeddings +from langchain_text_splitters import CharacterTextSplitter +``` + +### Step 3. Set up your environment + +Take the following steps to obtain the cluster connection string and configure environment variables: + +1. Navigate to the [**Clusters**](https://tidbcloud.com/console/clusters) page, and then click the name of your target cluster to go to its overview page. + +2. Click **Connect** in the upper-right corner. A connection dialog is displayed. + +3. Ensure the configurations in the connection dialog match your operating environment. + + - **Connection Type** is set to `Public`. + - **Branch** is set to `main`. + - **Connect With** is set to `SQLAlchemy`. + - **Operating System** matches your environment. + +4. Click the **PyMySQL** tab and copy the connection string. + + > **Tip:** + > + > If you have not set a password yet, click **Generate Password** to generate a random password. + +5. Configure environment variables. + + This document uses [OpenAI](https://platform.openai.com/docs/introduction) as the embedding model provider. In this step, you need to provide the connection string obtained from the previous step and your [OpenAI API key](https://platform.openai.com/docs/quickstart/step-2-set-up-your-api-key). + + To configure the environment variables, run the following code. You will be prompted to enter your connection string and OpenAI API key: + + ```python + # Use getpass to securely prompt for environment variables in your terminal. + import getpass + import os + + # Copy your connection string from the TiDB Cloud console. + # Connection string format: "mysql+pymysql://:@:4000/?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true" + tidb_connection_string = getpass.getpass("TiDB Connection String:") + os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:") + ``` + +### Step 4. Load the sample document + +#### Step 4.1 Download the sample document + +In your project directory, create a directory named `data/how_to/` and download the sample document [`state_of_the_union.txt`](https://github.com/langchain-ai/langchain/blob/master/docs/docs/how_to/state_of_the_union.txt) from the [langchain-ai/langchain](https://github.com/langchain-ai/langchain) GitHub repository. + +```shell +!mkdir -p 'data/how_to/' +!wget 'https://raw.githubusercontent.com/langchain-ai/langchain/master/docs/docs/how_to/state_of_the_union.txt' -O 'data/how_to/state_of_the_union.txt' +``` + +#### Step 4.2 Load and split the document + +Load the sample document from `data/how_to/state_of_the_union.txt` and split it into chunks of approximately 1,000 characters each using a `CharacterTextSplitter`. + +```python +loader = TextLoader("data/how_to/state_of_the_union.txt") +documents = loader.load() +text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) +docs = text_splitter.split_documents(documents) +``` + +### Step 5. Embed and store document vectors + +TiDB vector store supports both cosine distance (`consine`) and Euclidean distance (`l2`) for measuring similarity between vectors. The default strategy is cosine distance. + +The following code creates a table named `embedded_documents` in TiDB, which is optimized for vector search. + +```python +embeddings = OpenAIEmbeddings() +vector_store = TiDBVectorStore.from_documents( + documents=docs, + embedding=embeddings, + table_name="embedded_documents", + connection_string=tidb_connection_string, + distance_strategy="cosine", # default, another option is "l2" +) +``` + +Upon successful execution, you can directly view and access the `embedded_documents` table in your TiDB database. + +### Step 6. Perform a vector search + +This step demonstrates how to query "What did the president say about Ketanji Brown Jackson" from the document `state_of_the_union.txt`. + +```python +query = "What did the president say about Ketanji Brown Jackson" +``` + +#### Option 1: Use `similarity_search_with_score()` + +The `similarity_search_with_score()` method calculates the vector space distance between the documents and the query. This distance serves as a similarity score, determined by the chosen `distance_strategy`. The method returns the top `k` documents with the lowest scores. A lower score indicates a higher similarity between a document and your query. + +```python +docs_with_score = vector_store.similarity_search_with_score(query, k=3) +for doc, score in docs_with_score: + print("-" * 80) + print("Score: ", score) + print(doc.page_content) + print("-" * 80) +``` + +
+ Expected output + +```plain +-------------------------------------------------------------------------------- +Score: 0.18472413652518527 +Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. + +Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. + +One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. + +And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. +-------------------------------------------------------------------------------- +-------------------------------------------------------------------------------- +Score: 0.21757513022785557 +A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. + +And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. + +We can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling. + +We’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers. + +We’re putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster. + +We’re securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders. +-------------------------------------------------------------------------------- +-------------------------------------------------------------------------------- +Score: 0.22676987253721725 +And for our LGBTQ+ Americans, let’s finally get the bipartisan Equality Act to my desk. The onslaught of state laws targeting transgender Americans and their families is wrong. + +As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential. + +While it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military justice. + +And soon, we’ll strengthen the Violence Against Women Act that I first wrote three decades ago. It is important for us to show the nation that we can come together and do big things. + +So tonight I’m offering a Unity Agenda for the Nation. Four big things we can do together. + +First, beat the opioid epidemic. +-------------------------------------------------------------------------------- +``` + +
+ +#### Option 2: Use `similarity_search_with_relevance_scores()` + +The `similarity_search_with_relevance_scores()` method returns the top `k` documents with the highest relevance scores. A higher score indicates a higher degree of similarity between a document and your query. + +```python +docs_with_relevance_score = vector_store.similarity_search_with_relevance_scores(query, k=2) +for doc, score in docs_with_relevance_score: + print("-" * 80) + print("Score: ", score) + print(doc.page_content) + print("-" * 80) +``` + +
+ Expected output + +```plain +-------------------------------------------------------------------------------- +Score: 0.8152758634748147 +Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. + +Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. + +One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. + +And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. +-------------------------------------------------------------------------------- +-------------------------------------------------------------------------------- +Score: 0.7824248697721444 +A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans. + +And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system. + +We can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling. + +We’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers. + +We’re putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster. + +We’re securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders. +-------------------------------------------------------------------------------- +``` + +
+ +### Use as a retriever + +In Langchain, a [retriever](https://python.langchain.com/v0.2/docs/concepts/#retrievers) is an interface that retrieves documents in response to an unstructured query, providing more functionality than a vector store. The following code demonstrates how to use TiDB vector store as a retriever. + +```python +retriever = vector_store.as_retriever( + search_type="similarity_score_threshold", + search_kwargs={"k": 3, "score_threshold": 0.8}, +) +docs_retrieved = retriever.invoke(query) +for doc in docs_retrieved: + print("-" * 80) + print(doc.page_content) + print("-" * 80) +``` + +The expected output is as follows: + +``` +-------------------------------------------------------------------------------- +Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. + +Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. + +One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. + +And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. +-------------------------------------------------------------------------------- +``` + +### Remove the vector store + +To remove an existing TiDB vector store, use the `drop_vectorstore()` method: + +```python +vector_store.drop_vectorstore() +``` + +## Search with metadata filters + +To refine your searches, you can use metadata filters to retrieve specific nearest-neighbor results that match the applied filters. + +### Supported metadata types + +Each document in the TiDB vector store can be paired with metadata, structured as key-value pairs within a JSON object. Keys are always strings, while values can be any of the following types: + +- String +- Number: integer or floating point +- Boolean: `true` or `false` + +For example, the following is a valid metadata payload: + +```json +{ + "page": 12, + "book_title": "Siddhartha" +} +``` + +### Metadata filter syntax + +Available filters include the following: + +- `$or`: Selects vectors that match any one of the specified conditions. +- `$and`: Selects vectors that match all the specified conditions. +- `$eq`: Equal to the specified value. +- `$ne`: Not equal to the specified value. +- `$gt`: Greater than the specified value. +- `$gte`: Greater than or equal to the specified value. +- `$lt`: Less than the specified value. +- `$lte`: Less than or equal to the specified value. +- `$in`: In the specified array of values. +- `$nin`: Not in the specified array of values. + +If the metadata of a document is as follows: + +```json +{ + "page": 12, + "book_title": "Siddhartha" +} +``` + +The following metadata filters can match this document: + +```json +{ "page": 12 } +``` + +```json +{ "page": { "$eq": 12 } } +``` + +```json +{ + "page": { + "$in": [11, 12, 13] + } +} +``` + +```json +{ "page": { "$nin": [13] } } +``` + +```json +{ "page": { "$lt": 11 } } +``` + +```json +{ + "$or": [{ "page": 11 }, { "page": 12 }], + "$and": [{ "page": 12 }, { "page": 13 }] +} +``` + +In a metadata filter, TiDB treats each key-value pair as a separate filter clause and combines these clauses using the `AND` logical operator. + +### Example + +The following example adds two documents to `TiDBVectorStore` and adds a `title` field to each document as the metadata: + +```python +vector_store.add_texts( + texts=[ + "TiDB Vector offers advanced, high-speed vector processing capabilities, enhancing AI workflows with efficient data handling and analytics support.", + "TiDB Vector, starting as low as $10 per month for basic usage", + ], + metadatas=[ + {"title": "TiDB Vector functionality"}, + {"title": "TiDB Vector Pricing"}, + ], +) +``` + +The expected output is as follows: + +```plain +[UUID('c782cb02-8eec-45be-a31f-fdb78914f0a7'), + UUID('08dcd2ba-9f16-4f29-a9b7-18141f8edae3')] +``` + +Perform a similarity search with metadata filters: + +```python +docs_with_score = vector_store.similarity_search_with_score( + "Introduction to TiDB Vector", filter={"title": "TiDB Vector functionality"}, k=4 +) +for doc, score in docs_with_score: + print("-" * 80) + print("Score: ", score) + print(doc.page_content) + print("-" * 80) +``` + +The expected output is as follows: + +```plain +-------------------------------------------------------------------------------- +Score: 0.12761409169211535 +TiDB Vector offers advanced, high-speed vector processing capabilities, enhancing AI workflows with efficient data handling and analytics support. +-------------------------------------------------------------------------------- +``` + +## Advanced usage example: travel agent + +This section demonstrates a use case of integrating vector search with Langchain for a travel agent. The goal is to create personalized travel reports for clients, helping them find airports with specific amenities, such as clean lounges and vegetarian options. + +The process involves two main steps: + +1. Perform a semantic search across airport reviews to identify airport codes that match the desired amenities. +2. Execute a SQL query to merge these codes with route information, highlighting airlines and destinations that align with user's preferences. + +### Prepare data + +First, create a table to store airport route data: + +```python +# Create a table to store flight plan data. +vector_store.tidb_vector_client.execute( + """CREATE TABLE airplan_routes ( + id INT AUTO_INCREMENT PRIMARY KEY, + airport_code VARCHAR(10), + airline_code VARCHAR(10), + destination_code VARCHAR(10), + route_details TEXT, + duration TIME, + frequency INT, + airplane_type VARCHAR(50), + price DECIMAL(10, 2), + layover TEXT + );""" +) + +# Insert some sample data into airplan_routes and the vector table. +vector_store.tidb_vector_client.execute( + """INSERT INTO airplan_routes ( + airport_code, + airline_code, + destination_code, + route_details, + duration, + frequency, + airplane_type, + price, + layover + ) VALUES + ('JFK', 'DL', 'LAX', 'Non-stop from JFK to LAX.', '06:00:00', 5, 'Boeing 777', 299.99, 'None'), + ('LAX', 'AA', 'ORD', 'Direct LAX to ORD route.', '04:00:00', 3, 'Airbus A320', 149.99, 'None'), + ('EFGH', 'UA', 'SEA', 'Daily flights from SFO to SEA.', '02:30:00', 7, 'Boeing 737', 129.99, 'None'); + """ +) +vector_store.add_texts( + texts=[ + "Clean lounges and excellent vegetarian dining options. Highly recommended.", + "Comfortable seating in lounge areas and diverse food selections, including vegetarian.", + "Small airport with basic facilities.", + ], + metadatas=[ + {"airport_code": "JFK"}, + {"airport_code": "LAX"}, + {"airport_code": "EFGH"}, + ], +) +``` + +The expected output is as follows: + +```plain +[UUID('6dab390f-acd9-4c7d-b252-616606fbc89b'), + UUID('9e811801-0e6b-4893-8886-60f4fb67ce69'), + UUID('f426747c-0f7b-4c62-97ed-3eeb7c8dd76e')] +``` + +### Perform a semantic search + +The following code searches for airports with clean facilities and vegetarian options: + +```python +retriever = vector_store.as_retriever( + search_type="similarity_score_threshold", + search_kwargs={"k": 3, "score_threshold": 0.85}, +) +semantic_query = "Could you recommend a US airport with clean lounges and good vegetarian dining options?" +reviews = retriever.invoke(semantic_query) +for r in reviews: + print("-" * 80) + print(r.page_content) + print(r.metadata) + print("-" * 80) +``` + +The expected output is as follows: + +```plain +-------------------------------------------------------------------------------- +Clean lounges and excellent vegetarian dining options. Highly recommended. +{'airport_code': 'JFK'} +-------------------------------------------------------------------------------- +-------------------------------------------------------------------------------- +Comfortable seating in lounge areas and diverse food selections, including vegetarian. +{'airport_code': 'LAX'} +-------------------------------------------------------------------------------- +``` + +### Retrieve detailed airport information + +Extract airport codes from the search results and query the database for detailed route information: + +```python +# Extracting airport codes from the metadata +airport_codes = [review.metadata["airport_code"] for review in reviews] + +# Executing a query to get the airport details +search_query = "SELECT * FROM airplan_routes WHERE airport_code IN :codes" +params = {"codes": tuple(airport_codes)} + +airport_details = vector_store.tidb_vector_client.execute(search_query, params) +airport_details.get("result") +``` + +The expected output is as follows: + +```plain +[(1, 'JFK', 'DL', 'LAX', 'Non-stop from JFK to LAX.', datetime.timedelta(seconds=21600), 5, 'Boeing 777', Decimal('299.99'), 'None'), + (2, 'LAX', 'AA', 'ORD', 'Direct LAX to ORD route.', datetime.timedelta(seconds=14400), 3, 'Airbus A320', Decimal('149.99'), 'None')] +``` + +### Streamline the process + +Alternatively, you can streamline the entire process using a single SQL query: + +```python +search_query = f""" + SELECT + VEC_Cosine_Distance(se.embedding, :query_vector) as distance, + ar.*, + se.document as airport_review + FROM + airplan_routes ar + JOIN + {TABLE_NAME} se ON ar.airport_code = JSON_UNQUOTE(JSON_EXTRACT(se.meta, '$.airport_code')) + ORDER BY distance ASC + LIMIT 5; +""" +query_vector = embeddings.embed_query(semantic_query) +params = {"query_vector": str(query_vector)} +airport_details = vector_store.tidb_vector_client.execute(search_query, params) +airport_details.get("result") +``` + +The expected output is as follows: + +```plain +[(0.1219207353407008, 1, 'JFK', 'DL', 'LAX', 'Non-stop from JFK to LAX.', datetime.timedelta(seconds=21600), 5, 'Boeing 777', Decimal('299.99'), 'None', 'Clean lounges and excellent vegetarian dining options. Highly recommended.'), + (0.14613754359804654, 2, 'LAX', 'AA', 'ORD', 'Direct LAX to ORD route.', datetime.timedelta(seconds=14400), 3, 'Airbus A320', Decimal('149.99'), 'None', 'Comfortable seating in lounge areas and diverse food selections, including vegetarian.'), + (0.19840519342700513, 3, 'EFGH', 'UA', 'SEA', 'Daily flights from SFO to SEA.', datetime.timedelta(seconds=9000), 7, 'Boeing 737', Decimal('129.99'), 'None', 'Small airport with basic facilities.')] +``` + +### Clean up data + +Finally, clean up the resources by dropping the created table: + +```python +vector_store.tidb_vector_client.execute("DROP TABLE airplan_routes") +``` + +The expected output is as follows: + +```plain +{'success': True, 'result': 0, 'error': None} +``` + +## See also + +- [Vector Data Types](/tidb-cloud/vector-search-data-types.md) +- [Vector Search Index](/tidb-cloud/vector-search-index.md) diff --git a/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-integrate-with-llamaindex.md b/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-integrate-with-llamaindex.md new file mode 100644 index 00000000..38a34719 --- /dev/null +++ b/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-integrate-with-llamaindex.md @@ -0,0 +1,266 @@ +--- +title: Integrate Vector Search with LlamaIndex +summary: Learn how to integrate TiDB Vector Search with LlamaIndex. +--- + +# Integrate Vector Search with LlamaIndex + +This tutorial demonstrates how to integrate the [vector search](/tidb-cloud/vector-search-overview.md) feature of TiDB with [LlamaIndex](https://www.llamaindex.ai). + +> **Note** +> +> TiDB Vector Search is only available for TiDB (>= v8.4) and [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless). It is not available for [TiDB Cloud Dedicated](/tidb-cloud/select-cluster-tier.md#tidb-cloud-dedicated). + +> **Tip** +> +> You can view the complete [sample code](https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/vector_stores/TiDBVector.ipynb) on Jupyter Notebook, or run the sample code directly in the [Colab](https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/vector_stores/TiDBVector.ipynb) online environment. + +## Prerequisites + +To complete this tutorial, you need: + +- [Python 3.8 or higher](https://www.python.org/downloads/) installed. +- [Jupyter Notebook](https://jupyter.org/install) installed. +- [Git](https://git-scm.com/downloads) installed. +- A TiDB Cloud Serverless cluster. Follow [creating a TiDB Cloud Serverless cluster](/tidb-cloud/create-tidb-cluster-serverless.md) to create your own TiDB Cloud cluster if you don't have one. + +## Get started + +This section provides step-by-step instructions for integrating TiDB Vector Search with LlamaIndex to perform semantic searches. + +### Step 1. Create a new Jupyter Notebook file + +In the root directory, create a new Jupyter Notebook file named `integrate_with_llamaindex.ipynb`: + +```shell +touch integrate_with_llamaindex.ipynb +``` + +### Step 2. Install required dependencies + +In your project directory, run the following command to install the required packages: + +```shell +pip install llama-index-vector-stores-tidbvector +pip install llama-index +``` + +Open the `integrate_with_llamaindex.ipynb` file in Jupyter Notebook and add the following code to import the required packages: + +```python +import textwrap + +from llama_index.core import SimpleDirectoryReader, StorageContext +from llama_index.core import VectorStoreIndex +from llama_index.vector_stores.tidbvector import TiDBVectorStore +``` + +### Step 3. Configure environment variables + +Take the following steps to obtain the cluster connection string and configure environment variables: + +1. Navigate to the [**Clusters**](https://tidbcloud.com/console/clusters) page, and then click the name of your target cluster to go to its overview page. + +2. Click **Connect** in the upper-right corner. A connection dialog is displayed. + +3. Ensure the configurations in the connection dialog match your operating environment. + + - **Connection Type** is set to `Public`. + - **Branch** is set to `main`. + - **Connect With** is set to `SQLAlchemy`. + - **Operating System** matches your environment. + +4. Click the **PyMySQL** tab and copy the connection string. + + > **Tip:** + > + > If you have not set a password yet, click **Generate Password** to generate a random password. + +5. Configure environment variables. + + This document uses [OpenAI](https://platform.openai.com/docs/introduction) as the embedding model provider. In this step, you need to provide the connection string obtained from from the previous step and your [OpenAI API key](https://platform.openai.com/docs/quickstart/step-2-set-up-your-api-key). + + To configure the environment variables, run the following code. You will be prompted to enter your connection string and OpenAI API key: + + ```python + # Use getpass to securely prompt for environment variables in your terminal. + import getpass + import os + + # Copy your connection string from the TiDB Cloud console. + # Connection string format: "mysql+pymysql://:@:4000/?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true" + tidb_connection_string = getpass.getpass("TiDB Connection String:") + os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:") + ``` + +### Step 4. Load the sample document + +#### Step 4.1 Download the sample document + +In your project directory, create a directory named `data/paul_graham/` and download the sample document [`paul_graham_essay.txt`](https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt) from the [run-llama/llama_index](https://github.com/run-llama/llama_index) GitHub repository. + +```shell +!mkdir -p 'data/paul_graham/' +!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt' +``` + +#### Step 4.2 Load the document + +Load the sample document from `data/paul_graham/paul_graham_essay.txt` using the `SimpleDirectoryReader` class. + +```python +documents = SimpleDirectoryReader("./data/paul_graham").load_data() +print("Document ID:", documents[0].doc_id) + +for index, document in enumerate(documents): + document.metadata = {"book": "paul_graham"} +``` + +### Step 5. Embed and store document vectors + +#### Step 5.1 Initialize the TiDB vector store + +The following code creates a table named `paul_graham_test` in TiDB, which is optimized for vector search. + +```python +tidbvec = TiDBVectorStore( + connection_string=tidb_connection_url, + table_name="paul_graham_test", + distance_strategy="cosine", + vector_dimension=1536, + drop_existing_table=False, +) +``` + +Upon successful execution, you can directly view and access the `paul_graham_test` table in your TiDB database. + +#### Step 5.2 Generate and store embeddings + +The following code parses the documents, generates embeddings, and stores them in the TiDB vector store. + +```python +storage_context = StorageContext.from_defaults(vector_store=tidbvec) +index = VectorStoreIndex.from_documents( + documents, storage_context=storage_context, show_progress=True +) +``` + +The expected output is as follows: + +```plain +Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 8.76it/s] +Generating embeddings: 100%|██████████| 21/21 [00:02<00:00, 8.22it/s] +``` + +### Step 6. Perform a vector search + +The following creates a query engine based on the TiDB vector store and performs a semantic similarity search. + +```python +query_engine = index.as_query_engine() +response = query_engine.query("What did the author do?") +print(textwrap.fill(str(response), 100)) +``` + +> **Note** +> +> `TiDBVectorStore` only supports the [`default`](https://docs.llamaindex.ai/en/stable/api_reference/storage/vector_store/?h=vectorstorequerymode#llama_index.core.vector_stores.types.VectorStoreQueryMode) query mode. + +The expected output is as follows: + +```plain +The author worked on writing, programming, building microcomputers, giving talks at conferences, +publishing essays online, developing spam filters, painting, hosting dinner parties, and purchasing +a building for office use. +``` + +### Step 7. Search with metadata filters + +To refine your searches, you can use metadata filters to retrieve specific nearest-neighbor results that match the applied filters. + +#### Query with `book != "paul_graham"` filter + +The following example excludes results where the `book` metadata field is `"paul_graham"`: + +```python +from llama_index.core.vector_stores.types import ( + MetadataFilter, + MetadataFilters, +) + +query_engine = index.as_query_engine( + filters=MetadataFilters( + filters=[ + MetadataFilter(key="book", value="paul_graham", operator="!="), + ] + ), + similarity_top_k=2, +) +response = query_engine.query("What did the author learn?") +print(textwrap.fill(str(response), 100)) +``` + +The expected output is as follows: + +```plain +Empty Response +``` + +#### Query with `book == "paul_graham"` filter + +The following example filters results to include only documents where the `book` metadata field is `"paul_graham"`: + +```python +from llama_index.core.vector_stores.types import ( + MetadataFilter, + MetadataFilters, +) + +query_engine = index.as_query_engine( + filters=MetadataFilters( + filters=[ + MetadataFilter(key="book", value="paul_graham", operator="=="), + ] + ), + similarity_top_k=2, +) +response = query_engine.query("What did the author learn?") +print(textwrap.fill(str(response), 100)) +``` + +The expected output is as follows: + +```plain +The author learned programming on an IBM 1401 using an early version of Fortran in 9th grade, then +later transitioned to working with microcomputers like the TRS-80 and Apple II. Additionally, the +author studied philosophy in college but found it unfulfilling, leading to a switch to studying AI. +Later on, the author attended art school in both the US and Italy, where they observed a lack of +substantial teaching in the painting department. +``` + +### Step 8. Delete documents + +Delete the first document from the index: + +```python +tidbvec.delete(documents[0].doc_id) +``` + +Check whether the documents had been deleted: + +```python +query_engine = index.as_query_engine() +response = query_engine.query("What did the author learn?") +print(textwrap.fill(str(response), 100)) +``` + +The expected output is as follows: + +```plain +Empty Response +``` + +## See also + +- [Vector Data Types](/tidb-cloud/vector-search-data-types.md) +- [Vector Search Index](/tidb-cloud/vector-search-index.md) diff --git a/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-integrate-with-peewee.md b/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-integrate-with-peewee.md new file mode 100644 index 00000000..cb284ce5 --- /dev/null +++ b/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-integrate-with-peewee.md @@ -0,0 +1,211 @@ +--- +title: Integrate TiDB Vector Search with peewee +summary: Learn how to integrate TiDB Vector Search with peewee to store embeddings and perform semantic searches. +--- + +# Integrate TiDB Vector Search with peewee + +This tutorial walks you through how to use [peewee](https://docs.peewee-orm.com/) to interact with the [TiDB Vector Search](/tidb-cloud/vector-search-overview.md), store embeddings, and perform vector search queries. + +> **Note** +> +> TiDB Vector Search is only available for TiDB (>= v8.4) and [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless). It is not available for [TiDB Cloud Dedicated](/tidb-cloud/select-cluster-tier.md#tidb-cloud-dedicated). + +## Prerequisites + +To complete this tutorial, you need: + +- [Python 3.8 or higher](https://www.python.org/downloads/) installed. +- [Git](https://git-scm.com/downloads) installed. +- A TiDB Cloud Serverless cluster. Follow [creating a TiDB Cloud Serverless cluster](/tidb-cloud/create-tidb-cluster-serverless.md) to create your own TiDB Cloud cluster if you don't have one. + +## Run the sample app + +You can quickly learn about how to integrate TiDB Vector Search with peewee by following the steps below. + +### Step 1. Clone the repository + +Clone the [`tidb-vector-python`](https://github.com/pingcap/tidb-vector-python) repository to your local machine: + +```shell +git clone https://github.com/pingcap/tidb-vector-python.git +``` + +### Step 2. Create a virtual environment + +Create a virtual environment for your project: + +```bash +cd tidb-vector-python/examples/orm-peewee-quickstart +python3 -m venv .venv +source .venv/bin/activate +``` + +### Step 3. Install required dependencies + +Install the required dependencies for the demo project: + +```bash +pip install -r requirements.txt +``` + +Alternatively, you can install the following packages for your project: + +```bash +pip install peewee pymysql python-dotenv tidb-vector +``` + +### Step 4. Configure the environment variables + +1. Navigate to the [**Clusters**](https://tidbcloud.com/console/clusters) page, and then click the name of your target cluster to go to its overview page. + +2. Click **Connect** in the upper-right corner. A connection dialog is displayed. + +3. Ensure the configurations in the connection dialog match your operating environment. + + - **Connection Type** is set to `Public`. + - **Branch** is set to `main`. + - **Connect With** is set to `General`. + - **Operating System** matches your environment. + + > **Tip:** + > + > If your program is running in Windows Subsystem for Linux (WSL), switch to the corresponding Linux distribution. + +4. Copy the connection parameters from the connection dialog. + + > **Tip:** + > + > If you have not set a password yet, click **Generate Password** to generate a random password. + +5. In the root directory of your Python project, create a `.env` file and paste the connection parameters to the corresponding environment variables. + + - `TIDB_HOST`: The host of the TiDB cluster. + - `TIDB_PORT`: The port of the TiDB cluster. + - `TIDB_USERNAME`: The username to connect to the TiDB cluster. + - `TIDB_PASSWORD`: The password to connect to the TiDB cluster. + - `TIDB_DATABASE`: The database name to connect to. + - `TIDB_CA_PATH`: The path to the root certificate file. + + The following is an example for macOS: + + ```dotenv + TIDB_HOST=gateway01.****.prod.aws.tidbcloud.com + TIDB_PORT=4000 + TIDB_USERNAME=********.root + TIDB_PASSWORD=******** + TIDB_DATABASE=test + TIDB_CA_PATH=/etc/ssl/cert.pem + ``` + +### Step 5. Run the demo + +```bash +python peewee-quickstart.py +``` + +Example output: + +```text +Get 3-nearest neighbor documents: + - distance: 0.00853986601633272 + document: fish + - distance: 0.12712843905603044 + document: dog + - distance: 0.7327387580875756 + document: tree +Get documents within a certain distance: + - distance: 0.00853986601633272 + document: fish + - distance: 0.12712843905603044 + document: dog +``` + +## Sample code snippets + +You can refer to the following sample code snippets to develop your application. + +### Create vector tables + +#### Connect to TiDB cluster + +```python +import os +import dotenv + +from peewee import Model, MySQLDatabase, SQL, TextField +from tidb_vector.peewee import VectorField + +dotenv.load_dotenv() + +# Using `pymysql` as the driver. +connect_kwargs = { + 'ssl_verify_cert': True, + 'ssl_verify_identity': True, +} + +# Using `mysqlclient` as the driver. +# connect_kwargs = { +# 'ssl_mode': 'VERIFY_IDENTITY', +# 'ssl': { +# # Root certificate default path +# # https://docs.pingcap.com/tidbcloud/secure-connections-to-serverless-clusters/#root-certificate-default-path +# 'ca': os.environ.get('TIDB_CA_PATH', '/path/to/ca.pem'), +# }, +# } + +db = MySQLDatabase( + database=os.environ.get('TIDB_DATABASE', 'test'), + user=os.environ.get('TIDB_USERNAME', 'root'), + password=os.environ.get('TIDB_PASSWORD', ''), + host=os.environ.get('TIDB_HOST', 'localhost'), + port=int(os.environ.get('TIDB_PORT', '4000')), + **connect_kwargs, +) +``` + +#### Define a vector column + +Create a table with a column named `peewee_demo_documents` that stores a 3-dimensional vector. + +```python +class Document(Model): + class Meta: + database = db + table_name = 'peewee_demo_documents' + + content = TextField() + embedding = VectorField(3) +``` + +### Store documents with embeddings + +```python +Document.create(content='dog', embedding=[1, 2, 1]) +Document.create(content='fish', embedding=[1, 2, 4]) +Document.create(content='tree', embedding=[1, 0, 0]) +``` + +### Search the nearest neighbor documents + +Search for the top-3 documents that are semantically closest to the query vector `[1, 2, 3]` based on the cosine distance function. + +```python +distance = Document.embedding.cosine_distance([1, 2, 3]).alias('distance') +results = Document.select(Document, distance).order_by(distance).limit(3) +``` + +### Search documents within a certain distance + +Search for the documents whose cosine distance from the query vector `[1, 2, 3]` is less than 0.2. + +```python +distance_expression = Document.embedding.cosine_distance([1, 2, 3]) +distance = distance_expression.alias('distance') +results = Document.select(Document, distance).where(distance_expression < 0.2).order_by(distance).limit(3) +``` + +## See also + +- [Vector Data Types](/tidb-cloud/vector-search-data-types.md) +- [Vector Search Index](/tidb-cloud/vector-search-index.md) diff --git a/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-integrate-with-sqlalchemy.md b/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-integrate-with-sqlalchemy.md new file mode 100644 index 00000000..fb901fd8 --- /dev/null +++ b/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-integrate-with-sqlalchemy.md @@ -0,0 +1,185 @@ +--- +title: Integrate TiDB Vector Search with SQLAlchemy +summary: Learn how to integrate TiDB Vector Search with SQLAlchemy to store embeddings and perform semantic searches. +--- + +# Integrate TiDB Vector Search with SQLAlchemy + +This tutorial walks you through how to use [SQLAlchemy](https://www.sqlalchemy.org/) to interact with [TiDB Vector Search](/tidb-cloud/vector-search-overview.md), store embeddings, and perform vector search queries. + +> **Note** +> +> TiDB Vector Search is only available for TiDB (>= v8.4) and [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless). It is not available for [TiDB Cloud Dedicated](/tidb-cloud/select-cluster-tier.md#tidb-cloud-dedicated). + +## Prerequisites + +To complete this tutorial, you need: + +- [Python 3.8 or higher](https://www.python.org/downloads/) installed. +- [Git](https://git-scm.com/downloads) installed. +- A TiDB Cloud Serverless cluster. Follow [creating a TiDB Cloud Serverless cluster](/tidb-cloud/create-tidb-cluster-serverless.md) to create your own TiDB Cloud cluster if you don't have one. + +## Run the sample app + +You can quickly learn about how to integrate TiDB Vector Search with SQLAlchemy by following the steps below. + +### Step 1. Clone the repository + +Clone the `tidb-vector-python` repository to your local machine: + +```shell +git clone https://github.com/pingcap/tidb-vector-python.git +``` + +### Step 2. Create a virtual environment + +Create a virtual environment for your project: + +```bash +cd tidb-vector-python/examples/orm-sqlalchemy-quickstart +python3 -m venv .venv +source .venv/bin/activate +``` + +### Step 3. Install the required dependencies + +Install the required dependencies for the demo project: + +```bash +pip install -r requirements.txt +``` + +Alternatively, you can install the following packages for your project: + +```bash +pip install pymysql python-dotenv sqlalchemy tidb-vector +``` + +### Step 4. Configure the environment variables + +1. Navigate to the [**Clusters**](https://tidbcloud.com/console/clusters) page, and then click the name of your target cluster to go to its overview page. + +2. Click **Connect** in the upper-right corner. A connection dialog is displayed. + +3. Ensure the configurations in the connection dialog match your environment. + + - **Connection Type** is set to `Public`. + - **Branch** is set to `main`. + - **Connect With** is set to `SQLAlchemy`. + - **Operating System** matches your environment. + + > **Tip:** + > + > If your program is running in Windows Subsystem for Linux (WSL), switch to the corresponding Linux distribution. + +4. Click the **PyMySQL** tab and copy the connection string. + + > **Tip:** + > + > If you have not set a password yet, click **Generate Password** to generate a random password. + +5. In the root directory of your Python project, create a `.env` file and paste the connection string into it. + + The following is an example for macOS: + + ```dotenv + TIDB_DATABASE_URL="mysql+pymysql://.root:@gateway01..prod.aws.tidbcloud.com:4000/test?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true" + ``` + +### Step 5. Run the demo + +```bash +python sqlalchemy-quickstart.py +``` + +Example output: + +```text +Get 3-nearest neighbor documents: + - distance: 0.00853986601633272 + document: fish + - distance: 0.12712843905603044 + document: dog + - distance: 0.7327387580875756 + document: tree +Get documents within a certain distance: + - distance: 0.00853986601633272 + document: fish + - distance: 0.12712843905603044 + document: dog +``` + +## Sample code snippets + +You can refer to the following sample code snippets to develop your application. + +### Create vector tables + +#### Connect to TiDB cluster + +```python +import os +import dotenv + +from sqlalchemy import Column, Integer, create_engine, Text +from sqlalchemy.orm import declarative_base, Session +from tidb_vector.sqlalchemy import VectorType + +dotenv.load_dotenv() + +tidb_connection_string = os.environ['TIDB_DATABASE_URL'] +engine = create_engine(tidb_connection_string) +``` + +#### Define a vector column + +Create a table with a column named `embedding` that stores a 3-dimensional vector. + +```python +Base = declarative_base() + +class Document(Base): + __tablename__ = 'sqlalchemy_demo_documents' + id = Column(Integer, primary_key=True) + content = Column(Text) + embedding = Column(VectorType(3)) +``` + +### Store documents with embeddings + +```python +with Session(engine) as session: + session.add(Document(content="dog", embedding=[1, 2, 1])) + session.add(Document(content="fish", embedding=[1, 2, 4])) + session.add(Document(content="tree", embedding=[1, 0, 0])) + session.commit() +``` + +### Search the nearest neighbor documents + +Search for the top-3 documents that are semantically closest to the query vector `[1, 2, 3]` based on the cosine distance function. + +```python +with Session(engine) as session: + distance = Document.embedding.cosine_distance([1, 2, 3]).label('distance') + results = session.query( + Document, distance + ).order_by(distance).limit(3).all() +``` + +### Search documents within a certain distance + +Search for documents whose cosine distance from the query vector `[1, 2, 3]` is less than 0.2. + +```python +with Session(engine) as session: + distance = Document.embedding.cosine_distance([1, 2, 3]).label('distance') + results = session.query( + Document, distance + ).filter(distance < 0.2).order_by(distance).limit(3).all() +``` + +## See also + +- [Vector Data Types](/tidb-cloud/vector-search-data-types.md) +- [Vector Search Index](/tidb-cloud/vector-search-index.md) diff --git a/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-integration-overview.md b/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-integration-overview.md new file mode 100644 index 00000000..a6755c30 --- /dev/null +++ b/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-integration-overview.md @@ -0,0 +1,71 @@ +--- +title: Vector Search Integration Overview +summary: An overview of TiDB Vector Search integration, including supported AI frameworks, embedding models, and ORM libraries. +--- + +# Vector Search Integration Overview + +This document provides an overview of TiDB Vector Search integration, including supported AI frameworks, embedding models, and Object Relational Mapping (ORM) libraries. + +> **Note** +> +> TiDB Vector Search is only available for TiDB (>= v8.4) and [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless). It is not available for [TiDB Cloud Dedicated](/tidb-cloud/select-cluster-tier.md#tidb-cloud-dedicated). + +## AI frameworks + +TiDB provides official support for the following AI frameworks, enabling you to easily integrate AI applications developed based on these frameworks with TiDB Vector Search. + +| AI frameworks | Tutorial | +| ------------- | ------------------------------------------------------------------------------------------------- | +| Langchain | [Integrate Vector Search with LangChain](/tidb-cloud/vector-search-integrate-with-langchain.md) | +| LlamaIndex | [Integrate Vector Search with LlamaIndex](/tidb-cloud/vector-search-integrate-with-llamaindex.md) | + +Moreover, you can also use TiDB for various purposes, such as document storage and knowledge graph storage for AI applications. + +## Embedding models and services + +TiDB Vector Search supports storing vectors of up to 16383 dimensions, which accommodates most embedding models. + +You can either use self-deployed open-source embedding models or third-party embedding APIs provided by third-party embedding providers to generate vectors. + +The following table lists some mainstream embedding service providers and the corresponding integration tutorials. + +| Embedding service providers | Tutorial | +| --------------------------- | ------------------------------------------------------------------------------------------------------------------- | +| Jina AI | [Integrate Vector Search with Jina AI Embeddings API](/tidb-cloud/vector-search-integrate-with-jinaai-embedding.md) | + +## Object Relational Mapping (ORM) libraries + +You can integrate TiDB Vector Search with your ORM library to interact with the TiDB database. + +The following table lists the supported ORM libraries and the corresponding integration tutorials: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
LanguageORM/ClientHow to installTutorial
PythonTiDB Vector Clientpip install tidb-vector[client]Get Started with Vector Search Using Python
SQLAlchemypip install tidb-vectorIntegrate TiDB Vector Search with SQLAlchemy
peeweepip install tidb-vectorIntegrate TiDB Vector Search with peewee
Djangopip install django-tidb[vector]Integrate TiDB Vector Search with Django
diff --git a/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-limitations.md b/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-limitations.md new file mode 100644 index 00000000..8b573a7c --- /dev/null +++ b/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-limitations.md @@ -0,0 +1,42 @@ +--- +title: Vector Search Limitations +summary: Learn the limitations of the TiDB Vector Search. +--- + +# Vector Search Limitations + +This document describes the known limitations of TiDB Vector Search. + +> **Note** +> +> TiDB Vector Search is only available for TiDB (>= v8.4) and [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless). It is not available for [TiDB Cloud Dedicated](/tidb-cloud/select-cluster-tier.md#tidb-cloud-dedicated). + +## Vector data type limitations + +- Each [vector](/tidb-cloud/vector-search-data-types.md) supports up to 16383 dimensions. +- Vector data types cannot store `NaN`, `Infinity`, or `-Infinity` values. +- Vector data types cannot store double-precision floating-point numbers. If you insert or store double-precision floating-point numbers in vector columns, TiDB converts them to single-precision floating-point numbers. +- Vector columns cannot be used in primary keys, unique indexes or partition keys. To accelerate the vector search performance, use [Vector Search Index](/tidb-cloud/vector-search-index.md). +- Multiple vector columns in a table is allowed. However, there is [a limit of total number of columns in a table](/tidb-limitations.md#limitations-on-a-single-table). +- Currently TiDB does not support dropping a vector column with a vector index. To drop such column, drop the vector index first, then drop the vector column. +- Currently TiDB does not support modifying a vector column to other data types such as `JSON` and `VARCHAR`. + +## Vector index limitations + +- Vector index is used for vector search. It cannot accelerate other queries like range queries or equality queries. Thus, it is not possible to create a vector index on a non-vector column, or on multiple vector columns. +- Multiple vector indexes in a table is allowed. However, there is [a limit of total number of indexes in a table](/tidb-limitations.md#limitations-on-a-single-table). +- Multiple vector indexes on the same column is allowed only if they use different distance functions. +- Currently only `VEC_COSINE_DISTANCE()` and `VEC_L2_DISTANCE()` are supported as the distance functions for vector indexes. +- Currently TiDB does not support dropping a vector column with a vector index. To drop such column, drop the vector index first, then drop the vector column. +- Currently TiDB does not support setting vector index as [invisible](/sql-statements/sql-statement-alter-index.md). + +## Compatibility with TiDB tools + +- The Data Migration feature in the TiDB Cloud console does not support migrating or replicating MySQL 9.0 vector data types to TiDB Cloud. + +## Feedback + +We value your feedback and are always here to help: + +- [Join our Discord](https://discord.gg/zcqexutz2R) +- [Visit our Support Portal](https://tidb.support.pingcap.com/) diff --git a/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-overview.md b/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-overview.md new file mode 100644 index 00000000..92ce4ce2 --- /dev/null +++ b/markdown-pages/en/tidbcloud/master/tidb-cloud/vector-search-overview.md @@ -0,0 +1,72 @@ +--- +title: Vector Search (Beta) Overview +summary: Learn about Vector Search in TiDB. This feature provides an advanced search solution for performing semantic similarity searches across various data types, including documents, images, audio, and video. +--- + +# Vector Search (Beta) Overview + +TiDB Vector Search (beta) provides an advanced search solution for performing semantic similarity searches across various data types, including documents, images, audio, and video. This feature enables developers to easily build scalable applications with generative artificial intelligence (AI) capabilities using familiar MySQL skills. + +> **Note** +> +> TiDB Vector Search is only available for TiDB (>= v8.4) and [TiDB Cloud Serverless](/tidb-cloud/select-cluster-tier.md#tidb-cloud-serverless). It is not available for [TiDB Cloud Dedicated](/tidb-cloud/select-cluster-tier.md#tidb-cloud-dedicated). + +## Concepts + +Vector search is a search method that prioritizes the meaning of your data to deliver relevant results. + +Unlike traditional full-text search, which relies on exact keyword matching and word frequency, vector search converts various data types (such as text, images, or audio) into high-dimensional vectors and queries based on the similarity between these vectors. This search method captures the semantic meaning and contextual information of the data, leading to a more precise understanding of user intent. + +Even when the search terms do not exactly match the content in the database, vector search can still provide results that align with the user's intent by analyzing the semantics of the data. + +For example, a full-text search for "a swimming animal" only returns results containing these exact keywords. In contrast, vector search can return results for other swimming animals, such as fish or ducks, even if these results do not contain the exact keywords. + +### Vector embedding + +A vector embedding, also known as an embedding, is a sequence of numbers that represents real-world objects in a high-dimensional space. It captures the meaning and context of unstructured data, such as documents, images, audio, and videos. + +Vector embeddings are essential in machine learning and serve as the foundation for semantic similarity searches. + +TiDB introduces [Vector data types](/tidb-cloud/vector-search-data-types.md) and [Vector search index](/tidb-cloud/vector-search-index.md) designed to optimize the storage and retrieval of vector embeddings, enhancing their use in AI applications. You can store vector embeddings in TiDB and perform vector search queries to find the most relevant data using these data types. + +### Embedding model + +Embedding models are algorithms that transform data into [vector embeddings](#vector-embedding). + +Choosing an appropriate embedding model is crucial for ensuring the accuracy and relevance of semantic search results. For unstructured text data, you can find top-performing text embedding models on the [Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard). + +To learn how to generate vector embeddings for your specific data types, refer to integration tutorials or examples of embedding models. + +## How vector search works + +After converting raw data into vector embeddings and storing them in TiDB, your application can execute vector search queries to find the data most semantically or contextually relevant to a user's query. + +TiDB Vector Search identifies the top-k nearest neighbor (KNN) vectors by using a [distance function](/tidb-cloud/vector-search-functions-and-operators.md) to calculate the distance between the given vector and vectors stored in the database. The vectors closest to the given vector in the query represent the most similar data in meaning. + +![The Schematic TiDB Vector Search](/media/vector-search/embedding-search.png) + +As a relational database with integrated vector search capabilities, TiDB enables you to store data and their corresponding vector representations (that is, vector embeddings) together in one database. You can choose any of the following ways for storage: + +- Store data and their corresponding vector representations in different columns of the same table. +- Store data and their corresponding vector representation in different tables. In this way, you need to use `JOIN` queries to combine the tables when retrieving data. + +## Use cases + +### Retrieval-Augmented Generation (RAG) + +Retrieval-Augmented Generation (RAG) is an architecture designed to optimize the output of Large Language Models (LLMs). By using vector search, RAG applications can store vector embeddings in the database and retrieve relevant documents as additional context when the LLM generates responses, thereby improving the quality and relevance of the answers. + +### Semantic search + +Semantic search is a search technology that returns results based on the meaning of a query, rather than simply matching keywords. It interprets the meaning across different languages and various types of data (such as text, images, and audio) using embeddings. Vector search algorithms then use these embeddings to find the most relevant data that satisfies the user's query. + +### Recommendation engine + +A recommendation engine is a system that proactively suggests content, products, or services that are relevant and personalized to users. It accomplishes this by creating embeddings that represent user behavior and preferences. These embeddings help the system identify similar items that other users have interacted with or shown interest in. This increases the likelihood that the recommendations will be both relevant and appealing to the user. + +## See also + +To get started with TiDB Vector Search, see the following documents: + +- [Get started with vector search using Python](/tidb-cloud/vector-search-get-started-using-python.md) +- [Get started with vector search using SQL](/tidb-cloud/vector-search-get-started-using-sql.md)