使用langchain_openai 的embedding库获得向量结果是错误的 #2589

xiyuan-lee · 2024-11-26T12:23:08Z

System Info / 系統信息

ubuntu 20.04

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

docker / docker
pip install / 通过 pip install 安装
installation from source / 从源码安装

Version info / 版本信息

所有版本

The command used to start Xinference / 用以启动 xinference 的命令

docker部署

Reproduction / 复现过程

使用xinference部署的m3e-large模型，model id 设置为
text-embedding-ada-002，
使用langchain_openai 的embedding库获得向量结果是错误的，只要第一个字相同，向量结果就是相同的

import os
from langchain_openai import OpenAIEmbeddings
os.environ["OPENAI_API_KEY"] = "empty"
os.environ["OPENAI_API_BASE"] = "http://127.0.0.1:9997/v1/"

text = ['《中华人民共和国突发事件应对法》\n\n\n','《中华人民共和国突发事件应对法》\n\n\n（2007年8月30日第十届全国人民代表大会常务委员会第二十九次\n会议通过）\n\n\n']

embedding_model = OpenAIEmbeddings(model="text-embedding-ada-002",chunk_size=1000)
result1 = embedding_model.embed_query(text[0])
result2 = embedding_model.embed_query(text[1])

if str(result1) == str(result2):
print("相同")
else:
print("不相同")

print(result2[:5])
print(result1[:5])
print()
print(result2[-5:])
print(result1[-5:])

Expected behavior / 期待表现

不相同

qinxuye · 2024-11-27T07:56:47Z

@codingl2k1 能看下吗？

xiyuan-lee · 2024-11-27T11:26:27Z

找到问题了，是openai embedding将query转成tokens了，我增加代码将tokens转回query就好了，等我提交个pr修复一下这个问题

liunux4odoo · 2024-11-28T01:08:12Z

找到问题了，是openai embedding将query转成tokens了，我增加代码将tokens转回query就好了，等我提交个pr修复一下这个问题

需要做下判断。
只有langchain_openai.OpenAIEmbeddings里会用tiktoken对query做转换，openai sdk是不存在这个问题的。

xiyuan-lee · 2024-11-28T09:07:47Z

找到问题了，是openai embedding将query转成tokens了，我增加代码将tokens转回query就好了，等我提交个pr修复一下这个问题

需要做下判断。只有langchain_openai.OpenAIEmbeddings里会用tiktoken对query做转换，openai sdk是不存在这个问题的。

是的，我也发现了

qinxuye · 2024-11-28T09:44:41Z

找到问题了，是openai embedding将query转成tokens了，我增加代码将tokens转回query就好了，等我提交个pr修复一下这个问题

PR 发到哪里？

xiyuan-lee · 2024-11-28T10:09:37Z

找到问题了，是openai embedding将query转成tokens了，我增加代码将tokens转回query就好了，等我提交个pr修复一下这个问题

PR 发到哪里？

刚提交

xiyuan-lee · 2024-11-28T10:18:12Z

由于增加了解码过程，会增加约1倍的响应时间

qinxuye · 2024-11-28T11:46:36Z

由于增加了解码过程，会增加约1倍的响应时间

对于非 langchain OpenAIEmbedding 的应该没有影响，能确认吗？

Co-authored-by: qinxuye <[email protected]>

qinxuye · 2024-11-29T08:46:57Z

Fixed by #2600.

XprobeBot added this to the v1.x milestone Nov 26, 2024

codingl2k1 self-assigned this Nov 27, 2024

xiyuan-lee changed the title ~~使用openai的embedding库获得向量结果是错误的~~ 使用langchain_openai 的embedding库获得向量结果是错误的 Nov 28, 2024

xiyuan-lee added a commit to xiyuan-lee/inference that referenced this issue Nov 28, 2024

BUG: Correct the input bytes data by langchain_openai xorbitsai#2589

7e2d12f

xiyuan-lee added a commit to xiyuan-lee/inference that referenced this issue Nov 28, 2024

BUG: Correct the input bytes data by langchain_openai xorbitsai#2589

8d0121e

xiyuan-lee added a commit to xiyuan-lee/inference that referenced this issue Nov 28, 2024

BUG: Correct the input bytes data by langchain_openai xorbitsai#2589

8258f2b

xiyuan-lee added a commit to xiyuan-lee/inference that referenced this issue Nov 28, 2024

BUG: Correct the input bytes data by langchain_openai xorbitsai#2589

6d3255a

xiyuan-lee added a commit to xiyuan-lee/inference that referenced this issue Nov 28, 2024

BUG: Correct the input bytes data by langchain_openai xorbitsai#2589

c08fd71

xiyuan-lee added a commit to xiyuan-lee/inference that referenced this issue Nov 29, 2024

BUG: Correct the input bytes data by langchain_openai xorbitsai#2589

76c1b57

xiyuan-lee added a commit to xiyuan-lee/inference that referenced this issue Nov 29, 2024

BUG: Correct the input bytes data by langchain_openai xorbitsai#2589

fb504f4

xiyuan-lee added a commit to xiyuan-lee/inference that referenced this issue Nov 29, 2024

BUG: Correct the input bytes data by langchain_openai xorbitsai#2589

7907284

qinxuye added a commit that referenced this issue Nov 29, 2024

BUG: Correct the input bytes data by langchain_openai #2589 (#2600)

f4b5b42

Co-authored-by: qinxuye <[email protected]>

qinxuye closed this as completed Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

使用langchain_openai 的embedding库获得向量结果是错误的 #2589

使用langchain_openai 的embedding库获得向量结果是错误的 #2589

xiyuan-lee commented Nov 26, 2024 •

edited

Loading

qinxuye commented Nov 27, 2024

xiyuan-lee commented Nov 27, 2024

liunux4odoo commented Nov 28, 2024

xiyuan-lee commented Nov 28, 2024

qinxuye commented Nov 28, 2024

xiyuan-lee commented Nov 28, 2024

xiyuan-lee commented Nov 28, 2024

qinxuye commented Nov 28, 2024

qinxuye commented Nov 29, 2024

使用langchain_openai 的embedding库获得向量结果是错误的 #2589

使用langchain_openai 的embedding库获得向量结果是错误的 #2589

Comments

xiyuan-lee commented Nov 26, 2024 • edited Loading

System Info / 系統信息

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

Version info / 版本信息

The command used to start Xinference / 用以启动 xinference 的命令

Reproduction / 复现过程

Expected behavior / 期待表现

qinxuye commented Nov 27, 2024

xiyuan-lee commented Nov 27, 2024

liunux4odoo commented Nov 28, 2024

xiyuan-lee commented Nov 28, 2024

qinxuye commented Nov 28, 2024

xiyuan-lee commented Nov 28, 2024

xiyuan-lee commented Nov 28, 2024

qinxuye commented Nov 28, 2024

qinxuye commented Nov 29, 2024

xiyuan-lee commented Nov 26, 2024 •

edited

Loading