Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用langchain_openai 的embedding库获得向量结果是错误的 #2589

Closed
3 tasks done
xiyuan-lee opened this issue Nov 26, 2024 · 9 comments
Closed
3 tasks done
Assignees
Milestone

Comments

@xiyuan-lee
Copy link
Contributor

xiyuan-lee commented Nov 26, 2024

System Info / 系統信息

ubuntu 20.04

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

  • docker / docker
  • pip install / 通过 pip install 安装
  • installation from source / 从源码安装

Version info / 版本信息

所有版本

The command used to start Xinference / 用以启动 xinference 的命令

docker部署

Reproduction / 复现过程

使用xinference部署的m3e-large模型,model id 设置为
text-embedding-ada-002,
使用langchain_openai 的embedding库获得向量结果是错误的,只要第一个字相同,向量结果就是相同的

import os
from langchain_openai import OpenAIEmbeddings
os.environ["OPENAI_API_KEY"] = "empty"
os.environ["OPENAI_API_BASE"] = "http://127.0.0.1:9997/v1/"

text = ['《中华人民共和国突发事件应对法》\n\n\n','《中华人民共和国突发事件应对法》\n\n\n(2007年8月30日第十届全国人民代表大会常务委员会第二十九次\n会议通过)\n\n\n']

embedding_model = OpenAIEmbeddings(model="text-embedding-ada-002",chunk_size=1000)
result1 = embedding_model.embed_query(text[0])
result2 = embedding_model.embed_query(text[1])

if str(result1) == str(result2):
print("相同")
else:
print("不相同")

print(result2[:5])
print(result1[:5])
print()
print(result2[-5:])
print(result1[-5:])

Expected behavior / 期待表现

不相同

@XprobeBot XprobeBot added this to the v1.x milestone Nov 26, 2024
@qinxuye
Copy link
Contributor

qinxuye commented Nov 27, 2024

@codingl2k1 能看下吗?

@codingl2k1 codingl2k1 self-assigned this Nov 27, 2024
@xiyuan-lee
Copy link
Contributor Author

找到问题了,是openai embedding将query转成tokens了,我增加代码将tokens转回query就好了,等我提交个pr修复一下这个问题

@liunux4odoo
Copy link
Contributor

找到问题了,是openai embedding将query转成tokens了,我增加代码将tokens转回query就好了,等我提交个pr修复一下这个问题

需要做下判断。
只有langchain_openai.OpenAIEmbeddings里会用tiktoken对query做转换,openai sdk是不存在这个问题的。

@xiyuan-lee xiyuan-lee changed the title 使用openai的embedding库获得向量结果是错误的 使用langchain_openai 的embedding库获得向量结果是错误的 Nov 28, 2024
@xiyuan-lee
Copy link
Contributor Author

找到问题了,是openai embedding将query转成tokens了,我增加代码将tokens转回query就好了,等我提交个pr修复一下这个问题

需要做下判断。 只有langchain_openai.OpenAIEmbeddings里会用tiktoken对query做转换,openai sdk是不存在这个问题的。

是的,我也发现了

@qinxuye
Copy link
Contributor

qinxuye commented Nov 28, 2024

找到问题了,是openai embedding将query转成tokens了,我增加代码将tokens转回query就好了,等我提交个pr修复一下这个问题

PR 发到哪里?

xiyuan-lee added a commit to xiyuan-lee/inference that referenced this issue Nov 28, 2024
@xiyuan-lee
Copy link
Contributor Author

找到问题了,是openai embedding将query转成tokens了,我增加代码将tokens转回query就好了,等我提交个pr修复一下这个问题

PR 发到哪里?

刚提交

@xiyuan-lee
Copy link
Contributor Author

由于增加了解码过程,会增加约1倍的响应时间

xiyuan-lee added a commit to xiyuan-lee/inference that referenced this issue Nov 28, 2024
@qinxuye
Copy link
Contributor

qinxuye commented Nov 28, 2024

由于增加了解码过程,会增加约1倍的响应时间

对于非 langchain OpenAIEmbedding 的应该没有影响,能确认吗?

xiyuan-lee added a commit to xiyuan-lee/inference that referenced this issue Nov 28, 2024
xiyuan-lee added a commit to xiyuan-lee/inference that referenced this issue Nov 28, 2024
xiyuan-lee added a commit to xiyuan-lee/inference that referenced this issue Nov 28, 2024
xiyuan-lee added a commit to xiyuan-lee/inference that referenced this issue Nov 29, 2024
xiyuan-lee added a commit to xiyuan-lee/inference that referenced this issue Nov 29, 2024
xiyuan-lee added a commit to xiyuan-lee/inference that referenced this issue Nov 29, 2024
qinxuye added a commit that referenced this issue Nov 29, 2024
@qinxuye
Copy link
Contributor

qinxuye commented Nov 29, 2024

Fixed by #2600.

@qinxuye qinxuye closed this as completed Nov 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants