Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pinecone metadata to include confidence & summary #168

Closed
ccstan99 opened this issue Aug 26, 2023 · 4 comments
Closed

Pinecone metadata to include confidence & summary #168

ccstan99 opened this issue Aug 26, 2023 · 4 comments
Labels
help wanted Extra attention is needed

Comments

@ccstan99
Copy link
Collaborator

In the Pinecone metadata we need to include confidence & summary too.

  • Looks like the arxiv paper abstracts are in the summary not in the text itself. That needs to be embedded to help us retrieve relevant papers.
  • When editors flag an article "thumbs down," it's been embedded & in pinecone already so we want to make sure it's not provided as context for the chatbot.
@ccstan99 ccstan99 added the help wanted Extra attention is needed label Aug 27, 2023
@henri123lemoine
Copy link
Collaborator

I'm not sure I understand what exactly that would look like.
It would be easy to add the confidence number to the pinecone metadata, which I can do. I could also add the summaries in the metadata, that's very doable, but I don't see the advantage of doing so.
Regarding embedding the summaries, should they be split in the same way the text is, and embedded the same way the text is, with no difference otherwise? That would be easy. I'm not sure however if things like the header should be different and should note where the summary is from or something like that.
Regarding articles that have been thumbed down, it's pretty easy to remove them if we add a metadata for that, or otherwise we can just temporarily remove them from the pinecone and add them later? Which do you think is best?

@ccstan99
Copy link
Collaborator Author

I could also add the summaries in the metadata, that's very doable, but I don't see the advantage of doing so.

Right now, it looks like the arxiv abstracts are in summaries not included in the text. I think that's a key missing component. Initially, we can just split the abstract/summaries like any other text with the same header.

Eventually, we might want a separate namespace for all summaries to be embedded without splitting into chunks. This could be helpful to retrieve sources to answer questions like "What was the paper/blog post about..." which is a often requested feature. We should discuss before implementing this.

Regarding articles that have been thumbed down, it's pretty easy to remove them if we add a metadata for that, or otherwise we can just temporarily remove them from the pinecone and add them later? Which do you think is best?

For thumbs down, I would mark the metadata confidence=0 then make sure they don't get returned in pinecone search results. As long as the results are filtered out, it's not as important whether the data remains in pinecone or not... maybe leave it for know until we decide otherwise. Obviously, anything already low confidence in MySQL should NOT be added to pinecone.

@henri123lemoine
Copy link
Collaborator

Ok, perfect. To have the chatbot not use those, we just need to add a filter that doesn't accept confidence=0: filter={"confidence": {"$ne": 0}}.

@henri123lemoine
Copy link
Collaborator

fixed #171

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
Development

No branches or pull requests

2 participants