Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About The Retrieval Source #12

Open
yunfan42 opened this issue Oct 26, 2024 · 1 comment
Open

About The Retrieval Source #12

yunfan42 opened this issue Oct 26, 2024 · 1 comment

Comments

@yunfan42
Copy link

First of all, I would like to express my sincere gratitude to all the authors for their outstanding work and for open-sourcing such an excellent dataset. 👍 👍 👍

I am currently attempting to conduct some tests with RAG (Retrieval-Augmented Generation) on this dataset, and I have some confusion regarding the retrieval source that I hope the authors can help clarify.

In RAG, it is necessary to first chunk and index the Wikipedia pages that may be used for retrieval.

The paper's Section 3.1 mentions that all QA involves a total of 4,121 Wikipedia articles. Is this the complete retrieval source?

Or should I use the author-provided wikicache.tar.gz file (~9.43GB)? (Of course, this would consume a massive amount of Embedding Tokens and take a significant amount of time.)

Based on my understanding, this cached Wikipedia is preliminarily filtered out through BM25 based on the Question across all Wikipedia page Titles. I am not sure if this is correct.

Additionally, I would like to ask where I can directly download the actual 4,121 Wikipedia articles that are used.

@zhudotexe
Copy link
Owner

The Open Book setting is conducted over all of English Wikipedia as a knowledge base. If you have the ability to self-host Wikipedia, the SQL dump is provided at https://datasets.mechanus.zhu.codes/fanoutqa/enwiki-20231120-pages-articles-multistream.xml.bz2, though this will take a couple hundred GB of disk space!

We recommend using the pip library provided by this repo to download Wikipedia articles as needed. The library will ensure that the text downloaded is the revision as of November 2023. The wikicache.tar.gz file can be used to prepopulate the cache for this library, but is optional - it's all of the files downloaded onto my machine when we were running experiments, not necessarily filtered in any way.

We don't recommend embedding all of Wikipedia - that would be prohibitively expensive for all but large organizations! Instead, your model should use a search tool to find relevant article titles, then retrieve from the text of individual articles returned by the search tool (i.e., index individual pages instead of the entire knowledge base).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants