Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken Link to PubMed Abstracts dataset #623

Open
yacinebouaouni opened this issue Sep 26, 2023 · 2 comments
Open

Broken Link to PubMed Abstracts dataset #623

yacinebouaouni opened this issue Sep 26, 2023 · 2 comments

Comments

@yacinebouaouni
Copy link

The link provided in Section 5 / Big data? 🤗 Datasets to the rescue! :
data_files = "https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst"
is broken

@qualis2006
Copy link

Here is the Huggingface repository that I have created for the pubmed abstract dataset that you may want to look at:

from datasets import load_dataset
pubmed_dataset = load_dataset("qualis2006/PUBMED_title_abstracts_2020_baseline")
pubmed_dataset

Downloading data: 100%
7.98G/7.98G [11:47<00:00, 9.68MB/s]
Generating train split: 17722096/0 [00:36<00:00, 505376.37 examples/s]

DatasetDict({
train: Dataset({
features: ['meta', 'text'],
num_rows: 17722096
})

@mik-tf
Copy link

mik-tf commented Feb 26, 2024

@qualis2006 Nice! Thanks. On my end, it works using your code, and then I need to call pubmed_dataset['train'] instead of pubmed_dataset throughout the rest of the page.

To run the code as is on the page, we can download the dataset with the full URL.

data_files="https://huggingface.co/datasets/qualis2006/PUBMED_title_abstracts_2020_baseline/resolve/main/PUBMED_title_abstracts_2020_baseline.jsonl.zst"

@yacinebouaouni this line should work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants