Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where can I find the the-stack-v2-train-extras and LHQ datasets? #7

Open
jeykigung opened this issue Feb 29, 2024 · 10 comments
Open

Where can I find the the-stack-v2-train-extras and LHQ datasets? #7

jeykigung opened this issue Feb 29, 2024 · 10 comments

Comments

@jeykigung
Copy link

Thanks for your wonderful work! In https://huggingface.co/datasets/bigcode/the-stack-v2-dedup, I can only find the-stack-v2-train-smol and the-stack-v2-train-full data. I'm wondering where can I find the the-stack-v2-train-extras and LHQ datasets? Do you have a plan to release it?

@loubnabnl
Copy link
Contributor

Hi, all the extras will be available in a few weeks along with the stack v2's content

@ShaneTian
Copy link

Hi, all the extras will be available in a few weeks along with the stack v2's content

Hi, any updates for the-stack-v2-train-extras?

@ShaneTian
Copy link

Hi, all the extras will be available in a few weeks along with the stack v2's content

Hi, any updates for the-stack-v2-train-extras?

@loubnabnl Any updates?

@Casi11as
Copy link

Hi, any updates?

@ShaneTian
Copy link

Hi, all the extras will be available in a few weeks along with the stack v2's content

Hi, any updates for the-stack-v2-train-extras?

@loubnabnl Any updates?

@loubnabnl @bigximik @anton-l @iNeil77 @lvwerra Hi, any updates?

@noforit
Copy link

noforit commented Jun 28, 2024

@loubnabnl Hi, any updates?

1 similar comment
@fghccv
Copy link

fghccv commented Jun 28, 2024

@loubnabnl Hi, any updates?

@twelveand0
Copy link

Hi, all the extras will be available in a few weeks along with the stack v2's content

Hello, is this still on your release schedule?

@takiholadi
Copy link

HI! aNY UPDATE?11

@yucc-leon
Copy link

yucc-leon commented Nov 7, 2024

Still no updates? Bigcode seems not working well these days...
I checked the datasets used in training starcoder2. Most were already released before this project, like Arxiv, LHQ, Wiki, etc. One interesting thing is that they have been uploaded but not publicly available. What was really missed was the processed StackOverflow dataset.

Edit: maybe incorrect. I guess this one is the dataset they used: https://huggingface.co/datasets/bigcode/stack-exchange-preferences-20230914-clean-anonymization

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants