Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S2ORC bulk dataset finding only ~210k papers #220

Open
marius10p opened this issue Dec 4, 2024 · 0 comments
Open

S2ORC bulk dataset finding only ~210k papers #220

marius10p opened this issue Dec 4, 2024 · 0 comments
Assignees
Labels
question Further information is requested

Comments

@marius10p
Copy link

Thanks so much for maintaining this resource.

I received an API key (thank you) and followed the instructions to download the S2ORC dataset here and here. This found 30 files on AWS, and each of those contains almost 7,000 rows, each of which is a paper with text. I believe the S2ORC dataset should include over 8M paper with full text, not 30 * 7,000 = 210k. What am I missing?

Would love to ask this on the slack community but I cannot create an account, as I do not belong to the 5 organizations listed.

@marius10p marius10p added the question Further information is requested label Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants