S2ORC bulk dataset finding only ~210k papers #220

marius10p · 2024-12-04T03:32:11Z

Thanks so much for maintaining this resource.

I received an API key (thank you) and followed the instructions to download the S2ORC dataset here and here. This found 30 files on AWS, and each of those contains almost 7,000 rows, each of which is a paper with text. I believe the S2ORC dataset should include over 8M paper with full text, not 30 * 7,000 = 210k. What am I missing?

Would love to ask this on the slack community but I cannot create an account, as I do not belong to the 5 organizations listed.

marius10p added the question Further information is requested label Dec 4, 2024

marius10p assigned cfiorelli Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S2ORC bulk dataset finding only ~210k papers #220

S2ORC bulk dataset finding only ~210k papers #220

marius10p commented Dec 4, 2024

S2ORC bulk dataset finding only ~210k papers #220

S2ORC bulk dataset finding only ~210k papers #220

Comments

marius10p commented Dec 4, 2024