Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to convert WikiTables dataset to your JSON format? #4

Open
eloukas opened this issue Nov 5, 2021 · 11 comments
Open

How to convert WikiTables dataset to your JSON format? #4

eloukas opened this issue Nov 5, 2021 · 11 comments

Comments

@eloukas
Copy link

eloukas commented Nov 5, 2021

I've downloaded the dataset but it needs some pre-processing to get it to your format, as in the sample you provide in the repo.
Do you have the scripts for this process?

@eloukas
Copy link
Author

eloukas commented Nov 10, 2021

Hi @HaoAreYuDong, I saw your reply to the WDC dataset request.
Could you also maybe share here the complete pre-processed WikiTables dataset or your sample script for the preprocessing?

I need it to reproduce your work.

Many thanks!

@nickmagginas
Copy link

nickmagginas commented Nov 17, 2021

Hello @HaoAreYuDong . I am also interested in replicating parts of your work. I would also greatly appreciate it if you could share the preprocessing script for WikiTables as it is a must for reaching impressive results such as the ones you reported.
Thank you very much in advance and thank you for your great work as well as for open sourcing it. :)

@HaoAreYuDong
Copy link
Contributor

HaoAreYuDong commented Nov 18, 2021

Data can be downloaded from: https://drive.google.com/file/d/1XyZAtH9F8UoLsXHsBWriWkh9pZ92e3sX/view?usp=sharing (for wikidata.pt) and https://drive.google.com/file/d/19GYZyJNlOMk8xB_nhn1lfwfEWXWj4Iww (for wikidata.json).
Script will be uploaded soon.

@HaoAreYuDong
Copy link
Contributor

Hello @HaoAreYuDong . I am also interested in replicating parts of your work. I would also greatly appreciate it if you could share the preprocessing script for WikiTables as it is a must for reaching impressive results such as the ones you reported. Thank you very much in advance and thank you for your great work as well as for open sourcing it. :)

Please find an example preprocessing script at https://drive.google.com/drive/folders/1fEiBs9d6GwV1zq8Kk4bMbWMEUeYrLkeL?usp=sharing.

@HaoAreYuDong
Copy link
Contributor

Hi @HaoAreYuDong, I saw your reply to the WDC dataset request. Could you also maybe share here the complete pre-processed WikiTables dataset or your sample script for the preprocessing?

I need it to reproduce your work.

Many thanks!

Please find an example script at https://drive.google.com/drive/folders/1fEiBs9d6GwV1zq8Kk4bMbWMEUeYrLkeL?usp=sharing.

@codingforpleasure
Copy link

codingforpleasure commented Aug 3, 2023

Hi @HaoAreYuDong,
Thank you for sharing your work, I find the article very interesting.

  1. I noticed in the tuta/data/pretrain/wiki-table-samples.json each table has a unique id, Is there away to see the origin table on the web?

  2. It is mentioned in the file split_wiki.py:
    updated_structured_html_tables can be downloaded from:

http://l3s.de/~fetahu/wiki_tables/data/table_data/html_data/structured_html_table_data.json.gz
But the link is down (page not found), can you please post a working link,
Thank you in advance.

@HaoAreYuDong
Copy link
Contributor

Hi @HaoAreYuDong, Thank you for sharing your work, I find the article very interesting.

  1. I noticed in the tuta/data/pretrain/wiki-table-samples.json each table has a unique id, Is there away to see the origin table on the web?
  2. It is mentioned in the file split_wiki.py:
    updated_structured_html_tables can be downloaded from:

http://l3s.de/~fetahu/wiki_tables/data/table_data/html_data/structured_html_table_data.json.gz But the link is down (page not found), can you please post a working link, Thank you in advance.

I did not save the original file but still have a processed file which largely maintains the information. But it is big, is there any way to share with you? E.g., a shared folder.

@codingforpleasure
Copy link

@HaoAreYuDong thank you for the quick response,
I'll be glad if you could please upload the processed file to this google drive.

I made sure it has enough free space (about 55 GB).
Thank you!

@HaoAreYuDong
Copy link
Contributor

@HaoAreYuDong thank you for the quick response, I'll be glad if you could please upload the processed file to this google drive.

I made sure it has enough free space (about 55 GB). Thank you!

Great. Thanks.

@HaoAreYuDong
Copy link
Contributor

@HaoAreYuDong thank you for the quick response, I'll be glad if you could please upload the processed file to this google drive.

I made sure it has enough free space (about 55 GB). Thank you!

It;s here. https://drive.google.com/file/d/19GYZyJNlOMk8xB_nhn1lfwfEWXWj4Iww/view?usp=drive_link
Hope that it can help you.

@codingforpleasure
Copy link

codingforpleasure commented Aug 8, 2023

@HaoAreYuDong I'd like to explore and run inference on few of my tables,
according to data you have shared:
https://drive.google.com/file/d/19GYZyJNlOMk8xB_nhn1lfwfEWXWj4Iww/view?usp=drive_link

already holds RI, CI,Cd (Row index, Column Index, Children) fields which were generated using the file process_wiki.pt you posted here.

but in the script it refers to few json files which consist of 4 fields: id, caption, header, rows.

I can't see those fields in the data,
can you please share at least one json file which demonstrates the file structure before running on it the script.

I have also checked the raw data in the download section here:
A dataset of 1.6M Wikipedia Tables in JSON format
http://websail-fe.cs.northwestern.edu/TabEL/ and neither of them hold the spoken fields.

Hope it's not a hassle for you,
Thank you I appreciate it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants