Add the raw data statistics for all published studies #188

sbesson · 2023-11-22T10:35:31Z

As discused last Monday at the IDR weekly meeting, with the ongoing migration of the public downloadable data to https://ftp.ebi.ac.uk/pub/databases/IDR/ it is useful for end-users to know the amount of data for each study.

This PR uses the statistic that came out of the transfer command to capture the number of files and the total size in bytes for each top-level study.
The last column attempts to normalize the size in TB but arguably this can be recomputed from the third column so leaving it up to the reviewers to decide whether this is useful.

Includes the number of files as well as the total size in bytes

sbesson · 2023-11-22T10:36:51Z

Note the current state of this PR omits the data for HPA as I am still transferring the last few folders but I will update it as soon as I have the final figures for the current data.
Excluding HPA, the volume of data available for download is ~40M files for ~260TB of data

joshmoore · 2023-11-23T09:32:23Z

This looks straight forward but I imagine quite useful. (The idea that you just append the first column to https://ftp.ebi.ac.uk/pub/databases/IDR/ is lovely 🎉) Probably the only question is whether or not keeping this up-to-date is too onerous.

will-moore · 2023-11-23T09:46:19Z

This looks similar to the file we're using for stats on the IDR home page:
https://raw.githubusercontent.com/IDR/idr.openmicroscopy.org/master/_data/studies.tsv
I haven't compared the numbers, but we'd expect them to be the same, right? Do we need both?
How will this rawdata.tsv be "viewable" on the website?

sbesson · 2023-11-23T10:03:25Z

Probably the only question is whether or not keeping this up-to-date is too onerous.

With the current transfer script, the reported numbers are actually generated in the log e.g.

[idr-virtual@codon-slurm-login-02 ~]$ tail -n 10 completed/idr0001-graml-sysgro_out.37867557 
[2023-11-09T23:02:06] Seconds: 93172.864
[2023-11-09T23:02:06] Items: 411555
[2023-11-09T23:02:06]   Directories: 0
[2023-11-09T23:02:06]   Files: 411555
[2023-11-09T23:02:06]   Links: 0
[2023-11-09T23:02:06] Data: 34.250 TiB (37658654203804 bytes)
[2023-11-09T23:02:06] Rate: 385.457 MiB/s (37658654203804 bytes in 93172.864 seconds)
[2023-11-09T23:02:06] Updating timestamps on newly copied files
[2023-11-09T23:03:30] Completed updating timestamps
[2023-11-09T23:03:30] Completed sync

So I expect the cost of maintaining this file will be very low (but I definitely need to document the above).

I haven't compared the numbers, but we'd expect them to be the same, right? Do we need both?

This file capture filesystem metrics for the data we receive from the submitter and made available for direct download. This includes all image files, analysis files etc only a subset of which is being imported/registered in IDR.
On the other hand, studies.tsv and only reports the imaging data that is imported into OMERO and is also broken down by container (screen/project) rather than being study-wide

How will this rawdata.tsv be "viewable" on the website?

At the moment it's not and that's something that should be discussed as we rework the download instructions.

pwalczysko · 2023-11-24T12:13:00Z

How will this rawdata.tsv be "viewable" on the website?

At the moment it's not and that's something that should be discussed as we rework the download instructions.

I like the table and the idea. Lets remember the main purpose of this: Give the downloaders (== NOT the OME Team) the overview of what are the approximate sizes of what they are downloading.
Give the information in such a way that it is available at the place where the download happens, at the time of the download.
I would claim that even having a link on the https://ftp.ebi.ac.uk/pub/databases/IDR/ site where this table would be downloadable as, say, pdf, is far far superior to having nothing. Even an outdated table would do.

pwalczysko

Comment on the possible linkage to the to-be-downloaded data. This can be done in a separate PR or by other means, depending on the team-wide discussion result.

sbesson · 2023-11-24T12:25:40Z

Give the information in such a way that it is available at the place where the download happens, at the time of the download.

Interesting, a top-level file under https://ftp.ebi.ac.uk/pub/databases/IDR/ would be an easy way to colocate this metadata with the data to be downloaded as mentioned in #188 (comment). Also thinking of the process, assuming we make the right decisions, managing this information directly under this hierarchy is even easier as the metadata can be updated directly once the data is copied.

I would claim that even having a link on the https://ftp.ebi.ac.uk/pub/databases/IDR/ site where this table would be downloadable as, say, pdf, is far far superior to having nothing. Even an outdated table would do.

That probably gets us down to agreeing the minimal requirements for a first version:

format: currently set as is TSV mostly to match the existing tabular files in this repository. Can easily be CSV or even JSON
data: study name, number of files and total size (bytes) are the bare minimum columns. Everything else is up for discussion (or future amendments)

joshmoore · 2023-11-24T15:51:19Z

Interesting, a top-level file under https://ftp.ebi.ac.uk/pub/databases/IDR/ would be an easy way to colocate this metadata with the data to be downloaded as mentioned in #188 (comment).

Agreed, though would it be easier/as-effective to just have a README per directory then?

pwalczysko · 2023-11-27T11:07:17Z

Agreed, though would it be easier/as-effective to just have a README per directory then?

I am happy with that @sbesson

sbesson · 2023-11-27T21:06:46Z

Closing in favour of files directly hosted on the public storage infrastructure:

a top-level CSV file - see https://ftp.ebi.ac.uk/pub/databases/IDR/studies.csv
individual readme files under each study folder TBD

sbesson added 2 commits November 22, 2023 10:19

Add spreadsheet with the raw data statistics for each published study

cb98245

Includes the number of files as well as the total size in bytes

Use tabs

6a0e889

sbesson requested review from pwalczysko and francesw November 22, 2023 10:35

Add current size of idr0043

5e6462a

pwalczysko requested changes Nov 24, 2023

View reviewed changes

sbesson closed this Nov 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the raw data statistics for all published studies #188

Add the raw data statistics for all published studies #188

sbesson commented Nov 22, 2023

sbesson commented Nov 22, 2023

joshmoore commented Nov 23, 2023

will-moore commented Nov 23, 2023

sbesson commented Nov 23, 2023

pwalczysko commented Nov 24, 2023

pwalczysko left a comment

sbesson commented Nov 24, 2023 •

edited

Loading

joshmoore commented Nov 24, 2023

pwalczysko commented Nov 27, 2023

sbesson commented Nov 27, 2023 •

edited

Loading

Add the raw data statistics for all published studies #188

Add the raw data statistics for all published studies #188

Conversation

sbesson commented Nov 22, 2023

sbesson commented Nov 22, 2023

joshmoore commented Nov 23, 2023

will-moore commented Nov 23, 2023

sbesson commented Nov 23, 2023

pwalczysko commented Nov 24, 2023

pwalczysko left a comment

Choose a reason for hiding this comment

sbesson commented Nov 24, 2023 • edited Loading

joshmoore commented Nov 24, 2023

pwalczysko commented Nov 27, 2023

sbesson commented Nov 27, 2023 • edited Loading

sbesson commented Nov 24, 2023 •

edited

Loading

sbesson commented Nov 27, 2023 •

edited

Loading