Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the raw data statistics for all published studies #188

Closed
wants to merge 3 commits into from

Conversation

sbesson
Copy link
Member

@sbesson sbesson commented Nov 22, 2023

As discused last Monday at the IDR weekly meeting, with the ongoing migration of the public downloadable data to https://ftp.ebi.ac.uk/pub/databases/IDR/ it is useful for end-users to know the amount of data for each study.

This PR uses the statistic that came out of the transfer command to capture the number of files and the total size in bytes for each top-level study.
The last column attempts to normalize the size in TB but arguably this can be recomputed from the third column so leaving it up to the reviewers to decide whether this is useful.

Includes the number of files as well as the total size in bytes
@sbesson
Copy link
Member Author

sbesson commented Nov 22, 2023

Note the current state of this PR omits the data for HPA as I am still transferring the last few folders but I will update it as soon as I have the final figures for the current data.
Excluding HPA, the volume of data available for download is ~40M files for ~260TB of data

@joshmoore
Copy link
Member

This looks straight forward but I imagine quite useful. (The idea that you just append the first column to https://ftp.ebi.ac.uk/pub/databases/IDR/ is lovely 🎉) Probably the only question is whether or not keeping this up-to-date is too onerous.

@will-moore
Copy link
Member

This looks similar to the file we're using for stats on the IDR home page:
https://raw.githubusercontent.com/IDR/idr.openmicroscopy.org/master/_data/studies.tsv
I haven't compared the numbers, but we'd expect them to be the same, right? Do we need both?
How will this rawdata.tsv be "viewable" on the website?

@sbesson
Copy link
Member Author

sbesson commented Nov 23, 2023

Probably the only question is whether or not keeping this up-to-date is too onerous.

With the current transfer script, the reported numbers are actually generated in the log e.g.

[idr-virtual@codon-slurm-login-02 ~]$ tail -n 10 completed/idr0001-graml-sysgro_out.37867557 
[2023-11-09T23:02:06] Seconds: 93172.864
[2023-11-09T23:02:06] Items: 411555
[2023-11-09T23:02:06]   Directories: 0
[2023-11-09T23:02:06]   Files: 411555
[2023-11-09T23:02:06]   Links: 0
[2023-11-09T23:02:06] Data: 34.250 TiB (37658654203804 bytes)
[2023-11-09T23:02:06] Rate: 385.457 MiB/s (37658654203804 bytes in 93172.864 seconds)
[2023-11-09T23:02:06] Updating timestamps on newly copied files
[2023-11-09T23:03:30] Completed updating timestamps
[2023-11-09T23:03:30] Completed sync

So I expect the cost of maintaining this file will be very low (but I definitely need to document the above).

I haven't compared the numbers, but we'd expect them to be the same, right? Do we need both?

This file capture filesystem metrics for the data we receive from the submitter and made available for direct download. This includes all image files, analysis files etc only a subset of which is being imported/registered in IDR.
On the other hand, studies.tsv and only reports the imaging data that is imported into OMERO and is also broken down by container (screen/project) rather than being study-wide

How will this rawdata.tsv be "viewable" on the website?

At the moment it's not and that's something that should be discussed as we rework the download instructions.

@pwalczysko
Copy link
Contributor

How will this rawdata.tsv be "viewable" on the website?

At the moment it's not and that's something that should be discussed as we rework the download instructions.

I like the table and the idea. Lets remember the main purpose of this: Give the downloaders (== NOT the OME Team) the overview of what are the approximate sizes of what they are downloading.
Give the information in such a way that it is available at the place where the download happens, at the time of the download.
I would claim that even having a link on the https://ftp.ebi.ac.uk/pub/databases/IDR/ site where this table would be downloadable as, say, pdf, is far far superior to having nothing. Even an outdated table would do.

Copy link
Contributor

@pwalczysko pwalczysko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on the possible linkage to the to-be-downloaded data. This can be done in a separate PR or by other means, depending on the team-wide discussion result.

@sbesson
Copy link
Member Author

sbesson commented Nov 24, 2023

Give the information in such a way that it is available at the place where the download happens, at the time of the download.

Interesting, a top-level file under https://ftp.ebi.ac.uk/pub/databases/IDR/ would be an easy way to colocate this metadata with the data to be downloaded as mentioned in #188 (comment). Also thinking of the process, assuming we make the right decisions, managing this information directly under this hierarchy is even easier as the metadata can be updated directly once the data is copied.

I would claim that even having a link on the https://ftp.ebi.ac.uk/pub/databases/IDR/ site where this table would be downloadable as, say, pdf, is far far superior to having nothing. Even an outdated table would do.

That probably gets us down to agreeing the minimal requirements for a first version:

  1. format: currently set as is TSV mostly to match the existing tabular files in this repository. Can easily be CSV or even JSON
  2. data: study name, number of files and total size (bytes) are the bare minimum columns. Everything else is up for discussion (or future amendments)

@joshmoore
Copy link
Member

Interesting, a top-level file under https://ftp.ebi.ac.uk/pub/databases/IDR/ would be an easy way to colocate this metadata with the data to be downloaded as mentioned in #188 (comment).

Agreed, though would it be easier/as-effective to just have a README per directory then?

@pwalczysko
Copy link
Contributor

Agreed, though would it be easier/as-effective to just have a README per directory then?

I am happy with that @sbesson

@sbesson
Copy link
Member Author

sbesson commented Nov 27, 2023

Closing in favour of files directly hosted on the public storage infrastructure:

@sbesson sbesson closed this Nov 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants