-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the raw data statistics for all published studies #188
Conversation
Includes the number of files as well as the total size in bytes
Note the current state of this PR omits the data for HPA as I am still transferring the last few folders but I will update it as soon as I have the final figures for the current data. |
This looks straight forward but I imagine quite useful. (The idea that you just append the first column to https://ftp.ebi.ac.uk/pub/databases/IDR/ is lovely 🎉) Probably the only question is whether or not keeping this up-to-date is too onerous. |
This looks similar to the file we're using for stats on the IDR home page: |
With the current transfer script, the reported numbers are actually generated in the log e.g.
So I expect the cost of maintaining this file will be very low (but I definitely need to document the above).
This file capture filesystem metrics for the data we receive from the submitter and made available for direct download. This includes all image files, analysis files etc only a subset of which is being imported/registered in IDR.
At the moment it's not and that's something that should be discussed as we rework the download instructions. |
I like the table and the idea. Lets remember the main purpose of this: Give the downloaders (== NOT the OME Team) the overview of what are the approximate sizes of what they are downloading. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment on the possible linkage to the to-be-downloaded data. This can be done in a separate PR or by other means, depending on the team-wide discussion result.
Interesting, a top-level file under https://ftp.ebi.ac.uk/pub/databases/IDR/ would be an easy way to colocate this metadata with the data to be downloaded as mentioned in #188 (comment). Also thinking of the process, assuming we make the right decisions, managing this information directly under this hierarchy is even easier as the metadata can be updated directly once the data is copied.
That probably gets us down to agreeing the minimal requirements for a first version:
|
Agreed, though would it be easier/as-effective to just have a README per directory then? |
I am happy with that @sbesson |
Closing in favour of files directly hosted on the public storage infrastructure:
|
As discused last Monday at the IDR weekly meeting, with the ongoing migration of the public downloadable data to https://ftp.ebi.ac.uk/pub/databases/IDR/ it is useful for end-users to know the amount of data for each study.
This PR uses the statistic that came out of the transfer command to capture the number of files and the total size in bytes for each top-level study.
The last column attempts to normalize the size in TB but arguably this can be recomputed from the third column so leaving it up to the reviewers to decide whether this is useful.