-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data dump to GitHub #2
Comments
We already have a repo for this (https://github.com/bio-tools/bio.tools-content) but the names maybe a bit crappy? How about:
Preferences? I'll need to spell out this is strictly for experimental purposes (like what I said here already). And in what format:
Preferences? I'd personally prefer XML because it will make the validation direct and easier (and avoid any drift to using not very rigorous JSON schema equivalents of biotoolsSchema etc.) And what about the structure - I propose one folder per tool, where the folder name is the bio.tools toolID - which allows for adding other tool descriptors / files / formats under a common directory. Also one XML with everything in. Preferences? cc @bgruening @hmenager @hansioan : what do you think? |
From @bgruening on December 8, 2018 10:56 My gut feeling is I would prefere YAML, as this is currently the most easiest format for people to edit in an editor or browser. This can change if we dump the final version and when we have an curation interface, but for now I would prefer YAML. The shim is hopefully not complicated to write and would be used on CI to 1) convert it to XML and 2) validate and changes. Thanks @joncison for working on this. |
OK thanks! Just a note that https://dev.bio.tools/api/tool/ is currently serving a mess (in all of XML, JSON and YAML :) ) - but this has been sorted locally. I plan to play more with shims next week, so let's see how this goes ... And @bgruening - what about the directory structure; are you happy with folders as we talked about previously ? |
From @scapella on December 8, 2018 20:22 IMHO I'd go either for https://github.com/bio-tools/resources or Regarding the structure of the repo I'd suggest to have a general folder Cheers, Salva On Sat, Dec 8, 2018 at 12:01 PM Jon Ison [email protected] wrote:
|
From @hansioan on December 10, 2018 11:39 @joncison @hmenager @bgruening https://bio.tools/api/t?page=1&format=json In the case of biotoolsSchema xml for now we only have that on a per tool basis (example shown on dev but will soon work on production too) |
From @jlgelpi on December 10, 2018 11:45 I would go for a single format for the repository (one that can be easily checked against a schema). Having several formats may introduce inconsistences. Perhaps we can accept any input format (XML, json, yaml) and convert the data on the pull request. |
Please let us know what you think @hmenager then I'll write back addressing all comments above ... |
From @bgruening on December 10, 2018 11:59
Yes. Folders are good.
Most likely. Would make sense. Whatever we do, we can change this easily later one. So nothing is set in stone imho.
@hansioan any reason? JSON is a subset of YAML so that should be fine for both worlds and conversion is easy.
I guess the idea was to accept only one format and then on CI add all validation. This validation could happen by intermediate conversion to XML if @joncison thinks that's best. |
From @redmitry on December 10, 2018 12:10 Hello, I know that for the mere human being the form ?page=1&format=json is a natural way, as it permits to use usual browser for the GET requests, but talking about REST architecture, it is better to use headers:
The advantage of standard http pagination is that a client knows from the beginning the total size (headers go before the body) and may calculate the number of pages in the table, while loading only one page only. Of course nobody prevent someone to implement both forms. |
From @hansioan on December 10, 2018 12:20 Regarding the versions... Technically in bio.tools we don't have a fine grained track of tool versions. In the new schema 3.0 which will go in this week if not today we allow multiple version annotations per tool. The reason for this is that if there is no difference in annotation for a set of tool versions, they all go in the same tool annotation. Thus I am not really sure if any of the folder structure is needed, at least not at the core of the tool descriptions. We can have the option of providing multiple copies of the same tool, separated by version, but that should be something that results from the initial structure, and not something that IS the initial structure. |
From @scapella on December 10, 2018 12:49 @hansioan https://github.com/hansioan I agree with you that when there is no changes among versions, it is easy Cheers, Salva On Mon, Dec 10, 2018 at 1:20 PM Hans Ienasescu [email protected]
|
From @hansioan on December 10, 2018 13:15 Yes, but having version specific information for each tools gets us back to 2 years ago when tools were accessed like https://bio.tools/toolid/version This way was basically creating a tool whenever a new version appeared, and in 90% of all cases there were no (zero) differences between the annotations, except for the version property. We had a very famous example of a tool that appeared over 10 times in bio.tools with the same annotation, because the people were just going in and updating the version information whenever they released a new version (e.g. new tool between tool version 1.2.23 and 1.2.24). There is no good way to do separate versions for each tool except modeling this in the API request, and even if there was we would still have to store versioned tools in the database. While this can certainly apply for things like conda, containers and other projects that require the exact versions, I don't think applies as much to bio.tools. We must remember that 90% of our users just want to find a tool that meets their scientific requirements (focus on find). All this being said, I am not opposed to having a good solution that can work for everyone, it is just something which is complicated and not in our list of main tasks right now. We have opened the code and once all the remaining plumbing tasks are done and we are ready to accept pull requests, perhaps this can be one of the initial tasks for contributors. |
From @bgruening on December 10, 2018 13:17 @redmitry Json will be served from the web service. But this question issue is about the data storage in GitHub, no one disagrees that Json will be served from the web service, imho. @scapella @hansioan don't let the version discussion stop the initial drop, please. Versioning can be added at any time. Subfolders are easy to add for anyone if they care about the version or if the difference between the tools/benchmarks are too big. You simply need to adjust your bio.tools-github-parser to traverse recursively to all folders. And people that don't need this, can simply assume that the latest version is in the root dir. Really not a big deal. We can add this later if we need to. |
From @hmenager on December 10, 2018 14:23 @joncison As far as I'm concerned YAML would probably be the best choice,
For the repository, I would go for either https://github.com/bio-tools/tools On Mon, Dec 10, 2018 at 2:17 PM Björn Grüning [email protected]
|
From @scapella on December 10, 2018 15:33 Fine with me to have in the radar the versioning stuff but no to stop the Salva On Mon, Dec 10, 2018 at 3:23 PM Hervé Ménager [email protected]
|
Quick update - will be revisiting this in new year - but for now a few points:
Let's keep this issue for the data dump and use this for technical discussions about a GitHub-based content architecture. Pls. bear in mind the priority on the DK side is getting the deployment and open-dev process sorted, critical / high priority issues scheduled for the 2019 Q1 release, the website redesign, and other features with direct impact on end-users. The new content architecture under discussion would be awesome, but depends on other components including an independent curators interface e.g. based on edamToolAnnotator and independent validation mechanism, e.g. biotoolsLint. It's a lot of work, hence a matter of priorities. |
UPDATE
Let's keep this thread specifically for issues around the data dump using e.g. #7 for discussion around formats/transforms. @hansioan & I have started on curation work to assure all entries give "canonical" tool descriptions, which is a pre-requisite really to the complete dump. |
PS. one huge help @bgruening would be to work with @piotrgithub1 to get the local deployment working - as all the community bio.tools dev will depend on it. I know you started looking at this with @hansioan in FR last year. |
Yeah, I think we got the backend to startup in a redistributable conda environment. |
UPDATEThanks to @hansioan & @piotrgithub1 we now have 2000 tools in JSON format in https://github.com/bio-tools/content PJ says ... "I have added 2000 tools from bio.tools registry to github repo in a prettified JSON format (https://github.com/bio-tools/content ). The root folder is called ‘data’ for the lack of a better name and as we have previously agreed almost everything about this is can change. I invite you to try it out and we should organize a call in a couple of weeks (18^th onwards?) to summarize the experiences and present and/or evaluate ideas on the data interoperability within the platform’s systems." The curation work to assure all entries give "canonical" tool descriptions is almost done - some of the entries you see in the repo might disappear, or have their name / IDs changed over the next weeks. |
@piotrgithub1 now that the big curation work to ensure all entries give "canonical" tool descriptions is done (or "done enough" for now), can we dump everything (all 12,000+ entries) in JSON to https://github.com/bio-tools/content/tree/master/data ? I understand @hansioan has something to do the dump automatically? On the call today there was general agreement it would be nice to do this, esp. in preparation for All-hands. Prob. easiest to delete what's there / start again (ID's hence file names will have changed in some cases) cc @bgruening @scapella @osallou |
nudge-nudge @piotrgithub1 @hansioan - in case you think it'd be good to drop more files in time for All-hands |
From @joncison on September 4, 2018 11:27
Nightly dump of all content (in XML and JSON formats?) to GitHub, as a convenience (or least to begin, just a one-off dump)
Copied from original issue: bio-tools/biotoolsRegistry#355
The text was updated successfully, but these errors were encountered: