Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple downloads of the crosswalk table #29

Open
progval opened this issue Mar 23, 2020 · 4 comments
Open

Multiple downloads of the crosswalk table #29

progval opened this issue Mar 23, 2020 · 4 comments

Comments

@progval
Copy link
Member

progval commented Mar 23, 2020

Hi,

Every time one runs blogdown::build_site(), https://github.com/codemeta/codemeta/raw/master/crosswalk.csv is downloaded 16 times.
After several builds in a short time, Github rate-limits these requests, which fails the build.

Do you know if there is a way to make blogdown cache the crosstable across page builds?

@cboettig
Copy link
Member

@progval Yes, thanks for the ping. Simple question many answers:

  • For a simple fix, just rebuild the site with serve_site() instead of build_site() will tell blogdown not to re-render the .Rmd files in content dir, but instead stick with the already-knitted html outputs from them.

  • second, yeah, we could avoid having each of the crosswalk pages doing it's own download, or we could have R cache those downloads (e.g. by wrapping the download URL in pins::pin() or manually caching a copy), in
    https://github.com/codemeta/codemeta.github.io/blob/hugo/content/crosswalk/datacite.Rmd#L14

  • Zooming out, the whole design here probably needs an overhaul. As you probably know, haven't kept up with manually adding a new .Rmd for each new source column in crosswalk, so those crosswalks really aren't complete any more. Would love your opinion on this. Arguably it is quite useful to have a page with stuff like more background on maven or whatnot, but also this Rmd approach clearly doesn't scale super well. @mbjones and I were just discussing this in the context of a larger overhaul for codemeta.github.io website that would strip it down to something more minimal that is easier to maintain and keep current. The site today feels a bit bloated and stale to me, and not all that user friendly.

  • Related to the last is the fact that codemeta is now really two somewhat separate projects - while we set out primarily to create a crosswalk, we now basically maintain a 'new' standard and set of supporting tools, and rather separately maintain a list of crosswalk tables from other standards (largely without a lot of supporting tools except for some special cases like R, where codemetar crosswalks a lot more terms than are listed in the R crosswalk table anyway). Some ideas on how to proceed with these two pieces (e.g. should we omit or move the crosswalk stuff off of the main codemeta website?) would be helpful.

Thanks so much for all your work and contributions, it's really fantastic!

@progval
Copy link
Member Author

progval commented Mar 23, 2020

  • For a simple fix, just rebuild the site with serve_site() instead of build_site() will tell blogdown not to re-render the .Rmd files in content dir, but instead stick with the already-knitted html outputs from them.

Excellent!

  • second, yeah, we could avoid having each of the crosswalk pages doing it's own download, or we could have R cache those downloads (e.g. by wrapping the download URL in pins::pin() or manually caching a copy), in
    https://github.com/codemeta/codemeta.github.io/blob/hugo/content/crosswalk/datacite.Rmd#L14

  • Zooming out, the whole design here probably needs an overhaul. As you probably know, haven't kept up with manually adding a new .Rmd for each new source column in crosswalk, so those crosswalks really aren't complete any more. Would love your opinion on this. Arguably it is quite useful to have a page with stuff like more background on maven or whatnot, but also this Rmd approach clearly doesn't scale super well. @mbjones and I were just discussing this in the context of a larger overhaul for codemeta.github.io website that would strip it down to something more minimal that is easier to maintain and keep current. The site today feels a bit bloated and stale to me, and not all that user friendly.

You could also use Travis (or any other CI) to automatically build the branch with the HTML (currently master): https://docs.travis-ci.com/user/deployment/pages/ (it won't automatically rebuild on changes of crosswalk.csv, but you could set up a daily rebuild) from just the .Rmd files; and remove .md files from the hugo branch (which might need to be renamed; maybe rename it to master and rename the current master to gh-pages)

This way, humans never have to commit generated code.

Regarding the crosswalk, we could add a single script that generates them all, from a single input file. That would also mostly solve the multiple downloads issue (there'd be only this script and terms.Rmd)

  • Related to the last is the fact that codemeta is now really two somewhat separate projects - while we set out primarily to create a crosswalk, we now basically maintain a 'new' standard and set of supporting tools, and rather separately maintain a list of crosswalk tables from other standards (largely without a lot of supporting tools except for some special cases like R, where codemetar crosswalks a lot more terms than are listed in the R crosswalk table anyway).

Even though they don't support many package-manager/language metadata formats, AFAIK Bolognese would accept contributions in that direction.

I also wrote a tool running at Software Heritage that converts several formats to CodeMeta and stores it in our database.
Its reach is limited by most languages using a script in lieu of a metadata file, and we don't want to run arbitrary code on our infrastructure (though parsing with regexps seems to work in most cases, I just didn't spend much time on it).

Some ideas on how to proceed with these two pieces (e.g. should we omit or move the crosswalk stuff off of the main codemeta website?) would be helpful.

With Travis auto-building the website, most of this would no longer be a problem. We could also make https://github.com/codemeta/codemeta a git submodule of https://github.com/codemeta/codemeta.github.io and have the build process pull the local file, which would spare downloads at build time

Thanks so much for all your work and contributions, it's really fantastic!

You're welcome :)
Thanks for your all work as well!

@cboettig
Copy link
Member

You could also use Travis (or any other CI) to automatically build the branch with the HTML (currently master): https://docs.travis-ci.com/user/deployment/pages/ (it won't automatically rebuild on changes of crosswalk.csv, but you could set up a daily rebuild) from just the .Rmd files; and remove .md files from the hugo branch (which might need to be renamed; maybe rename it to master and rename the current master to gh-pages)

Yes, this totally should be done. It would be easiest to do so with the existing GitHub Actions script for blogdown: https://github.com/r-lib/actions/blob/master/examples/blogdown.yaml . This would avoid the extra faffing with credentials you need to do this in travis. A PR would be great for this, I'm juggling too many things to do this anytime soon!

Re crosswalk scripts -- yeah, definitely makes sense to automate that more, contributions welcome there too!! Though the crosswalk tables we have lack important metadata about "what" a given column actually is: a link to a homepage, an icon, a title and a description would be a big help.

Re translation, linking more of those tools would be a great addition.

Thanks again !

@progval
Copy link
Member Author

progval commented Mar 24, 2020

Unfortunately I'm going to be busy with another project too, but I'll keep this issue in mind

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants