Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative approach to reproducible research with drakepkg #6

Open
januz opened this issue Nov 28, 2018 · 5 comments
Open

Alternative approach to reproducible research with drakepkg #6

januz opened this issue Nov 28, 2018 · 5 comments

Comments

@januz
Copy link

januz commented Nov 28, 2018

@tiernanmartin Thanks for developing drakepkg! I found it 2 weeks ago when I was researching ways to package a drake workflow for a research project I plan to publish as a "research compendium" (according to the methods outlined by @benmarwick in rrtools).

I have since been working on an alternative to your approach which I have now uploaded to a fork of your repo here.

The main difference, I think, is that I also distribute the .drake/ directory with the package, so that users can check the consistency of the workflow with all its inputs and outputs and can check out intermediate results/targets, without having to re-run the analysis on their own computer. The package includes simple wrapper functions that are intended to lower the barrier for the user to interact with the analysis (e.g., just running reproduce_analysis() is enough to copy and check the analysis). I wrote a vignette that hopefully explains the procedure well.

I am interested in your thoughts on my approach. Thank you!

@tiernanmartin
Copy link
Owner

This looks great!

I think your decision to include the .drake/ directory makes sense as a default but there should probably be an easy way to turn this option off. The primary reason for that is that projects working with medium- to large-size data objects will have very large .drake/ directories that exceed the Github repo/file size limit.

@tiernanmartin
Copy link
Owner

@januz

Also, FYI: you may have noticed that the drakepkg repo hasn't had a lot of recent activity. I have treated it as a prototype or proof of concept and have been refining it in other active projects. I have a growing list of changes and improvements that I will begin implementing sometime in early 2019, so keep an eye out for those.

I look forward to playing with your forked version and will be happy to give you more detailed feedback afterwards!

@januz
Copy link
Author

januz commented Nov 29, 2018

@tiernanmartin Thanks for getting back to me!

The primary reason for that is that projects working with medium- to large-size data objects will have very large .drake/ directories that exceed the Github repo/file size limit.

You're absolutely right. I saw your discussion with @wlandau about using the OSF instead of Github to store files. One option for projects with larger size data files and .drake directory might be to -- instead of delivering the analysis directory structure with the package -- download it from an external source (e.g., from the OSF). I think that the functions could be easily re-written to allow for this option.

I have a growing list of changes and improvements that I will begin implementing sometime in early 2019, so keep an eye out for those.

I look forward to playing with your forked version and will be happy to give you more detailed feedback afterwards!

I'm looking forward to seeing your changes/improvements and hearing your suggestions! Maybe we can come up with a common framework that accommodates the most common use cases. Potentially, in the long run an integration into rrtools would make sense, e.g., with a function use_drake() that sets up a template package including the scripts, a toy analysis directory structure, and instructions how users can adjust it to their own analysis.

@wlandau
Copy link

wlandau commented Dec 3, 2018

Great discussion, you guys! It is exciting to see drakepkg in both its current forms. A few comments:

  • @januz, I love your idea of a use_drake() function in rrtools. Maybe an idea for pRojects as well. Related: Facilitate use of drake? lockedata/starters#39.
  • I think @tiernanmartin's version is a bit more pedagogically accessible. Its small size makes it easy to read, and the essential concepts come across easily. Maybe the fork could serve as a deeper demonstration for advanced users.
  • When you mentioned including the .drake/ cache with the package, that also reminded me of DataPackageR. That might even work out of the box with make(cache = storr::storr_dbi(...)).
  • While you could ship the .drake/ cache with the unbuilt source, I would recommend at least adding a line in .Rbuildignore because of the storage size issues.
  • In your fork, I noticed some functions that hash input files. I think drake can do much of that work for you.
library(drake)
load_main_example()
make(plan)
#> target raw_data
#> target data
#> target fit
#> target hist
#> target report
cache <- get_cache()
cache$get(file_store("report.Rmd"))
#> [1] "5a49b18f8d579dffda0983cbdecc44acc5099f3ee92c34b18f641fab86e6558e"
drake_cache_log()
#> # A tibble: 12 x 3
#>    hash             type   name                
#>    <chr>            <chr>  <chr>               
#>  1 a668e310782f864c import create_plot         
#>  2 27115496d692f3d2 target data                
#>  3 cfae01896d60312f target fit                 
#>  4 25cdbd93912e0269 import forcats::fct_inorder
#>  5 eb54142bf4029c58 target hist                
#>  6 25efbda5da1aa408 target raw_data            
#>  7 7cfd4cac5787a46e import "\"raw_data.xlsx\"" 
#>  8 c2ee4ecf9dd1c922 import readxl::read_excel  
#>  9 d1813aad07a6a9ba target report              
#> 10 01cc33b4bbba9d14 target "\"report.html\""   
#> 11 b9bbfe573f3087b7 import "\"report.Rmd\""    
#> 12 b7299c6d33b92763 import rmarkdown::render

Created on 2018-12-02 by the reprex package (v0.2.1)

@wlandau
Copy link

wlandau commented Dec 3, 2018

Note: the hash from cache$get(file_store("report.Rmd")) is different from the corresponding hash from drake_cache_log() because storr computes its own separate hashes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants