Alternative approach to reproducible research with `drakepkg` #6

januz · 2018-11-28T19:17:22Z

@tiernanmartin Thanks for developing drakepkg! I found it 2 weeks ago when I was researching ways to package a drake workflow for a research project I plan to publish as a "research compendium" (according to the methods outlined by @benmarwick in rrtools).

I have since been working on an alternative to your approach which I have now uploaded to a fork of your repo here.

The main difference, I think, is that I also distribute the .drake/ directory with the package, so that users can check the consistency of the workflow with all its inputs and outputs and can check out intermediate results/targets, without having to re-run the analysis on their own computer. The package includes simple wrapper functions that are intended to lower the barrier for the user to interact with the analysis (e.g., just running reproduce_analysis() is enough to copy and check the analysis). I wrote a vignette that hopefully explains the procedure well.

I am interested in your thoughts on my approach. Thank you!

The text was updated successfully, but these errors were encountered:

tiernanmartin · 2018-11-29T17:58:07Z

This looks great!

I think your decision to include the .drake/ directory makes sense as a default but there should probably be an easy way to turn this option off. The primary reason for that is that projects working with medium- to large-size data objects will have very large .drake/ directories that exceed the Github repo/file size limit.

tiernanmartin · 2018-11-29T18:04:22Z

@januz

Also, FYI: you may have noticed that the drakepkg repo hasn't had a lot of recent activity. I have treated it as a prototype or proof of concept and have been refining it in other active projects. I have a growing list of changes and improvements that I will begin implementing sometime in early 2019, so keep an eye out for those.

I look forward to playing with your forked version and will be happy to give you more detailed feedback afterwards!

januz · 2018-11-29T19:21:14Z

@tiernanmartin Thanks for getting back to me!

The primary reason for that is that projects working with medium- to large-size data objects will have very large .drake/ directories that exceed the Github repo/file size limit.

You're absolutely right. I saw your discussion with @wlandau about using the OSF instead of Github to store files. One option for projects with larger size data files and .drake directory might be to -- instead of delivering the analysis directory structure with the package -- download it from an external source (e.g., from the OSF). I think that the functions could be easily re-written to allow for this option.

I have a growing list of changes and improvements that I will begin implementing sometime in early 2019, so keep an eye out for those.

I look forward to playing with your forked version and will be happy to give you more detailed feedback afterwards!

I'm looking forward to seeing your changes/improvements and hearing your suggestions! Maybe we can come up with a common framework that accommodates the most common use cases. Potentially, in the long run an integration into rrtools would make sense, e.g., with a function use_drake() that sets up a template package including the scripts, a toy analysis directory structure, and instructions how users can adjust it to their own analysis.

wlandau · 2018-12-03T01:28:22Z

Great discussion, you guys! It is exciting to see drakepkg in both its current forms. A few comments:

@januz, I love your idea of a use_drake() function in rrtools. Maybe an idea for pRojects as well. Related: Facilitate use of drake? lockedata/starters#39.
I think @tiernanmartin's version is a bit more pedagogically accessible. Its small size makes it easy to read, and the essential concepts come across easily. Maybe the fork could serve as a deeper demonstration for advanced users.
When you mentioned including the .drake/ cache with the package, that also reminded me of DataPackageR. That might even work out of the box with make(cache = storr::storr_dbi(...)).
While you could ship the .drake/ cache with the unbuilt source, I would recommend at least adding a line in .Rbuildignore because of the storage size issues.
In your fork, I noticed some functions that hash input files. I think drake can do much of that work for you.

library(drake)
load_main_example()
make(plan)
#> target raw_data
#> target data
#> target fit
#> target hist
#> target report
cache <- get_cache()
cache$get(file_store("report.Rmd"))
#> [1] "5a49b18f8d579dffda0983cbdecc44acc5099f3ee92c34b18f641fab86e6558e"
drake_cache_log()
#> # A tibble: 12 x 3
#>    hash             type   name                
#>    <chr>            <chr>  <chr>               
#>  1 a668e310782f864c import create_plot         
#>  2 27115496d692f3d2 target data                
#>  3 cfae01896d60312f target fit                 
#>  4 25cdbd93912e0269 import forcats::fct_inorder
#>  5 eb54142bf4029c58 target hist                
#>  6 25efbda5da1aa408 target raw_data            
#>  7 7cfd4cac5787a46e import "\"raw_data.xlsx\"" 
#>  8 c2ee4ecf9dd1c922 import readxl::read_excel  
#>  9 d1813aad07a6a9ba target report              
#> 10 01cc33b4bbba9d14 target "\"report.html\""   
#> 11 b9bbfe573f3087b7 import "\"report.Rmd\""    
#> 12 b7299c6d33b92763 import rmarkdown::render

^{Created on 2018-12-02 by the reprex package (v0.2.1)}

wlandau · 2018-12-03T01:29:18Z

Note: the hash from cache$get(file_store("report.Rmd")) is different from the corresponding hash from drake_cache_log() because storr computes its own separate hashes.

This was referenced Nov 28, 2018

deploying to a Docker container for reproducible workflow ropensci/drake#589

Closed

The Docker PSOCK example times out. wlandau/drake-examples#7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative approach to reproducible research with `drakepkg` #6

Alternative approach to reproducible research with `drakepkg` #6

januz commented Nov 28, 2018

tiernanmartin commented Nov 29, 2018

tiernanmartin commented Nov 29, 2018

januz commented Nov 29, 2018 •

edited

Loading

wlandau commented Dec 3, 2018

wlandau commented Dec 3, 2018

Alternative approach to reproducible research with drakepkg #6

Alternative approach to reproducible research with drakepkg #6

Comments

januz commented Nov 28, 2018

tiernanmartin commented Nov 29, 2018

tiernanmartin commented Nov 29, 2018

januz commented Nov 29, 2018 • edited Loading

wlandau commented Dec 3, 2018

wlandau commented Dec 3, 2018

Alternative approach to reproducible research with `drakepkg` #6

Alternative approach to reproducible research with `drakepkg` #6

januz commented Nov 29, 2018 •

edited

Loading