diff --git a/.nojekyll b/.nojekyll index 0448b1f..d203d3d 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -df21fdd3 \ No newline at end of file +c3976136 \ No newline at end of file diff --git a/04-git_files/figure-pdf/unnamed-chunk-36-1.pdf b/04-git_files/figure-pdf/unnamed-chunk-36-1.pdf index 838b3d0..7932ea2 100644 Binary files a/04-git_files/figure-pdf/unnamed-chunk-36-1.pdf and b/04-git_files/figure-pdf/unnamed-chunk-36-1.pdf differ diff --git a/Building-Reproducible-Analytical-Pipelines.epub b/Building-Reproducible-Analytical-Pipelines.epub index 9cedd43..ddf6dd9 100644 Binary files a/Building-Reproducible-Analytical-Pipelines.epub and b/Building-Reproducible-Analytical-Pipelines.epub differ diff --git a/Building-Reproducible-Analytical-Pipelines.pdf b/Building-Reproducible-Analytical-Pipelines.pdf index 9fec038..a9f874f 100644 Binary files a/Building-Reproducible-Analytical-Pipelines.pdf and b/Building-Reproducible-Analytical-Pipelines.pdf differ diff --git a/img/repro_spectrum.png b/img/repro_spectrum.png new file mode 100644 index 0000000..25902a0 Binary files /dev/null and b/img/repro_spectrum.png differ diff --git a/index.html b/index.html index e470315..2a1907e 100644 --- a/index.html +++ b/index.html @@ -56,7 +56,27 @@ @media screen { pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; } } - +/* CSS for citations */ +div.csl-bib-body { } +div.csl-entry { + clear: both; + margin-bottom: 0em; +} +.hanging-indent div.csl-entry { + margin-left:2em; + text-indent:-2em; +} +div.csl-left-margin { + min-width:2em; + float:left; +} +div.csl-right-inline { + margin-left:2em; + padding-left:1em; +} +div.csl-indent { + margin-left: 2em; +} @@ -276,6 +296,10 @@

Table of contents

  • Reproducible analytical pipelines?
  • Data products?
  • Machine learning?
  • +
  • What actually is reproducibility? +
  • Why R? Why not [insert your favourite programming language]
  • Pre-requisites
  • Grading
  • @@ -370,6 +394,37 @@

    Machine learning?

    But what about machine learning? Well, depending what you’ll end up doing, you might indeed focus a lot on machine learning and/or statistical modeling. That being said, in practice, it is very often much more efficient to let some automl algorithm figure out the best hyperparameters of a XGBoost model and simply use that, at least as a starting point (but good luck improving upon automl…). What matters, is that the data you’re feeding to your model is clean, that your analysis is sensible, and most importantly, that it could be understood by someone taking over (imagine you get sick) and rerun with minimal effort in the future. The model here should simply be a piece that could be replaced by another model without much impact. The model is rarely central… but of course there are exceptions to this, especially in research, but every other point I’ve made still stands. It’s just that not only do you have to care about your model a lot, you also have to care about everything else.

    So in this course we’re going to learn a bit of all of this. We’re going to learn how to write reusable code, learn some basics of the Linux command line, Git and Docker.

    +
    +

    What actually is reproducibility?

    +

    A reproducible project means that this project can be rerun by anyone at 0 (or very minimal) cost. But there are different levels of reproducibility, and I will discuss this in the next section. Let’s first discuss some requirements that a project must have to be considered a RAP.

    +
    +

    The requirements of a RAP

    +

    For something to be truly reproducible, it has to respect the following bullet points:

    + +

    Also, reproducibility is on a continuum, and depending on the constraints you face your project can be “not very reproducible” to “totally reproducible”. Let’s consider the following list of anything that can influence how reproducible your project truly is:

    + +

    So by “reproducibility is on a continuum”, what I mean is that you could set up your project in a way that none, one, two, three, four or all of the preceding items are taken into consideration when making your project reproducible.

    +

    This is not a novel, or new idea. Peng (2011) already discussed this concept but named it the reproducibility spectrum.

    +
    +The reproducibility spectrum from Peng's 2011 paper. +
    +The reproducibility spectrum from Peng’s 2011 paper. +
    +
    +
    +

    Why R? Why not [insert your favourite programming language]

    In my absolutely objective opinion R is currently the most interesting and simple language you can use to create such data products. If you learn R you have access to almost 20’000 packages (as of October 2023) to:

    @@ -450,6 +505,11 @@

    License

    WTFPL

    +
    +
    +Peng, Roger D. 2011. “Reproducible Research in Computational Science.” Science 334 (6060): 1226–27. +
    +
    diff --git a/search.json b/search.json index a52ff4f..be9829e 100644 --- a/search.json +++ b/search.json @@ -49,6 +49,16 @@ "Introduction" ] }, + { + "objectID": "index.html#what-actually-is-reproducibility", + "href": "index.html#what-actually-is-reproducibility", + "title": "Building Reproducible Analytical Pipelines", + "section": "What actually is reproducibility?", + "text": "What actually is reproducibility?\nA reproducible project means that this project can be rerun by anyone at 0 (or very minimal) cost. But there are different levels of reproducibility, and I will discuss this in the next section. Let’s first discuss some requirements that a project must have to be considered a RAP.\n\nThe requirements of a RAP\nFor something to be truly reproducible, it has to respect the following bullet points:\n\nSource code must obviously be available and thoroughly tested and documented (which is why we will be using Git and Github);\nAll the dependencies must be easy to find and install (we are going to deal with this using dependency management tools);\nTo be written with an open source programming language (nocode tools like Excel are by default non-reproducible because they can’t be used non-interactively, and which is why we are going to use the R programming language);\nThe project needs to be run on an open source operating system (thankfully, we can deal with this without having to install and learn to use a new operating system, thanks to Docker);\nData and the paper/report need obviously to be accessible as well, if not publicly as is the case for research, then within your company.\n\nAlso, reproducibility is on a continuum, and depending on the constraints you face your project can be “not very reproducible” to “totally reproducible”. Let’s consider the following list of anything that can influence how reproducible your project truly is:\n\nVersion of the programming language used;\nVersions of the packages/libraries of said programming language used;\nOperating System, and its version;\nVersions of the underlying system libraries (which often go hand in hand with OS version, but not necessarily).\nAnd even the hardware architecture that you run all that software stack on.\n\nSo by “reproducibility is on a continuum”, what I mean is that you could set up your project in a way that none, one, two, three, four or all of the preceding items are taken into consideration when making your project reproducible.\nThis is not a novel, or new idea. Peng (2011) already discussed this concept but named it the reproducibility spectrum.\n\n\n\nThe reproducibility spectrum from Peng’s 2011 paper.", + "crumbs": [ + "Introduction" + ] + }, { "objectID": "index.html#why-r-why-not-insert-your-favourite-programming-language", "href": "index.html#why-r-why-not-insert-your-favourite-programming-language", @@ -104,7 +114,7 @@ "href": "index.html#license", "title": "Building Reproducible Analytical Pipelines", "section": "License", - "text": "License\nThis course is licensed under the WTFPL.", + "text": "License\nThis course is licensed under the WTFPL.\n\n\n\n\n\nPeng, Roger D. 2011. “Reproducible Research in Computational Science.” Science 334 (6060): 1226–27.", "crumbs": [ "Introduction" ]