Skip to content

Commit

Permalink
Built site for gh-pages
Browse files Browse the repository at this point in the history
  • Loading branch information
Quarto GHA Workflow Runner committed Nov 24, 2023
1 parent 052a52b commit a23a02e
Show file tree
Hide file tree
Showing 7 changed files with 73 additions and 3 deletions.
2 changes: 1 addition & 1 deletion .nojekyll
Original file line number Diff line number Diff line change
@@ -1 +1 @@
df21fdd3
c3976136
Binary file modified 04-git_files/figure-pdf/unnamed-chunk-36-1.pdf
Binary file not shown.
Binary file modified Building-Reproducible-Analytical-Pipelines.epub
Binary file not shown.
Binary file modified Building-Reproducible-Analytical-Pipelines.pdf
Binary file not shown.
Binary file added img/repro_spectrum.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
62 changes: 61 additions & 1 deletion index.html
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,27 @@
@media screen {
pre > code.sourceCode > span > a:first-child::before { text-decoration: underline; }
}
</style>
/* CSS for citations */
div.csl-bib-body { }
div.csl-entry {
clear: both;
margin-bottom: 0em;
}
.hanging-indent div.csl-entry {
margin-left:2em;
text-indent:-2em;
}
div.csl-left-margin {
min-width:2em;
float:left;
}
div.csl-right-inline {
margin-left:2em;
padding-left:1em;
}
div.csl-indent {
margin-left: 2em;
}</style>


<script src="site_libs/quarto-nav/quarto-nav.js"></script>
Expand Down Expand Up @@ -276,6 +296,10 @@ <h2 id="toc-title">Table of contents</h2>
<li><a href="#reproducible-analytical-pipelines" id="toc-reproducible-analytical-pipelines" class="nav-link" data-scroll-target="#reproducible-analytical-pipelines">Reproducible analytical pipelines?</a></li>
<li><a href="#data-products" id="toc-data-products" class="nav-link" data-scroll-target="#data-products">Data products?</a></li>
<li><a href="#machine-learning" id="toc-machine-learning" class="nav-link" data-scroll-target="#machine-learning">Machine learning?</a></li>
<li><a href="#what-actually-is-reproducibility" id="toc-what-actually-is-reproducibility" class="nav-link" data-scroll-target="#what-actually-is-reproducibility">What actually is reproducibility?</a>
<ul class="collapse">
<li><a href="#the-requirements-of-a-rap" id="toc-the-requirements-of-a-rap" class="nav-link" data-scroll-target="#the-requirements-of-a-rap">The requirements of a RAP</a></li>
</ul></li>
<li><a href="#why-r-why-not-insert-your-favourite-programming-language" id="toc-why-r-why-not-insert-your-favourite-programming-language" class="nav-link" data-scroll-target="#why-r-why-not-insert-your-favourite-programming-language">Why R? Why not [insert your favourite programming language]</a></li>
<li><a href="#pre-requisites" id="toc-pre-requisites" class="nav-link" data-scroll-target="#pre-requisites">Pre-requisites</a></li>
<li><a href="#grading" id="toc-grading" class="nav-link" data-scroll-target="#grading">Grading</a></li>
Expand Down Expand Up @@ -370,6 +394,37 @@ <h2 class="anchored" data-anchor-id="machine-learning">Machine learning?</h2>
<p>But what about machine learning? Well, depending what you’ll end up doing, you might indeed focus a lot on machine learning and/or statistical modeling. That being said, in practice, it is very often much more efficient to let some automl algorithm figure out the best hyperparameters of a XGBoost model and simply use that, at least as a starting point (but good luck improving upon automl…). What matters, is that the data you’re feeding to your model is clean, that your analysis is sensible, and most importantly, that it could be understood by someone taking over (imagine you get sick) and rerun with minimal effort in the future. The model here should simply be a piece that could be replaced by another model without much impact. The model is rarely central… but of course there are exceptions to this, especially in research, but every other point I’ve made still stands. It’s just that not only do you have to care about your model a lot, you also have to care about everything else.</p>
<p>So in this course we’re going to learn a bit of all of this. We’re going to learn how to write reusable code, learn some basics of the Linux command line, Git and Docker.</p>
</section>
<section id="what-actually-is-reproducibility" class="level2">
<h2 class="anchored" data-anchor-id="what-actually-is-reproducibility">What actually is reproducibility?</h2>
<p>A reproducible project means that this project can be rerun by anyone at 0 (or very minimal) cost. But there are different levels of reproducibility, and I will discuss this in the next section. Let’s first discuss some requirements that a project must have to be considered a RAP.</p>
<section id="the-requirements-of-a-rap" class="level3">
<h3 class="anchored" data-anchor-id="the-requirements-of-a-rap">The requirements of a RAP</h3>
<p>For something to be truly reproducible, it has to respect the following bullet points:</p>
<ul>
<li>Source code must obviously be available and thoroughly tested and documented (which is why we will be using Git and Github);</li>
<li>All the dependencies must be easy to find and install (we are going to deal with this using dependency management tools);</li>
<li>To be written with an open source programming language (nocode tools like Excel are by default non-reproducible because they can’t be used non-interactively, and which is why we are going to use the R programming language);</li>
<li>The project needs to be run on an open source operating system (thankfully, we can deal with this without having to install and learn to use a new operating system, thanks to Docker);</li>
<li>Data and the paper/report need obviously to be accessible as well, if not publicly as is the case for research, then within your company.</li>
</ul>
<p>Also, reproducibility is on a continuum, and depending on the constraints you face your project can be “not very reproducible” to “totally reproducible”. Let’s consider the following list of anything that can influence how reproducible your project truly is:</p>
<ul>
<li>Version of the programming language used;</li>
<li>Versions of the packages/libraries of said programming language used;</li>
<li>Operating System, and its version;</li>
<li>Versions of the underlying system libraries (which often go hand in hand with OS version, but not necessarily).</li>
<li>And even the hardware architecture that you run all that software stack on.</li>
</ul>
<p>So by “reproducibility is on a continuum”, what I mean is that you could set up your project in a way that none, one, two, three, four or all of the preceding items are taken into consideration when making your project reproducible.</p>
<p>This is not a novel, or new idea. <span class="citation" data-cites="peng2011">Peng (<a href="#ref-peng2011" role="doc-biblioref">2011</a>)</span> already discussed this concept but named it the <em>reproducibility spectrum</em>.</p>
<figure class="figure">
<img src="img/repro_spectrum.png" alt="The reproducibility spectrum from Peng's 2011 paper." class="figure-img">
<figcaption>
The reproducibility spectrum from Peng’s 2011 paper.
</figcaption>
</figure>
</section>
</section>
<section id="why-r-why-not-insert-your-favourite-programming-language" class="level2">
<h2 class="anchored" data-anchor-id="why-r-why-not-insert-your-favourite-programming-language">Why R? Why not [insert your favourite programming language]</h2>
<p>In my absolutely objective opinion R is currently the most interesting and simple language you can use to create such data products. If you learn R you have access to almost 20’000 packages (as of October 2023) to:</p>
Expand Down Expand Up @@ -450,6 +505,11 @@ <h2 class="anchored" data-anchor-id="license">License</h2>
<p><a href="http://www.wtfpl.net/"><img src="img/wtfpl-badge-4.png" width="80" height="15" alt="WTFPL"></a></p>


<div id="refs" class="references csl-bib-body hanging-indent" data-entry-spacing="0" role="list">
<div id="ref-peng2011" class="csl-entry" role="listitem">
Peng, Roger D. 2011. <span>“Reproducible Research in Computational Science.”</span> <em>Science</em> 334 (6060): 1226–27.
</div>
</div>
</section>
</section>

Expand Down
12 changes: 11 additions & 1 deletion search.json
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,16 @@
"Introduction"
]
},
{
"objectID": "index.html#what-actually-is-reproducibility",
"href": "index.html#what-actually-is-reproducibility",
"title": "Building Reproducible Analytical Pipelines",
"section": "What actually is reproducibility?",
"text": "What actually is reproducibility?\nA reproducible project means that this project can be rerun by anyone at 0 (or very minimal) cost. But there are different levels of reproducibility, and I will discuss this in the next section. Let’s first discuss some requirements that a project must have to be considered a RAP.\n\nThe requirements of a RAP\nFor something to be truly reproducible, it has to respect the following bullet points:\n\nSource code must obviously be available and thoroughly tested and documented (which is why we will be using Git and Github);\nAll the dependencies must be easy to find and install (we are going to deal with this using dependency management tools);\nTo be written with an open source programming language (nocode tools like Excel are by default non-reproducible because they can’t be used non-interactively, and which is why we are going to use the R programming language);\nThe project needs to be run on an open source operating system (thankfully, we can deal with this without having to install and learn to use a new operating system, thanks to Docker);\nData and the paper/report need obviously to be accessible as well, if not publicly as is the case for research, then within your company.\n\nAlso, reproducibility is on a continuum, and depending on the constraints you face your project can be “not very reproducible” to “totally reproducible”. Let’s consider the following list of anything that can influence how reproducible your project truly is:\n\nVersion of the programming language used;\nVersions of the packages/libraries of said programming language used;\nOperating System, and its version;\nVersions of the underlying system libraries (which often go hand in hand with OS version, but not necessarily).\nAnd even the hardware architecture that you run all that software stack on.\n\nSo by “reproducibility is on a continuum”, what I mean is that you could set up your project in a way that none, one, two, three, four or all of the preceding items are taken into consideration when making your project reproducible.\nThis is not a novel, or new idea. Peng (2011) already discussed this concept but named it the reproducibility spectrum.\n\n\n\nThe reproducibility spectrum from Peng’s 2011 paper.",
"crumbs": [
"Introduction"
]
},
{
"objectID": "index.html#why-r-why-not-insert-your-favourite-programming-language",
"href": "index.html#why-r-why-not-insert-your-favourite-programming-language",
Expand Down Expand Up @@ -104,7 +114,7 @@
"href": "index.html#license",
"title": "Building Reproducible Analytical Pipelines",
"section": "License",
"text": "License\nThis course is licensed under the WTFPL.",
"text": "License\nThis course is licensed under the WTFPL.\n\n\n\n\n\nPeng, Roger D. 2011. “Reproducible Research in Computational Science.” Science 334 (6060): 1226–27.",
"crumbs": [
"Introduction"
]
Expand Down

0 comments on commit a23a02e

Please sign in to comment.