This set of instructions describes a suggested workflow and best practices for
working smoothly across the CoVPN Correlates Report GitHub repository. The goal
is to ensure that we all follow the same process when merging new code into the
master
branch intended for use by collaborators (including our future selves).
Ultimately, the aim is to make sure that our code works across as many instances
and compute setups as possible; thus, for all contributed code, the original
programmer is responsible for resolving merge conflicts prior to requesting code
review from other members.
Git is a powerful collaborative tool: its major strength is that multiple people can work across the same source repository at the same time, with the ability to seamlessly combine their work. This is done through several steps: branching, committing changes, and generating pull requests.
Branching is the process of creating a distinct line of work (i.e., a sequence
of commits) in a local copy of the repository. Your work on a branch does not
impact the state of the master
branch. This has the advantage of allowing you
to develop and test new code, without any disruptions to the stable version of
the code stored on the shared master
branch. This way, the programmer is able
to make as many changes as desired, then commit and push the updated code to
make it available to other programmers; however, the new code will not impact
the master
branch until it is manually requested to be added in a pull
request.
A pull request (PR)
is a feature specific to GitHub that allows a programmer to request that changes
in a given branch be pulled into the master
branch (or any other target
branch). Upon opening a new PR, a few things can happen. First, GitHub will
check that there are no conflicts with the master
branch (see later),
automatically flagging any cases of such conflicts. Next, an opportunity to
request a review is made available. In a review of a PR, the programmer is able
to tag others to check their programming logic, a chance to ensure that the
underlying logic is clear and that the code included in the PR runs on more than
the machine on which it was written. The interface GitHub provides is very nice:
reviewer can point to specific areas in the code with questions and comments, or
even suggest changes to individual lines of code. As you work, make sure you've
created a new branch, commit often with clear commit messages, and generate a PR
once the code is ready to be reviewed and merged.
An important downside of branching and pulling branches into master
is that
two programmers can unwittingly work on the same file across multiple branches.
Should a PR updating a particular file (or files) also changed on your branch be
approved and merged into master
before the merging of your PR, Git will be
unsure of which version of the updated file ought to be used. This results in
a merge conflict: Git cannot be sure whether to accept your changes or those
made by another programmer (on a different branch). This can happen for many
reasons. The best method for avoiding such issues is simply for each programmer
to make changes only to those files that they need to edit and that such changes
be explicit and contained.
If, despite your best efforts, a merge conflict does occur when making a PR, the
conflict(s) will need to be evaluated and resolved manually prior to merging the
PR into the master
branch. Merge conflicts should be resolved by the original
programmer, as they are most familiar with the code being contributed in the PR
in question. Once all changes to be included in your PR have been committed on
your branch, checkout
the master
branch (i.e., git checkout master
) and
pull any/all changes from the remote (i.e., git pull REMOTE master
). Now,
switch back to your branch (git checkout MYBRANCH
) and attempt to merge in the
up-to-date master
branch (i.e., git merge master
). If there are any
conflicts, a warning message will appear in the command line, indicating the
presence of merge conflicts requiring manual resolution. Use git status
to
identify the files with the conflicts, or if you are using RStudio, look at the
Git pane (orange boxes with a "U" in them will appear before the file name).
For each of the files identified to have a merge conflict, read through the code
-- the differences should be highlighted. Sometimes, these will be large chunks
of code, other times just single lines. Differences start with <<<<<<< BRANCH NAME
, followed by all code from your branch, and then then several equal signs
(=======
) followed by the code from the master
branch, which itself ends
with >>>>>>> master
. Select which version of the code, or blend of them, to
keep and delete the remainder of the code. There may be multiple conflicts in
a file, so make sure that all are found and resolved. After correcting each
conflict across all affected files, ensure the code still runs as intended.
Commit the edits, push, and voila...the PR should no longer indicate any merge
conflicts.
Please adhere to the following rules before submitting a pull request.
- As noted above, do not submit a PR that has merge conflicts.
- Do not add compiled graphics and PDFs to your pull requests (with the
possible exception of verification documentation). Pull requests should only be
code. If you need to store compiled files locally, add them to
.gitignore
. - Update the
Makefile
to include relevant documentation for how to execute the code and dependencies between scripts. - Include a rule
all
at the top of theMakefile
that compiles all of the output that is needed to build the report for your subdirectory. - Include a rule
clean
that removes all compiled graphs and data sets from your subdirectory. - Confirm that after executing your
make clean
command (to remove any compiled results from your local directory), yourmake all
command executes on your system.
Git has an interesting approach to tracking how folder structures behave: it
does not look at the directory structure of the repository but instead tracks
files and their paths. This means that folders that do not contain files are not
added to git for tracking; therefore, such files will not be added or shared to
the remote repository. To get around this, add a simple empty file .gitkeep
to
the folder. Now, git has a file to track and will add the folder to version
control.
This is important because if your code expects a folder to exist and does not
check first if it needs to create it, the code will stop running. Since the goal
of this project is to run seamlessly, adding the .gitkeep
file will make this
possible with minimal effort.
.gitkeep
can be created a few different ways. Through the command line, use
touch /path/to/folder/.gitkeep
; or through R, use
file.create("/path/to/folder/.gitkeep")
.
Here are a few useful notes and readings that may be useful in remedying any
problems that may arise when working with git
. These range the spectrum from
applied to fairly technical:
- "Version Control with
git
" (Software Carpentry): a comprehensive introduction to the uses ofgit
and social coding with GitHub. This is aimed towards students and research professionals. - "Happy
git
and GitHub for the useR" (Jenny Bryan, RStudio): a comprehensive introduction to both the inner workings ofgit
and best practices in using GitHub, with a focus on integrating these tools into your R workflow. Aimed towards masters-level university students. - "Tools for Reproducible Research" (Karl
Broman): a semester-long course on best
practices for modern reproducible research, including a couple of lectures
on
git
and GitHub. - "Introduction to
git
" (Berkeley's Stat 159/259, Fall 2015, KJ Millman): a thorough walkthrough of some commongit
commands as well as explanations of what these commands do when invoked.
Some basic commands:
To clone the repo,
git clone https://{USER}@github.com/CoVPN/correlates_reporting_usgcove_archive.git
In an effort to promote computational reporoducibility, we are making an effort to all use the same version of R and R packages, using {renv}. When opening the project in R, {renv}'s tracking system should start automatically and highlight steps to follow. Things to note include
- Make sure that all R packages being used are the version specified in the
{renv} lockfile (
renv.lock
), ensuring consistency across all programmers' work by callingrenv::restore()
. Do this every time you manually start working, frequently updating from themaster
branch to ensure that the lockfile is itself up-to-date. - To add a new package, there are a few steps. First, ensure there are no
differences in your installed packages and the ones specified in the lockfile
by calling
renv::restore()
. This will enforce the correct versions. Next, install any package(s) as usual (e.g.,install.packages()
orremotes::install_github()
) and integrate use of the package into your code as necessary. Finally, callrenv::snapshot()
. This will record use of the new package(s) into the package library, adding to the lockfile.
Also take note of the {renv} collaboration guide.
To accommodate code developed to span multiple analyses (e.g., for Moderna and
for Janssen), we make use of the [config
]
(https://cran.r-project.org/web/packages/config/vignettes/introduction.html)
package. The file config.yml
in the base repository directory contains
specifications pertaining to each analysis. This includes specifications for
the mock and real Moderna data analysis, as well as several
specifications for Janssen's pooled and region-specific analyses. More
configurations may be added in the future.
To load an analysis configuration, all the user must do is set an environment variable in a shell session. This can be achieved via the command line as follows.
export TRIAL=moderna_mock
setenv TRIAL moderna_mock
On Windows, replace export with set in command line prompt. This can also be achieved in an interactive R
session as follows.
Sys.setenv(TRIAL = "moderna_mock")
Note that if the TRIAL
variable has not been set, the data processing
script will fail. Users should get in the habit of setting this variable
before executing any R code. Once TRIAL
has been set, configurations can be
loaded into an R session using config::get(config = Sys.getenv("TRIAL"))
.
However, in general a user will not need to explicitly call config::get()
in
their code, so long as _common.R
is
source
'd into the code.
Examining _common.R
, we see that config::get
is called in that script and
arguments in the config
list are read into the global environment of that R
session (so users do not need to refer to e.g., config$study_name
and instead
can simply write study_name
).
In order to facilitate reproducible reporting, this project makes use of
automated testing and checking via the continuous integration service
Travis-CI.
For any contributed or changed code, the Travis-CI service will automatically
attempt to build the reports corresponding to each of the statistical analyses
outline in the statistical analysis plan. This automatic check serves to ensure
that all dependencies and analyses, as well as the reports themselves, can be
executed/generated from a bare machine instance (Linux Ubuntu, by default). Any
contributed code that fails to pass these checks will be deemed unsuitable for
merging into the project; thus, it is imperative that contributors not only
verify their updated code by building reports locally (via the Make
recipes
provided) but also check that their contributions pass on the machine instances
spawned by the Travis-CI service. For any given build, logs are provided to help
elucidate any potential sources of error. Upon passing a continuous integration
check, Travis-CI will automatically post the resultant reports to the gh-pages
branch at https://github.com/CoVPN/correlates_reporting_usgcove_archive/tree/gh-pages.
With all this being said, if we all work together, we can try to minimize cases in which we need to resolve merge conflicts. Work in your own folder, make changes specific and manageable, and commit/push regularly.
Follow this flow closely:
- Check out
master
if not there already. If using RStudio, select themaster
branch; if on the command line:git checkout master
will do. - Pull any updates from the GitHub repo. If in RStudio, click the "pull"
button; if on the command line, simply
git pull
. - Make a new branch, check it out, and push to GitHub. If using RStudio, make
a new branch via the GUI. If on the command line:
git branch NAME
, thengit checkout NAME
, thengit push -u origin NAME
(n.b., the first two of these may be combined intogit checkout -b NAME
). - Make sure that the packages you are using match those in the shared library
by calling
renv::restore()
. Resolve any issues as necessary. - Write your code and commit regularly. Install packages as needed. Alternate
between
renv::restore()
andrenv::snapshot()
to capture changes to the active library. To add folders that do not have content in them, or whose content are ignored through.gitignore
, add a.gitkeep
to the folder and add the folder to be tracked. If using RStudio, use the git pane and add the folder; if on the command line, usegit add /path/to/folder/.gitkeep
. Commit the file to complete the addition to tracking the folder through git. - At the beginning of starting work on a branch for the day or reverting back
to prior work, make sure as much of the code is up-to-date with the remote.
Make sure all your edits are committed,
checkout master
, pull updates from the remotemaster
branch,checkout
your branch and mergemaster
into your branch. Resolve conflicts as necessary. - When preparing to open a PR, make sure all your edits are committed. Check
out the
master
branch, pull in changes to the remote, check out your branch and mergemaster
into your branch. Resolve conflicts. Commit and push any updates. - Go to GitHub and open a PR. Identify the updates clearly, and tag reviewers, who will provide feedback on how to improve/streamline the contributed code. Note that the PR will be subject to automated continuous integration checks by the Travis-CI service; a PR that fails to pass these automated checks will not be merged.
- Once the PR has been accepted, merge normally (i.e., do not use a special merge), then delete the feature branch.
- On your machine, clean things up by checking out
master
and pulling. Delete the local copy of that working branch (git branch -d NAME
), and prune your remote copies (git remote prune origin
).