Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nf-core download should support private repos #2406

Open
adamrtalbot opened this issue Aug 21, 2023 · 10 comments
Open

nf-core download should support private repos #2406

adamrtalbot opened this issue Aug 21, 2023 · 10 comments
Labels
download nf-core download enhancement

Comments

@adamrtalbot
Copy link
Contributor

Description of feature

Currently, nf-core download breaks on private repos, mainly because it tries to download some files as compressed zips. We should support a private Github repo where possible, authenticating using the existing Github mechanisms or general Git (e.g. run git clone && zip). This mainly affects pulling the repository contents itself but may affect configuration. This will enable users of private repos to stage them to an offline area.

@ewels
Copy link
Member

ewels commented Aug 22, 2023

huh, interesting. Does this URL not work for private repos?

revision: f"https://github.com/{self.pipeline}/archive/{wf_sha}.zip",

I guess it's because we're doing a plain download. I wonder if we could get it via the (authenticated) API instead.. 🤔 Or at least add in some headers to the download to provide auth.

@adamrtalbot
Copy link
Contributor Author

I'm not certain but I think it just does a requests.get

@ewels
Copy link
Member

ewels commented Aug 22, 2023

Yeah exactly, sorry I phrased it badly. I just meant does that URL exist with authentication on private repos, or is it just not there at all?

I think it's quite easy to extend requests.get to use auth.
Maybe even as simple as requests.get(url, auth=(username, token))..? (see Stack Overflow)

We already have the username and token in code (if we were able to find it, eg. by the user having the gh cli installed or by an env var). So then it would be trivial to add that in, and it would work with private repos. Maybe.

@MatthiasZepper
Copy link
Member

MatthiasZepper commented Aug 25, 2023

Does it work on private repos with the --tower flag?

Because for Tower downloads, it runs git clone https://github.com/{self.pipeline}.git and self.pipeline is e.g. nf-core/rnaseq or whatever. I do recall, however, that I had to use https://, because git clone [email protected]:{self.pipeline}.git always failed in the tests on GithubActions due to authentication issues (it basically does not allow for anonymous downloads). If we would start using SSH for every download, everyone would need to sign up at Github and create a SSH keypair, just to download a public pipeline. Thus, we should have a --private flag for the command to switch between the two methods?

@adamrtalbot
Copy link
Contributor Author

Ahhhh I looked at the code some more. So if the tower flag is false, you download the repo as a zip, but if it's true you clone it . That's clever stuff.

@adamrtalbot
Copy link
Contributor Author

So is there any reason not to use --tower?

@MatthiasZepper
Copy link
Member

MatthiasZepper commented Aug 25, 2023

So is there any reason not to use --tower?

The main reason is the single cryptic package output that does not allow for easy config additions.

When you tw launch a file:/ pipeline, a temporary copy of the pipeline is created in the runfolder and then executed by Nextflow. This makes any config adaptions quite challenging, because the profiles or config files are part of the .git package.

So for every pipeline (and revision!), I need to get the respective configs and test files one by one, adapt the paths in them to the offline locations and then save the output in a separate directory for Tower to access (or submit them to tw datasets)

git --git-dir="$DEPLOY_PIPELINES/rnaseq.git" show latest:conf/test.config | sed "s=https://raw.githubusercontent.com/nf-core/test-datasets=$DEPLOY_TESTDATA=g" > "$DEPLOY_PARAMS/rnaseq/test_miarka.yml" 

In contrast to that, the "classic" download is much more user-friendly. It creates a self-contained directory with the folders config, singularity-images and each pipeline revision. For each revision, the nextflow.config is already updated such that custom_config_base points to "${projectDir}/../configs/" and singularity.cacheDir to "${projectDir}/../singularity-images/":


├── 3_12_0
│   ├── assets
│   ├── bin
│   ├── conf
│   ├── docs
│   ├── lib
│   ├── modules
│   ├── subworkflows
│   └── workflows
├── 3_9
│   ├── assets
│   ├── bin
│   ├── conf
│   ├── docs
│   ├── lib
│   ├── modules
│   ├── subworkflows
│   └── workflows
├── configs
│   ├── bin
│   ├── conf
│   ├── docs
│   └── pipeline
└── singularity-images 

Since you can access everything with a file manager and inspect/edit the file contents with an editor, it is easily customizable.

In contrast, if you want to customize a released pipeline from a --tower download, you would need to commit the changes, delete the original tag and then move the tag to the newly created commit. So without Tower and its ability to e.g. manage sample sheets separately with tw datasets, a pipeline downloaded with --tower is just unwieldy.

@stevekm
Copy link
Contributor

stevekm commented Apr 30, 2024

Why is this hard-coded to only support GitHub? There are plenty of other git hosting platforms that people may be using.

@ewels
Copy link
Member

ewels commented May 1, 2024

There are a few reasons why the pipeline download code specifically is quite tied to GitHub:

  • It's old and mostly predates any ambitions to support pipelines outside of nf-core
  • It involves shared configs which are specific to nf-core
  • Downloading a zip file for the repo is easier and arguably better in this case
    • It edits the downloaded files, so we actively do not want to track git history
    • Downloading a zip file is faster and easier than running git clone
    • The mechanism for fetching zip files is specific to GitHub
  • Not many people have asked for it to support other git hosting platforms
    • If you're not using nf-core configs, then downloading a pipeline from another platform is usually a single click in that UI, or a single git clone command. So most people don't come looking for anything more.

If there is interest in supporting other git hosting platforms for nf-core download, we could look into that. For example, trying to fetch the zip file and falling back to a git clone (and rm -rf .git?) if the download fails. That could also be a feasible solution for this issue, if we can assume that the user already has access to the private git repo configured locally.

@MatthiasZepper
Copy link
Member

MatthiasZepper commented May 2, 2024

If there is interest in supporting other git hosting platforms for nf-core download, we could look into that.

I hate to be a spoilsport, but my personal interest in looking into that is relatively low and since I was virtually the only one writing code for nf-core download in the past ~2 years, I don't see how this would materialize soon.

All nf-core pipelines are on GitHub. Ensuring that they download correctly, including the containers, is already a cat-and-mouse game and even harder, if there is absolutely no standardization or uniformity in how the pipeline is built. Here be dragons is the level of liability I am willing to assume for anything that happens outside of nf-core. I shrug, if somebody tries downloading a non-nf-core pipeline and then complains about inappropriate defaults (#2892), but I fear more of those issues will pop up, if we open the floodgates a little wider by supporting additional git hosting platforms.

That being said, nf-core download is in dire need of a full rewrite, so if you, @stevekm, feel like getting invested into that, I am happy to provide guidance, review PRs and help out.

The coarse roadmap is, that downloading pipelines and associated containers should become separate functions #2408. The pipeline download would then be harmonized to use only the WorkflowRepo class, which increases the flexibility to support other Git hosting platforms as well as private and local repos (#2406/#2407). Tidying up the superclass SyncedRepo (#2940) is a paramount prerequisite to handle this reliably and cleanly with as little code duplication as possible. Once this is done, both the classic and the --platform downloads will in the background be channeled through this class. Instead of downloading the zip files of the releases, the repo will be checkout out for each revision into a temporary folder and after performing the config edits copied to the output directory.

Getting the separate container download to work will be more difficult, because a WorkflowRepo obviously contains all revisions and commits, but the information, which of them are relevant, is not readily available anymore. At the moment, the script downloads all container images that it can find in the specified revisions, so regardless of the chosen parameters, the image will be available. nextflow inspect, in contrast, requires all the configs and parameters to run. If you download n revisions of a pipeline, you might need to provide n revision-specific param files because of changing parameters. Because of this complexity, I am inclined to preserve the current functionality as a fallback option (but move the code over to the new CLI subcommand). In either way, we need to look into supporting Docker images (#2309), improving Singularity/Apptainer support (#1320, #2020,#2357), making it to play nicely with the new Wave community registry.

All of this would be nice to have, but I basically only get round to develop features that we as a facility need to have for our egoistic purposes (#2938). Anything that we will never use ourselves unfortunately has a relatively low priority. Writing a pipeline might at least result in a paper and a few citations ✌😏.

@edmundmiller edmundmiller added the download nf-core download label Jun 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
download nf-core download enhancement
Projects
None yet
Development

No branches or pull requests

5 participants