Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Now able to specify filenames of downloads #84

Closed
wants to merge 2 commits into from

Conversation

sbthandras
Copy link

No description provided.

@sbthandras sbthandras marked this pull request as draft July 26, 2024 14:45
@sbthandras sbthandras marked this pull request as ready for review July 29, 2024 14:40
@sbthandras
Copy link
Author

I can confirm that this works now with the modifications, as intended (downloading to the filename specified)

@stitam
Copy link
Owner

stitam commented Jul 29, 2024

Thanks @sbthandras for opening this PR!

While I understand it would be great to be able to give the files e.g. simpler names, I wonder if this would introduce too much complexity for the function? Apart from simplicity, I'm a bit worried about the errors this could introduce:

  • If we want to download multiple genomes, we have to provide multiple names (same as the number of UIDs). It seems this is not yet working, the function currently requires a single name and then only the first genome will be downloaded. Afterwards the file will already exist so subsequent genomes will be skipped.
  • When downloading multiple files I'm quite concerned if we can guarantee the files and their names will never get mixed up. Do you think this can be guaranteed?
  • It may happen that e.g. we choose to download .fna files but then in the filename we accidentally specify .faa and then we won't know why our scripts downstream are not working until we find the root of the problem.

I'm OK with adding this new argument, but I wonder if it would be simpler to just download the files using this function and then rename them separately? Not sure, what do you think?

(The tests fail due to some other reason. This is some new stuff, I hope to fix this soon)

@sbthandras
Copy link
Author

I thi

Thanks @sbthandras for opening this PR!

While I understand it would be great to be able to give the files e.g. simpler names, I wonder if this would introduce too much complexity for the function? Apart from simplicity, I'm a bit worried about the errors this could introduce:

  • If we want to download multiple genomes, we have to provide multiple names (same as the number of UIDs). It seems this is not yet working, the function currently requires a single name and then only the first genome will be downloaded. Afterwards the file will already exist so subsequent genomes will be skipped.
  • When downloading multiple files I'm quite concerned if we can guarantee the files and their names will never get mixed up. Do you think this can be guaranteed?
  • It may happen that e.g. we choose to download .fna files but then in the filename we accidentally specify .faa and then we won't know why our scripts downstream are not working until we find the root of the problem.

I'm OK with adding this new argument, but I wonder if it would be simpler to just download the files using this function and then rename them separately? Not sure, what do you think?

(The tests fail due to some other reason. This is some new stuff, I hope to fix this soon)

I think these are all valid points, I myself was thinking a lot about the last one. Especially that a user might mistakenly set an unzipped extension for a compressed file. The reason I added it, is because I don't like the idea of the identity of the files being 'lost' when I'm downloading 'genomic.fna'. I assume that if I was downloading a more compelx file format then I could parse that and retrieve the information about what genome it is exactly, but now when I download a set of accessions the file names are not regular at all but come with different suffixes. I assume, based on the UID the function should be able to find an accession name? How about an option that forces the downloaded file to be ACCESSION.fna?

@stitam
Copy link
Owner

stitam commented Jul 29, 2024

OK, so if I download e.g. GCF_003007635.1 the filename will be GCF_003007635.1_ASM300763v1_genomic.fna.gz by default and instead of this you would want to have GCF_003007635.1.fna.gz, right? I think this is possible, but it may be simpler to add an argument that specifies whether the filenames should contain 1. only the assembly, 2. only the name or 3. both (default) and then assemble the final filename using the FTP filename (which contains both). Would this solve the issue?

@stitam stitam closed this Sep 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants