Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add subsection on spruce case study (DRAFT) #176

Merged
merged 5 commits into from
Nov 18, 2024

Conversation

percyfal
Copy link
Contributor

Addresses #169.

I have added a subsection with text and table on the spruce case study, modeled on the Genomics England text. I'm currently working on converting coordinate system as a demonstration of some of the advantages that Zarr provides compared to VCF (coordinate overflow problem). I added some comments in the text regarding some of the presented numbers relating to file and chunk sizes as I'm not always sure what conventions you have used (e.g., du -hs or summing up stats in the inspect files?).

@jeromekelleher is this too verbose, or is it what you had in mind?

@jeromekelleher
Copy link
Contributor

Looks great, thanks! A few quick notes (I'm afk)

  • can you include a notebook that stores the raw data for the table? We want to store the output of the inspect commands for later use. There should be a notebook on the gel data you can follow in the notebooks dir

  • can you include a reference for the species information stated in the first paragraph please?

Your storage is dominated by pl, so we should look at using local alleles to see if that reduces much. This needs dev version of bio2zarr

@jeromekelleher
Copy link
Contributor

Why only store PL and not other call fields? I would have thought you'd drop PL and keep the others as it's huge and not especially useful. We'd need to explain that point a bit more.

@percyfal
Copy link
Contributor Author

Why only store PL and not other call fields? I would have thought you'd drop PL and keep the others as it's huge and not especially useful. We'd need to explain that point a bit more.

Well, it was my colleague who did the variant calling. I think the main reason was to use the PL fields for downstream analyses with ANGSD, which was the goal at the time.

@percyfal
Copy link
Contributor Author

Looks great, thanks! A few quick notes (I'm afk)

* can you include a notebook that stores the raw data for the table? We want to store the output of the inspect commands for later use. There should be a notebook on the gel data you can follow in the notebooks dir

Will do! On that note, is there an option to output vcf2zarr inspect results as csv? Parsing the text output with, e.g., R's read.table(sep="\t") does not work, or am I missing something?

* can you include a reference for the species information stated in the first paragraph please?

Yes.

Your storage is dominated by pl, so we should look at using local alleles to see if that reduces much. This needs dev version of bio2zarr

Are you referring to sgkit-dev/bio2zarr#285 here?

@percyfal
Copy link
Contributor Author

Ok, now I see it, the inspect file is generated in the notebook!

@jeromekelleher
Copy link
Contributor

Looks good, I'm happy to merge and iterate if you are?

@percyfal
Copy link
Contributor Author

Great! I was looking into the coordinate conversion issue today, hopefully I can have it done come Wednesday. I was thinking I'd add a paragraph and a notebook to describe the process briefly.

@percyfal percyfal marked this pull request as ready for review November 18, 2024 16:22
Details are now taken from notebooks
@jeromekelleher jeromekelleher merged commit 1157fec into sgkit-dev:main Nov 18, 2024
1 check passed
@percyfal percyfal deleted the spruce-case-study branch November 21, 2024 09:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants