-
Notifications
You must be signed in to change notification settings - Fork 67
Define most focal units of recurrent CNVs #644
Define most focal units of recurrent CNVs #644
Conversation
- add back `left_join` where it was replaced with `inner_join`in `03` nb and re-rerun nb
- the genes in the gene level status data.frame are likely not in the subset testing files, therefore causing circleCI tests to fails (comment out this section for now)
…into define-most-focal-units-cnvs
…into define-most-focal-units-cnvs
- rerun nb - save final table to file
- uncomment the previously commented out steps in shell script - rerun nb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbethell , I really appreciate your nice explanations in your PR! This made it easy for me to follow what the goal was. I still need to go over some of these tidy
manipulations with a finer tooth comb, but I wanna give you some of my first big picture thoughts now so we can discuss it.
My main question is about how we make the calls. How close is it, what does the percent of each status look like for each call? e.g. if we call something gain
is it ever the case that the percent of gain
is only a hair larger than loss
? To ask these questions you may want to calculate percentages first, and then in a separate step where you make the call.
I think we may need to restructure that particular case_when
but I have to think about what the best logic is for calling that. May want others to weigh in on that.
You could draft up some quick and dirty plots that shows the percent of each status and separate on the x-axis what the final call was. Ideally if we are calling things right, we should see that there are hardly ever "ties" but there's a clear differentiation between gains and losses.
analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd
Outdated
Show resolved
Hide resolved
analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd
Outdated
Show resolved
Hide resolved
analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd
Outdated
Show resolved
Hide resolved
analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd
Outdated
Show resolved
Hide resolved
analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd
Outdated
Show resolved
Hide resolved
analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd
Outdated
Show resolved
Hide resolved
analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd
Outdated
Show resolved
Hide resolved
analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd
Outdated
Show resolved
Hide resolved
I will implement these suggestions in the upcoming commits and post an updated comment here once I've summarized the findings. |
- functionalize the calculating of coverage percentage values in each given region and the logic determining the dominant status - tweak logic determining dominant status (will need further tweaking) - keep status coverage percentage values in final table - add plots to look at dominant status versus the percentage values for each status in each region - rerun nb
@cansavvy here is the rendered notebook of my first go at implementing the suggestions you made above. You can let me know if that's what you were thinking the plots should look like or if I should tweak them (or if I should tweak anything else at this point). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @cbethell , I think this is coming along! I have a few more questions and comments. I think the main function could use a bit of streamlining and DRYing up, and I have some questions about what the final table should look like.
analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd
Outdated
Show resolved
Hide resolved
analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd
Outdated
Show resolved
Hide resolved
analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd
Outdated
Show resolved
Hide resolved
analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd
Outdated
Show resolved
Hide resolved
percent_neutral = neutral / (gain + loss + neutral), | ||
dominant_status = case_when( | ||
percent_gain > 0.5 & | ||
percent_gain > 1.5 * percent_loss ~ "gain", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain these logic statements? I'm not saying they aren't what we want, but I wanna understand the changes you made here and what you are intending.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sure that I probably went about implementing this logic is the least efficient way (it's probably not doing what I intended), but the idea here is that we want the status percentage to be responsible for more than 50% of calls in that region and to be greater than 1.5 times the alternate status call.
This was an attempt to account for a loss/gain dominant status call being made when there is just slim difference between calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. Okay. Well as a more general statement, I think whatever the logic is that we have for deciding majority status we should probably these methods pretty upfront laid out in documentation (probably at the top of the notebook and READMEs as well as comment). Not only what the methods are for the majority status call decisions but also what was our reasoning behind that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That being said, I think we are getting closer, but there are two things that we may want to think about when calling majority status (and these are related).
- Not calling a region if too much of it is uncallable
- Not calling a region if the absolute number of calls are too small (related to 1) e.g. 3 calls vs 2 calls would technically be greater than 50% but we probably don't have enough data to make that call.
But lastly, I'm wondering whether we need a > 0.5 cutoff in addition to a 1.5x cutoff or if raising the threshold of the percent cutoff might establish the same thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But lastly, I'm wondering whether we need a > 0.5 cutoff in addition to a 1.5x cutoff or if raising the threshold of the percent cutoff might establish the same thing?
Good question, I'll test this out and include my findings in the notebook (in addition to the points you made above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need do an entire side analysis but, my point is just that whatever decision logic we come up with we should make sure we are clear about what it is and why we think this gives us a reasonable answer.
analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd
Outdated
Show resolved
Hide resolved
|
||
```{r} | ||
# Filter the annotated CN data to include only these cytobands | ||
consensus_seg_autosomes_df_develop <- consensus_seg_autosomes_df %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you are reusing the name consensus_seg_autosomes_df_develop
a few times. This is a tad hard to follow. Can you make them different things? We should probably see if we can streamline this some, or add this formatting step to the main function.
analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd
Outdated
Show resolved
Hide resolved
analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd
Outdated
Show resolved
Hide resolved
analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd
Outdated
Show resolved
Hide resolved
add the `percent_uncallable` and `percent_unstable` labels where applicable - increase the threshold and add more explanation as to why - make else statement more specific - convert plots to jitter plots - add final focal_call column to final table - add suffixes when joining to create final table - rerun nb
- update documentation in notebook and `README` file to make note of decisions around cutoffs
OpenPBTA-analysis/analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd Lines 253 to 259 in 1e89da2
The OpenPBTA-analysis/analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd Lines 89 to 102 in 1e89da2
you are getting the count of the number of genes in that region. At least that was how I was able to read it. |
I think changing to |
Yes. I think you are right. I imagine it might be more exact to move to a percent length based assessment? Or is that not necessary and using weighted means is sufficient? |
Weighted mean of call fraction and length will give the correct fractions for the total. |
Co-Authored-By: jashapiro <[email protected]>
- have `status_majority_caller` function calculate weighted means and define dominant status based on these values - comment out the plotting steps (for development purposes) because these now have to change now that the previous function's logic changed - rerun nb - rename `percentages` as `fractions`
…into define-most-focal-units-cnvs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are getting there. I have suggestions now about changing the filtering for the sub-arm calls: the current filtering is treating all samples the same way, and we really need to filter each sample separately. The easiest way to do this, I think is to use a join and then filter, rather than trying to get the filter list just from the prior table.
We will also need to decide what to do with the "amplification" calls at the gene level; I tend to think we should demote them all to "gain" for consistency, but maybe have a separate column that denotes what might have been called an amplification? Ideally, we would tie things back to the seg.mean call, but I think that is probably beyond the scope here.
I am also not quite sure what to do about the fact that the gene level calls were done with respect to ploidy, but the cytoband calls were not. We may have to revisit the cytoband calls to add this in.
analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd
Outdated
Show resolved
Hide resolved
analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd
Outdated
Show resolved
Hide resolved
analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd
Outdated
Show resolved
Hide resolved
analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd
Outdated
Show resolved
Hide resolved
loss_fraction, | ||
gain_fraction, | ||
callable_fraction, | ||
dominant_status = dominant_cytoband_status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are recalling the dominant_status
below, we can probably just drop this column here (or even earlier).
neutral_cytobands <- cytoband_status_count %>% | ||
filter(dominant_status %in% c("unstable", "neutral")) %>% | ||
dplyr::pull("cytoband") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly to above, I think we want to do this via a left_join
and comparisons within each row. The reason being we need to preserve the relationship between samples and cytobands, not just keep all cytobands with at least one neutral call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that when we join these genes to cytobands, some of the cytoband columns in consensus_seg_cytoband_status_df
have more than one cytoband as genes can cross boundaries. We will need to add logic to fill those in as possible.
We will also have to be careful here to be sure the logic flows correctly, as we don't want to make mistakes in filtering at this point because of any filtering we might have done at the cytoband level! That is, if we filtered a cytoband because it was the same call as the arm that contains it, we still want to filter out the genes within that cytoband that have the same call!
if (!(running_in_ci)) { | ||
# Now count the gene-level calls (for each status call) and define | ||
# the gene as that status if more than 50% of the total counts are | ||
# for that particular status | ||
gene_status_count <- status_majority_caller( | ||
combined_neutral_status_cytoband_df, | ||
"gene_symbol", | ||
status_threshold | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what this is meant to do. Each sample should only have one call for each gene, right?
- remove `status_majority_caller` function and tailor to the arm status data frame - change filtering logic to include cytoband ranges - change logic joining gene-level status calls - remove unnecessary `group_by` statements - change logic around filtering on arms and cytobands - use more unqiue `dominant_status` columns when handling each region - rerun nb
- propagate change to shell script - rerun nb
@jashapiro, I believe I addressed your most recent review comments here in this commit, with the exception of:
Please let me know if I misinterpreted any of your comments or suggestions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! I think I recreated most of the comments that github lost, but I might be back with more later.
The biggest thing now I think is the logic for the gene level calls, which might get a bit strange. I think the approach I would take with it is to make a data frame with all of the status calls from gene, band, and arm, and look at how those break down. Do we see places where arm and gene agree but band is different? If so, should that gene be included as focal, or not?
dominant_cytoband_status = case_when( | ||
callable_fraction.x < (1 - uncallable_threshold) ~ "uncallable", | ||
loss_fraction.x > status_threshold ~ "loss", | ||
gain_fraction.x > status_threshold ~ "gain", | ||
loss_fraction.x + gain_fraction.x > status_threshold ~ "unstable", | ||
TRUE ~ "neutral" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets do this recalling first (before the join) and save the results. The reason is that we will be these calls again later (when making the gene level calls), so they should be consistent between full data set and the filtered one.
# Now create a data.frame with the gene-level status calls for the | ||
# neutral, uncallable, and unstable cytoband-level calls | ||
consensus_seg_gene_status <- consensus_seg_autosomes_df %>% | ||
# Create a cytoband column that keeps only the first half of the ranges | ||
# for joining purposes | ||
mutate(cytoband_no_ranges = gsub("-.*", "\\1", cytoband)) %>% | ||
left_join( | ||
consensus_seg_cytoband_status_df, | ||
by = c( | ||
"biospecimen_id" = "Kids_First_Biospecimen_ID", | ||
"cytoband_no_ranges" = "cytoband" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mutate()
statement you had here could have resulted in some mistakes/information loss, so I am replacing it here with a way to keep all the bands. The result may be some duplicated genes down the line, but I think we can handle that later.
# Now create a data.frame with the gene-level status calls for the | |
# neutral, uncallable, and unstable cytoband-level calls | |
consensus_seg_gene_status <- consensus_seg_autosomes_df %>% | |
# Create a cytoband column that keeps only the first half of the ranges | |
# for joining purposes | |
mutate(cytoband_no_ranges = gsub("-.*", "\\1", cytoband)) %>% | |
left_join( | |
consensus_seg_cytoband_status_df, | |
by = c( | |
"biospecimen_id" = "Kids_First_Biospecimen_ID", | |
"cytoband_no_ranges" = "cytoband" | |
) | |
# Create separate rows for genes span multiple cytobands | |
# CAUTION: This will require addition of a distinct() statment later to resolve duplicates | |
# Filtering to only multiband genes first because regexes are slow. | |
multi_band_genes <- consensus_seg_autosomes_df %>% | |
filter(grepl("-", cytoband)) %>% | |
extract(cytoband, into = c("chrom", "band"), regex = "([0-9]+)(.+)") %>% # make chrom and band cols | |
separate_rows(band, sep = "-") %>% # duplicate rows with more than one band | |
unite(cytoband, chrom, band, sep = "") # rejoin chrom and band | |
single_band_genes <- consensus_seg_autosomes_df %>% | |
filter(!grepl("-", cytoband)) | |
gene_df <- bind_rows(single_band_genes, multi_band_genes) | |
# Now create a data.frame with the gene-level status calls for the | |
# neutral, uncallable, and unstable cytoband-level calls | |
consensus_seg_gene_status <- gene_df %>% | |
left_join( | |
consensus_seg_cytoband_status_df, | |
by = c( | |
"biospecimen_id" = "Kids_First_Biospecimen_ID", | |
"cytoband" | |
) |
dominant_cytoband_status %in% c("neutral", "uncallable", "unstable") | | ||
(status != dominant_cytoband_status & | ||
# bands that disagree with arm, but are not neutral (or uncallable) | ||
!(status %in% c("neutral", "uncallable", "unstable") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no neutral calls at the gene level, so I think we can skip this part of the filter.
# Filter the annotated CN data to include only neutral cytobands and disagreements | ||
filter( | ||
dominant_cytoband_status %in% c("neutral", "uncallable", "unstable") | | ||
(status != dominant_cytoband_status & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will catch all of the places where the gene level call is "amplification" because we don't make that call at the cytoband level. I think the right thing to do is to change all "amplification" calls to "gain" for the purposes of this script and its output.
I also have some thought that we should also be looking at arm-level calls at this point. I can imagine a case where an arm is a gain and a cytoband within it is neutral and then a gene within that band is a gain. We would not be including the cytoband in the final set (on the assumption it is a false negative), but by the current logic we would include the individual gene, but I don't think we should. The logic here may get a bit messy, but I think start is to be sure to join in the arm level calls as well, so we have both dominant_cytoband_status
and dominant_arm_status
to look at before making decisions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the most recent commit, I recoded the "amplification" calls to be "gain" calls as you suggested here.
I also believe that I implemented the logic of including individual gene calls using both the dominant_cytoband_status
and dominant_arm_status
data.
Again, please let me know if I misinterpreted this comment in the way I re-formatted the code.
loss_fraction, | ||
gain_fraction, | ||
callable_fraction, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are cytoband level info, so we probably don't want these in this gene-level output.
gain_fraction, | ||
callable_fraction, | ||
status | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the distinct()
we need because of the band separation into multiple rows can go right here.
- reformat logic around creating the gene-level status calls data.frame (@jashapiro's review comment) - distinguish the filtered data.frame objects - remove unnecessary filtering out of neutral gene-level calls and include chromosome arm calls in logic determining which gene-level calls to filter out - re-rerun nb
@jashapiro, this is ready for another look 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good!
I think I figured out the logic for the gene-level calls, but I do want some more eyes on that to see if I got it right.
Looking at the results, it seems like there are still regions where there are a lot of regions where many consecutive gene level calls are not reflected in the higher level calls; I suspect that is due to the fact that gene level were made with respect to ploidy, which the larger calls were not. In a separate analysis we will need to figure out how to properly account for that difference in approach.
### Determine chromosome arm status | ||
|
||
```{r} | ||
# Use our custom function to make the arm status calls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Use our custom function to make the arm status calls |
loss_fraction = weighted.mean(loss_fraction, band_length), | ||
gain_fraction = weighted.mean(gain_fraction, band_length), | ||
callable_fraction = weighted.mean(callable_fraction, band_length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A suggestion to save some confusion later.
loss_fraction = weighted.mean(loss_fraction, band_length), | |
gain_fraction = weighted.mean(gain_fraction, band_length), | |
callable_fraction = weighted.mean(callable_fraction, band_length) | |
loss_fraction_arm = weighted.mean(loss_fraction, band_length), | |
gain_fraction_arm = weighted.mean(gain_fraction, band_length), | |
callable_fraction_arm = weighted.mean(callable_fraction, band_length) |
callable_fraction < (1 - uncallable_threshold) ~ "uncallable", | ||
loss_fraction > status_threshold ~ "loss", | ||
gain_fraction > status_threshold ~ "gain", | ||
loss_fraction + gain_fraction > status_threshold ~ "unstable", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following on previous suggestion.
callable_fraction < (1 - uncallable_threshold) ~ "uncallable", | |
loss_fraction > status_threshold ~ "loss", | |
gain_fraction > status_threshold ~ "gain", | |
loss_fraction + gain_fraction > status_threshold ~ "unstable", | |
callable_fraction_arm < (1 - uncallable_threshold) ~ "uncallable", | |
loss_fraction_arm > status_threshold ~ "loss", | |
gain_fraction_arm > status_threshold ~ "gain", | |
loss_fraction_arm + gain_fraction_arm > status_threshold ~ "unstable", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if in an example situation where one status is >> than the other whether unstable is the right call. For example, if loss_fraction_arm
is 0.89 and gain_fraction_arm
is 0.01 whether this should be called as unstable
but I guess we could be asking these kinds of questions all day and maybe they don't actually happen often if at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and also yes. I think this is a place where refinement could happen, but I don't know how much of a difference it will make in final conclusions. If we were to refine it, we would probablyu need to add another threshold for unstable, but I don't think we would want to call a region with 80% gain and 10% loss as "neutral". In some sense, if all this does is catch a few things that fall just below threshold, that might be a good thing. Note that the consequences for the final table are the same: neutral and unstable are treated essentially the same way when looking at smaller regions/genes.
loss_fraction = loss_fraction.x, | ||
gain_fraction = gain_fraction.x, | ||
callable_fraction = callable_fraction.x, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you took my *_fraction_arm
suggestion above, you won't need to rename anymore!
loss_fraction = loss_fraction.x, | |
gain_fraction = gain_fraction.x, | |
callable_fraction = callable_fraction.x, | |
loss_fraction, | |
gain_fraction, | |
callable_fraction, |
filter( | ||
dominant_arm_status %in% c("neutral", "uncallable", "unstable") | | ||
dominant_cytoband_status %in% c("neutral", "uncallable", "unstable") | | ||
status != dominant_arm_status | | ||
status != dominant_cytoband_status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the critical bit, and the logic is tough! I think we need some &
s in here and parenthesis... I'm going to try to figure it out, with annotation. I would love another set of eyes on this (as many sets as possible?)
filter( | |
dominant_arm_status %in% c("neutral", "uncallable", "unstable") | | |
dominant_cytoband_status %in% c("neutral", "uncallable", "unstable") | | |
status != dominant_arm_status | | |
status != dominant_cytoband_status | |
filter( | |
# Case 1) Gene call disagrees with both arm and cytoband (This captures most | |
# we want to keep, including all of when arm and cytoband are neutral, uncallable | |
# or unstable, since we have no neutral gene calls in this df): | |
(status != dominant_arm_status & status != dominant_cytoband_status) | |
# Case 2) Gene call disagrees with a non-neutral cytoband call. | |
# Keep no matter what the arm status | |
| (status != dominant_cytoband_status | |
& dominant_cytoband_status %in% c("gain", "loss")) | |
# I think that captures everything we want. Cases we don't want include: | |
# gene & arm agree, but cytoband is neutral | |
# gene & cytoband agree | |
# all 3 agree |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this logic makes much more sense to me, thank you for the added comments here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those comments were not just for you! They were for me as I was writing it. Not even me tomorrow; me today!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking my understanding here - we do have neutrals for cytoband and arm. @jashapiro can you tell me a little bit more about
# gene & arm agree, but cytoband is neutral
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, we have neutrals for cytoband and arm. But if we call the whole arm as loss (for example), but the cytoband is called neutral, the assumption I was making is that the cytoband is probably a false negative, and we would already have the arm that "covers" all of the genes in this region, so an individual gene that is lost in that cytoband would not be worth calling out as "focal".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jashapiro ! This logic seems correct to me.
- add "_arm" to the status fraction columns for consensus_seg_arm_status object - adjust filtering logic at the gene level - rerun nb
I implemented the latest of @jashapiro's suggestions in the last commit, including the logic mentioned in this comment. I ran it with the logic in that comment and produced a new output file for folks to take a look at when (or if) thinking about how this logic may or may not need to be adjusted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me!
My only suggestion is really a forward-looking one that we might want to retain region start and end positions for each region in the final table to make later analysis/plots a bit easier. I do not think that change should be in this PR, but might be noted as a potential future enhancement.
select(Kids_First_Biospecimen_ID, | ||
status = dominant_status, | ||
region, | ||
region_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not critical, but it might be worth sorting by ID, just to keep things in a somewhat reasonable order.
If we want to do even better, we could merge back in start and end positions for each region? Doing so might save a lot of redundant work in later scripts.
region_type) | |
region_type) %>% | |
arrange(Kids_First_Biospecimen_ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha, added this change in the last commit.
@jashapiro In other words, in a future PR, you may imagine getting the chromosomal start and end regions from the consensus seg file? |
I imagine we would get them from the original annotation files that define arms, bands, and genes, but yes. If that information would be broadly used, it might make sense to include it here, rather than requiring those annotations being repeatedly added back. But I don't think it shoudl be done now. Once we figure out why CI is failing, this should go in as it is. |
Ahh okay, understood 👍 I will work on getting that ready in a separate PR. |
I would work on other things first as I would view this as very low priority at this point. I really just wanted to document the thought. |
Purpose/implementation Section
The purpose of this PR is to define the most focal units of recurrent CNVs.
What scientific question is your analysis addressing?
Related to PR #485, this PR is working to compile a table/list of the most focal units of recurrent CNVs, most focal meaning:
This PR is working to compile this information into one table.
What was your approach?
Using the file generated in PR #617:
What GitHub issue does your pull request address?
Related: #486
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
The
running_in_ci
flag was placed around the gene level steps because the testing subset files throw an error that a certain status is not found (as the subset files do not have each status at least once at the gene level after filtering out arms and cytobands that are non-neutral).Is there possibly a more elegant solution here?
The threshold used to determine the most dominant status for each level should also receive a closer look.
Is there anything that you want to discuss further?
Note that in this PR I also replace an
inner_join
in the03-add-cytoband-status.Rmd
notebook, that was unintentially added in this commit of PR #617 withleft_join
(as was originally implemented and approved) -- @jashapiro you may want to take a look at this and the resulting file to make sure its still on par with what you approved.Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
Yes, this is ready for review.
Results
What types of results are included (e.g., table, figure)?
A final table of results can be found at
results/consensus_seg_most_focal_cn_status.tsv.gz
What is your summary of the results?
The rendered html output of the notebook in this PR as it is currently can be seen here.
Reproducibility Checklist
Documentation Checklist
README
and it is up to date.analyses/README.md
and the entry is up to date.