Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Define most focal units of recurrent CNVs #644

Merged
merged 50 commits into from
May 15, 2020

Conversation

cbethell
Copy link
Contributor

@cbethell cbethell commented Mar 23, 2020

Purpose/implementation Section

The purpose of this PR is to define the most focal units of recurrent CNVs.

What scientific question is your analysis addressing?

Related to PR #485, this PR is working to compile a table/list of the most focal units of recurrent CNVs, most focal meaning:

  • If a chromosome arm is not clearly defined as a gain or loss (and is callable) we look to define the cytoband level status
  • If a cytoband is not clearly defined as a gain or loss (and is callable) we then look to define the gene level status

This PR is working to compile this information into one table.

What was your approach?

Using the file generated in PR #617:

  • The chromosome arm statuses are determined by counting the amount of instances of each call on each chromosome arm and defining the arm status based on the status call that is responsible for more than 90% of the total arm calls
  • If a chromosome arm is not clearly defined as a gain or loss (and is more than 50% callable), look to define the cytoband level status
  • If a cytoband is not clearly defined as a gain or loss (and is more than 50% callable), look to define the gene level status
  • Compile this information into one table

What GitHub issue does your pull request address?

Related: #486

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

  • The running_in_ci flag was placed around the gene level steps because the testing subset files throw an error that a certain status is not found (as the subset files do not have each status at least once at the gene level after filtering out arms and cytobands that are non-neutral).
    Is there possibly a more elegant solution here?

  • The threshold used to determine the most dominant status for each level should also receive a closer look.

Is there anything that you want to discuss further?

Note that in this PR I also replace an inner_join in the 03-add-cytoband-status.Rmd notebook, that was unintentially added in this commit of PR #617 with left_join (as was originally implemented and approved) -- @jashapiro you may want to take a look at this and the resulting file to make sure its still on par with what you approved.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes, this is ready for review.

Results

What types of results are included (e.g., table, figure)?

A final table of results can be found at results/consensus_seg_most_focal_cn_status.tsv.gz

What is your summary of the results?

The rendered html output of the notebook in this PR as it is currently can be seen here.

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

cbethell and others added 11 commits March 23, 2020 10:44
- add back `left_join` where it was replaced with `inner_join`in `03` nb and re-rerun nb
- the genes in the gene level status data.frame are likely not in the subset testing files, therefore causing circleCI tests to fails (comment out this section for now)
- rerun nb
- save final table to file
- uncomment the previously commented out steps in shell script
- rerun nb
@cbethell cbethell marked this pull request as ready for review March 24, 2020 19:27
@cbethell cbethell changed the title WIP: Define most focal units of recurrent CNVs Define most focal units of recurrent CNVs Mar 24, 2020
@cbethell cbethell requested a review from cansavvy March 24, 2020 19:31
Copy link
Collaborator

@cansavvy cansavvy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbethell , I really appreciate your nice explanations in your PR! This made it easy for me to follow what the goal was. I still need to go over some of these tidy manipulations with a finer tooth comb, but I wanna give you some of my first big picture thoughts now so we can discuss it.

My main question is about how we make the calls. How close is it, what does the percent of each status look like for each call? e.g. if we call something gain is it ever the case that the percent of gain is only a hair larger than loss? To ask these questions you may want to calculate percentages first, and then in a separate step where you make the call.

I think we may need to restructure that particular case_when but I have to think about what the best logic is for calling that. May want others to weigh in on that.

You could draft up some quick and dirty plots that shows the percent of each status and separate on the x-axis what the final call was. Ideally if we are calling things right, we should see that there are hardly ever "ties" but there's a clear differentiation between gains and losses.

@cbethell
Copy link
Contributor Author

To ask these questions you may want to calculate percentages first, and then in a separate step where you make the call.
You could draft up some quick and dirty plots that shows the percent of each status and separate on the x-axis what the final call was. Ideally if we are calling things right, we should see that there are hardly ever "ties" but there's a clear differentiation between gains and losses.

I will implement these suggestions in the upcoming commits and post an updated comment here once I've summarized the findings.

- functionalize the calculating of coverage percentage values in each given region and the logic determining the dominant status
- tweak logic determining dominant status (will need further tweaking)
- keep status coverage percentage values in final table 
- add plots to look at dominant status versus the percentage values for each status in each region
- rerun nb
@cbethell
Copy link
Contributor Author

@cansavvy here is the rendered notebook of my first go at implementing the suggestions you made above.

You can let me know if that's what you were thinking the plots should look like or if I should tweak them (or if I should tweak anything else at this point).

Copy link
Collaborator

@cansavvy cansavvy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cbethell , I think this is coming along! I have a few more questions and comments. I think the main function could use a bit of streamlining and DRYing up, and I have some questions about what the final table should look like.

percent_neutral = neutral / (gain + loss + neutral),
dominant_status = case_when(
percent_gain > 0.5 &
percent_gain > 1.5 * percent_loss ~ "gain",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain these logic statements? I'm not saying they aren't what we want, but I wanna understand the changes you made here and what you are intending.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sure that I probably went about implementing this logic is the least efficient way (it's probably not doing what I intended), but the idea here is that we want the status percentage to be responsible for more than 50% of calls in that region and to be greater than 1.5 times the alternate status call.

This was an attempt to account for a loss/gain dominant status call being made when there is just slim difference between calls.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Okay. Well as a more general statement, I think whatever the logic is that we have for deciding majority status we should probably these methods pretty upfront laid out in documentation (probably at the top of the notebook and READMEs as well as comment). Not only what the methods are for the majority status call decisions but also what was our reasoning behind that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, will do.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That being said, I think we are getting closer, but there are two things that we may want to think about when calling majority status (and these are related).

  1. Not calling a region if too much of it is uncallable
  2. Not calling a region if the absolute number of calls are too small (related to 1) e.g. 3 calls vs 2 calls would technically be greater than 50% but we probably don't have enough data to make that call.

But lastly, I'm wondering whether we need a > 0.5 cutoff in addition to a 1.5x cutoff or if raising the threshold of the percent cutoff might establish the same thing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But lastly, I'm wondering whether we need a > 0.5 cutoff in addition to a 1.5x cutoff or if raising the threshold of the percent cutoff might establish the same thing?

Good question, I'll test this out and include my findings in the notebook (in addition to the points you made above).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need do an entire side analysis but, my point is just that whatever decision logic we come up with we should make sure we are clear about what it is and why we think this gives us a reasonable answer.


```{r}
# Filter the annotated CN data to include only these cytobands
consensus_seg_autosomes_df_develop <- consensus_seg_autosomes_df %>%
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you are reusing the name consensus_seg_autosomes_df_developa few times. This is a tad hard to follow. Can you make them different things? We should probably see if we can streamline this some, or add this formatting step to the main function.

cbethell and others added 5 commits March 27, 2020 08:22
add the `percent_uncallable` and `percent_unstable` labels where applicable
- increase the threshold and add more explanation as to why
- make else statement more specific
- convert plots to jitter plots
- add final focal_call column to final table
- add suffixes when joining to create final table
- rerun nb
- update documentation in notebook and `README` file to make note of decisions around cutoffs
@jashapiro
Copy link
Member

Right now, it appears that calculations are being done based on the gene-level calls, with counts translated up to arms, but we have data for the cytobands which seems like a much more natural place to look for these high-level calls.

Do you think you could point out in the code what you are referring to specifically?

arm_status_count <- status_majority_caller(
combined_status_df,
"chromosome_arm",
"status",
"biospecimen_id",
percent_threshold
)

The status column is coming from the gene-level calls, right? The combined_status_df table has one row for every specimen/gene combination, so when you count up calls for a region here:

status_df <- status_df %>%
# Keep only the columns we need
select(!!id_sym, !!region_sym, !!status_sym) %>%
# Group by status, sample, and region
group_by(!!id_sym, !!region_sym, !!status_sym) %>%
# Get a total count of each s
summarize(status_n = n())
# Getting total counts by region
region_total_df <- status_df %>%
# Group by region and for each sample
group_by(!!region_sym, !!id_sym) %>%
# Get totals
summarize(total = sum(status_n))

you are getting the count of the number of genes in that region.

At least that was how I was able to read it.

@jashapiro
Copy link
Member

One disadvantage compared to your current approach is that it loses the "amplification" calls...

Should the amplifications calls be recoded to gains and this is resolved? Or is this something else that needs to be done?

I think changing to gain is probably the easiest thing here, as it is consistent with other CN calling that has been done. It may be reasonable to come back and looks specifically for amplifications if we go back to where we have seg.means, but that might be a separate step.

@cansavvy
Copy link
Collaborator

cansavvy commented Apr 3, 2020

Yes. I think you are right. I imagine it might be more exact to move to a percent length based assessment? Or is that not necessary and using weighted means is sufficient?

@jashapiro
Copy link
Member

Yes. I think you are right. I imagine it might be more exact to move to a percent length based assessment? Or is that not necessary and using weighted means is sufficient?

Weighted mean of call fraction and length will give the correct fractions for the total.

cbethell and others added 3 commits April 3, 2020 19:38
- have `status_majority_caller` function calculate weighted means and define dominant status based on these values 
- comment out the plotting steps (for development purposes) because these now have to change now that the previous function's logic changed
- rerun nb
- rename `percentages` as `fractions`
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are getting there. I have suggestions now about changing the filtering for the sub-arm calls: the current filtering is treating all samples the same way, and we really need to filter each sample separately. The easiest way to do this, I think is to use a join and then filter, rather than trying to get the filter list just from the prior table.

We will also need to decide what to do with the "amplification" calls at the gene level; I tend to think we should demote them all to "gain" for consistency, but maybe have a separate column that denotes what might have been called an amplification? Ideally, we would tie things back to the seg.mean call, but I think that is probably beyond the scope here.

I am also not quite sure what to do about the fact that the gene level calls were done with respect to ploidy, but the cytoband calls were not. We may have to revisit the cytoband calls to add this in.

loss_fraction,
gain_fraction,
callable_fraction,
dominant_status = dominant_cytoband_status
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are recalling the dominant_status below, we can probably just drop this column here (or even earlier).

Comment on lines 253 to 255
neutral_cytobands <- cytoband_status_count %>%
filter(dominant_status %in% c("unstable", "neutral")) %>%
dplyr::pull("cytoband")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to above, I think we want to do this via a left_join and comparisons within each row. The reason being we need to preserve the relationship between samples and cytobands, not just keep all cytobands with at least one neutral call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that when we join these genes to cytobands, some of the cytoband columns in consensus_seg_cytoband_status_df have more than one cytoband as genes can cross boundaries. We will need to add logic to fill those in as possible.

We will also have to be careful here to be sure the logic flows correctly, as we don't want to make mistakes in filtering at this point because of any filtering we might have done at the cytoband level! That is, if we filtered a cytoband because it was the same call as the arm that contains it, we still want to filter out the genes within that cytoband that have the same call!

Comment on lines 270 to 278
if (!(running_in_ci)) {
# Now count the gene-level calls (for each status call) and define
# the gene as that status if more than 50% of the total counts are
# for that particular status
gene_status_count <- status_majority_caller(
combined_neutral_status_cytoband_df,
"gene_symbol",
status_threshold
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what this is meant to do. Each sample should only have one call for each gene, right?

- remove `status_majority_caller` function and tailor to the arm status data frame
- change filtering logic to include cytoband ranges
- change logic joining gene-level status calls 
- remove unnecessary `group_by` statements
- change logic around filtering on arms and cytobands
- use more unqiue `dominant_status` columns when handling each region
- rerun nb
- propagate change to shell script 
- rerun nb
@cbethell
Copy link
Contributor Author

cbethell commented Apr 6, 2020

@jashapiro, I believe I addressed your most recent review comments here in this commit, with the exception of:

We will also need to decide what to do with the "amplification" calls at the gene level; I tend to think we should demote them all to "gain" for consistency, but maybe have a separate column that denotes what might have been called an amplification? Ideally, we would tie things back to the seg.mean call, but I think that is probably beyond the scope here.

I am also not quite sure what to do about the fact that the gene level calls were done with respect to ploidy, but the cytoband calls were not. We may have to revisit the cytoband calls to add this in.

Please let me know if I misinterpreted any of your comments or suggestions.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! I think I recreated most of the comments that github lost, but I might be back with more later.

The biggest thing now I think is the logic for the gene level calls, which might get a bit strange. I think the approach I would take with it is to make a data frame with all of the status calls from gene, band, and arm, and look at how those break down. Do we see places where arm and gene agree but band is different? If so, should that gene be included as focal, or not?

Comment on lines 141 to 146
dominant_cytoband_status = case_when(
callable_fraction.x < (1 - uncallable_threshold) ~ "uncallable",
loss_fraction.x > status_threshold ~ "loss",
gain_fraction.x > status_threshold ~ "gain",
loss_fraction.x + gain_fraction.x > status_threshold ~ "unstable",
TRUE ~ "neutral"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets do this recalling first (before the join) and save the results. The reason is that we will be these calls again later (when making the gene level calls), so they should be consistent between full data set and the filtered one.

Comment on lines 174 to 185
# Now create a data.frame with the gene-level status calls for the
# neutral, uncallable, and unstable cytoband-level calls
consensus_seg_gene_status <- consensus_seg_autosomes_df %>%
# Create a cytoband column that keeps only the first half of the ranges
# for joining purposes
mutate(cytoband_no_ranges = gsub("-.*", "\\1", cytoband)) %>%
left_join(
consensus_seg_cytoband_status_df,
by = c(
"biospecimen_id" = "Kids_First_Biospecimen_ID",
"cytoband_no_ranges" = "cytoband"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mutate() statement you had here could have resulted in some mistakes/information loss, so I am replacing it here with a way to keep all the bands. The result may be some duplicated genes down the line, but I think we can handle that later.

Suggested change
# Now create a data.frame with the gene-level status calls for the
# neutral, uncallable, and unstable cytoband-level calls
consensus_seg_gene_status <- consensus_seg_autosomes_df %>%
# Create a cytoband column that keeps only the first half of the ranges
# for joining purposes
mutate(cytoband_no_ranges = gsub("-.*", "\\1", cytoband)) %>%
left_join(
consensus_seg_cytoband_status_df,
by = c(
"biospecimen_id" = "Kids_First_Biospecimen_ID",
"cytoband_no_ranges" = "cytoband"
)
# Create separate rows for genes span multiple cytobands
# CAUTION: This will require addition of a distinct() statment later to resolve duplicates
# Filtering to only multiband genes first because regexes are slow.
multi_band_genes <- consensus_seg_autosomes_df %>%
filter(grepl("-", cytoband)) %>%
extract(cytoband, into = c("chrom", "band"), regex = "([0-9]+)(.+)") %>% # make chrom and band cols
separate_rows(band, sep = "-") %>% # duplicate rows with more than one band
unite(cytoband, chrom, band, sep = "") # rejoin chrom and band
single_band_genes <- consensus_seg_autosomes_df %>%
filter(!grepl("-", cytoband))
gene_df <- bind_rows(single_band_genes, multi_band_genes)
# Now create a data.frame with the gene-level status calls for the
# neutral, uncallable, and unstable cytoband-level calls
consensus_seg_gene_status <- gene_df %>%
left_join(
consensus_seg_cytoband_status_df,
by = c(
"biospecimen_id" = "Kids_First_Biospecimen_ID",
"cytoband"
)

dominant_cytoband_status %in% c("neutral", "uncallable", "unstable") |
(status != dominant_cytoband_status &
# bands that disagree with arm, but are not neutral (or uncallable)
!(status %in% c("neutral", "uncallable", "unstable")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no neutral calls at the gene level, so I think we can skip this part of the filter.

# Filter the annotated CN data to include only neutral cytobands and disagreements
filter(
dominant_cytoband_status %in% c("neutral", "uncallable", "unstable") |
(status != dominant_cytoband_status &
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will catch all of the places where the gene level call is "amplification" because we don't make that call at the cytoband level. I think the right thing to do is to change all "amplification" calls to "gain" for the purposes of this script and its output.

I also have some thought that we should also be looking at arm-level calls at this point. I can imagine a case where an arm is a gain and a cytoband within it is neutral and then a gene within that band is a gain. We would not be including the cytoband in the final set (on the assumption it is a false negative), but by the current logic we would include the individual gene, but I don't think we should. The logic here may get a bit messy, but I think start is to be sure to join in the arm level calls as well, so we have both dominant_cytoband_status and dominant_arm_status to look at before making decisions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the most recent commit, I recoded the "amplification" calls to be "gain" calls as you suggested here.

I also believe that I implemented the logic of including individual gene calls using both the dominant_cytoband_status and dominant_arm_status data.

Again, please let me know if I misinterpreted this comment in the way I re-formatted the code.

Comment on lines 198 to 200
loss_fraction,
gain_fraction,
callable_fraction,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are cytoband level info, so we probably don't want these in this gene-level output.

gain_fraction,
callable_fraction,
status
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the distinct() we need because of the band separation into multiple rows can go right here.

- reformat logic around creating the gene-level status calls data.frame (@jashapiro's review comment)
- distinguish the filtered data.frame objects
- remove unnecessary filtering out of neutral gene-level calls and include chromosome arm calls in logic determining which gene-level calls to filter out
- re-rerun nb
@cbethell
Copy link
Contributor Author

cbethell commented Apr 8, 2020

@jashapiro, this is ready for another look 👍

@jashapiro jashapiro mentioned this pull request Apr 8, 2020
5 tasks
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good!

I think I figured out the logic for the gene-level calls, but I do want some more eyes on that to see if I got it right.

Looking at the results, it seems like there are still regions where there are a lot of regions where many consecutive gene level calls are not reflected in the higher level calls; I suspect that is due to the fact that gene level were made with respect to ploidy, which the larger calls were not. In a separate analysis we will need to figure out how to properly account for that difference in approach.

### Determine chromosome arm status

```{r}
# Use our custom function to make the arm status calls
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Use our custom function to make the arm status calls

Comment on lines 108 to 110
loss_fraction = weighted.mean(loss_fraction, band_length),
gain_fraction = weighted.mean(gain_fraction, band_length),
callable_fraction = weighted.mean(callable_fraction, band_length)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A suggestion to save some confusion later.

Suggested change
loss_fraction = weighted.mean(loss_fraction, band_length),
gain_fraction = weighted.mean(gain_fraction, band_length),
callable_fraction = weighted.mean(callable_fraction, band_length)
loss_fraction_arm = weighted.mean(loss_fraction, band_length),
gain_fraction_arm = weighted.mean(gain_fraction, band_length),
callable_fraction_arm = weighted.mean(callable_fraction, band_length)

Comment on lines 116 to 119
callable_fraction < (1 - uncallable_threshold) ~ "uncallable",
loss_fraction > status_threshold ~ "loss",
gain_fraction > status_threshold ~ "gain",
loss_fraction + gain_fraction > status_threshold ~ "unstable",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following on previous suggestion.

Suggested change
callable_fraction < (1 - uncallable_threshold) ~ "uncallable",
loss_fraction > status_threshold ~ "loss",
gain_fraction > status_threshold ~ "gain",
loss_fraction + gain_fraction > status_threshold ~ "unstable",
callable_fraction_arm < (1 - uncallable_threshold) ~ "uncallable",
loss_fraction_arm > status_threshold ~ "loss",
gain_fraction_arm > status_threshold ~ "gain",
loss_fraction_arm + gain_fraction_arm > status_threshold ~ "unstable",

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if in an example situation where one status is >> than the other whether unstable is the right call. For example, if loss_fraction_arm is 0.89 and gain_fraction_arm is 0.01 whether this should be called as unstable but I guess we could be asking these kinds of questions all day and maybe they don't actually happen often if at all?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and also yes. I think this is a place where refinement could happen, but I don't know how much of a difference it will make in final conclusions. If we were to refine it, we would probablyu need to add another threshold for unstable, but I don't think we would want to call a region with 80% gain and 10% loss as "neutral". In some sense, if all this does is catch a few things that fall just below threshold, that might be a good thing. Note that the consequences for the final table are the same: neutral and unstable are treated essentially the same way when looking at smaller regions/genes.

Comment on lines 169 to 171
loss_fraction = loss_fraction.x,
gain_fraction = gain_fraction.x,
callable_fraction = callable_fraction.x,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you took my *_fraction_arm suggestion above, you won't need to rename anymore!

Suggested change
loss_fraction = loss_fraction.x,
gain_fraction = gain_fraction.x,
callable_fraction = callable_fraction.x,
loss_fraction,
gain_fraction,
callable_fraction,

Comment on lines 212 to 216
filter(
dominant_arm_status %in% c("neutral", "uncallable", "unstable") |
dominant_cytoband_status %in% c("neutral", "uncallable", "unstable") |
status != dominant_arm_status |
status != dominant_cytoband_status
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the critical bit, and the logic is tough! I think we need some &s in here and parenthesis... I'm going to try to figure it out, with annotation. I would love another set of eyes on this (as many sets as possible?)

Suggested change
filter(
dominant_arm_status %in% c("neutral", "uncallable", "unstable") |
dominant_cytoband_status %in% c("neutral", "uncallable", "unstable") |
status != dominant_arm_status |
status != dominant_cytoband_status
filter(
# Case 1) Gene call disagrees with both arm and cytoband (This captures most
# we want to keep, including all of when arm and cytoband are neutral, uncallable
# or unstable, since we have no neutral gene calls in this df):
(status != dominant_arm_status & status != dominant_cytoband_status)
# Case 2) Gene call disagrees with a non-neutral cytoband call.
# Keep no matter what the arm status
| (status != dominant_cytoband_status
& dominant_cytoband_status %in% c("gain", "loss"))
# I think that captures everything we want. Cases we don't want include:
# gene & arm agree, but cytoband is neutral
# gene & cytoband agree
# all 3 agree

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this logic makes much more sense to me, thank you for the added comments here!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those comments were not just for you! They were for me as I was writing it. Not even me tomorrow; me today!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking my understanding here - we do have neutrals for cytoband and arm. @jashapiro can you tell me a little bit more about

# gene & arm agree, but cytoband is neutral

as well.

Copy link
Member

@jashapiro jashapiro Apr 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, we have neutrals for cytoband and arm. But if we call the whole arm as loss (for example), but the cytoband is called neutral, the assumption I was making is that the cytoband is probably a false negative, and we would already have the arm that "covers" all of the genes in this region, so an individual gene that is lost in that cytoband would not be worth calling out as "focal".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jashapiro ! This logic seems correct to me.

- add "_arm" to the status fraction columns for consensus_seg_arm_status object
- adjust filtering logic at the gene level 
- rerun nb
@cbethell
Copy link
Contributor Author

I implemented the latest of @jashapiro's suggestions in the last commit, including the logic mentioned in this comment.

I ran it with the logic in that comment and produced a new output file for folks to take a look at when (or if) thinking about how this logic may or may not need to be adjusted.

@cbethell cbethell requested a review from jashapiro April 13, 2020 13:34
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me!

My only suggestion is really a forward-looking one that we might want to retain region start and end positions for each region in the final table to make later analysis/plots a bit easier. I do not think that change should be in this PR, but might be noted as a potential future enhancement.

select(Kids_First_Biospecimen_ID,
status = dominant_status,
region,
region_type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not critical, but it might be worth sorting by ID, just to keep things in a somewhat reasonable order.
If we want to do even better, we could merge back in start and end positions for each region? Doing so might save a lot of redundant work in later scripts.

Suggested change
region_type)
region_type) %>%
arrange(Kids_First_Biospecimen_ID)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, added this change in the last commit.

@cbethell
Copy link
Contributor Author

cbethell commented May 12, 2020

My only suggestion is really a forward-looking one that we might want to retain region start and end positions for each region in the final table to make later analysis/plots a bit easier. I do not think that change should be in this PR, but might be noted as a potential future enhancement.

@jashapiro In other words, in a future PR, you may imagine getting the chromosomal start and end regions from the consensus seg file?

@jashapiro
Copy link
Member

@jashapiro In other words, in a future PR, you may imagine getting the chromosomal start and end regions from the consensus seg file?

I imagine we would get them from the original annotation files that define arms, bands, and genes, but yes. If that information would be broadly used, it might make sense to include it here, rather than requiring those annotations being repeatedly added back. But I don't think it shoudl be done now.

Once we figure out why CI is failing, this should go in as it is.

@cbethell
Copy link
Contributor Author

I imagine we would get them from the original annotation files that define arms, bands, and genes, but yes. If that information would be broadly used, it might make sense to include it here, rather than requiring those annotations being repeatedly added back. But I don't think it shoudl be done now.

Once we figure out why CI is failing, this should go in as it is.

Ahh okay, understood 👍 I will work on getting that ready in a separate PR.

@jashapiro
Copy link
Member

Ahh okay, understood 👍 I will work on getting that ready in a separate PR.

I would work on other things first as I would view this as very low priority at this point. I really just wanted to document the thought.

@cbethell cbethell mentioned this pull request May 12, 2020
5 tasks
@jashapiro jashapiro merged commit aeb5847 into AlexsLemonade:master May 15, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants