Define most focal units of recurrent CNVs #644

cbethell · 2020-03-23T15:36:42Z

Purpose/implementation Section

The purpose of this PR is to define the most focal units of recurrent CNVs.

What scientific question is your analysis addressing?

Related to PR #485, this PR is working to compile a table/list of the most focal units of recurrent CNVs, most focal meaning:

If a chromosome arm is not clearly defined as a gain or loss (and is callable) we look to define the cytoband level status
If a cytoband is not clearly defined as a gain or loss (and is callable) we then look to define the gene level status

This PR is working to compile this information into one table.

What was your approach?

Using the file generated in PR #617:

The chromosome arm statuses are determined by counting the amount of instances of each call on each chromosome arm and defining the arm status based on the status call that is responsible for more than 90% of the total arm calls
If a chromosome arm is not clearly defined as a gain or loss (and is more than 50% callable), look to define the cytoband level status
If a cytoband is not clearly defined as a gain or loss (and is more than 50% callable), look to define the gene level status
Compile this information into one table

What GitHub issue does your pull request address?

Related: #486

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

The running_in_ci flag was placed around the gene level steps because the testing subset files throw an error that a certain status is not found (as the subset files do not have each status at least once at the gene level after filtering out arms and cytobands that are non-neutral).
Is there possibly a more elegant solution here?
The threshold used to determine the most dominant status for each level should also receive a closer look.

Is there anything that you want to discuss further?

Note that in this PR I also replace an inner_join in the 03-add-cytoband-status.Rmd notebook, that was unintentially added in this commit of PR #617 with left_join (as was originally implemented and approved) -- @jashapiro you may want to take a look at this and the resulting file to make sure its still on par with what you approved.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes, this is ready for review.

Results

What types of results are included (e.g., table, figure)?

A final table of results can be found at results/consensus_seg_most_focal_cn_status.tsv.gz

What is your summary of the results?

The rendered html output of the notebook in this PR as it is currently can be seen here.

Reproducibility Checklist

The dependencies required to run the code in this pull request have been added to the project Dockerfile.
This analysis has been added to continuous integration.

Documentation Checklist

This analysis module has a README and it is up to date.
This analysis is recorded in the table in analyses/README.md and the entry is up to date.
The analytical code is documented and contains comments.

- add back `left_join` where it was replaced with `inner_join`in `03` nb and re-rerun nb

- the genes in the gene level status data.frame are likely not in the subset testing files, therefore causing circleCI tests to fails (comment out this section for now)

…into define-most-focal-units-cnvs

- rerun nb - save final table to file

- uncomment the previously commented out steps in shell script - rerun nb

cansavvy

@cbethell , I really appreciate your nice explanations in your PR! This made it easy for me to follow what the goal was. I still need to go over some of these tidy manipulations with a finer tooth comb, but I wanna give you some of my first big picture thoughts now so we can discuss it.

My main question is about how we make the calls. How close is it, what does the percent of each status look like for each call? e.g. if we call something gain is it ever the case that the percent of gain is only a hair larger than loss? To ask these questions you may want to calculate percentages first, and then in a separate step where you make the call.

I think we may need to restructure that particular case_when but I have to think about what the best logic is for calling that. May want others to weigh in on that.

You could draft up some quick and dirty plots that shows the percent of each status and separate on the x-axis what the final call was. Ideally if we are calling things right, we should see that there are hardly ever "ties" but there's a clear differentiation between gains and losses.

analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd

analyses/focal-cn-file-preparation/run-prepare-cn.sh

analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd

cbethell · 2020-03-25T12:31:55Z

To ask these questions you may want to calculate percentages first, and then in a separate step where you make the call.
You could draft up some quick and dirty plots that shows the percent of each status and separate on the x-axis what the final call was. Ideally if we are calling things right, we should see that there are hardly ever "ties" but there's a clear differentiation between gains and losses.

I will implement these suggestions in the upcoming commits and post an updated comment here once I've summarized the findings.

- functionalize the calculating of coverage percentage values in each given region and the logic determining the dominant status - tweak logic determining dominant status (will need further tweaking) - keep status coverage percentage values in final table - add plots to look at dominant status versus the percentage values for each status in each region - rerun nb

cbethell · 2020-03-25T17:53:32Z

@cansavvy here is the rendered notebook of my first go at implementing the suggestions you made above.

You can let me know if that's what you were thinking the plots should look like or if I should tweak them (or if I should tweak anything else at this point).

cansavvy

Hi @cbethell , I think this is coming along! I have a few more questions and comments. I think the main function could use a bit of streamlining and DRYing up, and I have some questions about what the final table should look like.

analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd

cansavvy · 2020-03-26T13:20:16Z

analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd

+        percent_neutral = neutral / (gain + loss + neutral),
+        dominant_status = case_when(
+          percent_gain > 0.5 &
+            percent_gain > 1.5 * percent_loss ~ "gain",


Can you explain these logic statements? I'm not saying they aren't what we want, but I wanna understand the changes you made here and what you are intending.

I am sure that I probably went about implementing this logic is the least efficient way (it's probably not doing what I intended), but the idea here is that we want the status percentage to be responsible for more than 50% of calls in that region and to be greater than 1.5 times the alternate status call.

This was an attempt to account for a loss/gain dominant status call being made when there is just slim difference between calls.

Ah. Okay. Well as a more general statement, I think whatever the logic is that we have for deciding majority status we should probably these methods pretty upfront laid out in documentation (probably at the top of the notebook and READMEs as well as comment). Not only what the methods are for the majority status call decisions but also what was our reasoning behind that.

Gotcha, will do.

That being said, I think we are getting closer, but there are two things that we may want to think about when calling majority status (and these are related).

Not calling a region if too much of it is uncallable

Not calling a region if the absolute number of calls are too small (related to 1) e.g. 3 calls vs 2 calls would technically be greater than 50% but we probably don't have enough data to make that call.

But lastly, I'm wondering whether we need a > 0.5 cutoff in addition to a 1.5x cutoff or if raising the threshold of the percent cutoff might establish the same thing?

But lastly, I'm wondering whether we need a > 0.5 cutoff in addition to a 1.5x cutoff or if raising the threshold of the percent cutoff might establish the same thing?

Good question, I'll test this out and include my findings in the notebook (in addition to the points you made above).

No need do an entire side analysis but, my point is just that whatever decision logic we come up with we should make sure we are clear about what it is and why we think this gives us a reasonable answer.

analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd

cansavvy · 2020-03-26T13:40:31Z

analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd

+
+```{r}
+# Filter the annotated CN data to include only these cytobands
+consensus_seg_autosomes_df_develop <- consensus_seg_autosomes_df %>%


It looks like you are reusing the name consensus_seg_autosomes_df_developa few times. This is a tad hard to follow. Can you make them different things? We should probably see if we can streamline this some, or add this formatting step to the main function.

analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd

add the `percent_uncallable` and `percent_unstable` labels where applicable - increase the threshold and add more explanation as to why - make else statement more specific - convert plots to jitter plots - add final focal_call column to final table - add suffixes when joining to create final table - rerun nb

- update documentation in notebook and `README` file to make note of decisions around cutoffs

jashapiro · 2020-04-03T19:05:21Z

Right now, it appears that calculations are being done based on the gene-level calls, with counts translated up to arms, but we have data for the cytobands which seems like a much more natural place to look for these high-level calls.

Do you think you could point out in the code what you are referring to specifically?

OpenPBTA-analysis/analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd

Lines 253 to 259 in 1e89da2

    
           arm_status_count <- status_majority_caller( 
        
             combined_status_df, 
        
             "chromosome_arm", 
        
             "status", 
        
             "biospecimen_id", 
        
             percent_threshold 
        
           )

The status column is coming from the gene-level calls, right? The combined_status_df table has one row for every specimen/gene combination, so when you count up calls for a region here:

OpenPBTA-analysis/analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd

Lines 89 to 102 in 1e89da2

    
             status_df <- status_df %>% 
        
               # Keep only the columns we need 
        
               select(!!id_sym, !!region_sym, !!status_sym) %>% 
        
               # Group by status, sample, and region 
        
               group_by(!!id_sym, !!region_sym, !!status_sym) %>% 
        
               # Get a total count of each s 
        
               summarize(status_n = n()) 
        
             # Getting total counts by region 
        
             region_total_df <- status_df %>% 
        
               # Group by region and for each sample 
        
               group_by(!!region_sym, !!id_sym) %>% 
        
               # Get totals 
        
               summarize(total = sum(status_n))

you are getting the count of the number of genes in that region.

At least that was how I was able to read it.

jashapiro · 2020-04-03T19:11:09Z

One disadvantage compared to your current approach is that it loses the "amplification" calls...

Should the amplifications calls be recoded to gains and this is resolved? Or is this something else that needs to be done?

I think changing to gain is probably the easiest thing here, as it is consistent with other CN calling that has been done. It may be reasonable to come back and looks specifically for amplifications if we go back to where we have seg.means, but that might be a separate step.

cansavvy · 2020-04-03T19:11:53Z

Yes. I think you are right. I imagine it might be more exact to move to a percent length based assessment? Or is that not necessary and using weighted means is sufficient?

jashapiro · 2020-04-03T19:33:49Z

Yes. I think you are right. I imagine it might be more exact to move to a percent length based assessment? Or is that not necessary and using weighted means is sufficient?

Weighted mean of call fraction and length will give the correct fractions for the total.

Co-Authored-By: jashapiro <[email protected]>

- have `status_majority_caller` function calculate weighted means and define dominant status based on these values - comment out the plotting steps (for development purposes) because these now have to change now that the previous function's logic changed - rerun nb - rename `percentages` as `fractions`

…into define-most-focal-units-cnvs

jashapiro

I think we are getting there. I have suggestions now about changing the filtering for the sub-arm calls: the current filtering is treating all samples the same way, and we really need to filter each sample separately. The easiest way to do this, I think is to use a join and then filter, rather than trying to get the filter list just from the prior table.

We will also need to decide what to do with the "amplification" calls at the gene level; I tend to think we should demote them all to "gain" for consistency, but maybe have a separate column that denotes what might have been called an amplification? Ideally, we would tie things back to the seg.mean call, but I think that is probably beyond the scope here.

I am also not quite sure what to do about the fact that the gene level calls were done with respect to ploidy, but the cytoband calls were not. We may have to revisit the cytoband calls to add this in.

analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd

jashapiro · 2020-04-05T16:51:05Z

analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd

+    loss_fraction,
+    gain_fraction,
+    callable_fraction,
+    dominant_status = dominant_cytoband_status


If we are recalling the dominant_status below, we can probably just drop this column here (or even earlier).

analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd

jashapiro · 2020-04-05T17:13:37Z

analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd

+neutral_cytobands <- cytoband_status_count %>%
+  filter(dominant_status %in% c("unstable", "neutral")) %>%
+  dplyr::pull("cytoband")


Similarly to above, I think we want to do this via a left_join and comparisons within each row. The reason being we need to preserve the relationship between samples and cytobands, not just keep all cytobands with at least one neutral call.

Note that when we join these genes to cytobands, some of the cytoband columns in consensus_seg_cytoband_status_df have more than one cytoband as genes can cross boundaries. We will need to add logic to fill those in as possible.

We will also have to be careful here to be sure the logic flows correctly, as we don't want to make mistakes in filtering at this point because of any filtering we might have done at the cytoband level! That is, if we filtered a cytoband because it was the same call as the arm that contains it, we still want to filter out the genes within that cytoband that have the same call!

jashapiro · 2020-04-05T17:14:39Z

analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd

+if (!(running_in_ci)) {
+  # Now count the gene-level calls (for each status call) and define
+  # the gene as that status if more than 50% of the total counts are
+  # for that particular status
+  gene_status_count <- status_majority_caller(
+    combined_neutral_status_cytoband_df,
+    "gene_symbol",
+    status_threshold
+  )


I don't understand what this is meant to do. Each sample should only have one call for each gene, right?

analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd

- remove `status_majority_caller` function and tailor to the arm status data frame - change filtering logic to include cytoband ranges - change logic joining gene-level status calls - remove unnecessary `group_by` statements - change logic around filtering on arms and cytobands - use more unqiue `dominant_status` columns when handling each region - rerun nb

- propagate change to shell script - rerun nb

cbethell · 2020-04-06T23:22:34Z

@jashapiro, I believe I addressed your most recent review comments here in this commit, with the exception of:

We will also need to decide what to do with the "amplification" calls at the gene level; I tend to think we should demote them all to "gain" for consistency, but maybe have a separate column that denotes what might have been called an amplification? Ideally, we would tie things back to the seg.mean call, but I think that is probably beyond the scope here.

I am also not quite sure what to do about the fact that the gene level calls were done with respect to ploidy, but the cytoband calls were not. We may have to revisit the cytoband calls to add this in.

Please let me know if I misinterpreted any of your comments or suggestions.

jashapiro

Looking good! I think I recreated most of the comments that github lost, but I might be back with more later.

The biggest thing now I think is the logic for the gene level calls, which might get a bit strange. I think the approach I would take with it is to make a data frame with all of the status calls from gene, band, and arm, and look at how those break down. Do we see places where arm and gene agree but band is different? If so, should that gene be included as focal, or not?

jashapiro · 2020-04-07T14:39:53Z

analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd

+    dominant_cytoband_status = case_when(
+      callable_fraction.x < (1 - uncallable_threshold) ~ "uncallable",
+      loss_fraction.x > status_threshold ~ "loss",
+      gain_fraction.x > status_threshold ~ "gain",
+      loss_fraction.x + gain_fraction.x > status_threshold ~ "unstable",
+      TRUE ~ "neutral"


Lets do this recalling first (before the join) and save the results. The reason is that we will be these calls again later (when making the gene level calls), so they should be consistent between full data set and the filtered one.

analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd

jashapiro · 2020-04-07T14:56:54Z

analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd

+# Now create a data.frame with the gene-level status calls for the
+# neutral, uncallable, and unstable cytoband-level calls
+consensus_seg_gene_status <- consensus_seg_autosomes_df %>%
+  # Create a cytoband column that keeps only the first half of the ranges
+  # for joining purposes
+  mutate(cytoband_no_ranges = gsub("-.*", "\\1", cytoband)) %>%
+  left_join(
+    consensus_seg_cytoband_status_df,
+    by = c(
+      "biospecimen_id" = "Kids_First_Biospecimen_ID",
+      "cytoband_no_ranges" = "cytoband"
+    )


The mutate() statement you had here could have resulted in some mistakes/information loss, so I am replacing it here with a way to keep all the bands. The result may be some duplicated genes down the line, but I think we can handle that later.

Suggested change

# Now create a data.frame with the gene-level status calls for the

# neutral, uncallable, and unstable cytoband-level calls

consensus_seg_gene_status <- consensus_seg_autosomes_df %>%

# Create a cytoband column that keeps only the first half of the ranges

# for joining purposes

mutate(cytoband_no_ranges = gsub("-.*", "\\1", cytoband)) %>%

left_join(

consensus_seg_cytoband_status_df,

by = c(

"biospecimen_id" = "Kids_First_Biospecimen_ID",

"cytoband_no_ranges" = "cytoband"

)

# Create separate rows for genes span multiple cytobands

# CAUTION: This will require addition of a distinct() statment later to resolve duplicates

# Filtering to only multiband genes first because regexes are slow.

multi_band_genes <- consensus_seg_autosomes_df %>%

filter(grepl("-", cytoband)) %>%

extract(cytoband, into = c("chrom", "band"), regex = "([0-9]+)(.+)") %>% # make chrom and band cols

separate_rows(band, sep = "-") %>% # duplicate rows with more than one band

unite(cytoband, chrom, band, sep = "") # rejoin chrom and band

single_band_genes <- consensus_seg_autosomes_df %>%

filter(!grepl("-", cytoband))

gene_df <- bind_rows(single_band_genes, multi_band_genes)

# Now create a data.frame with the gene-level status calls for the

# neutral, uncallable, and unstable cytoband-level calls

consensus_seg_gene_status <- gene_df %>%

left_join(

consensus_seg_cytoband_status_df,

by = c(

"biospecimen_id" = "Kids_First_Biospecimen_ID",

"cytoband"

)

jashapiro · 2020-04-07T15:02:03Z

analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd

+    dominant_cytoband_status %in% c("neutral", "uncallable", "unstable") |
+      (status != dominant_cytoband_status &
+         # bands that disagree with arm, but are not neutral (or uncallable)
+         !(status %in% c("neutral", "uncallable", "unstable")


There are no neutral calls at the gene level, so I think we can skip this part of the filter.

jashapiro · 2020-04-07T15:03:59Z

analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd

+  # Filter the annotated CN data to include only neutral cytobands and disagreements
+  filter(
+    dominant_cytoband_status %in% c("neutral", "uncallable", "unstable") |
+      (status != dominant_cytoband_status &


This will catch all of the places where the gene level call is "amplification" because we don't make that call at the cytoband level. I think the right thing to do is to change all "amplification" calls to "gain" for the purposes of this script and its output.

I also have some thought that we should also be looking at arm-level calls at this point. I can imagine a case where an arm is a gain and a cytoband within it is neutral and then a gene within that band is a gain. We would not be including the cytoband in the final set (on the assumption it is a false negative), but by the current logic we would include the individual gene, but I don't think we should. The logic here may get a bit messy, but I think start is to be sure to join in the arm level calls as well, so we have both dominant_cytoband_status and dominant_arm_status to look at before making decisions.

In the most recent commit, I recoded the "amplification" calls to be "gain" calls as you suggested here.

I also believe that I implemented the logic of including individual gene calls using both the dominant_cytoband_status and dominant_arm_status data.

Again, please let me know if I misinterpreted this comment in the way I re-formatted the code.

jashapiro · 2020-04-07T15:04:23Z

analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd

+    loss_fraction,
+    gain_fraction,
+    callable_fraction,


These are cytoband level info, so we probably don't want these in this gene-level output.

jashapiro · 2020-04-07T15:13:13Z

analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd

+    gain_fraction,
+    callable_fraction,
+    status
+  )


I think the distinct() we need because of the band separation into multiple rows can go right here.

@jashapiro

- reformat logic around creating the gene-level status calls data.frame (@jashapiro's review comment) - distinguish the filtered data.frame objects - remove unnecessary filtering out of neutral gene-level calls and include chromosome arm calls in logic determining which gene-level calls to filter out - re-rerun nb

cbethell · 2020-04-08T15:08:37Z

@jashapiro, this is ready for another look 👍

jashapiro

This looks good!

I think I figured out the logic for the gene-level calls, but I do want some more eyes on that to see if I got it right.

Looking at the results, it seems like there are still regions where there are a lot of regions where many consecutive gene level calls are not reflected in the higher level calls; I suspect that is due to the fact that gene level were made with respect to ploidy, which the larger calls were not. In a separate analysis we will need to figure out how to properly account for that difference in approach.

jashapiro · 2020-04-10T13:26:38Z

analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd

+### Determine chromosome arm status
+
+```{r}
+# Use our custom function to make the arm status calls


Suggested change

# Use our custom function to make the arm status calls

jashapiro · 2020-04-10T13:29:11Z

analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd

+      loss_fraction = weighted.mean(loss_fraction, band_length),
+      gain_fraction = weighted.mean(gain_fraction, band_length),
+      callable_fraction = weighted.mean(callable_fraction, band_length)


A suggestion to save some confusion later.

Suggested change

loss_fraction = weighted.mean(loss_fraction, band_length),

gain_fraction = weighted.mean(gain_fraction, band_length),

callable_fraction = weighted.mean(callable_fraction, band_length)

loss_fraction_arm = weighted.mean(loss_fraction, band_length),

gain_fraction_arm = weighted.mean(gain_fraction, band_length),

callable_fraction_arm = weighted.mean(callable_fraction, band_length)

jashapiro · 2020-04-10T13:29:41Z

analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd

+        callable_fraction < (1 - uncallable_threshold) ~ "uncallable",
+        loss_fraction > status_threshold ~ "loss",
+        gain_fraction > status_threshold ~ "gain",
+        loss_fraction + gain_fraction > status_threshold ~ "unstable",


Following on previous suggestion.

Suggested change

callable_fraction < (1 - uncallable_threshold) ~ "uncallable",

loss_fraction > status_threshold ~ "loss",

gain_fraction > status_threshold ~ "gain",

loss_fraction + gain_fraction > status_threshold ~ "unstable",

callable_fraction_arm < (1 - uncallable_threshold) ~ "uncallable",

loss_fraction_arm > status_threshold ~ "loss",

gain_fraction_arm > status_threshold ~ "gain",

loss_fraction_arm + gain_fraction_arm > status_threshold ~ "unstable",

I wonder if in an example situation where one status is >> than the other whether unstable is the right call. For example, if loss_fraction_arm is 0.89 and gain_fraction_arm is 0.01 whether this should be called as unstable but I guess we could be asking these kinds of questions all day and maybe they don't actually happen often if at all?

Yes and also yes. I think this is a place where refinement could happen, but I don't know how much of a difference it will make in final conclusions. If we were to refine it, we would probablyu need to add another threshold for unstable, but I don't think we would want to call a region with 80% gain and 10% loss as "neutral". In some sense, if all this does is catch a few things that fall just below threshold, that might be a good thing. Note that the consequences for the final table are the same: neutral and unstable are treated essentially the same way when looking at smaller regions/genes.

jashapiro · 2020-04-10T13:33:19Z

analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd

+    loss_fraction = loss_fraction.x,
+    gain_fraction = gain_fraction.x,
+    callable_fraction = callable_fraction.x,


If you took my *_fraction_arm suggestion above, you won't need to rename anymore!

Suggested change

loss_fraction = loss_fraction.x,

gain_fraction = gain_fraction.x,

callable_fraction = callable_fraction.x,

loss_fraction,

gain_fraction,

callable_fraction,

jashapiro · 2020-04-10T14:39:20Z

analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd

+  filter(
+    dominant_arm_status %in% c("neutral", "uncallable", "unstable") |
+      dominant_cytoband_status %in% c("neutral", "uncallable", "unstable") |
+      status != dominant_arm_status |
+      status != dominant_cytoband_status


This is the critical bit, and the logic is tough! I think we need some &s in here and parenthesis... I'm going to try to figure it out, with annotation. I would love another set of eyes on this (as many sets as possible?)

Suggested change

filter(

dominant_arm_status %in% c("neutral", "uncallable", "unstable") |

dominant_cytoband_status %in% c("neutral", "uncallable", "unstable") |

status != dominant_arm_status |

status != dominant_cytoband_status

filter(

# Case 1) Gene call disagrees with both arm and cytoband (This captures most

# we want to keep, including all of when arm and cytoband are neutral, uncallable

# or unstable, since we have no neutral gene calls in this df):

(status != dominant_arm_status & status != dominant_cytoband_status)

# Case 2) Gene call disagrees with a non-neutral cytoband call.

# Keep no matter what the arm status

| (status != dominant_cytoband_status

& dominant_cytoband_status %in% c("gain", "loss"))

# I think that captures everything we want. Cases we don't want include:

# gene & arm agree, but cytoband is neutral

# gene & cytoband agree

# all 3 agree

Ah, this logic makes much more sense to me, thank you for the added comments here!

Those comments were not just for you! They were for me as I was writing it. Not even me tomorrow; me today!

Checking my understanding here - we do have neutrals for cytoband and arm. @jashapiro can you tell me a little bit more about

# gene & arm agree, but cytoband is neutral

as well.

Correct, we have neutrals for cytoband and arm. But if we call the whole arm as loss (for example), but the cytoband is called neutral, the assumption I was making is that the cytoband is probably a false negative, and we would already have the arm that "covers" all of the genes in this region, so an individual gene that is lost in that cytoband would not be worth calling out as "focal".

Thanks @jashapiro ! This logic seems correct to me.

- add "_arm" to the status fraction columns for consensus_seg_arm_status object - adjust filtering logic at the gene level - rerun nb

cbethell · 2020-04-10T20:59:29Z

I implemented the latest of @jashapiro's suggestions in the last commit, including the logic mentioned in this comment.

I ran it with the logic in that comment and produced a new output file for folks to take a look at when (or if) thinking about how this logic may or may not need to be adjusted.

jashapiro

This looks good to me!

My only suggestion is really a forward-looking one that we might want to retain region start and end positions for each region in the final table to make later analysis/plots a bit easier. I do not think that change should be in this PR, but might be noted as a potential future enhancement.

jashapiro · 2020-05-12T15:09:04Z

analyses/focal-cn-file-preparation/05-define-most-focal-cn-units.Rmd

+  select(Kids_First_Biospecimen_ID,
+         status = dominant_status,
+         region,
+         region_type)


This is not critical, but it might be worth sorting by ID, just to keep things in a somewhat reasonable order.
If we want to do even better, we could merge back in start and end positions for each region? Doing so might save a lot of redundant work in later scripts.

Suggested change

region_type)

region_type) %>%

arrange(Kids_First_Biospecimen_ID)

Gotcha, added this change in the last commit.

cbethell · 2020-05-12T16:15:27Z

My only suggestion is really a forward-looking one that we might want to retain region start and end positions for each region in the final table to make later analysis/plots a bit easier. I do not think that change should be in this PR, but might be noted as a potential future enhancement.

@jashapiro In other words, in a future PR, you may imagine getting the chromosomal start and end regions from the consensus seg file?

jashapiro · 2020-05-12T16:21:27Z

@jashapiro In other words, in a future PR, you may imagine getting the chromosomal start and end regions from the consensus seg file?

I imagine we would get them from the original annotation files that define arms, bands, and genes, but yes. If that information would be broadly used, it might make sense to include it here, rather than requiring those annotations being repeatedly added back. But I don't think it shoudl be done now.

Once we figure out why CI is failing, this should go in as it is.

cbethell · 2020-05-12T16:23:18Z

I imagine we would get them from the original annotation files that define arms, bands, and genes, but yes. If that information would be broadly used, it might make sense to include it here, rather than requiring those annotations being repeatedly added back. But I don't think it shoudl be done now.

Once we figure out why CI is failing, this should go in as it is.

Ahh okay, understood 👍 I will work on getting that ready in a separate PR.

jashapiro · 2020-05-12T17:02:52Z

Ahh okay, understood 👍 I will work on getting that ready in a separate PR.

I would work on other things first as I would view this as very low priority at this point. I really just wanted to document the thought.

…-units-cnvs

cbethell and others added 11 commits March 23, 2020 10:44

Add notebook to define most focal units of recurrent CNVs

fa323f4

- add back `left_join` where it was replaced with `inner_join`in `03` nb and re-rerun nb

add further documentation and rerun nb

be79b8e

add 05 nb to module shell script

3a83534

fix object name

76d7871

add logic for defining dominant cytoband and gene level status calls

f71f317

comment out gene level status wrangling

2ae1ae7

- the genes in the gene level status data.frame are likely not in the subset testing files, therefore causing circleCI tests to fails (comment out this section for now)

Merge branch 'master' into define-most-focal-units-cnvs

fcdba72

Merge branch 'master' of https://github.com/cbethell/OpenPBTA-analysis …

c23f1a5

…into define-most-focal-units-cnvs

Merge branch 'master' of https://github.com/cbethell/OpenPBTA-analysis …

6bd4609

…into define-most-focal-units-cnvs

combine arm, cytoband and gene level status data

0a061c0

- rerun nb - save final table to file

add logic to run in circle CI tests

ce09ee1

- uncomment the previously commented out steps in shell script - rerun nb

cbethell marked this pull request as ready for review March 24, 2020 19:27

cbethell changed the title ~~WIP: Define most focal units of recurrent CNVs~~ Define most focal units of recurrent CNVs Mar 24, 2020

cbethell requested a review from cansavvy March 24, 2020 19:31

cansavvy reviewed Mar 24, 2020

View reviewed changes

update comments/documentation in 05 nb and rerun

fa9010f

cbethell and others added 4 commits March 25, 2020 14:56

fix logic around plotting gene status (for circlCI tests)

0ee5b34

fix display of gene level status plots for rendered notebook

409ba19

use list function to display gene plots when not running in CI

820652f

Merge branch 'master' into define-most-focal-units-cnvs

2aebcb8

cansavvy reviewed Mar 26, 2020

View reviewed changes

jaclyn-taroni mentioned this pull request Mar 26, 2020

WIP: Resolving "gene-level" copy number alterations with broader alterations #485

Closed

5 tasks

cbethell and others added 5 commits March 27, 2020 08:22

add logic for running in CI

a4e38e3

fix placement of logic and rerun

2715107

@cansavvy and @cbethell paircoding changes

37c43b7

functionalize facet_wrap plotting

82043c6

- update documentation in notebook and `README` file to make note of decisions around cutoffs

cbethell and others added 3 commits April 3, 2020 19:38

Apply @jashapiro's sentence re-structuring suggestions

b99c9df

Co-Authored-By: jashapiro <[email protected]>

Merge branch 'master' of https://github.com/cbethell/OpenPBTA-analysis …

541fabd

…into define-most-focal-units-cnvs

jashapiro reviewed Apr 5, 2020

View reviewed changes

cbethell added 2 commits April 6, 2020 19:05

remove running in CI flag that is no longer needed with new logic

6c23024

- propagate change to shell script - rerun nb

jashapiro reviewed Apr 7, 2020

View reviewed changes

jashapiro mentioned this pull request Apr 8, 2020

CN Status Heatmap (PR 1 of 2) #602

Merged

5 tasks

jashapiro reviewed Apr 10, 2020

View reviewed changes

incorporate @jashapiro's suggestions

f58c4d1

- add "_arm" to the status fraction columns for consensus_seg_arm_status object - adjust filtering logic at the gene level - rerun nb

cbethell requested a review from jashapiro April 13, 2020 13:34

cbethell and others added 2 commits April 13, 2020 15:20

Merge branch 'master' into define-most-focal-units-cnvs

550a4f6

Merge branch 'master' into define-most-focal-units-cnvs

abe0ec3

jashapiro approved these changes May 12, 2020

View reviewed changes

arrange by biospecimen ID and rerun

5146ac4

cbethell mentioned this pull request May 12, 2020

Find recurrent focal CN calls #686

Merged

5 tasks

Merge remote-tracking branch 'upstream/master' into define-most-focal…

e48c15c

…-units-cnvs

jashapiro merged commit aeb5847 into AlexsLemonade:master May 15, 2020

-  filter(
-    dominant_arm_status %in% c("neutral", "uncallable", "unstable") |
-      dominant_cytoband_status %in% c("neutral", "uncallable", "unstable") |
-      status != dominant_arm_status |
-      status != dominant_cytoband_status
+  filter(
+    # Case 1) Gene call disagrees with both arm and cytoband (This captures most
+    # we want to keep, including all of when arm and cytoband are neutral, uncallable
+    # or unstable, since we have no neutral gene calls in this df):
+    (status != dominant_arm_status & status != dominant_cytoband_status)
+    # Case 2) Gene call disagrees with a non-neutral cytoband call.
+    # Keep no matter what the arm status
+    | (status != dominant_cytoband_status
+       & dominant_cytoband_status %in% c("gain", "loss"))
+    # I think that captures everything we want. Cases we don't want include:
+    # gene & arm agree, but cytoband is neutral
+    # gene & cytoband agree
+    # all 3 agree

	region_type)
	region_type) %>%
	arrange(Kids_First_Biospecimen_ID)

Define most focal units of recurrent CNVs #644

Define most focal units of recurrent CNVs #644

Conversation

cbethell commented Mar 23, 2020 • edited Loading

Purpose/implementation Section

What scientific question is your analysis addressing?

What was your approach?

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is there anything that you want to discuss further?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Results

What types of results are included (e.g., table, figure)?

What is your summary of the results?

Reproducibility Checklist

Documentation Checklist

cansavvy left a comment • edited Loading

Choose a reason for hiding this comment

cbethell commented Mar 25, 2020

cbethell commented Mar 25, 2020

cansavvy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jashapiro commented Apr 3, 2020

jashapiro commented Apr 3, 2020

cansavvy commented Apr 3, 2020 • edited by jashapiro Loading

jashapiro commented Apr 3, 2020

jashapiro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbethell commented Apr 6, 2020

jashapiro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbethell commented Apr 8, 2020

jashapiro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jashapiro Apr 13, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbethell commented Apr 10, 2020

jashapiro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cbethell commented May 12, 2020 • edited Loading

jashapiro commented May 12, 2020

cbethell commented May 12, 2020

jashapiro commented May 12, 2020

cbethell commented Mar 23, 2020 •

edited

Loading

cansavvy left a comment •

edited

Loading

cansavvy commented Apr 3, 2020 •

edited by jashapiro

Loading

jashapiro Apr 13, 2020 •

edited

Loading

cbethell commented May 12, 2020 •

edited

Loading