-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question regarding how Lineplots values are calculated, related to #116 #252
Comments
Hi @AlexBlais74 Sorry for the late reply.
Correct.
Correct. Here we followed the DEseq2 assumption to normalize the TF activity which was calculated by summing up the number counts around the binding sites within a specific window.
Correct. This step is optional and we only ran it to visualize the signal in IGV.
Correct.
Here the lineplot are generated by using the signal (after bias correction) around the TF binding sites. In step 3, the signal is based on peaks. We actually re-generated the signal instead of using the signal from step 3. Regarding normalization, indeed, here we have to remove the sample-specific bias caused by sequencing depth. The normalization was perform by using method from DEseq, i.e., median-based normalization.
As mentioned above, this normalization tried to scale the signal with a normalization factor for each sample which assumes that most TFs are not changing.
The TF activity in "differential_statistics.txt" have been normalized by the scaling factors, so if you calculate the factor based on the normalized data, you will have value close to 1.
You are right, the normalization factors are computed based on the ATAC signal, and we first normalized the ATAC-seq signal with the scaling factors. Based on the normalize signal, we can compute the activity score by combining TF protetction and accessibility. Please let me know if there are anything unclear. :) Zhijian |
Hello, I am interested in understanding precisely how the lineplots are generated, with rgt-hint --differential for ATAC-seq. This is as a follow up to issue #116, where @lzj1769 offered some information but I am not super clear on the process. The reason for my questions is that this would allow me to generate similar plots, but with varying widths around the TF motif hit, or at a subset of genomic regions of my choice, at subsets of the TF motif hits, etc.
Note: I am working with
HINT - Regulatory Analysis Toolbox (RGT) - v1.0.1
Perhaps you can tell me if what I understood is correct.
1-run
rgt-hint footprinting --atac-seq --paired-end
to identify footprints at regions of interest (e.g. peaks)2-run
rgt-motifanalysis matching
in order to locate TF motif hits among the identified footprints. Important at this step to screen a large collection of diverse TF motifs because the stats in "differential" assume a normal distribution with most TFs having no differential activity.3-run
rgt-hint tracks --bc --norm
in order to generate wig/bigWig files for each of my samples at regions of interest (e.g. peaks) that are bias-corrected and normalized. This normalization is from Gusmao et al. Bioinformatics. 2014, section 2.1 of the paper (equation #1, within datasets normalization only). Between datasets normalization cannot be applied here since rgt-hint tracks only takes one bam file at a time.4-run rgt-hint --differential --bc to calculate activity scores for each tested TF, statistics, and generate lineplots.
5-Lineplots are made by taking the ATAC read coverage (e.g. from bias-corrected signal made like at step 3) at TF motif hits (from step 2 above). Here, I suppose that between-sample normalization is also applied (Gusmao et al., equation 2).
In issue 116, Li mentions a normalization step of the signal (additional to or replacing step 5 above?) in the two tested conditions by taking, for each TF, the ratio of activity in condition A divided by activity in condition B (activity values for each TF are calculated from the output file "differential_statistics.txt", activity in condition i being PSi+TCi), then calculating the median of these ratios to obtain a scaling factor so that the ATAC signal in B can be scaled to be comparable to A, to generate the lineplot curves y-axis values.
Did I get that correctly?
Additionally, I am curious about the two values provided in "differential_factor.txt". In the case of my own analysis, the values are about 0.45 and 2.25, for conditions A and B respectively. However, if I calculate from the "differential_statistics.txt" file, the median of activity in A is 0.95 and median of activity in B is 0.93, giving a ratio of medians (B/A) of 0.97. Calculating instead of median of the ratio for each factor gives 0.99.
I tried to make sense of the python code in reg-gen/rgt/HINT/DifferentialAnalysis.py around lines 234-242 but I am having a hard time understanding how that fits with what Li wrote in the issue thread I referenced above. Reading the function "compute_factors", it seems that the normalization factors are calculated on the sum of ATAC signal at the TF motif hits rather than using the activity scores.
Thanks in advance for any clarification.
Alex
The text was updated successfully, but these errors were encountered: