forked from IFB-ElixirFr/EBAII
-
Notifications
You must be signed in to change notification settings - Fork 1
/
hands-on_empty.Rmd
1048 lines (814 loc) · 45.9 KB
/
hands-on_empty.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "M2.2 BIMS - epigenomics"
author: "Stephanie Le Gras$^{1}$ - [email protected]"
date: "2024 September 05-06"
output:
html_document:
fig_caption: yes
toc: yes
toc_depth: 5
toc_float: yes
number_sections: yes
word_document:
toc: yes
toc_depth: '2'
pdf_document:
fig_caption: yes
keep_tex: yes
toc: yes
toc_depth: 2
bibliography: references.bib
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, root.dir="~/Documents/Formations/M2.2-BIMS-epigenomique")
```
$^1$ GenomEast platform, IGBMC
# - Introduction
## - Goal
The aim is to :
* understand the nature of ChIP-Seq data
* perform a analysis workflow including quality check (QC), read mapping, visualization in a genome browser and peak-calling. Use command line and open source software for each step of the workflow and feel the complexity of the task
* have an overview of some possible downstream analyses
## - Dataset description
For this training, we will use two datasets:
* a dataset produced by Myers et al [Pubmed](http://www.ncbi.nlm.nih.gov/pubmed/23818864) involved in the regulation of gene expression under anaerobic conditions in bacteria. We will focus on one factor: **FNR**. The advantage of this dataset is its small size, allowing real time execution of all steps of the dataset
* a dataset of ChIP-seq peaks obtained in different mouse tissues for the p300 co-activator protein by Visel et al. [Pubmed](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2745234/); we will use this dataset to illustrate downstream annotation of peaks using R.
## - Data availability
Data are publicly available as the GEO's experiment dataset GSE41187. Raw data were downloaded from [Sequence Read Archive (SRA)](http://www.ncbi.nlm.nih.gov/sra) and put on the server (see #faq). Samples of interest are :
* FNR IP ChIP-seq Anaerobic A (GEO:GSM1010219 / SRA:SRR576933)
* FNR IP ChIP-seq Anaerobic B (GEO:GSM1010220 / SRA:SRR576934)
* anaerobic INPUT DNA (GEO:GSM1010224 / SRA:SRR576938).
# - Connect to the server and set up your environment
During this training, we will work on the cluster provided by the Institut Français de Bioinformatique (IFB) using JupyterLab through the ondemand system.
1. Go to [ondemand](https://ondemand.cluster.france-bioinformatique.fr/)
2. Select JupyterLab: Core
![alt text][selectjupyterlab]
3. Fill the form as such:
- account: 2421_m22_bims,
- CPUS: 2
- Amount of memory: 10G
- Number of hours: 7
![alt text][jupyterlabform]
4. Once the job is running, click on Connect to Jupyter
![alt text][launchjupyterlab]
## - Set up your working environment
1. Go to the directory used for this training
```{bash eval=FALSE}
cd /shared/projects/2421_m22_bims/
```
2. Create a directory with your user id. You're going to work in this directory. Make sure not to use other's!
```{bash eval=FALSE}
mkdir <userid>
```
3. Go to the newly created directory
```{bash eval=FALSE}
cd <userid>
```
4. Copy the data we are going to use during the training
```{bash eval=FALSE}
cp -r /shared/projects/2421_m22_bims/data .
```
7. Your directory structure should be like this
```
/shared/projects/2421_m22_bims/<userid>
│
└───data
```
If you wish, you can check your directory structure:
```{bash eval=FALSE}
tree
```
# - Quality control of the reads and statistics
**Goal**: Get some basic information on the data (read length, number of reads, global quality of datasets)
## - Generating the FASTQC report
Before you analyze the data, it is crucial to check the quality of the data. We will use the standard tool for checking the quality of data generated on the Illumina platform: [FASTQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/).
1. Create a directory named **01-QualityControl** in which to output results from fastqc
```{bash eval=FALSE}
```
2. Go to the directory you've just created
```{bash eval=FALSE}
```
Your directory structure should be like this
```
/shared/projects/2421_m22_bims/<userid>
│
└───data
│
└───01-QualityControl <- you should be in this folder
```
3. Get FastQC available in your environment
```{bash eval=FALSE}
module load fastqc/0.12.1
```
4. Check the help page of the program to see its usage and parameters.
```{bash eval=FALSE}
fastqc --help
```
5. Launch the FASTQC program on the experiment file (FNR_IP_ChIP-seq_Anaerobic_A.fastq.gz)
* -o: creates all output files in the specified output directory. '.' means current directory.
```{bash eval=FALSE}
```
6. Wait until the analysis is finished. Check the FastQC result files.
```{bash eval=FALSE}
ls
```
> FNR_IP_ChIP-seq_Anaerobic_A_fastqc.html FNR_IP_ChIP-seq_Anaerobic_A_fastqc.zip
7. Go to the directory /shared/projects/2421_m22_bims/<userid>/1-QualityControl in the tree directory on the left of the jupyterhub window and double click on FNR_IP_ChIP-seq_Anaerobic_A_fastqc.html to visualize the file.
![alt text][fastqc]
8. Launch the FASTQC program on the replicate (FNR_IP_ChIP-seq_Anaerobic_B.fastq.gz) and on the control file (Anaerobic_INPUT_DNA.fastq.gz)
**Analyze the result of the FASTQC program:**
* **How many reads are present in each file ? **
* **What is the read length ? **
* **Is the overall quality good for the three samples ? **
* **Are there any concerns raised by the report ? If so, can you tell where the problem might come from ?**
9. Once you are done with FastQC, unload it
```{bash eval=FALSE}
module unload fastqc/0.12.1
```
# - Mapping the reads with Bowtie
**Goal**: Obtain the coordinates of each read to the reference genome.
## - Choosing a mapping program
There are multiple programs to perform the mapping step. For reads produced by an Illumina machine for ChIP-seq, the currently "standard" programs is Bowtie (versions 1 and 2)[@langmead_ultrafast_2009] [@langmead_fast_2012]. We will use **Bowtie version 2.5.1** for this exercise.
## - Bowtie
1. Load Bowtie
```{bash eval=FALSE}
module load bowtie2/2.5.1
```
2. Try out bowtie
```{bash eval=FALSE}
bowtie2
```
This prints the help of the program. However, this is a bit difficult to read ! If you need to know more about the program, it's easier to directly check out the manual on the [website](https://bowtie-bio.sourceforge.net/bowtie2/manual.shtml).
3. Bowtie needs the reference genome to align each read on it. The genome needs to be in a specific format (=index) for bowtie to be able to use it. Several pre-built indexes are available for download on bowtie webpage, but our genome is not there. You will need to make this index file.
4. Create a directory named **02-Mapping** in which to output mapping results
```{bash eval=FALSE}
```
5. Go to the directory you've just created
```{bash eval=FALSE}
```
## - Prepare the index file
1. To make the index file, you will need the complete genome, in FASTA format. It has already been downloaded to gain time (Escherichia_coli_K12.fasta in the course folder) (The genome was downloaded from the NCBI).
2. Create a directory named **index** in which to output bowtie indexes
```{bash eval=FALSE}
```
3. Go to the newly created directory
```{bash eval=FALSE}
```
4. Try out bowtie2-build
```{bash eval=FALSE}
bowtie2-build
```
5. Build the index for bowtie
```{bash eval=FALSE}
## Creating genome index : provide the path to the genome file and the name to give to the index (Escherichia_coli_K12)
```
6. Go back to upper directory i.e 02-Mapping
```{bash eval=FALSE}
```
## - Mapping the samples
1. Create a directory named **bam** to put mapping results
```{bash eval=FALSE}
```
2. Go to the newly created directory bam
```{bash eval=FALSE}
```
Your directory structure should be like this:
```
/shared/projects/2421_m22_bims/<userid>
│
└───data
│
└───01-QualityControl
│
└───02-Mapping
| └───index
| └───bam <- you should be here
```
3. Let's see the parameters of bowtie before launching the mapping:
* -x to specify genome index prefix
* -U to specify file with reads to be mapped
* -3 will trim x base from the end of the read. As our last position is of low quality, we'll trim 1 base.
* -S will output the result in SAM format
* --mm allows many concurrent bowtie processes on the same computer to share the same memory image of the index
* 2> FNR_IP_ChIP-seq_Anaerobic_A.out will output some statistics about the mapping in the file FNR_IP_ChIP-seq_Anaerobic_A.out
```{bash eval=FALSE}
## Run alignment
## Tip: first type bowtie command line then add quotes around and prefix it with "sbatch --cpus 10 --wrap="
module load formation/2421
sbatch -p fast -o FNR_IP_ChIP-seq_Anaerobic_A.mapping.out --cpus-per-task 10 --wrap="bowtie2 -p 10 --mm -3 1 -x ../index/Escherichia_coli_K12 -U ../../data/FNR_IP_ChIP-seq_Anaerobic_A.fastq.gz -S FNR_IP_ChIP-seq_Anaerobic_A.sam"
```
This should take few minutes as we work with a small genome. For the human genome, we would need either more time and more resources.
**Analyze the result of the mapped reads:
Open the file FNR_IP_ChIP-seq_Anaerobic_A.mapping.out (for example using the ` less ` command), which contains some statistics about the mapping. How many reads were mapped? How many multi-mapped reads were originally present in the sample? To quit less press 'q'**
Bowtie output is a [SAM](https://samtools.github.io/hts-specs/SAMv1.pdf) file. The SAM format corresponds to large text files, that can be compressed ("zipped") into a BAM format. The BAM files takes up to 4 time less disk space and are usually sorted and indexed for fast access to the data it contains. The index of a given <prefix>.bam file is named <prefix>.bam.bai or <prefix>.bai file. Some tools require to have the index of the bam file to process it.
4. multimapped reads are given a very low mapping quality (below 10). Remove reads which mapping quality is below 10, sort the sam file and create a bam file using samtools [@li_sequence_2009]. samtools view is used to filter data based on mapping quality and samtools sort is used to sort data based on genomic coordinates.
* -@: number of processors to use
* -q: to set a threshold to the mapping quality
* -b: to output a BAM file (it is a SAM file by default)
* -o: to specify a output file name
```{bash eval=FALSE}
## First load samtools
module load samtools/1.18
## Then run samtools
samtools view -@ 2 -q 10 -b FNR_IP_ChIP-seq_Anaerobic_A.sam | samtools sort -@ 2 - -o FNR_IP_ChIP-seq_Anaerobic_A.bam
```
5. Create an index for the bam file
```{bash eval=FALSE}
```
6. Compress the .sam file (you could also delete the file)
```{bash eval=FALSE}
```
7. Once it's done, unload the tools you used
```{bash eval=FALSE}
module unload samtools/1.18 bowtie2/2.5.1
```
## - Map the second replicate and the control
1. Repeat the steps above (3 -> 6 - Mapping) for the files FNR_IP_ChIP-seq_Anaerobic_B.fastq.gz and Anaerobic_INPUT_DNA.fastq.gz in the directory named "**bam**" within the directory 02-Mapping.
**Analyze the result of the mapped reads:
How many reads were mapped for samples Anaerobic_INPUT_DNA and FNR_IP_ChIP-seq_Anaerobic_B?**
# - Estimating the number of duplicated reads
**Goal**: Duplicated reads i.e reads mapped at the same positions in the genome are present in ChIP-seq results. They can arise from several reasons including a biased amplification during the PCR step of the library prep, DNA fragments coming from repetitive elements of the genome, sequencing saturation or the same clusters read several times on the flowcell (i.e optical duplicates). As analyzing ChIP-Seq data consist in detecting signal enrichment, we can not keep duplicated reads for subsequent analysis. So let's detect them using [Picard](http://broadinstitute.github.io/picard/) [@broadinstitute_picard].
1. Go to the directory with alignment files
```{bash eval=FALSE}
cd /shared/projects/2421_m22_bims/<userid>/02-Mapping/bam
```
2. Run Picard markDuplicates to mark duplicated reads (= reads mapping at the exact same location on the genome)
* CREATE_INDEX: Create .bai file for the result bam file with marked duplicate reads
* INPUT: input file name to mark for duplicate reads
* OUTPUT: output file name
* METRICS: file with duplicates marking statistics
* VALIDATION_STRINGENCY: Validation stringency for all SAM files read by picard.
```{bash eval=FALSE}
## Load picard
module load picard/2.23.5
## Run picard
```
To determine the number of duplicated reads marked by Picard, we can run the `samtools flagstat` command:
```{bash eval=FALSE}
## Add samtools to your environment
module load samtools/1.18
## run samtools
```
**Run picard MarkDuplicates on the 2 other samples. How many duplicates are found in each sample?**
Go back to working home directory (i.e /shared/projects/2421_m22_bims/<userid>/)
```{bash eval=FALSE}
## Unload picard and samtools
module unload samtools/1.18 picard/2.23.5
## If you are in 02-Mapping/bam
```
# - ChIP quality controls
**Goal**: This exercise aims at plotting the **Lorenz curve** to assess the quality of the chIP.
## - Plot the Lorenz curve with Deeptools
1. Create a directory named **03-ChIPQualityControls** in which to put mapping results for IP
```{bash eval=FALSE}
```
2. Go to the newly created directory
```{bash eval=FALSE}
```
3. Run Deeptools [plotFingerprint](http://deeptools.readthedocs.io/en/latest/content/tools/plotFingerprint.html) [@ramirez_deeptools2:_2016] to draw the Lorenz curve
* -b: List of indexed BAM files
* -plot: File name of the output figure (extension can be either “png”, “eps”, “pdf” or “svg”)
* --numberOfSamples: how many regions are used to plot the graph
* -p: Number of processors to use (2 processors)
```{bash eval=FALSE}
## Load deeptools in your environment
module load deeptools/3.5.4
## Run deeptools fingerprint
```
4. If plotFingerprint takes to much time to run. Take the file that has already been prepared for the training.
```{bash eval=FALSE}
cp /shared/home/slegras/2421_m22_bims/slegras/03-ChIPQualityControls/fingerprint.png .
```
5. Go find the file using the directory tree on the left of the Jupyterlab panel and click on the fingerprint.png file to display it in Jupyterlab.
**Look at the result files fingerprint.png (add the plot to this report). Give an explanation of the curves?**
Go back to the working home directory (i.e /shared/projects/2421_m22_bims/\<login\>)
```{bash eval=FALSE}
## Unload deepTools
module unload deeptools/3.5.4
## If you are in 03-ChIPQualityControls
cd ..
```
# - Visualizing the data in a genome browser
**Goal**: Check whether the IP worked: visualize the data in their genomic context.
## - Choosing a genome browser
There are several options for genome browsers, divided between the local browsers (which you need to install on your computer, eg. IGV) and the online genome browsers (eg. UCSC genome browser, Ensembl). We often use both types, depending on the aim and the localization of the data.
If the data are on your computer, to prevent data transfer, it's easier to visualize the data locally (IGV). Note that if you're working on a non-model organism, the local viewer will be the only choice. If the aim is to share the results with your collaborators, view many tracks in the context of many existing annotations, then the online genome browsers are more suitable.
## - Viewing the raw alignment data in IGV
1. Download the following files from the server onto your computer
* data/Escherichia_coli_K12.fasta
* data/Escherichia_coli_K_12_MG1655.annotation.fixed.gtf
* 02-Mapping/bam/FNR_IP_ChIP-seq_Anaerobic_A.bam
* 02-Mapping/bam/FNR_IP_ChIP-seq_Anaerobic_A.bam.bai
* 02-Mapping/bam/FNR_IP_ChIP-seq_Anaerobic_B.bam
* 02-Mapping/bam/FNR_IP_ChIP-seq_Anaerobic_B.bam.bai
* 02-Mapping/bam/Anaerobic_INPUT_DNA.bam
* 02-Mapping/bam/Anaerobic_INPUT_DNA.bam.bai
2. Open IGV on your computer
3. Load the genome
* Genomes / Load Genome from File...
* Select the fasta file Escherichia_coli_K12.fasta located into the data directory
4. Load an annotation file named Escherichia_coli_K_12_MG1655.annotation.fixed.gtf into IGV
* File / Load from File...
* Select the annotation file. The positions of the genes are now loaded.
5. Load the three bam files (FNR_IP_ChIP-seq_Anaerobic_A.bam, FNR_IP_ChIP-seq_Anaerobic_B.bam and Anaerobic_INPUT_DNA.bam) in IGV.
* File / Load from File...
* Select the bam files.
![alt text][igvbam]
**Browse around in the genome. Specifically go to the following genes: pepT (geneID:b1127), ycfP (geneID:b1108). Do you see peaks (add screenshots to this report).**
However, looking at BAM file as such does not allow to directly compare the two samples as data are not normalized. Let's generate normalized data for visualization.
## - Viewing scaled data
[bamCoverage](https://deeptools.readthedocs.io/en/latest/content/tools/bamCoverage.html) from deepTools generates BigWigs out of BAM files
1. Try it out
```{bash eval=FALSE}
## Load deeptools in your environment
module load deeptools/3.5.4
## run bamCoverage
bamCoverage --help
```
2. Create a directory named **04-Visualization** to store bamCoverage outputs
```{bash eval=FALSE}
```
3. Go to the newly created directory
```{bash eval=FALSE}
```
Your directory structure should be like this:
```
/shared/projects/2421_m22_bims/<userid>
│
└───data
│
└───01-QualityControl
│
└───02-Mapping
| └───index
| └───bam
│
└───03-ChIPQualityControls
│
└───04-Visualization <- you should be in this folder
```
4. Generate a scaled bigwig file on the IP with bamCoverage
* --bam: BAM file to process
* --outFileName: output file name
* --outFileFormat: output file type (bigwig)
* --effectiveGenomeSize : size of the mappable genome (use 4639675)
* --normalizeUsing : different overall normalization methods; we will use CPM
* --skipNonCoveredRegions: skip non-covered regions
* --extendReads 200: Extend reads to fragment size
* --ignoreDuplicates: reads that have the same orientation and start position will be considered only once
```{bash eval=FALSE}
bamCoverage \
--bam ../02-Mapping/bam/Marked_FNR_IP_ChIP-seq_Anaerobic_A.bam \
--outFileName FNR_IP_ChIP-seq_Anaerobic_A_nodup.bw \
--outFileFormat bigwig \
--effectiveGenomeSize 4639675 \
--normalizeUsing CPM \
--skipNonCoveredRegions \
--extendReads 200 \
--ignoreDuplicates
```
5. Do it for the replicate and the control.
6. Download the three bigwig files you have just generated
* 04-Visualization/FNR_IP_ChIP-seq_Anaerobic_A_nodup.bw
* 04-Visualization/FNR_IP_ChIP-seq_Anaerobic_B_nodup.bw
* 04-Visualization/Anaerobic_INPUT_DNA_nodup.bw
7. Load the three bigwig files in IGV
* File / Load from File...
* Select the three bigwig files.
8. Set the visualization of the three bigwig files to be autoscaled
* Click right on the name of the tracks and select **Autoscale**
* Click right on the name of the tracks and set the windowing function to Maximum
**Go back to the genes we looked at earlier: pepT, ycfP (add screenshots to this report). Look at the shape of the signal.**
**Keep IGV opened.**
Go back to working home directory (i.e /shared/projects/2421_m22_bims/<userid>)
```{bash eval=FALSE}
## If you are in 04-Visualization
```
# - Peak calling with MACS2
**Goal**: Detect the peaks which are regions with high densities of reads and that correspond to where the studied factor was bound
## - Choosing a peak-calling program
There are multiple programs to perform the peak-calling step. Some are more directed towards histone marks (broad peaks) while others are specific to transcription factors which present narrow peaks. Here we will use the callpeak function of MACS2 (version 2.2.7.1) because it's known to produce generally good results, and it is well-maintained by the developer.
## - Calling the peaks
1. Create a directory named **05-PeakCalling** and one directory named **replicates** within to store peaks coordinates.
```{bash eval=FALSE}
```
2. Go to the newly created directory replicates
```{bash eval=FALSE}
```
3. Try out MACS2
```{bash eval=FALSE}
## Load macs2 in your environment
module load macs2/2.2.7.1
macs2 callpeak --help
```
This prints the help of the program.
4. Let's see the parameters of MACS before launching the mapping:
* ChIP-seq tag file (-t) is the name of our experiment (treatment) mapped read file FNR_IP_ChIP-seq_Anaerobic_A.bam
* ChIP-seq control file (-c) is the name of our input (control) mapped read file Anaerobic_INPUT_DNA.bam
* --format BAM indicates the input file are in BAM format. Other formats can be specified (SAM,BED...)
* --gsize Effective genome size: this is the size of the genome considered "usable" for peak calling. This value is given by the MACS developers on their website. It is smaller than the complete genome because many regions are excluded (telomeres, highly repeated regions...). The default value is for human (2700000000.0), so we need to change it. As the value for E. coli is not provided, we will take the complete genome size 4639675.
* --name provides a prefix for the output files. We set this to FNR_Anaerobic_A, but it could be any name.
* --bw The bandwidth is the size of the fragment extracted from the gel electrophoresis or expected from sonication. By default, this value is 300bp. Usually, this value is indicated in the Methods section of publications. In the studied publication, a sentence mentions "400bp fragments (FNR libraries)". We thus set this value to 400.
* --fix-bimodal indicates that in the case where macs2 cannot find enough paired peaks between the plus strand and minus strand to build the shifting model, it can bypass this step and use a extension size of 200bp by default.
* -p 1e-2 indicates that we report the peaks if their associated p-value is lower than 1e-2. This is a relaxed threshold as we want to keep a high number of false positives in our peak set to later compute the IDR analysis.
* &> MACS.out will output the verbosity (=information) in the file MACS.out
```{bash eval=FALSE}
```
5. Run macs2 for replicate A and replicate B.
6. In a new directory called pool, run macs2 for the pooled replicates A and B by giving both bam files as input treatment files (-t).
```{bash eval=FALSE}
# You should be in 05-PeakCalling
# Run macs2 for pooled replicates
```
## - Analyzing MACS results
**Look at the files that were created by MACS. Explain the content of the result files ?**
**How many peaks were detected by MACS2 for each sample and in the pool of samples ?**
## - Calling peaks in a replicate-aware method (IDR)
In order to take advantage of having biological replicates, we will create a combine set of peaks based on the reproducibility of each individual replicate peak calling. We will use the **Irreproducible Discovery Rate** (IDR) algorithm.
1. Create a new directory to store the peak coordinates resulting after idr analysis
```{bash eval=FALSE}
## You should be 05-PeakCalling
```
Your directory structure should be like this:
```
/shared/projects/2421_m22_bims/<userid>
│
└───data
│
└───01-QualityControl
│
└───02-Mapping
| └───index
| └───bam
│
└───03-ChIPQualityControls
│
└───04-Visualization
|
└───05-PeakCalling
| └───replicates
| └───pool
| └───idr <- you should be in this folder
```
2. Load the module idr and have a look at its parameters
```{bash eval=FALSE}
## Load idr in your environment
module load idr/2.0.4.2
idr --help
```
* --samples : peak files of each individual replicate
* --peak-list : the peak file of the pooled replicates, it will be used as a master peak set to compare with the regions from each replicates
* --input-file-type : format of the peak file, in our case it is narrowPeak
* --output-file : name of the result file
* --plot : plot additional diagnosis plot
3. Run idr
```{bash eval=FALSE}
```
**Add the IDR graph to this report. How many peaks are found with the IDR method?**
4. Remove IDR and MACS2 from your environment and go back to working home directory (i.e /shared/projects/2421_m22_bims/<userid>)
```{bash eval=FALSE}
module unload macs2/2.2.7.1
module unload idr/2.0.4.2
## If you are in 05-PeakCalling/idr
cd ../..
```
## - Visualize peaks into IGV
1. Download the following BED files from the server into your computer to visualise in IGV :
* 05-PeakCalling/replicates/FNR_Anaerobic_A_peaks.narrowPeak
* 05-PeakCalling/replicates/FNR_Anaerobic_B_peaks.narrowPeak
* 05-PeakCalling/pool/FNR_Anaerobic_pool_peaks.narrowPeak
* 05-PeakCalling/idr/FNR_anaerobic_idr_peaks.bed
**Go back again to the genes we looked at earlier: pepT, ycfP. Do you see peaks (add the 2 screenshots to this report)?**
**Navigate throught the genome to find peaks detected in the replicates (peak calling per replicate) and not found/kept with the IDR method**
**From now on, peak set we keep is the IDR peak set.**
# - Motif analysis
**Goal**: Define binding motif(s) for the ChIPed transcription factor and identify potential cofactors
## - Retrieve the peak sequences corresponding to the peak coordinate file (BED)
For the motif analysis, you first need to extract the sequences corresponding to the peaks. There are several ways to do this (as usual...). If you work on a UCSC-supported organism, the easiest is to use RSAT fetch-sequences or Galaxy. Here, we will use Bedtools [@bedtools], as we have the genome of interest on our computer (Escherichia_coli_K12.fasta).
1. Create a directory named **06-MotifAnalysis** to store data needed for motif analysis
```{bash eval=FALSE}
```
2. Go to the newly created directory
```{bash eval=FALSE}
```
Your directory structure should be like this:
```
/shared/projects/2421_m22_bims/<userid>
│
└───data
│
└───01-QualityControl
│
└───02-Mapping
| └───index
| └───bam
│
└───03-ChIPQualityControls
│
└───04-Visualization
│
└───05-PeakCalling
│
└───06-MotifAnalysis <- you should be in this folder
```
3. Extract peak sequence in fasta format
```{bash eval=FALSE}
## First load samtools
module load samtools/1.18
## Create an index of the genome fasta file
## First load bedtools
module load bedtools/2.30.0
## Extract fasta sequence from genomic coordinate of peaks
```
4. Download the file FNR_anaerobic_idr_peaks.fa on your computer
## - Motif discovery with RSAT
1. Open a connection to a Regulatory Sequence Analysis Tools server. You can choose between various website mirrors.
* Teaching Server (recommended for this training) [https://rsat.france-bioinformatique.fr/teaching/](https://rsat.france-bioinformatique.fr/teaching/)
2. In the left menu, click on **NGS ChIP-seq** and then click on **peak-motifs**. A new page opens, with a form
3. The default peak-motifs web form only displays the essential options. There are only two mandatory parameters.
* The **title box**, which you will set as **FNR Anaerobic** . The **sequences**, that you will **upload from your computer**, by clicking on the button Choose file, and select the file **FNR_anaerobic_idr_peaks.fa** from your computer.
4. We will now modify some of the advanced options in order to fine-tune the analysis according to your data set.
* Open the "Reduce peak sequences" title, and make sure the **Cut peak sequences: +/- ** option is set to **0** (we wish to analyze our full dataset)
* Open the “Motif Discovery parameters” title, and check the **oligomer sizes 6 and 7** (but not 8). Check "Discover over-represented spaced word pairs **[dyad-analysis]**"
* Under “Compare discovered motifs with databases”, **add RegulonDB prokaryotes** (2015_08) as the studied organism is the bacteria E. coli.
5. Click “**GO**”.
6. The Web page displays a link, You can already click on this link. The report will be progressively updated during the processing of the workflow.
**Is there anything interesting in RSAT results? If so, which motif is of interest and why (add screenshot of the results).**
# - Peak annotation
**Goals**: Associate ChIP-seq peaks to genomic features, identify closest genes and run ontology analyses
1. Create a directory named **07-PeakAnnotation**
```{bash eval=FALSE}
# aller dans le répertoire si besoin
```
2. Go to the newly created directory
```{bash eval=FALSE}
```
## - Associate peaks to closest genes
[annotatePeaks.pl](http://homer.ucsd.edu/homer/ngs/annotation.html) from the Homer suite [@heinz_simple_2010] associates peaks with nearby genes.
1. Create a file suitable for annotatePeaks.pl. To run the tool needs a peak bed file composed of 6 fields (chr, start, end, name, score, strand). The 5 first columns of the file ../05-PeakCalling/idr/FNR_anaerobic_idr_peaks.bed are good but all other colums are of no use and the strand is missing. To generate a file with a correct format, we are using the tool cut to select fields 1 to 5 of the peak file and we add a "+" to every line using awk (*this is code example that can do what we want, not the only solution to do so.*).
```{bash eval=FALSE}
cut \
-f 1-5 \
../05-PeakCalling/idr/FNR_anaerobic_idr_peaks.bed | \
awk -F "\t" '{print $0"\t+"}' \
> FNR_anaerobic_idr_peaks.bed
```
2. Try annotatePeaks.pl
```{bash eval=FALSE}
## First load bedtools
module load homer/4.11
## run Homer annotatePeaks
annotatePeaks.pl --help
```
Let's see the parameters:
annotatePeaks.pl peak/BEDfile genome > outputfile
User defined annotation files (default is UCSC refGene annotation):
annotatePeaks.pl accepts GTF (gene transfer formatted) files to annotate positions relative
to custom annotations, such as those from de novo transcript discovery or Gencode.
-gtf <gtf format file> (Use -gff and -gff3 if appropriate, but GTF is better)
3. Annotation peaks with nearby genes with Homer
```{bash eval=FALSE}
```
**Look at the file you generated. Gene symbols are not present. Let's add them with some R code.**
4. Launch Rstudio in [ondemand](https://ondemand.cluster.france-bioinformatique.fr/)
![alt text][launchrstudio]
5. Add gene symbol annotation using R with Rstudio
```{R eval=FALSE, include=TRUE}
## set working directory
setwd("/shared/projects/2421_m22_bims/<userid>/07-PeakAnnotation")
## Or navigate using the "Files" tab and click on "More">"Set as Working Directory"
## read the file with peaks annotated with homer
## data are loaded into a data frame
## sep="\t": this is a tab separated file
## header=TRUE: there is a line with headers (ie. column names)
d <-
## Load a 2-columns files which contains in the first column gene IDs
## and in the second column gene symbols
## data are loaded into a data frame
## header=FALSE: there is no header line
gene.symbol <-
## Merge the 2 data frames based on a common field
## by.x gives the columns name in which the common field is for the d data frame
## by.y gives the columns name in which the common field is for the gene.symbol data frame
## d contains several columns with no information. We select only interesting columns
d.annot <-
## Change column names of the resulting data frame
colnames(d.annot)[2] <- "PeakID" # name the 2d column of the new file "PeakID"
colnames(d.annot)[dim(d.annot)[2]] <- "Gene.Symbol"
## output the merged data frame to a file named "FNR_anaerobic_idr_final_peaks_annotation.tsv"
## col.names=TRUE: output column names
## row.names=FALSE: don't output row names
## sep="\t": table fields are separated by tabs
## quote=FALSE: don't put quote around text.
write.table
```
**What information is listed in each column of the file? (print column names and explain them)**
**How many genes are associated to the "promoter-TSS" feature?**
**What are all the possible gene features? (see in column Annotation - extract information like promoter-TSS, TSS, ...). Create a plot (pie chart, barplot...) showing the proportion of each of them (include both the plot and the code that created it in the report).**
6. Go back to working home directory (i.e /shared/projects/training/\<login\>/M2.2-BIMS-epigenomique)
```{bash eval=FALSE}
## If you are in 07-PeakAnnotation
```
## - Search for Biological Processes, Molecular Functions or Cellular Compartments enrichment
Use Official gene symbols of the file FNR_anaerobic_idr_final_peaks_annotation.tsv to search for enriched gene ontologies with the tool DAVID (Database for Annotation, Visualization and Integrated Discovery). Input your gene list on the DAVID website: https://david.ncifcrf.gov/. **Use DAVID convert ID tool if needed**
**Are there biological processes enriched in the list of genes associated to the peaks? Show the top results of the Functional Annotation Clustering.**
**Are these genes enriched in some KEGG pathway? Which ones?**
# - Bonus: Annotation of ChIP-peaks using R tools
**In this part, we will use a different set of peaks obtained using a peak caller from a set of p300 ChIP-seq experiments in different mouse embryonic tissues (midbrain, forebrain and limb).**
## - Obtain the bed files from GEO
1. We will download the already called peak files in bed format from GEO.
Create a new folder and go in it.
```{bash eval=FALSE}
cd /shared/projects/2421_m22_bims/<userid>
```
2. Search for the dataset **GSE13845** either using Google or from the front page of [GEO](https://www.ncbi.nlm.nih.gov/geo/)
3. On the description page, find the three GSM files, and click on each of then
4. On each page, select and download the `GSMxxxxx_p300_peaks.txt.gz` file to the newly created folder (where `xxxxx` represents the GSM number)
You should now have downloaded 3 files:
> GSM348064_p300_peaks.txt.gz (Forebrain)
> GSM348065_p300_peaks.txt.gz (Midbrain)
> GSM348066_p300_peaks.txt.gz (limb)
```{bash eval=TRUE, include=FALSE}
mkdir 07-PeakAnnotation-bonus
cd 07-PeakAnnotation-bonus
curl -O https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM348nnn/GSM348064/suppl/GSM348064_p300_peaks.txt.gz
curl -O https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM348nnn/GSM348065/suppl/GSM348065_p300_peaks.txt.gz
curl -O https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM348nnn/GSM348066/suppl/GSM348066_p300_peaks.txt.gz
```
*Beware: Make sure to check which genome version was used to call the peaks (remember: this is mouse data!)*
## - Performing a first evaluation of peak sets using R
Now, we will use **RStudio** to perform the rest of the analysis in R. For the analysis, we will need some R/Bioconductor libraries
* [ChIPSeeker](https://bioconductor.org/packages/release/bioc/html/ChIPseeker.html) [@chipseeker]
* [mouse gene annotation](http://bioconductor.org/packages/release/data/annotation/html/TxDb.Mmusculus.UCSC.mm9.knownGene.html)
* [mouse functional annotation](http://bioconductor.org/packages/release/data/annotation/html/org.Mm.eg.db.html)
* [clusterProfiler: Gene set annotation tool](http://bioconductor.org/packages/release/bioc/html/clusterProfiler.html) [@clusterprofiler]
1. Go to Rstudio and execute the R code below (show results in the report)
```{r eval=TRUE, message=FALSE, warning=FALSE, include=TRUE}
# load the required libraries
library(RColorBrewer)
library(ChIPseeker)
library(TxDb.Mmusculus.UCSC.mm9.knownGene)
library(org.Mm.eg.db)
# define the annotation of the mouse genome
txdb = TxDb.Mmusculus.UCSC.mm9.knownGene
# define colors
col = brewer.pal(9,'Set1')
```
2. read the peak files for the three datasets:
```{r eval=FALSE, include=TRUE}
# set the working directory to the folder in which the peaks are stored
setwd("/shared/projects/2421_m22_bims/<userid>/07-PeakAnnotation-bonus")
```
```{r eval=FALSE, message=FALSE, warning=FALSE, include=TRUE}
peaks.forebrain =
peaks.midbrain =
peaks.limb =
```
```{r eval=FALSE, include=TRUE}
# create a list containing all the peak sets
all.peaks = list(forebrain=peaks.forebrain,
midbrain=peaks.midbrain,
limb=peaks.limb)
```
The peaks are stored as **GenomicRanges** object; this is an R format which look like the bed format, but is optimized in terms of memory requirements and speed of execution.
We can start by computing some basic statistics on the peak sets.
### - How many peaks?
```{r eval=FALSE, include=TRUE}
# check the number of peaks for the forebrain dataset
# compute the number of peaks for all datasets using the list object
# display this as a barplot
```
### - How large are these peaks?
```{r eval=FALSE, include=TRUE}
# statistics on the peak length for forebrain
# size distribution of the peaks
# boxplot of the sizes
```
### - What is the score of these peaks?
Can you adapt the previous code to display a boxplot of the peak score distribution for the Forebrain peak set (column `Maximum.Peak.Height`)?
### - Where are the peaks located?
We can now display the genomic distribution of the peaks along the chromosomes, including the peak scores, using the `covplot` function from `ChIPSeeker`:
```{r eval=FALSE, include=TRUE}
# genome wide distribution
covplot(peaks.forebrain, weightCol="Maximum.Peak.Height")
```
**Exercice: use the option "lower" in covplot to display only the peaks with a score (Max.Peak.Height) above 10**
### - How does the signal look like at TSS?
In addition to the genome wide plot, we can check if there is a tendency for the peaks to be located close to gene promoters.
```{r eval=FALSE, include=TRUE}
# define gene promoters
promoter = getPromoters(TxDb=txdb, upstream=5000, downstream=5000)
# compute the density of peaks within the promoter regions
tagMatrix = getTagMatrix(peaks.limb, windows=promoter)
# plot the density
tagHeatmap(tagMatrix, palette = "RdYlBu")
```
## - Functional annotation of the peaks
We can now assign the peaks to the closest genes and genomic compartments (introns, exons, promoters, distal regions, etc...)
This is done using the function `annotatePeak` which compares the peak files with the annotation file of the mouse genome. This function returns
a complex object which contains all this information.
```{r eval=FALSE, include=TRUE}
peakAnno.forebrain = annotatePeak(peaks.forebrain, tssRegion=c(-3000, 3000), TxDb=txdb, annoDb="org.Mm.eg.db")
peakAnno.midbrain = annotatePeak(peaks.midbrain, tssRegion=c(-3000, 3000), TxDb=txdb, annoDb="org.Mm.eg.db")
peakAnno.limb = annotatePeak(peaks.limb, tssRegion=c(-3000, 3000), TxDb=txdb, annoDb="org.Mm.eg.db")
```
### - genomic localization
We can now analyze more in details the localization of the peaks (introns, exons, promoters, distal regions,...)
```{r eval=FALSE, include=TRUE}
# distribution of genomic compartments for forebrain peaks
plotAnnoPie(peakAnno.forebrain)
# for all the peaks
plotAnnoBar(list(forebrain=peakAnno.forebrain, midbrain=peakAnno.midbrain,limb=peakAnno.limb))
```
**Question: do you see differences between the three peak sets?**
### - functional annotation
An important step in ChIP-seq analysis is to interpret genes that are located close to the ChIP peaks. Hence, we need to
1. assign genes to peaks
2. compute functional enrichments of the target genes.
**Beware:**
By doing so, we assume that the target gene of the peak is always the closest one. Hi-C/4C analysis have shown that in higher eukaryotes, this is not always the case. However, in the absence of data on the real target gene of ChIP-peaks, we can work with this approximation.
We will compute the enrichment of the Gene Ontology "Biological Process" categories in the set of putative target genes.
```{r eval=FALSE, message=FALSE, warning=FALSE, include=TRUE}
# load the library
library(clusterProfiler)
```
```{r eval=FALSE, include=TRUE}
# define the list of all mouse genes as a universe for the enrichment analysis
universe = mappedkeys(org.Mm.egACCNUM)
## extract the gene IDs of the forebrain target genes
genes.forebrain = peakAnno.forebrain@anno$geneId
ego.forebrain = enrichGO(gene = genes.forebrain,
universe = universe,
OrgDb = org.Mm.eg.db,
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05,
readable = TRUE)
# display the results as barplots
barplot(ego.forebrain,showCategory=10)
```
**Question: do you see an enrichment of the expected categories? What does the x-axis mean? What does the color mean?**
**Exercise:** redo this analysis for the limb dataset and check if the enriched categories make sense.
## FAQ
### How to download the data
**Goal**: Identify the datasets corresponding to the studied article and retrieve the data (reads as FASTQ files) corresponding to 2 replicates of a condition and the corresponding control.
#### - Obtaining an identifier for a chosen dataset
NGS datasets are (usually) made freely accessible for other scientists, by depositing these datasets into specialized databanks. [Sequence Read Archive (SRA)](http://www.ncbi.nlm.nih.gov/sra) located in USA hosted by NCBI, and its European equivalent [European Nucleotide Archive (ENA)](http://www.ebi.ac.uk/ena) located in England hosted by EBI both contains **raw reads**.
Functional genomic datasets (transcriptomics, genome-wide binding such as ChIP-seq,...) are deposited in the databases [Gene Expression Omnibus (GEO)](http://www.ncbi.nlm.nih.gov/geo/) or its European equivalent [ArrayExpress](https://www.ebi.ac.uk/arrayexpress/).
Within an article of interest, search for a sentence mentioning the deposition of the data in a database. Here, the following sentence can be found at the end of the Materials and Methods section:
*"All genome-wide data from this publication have been deposited in NCBI’s Gene Expression Omnibus (**GSE41195**)."*
We will thus use the **GSE41195** identifier to retrieve the dataset from the **NCBI GEO** (Gene Expression Omnibus) database.
#### - Accessing GSE41195 from GEO
1. The GEO database hosts processed data files and many details related to the experiments. SRA (Sequence Read Archive) stores the actual raw sequence data.
2. Search in Google **GSE41195**. Click on the first link to directly access the correct page on the GEO database.
![alt text][geo]
3. This GEO entry is a mixture of expression analysis (Nimblegen Gene Expression Array), chip-chip and chip-seq. At the bottom of the page, click on the subseries related to the chip-seq datasets. (this subseries has its own identifier: **GSE41187**).
![alt text][geo2]
4. From this page, we will focus on the experiment **FNR IP ChIP-seq Anaerobic A**. At the bottom of the page, click on the link "**GSM1010219** - FNR IP ChIP-seq Anaerobic A".
5. In the new page, go to the bottom to find the SRA identifier. This is the identifier of the raw dataset stored in the SRA database.
![alt text][geo3]
6. Copy the identifier **SRX189773** (do not click on the link that would take you to the SRA database, see below why)
#### - Downloading FASTQ file from the ENA database
Although direct access to the SRA database at the NCBI is doable, SRA does not store sequences in a FASTQ format. So, in practice, it's simpler (and quicker!!) to download datasets from the ENA database (European Nucleotide Archive) hosted by EBI (European Bioinformatics Institute) in UK. ENA encompasses the data from SRA.
1. Go to the [EBI](http://www.ebi.ac.uk/) website. Paste your SRA identifier (SRX189773) and click on the button "search".