-
Notifications
You must be signed in to change notification settings - Fork 0
/
ggplot2.Rmd
1298 lines (904 loc) · 34.7 KB
/
ggplot2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# ggplot2
```{r global_options, include=FALSE}
knitr::opts_chunk$set(fig.width=5, fig.height=4,
echo=TRUE, warning=FALSE, message=FALSE)
```
* Graphing package inspired by the **G**rammar of **G**raphics work of Leland Wilkinson:
* *The Grammar of Graphics is based on the idea that every graphic can be broken down into a series of components or layers. These components include the data, the aesthetic mapping, the geometric shapes, the statistical transformation, and the scales.* [source](https://medium.com/aiskunks/data-visualization-grammar-of-graphics-fccf78379b52)
* Flexible, versatile, customizable.
* Well documented.
<img src="images/ggplot_plots_cscherer.png" alt="import zip" width="700"/>
*image from https://www.cedricscherer.com/img/ggplot-tutorial/overview.png*
## Getting started
A ggplot graph needs at least 3 components:
* **Data**: that is the source data that we want to represent.
* **Aesthetics** mappings: they describe what will be visualized from **data**. What are you trying to show?
* **Geometrics**: functions that represent what we see in the graph: lines, points, boxes etc. for example:
* geom_point()
* geom_lines()
* geom_histogram()
* geom_boxplot()
* geom_bar()
* geom_smooth()
* geom_tiles()
The base structure is the following:
**ggplot(\<DATA\>, \<AESTHETICS\>) + \<GEOMETRICS\>**
For example if we want to represent **column1** (on the x axis) and **column2** (on the y axis) of **data** as **points**, we can use the following structure:
```{r, eval=F, echo=TRUE}
ggplot(data=dataframe, mapping=aes(x=column1, y=column2)) + geom_point()
```
This will be our template as we explore different types of graphs.
We can add **more layers and components** to this base structure to customize the plot, as we will see in the next examples.
## Scatter plot
### Base plot
We can start from the **geneexp** object, that holds the content of file *expression_20genes.csv*: we want to plot **sample1** on the x axis and **sample2** on the y axis.
The base layer will be the following:
```{r, eval=T}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2))
```
Copy-paste this in the console, and hit Enter.
As you can see, nothing is plotted yet: the base is set.
Adding to the base layer the geometrics called **geom_point()**, we **tell ggplot to produce a scatter/point plot**:
```{r, eval=T}
# This line is a comment: a comment is not interpreted by R.
# Example of a scatter plot: add the geom_point() layer
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2)) +
geom_point()
# Note that the new line is NOT necessary after the "+": it is done for clarity / readability.
```
Please, copy the code above in your script, and hit Enter!
Your plot should appear in the "Plots" tab in the bottom-right panel.
### Customize the points
**geom_point()** can take parameters, including the point color and size:
Color all points in red:
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2)) +
geom_point(color="red")
```
Increase point size (default size is 1.5):
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2)) +
geom_point(color="red", size=2.5)
```
This is a good place to introduce the **help pages** of functions.
Functions in **ggplot2** (and **tidyverse** in general) are richly documented.
While documentation can be quite technical it is always good practice to take a look at it.
You can access the help page of a function in the **Help** tab in the bottom-right panel. Give it a try with "geom_point":
<img src="images/ggplot_help.png" alt="rstudio help" width="500"/>
Back to our customization: let's set different shapes for the points!
This is done by setting the **shape** parameter in **geom_point()**.
Points can become, for example, triangles:
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2)) +
geom_point(color="red", size=2.5, shape="triangle")
```
See more options in the following image:
<img src="images/ggplot2_shape.png" alt="import zip" width="700"/>
*Image from ggplot2 documentation*
Note that you can also replace the points by any character, the following way:
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2)) +
geom_point(color="red", size=2.5, shape="$")
```
### Add more layers
We can add more layers to the plot, using the same structure (**+ layer_name()**)
#### ggtitle()
Add a title using the **ggtitle()** layer:
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2)) +
geom_point(color="red", size=2.5, shape="diamond") +
ggtitle(label="my first ggplot")
```
**label** is a parameter of **ggtitle()** function.
#### Background
Not a big fan of the grey background?
This is the default "theme", but there are [more options](https://ggplot2.tidyverse.org/reference/ggtheme.html).
For example:
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2)) +
geom_point(color="red", size=2.5, shape="diamond") +
ggtitle(label="theme grey (the default theme)") +
theme_grey()
```
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2)) +
geom_point(color="red", size=2.5, shape="diamond") +
ggtitle(label="theme linedraw") +
theme_linedraw()
```
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2)) +
geom_point(color="red", size=2.5, shape="diamond") +
ggtitle(label="theme bw = black and white") +
theme_bw()
```
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2)) +
geom_point(color="red", size=2.5, shape="diamond") +
ggtitle(label="theme void") +
theme_void()
```
Here is a good page to check the different backgrounds:
https://ggplot2-book.org/themes#sec-theme
Note that you can also change some settings globally as you use a new theme, e.g.
* *base_size*: by default, 11.
* *base_family*: the font (uses by default arial or sans). To check the fonts that are available, type *systemfonts::system_fonts()$family*
* *base_line_size*: by default, base_size/22.
* *base_rect_size*: by default, base_size/22
```{r}
# get full list of available fonts in your system with:
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2)) +
geom_point(color="red", size=2.5, shape="diamond") +
ggtitle(label="my first ggplot") +
theme_bw(base_size=18, base_family = "Laksaman", base_line_size = 2, base_rect_size = 4)
```
#### Regression line
Add a regression line with **geom_smooth()**. A smoothed line can help highlight the dominant pattern/trend.
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2)) +
geom_point(color="red", size=2.5, shape="diamond") +
ggtitle(label="my first ggplot") +
theme_linedraw() +
geom_smooth()
```
Remove the confidence interval:
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2)) +
geom_point(color="red", size=2.5, shape="diamond") +
ggtitle(label="my first ggplot") +
theme_linedraw() +
geom_smooth(se=FALSE)
```
Different methods can be used to fit the smoothing line:
* "lm": linear model.
* "glm": generalized linear model.
* "gam": generalized additive model.
* "loess": local polynomial regression.
* A function (more advanced)
By default, the smoothing method is picked based on the size of the largest group across all panels.
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2)) +
geom_point(color="red", size=2.5, shape="diamond") +
ggtitle(label="my first ggplot") +
theme_linedraw() +
geom_smooth(se=FALSE, method="lm")
```
<details>
<summary>
*More advanced (as reference, or if someone asks): add correlation coefficient:*
</summary>
You can add the correlation coefficient between the 2 variables, using another function from the {ggpubr} package:
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2)) +
geom_point(color="red", size=2.5, shape="diamond") +
ggtitle(label="my first ggplot") +
theme_linedraw() +
geom_smooth() +
ggpubr::stat_cor(method = "pearson", label.x = 3, label.y = 30)
```
</details>
## Save your plot
### From the RStudio interface
Before we dive into more graph types, let's pause and learn how to easily save the current plot.
In the "Plots" tab, click on "Export" and "Save as image":
<img src="images/ggplot_save_image.png" alt="import zip" width="700"/>
From that windows, you can:
* Pick an image format between: PNG, JPEG, TIFF, BMP, SVG, EPS.
* Choose where you want to **save the output file** (by default, R will propose the current working directory).
* Choose the **file name**.
* Set the dimensions, by either:
* Setting the Width and Height of the figure (in pixels)
* Moving the graph manually (bottom-right corner of the plot) until you obtain the size and proportions that you want.
<img src="images/ggplot_save_parameters.png" alt="import zip" width="700"/>
### From the console
The best way to save a plot to a few from the console, is using the ggsave function.
First, you need to save the plot to an object (if you don't, ggplot will create a file from the latest plot, which is fine too!).
```{r}
myplot <- ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2)) +
geom_point(color="red", size=2.5, shape="diamond") +
ggtitle(label="my first ggplot")
```
Many different formats are available:
* eps
* ps
* tex
* pdf
* jpeg
* tiff
* png
* bmp
* svg
* wmf
```{r}
ggsave(filename="myplot.png", plot=myplot, device="png")
```
You can specify the plot size units between inches "in", centimeters "cm", milimeters "mm" or pixels "px".
You can also specify the **dpi**, i.e. dots per inches.
If we take as an example the requirements of electronic image formats [for Nature publishing group](https://www.nature.com/nature/for-authors/final-submission):
"Layered Photoshop (PSD) or TIFF format (high resolution, 300–600 dots per inch (dpi) )"
We could save the plot as a file the following way:
```{r}
ggsave(filename="myplot.tiff",
plot=myplot,
device="tiff",
dpi=300,
units="in",
width=5, height=5)
```
## Exercise 1
Time for our first exercise!
Starting from the same object **geneexp**:
1. Create a scatter plot that shows sample2 on the x-axis and sample1 on the y-axis.
<details>
<summary>
correction
</summary>
```{r}
ggplot(data=geneexp, mapping=aes(x=sample2, y=sample1)) +
geom_point()
```
</details>
<br>
2. Change the point color to blue, and the point size to 2.
<details>
<summary>
correction
</summary>
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2)) +
geom_point(color="blue", size=2)
```
</details>
<br>
3. Change the point shape to "square cross"
<details>
<summary>
correction
</summary>
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2)) +
geom_point(color="blue", size=2, shape="square cross")
```
</details>
<br>
4. Add the title of your choice.
<details>
<summary>
correction
</summary>
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2)) +
geom_point(color="blue", size=2, shape="square cross") +
ggtitle(label="my second ggplot")
```
</details>
<br>
5. Add a subtitle (wait: that's new! Check **ggtitle** help page and/or Google "ggtitle subtitle" and see if you can find!)
<details>
<summary>
correction
</summary>
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2)) +
geom_point(color="blue", size=2, shape="square cross") +
ggtitle(label="my second ggplot", subtitle="nice blue squares")
```
</details>
<br>
6. Save your plot as a JPEG file, in the workshop folder, with dimensions 600X600 pixels.
<details>
<summary>
correction
</summary>
From the interface:
Bottom-right panel -> Plots tab -> Export -> ...
From the console:
```{r}
# first, save in an object
mybluescatterplot <- ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2)) +
geom_point(color="blue", size=2, shape="square cross") +
ggtitle(label="my second ggplot", subtitle="nice blue squares")
# then save with ggsave
ggsave(filename="myblueplot.jpg", plot=mybluescatterplot,
device="jpeg",
units="px", width=600, height=600)
```
</details>
## Scatter plots: more features
We can customize our scatter plot a bit more.
### Labels
We may want to show the gene names that the points represent.
This is done by:
* setting the **label** parameter, in the ggplot **aes()** function
* adding the **geom_text()** layer
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2, label=Gene)) +
geom_point() +
geom_text()
```
We can adjust the position of the labels relative to the points, so they do not overlap: this is done with **nudge_x** (moves the labels horizontally / on the **x** axis).
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2, label=Gene)) +
geom_point() +
geom_text(nudge_x=1.5)
```
We can also decrease or increase the label size:
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2, label=Gene)) +
geom_point() +
geom_text(nudge_x=1.5, size=3)
```
You can also overrule the mapping of colors to labels and keep all labels black, for example:
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2, label=Gene)) +
geom_point() +
geom_text(nudge_x=1.5, size=3, color="black")
```
Note that the automatic organization of labels, so that they do not overlap, can be done using the {ggrepel} package.
You only need to load the package and change **geom_text()** with **geom_repel_text()**:
```{r, eval=T, echo=F}
library(ggrepel)
```
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2, label=Gene)) +
geom_point() +
geom_text_repel()
```
### Color and shape mapping
Point color and shape can be **dependent on another column / variable of the data**. This is called **mapping an aesthetic to a variable**.
Columns used to **conditionally color or shape the points** are set inside the **aes()** function.
For **shape**:
```{r, fig.width=7}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2, label=Gene, shape=DE)) +
geom_point() +
geom_text(nudge_x=1.2, size=3)
```
For **color**:
```{r, fig.width=7}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2, label=Gene, color=DE)) +
geom_point() +
geom_text(nudge_x=1.2, size=3)
```
TIP: remove the double labeling in the legend (a letter behind the point because both labels and colors are mapped to the same variable): set **show.legend=FALSE** in **geom_text()**:
```{r, fig.width=7}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2, label=Gene, color=DE)) +
geom_point() +
geom_text(nudge_x=1.2, size=3, show.legend=FALSE)
```
You can change the legend title the following way:
```{r, fig.width=7}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2, label=Gene, color=DE)) +
geom_point() +
geom_text(nudge_x=1.2, size=3, show.legend=FALSE) +
scale_color_discrete(name="DiffExp")
```
<details>
<summary>
*More advanced (as reference, or if someone asks): how to change default colors:*
</summary>
Colors can be set manually using (yet another) layer: **scale_color_manual()**.
```{r, fig.width=7}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2, label=Gene, color=DE)) +
geom_point() +
geom_text(nudge_x=1.2, size=3) +
scale_color_manual(values=c(Down="blue", No="black", Up="red"))
```
</details>
### Additional ticks
**geom_rug** creates a compact visualization along the axes to help read the information of individual cases. You can simply add it as an additional layer.
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2)) +
geom_point(color="red", size=2.5, shape="diamond") +
ggtitle(label="my first ggplot") +
theme_linedraw() +
geom_rug()
```
As usual, you can customize several parameters, such as:
* *sides*: sides where to draw the lines (**t**op, **b**ottom, **r**ight, **l**eft)
* *alpha*: opacity Ranges from 0 (transparent) to 1 (opaque).
* *linewidth*, *linetype*
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2)) +
geom_point(color="red", size=2.5, shape="diamond") +
ggtitle(label="my first ggplot") +
theme_linedraw() +
geom_rug(sides="tr", alpha=0.3, linewidth=1)
```
### Density estimates
**geom_density_2d** performs a 2D kernel density estimation and displays the results with contours.
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2)) +
geom_point(color="red", size=2.5, shape="diamond") +
ggtitle(label="my first ggplot") +
theme_linedraw() +
geom_density_2d()
```
Play with some of the parameters we already know:
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2)) +
geom_point(color="red", size=2.5, shape="diamond") +
ggtitle(label="my first ggplot") +
theme_linedraw() +
geom_density_2d(color="pink", alpha=0.5, linewidth = 2)
```
## Barplots
A barplot (or barchart) is a graph that represents categorical data with rectangular bars, which heights are proportional to the values they represent.
The first layer of the **ggplot()** function is similar.
However, note that only **x=** is set in **aes()** function (the basic way to plot a barplot):
```{r, eval=F}
ggplot(data=dataframe, mapping=aes(x=column1)) +
geom_bar()
```
Using our previous **geneexp** data, we can produce a bar plot out of the **DE** column, such as:
```{r}
ggplot(geneexp, aes(x=DE)) +
geom_bar()
```
This produces a barplots containing 3 bars: **Down**, **No** and **Up**: their height represents the number of genes found in each category.
## Exercise 2
1. Import file **DataViz_source_files-main/files/gencode.v44.annotation.csv** in an object called **gtf**.
<details>
<summary>
correction
</summary>
```{r}
gtf <- read_csv("DataViz_source_files-main/files/gencode.v44.annotation.csv")
```
This is a small subset of the gencode v44 human gene annotation, created the following way:
* Selection of protein coding genes, long non-coding genes, miRNAs, snRNAs and snoRNAs.
* Selection of chromosomes 1 to 10 only.
* Creation of a random subset of 1000 genes.
* Convertion to a friendly csv format.
</details>
<br>
2. Create a simple barplot representing the count of genes per chromosome:
<details>
<summary>
correction
</summary>
```{r}
ggplot(data=gtf, mapping=aes(x=chr)) +
geom_bar()
```
</details>
<br>
3. Keep the chromosome represented on the x axis, and split the barplot **per gene type**.
TIP: remember how we set **color=** in **mapping=aes()** function in the scatter plot section? Give it a try here!
<details>
<summary>
correction
</summary>
```{r}
ggplot(data=gtf, mapping=aes(x=chr, color=gene_type)) +
geom_bar()
```
</details>
<br>
4. Change **color=** with **fill=** in **aes()**. What changes?
<details>
<summary>
correction
</summary>
```{r}
ggplot(data=gtf, mapping=aes(x=chr, fill=gene_type)) +
geom_bar()
```
</details>
<br>
5. Add a title to the graph:
<details>
<summary>
correction
</summary>
```{r}
ggplot(data=gtf, mapping=aes(x=chr, fill=gene_type)) +
geom_bar() +
ggtitle(label = "Number of genes per chromosome, split by gene type")
```
</details>
<br>
6. Change the default [**theme**](https://ggplot2-book.org/themes):
<details>
<summary>
correction
</summary>
```{r}
ggplot(data=gtf, mapping=aes(x=chr, fill=gene_type)) +
geom_bar() +
ggtitle(label = "Number of genes per chromosome, split by gene type") +
theme_bw()
```
</details>
<br>
7. Save the graph in PNG format in the workshop's directory.
<details>
<summary>
correction
</summary>
```{r}
# save plot in an object
gtfbars <- ggplot(data=gtf, mapping=aes(x=chr, fill=gene_type)) +
geom_bar() +
ggtitle(label = "Number of genes per chromosome, split by gene type") +
theme_bw()
# save as PNG file
ggsave(filename="gtfbarplot.png", plot=gtfbars,
device="png")
```
## Barplots: bars position
We can also play with the **position** of the bars. By default, position is **stack**, i.e. categories are stacked on top of each other along the bar.
Position **fill** scales data so the top is always 1, i.e. it shows **proportions**, instead of the absolute values:
```{r}
ggplot(data=gtf, mapping=aes(x=chr, fill=gene_type)) +
geom_bar(position="fill")
```
Position **dodge** represents each category (here, gene types) side-by-side:
```{r}
ggplot(data=gtf, mapping=aes(x=chr, fill=gene_type)) +
geom_bar(position="dodge")
```
<details>
<summary>
*More advanced (as reference, or if someone asks): how to reorder x-axis labels:*
</summary>
Factors are a data type in R: they are used to represent categorical data. Using factors requires a bit more understanding of R works/thinks, but here is an application using **ordered factors/categories**:
```{r}
ggplot(data=gtf, mapping=aes(x=factor(chr, levels=c("chr1", "chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9", "chr10"), ordered=TRUE), fill=gene_type)) +
geom_bar(position="dodge") +
xlab("chromosome")
```
</details>
### stat="identity" parameter
**stat** represents a statistical transformation of the data. It typically aims to summarize the data.
geom_bar() provides different options for **stat**:
* **count** (default): **counts the number of occurrences of each value / category in x**. It does not expect an input in **y**.
* **identity**: uses the data **as is** (i.e. no transformation is applied) and skips the aggregation. Values used for the bars (categories) are provided by the user in **x**. Height of the bars are provided in **y**.
Let's import data from file: **DataViz_source_files-main/files/stats_continents_barcelona_2013-2023_long.csv**
in an object called **statsbcn**.
The data contains the number of foreign residents in Barcelona from 2013 to 2023.
```{r}
statsbcn <- read_csv("DataViz_source_files-main/files/stats_continents_barcelona_2013-2023_long.csv")
```
How many rows and how many columns does the data contain?
In the barplots we created so far, R takes categories in the columns specified in **x=** and counts the number of occurrences.
If we now set **stat="identity"** in geom_bar(), R uses the sum of the variable specified in **y=**, **grouped by the x variable**.
In the following example, we are plotting the sum of foreign residents in Barcelona (Population provided in **y**) per year (Year provided in **x**):
```{r}
ggplot(statsbcn, aes(x=Year, y=Population)) +
geom_bar(stat="identity")
```
We can map, for example, **fill** to **Continent**:
```{r}
ggplot(statsbcn, aes(x=Year, y=Population, fill=Continent)) +
geom_bar(stat="identity")
```
We can further play with the **position**, as previously done.
* Position **fill** :
```{r}
ggplot(statsbcn, aes(x=Year, y=Population, fill=Continent)) +
geom_bar(stat="identity", position="fill")
```
* Position **dodge** :
```{r}
ggplot(statsbcn, aes(x=Year, y=Population, fill=Continent)) +
geom_bar(stat="identity", position="dodge")
```
We can control the width of bars (hence, the spacing between 2 bars) using the **width** parameter of geom_bar():
```{r}
ggplot(statsbcn, aes(x=Year, y=Population, fill=Continent)) +
geom_bar(stat="identity", position="dodge", width = 0.8)
```
<details>
<summary>
*More advanced (as reference, or if someone asks): display all labels:*
</summary>
Convert "Year" column as character, instead of numbers:
```{r}
# convert the x-axis from a continuous to a discrete variable (as.character)
ggplot(statsbcn, aes(x=as.character(Year), y=Population, fill=Continent)) +
geom_bar(stat="identity", position="dodge")
```
</details>
## Boxplots
A boxplot is used to visualize the distribution of data.
<img src="images/ggplot_boxplot_definition.jpg" alt="import zip" width="600"/>
*[Source](https://i.ytimg.com/vi/BE8CVGJuftI/maxresdefault.jpg)*
We will import data from a file that contains the same information as **geneexp** but in a slightly different format:
```{r, echo=T, eval=T, message=F, warning=F}
geneexp2 <- read_csv("DataViz_source_files-main/files/expression_20genes_long.csv")
```
In our first boxplot, one box corresponds to one sample:
```{r}
ggplot(geneexp2, aes(x=sample, y=expression)) +
geom_boxplot()
```
We can split boxes by **DE**, the same way we did for barplots, by mapping **fill** or **color** to the variable:
```{r}
ggplot(geneexp2, aes(x=sample, y=expression, fill=DE)) +
geom_boxplot()
```
If you prefer a violin plot, it is easy:
```{r}
ggplot(geneexp2, aes(x=sample, y=expression, fill=DE)) +
geom_violin()
```
Violin plots also aim to visualize data distribution. While boxplots can only show summary statistics / quantiles, violin plots also show the density of each variable.
## Fine-tuning text
Controlling the **font size and style** of the different components of the graph (axis text, title, legend, etc.) is important for the image readability and impact.
Text size and style of ggplot2 graphs can be changed using the **theme()** function. While very powerful, it can take a bit of time to get used to the structure.
We will illustrate how to make some changes using **theme()** layer on our first scatter plot.
* Change **overall** font size:
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2)) +
geom_point() +
ggtitle("scatter plot") +
theme(text = element_text(size = 20))
```
* Change font size of **axis text**:
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2)) +
geom_point() +
ggtitle("scatter plot") +
theme(axis.text = element_text(size = 20))
```
* Change font size of **axis titles**:
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2)) +
geom_point() +
ggtitle("scatter plot") +
theme(axis.title = element_text(size = 20))
```
* **Remove axis titles** (e.g. below: removing x-axis title):
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2)) +
geom_point() +
ggtitle("scatter plot") +
theme(axis.title.x = element_blank())
```
* **Shift** the graph title to the right (it is by default centered to the left):
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2)) +
geom_point() +
ggtitle("scatter plot") +
theme(plot.title = element_text(hjust = 0.5))
```
* Change font size of the **graph title**:
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2)) +
geom_point() +
ggtitle("scatter plot") +
theme(plot.title = element_text(size = 20, hjust = 0.5))
```
* Change the **color of the title**, and make it **bold**:
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2)) +
geom_point() +
ggtitle("scatter plot") +
theme(plot.title = element_text(size = 20, hjust = 0.5, face = "bold", colour = "blue"))
```
* You can also use theme() to **rotate the x-axis label** of plots, for example:
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2)) +
geom_point() +
ggtitle("scatter plot") +
theme(axis.text.x = element_text(angle=90))
```
* As a last examples, let's see how we can control the **plot's legend**:
**Colored background:**
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2, color=DE)) +
geom_point() +
theme(legend.background = element_rect(fill="yellow"))
```
**Decrease or increase the space between the legend box and the plot:**
```{r}
ggplot(data=geneexp, mapping=aes(x=sample1, y=sample2, color=DE)) +
geom_point() +
theme(legend.box.spacing = unit(3, "cm")) # default is 0.4cm
```
Remove the **key background**: