-
Notifications
You must be signed in to change notification settings - Fork 109
/
Comparative_Study_of_Mosaic.Rmd
334 lines (213 loc) · 15.4 KB
/
Comparative_Study_of_Mosaic.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
# Comparative Study of vcd::mosaic and geom_mosaic
Mamunur Rashid and Jiaying Chen
`vcd::mosaic` and `geom_mosaic` are two mostly used functions to create mosaic plot - a way of visualizing multivariate categorical data which is an area-proportional visualization of frequencies, composed of tiles (or block) created by recursive vertical and horizontal splits of a rectangle.
There are several functions available to produce mosaic plot including mosaicplot - a generic function from base R function, mosaic from vcd (Visualizing Categorical Data) package, geom_mosaic from ggmosaic package etc.
Currently, there is inadequate online help or guidance to use those functions. Moreover, the existing documentation of these functions is not user-friendly for beginners. It has a limited number of examples and test cases.
In the following sections, we are going to discuss a comparative study of `vcd::mosaic` and `geom_mosaic`. Our goal is to describe different aspects of these functions with some user-friendly examples and detail explanation of some of the parameters, and finally to compare ease-of-use & effectiveness of each of the functions.
In this study, we tried to explain how to use these two functions to produce different mosaic plots. We have elaborated some of the necessary parameters/ arguments in detail that are used in the functions. As a matter of fact, it will be easy to learn for a new user. Finally, we showed which function is better in which scenario such as `vcd::mosaic` is better in some cases whereas `geom_mosaic` is better in other cases based on the available data and the objective.
To demonstrate different aspects of these functions, we will use the Titanic dataset to plot different mosaic plots.
```{r}
str(Titanic)
```
```{r}
tail(as.data.frame(Titanic))
```
## 1. vcd::mosaic:
```{r}
library(vcd)
```
`mosaic` is one of the functions of strucplot framework which we can get from vcd R package. The main idea of strucplot is to visualize the tables’ cells arranged in rectangular form structable objects. The “flattened” contingency table can be obtained using the structable() function.
The basic Usage of mosaic functions is:
```{r eval=FALSE, message=FALSE, warning=FALSE}
vcd::mosaic(x, condvars = NULL,
split_vertical = NULL, direction = NULL, spacing = NULL,
spacing_args = list(), gp = NULL, expected = NULL, shade = NULL,
highlighting = NULL, highlighting_fill = grey.colors, highlighting_direction = NULL,
zero_size = 0.5, zero_split = FALSE, zero_shade = NULL,
zero_gp = gpar(col = 0), panel = NULL, main = NULL, sub = NULL, ...)
OR
vcd::mosaic(formula, data, highlighting = NULL,
..., main = NULL, sub = NULL, subset = NULL, na.action = NULL)
```
Some of the basic parameters/arguments are described in the following section:
**x:**
A contingency table in array form, with optional category labels specified in the dimnames(x) attribute, or an object of class "structable".
x is a table or formula. A structable can also be deduced from data frame or contengency table using formula. For example,
```{r}
HEC <- structable(Survived ~ Sex + Age, data = Titanic)
HEC
```
**formula:**
A formula specifying the variables used to create a contingency table from data. For convenience, conditioning formulas can be specified; the conditioning variables will then be used first for splitting. If any, a specified response variable will be highlighted in the cells.
For example, If we want to split the plot by 1 varialbe, we can specify as ~V1.
If V2 is dependent on V1 and V1 needs to be highlighted, we can do V2 ~ V1.
In the following example, a plot is crated from Titanic data:
```{r}
vcd::mosaic(Survived ~ Sex + Age, data = Titanic,
main = "Survival on the Titanic")
```
In the above plot, the data is partinioned by sex first and then age; after that survival is used to show the dependency.
```{r fig.width=15.5, fig.height=6.5}
vcd::mosaic(Survived ~ ., data = Titanic)
```
This type of formuala is used when one variable is needed to be highlighted based on rest of the varialbes. Survived is highlihgted based on all otther dependend variables.
**data:**
Either a data frame, or an object of class "table" or "ftable". If we provide, structable contingency table, we do not need to provide data (df or table). If data is in a data frame, the count column must be called Freq.
**direction:**
character vector of length k, where k is the number of margins of x (values are recycled as needed). For each component, a value of "h" indicates that the tile(s) of the corresponding dimension should be split horizontally, whereas "v" indicates vertical split(s).
By default, vcd::mosaic() splits horizontally from left for the first variable (e.g. ~ V1, ~ V1 + V2, ~ V2 | V1 or V2 ~ V1 -- here V1 is considered as first variable) and vertically from top for the second variable and so on. If we want to alter this direction, we can provide this "direction" parameter. For example,
```{r}
vcd::mosaic(Survived ~ Sex + Age, data = Titanic,
main = "Survival on the Titanic",direction=c("v","v","h"))
# Sex = Vertical, Age = Vertical, Survived = Horizonal
```
In the above plot, the sex variable is used to split the plot vertically, then age variable is used to sub-devide verticaly and then survived sub-division is highlighted horizontally from the other side.
**gp:**
Object of class "gpar", shading function or a corresponding generating function (see details and shadings). Components of "gpar" objects are recycled as needed along the last splitting dimension. Ignored if shade = FALSE.
if gp is not provided, the default gray color is applied with shade = TRUE
```{r}
vcd::mosaic(Survived~ Sex + Age, data = Titanic,
main = "Survival on the Titanic", shade = TRUE,
direction=c("v","v","h"))
```
If gp is provided with shade = TRUE (or no shade parameter but not shade = FALSE), the provided color along with other graphical paramters (if provided any) will apply.
```{r}
vcd::mosaic(Survived~ Sex + Age, data = Titanic,
main = "Survival on the Titanic", shade = TRUE,
direction=c("v","v","h"), gp = gpar(fill=c('lightblue', 'pink'))
)
```
**shade:**
Logical specifying whether gp should be used or not (see gp). If TRUE and expected is unspecified, a default model is fitted: if condvars (see strucplot) is specified, a corresponding conditional independence model, and else the total independence model.
If we provide shade = FALSE, the gp will not apply at all.
```{r}
vcd::mosaic(Survived~ Sex + Age, data = Titanic,
main = "Survival on the Titanic", shade = FALSE,
direction=c("v","v","h"), gp = gpar(fill=c('lightblue', 'pink'))
)
```
## 2. geom_mosaic:
`geom_mosaic` is similar to `vcd::mosaic`, which is also designed to display the relationship between categorical variables. `geom_mosaic` can be imported from ggmosaic package which is integrated in ggplot2 as a geom and allows for facetting and layering.
```{r message=FALSE, warning=FALSE}
library(ggmosaic)
library(gridExtra)
```
We are also using the same Titanic dataset. But, here will convert the dataset from contengency table to data frame to easy to use.
```{r}
Titanic_df<-as.data.frame(Titanic)
```
The basic usage or parameters of geom_mosaic is:
```{r eval=FALSE}
geom_mosaic(mapping = NULL, data = NULL, stat = "mosaic",
position = "identity", conds=NULL, na.rm = FALSE,
divider = mosaic(), offset = 0.01, show.legend = NA,
inherit.aes = FALSE, ...)
```
The following is the description of parameters:
**mapping:**
In geom_mosaic, this is used to specify categorical varaibles and numeric variable to be displayed. The basic parameters of mapping is
```{r eval=FALSE}
aes(weight=NULL, x=NULL, fill=NULL, conds=NULL)
```
Aesthetics contains the arugments of weight, x, fill and conds.
* **weight:** select a weighting variable, it can be frequency, relative frequency or other similar weighting variables, but it must be numeric. In mosaic graph, this parameter is used to show the area of each splitting rectangle.
* **x:** select variables to add to formula, which is used to assign categorical varaibles. The basic parameters of x is:
```{r eval=FALSE}
x=product(col_1, col_2, col_3,...)
```
For example,
```{r }
ggplot(data = Titanic_df) +
geom_mosaic(aes(weight = Freq, x = product(Survived, Age, Sex)))
```
The data is partinioned by Sex first, then Age, and finally Survived.
* **fill:** Select a variable to be filled, it is used to color each rectangle in different colors, which is same to splitting each rectangle into two smaller rectangle and then coloring each smaller rectangle in the same parent rectangle in different colors. For example,
```{r }
ggplot(data = Titanic_df) +
geom_mosaic(aes(weight = Freq, x = product( Age, Sex), fill=Survived))
```
* **conds:**
Select variables to condition on, which will be used first for splitting. The basic parameters of conds is
```{r eval=FALSE}
conds=product(col_1, col_2, ...)
```
For example,
```{r }
ggplot(data = Titanic_df) +
geom_mosaic(aes(weight = Freq, x = product(Age, Sex),
conds=product(Class, Survived),fill=Survived))
```
In this example, the data is partitioned by variables in conds first, (Survived, then Class), and then by variables in x.
When variables in "product", then the order is always from the right most varible to the left most variable, not only in x but also in conds.
And geom_mosaic is a little different to other geom_* functions, we don't need to and also can't specific y. In geom_mosaic, there is no y value.
**order of split**
```{r }
ggplot(data = Titanic_df) +
geom_mosaic(aes(weight = Freq, x = product(Survived, Age, Sex), fill=Class))
```
In this example, the order of partition is Sex, Age, Survived, and then each rectangle splitted by Sex, Age and Survived will also be partitioned by Class into 4 smaller rectanglers and then color these smaller ones into four different colors.
The most important thing for this parameter is that the direction of splitting depends on the number of patameters in x. In the above example, the number of patameters in x is 3, which is odd, then the data is partitioned by Class horizontally. While when the number of patameters in x is even, for example, if it is 2, we can see the direction is now vertical.
```{r }
ggplot(data = Titanic_df) +
geom_mosaic(aes(weight = Freq, x = product(Age, Sex), fill=Class))
```
**data:** Only data frame can be used, no "table" or "ftable". If we want to use an object of "table", we must transfer it into dataframe.
```{r eval=FALSE}
#Titanic <- as.data.frame(Titanic)
```
And we can also put data in the function of ggplot.
**na.rm:** The default is "FALSE", which means it will not remove "NA"; while when it equals to "TURE", it will remove missing values.
**divider:** We have two methods to set divider.
The first one is mosaic().The default of directions is first vertical and then horizontal, repeatedly. And we can also use divider to reset the directions of splitting. For example,
```{r }
ggplot(data = Titanic_df) +
geom_mosaic(aes(weight = Freq, x = product(Survived, Age, Sex),
fill=Survived), divider = mosaic("v"))
```
The default parameter is "horizontal"(or "h"), the splitting directions is v-h-v-h-...; while when using "vertical"(or "v"), the directions will become h-v-h-v-.... In other words, it is transversed, varaibles in y-axis will be in x-axis, and varaibles in x-axis will be in y-axis.
The second one is setting divider directly, hspine, vspine, vbar or hbar. But in this method, the data can only be splitted by one categorical variable. Using hspine or vspine, it is same to the original mosaic graph, which using vbar or hbar, it will be like bar chart.
For example,
```{r }
p1<-ggplot(data = Titanic_df) +
geom_mosaic(aes(weight = Freq, x = product(Sex)), divider="hspine")+ggtitle("Hspine")
p2<-ggplot(data = Titanic_df) +
geom_mosaic(aes(weight = Freq, x = product(Sex)), divider="hbar")+ggtitle("Hbar")
grid.arrange(p1, p2, nrow = 1)
```
**offset:** Set the space between the first spine, but at the same time, the space between other spines also will change in equal proportion. For example,
```{r }
g1<-ggplot(data = Titanic_df) +
geom_mosaic(aes(weight = Freq, x = product(Survived, Age, Sex),fill=Survived),offset = 0.01)+ggtitle("Offset=0.01")
g2<-ggplot(data = Titanic_df) +
geom_mosaic(aes(weight = Freq, x = product(Survived, Age, Sex),fill=Survived),offset = 0.05)+ggtitle("Offset=0.1")
grid.arrange(g1, g2, nrow = 1)
```
Besides, geom_mosaic also has other similar features of ggplot, like it can use facet_* to facet. For example,
```{r }
ggplot(data = Titanic_df) +
geom_mosaic(aes(weight = Freq, x = product(Survived, Age, Sex), fill=Survived))+facet_grid(Class~.)
```
## 3. vcd::mosaic vs geom_mosaic -- which one is better?
* **Type of data**
`geom_mosaic` can only use data frame as its `data` argument, however the type of data in `vcd::mosaic` can be data frame or table.
* **Handling variables used in partition**
`geom_mosaic`, in fact ggplot2, is not capable of handling a different number of variables in aesthetics. `product`, a wrapper function for a list, is used as a workaround for this. However, this workaround leads to issues with the labeling; those can be fixed manually though.
On the other hand, the `formualla` of `vcd::mosaic` is capable of handling a number of splitting variables and labeling is also very easy in this case.
* **Use of different layer including faceting**
Since `geom_mosaic` is a faction of ggplot2, it inherits the layered mechanism. So, we can plot mosaic with other plots. We can also use other grouping techniques like faceting.
* **Order of splits**
`vcd::mosaic` is much easier to control the order of split than `geom_mosaic`. `vcd::mosaic` uses formula (e.g. V3 ~ V1 + V2) to set the split order which has a clear explanation of the position of the variables. On the other hand, `geom_mosaic` uses odd or even number of variables to decide the split. If the number of parameters is odd, the data is partitioned by the last dependent variable horizontally.
* **Direction of splits**
`vcd::mosaic` has a `direction` argument that directs split directions whether the variable should be split horizontally or vertically.
`geom_mosaic` has default direction to split the product variables. The default of directions is first vertical and then horizontal, repeatedly. `divider` is used to reset the directions of splitting, but a different combination of splitting direction with the variable is very challenging
* **Color**
In the use of color in different places (e.g. fill, border), both functions have a straightforward structure. `geom_mosaic` also gives an auto legend facility in the use of color.
* **offset**
Setting offset and spacing, `geom_mosaic` is very easy to work with.
In conclusion, it's easier to use `vcd::mosaic`, especially when setting the arguments of categorical variables and their splitting directions. What's more, `vcd::mosaic` can also display the relationship between categorical variables more clearly. But `geom_mosaic` is integrated in ggplot2 as a geom which allows for faceting, layering, setting offset and so on.
**Reference:**
1) https://pdfs.semanticscholar.org/2416/2c1a46669f94356854176ece8548bf7fb989.pdf
2) https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Mosaic_Plots.pdf
3) https://en.wikipedia.org/wiki/Mosaic_plot
4) http://www.pmean.com/definitions/mosaic.htm
5) https://cran.r-project.org/web/packages/ggmosaic/vignettes/ggmosaic.html