-
Notifications
You must be signed in to change notification settings - Fork 1
/
Data_Analysis_3.Rmd
353 lines (276 loc) · 12.9 KB
/
Data_Analysis_3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
---
title: 'Descriptive Statistics'
subtitle: 'Bivariate Analysis'
author: "Karol Flisikowski"
date: "`r Sys.Date()`"
output:
rmdformats::downcute:
self_contained: true
thumbnails: false
lightbox: true
gallery: true
highlight: pygments
---
## Introduction
This is our first lab when we are considering 2 dimensions and instead of calculating univariate statistics by groups (or factors) of other variable - we will measure their common relationships based on co-variance and correlation coefficients.
*Please be very careful when choosing the measure of correlation! In case of different measurument scales we have to recode one of the variables into weaker scale.
It would be nice to add some additional plots in the background. Feel free to add your own sections and use external packages.
## Data
This time we are going to use a typical credit scoring data with predefined "default" variables and personal demografic and income data. Please take a look closer at headers and descriptions of each variable.
```{r load-data, warning=TRUE, include=FALSE}
library(haven)
download.file("https://github.com/kflisikowski/ds/blob/master/bank_defaults.sav?raw=true", destfile ="bank_defaults.sav",mode="wb")
bank_defaults <- read_sav("bank_defaults.sav")
```
## Scatterplots
First let's visualize our quantitative relationships using scatterplots.
```{r echo=FALSE, warning=TRUE}
#install.packages("ggplot2", dependencies=TRUE)
library(ggplot2)
# Basic scatter plot
ggplot(bank_defaults, aes(x=age, y=income)) + geom_point()
# Change the point size, and shape
ggplot(bank_defaults, aes(x=age, y=income)) +
geom_point(size=2, shape=23)
```
You can also transform the skewed distribution of incomes using log:
```{r echo=FALSE, warning=TRUE}
bank_defaults$log_income<-log(bank_defaults$income)
# Basic scatter plot with the log of income
ggplot(bank_defaults, aes(x=age, y=log_income)) + geom_point()
```
We can add an estimated linear regression line:
```{r echo=FALSE, warning=TRUE}
ggplot(bank_defaults, aes(x=age, y=log_income)) +
geom_point(shape=18, color="blue")+
geom_smooth(method=lm, linetype="dashed",
color="darkred", fill="blue")
```
## Scatterplots by groups
We can finally see if there any differences between risk status:
```{r echo=FALSE, warning=TRUE}
#install.packages('knitr')
#Loading the libraries and datasets
#install.packages('HSAUR3')
#install.packages('tidyverse')
#install.packages('dplyr')
library(dplyr)
library(tidyverse)
library(HSAUR3)
bank<-na.omit(bank_defaults)
bank$def<-as.factor(bank$default)
ggplot(data=bank, aes(x=age, y=log_income, color=def, shape=def)) + geom_point() + geom_smooth(method="lm") +
labs(title="Income profiles by age vs. risk status",
x="Age",
y="Log of Income")
```
We can also see more closely if there any differences between those two distributions adding their estimated density plots:
```{r echo=FALSE, warning=TRUE}
# scatter plot of x and y variables
# color by groups
scatterPlot <- ggplot(bank,aes(age, log_income, color=def)) +
geom_point() +
scale_color_manual(values = c('#999999','#E69F00')) +
theme(legend.position=c(0,1), legend.justification=c(0,1))
scatterPlot
# Marginal density plot of age (top panel)
agedensity <- ggplot(bank, aes(age, fill=def)) +
geom_density(alpha=.5) +
scale_fill_manual(values = c('#999999','#E69F00')) +
theme(legend.position = "none")
agedensity
# Marginal density plot of y (right panel)
incomedensity <- ggplot(bank, aes(income, fill=def)) +
geom_density(alpha=.5) +
scale_fill_manual(values = c('#999999','#E69F00')) +
theme(legend.position = "none")
incomedensity
```
We can also put those plots together:
```{r echo=FALSE, warning=TRUE}
#install.packages('knitr')
#Loading the libraries and datasets
#install.packages('HSAUR3')
#install.packages('tidyverse')
#install.packages('dplyr')
library(dplyr)
library(tidyverse)
library(HSAUR3)
bank<-na.omit(bank_defaults)
bank$def<-as.factor(bank$default)
ggplot(data=bank, aes(x=age, y=log_income, color=def, shape=def)) + geom_point() + geom_smooth(method="lm") +
labs(title="Income profiles by age vs. risk status",
x="Age",
y="Log of Income")
```
## Scatterplots with density curves
We can also see more closely if there any differences between those two distributions adding their estimated density plots:
```{r echo=FALSE, warning=TRUE}
#install.packages("gridExtra")
library("gridExtra")
# Create a blank placeholder plot:
blankPlot <- ggplot()+geom_blank(aes(1,1))+
theme(plot.background = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank()
)
grid.arrange(agedensity, blankPlot, scatterPlot, incomedensity,
ncol=2, nrow=2, widths=c(4, 1.4), heights=c(1.4, 4))
```
## Correlation coefficients - Pearson's linear correlation
Ok, let's move to some calculations.
In R, we can use the cor() function. It takes three arguments and the method: cor(x, y, method)
For 2 quantitative data, with all assumptions met, we can calculate simple Pearson's coefficient of linear correlation:
```{r echo=FALSE, warning=TRUE}
r<-cor(bank$log_income, bank$age, method = "pearson")
r
```
Ok, what about the percentage of the explained variability?
```{r echo=FALSE, warning=TRUE}
r2<-r^2
r2
```
So as we can see almost 33% of total log of incomes' variability is explained by differences in age. The rest (67%) is probably explained by other factors.
## Partial and semipartial correlation
The partial and semi-partial (also known as part) correlations are used to express the specific portion of variance explained by eliminating the effect of other variables when assessing the correlation between two variables.
Partial correlation holds constant one variable when computing the relations to others. Suppose we want to know the correlation between X and Y holding Z constant for both X and Y. That would be the partial correlation between X and Y controlling for Z.
Semipartial correlation holds Z constant for either X or Y, but not both, so if we wanted to control X for Z, we could compute the semipartial correlation between X and Y holding Z constant for X.
Suppose we want to know the correlation between the log of income and age controlling for years of employment. How highly correlated are these after controlling for tenure?
**There can be more than one control variable.
```{r echo=FALSE, warning=FALSE}
#install.packages("ppcor")
library(ppcor) # this package computes partial and semipartial correlations.
pcor.test(bank$log_income,bank$age,bank$employ)
```
How can we interpret the obtained partial correlation coefficient? What is the difference between that one and the semi-partial coefficient:
```{r echo=FALSE, warning=FALSE}
library(ppcor)
spcor.test(bank$log_income,bank$age,bank$employ)
```
## Rank correlation
For 2 different scales - like for example this pair of variables: income vs. education levels - we cannot use Pearson's coefficient. The only possibility is to rank also incomes... and lose some more detailed information about them.
First, let's see boxplots of income by education levels.
```{r echo=FALSE, warning=TRUE}
bank$educ<-as.factor(bank$ed)
ggplot(data=bank, aes(x=educ, y=log_income, fill=def)) +
geom_boxplot() +
labs(title="Income profiles by education vs. risk status",
x="Education levels",
y="Income")
```
Now, let's see Kendal's coefficient of rank correlation (robust for ties).
```{r echo=FALSE, warning=TRUE}
r_ken1<-cor(bank$income, bank$ed, method = "kendal")
r_ken1
```
## Point-biserial correlation
Let's try to verify if there is a significant relationship between incomes and risk status. First, let's take a look at the boxplot:
```{r echo=FALSE, warning=TRUE}
ggplot(data=bank, aes(x=def, y=log_income)) +
geom_boxplot() +
labs(title="Income profiles vs. risk status",
x="Risk status (default)",
y="Income")
```
If you would like to compare 1 quantitative variable (income) and 1 dychotomous variable (default status - binary), then you can use point-biserial coefficient:
```{r echo=FALSE, warning=FALSE}
#install.packages("ltm")
library(ltm)
r_bin1<-biserial.cor(bank$log_income, bank$def)
r_bin1
```
## Nonlinear correlation - eta coefficient
If you would like to check if there are any nonlinearities between 2 variables, the only possibility (beside transformations and linear analysis) is to calculate "eta" coefficient and compare it with the Pearson's linear coefficient.
```{r echo=FALSE, warning=FALSE}
library(ryouready)
r_eta1<-eta(as.numeric(bank$income), as.numeric(bank$age))
r_cor1<-cor(as.numeric(bank$income), as.numeric(bank$age))
r_eta1 #eta coefficient
r_cor1 #linear coefficient
difference<-(r_eta1)^2-(r_cor1)^2
difference #can you see a significant difference? >0.2?
```
## Correlation matrix
We can also prepare the correlation matrix for all quantitative variables stored in our data frame.
We can use ggcorr() function:
ggcorr(df, method = c("pairwise", "pearson"),
nbreaks = NULL, digits = 2, low = "#3B9AB2",
mid = "#EEEEEE", high = "#F21A00",
geom = "tile", label = FALSE,
label_alpha = FALSE)
```{r echo=FALSE, warning=TRUE}
#install.packages("GGally")
library(GGally)
ggcorr(bank,
nbreaks = 6,
low = "steelblue",
mid = "white",
high = "darkred",
geom = "circle")
```
As you can see - the default correlation matrix is not the best idea for all measurement scales (including binary variable "default").
That's why now we can perform our bivariate analysis with ggpair with grouping.
## Correlation matrix with scatterplots
Here is what we are about to calculate:
- The correlation matrix between age, log_income, employ, address, debtinc, creddebt, and othdebt variable grouped by whether the person has a default status or not.
- Plot the distribution of each variable by group
- Display the scatter plot with the trend by group
```{r echo=FALSE, warning=TRUE}
ggpairs(bank, columns = c("age", "log_income", "employ", "address", "debtinc", "creddebt", "othdebt"), title = "Bivariate analysis of customers' risk characteristics", upper = list(continuous = wrap("cor",
size = 3)),
lower = list(continuous = wrap("smooth",
alpha = 0.3,
size = 0.1)),
mapping = aes(color = def))
```
## Contingency tables - qualitative data
Do you believe in the Afterlife?
https://nationalpost.com/news/canada/millennials-do-you-believe-in-life-after-life
A survey was conducted and a random sample of 1091 questionnaires is given in the form of the following contingency table:
```{r echo=FALSE, warning=FALSE}
x=c(435,147,375,134)
dim(x)=c(2,2)
dane<-as.table(x)
dimnames(dane)=list(Gender=c('Female','Male'),Believe=c('Yes','No'))
dane
fourfoldplot(dane)
```
Our task is to check if there is a significant relationship between the belief in the afterlife and gender. We can perform this procedure with the simple chi-square statistics and chosen qualitative correlation coefficient (two-way 2x2 table).
```{r echo=FALSE, warning=FALSE}
#install.packages("psych")
library(psych)
yes<-c(435,147)
no<-c(375,134)
#cohen.kappa(cbind(yes,no))
chisq.test(dane)
prop.table(dane)
```
As you can see we can calculate our chi-square statistic really quickly for two-way tables or larger.
Now we can standardize this contingency measure to see if the relationship is significant.
```{r echo=FALSE, warning=FALSE}
library(DescTools)
Phi(dane)
#?ContCoef
#ContCoef(dane)
#CramerV(dane)
#TschuprowT(dane)
mosaicplot(dane)
barplot(dane)
```
## Exercise 1. Bivariate analysis for the 'Titanic' data.
Let's consider the titanic dataset which contains a complete list of passengers and crew members on the RMS Titanic. It includes a variable indicating whether a person did survive the sinking of the RMS Titanic on April 15, 1912.
A data frame contains 2456 observations on 14 variables.
```{r load-data2, warning=TRUE, include=FALSE}
library(haven)
download.file("https://github.com/kflisikowski/ds/blob/master/titanic.csv?raw=true", destfile ="titanic.csv",mode="wb")
titanic <- read.csv("titanic.csv",row.names=1,sep=";")
```
The website http://www.encyclopedia-titanica.org/ offers detailed information about passengers and crew members on the RMS Titanic. According to the website 1317 passengers and 890 crew member were aboard.
8 musicians and 9 employees of the shipyard company are listed as passengers, but travelled with a free ticket, which is why they have NA values in fare. In addition to that, fare is truely missing for a few regular passengers.