forked from edgararuiz-zz/dbplot
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
239 lines (173 loc) · 6.33 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
---
output: github_document
---
# dbplot <img src="man/figures/logo.png" align="right" alt="" width="220" />
```{r, setup, include = FALSE}
library(dplyr)
library(dbplot)
library(sparklyr)
library(nycflights13)
knitr::opts_chunk$set(fig.height = 3.5, fig.width = 4, fig.align = 'center')
```
[![Build Status](https://travis-ci.org/edgararuiz/dbplot.svg?branch=master)](https://travis-ci.org/edgararuiz/dbplot)
[![CRAN\_Status\_Badge](http://www.r-pkg.org/badges/version/dbplot)](https://cran.r-project.org/package=dbplot)
[![Coverage status](https://codecov.io/gh/edgararuiz/dbplot/branch/master/graph/badge.svg)](https://codecov.io/github/edgararuiz/dbplot?branch=master)
- [Installation](#installation)
- [Connecting to a data source](#connecting-to-a-data-source)
- [Example](#example)
- [`ggplot`](#ggplot)
- [Histogram](#histogram)
- [Raster](#raster)
- [Bar Plot](#bar-plot)
- [Line plot](#line-plot)
- [Boxplot](#boxplot)
- [Calculation functions](#calculation-functions)
- [`db_bin()`](#db_bin)
Leverages `dplyr` to process the calculations of a plot inside a database. This package provides helper functions that abstract the work at three levels:
1. Functions that ouput a `ggplot2` object
2. Functions that outputs a `data.frame` object with the calculations
3. Creates the formula needed to calculate bins for a Histogram or a Raster plot
## Installation
You can install the released version from CRAN:
```{r, eval = FALSE}
# install.packages("dbplot")
```
Or the the development version from GitHub, using the `remotes` package:
```{r, eval = FALSE}
# install.packages("remotes")
# remotes::install_github("edgararuiz/dbplot")
```
## Connecting to a data source
- For more information on how to connect to databases, including Hive, please visit http://db.rstudio.com
- To use Spark, please visit the `sparklyr` official website: http://spark.rstudio.com
## Example
In addition to database connections, the functions work with `sparklyr`. A Spark DataFrame will be used for the examples in this README.
```{r, include = FALSE}
library(sparklyr)
conf <- spark_config()
conf$`sparklyr.shell.driver-memory` <- "1G"
conf$spark.memory.fraction <- 0.9
sc <- spark_connect(master = "local",config = conf)
spark_flights <- copy_to(sc, nycflights13::flights, "flights")
```
```{r, eval = FALSE}
library(sparklyr)
sc <- spark_connect(master = "local")
spark_flights <- copy_to(sc, nycflights13::flights, "flights")
```
## `ggplot`
### Histogram
By default `dbplot_histogram()` creates a 30 bin histogram
```{r}
library(ggplot2)
spark_flights %>%
dbplot_histogram(distance)
```
Use `binwidth` to fix the bin size
```{r}
spark_flights %>%
dbplot_histogram(distance, binwidth = 400)
```
Because it outputs a `ggplot2` object, more customization can be done
```{r}
spark_flights %>%
dbplot_histogram(distance, binwidth = 400) +
labs(title = "Flights - Distance traveled") +
theme_bw()
```
### Raster
To visualize two continuous variables, we typically resort to a Scatter plot. However, this may not be practical when visualizing millions or billions of dots representing the intersections of the two variables. A Raster plot may be a better option, because it concentrates the intersections into squares that are easier to parse visually.
A Raster plot basically does the same as a Histogram. It takes two continuous variables and creates discrete 2-dimensional bins represented as squares in the plot. It then determines either the number of rows inside each square or processes some aggregation, like an average.
- If no `fill` argument is passed, the default calculation will be count, `n()`
```{r}
spark_flights %>%
dbplot_raster(sched_dep_time, sched_arr_time)
```
- Pass an aggregation formula that can run inside the database
```{r}
spark_flights %>%
dbplot_raster(
sched_dep_time,
sched_arr_time,
mean(distance, na.rm = TRUE)
)
```
- Increase or decrease for more, or less, definition. The `resolution` argument controls that, it defaults to 100
```{r}
spark_flights %>%
dbplot_raster(
sched_dep_time,
sched_arr_time,
mean(distance, na.rm = TRUE),
resolution = 20
)
```
### Bar Plot
- `dbplot_bar()` defaults to a tally() of each value in a discrete variable
```{r}
spark_flights %>%
dbplot_bar(origin)
```
- Pass a formula, and column name, that will be operated for each value in the discrete variable
```{r}
spark_flights %>%
dbplot_bar(origin, avg_delay = mean(dep_delay, na.rm = TRUE))
```
### Line plot
- `dbplot_line()` defaults to a tally() of each value in a discrete variable
```{r}
spark_flights %>%
dbplot_line(month)
```
- Pass a formula that will be operated for each value in the discrete variable
```{r}
spark_flights %>%
dbplot_line(month, avg_delay = mean(dep_delay, na.rm = TRUE))
```
### Boxplot
- It expects a discrete variable to group by, and a continuous variable to calculate the percentiles and IQR. It doesn't calculate outliers. Currently, this feature works with sparklyr and Hive connections.
```{r}
spark_flights %>%
dbplot_boxplot(origin, dep_delay)
```
## Calculation functions
If a more customized plot is needed, the data the underpins the plots can also be accessed:
1. `db_compute_bins()` - Returns a data frame with the bins and count per bin
2. `db_compute_count()` - Returns a data frame with the count per discrete value
3. `db_compute_raster()` - Returns a data frame with the results per x/y intersection
4. `db_compute_raster2()` - Returns same as `db_compute_raster()` function plus the coordinates of the x/y boxes
5. `db_compute_boxplot()` - Returns a data frame with boxplot calculations
```{r}
spark_flights %>%
db_compute_bins(arr_delay)
```
The data can be piped to a plot
```{r}
spark_flights %>%
filter(arr_delay < 100 , arr_delay > -50) %>%
db_compute_bins(arr_delay) %>%
ggplot() +
geom_col(aes(arr_delay, count, fill = count))
```
## `db_bin()`
Uses 'rlang' to build the formula needed to create the bins of a numeric variable in an un-evaluated fashion. This way, the formula can be then passed inside a dplyr verb.
```{r}
db_bin(var)
```
```{r}
spark_flights %>%
group_by(x = !! db_bin(arr_delay)) %>%
tally()
```
```{r}
spark_flights %>%
filter(!is.na(arr_delay)) %>%
group_by(x = !! db_bin(arr_delay)) %>%
tally()%>%
collect %>%
ggplot() +
geom_col(aes(x, n))
```
```{r}
spark_disconnect(sc)
```