-
-
Notifications
You must be signed in to change notification settings - Fork 40
/
README.Rmd
346 lines (236 loc) · 16.6 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
#fig.path = "",
out.width = "100%"
)
```
# disk.frame <img src="inst/figures/disk.frame.png" align="right">
# NOTICE
**{disk.frame} has been soft-deprecated in favor of {arrow}**. With the {arrow} 6.0.0 release, it's now capable of doing larger-than-RAM data analysis quite well see [release note](https://arrow.apache.org/docs/r/news/index.html#1-expanded-arrow-native-queries-aggregation-and-joins-6-0-0). Hence, there is no strong reason to prefer {disk.frame} unless you have very specific feature needs.
For the above reason, I've decided to soft-deprecate {disk.frame} which means I will no longer actively develop new features for it but it will remain on CRAN in maintenance mode.
To help with the transition I've created a function, `disk.frame::disk.frame_to_parquet(df, outdir)` to help you convert existing {disk.frame}s to the parquet format so you can use {arrow} with it.
I am working on an reincarnation of {disk.frame} in Julia, so the {disk.frame} will live on!
Thank your for support {disk.frame}. I've learnt alot along the way, but time has come to move on!
# Introduction
How do I manipulate tabular data that doesn't fit into Random Access Memory (RAM)?
Use `{disk.frame}`!
In a nutshell, `{disk.frame}` makes use of two simple ideas
1) split up a larger-than-RAM dataset into chunks and store each chunk in a separate file inside a folder and
2) provide a convenient API to manipulate these chunks
`{disk.frame}` performs a similar role to distributed systems such as Apache Spark, Python's Dask, and Julia's JuliaDB.jl for *medium data* which are datasets that are too large for RAM but not quite large enough to qualify as *big data*.
## Installation
You can install the released version of `{disk.frame}` from [CRAN](https://CRAN.R-project.org) with:
```r
install.packages("disk.frame")
```
And the development version from [GitHub](https://github.com/) with:
```r
# install.packages("devtools")
devtools::install_github("DiskFrame/disk.frame")
```
On some platforms, such as SageMaker, you may need to explicitly specify a repo like this
```r
install.packages("disk.frame", repo="https://cran.rstudio.com")
```
## Vignettes and articles
Please see these vignettes and articles about `{disk.frame}`
- [Quick start:
`{disk.frame}`](https://diskframe.com/articles/intro-disk-frame.html)
which replicates the `sparklyr` vignette for manipulating the
`nycflights13` flights data.
- [Ingesting data into `{disk.frame}`](https://diskframe.com/articles/ingesting-data.html) which lists some commons way of creating disk.frames
- [`{disk.frame}` can be more epic!](https://diskframe.com/articles/more-epic.html) shows some ways of loading large CSVs and the importance of `srckeep`
- [Group-by](https://diskframe.com/articles/group-by.html) the various types of group-bys
- [Custom one-stage group-by functions](https://diskframe.com/articles/custom-group-by.html) how to define custom one-stage group-by functions
- [Fitting GLMs (including logistic regression)](https://diskframe.com/articles/glm.html) introduces the `dfglm` function for fitting generalized linear models
- [Using data.table syntax with disk.frame](https://diskframe.com/articles/data-table-syntax.html)
- [disk.frame concepts](https://diskframe.com/articles/concepts.html)
- [Benchmark 1: disk.frame vs Dask vs
JuliaDB](https://diskframe.com/articles/vs-dask-juliadb.html)
## Common questions
### a) What is `{disk.frame}` and why create it?
`{disk.frame}` is an R package that provides a framework for manipulating larger-than-RAM structured tabular data on disk efficiently. The reason one would want to manipulate data on disk is that it allows arbitrarily large datasets to be processed by R. In other words, we go from "R can only deal with data that fits in RAM" to "R can deal with any data that fits on disk". See the next section.
### b) How is it different to `data.frame` and `data.table`?
A `data.frame` in R is an in-memory data structure, which means that R must load the data in its entirety into RAM. A corollary of this is that only data that can fit into RAM can be processed using `data.frame`s. This places significant restrictions on what R can process with minimal hassle.
In contrast, `{disk.frame}` provides a framework to store and manipulate data on the hard drive. It does this by loading only a small part of the data, called a chunk, into RAM; process the chunk, write out the results and repeat with the next chunk. This chunking strategy is widely applied in other packages to enable processing large amounts of data in R, for example, see [`chunkded`](https://github.com/edwindj/chunked) [`arkdb`](https://github.com/ropensci/arkdb), and [`iotools`](https://github.com/s-u/iotools).
Furthermore, there is a row-limit of 2^31 for `data.frame`s in R; hence an alternate approach is needed to apply R to these large datasets. The chunking mechanism in `{disk.frame}` provides such an avenue to enable data manipulation beyond the 2^31 row limit.
### c) How is `{disk.frame}` different to previous "big" data solutions for R?
R has many packages that can deal with larger-than-RAM datasets, including `ff` and `bigmemory`. However, `ff` and `bigmemory` restrict the user to primitive data types such as double, which means they do not support character (string) and factor types. In contrast, `{disk.frame}` makes use of `data.table::data.table` and `data.frame` directly, so all data types are supported. Also, `{disk.frame}` strives to provide an API that is as similar to `data.frame`'s where possible. `{disk.frame}` supports many `dplyr` verbs for manipulating `disk.frame`s.
Additionally, `{disk.frame}` supports parallel data operations using infrastructures provided by the excellent [`future` package](https://CRAN.R-project.org/package=future) to take advantage of multi-core CPUs. Further, `{disk.frame}` uses state-of-the-art data storage techniques such as fast data compression, and random access to rows and columns provided by the [`fst` package](http://www.fstpackage.org/) to provide superior data manipulation speeds.
### d) How does `{disk.frame}` work?
`{disk.frame}` works by breaking large datasets into smaller individual chunks and storing the chunks in `fst` files inside a folder. Each chunk is a `fst` file containing a `data.frame/data.table`. One can construct the original large dataset by loading all the chunks into RAM and row-bind all the chunks into one large `data.frame`. Of course, in practice this isn't always possible; hence why we store them as smaller individual chunks.
`{disk.frame}` makes it easy to manipulate the underlying chunks by implementing `dplyr` functions/verbs and other convenient functions (e.g. the `cmap(a.disk.frame, fn, lazy = F)` function which applies the function `fn` to each chunk of `a.disk.frame` in parallel). So that `{disk.frame}` can be manipulated in a similar fashion to in-memory `data.frame`s.
### e) How is `{disk.frame}` different from Spark, Dask, and JuliaDB.jl?
Spark is primarily a distributed system that also works on a single machine. Dask is a Python package that is most similar to `{disk.frame}`, and JuliaDB.jl is a Julia package. All three can distribute work over a cluster of computers. However, `{disk.frame}` currently cannot distribute data processes over many computers, and is, therefore, single machine focused.
In R, one can access Spark via `sparklyr`, but that requires a Spark cluster to be set up. On the other hand `{disk.frame}` requires zero-setup apart from running `install.packages("disk.frame")` or `devtools::install_github("xiaodaigh/disk.frame")`.
Finally, Spark can only apply functions that are implemented for Spark, whereas `{disk.frame}` can use any function in R including user-defined functions.
# Example usage
## Set-up `{disk.frame}`
`{disk.frame}` works best if it can process multiple data chunks in parallel. The best way to set-up `{disk.frame}` so that each CPU core runs a background worker is by using
```r
setup_disk.frame()
# this allows large datasets to be transferred between sessions
options(future.globals.maxSize = Inf)
```
The `setup_disk.frame()` sets up background workers equal to the number of CPU cores; please note that, by default, hyper-threaded cores are counted as one not two.
Alternatively, one may specify the number of workers using `setup_disk.frame(workers = n)`.
## Quick-start
```{r setup, cache=TRUE}
suppressPackageStartupMessages(library(disk.frame))
library(nycflights13)
# this will setup disk.frame's parallel backend with number of workers equal to the number of CPU cores (hyper-threaded cores are counted as one not two)
setup_disk.frame()
# this allows large datasets to be transferred between sessions
options(future.globals.maxSize = Inf)
# convert the flights data.frame to a disk.frame
# optionally, you may specify an outdir, otherwise, the
flights.df <- as.disk.frame(nycflights13::flights)
```
## Example: dplyr verbs
### dplyr verbs
{disk.frame} aims to support as many dplyr verbs as possible. For example
```{r, dependson='setup'}
flights.df %>%
filter(year == 2013) %>%
mutate(origin_dest = paste0(origin, dest)) %>%
head(2)
```
### Group-by
Starting from `{disk.frame}` v0.3.0, there is `group_by` support for a limited set of functions. For example:
```r
result_from_disk.frame = iris %>%
as.disk.frame %>%
group_by(Species) %>%
summarize(
mean(Petal.Length),
sumx = sum(Petal.Length/Sepal.Width),
sd(Sepal.Width/ Petal.Length),
var(Sepal.Width/ Sepal.Width),
l = length(Sepal.Width/ Sepal.Width + 2),
max(Sepal.Width),
min(Sepal.Width),
median(Sepal.Width)
) %>%
collect
```
The results should be exactly the same as if applying the same group-by operations on a data.frame. If not, please [report a bug](https://github.com/DiskFrame/disk.frame/issues).
#### List of supported group-by functions
If a function you like is missing, please make a feature request [here](https://github.com/DiskFrame/disk.frame/issues). It is a limitation that function that depend on the order a column can only be obtained using estimated methods.
| Function | Exact/Estimate | Notes |
| -- | -- | -- |
| `min` | Exact | |
| `max` | Exact | |
| `mean` | Exact | |
| `sum` | Exact | |
| `length` | Exact | |
| `n` | Exact | |
| `n_distinct` | Exact | |
| `sd` | Exact | |
| `var` | Exact | `var(x)` only `cor, cov` support *planned* |
| `any` | Exact | |
| `all` | Exact | |
| `median` | Estimate | |
| `quantile` | Estimate | One quantile only |
| `IQR` | Estimate | |
## Example: data.table syntax
```{r}
library(data.table)
suppressWarnings(
grp_by_stage1 <-
flights.df[
keep = c("month", "distance"), # this analysis only required "month" and "dist" so only load those
month <= 6,
.(sum_dist = sum(distance)),
.(qtr = ifelse(month <= 3, "Q1", "Q2"))
]
)
grp_by_stage1
```
The result `grp_by_stage1` is a `data.table` so we can finish off the two-stage aggregation using data.table syntax
```{r}
grp_by_stage2 = grp_by_stage1[,.(sum_dist = sum(sum_dist)), qtr]
grp_by_stage2
```
## Basic info
To find out where the disk.frame is stored on disk:
```{r, dependson='setup'}
# where is the disk.frame stored
attr(flights.df, "path")
```
A number of data.frame functions are implemented for disk.frame
```{r, dependson='setup'}
# get first few rows
head(flights.df, 1)
```
```{r, dependson='setup'}
# get last few rows
tail(flights.df, 1)
```
```{r, dependson='setup'}
# number of rows
nrow(flights.df)
```
```{r, dependson='setup'}
# number of columns
ncol(flights.df)
```
## Hex logo
![disk.frame logo](inst/figures/logo.png)
## Contributors
This project exists thanks to all the people who contribute.
<a href="https://github.com/DiskFrame/disk.frame/graphs/contributors"><img src="https://opencollective.com/diskframe/contributors.svg?width=890&button=false" /></a>
## Current Priorities
The work priorities at this stage are
1. Bugs
2. Urgent feature implementations that can improve an awful user-experience
3. More vignettes covering every aspect of disk.frame
4. Comprehensive Tests
5. Comprehensive Documentation
6. More features
## Blogs and other resources
| Title | Language | Author | Date | Description |
| ------------------------------------------------------------ | -------- | --------------- | ---------- | ------------------------------------------------------------ |
| [25 days of disk.frame](https://twitter.com/evalparse/status/1200963268270886912) | English | ZJ | 2019-12-01 | 25 tweets about `{disk.frame}` |
| https://www.researchgate.net/post/What-is-the-Maximum-size-of-data-that-is-supported-by-R-datamining | English | Knut Jägersberg | 2019-11-11 | Great answer on using disk.frame |
| [`{disk.frame}` is epic](https://www.brodrigues.co/blog/2019-09-03-disk_frame/) | English | Bruno Rodriguez | 2019-09-03 | It's about loading a 30G file into `{disk.frame}` |
| [My top 10 R packages for data analytics](https://www.actuaries.digital/2019/09/26/my-top-10-r-packages-for-data-analytics/) | English | Jacky Poon | 2019-09-03 | `{disk.frame}` was number 3 |
| [useR! 2019 presentation video](https://www.youtube.com/watch?v=3XMTyi_H4q4) | English | Dai ZJ | 2019-08-03 | |
| [useR! 2019 presentation slides](https://www.beautiful.ai/player/-LphQ0YaJwRektb8nZoY) | English | Dai ZJ | 2019-08-03 | |
| [Split-apply-combine for Maximum Likelihood Estimation of a linear model](https://www.brodrigues.co/blog/2019-10-05-parallel_maxlik/) | English | Bruno Rodriguez | 2019-10-06 | `{disk.frame}` used in helping to create a maximum likelihood estimation program for linear models |
| [Emma goes to useR! 2019](https://emmavestesson.netlify.app/2019/07/user2019/) | English | Emma Vestesson | 2019-07-16 | The first mention of `{disk.frame}` in a blog post |
| [深入对比数据科学工具箱:Python3 和 R 之争(2020版)](https://segmentfault.com/a/1190000021653567) | Chinese | Harry Zhu | 2020-02-16 | Mentions disk.frame |
### Interested in learning `{disk.frame}` in a structured course?
Please register your interest at:
https://leanpub.com/c/taminglarger-than-ramwithdiskframe
## Open Collective
If you like `{disk.frame}` and want to speed up its development or perhaps you have a feature request? Please consider sponsoring `{disk.frame}` on Open Collective
### Backers
Thank you to all our backers!
<a href="https://opencollective.com/diskframe#backers" target="_blank"><img src="https://opencollective.com/diskframe/backers.svg?width=890"></a>
### Sponsor and back `{disk.frame}`
Support `{disk.frame}` development by becoming a sponsor. Your logo will show up here with a link to your website.
<a href="https://opencollective.com/diskframe#sponsors" target="_blank"><img src="https://opencollective.com/diskframe/sponsors.svg?width=890"></a>
## Contact me for consulting
**Do you need help with machine learning and data science in R, Python, or Julia?**
I am available for Machine Learning/Data Science/R/Python/Julia consulting! [Email me](mailto:[email protected])
## Non-financial ways to contribute
Do you wish to give back the open-source community in non-financial ways? Here are some ways you can contribute
* Write a blogpost about your `{disk.frame}` usage or experience. I would love to learn more about how `{disk.frame}` has helped you
* Tweet or post on social media (e.g LinkedIn) about `{disk.frame}` to help promote it
* Bring attention to typos and grammatical errors by correcting and making a PR. Or simply by [raising an issue here](https://github.com/DiskFrame/disk.frame/issues)
* Star the [`{disk.frame}` Github repo](https://github.com/DiskFrame/disk.frame)
* Star any repo that `{disk.frame}` depends on e.g. [`{fst}`](https://github.com/fstpackage/fst) and [`{future}`](https://github.com/HenrikBengtsson/future)
## Related Repos
https://github.com/DiskFrame/disk.frame-fannie-mae-example
https://github.com/DiskFrame/disk.frame-vs
https://github.com/DiskFrame/disk.frame.ml
## Download Counts & Build Status
[![](https://cranlogs.r-pkg.org/badges/disk.frame)](https://cran.r-project.org/package=disk.frame)
[![](http://cranlogs.r-pkg.org/badges/grand-total/disk.frame)](https://cran.r-project.org/package=disk.frame)