forked from hadley/adv-r
-
Notifications
You must be signed in to change notification settings - Fork 1
/
memory.rmd
281 lines (200 loc) · 13.9 KB
/
memory.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
---
title: Memory usage
layout: default
---
It's important to understand memory usage in R, firstly because you might be running out, and secondly because efficiently managing memory can make your code faster. The goals of this chapter are:
* to give you a basic understanding of how memory works in R
* to help you predict when a object will be copied, show you how to test your prediction, and give you some tips for avoiding copies
* give you practical practical tools to understand memory allocation in a given problem
* build your vocabulary so you can more easily understand more advanced documentation.
Unfortunately the details of memory management in R is not documented in one place, but most of the information in this chapter I cleaned from close reading of the documentation (partiularly `?Memory` and `?gc`), the [memory profiling](http://cran.r-project.org/doc/manuals/R-exts.html#Profiling-R-code-for-memory-use) section of R-exts, and the [SEXPs](http://cran.r-project.org/doc/manuals/R-ints.html#SEXPs) section of R-ints. The rest I figured out by small experiments and by asking questions on R-devel.
## `object.size()`
One of the most useful tools for understand memory usage in R is `object.size()`: it tells you how large an object is, and places a similar role to `system.time()` for understanding time usage. This section explores the usage of `object.size()` and by explaining some unusual findings, will along the way help you understand some important aspects of memory allocation.
We'll start with a suprising plot: a plot of vector length vs. the number of bytes of memory it occupies. You might have expected that the size of an empty vector would be 0 and that the memory usage would grow proportionately with length. Neither of those things are true!
```{r size-q}
sizes <- sapply(0:50, function(n) object.size(seq_len(n)))
plot(0:50, sizes, xlab = "Length", ylab = "Bytes", type = "s")
```
It's not just numeric vectors of length 0 that occupy 40 bytes of memory, it's every empty vector type:
```{r}
object.size(numeric())
object.size(integer())
object.size(raw())
object.size(list())
```
What is that 40 bytes of memory used for? There are four components that every object in R has:
* 4 bytes: information about the object, called the sxpinfo). This includes its base type, and information used for debugging and memory management.
* 2 * 8 bytes: two pointers needed for memory management. Objects in R are stored in a doubly-linked list, so that R can easily iterate through every object stored in memory.
* 8 bytes: a pointer to the attributes.
And three components possessed by all vector types:
* 4 bytes: the length of the vector. This gave rise to the previous limitation in R of only supporting vectors up to 2 ^ 31 - 1 (about two billion) elements long. You can read in R-internals about how support for [long vectors](http://cran.r-project.org/doc/manuals/R-ints.html#Long-vectors) was added, without changing this number.
* 4 bytes: the "true" length, which is basically never used (the exception is for environments with hastables, where the hashtable is a list, and the truelength represents the allocated space and length represents the space)
* 0 bytes: the data. That's not used for an empty vector, but is obviously very important otherwise!
If you're counting closely you'll note that only this adds up to 36 bytes. The other 4 bytes are needed because 36 is not divisable by 8. This is important because when using a 64-bit architecture you're best off aligning objects to the nearest 8 byte (=64 bit) boundary.
So that explains the intercept on the graph. But why does the memory size grow in irregular jumps? To understand that, you need to know a little bit about how R requests memory from the operating system. Requesting memory, using the `malloc()` function is a relatively expensive operation, and it would make R slow if it had to request memory every time you create a little vector. Instead, it asks for a big block of memory and then manages it itself: this is called the small vector pool. R uses this pool for vectors less than 128 bytes long, and for efficiency and simplicitly reasons, only allocates vectors that are 8, 16, 32, 48, 64 or 128 bytes long. If we adjust our previous plot by removing the 40 bytes of overhead we can see that those values correspond to the jumps.
```{r size-a}
plot(0:50, sizes - 40, xlab = "Length", ylab = "Bytes excluding overhead", type = "n")
abline(h = 0, col = "grey80")
abline(h = c(8, 16, 32, 48, 64, 128), col = "grey80")
abline(a = 0, b = 4, col = "grey90", lwd = 4)
lines(sizes - 40, type = "s")
```
It only remains to explain the jumps after the 128 limit: when R has moved into the large storage pool, and asks the operating system for memory every time it needs a new vector. R always asks for memory in multiples of 8 bytes: this is partly to make internal code simpler when used across very different operating systems, partly for legacy reasons and partly for performance reasons.
## Total memory use
`object.size()` allows you to determine the size of a single object, and you can use `gc()` to determine how much memory you're currently using:
```{r}
gc()
```
R breaks down memory usage into Vcells (vector memory usage) and Ncells (everything else), but this distinction usually doesn't matter, and neither do the gc trigger and max used columns. The function below wraps around `gc()` to just return the number of megabytes of memory you're currently using.
```{r}
mem <- function() {
bit <- 8L * .Machine$sizeof.pointer
if (bit != 32L && bit != 64L) {
stop("Unknown architecture", call. = FALSE)
}
node_size <- if (bit == 32) 28L else 56L
usage <- gc()
sum(usage[, 1] * c(node_size, 8)) / (1024 ^ 2)
}
mem()
```
Don't expect this number to agree with the amount of memory reported by your operating system. There is some overhead associated with the R interpreter that is not captured by these numbers, and both R and the operating system are lazy: they won't try and reclaim memory until it's actually needed. Another problem is memory fragmentation: R counts the memory occupied by objects, but not the memory occupied by the gaps between objects where old objects used to live.
We can use `mem()` to create a handy function for telling us how much memory a particularly operation needs:
```{r}
mem_change <- function(code) {
start <- mem()
expr <- substitute(code)
eval(expr, parent.frame())
rm(code, expr)
round(mem() - start, 3)
}
```
## Garbarge collection
The job of the garbage collector is to reclaim the memory from objects that are no available. There are two ways to remove objects:
* manually, with `rm()`
```{r}
f <- function() {
1:1e6
}
mem_change(big <- f())
mem_change(rm(big))
```
* automatically, when the environment in which they were defined is no longer used.
```{r}
# Notice that there's no net memory change
mem_change(f())
```
However, you need to be careful with anything that captures the enclosing environment, like formulas or closures:
```{r}
f1 <- function() {
x <- 1:1e6
10
}
f2 <- function() {
x <- 1:1e6
a ~ b
}
f3 <- function() {
x <- 1:1e6
function() 10
}
mem_change(x <- f1())
mem_change(y <- f2())
mem_change(z <- f3())
rm(x, y, z)
```
They prevent the environment from going out of scope and so the memory is never reclaimed.
Also note that using `object.size()` on an environment tells you the size of the environment, not the total size of its contents.
```{r}
env_size <- function(env) {
objs <- ls(env, all = TRUE)
sizes <- vapply(objs, function(x) object.size(get(x, env)), double(1))
sum(sizes)
}
object.size(environment())
object.size(env_size())
```
There's a good reason for this - it's not immediately obvious how much space you say an environment takes up because environment objects are reference based. In the following example, what is the size of `a1`? What is the size of `a2`?
```{r}
e <- new.env()
e$x <- 1:1e6
a1 <- list(e)
object.size(a1)
a2 <- list(e)
object.size(a2)
```
Finally, despite what you might have read elsewhere, there's never any point in calling `gc()` yourself, apart to see how much memory is in use. R will automatically call run garbage collection whenever it needs more space; if you want to see when that is, call `gcinfo(TRUE)`. The only reason you _might_ want to call `gc()` is that it will instruct R to return memory to the operating system. (And even then it wouldn't necessary do anything - older versions of windows had no way for a program to return memory to the OS)
## Modification in place
Generally, any primitive replacement function will modify in place, provided that the object is not referred to elsewhere.
```R
library(pryr)
x <- 1:5
address(x)
x[2] <- 3L
address(x)
# Assigning in a real number forces conversion of x to real
x[2] <- 3
address(x)
# Modifying class or other attributes modifies in place
attr(x, "a") <- "a"
class(x) <- "b"
address(x)
# But making a reference to x elsewhere, will create a modified
# copy when you modify x - no longer modifies in place
y <- x
x[1] <- 2
address(x)
```
In R, it's easy to think that you're modifying an object in place, but you're actually creating a new copy each time.
It's not that loops are slow, it's that if you're not careful every time you modify an object inside a list it makes a complete copy. C functions are usually faster not just because the loop is written in C, but because C's default behaviour is to modify in place, not make a copy. This is less safe, but much more efficient. If you're modifying a data structure in a loop, you can often get big performance gains by switching to the vectorised equivalent. When working with matrices and data frames, this often means creating a large object that you can combine with a single operation.
Take the following code that subtracts the median from each column of a large data.frame:
```{r, cache = TRUE}
x <- data.frame(matrix(runif(100 * 1e4), ncol = 100))
medians <- vapply(x, median, numeric(1))
system.time({
for(i in seq_along(medians)) {
x[, i] <- x[, i] - medians[i]
}
})
```
It's rather slow - we only have 100 columns and 10,000 rows, but it's still taking over second. We can use `address()` to see what's going on. This function returns the memory address that the object occupies:
```{r, results = 'hide'}
library(pryr)
track_x <- track_copy(x)
system.time({
for(i in seq_along(medians)) {
x[, i] <- x[, i] - medians[i]
track_x()
}
})
```
Each iteration of the loop prints a different memory address - the complete data frame is being modified and copied for each iteration.
We can make the function substantially more efficient by using a list which can modify in place:
```{r}
y <- as.list(x)
track_y <- track_copy(y)
system.time({
for(i in seq_along(medians)) {
y[[i]] <- y[[i]] - medians[i]
track_y()
}
})
```
We can rewrite it to be much faster by eliminating all those copies, and instead relying on vectorised data frame subtraction: if you subtract a list from a data frame, the elements of the list are matched up with the elements of the data frame. That loop occurs at the C-level, which means the data frame is only copied once, not many many times.
```{r}
z <- as.data.frame(x)
system.time({
z <- z - as.list(medians)
})
```
The art of R performance improvement is to build up a good intuitions for what operations incur a copy, and what occurs in place. Each version of R usually implements a few performance improvements that eliminates copies, so it's impossible to give an up-to-date list, but some rules of thumb are:
* `structure(x, class = "c")` makes a copy. `class(x) <- c` does not.
* Modifying a vector in place with `[<-` or `[[<-` does not make a copy. Modifying a data frame in place does make a copy. Modifying a list in place makes a copy, but it's a shallow copy: each individual component of the list is not copied.
* `names<-`, `attr<-` and `attributes<-` don't make a copy
* Avoid modifying complex objects (like data frames) repeatedly and instead pull out the component you want to modify, modify it, and then put it back in. If that doesn't work, converting it to a simpler object type and then converting back might help
Generally, building up a rich vocabulaory of vectorised functions will help you write performant code. Vectorisation basically means pushing a for-loop from R in C so that only one copy of the data structure is made.
If you thinking copying is causing a bottleneck in your program, then I recommend running some small experiments using `address()` and `microbenchmark` as described below.
## Memory profiling
Memory profiling is a little tricky, because looking at total memory is not that useful - some of that memory may be used unreferenced objects that haven't yet been removed by the garbage collector. Additionally, R's memory profiler is timer based - R regular stops the execution of the script and records memory information. This has two consequences: firstly, enabling profiling will slow down the execution of your script, and secondly the timer has only limited resolution so it's not able to capture expressions that happen quickly. (Fortunately however, big memory allocations are relatively expensive so they're likely to be caught).
Despite these caveats, memory profiling is still useful. Rather than looking at total memory use, we focus on allocations; and we bear in mind that the attributions of memory allocation might be a few lines off.
Another option is to use `gctorture(TRUE)`: this forces R to run after every allocation. This helps with both problems because memory is freed as soon as possible, and R runs much more slowly (10-100x in my experience), so the resolution of the timer effectively becomes 10x greater. So only run this once you've isolated a small part of your code that you want to understand the memory usage of, or if you're very patient. In my experience, it helps largely with smaller allocations and associating allocations with exactly the right line of code. It also helps you see when objects would be reclaimed if absolutely necessary.