-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy path80-memory.Rmd
436 lines (289 loc) · 17.9 KB
/
80-memory.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
---
output: html_document
editor_options:
chunk_output_type: console
---
# Memory Efficiency {#memory}
As put by @kane2013scalable, it was quite puzzling when very few of the competitors, for the Million dollars prize in the [Netflix challenge](https://en.wikipedia.org/wiki/Netflix_Prize), were statisticians.
This is perhaps because the statistical community historically uses SAS, SPSS, and R.
The first two tools are very well equipped to deal with big data, but are very unfriendly when trying to implement a new method.
R, on the other hand, is very friendly for innovation, but was not equipped to deal with the large data sets of the Netflix challenge.
A lot has changed in R since 2006. This is the topic of this chapter.
As we have seen in the Sparsity Chapter \@ref(sparse), an efficient representation of your data in RAM will reduce computing time, and will allow you to fit models that would otherwise require tremendous amounts of RAM.
Not all problems are sparse however.
It is also possible that your data does not fit in RAM, even if sparse.
There are several scenarios to consider:
1. Your data fits in RAM, but is too big to compute with.
1. Your data does not fit in RAM, but fits in your local storage (HD, SSD, etc.)
1. Your data does not fit in your local storage.
If your data fits in RAM, but is too large to compute with, a solution is to replace the algorithm you are using.
Instead of computing with the whole data, your algorithm will compute with parts of the data, also called _chunks_, or _batches_.
These algorithms are known as _external memory algorithms_ (EMA), or _batch processing_.
If your data does not fit in RAM, but fits in your local storage, you have two options.
The first is to save your data in a _database management system_ (DBMS).
This will allow you to use the algorithms provided by your DBMS, or let R use an EMA while "chunking" from your DBMS.
Alternatively, and preferably, you may avoid using a DBMS, and work with the data directly form your local storage by saving your data in some efficient manner.
Finally, if your data does not fit on you local storage, you will need some external storage solution such as a distributed DBMS, or distributed file system.
```{remark}
If you use Linux, you may be better of than Windows users.
Linux will allow you to compute with larger datasets using its _swap file_ that extends RAM using your HD or SSD.
On the other hand, relying on the swap file is a BAD practice since it is much slower than RAM, and you can typically do much better using the tricks of this chapter.
Also, while I LOVE Linux, I would never dare to recommend switching to Linux just to deal with memory contraints.
```
## Efficient Computing from RAM
If our data can fit in RAM, but is still too large to compute with it (recall that fitting a model requires roughly 5-10 times more memory than saving it), there are several facilities to be used.
The first, is the sparse representation discussed in Chapter \@ref(sparse), which is relevant when you have factors, which will typically map to sparse model matrices.
Another way is to use _external memory algorithms_ (EMA).
The `biglm::biglm` function provides an EMA for linear regression.
The following if taken from the function's example.
```{r, echo=FALSE}
library(magrittr)
```
```{r biglm}
data(trees)
ff<-log(Volume)~log(Girth)+log(Height)
chunk1<-trees[1:10,]
chunk2<-trees[11:20,]
chunk3<-trees[21:31,]
library(biglm)
a <- biglm(ff,chunk1)
a <- update(a,chunk2)
a <- update(a,chunk3)
coef(a)
```
Things to note:
- The data has been chunked along rows.
- The initial fit is done with the `biglm` function.
- The model is updated with further chunks using the `update` function.
We now compare it to the in-memory version of `lm` to verify the results are the same.
```{r}
b <- lm(ff, data=trees)
rbind(coef(a),coef(b))
```
Other packages that follow these lines, particularly with classification using SVMs, are __LiblineaR__, and __RSofia__.
### Summary Statistics from RAM
If you are not going to do any model fitting, and all you want is efficient filtering, selection and summary statistics, then a lot of my warnings above are irrelevant.
For these purposes, the facilities provided by __base__, __stats__, and __dplyr__ are probably enough.
If the data is large, however, these facilities may be too slow.
If your data fits into RAM, but speed bothers you, take a look at the __data.table__ package.
The syntax is less friendly than __dplyr__, but __data.table__ is BLAZING FAST compared to competitors.
Here is a little benchmark^[The code was contributed by Liad Shekel.].
First, we setup the data.
```{r dplys vs datatable}
library(data.table)
n <- 1e6 # number of rows
k <- c(200,500) # number of distinct values for each 'group_by' variable
p <- 3 # number of variables to summarize
L1 <- sapply(k, function(x) as.character(sample(1:x, n, replace = TRUE) ))
L2 <- sapply(1:p, function(x) rnorm(n) )
tbl <- data.table(L1,L2) %>%
setnames(c(paste("v",1:length(k),sep=""), paste("x",1:p,sep="") ))
tbl_dt <- tbl
tbl_df <- tbl %>% as.data.frame
```
We compare the aggregation speeds.
Here is the timing for __dplyr__.
```{r datatable aggregation}
library(dplyr)
system.time( tbl_df %>%
group_by(v1,v2) %>%
summarize(
x1 = sum(abs(x1)),
x2 = sum(abs(x2)),
x3 = sum(abs(x3))
)
)
```
And now the timing for __data.table__.
```{r}
system.time(
tbl_dt[ , .( x1 = sum(abs(x1)), x2 = sum(abs(x2)), x3 = sum(abs(x3)) ), .(v1,v2)]
)
```
The winner is obvious.
Let's compare filtering (i.e. row subsets, i.e. SQL's SELECT).
```{r}
system.time(
tbl_df %>% filter(v1 == "1")
)
```
```{r}
system.time(
tbl_dt[v1 == "1"]
)
```
## Computing from a Database
The early solutions to oversized data relied on storing your data in some DBMS such as _MySQL_, _PostgresSQL_, _SQLite_, _H2_, _Oracle_, etc.
Several R packages provide interfaces to these DBMSs, such as __sqldf__, __RDBI__, __RSQite__.
Some will even include the DBMS as part of the package itself.
Storing your data in a DBMS has the advantage that you can typically rely on DBMS providers to include very efficient algorithms for the queries they support.
On the downside, SQL queries may include a lot of summary statistics, but will rarely include model fitting^[This is slowly changing. Indeed, Microsoft's SQL Server 2016 is already providing [in-database-analytics](https://blogs.technet.microsoft.com/dataplatforminsider/2016/03/29/in-database-advanced-analytics-with-r-in-sql-server-2016/), and other will surely follow.].
This means that even for simple things like linear models, you will have to revert to R's facilities-- typically some sort of EMA with chunking from the DBMS.
For this reason, and others, we prefer to compute from efficient file structures, as described in Section \@ref(file-structure).
If, however, you have a powerful DBMS around, or you only need summary statistics, or you are an SQL master, keep reading.
The package __RSQLite__ includes an SQLite server, which we now setup for demonstration.
The package __dplyr__, discussed in the Hadleyverse Chapter \@ref(hadley), will take care of translating the __dplyr__ syntax, to the SQL syntax of the DBMS.
The following example is taken from the __dplyr__ [Databases vignette](https://cran.r-project.org/web/packages/dplyr/vignettes/databases.html).
```{r rsqlite, eval=FALSE}
library(RSQLite)
library(dplyr)
file.remove('my_db.sqlite3')
my_db <- src_sqlite(path = "my_db.sqlite3", create = TRUE)
library(nycflights13)
flights_sqlite <- copy_to(
dest= my_db,
df= flights,
temporary = FALSE,
indexes = list(c("year", "month", "day"), "carrier", "tailnum"))
```
Things to note:
- `src_sqlite` to start an empty table, managed by SQLite, at the desired path.
- `copy_to` copies data from R to the database.
- Typically, setting up a DBMS like this makes no sense, since it requires loading the data into RAM, which is precisely what we want to avoid.
We can now start querying the DBMS.
```{r, eval=FALSE}
select(flights_sqlite, year:day, dep_delay, arr_delay)
```
```{r, eval=FALSE}
filter(flights_sqlite, dep_delay > 240)
```
## Computing From Efficient File Structrures {#file-structure}
It is possible to save your data on your storage device, without the DBMS layer to manage it.
This has several advantages:
- You don't need to manage a DBMS.
- You don't have the computational overhead of the DBMS.
- You may optimize the file structure for statistical modelling, and not for join and summary operations, as in relational DBMSs.
There are several facilities that allow you to save and compute directly from your storage:
1. __Memory Mapping__:
Where RAM addresses are mapped to a file on your storage.
This extends the RAM to the capacity of your storage (HD, SSD,...).
Performance slightly deteriorates, but the access is typically very fast.
This approach is implemented in the __bigmemory__ package.
1. __Efficient Binaries__:
Where the data is stored as a file on the storage device.
The file is binary, with a well designed structure, so that chunking is easy.
This approach is implemented in the __ff__ package, and the commercial __RevoScaleR__ package.
Your algorithms need to be aware of the facility you are using.
For this reason each facility ( __bigmemory__, __ff__, __RevoScaleR__,...) has an eco-system of packages that implement various statistical methods using that facility.
As a general rule, you can see which package builds on a package using the _Reverse Depends_ entry in the package description.
For the __bigmemory__ package, for instance, [we can see](https://cran.r-project.org/web/packages/bigmemory/index.html) that the packages __bigalgebra__, __biganalytics__, __bigFastlm__, __biglasso__, __bigpca__, __bigtabulate__, __GHap__, and __oem__, build upon it.
We can expect this list to expand.
Here is a benchmark result, from @wang2015statistical.
It can be seen that __ff__ and __bigmemory__ have similar performance, while __RevoScaleR__ (RRE in the figure) outperforms them.
This has to do both with the efficiency of the binary representation, but also because __RevoScaleR__ is inherently parallel.
More on this in the Parallelization Chapter \@ref(parallel).
![__bigmemory__ vs. __ff__ vs. __RevoScaleR__ when reading, trasnforming, or fitting a model to 12GB of data, on a standrard laptop (in 2015 standards).](art/benchmark.png)
### bigmemory {#bigmemory}
We now demonstrate the workflow of the __bigmemory__ package.
We will see that __bigmemory__, with it's `big.matrix` object is a very powerful mechanism.
If you deal with big numeric matrices, you will find it very useful.
If you deal with big data frames, or any other non-numeric matrix, __bigmemory__ may not be the appropriate tool, and you should try __ff__, or the commercial __RevoScaleR__.
```{r}
# download.file("http://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/BSAPUFS/Downloads/2010_Carrier_PUF.zip", "2010_Carrier_PUF.zip")
# unzip(zipfile="2010_Carrier_PUF.zip")
library(bigmemory)
x <- read.big.matrix("data/2010_BSA_Carrier_PUF.csv", header = TRUE,
backingfile = "airline.bin",
descriptorfile = "airline.desc",
type = "integer")
dim(x)
pryr::object_size(x)
class(x)
```
Things to note:
- The basic building block of the __bigmemory__ ecosystem, is the `big.matrix` class, we constructed with `read.big.matrix`.
- `read.big.matrix` handles the import to R, and the saving to a memory mapped file. The implementation is such that at no point does R hold the data in RAM.
- The memory mapped file will be there after the session is over. It can thus be called by other R sessions using ` attach.big.matrix("airline.desc")`. This will be useful when parallelizing.
- `pryr::object_size` return the size of the object. Since `x` holds only the memory mappings, it is much smaller than the 100MB of data that it holds.
We can now start computing with the data.
Many statistical procedures for the `big.matrix` object are provided by the __biganalytics__ package.
In particular, the `biglm.big.matrix` and `bigglm.big.matrix` functions, provide an interface from `big.matrix` objects, to the EMA linear models in `biglm::biglm` and `biglm::bigglm`.
```{r, cache=TRUE}
library(biganalytics)
biglm.2 <- bigglm.big.matrix(BENE_SEX_IDENT_CD~CAR_LINE_HCPCS_CD, data=x)
coef(biglm.2)
```
Other notable packages that operate with `big.matrix` objects include:
- __bigtabulate__: Extend the bigmemory package with 'table', 'tapply', and 'split' support for 'big.matrix' objects.
- __bigalgebra__: For matrix operation.
- __bigpca__: principle components analysis (PCA), or singular value decomposition (SVD).
- __bigFastlm__: for (fast) linear models.
- __biglasso__: extends lasso and elastic nets.
- __GHap__: Haplotype calling from phased SNP data.
### bigstep
The [bigstep](https://cran.r-project.org/web/packages/bigstep/vignettes/bigstep.html) package uses the __bigmemory__ framework to perform stepwise model selction, when the data cannot fit into RAM.
TODO
## ff
The __ff__ packages replaces R's in-RAM storage mechanism with on-disk (efficient) storage.
Unlike __bigmemory__, __ff__ supports all of R vector types such as factors, and not only numeric.
Unlike `big.matrix`, which deals with (numeric) matrices, the `ffdf` class can deal with data frames.
Here is an example.
First open a connection to the file, without actually importing it using the `LaF::laf_open_csv` function.
```{r}
.dat <- LaF::laf_open_csv(filename = "data/2010_BSA_Carrier_PUF.csv",
column_types = c("integer", "integer", "categorical", "categorical", "categorical", "integer", "integer", "categorical", "integer", "integer", "integer"),
column_names = c("sex", "age", "diagnose", "healthcare.procedure", "typeofservice", "service.count", "provider.type", "servicesprocessed", "place.served", "payment", "carrierline.count"),
skip = 1)
```
Now write the data to local storage as an ff data frame, using `laf_to_ffdf`.
```{r}
data.ffdf <- ffbase::laf_to_ffdf(laf = .dat)
head(data.ffdf)
```
We can verify that the `ffdf` data frame has a small RAM footprint.
```{r}
pryr::object_size(data.ffdf)
```
The __ffbase__ package provides several statistical tools to compute with `ff` class objects.
Here is simple table.
```{r ffdf table, cache=TRUE}
ffbase::table.ff(data.ffdf$age)
```
The EMA implementation of `biglm::biglm` and `biglm::bigglm` have their __ff__ versions.
```{r ffdf bigglm}
library(biglm)
mymodel.ffdf <- biglm(payment ~ factor(sex) + factor(age) + place.served,
data = data.ffdf)
summary(mymodel.ffdf)
```
Things to note:
- `biglm::biglm` notices the input of of class `ffdf` and calls the appropriate implementation.
- The model formula, `payment ~ factor(sex) + factor(age) + place.served`, includes factors which cause no difficulty.
- You cannot inspect the factor coding (dummy? effect?) using `model.matrix`.
This is because EMAs never really construct the whole matrix, let alone, save it in memory.
## disk.frame
TODO: https://github.com/xiaodaigh/disk.frame
## matter
Memory-efficient reading, writing, and manipulation of structured binary data on disk as vectors, matrices, arrays, lists, and data frames.
TODO
## iotools
A low level facility for connecting to on-disk binary storage.
Unlike __ff__, and __bigmemory__, it behaves like native R objects, with their copy-on-write policy.
Unlike __reader__, it allows chunking.
Unlike `read.csv`, it allows fast I/O.
__iotools__ is thus a potentially very powerfull facility.
See @arnold2015iotools for details.
TODO
## HDF5
Like __ff__, HDF5 is an on-disk efficient file format.
The package __h5__ is interface to the "HDF5" library supporting fast storage and retrieval of R-objects like vectors, matrices and arrays.
TODO
## DelayedArray
An abstraction layer for operations on array objects, which supports various backend storage of arrays such as:
- In RAM: [base](), [Matrix](), [DelayedArray](https://bioconductor.org/packages/release/bioc/html/DelayedArray.html).
- In Disk: [HDF5Array](https://bioconductor.org/packages/release/bioc/html/HDF5Array.html), [matterArray](https://github.com/PeteHaitch/matterArray).
[Link](https://bioconductor.org/packages/release/bioc/html/DelayedArray.html)
Several application packages already build upon the __DelayedArray__ pacakge:
- [DelayedMatrixStats](https://github.com/PeteHaitch/DelayedMatrixStats): Functions that Apply to Rows and Columns of __DelayedArray__ Objects.
- [beachmat]() C++ API for (most) __DelayedMatrix__ backends.
## Computing from a Distributed File System
If your data is SOOO big that it cannot fit on your local storage, you will need a distributed file system or DBMS.
We do not cover this topic here, and refer the reader to the __RHipe__, __RHadoop__, and __RSpark__ packages and references therein.
## Bibliographic Notes
An absolute SUPERB review on computing with big data is @wang2015statistical, and references therein (@kane2013scalable in particular).
Michael Kane also reports [his benchmarks](http://technodocbox.com/C_and_CPP/112025624-Massive-data-shared-and-distributed-memory-and-concurrent-programming-bigmemory-and-foreach.html) of in-memory, vs. DBMS operations.
Here is also an excellent talk by [Charles DiMaggio](http://www.columbia.edu/~sjm2186/EPIC_R/EPIC_R_BigData.pdf).
For an up-to-date list of the packages that deal with memory constraints, see the __Large memory and out-of-memory data__ section in the [High Performance Computing](https://cran.r-project.org/web/views/HighPerformanceComputing.html) task view.
For a list of resources to interface to DMBS, see the [Databases with R](https://cran.r-project.org/web/views/Databases.html) task view.
For more on data analysis from disk, and not from RAM, see [Peter_Hickey's JSM talk](https://www.peterhickey.org/slides/2017/2017-08-01_Peter_Hickey_JSM.pdf).
## Practice Yourself