-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy path75-sparsity.Rmd
338 lines (229 loc) · 13.9 KB
/
75-sparsity.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
---
output: html_document
editor_options:
chunk_output_type: console
---
# Sparse Representations {#sparse}
Analyzing "bigdata" in R is a challenge because the workspace is memory resident, i.e., all your objects are stored in RAM.
As a rule of thumb, fitting models requires about 5 times the size of the data.
This means that if you have 1 GB of data, you might need about 5 GB to fit a linear models.
We will discuss how to compute _out of RAM_ in the Memory Efficiency Chapter \@ref(memory).
In this chapter, we discuss efficient representations of your data, so that it consumes less RAM.
The fundamental idea, is that if your data is _sparse_, i.e., there are many zero entries in your data, then a naive `data.frame` or `matrix` will consume memory for all these zeroes.
If, however, you have many recurring zeroes, it is more efficient to save only the non-zero entries.
When we say _data_, we actually mean the `model.matrix`.
The `model.matrix` is a matrix that R grows, converting all your factors to numeric variables that can be computed with.
_Dummy coding_ of your factors, for instance, is something that is done in your `model.matrix`.
If you have a factor with many levels, you can imagine that after dummy coding it, many zeroes will be present.
```{remark}
PyTorch users may find similar ideas in the [PyTorch Block-Sparse](https://github.com/huggingface/pytorch_block_sparse) extension.
```
## Introduction to Sparse Matrices
The __Matrix__ package replaces the `matrix` class, with several sparse representations of matrix objects.
When using sparse representation, and the __Matrix__ package, you will need an implementation of your favorite model fitting algorithm (e.g. `lm`) that is adapted to these sparse representations; otherwise, R will cast the sparse matrix into a regular (non-sparse) matrix, and you will have saved nothing in RAM.
```{remark}
If you are familiar with MATLAB you should know that one of the great capabilities of MATLAB, is the excellent treatment of sparse matrices with the `sparse` function.
```
Before we go into details, here is a simple example.
We will create a factor of letters with the `letters` function.
Clearly, this factor can take only $26$ values.
This means that $25/26$ of the `model.matrix` will be zeroes after dummy coding.
We will compare the memory footprint of the naive `model.matrix` with the sparse representation of the same matrix.
```{r make x, cache=TRUE}
library(magrittr)
reps <- 1e6 # number of samples
y<-rnorm(reps)
x<- letters %>%
sample(reps, replace=TRUE) %>%
factor
```
The object `x` is a factor of letters:
```{r}
head(x)
```
We dummy code `x` with the `model.matrix` function.
```{r make X1, cache=TRUE}
X.1 <- model.matrix(~x-1)
head(X.1)
```
We call __MatrixModels__ for an implementation of `model.matrix` that supports sparse representations.
```{r make X2, cache=TRUE}
library(Matrix)
X.2<- sparse.model.matrix(~x-1) # Makes sparse dummy model.matrix
head(X.2)
class(X.2)
```
Notice that the matrices have the same dimensions:
```{r}
dim(X.1)
dim(X.2)
```
The memory footprint of the matrices, given by the `pryr::object_size` function, are very very different.
```{r}
pryr::object_size(X.1)
pryr::object_size(X.2)
```
Things to note:
- The sparse representation takes a whole lot less memory than the non sparse.
- `Matrix::sparse.model.matrix` grows the dummy variable representation of the factor `x`, and stores it as a sparse matrix object.
- The __pryr__ package provides many facilities for inspecting the memory footprint of your objects and code.
With a sparse representation, we will not only spare some RAM, but also shorten the time to fit a model. This is because we will not be summing zeroes.
Here is the timing of a non sparse representation:
```{r nonSparse lm, cache=TRUE, dependson='make X1'}
system.time(lm.1 <- lm(y ~ X.1))
```
Well actually, `lm` is a wrapper for the `lm.fit` function.
If we override all the overhead of `lm`, and call `lm.fit` directly, we gain some time:
```{r nonSparse lmfit, cache=TRUE, dependson='make X1'}
system.time(lm.1 <- lm.fit(y=y, x=X.1))
```
We now do the same with the sparse representation:
```{r sparselm, cache=TRUE, dependson='make X2'}
system.time(lm.2 <- MatrixModels:::lm.fit.sparse(X.2,y))
```
It is only left to verify that the returned coefficients are the same:
```{r}
all.equal(lm.2, unname(lm.1$coefficients), tolerance = 1e-12)
```
You can also visualize the non zero entries, i.e., the _sparsity structure_.
```{r}
image(X.2[1:26,1:26])
```
## Efficient Memory Representation
We first distinguish between the two main goals of the efficient representation:
(i) efficient writing, i.e., modification;
(ii) efficient reading, i.e., access.
For statistics and machine learning, we will typically want efficient reading, since the `model.matrix` does not change while a model is being fit.
Efficient representations for writing include the _dictionary of keys_, _list of lists_, and a _coordinate list_.
Efficient representations for reading include the _compressed sparse row_ and _compressed sparse column_.
### Coordinate List Representation {#coo}
A _coordinate list representation_, also known as _COO_, or _triplet representation_ is simply a list of the non zero entries.
Each element in the list is a triplet of the row, column, and value, of each non-zero entry in the matrix.
For instance the matrix
$$ \begin{bmatrix}
0 & a_2 & 0 \\
0 & 0 & b_3
\end{bmatrix} $$
will be
$$ \begin{bmatrix}
1 & 2 & a_2 \\
2 & 3 & b_3
\end{bmatrix}. $$
### Compressed Row Oriented Representation
_Compressed row oriented representation_, also known as _compressed sparse row_, or _CSR_.
_CSR_ is similar to _COO_ with a compressed row vector.
Instead of holding the row of each non-zero entry, the row vector holds the locations in the column vector where a row is increased.
See the next illustration.
![The CSR data structure. From @shah2004sparse. Remember that MATLAB is written in C, where the indexing starts at $0$, and not $1$.](art/crc.png)
### Compressed Column Oriented Representation
A _compressed column oriented representation_, also known as _compressed sparse column_, or _CSC_.
In _CSC_ the column vector is compressed. Unlike _CSR_ where the row vector is compressed.
The nature of statistical applications is such, that CSC representation is typically the most economical, justifying its popularity.
### Algorithms for Sparse Matrices
Working with sparse representations requires using a algorithms, i.e. functions, that are aware of the representation you are using.
We will not go into the details of numerical linear algebra, with/without sparse representation.
We merely note that you need to make sure that the functions you use, are aware of the way your data is represented.
If you do not, R will either try to cast a sparse matrix as a regular/dense matrix, and you will have gained nothing.
Alternatively, you will just get an error.
## Sparse Matrices and Sparse Models in R
### The Matrix Package
The __Matrix__ package provides facilities to deal with real (stored as double precision), logical and so-called "pattern" (binary) dense and sparse matrices.
There are provisions for integer and complex (stored as double precision complex) matrices.
The sparse matrix classes include:
- `TsparseMatrix`: a virtual class of the various sparse matrices in triplet representation.
- `CsparseMatrix`: a virtual class of the various sparse matrices in CSC representation.
- `RsparseMatrix`: a virtual class of the various sparse matrices in CSR representation.
For matrices of real numbers, stored in _double precision_, the __Matrix__ package provides the following (non virtual) classes:
- `dgTMatrix`: a __general__ sparse matrix of __doubles__, in __triplet__ representation.
- `dgCMatrix`: a __general__ sparse matrix of __doubles__, in __CSC__ representation.
- `dsCMatrix`: a __symmetric__ sparse matrix of __doubles__, in __CSC__ representation.
- `dtCMatrix`: a __triangular__ sparse matrix of __doubles__, in __CSC__ representation.
Why bother with distinguishing between the different shapes of the matrix?
Because the more structure is assumed on a matrix, the more our (statistical) algorithms can be optimized.
For our purposes `dgCMatrix` will be the most useful, i.e., a sparse matrix, in double precision, without any further structure assumed.
### The glmnet Package
As previously stated, an efficient storage of the `model.matrix` is half of the story.
We also need implementations of our favorite statistical algorithms that make use of this representation.
A very useful package that does that is the __glmnet__ package, which allows to fit linear models, generalized linear models, with ridge, lasso, and elastic-net regularization.
The __glmnet__ package allows all of this, using the sparse matrices of the __Matrix__ package.
The following example is taken from [John Myles White's blog](http://www.johnmyleswhite.com/notebook/2011/10/31/using-sparse-matrices-in-r/), and compares the runtime of fitting an OLS model, using `glmnet` with both sparse and dense matrix representations.
```{r sparse glmnet simulation, cache=TRUE}
library('glmnet')
set.seed(1)
performance <- data.frame()
for (sim in 1:10){
n <- 10000
p <- 500
nzc <- trunc(p / 10)
x <- matrix(rnorm(n * p), n, p) #make a dense matrix
iz <- sample(1:(n * p),
size = n * p * 0.85,
replace = FALSE)
x[iz] <- 0 # sparsify by injecting zeroes
sx <- Matrix(x, sparse = TRUE) # save as a sparse object
beta <- rnorm(nzc)
fx <- x[, seq(nzc)] %*% beta
eps <- rnorm(n)
y <- fx + eps # make data
# Now to the actual model fitting:
sparse.times <- system.time(fit1 <- glmnet(sx, y)) # sparse glmnet
full.times <- system.time(fit2 <- glmnet(x, y)) # dense glmnet
sparse.size <- as.numeric(object.size(sx))
full.size <- as.numeric(object.size(x))
performance <- rbind(performance, data.frame(Format = 'Sparse',
UserTime = sparse.times[1],
SystemTime = sparse.times[2],
ElapsedTime = sparse.times[3],
Size = sparse.size))
performance <- rbind(performance, data.frame(Format = 'Full',
UserTime = full.times[1],
SystemTime = full.times[2],
ElapsedTime = full.times[3],
Size = full.size))
}
```
Things to note:
- The simulation calls `glmnet` twice. Once with the non-sparse object `x`, and once with its sparse version `sx`.
- The degree of sparsity of `sx` is $85\%$. We know this because we "injected" zeroes in $0.85$ of the locations of `x`.
- Because `y` is continuous `glmnet` will fit a simple OLS model. We will see later how to use it to fit GLMs and use lasso, ridge, and elastic-net regularization.
We now inspect the computing time, and the memory footprint, only to discover that sparse representations make a BIG difference.
```{r}
library('ggplot2')
ggplot(performance, aes(x = Format, y = ElapsedTime, fill = Format)) +
stat_summary(fun.data = 'mean_cl_boot', geom = 'bar') +
stat_summary(fun.data = 'mean_cl_boot', geom = 'errorbar') +
ylab('Elapsed Time in Seconds')
```
```{r}
ggplot(performance, aes(x = Format, y = Size / 1000000, fill = Format)) +
stat_summary(fun.data = 'mean_cl_boot', geom = 'bar') +
stat_summary(fun.data = 'mean_cl_boot', geom = 'errorbar') +
ylab('Matrix Size in MB')
```
How do we perform other types of regression with the __glmnet__?
We just need to use the `family` and `alpha` arguments of `glmnet::glmnet`.
The `family` argument governs the type of GLM to fit: logistic, Poisson, probit, or other types of GLM.
The `alpha` argument controls the type of regularization. Set to `alpha=0` for ridge, `alpha=1` for lasso, and any value in between for elastic-net regularization.
## Beyond Sparsity and Apache Arrow
It is quite possible that your data contain redundancies, other than the occurrences of the number 0.
We would not call it _sparse_, but there is still room for its efficient representation in memory.
_Apache Arrow_ is a a set of C++ functions, for efficient representation of objects in memory.
Just like the __Matrix__ package, it can detect these redundancies and exploit them for efficient memory representation.
It can actually do much more, and it is definitely a technology to follow.
- The memory representation is designed for easy read/writes into disk or network. This means that saving your file, or sending it over the network, will require very little CPU.
- It is supported by R and Python, making data exchanges between the two possible, and fast (see previous item). For optimal performance save it into [Apache Parquet](https://en.wikipedia.org/wiki/Apache_Parquet) using `arrow::write_parquet`, or [Feather](http://arrow.apache.org/blog/2019/08/08/r-package-on-cran/) file formats, using the `arrow::write_feather()` function. Read functions are also provided.
- The previous benefits extend to many other [software suits](https://arrow.apache.org/powered_by/) that support this format.
## Bibliographic Notes
The best place to start reading on sparse representations and algorithms is the [vignettes](https://cran.r-project.org/web/packages/Matrix/vignettes/Intro2Matrix.pdf) of the __Matrix__ package.
@gilbert1992sparse is also a great read for some general background.
See [here](http://netlib.org/linalg/html_templates/node90.html) for a blog-level review of sparse matrix formats.
For the theory on solving sparse linear systems see @davis2006direct.
For general numerical linear algebra see @gentle2012numerical.
## Practice Yourself
1. What is the CSC representation of the following matrix:
$$\begin{bmatrix}
0 & 1 & 0 \\
0 & 0 & 6 \\
1 & 0 & 1
\end{bmatrix}$$
1. Write a function that takes two matrices in CSC and returns their matrix product.