generated from jtr13/quarto-edav-template
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathch9.qmd
355 lines (276 loc) · 15.3 KB
/
ch9.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
---
title-block-banner: true
bibliography: references.bib
editor:
markdown:
wrap: sentence
---
# Parallel Geospatial Computing
Spatial operations in R can be made faster by using parallel processing.
Generally, parallel computation is the simultaneous execution of different pieces of a larger operation across multiple computing processors or cores.
Imagine if you can execute an operation in X seconds on a single processor.
Would you be able to divide the processing time by n if you divide the task between n processors?
The answer is...almost.
Parallel processing comes with overhead computing costs associated with several processors/nodes interacting with each other over shared memory.
So, for smaller datasets, the increment in computing performance using parallel processing will be minimal (if not negative).
The real difference in computing performance is apparent in the case or large datasets.
For more details read: <https://www.linkedin.com/pulse/thinking-parallel-high-performance-computing-hpc-debasish-mishra>.
In this chapter we will try several examples of the cell-wise, layer-wise and block-wise implementation of custom functions in parallel.
Let us start by first importing surface soil moisture from NASA's SMAP satellite from the netCDF provided with the sample dataset.
```{r 9a1, message = FALSE, warning = FALSE}
library(terra)
# Import SMAP soil moisture NetCDF to the workspace
SMAPrast = rast("./SampleData-master/SMAP_L3_USA.nc")
# Cell-wise mean of all layers
meanSMAP = mean(SMAPrast, na.rm=TRUE)
# Plot spatial map for mean soil moisture
library(spData)
us_shp = vect(us_states)
plot(meanSMAP)
plot(us_shp, add=TRUE)
```
## Cell–wise Operation
### Apply custom function to pixel time series
![](images/clipboard-1143469546.png){fig-align="center" width="550"}
Once we have imported the netCDF file as `rast` object, we will apply a slightly modified version of previously used function `my_fun` (from Ch 8) for calculating mean, variance and skewness for time series data for each cell in parallel.
We will use `terra::app` function to apply `my_fun` on SpatRaster in parallel.
For seamless implementation of function in parallel mode, care must be taken that all necessary are accessible to ALL cores and error exceptions are handles appropriately.
We will modify `my_fun` slightly to highlight what it means in practice.
> 1. We will convert input `x` to a numeric array
>
> 2. We will remove `NA` values from dataset before calculation
>
> 3. We will use `minSamp` to fix minimum sample counts for calculation.
> If the number of observations for a pixel are less than `minSamp`, the grid is skipped.
>
> 4. We will use `tryCatch` to handle error exceptions
The basic rules to avoid errors: (a) checking that inputs are correct, (b) avoiding non-standard evaluation, and (c) avoiding functions that can return different types of output.
```{r 9a2, message = FALSE, warning = FALSE}
#~~ We will make some changes in the custom function for mean, variance and skewness
my_fun = function(x, minSamp){
smTS=as.numeric(as.vector(x)) # Convert dataset to numeric array
smTS=as.numeric(na.omit(smTS)) # Omit NA values
# Implement function with trycatch for catching exception
tryCatch(if(length(smTS)>minSamp) { # Apply minimum sample filter
######## OPERATION BEGINS #############
meanVal=mean(smTS, na.rm=TRUE) # Mean
varVal=var(smTS, na.rm=TRUE) # Variance
skewVal=moments::skewness(smTS, na.rm=TRUE) # Skewness
output=c(meanVal,varVal,skewVal) # Combine all statistics
return(output) # Return output
######## OPERATION ENDS #############
} else {
return(rep(NA,3)) # If conditions !=TRUE, return array with NA
},error =function(e){return(rep(NA, 3))}) # If Error== TRUE, return array with NA
}
# Apply function to all grids in parallel
library(tictoc)
tic()
stat_brk = app(SMAPrast,
my_fun,
minSamp = 50, # Minimum assured samples for statistics
cores =parallel::detectCores(logical = FALSE) - 1) # Leave one core for housekeeping
# Beware while using detectCores().
# The argument logical = FALSE returns the number of physical cores.
# logical = TRUE returns the number of available hardware threads.
names(stat_brk)=c("Mean", "Variance", "Skewness") # Add layer names
toc()
# Plot statistics
library(cetcolor)
colpal = cetcolor::cet_pal(20, name = "r2")
plot(stat_brk, col=colpal)
```
### Best practices for large-scale parallel operations
Error handling is the art of debugging unexpected problems in your code.
One easy solution when looping through customized functions is to include `print()` messages after each major operation which can help indicate where the error might be happening.
When working with large spatial data, the following the steps listed below can be very helpful:
> 1. Check if the function works as expected by testing it first on a sample series extracted (using `terra::extract` function) for a test location from the multilayer raster.
> For example, for a sample location with Long=-100, and Lat=35 (in decimal degrees) we can extract a time series for soil moisture and test the custom function as follows:
>
> ```{r 9a3a, message = FALSE, warning = FALSE}
> # Location with Long=-100, and Lat=35 (in decimal degrees) extract time series
> ts_sample=terra::extract(SMAPrast, cbind(-100,35))
> >
> # Does it generate three numeric values as expected?
> my_fun(ts_sample, minSamp=50)
> ```
>
> 2. Try parallel operation on a smaller region before mounting large jobs for cpmputing.
> Pixel-wise implementation of the function can help identify errors in the code.
> Convert the cropped region into a data frame and apply function to time series of each cell.
> If your code throws error, troubleshoot carefully for the series which generates the error.
>
> ```{r 9a3b, message = FALSE, warning = FALSE}
> library(terra)
> e <- ext( c(-110,-108, 35,37) ) # Sample 2X2 degree domain
> p <- as.polygons(e)
> crs(p) <- "EPSG:4326" # Use this polygon to crop and mask the larger SpatRaster
> ```
>
> 3. Use `tryCatch` **carefully** as it may suppress legitimate errors as well, generating spurious results.
> Test the codes for smaller region without `tryCatch` to test the robustness of your codes.
Remember, parallel computing may have some overheads upon creation and closing of clusters.
A significant improvement in computing times using parallel techniques would be visible for large jobs.
## Cell–wise Operation On Two Multilayer Rasters
![](images/clipboard-2268304266.png){fig-align="center" width="750"}
We will use modeled daily evapotranspiration (ET) and maximum daily temperature (Tmax) from NOAA-Physical Sciences Lab's repository for the year 2011 to find the correlation between ET and Tmax at each grid over CONUS.
This dataset is available in NetCDF format (`tmax.2011.nc` and `et.2011.nc`).
Due to sufficient availability of moisture in humid and sub-humid climates, an increase in temperature is matched with a corresponding increase in evapotranspiration (hence, positive correlation).
In contrast, in arid and semi-arid regions, general scarcity of moisture restricts cooling of the surface by evaporation.
Hence, larger fraction of incoming radiation is used up to heat up the land surface.
This shows up as near zero or negative correlation between temperature and evapotranspiration in our analysis.
This forms the basis of terrestrial energy balance.
```{r 9a4, message = FALSE, warning = FALSE}
# Open access path of the daily ET and Tmax .nc files for the year 2011
tmax_path = "https://downloads.psl.noaa.gov//Datasets/livneh/metvars/tmax.2011.nc"
et_path = "https://downloads.psl.noaa.gov//Datasets/livneh/fluxvars/et.2011.nc"
# Download multilayer rasters from the NOAA-PSL servers
download.file(url = tmax_path, method="curl", destfile = "tmax.2011.nc")
download.file(url = et_path, method="curl", destfile = "et.2011.nc")
# Custon function for correlation between time series from ET and Tmax multilayer rasters
#~~~ arg: "pairwise.complete.obs" ignores NA values in either dataset
corfun=function (x, y) {
return(cor(x, y, use = "pairwise.complete.obs"))
}
# Import netCDFs as multilayer rasters
tmax=rast("tmax.2011.nc")
et=rast("et.2011.nc")
# Pixelwise correlation between daily ET and Tmax
xcor = terra::xapp(et, tmax, fun= corfun)
# Plot map of the correlation
plot(xcor, main=" Correlation: ET vs Tmax")
```
## Layer–wise Parallel Computing
![](images/clipboard-1723748174.png){fig-align="center" width="750"}
We will convert SpatRaster to a list of rasters and then we will apply `my_fun` to each element of the list in parallel using `future_lapply`.
Beware while using `detectCores()`.
The argument `logical = FALSE` returns the number of physical cores and `logical = TRUE` returns the number of available hardware threads.
```{r 9a5, message = FALSE, warning = FALSE}
library(terra)
# Import SMAP soil moisture NetCDF to the workspace
SMAPrast = rast("./SampleData-master/SMAP_L3_USA.nc")
# Convert Spatraster to a list of rasters
rasList=as.list(SMAPrast[[1:10]]) # What will happen if we pass rast(rasList)?
length(rasList)
# Custom function to be implemented on each layer
my_fun = function(x){
x=as.numeric(values(x)) # Create vector of numeric values of SpatRaster
meanVal=base::mean(x, na.rm=TRUE) # Mean
varVal=stats::var(x, na.rm=TRUE) # Variance
skewVal=moments::skewness(x, na.rm=TRUE) # Skewness
output=c(meanVal,varVal,skewVal) # Combine all statistics
return(output) # Return output
}
# Test the function for one raster
my_fun(rasList[[10]])
# Apply function in parallel for all layers
library(parallel)
library(future.apply)
library(future)
library(tictoc)
# Create worker nodes with shared environment
#~~~ employ max core-1 for processing
future::plan(multicore, workers = detectCores(logical = FALSE) - 1)
# Deploy function in parallel
tic()
outStat= future_lapply(rasList, my_fun)
toc()
# Check output for one layer
# outStat[[2]]
```
## Block–wise Parallel Computing
![](images/clipboard-3165518732.png){fig-align="center" width="750"}
In this section we will use a shapefile to extract cell values from a SpatRaster as a list using `exact_extract`.
Summary statistics will be calculated in parallel using `my_fun` for dataset for each feature.
Function `exactextractr::exact_extract` is faster and more suited for large applications compared to `terra::extract`.
Although both perform similar operation with little changes in output format.
```{r 9a6, message = FALSE, warning = FALSE}
#~ Extract feature data as data frame
library(exactextractr)
library(sf)
library(sp)
featureData=exact_extract(SMAPrast, # Raster brick
st_as_sf(us_shp), # Convert shapefile to sf (simple feature)
force_df = FALSE, # Output as a data.frame?
include_xy = FALSE, # Include cell lat-long in output?
fun = NULL, # Specify the function to apply for each feature extracted data
progress = TRUE) # Progressbar
length(featureData) # Same as feature count in CONUS? i.e. nrow(conus)
# Lets try out data for Louisiana
which(us_shp$NAME=="Louisiana") # Find feature number for Louisiana
# View(featureData[[10]]) # View the extracted data frame
nrow(featureData[[10]]) # No. pixels within selected feature
```
Each row in `featureData[[10]]` is the time series of cell values which fall within the boundary of feature number 10, i.e.
Louisiana.
Since `exact_extract` function provides `coverage_fraction` for each pixel in the output, we will make some minor change in the `my_fun` function to remove this variable before calculating the statistics.
```{r 9a7, message = FALSE, warning = FALSE}
# Extract SM time series for first pixel by removing percentage fraction
cellTS=as.numeric(featureData[[10]][1,1:nlyr(SMAPrast)])
# Plot time time series for the selected feature
plot(cellTS, type="l", xlab="Time", ylab="Soil moisture")
#~~ We will make another small change in the custom function for mean, variance and skewness
my_fun = function(x, minSamp =50, na.rm=TRUE){
xDF=data.frame(x) # Convert list to data frame
xDF=xDF[ , !(names(xDF) %in% 'coverage_fraction')] # Remove coverage_fraction column
xData=as.vector(as.matrix(xDF)) # Convert data.frame to 1-D matrix
smTS=as.numeric(na.omit(xData)) # Omit NA values
# Implement function with trycatch for catching exception
tryCatch(if(length(smTS)>minSamp) { # Apply minimum sample filter
######## OPERATION BEGINS #############
meanVal=mean(smTS, na.rm=TRUE) # Mean
varVal=var(smTS, na.rm=TRUE) # Variance
skewVal=moments::skewness(smTS, na.rm=TRUE) # Skewness
output=c(meanVal,varVal,skewVal) # Combine all statistics
return(output) # Return output
######## OPERATION ENDS #############
} else {
return(rep(NA,3)) # If conditions !=TRUE, return array with NA
},error =function(e){return(rep(NA, 3))}) # If Error== TRUE, return array with NA
}
```
Let’s apply `my_fun` to extracted data for each feature.
```{r 9a8, message = FALSE, warning = FALSE}
# Test the function for one block
my_fun(featureData[[10]])
# Apply function in parallel for all layers
library(parallel)
library(snow)
library(future.apply)
library(future)
# Specify argument minSamp to be passed along to all nodes
minSamp=50 # Minimum assured samples for statistics
# Create worker nodes with shared environment
future::plan(multisession, workers = detectCores(logical = FALSE) - 1)
# Apply the function in parallel
outStat= future_lapply(featureData, my_fun)
# Test output for one feature
outStat[[10]] # Is this the same as before?
# Extract each summary stats for all features from the output list
FeatureMean=sapply(outStat,"[[",1) # Extract mean for all features
FeatureVar=sapply(outStat,"[[",2) # Extract variance for all features
FeatureSkew=sapply(outStat,"[[",3) # Extract skewness for all features
# Let's place mean statistics as an attribute to the shapefile
us_shp$meanSM=FeatureMean
# Plot mean soil moisture map for CONUS
library(rcartocolor)
library(ggplot2)
library(sf)
library(sp)
mean_map=ggplot() +
geom_sf(data = st_as_sf(us_shp), # CONUS shp as sf object (simple feature)
aes(fill = meanSM)) + # Plot fill color= mean soil moisture
scale_fill_carto_c(palette = "BluYl", # Using carto color palette
name = "Mean SM", # Legend name
na.value = "#e9e9e9", # Fill values for NA
direction = 1)+ # To invert color, use -1
coord_sf(crs = 2163)+ # Reprojecting polygon 4326 or 3083
theme_void() + # Plot theme. Try: theme_bw
theme(legend.position = c(0.2, 0.1),
legend.direction = "horizontal",
legend.key.width = unit(5, "mm"),
legend.key.height = unit(4, "mm"))
mean_map
```
------------------------------------------------------------------------
![](images/clipboard-2382276931.png){fig-align="center" width="600"}