TRISH-R/ReconStep.Rmd at main · ltrr-arizona-edu/TRISH-R · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Climate reconstruction from tree rings by stepwise regression"
output:
  html_document:
    df_print: paged
  html_notebook: default
  pdf_document: default
# Edit following line to path where ReconStep.bib stored on your computer
bibliography: /home/dave/GitWork/TRISH-R/ReconStep.bib
---

ReconStep.Rmd is an instructional [R Markdown](http://rmarkdown.rstudio.com) notebook ("Rmd" file) to illustrate the reconstruction of a climatic or hydrologic time series from a network of tree-ring site chronologies by stepwise multiple linear regression. The latest version of ReconStep was revised for the course "Tree Rings, Climate, Natural Resources, and Human Interaction," to be held in August 2025 at the Petnica Science Research Center in Serbia.

The code in ReconStep.Rmd is an R script modified from an earlier script, StepwiseRecon.Rmd, written for the dendroclimatology module of the summer 2022 Dendrochronology Intensive Summer Course (DISC) at the University of Arizona.

ReconStep.Rmd comprises "chunks" of code preceded by text explaining what the chunks do. "Rmd" stands for "R markdown." Rmd files can be run in RStudio using a dropdown menu ("Run" button) that lets you go chunk-by-chunk or run the entire Rmd at once. Results, which may be graphics, tables, or data listings, occur beneath the chunk that generated them.

After running this script, you can right-click on any graphic and save it as a png for outside use. You have access to all computed statistics and time series in the workspace. Key statistics are also saved in an output file "RecOutput.Rdata" that you can re-load into R. Some of the contents of RecOutput.Rdata are:

-   F: list with results of regression modeling and cross-validation
-   yrgoc and yrspc: first and last year of calibration period
-   resSS: list with results of split-sample stability check of regression model
-   resIV: list with results of independent verification on period help back from calibration/validation; relevant only if the specifications call for independent verification (see User settings).
-   resPCA: list with results of PCA run on matrix of screened lagged chronologies
-   YhCI.ts: matrix with reconstructed predictand, full length, with 50% confidence interval (CI)
-   Z.ts: time series of observed and reconstructed predictand for the calibration period

### Preliminary steps

Preliminary steps are to install R and Rstudio along with any needed R packages. The six R packages required are listed in the "library" commands of the next code chunk. Compatibility problems sometimes come up when using old versions of R or RStudio. My versions for testing this program are:

-   R version 4.5.0 (2025-04-11)

-   RStudio version 2024.12.1 Build 563

You also need access to eight user-written functions (UWFs), which for this script are defined as functions written by David Meko. The "source" commands in the chunk below indicate the UWFs needed. Notice that these source commands include a path to a folder on Meko's laptop. You need to edit the lines to replace that path with a path to where you are storing the UWFs on your laptop. The simplest solution is to keep everything -- this script, the UWFs, and all data files, in a single R project folder. In that case no path is needed in the "source" commands for the UWFs.

You also need to prepare your data files. Two files are required:

1)  a tab-separated numeric-only two-column file (year and data) of the climate variable to be reconstructed\
2)  a tab-separated numeric-only *m*-column file (year and chronologies) of the *m-1* chronologies you intend to pull your predictors from.

Your input tree-ring file can contain a maximum of five chronologies. This limit was set to avoid overly cumbersome tables and other output in an instructional setting. See the sample input data files for an example of the data formats for the input.

This Rmd file (ReconStep.Rmd), the UWFs it calls, sample input data, and running instructions are can be download from the GitHub repository

<https://github.com/ltrr-arizona-edu/TRISH-R>

1.  Go to the above URL

2.  Click the green button labeled "Code" and click "Download zip" from the dropdown men.File "TRISH-R-main.zip" will be downloaded, and this zip file contains all files you will need

3.  Unzip TRISH-R-main.zip such that all contained files are in the R project directory you created in an earlier step for working with ReconStep.Rmd. Beginning R users best store everything (e.g., UWFs, data files, output files) in the R project directory. Advanced users will likely tailor the paths in ReconStep.Rmd and keep some types of files in separate folders.

The best way to start using this script is to first get it working on the sample data. If that works, you know that Rstudio and R have been properly set up and that the necessary R packages have been successfully installed.

Then modify the script and try running it on your own data. You will need to edit the lines in the section "User settings" to match your data files and desired settings for the analysis.

Besides lines in "User settings," The only other lines you need to modify are those mentioned above that deal with paths to UWFs.

### Outline of reconstruction method

A predictand, *y*, is regressed forward stepwise [@Draper&Smith1981] on screened, lagged tree-ring chronologies or their principal components. Stepwise regression on lagged predictors has a long history of use in dendroclimatology [@Stockton&Fritts1973; @Fritts1976; @Stockton&Meko1983; @Cook&Meko1999; @Meko&Stahle2011; @Meko&Biondi2024].

Stepwise regression relies on automated selection of predictors, and so has a risk of overfitting the regression model [@Rencher&Pun1980]. This script has two safeguards against overfitting. First, you can restrict the size of the pool of potential predictors through a parameter in "User settings." Second, an option allows a cross-validation stopping rule [@Wilks2019] to terminate the entry of predictors in the stepwise regression whenever an additional predictor would lower the skill of cross-validation, as measured by the reduction-of-error statistic [@Fritts&Guiot1990].

Cross-validation, following @Meko1997a, leaves out *m* observations in developing cross-validation residuals. If specification call for no lags allowed on predictors, $m=1$, and cross-validation is "leave-1-out" [@Michaelsen1987] .Otherwise, *m* is set to $m=1+4L_{max}$, where $L_{max}$ is the highest positive or negative lag allowed on the chronologies. This results in more than one observation being omitted for each iteration of cross-valiation modeling. For example if two negative and two positive lags are allowed, $L_{max}=2$ and $m=9$ observations are omitted in each cross-validation iteration.

In leave-*m*-out cross-validation *m* is always an odd number. A separate model to predict *y* for each year of the calibration period is fit to a subset of years from which *m* years of the full calibration period centered on the target year have been omitted. The procedure guarantees that for a regression model that includes lagged predictors the cross-validation prediction for any year, *t*=*j*, does not depend on any of the same tree-ring values used to calibration the cross-validation model [@Meko1997a].

Reconstruction steps are summarized below.

1)  Starting point is a matrix of up to five tree-ring chronologies and an annual climate series

2)  Optionally, expand chronology matrix by including lags on the chronologies

3)  Screen the matrix of lagged chronologies to include only those variables most highly correlated (absolute correlation) with *y*.

4)  Run a principal components analysis (PCA; @Mardia&Kent1979) on the matrix of screened lagged chronologies, and subsequently use either the screened chronologies or their principal component (PC) scores as the pool of potential predictors in stepwise regression. Assume that *X* is a time series matrix of this pool of variables, whether chronologies or PCs.

5)  Regress *y* on variables in *X* forward stepwise to estimate the reconstruction model, using one of two alternative "stopping rules" dictating the final step of the model to be used for reconstruction:

    1)  Maximum adjusted *R^2^* [@Draper&Smith1981]

    2)  Maximum cross-validation RE [@Wilks2019]

        Rule 2 is called a "cross-validation stopping rule." Usually, rule 2 will give the same model or a simpler model (fewer steps) than rule 1, but it is not constrained to do so. Cross-validation statistics are computed and saved whether or not you use maximum RE as the stopping rule. The model is cross-validated at each step and cross-validation statistics are saved regardless of which of the two stopping rules is used.

6)  Apply the model to full-length matrix of available tree-ring data to generate reconstructed *y*, or $\hat{y}$.


The reconstruction process also includes analysis of residuals and split-sample calibration/validation of the final regression model. Optionally, "independent verification" of the model is also implemented. Independent verification is checking skill of prediction on observations (years) of data withheld from data screening or model calibration. More details are given in the text preceding each code chunk, as well as in the comments inside the chunks.

### Libraries and functions

The next chunk loads libraries and makes accessible, or "sources," any needed UWFs. You need to edit just one line in the next chunk: change the folder in the "code_dir" statement to match the folder where you have stored the UWFs on your laptop.

Try executing the chunk below by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Ctrl+Shift+Enter*. You will get a couple warnings about objects being "masked" from a couple R packages; ignore the warnings.

```{r}
rm(list=ls())

# Edit following line to match where you have stored user-written functions (UWFs)
code_dir <- "/home/dave/GitWork/TRISH-R/"

# The following five packages are essential to ReonStep.Rmd
library(rmarkdown)
library(car)
library(pracma)
library(nortest)
library(stringr)

# The next packages are not needed by ReconStep.Rmd but are needed for a
# couple chronology development scripts in the international course
library(dplR)
library(data.table)
library(treeclim)

# The next packages are not needed by ReconStep.Rmd, but are needed by
# ReconAnalog.R
library(resample)
library(rjson)
library(ggplot2)
library(gplots)

source(paste(code_dir,'LagYear.R',sep=''))  # build lagged matrix of chrons
source(paste(code_dir,'PeriodCommon.R',sep=''))  # common period of overlap of chrons and predictand
source(paste(code_dir,'ForwStep1.R',sep='')) # forward stepwise regression
source(paste(code_dir,'mannken1.R',sep='')) # Mann-Kendall test for trend
source(paste(code_dir,'xyCI.R',sep='')) # utility function to compute x and y coordinates for patch for CI on recon plot
source(paste(code_dir,'ssValid.R',sep='')) # split-sample calib-valid
source(paste(code_dir,'NashSutt.R',sep='')) # independent verification check
source(paste(code_dir,'CrossValid1.R',sep='')) # cross-validation, called by ForwStep1

```

The next chunk reads Meko's full bib bibliography from his laptop and writes a file StepRecon.bib that includes references for only the citations in this Rmd file. The general user should comment out the last line of this chunk, which is
"clean_bib('ReconStep.Rmd','/home/dave/Data/BibTex/BibliographyMeko.bib','ReconStepMeko.bib')"

```{r}
clean_bib <- function(input_file, input_bib, output_bib){
  lines <- paste(readLines(input_file), collapse = "")
  entries <- unique(str_match_all(lines, "@([a-zA-Z0-9]+)[,\\. \\?\\!\\]\\;]")[[1]][, 2])

  bib <- paste(readLines(input_bib), collapse = "\n")
  bib <- unlist(strsplit(bib, "\n@"))

  output <- sapply(entries, grep, bib, value = T)
  output <- paste("@", output, sep = "")

  writeLines(unlist(output), output_bib)
}

# Call function to generate bib file for this project from my bib file. User
# can ignore this code. I mus be aware that clear_bib does not work properly with
# bibtex files, like mine, that have key fields with more than one name, such
# as "Meko&Stockton1983." So, I have to build ReconStep.bib MANUALLY pasting
# from /media/dave/Red128/BibliographyMeko.bib to ReconStep.bib. To avoid catastropic
# overwriting of ReconStep.bib, I name the NEVER TO BE CREATED output file below
# 'ReconStepMeko.bib' instead of ReconStep.bib.
#
# clean_bib('ReconStep.Rmd','/home/dave/Data/BibTex/BibliographyMeko.bib','ReconStepMeko.bib')
```

### User settings

The next chunk allows you to tailor the reconstruction model. Settings include input filenames and various parameters that control the modeling. You can allow lags (up to 2 years) on the tree-ring chronologies used as predictor variables. To guard against overfitting the regression model, you can specify that the pool of potential predictors must be no larger than some specified fraction of the number of years in the calibration period. You can optionally specify transformation of the regression predictand, *y*. You can also optionally specify that variables in the pool of potential predictors be principal components of chronologies rather than chronologies themselves. Another setting allows you to specify a stopping rule for entry of variables in stepwise regression. Predictors for stepwise regression are entered in order of maximum reduction of variance of residuals.

The two alternative rules are available for stopping entry or predictors in stepwise regression: 1) maximum adjusted *R^2^* of regression, and 2) maximum cross-validation RE. Please read the comments in the chunk below for additional guidance in editing the code.

```{r}
# Specify paths and filenames for the files with the input climate and tree-ring time series.
# A path, if included, must match the setup of your laptop. If no path is specified, the file
# is assumed to be in the R project directory.
# Edit following line (two lines) for path and filenames
fileClimate <- '/home/dave/GitWork/TRISH-R/PsepAugTahoe.txt'
fileTree <- '/home/dave/GitWork/TRISH-R/ChronsTahoe5.txt'

# Specify desired calibration period of the reconstruction model. This period is also used for any
# correlation screening of the pool of potential predictors, and for cross-validation of the
# reconstruction model. The specified first and last years apply to years of the y. Setting
# yrgoc and yrspc to "NA" means the calibration should be chosen automatically from the
# maximum overlap of y (climate) and the tree-ring series. If you mistakenly specify an
# impossible calibration period, the settings will be ignored and the period will be
# chosen automatically.
yrgoc<-NA  # start of calib period (if data and lags allow)
yrspc<- NA  # end of calib period (...)

# Set the period for "independent" verification. This must not overlap any of the years in
# yrgoc-yrspc. Independent verification checks how well the model predicts the predictand
# for a period NOT USED IN ANY WAY to screen or tune the predictand to predictors. If both
# yrgoi and yrspi are set to NA, no independent verification is attempted. This is an appropriate
# setting when there is no additional overlap of tree rings and the predictand beyond what
# is used for calibration/validation.
yrgoi <- NA # start year of independent verification
yrspi <- NA # end year of independent verification

# Specify the maximum number of negative and positive lags on the predictors allowed
# in the predictor pool. The final model may not include such lags, depending on the
# results of stepwise regression. The  maximum allowable lag is 2 years. In other
# words, the only acceptable settings for nNeg and nPos are 0, 1, or 2.
nNeg=1     # maximum negative lag allowed on chronologies
nPos=1     # maximum positive lag allowed on chronologies

# Specify a decimal fraction that caps the size of the pool of potential predictors
# as some fraction of number of years in the calibration period. This is a safeguard
# against model overfitting. Regression models that select predictors automatically
# can result in chance relationships if the pool of potential predictors is some large
# fraction of the number of observations (years) available for calibrating the model.
# Various rules exist on the allowable ratio of number of potential predictors to
# number of observations for calibration. Ideally, there should be at least 5 to 10
# times as many years in the calibration period as variables in the pool. Here by default
# the predictor pool is capped at no larger than 1/10 the number of observations in the
# calibration period. Beware that with lagged models the pool can become large even if
# using only a few chronologies. For example, 5 chronologies and a specified maximum lag
# of 2 years gives a pool of 25 potential predictors (five lags (-2 to +2) time 5
# chronologies. In that case, if the calibration period were 100 years, only the 10
# potential predictors most highly correlated (Pearson) with the predictand would
# be allowed in the pool.
f<-1/10 # pool of potential predictors not allowed to be larger than this fraction of calibration period observations

# Optionally specify to transform y before regression using either a log10
# or square root transform. Keep in mind that log10 is invalid if y has
# any zero or negative values, and that the square root transform is invalid
# if y has any negative values.
#   1 = no transform
#   2 = square root transform
#   3 = log10 transform
ktransform<-1 # Transform predictand?

# Specify the stopping rule for stepwise regression. Two options are available.
# Regardless of the option selected, regression proceeds stepwise forward, at
# each step entering the variable that gives the greatest reduction of residual
# variance from the previous step. The stopping rule specifies how the final
# regression model is selected. One option is step at which the regression model
# has maximum adjusted R-squared. The other option is the step with maximum
# RE of cross-validation
kStopWhen<-2  # 1=max adjusted R-squared, 2= maximum RE

# Specify whether the regression model for reconstruction uses original
# chronologies or their principal components (PCs). A setting of "FALSE"
# instructs to use the screened lagged chronologies as the pool. A setting of
# "TRUE" instructs to use the PCs of those screened lagged chronologies.
PCA<-TRUE # regression on PCs of screened lagged chronologies

par(ask=TRUE) # pause before each plot; no need to edit this line

# END of user-defined variables/parameters
```

################################################################################

#### PART 1. Read input, optionally transform *y*, and build a lagged matrix of chronologies

################################################################################

The next chunk reads input time series, optionally builds a lagged time series matrix of chronologies, optionally transforms *y*, checks whether the tree-ring matrix has enough columns to support a PCA, plots the time series *y*, and lists the coded names of tree-ring variables in the console.

If you have allowed lagged predictors, the lagged time series matrix of the input tree-ring chronologies is organized with unlagged series in the leftmost columns, followed by negative lags and positive lags. The column names of this lagged matrix identify the original chronology and lag. For example, X5, X5N1, X5P2 are chronology #5 lagged 0, -1 and +2 years from the year of the predictand.

The program stops with an error message if you called for reconstruction by PCA and the matrix of lagged chronologies has fewer than three columns.

Optional transformations for *y* are log10 and square root.

The time series plot of *y* is covers all years of the provided *y*. The list of coded names of tree-ring variables that appears in the console identifies columns of the tree-ring matrix before screening against *y*.

```{r}

# READ INPUT TIME SERIES
#V <- read.table(fileClimate,sep="\t",header=FALSE) # input predictand, tab separated
V <- read.table(fileClimate,sep="",header=FALSE) # input predictand, space separated
#U <- read.table(fileTree,sep="\t",header=TRUE)  # input chronologies, tab-sep
U <- read.table(fileTree,sep="",header=TRUE)  # input chronologies, space-sep

#--- TRIM PREDICTAND TO ELIMINATE NA'S
V<-as.matrix(V)
L<-complete.cases(V)
V<-V[L,]

#--- TRANSFORM PREDICTAND?
v<-V[,2]
yrv=V[,1]
if (ktransform==1){
  v<-v
  dunits <-'original'
} else if (ktransform==2) {
  v<-sqrt(v)
  dunits <- 'square root of original'
} else  {
  v=log10(v)
  dunits <- 'log10 of original'
}

 # want chronologies as matrix class
U<-as.matrix(U)
yrU<-U[,1]
U<-U[,-1]
yrU<-as.matrix(yrU)
U<-as.matrix(U)

# Call a UWF to build a lagged matrix of the input chronologies. The columns of
# the lagged matrix, W, will be organized left to right in this order:
# 1) unlagged chronologies
# 2) negative lags on chronologies, beginning with lag t-1 from y
# 3) positive lags on chronologies, beginning with lag t+1 from y
mU<-nrow(U)
tgo<-yrU[1]
tsp<-yrU[mU]
ktrim=2
D<-LagYear(U,tgo,tsp,nNeg,nPos,ktrim) # list with lagged matrix, ids, etc
W<-D$X # lagged matrix of chrons
colnames(W)<-D$ids # add column names associating variables and lags with W
yrW<-D$tGo:D$tSp
# W, yrW, colnames(W) now desribed the full lagged chronology

# Abort with error message if have called for reconstruction with PCA but
# fewer than 3 variables in pool of potential predictors
if ((ncol(W)<3) && (PCA))
stop("Error: You choose PCs for reconstruction by pool of potential predictors N<3)")

# Time plot of predictand
tit1 <- paste('Time plot of predictand (',dunits,' units); line at mean')
plot(yrv,v,type='b',xlab='Year',ylab='Predictand',main=tit1,col='blue')
abline(h=mean(v),col='gray')

# Listing of variable codes in lagged chronology matrix
sprintf('%s','Variables (cols) in the lagged chronology matrix W, before any correlation screening')
sprintf('%s',colnames(W))
rm(tit1)
```

################################################################

#### PART 2. Correlation screening of lagged chronology matrix

#################################################################

The next chunk of code screens the matrix of lagged chronologies to include only the *m* variables most highly correlated (absolute correlation) with *y*. The purpose of the screening is avoid artificial, or "chance," relationships that can result from an overly large pool of potential predictors [@Rencher&Pun1980]. Your setting ("User settings") for parameter *f* in the next chunk controls the screening. Let's take an example. With 5 chronologies, and allowing for lags of up to +-2 years, the pool of potential predictors has a total of 5x5=25 variables. If the length of the calibration period is 50 years, this means the the ratio of the size of the pool to the number of calibration observations is 25/50=0.5. Such a high ratio gives a high likelihood of chance relationships in stepwise regression. This risk can be reduced by keeping just those tree-ring variables most highly correlated with *y* in the pool (correlation screening). The setting for parameter *f* forces the pool of potential predictors to include at most  *fN* variables; those variables are selected in order of decreasing absolute correlation with *y.* The default is *f=0.1*. You can change this to make the screening more or less stringent. The default follows the "10:1" rule of minimum recommended ratio of number of calibration observations to number of potential predictors in the pool for stepwise regression [@Chowdhury&Turin2020].

The next chunk also creates a list of the ids of the screened tree-ring variables. These or their principal components will comprise the pool of potential predictor for the reconstruction model. A bar plot of the correlations is also displayed. The longest bars in the chart correspond to the predictors passing screening. The coded names of those predictors are listed in the console. The correlations are computed for the calibration period that you specified in "User settings," or otherwise for years that all series overlap the predictand (default calibration period).

The "critical" absolute correlation as defined here is not necessarily statistically significant and indeed is not tested for statistical significance. The variables in the tree-ring matrix -- possibly including lagged variables --- are ranked according to absolute correlation with the predicand. The "nkeep" variables most highly correlated (absolute) with the predictand are retained as the pool of potential predictors. In that sense, the critical correlation is s the absolute correlation that if used as a threshold would result in "nkeep" tree-ring variables in the pool of potential predictors. By lowering input parameter *f* you effectively require a stronger correlation for the variable to be in the pool of potential predictors.

```{r}

#--- Cull common period predictand and all cols of lagged chron matrix
B<-cbind(yrv,v)
A<-cbind(yrW,W)
D<-PeriodCommon(A,B) # D$Y, D$X: predictand predictor matrices, with year in col 1
mv1<-D$tsp-D$tgo+1 # number of observations in common period
v1<-D$Y[,2] # common-period predictand
yrv1<-D$Y[,1] # common-period time vector
ntemp<-ncol(D$X) # col-size of lagged predictor matrix, with year col
W1<-D$X[,2:ntemp] # common period predictor matrix
W1<-as.matrix(W1) # in case 1-col W1
yrW1<-yrv1 # common period time vector, duped
colnames(W1)<-colnames(W) # column names remain the same as before culling

#--- If you specified yrgoc or yrspc, truncate W1, yrW1, v1, yrv1 accordingly
# If you specified a yrgoc or yrspc outside the range of yrv1, give error message
Lfront=FALSE
Lback=FALSE

# front
Ltemp = is.na(yrgoc)
if (Ltemp){
  yrgoc <- yrv1[1]
} else {
  if (yrgoc<yrv1[1]){
    strtemp<-paste('Re-specify yrgoc.\n','You set yrgoc = ',sprintf('%5.0f',yrgoc),'.',
                   '\nThe data are not consistent with yrgoc<',sprintf('%5.0f',yrv1[1]))
    stop(strtemp)
  } else {
  }
  if (yrgoc == yrv1[1]) {
  } else {
    Lfront=TRUE
  }
}

# back
Lstopc <- FALSE # initial flag for locking in stop year of calibration to what user specified, as yrspc
Ltemp <- is.na(yrspc)
if (Ltemp){
  Lstopc <- TRUE
  yrspc <- yrv1[length(yrv1)]
} else {
  if (yrspc>yrv1[length(yrv1)]){
    strtemp<-paste('Re-specify yrspc.\n','You set yrspc = ',sprintf('%5.0f',yrspc),'.',
                   '\nThe data are not consistent with yrsp > ',sprintf('%5.0f',yrv1[length(yrv1)]))
    stop(strtemp)
  } else {
  }
  if (yrspc == yrv1[length(yrv1)]) {
  } else {
    Lback=TRUE
  }
}

# If you have specified a calibration period shorter than the common period of lagged chronologies and predictand,
# truncate accordingly because want the correlation screening as well as model calibration on the shortened interval
Ltemp = Lfront || Lback
if (Ltemp){
  L <- (yrv1 >= yrgoc) & (yrv1<=yrspc)
  v1 <- v1[L]; yrv1 <- yrv1[L]
  W1 <- W1[L,]; yrW1 <- yrW1[L]
  mv1 = length(v1)
} else {
}
rm(ntemp,A,B,D,Lfront,Lback,Ltemp,L)


# Check specified years for "independent verification" (IV)." If both yrgoi and yrspi are NA, no
# IV is attempted. Settings yrgoi and yrspi must BOTH be years or BOTH be NA. If IV is desired,
# no overlap is allowed of period yrgoi-yrspi with yrgoc-yrspc, and the IV period must be at least 5 years long.
LIV <- !(is.na(yrgoi) && is.na(yrspi)) # if true, you want IV
if (sum(is.na(c(yrgoi,yrspi)))==1){
  stop('Not allowed to set just one of yrgo1 and yrspi to NA')
}
if (LIV){
  # IV period must be at least 5 years
  nIV <- yrspi-yrgoi+1
  if (nIV < 5){
    stop('Specified period for independent verification shorter than 5 years')
  } else {
    yrIV <- yrgoi:yrspi # vector of IV years
  }
  # IV period cannot overlap calibration/validation period
  if (any(yrIV %in% yrv1)){
    stop('Specified period for indpendent verification overlaps calibration period')
  }
}


# Status: yrv1,v1, yrW1, W1 are calibration-period predictand (v1) and
# predictor. Names of columns of W1, colnames(W1), are coded to indicate lag

#--- Correlate v1 with cols of W1 and identify threshold correlation,'
# rc, marking the floor(f*nv1) highest correlation, where f is a decimal fraction and
# nv1 is the number of observations in the calibraiton period.

#--- Critical threshold absolute correlation
nkeep<-floor(f*mv1) # want no more than this number of potential predictors in pool

# Cannot keep more cols of W1 than W1 has
nkeep <-min(nkeep,ncol(W1))

r<-cor(W1,v1) # Pearson correlation
rs<-sort(abs(r),decreasing=TRUE,index.return=TRUE) # sorted r (list object)
rc = rs$x[nkeep] # threshold absolute correlation for screening

# Pull subset of predictors with absolute correlation >rc
j<-rs$ix;
nW2 <- nkeep # number of screened potential predictors
ix1<-j[1:nkeep] # column indices of screened ...
W2<-W1[,ix1] # correlation-screened set of predictors; note these are ordered
# left to right in W2 in order of decreasing absolute correlation with v1
W2<-as.matrix(W2)
yrW2<-yrW1
# STATUS: W2, yrW2 are the screened potential predictors; ix1 is
# pointer to columns of the screened potential predictors in the
# full matrix of lagged potential predictors

# Barplot of calib-period correlation of predictand with lagged chronologies
barplot(t(r))
tit1 <- paste('r of predictand with lagged chronologies ',
              as.integer(yrgoc),'-',as.integer(yrspc),'\n',as.integer(nkeep),' passed screening',
              '(critical absolute r=',sprintf('%5.2f',rc),')')
title(tit1)

# List tree-ring variables passing screening
sprintf('%s','Variables passing screening (these in W2 in order L to R by ranked absolute r)')
sprintf('%s',colnames(W2))


```

################################################################

#### PART 3. Optionally extend calibration period

The next chunk optionally revises the end year for calibration of the reconstruction model if the model specifications have called for lagged predictors and correlation screening indicates that positively lagged tree-ring data will not be in predictor pool. For example, if you specified that the regression should allow lags +1 and +2 years relative to *y* as predictors and screening shows that positively lagged chronologies are not significantly correlated with *y*, the calibration period can be extended two years on the recent end. Such extension is done only if you have specified that yrspc is to be determined automatically from the overlap of *y* and tree rings (i.e., yrsp=NA).

#################################################################

```{r}
Wtemp<-W[,ix1] # full lagged matrix, screened cols only
Wtemp<-as.matrix(Wtemp)
L <- yrW>=yrv1[1] & yrW<=yrv1[mv1] # logical to years of predictand
Wtemp<-Wtemp[L,]
yrWtemp<-yrW[L]
Wtemp<-as.matrix(Wtemp)

#--- Revise screened predictor pool calibration years if possible and your input setting of yrspc was NA
if ((nrow(Wtemp) != nrow(W2)) && Lstopc ) {
  W2<-Wtemp
  yrW2<-yrWtemp
  # Check yrv against yrW2 and set calibration period as years with complete
  # cases of v and W2
  B<-cbind(yrv,v)
  A<-cbind(yrW2,W2)
  D<-PeriodCommon(A,B) # D$Y, D$X: predictand predictor matrices, with year in col 1
  mv1<-D$tsp-D$tgo+1 # number of observations in common period
  v1<-D$Y[,2] # common period predictand
  yrv1<-D$Y[,1] # common period time vector
  ntemp<-ncol(D$X) # col-size of lagged predictand matrix, with year col
  W2<-D$X[,2:ntemp] # common period predictand matrix
  yrW2<-yrv1 # common period time vector, duped
  colnames(W2)<-colnames(W)[ix1] # column names remain the same as before culling
  yrspc=yrv1[length(yrv1)] # revised end year for calibration
  rm(ntemp,A,B,D)
}
```

################################################################################

#### PART 4. PCA transformation of chronologies

The next chunk does a principal component analysis (PCA) of the screened, lagged tree-ring chronologies. Depending on the specifications, the pool of potential predictors for the reconstruction model will be the chronologies themselves or their PCs. A PCA is run and PC time series are generated even if you are using the original chronologies as predictors in the reconstruction model. The PCA is done on the full common period of the screened lagged chronologies. Two plots are produced: a scree plot of eigenvalues, and a time plot of the scores of the first component (PC1). Additional data from the PCA is stored in the output list resPCA.

PCA is not attempted if the matrix of screened, lagged chronologies has fewer than three columns. If PCA is not attempted, the output variable resPCA is set equal to the string "Too few tree-ring variables to justify PCA."

###############################################################################

```{r}
#--- Get matrix of screened lagged chronologies for full common period of those
# chronologies. This will be used for the PCA, and will also be needed later for
# reconstruction by the non-PCA option
Wf<-W[,ix1] # full matrix of screened lagged chronologies
yrWf <-yrW
L<-complete.cases(Wf) # logical to years with no missing data
W3<-Wf[L,] # common-period screened lagged Chronologies
yrW3<-yrWf[L]

# PCA: option activated only if 3 or more screened lagged chronologies and more than one chronology in input tree-ring matrix
if (dim(W3)[2]>2){
# if (dim(W3)[2]>2 && dim(U)[2]>1){ # earlier version also had condition on number of original chronologies (before any lagging) for PCA to be attempted
  resPCA<-princomp(x=W3,cor=TRUE,fix_sign=TRUE) # PCA, on correlation matrix
  summary(resPCA) # summary of PCA
  screeplot(resPCA) # scree plot of variance accounted for by PCa
  titTemp <- paste('Scree plot of principal components of screened lagged chronologies',
                   '\nFor details of the PCA, type summary(resPCA) at R console')
  title(titTemp)

  A<-resPCA$scores  # the new times series; PC "scores"
  yrA<-yrW3
  plot(yrA,A[,1],type="l",xlab="Year",ylab="Score") # time plot of PC1
  title('Scores of PC1 of screened lagged chronologies')


  # Pull rows of A for regression modeling;
  yrA2<-intersect(yrA,yrv1)
  L<- (yrA>=yrA2[1]) & (yrA<=yrA2[length(yrA2)])
  A2<-A[L,]

  # STATUS. resPCA is a "princomp" object. A summary of the PCA is
  # printed by >summary(resPCA). Loading, scores and other data
  # can be obtained by such commands as pcW3$loadings. For
  # regression and reconstruction, the scores are stored
  # in matrix A2 and A, with years in yrA2 and yrA
} else {
  resPCA <- ('Too few tree-ring variables to justify PCA.')
}
```

################################################################################

#### PART 5. Regression modeling

The next chunk does stepwise regression modeling of *y* on either the screened lagged chronologies or their PCs. The tree-ring variable most highly correlated with *y* enters first. Subsequent predictors enter forward stepwise in order of decreasing percentage of residual variance (from previous step) explained. At each step adjusted *R^2^* is recorded and the model is cross-validated. All variables in the pool of potential predictors eventually enter and statistics are examined to determine a final step. This step follows the stopping rule you have specified (see User settings).

The following output appears in the console:

1)  Table with model coefficients
2)  Time series plots of observed and reconstructed *y*
3)  Scatterplot of reconstructed *y* against observed *y*, with adjusted *R*^2^ annotated

################################################################################

```{r}
# Review of needed variables, which are ready:
#   v1, yrv1:  [numeric] time series of predictand and its years
#   W2, yrW2: [matrix, vector]: time series matrix of screened predictor pool
#   A2, yrA2 is the PC equivalent of W2, yrW2
#   colnames(W2) [character] ids of screened lagged predictors, coded by column in
#     original loaded tree-ring matrix and by lag
#   ix1[integer]: index of lagged screened predictors indicating the columns
#     those in the matrix of the lagged tree-ring matrix before screening.
#     ix1 will be needed later when we want to pull tree-ring data to generate
#     the long-term reconstruction
#
#   For that and for reconstruction we have the full lagged tree-ring matrix
#     W, yrW [matrix, numeric], and we know that W1 is the sub-matrix of
#     columns ix1 of W


if (!PCA | (dim(W3)[2]<2)){
  # Regression on chronologies themselves (possibly with lags)
  namesW2<-colnames(W2) # want these names as a separate input arg to ForwStep1
  F<-ForwStep1(W2,namesW2, v1,kStopWhen,nNeg,nPos) # forward stepwise
  #   regression of v1 on W2
}else{
  namesA2<-colnames(A2) # names of PCs: Comp.1, Comp.2, etc
  F <- ForwStep1(A2,namesA2,v1,kStopWhen,nNeg,nPos) # forward stepwise
  #   regression of v1 on A2
}


#--- Get the model object see a summary
M<-F$Model
summary(M)

# Get fitted values, bind into a matrix with the observed, and plot
# time series
yh <- M$fitted.values
yrgo<-yrv1[1]
Z<-cbind(v1,yh)
Z.ts <-ts(Z,start=yrv1[1])
ts.plot(Z.ts,type='b',col=c('blue','red'),xlim=c(yrgo,yrv1[length(v1)]))
abline(h=0)
#--- Add a title and line at the mean
title('Oberved (blue) and Reconstructed (red)')
abline(h=mean(v1),col='grey')

# Scatterplot of predicted on observed, with least-squares
# fit straight lin as well as loess fit of locattion an spread
scatterplot(v1,yh,xlab='Observed',ylab='Predicted')
title(sprintf('Adjusted R-squared = %5.2f',F$RsquaredAdj))
```

###############################################################################

#### PART 6. Reconstruction

The next chunk multiplies the estimated regression coefficients times the matrix of screened lagged tree-ring chronologies (or their PCs) to generate the reconstruction, $\hat{y}$. This chunk also uses R's built-in "lm" function to recalibrate the final model and compute additional calibration statistics, and checks that the "fitted values" returned by "lm" are the same as the calibration-period segment of $\hat{y}$. These data are listed in two columns in the console so that you can verify that they are the same.

################################################################################

```{r}
#X<-as.data.frame(W3[,F$ColsInModel]) # predictor matrix f
if (!PCA){
  X<-W3[,F$ColsInModel] # predictor matrix for recon
  yrX<-yrW3
}else{
  X<-A[,F$ColsInModel] # predictor matrix for recon
  yrX<-yrA
}
X <- as.matrix(X)
mX = nrow(X) # number of rows (years) in predictor matrix
Xones=as.matrix(rep(1,mX))
X<-cbind(Xones,X) # add column of 1's to predictor matrix
b<-F$Coefficients # regression coeffs
b<-as.matrix(b) #... to row vector
v1hat<-X %*% b   # reconstructions

#-- renaming for simplicity; know have v1,yrv as calib-period predictand
y <- v    # observed predictand, all available years
yry <- yrv
yc <- v1; # calibration period predictand
yryc <-yrv1
yh <- v1hat # reconstructed ("hat") predictand
yryh <-yrX

#---- Quality control: are predictions for calib period the same as
# fitted values from the lm function?
yf<-fitted(M) # fitted valued from model object
yryf <-yrv1 # these are for the calib years
yr1 <- yryf[1] # start year of calib
yr2 <-yryf[length(yryf)] # end year of calibration

L <- yryh>=yr1 & yryh<=yr2
ycheck<-yh[L]

# Check that specified period yrgo-yrspi for indpendent verification consistent with
# overlap of reconstructed and observed predictand
# Know that if LIV =TRUE, you want independent verification, and if FALSE, will skip it
if (LIV){
  yrgoO <- max(c(yrv[1],yryh[1])) # first year of overlap of recon and observed predictand
  yrspO <- min(c(yrv[length(yrv)]),yryh[length(yryh)]) # last year of ...
  yrO <- yrgoO:yrspO
  Lkill <- !all(yrIV %in% yrO)
  if (Lkill){
    str_kill <- paste('You specified independent verification (IV) period ',as.integer(yrgoi),'-',
                      as.integer(yrspi),' but overlap of observed and reonstructed predictand covers only ',
                      as.integer(yrgoO),'-',as.integer(yrspO),
                      '. And keep in mind that IV cannot include any years in the ',as.integer(yrgoc),
                      '-',as.integer(yrspc),' calibration period.')
    stop(str_kill)
  }
  rm(Lkill)
}


# Quality control (QC1)
#--- Print 2-col matrix and check that the column of lm-fitted values is
# identical to the column of reconstructed values generated by applying the
# regression equation to the matrix of screened lagged chronologies for
# the calibration period
QC1<-cbind(as.matrix(yf),as.matrix(ycheck))
print(QC1)
```

################################################################################

#### PART 7. Independent verification

The next chunk optionally verifies the reconstruction model on a segment of earlier observations withheld from calibration and not used for preliminary screening of tree rings against *y* (e.g., @Meko&Biondi2024). Independent verification (IV) is optional in ReconStep, and is done only if you have specified years for either yrgoi or yrspi (see User settings), If yrgoi and yrspi are set to NA, IV is skipped. IV is rarely done in dendroclimatology because the period of overlap of climate data and tree-ring data is usually so short that people want to take advantage of all of the available data for calibration and cross-validation.

Skill of IV is measured in StepRecon by the Nash-Sutcliffe efficiency (NSE). The NSE, computed as 1 minus the ratio of the variance of errors to variance of observed *y*, is widely used in hydrology [@Nash&Sutcliffe1970] and is equivalent to the coefficient of efficiency (CE), a dendroclimatic validation statistic for which the "null" prediction is the validation-period mean of *y* [@Cook&Kairiukstis1990].

If you have called for independent verification, the next chunk returns a time series plot and statistics. The plot has the following features:

1)  Time series of observed and reconstructed *y* for the IV period.

2)  Horizontal line at the observed mean for the IV period. This line is the null prediction whose errors are used to compute the NSE.

3)  Horizontal line at the observed mean for the calibration period of the regression model. This line is the null prediction whose errors are used to compute the reduction-of-error (RE) statistic. Besides the NSE and RE, the Pearson correlation of predicted with observed *y* for the IV period is also computed. These statistics are annotated above the plot. Those and other statistics are stored in the list ResIV, available in the workspace and saved in the output file "RecOutput.Rdata".

```{r}

LIV  <- is.na(yrgoi) | is.na(yrspi)
if (!LIV){
# Recall have reconstruction in yh, yryh; observed predictand in y,yry; calibration predictand in yc
# Will call function NashSutt to do the independent verification

kplotIV <- 1 # plot option calling for plot in next device
outputDir <- NA   # ignored by NashSutt for kplotIV=1
gFileType <- 'png' # file-type of plot (ignored for kplotIV=1)

# Cull time series of observed and reconstructed predictand for IV period
L <- (yry >= yrgoi) & (yry <= yrspi)
yIV  <- y[L]; yryIV <- yrIV # observed
L <- (yryh >= yrgoi) & (yryh <= yrspi)
yhIV  <- yh[L]; yryhIV <- yryh[L] # recon
ycMean  <- mean(yc)

ResIV <- NashSutt(yIV, yhIV, yrIV, ycMean, kplotIV, outputDir,gFileType)
} else {
  ResIV <- 'You elected to not do independent verification'

}


```

################################################################################

#### PART 8. Analysis of residuals

The next chunk does an analysis of residuals, which is a check for violation of regression assumptions. The residuals in regression are assumed to be normally distributed, non-autocorrelated, and homoscedastic [@Wilks2019].

The next chunk also tests the residuals for trend, which is a particular aspect of autocorrelation [@Chatfield2004]. A time series with positive or negative trend will generally have a positive autocorrelation out to some number of lags that depends on the length of the time series and the form of the trend (e.g. linear vs cyclic). It is important to identify trend in the residuals of regression models for reconstruction of climate from tree rings because the trend could indicate a flaw in the model. For example, consider a reconstruction model for annual precipitation, *y*, from tree rings and residuals with a positive trend. Because the residuals are computed as observed minus reconstructed *y*, the positive trend in residuals could indicate a downward trend in tree-growth driven by a climate variable (e.g., air temperature) other than *y*. The most likely suspect in this case is increased water stress in trees over time due to increasing evaporative stress related to warming. Another plausible source of trend in residuals here is improper removal of the age trend in converting measured ring widths to indices. The various tests applied in analysis of residuals here are outlined below. For all tests, the respective null hypothesis is rejected with 95% confidence if the *p* value of the statistic is lower than 0.05.

-   **Normality** is tested with the Lilliefors test [@Conover1999], whose p-value is annotated on a plotted histogram. The null hypothesis is that the residuals are a sample from a normal distribution.

-   **Heteroscedasticity** is tested with Breusch-Pagan statistic [@Breusch&Pagan1979], which is annotated on a scatterplot of residuals against predicted *y*. Heteroscedasticity is defined as residual variance that is not constant over the range of predicted *y*. The opposite of heteroscedasticity is homoscedasticity, which is constancy of variance of residuals. Regression residuals are assumed to be homoscedastic and that is the null hypothesis for the Breusch-Pagan test.\
    \
    A common sign of heteroskedasticity in streamflow reconstructions is a fanning-out pattern in the scatter plot of residuals against predicted *y*. In streamflow reconstruction, this pattern indicates that error variance increases toward higher predicted flows. The problem can sometimes be corrected by log10 [@Meko&Therrell2001] or square-root transformation [@Meko&Touchan2020]of the flows before using them as the regression predictand. A trade-off of transformation is that the reconstruction is then in transformed units, and the accuracy statistics apply to the reconstruction in the transformed units.

-   **Autocorrelation** is tested with the Durbin-Watson (DW) statistic [@Wilks2019]. The null hypothesis is that the population lag-1 autocorrelation of residuals is zero. The *p*-value of the DW statistic is annotated on a plot of the autocorrelation function of residuals.

-   **Trend** is tested with the Mann-Kendall trend statistic [@Wilks2019], which addresses monotonic trend, whether linear or not. The null hypothesis is that the residuals have no trend. The Mann-Kendall test statistic and its *p*-value are annotated on a time series plot of residuals. A non-parametric trend line [@Haan2002] is added to the plot to help in the assessment. The significance of the Mann Kendall test is adjusted, if necessary, for any significant autocorrelation over and above that due to trend [@Wilks2019]. An annotated variance inflation factor ("VIF") indicates if any such adjustment was applied. If VIF=1, no adjustment was needed. If VIF\>1, an adjustment was applied, and the size of VIF is related to the amplification of the variance of the test statistic due to autocorrelation.

################################################################################

```{r}

# A few tests on residuals
F$DurbinWatson <- durbinWatsonTest(M) #--- Durbin-Watson test for lag-1 autcorrelation
hLillie <- lillie.test(M$residuals); # Lilliefors test for normality


#--- Time plot with Mann-Kendall test for trend
  # Prepare input for mannken1
  X <-cbind(yryc,M$residuals) # matrix with year and regression residuals

  kopt<- c(2,1) # want plot; want adjustment of significance of Mann-Kendall statistic
  # for autocorrelation if warranted
  kplot <-1 # non-TRISH version, simple plot with trend line
  ylabTemp1 <-'Residual'
  ylabTemp2 <- 'Detrended Residual'
  textPlot <- c('Regression Residuals with Nonparametric-Fit Trend Line,','Year',ylabTemp1,
                ylabTemp2)
  FigNumber=0 # meaningless inpiut to mannken1 outside of TRISH
  Din <- list(X=X,kopt=kopt,kplot=kplot,NextFigNumber=FigNumber,textPlot=textPlot,outputDir='Null')

  # mannken1 to get statistics and plot
  ResMK <- mannken1(Din)
  rm(Din)

  #---Histogram, with annotated Lilliefors test  p-value
  Tit1 <- paste('Residuals, E',
                '\n  (p=',sprintf('%.2g',hLillie$p.value),'for H0 that E normal, from Lilliefors test)')
  hist(M$residuals,xlab='E (predictand data units)',ylab='Frequency',main=Tit1)

  # acf and its 95% CI
  len1 <- floor(length(yryc)/4)
  lagsPlot <- min(len1,20)
  acfMy <- acf(M$residuals, lag.max=lagsPlot, type = "correlation",
               plot = TRUE,xlab='Lag (yr)', ylab = 'ACF')
  r1 <- acfMy$acf[2] # lag-1 autocorrelation

      #               '\n  (p=',sprintf('%.2g',BP$p),' for H0 that E homoscedastic)',
    #               '\n(from Breusch-Pagan Test)')
  r1Str =paste('r(1)=',sprintf('%5.2f',r1),'\nDW  statistic=',sprintf('%6.1f',F$DurbinWatson$dw),
               '\np=',sprintf('%6.3f',F$DurbinWatson$p))
  text(lagsPlot,0.9,r1Str,adj=c(1,1))


  # Scatter of residuals against fitted values, with annotated p=value for Breusch-Pagan test.
  BP <- ncvTest(M)
  if (BP$p>0.05){
    strBP <- ' (cannot reject at 0.05)'
  } else {
    strBP <-' (reject at 0.05)'
  }
  Tit2 <- paste('Breusch-Pagan test; H0: errors homoskedastic',
                '\np=',sprintf('%G',BP$p),strBP)
  plot(M$fitted.values,M$residuals,xlab='Fitted Values',
       ylab='Residuals',
       main=Tit2)
  abline(h=0,lty=2,col='#808080') # dash gray


```

################################################################################

#### PART 9. Split-sample analysis to check stability of reconstruction model

The next chunk checks the temporal stability of the model estimated on the full calibration period by split-sample calibration/validation [@Snee1977]. The full calibration period is split in half (earlier half longer if total length odd). The model is then re-calibrated on each half of the full period and validated on the other half. Validation skill is measured with "skill" statistics.

The split-sample exercise applies linear regression, but in ReconStep there is no automatic selection of predictors in calibrating the split-sample models. Those predictors are the same variables selected as predictors for the full-period reconstruction model.

Consider the first half as period A and the second half as period B.The model is first fit to A and validated on B. The model is then fit to B and validated on A. Validation consists of computing the RE statistic on residuals for the validation period. An RE\>0 indicates the model performs better, or has a lower sum of squares of errors, than a null model in which the prediction for each year is the calibration-period observed mean of *y*. In that sense, RE\>0 indicates "some" skill.

Split-sample calibration/validation as described here is not strictly independent validation, because the full period of overlap of *y* with tree-ring data has been used pick the predictor variables for the reconstruction model.

A figure with time series plots of observed *y* and split-sample predictions of *y* is produced in the console. A horizontal line, called the "null prediction is plotted to mark the no-skill prediction for each validation period. The reduction of error (RE; @Fritts&Guiot1990) statistic for validation on the early and late halves is annotated on the figure. If the reconstruction for some observation is closer than the null prediction to the observed *y*, that observation contributes positively to RE; otherwise, the contribution is negative. Large departures have an amplified effect on RE because RE is computed from the sum-of-squares of differences of observed and predicted. RE\>0 is interpreted as a model with"some" skill.

################################################################################

```{r}
# Review
# M$Model is data frame with calibration-period  predictand in col 1 and predictors in remaining columns
# yrv1 is vector of predictand years for calibration period, with length mv1
# Need to regress col 1 of M#Model on the remaining columns fo#
# Will call Meko's function ssValid to do the split-sample calibration-validation
#   First make pointer to the rows of M$Model corresponding to A (early) and B  (late)
#   Call ssValid twice, switching the row pointer for calibration and valdation
#   Store results for both halves.
# Plot results in 1x1 figure with time plots of observed and validation-predicted predicand, with vertical line splitting halves

# Make pointer to halves
nE  <- ceil(mv1/2) # sample size, early half
nL = mv1-nE  # sample size late
i1 <- 1:nE; i2 <- (nE+1):mv1   # pointer to rows

y1 <- as.matrix(M$model[,1])
X1 <- as.matrix(M$model[,-1])
nX1 <- dim(X1)[2]
iuse <- 1:nX1

resSScalE <- ssValid(y1,X1,i1,i2,iuse) # cal on first, val on 2nd
resSScalL <- ssValid(y1,X1,i2,i1,iuse) # cal on first, val on 2nd

atemp <-bquote (.R^2)
strE <- paste('RE=',sprintf('%5.2f',resSScalL$RE))
strL <- paste('RE=',sprintf('%5.2f',resSScalE$RE))

# Get SS predctions
yra1 <-yrv1[i1]; yra2 <-yrv1[i2]
yrmid <- 0.5*(yra1[length(yra1)] + yra2[1])
a1 <-resSScalL$ssPreds
a2 <-resSScalE$ssPreds

# x and y for null prediction horiz lines
mn1 <- resSScalL$MeanObsCalibPd # Late period obs calib mean
mn2 <- resSScalE$MeanObsCalibPd # Early period obs calib mean
xmn1 = c(yra1[1],yra1[length(yra1)])
xmn2 = c(yra2[1],yra2[length(yra2)])
ymn1 =c(mn1,mn1); ymn2 =c(mn2,mn2)

# build ylims
fscale1 <- 0.01; fscale2 <- 0.6
yhi <- max(y1,a1,a2); ylo <- min(y1,a1,a2)
deltay <- (yhi-ylo)
yhi <- yhi+fscale2*deltay
ylo <- ylo - fscale1*deltay
ylims <-c(ylo,yhi)

xlims <- c(yrgoc-1, yrspc+1)
plot(yrv1,v1,type="o", col="black", pch="o",
     xlab='Year',ylab="Predictand", lty=1,xlim=xlims,ylim=ylims)
lines(yra1, a1, col="red",lty=2, pch="*",type="b")
lines(xmn1,ymn1,col="red")
lines(yra2, a2, col="blue",lty=2,pch="*",type="b")
lines(xmn2,ymn2,col="blue")

abline(v=yrmid,col="gray")
text(yra1[1],ylims[2],strE,adj=c(0,1))
text(yra2[1],ylims[2],strL,adj=c(0,1))
title('Split-sample prediction skill')

legend("topright", legend=c("Observed","Recon-early","Null-early","Recon-Late","Null-late"),
       col=c("black","red", "red","blue","blue"),
       lty=c(1,2,1,2,1), cex=0.8)

whatsUp <- 'Results for split-sample (SS) analysis, calibrated on early (...calE) and late (...calL)'
EarlyHalf <- paste('Early = ',as.integer(yra1[1]),'-',as.integer(yra1[length(yra1)]))
LateHalf <- paste('Late = ',as.integer(yra2[1]),'-',as.integer(yra2[length(yra2)]))
resSS <- list(what=whatsUp,EarlyHalf=EarlyHalf,LateHalf=LateHalf,
               resSScalE=resSScalE,resSScalL=resSScalL)
rm(whatsUp)

```

################################################################################

#### PART 10. Confidence interval for reconstruction

The next chunk plots the annual reconstruction of *y* with its 50% confidence interval (CI). The CI is computed as $\hat{y} \pm 0.6745\ \rm{RMSE_{cv}}$, where RMSE~cv~ is the root-mean-square error from cross-validation, and the errors are assumed to be normally distributed.

###############################################################################

```{r}
#--- Compute calibration root-mean-square error (RMSEc), just to be able to compare
# that to the cross-validation RMSE.
# reconstructed value, lower 50% CI and upper 50% CI; then plot the the series
RMSEc <-  sqrt(sum(M$residuals^2)/(M$df.residual)) # root-mean-square error of calib
RMSE <- F$RMSEcv # but use the cross-validation RMSE for the confidence interval on reconstruction
yhlo<-yh-0.6475*RMSE # upper 50%
yhhi<-yh+0.6475* RMSE # lower 50%
YhCI<-cbind(yh,yhlo,yhhi)
namestemp<-c('Recon','Lower 50%','Upper 50%')
YhCI.ts<- ts(YhCI,start=yryh[1],frequency=1,names=namestemp)

# time series for plot and CI
w <- yh; yrw <- yryh #  reconstruction

# Compute shaded polygon x and y
Xtemp <- cbind(yryh,yhlo,yhhi) # matrix with year as col 1, lower CI as col 2, upper CI as col 3
ResTemp <- xyCI(Xtemp)
xP <- ResTemp$x;   yP <- ResTemp$y

# Limits for plot
yLo <- min(Xtemp[,2:3])
yHi <- max(Xtemp[,2:3])
ynudge <- 0.02 * (yHi-yLo)
ylims = c(yLo-ynudge, yHi+ynudge)
xlims = c(yryh[1]-1,(max(yryh)+1))

# Strings for plot
strRecYrs <- paste(sprintf('%d',yryh[1]),'-',sprintf('%d',yryh[length(yryh)]),sep='')
Tit1 <- paste('Reconstruction',', ',strRecYrs,
              '\n(50% CI shaded; dashed line at reconstructed mean)',sep='')
ylab1 <- 'Predictand'

# Plot
plot(yryh,yh,type="l",col="blue",xlim=xlims,ylim=ylims,
     ylab=ylab1,xlab='Year',main=Tit1)
abline(h=mean(yh),lty=2,col='#808080') # dash gray
polygon(xP,yP, col=rgb(1.00,0,0,0.1),border=NA) # mustard

# # OLD CODE, COMMENTNED OUT
# YhCI<-cbind(yh,yhlo,yhhi)
# namestemp<-c('Recon','Lower 50%','Upper 50%')
# YhCI.ts<- ts(YhCI,start=yryh[1],frequency=1,names=namestemp)
# ts.plot(YhCI.ts,col=c('black','red','red'),xlim=c(yryh[1],yryh[length(yh)]))
#
# #% add line at mean, and title
# abline(h=mean(yh),col="blue",lty=3)
# title('Reconstruction with 50% CI')


#--- STATUS
# yh [matrix] is reconstruction for years yryh [numeric]
# M [lm object] returned model from lm function
# F$Rsquared, F$RsquaredAdj, etc, model statistics stored in list F. The list
#   items actually generated from lm can also be extracted from the model
#   object M. Other data and statistics can also be pulled from that model
#   object. For example summary(M), fitted(M), ...
#   the lis
# colnames(W2)[F$ColsInModel] ... this returns "ids" of variables in the model.
#   "Xn" refers to column n of chronology in input data frame; Ni or Pi
#   refer to lag i, negative (N) or postive (P) relative to year of predictand
```

###############################################################################

#### PART 11. Save reconstruction and statistics

The next chunk saves time series and statistics output in the file "RecOutput.Rdata." The file includes the following:

1)  YhCI.ts: time series of reconstructed *y* with its 50% CI