-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdiagnostics_e.sas
308 lines (199 loc) · 5.57 KB
/
diagnostics_e.sas
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
data toluca;
input obs LotSize WorkHrs;
cards;
1 80 399
2 30 121
3 50 221
4 90 376
5 70 361
6 60 224
7 120 546
8 80 352
9 100 353
10 50 157
11 40 160
12 70 252
13 90 389
14 20 113
15 110 435
16 100 420
17 30 212
18 50 268
19 90 377
20 110 421
21 30 273
22 90 468
23 40 244
24 80 342
25 70 323
;
run;
/* To start with, we cam make a scatter plot to verify that we have a linear relationship
but this is not the best method
symbol1 v=circle c = red;
PROC GPLOT DATA=toluca;
PLOT WorkHrs*LotSize;
RUN;
*/
PROC SGPLOT DATA=toluca;
scatter x=LotSize y=WorkHrs;
RUN;
/* To perform the diagnostics, we need to run the model and obtain our residuals and predicted values.
Again, we do this by outputting a new file with the "output out=filename" command and including
"p=name of predicted values" and "r=name of residuals".
*/
proc reg data=toluca;
model WorkHrs = LotSize;
OUTPUT OUT=results p=predicted_WorkHrs r=residuals ;
run;
quit;
proc print data=results; run;
/*
Linearity:
To check assumption of linearity, we plot the predictor variable against the residuals.
If we have linearity, we should get a random scatter about 0
proc gplot data=results; plot residuals*LotSize ; run; */
PROC SGPLOT DATA=results;
scatter x=LotSize y=residuals;
RUN;
/* We can also plot the fitted values against the residuals too check the assumption of linearity */
/* proc gplot data=results; plot residuals*predicted_WorkHrs ; run; */
PROC SGPLOT Data=results;
scatter x=predicted_WorkHrs y=residuals;
RUN;
/* an example of a curve relationship between X and Y*/
data curve; input x y;
cards;
2 5
1 2
4 15
3 10
5 24
5 26
10 101
8 62
;
run;
proc reg data=curve;
model y=x;OUTPUT OUT=residscurve p=predicted_curve r=residualscurve ;
run;
* proc gplot data=residscurve; plot residualscurve*x ; run;
PROC SGPLOT Data=residscurve;
scatter x=x y=residualscurve;
RUN;
quit;
/*
Constant Variance:
We use the same residual plots to check the assumption of constant variance. If there is constant variance
then again the residuals should form a random scatter around 0. If the variance increases or decreases as X gets
larger, the residuals will form a funnel shape. Our original data looked fine. We wil create a
new data set that does not have constant variance for an example.*/
data nocon; input y x;
cards;
1.08 1
2.56 2
3.04 3
2.62 4
5.90 5
2.28 6
1.06 7
9.74 8
14.22 9
-3.20 10
;
run;
proc reg data=nocon;
model y=x;OUTPUT OUT=residsnocon p=predicted_nocon r=residualsnocon ;
run;
/* proc gplot data=residsnocon; plot residualsnocon*x ; run; */
PROC SGPLOT Data=residsnocon;
scatter x=x y=residualsnocon;
RUN;
quit;
/* Checking for outliers.
We can use standardized residuals to help determine if there are any outliers.
A rule of thumb is that standardized residuals with an absolute value of four or more correspond with outliers.
To get standardized residuals, we include "Student=name of standardized residuals".
*/
proc reg data=toluca;
model WorkHrs = LotSize ;
OUTPUT OUT=results_stan p=predicted_WorkHrs r=residuals Student=standardized_residuals;
run;
quit;
proc print data=results_stan; run;
proc means data=results_stan;
var standardized_residuals;
run;
/* Residual plot using standardized residuals*/
proc sgplot data=results_stan;
scatter x=LotSize y=standardized_residuals ;
run;
/* None of the absolute values of the standardized residuals are greater than 4 so we conclude there
are no outliers. */
/*Checking for independence of error terms.
To visually inspect for independence, we will create a sequence plot of the residuals over time.
To do so, we need to include a variable in our dataset that indexes the order the values were collected in.*/
data toluca_seq;
input obs LotSize WorkHrs;
cards;
1 80 399
2 30 121
3 50 221
4 90 376
5 70 361
6 60 224
7 120 546
8 80 352
9 100 353
10 50 157
11 40 160
12 70 252
13 90 389
14 20 113
15 110 435
16 100 420
17 30 212
18 50 268
19 90 377
20 110 421
21 30 273
22 90 468
23 40 244
24 80 342
25 70 323
;
run;
proc reg data=toluca_seq;
model WorkHrs = LotSize;
OUTPUT OUT=results_seq r=residuals_seq ;
run;
quit;
proc sgplot data=results_seq;
scatter y=residuals_seq x=obs;
run;
/* It is clear there is no time (sequence) effect here*/
/*
Noramility:
Graphical check of normality
To verify the assumption of normality, we can use proc univariate. Note, it is the residuals
that are the variable we are interested in. We want to plot and
test if the residuals are normal.
The command "plot" produces stem and leaf plots, boxplots, and normal probability plots of the residuals.
A better graphical tool uses the qqplot command. The qqplot command tells SAS to produce the q-q plots.
Specifying MU=EST and SIGMA=EST with the NORMAL primary option
requests the reference line for which the mean and standard deviation are estimated
by the sample mean and standard deviation.
Again, we are doing this on the residuals (here named residuals_seq).
(There is also a lot of other output which we are not interested in now)
*/
Proc univariate data=results plot;
qqplot residuals/NORMAL(MU=EST SIGMA=EST COLOR=RED L=1); run;
/*
Formal test of normality
To do a formal test of normality, we use the "normal" option within proc univariate, which
produces formal tests of normality.
And note, the variable we are running these tests on is the residuals.
*/
proc univariate data=results normal;
var residuals;
run;