-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPrediction_MLR.Rmd
581 lines (358 loc) · 23.4 KB
/
Prediction_MLR.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
---
title: "**Prediction using Multiple Linear Regression**"
subtitle: "Video Transcoding Time"
author: Aisling Kinsella
date: "`r Sys.Date()`"
output:
pdf_document:
latex_engine: xelatex
toc: true
toc_depth: 2
number_sections: true
df_print: kable
fig_width: 7
fig_height: 4
fig_caption: true
extra_dependencies: rotating
geometry: margin=1in
fontsize: 12pt
header-includes:
# - \usepackage[utf8]{inputenc}
- \usepackage{booktabs}
- \usepackage{float}
- \usepackage{dcolumn}
- \usepackage{fontspec}
- \setmainfont{Avenir Light} # GFS Didot
- \setsansfont{Raleway}
---
\newpage
``` {r setup, include = FALSE}
knitr::opts_chunk$set(echo = FALSE, fig.showtext = TRUE)
# Load Libraries
library(knitr)
library(rio) # multiple format file import
library(dplyr)
library(tidyverse)
library(kableExtra)
library(xtable)
library(stargazer)
library(pander)
library(viridis)
library(RColorBrewer)
library(ggsci)
library(ggplot2)
library(gridExtra)
library(grid)
library(gtable)
library(showtext)
library(GGally)
library(reshape2)
library(ggcorrplot)
library(psych)
library(caret)
library(texreg)
library(jtools)
# Load fonts
font_add(family = "Proxima Nova Light", regular = "/Users/aislingkinsella/Library/Fonts/Proxima Nova Light.otf")
font_add(family = "Proxima Nova Thin", regular = "/Users/aislingkinsella/Library/Fonts/Proxima Nova Thin.otf")
font_add(family = "Proxima Nova Regular", regular = "/Users/aislingkinsella/Library/Fonts/Proxima Nova Regular.otf")
install_formats()
options(xtable.comment = FALSE)
# to switch between latex font sizes from chunk to markdown text
def.chunk.hook <- knitr::knit_hooks$get("chunk")
knitr::knit_hooks$set(chunk = function(x, options) {
x <- def.chunk.hook(x, options)
ifelse(options$size != "normalsize", paste0("\n \\", options$size,"\n\n", x, "\n\n \\normalsize"), x)
})
```
# Introduction
This work aims to predict video transcode time using multiple variables as inputs to a prediction model. Multiple linear regression, an extension of simple linear regression, is a useful model in this case as there are multiple independent variables to consider when predicting the dependent variable; video transcode time. The goal is the establishment of a function of select independent variables which best explains the dependent variable. Multiple linear regression has the advantage of revealing potential relationships among the different independent variables. This can help mitigate against any over-fitting that can occur when making use of multiple variables, while removing any redundancy-derived imprecision.
```{r importData, echo=FALSE}
# Load Datasets
transcodeData <- rio::import("Data/transcoding_mesurment.tsv", format = "tsv")
youtubeData <- rio::import("Data/youtube_videos.tsv", format = "tsv")
# str(transcodeData)
dfDim <- dim(transcodeData)
```
## Dataset
The dataset was provided by Tewodros Deneke, hosted on the UCI machine learning repository.
It consists of $`r dfDim[1]`$ records and $`r dfDim[2]`$ variables.
\footnotesize
```{r viewHead, echo = FALSE}
# View DF Heads
options(knitr.kable.NA = "**")
pander(head(transcodeData), caption= "Video Transcode Specifications\n")
```
\newpage
\normalsize
```{r uniqueCounts, echo = FALSE}
# Get unique counts for id to confirm only one video. For codecs to count the number of
# transcodeData %>% group_by("id") %>% summarize(count=n())
# transcodeData %>% group_by("o_codec") %>% summarize(count=n())
noVideos <- length(unique(transcodeData[,"id"]))
noDurations <- length(unique(transcodeData[,"duration"]))
# **** Discrepancy between no of unique vid id's and unique vid durations
```
The table presents the data for $`r noVideos`$ uniquely identified videos. Each video is presented in a number of different framesizes, framerates, bitrates and codecs. The time taken to transcode the original video to a variance of the format is stored in the *utime* column.
This is the variable that will be predicted; the dependent variable.
```{r uniqueCodecs, echo=FALSE, results='asis'}
# Find number of unique codecs
inputCodecs <- unique(transcodeData[,"codec"])
outputCodecs <- unique(transcodeData[,"o_codec"])
# Find number of unique width/height
widthData <- unique(transcodeData[, "width"])
heightData <- unique(transcodeData[,"height"])
# join width/height data into matrix
join <- c(widthData, heightData)
widthHeight <- matrix(join, nrow = 6, ncol = 2)
# label columns
colnames(widthHeight) <- c("Width", "Height")
# Show in Table
kbl(widthHeight, format ="latex", caption = "Video Dimensions/Frame Sizes", booktabs = T) %>%
kable_styling(latex_options = c("striped", "HOLD_position"))
# output below bypassed, goes straight to LaTeX
```
The dataset is suitably prepared for multiple regression in so far as there is a one unit change in a single variable per row, while all other variables are held constant.
__Example:__
The first video of duration 130.4, mpeg4, has been transcoded to 6 different frame sizes, *Table 4*, each with an amended bitrate of 56000. The initial file is then output to the 6 different frame sizes with an amended frame rate of 15fps. The data continues in this manner until all possible output variations have been achieved for the specified output codec, mpeg4. The cycle begins again with the next output codec, of which 4 are trialed; `r outputCodecs`.
***
\newpage
# Exploratory Data Analysis
```{r nas, echo = FALSE}
# sum(is.na(transcodeData))
# No missing values in dataset
```
## Summary Statistics
``` {r sumStatsUTime, size = 'small', results = "asis"}
pander(summary(transcodeData$utime), caption = "Summary Statistics: Transcode Time")
```
There is high dispersion among the *utime*/transcode time data. It appears the data may not be normally distributed as the mean value is higher than the median. The median value is an appropriate choice for identifying a central value as it is not dependent on the shape of the data.
## Distribution
The dependent variable, *utime* or video transcode time, is plotted in order to clarify the distribution. It is evident the data is right skewed with clustering below the mean. Skewed data is not ideal as linear regression assumes a normal distribution. However, in practice this is common and may be tempered at a later stage.
\hfill\break
```{r plotDist, echo = FALSE, warning = FALSE}
pTheme <- theme(plot.title = element_text(hjust = 0.5, family = "Proxima Nova Regular", size = (12), colour = "#3b518b"),
axis.title = element_text(family = "Proxima Nova Regular", size = (8), colour = "#3b518b"),
axis.text = element_text(family = "Proxima Nova Thin", colour = "#3b518b", size = (8)))
# Plot the data
p <- ggplot(transcodeData, aes(x = utime, fill=..x..)) + geom_histogram(alpha = 0.8, binwidth=5, color="#FFFFFF") + scale_fill_viridis_c(option = 'E', direction = -1) +
scale_color_viridis() + labs(title = "Transcode Time Distribution",y = "Count", x = "Transcode Time") + pTheme + xlim(0, 100) + theme(legend.position = "bottom")
# Add mean line
p2 <- p + geom_vline(aes(xintercept=mean(utime)),
color = "#3b518b", linetype = "dashed", size = 1)
p2
```
\newpage
## Factor Coding
Regression models require every feature to be numeric. The categorical variable *codec* contains four factors: the types of input codec. This factor variable has been codified for inclusion in the model and pairwise comparisions. Additional columns have been appended to the data frame including binary codings for each individual codec.
\hfill\break
``` {r codecDist, results = "hide"}
# Distribution of input and output codec data/ factor-type data
a <- table(transcodeData$codec)
b <- table(transcodeData$o_codec)
codecDist <- print(xtable(t(a),
align=rep("r",5)),
table.placement="H")
```
```{r cat2Bin, results = 'hide'}
# Data Preparation - Dummy Coding Missing Values
# # Comparing constructed dummy variables to the original codec variable:
# table(transcodeData$codec, useNA = "ifany")
# table(transcodeData$mp4, useNA = "ifany")
# table(transcodeData$h264, useNA = "ifany")
# table(transcodeData$flv, useNA = "ifany")
# The number of transcodeData$mp4 and transcodeData$h264 etc. matches the number of mpeg4,h264 and flv values in the initial coding so dummy coding can be trusted
# Codify codec colum and append new cols to dataframe
transcodeData <- transcodeData %>% mutate(mpeg4 = if_else(codec=="mpeg4", 1, 0)) %>% mutate(h264 = if_else(codec=="h264", 1, 0)) %>% mutate(flv = if_else(codec=="flv", 1, 0)) %>% mutate(vp8 = if_else(codec=="vp8", 1, 0))
round(transcodeData$mpeg4)
round(transcodeData$h264)
round(transcodeData$flv)
round(transcodeData$vp8)
newCols <- head(transcodeData[1,23:26], drop = FALSE)
newCols
# Print new cols added
colsAdded <- print(xtable(newCols,
align=rep("r",5), digits=c(0,0,0,0,0)), # vector of values to pass to decimal places required per column
table.placement="H")
# Codify o_codec colum and append new cols to dataframe
transcodeData <- transcodeData %>% mutate(o_mpeg4 = if_else(o_codec=="mpeg4", 1, 0)) %>% mutate(o_h264 = if_else(o_codec=="h264", 1, 0)) %>% mutate(o_flv = if_else(o_codec=="flv", 1, 0)) %>% mutate(o_vp8 = if_else(o_codec=="vp8", 1, 0))
round(transcodeData$o_mpeg4)
round(transcodeData$o_h264)
round(transcodeData$o_flv)
round(transcodeData$o_vp8)
newCols <- head(transcodeData[1,23:30], drop = FALSE)
newCols
# Print new cols added
colsAdded <- print(xtable(newCols,
align=rep("r",9), digits=c(0,0,0,0,0,0,0,0,0)), # vector of values to pass to decimal places required per column
table.placement="H")
# table printed in next code chunk
```
````{r latAlignTables, results = 'asis'}
# align tables
cat(c("\\begin{table}[!htb]
\\begin{minipage}{.5\\linewidth}
\\centering
\\caption{Distribution: Input Codecs}",
codecDist,
"\\end{minipage}
\\begin{minipage}{.5\\linewidth}
\\centering
\\caption{Appended Columns}",
colsAdded,
"\\end{minipage}
\\end{table}"
))
```
\hfill\break
```{r codecPlot, fig.height = 6 , fig.width = 7 , echo = FALSE}
codecPlot <- ggplot(transcodeData, aes(codec, ..count..)) + geom_bar(alpha = 0.8, aes(fill = o_codec), color = "#FFFFFF", position = "dodge") + labs(title="Output Codec Distribution", y = "Count", x = "Codec")+ scale_fill_viridis_d(begin = 0.3, end = 1, option = 'E', direction = -1)+ theme(legend.position = "bottom") + pTheme
codecPlot
```
\newpage
## Correlation Analysis
Correlation indicates the extent to which two or more variables move together. Before fitting any regression model to data, the relationships between the dependent variable and each independent variable must be checked for correlations. The Pearson Correlation Coefficient is calculated with each pair of features receiving a value of between -1 to +1; revealing the strength of the linear relationship between each feature pairing.
A positive linear relationship indicates that the feature variable can explain some of the variance in the predictor variable and thus could be useful in the prediction model. When the resulting coefficent is zero, it simply indicates the absence of a *linear* relationship between two variables; they may still be correlated and not independent of each other.
```{r corMatrix, size = 'footnotesize', results = 'asis'}
corMatrix <- correlation.matrix <- cor(transcodeData[,c("duration", "width", "height", "bitrate", "framerate", "size", "o_bitrate", "o_framerate","o_width", "o_height", "mpeg4", "h264", "flv", "vp8")])
pander(corMatrix, caption= "Correlation Matrix\n")
# stargazer(correlation.matrix, title="Correlation Matrix", type = 'latex', header = FALSE, font.size = "tiny", float.env = "sidewaystable") - Landscape matrix
```
The figure above shows the correlation matrix for the numeric variables in the transcode dataset. Features *i, p, b, frames, i_size, p_size and b_size* will not be used in the model as these values are codec specific compression-type values.
Numbers along the diagonal of the matrix are always one, as there is naturally a perfect correlation between the variable and itself. The correlation matrix is plotted below for clarity, with duplicates removed.
\newpage
```{r corMatPlot, fig.height = 6 , fig.width = 9 , echo = FALSE, message = FALSE}
# plot the cor matrix
gp <- ggcorrplot(corMatrix, hc.order = TRUE, type = "lower", lab = TRUE, lab_col= "white", insig = "blank") + scale_fill_viridis_c(alpha = 0.8, begin = 0, end = 1, option = 'E', direction = -1) +
scale_color_viridis() # method = "circle",
gp2 <- gp + pTheme + labs(title = "Correlation Matrix")
gp2
```
\hfill\break
Reading the correlation matrix it becomes apparent that there are potential multicollinearity problems. When more than two independent variables are highly linearly correlated this phenomenon occurs, making it difficult to interpret the model results. Multiple linear regression assumes no multicollinearity so this must be further investigated in order to establish a reliable prediction model.
__The following results can be surmised;__
* *height* and *width* appear highly positively correlated with each other
* *bitrate* and *height* appear highly positively correlated with each other
* *bitrate* and *width* appear highly positively correlated with each other
* *size* shows positive correlation to *bitrate*, *width* and *height*
\newpage
Generally a coefficient of >0.7 among variables indicates the presence of multicollinearity. The correlation coefficients between *size* and *bitrate*, *width* and *height* falls below this cut off, at 0.62, 0.58, and 0.55 respectively. Relationships between *width*, *height* and *size* exceed this threshold.
When building a prediction model, some independent variables predict the dependent variable better than others. With multicollinearity arising, detecting which variables are better predictors becomes more difficult.
***
\newpage
## Scatterplot Matrix Analysis
The *o_width*, *o_height* variables have shown strong positive correlations at .99 so they can be removed from further scatterplot matrix analysis.
The *o_framerate* and *o_bitrate* have been removed from the correlation analysis as they have showed values of 0 in the correlation matrix above. This result negates the possibility of linear association though there could be some type of correlation.
\hfill\break
```{r SPLOM, fig.height = 6 , fig.width = 9, echo = FALSE}
pairs.panels(transcodeData[c("duration", "width", "height", "bitrate", "framerate", "size", "mpeg4", "h264", "flv", "vp8")])
```
\hfill\break
The top right portion of scatterplot above has been replaced with a correlation matrix. Along the diagonal there is a histogram showing the distribution of the values across each variable. Importantly there is now additional visual information to aid interpretation of these results; the correlation ellipse and the loess smooth.
The circular image depicts the correlation ellipse. When this shape is stretched into a more oval shape it indicates a high level of correlation. The more circular it appears the weaker the correlation e.g. framerate: flv indicates a weak correlation.
Stretched correlation ellipses/ high correlations
framerate: width (0.40), framerate:height(0.46)
bitrate:width (0.82), bitrate: height (0.80) **
The curved line across each scatterplot is the loess smooth. This helps to see the relationship between the x and y variables. Non - linear correlations are visible where before they were obscured.
The relationship between *size* and *duration* indicates a strong positive linear correlation up until a certain duration. At that point it becomes a negative linear relationship, indicating other factors come into play at this stage.
A closer look at the scatterplot matrix enables a clearer reading.
\hfill\break
```{r SPLOM2, fig.height = 6 , fig.width = 9, echo = FALSE}
pairs.panels(transcodeData[c("duration", "width", "height", "bitrate", "framerate", "size")])
```
While it appears there are weak negative correlations between the different types of codec and each of the other independent variables selected, a closer inspection will clarify this.
\hfill\break
```{r SPLOM3, fig.height = 6 , fig.width = 9, echo = FALSE}
pairs.panels(transcodeData[c("duration","height", "framerate", "size", "mpeg4", "h264", "flv", "vp8")])
```
\newpage
# Building the Model
## Partitioning the data
```{r partition, echo = FALSE}
# split the data into training and test set
set.seed(123) # setting the partitions
training.samples <- transcodeData$utime %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- transcodeData[training.samples,]
test.data <- transcodeData[-training.samples,]
```
The data is partitioned into two halves: test and train. Eighty per cent of the data is held for training the model. The remaining twenty per cent will be held back to test the strength of the model.
## Training the Model
The lm function is used to return a vector/MATRIX of predicted values.
```{r train, echo = FALSE}
# training the model
transcodeModel <- lm(utime ~ duration + width + bitrate + framerate + size, data = transcodeData)
```
The model is trained on *duration*, *width*, *bitrate*, *framerate* and *size* in accordance with the results from the correlation matrix.
*Width* and *height* are strongly positively correlated so one will suffice, the one with the highest correlation to both *size* and *bitrate* is *width* which follows logic as the width contains more picture information than height (the sizes all have wider screen size than height; 6:9, 4:3 aspect ratios). *Size* and *bitrate* were correlated at 0.62, a positive linear association though not reaching the threshold of 0.7 for multicollinearity. Both features will be included to begin training.
## Evaluating Model Performance
The estimated beta coefficients reveal the impact on the dependent variable of a rise of one unit in each independent variable, while all other variables remain unchanged. The intercept shows the value of *utime* when the variables are equal to 0.
```{r betaCo, echo = FALSE, message = FALSE}
# check estimated beta coefficients
#summary(transcodeModel)
```
To see how well the model performs a summary output appears below. *Duration*, *bitrate*, and *framerate* appear to be the most statistically significant in this set of independent variables. The R-Squared and Adjusted R-Squared Values are measures of explanatory power. These values do not necessarily explain whether the model fits well or not. Including more variables to the model would increase the R-square value but the extra variables may not be statistically significant. These R-square values are low however, and will not wield any accuracy for predictions using this model. Alternate variables should be explored.
```{r modelSum, echo = FALSE, message = FALSE, results= 'asis'}
# summary of model fitting
summ(transcodeModel)
```
\begin{center}
Summary of Model Performance
\end{center}
\newpage
\hfill\break
# Improving Model Performance
Overfitting can occur when too many independent variables have been added to the prediction model. The model has become too 'tuned' to the data such that it loses its prediction power on new data. Underfitting is the opposite concern and in this case the initial model is performing poorly on the training data. Regression modelling leaves feature selection to the user and with this data it is apparent that the output settings for each codec would further aid explanation of the dependent variable, *utime*.
## An Improved Regression Model
Model 2 has been predicted using the following independent variables;
```{r featureList, echo = FALSE}
featureList <- list("bitrate", "flv", "h264", "mpeg4", "vp8", "o_framerate", "o_bitrate", "o_width")
kable(featureList, col.names = NULL, format = 'latex', booktabs = T) %>% kable_styling(latex_options = "HOLD_position")
```
```{r reTrain, echo = FALSE}
# training the model
# transcodeModel2 <- lm(utime ~ bitrate + framerate, data = transcodeData)
# summ(transcodeModel2, caption = "Model 2 Performance Summary")
# training the model
# transcodeModel3 <- lm(utime ~ bitrate + framerate + mpeg4 + h264 + flv + vp8, data = transcodeData)
# summ(transcodeModel3)
# training the model - model 2 in text
transcodeModel4 <- lm(utime ~ bitrate + mpeg4 + h264 + flv + vp8 + o_framerate, o_bitrate + o_width , data = transcodeData)
summ(transcodeModel4)
```
\begin{center}
Summary of Model Performance: Model 2
\end{center}
\newpage
The Adjusted R-Squared value has improved. Employing the use of the output codec type could further increase this value.
\hfill\break
The third model has been predicted using the following independent variables;
```{r featureList2, results = 'asis'}
# names(transcodeModel5$coefficients)
featureList <- list("bitrate", "flv", "h264", "mpeg4", "o_bitrate", "o_codech264", "o_codecmpeg4", "o_codecvp8", "o_width", "vp8")
kable(featureList, col.names = NULL, format = 'latex', booktabs = T) %>% kable_styling(latex_options = "HOLD_position")
```
```{r impModel, echo = FALSE}
# training the model
transcodeModel5 <- lm(utime ~ bitrate + mpeg4 + h264 + flv + vp8 + o_framerate + o_bitrate + o_width + o_mpeg4 + o_h264 + o_flv + o_vp8, data = transcodeData)
summ(transcodeModel5)
```
Including the output codecs returns a reduced Adjusted R_Squared Value. The correlation matrix revealed more correlations per h264 and mpeg4 so these could be alternatively selected.
\newpage
## Predicting
The second model, returning Adjusted R-Squared of 0.57, is chosen to predict on the test data. The results appear below.
```{r predict, echo = FALSE}
# make a prediction - bitrate + mpeg4 + h264 + flv + vp8 + o_framerate, o_bitrate + o_width
newdata = data.frame(bitrate = 56000, mpeg4 = 0, h264 = 1, flv = 0, vp8 = 0, o_framerate = 25, o_bitrate= 88000, o_width = 1080)
predict(transcodeModel4, newdata)
# what is the coefficient of determination?
coeff <- summary(transcodeModel4)$r.squared
# what is the confidence interval
ConfInt <- predict(transcodeModel4, newdata, interval="confidence")
# confidence if asked to predict right now with these parameters.
ConfIntExact <- predict(transcodeModel4, newdata, interval="predict")
```
The Coefficient of Determination for this model is: $`r coeff`$. The confidence Interval for the prediction with the input variables used is: $`r ConfIntExact `$. However, multicollinearity issues are leading to issues with the prediction outcome. The prediction is rank-deficient; 23.26. The model needs to be revised to reduce all possible multicollinearities.
# Conclusion
While correlation does not imply causation, it can lead to problems of over-fitting and bias in the resultant model.
The model presents issues with high correlations between independent variables which challenges the use of variables required to increase the measure of dependent variable explanation, Adjusted R-Squared. With such a large number of features to analyse and select from, the task is challenging. Setting minimum/maximum parameters from the correlation matrix to the scatterplot matrix would enable inspection of all significant parameters. This would aid reduction of multicollinearity. Achieving optimum model precision takes time and there are several possibilities to further explore. Reducing multicollinearity is the next step. Plotting the predicted and residual values would aid further investigation of the prediction model and identify weaknesses. Robust standard errors could be applied to improve accuracy. This would be interesting to explore further.