stevenae.github.io/knowyourneighbr_algo.html at master · stevenae/stevenae.github.io · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121

<!DOCTYPE HTML>
<html>
<head>
<title>Steve Ellis</title>
<link rel="stylesheet" type="text/css" href="self.css">
</head>
<link rel="icon"
      type="image/png"
      href="/favicon.png">
<link rel="icon"
      type="image/png"
      href="favicon.png">

<body>
<a href="index.html"><img id="logo" src="ideographic_description_character_overlaid.svg" type="image/svg+xml">
</img></a>
<div id="wrapdiv">

<div id="links"><h2>
	<a href="index.html">Profile</a>
<a id="sel" href="research.html">Research</a>
	<a href="visualization.html">Visualization</a>

	</h2></div>
<br>

<div id="story">
<h1>KnowYourNeighbr</h1>

<p>KnowYourNeighbr is an ensemble modeling approach combining a blackbox model and a whitebox model. This approach allows for high accuracy, often the domain of blackbox models, to be augmented with interpretability, generally the domain of whitebox models. In most applications, I have used information from an initial XGBoost step to improve a second multivariate matching or KNN step. I have also combined the results of these two steps, as in a form of model stacking. The below will describe the non-stacking approach -- a stacking approach would simply go one step further and combine the results of the two models.</p>
<a href="modeling.png"><div class="image" style="text-align:center;"><img src="knowyourneighbr_fig_1.png" width="500"></div></a>
<p>I have used this approach with both matching outcomes (generation of synthetic control, either for post-hoc inference or prospective experimental group assignment) and neighbor outcomes (either/both of generation of a knn prediction, display of selected neighbors).  A matching approach underlies the (linked) <a href="acu_brfss.html">"Does Community Acupuncture Ameliorate Chronic Illness"</a> post.</p>

<p>The utility of such an approach is in creating output in the form of raw data -- rows of your original inputs -- which are optimally collated with regards to the importance of each column (or predictor variable) in predicting your outcome variable.</p>

<p>This can be useful when:
</p><p>1) Data structures suggest applicability of tree-based learning:
</p><ol>
<li>complex non-linear interactions</li>
<li>theory-driven hierarchical structure</li>
</ol>
<p>and there exists:
</p>
<p>2) Potential benefit from match generation:</p>
<ol><li>synthetic controls</li>
<li>interpretable measurements of bias between/across variables</li>
</ol>
<p>3) Potential benefit from neighbor generation:</p>
<ol><li>end-user interpretability via neighbor display</li>
<li>internal (or external) quality-control (including regulatory oversight)</li>
</ol>
<h2>Process:</h2>

<p>First, generate a tree-based model:</p>

<p>Using an approach such as classification/regression trees, random forest, or xgboost, generate a model which predicts your outcome variable based on your predictor variables.</p>
<code>library(xgboost)
library(data.table)
library(caret)

dat_trva <- data.table(input)
tr_rows <- createDataPartition(input$outcome,
	p = .8, list = FALSE, times = 1)
va_rows <- setdiff(seq(nrow(dat_trva)),tr_rows)

tr_dm <- data.matrix(dat_trva[tr_rows,])
tr_lab <- dat_trva[tr_rows,outcome]
tr_packaged <- xgb.DMatrix(tr_dm,label=tr_lab)

va_dm <- data.matrix(dat_trva[va_rows,])
va_lab <- dat_trva[va_rows,outcome]
va_packaged <- xgb.DMatrix(va_dm,label=va_lab)

tr_va_xgb_m <- xgb.train(
	objective = "reg:squarederror",
	eta = .1,
	early_stopping_rounds = 100,
	nrounds =  1000,
	data = tr_packaged,
	max_depth = 10,
	print_every_n = 50,
	watchlist=list(train=tr_packaged,
		validate=va_packaged)
)
</code>
<p>Second, extract feature importances:</p>
<p>Feature importances are encoded in the model object, which we access via a function.</p>
<code>(model_imp <- xgb.importance(model=tr_va_xgb_m))
imp_val <- model_imp$Gain
imp_vars <- model_imp$Feature
</code>
<p>Third, normalize and scale non-outcome variables according to feature importances:</p>
<code>dat_trva_scaled <- scale(data.matrix(dat_trva))
dat_trva_scaled[is.na(dat_trva_scaled)] <- 0
dat_trva_scaled <- dat_trva_scaled[
	seq(nrow(dat_trva)),]

tr_x <- dat_trva_scaled[tr_rows,]
va_x <- matrix(dat_trva_scaled[va_rows,],
	nrow=sum(va_rows))
tr_va_x <- dat_trva_scaled

to_diag <- diag(apply(xgb_m_imp[,c(3,4)],1,max))
col_matchup <- match(xgb_m_imp$Feature,
	colnames(tr_x))
tr_x <- tr_x[,col_matchup] %*% to_diag
va_x <- va_x[,col_matchup] %*% to_diag
tr_va_x <- tr_va_x[,col_matchup] %*% to_diag
</code>
<p>Fourth, apply your matching or knn model:</p>
<p>NB: to achieve this outcome, I needed to modify the R package FNN's default knn.reg function, to enable access to neighbor metadata</p>
<code>attr(res, "nn.index")<- matrix(Z$nn.index, ncol=k);
attr(res, "nn.dist")<- matrix(Z$nn.dist, ncol=k);
kr <- knn.reg(train = tr_x,
    test = va_x,
    y = dat_trva[tr_rows,price_col],
    k = k)
kr_resid_centered <- kr$pred/dat_trva[va_rows,
	price_col]-1
</code>