-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpost-knowing.html
102 lines (77 loc) · 12 KB
/
post-knowing.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Knowing</title>
<link rel="stylesheet" href="style.css">
<script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
<script id="MathJax-script" async src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
</head>
<body>
<!-- Sidebar is loaded dynamically -->
<div id="sidebar"></div>
<div id="content">
<h1>Knowing - what do I mean when I say I know something</h1>
<p>I was told "the joint probability density function between two variables \( X \) and \( Y \) captures everything that there is ever to be known about the relation between those \( X \) and \( Y \)" 25 years ago (¡Hola! <a href="https://faculty.ucmerced.edu/mcarreira-perpinan/">Miguel</a> :-)), and it's been a blessing and a curse. Blessing - yeah the joint pdf \( f_{X,Y}(x,y) \) really does capture everything. Curse - often I read an article and think of the author "wish someone told you too", you poor soul.</p>
<p>So for me knowing something about \( X \) means knowing the distribution, the pdf \( f_X(x) \). Most of the time our knowledge is more than 1-dimensional, we have at least two qualities that we want to quantify the relationship of. So knowing something about \( (X,Y) \) jointly, for me means knowing the joint pdf \( f_{X,Y}(x,y) \).</p>
<p>Below I illustrate this point on the example of a joint pdf \( p = f_{X,Y}(x,y) \) that is a mix of two Gaussians in 2D space \( (x,y) \). We observe the variable \( X \), and that observations is \( x=1 \). The question is - what do we now know about the variable \( Y \), having observed the variable \( X \) (to be \( x=1 \)).</p>
<p>The observation \(x=1 \) is equivalent to the joint pdf being cut by the plane \( x=1 \). The intersection of the joint pdf \( f_{X,Y}(x,y) \) and the plane \( x=1 \) is \( f_{X,Y}(x=1,y) \). This curve is the best description of what we now know about the distribution of the unobserved variable \( Y \).</p>
<p>The starting model that was \( f_{X,Y})(x,y) \) is affected by the observation \( x=1 \). The effect is the intersection \( f_{X,Y}(x=1,y) \), and is outlined below. It is a function of \( y \), that is a scaled conditional \( f_Y(y|x=1) = \frac{f_{X,Y}(x=1,y)}{f_X(x=1)} \). The conditional pdf is \( f_Y(y|x) \).</p>
<p>The scaler \( f_X(x=1) \) is the marginal pdf \( f_X(x) \) of \( X \) at point \( x=1 \). The marginal pdf \( f_X(x) \) is computed from the joint pdf \( f_{X,Y}(x,y) \) by marginalization, by integrating out \( Y \) as \( f_X(x) = \int f_{X,Y}(x,y)\,dy \) and then plugging in \( x=1 \).</p>
<p><a href="#" onclick="toggleShowImage('pdf-joint-cond-marg-1of3')">Joint marginal conditional pdf 1 of 3</a>. (click to zoom)
<img id="pdf-joint-cond-marg-1of3" src="pdf-joint-cond-marg-1of3.png" style="display: none; width: 100%; height: auto;" onclick="zoomImage(this)"></p>
<p><a href="#" onclick="toggleShowImage('pdf-joint-cond-marg-2of3')">Conditional pdf is ratio of joint (at point) and marginal 2 of 3</a>. (click to zoom)
<img id="pdf-joint-cond-marg-2of3" src="pdf-joint-cond-marg-2of3.png" style="display: none; width: 100%; height: auto;" onclick="zoomImage(this)"></p>
<p><a href="#" onclick="toggleShowImage('pdf-joint-cond-marg-3of3')">Marginal pdf is derived from the joint pdf 3 of 3</a>. (click to zoom)
<img id="pdf-joint-cond-marg-3of3" src="pdf-joint-cond-marg-3of3.png" style="display: none; width: 100%; height: auto;" onclick="zoomImage(this)"></p>
<p>Coming back to "joint pdf captures everything there is in the relationship \(X,Y\)". Putting it in a wider context.</p>
<p>When reading about knowledge, I have come across the following so collected here for future reference.</p>
<p>We can have 2 types of knowledge about the outcome of (repeated) experiment(s):</p>
<ol type="a">
<li id="know-dirac">We know what will happen and when it will happen in each experiment. This is non-probabilistic, deterministic knowledge.</br>
NB it is a special case of <a href="#know-pdf">both</a> <a href="#aleo-pdf">(b)</a> cases below with the pdf being a Dirac impulse function.</li>
<li id="know-pdf">We know the possible outcomes, we know how many of each will happen if we do 100 experiments, but for each 1 experiment, we can't tell the outcome.</br>
This is probabilistic knowledge where we know the pdf (=probability density function) of the experiment outcome.<br>
It is the aleatoric kind of uncertainty (see <a href="#aleo-pdf">below</a>) - we know the statistics, the counts, but not what one outcome is going to be in every one experiment.</li>
</ol>
<p>Uncertainty - obverse to knowing, to knowledge, lacking (perfect, deterministic) knowledge, we can think of types:</p>
<ol type="a" start="2">
<li id="aleo-pdf">Aleatoric uncertainty means not being certain what the random sample drawn (from the probability distribution) will be: the p.d.f. is known, only point samples will be variable (but from that p.d.f.).</br>
We can actually reliably count the expected number of times an event will happen.</li>
<li id="epi-unpdf">Epistemic uncertainty is not being certain what the relevant probability distribution is: it is the p.d.f. that is unknown.</br>
We can't even reliably count the expected number of times an event will happen.</li>
</ol>
<p>The probabilistic knowledge of type <a href="#know-pdf">(b)</a> above and aleatoric uncertainty of type <a href="#aleo-pdf">(b)</a> are one and the same.</p>
<p>The 2D \( (X,Y) \) example is also useful to illustrate a further point. Once we observe \( X \), and work out the conditional pdf \( f_Y(y|x) \), the question arises - what next? What do we do with it?</p>
<p>If \( Y \) is discrete, we have a problem of classification. If \( Y \) is continuous, we have a problem of regression.</p>
<p>We have the entire curve to work with - and that's the best. But often, we approximate the entire curve, with a representative value, and soldier on. Then the question becomes: well how do we chose one representative value from that curve?</p>
<p>The "\( X \) observed \( Y \) not observed" is arbitrary - it could be the other way around. We can generalize this by introducing a 2D binary <b>mask</b> \( M \), to indicate what parts of the vector \( (X,Y) \) are present (observed), and what parts are missing (and thus of some interest, e.g. we want to predict or forecast them).</p>
<p>With present data \( X \) and missing data \( Y \) in \( (X,Y) \), then missing data imputation is actually equivalent to forecasting regression or classification.</p>
<p>TBD When time is one of the dimensions, with Now separating the Past from the Future: there is a big difference is whether \( X \) and \( Y \) are contemporaneous, or not. Not contemporaneous, having \( X \) in the past, predicting \( Y \) in the future, makes the signal connection \( X \longrightarrow Y \) much, much weaker, by orders of magnitude, then it would be otherwise (if \( X \) and \( Y \) are not separate by time, but happen at the same time).</p>
<p>TBD link to entropy and average information, specific <a href="DeWeese_Meister_-_How_to_measure_the_information_gained_from_one_symbol_-_ne9403.pdf">information from one symbol</a> -<br>
Michael R DeWeese and Markus Meister (1999), <a href="DeWeese_Meister_-_How_to_measure_the_information_gained_from_one_symbol_-_ne9403.pdf">"How to measure the information gained from one symbol"</a>, <a href="https://www.tandfonline.com/doi/abs/10.1088/0954-898X_10_4_303">Network: Computation in Neural Systems, 10:4, 325-340</A>, <a href="https://iopscience.iop.org/article/10.1088/0954-898X/10/4/303">DOI: 10.1088/0954-898X_10_4_303</a> [[excellent paper; introduces the idea that more information can make the entropy higher, thus reducing our knowledge if the knowledge measure is the spikiness of the probability density function; after we have additional observation (information), the posterior p.d.f. given the observation, maybe flatter then before => our knowledge decreased ]]</p>
<p>TBD Knowing and knowledge put into an even wider context.</p>
<ol>
<li>Known Knowns. Deterministic knowledge - we know exactly which one. (<a href="#know-dirac">deterministic knowledge</a> above)</li>
<li>Known Unknowns. We don't know which one, but we know how many of which type; i.e. the distribution. (<a href="#know-pdf">known pdf</a>, <a href="#aleo-pdf">aleatoric uncertainty</a> above)</li>
<li>Unknown Unknowns. We don't know the p.d.f. either. (<a href="#epi-unpdf">epistemic uncertainty</a> above)</li>
<li>Unknown Knowns. Ideology. Fish swimming in water never knowing anything else but water.Possibly thus being unable to perceive the water too? Not sure - that maybe a step too far. Just not knowing anything outside water, but knowing water suffices imo. (Zizek @ YT; self-reflective, I using the framework described above by the 3 cases {pdf-yes-Dirac,pdf-yes-non-Dirac,pdf-no} am like the fish, and the framework is my water, in the sense that's my entire knowledge and I know not outside of it)</li>
</ol>
<p>TBD Quantity and quality. Dimensions \( (X,Y) \) are qualities, and we quantify them each too. When do we add new quality and obversely when do we lose a quality (dimension)?</p>
<p>(<a href="https://news.ycombinator.com/item?id=38261719">LJ @ HN</a>) There is a spin on the same idea when working with data (maths/stats/comp/ML) and having to skirt around the curse of dimensionality. Suppose I have a 5-dimensional observation and I'm wondering if it's really only 4 dimensions there. One way I check is - do a PCA, then look at the size of the remaining variance along the axis that is the smallest component (the one at the tail end, when sorting the PCA components by size). If the remaining variance is 0 - that's easy, I can say: well, it was only ever a 4-dimensional observation that I had after all. However, in the real world it's never going to be exactly 0. What if it is 1e-10? 1e-2? 0.1? At what size does the variance along that smallest PCA axis count as an additional dimension in my data? The thresholds are domain dependent - I can for sure say that enough quantity in the extra dimension gives a rise to that new dimension, adds a new quality. Obversely - diminishing the (variance) quantity in the extra dimension removes that dimension eventually (and with total certainty at the limit of 0). I can extend the logic from this simplest case of linear dependency (where PCA suffices) all the way to to the most general case where I have a general program (instead of PCA) and the criterion is predicting the values in the extra dimension (with the associated error having the role of the variance in the PCA case). At some error quantity \( \gt 0 \) I have to admit I have a new dimension (quality). TBD</p>
<p>TBD Ndim space. Ratio of Ncube/Nball. Does our intuition fail us about the representative values of a distribution when we go from low \( N \) \( (N = 2) \) to high(er) \( N \) \( (N \gt 10) \)? For large N, Nspace in Ndim: (a) moves into the edges (b) every observation is an outlier (in some dimension). Does that mean the space becomes discrete, it discretizes?</p>
<p>TBD Sparse representation, moving to symbolics and rules. Once the Ndim vector becomes sparse, we move from continuous representations to discrete symbolic rules?</p>
<p></p>
<p></p>
<p></p>
<p></p>
<p></p>
<p><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>-- <br>LJ HPD Sun 20 Oct 07:31:04 BST 2024</p>
</div>
<!-- Link to the external script -->
<script src="scripts.js"></script>
<!--Load the sidebar html that is table of contents -->
<script>loadSidebar();</script>
</body>
</html>