In a dice, there are 6 possible worlds: 1, 2, 3, 4, 5, and 6. Each world has a probability of being true and we can represent both the world and its probability by doing the following.
-> A possible world
-> The probability of world
being true where
0 ≤ P() ≤ 1.
The sum of all P() has to be equal to 1. Logically, this tells us that something thing will be true. In the dice example, it is guaranteed that we will get a number where 1 ≤ ≤ 6. We can represent this with the equation:
Tthe sum of the probabilities of all possible worlds will be equal to 1.
In a more complex model, it's not always the case that each world is equally as likely.
Example:
In this example, although each world is equally likely(a combination of dice rolls), the sum of the rolls are not. There are 6 possible worlds where 7 is the sum but only 1 possible world where the sum is 12. We can represent this using the following.
Only 1 out of the 36 worlds have a sum of 12.
6 worlds have a sum of 12.
Degree of belief in a proposition in the absence of any other evidence.
Degree of belief in a proposition given some evidence that has already been revealed.
Probability that
is true given
.
All cases where a and b are true with the exception of b being true.
or,Example: In the image above, the probability of the sum being 12 given a red die with value 6.
: the a value in the above equation is 12.
: the b value in the above equation is a red 6 die.
First, the probability of red being 6 is . Next, we need the probability of both a and b being true which is because only 1 world has a sum of 12 with red being 6. Finally, we divide these two to get the probability of both a and b being true.
Therefore, given that the red die rolled 6, the probability of the sum being 12 is .
A variable in porbability that has a domain of values that it can be
Ex: Roll = {1, 2, 3, 4, 5, 6}
Ex: Weather = {sum, cloud, rain, wind, snow}
Ex: Traffic = {none, light, heavy}
Ex: Flight = {on time, delayed, cancelled}
Ex: P(Flight = on time) = 0.6
Ex: P(Flight = delayed) = 0.3
Ex: P(Flight = cancelled) = 0.1
This can be represented using a vector: A sequence of values(interpreted in order).
P(Flight) = <0.6, 0.3, 0.1>
The knowledge that one event occurs does not affect the probability of other events.
Ex: The red die roll does not influence the blue die roll.
Ex: Clouds and rain are most likely not independent.
There is no way you can get 4 if you get 6. Therefore the event that red rolls 6 and the event that red rolls 4 are non-independent. If one of them occurs, the other is guaranteed not to occur on that specific roll.
Given:
and We can use substitution to derive Bayes' Rule. Divide by P(a) to get:This is useful when one probability is easier to know than another probability. Knowing P(visible effect | unknown cause), we can calculate P(unkown cause | visible effect).
AM:
C = cloud | C = ¬cloud |
---|---|
0.4 | 0.6 |
PM:
R = rain | R = ¬rain |
---|---|
0.1 | 0.9 |
AM & PM:
R = rain | R = ¬rain | |
---|---|---|
C = cloud | 0.08 | 0.32 |
C = ¬cloud | 0.02 | 0.58 |
Probability of clouds given it is raining(,
and ∧
are used interchangably).
Conditional distribution of is proportional to the joint probability of where = <0.08, 0.02>(cloud probability given it's raining). Equivalently, = <0.8, 0.2> to reach a sum of 1.
The probability of ¬a happening is 1 - the probability of a happening.
The probability of a or b being true is the the probability of a being true plus the probability of b being true - the probability of both a and b being true.
This formula helps find the probability of a using some other information like b. Either b is true or b is not true but the probability of a being true is the sum of both these cases. Sometimes b could take the form of a random variable where there are multple possible values. Now, we have to sum up all the possible values that b could take. This can be represented as follows:
Similar to margininalization except it uses conditional probability instead of joint probability. Again, there is a version of conditioning that can be applied to random variables.
The above basically means given variable Y, what is the probability that it takes on state . We then multiply this by the probability of X taking on the state given Y took on state .
Data structure that represents the dependencies among random variables.
- Directed graph
- Each node represents a random variable
- Arrow from X to Y means X is a parent of Y
- Each node X has probabillity distribution:
Description:
Appointment: Dependent on Train(1 arrow pointing to it)
Train: Dependent on maintenance and Rain(2 arrows pointing to it)
Maintenance: Dependent on Rain(1 arrow pointing to it)
Rain: Not dependent on anything(no arrows pointing to it)
Using Bayesian Networks:
Let's say we wanted to find P(light, no, delayed, miss).
Maintenance is dependent on rain so we write the probability of maintence taking on the value of no as P(light)*P(no | light).
The train is dependent on both Rain and Maintenance so we add both light, no
to the previous expression to get
P(light)*P(no | light)*P(delayed | light, no).
Ultimately, appointment is only dependent on the Train so we add another segment to get the probability of
P(light, no, delayed, miss):
- Query X: variable for which to compute distribution.
- Evidence variables E: observed variables for event e.
- Hidden variables Y: non-evidence, non-query variables.
Query: Appointment
Evidence: light, no
Hidden: Train
We know that a conditional probability is equiavalent to times joint probability.
Either the train is on time, or the train is delayed:X is the query variable.
e is the evidence.
y ranges over values of hidden variables.
normalizes the result.
Randomly choose a world for one of the random variables. Every variable from that point is chosen based on this value. Now, if we run this sampling simulation many many times, we can choose the results where the query is true and estimate the probability rather than computing every possible case.
This works similar to conditional sampling except we reject the worlds where the evidence is not satisfied and choose the ones where it is.
Problems with Sampling: If a world is very unlikely to occur, a lot of computation is wasted.
- Start by fixing the values for evidence variables
- Sample the non-evidence variables using conditional probabilities in the Bayesian Network
- Eight each sample by its likelihood: the probability of all the evidence.
The assumption that the curent state depends on only a finite fixed number of previous states.
A sequence of random variables where the distribution of each variable follows the Markov assumption.
Transition Model: Transition from one state to another state.
Given today's weather, predict tomorrow's weather. Use tomorrow's weather to predict the day after's weather and so on...
Hidden Markov Model:
A Markov model for a system with hidden states that generate some observed event.
Sensor Model:
In this case, we aren't directly observing umbrellas but we are using sun and rain to predict something that is "hidden."
The assumption that the evidence variable depends only on the corresponding state.
Filtering: Given observations from start until now, calculate probability for the current state.
Prediction: Given observations from start until now, calculate probability for a future state.
Smoothing: Given observations from start until now, calculate probability for a past state.
Most Likely: Given observations from start until now, calculate most likely sequence of states.