Skip to content

Linear Discriminant Analysis

ARKAEA edited this page Sep 15, 2022 · 6 revisions

Introduction:

Dimensionality reduction can be aquired through several methods, for instance: linear and non-linear methods. Here, the linear method LDA will be presented.

Why it is important:

It can be efficient to remove redundant and dependent dimensions by transforming from a higher dimensional space to a lower space without getting the curse of dimensionality[2]. The curse of dimensionality in essence refers to the complications that arise due to having too many dimensions such that the amount of data that needs to be generalized becomes becomes a harder task to accomplish.[3]

Intuition:

Theorem 1: LDA is trying construct a lower dimensional space which maximizes the ratio between between-class variance and within-class variance, which maximizes class separability.[2]

between-class - distance between means of different classes.
within-class  - distance between the mean and the observations of each class

In one equation, $LDA = \frac{Variance_{BetweenClass}}{Variance_{WithinClass}}$ we need maximize the separability between classes and minimize the separability within classes so as to get the largest value possible.

(Optional) In other words the matrix required to transform the data matrix into our desired transformation is directly proportional, or dependent, to the inverse of the within-class matrix times the difference between the class means. [10] Mathematically -> $W \propto S_{W}^{-1} (m_2 - m_1)$ (I don't understand it. If it helps with your understanding congrats)

Algorithm to compute LDA:

Now the algorithm for computing LDA will be presented. Formulas will be omitted. They can be found in [2].

  1. From a set of $N$ observations $[x_i]^{N}_{i=1}$ , each of which is represented as a row of length $M$. Additionally the observations are stored in the data matrix $X$ ( $N \times M$ )
  2. Calculate the mean for each class $\mu_i$ ( $1 \times M$ )
  3. Calculate the mean for all the data $\mu$ ( $1 \times M$ )
  4. Calculate the between-class matrix $S_B$ ( $M \times M$ )
  5. Calculate the within-class matrix $S_W$ ( $M \times M$ )
  6. Calculate the transforming matrix $W$
  7. Calculate the eigenvalues $\lambda$ and eigenvectors $V$ for $W$
  8. Sort the eigenvectors in descending order according to their corresponding eigenvalues. The first $k$ eigenvectors will be used as a lower dimensional space ( $V_k$ )
  9. Project the original data matrix $X$ onto $V_k$

Note regarding LDA:

The algorithm that is shown works for the class-independent technique. In the class-independent (CI) only one global transformation matrix is calculated, and all of the observations are projected on the eigenvectors. For the class-dependent (CD) technique the within-class matrix, tranformation matrix, eigenvalues and eigenvectors need to be computed. Then each observation should be projected onto the class' lower dimensional space.

Key takeway: CD is computationally more expensive than CI, and it may affect the singulairty of the within-class matrices.[2] Other sources [8][9] suggest that CD is more precise because the transformation matrix is calculated in terms of the discrimination information that is contained in the respective class, instead of one singular transformation matrix constructed for all the classes.

Potential Problems: (The problems can be fixed as there are solutions contained in [2], but also other sources)

LDA does not work on non-linearly separable classes - that is, if the means of the classes are approximately equal, then the between-class variance and covariance matrix will be zero, which means that no projection onto a lower dimension will be accomplished.

Another problem which can be encountered while applying LDA is the Singularity problem. It occurs when the within-class variance is singular - that is, the determinant of $S_W$ is zero. In other words, LDA fails to find the lower dimensional space if the dimensions are much higher than the number of samples in the data matrix.[2]

Assumptions:

For optimal performance, LDA assumes that the data has a normal distribution and that the classes have equal covariances. However, reserach shows that it could also do some decent work without the aforementioned assumptions.[5]

Refs and Further reading:

[1] https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.735.8417&rep=rep1&type=pdf

[2] http://usir.salford.ac.uk/id/eprint/52074/1/AI_Com_LDA_Tarek.pdf

[3] https://www.kdnuggets.com/2017/04/must-know-curse-dimensionality.html

[4] https://link.springer.com/content/pdf/10.1007/978-0-387-78189-1.pdf

[5] https://en.wikipedia.org/wiki/Linear_discriminant_analysis

[6] https://papers.nips.cc/paper/2004/file/86ecfcbc1e9f1ae5ee2d71910877da36-Paper.pdf

[7] https://sthalles.github.io/fisher-linear-discriminant/

[8] https://www.eurasip.org/Proceedings/Eusipco/Eusipco2014/HTML/papers/1569926695.pdf

[9] https://journal.xidian.edu.cn/xdxb/EN/abstract/abstract11596.shtml

[10] https://sthalles.github.io/fisher-linear-discriminant/

The math behind: (gave up - too much to write. Just look at [2])

The between-class variance( $S_B$ ):

We denote $S{B}{i}$ as the between-class variance for the $i^{th}$ class. It is expressed as the squared difference between the $i^{th}$ class mean ( $\mu_i$ ) and the total mean ( $\mu$ ). Formally we write this as

$(m_i-m)^2$,

where $m_i$ is the projection of $\mu_{i}$, as is calculated by $m_i$=$W^{T}\mu_{i}$, where $W$ is the transformation matrix. $m$ is the total mean $1\times M$ vector.

Computing the mean for the $i^{th}$ class is done as follows:

$\mu_{j}$ = $\frac{1}{n_j}$ $\sum\limits_{x_{i} \in \omega_{j}} x_i$, and for the total mean respectively:

$\mu$ = $\sum\limits_{i=1}^{c}\frac{n_i}{N}\mu_{i}$

The rest of the formula for $S{_B}$, namely, $(\mu_i-\mu)(\mu_i-\mu)^T$ constitutes the distance between the mean of the $i^{th}$ class and the total mean, which accounts for one between-class variance instance. For the total between-class variance

We assume that we have a data matrix $X$ = { $x_1, x_2, .... x_N$ } where $x_i$ represents the $i^{th}$ observation, and $N$ represents the total number of observations. Each observation is represented in $M$ dimensions, which can be written as $x_i \in R^M$

We assume that we have $c$ classes, so the data matrix $X$ becomes $X$ $=$ { $\omega_1 , \omega_2,. . . \omega_c$ }

$N$ is calculated as $\sum^{c}_{i=1} n_i$

The within-class variance( $S_W$ )

(. . .)

Algorithm for class-dependent LDA:

  1. From a set of $N$ observations $[x_i]^{N}_{i=1}$ , each of which is represented as a row of length $M$. Additionally the observations are stored in the data matrix $X$ ( $N \times M$ )

  2. Calculate the mean for each class $\mu_i$ ( $1 \times M$ )

  3. Calculate the mean for all the data $\mu$ ( $1 \times M$ )

  4. Calculate the between-class matrix $S_B$ ( $M \times M$ )

  5. for all Class i = 1,2... c do

    1. Calculate the within-class matrix for each class $S_{{W}_{i}}$ ( $M \times M$ )

    2. Calculate the transforming matrix for each class $W_i$

    3. Calculate the eigenvalues $\lambda^i$ and eigenvectors $V^i$ for each $W_i$ transformation matrix for the respective class

    4. Sort the eigenvectors in descending order according to their corresponding eigenvalues. The first $k$ eigenvectors will be used as a lower dimensional space ( $V_{k}^{i}$ )

    5. Project the original data matrix $X$ onto $V_{k}^{i}$

  6. end for