01_notes.qmd

# Notes {-}

## What is statistical learning?

[**Statistical learning**](https://en.wikipedia.org/wiki/Statistical_learning_theory) is the theoretical foundation for machine learning framework. It makes connections between the fields of statistics, linear algebra and functional analysis. 

In particular, the **Statistical learning theory** deals with the problem of finding a predictive function based on data and this is what is best known as **supervised learning**, in this book we will see more than just theory, as we will deal with **unsupervised learning** as well as making practical applications.


- **Supervised:** "Building a model to **predict an *output* from *inputs*.**"
  - Predict `wage` from `age`, `education`, and `year`.
  - Predict `market direction` from `previous days' performance`.
- **Unsupervised:** Inputs but no specific outputs, **find relationships and structure.**
  - Identify `clusters` within `cancer cell lines`.


## Why ISLP?

- "Facilitate the transition of statistical learning from an academic to a mainstream field."
- Machine learning* is useful to everyone, let's all learn enough to use it responsibly.
- Python "labs" make this make sense for this community!

## Premises of ISLP

From Page 9 of the Introduction:

- "Many statistical learning methods are relevant and useful in a wide range of academic and non-academic disciplines, beyond just the statistical sciences."
- "Statistical learning should not be viewed as a series of black boxes."
- "While it is important to know what job is performed by each cog, it is not necessary to have the skills to construct the machine inside the box!"
- "We presume that the reader is interested in applying statistical learning methods to real-world problems."

## Notation

- *n* = number of observations (rows)
- *p* = number of features/variables (columns)
- We'll come back here if we need to as we go!
- Some symbols they assume we know:
  - $\in$ = "is an element of", "in"
  - ${\rm I\!R}$ = "real numbers"

## What have we gotten ourselves into?

**An Introduction to Statistical Learning (ISL by James, Witten, Hastie and Tibshiraniis)**, is a collection of modern statistical methods for modeling and making predictions from real-world data.

It is a middle way between theoretical statistics and the practice of applying statistics to real-world problems.

It can be considered as a user manual, with self-contained Python labs, which lead you through the use of different methods for applying statistical analysis to different kinds of data.

- 2: Terminology & main concepts
- 3-4: Classic linear methods
- 5: Resampling (so we can choose the best method)
- 6: Modern updates to linear methods
- 7+: Beyond Linearity (we can worry about details as we get there)

## Where's the data?

```python
pip install ISLP
```

We'll look at this data in more detail below.

## Some useful resources:

- the book page: [statlearning.com](https://www.statlearning.com/)
- pdf of the book: [ISLRv2_website](https://hastie.su.domains/ISLP/ISLP_website.pdf)
- ISLP [labs](https://github.com/intro-stat-learning/ISLP_labs/tree/v2)
- course on edX: [statistical-learning](https://www.edx.org/course/statistical-learning)
- youtube channel: [playlists](https://www.youtube.com/channel/UCB2p-jaoolkv0h22m4I9l9Q/playlists)
- exercise solutions: [applied solutions](https://botlnec.github.io/islp/)

**Some more theoretical resources:**

- The Elements of Statistical Learning (ESL, by Hastie, Tibshirani, and Friedman) [ESLII](https://web.stanford.edu/~hastie/Papers/ESLII.pdf)


## What is covered in the book?

The book provides a series of toolkits classified as **supervised or unsupervised** techniques for understanding data.

The second edition of the book (2021) contains additions within the most updated statistical analysis.


```{r}
#| label: 01-contents
#| echo: false
#| fig-cap: Editions
#| out-width: 100%
knitr::include_graphics("images/01-contents.png")
```


## How is the book divided?

The book is divided into 13 chapters covering:

- Introduction and Statistical Learning:
  - Supervised Versus Unsupervised Learning
  - Regression Versus Classification Problems

**Linear statistical learning**

- Linear Regression:
  - basic concepts
  - introduction of K-nearest neighbor classifier
      
- Classification:
  - logistic regression
  - linear discriminant analysis

- Resampling Methods:
  - cross-validation
  - the bootstrap
    
- Linear Model Selection and Regularization: 
potential improvements over standard linear regression
  - stepwise selection
  - ridge regression
  - principal components regression
  - the lasso

**Non-linear statistical learning**

- Moving Beyond Linearity:
  - Polynomial Regression
  - Regression Spline
  - Smoothing Splines
  - Local Regression
  - Generalized Additive Models
    
- Tree-Based Methods:
  - Decision Trees
  - Bagging, Random Forests, Boosting, and Bayesian Additive Regression Trees
    
- Support Vector Machines (linear and non-linear classification)

- Deep Learning (non-linear regression and classification)

- Survival Analysis and Censored Data

- Unsupervised Learning:
  - Principal components analysis
  - K-means clustering
  - Hierarchical clustering
    
- Multiple Testing


**Each chapter includes 1 self-contained R lab on the topic**

## Some examples of the problems addressed with statistical analysis

- Identify the risk factors for some type of cancers
- Predict whether someone will have a hearth attack on the basis of demographic, diet, and clinical measurements
- Email spam detection
- Classify a tissue sample into one of several cancer classes, based on a gene expression profile
- Establish the relationship between salary and demographic variables in population survey data

([source](https://web.stanford.edu/~hastie/ISLR2/Slides/Ch1_Inroduction.pdf))


## Datasets provided in the ISLP package

The book provides the ISLP Python package with all the datasets needed the analysis.

```{r}
#| label: 01-datasets
#| echo: false
#| fig-cap: "Datasets in ISLP package"
#| out-width: 100%
knitr::include_graphics("images/01-datasets.png")
```