-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy path01_notes.qmd
172 lines (115 loc) · 6 KB
/
01_notes.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
# Notes {-}
## What is statistical learning?
[**Statistical learning**](https://en.wikipedia.org/wiki/Statistical_learning_theory) is the theoretical foundation for machine learning framework. It makes connections between the fields of statistics, linear algebra and functional analysis.
In particular, the **Statistical learning theory** deals with the problem of finding a predictive function based on data and this is what is best known as **supervised learning**, in this book we will see more than just theory, as we will deal with **unsupervised learning** as well as making practical applications.
- **Supervised:** "Building a model to **predict an *output* from *inputs*.**"
- Predict `wage` from `age`, `education`, and `year`.
- Predict `market direction` from `previous days' performance`.
- **Unsupervised:** Inputs but no specific outputs, **find relationships and structure.**
- Identify `clusters` within `cancer cell lines`.
## Why ISLP?
- "Facilitate the transition of statistical learning from an academic to a mainstream field."
- Machine learning* is useful to everyone, let's all learn enough to use it responsibly.
- Python "labs" make this make sense for this community!
## Premises of ISLP
From Page 9 of the Introduction:
- "Many statistical learning methods are relevant and useful in a wide range of academic and non-academic disciplines, beyond just the statistical sciences."
- "Statistical learning should not be viewed as a series of black boxes."
- "While it is important to know what job is performed by each cog, it is not necessary to have the skills to construct the machine inside the box!"
- "We presume that the reader is interested in applying statistical learning methods to real-world problems."
## Notation
- *n* = number of observations (rows)
- *p* = number of features/variables (columns)
- We'll come back here if we need to as we go!
- Some symbols they assume we know:
- $\in$ = "is an element of", "in"
- ${\rm I\!R}$ = "real numbers"
## What have we gotten ourselves into?
**An Introduction to Statistical Learning (ISL by James, Witten, Hastie and Tibshiraniis)**, is a collection of modern statistical methods for modeling and making predictions from real-world data.
It is a middle way between theoretical statistics and the practice of applying statistics to real-world problems.
It can be considered as a user manual, with self-contained Python labs, which lead you through the use of different methods for applying statistical analysis to different kinds of data.
- 2: Terminology & main concepts
- 3-4: Classic linear methods
- 5: Resampling (so we can choose the best method)
- 6: Modern updates to linear methods
- 7+: Beyond Linearity (we can worry about details as we get there)
## Where's the data?
```python
pip install ISLP
```
We'll look at this data in more detail below.
## Some useful resources:
- the book page: [statlearning.com](https://www.statlearning.com/)
- pdf of the book: [ISLRv2_website](https://hastie.su.domains/ISLP/ISLP_website.pdf)
- ISLP [labs](https://github.com/intro-stat-learning/ISLP_labs/tree/v2)
- course on edX: [statistical-learning](https://www.edx.org/course/statistical-learning)
- youtube channel: [playlists](https://www.youtube.com/channel/UCB2p-jaoolkv0h22m4I9l9Q/playlists)
- exercise solutions: [applied solutions](https://botlnec.github.io/islp/)
**Some more theoretical resources:**
- The Elements of Statistical Learning (ESL, by Hastie, Tibshirani, and Friedman) [ESLII](https://web.stanford.edu/~hastie/Papers/ESLII.pdf)
## What is covered in the book?
The book provides a series of toolkits classified as **supervised or unsupervised** techniques for understanding data.
The second edition of the book (2021) contains additions within the most updated statistical analysis.
```{r}
#| label: 01-contents
#| echo: false
#| fig-cap: Editions
#| out-width: 100%
knitr::include_graphics("images/01-contents.png")
```
## How is the book divided?
The book is divided into 13 chapters covering:
- Introduction and Statistical Learning:
- Supervised Versus Unsupervised Learning
- Regression Versus Classification Problems
**Linear statistical learning**
- Linear Regression:
- basic concepts
- introduction of K-nearest neighbor classifier
- Classification:
- logistic regression
- linear discriminant analysis
- Resampling Methods:
- cross-validation
- the bootstrap
- Linear Model Selection and Regularization:
potential improvements over standard linear regression
- stepwise selection
- ridge regression
- principal components regression
- the lasso
**Non-linear statistical learning**
- Moving Beyond Linearity:
- Polynomial Regression
- Regression Spline
- Smoothing Splines
- Local Regression
- Generalized Additive Models
- Tree-Based Methods:
- Decision Trees
- Bagging, Random Forests, Boosting, and Bayesian Additive Regression Trees
- Support Vector Machines (linear and non-linear classification)
- Deep Learning (non-linear regression and classification)
- Survival Analysis and Censored Data
- Unsupervised Learning:
- Principal components analysis
- K-means clustering
- Hierarchical clustering
- Multiple Testing
**Each chapter includes 1 self-contained R lab on the topic**
## Some examples of the problems addressed with statistical analysis
- Identify the risk factors for some type of cancers
- Predict whether someone will have a hearth attack on the basis of demographic, diet, and clinical measurements
- Email spam detection
- Classify a tissue sample into one of several cancer classes, based on a gene expression profile
- Establish the relationship between salary and demographic variables in population survey data
([source](https://web.stanford.edu/~hastie/ISLR2/Slides/Ch1_Inroduction.pdf))
## Datasets provided in the ISLP package
The book provides the ISLP Python package with all the datasets needed the analysis.
```{r}
#| label: 01-datasets
#| echo: false
#| fig-cap: "Datasets in ISLP package"
#| out-width: 100%
knitr::include_graphics("images/01-datasets.png")
```