-
Notifications
You must be signed in to change notification settings - Fork 12
/
Copy pathcorrelated-filter.qmd
144 lines (112 loc) · 4.49 KB
/
correlated-filter.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
---
pagetitle: "Feature Engineering A-Z | High Correlation Filter"
---
# High Correlation Filter {#sec-correlated-filter}
::: {style="visibility: hidden; height: 0px;"}
## High Correlation Filter
:::
We are looking to remove correlated features.
By correlated features, we typically talk about Pearson correlation.
The methods in this chapter don't require that Pearson correlation is used, just that any correlation method is used.
Next, we will look at how we can remove the offending variables.
We will show an **iterative approach** and a **clustering approach**.
Both methods are learned methods as they identify variables to be kept.
I always find it useful to look at a correlation matrix first
```{r}
#| label: set-seed
#| echo: false
#| message: false
library(tidymodels)
library(corrr)
set.seed(1234)
```
```{r}
#| label: fig-correlation-matrix
#| echo: false
#| message: false
#| fig-cap: |
#| Some clusters of correlated features.
#| fig-alt: |
#| Correlation chart.
ames |>
select(where(is.numeric)) |>
correlate(method = "pearson", quiet = TRUE) |>
autoplot(method = "identity")
```
looking at the above chart we see some correlated features.
One way to perform our filtering is to find all the correlated pairs over a certain threshold and remove one of them.
Below is a chart of the 10 most correlated pairs
```{r}
#| label: tbl-correlation-values
#| tbl-cap: Some predictors appear multiple times in this table.
#| echo: false
ames |>
select(where(is.numeric)) |>
correlate(method = "pearson", quiet = TRUE) |>
shave() |>
stretch() |>
arrange(desc(abs(r))) |>
slice(1:10) |>
mutate(r = round(r, 3)) |>
knitr::kable()
```
One way to do filtering is to pick a threshold and repeatably remove one of the variables of the most correlated pair until there are no pairs left with a correlation over the threshold.
This method has a minimal computational footprint as it just needs to calculate the correlations once at the beginning.
The threshold is likely to need to be tuned as we can't say for sure what a good threshold is.
With the removal of variables, there is always a chance that we are removing signal rather than noise, this is increasingly true as we remove more and more predictors.
::: callout-caution
# TODO
rewrite this as an algorithm
:::
If we look at the above table, we notice that some of the variables occur together.
One such example is `Garage_Cars`, `Garage_Area` and `Sale_Price`.
These 3 variables are highly co-correlated and it would be neat if we could deal with these variables at the same time.
::: callout-caution
# TODO
find a different example so `Sale_Price` isn't part of this as it is usually the outcome.
:::
::: callout-caution
# TODO
Add a good graph showing this effect.
:::
What we could do, is take the correlation matrix and apply a clustering model on it.
Then we use the clustering model to lump together the groups of highly correlated predictors.
Then within each cluster, one predictor is chosen to be kept.
The clusters should ideally be chosen such that uncorrelated predictors are alone in their cluster.
This method can work better with the global structure of the data but requires fitting and tuning another model.
## Pros and Cons
### Pros
- Computationally simple and fast
- Easily explainable. "Predictors were removed"
- Will lead to a faster and simpler model
### Cons
- Can be hard to justify. "Why was this predictor kept instead of this one?"
- Will lead to loss of signal and performance, with the hope that this loss is kept minimal
## R Examples
We will use the `ames` data set from {modeldata} in this example.
The {recipes} step `step_corrr()` performs the simple correlation filter described at the beginning of this chapter.
```{r}
#| label: step_corr
library(recipes)
library(modeldata)
corr_rec <- recipe(Sale_Price ~ ., data = ames) |>
step_corr(all_numeric_predictors(), threshold = 0.75) |>
prep()
```
We can see the variables that were removed with the `tidy()` method
```{r}
#| label: tidy
corr_rec |>
tidy(1)
```
We can see that when we lower this threshold to the extreme, more predictors are removed.
```{r}
#| label: tidy-threshold
recipe(Sale_Price ~ ., data = ames) |>
step_corr(all_numeric_predictors(), threshold = 0.25) |>
prep() |>
tidy(1)
```
## Python Examples
I'm not aware of a good way to do this in a scikit-learn way.
Please file an [issue on github](https://github.com/EmilHvitfeldt/feature-engineering-az/issues/new?assignees=&labels=&projects=&template=general-issue.md&title) if you know of a good way.