Feature Selection: Statistical Method
Feature Selection: Statistical Method
1. Filter
Evaluate features on divergence and correlation. set one or more thresholds and select the feature.
Pro: fast, scalable, independent of the model
Con: ignore the dependence of features
1.1 Variance/Skewness
For most models, variables with high variance or skewness would be weighted more and deemed as more important. Thus, we can
- calculate the variance(not STD) of each feature
- set a threshold, select all features whose variance bigger than the threshold
Calculation of Skewness: \[ SK = \frac{n\sum(x_i-\bar{x})^3}{(n-1)(n-2)\sigma^3} \] An implementation of feature selection based on variance in Python
1 |
|
1.2 Correlation
- calculate the correlation value between each feature and the target
- select the features with top K biggest correlation value
1.2.1 Pearson R
Pearson R calculates correlation based on covariance. \[ p_{X,Y} = \frac{cov(X,Y)}{\sigma_X \sigma_Y} = \frac{E[(X - \mu_X)(Y-\mu_Y)]}{\sigma_X \sigma_Y} = \frac{E[XY]-E[X]E[Y]}{\sqrt{(E[X^2]-E[X])^2} \sqrt{(E[Y^2]-E[Y])^2}} \] For a sample with n samples: \[ r = \frac{\sum_i^n(X_i-\bar{X})(Y_i-\bar{Y})}{\sqrt{\sum_i^n(X_i-\bar{X})^2} \sqrt{\sum_i^n(Y_i-\bar{Y})^2}} = \frac{\sum XY - \frac{\sum X \sum Y}{n}}{\sqrt{\sum X^2 - \frac{(\sum X)^2 }{n}} \sqrt{\sum Y^2 - \frac{(\sum Y)^2 }{n}}} \] The Pearson R has following properties:
- The range of Pearson R is [-1,1], positive numbers indicate positive correlation
- The Pearson R represents the linear correlation between two variables, linear transformation of X or Y does not change Pearson R
- X and Y must be numerical variables and follow normal distribution
- The observations of X and Y are in pairs
1.2.2 Spearman
Suppose we have samples of X and Y with n observations. Sort the observations X and Y to obtain two new set \(a\) and \(b\), where \(a_i,b_i\) is the rank of \(X_i,Y_i\) in X,Y
define Spearman correlation as: \[ \rho = \frac{6\sum_i^n(a_i-b_i)^2}{n(n^2-1)} \] The Spearman correlation has following properties:
- The range of Spearman correlations is [-1,1], positive numbers indicate positive correlation
- The Spearman correlation represents rank correlation (simply based on size relationship)
- X and Y does not need to follow certain distribution. Aithough the sample must be the same, they do not need to be in pairs
- The statistical power of Spearman is relatively lower
1.2.3 Kendall
Suppose we combine variables X and Y into a new element \((X,Y)\). If two elements \((X_i, Y_i)\) and \(X_j,Y_j\) satisfy either of these two case:
- \(X_i > X_j\) and \(Y_i >Y_j\)
- \(X_i < X_j\) and \(Y_i <Y_j\)
Then we call these two elements have consistency
if \(X_i = X_j\) and \(Y_i = Y_j\), we regard these two elements as having neither consistency nor inconsistency. Otherwise, we call these two elements have inconsistency
define Kendall correlation as: \[ \tau = \frac{C-D}{\sqrt{N3-N1}\sqrt{N_3-N_2}{}} \] where:
C is number of paris of elements(Two element \((X _1,Y_1),(X_2,Y_ 2)\) is one pair) that have consistency
D is number of paris of elements that have inconsistency
\(N1 = \sum_i^s \frac{ 1}{2}U_i(U_i-1)\), where:
- s is the number of values in X that appears more than once
- \(U_i\) is the number those values appears(For \(i^{th}\) Values in s)
- For example, for a X={1,2,2,3,3,3,4}, s=2, \(U_1 = 2, U_2 = 3\)
N2 is calculated same way as N1 on Y
\(N3 = \frac{1}{2}N(N- 1)\), where is the number of samples
The Kendall correlations have similar conditions as Pearson R. The only difference is it represents rank correlation instead of linear correlation
- The range of Pearson R is [-1,1], positive numbers indicate positive correlation
- X and Y must be numerical variables and follow normal distribution
- The observations of X and Y are in pairs
An implementation of feature selection based on Pearson R in Python
1 |
|
1.3 Hypothesis Testing
We can conduct hypothesis testing on correlation to examine whether the correlation of two variables are significant, and we can drop those features with high correlation with others.
For example, we can conduct \(\chi^2\) test to test the correlation of discrete variables. We can conduct Fisher's Z test on the correlation coefficient to determine whether the correlation are significant.
For details of these tests, refer tothis article
1.4 Mutual Information
For the theoretical part about Entrophy and Mutual Information, refer to this article \[ MI = H(x,y) - H(x|y) - H(y|x) = \sum_x\sum_yp(x,y)log\frac{p(x,y)}{p(x)p(y)}\\ \] Suppose we have an input variable X with n samples and m unique values {\(x_1 = c_ 1,x_2 =c_ 1,...x_n = c_m\)} and an output variable Y with n samples and k unique values {\(y_1 = c_ 1,y_2 =c_ 1,...y_n = c_k\)}
It's easy to calculate \(P(X=c_i),P(Y=c_j),P(X=c_i,Y=c_j)\)
Thus the MI can be calculated. The greater MI X and Y share, the greater dependency there exists, and X is thus a more important feature.
The MI mtheod have the following properties:
- MI method is sometime impractical with two continuous numerical variables, since there are too many unique values
- MI needs some certain metrics to map the original values in to a range(usually [0,1]), so that MI score of different variables can be compared
An implementation of feature selection based on Mutual Information in Python:
1 |
|
2.Wrapper
Through the evaluation of the model, add or drop some feature each trail to obtain a subset of all features, which is the selected feature.
Pro: accurate, model-relevant
Con: time-consuming
2.1 Recursive Feature Elimination
- Train model with all m features
- Select k best features and to take them out(or drop k worst feature from all feature)
- Train the model with the rest feature and repeat step2, until we reach the max/min number of feature we want
- The taken-out/ left-in feature is the final features space we want to preserve
1 |
|
2.2 Step-wise Regression
Stepwise regression is a variable selection method used in multiple linear regression to identify the most significant predictors or independent variables in a dataset. It is an iterative process that involves adding or removing variables based on their statistical significance, with the goal of improving the overall model's performance. Stepwise regression can help to mitigate multicollinearity, reduce overfitting, and simplify a model for easier interpretation.
There are three primary types of stepwise regression: forward selection, backward elimination, and bidirectional elimination (also known as stepwise selection).
- Forward selection:
- Start with an empty model (i.e., no independent variables)
- Add the most significant variable based on a predetermined significance level threshold (e.g., p-value < 0.05)
- Continue adding variables one at a time, each time selecting the variable that provides the most significant improvement to the model
- Stop when no further variables meet the significance level for inclusion
- Backward elimination:
- Start with a full model (i.e., all independent variables included)
- Remove the least significant variable based on a predetermined significance level threshold (e.g., p-value > 0.1)
- Continue removing variables one at a time, each time selecting the variable that has the least significant impact on the model
- Stop when all remaining variables meet the significance level for retention
- Bidirectional elimination (stepwise selection):
- Start with either an empty model or a full model
- At each step, consider both adding and removing variables based on predetermined significance levels for inclusion and retention
- Continue adding or removing variables iteratively until no more variables meet the criteria for inclusion or exclusion
Stepwise regression has some limitations and assumptions. The procedure assumes a linear relationship between the independent and dependent variables and requires that the data meet the assumptions of linear regression (e.g., normally distributed errors, homoscedasticity, and absence of multicollinearity). Additionally, stepwise regression can be sensitive to the initial set of variables, and the final model may depend on the order in which variables are considered.
Despite these limitations, stepwise regression can be a valuable tool for identifying the most important predictors in a dataset and building a parsimonious model with improved interpretability.
3. Embedded
Some models, like Lasso, Ridge, and Random Forest has method embedded in the model to evaluate features. Train these model first and than make selection of feature.
Pro: fast, easy to apply
Con: ignore the dependence of features
1 |
|