ml-notes
General
Hyperparameter Optimization
Boosting
Feature selection/importance
- For linear regression (i.e. after using the
lm
function in R), a good package that provides feature selection based on diverse criteria (e.g. p-value) is SignifReg
.
- Correlation can be used in 2 ways:
- Remove features with an absolute correlation of 0.75 or higher when finding the correlation between variables in a training dataset (NO RESPONSE is used here - unsupervised so-to-speak)
- Variables that correlate highly with the response output, have higher importance or significance
- Visualize correlations with:
corrplot
library (examples)
- Use Lasso (L1) regularization or Elastic Net (L1 and L2) to remove unimportant features - larger coefficient => more importance (zeroed coefficients => good riddance!)
- Use
randomForest
package
- Visualize importance:
varImpPlot()
- Article: Tune number of Trees? -
500
is alright in general, tune the mtry
parameter using the function RFtune()
!!!
- If
random.Forest()
is run with proximity=TRUE
(keep N less than 10000, depending on your RAM as well) it generates a N x N matrix of proximity (similarity) (N = number of rows/data points).
This can be scaled to 2D using MDSplot()
(same data, same response vector, same random forest used to train the data), which internally uses the stat::cmdscale()
to see the dataset (every point) in 2D (very slow).
- For faster and multi-threaded Random Forests, use the
ranger
R package
- No tuning offered for
mtry
(so do that with randomForest
on random samples of your dataset), but everything else is better and faster!
- Use the Boruta R package and plot the importance boxplot result!
- Uses
ranger
under the hood currently, so multi-thread support for free
Correlation
- Article: Correlation Test Between Two Variables in R
- Use
ggscatter
: prints R^2
and p-value.
- Use
ComplexHeatmap
to visualize a correlation matrix between two variables of interest,
e.g. mRNA and protein expressions (if the data dimensions are large)
- On the correlation measure to use between two variables (continuous vs discrete). From this article I quote: “The idea behind using logistic regression to understand correlation between cont and categorical variables is actually quite straightforward and follows as such: If there is a relationship between the categorical and continuous variable, we should be able to construct an accurate predictor of the categorical variable from the continuous variable. If the resulting classifier has a high degree of fit, is accurate, sensitive, and specific we can conclude the two variables share a relationship and are indeed correlated.”
Logistic Regression in R
- Application and Interpretation article
- If the response categorical variable is binary, use
stats::glm(family = binomial)
- For multiple response classes, use Ordinal Logistic Regression only when the proportional odds assumption is true (the relationship between each pair of response groups is the same - e.g. in a survey where you answer with 3 choices, the distance between unlikely and somewhat likely may be shorter than the distance between somewhat likely and very likely). In R, use
MASS::polr
or rms::lrm
. Otherwise, go with Multinomial Logistic Regression: nnet:multinom
.
- Measures of goodness of fit (fit statistics - related to model validation) for logistic binary regression: article IBM
Dimensionality Reduction
- PCA (Prinicipal Componenent Analysis): article
- Use R packages
FactoMineR
(run PCA) and factoexta
(for visualization)
- MCA (Multiple Correspondence Analysis - PCA for categorical variables): article
- UMAP: a non-linear dimensionality reduction method, check R package uwot
PR (Precision-Recall curves)
Hierarchical Clustering Analysis
Survival Analysis