ml-notes

General

CRAN Task View: ML
mlr3 framework and mlr3book
caret package
mlim R package for ML-driven data imputation

Hyperparameter Optimization

Nice article on Bayesian Optimization

Boosting

xgboost R package (docs, GitHub) and nice intro slides

Feature selection/importance

For linear regression (i.e. after using the lm function in R), a good package that provides feature selection based on diverse criteria (e.g. p-value) is SignifReg.
Correlation can be used in 2 ways:
- Remove features with an absolute correlation of 0.75 or higher when finding the correlation between variables in a training dataset (NO RESPONSE is used here - unsupervised so-to-speak)
- Variables that correlate highly with the response output, have higher importance or significance
- Visualize correlations with: corrplot library (examples)
Use Lasso (L1) regularization or Elastic Net (L1 and L2) to remove unimportant features - larger coefficient => more importance (zeroed coefficients => good riddance!)
- Use gmlnet package
- Nice analysis for cross-validating the alpha parameter
- Slides for Elastic Nets
Use randomForest package
- Visualize importance: varImpPlot()
- Article: Tune number of Trees? - 500 is alright in general, tune the mtry parameter using the function RFtune()!!!
- If random.Forest() is run with proximity=TRUE (keep N less than 10000, depending on your RAM as well) it generates a N x N matrix of proximity (similarity) (N = number of rows/data points). This can be scaled to 2D using MDSplot() (same data, same response vector, same random forest used to train the data), which internally uses the stat::cmdscale() to see the dataset (every point) in 2D (very slow).
For faster and multi-threaded Random Forests, use the ranger R package
- No tuning offered for mtry (so do that with randomForest on random samples of your dataset), but everything else is better and faster!
Use the Boruta R package and plot the importance boxplot result!
- Uses ranger under the hood currently, so multi-thread support for free

Correlation

Article: Correlation Test Between Two Variables in R
Use ggscatter: prints R^2 and p-value.
Use ComplexHeatmap to visualize a correlation matrix between two variables of interest, e.g. mRNA and protein expressions (if the data dimensions are large)
On the correlation measure to use between two variables (continuous vs discrete). From this article I quote: “The idea behind using logistic regression to understand correlation between cont and categorical variables is actually quite straightforward and follows as such: If there is a relationship between the categorical and continuous variable, we should be able to construct an accurate predictor of the categorical variable from the continuous variable. If the resulting classifier has a high degree of fit, is accurate, sensitive, and specific we can conclude the two variables share a relationship and are indeed correlated.”

Logistic Regression in R

Application and Interpretation article
If the response categorical variable is binary, use stats::glm(family = binomial)
For multiple response classes, use Ordinal Logistic Regression only when the proportional odds assumption is true (the relationship between each pair of response groups is the same - e.g. in a survey where you answer with 3 choices, the distance between unlikely and somewhat likely may be shorter than the distance between somewhat likely and very likely). In R, use MASS::polr or rms::lrm. Otherwise, go with Multinomial Logistic Regression: nnet:multinom.
Measures of goodness of fit (fit statistics - related to model validation) for logistic binary regression: article IBM
- Mcfaddens R^2: pscl:pR2, rms => returns res$stats (has Nagelkerke’s R^2)
- Mcfaddens R^2 interpretation
- Nice reference about R^2 measures

Dimensionality Reduction

PCA (Prinicipal Componenent Analysis): article
- Use R packages FactoMineR (run PCA) and factoexta (for visualization)
MCA (Multiple Correspondence Analysis - PCA for categorical variables): article
UMAP: a non-linear dimensionality reduction method, check R package uwot

ml-notes

General

Hyperparameter Optimization

Boosting

Feature selection/importance

Correlation

Logistic Regression in R

Dimensionality Reduction

PR (Precision-Recall curves)

Hierarchical Clustering Analysis

Survival Analysis