Introduction to Machine Learning with R
R has emerged over the last couple decades as a first-class tool for scientific computing tasks, and has been a consistent leader in implementing statistical methodologies for analyzing data. The usefulness of R for data science stems from the large, active, and growing ecosystem of third-party packages.
First day:
R and R Studio
Reading and importing data, exploratory analysis and visualization
Day two: Regression
Day three: Classification
Hands on exercise
To download R, go to CRAN, the Comprehensive R Archive Network.
CRAN is composed of a set of mirror servers distributed around the world and is used to distribute R and R packages.
pick a mirror or use the cloud mirror, https://cloud.r-project.org
RStudio is an integrated development environment, or IDE, for R programming. Download and install it from http://www.rstudio.com/download.
RStudio is updated a couple of times a year. When a new version is available, RStudio will let you know.
When you start RStudio, you’ll see two key regions in the interface:
Source: R for Data Science
R packages are collections of functions and data sets developed by the community. They increase the power of R by improving existing base R functionalities, or by adding new ones.
- 10,000 packages published
The tidyverse is a collection of packages that can easily be installed with a single “meta”-package, that share a high-level design philosophy and low-lewel grammar and data structures, so that learning one package makes it easier to learn the next.
A script is simply a text file containing a set of commands and comments.
The script can be saved and used later to re-execute the saved commands.
The script can also be edited so you can execute a modified version of the commands.
Script example
Illustration credit: https://vas3k.com/blog/machine_learning/
How are they similar? Different?
the “two cultures”
model first vs. data first
inference vs. prediction
A predictive model is used for tasks that involve the prediction of a given output using other variables (or features) in the data set.
Or, as stated by Kuhn and Johnson (2013, 26:2), predictive modeling is “…the process of developing a mathematical tool or model that generates an accurate prediction.”
The learning algorithm in a predictive model attempts to discover and model the relationships among the variable being predicted and the other predictor variables.
using home attributes to predict the sales price;
using employee attributes to predict the likelihood of attrition;
using patient attributes and symptoms to predict the risk of readmission;
using production attributes to predict time to market.
In essence, these tasks all seek to learn from data. To address each scenario, we can use a given set of variables to train an algorithm and extract insights.
The supervision refers to the fact that the target values provide a supervisory role, which indicates to the learner the task it needs to learn.
Specifically, given a set of data, the learning algorithm attempts to optimize a function to find the combination of variables values that results in a predicted value that is as close to the actual target output as possible.
Most supervised learning problems can be bucketed into one of two categories, regression or classification.
When the objective of our supervised learning is to predict a numeric outcome, we refer to this as a regression problem.
When the objective of our supervised learning is to predict a categorical outcome, we refer to this as a classification problem.
Classification problems most commonly revolve around predicting a binary or multinomial response measure such as:
Did a customer click on our online ad (coded as yes/no or 1/0)?
Commute choice: car, bike or bus.
Predict a numeric outcome
For machine learning, we typically split data into training and test sets:
Do not 🚫use the test set during training.
set.seed()
?To create that split of the data, R generates “pseudo-random” numbers: while they are made to behave like random numbers, their generation is deterministic give a “seed”.
This allows us to reproduce results by setting that seed.
Which seed you pick doesn’t matter, as long as you don’t try a bunch of seeds and pick the one that gives you the best performance.
REGRESSION | CLASSIFICATION |
---|---|
|
|
https://h2o.ai/wiki/confusion-matrix/
To make an ROC (receiver operator characteristic) curve, we:
calculate the sensitivity (true positive rate) and specificity true negative rate) for all possible thresholds
plot false positive rate (x-axis) versus true positive rate (y-axis)
. . .
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
We can use the area under the ROC curve as a classification metric:
Overfitting is the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably.
A validation set is just another type of resampling.
Training a machine learning model using the caret
package is deceptively simple in RStudio. We simply use the train
function, specify our outcome variable and data set, and specify the model we would like to apply via the argument method = ...
.
However, there are over 200 available models in the caret
package.
Each model will have a different composition and different requirements, which we refer to as tuning parameters.
The simplest type of tree model is a Decision Tree model. The easiest way to describe a decision tree model is probably to show one.
https://www.kaggle.com/code/akashchola/decision-tree-for-classification-regression
A regression tree is like a decision tree, except that the regression model fitting process involved is more sophisticated.
https://www.sciencedirect.com/science/article/abs/pii/S2210670722004991
Random forest models combine multiple decision trees to achieve better results than any single decision tree considered could offer.
https://blog.toadworld.com/2018/08/31/random-forest-machine-learning-in-r-python-and-sql-part-1
Nearest Neighbour are another class of non-linear models, that assess distances between observations, grouping nearby observations together - a bit like k-means clustering.
https://www.geeksforgeeks.org/k-nn-classifier-in-r-programming/
A linear discriminant analysis (LDA) model is a type of linear model that uses Bayes’ theorem to classify new observations based on characteristics of the outcome variable classes.
https://www.sciencedirect.com/topics/computer-science/linear-discriminant
A support vector machine (SVM) is a type of non-linear model that operates similarly to LDA models, with a focus on clearly separating outcome variable classes.
By Original: Alisneaky Vector: Zirguezi - Own work based on: Kernel Machine.png, CC BY-SA 4.0
Linear regression is a linear approach for modelling the relationship between a quantitative response and one or more explanatory variables. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression.
https://rpubs.com/cardiomoon/474707
https://workshops.tidymodels.org/
https://www.tidymodels.org/learn/
This material is made with Quarto.