Machine learning: classification and regression
OverviewQuestions:Objectives:
what are classification and regression techniques?
How they can be used for prediction?
How visualizations can be used to analyze predictions?
Requirements:
Explain the types of supervised machine learning - classification and regression.
Learn how to make predictions using the training and test dataset.
Visualize the predictions.
Time estimation: 1 hourSupporting Materials:Last modification: Oct 18, 2022
Introduction
Machine learning is a subset of artificial intelligence (AI) that provides machines with the ability to automatically learn from data without being explicitly programmed. It is a combined field of computer science, mathematics and statistics to create a predictive model by learning patterns in a dataset. The dataset may have an output field which makes the learning process supervised. The supervised learning methods in machine learning have outputs (also called as targets or classes or categories) defined in the datasets in a column. These targets can either be integers or real (continuous) numbers. When the targets are integers, the learning task is known as classification. Each row in the dataset is a sample and the classification is assigning a class label/target to each sample. The algorithm which is used for this learning task is called a classifier. When the targets are real numbers, the learning task is called regression and the algorithm which is used for this task is called a regressor. We will go through classification first and look at regression later in this tutorial.
QuestionWhat are features and outputs/targets in a dataset?
Features and targets of a breast cancer dataset:
AgendaIn this tutorial, we will deal with:
Classification
A classification task assigns a category/class to each sample by learning a decision boundary in a dataset. This dataset is called a training dataset and contains samples and desired class/category for each sample. The training dataset contains “features” as columns and a mapping between these features and the target is learned for each sample. The performance of mapping is evaluated using a test dataset (which is separate from training dataset). The test dataset contains only the feature columns and not the target column. The target column is predicted using the mapping learned on the training dataset. In this tutorial, we will use a classifier to train a model using a training dataset, predict the targets for test dataset and visualize the results using plots.
CommentThe terms like ‘targets’, ‘classes’, ‘categories’ or ‘labels’ have been used interchangeably for the classification part of this tutorial. They contain identical meaning. For regression, we will just use ‘targets’.
In figure 2, the line is a boundary which separates a class from another class (for example from tumor to no tumor). The task of a classifier is to learn this boundary, which can be used to classify or categorize an unseen/new sample. The line is the decision boundary. There are different ways to learn this decision boundary. If the dataset is linearly separable, linear classifiers can produce good classification results. But, when the dataset is complex and requires non-linear decision boundaries, more powerful classifiers like support vector machine
or tree
or ensemble
based classifiers may prove to be beneficial. In the following part, we will perform classification on breast cancer dataset using a linear classifier and then will analyze the results with plots. Let’s begin by uploading the necessary datasets.
Data upload
The datasets to be used for classification contain 9 features. Each feature contains some unique information about breast cancer including the thickness of clump, cell-size, cell-shape and so on. More information about the dataset can be found here - a and b. In addition to these features, the training dataset contains one more column as the target
. It has a binary value (0 or 1) for each row. 0
indicates no breast cancer (benign) and 1
(malignant) indicates breast cancer. The test dataset does not contain the target
column (which should be predicted by a classifier). The third dataset contains all the samples from the test dataset, this time including the target
column which is needed to compare between real and predicted targets.
Hands-on: Data upload
Create a new history for this tutorial.
Click the new-history icon at the top of the history panel.
If the new-history is missing:
- Click on the galaxy-gear icon (History options) on the top of the history panel
- Select the option Create New from the menu
Import the following datasets and choose the type of data as
tabular
.https://zenodo.org/record/3248907/files/breast-w_targets.tsv https://zenodo.org/record/3248907/files/breast-w_test.tsv https://zenodo.org/record/3248907/files/breast-w_train.tsv
- Copy the link location
Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)
- Select Paste/Fetch Data
Paste the link into the text field
Press Start
- Close the window
Rename datasets to
breast-w_train
,breast-w_test
andbreast-w_targets
.
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, change the Name field
- Click the Save button
Learn using training dataset
The training dataset is used for learning the associations between features and the targets. The classifier learns general patterns in a dataset and saves a trained model. This model can be used for classifying a new sample. In this step, we will use breast-w_train
as the training dataset and apply SVM (support vector machine) classifier. It will learn features from the dataset and maps them to the targets. This mapping is called a trained model. The training step produces a model file of type zip
.
Hands-on: Train the model
- Support vector machines (SVMs) for classification tool with the following parameters to train the classifier on training dataset:
- “Select a Classification Task”:
Train a model
- “Classifier type”:
Linear Support Vector Classification
- “Select input type”:
tabular data
- param-file “Training samples dataset”:
breast-w_train
- “Does the dataset contain header”:
Yes
- “Choose how to select data by column”:
All columns EXCLUDING some by column header name(s)
- “Type header name(s)”:
target
- param-file “Dataset containing class labels”:
breast-w_train
- “Does the dataset contain header”:
Yes
- “Choose how to select data by column”:
Select columns by column header name(s)
- “Select target column(s)”:
target
- Rename the generated file to
model
QuestionWhat is learned by the classifier?
Two attributes coef_ and intercept_ are learned by the classifier using the training dataset. The coef_ contains importance weight for each feature and intercept_ is just a constant scalar. However, for different classifiers, these attributes are different. The attributes shown here are specific to the Linear support vector classifier. These attributes are stored in the trained model and can be accessed by reading this file.
Predict categories of test dataset
After the training process is complete, we can see the trained model file (zip
file) which contains information about patterns in the form of weights. The trained model is used to predict the classes of the test (breast-w_test
) dataset. It assigns a class (either tumor or no tumor) to each row in the breast-w_test
dataset.
Hands-on: Predict classes using the trained model
- Support vector machines (SVMs) for classification tool with the following parameters to predict classes of test dataset using the trained model:
- “Select a Classification Task”:
Load a model and predict
- param-file “Models”:
model
file (output of the previous step)- param-file “Data (tabular)”:
breast-w_test
file- param-check “Does the dataset contain header”:
Yes
- param-select “Select the type of prediction”:
Predict class labels
- Rename the generated file to
predicted_labels
.
Visualise the predictions
We should evaluate the quality of predictions by comparing them against the true targets. To do this, we will use another dataset (breast-w_targets
). This is the same as the test dataset (breast-w_test
) but contains an extra target
column containing the true classes of the test dataset. With the predicted and true classes, the learned model is evaluated to verify how correct the predictions are. To visualise these predictions, a plotting tool is used. It creates three plots - confusion matrix, precision, recall and F1 and ROC and AUC. We will mainly analyze the precision and recall plot.
Hands-on: Check and visualize the predictions
- Plot confusion matrix, precision, recall and ROC and AUC curves tool with the following parameters to visualise the predictions:
- param-file “Select input data file”:
breast-w_targets
- param-file “Select predicted data file”:
predicted_labels
- param-file “Select trained model”:
model
We will analyze the following plots:
Using these plots, the robustness of classification can be visualized.
Summary
By following these steps from data upload until plotting, we have learned how to do classification and visualise the predictions using Galaxy’s machine learning and plotting tools. A similar analysis can be performed using a different dataset or by using a different classifier. This machine learning suite provides multiple classifiers from linear to complex ones suited for different classification tasks. For example for a binary class classification, support vector machine
classifier may perform well. It is recommended to try out different classifiers on a dataset to find the best one.
Regression
For classification, the targets are integers. However, when the targets in a dataset are real numbers, the machine learning task becomes regression. Each sample in the dataset has a real-valued output or target. Figure 6 shows how a (regression) curve is fitted which explains most of the data points (blue balls). Here, the curve is a straight line (red). The regression task is to learn this curve which explains the underlying distribution of the data points. The target for a new sample will lie on the curve learned by the regression task. A regressor learns the mapping between the features of a dataset row and its target value. Inherently, it tries to fit a curve for the targets. This curve can be linear or non-linear. In this part of the tutorial, we will perform regression on body density dataset.
Data upload
The dataset contains information about human body density. It includes 14 features like underwater body density, age, weight, height, neck circumference and so on. The target is the percent body fat. The aim of the task is to learn a mapping between several body features and fat content inside the human body. Using this learning, the body fat percentage can be predicted using other features. To carry out this task, we will need training and test datasets. Again, we will also prepare another test dataset with targets included to evaluate the regression performance. body_fat_train
dataset is used as the training dataset and body_fat_test
as the test dataset. The dataset body_fat_test_labels
contains the true targets for the test dataset (body_fat_test
).
Hands-on: Data upload
- Create a new history for this tutorial.
Import the following datasets and choose the type of data as
tabular
.https://zenodo.org/record/3248907/files/body_fat_train.tsv https://zenodo.org/record/3248907/files/body_fat_test_labels.tsv https://zenodo.org/record/3248907/files/body_fat_test.tsv
- Copy the link location
Open the Galaxy Upload Manager (galaxy-upload on the top-right of the tool panel)
- Select Paste/Fetch Data
Paste the link into the text field
Press Start
- Close the window
Rename datasets to
body_fat_train
,body_fat_test_labels
andbody_fat_test
.
- Click on the galaxy-pencil pencil icon for the dataset to edit its attributes
- In the central panel, change the Name field
- Click the Save button
Learn from training dataset
To learn the mapping between several features and the targets, we will apply a regressor which is called
the Gradient boosting regressor. It is an ensemble-based regressor because its prediction is the collective performance of multiple weak learners (e.g. decision trees). It learns features from training dataset (body_fat_train
) and maps all the rows to their respective targets (real numbers). The process of mapping gives a trained model.
Hands-on: Train a model
- Ensemble methods for classification and regression tool with the following parameters to train the regressor:
- “Select a Classification Task”:
Train a model
- “Select an ensemble method”:
Gradient Boosting Regressor
- “Select input type”:
tabular data
- param-file “Training samples dataset”:
body_fat_train
- param-check “Does the dataset contain header”:
Yes
- param-select “Choose how to select data by column”:
All columns EXCLUDING some by column header name(s)
- param-text “Type header name(s)”:
target
- param-file “Dataset containing class labels”:
body_fat_train
- param-check “Does the dataset contain header”:
Yes
- param-select “Choose how to select data by column”:
Select columns by column header name(s)
- param-text “Select target column(s)”:
target
- Rename the generated file to
model
.
QuestionWhat is learned by the regressor?
Unlike the Linear support vector classifier (used for classification in the first part of the tutorial) which learned only two attributes, the Gradient boosting regressor learns multiple attributes such as feature_importances_ (weight for each feature/column), oob_improvement_ (which stores incremental improvements in learning), estimators_ (collection of weak learners) and a few more. These attributes are used to predict the target for a new sample and are stored in the trained model. They can be accessed by reading this file.
Predict using test dataset
After learning on the training dataset, we should evaluate the performance on the test dataset to know whether the algorithm learned general patterns from the training dataset or not. These patterns are used to predict a new sample and a similar accuracy is expected. Similar to the classification task, the trained model is evaluated on body_fat_test
which predicts a target value for each row. The predicted targets are compared to the expected targets to measure the robustness of learning.
Hands-on: Predict targets using the model
- Ensemble methods for classification and regression tool with the following parameters to predict targets of test dataset using the trained model:
- “Select a Classification Task”:
Load a model and predict
- param-file “Models”:
model
- param-file “Data (tabular)”:
body_fat_test
- param-check “Does the dataset contain header”:
Yes
- param-select “Select the type of prediction”:
Predict class labels
- Rename the generated file to
predicted_data
.
Visualise the prediction
We will evaluate the predictions by comparing them to the expected targets.
Hands-on: Check and visualize the predictions
- Plot actual vs predicted curves and residual plots tool with the following parameters to visualise the predictions:
- param-file “Select input data file”:
body_fat_test_labels
- param-file “Select predicted data file”:
predicted_data
The visualization tool creates the following plots:
-
True vs predicted targets curves:
-
Scatter plot for true vs. predicted targets:
-
Residual plot between residual (predicted - true) and predicted targets:
These plots are important to visualize the quality of regression and the true and predicted targets - how close or far they are from each other. The closer they are, the better the prediction.
Summary
By following these steps, we learned how to perform regression and visualise the predictions using Galaxy’s machine learning and plotting tools. The features of the training dataset are mapped to the real-valued targets. This mapping is used to make predictions on an unseen (test) dataset. The quality of predictions is visualised using a plotting tool. There are multiple other regression algorithms, few are simpler to use (with fewer parameters) and some are powerful, which can be tried out on this dataset and on other datasets as well.
Conclusion
We learned how to perform classification and regression using different datasets and machine learning tools in Galaxy. Moreover, we visualized the results using multiple plots to ascertain the robustness of machine learning tasks. There are many other classifiers and regressors in the machine learning suite which can be tried out on these datasets to find how they perform. Different datasets can also be analysed using these classifiers and regressors. The classifiers and regressors have lots of parameters which can be altered while performing the analyses to see if they affect the prediction accuracy. It may be beneficial to perform hyperparameter search to tune these parameters of classifiers and regressors for different datasets. Some data pre-processors can also be used to clean the datasets.
Key points
There is two types of machine learning’s supervised approaches, classification and regression.
In supervised approaches, the target for each sample is known.
For classification and regression tasks, data is divided into training and test sets.
Using classification, the categories of rows are learned using the training set and predicted using the test set.
Using regression, real-valued targets are learned using the training set and predicted using the test set.
Frequently Asked Questions
Have questions about this tutorial? Check out the tutorial FAQ page or the FAQ page for the Statistics and machine learning topic to see if your question is listed there. If not, please ask your question on the GTN Gitter Channel or the Galaxy Help ForumFeedback
Did you use this material as an instructor? Feel free to give us feedback on how it went.
Did you use this material as a learner or student? Click the form below to leave feedback.
Citing this Tutorial
- Anup Kumar, Bérénice Batut, 2022 Machine learning: classification and regression (Galaxy Training Materials). https://training.galaxyproject.org/training-material/topics/statistics/tutorials/classification_regression/tutorial.html Online; accessed TODAY
- Batut et al., 2018 Community-Driven Data Analysis Training for Biology Cell Systems 10.1016/j.cels.2018.05.012
Congratulations on successfully completing this tutorial!@misc{statistics-classification_regression, author = "Anup Kumar and Bérénice Batut", title = "Machine learning: classification and regression (Galaxy Training Materials)", year = "2022", month = "10", day = "18" url = "\url{https://training.galaxyproject.org/training-material/topics/statistics/tutorials/classification_regression/tutorial.html}", note = "[Online; accessed TODAY]" } @article{Batut_2018, doi = {10.1016/j.cels.2018.05.012}, url = {https://doi.org/10.1016%2Fj.cels.2018.05.012}, year = 2018, month = {jun}, publisher = {Elsevier {BV}}, volume = {6}, number = {6}, pages = {752--758.e1}, author = {B{\'{e}}r{\'{e}}nice Batut and Saskia Hiltemann and Andrea Bagnacani and Dannon Baker and Vivek Bhardwaj and Clemens Blank and Anthony Bretaudeau and Loraine Brillet-Gu{\'{e}}guen and Martin {\v{C}}ech and John Chilton and Dave Clements and Olivia Doppelt-Azeroual and Anika Erxleben and Mallory Ann Freeberg and Simon Gladman and Youri Hoogstrate and Hans-Rudolf Hotz and Torsten Houwaart and Pratik Jagtap and Delphine Larivi{\`{e}}re and Gildas Le Corguill{\'{e}} and Thomas Manke and Fabien Mareuil and Fidel Ram{\'{\i}}rez and Devon Ryan and Florian Christoph Sigloch and Nicola Soranzo and Joachim Wolff and Pavankumar Videm and Markus Wolfien and Aisanjiang Wubuli and Dilmurat Yusuf and James Taylor and Rolf Backofen and Anton Nekrutenko and Björn Grüning}, title = {Community-Driven Data Analysis Training for Biology}, journal = {Cell Systems} }