Supervised Learning

Supervised learning is referred to as the methods wherein the training data are labeled. Thus, models are developed for the supervised learning scenarios are categorized into classification and regression.

The classification is divided into binary classification and multi-class datasets. The former contains data points with only two possible labels, commonly represented as 0 and 1, such as the Breast Cancer dataset in scikit-learn [1], where each instance indicates whether a patient has cancer or not. The latter includes datasets with more than two classes, such as the Iris dataset in scikit-learn, which comprises three distinct classes. The goal of a classifier is to correctly categorize data samples to achieve high predictive performance, measured by accuracy, or by precision and recall in the case of imbalanced datasets.

On the other hand, regression models focus on predicting continuous numerical values for the input data. The goal of regression is to minimize prediction loss (or error), i.e., the difference between the true (ground-truth) values and the predicted values.

The most common algorithms used in supervised learning include Logistic/Linear Regression, Decision Tree, Random Forest, Gradient Boosting, Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and Bayesian algorithms, all of which can be applied to both classification and regression tasks. In the remainder of this section, we provide a detailed review of these algorithms. Notably, The main reference for the current topic is the scikit-learn [1].

References

[1] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.