Classification Objective#

Classification is a supervised learning task where the goal is to predict a discrete class label for each input. A typical toy dataset in classification is the Iris dataset, where the task is to classify flower species based on physical features of the plant.

from sklearn.datasets import load_iris
import pandas as pd

# Load the dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target, name="target")

# Preview the dataset
X.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

The Iris dataset contains 150 data points, having 4 features: sepal length, sepal width, petal length, and petal width. The target classes are:

  • \(y=0\): setosa

  • \(y=1\): versicolor

  • \(y=2\): virginica

All features in the Iris dataset are continuous numerical values, as they represent physical measurements of the flower in centimeters. However, classification datasets may also have discrete features such as male/female or color (green, red, blue).

Classification Challenges#

Even though the Iris dataset is relatively “clean,” it still allows us to discuss core challenges in classification tasks.

Class Imbalance#

In many real-world problems, one class dominates the others. Let’s check the class distribution in Iris:

y.value_counts(normalize=True)
0    0.333333
1    0.333333
2    0.333333
Name: target, dtype: float64

Here, each class occurs exactly 50 times (1/3 of the data), making the Iris dataset a perfectly balanced classification dataset. Having one class dominate over others may result in classifiers that predict the majority class in most cases and that are also correct most of the time with that prediction. Hence, the evaluation must be performed carefull in those cases, and maybe the classifier needs to be adapted to this case as well.

Mixed Feature Types#

While the Iris dataset only contains continuous numerical features, many real-world classification problems involve discrete or categorical features, which introduce their own set of challenges. In many datasets, features can be discrete — meaning they take on a finite set of values, such as:

  • Categorical/Nominal: e.g., color \(\in\{\) red, green, blue \(\}\)

  • Ordinal: e.g., education_level \(\in\{\) high school, college, graduate \(\}\)

  • Binary: e.g., has_credit_card \(\in\{\) yes, no\(\}\)

Categorical and binary data has no meaningful \(\leq\) operation and this requires special treatment. For example, simple statistics such as computing the mean and variance don’t have a meaning for categorical data. Not all classifiers can handle all feature types, hence it’s important to match the classifier to the feature types in the data.

Classifier Goal#

Goal (Classification)

Given a dataset consisting of \(n\) observations \(\vvec{x}_i\) and their corresponding labels \(y_i\) indicating one of \(c\) classes

\[\mathcal{D}=\left\{(\vvec{x}_{i},y_i)\vert \vvec{x}_{i}\in\mathbb{R}^{d},y_i\in\{1,\ldots, c\}, 1\leq i \leq n\right\}\]

Find a classifier \(\hat{y}:\mathbb{R}^ d\rightarrow \{1,\ldots,c\}\), that reliably predicts the class of a new observation

\[\hat{y}(\mathbf{x})= y\]

A classifier is often written as a function

\[ f:\mathbb{R}^d \rightarrow \mathbb{R}^c, \]

where \(f(\vvec{x})_l\) denotes a score assigned to class \(l\). That is, the output contains one value for each class. The predicted class label is then obtained by selecting the class with the largest score:

\[ \hat y = \argmax_{l\in\{1,\dots,c\}} f(\vvec{x})_l. \]

For example, in a three-class classification problem,

\[\begin{split} f(\vvec{x}) = \begin{pmatrix} 0.1\\ 2.3\\ 1.7 \end{pmatrix}, \end{split}\]

the classifier assigns the highest score to the second class and therefore predicts class \(\hat{y}(\vvec{x})=2\).

In the following, we will indicate classifiers as functions \(f:\mathbb{R}^d \rightarrow \mathbb{R}^c\) or \(\hat y:\mathbb{R}^d \rightarrow \{1,\dots,c\}\). While \(\hat{y}\) gives us the predictions straight away, \(f\) often represents the underlying model. The scores of \(f\) have an interpretation as a confidence of the model in the prediction.

Classifiers, similar to regression models, are defined by their inference and their training. Inference describes how the model performs prediction of (unseen) data points. The training or learning describes how the model is generated, given the training data.

Evaluation#

We quickly define the most straightforward classification evaluation metrics: the \(L_{01}\)-loss and the accuracy. Both metrics quantify the quality of predictions.

Definition 26 (0-1 Loss)

Given a classifier \(\hat{y}(\vvec{x})\) returning the predicted label. We define the 0-1 loss as

\[\begin{split}L_{01}(y,\hat{y}) = \begin{cases} 1, & \text{if } y\neq \hat{y}\\ 0, & \text{if } y=\hat{y} \end{cases}\end{split}\]

The 0-1 loss indicates whether a classifier makes an error in its prediction. In contrast, the accuracy indicates how much a classifier gets right.

Definition 27 (Accuracy)

Given a classifier \(\hat{y}(\vvec{x})\) returning the predicted label. the accuracy of the classifier on dataset \(\mathcal{D}\) containing \(n\) data points is given as

\[\begin{align*} \mathrm{Acc}(\hat{y},\mathcal{D}) &= \frac{\text{Correct predictions}}{\text{Total predictions}}\\ &= \frac1n \lvert\{(\vvec{x}_i,y_i)\in\mathcal{D}\mid \hat{y}(\vvec{x}_i)=y_i\}\rvert\\ &= 1- \frac1n\sum_{i=1}^n L_{01}(\hat{y}(\vvec{x}_i),y_i). \end{align*}\]

Theoretically Optimal Classifiers#

To allow for a theoretical analysis of classifiers, we need to state some assumptions about the ground truth that we want to recover. In classification, the ground truth is defined by a joint probability function \( p^{\ast}({\bf x}, y) \) that assigns each combination of observation \(\mathbf{x}\) and label \(y\) a probability.

Property 2 (i.i.d. Class Distribution)

Under the i.i.d. class distribution assumption, we assume that the dataset samples are identically distributed and independently drawn from an unknown probability distribution \( p^{\ast}({\bf x}, y) \), i.e.

(30)#\[{\bf x}_{i}, y_{i} \sim p^*({\bf x}, y), \forall i \in \lbrace 1, \ldots, n \rbrace.\]

Considering observations \(\mathbf{x}\), labels \(y\) and the training dataset \(\mathcal{D}\) as random variables, we can define the expected prediction error for a classification problem.

Definition 28 (Classifier EPE)

Given a classifier \(\hat{y}_\mathcal{D}\) that has been trained on dataset \(\mathcal{D}\). The Expected Prediction Error (EPE) of \(\hat{y}_\mathcal{D}\) is given as

(31)#\[p( y \neq \hat{y}(\vvec{x}) ) = \mathbb{E}_{\vvec{x},y,\mathcal{D}} [L_{01} (y, \hat{y}_\mathcal{D}(\vvec{x}) )] ,\]

over three random variables:

  • \(\vvec{x}\) is the random variable of a feature vector in the test set.

  • \(y\) is the random variable of the class of \(\vvec{x}\).

  • \(\mathcal{D}\) is the random variable of the training data.

The EPE is equal to the probability of making an error, because the \(L_{01}\) loss has a binary outcome, and the expected value of a binary random variable is equal to the probability that it is one.

Note

The probability of misclassification \( p( y \neq \hat{y}(\vvec{x})) \) is also known as the risk \( R(\hat{y}) \) of the classifier \( \hat{y} \).

Theoretically, we can state the classifier that minimizes the prediction error, when we assume that our data is generated according to the i.i.d. class distribution. The idea is simple: predict the class that occurs most probably, given an observation \(\vvec{x}\).

Definition 29 (Bayes optimal classifier)

The Bayes classifier is the optimal classifier that minimizes the probability of misclassification. Under the i.i.d. data assumption, the Bayes classifier is given as

(32)#\[y^*(\vvec{x}) = \argmax_{1\leq l\leq c}\ p^* (l\mid \vvec{x}) =\argmax_{1\leq l\leq c}\ \frac{p^*(\vvec{x},l)}{p^*(\vvec{x})},\]

The Bayes classifier has the lowest EPE possible. But, as usual, the problem with these theoretically optimal models is that we do not know the ground truth probability \(p^*\).