Confusion Matrices

Uses Confusion Matrices to Evaluate Classifiers

Scott Miner

Last updated on Apr 23, 2023 23 min read Confusion Matrices, Sensitivity, True Positive Rate, Recall, Precision, Confusion Matrix, Machine Learning, Artificial Intelligence

Overview

In the last post, we discussed using stratified k-fold cross-validation to evaluate the accuracy of an SGDClassifier that classifies images from the MNIST dataset as either fives or non-fives.

Confusion matrices are often better at evaluating classifiers than accuracy metrics are, especially when dealing with unbalanced datasets. Consider building a classifier that aims to predict the prevalence of a rare disease. If only 5% of the people have the disease, a classifier that always predicts healthy individuals will achieve an accuracy score of 95%. However, the classifier is relatively useless since it never predicts any instances of the disease.

Confusion matrices provide additional metrics to investigate a classifier’s performance. Before diving deeper into confusion matrices, let’s look at baseline classifiers.

Baseline Classifiers

In sklearn , baseline methods are known as dummy methods and represent the most basic estimators. Baseline methods make predictions based on simple statistics or random guesses.

sklearn provides four baseline classification methods. Two use random strategies to make predictions, and two use constant techniques.

Random Strategies

The uniform strategy chooses evenly amongst target classes based on the number of categories.
The stratified technique chooses evenly amongst target classes based on the frequency of the groups.

The two random methods behave differently on unbalanced datasets. The uniform strategy picks evenly amongst classes despite the distribution of classes, whereas the stratified technique considers this distribution.

Constant Strategies

The constant strategy returns a user-defined pre-determined target class.
The most_frequent technique returns the single most likely category (also available under the name prior).

Let’s import the needed classes for this tutorial so we can take a look at a simple example of a Dummy Classifier.

Imports

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn import datasets
from sklearn.dummy import DummyClassifier
from sklearn.base import BaseEstimator
from sklearn.metrics import accuracy_score, precision_score, confusion_matrix, recall_score
from sklearn.model_selection import train_test_split, cross_val_predict
from sklearn.neighbors import KNeighborsClassifier
import seaborn as sns
import textwrap
import warnings
warnings.simplefilter(action='ignore', category=[FutureWarning])

Matplotlib & Seaborn Settings

mpl.rcParams.update({'font.size': 26})
mpl.rc('figure', figsize=(8, 8))
sns.set(font_scale=2.0)
sns.color_palette("deep")
%matplotlib inline

Now, let’s build a simple dummy classifier.

Simple DummyClassifier

# Create the features
X = np.array([-1, 1, 1, 1])

# Create the target class
y = np.array([0, 1, 1, 1])

# Create a dummy classifier using the most frequest strategy
dummy_clf = DummyClassifier(strategy="most_frequent")

# Fit the dummy classifier
dummy_clf.fit(X, y)

# Print the predictions and the score
print(f'Predictions: {dummy_clf.predict(X)}')
print(f'Score: {dummy_clf.score(X, y)}')

Predictions: [1 1 1 1]
Score: 0.75

The dummy classifier always predicts 1 ’s because that is the most frequently occurring class in y (the target class). The dummy classifier achieved a 75% accuracy since there was a single 0 in the initial dataset.

Let’s expand this example to the Iris dataset. First, we need to load the data.

Load Iris Dataset

# Load the iris dataset
iris = datasets.load_iris()

# How many elements are in the data set?
print(f'Size of full dataset: {len(iris.data)}')

# Split the data into training and testing sets
tts_iris = train_test_split(iris.data, iris.target,
                            test_size=.33, random_state=21)

# Output the result of the train_test_split to a tuple
(iris_train_ftrs, iris_test_ftrs,
 iris_train_tgt, iris_test_tgt) = tts_iris

# The test set should contain 50 elements, since we set the size to a third
print(f'Size of test dataset: {len(iris_test_ftrs)}')

Size of full dataset: 150
Size of test dataset: 50

As you can see, train_test_split splits arrays into random training and test sets. The test_size parameter corresponds to the proportion of the dataset to include in the test split. The random_state parameter allows for reproducible output across multiple function calls. Let’s look at a few simple examples.

Splitting data into training and test sets

# Create 5 rows and 2 columns of features ranging from 0 - 9
# and a target variable ranging from 0 - 5
X, y = np.arange(10).reshape((5, 2)), range(5)

# Divide the data into training and testing sets
tts = train_test_split(X, y, test_size=0.33, random_state=42)

# Get the training and testing features and targets
(train_features, test_features,
 train_target,   test_target) = tts

# Print them out to verify the results
print(f'Training Features:')
print(f'{train_features}')
print(f'Training Target:')
print(f'{train_target}')
print(f'Evaluation Features:')
print(f'{test_features}')
print(f'Evaluation Target:')
print(f'{test_target}')

Training Features:
[[4 5]
 [0 1]
 [6 7]]
Training Target:
[2, 0, 3]
Evaluation Features:
[[2 3]
 [8 9]]
Evaluation Target:
[1, 4]

Setting shuffle to False specifies whether or not to shuffle the data before splitting. If shuffle=False , then stratify must be None (the default).

train_test_split(y, shuffle=False)

[[0, 1, 2], [3, 4]]

Let’s get back to the Iris dataset.

print('Iris Training Features')
print('----------------------')
print(iris_train_ftrs[:10])
print()
print('Iris Training Target')
print('--------------------')
print(iris_train_tgt[:10])
print()
print('Iris Testing Features')
print('---------------------')
print(iris_test_ftrs[:10])
print()
print('Iris Testing Target')
print('---------------------')
print(iris_test_tgt[:10])

Iris Training Features
----------------------
[[6.9 3.1 4.9 1.5]
 [5.  3.3 1.4 0.2]
 [6.7 3.1 4.4 1.4]
 [7.7 2.6 6.9 2.3]
 [5.7 2.8 4.5 1.3]
 [5.8 2.7 4.1 1. ]
 [4.6 3.1 1.5 0.2]
 [5.1 3.5 1.4 0.3]
 [7.7 3.  6.1 2.3]
 [4.7 3.2 1.6 0.2]]

Iris Training Target
--------------------
[1 0 1 2 1 1 0 0 2 0]

Iris Testing Features
---------------------
[[5.8 2.6 4.  1.2]
 [5.1 3.8 1.9 0.4]
 [5.  3.4 1.5 0.2]
 [5.1 3.7 1.5 0.4]
 [5.7 3.  4.2 1.2]
 [6.6 3.  4.4 1.4]
 [5.4 3.4 1.7 0.2]
 [5.6 2.8 4.9 2. ]
 [5.  3.4 1.6 0.4]
 [5.1 3.8 1.5 0.3]]

Iris Testing Target
---------------------
[1 0 0 0 1 1 0 2 0 0]

Iris Dataset DummyClassifier

What’s the most frequently occurring class in the Iris dataset?

Let’s look at a bar chart of the target variable. As a reminder, the target variable in the Iris dataset is the type of Iris plant (e.g., Setosa, Versicolour, or Virginica).

# Get the unique values and counts of the target variable
values, counts = np.unique(iris_train_tgt, return_counts=True)

# Print the output
print('Class counts')
print('------------')
print(np.c_[values, counts])

plt.figure(figsize=(12, 8))

# Create the bar chart
bars = sns.barplot(x=values, y=counts)

# Create the labels above the bars
plt.bar_label(bars.containers[0], label_type="edge", size=26)

# Set the titles, axes, and layout
plt.title("Frequency of Target Variable Classes in the Iris Dataset")
plt.xlabel("Class")
plt.ylabel("Counts")
labels = ['Iris Setosa ("0")', 'Iris Versicolour ("1")',
          'Iris Virginica ("2")']
plt.xticks(values, labels, rotation=9)
plt.ylim(0, 40)
plt.show()

Class counts
------------
[[ 0 32]
 [ 1 33]
 [ 2 35]]

As you can see, 2 or Iris Virginica is the most frequently occurring class in the Iris dataset. Supplying the most_frequent argument to the strategy parameter of the DummyClassifier class should return a classifier that predicts all 2 ’s.

Iris Dataset - `most_frequent` DummyClassifier

Let’s create a baseline classifier for the Iris dataset using the most_frequent strategy.

# Create the classifier
baseline = DummyClassifier(strategy="most_frequent")

# Fit the classifier
baseline.fit(iris_train_ftrs, iris_train_tgt)

# Make predictions based on the classifier
base_preds = baseline.predict(iris_test_ftrs)

# Print to verify results
print(f'Length of baseline predictions: {len(base_preds)}')
print(f'Ten baseline predictions (they are all 2\'s): {base_preds[:10]}')

Length of baseline predictions: 50
Ten baseline predictions (they are all 2's): [2 2 2 2 2 2 2 2 2 2]

Let’s check the accuracy of these predictions on the test dataset.

# Check the accuracy of the baseline classifier
base_acc = accuracy_score(base_preds, iris_test_tgt)
print(f'Accuracy of the baseline classifier: {base_acc}')

Accuracy of the baseline classifier: 0.3

As expected, the baseline classifier does not perform that well. Now, let’s check the accuracies of all the baseline classifier strategies.

Comparing the accuracies of all baseline strategies

# Create a list of strategies
strategies = ['constant', 'uniform', 'stratified', 'prior', 'most_frequent']

# Set up args to create different DummyClassifier strategies
baseline_args = [{'strategy': s} for s in strategies]

# Class 0 is setosa
baseline_args[0]['constant'] = 0
accuracies = []

# Loop through the classifiers and display the results in a DF
for bla in baseline_args:
    baseline = DummyClassifier(**bla)
    baseline.fit(iris_train_ftrs, iris_train_tgt)
    base_preds = baseline.predict(iris_test_ftrs)
    accuracies.append(accuracy_score(base_preds, iris_test_tgt))

display(pd.DataFrame({'accuracy': accuracies}, index=strategies))

	accuracy
constant	0.36
uniform	0.28
stratified	0.34
prior	0.30
most_frequent	0.30

The uniform and stratified strategies will return different results when re-run multiple times on a fixed train-test split because they are randomized methods. The other techniques always return the same values for a fixed train-test split.

Let’s do the same thing using the MNIST dataset.

MNIST Dataset - DummyClassifier

# Read in vars from previous notebook
%store -r

# Create a list of strategies
strategies = ['constant', 'uniform', 'stratified', 'prior', 'most_frequent']

# Set up args to create different DummyClassifier strategies
baseline_args = [{'strategy': s} for s in strategies]

# False is the constant class
baseline_args[0]['constant'] = False
accuracies = []

# Loop through the classifiers and display the results in a DF
for bla in baseline_args:
    baseline = DummyClassifier(**bla)
    baseline.fit(X_train, y_train_5)
    base_preds = baseline.predict(X_test)
    accuracies.append(accuracy_score(base_preds, y_test_5))

display(pd.DataFrame({'accuracy': accuracies}, index=strategies))

	accuracy
constant	0.9108
uniform	0.4944
stratified	0.8396
prior	0.9108
most_frequent	0.9108

The constant , prior , and most_frequent strategies all perform the same. The accuracy is so high because only about 10% of the images in the dataset are 5s and this classifier predicts whether or not an image is a 5.

What are some of the other metrics we can use to evaluate a classifier? Let’s take a look.

Metrics Keys

print(textwrap.fill(str(sorted(metrics.SCORERS.keys())), width=70))

['accuracy', 'adjusted_mutual_info_score', 'adjusted_rand_score',
'average_precision', 'balanced_accuracy', 'completeness_score',
'explained_variance', 'f1', 'f1_macro', 'f1_micro', 'f1_samples',
'f1_weighted', 'fowlkes_mallows_score', 'homogeneity_score',
'jaccard', 'jaccard_macro', 'jaccard_micro', 'jaccard_samples',
'jaccard_weighted', 'matthews_corrcoef', 'max_error',
'mutual_info_score', 'neg_brier_score', 'neg_log_loss',
'neg_mean_absolute_error', 'neg_mean_absolute_percentage_error',
'neg_mean_gamma_deviance', 'neg_mean_poisson_deviance',
'neg_mean_squared_error', 'neg_mean_squared_log_error',
'neg_median_absolute_error', 'neg_root_mean_squared_error',
'normalized_mutual_info_score', 'precision', 'precision_macro',
'precision_micro', 'precision_samples', 'precision_weighted', 'r2',
'rand_score', 'recall', 'recall_macro', 'recall_micro',
'recall_samples', 'recall_weighted', 'roc_auc', 'roc_auc_ovo',
'roc_auc_ovo_weighted', 'roc_auc_ovr', 'roc_auc_ovr_weighted',
'top_k_accuracy', 'v_measure_score']

There certainly are a lot. How can we figure out the default scorer for a particular classifier? Let’s take a look at a k-nearest neighbor classifier, for instance.

Default Scorer for Classifier

knn = KNeighborsClassifier()

Returns_index = '\n'.join(knn.score.__doc__.splitlines()).index('Returns')
print('\n'.join(knn.score.__doc__.splitlines())[Returns_index:])

Returns
        -------
        score : float
            Mean accuracy of ``self.predict(X)`` wrt. `y`.

The above shows that the default evaluation metric for k-NN is mean accuracy. Let’s look into confusion matrices and some other metrics, including the precision, recall, and specificity.

First, we’ll look at a generic confusion matrix.

Generic Confusion Matrix

The below provides a sample of what hypothetical predictions returned by a binary classifier might look like, along with corresponding result labels.

d = {'Reality': ['0', '1', '0', '1'],
    'Prediction': ['0', '0', '1', '1'],
    'Result': ['True Negative', 'False Negative', 'False Positive', 'True Positive'],
    'Abbreviation': ['TN', 'FN', 'FP', 'TP']}

df = pd.DataFrame(data=d)
df

	Reality	Prediction	Result	Abbreviation
0	0	0	True Negative	TN
1	1	0	False Negative	FN
2	0	1	False Positive	FP
3	1	1	True Positive	TP

In a generic confusion matrix, the rows represent reality, and the columns represent predictions. However, depending on who draws the confusion matrix, the rows and columns might be flip-flopped. Let’s take a look at a generic confusion matrix.

data = {'Predicted Negative (PredN)': ['True Negative (TN)', 'False Negative (FN)'],
        'Predicted Positive (PredP)': ['False Positive (FP)', 'True Positive (TP)']}

df = pd.DataFrame(
    data, index=['Real Negative (RealN)', 'Real Positive (RealP)'])

df

	Predicted Negative (PredN)	Predicted Positive (PredP)
Real Negative (RealN)	True Negative (TN)	False Positive (FP)
Real Positive (RealP)	False Negative (FN)	True Positive (TP)

We can use the following equations to represent reality and our predictions:

$$\text{Real Negatives} = TN + FP$$ $$\text{Real Positives} = FN + TP$$ $$\text{Predicted Negatives} = TN + FN$$ $$\text{Predicted Positives} = FP + TP$$

Let’s look at some additional metrics the confusion matrix provides.

Metrics from the Confusion Matrix

The following provides an overview of some metrics we can calculate from a confusion matrix.

Specificity, False Positives, False Alarms, Type I Errors, and Overestimation

Let’s start with the specificity, which we can calculate using the top row of the above confusion matrix. The specificity is also known as the true negative rate (TNR). The specificity evaluates how many cases we correctly identified as false out of all the real negative cases. The best value is 1 and the worst value is 0. $$\text{specificity} = \frac{TN}{FP + TN} = \frac{TN}{RealN}$$

Another way to think about the specificity is whether or not the classifier raises a flag in the specific cases we want it to. The specificity is intuitively the ability of the classifier to find all the negative samples. The specificity aims to minimize false positives. False negatives do not affect its outcome.

Two ways come to mind to calculate it in sklearn on a binary classifier. The first is by calculating the true negatives, etc., from the confusion matrix and performing the calculation by hand. Second, we can set the pos_label argument of the recall_score function to 0, producing the same result. Let’s look at how different predictions output differing specificities on a balanced dataset.

# specificity
titles = ['All False Predictions', 'Some False Positives',
          'Some False Negatives', 'Perfect Classifier']
y_true = [[0, 0, 0, 0, 0, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 1, 1, 1, 1, 1], [
    0, 0, 0, 0, 0, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]]
y_pred = [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 1, 1, 1, 1, 1, 1], [
    0, 0, 0, 0, 0, 0, 0, 1, 1, 1], [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]]
for i, (y_t, y_p, title) in enumerate(zip(y_true, y_pred, titles)):
    print(f'{i + 1}. {title}')
    print('-----------------------------')
    print(f'y_true: {y_t}')
    print(f'y_pred: {y_p}')
    print()
    (tn, fp, fn, tp) = confusion_matrix(y_t, y_p).ravel()
    print(f'True Negatives: {tn}')
    print(f'False Positives: {fp}')
    print(f'True Positives: {tp}')
    print(f'False Negatives: {fn}')
    print()
    print(f'True Negative Rate 1 (Specificity): {tn / (fp + tn):.3f}')
    print(
        f'True Negative Rate 2 (Specificity): {recall_score(y_t, y_p, pos_label=0):.3f}')
    print()

1. All False Predictions
-----------------------------
y_true: [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
y_pred: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

True Negatives: 5
False Positives: 0
True Positives: 0
False Negatives: 5

True Negative Rate 1 (Specificity): 1.000
True Negative Rate 2 (Specificity): 1.000

2. Some False Positives
-----------------------------
y_true: [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
y_pred: [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

True Negatives: 3
False Positives: 2
True Positives: 5
False Negatives: 0

True Negative Rate 1 (Specificity): 0.600
True Negative Rate 2 (Specificity): 0.600

3. Some False Negatives
-----------------------------
y_true: [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
y_pred: [0, 0, 0, 0, 0, 0, 0, 1, 1, 1]

True Negatives: 5
False Positives: 0
True Positives: 3
False Negatives: 2

True Negative Rate 1 (Specificity): 1.000
True Negative Rate 2 (Specificity): 1.000

4. Perfect Classifier
-----------------------------
y_true: [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
y_pred: [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

True Negatives: 5
False Positives: 0
True Positives: 5
False Negatives: 0

True Negative Rate 1 (Specificity): 1.000
True Negative Rate 2 (Specificity): 1.000

The specificity only decreases when we ramp up the number of false positives. As long as the classifier identifies all negative cases, the specificity will be 1.0, even if false negatives exist.

Precision, Positive Predictive Value (PPV)

Traveling clockwise around the confusion matrix, let’s look at another evaluation metric, this time the precision. The precision answers the question, “What is the value of a hit?” and is also known as the positive predictive value (PPV). The formula for the precision is the following:

$$\text{precision} = \frac{TP}{PredP} = \frac{TP}{TP + FP}$$

The precision is appropriate when we want to minimize false positives and is a metric that quantifies the number of correct positive predictions made. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative. The best value is 1, and the worst value is 0. We can use the precision_score metric of sklearn to calculate a classifier’s precision.

# precision
titles = ['All False Predictions', 'Some False Positives',
          'Some False Negatives', 'Perfect Classifier']
y_true = [[0, 0, 0, 0, 0, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 1, 1, 1, 1, 1], [
    0, 0, 0, 0, 0, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]]
y_pred = [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 1, 1, 1, 1, 1, 1], [
    0, 0, 0, 0, 0, 0, 0, 1, 1, 1], [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]]
for i, (y_t, y_p, title) in enumerate(zip(y_true, y_pred, titles)):
    print(f'{i + 1}. {title}')
    print('-----------------------------')
    print(f'y_true: {y_t}')
    print(f'y_pred: {y_p}')
    print()
    (tn, fp, fn, tp) = confusion_matrix(y_t, y_p).ravel()
    print(f'True Negatives: {tn}')
    print(f'False Positives: {fp}')
    print(f'True Positives: {tp}')
    print(f'False Negatives: {fn}')
    print()
    print(
        f'Positive Predictive Value (Precision): {precision_score(y_t, y_p,zero_division=False):.3f}')
    print()

1. All False Predictions
-----------------------------
y_true: [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
y_pred: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

True Negatives: 5
False Positives: 0
True Positives: 0
False Negatives: 5

Positive Predictive Value (Precision): 0.000

2. Some False Positives
-----------------------------
y_true: [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
y_pred: [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

True Negatives: 3
False Positives: 2
True Positives: 5
False Negatives: 0

Positive Predictive Value (Precision): 0.714

3. Some False Negatives
-----------------------------
y_true: [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
y_pred: [0, 0, 0, 0, 0, 0, 0, 1, 1, 1]

True Negatives: 5
False Positives: 0
True Positives: 3
False Negatives: 2

Positive Predictive Value (Precision): 1.000

4. Perfect Classifier
-----------------------------
y_true: [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
y_pred: [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

True Negatives: 5
False Positives: 0
True Positives: 5
False Negatives: 0

Positive Predictive Value (Precision): 1.000

Whereas the specificity for a classifier that predicts all negative cases was 1.0, the precision is 0. The precision increases as the true positives increase and the false positives decrease. False negatives do not affect the calculation.

Recall, Sensitivity, and True Positive Rate (TPR)

Now to the bottom row of the above confusion matrix to calculate the recall. The recall is also known as the sensitivity or true positive rate (TPR) and is appropriate when focusing on minimizing false negatives. Imagine a classifier that predicts whether a web page results from a search engine request. The recall answers the question, “Of the valuable web page results, how many did the classifier identify or recall correctly?” Another way to phrase this is, “Within real-world cases of the target class, how many did the classifier identify?” Intuitively, the recall is the classifier’s ability to find all the positive samples. At last, the recall is not concerned with false positives and minimizes false negatives.

$$ \text{recall} = \frac{TP}{FN + TP} = \frac{TP}{RealP}$$

The Complement of the Recall, The False Negative Rate, Type II error, Miss, Underestimation

The complement to caring about the number of hits we got right is the number of real hits we got wrong, which would produce a Type II error known as a false negative.

$$ \text{false negative rate} = \frac{FN}{TP + FN} = \frac{FN}{RealP} $$

The false negative rate is equivalent to the following:

$$ \text{false negative rate} = 1 - \text{true positive rate} $$

In other words, we can break the target class down into two groups:

Real positives the classifier identified correctly
Real positives the classifier identified incorrectly

Using the recall and its complement, we can add up the hits we got right with the hits we got wrong, giving us all the real positive cases.

$$ \frac{TP}{TP+FN} + \frac{FN}{TP+FN} = \frac{TP+FN}{TP+FN} = 1 $$

Let’s develop some intution regarding the recall and its complement.

# Recall and False Negative Rate
titles = ['All False Predictions', 'Some False Positives',
          'Some False Negatives', 'Perfect Classifier']
y_true = [[0, 0, 0, 0, 0, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 1, 1, 1, 1, 1], [
    0, 0, 0, 0, 0, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]]
y_pred = [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 1, 1, 1, 1, 1, 1], [
    0, 0, 0, 0, 0, 0, 0, 1, 1, 1], [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]]
for i, (y_t, y_p, title) in enumerate(zip(y_true, y_pred, titles)):
    print(f'{i + 1}. {title}')
    print('-----------------------------')
    print(f'y_true: {y_t}')
    print(f'y_pred: {y_p}')
    print()
    (tn, fp, fn, tp) = confusion_matrix(y_t, y_p).ravel()
    print(f'True Negatives: {tn}')
    print(f'False Positives: {fp}')
    print(f'True Positives: {tp}')
    print(f'False Negatives: {fn}')
    print()
    print(f'True Positive Rate (Recall): {recall_score(y_t, y_p):.3f}')
    print(
        f'False Negative Rate (Miss Rate): {fn / (tp + fn):.3f}')
    print()

1. All False Predictions
-----------------------------
y_true: [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
y_pred: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

True Negatives: 5
False Positives: 0
True Positives: 0
False Negatives: 5

True Positive Rate (Recall): 0.000
False Negative Rate (Miss Rate): 1.000

2. Some False Positives
-----------------------------
y_true: [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
y_pred: [0, 0, 0, 1, 1, 1, 1, 1, 1, 1]

True Negatives: 3
False Positives: 2
True Positives: 5
False Negatives: 0

True Positive Rate (Recall): 1.000
False Negative Rate (Miss Rate): 0.000

3. Some False Negatives
-----------------------------
y_true: [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
y_pred: [0, 0, 0, 0, 0, 0, 0, 1, 1, 1]

True Negatives: 5
False Positives: 0
True Positives: 3
False Negatives: 2

True Positive Rate (Recall): 0.600
False Negative Rate (Miss Rate): 0.400

4. Perfect Classifier
-----------------------------
y_true: [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
y_pred: [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

True Negatives: 5
False Positives: 0
True Positives: 5
False Negatives: 0

True Positive Rate (Recall): 1.000
False Negative Rate (Miss Rate): 0.000

Coding the Confusion Matrix

Let’s code several confusion matrics for fictitious classifiers and calculate the metrics defined above for each. We use a seaborn heatmap to visualize the confusion matrices.

titles = ['All False Predictions', 'Some False Positives',
          'Some False Negatives', 'Perfect Classifier']
fig, axes = plt.subplots(2, 2, figsize=(12, 12), sharex=True, sharey=True)
fig.tight_layout()
y_true = [[0, 0, 0, 0, 0, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 1, 1, 1, 1, 1], [
    0, 0, 0, 0, 0, 1, 1, 1, 1, 1], [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]]
y_pred = [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 1, 1, 1, 1, 1, 1], [
    0, 0, 0, 0, 0, 0, 0, 1, 1, 1], [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]]

for i, (ax, y_t, y_p, title) in enumerate(zip(axes.flat, y_true, y_pred, titles)):
    cm = confusion_matrix(y_t, y_p)
    sns.heatmap(cm, annot=True, square=True,
                xticklabels=['False', 'True'],
                yticklabels=['False', 'True'],
                fmt='g', cmap='flare', ax=ax, cbar=False, annot_kws={"size": 24})
    ax.set_xlabel('Predicted')
    ax.set_ylabel('Actual')
    ax.set_title(f'{i + 1}. {title}')
plt.show()

for i, (y_t, y_p, title) in enumerate(zip(y_true, y_pred, titles)):
    print(f'{i + 1}. {title}')
    print('-----------------------------')
    (tn, fp, fn, tp) = confusion_matrix(y_t, y_p).ravel()
    print(f'True Negatives: {tn}')
    print(f'False Positives: {fp}')
    print(f'True Positives: {tp}')
    print(f'False Negatives: {fn}')
    print()
    print(
        f'True Negative Rate (Specificity): {recall_score(y_t, y_p, pos_label=0):.3f}')
    print(
        f'Positive Predictive Value (Precision): {precision_score(y_t, y_p, zero_division=False):.3f}')
    print()
    print(
        f'True Positive Rate (Recall): {recall_score(y_t, y_p, zero_division=False):.3f}')
    fnr = fn/(tp + fn)
    print(f'False Negative Rate (Recall Complement): {fnr:.3f}')
    print()
    # Overall accuracy
    print(f'Accuracy: {accuracy_score(y_t, y_p):.3f}')
    print()

1. All False Predictions
-----------------------------
True Negatives: 5
False Positives: 0
True Positives: 0
False Negatives: 5

True Negative Rate (Specificity): 1.000
Positive Predictive Value (Precision): 0.000

True Positive Rate (Recall): 0.000
False Negative Rate (Recall Complement): 1.000

Accuracy: 0.500

2. Some False Positives
-----------------------------
True Negatives: 3
False Positives: 2
True Positives: 5
False Negatives: 0

True Negative Rate (Specificity): 0.600
Positive Predictive Value (Precision): 0.714

True Positive Rate (Recall): 1.000
False Negative Rate (Recall Complement): 0.000

Accuracy: 0.800

3. Some False Negatives
-----------------------------
True Negatives: 5
False Positives: 0
True Positives: 3
False Negatives: 2

True Negative Rate (Specificity): 1.000
Positive Predictive Value (Precision): 1.000

True Positive Rate (Recall): 0.600
False Negative Rate (Recall Complement): 0.400

Accuracy: 0.800

4. Perfect Classifier
-----------------------------
True Negatives: 5
False Positives: 0
True Positives: 5
False Negatives: 0

True Negative Rate (Specificity): 1.000
Positive Predictive Value (Precision): 1.000

True Positive Rate (Recall): 1.000
False Negative Rate (Recall Complement): 0.000

Accuracy: 1.000

False positives affect the TNR (specificity) and the Positive Predictive Value (precision). False negatives, on the other hand, affect the true positive and false negative rates, otherwise known as the recall and its complement.

Let’s move on to evaluating the SGDClassifier we built for the MNIST dataset that classifies images as either fives or not fives.

MNIST Dataset

To compute a confusion matrix, we need to obtain a set of predictions on the training data. We use the training data to obtain predictions since we should only use the evaluation split for final evaluations. We can use the cross_val_predict method from sklearn to get the predictions for the sgd_clf we created.

%store -r sgd_clf
%store -r X_train
%store -r y_train_5

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

y_train_pred contains an array of true and false values, indicating whether or not the classifier predicts a given sample to be a five.

y_train_pred[:10]

array([False, False, False, False, False, False, False, False,  True,
       False])

Let’s build the confusion matrix.

confusion_matrix(y_train_5, y_train_pred)

array([[53124,  1455],
       [  949,  4472]], dtype=int64)

We can use the seaborn package to make the confusion matrix easier to read.

fig, ax = plt.subplots(1, 1, figsize=(10, 8))
cm = confusion_matrix(y_train_5, y_train_pred)
ax = sns.heatmap(cm, annot=True, square=True, xticklabels=[
                 'False', 'True'], yticklabels=['False', 'True'],
                 fmt='g', cmap="flare", annot_kws={"size": 24})
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')
ax.set_title('Confusion Matrix for SGDClassifier Built on MNIST Dataset')
plt.show()

The classifier correctly identified 53,124 images as non-fives (true negatives). The remaining 1,455 were wrongly classified as fives (false positives). The second row reveals that 949 real fives were incorrectly classified as non-fives (false negatives), while the remaining 4,472 were correctly classified as fives (true positives).

Let’s calculate the remaining metrics.

tn, fp, fn, tp = confusion_matrix(
    y_true=y_train_5, y_pred=y_train_pred).ravel()

print(f'True Negatives: {tn}')
print(f'False Positives: {fp}')
print(f'True Positives: {tp}')
print(f'False Negatives: {fn}')
print()
print(
    f'True Negative Rate (Specificity): {recall_score(y_train_5, y_train_pred, pos_label=0):.3f}')
print(
    f'Positive Predictive Value (Precision): {precision_score(y_train_5, y_train_pred, zero_division=False):.3f}')
print()
print(
    f'True Positive Rate (Recall): {recall_score(y_train_5, y_train_pred, zero_division=False):.3f}')
fnr = fn/(tp + fn)
print(f'False Negative Rate (Recall Complement): {fnr:.3f}')
print()
# Overall accuracy
print(f'Accuracy: {accuracy_score(y_train_5, y_train_pred):.3f}')
print()

True Negatives: 53124
False Positives: 1455
True Positives: 4472
False Negatives: 949

True Negative Rate (Specificity): 0.973
Positive Predictive Value (Precision): 0.755

True Positive Rate (Recall): 0.825
False Negative Rate (Recall Complement): 0.175

Accuracy: 0.960

The precision indicates that when the classifier picks a five, it is right about 75.5% of the time. The recall demonstrates that the classifier detects 82.5% of the 5s. Let’s compare this with a baseline classifier, using a different technique to create the classifier.

Baseline Classifier on MNIST Dataset

# Create the class
class Never5Classifier(BaseEstimator):
    def fit(self, X, y=None):
        pass
    def predict(self, X):
        return np.zeros((len(X), 1), dtype=bool)

# Instantiate and create predictions
never_5_clf = Never5Classifier()
y_train_never_5_pred = cross_val_predict(never_5_clf, X_train, y_train_5, cv=3)

# Draw the confusion matrix 
fig, ax = plt.subplots(1, 1, figsize=(10, 8))
cm = confusion_matrix(y_train_5, y_train_never_5_pred)
ax = sns.heatmap(cm, annot=True, square=True, xticklabels=[
                 'False', 'True'], yticklabels=['False', 'True'],
                 fmt='g', cmap="flare", annot_kws={"size": 24})
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')
ax.set_title('Confusion Matrix for Baseline Estimator Built on MNIST Dataset')
plt.show()

We never predict any fives. Let’s check the metrics.

# Calculate statistics
tn, fp, fn, tp = confusion_matrix(
    y_true=y_train_5, y_pred=y_train_never_5_pred).ravel()

print(f'True Negatives: {tn}')
print(f'False Positives: {fp}')
print(f'True Positives: {tp}')
print(f'False Negatives: {fn}')
print()
print(
    f'True Negative Rate (Specificity): {recall_score(y_train_5, y_train_never_5_pred, pos_label=0):.3f}')
print(
    f'Positive Predictive Value (Precision): {precision_score(y_train_5, y_train_never_5_pred, zero_division=False):.3f}')
print()
print(
    f'True Positive Rate (Recall): {recall_score(y_train_5, y_train_never_5_pred, zero_division=False):.3f}')
fnr = fn/(tp + fn)
print(f'False Negative Rate (Recall Complement): {fnr:.3f}')
print()
# Overall accuracy
print(f'Accuracy: {accuracy_score(y_train_5, y_train_never_5_pred):.3f}')
print()

True Negatives: 54579
False Positives: 0
True Positives: 0
False Negatives: 5421

True Negative Rate (Specificity): 1.000
Positive Predictive Value (Precision): 0.000

True Positive Rate (Recall): 0.000
False Negative Rate (Recall Complement): 1.000

Accuracy: 0.910

These metrics show why the accuracy can be misleading when evaluating classifiers. We’ll draw confusion matrices for multi-class classification problems in the next post.