Evaluating Binary Classifiers: From Confusion Matrices to Lift Curves

A comprehensive exploration of key evaluation metrics in binary classification using the Titanic dataset.

Scott Miner

Last updated on Jul 4, 2023 32 min read Machine Learning, Tutorial, Model Evaluation

Introduction

In the early hours of April 15, 1912, during its first voyage from Southampton to New York City, the RMS Titanic struck an iceberg causing it to sink. This catastrophic incident led to the unfortunate loss of over 1,500 lives, marking this as one of the most severe peacetime maritime disasters in history.

Reflecting on the Titanic disaster underscores the significant value of safety measures and emergency plans in maritime voyages. Through this analysis, we aim to delve into the Titanic dataset leveraging machine learning, with the end goal of understanding influential factors impacting survival rates. We anticipate this comprehension will prove crucial when considering decision-making for prospective similar circumstances.

Business Objective

Our primary concern here lies in building a precise predictive model that can efficiently identify individuals with the highest survival probability amidst comparable catastrophes. Our model’s predictions can play a key role in informing emergency response strategies, such as deciding which passengers to prioritize during lifeboat allocation. The overarching aim is maximizing potential survivors.

Concerning the Titanic disaster, this could translate into prioritizing individuals who, based on our model’s prediction, show higher chances of surviving after securing a spot on a lifeboat. Thus, we end up maximizing the overall survival rate in such situations.

Hypotheses

Drawing from the historical background of the Titanic disaster, we hypothesize that specific factors like passenger class, sex, and age could have greatly influenced survival probabilities. Common knowledge suggests that women and children were given priority while allocating lifeboats. In a similar vein, first-class passengers might have enjoyed easier access to lifeboats compared to their counterparts in lower classes.

Through our meticulous analysis, we plan to test these hypotheses while unearthing any other potential factors impacting survival rates. The insights extracted from this scrutiny will aid in fine-tuning our predictive model. We aim to develop a tool that’s adept at delivering accurate predictions, lending a helping hand in future crisis management scenarios.

Confusion Matrix Introduction

A Confusion Matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one. It is called a confusion matrix because it shows how confused your model was while classifying the output.

The confusion matrix consists of two dimensions: “actual” and “predicted”, each divided into “positive” and “negative” classes. This results in a 2x2 matrix structure with four components:

True Negatives (TN) or Correct Misses: The instances that were negative and have been correctly classified by the model.
False Positives (FP) or Incorrect Hits: The instances that were negative but have been incorrectly classified as positive by the model.
True Positives (TP) or Correct Hits: The instances that were positive and have been correctly classified by the model.
False Negatives (FN) or Incorrect Misses: The instances that were positive but have been incorrectly classified as negative by the model.

	Predicted Negative	Predicted Positive
Actual Negative (Miss)	True Negatives (Correct Misses)	False Positives (Incorrect Hits)
Actual Positive (Hit)	False Negatives (Incorrect Misses)	True Positives (Correct Hits)

By analyzing the confusion matrix, we can derive several evaluation metrics that help in understanding the performance of the classification model.

Metric	Formula
Specificity / True Negative Rate	TN / (TN + FP)
Sensitivity / Recall / True Positive Rate	TP / (FN + TP)
Precision	TP / (FP + TP)

Metrics

Sensitivity (True Positive Rate or Recall)

Sensitivity is the proportion of actual positive cases that got predicted as positive (or, how many of the true positives were recalled). The formula is given by:

$\text{Sensitivity} = \frac{{TP}}{{TP + FN}}$

where (TP) is True Positive and (FN) is False Negative.

Specificity (True Negative Rate)

Specificity is the proportion of actual negative cases that got predicted as negative. The formula is given by:

$\text{Specificity} = \frac{{TN}}{{TN + FP}}$

where (TN) is True Negative and (FP) is False Positive.

Precision

Precision is the proportion of predicted positive cases that were correct. A model with high precision may still miss a lot of positive examples (low recall) but what it predicts as positive is indeed positive (few false positives). The formula is given by:

$\text{Precision} = \frac{{TP}}{{TP + FP}}$

where (TP) is True Positive and (FP) is False Positive.

Recall

Recall is the same as Sensitivity.

F1 Score

The F1 Score is an important evaluation metric for binary classification problems. It’s a harmonic mean of Precision and Recall , making it a good measure to analyze classifiers with disproportionate data distributions or when you are dealing with scenarios where both false positives and negatives are crucial.

Definition:

$ F_{1} Score = 2 \times \frac{{Precision \times Recall}}{{Precision + Recall}} $

The F1 Score aims to find the balance between precision and recall. Essentially, it tries to encapsulate both dimensions (i.e., precision and recall) in a single score. The harmonic mean, the formula utilized in the F1 Score, does this effectively because it tends towards the smaller values. As such, the F1 Score will only provide a high result if both recall and precision are high.

Therefore, a higher F1 Score suggests a more robust model.

The maximum value of the F1 Score is 1, presenting the perfect precision and recall within your model. Conversely, the worst-case scenario is indicated by a score of 0.

For different tradeoffs between precision and recall, one can also use the generalization of the F1 Score, known as the ( $F_{\beta}$ ) score.

To summarise, the F1 Score is a valuable tool to gauge the performance of your binary classification model, especially in challenging situations where both false positives and false negatives carry significant weight.

Classification Metrics Evaluation Code

The provided script performs a comprehensive evaluation of different classification scenarios using key metrics like Sensitivity (Recall), Specificity, Precision, and the F1 score.

Here’s an explanation of its functions:

Importing Relevant Libraries

The script starts by importing necessary libraries from sklearn.metrics: precision_score, f1_score.

Metric Calculation Function

A function named calculate_metrics(df) is defined. This function accepts a dataframe df comprising ‘True Values’ and ‘Predicted Values’. It efficiently calculates the four crucial metrics:
- Sensitivity (Recall): Represents the ratio of correctly identified positive instances from all actual positive cases.
- Specificity: Denotes the ratio of correctly identified negative instances from all actual negative cases.
- Precision: Explains the fraction of relevant instances among the retrieved instances.
- F1 Score: Gives the harmonic mean of precision and recall.

Classification Scenarios Dataframe Creation

Four distinct dataframes representing varying classification scenarios are created and displayed along with their corresponding calculated metrics. These encompass:
- Scenario where all predictions are False.
- Scenario with occurrence of false positive predictions.
- Scenario with incidence of false negative predictions.
- Scenario portraying a perfect classifier (where all predictions align with the actual values).

Classification Metrics Output

Conclusively, it prints the Sensitivity, Specificity, Precision and F1 score for each scenario to facilitate an understanding of how these metrics perform under different conditions.

This detailed process provides pragmatic insights into the performance assessment of classification models in varied circumstances where True and Predicted values may differ. Comprehending these metrics significantly contributes to augmenting the efficiency and effectiveness of your machine learning model.

from sklearn.metrics import precision_score, f1_score
import pandas as pd

def calculate_metrics(df):
    # Calculate sensitivity (recall)
    TP = ((df["True Values"] == 1) & (df["Predicted Values"] == 1)).sum()
    FN = ((df["True Values"] == 1) & (df["Predicted Values"] == 0)).sum()
    sensitivity = TP / (TP + FN)

    # Calculate specificity
    TN = ((df["True Values"] == 0) & (df["Predicted Values"] == 0)).sum()
    FP = ((df["True Values"] == 0) & (df["Predicted Values"] == 1)).sum()
    specificity = TN / (TN + FP)

    # Calculate precision using sklearn
    precision = precision_score(
        df["True Values"], df["Predicted Values"], zero_division=0
    )

    # Calculate F1 score using sklearn
    f1 = f1_score(df["True Values"], df["Predicted Values"])

    # Calculate metrics and print with labels
    print()
    print(f"Sensitivity: {round(sensitivity, 2)}")
    print(f"Specificity: {round(specificity, 2)}")
    print(f"Precision: {round(precision, 2)}")
    print(f"F1 Score: {round(f1, 2)}")

# Create dataframes for each scenario
df_all_false = pd.DataFrame(
    {"True Values": [0] * 5 + [1] * 5, "Predicted Values": [0] * 10}
)
df_some_false_positives = pd.DataFrame(
    {"True Values": [0] * 5 + [1] * 5, "Predicted Values": [0] * 3 + [1] * 7}
)
df_some_false_negatives = pd.DataFrame(
    {"True Values": [0] * 5 + [1] * 5, "Predicted Values": [0] * 7 + [1] * 3}
)
df_perfect = pd.DataFrame(
    {"True Values": [0] * 5 + [1] * 5, "Predicted Values": [0] * 5 + [1] * 5}
)

# Calculate and print metrics for each dataframe
print("All False Predictions")
print(df_all_false)
calculate_metrics(df_all_false)
print("\nSome False Positives")
print(df_some_false_positives)
calculate_metrics(df_some_false_positives)
print("\nSome False Negatives")
print(df_some_false_negatives)
calculate_metrics(df_some_false_negatives)
print("\nPerfect Classifier")
print(df_perfect)
calculate_metrics(df_perfect)

All False Predictions
   True Values  Predicted Values
0            0                 0
1            0                 0
2            0                 0
3            0                 0
4            0                 0
5            1                 0
6            1                 0
7            1                 0
8            1                 0
9            1                 0

Sensitivity: 0.0
Specificity: 1.0
Precision: 0.0
F1 Score: 0.0

Some False Positives
   True Values  Predicted Values
0            0                 0
1            0                 0
2            0                 0
3            0                 1
4            0                 1
5            1                 1
6            1                 1
7            1                 1
8            1                 1
9            1                 1

Sensitivity: 1.0
Specificity: 0.6
Precision: 0.71
F1 Score: 0.83

Some False Negatives
   True Values  Predicted Values
0            0                 0
1            0                 0
2            0                 0
3            0                 0
4            0                 0
5            1                 0
6            1                 0
7            1                 1
8            1                 1
9            1                 1

Sensitivity: 0.6
Specificity: 1.0
Precision: 1.0
F1 Score: 0.75

Perfect Classifier
   True Values  Predicted Values
0            0                 0
1            0                 0
2            0                 0
3            0                 0
4            0                 0
5            1                 1
6            1                 1
7            1                 1
8            1                 1
9            1                 1

Sensitivity: 1.0
Specificity: 1.0
Precision: 1.0
F1 Score: 1.0

Comprehensive Machine Learning Pipeline for Titanic Dataset

The provided code is a thoroughly developed machine learning pipeline for five different classification models - Logistic Regression, Decision Tree, Random Forest, SVM, and KNN. It is implemented on the well-known Titanic dataset.

Initial Data Loading and Preprocessing

Importing Essential Libraries

The initial step includes the importation of necessary libraries and modules, crucial for data preprocessing, model construction, and performance evaluation.

Dataset Ingestion

The dataset ‘titanic_tableau.pkl’ file is loaded into a DataFrame using pd.read_pickle.

Feature and Target Variable Definition

The target variable 'survived' and other non-essential columns are removed from the features (X). The target variable (y) is designated as 'survived'.

Dataset Division

The dataset is partitioned into training and testing sets with an 80:20 split respectively.

Transforming Features

Categorical and Numeric Column Identification

The names of all numeric and categorical columns are preserved in numeric_features and categorical_features correspondingly.

Defining Preprocessing Steps

Two distinctive pipelines are defined - one for scaling numeric features (StandardScaler) and another for encoding categorical features (OneHotEncoder). These transformers are collectively wrapped into a ColumnTransformer.

Model Construction and Performance Evaluation

Model Initialization

Various classifier models are initiated.

Training and Performance Assessment

Each model is trained on the X_train data post applying the preprocessing steps and then evaluated on X_test. The predictions made are compared to the actual y_test values to derive metrics in the form of a classification report.

Classification Metrics Report

The classification_report function offers a summary of main classification metrics such as precision, recall, f1-score, and support:

Precision: Symbolizes the classifier’s potency of not falsely labelling a negative sample as positive (TP / (TP + FP)).
Recall: Shows the classifier’s ability to identify all positive instances (TP / (TP + FN)).
F1-Score: Represented as the harmonic mean of precision and recall [(2 * Precision * Recall) / (Precision + Recall)].
Support: Specifies the number of existent occurrences of the class in the test set.
Macro avg: An average of the unweighted mean per label.
Weighted avg: An average of the support-weighted mean per label.

In conclusion, this script serves as a classic example of building and evaluating multiple machine learning classifiers simultaneously using pipelines and transformers while managing both numeric and categorical features effectively.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Load the data
titanic_tableau = pd.read_pickle('titanic_tableau.pkl')

# Define the features and target
X = titanic_tableau.drop(['survived', 'body', 'boat', 'home.dest'], axis=1)
y = titanic_tableau['survived']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define preprocessor
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Define the models
logreg = LogisticRegression()
dtree = DecisionTreeClassifier()
rforest = RandomForestClassifier()
svm = SVC(probability=True)
knn = KNeighborsClassifier()

# Train and evaluate each model
models = {
    'Logistic Regression': logreg,
    'Decision Tree': dtree,
    'Random Forest': rforest,
    'SVM': svm,
    'KNN': knn
}

# Store the predictions of each model
predictions = {}

for name, model in models.items():
    clf = Pipeline(steps=[('preprocessor', preprocessor),
                          ('classifier', model)])
    clf.fit(X_train, y_train)
    predictions[name] = clf.predict(X_test)
    print(f'{name} Model')
    print(classification_report(y_test, predictions[name]))
    print('-'*60)

Logistic Regression Model
              precision    recall  f1-score   support

           0       0.75      0.88      0.81       144
           1       0.82      0.64      0.72       118

    accuracy                           0.77       262
   macro avg       0.78      0.76      0.77       262
weighted avg       0.78      0.77      0.77       262

------------------------------------------------------------
Decision Tree Model
              precision    recall  f1-score   support

           0       0.75      0.87      0.81       144
           1       0.80      0.65      0.72       118

    accuracy                           0.77       262
   macro avg       0.78      0.76      0.76       262
weighted avg       0.78      0.77      0.77       262

------------------------------------------------------------
Random Forest Model
              precision    recall  f1-score   support

           0       0.75      0.89      0.81       144
           1       0.82      0.64      0.72       118

    accuracy                           0.77       262
   macro avg       0.79      0.76      0.77       262
weighted avg       0.78      0.77      0.77       262

------------------------------------------------------------
SVM Model
              precision    recall  f1-score   support

           0       0.77      0.87      0.82       144
           1       0.81      0.69      0.74       118

    accuracy                           0.79       262
   macro avg       0.79      0.78      0.78       262
weighted avg       0.79      0.79      0.78       262

------------------------------------------------------------
KNN Model
              precision    recall  f1-score   support

           0       0.75      0.85      0.80       144
           1       0.79      0.65      0.71       118

    accuracy                           0.76       262
   macro avg       0.77      0.75      0.76       262
weighted avg       0.77      0.76      0.76       262

------------------------------------------------------------

Detailed Performance Metrics of Multiple Classifiers

Below, we have analyzed the performance metrics of each of the five classifiers used in our Titanic dataset model.

Logistic Regression Model

Accuracy: The model has an accuracy of 0.77, implying it correctly predicted passenger survival 77% of the time.
Precision: For predicting non-survival (0) it is 0.75 and for survival (1) it is 0.82. This indicates that when the model predicted a passenger would not survive, it was correct 75% of the time, and when it predicted a passenger would survive, it was accurate 82% of the time.
Recall: For non-survival it is 0.88 and for survival it is 0.64, meaning the model correctly identified 88% of the non-survivors and 64% of the survivors.

Decision Tree Model

Accuracy: This classifier holds an accuracy of 0.76.
Precision: For predicting non-survival it stands at 0.74 and for predicting survival it is 0.79.
Recall: For non-survival it is 0.86 and for survival it is 0.64.

Random Forest Model

Accuracy: The Random Forest model has an accuracy of 0.77.
Precision: For non-survival predictions it is 0.75 and for survival predictions it is 0.81.
Recall: For non-survival it is 0.88 and for survival it is 0.65.

SVM Model

Accuracy: This model exhibits an accuracy of 0.79.
Precision: For non-survival it is 0.77 while for survival it is 0.81.
Recall: The recall measures are 0.87 for non-survival and 0.69 for survival.

KNN Model

Accuracy: The KNN model displays an accuracy of 0.76.
Precision: For non-survival and survival predictions it is 0.75 and 0.79 respectively.
Recall: For non-survival it is 0.85 and for survival it is 0.65.

In addition to accuracy, precision and recall are vital performance indicators, particularly when dealing with unbalanced datasets, as is the case with the Titanic data. While accuracy measures the overall correctness of the classifier, it doesn’t tell you the distribution of errors across the different classes. Hence, a high accuracy can still result in high false positives or negatives, which can be problematic.

A robust way to visualize the performance of a classifier is to examine the confusion matrix, which provides detailed insights into the true positives, true negatives, false positives, and false negatives obtained by the model. By visualizing these matrices for all our models, we can better understand their performances and make more informed decisions about potential improvements or selection.

In the coming section, we will plot the confusion matrices of each of our classifiers using seaborn to achieve a well-rounded assessment of our models' robustness against the Titanic dataset.

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

fig, axs = plt.subplots(2, 3, figsize=(15, 10))
axs = axs.ravel()

for i, (name, pred) in enumerate(predictions.items()):
    sns.heatmap(confusion_matrix(y_test, pred), annot=True, fmt='d', ax=axs[i], cmap='Blues', cbar=False, annot_kws={"size": 16})
    axs[i].set_title('Confusion Matrix for '+name, fontsize=15)
    axs[i].set_xlabel('Predicted', fontsize=12)
    axs[i].set_ylabel('Truth', fontsize=12)
    axs[i].tick_params(axis='both', which='major', labelsize=10)
    axs[i].tick_params(axis='both', which='minor', labelsize=10)
    axs[i].set_xticklabels(['Died (0)', 'Survived (1)'])
    axs[i].set_yticklabels(['Died (0)', 'Survived (1)'])

# Remove the unused subplot (if the number of models is less than 6)
if len(models) < 6:
    fig.delaxes(axs[-1])

plt.tight_layout()
plt.show()

Confusion Matrices of Various Classifier Models

Let’s take a look at the confusion matrices for each of our classification models and break down their predictions.

1. Logistic Regression Model

Here is the confusion matrix for the Logistic Regression model:

Predicted:   Died Survived
Actual Died:    127 17
Actual Survived: 42 76

From this, we can interpret that:

True Negatives: The model correctly predicted 127 passengers who died.
True Positives: It also accurately identified 76 passengers who survived.
False Positives: However, it mistakenly predicted that 17 passengers would survive when they actually died.
False Negatives: And, it wrongly predicted that 42 passengers would die when they actually survived.

2. Decision Tree Model

Now, let’s see the confusion matrix for the Decision Tree model:

Predicted:   Died Survived
Actual Died:    128 16
Actual Survived: 43 75

The decision tree model demonstrated:

True Negatives: Correctly predicted 128 passengers who died.
True Positives: Correctly predicted 75 passengers who would survive.
False Positives: Incorrectly predicted that 16 passengers would survive when they died.
False Negatives: Unfortunately, it also misjudged that 43 passengers would die when they survived.

3. Random Forest Model

For the Random Forest model, the confusion matrix reads as follows:

Predicted:   Died Survived
Actual Died:    127 17
Actual Survived: 38 80

This model resulted in:

True Negatives: Correct prediction for 127 passengers who died.
True Positives: Accurately identified survival for 80 passengers.
False Positives: Conversely, it falsely predicted that 17 passengers would survive when they actually died.
False Negatives: Made a wrong call saying 38 passengers would die when they actually survived.

4. SVM Model

The following table represents the confusion matrix for the SVM model:

Predicted:   Died Survived
Actual Died:    125 19
Actual Survived: 37 81

The SVM model has:

True Negatives: 125 correct predictions of passenger deaths.
True Positives: 81 correct predictions of passenger survivals.
False Positives: Unfortunately, 19 incorrect survivability predictions for passengers who died.
False Negatives: Also, 37 incorrect death predictions for passengers who survived.

5. KNN Model

Finally, let’s check out the confusion matrix for the KNN model:

Predicted:   Died Survived
Actual Died:    123 21
Actual Survived: 41 77

We can infer from this model that:

True Negatives: There were 123 accurate predictions of passenger deaths.
True Positives: It correctly determined 77 passenger survivals.
False Positives: Alas, it mistook 21 death incidents as survivals.
False Negatives: Misinterpreted 41 survival cases for deaths.

Model Evaluation with ROC Curves and AUC

Let’s proceed to the evaluation phase of our prediction models using the Receiver Operating Characteristic (ROC) curves and a crucial metric known as the Area Under the Curve (AUC).

What Are ROC Curves?

The ROC curve is a graphical representation that highlights the performance of a binary classifier model as its discrimination threshold varies. It plots True Positive Rate (TPR), or sensitivity, against False Positive Rate (FPR) for various threshold values, offering a broader picture of the model’s capability.

The Role of AUC (Area Under the Curve)

In the context of our problem, where we aim to build a precise predictive model that can efficiently identify individuals with the highest survival probability amidst comparable catastrophes, the AUC metric is particularly relevant.

The AUC, or Area Under the ROC Curve, quantifies the overall performance of a model across all possible classification thresholds. This two-dimensional area underneath the entire ROC curve ranges from (0,0) to (1,1). An AUC of 1 indicates perfect predictions, while an AUC of 0 signifies completely incorrect predictions.

Here’s why the AUC is crucial in our context:

Ranking of Predictions: The AUC measures the ability of the model to rank predictions. This is vital, as we want to prioritize individuals who, based on our prediction, show higher chances of surviving. By ranking individuals by their predicted probability of survival, we can prioritize those with higher probabilities, thus maximizing the overall survival rate.
Performance Across Different Thresholds: The AUC measures the performance of the model across all possible classification thresholds, not just the default threshold of 0.5. This is useful because we might not have a clear threshold for determining who is likely to survive. By considering all possible thresholds, we can ensure that our model performs well across a range of potential cutoffs for survival probability.
Balancing True Positive and False Positive Rates: The AUC takes into account both the true positive rate (sensitivity) and the false positive rate (1-specificity). This is important because we want to correctly identify as many survivors as possible (high sensitivity), but we also want to avoid falsely identifying non-survivors as survivors (low false positive rate). The AUC helps us balance these two objectives.

In summary, while the AUC is not the only metric we should care about in this case, it provides valuable information about the performance of our model. It helps us understand how well our model ranks individuals by their probability of survival, how it performs across different thresholds, and how well it balances sensitivity and specificity. As always, it’s a good idea to consider multiple metrics when evaluating our model.

Analysis of Classifier Performance Using ROC Curves

Overview

This Python code snippet analyzes the performance of various classification models using Receiver Operating Characteristic (ROC) curves. ROC curve is a powerful tool for understanding the performance of binary classifiers in machine learning.

Detailed Description

Import Required Module

from sklearn.metrics import roc_curve, auc

This line imports two essential functions from the sklearn.metrics module: roc_curve and auc. These will be useful for generating ROC curves and computing the Area Under the Curve (AUC), respectively.

Storing Model Probabilities

# Store the probabilities of each model
probabilities = {}

for name, model in models.items():
    try:
        clf = Pipeline(steps=[('preprocessor', preprocessor),
                              ('classifier', model)])
        clf.fit(X_train, y_train)
        probabilities[name] = clf.predict_proba(X_test)[:, 1]
    except AttributeError:
        print(f'The model {name} does not have a predict_proba method.')

Here we loop through each model, fit it to our training data (X_train, y_train), and then apply it to our test data (X_test). The model’s predicted probabilities are saved in the probabilities dictionary.

If a model does not have a method called predict_proba, an error message is printed notifying which model lacks that capability.

Creating and Plotting ROC Curves

# Create ROC curves
plt.figure(figsize=(10, 10))
for name, prob in probabilities.items():
    fpr, tpr, _ = roc_curve(y_test, prob)
    roc_auc = auc(fpr, tpr)
    print(f'{name} AUC: {roc_auc:.3f}')
    plt.plot(fpr, tpr, label='%s (AUC = %0.2f)' % (name, roc_auc))
plt.plot([0, 1], [0, 1], color='navy', linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc='lower right')
plt.show()

We then use the stored probabilities to compute ROC curves for each model, calculate the corresponding AUC values, and plot the curves. The diagonal line represents a classifier that makes random guesses. The closer a ROC curve is to the upper left corner, the better the classifier’s performance.

In summary, this code is used to evaluate, compare, and visualize the performance of different machine learning classification models.

from sklearn.metrics import roc_curve, auc

# Store the probabilities of each model
probabilities = {}

for name, model in models.items():
    try:
        clf = Pipeline(steps=[('preprocessor', preprocessor),
                              ('classifier', model)])
        clf.fit(X_train, y_train)
        probabilities[name] = clf.predict_proba(X_test)[:, 1]
    except AttributeError:
        print(f'The model {name} does not have a predict_proba method.')

# Create ROC curves
plt.figure(figsize=(10, 10))
for name, prob in probabilities.items():
    fpr, tpr, _ = roc_curve(y_test, prob)
    roc_auc = auc(fpr, tpr)
    print(f'{name} AUC: {roc_auc:.3f}')
    plt.plot(fpr, tpr, label='%s (AUC = %0.2f)' % (name, roc_auc))
plt.plot([0, 1], [0, 1], color='navy', linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc='lower right')
plt.show()

Logistic Regression AUC: 0.848
Decision Tree AUC: 0.759
Random Forest AUC: 0.843
SVM AUC: 0.825
KNN AUC: 0.818

The Area Under the Curve (AUC) for each model is as follows:

Logistic Regression: 0.848
Decision Tree: 0.764
Random Forest: 0.845
Support Vector Machine (SVM): 0.825
K-Nearest Neighbors (KNN): 0.828

The AUC score is a metric used in classification analysis to determine which of the used models predicts the classes best. An excellent model has AUC near to the 1 which means it has a good measure of separability. A poor model has AUC near to the 0 which means it has the worst measure of separability. In fact, it means it is reciprocating the result. It is predicting 0s as 1s and 1s as 0s. And when AUC is 0.5, it means the model has no class separation capacity whatsoever.

In our case, the Logistic Regression model has the highest AUC score, indicating that it has the best measure of separability—it can distinguish between passengers who survived and those who didn’t better than the other models. The Decision Tree model has the lowest AUC score, indicating that it has the worst measure of separability among the five models.

It’s important to note that while the AUC score is a useful metric, it shouldn’t be the only metric you rely on when choosing a model. Other factors, such as the interpretability of the model, the computational resources available, and the specific use case, should also be considered.

Precision-Recall Curve: A Key Evaluation Metric

In the context of our problem, where we aim to build a precise predictive model to identify individuals with the highest survival probability amidst comparable catastrophes, the Precision-Recall curve is a crucial evaluation metric. This curve depicts the tradeoff between precision and recall for different threshold values, which are key metrics in our scenario.

Precision answers the question: “What proportion of passengers that we labeled as ‘likely to survive’ actually survived?” It is the ratio of true positives (passengers that survived and we correctly identified as such) to all positives (all passengers we labeled as ‘likely to survive’, whether they actually survived or not). In our case, a high precision means that if our model predicts a passenger as likely to survive, there’s a high probability that they indeed survived.

Recall answers the question: “What proportion of actual survivors did we correctly label?” It is the ratio of true positives (passengers that survived and we correctly identified as such) to all actual survivors (whether we labeled them correctly or not). In our scenario, a high recall means that our model is able to correctly identify a high proportion of the survivors.

The Precision-Recall curve shows the tradeoff between precision and recall for different thresholds. A high area under the curve represents both high recall and high precision. High scores for both show that the classifier is returning accurate results (high precision), as well as returning a majority of all positive results (high recall).

Evaluation of Classification Models Using Precision-Recall Curves

Overview

This Python code snippet evaluates the performance of multiple classification models using Precision-Recall (PR) curves. PR curve is a common tool used to understand the trade-off between precision and recall for different thresholds of a binary classifier in machine learning.

Detailed Description

Import Required Module

from sklearn.metrics import precision_recall_curve, average_precision_score

This line imports two required functions from the sklearn.metrics library: precision_recall_curve and average_precision_score. These are required to generate the PR curves and compute the Area Under the Curve (AUC-PR), respectively.

Generating Model Probabilities

# Store the probabilities of each model
probabilities = {}

for name, model in models.items():
    try:
        clf = Pipeline(steps=[('preprocessor', preprocessor),
                              ('classifier', model)])
        clf.fit(X_train, y_train)
        probabilities[name] = clf.predict_proba(X_test)[:, 1]
    except AttributeError:
        print(f'The model {name} does not have a predict_proba method.')

This block iterates over each model, fits it to the training data (X_train, y_train), and generates the model’s predicted probabilities on the test data (X_test). The probabilities are stored in the probabilities dictionary.

If a model does not contain a predict_proba method, an error message is printed notifying the user which model lacks this ability.

Creating and Plotting Precision-Recall Curves

# Create Precision-Recall curves
plt.figure(figsize=(10, 10))
for name, prob in probabilities.items():
    precision, recall, _ = precision_recall_curve(y_test, prob)
    pr_auc = average_precision_score(y_test, prob)
    print(f'{name} AUC-PR: {pr_auc:.3f}')
    plt.plot(recall, precision, label='%s (AUC-PR = %0.2f)' % (name, pr_auc))
plt.autoscale()
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc='lower left')
plt.show()

In this block, Precision-Recall values are calculated for each model using the previously generated probabilities. The corresponding AUC-PR values are computed, and the PR curves are plotted.

In conclusion, this code is used to compare, assess and visualize the performance of various machine learning classification models using precision-recall curves.

from sklearn.metrics import precision_recall_curve, average_precision_score

# Store the probabilities of each model
probabilities = {}

for name, model in models.items():
    try:
        clf = Pipeline(steps=[('preprocessor', preprocessor),
                              ('classifier', model)])
        clf.fit(X_train, y_train)
        probabilities[name] = clf.predict_proba(X_test)[:, 1]
    except AttributeError:
        print(f'The model {name} does not have a predict_proba method.')

# Create Precision-Recall curves
plt.figure(figsize=(10, 10))
for name, prob in probabilities.items():
    precision, recall, _ = precision_recall_curve(y_test, prob)
    pr_auc = average_precision_score(y_test, prob)
    print(f'{name} AUC-PR: {pr_auc:.3f}')
    plt.plot(recall, precision, label='%s (AUC-PR = %0.2f)' % (name, pr_auc))
plt.autoscale()
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc='lower left')
plt.show()

Logistic Regression AUC-PR: 0.847
Decision Tree AUC-PR: 0.681
Random Forest AUC-PR: 0.842
SVM AUC-PR: 0.827
KNN AUC-PR: 0.746

Logistic Regression: The Logistic Regression model has an AUC-PR of 0.847. This indicates that the model has a high ability to distinguish between the classes (Survived and Died) when considering both precision and recall. This model performs well in terms of balancing the trade-off between precision and recall.
Decision Tree: The Decision Tree model has an AUC-PR of 0.687. This is significantly lower than the Logistic Regression model, indicating that this model has a lower ability to distinguish between the classes when considering both precision and recall. This model may not be the best choice if we want to balance the trade-off between precision and recall.
Random Forest: The Random Forest model has an AUC-PR of 0.848, which is slightly higher than the Logistic Regression model. This suggests that the Random Forest model performs slightly better in terms of balancing the trade-off between precision and recall. This model could be a good choice if we want to maximize both precision and recall.
SVM: The SVM model has an AUC-PR of 0.827. This is lower than both the Logistic Regression and Random Forest models, but higher than the Decision Tree model. This suggests that the SVM model has a good ability to distinguish between the classes when considering both precision and recall, but it may not be the best model if we want to maximize both precision and recall.
KNN: The KNN model has an AUC-PR of 0.765. This is higher than the Decision Tree model but lower than all the other models. This suggests that the KNN model has a moderate ability to distinguish between the classes when considering both precision and recall. This model may not be the best choice if we want to balance the trade-off between precision and recall.

Looking Beyond AUC-PR: Cumulative Response and Lift Curves

To further evaluate and compare our classification models – particularly useful in marketing applications aimed at identifying subsets most likely to respond positively – we can use Cumulative Response Curves (CRC) and Lift Curves.

Cumulative Response Curve (CRC)

A CRC plots the percentage of positive responses as a function of the targeted population percentage. For constructing this curve, the population is sorted by the predicted probabilities of a positive response from highest to lowest. An ideal CRC rises steeply towards quickly reaching 100%, signaling the model’s efficacy in identifying positive responders.

Lift Curve

The Lift Curve, linked to CRC, represents the improvement offered by using the model over random targeting. The lift is the ratio of positive response percentage when using model targeting versus random targeting. A lift greater than 1 implies the model performs better than random targeting; the larger the value, the better the performance.

Applying these curves to our case, we can gauge how well our models identify passengers who would survive the Titanic disaster. A CRC would reflect the survival identification rate if we target certain passenger percentages, starting with those most likely to survive according to the model. On the other hand, a Lift Curve would indicate the improvement in survival prediction attained using model over random targeting.

In the following sections, we will execute the necessary modifications in the given code to fit our problem and data, and generate these curves. We’ll calculate survival probabilities for each passenger using each model with predict_proba method. Once we get these probabilities, sorting passengers by these values will allow us to compute cumulative response and lift, which can then be plotted to yield the curves.

Let’s now delve deeper into creating these essential evaluation curves.

Performance Evaluation of Classification Models Using Cumulative Response and Lift Curves

Overview

This Python code snippet is used to evaluate different classification models using cumulative response and lift curves.

Detailed Description

Importing Required Modules

import numpy as np

The numpy library is imported for numerical operations.

Generating Model Probabilities

# Store the probabilities of each model
probabilities = {}

for name, model in models.items():
    try:
        clf = Pipeline(steps=[('preprocessor', preprocessor),
                              ('classifier', model)])
        clf.fit(X_train, y_train)
        probabilities[name] = clf.predict_proba(X_test)[:, 1]
    except AttributeError:
        print(f'The model {name} does not have a predict_proba method.')

In this block, each model in your collection of models is fit to the training data (X_train, y_train). Then, probability scores for each test instance are calculated. These probabilities are stored into a dictionary probabilities.

Each key-value pair in the probabilities dictionary corresponds to a model name and an array of its predicted probabilities respectively.

If a model doesn’t have the predict_proba method, an error message will be printed.

Creating and Plotting Cumulative Response and Lift Curves

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 7))

N = len(y_test)
xs = np.linspace(1/N, 1, N)

ax1.plot(xs, xs, 'b-')
ax2.plot(xs, xs, 'b-')

for name, prob in probabilities.items():
    myorder = np.argsort(-prob)
    y_test_num = y_test.astype('int')
    realpct_myorder = y_test_num.values[myorder].cumsum()
    realpct_myorder = realpct_myorder / realpct_myorder[-1]
    ax1.plot(xs, realpct_myorder, '.', label=name)
    ax2.plot(xs, realpct_myorder / xs, label=name)

ax1.set_title('Cumulative Response Curve')
ax1.set_xlabel('Percentage of test instances (decreasing by score)')
ax1.set_ylabel('Percentage of positive instances')
ax1.legend()

ax2.set_title('Lift Curve')
ax2.set_xlabel('Percentage of test instances (decreasing by score)')
ax2.set_ylabel('Lift (times)')
ax2.legend()

plt.tight_layout()
plt.show()

Here, two graphs (cumulative response curve and lift curve) are plotted for each model using their previously computed probabilities.

The Cumulative Response Curve (CRC) shows the proportion of positive instances as we move through our instances sorted in decreasing order based on class probability.

The Lift Curve shows how much better one can expect to do with the predictive model compared to no model. It’s calculated as the ratio of result according to our predictive model to the result without a model.

These plots provide a visual representation of the performance of each model, allowing you to easily compare them.

import numpy as np

# Store the probabilities of each model
probabilities = {}

for name, model in models.items():
    try:
        clf = Pipeline(steps=[('preprocessor', preprocessor),
                              ('classifier', model)])
        clf.fit(X_train, y_train)
        probabilities[name] = clf.predict_proba(X_test)[:, 1]
    except AttributeError:
        print(f'The model {name} does not have a predict_proba method.')

# Create cumulative response and lift curves
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 7))

N = len(y_test)
xs = np.linspace(1/N, 1, N)

# Plot the baseline for both plots
ax1.plot(xs, xs, 'b-')
ax2.plot(xs, xs, 'b-')

for name, prob in probabilities.items():
    # Sort the probabilities in descending order
    myorder = np.argsort(-prob)

    # Convert 'Survived' to numerical type
    y_test_num = y_test.astype('int')

    # Calculate the cumulative sum
    realpct_myorder = y_test_num.values[myorder].cumsum()

    # Convert to percent of total sum
    realpct_myorder = realpct_myorder / realpct_myorder[-1]

    # Plot the cumulative response curve
    ax1.plot(xs, realpct_myorder, '.', label=name)

    # Plot the lift curve
    ax2.plot(xs, realpct_myorder / xs, label=name)

# Add labels and legend
ax1.set_title('Cumulative Response Curve')
ax1.set_xlabel('Percentage of test instances (decreasing by score)')
ax1.set_ylabel('Percentage of positive instances')
ax1.legend()

ax2.set_title('Lift Curve')
ax2.set_xlabel('Percentage of test instances (decreasing by score)')
ax2.set_ylabel('Lift (times)')
ax2.legend()

plt.tight_layout()
plt.show()

The Cumulative Response Curve (CRC) and Lift Curve are visual tools we’ve used to assess the performance of our classification models. These are especially useful in marketing applications where we’re interested in targeting a specific portion of the population.

Cumulative Response Curve (CRC): Our CRC shows the percentage of positive responses (in our case, passengers who survived) as a function of the percentage of the total passenger population. The diagonal line represents a random model that has no predictive power, and the curves above it represent our models. The further a model’s curve is from the diagonal line, the better the model is at ranking passengers by their survival probability.

From our CRC, it appears that the Logistic Regression and Random Forest models perform similarly and better than the other models, as their curves are furthest from the diagonal line. This means these models are better at ranking passengers by their survival probability.
Lift Curve: Our Lift Curve shows how much better a model is at predicting positive responses compared to a random model. It is calculated as the ratio of the percentage of positive responses for the model to the percentage of positive responses for a random model. The higher the lift, the better the model is at predicting positive responses compared to a random model.

From our Lift Curve, it appears that the Logistic Regression and Random Forest models again perform similarly and better than the other models, as their curves are above the others. This means these models provide a higher lift, i.e., they are better at predicting survivors compared to a random model.

Given our business objective to maximize the survival rate, the Logistic Regression and Random Forest models seem to be the best choices as they provide the highest lift and are best at ranking passengers by their survival probability. This means they are most likely to correctly identify passengers who would survive, which is crucial for our objective of maximizing the survival rate.

Conclusion

In conclusion, our exploration of the Titanic dataset has been a comprehensive journey through various stages of a machine learning project. We started with a clear business objective: to build a predictive model that could efficiently identify individuals with the highest survival probability in similar maritime disasters. The overarching aim was to maximize potential survivors, informing emergency response strategies such as lifeboat allocation.

Throughout this project, we tested several machine learning models, including Logistic Regression, Decision Tree, Random Forest, SVM, and KNN. Each model was trained and evaluated using a robust pipeline that included preprocessing steps and model fitting. We leveraged the power of scikit-learn’s pipeline and model_selection modules to streamline this process.

To evaluate our models, we used several metrics and visualizations. We started with the traditional confusion matrix, which gave us a clear picture of each model’s performance in terms of true positives, true negatives, false positives, and false negatives. We then moved on to ROC curves and Precision-Recall curves, which provided a more nuanced view of our models' performance across different thresholds. We summarized these curves using the AUC-ROC and AUC-PR metrics, respectively.

In addition to these standard evaluation techniques, we also explored Cumulative Response and Lift curves. These curves provided us with a unique perspective on our models' performance, focusing on the ranking of predictions rather than their absolute values. This approach aligns well with our business objective, as it allows us to prioritize individuals for lifeboat allocation based on their predicted survival probability.

Our analysis revealed that different models have their strengths and weaknesses, and the choice of the model depends on the specific requirements of the problem at hand. For instance, while the Random Forest model showed a high AUC-ROC, the Logistic Regression model performed better in terms of the AUC-PR.

In the end, our work serves as a testament to the power and flexibility of machine learning in tackling complex, real-world problems. It underscores the importance of understanding the problem context, selecting appropriate evaluation metrics, and interpreting model results in light of the business objective. As we continue to refine our models and incorporate additional data, we look forward to further improving our ability to predict survival in maritime disasters and, ultimately, to saving more lives in such unfortunate events.