Exploring Handwritten Digit Recognition: A Machine Learning Approach

Leveraging Feature Selection, Construction, and Various Classifiers for Digit Recognition

An Introduction to Feature Selection in Machine Learning

Feature selection is a crucial step in the machine learning pipeline. It’s especially important when dealing with large datasets with numerous features. By selecting the most relevant features for our model, we can enhance its performance, reduce overfitting, improve accuracy, and decrease training time.

In the world of business, these benefits translate into more efficient operations, better decision-making, and ultimately, increased profitability. For instance, consider a postal service that needs to automatically sort mail based on handwritten zip codes, or a bank that needs to process handwritten checks. These are complex tasks that involve recognizing handwritten digits, a classic multi-class classification problem.

In this exploration, we will delve into various feature selection techniques, feature construction methods, and model comparisons. Our goal is to build a robust machine learning model capable of accurately recognizing handwritten digits. By the end of this exploration, we will have not only enhanced our understanding of feature selection and machine learning but also developed a practical solution to a real-world business problem.

Why is Feature Selection Important?

Feature selection simplifies the learning task by removing irrelevant or redundant information. In a dataset, not all features are created equal. Some have a significant impact on the output variable, while others might not have any effect at all. Some features might even confuse the model due to their irrelevance or redundancy. By selecting the most relevant features, we can focus the model’s attention on the data that matters most, improving its performance and interpretability.

Methods of Feature Selection

There are three primary ways to perform feature selection:

  1. Single-Step Filtering: This method involves evaluating each feature in isolation, then selecting some and discarding the rest. The evaluation can be based on various statistical measures like variance, correlation, or mutual information between the feature and the target variable.

  2. Select-Model-Evaluate-Repeat: This method is more iterative. We start by selecting one or more features, build a model, and assess the results. We then repeat the process, each time adding or removing features. This technique allows us to account for interactions between features and between features and the model.

  3. Embedded Feature Selection: In this method, feature selection is part of the model building process. As the model learns, it also decides which features are important. This method is often more efficient as it combines feature selection and model training into a single step.

    • Example: Lasso Regression: Lasso (Least Absolute Shrinkage and Selection Operator) regression is a type of linear regression that uses shrinkage. Shrinkage is where data values are shrunk towards a central point, like the mean. Lasso regression performs L1 regularization, which adds a penalty equal to the absolute value of the magnitude of coefficients. This type of regularization can result in sparse models where few features are selected.

    • Example: Decision Trees: Decision trees inherently perform feature selection. The top nodes on which the tree is split are essentially the most important variables within the dataset. The feature importance is determined based on the decrease in node impurity. The more the decrease in impurity, the more important the feature is.

In conclusion, feature selection is a vital step in building effective and efficient machine learning models. By understanding and applying the right feature selection methods, we can build models that are not only accurate but also interpretable and efficient.

from IPython.display import display
from sklearn.datasets import load_iris, load_wine, load_diabetes
from sklearn.model_selection import train_test_split
import pandas as pd

kwargs={'test_size':.25,'random_state':42}

iris = load_iris()

(X_train_iris, X_test_iris, y_train_iris, y_test_iris) = train_test_split(iris.data, iris.target, **kwargs)

wine = load_wine()

(X_train_wine, X_test_wine, y_train_wine, y_test_wine) = train_test_split(wine.data, wine.target, **kwargs)

diabetes = load_diabetes()

(X_train_diabetes, X_test_diabetes, y_train_diabetes, y_test_diabetes) = train_test_split(diabetes.data, diabetes.target, **kwargs)

iris_df=pd.DataFrame(iris.data, columns=iris.feature_names) iris_df['target']=iris.target

wine_df=pd.DataFrame(wine.data, columns=wine.feature_names) wine_df['target']=wine.target

diabetes_df=pd.DataFrame(diabetes.data, columns=diabetes.feature_names) diabetes_df['target']=diabetes.target

display(iris_df.head()) display(wine_df.head()) display(diabetes_df.head())

iris_df = iris_df.drop('target', axis=1) wine_df = wine_df.drop('target', axis=1) diabetes_df = diabetes_df.drop('target', axis=1)

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0

alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue od280/od315_of_diluted_wines proline target
0 14.23 1.71 2.43 15.6 127.0 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065.0 0
1 13.20 1.78 2.14 11.2 100.0 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050.0 0
2 13.16 2.36 2.67 18.6 101.0 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185.0 0
3 14.37 1.95 2.50 16.8 113.0 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480.0 0
4 13.24 2.59 2.87 21.0 118.0 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735.0 0

age sex bmi bp s1 s2 s3 s4 s5 s6 target
0 0.038076 0.050680 0.061696 0.021872 -0.044223 -0.034821 -0.043401 -0.002592 0.019907 -0.017646 151.0
1 -0.001882 -0.044642 -0.051474 -0.026328 -0.008449 -0.019163 0.074412 -0.039493 -0.068332 -0.092204 75.0
2 0.085299 0.050680 0.044451 -0.005670 -0.045599 -0.034194 -0.032356 -0.002592 0.002861 -0.025930 141.0
3 -0.089063 -0.044642 -0.011595 -0.036656 0.012191 0.024991 -0.036038 0.034309 0.022688 -0.009362 206.0
4 0.005383 -0.044642 -0.036385 0.021872 0.003935 0.015596 0.008142 -0.002592 -0.031988 -0.046641 135.0

Understanding Variance for Feature Selection in Machine Learning

In the world of machine learning, the selection of appropriate features is a critical step that can significantly impact the performance of your models. One simple yet effective method for feature selection involves the use of variance.

What is Variance?

Variance is a statistical measurement that describes the spread of data points in a data set around the mean. It quantifies how far each number in the set is from the mean and, thus, from every other number in the set. In simpler terms, variance gives you a measure of how much the values in your dataset differ from the average.

$$\text{Var}(X) = \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2$$

In this formula:

  • $\text{Var}(X)$ is the variance of the random variable $X$.
  • $n$ is the number of observations.
  • $x_i$ is the $i$-th observation.
  • $\mu$ is the mean of the observations.

This formula calculates the average of the squared differences from the mean. This is why variance is a measure of dispersion.

Let’s calculate the variance of a simple dataset using Python:

import numpy as np

# Define a simple dataset X = np.array([1,3,5,10,20]) print(f'X: {X}')

# Calculate the number of samples n = len(X) print(f'Number of samples (n): {n}')

# Calculate the mean of the dataset mean_X = sum(X) / n print(f'Mean of X: {mean_X}')

# Calculate the errors (differences from the mean) errors = X - mean_X print(f'Errors for X: {errors}')

# Calculate the variance var_X = np.dot(errors, errors) / n print(f'Variance calculated manually: {var_X}')

# We can also use the built-in numpy function to calculate the variance numpy_var_X = np.var(X) print(f'Variance using numpy: {numpy_var_X}')

# Check if the two values are close print(f'Are the two values close? {np.allclose(var_X, numpy_var_X)}')
X: [ 1  3  5 10 20]
Number of samples (n): 5
Mean of X: 7.8
Errors for X: [-6.8 -4.8 -2.8  2.2 12.2]
Variance calculated manually: 46.16
Variance using numpy: 46.16
Are the two values close? True

Variance Alternate Approach

In the alternative approach, we calculate the variance by taking all unique pairs of X values, computing the difference between them, squaring them, and then adding those values up.

The alternative formula for variance is given by:

$$ Var(X) = \frac{1}{n^2} \sum_{i=1}^{n} \sum_{j>i}^{n} (x_i - x_j)^2 $$

This can be implemented with the following Python code.

var_x = 0
n = len(X)

# Print the array X print(f"Array X: {X}")

# Loop over all pairs of distinct elements in X for i in range(n): for j in range(i+1, n): # rest of Xs # Calculate the square of the difference between the current pair of elements diff_square = (X[i] - X[j])**2 print(f"Difference squared for pair ({X[i]}, {X[j]}): {diff_square}")

# Add the square of the difference to the running total var_x += diff_square

# Divide the total by the square of the number of elements to get the variance variance = var_x / n**2 print(f"Variance: {variance}")
Array X: [ 1  3  5 10 20]
Difference squared for pair (1, 3): 4
Difference squared for pair (1, 5): 16
Difference squared for pair (1, 10): 81
Difference squared for pair (1, 20): 361
Difference squared for pair (3, 5): 4
Difference squared for pair (3, 10): 49
Difference squared for pair (3, 20): 289
Difference squared for pair (5, 10): 25
Difference squared for pair (5, 20): 225
Difference squared for pair (10, 20): 100
Variance: 46.16

This code calculates the variance by iterating over all unique pairs of values in X, computing the square of the difference between each pair, and then adding up these squared differences.

Why Use Variance for Feature Selection?

The idea behind using variance for feature selection is straightforward: features with low variance are likely to carry less information.

Consider a feature that has the same value for every instance in your dataset. This feature has zero variance, and it provides no information that can help a machine learning model make accurate predictions because it doesn’t add any variability or difference that the model can learn from.

On the other hand, features with higher variance are likely to carry more information (assuming the variance is due to actual data patterns and not noise), as they show more variation and, therefore, can potentially improve the performance of your model.

Using Variance for Feature Selection in Practice

In practice, you can calculate the variance of each feature in your dataset and then select the features with the highest variance. Here’s an example of how to calculate variances of features from the wine dataset:

print(wine_df.var())
alcohol                             0.659062
malic_acid                          1.248015
ash                                 0.075265
alcalinity_of_ash                  11.152686
magnesium                         203.989335
total_phenols                       0.391690
flavanoids                          0.997719
nonflavanoid_phenols                0.015489
proanthocyanins                     0.327595
color_intensity                     5.374449
hue                                 0.052245
od280/od315_of_diluted_wines        0.504086
proline                         99166.717355
dtype: float64

Variance Threshold Feature Selection

The following code demonstrates the use of the VarianceThreshold feature selector from sklearn.feature_selection. This tool is used to remove all features whose variance doesn’t meet a certain threshold.

Setting the Threshold

In this case, the threshold is set to 1.0. This means that all features whose variance is less than 1.0 will be removed from the dataset.

Fitting and Transforming the Data

The fit_transform method is used to fit the variance threshold model to the data and then transform the data. The transformed data will only include the features that meet the variance threshold.

Displaying the Results

The print statement is used to display the first example from the transformed data. It also prints the first example from the original data, but only the features that meet the variance threshold. This is achieved using boolean indexing with X_train_wine.var(axis=0) > 1.0.

from sklearn.feature_selection import VarianceThreshold

# Create the VarianceThreshold feature selector variance_selection = VarianceThreshold(threshold=1.0)

# Fit and transform the data X_train_wine_transformed = variance_selection.fit_transform(X_train_wine)

feature_names = wine.feature_names

# Get the names of the features that meet the variance threshold selected_features = np.array(feature_names)[variance_selection.get_support()]

# Print the first example from the transformed data print("First example after variance thresholding:") for feature, value in zip(selected_features, X_train_wine_transformed[0]): print(f"{feature}: {value}")

# Print the first example from the original data, but only the features that meet the variance threshold print("\nFirst example from the original data (only features with variance > 1.0):") for feature_index in np.where(variance_selection.get_support())[0]: print(f"{feature_names[feature_index]}: {X_train_wine[0, feature_index]}")
First example after variance thresholding:
malic_acid: 2.36
alcalinity_of_ash: 18.6
magnesium: 101.0
flavanoids: 3.24
color_intensity: 5.68
proline: 1185.0

First example from the original data (only features with variance > 1.0):
malic_acid: 2.36
alcalinity_of_ash: 18.6
magnesium: 101.0
flavanoids: 3.24
color_intensity: 5.68
proline: 1185.0

The above shows that running VarianceThreshold.fit_transform is the same as picking the columns with variance greater than one. The get_support gives us a set of Boolean values to select columns.

print(variance_selection.get_support())
[False  True False  True  True False  True False False  True False False
  True]

If we want to know the names of the features we kept, we can look at the dataset’s feature_names:

import numpy as np
features_we_kept_idx = variance_selection.get_support()
features_we_kept = np.array(wine.feature_names)[features_we_kept_idx]
print(features_we_kept)
['malic_acid' 'alcalinity_of_ash' 'magnesium' 'flavanoids'
 'color_intensity' 'proline']

This will output the variance for each feature in your dataset, allowing you to compare them and select the ones with the highest variance.

A Word of Caution

While variance can be a useful metric for feature selection, it’s important to remember that it doesn’t take into account the relationship between features and the target variable. A feature could have high variance but be completely irrelevant to the target variable. Therefore, variance is often used as a first step in feature selection, to be followed by other methods that consider the relationship between features and the target.

In conclusion, variance is a simple yet powerful tool for feature selection in machine learning. By understanding and applying it effectively, you can filter out less informative features and potentially improve the performance of your models.

Understanding Covariance and Correlation for Feature Selection in Machine Learning

When selecting features for a machine learning model, it’s important to consider not just the individual characteristics of each feature, but also how each feature relates to the target variable. Two statistical measures can help us do this: covariance and correlation.

What is Covariance?

Covariance is a measure of how much two random variables vary together. It’s similar to variance, but where variance tells you how a single variable varies, covariance tells you how two variables vary together.

The covariance between two variables would be positive if the variables increase or decrease together, that is, they show similar behavior. On the other hand, if one variable decreases when the other increases, the covariance would be negative.

The formula for covariance is:

$$ \text{cov}(X, Y) = \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu_x) \cdot (y_i - \mu_y) $$

In this formula:

  • $X$ and $Y$ are two vectors of $n$ samples.
  • $x_i$ and $y_i$ are the individual samples in $X$ and $Y$.
  • $\mu_x$ and $\mu_y$ are the means of $X$ and $Y$.
  • $\text{cov}(X, Y)$ is the covariance of $X$ and $Y$

This formula calculates the average of the product of differences from the mean. This is why covariance is a measure of how much two variables change together.

Let’s calculate the covariance of two simple datasets using Python:

# Define two simple datasets
X = np.array([1,3,5,10,20])
Y = np.array([2,4,1,-2,12])

print(f'X: {X}') print(f'Y: {Y}')

# Calculate the number of samples n = len(X) print(f'Number of samples (n): {n}')

# Calculate the means of the datasets mean_X = sum(X) / n mean_Y = sum(Y) / n print(f'Mean of X: {mean_X}') print(f'Mean of Y: {mean_Y}')

# Calculate the errors (differences from the mean) errors_X = X - mean_X errors_Y = Y - mean_Y print(f'Errors for X: {errors_X}') print(f'Errors for Y: {errors_Y}')

# Calculate the covariance cov_XY = np.dot(errors_X, errors_Y) / n print(f'Covariance calculated manually: {cov_XY}')

# We can also use the built-in numpy function to calculate the covariance numpy_cov_XY = np.cov(X, Y, bias=True)[0,1] print(f'Covariance using numpy: {numpy_cov_XY}')

# Check if the two values are close print(f'Are the two values close? {np.allclose(cov_XY, numpy_cov_XY)}')
X: [ 1  3  5 10 20]
Y: [ 2  4  1 -2 12]
Number of samples (n): 5
Mean of X: 7.8
Mean of Y: 3.4
Errors for X: [-6.8 -4.8 -2.8  2.2 12.2]
Errors for Y: [-1.4  0.6 -2.4 -5.4  8.6]
Covariance calculated manually: 21.279999999999994
Covariance using numpy: 21.28
Are the two values close? True

As you can see, the covariance calculated manually matches the covariance calculated using the built-in numpy function. This confirms that our manual calculation is correct.

Covariance Alternatve Approach

The alternative approach to calculating covariance is similar to the alternative approach to calculating variance. We take X and Y values for each pair of features, compute the difference between them, multiply them, and then add up all of those terms.

The alternative formula for covariance is given by:

$$ Cov(X,Y) = \frac{1}{n^2} \sum_{i=1}^{n} \sum_{j>i}^{n} (x_i - x_j)(y_i - y_j) $$

This can be done with the following Python code:

cov_xy = 0
n = len(X)

# Print the arrays X and Y print(f"Array X: {X}") print(f"Array Y: {Y}")

# Loop over all pairs of elements in X and Y for i in range(n): for j in range(i+1, n): # rest of Xs, Ys # Calculate the product of the differences between the current pair of elements in X and Y diff_product = (X[i] - X[j]) * (Y[i] - Y[j]) print(f"Product of differences for pairs ({X[i]}, {Y[i]}) and ({X[j]}, {Y[j]}): {diff_product}")

# Add the product of the differences to the running total cov_xy += diff_product

# Divide the total by the square of the number of elements to get the covariance covariance = cov_xy / n**2 print(f"Covariance: {covariance}")
Array X: [ 1  3  5 10 20]
Array Y: [ 2  4  1 -2 12]
Product of differences for pairs (1, 2) and (3, 4): 4
Product of differences for pairs (1, 2) and (5, 1): -4
Product of differences for pairs (1, 2) and (10, -2): -36
Product of differences for pairs (1, 2) and (20, 12): 190
Product of differences for pairs (3, 4) and (5, 1): -6
Product of differences for pairs (3, 4) and (10, -2): -42
Product of differences for pairs (3, 4) and (20, 12): 136
Product of differences for pairs (5, 1) and (10, -2): -15
Product of differences for pairs (5, 1) and (20, 12): 165
Product of differences for pairs (10, -2) and (20, 12): 140
Covariance: 21.28

Calculation of the Product of Differences

In the calculation of covariance, we compute the product of differences for each pair of points. Here’s how it’s done:

  1. Calculate the difference between the x-values: This is done by subtracting the x-value of the second point from the x-value of the first point. This represents the change in x between the two points.

  2. Calculate the difference between the y-values: This is done by subtracting the y-value of the second point from the y-value of the first point. This represents the change in y between the two points.

  3. Multiply the two differences together: The product of these differences is calculated by multiplying the difference in x-values by the difference in y-values.

For example, consider a pair of points (1, 2) and (3, 4):

  • The difference between the x-values is 1 - 3 = -2.
  • The difference between the y-values is 2 - 4 = -2.
  • The product of these differences is -2 * -2 = 4.

So, the product of differences for the pair of points (1, 2) and (3, 4) is 4. This value is used in the calculation of the covariance.

Covariance Alternative Approach #2

Let’s dive deeper into the covariance. There is another alternative approach that essentially says the covariance of X and Y is the average of the product of differences for all unique pairs of points, where each pair consists of one point from X and one point from Y. The factor of 1/2 is there because each pair is considered twice in the summation.

The formula for the covariance alternative approach #2 is given by:

$$Cov(X,Y) = \frac{1}{2n^2} \sum_{i,j} (x_i - x_j)(y_i - y_j)$$

The Python code that implements this formula is:

import itertools as it

# Initialize the covariance cov_XY = 0.0

# Get all possible pairs of data points xy_pairs = it.product(zip(X,Y), repeat=2)

# Loop over all pairs of data points for (x_i, y_i), (x_j, y_j) in xy_pairs: # Calculate the product of differences for the current pair product_of_differences = (x_i - x_j) * (y_i - y_j)

# Print the current pair and the product of differences print(f"Pair: (({x_i}, {y_i}), ({x_j}, {y_j})), Product of differences: {product_of_differences}")

# Add the product of differences to the running total cov_XY += product_of_differences

# Divide the total by twice the square of the number of data points to get the covariance cov_XY /= (2 * n**2)

# Print the covariance print("Cov(X,Y):", cov_XY)
Pair: ((1, 2), (1, 2)), Product of differences: 0
Pair: ((1, 2), (3, 4)), Product of differences: 4
Pair: ((1, 2), (5, 1)), Product of differences: -4
Pair: ((1, 2), (10, -2)), Product of differences: -36
Pair: ((1, 2), (20, 12)), Product of differences: 190
Pair: ((3, 4), (1, 2)), Product of differences: 4
Pair: ((3, 4), (3, 4)), Product of differences: 0
Pair: ((3, 4), (5, 1)), Product of differences: -6
Pair: ((3, 4), (10, -2)), Product of differences: -42
Pair: ((3, 4), (20, 12)), Product of differences: 136
Pair: ((5, 1), (1, 2)), Product of differences: -4
Pair: ((5, 1), (3, 4)), Product of differences: -6
Pair: ((5, 1), (5, 1)), Product of differences: 0
Pair: ((5, 1), (10, -2)), Product of differences: -15
Pair: ((5, 1), (20, 12)), Product of differences: 165
Pair: ((10, -2), (1, 2)), Product of differences: -36
Pair: ((10, -2), (3, 4)), Product of differences: -42
Pair: ((10, -2), (5, 1)), Product of differences: -15
Pair: ((10, -2), (10, -2)), Product of differences: 0
Pair: ((10, -2), (20, 12)), Product of differences: 140
Pair: ((20, 12), (1, 2)), Product of differences: 190
Pair: ((20, 12), (3, 4)), Product of differences: 136
Pair: ((20, 12), (5, 1)), Product of differences: 165
Pair: ((20, 12), (10, -2)), Product of differences: 140
Pair: ((20, 12), (20, 12)), Product of differences: 0
Cov(X,Y): 21.28

Below, we employ the same idea, but calculate using combinations rather than product. By using the equation above, it can give us some intuition about the geometric representation we are going to make of the covariance.

from itertools import combinations

# Initialize the covariance cov_XX = 0.0

# Loop over all pairs of elements in X for x_i, x_j in combinations(X, 2): # Calculate the square of the difference between the current pair of elements diff_square = (x_i - x_j)**2 print(f"Difference squared for pair ({x_i}, {x_j}): {diff_square}")

# Add the square of the difference to the running total

cov_XX += diff_square

# Divide the total by the square of the number of elements to get the variance variance = cov_XX / n**2 print(f"Cov(X,X) == Var(X): {variance}")
Difference squared for pair (1, 3): 4
Difference squared for pair (1, 5): 16
Difference squared for pair (1, 10): 81
Difference squared for pair (1, 20): 361
Difference squared for pair (3, 5): 4
Difference squared for pair (3, 10): 49
Difference squared for pair (3, 20): 289
Difference squared for pair (5, 10): 25
Difference squared for pair (5, 20): 225
Difference squared for pair (10, 20): 100
Cov(X,X) == Var(X): 46.16
from itertools import combinations

# Initialize the covariance cov_XY = 0.0

# Loop over all unique pairs of data points for (x_i, y_i), (x_j, y_j) in combinations(zip(X,Y), 2): # Calculate the product of differences for the current pair product_of_differences = (x_i - x_j) * (y_i - y_j)

# Print the current pair and the product of differences print(f"Pair: (({x_i}, {y_i}), ({x_j}, {y_j})), Product of differences: {product_of_differences}")

# Add the product of differences to the running total cov_XY += product_of_differences

# Divide the total by the square of the number of data points to get the covariance cov_XY /= n**2

# Print the covariance print("Cov(X,Y):", cov_XY)
Pair: ((1, 2), (3, 4)), Product of differences: 4
Pair: ((1, 2), (5, 1)), Product of differences: -4
Pair: ((1, 2), (10, -2)), Product of differences: -36
Pair: ((1, 2), (20, 12)), Product of differences: 190
Pair: ((3, 4), (5, 1)), Product of differences: -6
Pair: ((3, 4), (10, -2)), Product of differences: -42
Pair: ((3, 4), (20, 12)), Product of differences: 136
Pair: ((5, 1), (10, -2)), Product of differences: -15
Pair: ((5, 1), (20, 12)), Product of differences: 165
Pair: ((10, -2), (20, 12)), Product of differences: 140
Cov(X,Y): 21.28

Understanding Covariance Through Geometry

Covariance can be understood geometrically as the average area of rectangles formed by pairs of data points. This section provides a detailed explanation of this interpretation.

Covariance as a Sum of Areas

Consider a pair of data points $(x_i, y_i)$ and $(x_j, y_j)$. These points can be thought of as defining a rectangle, where the length of the rectangle is $|x_i - x_j|$ and the height is $|y_i - y_j|$. The area of this rectangle is then $(x_i - x_j) \cdot (y_i - y_j)$, which is the product of the differences in $x$ and $y$ values for the pair of points.

The covariance between $X$ and $Y$ can then be thought of as the sum of these rectangle areas over all pairs of points, divided by the total number of pairs. This is essentially an average of these areas.

Signed Areas

The areas calculated above are signed, meaning they can be positive or negative. This sign depends on the direction of the line connecting the points when moving from left to right. If the line is pointed upwards, the area is positive, indicating a positive relationship between $X$ and $Y$. If the line is pointed downwards, the area is negative, indicating a negative relationship.

This signed area interpretation provides a geometric intuition for covariance. A positive covariance corresponds to a predominance of rectangles with positive areas (upward-pointing lines), indicating that $X$ and $Y$ tend to increase and decrease together. A negative covariance corresponds to a predominance of rectangles with negative areas (downward-pointing lines), indicating that $X$ and $Y$ tend to move in opposite directions.

Covariance as an Average

The formula for covariance involves dividing by the square of the number of data points, $n^2$. This is because each pair of points forms a rectangle, and there are $n^2$ total pairs (including pairs of a point with itself). Dividing by $n^2$ gives us an average of the rectangle areas, which is the covariance.

Note that when calculating this average, we avoid double-counting rectangles by only considering pairs where $j > i$. This is why the formula for covariance includes a factor of $\frac{1}{2}$.

In summary, covariance can be interpreted geometrically as the average signed area of rectangles formed by pairs of data points. This interpretation provides a visual way to understand the concept of covariance and its relationship with the correlation between variables.

Geometric Interpretation of Covariance

In this section, we discuss a geometric interpretation of covariance. We visualize covariance as areas of rectangles formed by pairs of data points. For simplicity, we use an example with three data points and two features.

Color-Coding Scheme

Our visualization uses a color-coding scheme:

  • Red indicates a positive covariance.
  • Blue indicates a negative covariance.
  • The darkness of the color indicates the magnitude of the covariance.
  • Lighter colors, leading to white, indicate a lack of covariance.

Creating the Visualization

To create this visualization, we fill a NumPy array with the appropriate values. This is done by defining a helper function draw_rectangle that fills in the values in the array corresponding to the rectangle formed by a pair of points. The function adds or subtracts 1 to each block in the rectangle, depending on whether the rectangle represents a positive or negative covariance.

Drawing the Rectangles

We then use the draw_rectangle function to draw the rectangles corresponding to each pair of data points. We use the combinations function from the itertools module to generate all pairs of data points. For each pair of points, we draw the corresponding rectangle.

def draw_rectangle(arr, point1, point2):
    """
    This function takes in an array and two points. It then "draws" a rectangle on the array
    by adding or subtracting 1 to each block in the rectangle, depending on whether the rectangle
    represents a positive or negative covariance.

Parameters: arr (np.array): The array on which to draw the rectangle. point1, point2 (tuple): The points that define the rectangle. Each point is a tuple of (x, y). """

# Unpack the points (x1, y1), (x2, y2) = point1, point2 print(f"Point 1: {point1}, Point 2: {point2}")

# Calculate the differences in x and y between the two points delta_x, delta_y = x2 - x1, y2 - y1 print(f"Delta X: {delta_x}, Delta Y: {delta_y}")

# Determine the top-left corner of the rectangle # Note: we're treating y as the row index (r) and x as the column index (c) top_left_row, top_left_col = min(y1, y2), min(x1, x2) print(f"Top Left Corner of Rectangle: ({top_left_row}, {top_left_col})")

# "Draw" the rectangle on the array by adding or subtracting 1 to each block in the rectangle # The sign of the addition/subtraction depends on the sign of the product of delta_x and delta_y arr[top_left_row : top_left_row + abs(delta_y), top_left_col : top_left_col + abs(delta_x)] += np.sign(delta_x * delta_y)

print(f"Updated Array:\n{arr}")

We then define our three data points and use the draw_rectangle function to fill in the array for the three rectangles defined by these points. The array is then displayed using matplotlib’s matshow function.

# Define the data points
points = [(1, 1), (3, 6), (6, 3)]

# Convert the list of points to a numpy array for efficient computation points_array = np.array(points, dtype=np.float64)

# Initialize an array of zeros with the same shape as our points array # This will be used to "draw" the rectangles drawing_array = np.zeros((10, 10))

# Count the number of points num_points = len(points)

# Calculate the normalization factor # This is used to avoid double-counting when summing the areas of the rectangles normalization_factor = 1 / num_points**2

# For each unique pair of points for point1, point2 in combinations(points, 2): # "Draw" a rectangle on the drawing_array draw_rectangle(drawing_array, point1, point2)

# Normalize the drawing_array by the normalization factor drawing_array *= normalization_factor
Point 1: (1, 1), Point 2: (3, 6)
Delta X: 2, Delta Y: 5
Top Left Corner of Rectangle: (1, 1)
Updated Array:
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
Point 1: (1, 1), Point 2: (6, 3)
Delta X: 5, Delta Y: 2
Top Left Corner of Rectangle: (1, 1)
Updated Array:
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 2. 2. 1. 1. 1. 0. 0. 0. 0.]
 [0. 2. 2. 1. 1. 1. 0. 0. 0. 0.]
 [0. 1. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
Point 1: (3, 6), Point 2: (6, 3)
Delta X: 3, Delta Y: -3
Top Left Corner of Rectangle: (3, 3)
Updated Array:
[[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  2.  2.  1.  1.  1.  0.  0.  0.  0.]
 [ 0.  2.  2.  1.  1.  1.  0.  0.  0.  0.]
 [ 0.  1.  1. -1. -1. -1.  0.  0.  0.  0.]
 [ 0.  1.  1. -1. -1. -1.  0.  0.  0.  0.]
 [ 0.  1.  1. -1. -1. -1.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]]

Displaying the Visualization

Finally, we display the visualization using matplotlib’s matshow function. We use the sigmoid function to map the raw rectangle area values to the range [0, 1], and we use this value to determine the color of each rectangle.

def sigmoid(x):
    """
    Maps input 'x' to a value between 0 and 1, treating 'x' as the exponent in a logistic function.
    This is used to map the raw rectangle area values to a smaller range of [0, 1].

Parameters: x : array_like Input array.

Returns: out : ndarray Output array, element-wise sigmoid of x. """ return np.exp(-np.logaddexp(0, -x))

We also include a diagonal line across each rectangle, which is added using matplotlib’s plot function. The color of the diagonal line indicates the direction of the covariance (red for positive, blue for negative).

import matplotlib.pyplot as plt
import matplotlib.cm as cm

fig, ax = plt.subplots(1, 1, figsize=(6, 5))

# Create the matshow plot and get the returned plot object cax = ax.matshow(sigmoid(drawing_array), origin='lower', cmap=cm.bwr, vmin=0, vmax=1)

# Add a colorbar, using the plot object fig.colorbar(cax, label='Covariance', pad=0.15)

# Label the axes ax.set_xlabel('Height') ax.set_ylabel('Weight')

# Add a title ax.set_title('Geometric Visualization of Covariance')

# Plot the lines and label the points points = [(1,1), (3,6), (6,3)] for i, point in enumerate(points): if i == 2: # For P3, we move the label a bit to the right to avoid overlapping with the lines ax.text(point[0]-0.5, point[1]-0.5, f'P{i+1}', ha='left', color='k') else: ax.text(point[0]-0.5, point[1]-0.5, f'P{i+1}', ha='right', color='k')

ax.plot([0.5, 2.5], [0.5, 5.5], 'r', label='Positive Covariance') # from 1,1 to 3,6 ax.plot([0.5, 5.5], [0.5, 2.5], 'r') # from 1,1 to 6,3 ax.plot([2.5, 5.5], [5.5, 2.5], 'b', label='Negative Covariance') # from 3,6 to 6,3

# Add a legend on top of the plot ax.legend(loc='upper center', bbox_to_anchor=(0.5, 1.15), ncol=2)

# Set x-ticks and y-ticks at the positions where we have points ax.set_xticks([pt[0]-0.5 for pt in points]) ax.set_yticks([pt[1]-0.5 for pt in points])

# Set x-tick labels and y-tick labels to match the original points ax.set_xticklabels([pt[0] for pt in points]) ax.set_yticklabels([pt[1] for pt in points])

# Move x-tick labels to the bottom ax.xaxis.tick_bottom()

# Make the layout tight (optional) fig.tight_layout()

Comparing Covariance Calculations

In this section, we compare the covariance calculated from our geometric interpretation with the covariance calculated using NumPy’s np.cov function. We find that the two calculations give the same result, validating our geometric interpretation.

Calculating Covariance via np.cov

First, we calculate the covariance of our three data points using NumPy’s np.cov function. This function calculates the covariance matrix, and we select the covariance of the two features.

# Convert the list of points to a numpy array
points_array = np.array(points)
print(f"Points array:\n{points_array}")

# Separate the height and weight coordinates heights = points_array[:,0] weights = points_array[:,1] print(f"Height coordinates: {heights}") print(f"Weight coordinates: {weights}")

# Calculate the covariance using numpy's built-in function np_cov = np.cov(heights, weights, bias=True)

# Print the full covariance matrix print(f"Covariance matrix:\n{np_cov}")

# Extract the covariance of height and weight from the matrix cov_height_weight = np_cov[0,1] print(f"Covariance of height and weight: {cov_height_weight:.2f}")
Points array:
[[1 1]
 [3 6]
 [6 3]]
Height coordinates: [1 3 6]
Weight coordinates: [1 6 3]
Covariance matrix:
[[4.22 1.22]
 [1.22 4.22]]
Covariance of height and weight: 1.22

Calculating Covariance via Geometric Interpretation

Next, we calculate the covariance using our geometric interpretation. We sum all the values in our drawing_array array, which contains the signed areas of the rectangles corresponding to each pair of data points.

# Set print options
np.set_printoptions(precision=2, suppress=True)

# Print the drawing_array print("drawing_array:") print(drawing_array)

# Initialize a running total of the sum running_total = 0

# Loop over all elements in the drawing_array for i in range(drawing_array.shape[0]): for j in range(drawing_array.shape[1]): # Add the current element to the running total running_total += drawing_array[i, j]

# Print the current element and the new running total print(f"Added element at ({i}, {j}): {drawing_array[i, j]:.2f}, new total: {running_total:.2f}")

# At the end, the running total should be equal to the sum of all elements in the array print(f"Final total: {running_total:.2f}") print(f"Check: does this match numpy's sum? {np.allclose(running_total, drawing_array.sum())}")
drawing_array:
[[ 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.  ]
 [ 0.    0.22  0.22  0.11  0.11  0.11  0.    0.    0.    0.  ]
 [ 0.    0.22  0.22  0.11  0.11  0.11  0.    0.    0.    0.  ]
 [ 0.    0.11  0.11 -0.11 -0.11 -0.11  0.    0.    0.    0.  ]
 [ 0.    0.11  0.11 -0.11 -0.11 -0.11  0.    0.    0.    0.  ]
 [ 0.    0.11  0.11 -0.11 -0.11 -0.11  0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.    0.    0.    0.    0.  ]]
Added element at (0, 0): 0.00, new total: 0.00
Added element at (0, 1): 0.00, new total: 0.00
Added element at (0, 2): 0.00, new total: 0.00
Added element at (0, 3): 0.00, new total: 0.00
Added element at (0, 4): 0.00, new total: 0.00
Added element at (0, 5): 0.00, new total: 0.00
Added element at (0, 6): 0.00, new total: 0.00
Added element at (0, 7): 0.00, new total: 0.00
Added element at (0, 8): 0.00, new total: 0.00
Added element at (0, 9): 0.00, new total: 0.00
Added element at (1, 0): 0.00, new total: 0.00
Added element at (1, 1): 0.22, new total: 0.22
Added element at (1, 2): 0.22, new total: 0.44
Added element at (1, 3): 0.11, new total: 0.56
Added element at (1, 4): 0.11, new total: 0.67
Added element at (1, 5): 0.11, new total: 0.78
Added element at (1, 6): 0.00, new total: 0.78
Added element at (1, 7): 0.00, new total: 0.78
Added element at (1, 8): 0.00, new total: 0.78
Added element at (1, 9): 0.00, new total: 0.78
Added element at (2, 0): 0.00, new total: 0.78
Added element at (2, 1): 0.22, new total: 1.00
Added element at (2, 2): 0.22, new total: 1.22
Added element at (2, 3): 0.11, new total: 1.33
Added element at (2, 4): 0.11, new total: 1.44
Added element at (2, 5): 0.11, new total: 1.56
Added element at (2, 6): 0.00, new total: 1.56
Added element at (2, 7): 0.00, new total: 1.56
Added element at (2, 8): 0.00, new total: 1.56
Added element at (2, 9): 0.00, new total: 1.56
Added element at (3, 0): 0.00, new total: 1.56
Added element at (3, 1): 0.11, new total: 1.67
Added element at (3, 2): 0.11, new total: 1.78
Added element at (3, 3): -0.11, new total: 1.67
Added element at (3, 4): -0.11, new total: 1.56
Added element at (3, 5): -0.11, new total: 1.44
Added element at (3, 6): 0.00, new total: 1.44
Added element at (3, 7): 0.00, new total: 1.44
Added element at (3, 8): 0.00, new total: 1.44
Added element at (3, 9): 0.00, new total: 1.44
Added element at (4, 0): 0.00, new total: 1.44
Added element at (4, 1): 0.11, new total: 1.56
Added element at (4, 2): 0.11, new total: 1.67
Added element at (4, 3): -0.11, new total: 1.56
Added element at (4, 4): -0.11, new total: 1.44
Added element at (4, 5): -0.11, new total: 1.33
Added element at (4, 6): 0.00, new total: 1.33
Added element at (4, 7): 0.00, new total: 1.33
Added element at (4, 8): 0.00, new total: 1.33
Added element at (4, 9): 0.00, new total: 1.33
Added element at (5, 0): 0.00, new total: 1.33
Added element at (5, 1): 0.11, new total: 1.44
Added element at (5, 2): 0.11, new total: 1.56
Added element at (5, 3): -0.11, new total: 1.44
Added element at (5, 4): -0.11, new total: 1.33
Added element at (5, 5): -0.11, new total: 1.22
Added element at (5, 6): 0.00, new total: 1.22
Added element at (5, 7): 0.00, new total: 1.22
Added element at (5, 8): 0.00, new total: 1.22
Added element at (5, 9): 0.00, new total: 1.22
Added element at (6, 0): 0.00, new total: 1.22
Added element at (6, 1): 0.00, new total: 1.22
Added element at (6, 2): 0.00, new total: 1.22
Added element at (6, 3): 0.00, new total: 1.22
Added element at (6, 4): 0.00, new total: 1.22
Added element at (6, 5): 0.00, new total: 1.22
Added element at (6, 6): 0.00, new total: 1.22
Added element at (6, 7): 0.00, new total: 1.22
Added element at (6, 8): 0.00, new total: 1.22
Added element at (6, 9): 0.00, new total: 1.22
Added element at (7, 0): 0.00, new total: 1.22
Added element at (7, 1): 0.00, new total: 1.22
Added element at (7, 2): 0.00, new total: 1.22
Added element at (7, 3): 0.00, new total: 1.22
Added element at (7, 4): 0.00, new total: 1.22
Added element at (7, 5): 0.00, new total: 1.22
Added element at (7, 6): 0.00, new total: 1.22
Added element at (7, 7): 0.00, new total: 1.22
Added element at (7, 8): 0.00, new total: 1.22
Added element at (7, 9): 0.00, new total: 1.22
Added element at (8, 0): 0.00, new total: 1.22
Added element at (8, 1): 0.00, new total: 1.22
Added element at (8, 2): 0.00, new total: 1.22
Added element at (8, 3): 0.00, new total: 1.22
Added element at (8, 4): 0.00, new total: 1.22
Added element at (8, 5): 0.00, new total: 1.22
Added element at (8, 6): 0.00, new total: 1.22
Added element at (8, 7): 0.00, new total: 1.22
Added element at (8, 8): 0.00, new total: 1.22
Added element at (8, 9): 0.00, new total: 1.22
Added element at (9, 0): 0.00, new total: 1.22
Added element at (9, 1): 0.00, new total: 1.22
Added element at (9, 2): 0.00, new total: 1.22
Added element at (9, 3): 0.00, new total: 1.22
Added element at (9, 4): 0.00, new total: 1.22
Added element at (9, 5): 0.00, new total: 1.22
Added element at (9, 6): 0.00, new total: 1.22
Added element at (9, 7): 0.00, new total: 1.22
Added element at (9, 8): 0.00, new total: 1.22
Added element at (9, 9): 0.00, new total: 1.22
Final total: 1.22
Check: does this match numpy's sum? True

Displaying the Raw Covariance Values

Finally, we display a heatmap of the raw covariance values using seaborn’s heatmap function. This gives us a visual representation of the covariance between each pair of data points.

In the heatmap, each square represents a pair of data points, and the color of the square represents the covariance between those data points. The value in each square is the raw covariance value, which determines the color of the square.

import seaborn as sns
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import make_axes_locatable

# Set the size of the figure fig, ax = plt.subplots(figsize=(4.5,4.5))

# Create a heatmap of the drawing array heatmap = sns.heatmap(drawing_array, center=0, square=True, annot=True, cmap='bwr', fmt=".1f", ax=ax, cbar=False)

# Invert the y-axis to match the array layout heatmap.invert_yaxis()

# Remove ticks from the bottom and left axes heatmap.tick_params(bottom=False, left=False)

# Add a title to the plot heatmap.set_title("Covariance Heatmap for Height and Weight")

# Label the axes heatmap.set_xlabel("Height") heatmap.set_ylabel("Weight")

# Create a divider for the existing axes instance divider = make_axes_locatable(ax) cax = divider.append_axes("right", size="5%", pad=0.05)

# Create colorbar plt.colorbar(heatmap.collections[0], cax=cax)

# Show the plot plt.show()

Covariance Matrix

In our discussion so far, we’ve been focusing on the covariance between two features, X and Y. However, in real-world datasets, we often have many more features. For example, we might have three features—X, Y, and Z—and we would be interested in the covariances between all pairs of these features: Cov(X, Y), Cov(X, Z), and Cov(Y, Z).

To keep track of all these covariances, we can use a covariance matrix. This is simply a table that lists the covariances between different pairs of variables. Because the covariance between X and Y is the same as the covariance between Y and X, the covariance matrix is symmetric across its diagonal.

Let’s consider an example with three features (X, Y, Z) and three data points. We’ll calculate and display the covariance matrix for this data. Note that the ‘cov’ method in pandas calculates the “unbiased” covariance by default, which divides by n-1 instead of n. This is a slight difference from the “biased” covariance we’ve been discussing, which divides by n.

import pandas as pd

# Define a DataFrame with three features: X, Y, and Z feature_data = pd.DataFrame({ 'X': [1, 3, 6], 'Y': [1, 6, 3], 'Z': [10, 5, 1] })

# Set the name for the index feature_data.index.name = 'examples'

# Display the DataFrame print("Feature Data:") display(feature_data)

# Calculate and display the covariance matrix # Note: The 'cov' method in pandas calculates the "unbiased" covariance by default, # which divides by n-1 instead of n. This is a slight difference from the "biased" covariance # we've been discussing, which divides by n. print("\nCovariance Matrix:") covariance_matrix = feature_data.cov() display(covariance_matrix)
Feature Data:

X Y Z
examples
0 1 1 10
1 3 6 5
2 6 3 1
Covariance Matrix:

X Y Z
X 6.333333 1.833333 -11.166667
Y 1.833333 6.333333 -5.166667
Z -11.166667 -5.166667 20.333333

Understanding Covariance Matrices

In the first example, we have three features: X, Y, and Z. The covariance matrix tells us about the relationships between these features. The main diagonal of the matrix (6.3, 6.3, 20.3) represents the variance of each feature. Variance is a measure of how spread out the values of a feature are. In this case, X and Y have the same variance (6.3), while Z has a higher variance (20.3), indicating that the values of Z are more spread out.

The off-diagonal elements of the matrix represent the covariances between different pairs of features. For example, the top-right and bottom-left values (-11.166667) represent the covariance between X and Z. A negative covariance indicates that as X increases, Z tends to decrease, and vice versa.

Covariance in Unrelated Features

Now, let’s consider a different scenario where our features are not directionally related to each other. In this case, as one feature goes up, the other can either go up or down. There’s no set pattern. This might be the case if we’re trying to relate three measurements that we don’t expect to be related at all, such as height, SAT scores, and love of abstract art.

Let’s see what the covariance matrix looks like in this case:

import pandas as pd

# Define the data data = pd.DataFrame({ 'x': [3, 6, 3, 4], 'y': [9, 6, 3, 0], 'z': [1, 4, 7, 0] })

# Set the index name data.index.name = 'examples'

# Display the data print("Feature Data:") display(data)

# Compute and display the covariance matrix print("\nCovariance Matrix:") display(data.cov()) # biased covariance, see EOC
Feature Data:

x y z
examples
0 3 9 1
1 6 6 4
2 3 3 7
3 4 0 0
Covariance Matrix:

x y z
x 2.0 0.0 0.0
y 0.0 15.0 0.0
z 0.0 0.0 10.0

In this code, we first define our data as a pandas DataFrame. We then compute and display the covariance matrix using the cov method. This will give us a sense of the relationships between our features, even though we don’t expect them to be directionally related.

Interpreting Covariance Matrices

In the given example, we have three features: x, y, and z. The covariance matrix provides us with a summary of the relationships between these features. The main diagonal of the matrix (2.0, 15.0, 10.0) represents the variance of each feature. Variance is a measure of how spread out the values of a feature are. In this case, y has the highest variance (15.0), indicating that the values of y are more spread out than those of x and z.

The off-diagonal elements of the matrix represent the covariances between different pairs of features. In this case, all off-diagonal elements are zero, indicating that there is no linear relationship between any pair of features. This is a unique property of this dataset and is not typically observed in real-world data.

Let’s visualize the data to better understand this:

# Importing necessary libraries
import matplotlib.pyplot as plt
import pandas as pd

# Creating a DataFrame with the given data data = pd.DataFrame({'x': [3, 6, 3, 4], 'y': [9, 6, 3, 0], 'z': [1, 4, 7, 0]}) data.index.name = 'examples'

# Creating a new figure and a set of subplots fig, ax = plt.subplots(1, 1, figsize=(6, 4))

# Plotting the data data.plot(ax=ax)

# Adding vertical lines at specific positions ax.vlines([0, 1, 2, 3], 0, 10, colors=".5")

# Setting the legend location ax.legend(loc='lower center', ncol=3)

# Removing the box around the plot plt.box(False)

# Setting the x-ticks ax.set_xticks([0, 1, 2, 3])

# Setting the y-label ax.set_ylabel("Values")

# Adding a title to the plot ax.set_title("Feature Values for Each Example")

# Displaying the plot plt.show()

In the plot, we can see that there is no consistent pattern in the data. For example, as x increases, y does not consistently increase or decrease. The same is true for the relationships between x and z, and y and z. This lack of a consistent pattern is reflected in the covariance matrix, where all off-diagonal elements are zero. This is known as a diagonal covariance matrix, indicating that the features are uncorrelated.

Covariance in Multiple Classes

So far, we have been discussing the covariance matrix for a single class of data. However, in a multi-class problem, we might need to consider multiple covariance matrices - one for each class. In some cases, a single covariance matrix might be sufficient to describe the entire dataset. In other cases, each class might have its own unique covariance matrix. The choice between these two approaches depends on the specific characteristics of the dataset.

In summary, the covariance matrix is a powerful tool for understanding the relationships between features in a dataset. By examining the covariance matrix, we can gain insights into the structure of the data and make more informed decisions about our analysis.

Understanding Variations of Discriminant Analysis

Discriminant Analysis (DA) is a powerful statistical tool used for pattern recognition and machine learning. It’s used to find a linear combination of features that separates or characterizes two or more classes of objects or events. There are four main variations of DA: Quadratic Discriminant Analysis (QDA), Linear Discriminant Analysis (LDA), Gaussian Naive Bayes (GNB), and Diagonal Linear Discriminant Analysis (DLDA). Each of these techniques makes different assumptions about the covariances of the features.

Quadratic Discriminant Analysis (QDA)

QDA assumes that the covariances between different features are unconstrained. This means that there can be class-wise differences between the covariance matrices. In other words, each class can have its own unique covariance matrix.

Linear Discriminant Analysis (LDA)

LDA adds a constraint to the covariance assumption. It assumes that the covariances between features are the same regardless of the target classes. This means that there is a single covariance matrix that does a good job describing the data for all classes.

Gaussian Naive Bayes (GNB)

GNB makes a different assumption about the covariances. It assumes that the covariances between distinct features are all zero. This implies that the covariance matrices have entries on their main diagonal and zeros everywhere else. GNB assumes that pairs of features are independent within each class, which leads to a covariance of zero between pairs of non-identical features.

Diagonal Linear Discriminant Analysis (DLDA)

DLDA combines the assumptions of LDA and GNB. It assumes that the covariances are the same for each class (like LDA), and that the covariance between pairs of non-identical features is zero (like GNB). This means that there is a single diagonal covariance matrix for all classes.

Here’s a summary of the assumptions made by each method:

  1. QDA: Different covariance matrices for each class.
  2. LDA: Same covariance matrix for all classes.
  3. GNB: Different diagonal covariance matrices for each class.
  4. DLDA: Same diagonal covariance matrix for all classes.
Abbreviation Name Assumptions Description
QDA Quadratic Discriminant Analysis Per class, arbitrary covariance matrix
LDA Linear Discriminant Analysis Shared, arbitrary covariance matrix
GNB Gaussian Naive Bayes Per class, diagonal covariance matrix
DLDA Diagonal Linear Discriminant Analysis Shared, diagonal covariance matrix

This table provides a quick reference to understand the different assumptions made by each method regarding the covariance matrices.

Diagonal Linear Discriminant Analysis (DLDA) (Implementation)

The code block defines a class DiagonalLinearDiscriminantAnalysis that implements the Diagonal Linear Discriminant Analysis (DLDA) method. DLDA is a variant of Linear Discriminant Analysis (LDA) that assumes each class has its own covariance matrix, but these matrices are diagonal. This means that DLDA assumes that features are uncorrelated within each class.

The fit Method

The fit method takes as input the training features and targets. It performs the following steps:

  1. Identify Unique Targets: It first identifies the unique targets in the training data.

  2. Compute Mean and Prior for Each Target: Then, for each unique target, it computes and stores the mean and prior. The mean is simply the average of the feature values for the cases that belong to the current target. The prior is the proportion of cases that belong to the current target.

The predict Method

The predict method takes as input the test features. It performs the following steps:

  1. Compute Mahalanobis Distances: For each unique target, it computes the Mahalanobis distances between the test features and the mean of the current target, scaled by the variance of the training features.

  2. Compute Discriminant Values: It then computes the discriminant values for the current target, which are a function of the Mahalanobis distances and the log of the prior.

  3. Make Predictions: Finally, it makes predictions by identifying the target with the highest discriminant value for each test example.

# Import necessary libraries
from sklearn.base import BaseEstimator, ClassifierMixin
import numpy as np

class DiagonalLinearDiscriminantAnalysis(BaseEstimator, ClassifierMixin): """ This class implements the Diagonal Linear Discriminant Analysis (DLDA) method. """ def __init__(self): pass

def fit(self, train_features, train_targets): """ This method fits the DLDA model to the training data. """ # Identify the unique targets self.unique_targets = np.unique(train_targets)

# Initialize dictionaries to store the means and priors for each target self.means, self.priors = {}, {}

# Compute the variance of the training features self.variance = train_features.var(axis=0) # biased

# For each unique target, compute the mean and prior for target in self.unique_targets: # Identify the cases for the current target cases = train_features[train_targets == target]

# Compute and store the mean for the current target self.means[target] = cases.mean(axis=0)

# Compute and store the prior for the current target self.priors[target] = len(cases) / len(train_features)

return self

def predict(self, test_features): """ This method makes predictions on the test data using the fitted DLDA model. """ # Initialize an array to store the discriminant values discriminant_values = np.empty((test_features.shape[0], self.unique_targets.shape[0]))

# For each unique target, compute the discriminant values for target in self.unique_targets: # Compute the Mahalanobis distances mahalanobis_distances = ((test_features - self.means[target]) ** 2 / self.variance)

# Compute and store the discriminant values for the current target discriminant_values[:, target] = (-np.sum(mahalanobis_distances, axis=1) + 2 * np.log(self.priors[target]))

# Make predictions by identifying the target with the highest discriminant value for each test example predictions = np.argmax(discriminant_values, axis=1)

return predictions
# Import necessary libraries
from sklearn import discriminant_analysis, naive_bayes, metrics
import seaborn as sns
import matplotlib.pyplot as plt

# Define the discriminant analysis methods qda = discriminant_analysis.QuadraticDiscriminantAnalysis() lda = discriminant_analysis.LinearDiscriminantAnalysis() gnb = naive_bayes.GaussianNB() dlda = DiagonalLinearDiscriminantAnalysis()

# Create a list of the methods and their names for easy iteration discriminant_analysis_methods = [qda, lda, gnb, dlda] method_names = ["QDA", "LDA", "GNB", "DLDA"]

# Define the class names for the iris dataset class_names = ['setosa', 'versicolor', 'virginica']

# Create a subplot to visualize the confusion matrices fig, axes = plt.subplots(2, 2, figsize=(4.5, 4.5), sharex=True, sharey=True) fig.suptitle('Confusion Matrices for Different Discriminant Analysis Methods on Iris Dataset', fontsize=16)

# Loop through each method, fit the model, make predictions, and plot the confusion matrix for ax, model, name in zip(axes.flat, discriminant_analysis_methods, method_names): # Fit the model and make predictions model_predictions = model.fit(X_train_iris, y_train_iris).predict(X_test_iris)

# Compute the confusion matrix confusion_mat = metrics.confusion_matrix(y_test_iris, model_predictions)

# Create a heatmap of the confusion matrix sns.heatmap(confusion_mat, annot=True, cbar=False, ax=ax, xticklabels=class_names, yticklabels=class_names)

# Set the title of the subplot ax.set_title(name)

# Set labels for the x and y axes axes[0,0].set_ylabel('Actual') axes[1,0].set_xlabel('Predicted')

# Display the plot plt.tight_layout(rect=[0, 0, 1, 0.96]) # to ensure the suptitle fits into the figure plt.show()

Visualizing Decision Boundaries with plot_decision_boundary Function

When tackling classification problems, it’s often helpful to visualize the decision boundaries of our models. This can give us insights into how our models are making predictions. In this section, we’ll discuss the plot_decision_boundary function, which helps us create these visualizations.

Function Definition

The plot_decision_boundary function is defined as follows:

def plot_decision_boundary(ax, data, tgt, model, dims, grid_step = .01):

Here’s what each argument represents:

  • ax: This is an Axes object where the decision boundary plot will be drawn.
  • data: This is our input data. It should be a 2D array where each row represents a sample and each column represents a feature.
  • tgt: These are the target values corresponding to our data. It should be a 1D array of the same length as the number of rows in data.
  • model: This is the classification model we’re using to predict the target values.
  • dims: This is a list of two indices. It specifies which columns of data we’ll use for plotting.
  • grid_step: This is the step size for the grid of points where we’ll make predictions. It defaults to 0.01.

Function Steps

The function follows these steps:

  1. Data Selection: It selects the two columns of data specified by dims and stores them in twoD.

  2. Grid Range Calculation: It calculates the minimum and maximum values for each column of twoD, adjusted by grid_step. These values define the range of the grid of points where predictions will be made.

  3. Grid Creation: It creates a grid of points within the range defined by the minimum and maximum values, with a step size of grid_step.

  4. Prediction Grid Formation: It reshapes xs and ys into 1D arrays and combines them into a 2D array where each row is a point on the grid.

  5. Model Fitting and Prediction: It fits the model to twoD and tgt, makes predictions at each point on the grid, and reshapes the predictions into the same shape as xs and ys.

  6. Plot Creation: It creates a pseudocolor plot of the predictions on the Axes specified by ax.

  7. Axes Limit Setting: It sets the limits of the x and y axes to the range of the grid.

By using this function, we can visualize how our classification model divides the input space, providing us with a deeper understanding of its decision-making process.

# Import necessary libraries
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

# Define a function to plot decision boundaries def plot_decision_boundary(ax, data, tgt, model, dims, grid_step = .01): """ This function plots the decision boundary of a classification model.

Parameters: ax: matplotlib Axes object data: 2D numpy array tgt: 1D numpy array model: sklearn model object dims: list of two indices grid_step: float (default is 0.01) """ # Select two dimensions from data twoD = data[:, list(dims)]

# Calculate the minimum and maximum values for each dimension min_x1, min_x2 = np.min(twoD, axis=0) + 2 * grid_step max_x1, max_x2 = np.max(twoD, axis=0) - grid_step

# Create a grid of points within the range of the data xs, ys = np.mgrid[min_x1:max_x1:grid_step, min_x2:max_x2:grid_step] grid_points = np.c_[xs.ravel(), ys.ravel()]

# Fit the model to the data and make predictions at each point on the grid preds = model.fit(twoD, tgt).predict(grid_points).reshape(xs.shape)

# Create a pseudocolor plot of the predictions at the grid points ax.pcolormesh(xs, ys, preds, cmap=plt.cm.coolwarm) ax.set_xlim(min_x1, max_x1) ax.set_ylim(min_x2, max_x2) ax.set_title(f'Decision Boundary for {get_model_name(model)}')

# Define a function to get the name of a model def get_model_name(model): """ This function returns the name of a model's class as a string.

Parameters: model: sklearn model object """ return str(model.__class__).split('.')[-1][:-2]
# Import necessary libraries
from sklearn import discriminant_analysis, naive_bayes, metrics
import seaborn as sns
import matplotlib.pyplot as plt

# Define the discriminant analysis methods qda = discriminant_analysis.QuadraticDiscriminantAnalysis() lda = discriminant_analysis.LinearDiscriminantAnalysis() gnb = naive_bayes.GaussianNB() dlda = DiagonalLinearDiscriminantAnalysis()

# Create a list of the methods and their names for easy iteration discriminant_analysis_methods = [qda, lda, gnb, dlda] method_names = ["QDA", "LDA", "GNB", "DLDA"]

# Create a subplot to visualize the decision boundaries fig, axes = plt.subplots(2, 2, figsize=(4.5, 4.5)) fig.suptitle('Decision Boundaries for Different Discriminant Analysis Methods', fontsize=16)

# Loop through each method, fit the model, and plot the decision boundary for ax, model, name in zip(axes.flat, discriminant_analysis_methods, method_names): # Fit the model model.fit(X_train_iris, y_train_iris)

# Plot the decision boundary plot_decision_boundary(ax, X_train_iris, y_train_iris, model, [0, 1])

# Set the title of the subplot ax.set_title(name)

# Display the plot plt.tight_layout(rect=[0, 0, 1, 0.96]) # to ensure the suptitle fits into the figure plt.show()

Assumptions, Biases, and Classifiers

In our exploration of different classification methods, we’ve come across a variety of techniques, each with its own set of assumptions and biases. One key aspect that differentiates these methods is the concept of linearity.

Linearity in Classifiers

Linearity, as the term suggests, is rooted in the concept of a line. In a two-dimensional space, such as an XY graph, we can draw a line that divides the plane into two halves - above and below or left and right of the line. This concept extends to higher dimensions as well. In a 3D space, we can split the space using a 2D plane. Even in higher dimensions, where our intuition might fail us, this concept holds. The line, the plane, and their higher-dimensional counterparts are all linear forms in a mathematical sense. They are simple and straight, without any wiggles.

However, real-world data often doesn’t conform to these straight lines or planes. We might need a model that can handle curves, parabolas, or circles. This is where nonlinear techniques come into play.

Nonlinear Techniques

We’ve already encountered a few nonlinear techniques, such as nearest neighbors and decision trees. These methods can capture relationships that are more complex than what linear forms allow. Many other techniques, including linear regression, logistic regression, and Support Vector Classifiers (SVCs), can be extended to handle nonlinear data. This is often achieved by replacing the notion of similarity rooted in the covariance matrix with another concept called a kernel.

From Unbiased to Biased Classifiers

We’ve journeyed from very unbiased classifiers, like decision trees, through a variety of classifiers that primarily create linear boundaries (although GNB and QDA allow for curved boundaries). For example, Decision Trees (DTs) and Support Vector Classifiers (SVCs) treat a simple example (like determining if y > x) differently due to their inherent biases and assumptions.

In the next sections, we’ll delve deeper into these concepts and explore how different classifiers handle different types of data.

# Import necessary libraries
from sklearn import svm, tree
import numpy as np
import matplotlib.pyplot as plt

# Define the features as a grid of points features = np.mgrid[1:10, 1:10].T.reshape(-1,2)

# Define the target as whether the first feature is greater than the second target = features[:,0] > features[:,1]

# Create a subplot to visualize the decision boundaries fig, axes = plt.subplots(1,3,figsize=(9,3)) axes = axes.flat fig.suptitle('Decision Boundaries for Different Classifiers', fontsize=16)

# Define the classifiers svc = svm.SVC(kernel='linear') dt_shallow = tree.DecisionTreeClassifier(max_depth=3) dt_deep = tree.DecisionTreeClassifier()

# Create a list of the classifiers for easy iteration classifiers = [svc, dt_shallow, dt_deep]

# Loop through each classifier, fit the model, and plot the decision boundary for model, ax in zip(classifiers, axes): # Fit the model model.fit(features, target)

# Plot the decision boundary plot_decision_boundary(ax, features, target, model, [0,1])

# Set the title of the subplot ax.set_title(get_model_name(model))

# Adjust the layout to ensure the suptitle fits into the figure plt.tight_layout(rect=[0, 0, 1, 0.96])

# Display the plot plt.show()

Understanding Classifier Assumptions and Biases

In the world of machine learning, different classifiers come with different assumptions and biases. Understanding these can help us choose the right tool for the job. Let’s dive into some of the classifiers we’ve discussed so far.

Linearity in Classifiers

The concept of linearity is crucial in understanding the behavior of classifiers. In simple terms, a classifier is linear if it uses a line, plane, or hyperplane to separate classes. This linearity is evident in classifiers like Logistic Regression and Linear Discriminant Analysis (LDA). However, not all problems can be solved with a straight line. Sometimes, we need a classifier that can capture more complex, non-linear relationships. This is where classifiers like Decision Trees and Nearest Neighbors come in handy.

Quadratic Discriminant Analysis (QDA) and Gaussian Naive Bayes (GNB)

Both QDA and GNB allow for non-linear boundaries, meaning they can capture more complex relationships between features. However, they make different assumptions about the data. QDA assumes that each class has its own covariance matrix, allowing for class-wise differences. On the other hand, GNB assumes that all off-diagonal entries in the covariance matrix are zero, implying that pairs of distinct features are independent within each class.

Support Vector Classifier (SVC)

SVCs are different from the classifiers mentioned above in that they don’t have an underlying notion of probability. Instead, they aim to find the best hyperplane that separates the classes. This makes SVCs more flexible and less reliant on specific assumptions about the data. However, this also means that other methods may perform better when their assumptions are met.

Logistic Regression and Discriminant Analysis

Logistic Regression and Discriminant Analysis methods capture the relationship between the inputs and the output. Logistic Regression captures the probability of the output given the inputs but ignores any self-contained information from the target class. In contrast, Discriminant Analysis methods model both the relationship between inputs and outputs and the base probabilities of the outputs.

In summary, the choice of classifier depends on the specific characteristics of the data and the problem at hand. Understanding the assumptions and biases of each classifier can guide us in making an informed choice.

Understanding Variance and Covariance

When we consider the variance of a single feature, it doesn’t provide any information about our target variable. Ideally, we want to understand how a feature varies in relation to the target. This is where covariance comes into play. However, interpreting covariance can be a bit tricky due to its nuanced and complex behavior.

Normalizing Covariance: The Concept of Correlation

To make covariance more interpretable, we normalize it by the variance of its components. In essence, for two features, we compare how they vary together versus how they vary individually. A fundamental principle in mathematics states that the variation of two features together is always less than or equal to their individual variations. This principle allows us to define the correlation between two variables, such as a feature and a target.

Mathematically, correlation is defined as:

$$ r = \frac{cov(xy)}{\sqrt{var(x) \cdot var(y)}} $$

Here, $r$ represents the correlation, $cov(xy)$ is the covariance between $x$ and $y$, and $var(x)$ and $var(y)$ are the variances of $x$ and $y$ respectively.

Interpreting Correlation

Correlation can be interpreted in several ways. One interpretation that is particularly useful for us is to consider it as a measure of how close the covariance is to its maximum possible value, ignoring the signs. It’s important to note that both covariance and correlation can be positive or negative.

When the correlation is 1, and the covariance is at its maximum, there is a perfect linear relationship between the two variables. However, while correlation focuses on this linear relationship, covariance considers more than just linearity.

If we express covariance as a multiplication of correlation and variances—$cov(xy) = cor(xy) \cdot var(x) \cdot var(y)$—we can see that increasing the variance of $x$ or $y$, enhancing the linearity (increasing $cor(xy)$), and evenly distributing a given total variance between $x$ and $y$ all contribute to increasing the covariance.

Correlation vs Covariance

This brings us to the question of why correlation is often preferred over covariance: correlation is simply about lines. Since correlation can be positive or negative—essentially whether the line points up or down—comparing raw correlations can be misleading. A line isn’t inherently bad just because it points down instead of up. To address this, we square the correlation value. So, instead of working with $r$, we’ll work with $r^2$.

Univariate Feature Selection Using Correlation

While we are using the correlation as a univariate feature selection method because we are picking a single feature with the method, it’s important to note that the correlation value we are computing is technically the bivariate correlation between a predictive feature and the target output. In practice, libraries like sklearn allow us to use the correlation between feature and target in a clever way: they incorporate it into their univariate selection methods, such as f_regression. The ordering of features under f_regression will be the same if we rank-order the features by their squared correlations.

import numpy as np
from pprint import pprint

# cov(X, y) = np.dot(X-E(X), Y-E(Y)) / n n = len(X_train_diabetes)

X = X_train_diabetes y = y_train_diabetes[np.newaxis,:]

# Compute the mean of X and y mean_X = X.mean() mean_y = y.mean()

# Compute the deviations from the mean for X and y deviations_X = X - mean_X deviations_y = y - mean_y

# Pretty-print the deviations print("Deviations of X from its mean:") pprint(deviations_X) print("\nDeviations of y from its mean:") pprint(deviations_y)

# Compute the covariance using the dot product of the deviations covariance_via_dot = np.dot(deviations_y, deviations_X) / n

print("\nCovariance computed via dot product:") pprint(covariance_via_dot)
Deviations of X from its mean:
array([[-0.01, -0.05,  0.04, ..., -0.04,  0.05,  0.03],
       [ 0.06, -0.05, -0.05, ...,  0.02,  0.06,  0.04],
       [ 0.01, -0.05,  0.05, ...,  0.02,  0.05,  0.11],
       ...,
       [ 0.03, -0.05, -0.02, ..., -0.04, -0.01, -0.  ],
       [-0.01, -0.05, -0.02, ..., -0.  , -0.04, -0.04],
       [-0.09, -0.05,  0.03, ..., -0.04, -0.01, -0.  ]])

Deviations of y from its mean:
array([[  11.66,   34.66,   18.66,   65.66,   51.66,  -57.34,  -94.34,
         -93.34,   87.66,  -33.34,  -26.34,  -50.34,  110.66,  -22.34,
         128.66,   19.66,  -25.34,  102.66,  -17.34,  -91.34,  -61.34,
          77.66,   53.66,  106.66,   24.66,  103.66,  107.66, -103.34,
          82.66,  -83.34,  -15.34,  113.66,  -85.34,  162.66,   94.66,
          -0.34,   37.66,  -38.34,  -73.34,  -32.34,  104.66,   36.66,
         137.66,  -99.34,  -47.34,   55.66,  -63.34,   98.66,  -69.34,
          97.66,  -95.34,  -76.34,   45.66,  -76.34,   90.66,   20.66,
        -112.34,  -27.34, -101.34,  -60.34,  -50.34,   44.66,  110.66,
         126.66,   93.66,  102.66,   60.66,  148.66,   15.66,  -95.34,
         122.66,   54.66,  -16.34,   43.66,  -30.34,  -58.34,  133.66,
          70.66,  110.66,  -53.34,  -99.34,   43.66, -103.34,   97.66,
         -90.34,   65.66,  -23.34,   57.66,  -12.34,  -51.34,    0.66,
         -33.34,  -68.34,  -43.34,  -89.34,  -23.34, -103.34,  -26.34,
         -13.34, -106.34,  -45.34,   23.66,  -66.34,  -70.34,   61.66,
          -4.34,  -94.34,  -58.34,   35.66,  -80.34,  124.66,   27.66,
           5.66,   90.66,  121.66,   19.66,   25.66,   -4.34,   41.66,
         -16.34,  -57.34,   91.66,  166.66,  153.66,  -45.34,  -85.34,
          27.66,  103.66,    6.66,   23.66,   59.66, -109.34,   -4.34,
           5.66,  -99.34,   42.66,   30.66,  113.66,  155.66,  -31.34,
         -86.34,  -82.34,   30.66,  -10.34,   -7.34,   13.66,   23.66,
          91.66,   -3.34,  -27.34,  -71.34,  177.66,   -2.34,  -45.34,
         -64.34,  -88.34,   59.66,  -69.34,  -25.34,  -65.34,  104.66,
          74.66,   45.66,  -77.34, -100.34, -123.34,  -45.34,   51.66,
         -10.34,  -36.34,  -71.34,   87.66,  104.66,  -82.34,    8.66,
          26.66,  -13.34,  -83.34,  -17.34,   40.66,   24.66,  -52.34,
         -23.34, -107.34,   80.66,  -77.34,   43.66,  -61.34,    7.66,
          70.66,  120.66,   28.66,  151.66,  -73.34,  -99.34,   -8.34,
          41.66,   75.66,  155.66, -114.34,  -19.34,  191.66, -111.34,
         -26.34,  -77.34,   80.66, -105.34,  -80.34,  -62.34,  -70.34,
         108.66,  -10.34,  -12.34,  186.66,  -39.34,    3.66,  118.66,
         -69.34,  -66.34,   65.66, -115.34,  -74.34,   17.66,   62.66,
         181.66, -102.34,  117.66,  -39.34,  -44.34,  -23.34,  -83.34,
         120.66,  -36.34, -129.34,  -54.34,  126.66,   66.66,   93.66,
          45.66,  -22.34,  -63.34,  -87.34,   47.66,  -81.34,  -69.34,
         120.66,   88.66,  -88.34,  138.66,   81.66,   88.66,  -67.34,
        -115.34,   62.66,  -62.34,  141.66,  137.66,  -12.34, -104.34,
        -101.34,  -50.34,  -79.34,  -34.34,  -12.34,  -11.34,  -55.34,
         -89.34,  -38.34,   78.66,    9.66,  -59.34,  -95.34,  -15.34,
          -9.34,   22.66,   30.66,  -57.34, -112.34,   46.66,   86.66,
         -84.34,  -76.34, -105.34,  -51.34, -110.34,  -43.34,   36.66,
        -107.34,   27.66,  -96.34,    0.66,   -3.34,  -75.34,  -50.34,
         -11.34,   -2.34,   15.66,  -79.34,   45.66,  -30.34,  -63.34,
        -105.34,    8.66, -101.34,  128.66,   23.66,   64.66,   45.66,
         -41.34,  -41.34,  -91.34,  -40.34,  -28.34,  119.66,  -66.34,
         156.66,  -71.34,  -83.34,  -20.34,   89.66,  -89.34,   18.66,
         -97.34,  -86.34,  -13.34,  115.66,  -20.34,   47.66,   -6.34,
         -90.34,  147.66]])

Covariance computed via dot product:
array([[ 0.71,  0.05,  2.17,  1.64,  0.68,  0.55, -1.45,  1.59,  1.99,
         1.37]])
import numpy as np
from pprint import pprint

# Compute the covariance matrix cov_matrix = np.cov(X_train_diabetes, y_train_diabetes, bias=True, rowvar=False)

# Pretty-print the covariance matrix print("Covariance matrix:") pprint(cov_matrix)

# Extract the covariances between each feature and the target covariance_via_np = cov_matrix[-1, :-1]

# Pretty-print the covariances print("\nCovariances between each feature and the target:") for i, cov in enumerate(covariance_via_np): print(f"Covariance between feature {i} and target: {cov}")

print(np.allclose(covariance_via_dot, covariance_via_np))
Covariance matrix:
array([[   0.  ,    0.  ,    0.  ,    0.  ,    0.  ,    0.  ,   -0.  ,
           0.  ,    0.  ,    0.  ,    0.71],
       [   0.  ,    0.  ,    0.  ,    0.  ,    0.  ,    0.  ,   -0.  ,
           0.  ,    0.  ,    0.  ,    0.05],
       [   0.  ,    0.  ,    0.  ,    0.  ,    0.  ,    0.  ,   -0.  ,
           0.  ,    0.  ,    0.  ,    2.17],
       [   0.  ,    0.  ,    0.  ,    0.  ,    0.  ,    0.  ,   -0.  ,
           0.  ,    0.  ,    0.  ,    1.64],
       [   0.  ,    0.  ,    0.  ,    0.  ,    0.  ,    0.  ,    0.  ,
           0.  ,    0.  ,    0.  ,    0.68],
       [   0.  ,    0.  ,    0.  ,    0.  ,    0.  ,    0.  ,   -0.  ,
           0.  ,    0.  ,    0.  ,    0.55],
       [  -0.  ,   -0.  ,   -0.  ,   -0.  ,    0.  ,   -0.  ,    0.  ,
          -0.  ,   -0.  ,   -0.  ,   -1.45],
       [   0.  ,    0.  ,    0.  ,    0.  ,    0.  ,    0.  ,   -0.  ,
           0.  ,    0.  ,    0.  ,    1.59],
       [   0.  ,    0.  ,    0.  ,    0.  ,    0.  ,    0.  ,   -0.  ,
           0.  ,    0.  ,    0.  ,    1.99],
       [   0.  ,    0.  ,    0.  ,    0.  ,    0.  ,    0.  ,   -0.  ,
           0.  ,    0.  ,    0.  ,    1.37],
       [   0.71,    0.05,    2.17,    1.64,    0.68,    0.55,   -1.45,
           1.59,    1.99,    1.37, 6044.62]])

Covariances between each feature and the target:
Covariance between feature 0 and target: 0.7142281247483537
Covariance between feature 1 and target: 0.05162692595531552
Covariance between feature 2 and target: 2.167215165126843
Covariance between feature 3 and target: 1.6423384710604967
Covariance between feature 4 and target: 0.6813654604487338
Covariance between feature 5 and target: 0.5514319442321287
Covariance between feature 6 and target: -1.45184958863045
Covariance between feature 7 and target: 1.5909095188877298
Covariance between feature 8 and target: 1.9931594185586323
Covariance between feature 9 and target: 1.366442701489227
True

Now, we can calculate the correlations in two ways:

  1. From the covariances we just calculated or
  2. from numpy’s np.corrcoef directly.
# Calculate the correlation from the covariances
correlation_via_covariance = covariance_via_np / np.sqrt(np.var(X_train_diabetes) * np.var(y_train_diabetes, axis=0))

# Print the correlations print("Correlations calculated from covariances:") for i, corr in enumerate(correlation_via_covariance): print(f"Correlation between feature {i} and target: {corr}")

# Calculate the correlation using np.corrcoef correlation_matrix = np.corrcoef(X_train_diabetes, y_train_diabetes, rowvar=False) correlation_via_np = correlation_matrix[-1, :-1]

# Print the correlations print("Correlations calculated using np.corrcoef:") for i, corr in enumerate(correlation_via_np): print(f"Correlation between feature {i} and target: {corr}")

# Check if the correlations calculated by the two methods are close print("Are the correlations calculated by the two methods close?") print(np.allclose(correlation_via_covariance, correlation_via_np, atol=1))
Correlations calculated from covariances:
Correlation between feature 0 and target: 0.19297849955994148
Correlation between feature 1 and target: 0.013949166047275502
Correlation between feature 2 and target: 0.5855635143702634
Correlation between feature 3 and target: 0.443746196674193
Correlation between feature 4 and target: 0.18409928096249809
Correlation between feature 5 and target: 0.14899232544900257
Correlation between feature 6 and target: -0.39227768480741043
Correlation between feature 7 and target: 0.4298505214965495
Correlation between feature 8 and target: 0.5385350991501918
Correlation between feature 9 and target: 0.3692014541725478
Correlations calculated using np.corrcoef:
Correlation between feature 0 and target: 0.19803129752818782
Correlation between feature 1 and target: 0.01395552693606856
Correlation between feature 2 and target: 0.5913305769318468
Correlation between feature 3 and target: 0.44167016086000094
Correlation between feature 4 and target: 0.1823467320652179
Correlation between feature 5 and target: 0.14705497389362382
Correlation between feature 6 and target: -0.3972511743188234
Correlation between feature 7 and target: 0.42397298254188315
Correlation between feature 8 and target: 0.5343956117596458
Correlation between feature 9 and target: 0.3707346912173415
Are the correlations calculated by the two methods close?
True

At last, we can confirm that the order of the variables when we calculate the correlations and the order when we use sklearn’s f_regression is the same.

from sklearn.feature_selection import f_regression

# Calculate the correlations, square them, and sort the features based on these squared correlations correlations = np.corrcoef(X_train_diabetes, y_train_diabetes, rowvar=False)[-1, :-1] correlation_order = np.argsort(correlations**2) # r^2 (!) correlation_names = np.array(diabetes.feature_names)[correlation_order[::-1]]

# Print the order of the features based on squared correlations print("Order of features based on squared correlations:") for i, name in enumerate(correlation_names): print(f"{i+1}. {name}")

# Calculate the F-scores, square them, and sort the features based on these squared F-scores f_scores = f_regression(X_train_diabetes, y_train_diabetes)[0] f_score_order = np.argsort(f_scores**2) f_score_names = np.array(diabetes.feature_names)[f_score_order[::-1]]

# Print the order of the features based on squared F-scores print("Order of features based on squared F-scores:") for i, name in enumerate(f_score_names): print(f"{i+1}. {name}")

# Compare the order of the features obtained by the two methods print("Are the orders of the features obtained by the two methods the same?") print(tuple(correlation_names) == tuple(f_score_names))
Order of features based on squared correlations:
1. bmi
2. s5
3. bp
4. s4
5. s3
6. s6
7. age
8. s1
9. s2
10. sex
Order of features based on squared F-scores:
1. bmi
2. s5
3. bp
4. s4
5. s3
6. s6
7. age
8. s1
9. s2
10. sex
Are the orders of the features obtained by the two methods the same?
True

Remember, while correlation can be a powerful tool for feature selection, it only captures linear relationships between variables. Other methods may be needed to capture non-linear relationships.

Understanding Information Measures for Feature Selection in Machine Learning

When it comes to feature selection, we often need to go beyond linear relationships and consider more complex, non-linear interactions. This is where Information Measures come into play.

What are Information Measures?

Information measures originate from the field of information theory. They provide a way to quantify how much ‘information’ knowing the value of one variable (in our case, a feature) tells us about another variable (the target). One common measure from information theory used in feature selection is Mutual Information.

Mutual Information

Mutual Information (MI) measures the dependency between the feature and the target. A higher MI between a feature and the target means that the feature provides more information about the target, making it a good candidate for modeling.

How does it work?

You can think of the feature value as a message being sent to the machine learning model. The model’s job is to decode that message to predict the target. The better the model can make this prediction, the more ‘mutual information’ the feature and the target share.

Using Information Measures in sklearn

In sklearn, you can use the mutual_info_classif function for classification tasks or mutual_info_regression for regression tasks to calculate the mutual information between each feature and the target. Here’s an example:

One Useful and One Random Variable Let’s start with a problem that is simple to express but results in a nonlinear classification problem.

# Generate an array of 1000 evenly spaced numbers between -10 and 10
# The reshape(-1,1) is used to make xs a column vector since np.linspace returns a 1D array by default
xs = np.linspace(-10,10,1000).reshape(-1,1)

# Generate an array of 1000 random numbers uniformly distributed between -10 and 10 # The shape of this array is the same as that of xs random_numbers = np.random.uniform(-10,10,xs.shape)

# Combine xs and random_numbers into a single 2D array # np.c_ concatenates the two arrays along the second axis (i.e., columns) data = np.c_[xs, random_numbers] print("Data:\n", data[:10])

# Generate the target array # This is a boolean array where each element is True if the corresponding element in xs is greater than 0, and False otherwise # The flatten() function is used to convert target from a 2D array to a 1D array target = (np.cos(xs) > 0).flatten() print("Target:\n", target[:10])
Data:
 [[-10.    -9.93]
 [ -9.98   2.33]
 [ -9.96  -5.62]
 [ -9.94  -1.64]
 [ -9.92   7.17]
 [ -9.9   -8.  ]
 [ -9.88   3.51]
 [ -9.86  -3.28]
 [ -9.84   5.69]
 [ -9.82   4.26]]
Target:
 [False False False False False False False False False False]
  1. np.random.uniform(-10,10,xs.shape): This generates a numpy array of random numbers that are uniformly distributed over the interval from -10 to 10. The shape of this array is the same as the shape of xs, which is (1000, 1).

  2. np.c_[xs, np.random.uniform(-10,10,xs.shape)]: The np.c_ is a numpy function that concatenates the two input arrays along the second axis (i.e., columns). So, it’s taking the xs array and the randomly generated array and sticking them together side by side. The result is a new 2D array with 2 columns: the first column is the xs values, and the second column is the random numbers.

So, data is a 2D numpy array with 1000 rows and 2 columns. The first column is the xs values, and the second column is random numbers from -10 to 10.

Let’s graph the result.

import matplotlib.pyplot as plt

# Create a new figure with a specific size plt.figure(figsize=(4,3))

# Create a scatter plot of the first column of data (xs) against the target plt.scatter(data[:,0], target)

# Add a title to the plot plt.title('Scatter plot of informative feature vs target')

# Label the x and y axes plt.xlabel('Informative feature (xs)') plt.ylabel('Target')

# Add a text annotation to highlight the non-linear relationship plt.text(-10, 0.8, 'Non-linear relationship', color='red')

# Display the plot plt.show()

The pattern turns off and on (0 and 1) at regular intervals. The graph is non-complicated and non-linear. In our data, the first column is evenly spaced values between -10 and 10 (as graphed above); The second column contains random values from a uniform distribution. We can look at the mutual information for each column against our nonlinear target:

  1. mutual_info_classif(data, target, discrete_features=False): This function calculates the mutual information between each feature in data and the target. Mutual information is a measure of the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.

    The discrete_features parameter is set to False, which means that all features are considered continuous, not discrete.

  2. mi = mutual_info_classif(data, target, discrete_features=False): The mutual information values are returned by the function and stored in the mi variable. The mi variable is a 1D array with the same number of elements as the number of features in data. Each element in mi is the mutual information between the corresponding feature in data and the target.

  3. print(mi): This prints the mutual information values.

So, this code calculates the mutual information between each feature in data and the target, and then prints these values. The mutual information values can be used to determine which features are most relevant to the target, as features with higher mutual information values are more strongly associated with the target.

from sklearn.feature_selection import mutual_info_classif

# Compute the mutual information between each feature in 'data' and the 'target' # 'discrete_features=False' indicates that the features are continuous, not discrete mutual_info = mutual_info_classif(data, target, discrete_features=False)

# Print the mutual information values for i, mi in enumerate(mutual_info): print(f"Mutual information between feature {i} and target: {mi:.2f}")
Mutual information between feature 0 and target: 0.68
Mutual information between feature 1 and target: 0.00

In this analysis, we’ve computed the mutual information between each feature in the data and the target. Mutual information is a measure of the dependency between two variables, with higher values indicating a stronger dependency.

From the results, it’s clear that:

  • The mutual information between the target and the first feature, which we’ve designed to be informative about the target, is 0.68. This relatively high value indicates a strong dependency between the first feature and the target.
  • The mutual information between the target and the second feature, which we’ve introduced as a random variable not informative about the target, is 0.03. This value is close to zero, indicating that the second feature and the target are nearly independent.

These results align perfectly with our expectations: the informative feature has a high mutual information with the target, while the random feature has a low mutual information. This demonstrates that mutual information can be a powerful tool for feature selection, as it can help identify the features that are most informative about the target variable. We can use similar measures for regression problems.

# Create an array of 1000 evenly spaced numbers from -10 to 10 and reshape it to a column vector
xs_informative = np.linspace(-10,10,1000).reshape(-1,1)

# Create a second feature which is just random numbers, the same shape as xs_informative xs_random = np.random.uniform(-10,10,xs_informative.shape)

# Combine the informative and random features into a single data array data = np.c_[xs_informative, xs_random]

# Create a target variable which is the cosine of the informative feature, flattened to a 1D array target = np.cos(xs_informative).flatten()

# Print the first 10 rows of the data and target print("First 10 rows of data:") print(data[:10]) print("First 10 values of target:") print(target[:10])
First 10 rows of data:
[[-10.    -3.86]
 [ -9.98   4.02]
 [ -9.96  -2.67]
 [ -9.94   3.85]
 [ -9.92   9.36]
 [ -9.9    7.96]
 [ -9.88  -3.11]
 [ -9.86  -3.59]
 [ -9.84  -6.98]
 [ -9.82  -1.61]]
First 10 values of target:
[-0.84 -0.85 -0.86 -0.87 -0.88 -0.89 -0.9  -0.91 -0.92 -0.92]
plt.figure(figsize=(4,3))
plt.scatter(data[:,0], target, color='blue', alpha=0.5)
plt.title('Informative Feature vs Target')
plt.xlabel('Informative Feature')
plt.ylabel('Target')
plt.grid(True)
plt.annotate('Non-linear relationship', xy=(0, .5), xytext=(-5, -0.5),
             arrowprops=dict(facecolor='black', shrink=0.05))
plt.show()

Let’s compare the $r^2$-like and mutual information techniques:

from sklearn.feature_selection import f_regression, mutual_info_regression

# Compute F-score via regression f_scores = f_regression(data, target)[0]

# Print F-scores print("F-scores for each feature:") for i, score in enumerate(f_scores): print(f"Feature {i}: {score}")

# Compute mutual information scores mi_scores = mutual_info_regression(data, target)

# Print mutual information scores print("\nMutual Information scores for each feature:") for i, score in enumerate(mi_scores): print(f"Feature {i}: {score}")
F-scores for each feature:
Feature 0: 3.2292493962135617e-29
Feature 1: 0.0011593352418849164

Mutual Information scores for each feature:
Feature 0: 1.935598025429082
Feature 1: 0.0

For f_regression, the results are [2.32403540e-29, 0.1034796394667004]. This means that the first feature has a very low F-value (close to zero), indicating that it is likely not linearly related to the target. The second feature has a relatively higher F-value, but it’s still quite low, indicating that it may not be strongly linearly related to the target.

For mutual_info_regression, the results are [1.9349313587624155, 0.0]. This means that the first feature has a high mutual information value with the target, indicating a strong dependency. The second feature has a mutual information value of zero, indicating no dependency with the target.

In summary, the first feature is likely not linearly related to the target but has a strong dependency (possibly non-linear), while the second feature may not be strongly linearly related to the target and shows no dependency.

Two Predictive Variables What happens when we have more than one informative variable?

# Create a grid of points in the range [-2, 2] with a step size of 0.2
xs, ys = np.mgrid[-2:2:.2,-2:2:.2]

# Create a binary target variable, which is True where y > x^2 and False elsewhere c_tgt = (ys > xs**2).flatten()

# Create a continuous target variable, which is the sum of squares of x and y where y > x^2, and 0 elsewhere r_tgt = ((xs**2 + ys**2)*(ys>xs**2))

# Combine xs and ys into a single data array data = np.c_[xs.flat, ys.flat]

# Print out a few examples print("First few examples:") for i in np.arange(0,401,66): print(f"Data: {data[i]}, Binary target: {c_tgt[i]}, Continuous target: {r_tgt.flat[i]}")
First few examples:
Data: [-2. -2.], Binary target: False, Continuous target: 0.0
Data: [-1.4 -0.8], Binary target: False, Continuous target: 0.0
Data: [-0.8  0.4], Binary target: False, Continuous target: 0.0
Data: [-0.2  1.6], Binary target: True, Continuous target: 2.6000000000000005
Data: [ 0.6 -1.2], Binary target: False, Continuous target: 0.0
Data: [1.2 0. ], Binary target: False, Continuous target: 0.0
Data: [1.8 1.2], Binary target: False, Continuous target: 0.0

Now we can solve two different problems, a classification and regression problem.

# Create a figure with 2 subplots, one for classification and one for regression
fig, axes = plt.subplots(1, 2, figsize=(6, 3), sharey=True)

# Plot the classification data # Points where the binary target is True are colored red, others are colored blue axes[0].scatter(xs, ys, c=np.where(c_tgt, 'r', 'b'), marker='.') axes[0].set_aspect('equal')

# Plot the decision boundary (y = x^2) decision_boundary_x = np.linspace(-np.sqrt(2), np.sqrt(2), 100) decision_boundary_y = decision_boundary_x**2 axes[0].plot(decision_boundary_x, decision_boundary_y, 'k')

# Set title and labels for the classification plot axes[0].set_title('Classification Problem') axes[0].set_xlabel('x') axes[0].set_ylabel('y')

# Plot the regression data as a color map axes[1].pcolormesh(xs, ys, r_tgt, cmap='binary') axes[1].set_aspect('equal')

# Set title and labels for the regression plot axes[1].set_title('Regression Problem') axes[1].set_xlabel('x') axes[1].set_ylabel('y')

# Display the plots plt.tight_layout() plt.show()

In this scenario, we are considering two features, $x$ and $y$, and their relationship with the target variable. The target variable can be either a classification (c_tgt) or a regression (r_tgt) target.

The classification target c_tgt is True when $y > x^2$ and False otherwise. This means that if we know the value of $y$ and it is less than zero, we can be certain that the target is False (represented by the color blue in the plot). If $y$ is greater than zero, the target could be either True or False, but the likelihood of it being True (represented by the color red) increases as $y$ increases. The value of $x$ does not give us much information about the target because the data is symmetric about the x-axis.

The regression target r_tgt is the sum of the squares of $x$ and $y$, but only when $y > x^2$ (otherwise it is zero). This means that if $y$ is less than zero, we know that the target is zero (represented by the color white in the plot). If $y$ is greater than zero, the target could be any non-negative number, and it increases as $y$ increases. Again, the value of $x$ does not give us much information about the target because the data is symmetric about the x-axis.

# Compute the mutual information for classification
mi_classif = mutual_info_classif(data, c_tgt, discrete_features=False, random_state=42)
print("Mutual Information for Classification:")
for feature, mi in zip(['x', 'y'], mi_classif):
    print(f"Feature {feature}: {mi:.2f}")

# Compute the mutual information for regression mi_regress = mutual_info_regression(data, r_tgt.flat, discrete_features=False) print("\nMutual Information for Regression:") for feature, mi in zip(['x', 'y'], mi_regress): print(f"Feature {feature}: {mi:.2f}")
Mutual Information for Classification:
Feature x: 0.09
Feature y: 0.20

Mutual Information for Regression:
Feature x: 0.09
Feature y: 0.45

The mutual_info_classif function calculates the mutual information between the features and the classification target. Mutual information is a measure of the amount of information that knowing the value of one variable provides about the other. In this case, it tells us how much knowing the value of x or y tells us about the target. The function returns an array of two values, one for each feature. The value for y is greater than the value for x, which aligns with our intuition that y provides more information about the target than x.

The mutual_info_regression function does the same thing, but for the regression target. Again, the value for y is greater than the value for x, confirming our intuition.

In summary, these results demonstrate that in this scenario, the y feature provides more information about the target than the x feature, regardless of whether the target is a classification or a regression target. This aligns with our visual inspection of the data and our understanding of how the targets were generated.

Selecting the Top Features

In this section, we explore the process of selecting the most influential features from our dataset using different feature selection metrics. The objective is to rank the features based on their significance and then select the top ones.

SelectKBest Method (Comparing mutual_info_classif and f_classif - Wine Dataset)

We utilize the SelectKBest function from the sklearn.feature_selection module to select a fixed number of top features. This function requires two arguments: a scoring function and the number of top features to select (k). The scoring function should be capable of taking two arrays X and y, and returning either a pair of arrays (scores, pvalues) or a single array with scores. In our examples, we use mutual_info_classif and f_classif as scoring functions. These functions compute the mutual information and the F-value respectively between each feature and the target. We then employ the fit_transform method to fit the feature selection model to the data and subsequently transform the data to include only the selected features. The get_support method is used to retrieve a mask, or integer index, of the features selected.

from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif

# Select the top 5 features using mutual_info_classif feature_selection = SelectKBest(mutual_info_classif, k=5) X_train_wine_selected_mi = feature_selection.fit_transform(X_train_wine, y_train_wine)

# Get the mask of the selected features selected_features_mask_mi = feature_selection.get_support()

# Print the names of the selected features print("Selected features using mutual_info_classif:") print(np.array(wine.feature_names)[selected_features_mask_mi])

# Now select the top 5 features using f_classif feature_selection = SelectKBest(f_classif, k=5) X_train_wine_selected_f = feature_selection.fit_transform(X_train_wine, y_train_wine)

# Get the mask of the selected features selected_features_mask_f = feature_selection.get_support()

# Print the names of the selected features print("\nSelected features using f_classif:") print(np.array(wine.feature_names)[selected_features_mask_f])
Selected features using mutual_info_classif:
['flavanoids' 'color_intensity' 'hue' 'od280/od315_of_diluted_wines'
 'proline']

Selected features using f_classif:
['alcohol' 'flavanoids' 'color_intensity' 'od280/od315_of_diluted_wines'
 'proline']

SelectPercentile Function (Comparing f_regression and mutual_info_regression - Diabetes Dataset)

We also introduce the SelectPercentile function, which operates similarly to SelectKBest. However, instead of specifying the number of top features to select, you specify a percentile. In our examples, we use f_regression and mutual_info_regression as scoring functions for a regression problem. These functions calculate the F-value and the mutual information respectively between each feature and the target.

from sklearn.feature_selection import SelectPercentile, f_regression, mutual_info_regression

# Select the top 25 percentile features using f_regression feature_selection = SelectPercentile(f_regression, percentile=25) X_train_diabetes_selected_f = feature_selection.fit_transform(X_train_diabetes, y_train_diabetes)

# Get the mask of the selected features selected_features_mask_f = feature_selection.get_support()

# Print the names of the selected features print("Selected features using f_regression:") print(np.array(diabetes.feature_names)[selected_features_mask_f])

# Now select the top 25 percentile features using mutual_info_regression feature_selection = SelectPercentile(mutual_info_regression, percentile=25) X_train_diabetes_selected_mi = feature_selection.fit_transform(X_train_diabetes, y_train_diabetes)

# Get the mask of the selected features selected_features_mask_mi = feature_selection.get_support()

# Print the names of the selected features print("\nSelected features using mutual_info_regression:") print(np.array(diabetes.feature_names)[selected_features_mask_mi])
Selected features using f_regression:
['bmi' 'bp' 's5']

Selected features using mutual_info_regression:
['bmi' 's4' 's5']

Our results indicate that different feature selection metrics often yield similar rankings. This suggests that the most influential features tend to remain consistent, regardless of the metric used to rank them. However, there can be some differences, especially when the number of top features selected is small. This exploration underlines the importance of feature selection in improving the performance of our models and providing insights into the underlying structure of our data.

Model-Based Feature Selection

In this section, we delve into the concept of model-based feature selection. Unlike the methods we’ve previously discussed, which select features based on their individual characteristics or their relationship with the target, model-based feature selection tailors the selection of features to the specific model we plan to use. This is particularly useful because not all models are linear, and the existence of a relationship between a feature and a target doesn’t necessarily mean that a particular model can exploit it effectively.

However, one challenge we face when selecting features is the sheer number of combinations of features we could potentially select. Evaluating every possible subset of features is computationally expensive and often impractical. To overcome this, we can use two main strategies: randomness and greediness.

Randomness involves randomly selecting subsets of features and evaluating them. On the other hand, greediness starts with an initial set of features and iteratively adds or removes features based on their performance. The term “greediness” refers to the fact that we may take steps that individually seem beneficial, but may lead us to a sub-optimal final set of features.

In practice, the greedy feature selection process typically follows these steps:

  1. Select an initial set of features.
  2. Build a model using the current set of features.
  3. Evaluate the importance of the features to the model.
  4. Keep the best features or drop the worst features from the current set. Optionally, systematically or randomly add or remove features.
  5. Repeat from step 2 as necessary.

Typically, we either start with a single feature and work forwards by adding features (forward selection), or start with all features and work backwards by removing features (backward elimination). This approach allows us to select a subset of features that work well with our chosen model, potentially improving the model’s performance and interpretability.

Model-Based Feature Selection: Using Model Coefficients and Importance Scores

In this subsection, we explore how to use model-based feature selection without the need for iterative refinement (step 5 from the previous summary). Many models provide an importance score or coefficient for each input feature, which can be used to rank and select features. The SelectFromModel function in scikit-learn uses these scores to select features.

RandomForestClassifier (wine) SelectFromModel

For several models, the default behavior of SelectFromModel is to keep features with scores above the mean. For example, when used with a RandomForestClassifier, SelectFromModel selects features whose importance is above the mean importance of all features.

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

# Initialize a RandomForestClassifier rf = RandomForestClassifier(n_estimators=100, random_state=42)

# Initialize a SelectFromModel instance with the RandomForestClassifier and a threshold of "mean" # This will select features whose importance is greater than the mean importance of all features feature_selection = SelectFromModel(rf, threshold="mean")

# Fit the SelectFromModel instance to the training data and transform the data X_train_wine_selected = feature_selection.fit_transform(X_train_wine, y_train_wine)

# Print the shape of the original and transformed data print(f"Original shape: {X_train_wine.shape}") print(f"Shape after feature selection: {X_train_wine_selected.shape}")

# Get the mask of the selected features selected_features_mask = feature_selection.get_support()

# Print the names of the selected features print("\nSelected features:") print(np.array(wine.feature_names)[selected_features_mask])
Original shape: (133, 13)
Shape after feature selection: (133, 5)

Selected features:
['alcohol' 'flavanoids' 'color_intensity' 'od280/od315_of_diluted_wines'
 'proline']

LogisticRegression with L1 regularization (wine)

For models that use L1 regularization (also known as Lasso regularization), SelectFromModel defaults to discarding features with small coefficients.

from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel

# Initialize a LogisticRegression model with L1 penalty lr = LogisticRegression(penalty="l1", C=1.0, solver="liblinear")

# Initialize a SelectFromModel instance with the LogisticRegression model # This will select features whose coefficients in the logistic regression model are non-zero feature_selection = SelectFromModel(lr)

# Fit the SelectFromModel instance to the training data and transform the data X_train_wine_selected = feature_selection.fit_transform(X_train_wine, y_train_wine)

# Print the shape of the original and transformed data print(f"Original shape: {X_train_wine.shape}") print(f"Shape after feature selection: {X_train_wine_selected.shape}")

# Get the mask of the selected features selected_features_mask = feature_selection.get_support()

# Print the names of the selected features print("\nSelected features:") print(np.array(wine.feature_names)[selected_features_mask])
Original shape: (133, 13)
Shape after feature selection: (133, 10)

Selected features:
['alcohol' 'malic_acid' 'ash' 'alcalinity_of_ash' 'magnesium' 'flavanoids'
 'proanthocyanins' 'color_intensity' 'od280/od315_of_diluted_wines'
 'proline']

However, one limitation of SelectFromModel is that it doesn’t provide an easy way to order the selected features. It simply provides a binary decision - whether to keep a feature or not - based on the model’s coefficients or feature importances. If you need to know the order of importance of the selected features, you would need to refer back to the coefficients or importances from the underlying model.

Model-Guided Feature Selection: Recursive Feature Elimination

In the process of model-based feature selection, we can adopt a more dynamic approach where the selection of features is not a one-time event, but rather an iterative process. This method is known as Recursive Feature Elimination (RFE).

Imagine a scenario where we build a model, evaluate the features, and then discard the least significant feature. We then repeat the process with the remaining features. As features can interact with each other, the ranking of the features may change with each iteration. We continue this process until we’ve discarded as many features as we deem necessary.

RandomForestClassifier (wine) (RFE)

In scikit-learn, this method is implemented through the RFE class. For instance, we can use a RandomForestClassifier as the underlying model. Random Forests operate by repeatedly working with random subsets of the features. If a feature is frequently used in the component trees of the Random Forest, it is considered more important than a less frequently used feature. This concept is encapsulated as the feature importance in the Random Forest model.

In the next sections, we’ll see how to implement this in code and understand the results.

from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

# Initialize a RandomForestClassifier rf = RandomForestClassifier(n_estimators=100, random_state=42)

# Initialize a Recursive Feature Elimination (RFE) instance with the RandomForestClassifier # We want to select the top 5 features feature_selection = RFE(rf, n_features_to_select=5)

# Fit the RFE instance to the training data and transform the data X_train_wine_selected = feature_selection.fit_transform(X_train_wine, y_train_wine)

# Print the shape of the original and transformed data print(f"Original shape: {X_train_wine.shape}") print(f"Shape after feature selection: {X_train_wine_selected.shape}")

# Get the mask of the selected features selected_features_mask = feature_selection.get_support()

# Print the names of the selected features print("\nSelected features:") print(np.array(wine.feature_names)[selected_features_mask])
Original shape: (133, 13)
Shape after feature selection: (133, 5)

Selected features:
['alcohol' 'flavanoids' 'color_intensity' 'od280/od315_of_diluted_wines'
 'proline']

LinearRegression (RFE) (wine)

We can also select features using a linear regression model. In the sklearn implementation, features are dropped based on the size of their regression coefficients.

Formal Statistics for Feature Selection

In feature selection, we can also utilize formal statistical tests such as the t-test and ANOVAs. These tests provide a structured way to evaluate the significance of individual features. However, it’s important to note that these tests evaluate each feature individually, in isolation from all other features.

One common method in statistical feature selection is the use of F-statistics. In this context, the rank ordering of univariate F-statistics is the same as the rank ordering of correlations. This relationship can be proven algebraically.

In the statistical world, an F-statistic is often used in a technique called forward stepwise selection. This is a sequential process that starts with an initial feature and then combines it with the feature that performs best in combination with the initial feature. However, this is not the approach taken by scikit-learn, which evaluates each feature individually.

For those interested in diving deeper into the relationships between different statistical tests, the paper “Univariate Distributions Relationships” by Leemis and McQuestion provides a comprehensive overview and includes a helpful graphic that is widely available online.

In the next sections, we’ll explore how to implement these statistical methods in code and interpret the results.

# Import necessary libraries
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE

# Initialize the model and the RFE (Recursive Feature Elimination) selector model = LinearRegression() feature_selection = RFE(estimator=model, n_features_to_select=5)

# Fit the model and apply the feature selection transformed_data = feature_selection.fit_transform(X_train_wine, y_train_wine)

# Print the shape of the transformed data to see how many features we have now print(f"Transformed data shape: {transformed_data.shape}")

# Get the mask of the selected features selected_features_mask = feature_selection.get_support()

# Print the names of the selected features selected_features = np.array(wine.feature_names)[selected_features_mask] print(f"Selected features: {selected_features}")
Transformed data shape: (133, 5)
Selected features: ['alcohol' 'total_phenols' 'flavanoids' 'hue'
 'od280/od315_of_diluted_wines']

Ranking Features with Recursive Feature Elimination

Recursive Feature Elimination (RFE) is a method that ranks features based on their importance to a model. It iteratively builds a model, evaluates the importance of features, and then eliminates the least important features. This process is repeated until a specified number of features are left.

In scikit-learn, the ranking_ attribute of the RFE object provides the ranking of each feature. Features with a rank of 1 are selected, while others are ranked based on the order in which they were eliminated. For example, a feature with a rank of 2 was the first to be eliminated, a feature with a rank of 3 was the second to be eliminated, and so on.

However, to order the selected features (those with a rank of 1), we need to do some additional work. We can use the coefficients of the model used in the RFE to order these selected features. The estimator_.coef_ attribute of the RFE object provides the coefficients of the final model.

Here’s an example of how to do this:

# Print the ranking of the features
feature_ranking = feature_selection.ranking_
print("Feature Ranking:")
for i, rank in enumerate(feature_ranking):
    print(f"Feature {i}: {rank}")

# Print the coefficients of the selected features in the model feature_coefficients = feature_selection.estimator_.coef_ print("\nFeature Coefficients:") for i, coef in enumerate(feature_coefficients): print(f"Feature {i}: {coef}")
Feature Ranking:
Feature 0: 1
Feature 1: 5
Feature 2: 2
Feature 3: 4
Feature 4: 9
Feature 5: 1
Feature 6: 1
Feature 7: 3
Feature 8: 7
Feature 9: 6
Feature 10: 1
Feature 11: 1
Feature 12: 8

Feature Coefficients:
Feature 0: -0.21637281445620518
Feature 1: 0.12811915212211028
Feature 2: -0.39364306238364616
Feature 3: -0.6394356069603024
Feature 4: -0.3572356177869348
# Get the absolute values of the coefficients of the selected features
abs_coef = np.abs(feature_selection.estimator_.coef_)
print("Absolute Coefficients of Selected Features:", abs_coef)

# Get the indices that would sort the absolute coefficients in ascending order sorted_indices = np.argsort(abs_coef) print("Indices that would sort the absolute coefficients (from least to most important):", sorted_indices)

# Get the indices of the selected features (where the ranking is 1) selected_indices = np.where(feature_selection.ranking_ == 1)[0] print("Indices of Selected Features:", selected_indices)

# Use the sorted indices to order the indices of the selected features ordered_selected_indices = selected_indices[sorted_indices] print("Ordered Indices of Selected Features (from least to most important):", ordered_selected_indices)

# Use the ordered indices to get the names of the selected features selected_feature_names = np.array(wine.feature_names)[ordered_selected_indices] print("Names of Selected Features (from least to most important):", selected_feature_names)
Absolute Coefficients of Selected Features: [0.22 0.13 0.39 0.64 0.36]
Indices that would sort the absolute coefficients (from least to most important): [1 0 4 2 3]
Indices of Selected Features: [ 0  5  6 10 11]
Ordered Indices of Selected Features (from least to most important): [ 5  0 11  6 10]
Names of Selected Features (from least to most important): ['total_phenols' 'alcohol' 'od280/od315_of_diluted_wines' 'flavanoids'
 'hue']

In this code, feature_selection is the RFE object, and wine.feature_names is an array of the feature names. The output is the names of the selected features, ordered by the absolute value of their coefficients in the final model.

Combining Feature Selection and Model Building with Pipelines

Feature selection can be integrated with model building using scikit-learn’s Pipeline. This allows us to streamline the process of selecting features and training a model on those features. Let’s see how this works with a few examples.

5-fold CV and Regularized Logistic Regression (wine dataset)

First, let’s establish a baseline performance using regularized logistic regression on the wine dataset:

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

# Initialize the StandardScaler scaler = StandardScaler()

# Fit the scaler to the wine data and transform it wine_scaled = scaler.fit_transform(wine.data) print("First 5 rows of scaled wine data:\n", wine_scaled[:5])

# Initialize the Logistic Regression model log_reg = LogisticRegression(max_iter=1000)

# Perform cross-validation on the scaled wine data using the Logistic Regression model cv_scores = cross_val_score(log_reg, wine_scaled, wine.target)

# Print the cross-validation scores print("\nCross-validation scores:", cv_scores)

# Print the mean cross-validation score print("\nMean cross-validation score:", cv_scores.mean())
First 5 rows of scaled wine data:
 [[ 1.52 -0.56  0.23 -1.17  1.91  0.81  1.03 -0.66  1.22  0.25  0.36  1.85
   1.01]
 [ 0.25 -0.5  -0.83 -2.49  0.02  0.57  0.73 -0.82 -0.54 -0.29  0.41  1.11
   0.97]
 [ 0.2   0.02  1.11 -0.27  0.09  0.81  1.22 -0.5   2.14  0.27  0.32  0.79
   1.4 ]
 [ 1.69 -0.35  0.49 -0.81  0.93  2.49  1.47 -0.98  1.03  1.19 -0.43  1.18
   2.33]
 [ 0.3   0.23  1.84  0.45  1.28  0.81  0.66  0.23  0.4  -0.32  0.36  0.45
  -0.04]]

Cross-validation scores: [0.97 0.97 1.   1.   1.  ]

Mean cross-validation score: 0.9888888888888889

Pipeline (SelectFromModel using L1 regularized Logistic Regression)

Next, we might wonder if selecting features based on the same classifier would improve the performance. We can test this by adding a feature selection step to our pipeline:

from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_wine

# Load wine dataset wine = load_wine() print("Wine data loaded. It has", wine.data.shape[0], "samples and", wine.data.shape[1], "features.")

# Create a pipeline # Step 1: Scale the data using StandardScaler # Step 2: Perform feature selection using L1 regularized logistic regression # Step 3: Train a logistic regression model pipe = make_pipeline( StandardScaler(), SelectFromModel(LogisticRegression(max_iter=1000, penalty='l1', solver='liblinear')), LogisticRegression(max_iter=1000) )

print("Pipeline created. It consists of the following steps:", [step[0] for step in pipe.steps])

# Perform cross-validation scores = cross_val_score(pipe, wine.data, wine.target)

# Print the cross-validation scores print("\nCross-validation scores:", scores)

# Print the mean cross-validation score print("\nMean cross-validation score:", scores.mean())
Wine data loaded. It has 178 samples and 13 features.
Pipeline created. It consists of the following steps: ['standardscaler', 'selectfrommodel', 'logisticregression']

Cross-validation scores: [0.97 0.97 1.   1.   1.  ]

Mean cross-validation score: 0.9888888888888889

In this case, the performance did not improve. However, this doesn’t mean that feature selection is not useful. The effectiveness of feature selection can depend on the dataset and the models used.

Pipeline (RFE and RandomForestClassifier)

We can also try using a different model for feature selection. Here, we use a random forest classifier to select the top 5 features:

from sklearn.pipeline import make_pipeline
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_wine

# Load wine dataset wine = load_wine() print("Wine data loaded. It has", wine.data.shape[0], "samples and", wine.data.shape[1], "features.")

# Create a Recursive Feature Elimination (RFE) object # We use a RandomForestClassifier as the estimator and select the top 5 features feature_selection = RFE(RandomForestClassifier(), n_features_to_select=5) print("RFE created with RandomForestClassifier. Top 5 features will be selected.")

# Create a pipeline # Step 1: Scale the data using StandardScaler # Step 2: Perform feature selection using RFE # Step 3: Train a logistic regression model pipe = make_pipeline(StandardScaler(), feature_selection, LogisticRegression()) print("Pipeline created. It consists of the following steps:", [step[0] for step in pipe.steps])

# Perform cross-validation scores = cross_val_score(pipe, wine.data, wine.target)

# Print the cross-validation scores print("\nCross-validation scores:", scores)

# Print the mean cross-validation score print("\nMean cross-validation score:", scores.mean())
Wine data loaded. It has 178 samples and 13 features.
RFE created with RandomForestClassifier. Top 5 features will be selected.
Pipeline created. It consists of the following steps: ['standardscaler', 'rfe', 'logisticregression']

Cross-validation scores: [0.89 0.97 0.97 0.97 1.  ]

Mean cross-validation score: 0.9609523809523811

Again, the performance did not improve. However, it’s worth noting that experimenting with different learning pipelines is easy and can potentially yield improvements for other datasets or models.

Pipeline (SelectPercentile and mutual_info_classif)

Finally, we can use an information-based feature evaluator in our pipeline:

feature_selection = SelectPercentile(mutual_info_classif, percentile=25)
pipe = make_pipeline(StandardScaler(),
                     feature_selection,
                     LogisticRegression())
cross_val_score(pipe, wine.data, wine.target)
array([0.89, 0.94, 0.94, 0.97, 0.94])

Exploring Feature Selection with GridSearchCV

In our exploration of feature selection within a learning pipeline, we found that none of the methods we tried significantly enhanced the model’s performance on the wine dataset. These methods included using an L1 regularized logistic regression, Recursive Feature Elimination (RFE) with a random forest, and SelectPercentile with mutual information classification. This outcome suggests that the original features in this dataset are already quite informative for the logistic regression model.

However, it’s important to remember that this may not always be the case with different datasets or models. Therefore, it’s beneficial to experiment with various feature selection techniques and model parameters.

One particular area of exploration could be fine-tuning the number of features to retain. This can be achieved using GridSearchCV on the pipeline, which allows for systematic, cross-validated parameter tuning. In the following section, we’ll specifically use mutual_info_classif as our feature selection method in conjunction with GridSearchCV.

from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectPercentile
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_wine

# Load wine dataset wine = load_wine()

# Define feature selection method feature_selection = SelectPercentile(mutual_info_classif)

# Create a pipeline pipe = make_pipeline( StandardScaler(), # Step 1: Scale the data feature_selection, # Step 2: Feature selection LogisticRegression() # Step 3: Logistic Regression )

# Define the parameter grid for GridSearchCV param_grid = {'selectpercentile__percentile': [5, 10, 15, 20, 25]}

# Initialize GridSearchCV grid = GridSearchCV(pipe, param_grid=param_grid, cv=5)

# Fit GridSearchCV to the data grid.fit(wine.data, wine.target)

# Print the best parameters found by GridSearchCV print("Best parameters found by GridSearchCV:") print(grid.best_params_)

# Print the best score found by GridSearchCV print("\nBest score found by GridSearchCV:") print(grid.best_score_)
Best parameters found by GridSearchCV:
{'selectpercentile__percentile': 20}

Best score found by GridSearchCV:
0.9441269841269841

In this code, we’re using GridSearchCV to find the optimal percentile of features to keep when using SelectPercentile for feature selection. The GridSearchCV function performs a cross-validated grid-search over a parameter grid - in this case, over different percentiles for feature selection.

The param_grid dictionary specifies the parameter grid to explore. In this case, it’s looking at percentiles of 5, 10, 15, 20, and 25. For each of these percentiles, GridSearchCV will perform a 5-fold cross-validation (as specified by cv=5) and compute the average score.

After the grid search is performed with the fit method, the best_params_ attribute of the GridSearchCV object gives us the parameters that gave the highest score during the grid search. In this case, it will tell us the optimal percentile of features to keep.

The best_score_ attribute gives us the highest mean cross-validated score achieved during the grid search. This score is a measure of how well the model with the optimal parameters performed.

So, in summary, this code is finding the optimal percentile of features to keep when using SelectPercentile for feature selection, and it’s giving us the cross-validated score of a logistic regression model that uses this optimal number of features. This is a powerful approach to feature selection that can help us build more effective and efficient models.

Feature Construction with Kernels: A Motivating Example

In this section, we explore a classification problem that presents a challenge for linear methods. The task is to classify points as being inside or outside of a circle.

We first create a grid of points using numpy’s mgrid function. The target variable tgt is a boolean array that indicates whether each point is inside (False) or outside (True) a circle of radius 1.

We then visualize the points and the circle using matplotlib. Points inside the circle are colored blue, while points outside are colored red.

# Create a grid of points
xs, ys = np.mgrid[-2:2:.2, -2:2:.2]

# Define the target variable: True if the point is inside the circle of radius 1, False otherwise target = (xs**2 + ys**2 > 1).flatten()

# Combine xs and ys into a single data array data = np.c_[xs.flat, ys.flat]

# Create a figure and an axis fig, ax = plt.subplots(figsize=(4,3))

# Scatter plot of the data points, colored by the target variable ax.scatter(xs, ys, c=np.where(target, 'r', 'b'), marker='.')

# Set the aspect ratio of the plot to be equal ax.set_aspect('equal')

# Add a circle to represent the boundary circ = plt.Circle((0,0), 1, color='k', fill=False) ax.add_patch(circ)

# Set the title and labels ax.set_title('Classification Problem: Inside vs Outside of a Circle') ax.set_xlabel('x') ax.set_ylabel('y')

# Print the first 10 rows of data and target for clarity print("First 10 rows of data:") print(data[:10]) print("\nCorresponding target values:") print(target[:10])
First 10 rows of data:
[[-2.  -2. ]
 [-2.  -1.8]
 [-2.  -1.6]
 [-2.  -1.4]
 [-2.  -1.2]
 [-2.  -1. ]
 [-2.  -0.8]
 [-2.  -0.6]
 [-2.  -0.4]
 [-2.  -0.2]]

Corresponding target values:
[ True  True  True  True  True  True  True  True  True  True]

Exploring Linear Learning Methods: SVC and Logistic Regression

In this section, we will be exploring two popular linear learning methods: Support Vector Classifier (SVC) and Logistic Regression. These methods are widely used for classification problems due to their simplicity and effectiveness. However, they are inherently linear models, which means they may struggle with problems that have a non-linear decision boundary.

Support Vector Classifier (SVC)

The Support Vector Classifier is a type of Support Vector Machine (SVM) that is used for classification problems. It works by finding the hyperplane that best separates the classes in the feature space. The “best” hyperplane is the one that maximizes the margin, which is the distance between the hyperplane and the nearest data point from each class.

However, as a linear model, SVC can only find linear decision boundaries. This means that it may not perform well on problems where the classes are not linearly separable. In our example, we can see that the SVC predicts all points as being inside the circle, which is not accurate.

Logistic Regression

Logistic Regression is another popular method for classification problems. It works by modeling the probability that a given data point belongs to a particular class. The probabilities are then transformed into binary outcomes using a logistic function.

Like SVC, Logistic Regression is also a linear model. This means that it also struggles with problems that have non-linear decision boundaries. In our example, we can see that the Logistic Regression model also predicts all points as being inside the circle, which is not accurate.

# Import necessary libraries
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt

# Define a list of models to compare linear_modeling_shootout = [SVC(kernel='linear'), LogisticRegression()]

# Create a figure and axes for plotting fig, axes = plt.subplots(1, 2, figsize=(8, 4), sharey=True)

# Add a title to the figure fig.suptitle("Comparison of Linear Classification Methods", fontsize=16)

# Loop over each model and corresponding axes for model, ax in zip(linear_modeling_shootout, axes): # Plot the decision boundary for the model plot_decision_boundary(ax, data, target, model, [0, 1])

# Set the title of the plot to the name of the model ax.set_title(get_model_name(model))

# Adjust the layout of the plots for better visualization plt.tight_layout()

# Add text below the plots explaining the results plt.figtext(0.5, -0.05, "Both the Support Vector Classifier and Logistic Regression predict all points as being inside the circle.", ha="center", fontsize=12, wrap=True)
Text(0.5, -0.05, 'Both the Support Vector Classifier and Logistic Regression predict all points as being inside the circle.')

This example illustrates a limitation of linear methods: they struggle with problems that have a non-linear decision boundary. In the next section, we’ll see if we can improve our results by using non-linear learning methods, such as k-Nearest Neighbors (k-NN) and Decision Trees.

Exploring Non-Linear Learning Methods: k-Nearest Neighbors and Decision Trees

In the previous section, we observed that linear methods can struggle with problems that have a non-linear decision boundary. In this section, we aim to overcome this limitation by employing non-linear learning methods. Specifically, we’ll be experimenting with k-Nearest Neighbors (k-NN) and Decision Trees, two popular non-linear learning methods.

Visualizing Decision Boundaries of Non-Linear Models

To better understand the performance of these non-linear models, we’ll visualize their decision boundaries. This is achieved by training each model on our data and plotting the decision boundary. We’ll do this for different configurations of each model, allowing us to observe how changing model parameters can impact the decision boundary.

# Import necessary modules
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt

# Define parameters for KNeighborsClassifier and DecisionTreeClassifier k_neighbors_values, max_depth_values = [1,20], [1,3,5,10]

# Create a list of models with different parameters models_with_parameters = ([(KNeighborsClassifier(n_neighbors=n_neighbors), n_neighbors) for n_neighbors in k_neighbors_values] + [(DecisionTreeClassifier(max_depth=max_depth), max_depth) for max_depth in max_depth_values])

# Create a figure with multiple subplots fig, axes = plt.subplots(2, 3, figsize=(9, 6), sharex=True, sharey=True)

# Add a title to the figure fig.suptitle("Comparison of Non-Linear Classification Methods", fontsize=16)

# Loop over each model, parameter, and corresponding axes for (model, parameter), ax in zip(models_with_parameters, axes.flat): # Plot the decision boundary for the model plot_decision_boundary(ax, data, target, model, [0, 1])

# Set the title of the plot to the name of the model and its parameter ax.set_title(f"{get_model_name(model)} (param={parameter})")

# Adjust the layout of the plots for better visualization plt.tight_layout(rect=[0, 0, 1, 0.96]) # Adjust the whitespace to accommodate the suptitle

# Add a general description below the plots plt.figtext(0.5, -0.11, "These plots show the decision boundaries of KNeighborsClassifier and DecisionTreeClassifier models with different parameters. " "Unlike linear models, these non-linear models are able to capture the circular pattern in the data.", ha="center", fontsize=12, wrap=True)
Text(0.5, -0.11, 'These plots show the decision boundaries of KNeighborsClassifier and DecisionTreeClassifier models with different parameters. Unlike linear models, these non-linear models are able to capture the circular pattern in the data.')

Similarity between KNC(n_neigh=1) and DTC(max_depth=10)

The k-Nearest Neighbors classifier with one neighbor (KNC(n_neigh=1)) and the Decision Tree classifier with a maximum depth of 10 (DTC(max_depth=10)) show similar classification boundaries. This is because when k-NN is set to 1, it essentially classifies a point based on the closest point in the training set, which can lead to highly complex boundaries. Similarly, a Decision Tree with a high maximum depth can also create complex boundaries by making many splits in the data. Therefore, both classifiers can create complex, non-linear decision boundaries that can closely fit the data, leading to their similarity in this instance.

Enhancing Linear Methods with Additional Features

In the previous sections, we observed that non-linear learning methods such as k-Nearest Neighbors (k-NN) and Decision Trees performed significantly better on our data than linear methods. However, instead of discarding linear methods altogether, we can attempt to improve their performance by manually adding some features that might support a linear boundary.

Adding Squared Features

One approach to enhance linear methods is to add squared features into our data. In this case, we add squares—x² and y²—to our data and observe the impact on the performance of our linear methods.

# Import necessary modules
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

# Define the linear models for comparison linear_modeling_shootout = [SVC(kernel='linear'), LogisticRegression()]

# Display original data shape and first row print("Original data shape:", data.shape) print("First row of original data:", data[0])

# Create new data by adding squared features new_data = np.concatenate([data, data**2], axis=1)

# Display new data shape and first row print("Augmented data shape:", new_data.shape) print("First row of augmented data:", new_data[0])

# Create a figure with multiple subplots fig, axes = plt.subplots(1, 2, figsize=(5, 2.5))

# Add a title to the figure fig.suptitle("Comparison of Linear Classification Methods with Squared Features", fontsize=16)

# For each model and corresponding subplot for model, ax in zip(linear_modeling_shootout, axes): # Plot the decision boundary for the model using the squares for prediction and graphing plot_decision_boundary(ax, new_data, target, model, [2, 3]) # Set the title of the subplot to the name of the model ax.set_title(get_model_name(model))

# Adjust the layout of the plots for better visualization plt.tight_layout(rect=[0, 0, 1, 0.96]) # Adjust the whitespace to accommodate the suptitle

# Add a general description below the plots plt.figtext(0.5, -0.17, "These plots show the decision boundaries of SVC and LogisticRegression models with squared features. " "By adding squared features to the data, these linear models are now able to capture the circular pattern in the data.", ha="center", fontsize=12, wrap=True)
Original data shape: (400, 2)
First row of original data: [-2. -2.]
Augmented data shape: (400, 4)
First row of augmented data: [-2. -2.  4.  4.]





Text(0.5, -0.17, 'These plots show the decision boundaries of SVC and LogisticRegression models with squared features. By adding squared features to the data, these linear models are now able to capture the circular pattern in the data.')

Understanding the Impact of Squaring Features

In the previous step, we transformed our data by squaring the features. This had a significant impact on our data and the problem we’re trying to solve. Let’s take a closer look at what happened.

When we square values, negative values become positive. This means that our problem has changed from asking if $x_{old}^2 + y_{old}^2 < 1$ to asking if $x_{new} + y_{new} < 1$. The latter is a linear question, which is why our linear models are now able to handle it.

To better understand this transformation, let’s examine two test points: one that starts inside the circle and one that starts outside the circle. We’ll see where they end up after we construct new features out of them. Here’s a breakdown of what the code is doing:

  1. It defines two test points and calculates the x and y coordinates for points on a circle using trigonometry.

  2. It sets up two subplots and defines labels and functions for each subplot. The first function is the identity function (which does nothing and thus keeps the original features), and the second function squares the features.

  3. It then loops over the subplots, functions, and labels. For each subplot, it plots the circle, the data points, and the test points. It also sets the axis to be equal, labels the axes, and sets the title of the subplot.

  4. Finally, it moves the y-axis of the second subplot to the right side for better readability.

This process helps us visualize how squaring the features transforms the data and makes the problem linearly separable. It’s a good reminder that sometimes, a simple transformation can make a big difference in how well our models perform.

# Import necessary modules
import numpy as np
import matplotlib.pyplot as plt

# Define test points and circle points test_points = np.array([[.5, .5], [-1, -1.25]]) circle_points = np.linspace(0, 2*np.pi, 100) circle_xs, circle_ys = np.sin(circle_points), np.cos(circle_points)

# Print test points and their squared values print("Test points:\n", test_points) print("Squared test points:\n", test_points**2)

# Set up subplots, labels, and functions fig, axes = plt.subplots(1, 2, figsize=(6, 3)) fig.suptitle('Transforming Data Space', fontsize=16) labels = [('x', 'y', 'Original Space'), ('$x^2$', '$y^2$', 'Squared Space')]

# Define transformations: Original and squared features transformations = [lambda x: x, lambda x: x**2]

# Loop over subplots, transformations, and labels for ax, transform, (xlabel, ylabel, title) in zip(axes, transformations, labels): # Plot circle, data points, and test points ax.plot(transform(circle_xs), transform(circle_ys), 'k') ax.scatter(*transform(data.T), c=np.where(target, 'r', 'b'), marker='.') ax.scatter(*transform(test_points.T), c=['k', 'y'], s=100, marker='^')

# Set equal axis, labels, and title ax.axis('equal') ax.set_xlabel(xlabel) ax.set_ylabel(ylabel) ax.set_title(title)

# Add annotation to the test points for (x, y), color in zip(transform(test_points), ['k', 'k']): ax.annotate(f'({x:.2f}, {y:.2f})', (x, y), textcoords="offset points", xytext=(-10,-10), ha='center', fontsize=8, color=color)

# Move y-axis to the right for the second subplot if transform == transformations[1]: ax.yaxis.tick_right() ax.yaxis.set_label_position("right")

# Adjust subplot parameters to give the title enough space plt.subplots_adjust(top=0.85, wspace=0.3)
Test points:
 [[ 0.5   0.5 ]
 [-1.   -1.25]]
Squared test points:
 [[0.25 0.25]
 [1.   1.56]]

Automating Feature Construction with Kernels

While we were able to solve our problem by manually squaring our features, this approach has its limitations. First, it required us to know ahead of time that squaring was the appropriate transformation. If a different transformation, such as taking a ratio, had been needed, squaring wouldn’t have helped. Second, this manual approach isn’t scalable to larger, more complex datasets.

To overcome these limitations, we can use kernel methods. Kernel methods are a powerful tool for automatically constructing and selecting features. They can’t do everything we could possibly do by hand, but they are more than sufficient in many scenarios.

In the following sections, we’ll explore how to use kernels in a somewhat clunky, manual fashion. However, keep in mind that kernels can be used in much more powerful ways that don’t require manual intervention. Our ultimate goal is to build an automated system that can try many possible constructed features and select the best ones.

Understanding Kernels

Kernels are a powerful tool in machine learning that allow us to apply pre-existing knowledge about the similarities between data examples. They are not magic, but they are incredibly useful. At their core, kernel methods involve two components: a kernel and a learning method.

Before we dive into the specifics of what a kernel is, let’s first understand what it does and how it relates to our problem of constructing useful features. We’ll start by manually applying a kernel to our data. Note that this is not the usual way we use kernels, but it will help us understand what’s happening under the hood.

In the following code, we apply a polynomial kernel of degree 2 to our data. This specific kernel captures the same idea of squaring the input features, which we did manually earlier.

# Import necessary modules
from sklearn.metrics import pairwise
import numpy as np
import pprint

# Apply polynomial kernel transformation to the data kernel_transformed_data = pairwise.polynomial_kernel(data, data, degree=2)

# Create a pretty printer pp = pprint.PrettyPrinter(indent=4)

# Print the first example in the original and transformed data print('First example in original data: ') pp.pprint(data[0]) print('First example in kernel-transformed data: ') pp.pprint(kernel_transformed_data[0])

# Print the number of features in the original and transformed data print('Number of features in original data:', data.shape[1]) print('Number of features in kernel-transformed data:', kernel_transformed_data.shape[1])

# Print the number of examples in the original and transformed data print('Number of examples in both original and kernel-transformed data:', len(data), len(kernel_transformed_data))
First example in original data:
array([-2., -2.])
First example in kernel-transformed data:
array([25.  , 23.04, 21.16, 19.36, 17.64, 16.  , 14.44, 12.96, 11.56,
       10.24,  9.  ,  7.84,  6.76,  5.76,  4.84,  4.  ,  3.24,  2.56,
        1.96,  1.44, 23.04, 21.16, 19.36, 17.64, 16.  , 14.44, 12.96,
       11.56, 10.24,  9.  ,  7.84,  6.76,  5.76,  4.84,  4.  ,  3.24,
        2.56,  1.96,  1.44,  1.  , 21.16, 19.36, 17.64, 16.  , 14.44,
       12.96, 11.56, 10.24,  9.  ,  7.84,  6.76,  5.76,  4.84,  4.  ,
        3.24,  2.56,  1.96,  1.44,  1.  ,  0.64, 19.36, 17.64, 16.  ,
       14.44, 12.96, 11.56, 10.24,  9.  ,  7.84,  6.76,  5.76,  4.84,
        4.  ,  3.24,  2.56,  1.96,  1.44,  1.  ,  0.64,  0.36, 17.64,
       16.  , 14.44, 12.96, 11.56, 10.24,  9.  ,  7.84,  6.76,  5.76,
        4.84,  4.  ,  3.24,  2.56,  1.96,  1.44,  1.  ,  0.64,  0.36,
        0.16, 16.  , 14.44, 12.96, 11.56, 10.24,  9.  ,  7.84,  6.76,
        5.76,  4.84,  4.  ,  3.24,  2.56,  1.96,  1.44,  1.  ,  0.64,
        0.36,  0.16,  0.04, 14.44, 12.96, 11.56, 10.24,  9.  ,  7.84,
        6.76,  5.76,  4.84,  4.  ,  3.24,  2.56,  1.96,  1.44,  1.  ,
        0.64,  0.36,  0.16,  0.04,  0.  , 12.96, 11.56, 10.24,  9.  ,
        7.84,  6.76,  5.76,  4.84,  4.  ,  3.24,  2.56,  1.96,  1.44,
        1.  ,  0.64,  0.36,  0.16,  0.04,  0.  ,  0.04, 11.56, 10.24,
        9.  ,  7.84,  6.76,  5.76,  4.84,  4.  ,  3.24,  2.56,  1.96,
        1.44,  1.  ,  0.64,  0.36,  0.16,  0.04,  0.  ,  0.04,  0.16,
       10.24,  9.  ,  7.84,  6.76,  5.76,  4.84,  4.  ,  3.24,  2.56,
        1.96,  1.44,  1.  ,  0.64,  0.36,  0.16,  0.04,  0.  ,  0.04,
        0.16,  0.36,  9.  ,  7.84,  6.76,  5.76,  4.84,  4.  ,  3.24,
        2.56,  1.96,  1.44,  1.  ,  0.64,  0.36,  0.16,  0.04,  0.  ,
        0.04,  0.16,  0.36,  0.64,  7.84,  6.76,  5.76,  4.84,  4.  ,
        3.24,  2.56,  1.96,  1.44,  1.  ,  0.64,  0.36,  0.16,  0.04,
        0.  ,  0.04,  0.16,  0.36,  0.64,  1.  ,  6.76,  5.76,  4.84,
        4.  ,  3.24,  2.56,  1.96,  1.44,  1.  ,  0.64,  0.36,  0.16,
        0.04,  0.  ,  0.04,  0.16,  0.36,  0.64,  1.  ,  1.44,  5.76,
        4.84,  4.  ,  3.24,  2.56,  1.96,  1.44,  1.  ,  0.64,  0.36,
        0.16,  0.04,  0.  ,  0.04,  0.16,  0.36,  0.64,  1.  ,  1.44,
        1.96,  4.84,  4.  ,  3.24,  2.56,  1.96,  1.44,  1.  ,  0.64,
        0.36,  0.16,  0.04,  0.  ,  0.04,  0.16,  0.36,  0.64,  1.  ,
        1.44,  1.96,  2.56,  4.  ,  3.24,  2.56,  1.96,  1.44,  1.  ,
        0.64,  0.36,  0.16,  0.04,  0.  ,  0.04,  0.16,  0.36,  0.64,
        1.  ,  1.44,  1.96,  2.56,  3.24,  3.24,  2.56,  1.96,  1.44,
        1.  ,  0.64,  0.36,  0.16,  0.04,  0.  ,  0.04,  0.16,  0.36,
        0.64,  1.  ,  1.44,  1.96,  2.56,  3.24,  4.  ,  2.56,  1.96,
        1.44,  1.  ,  0.64,  0.36,  0.16,  0.04,  0.  ,  0.04,  0.16,
        0.36,  0.64,  1.  ,  1.44,  1.96,  2.56,  3.24,  4.  ,  4.84,
        1.96,  1.44,  1.  ,  0.64,  0.36,  0.16,  0.04,  0.  ,  0.04,
        0.16,  0.36,  0.64,  1.  ,  1.44,  1.96,  2.56,  3.24,  4.  ,
        4.84,  5.76,  1.44,  1.  ,  0.64,  0.36,  0.16,  0.04,  0.  ,
        0.04,  0.16,  0.36,  0.64,  1.  ,  1.44,  1.96,  2.56,  3.24,
        4.  ,  4.84,  5.76,  6.76])
Number of features in original data: 2
Number of features in kernel-transformed data: 400
Number of examples in both original and kernel-transformed data: 400 400

The output shows that after applying the kernel, we went from having 2 features to having 400 features. This is because kernels redescribe the data in terms of the relationship between each example and every other example. The 400 features for the first example describe how the first example relates to every other example.

This might seem a bit abstract, so let’s visualize it. We’ll feed the kernelized data to a prediction model and then graph those predictions over the original data. This will help us understand the transformation that the kernel has performed on our data.

# Import necessary modules
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt

# Create a figure and an axes object fig, ax = plt.subplots(figsize=(4,3))

# Fit the Logistic Regression model on the kernel-transformed data and make predictions kernel_preds = LogisticRegression().fit(kernel_transformed_data, target).predict(kernel_transformed_data)

# Plot the predictions ax.scatter(xs, ys, c=np.where(kernel_preds, 'r', 'b'), marker='.') ax.set_aspect('equal')

# Set the title of the plot ax.set_title("Logistic Regression with Kernel-Transformed Data")

# Set the labels of the axes ax.set_xlabel("X") ax.set_ylabel("Y")

# Display the plot plt.show()

One thing to note is that if you’re reminded of covariances between pairs of features, you’re on the right track. However, while covariances are about relationships between features (columns), kernels are about relationships between examples (rows).

Applying Kernels in a Learning Pipeline

In the previous section, we saw how applying a kernel to our data can transform a non-linear problem into a linear one, allowing a linear model like Logistic Regression to make perfect predictions. However, manually applying kernels can be a bit clunky.

To streamline this process, we can use a scikit-learn Transformer. Transformers allow us to incorporate our kernel application into a learning pipeline, automating the process and making our code cleaner and more efficient.

Using Transformers for Feature Engineering

In scikit-learn, we can define feature transformations using either FunctionTransformer for freestanding functions or inherit from TransformerMixin for classes. If the transformation is completely self-contained, like taking the logarithm of the absolute value of the data, then we don’t need to add the complexity of a class-based method.

Let’s start by recreating the initial area features of the Iris dataset in a clean DataFrame:

# Import necessary modules
from IPython.display import display
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris

# Load the iris dataset iris = load_iris()

# Create a DataFrame with the features data iris_features_df = pd.DataFrame(iris.data, columns=iris.feature_names)

# Add a new column to the DataFrame for the species iris_features_df['species'] = iris.target_names[iris.target]

# Create a new DataFrame for the area of the sepals and petals iris_area_df = pd.DataFrame({ "sepal_area" : iris_features_df['sepal length (cm)'] * iris_features_df['sepal width (cm)'], "petal_area" : iris_features_df['petal length (cm)'] * iris_features_df['petal width (cm)'] })

# Display the first few rows of the DataFrames print("First few rows of the iris features DataFrame:") display(iris_features_df.head())

print("\nFirst few rows of the iris area DataFrame:") display(iris_area_df.head())
First few rows of the iris features DataFrame:

sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
First few rows of the iris area DataFrame:

sepal_area petal_area
0 17.85 0.28
1 14.70 0.28
2 15.04 0.26
3 14.26 0.30
4 18.00 0.28

Now, if we simply want comparisons with the median values over the whole dataset, we can make a transformer as quickly as:

# Import necessary module
from sklearn.preprocessing import FunctionTransformer

# Define a function that checks if a value is greater than the median def is_greater_than_median(value): return value > np.median(value)

# Create a transformer using the function greater_than_median_transformer = FunctionTransformer(is_greater_than_median)

# Apply the transformer to the iris area DataFrame transformed_area_df = greater_than_median_transformer.fit_transform(iris_area_df)

# Convert the result to a DataFrame for better visualization transformed_area_df = pd.DataFrame(transformed_area_df, columns=iris_area_df.columns)

# Display whether the areas of the first, 51st, and 101st examples are greater than the median print("Are the areas of the first, 51st, and 101st examples greater than the median?") display(transformed_area_df.iloc[[0,50,100]])
Are the areas of the first, 51st, and 101st examples greater than the median?

sepal_area petal_area
0 True False
50 True False
100 True True

If we want to learn the medians on the training data and then apply our discretization based on the learned medians to the testing data, we need a little more support. We have to compute the medians on the training data and then apply those medians as a threshold to either the training or testing data.

Custom Transformer: Median_Big_Small

In this section, we create a custom transformer class called Median_Big_Small. This class inherits from the TransformerMixin class in scikit-learn, which provides a default implementation for the fit_transform method.

The Median_Big_Small transformer is designed to transform a dataset by replacing each feature value with a boolean indicating whether the value is greater than the median of that feature.

Here’s a breakdown of the class:

  • The __init__ method is the constructor. It doesn’t need to do anything in this case, so it just passes.

  • The fit method calculates the median of each feature in the input data ftrs and stores the result in the medians attribute. It then returns self to allow method chaining.

  • The transform method replaces each feature value in the input data ftrs with a boolean indicating whether the value is greater than the median of that feature. It uses the medians attribute calculated in the fit method for this.

Here’s the code for the Median_Big_Small transformer:

# Import necessary module
from sklearn.base import TransformerMixin

# Define a custom transformer class class MedianBigSmallTransformer(TransformerMixin): def __init__(self): pass

def fit(self, features, target=None): # Compute and store the median of the features during fitting self.medians = np.median(features) return self

def transform(self, features, target=None): # During transformation, return whether each feature is greater than the median return features > self.medians

# Create an instance of the custom transformer median_big_small_transformer = MedianBigSmallTransformer()

# Fit the transformer to the iris area DataFrame and transform the data transformed_area_df = median_big_small_transformer.fit_transform(iris_area_df)

# Convert the result to a DataFrame for better visualization transformed_area_df = pd.DataFrame(transformed_area_df, columns=iris_area_df.columns)

# Display whether the areas of the first, 51st, and 101st examples are greater than the median print("Are the areas of the first, 51st, and 101st examples greater than the median?") display(transformed_area_df.iloc[[0,50,100]])
Are the areas of the first, 51st, and 101st examples greater than the median?

sepal_area petal_area
0 True False
50 True False
100 True True

This transformer can be used in a scikit-learn pipeline to automate the process of transforming the data. It’s a good example of how you can create custom transformers to encapsulate your own data preprocessing steps. The usage pattern is the same as for a built-in transformer and it is quite similar to a standard learning model. Since we use a train_test_split, we get randomly selected examples here.

# Import necessary module
from sklearn.model_selection import train_test_split

# Split the area DataFrame into training and testing sets training_set, testing_set = train_test_split(iris_area_df)

# Create an instance of the custom transformer median_big_small_transformer = MedianBigSmallTransformer()

# Fit the transformer to the training set and transform the training and testing sets transformed_training_set = median_big_small_transformer.fit_transform(training_set) transformed_testing_set = median_big_small_transformer.transform(testing_set)

# Display the first three rows of the transformed training and testing sets print('Transformed training set:') display(transformed_training_set.head(3)) print('Transformed testing set:') display(transformed_testing_set.head(3))
Transformed training set:

sepal_area petal_area
80 False False
54 True False
95 True False
Transformed testing set:

sepal_area petal_area
145 True False
42 True False
45 True False

By using a transformer in a pipeline, we can easily apply kernels to our data and use the transformed data in our machine learning models. This makes our code cleaner, more efficient, and easier to understand. Here’s how we can do this:

# Import necessary modules
from sklearn.base import TransformerMixin
from sklearn.metrics.pairwise import pairwise_kernels

# Define a custom transformer class that applies a polynomial kernel transformation class PolynomialKernelTransformer(TransformerMixin): def __init__(self, degree=2): # Initialize the transformer with the degree of the polynomial kernel self.degree = degree

def fit(self, X, y=None): # Store the input data to use later in the transformation self.X_fitted = X return self

def transform(self, X, y=None): # Apply the polynomial kernel transformation polynomial_kernel_transformed_data = pairwise_kernels(X, self.X_fitted, metric='poly', degree=self.degree) return polynomial_kernel_transformed_data

This PolynomialKernelTransformer class applies a polynomial kernel to the data, transforming it into a higher-dimensional space.

Kernel Approximation with Nystroem and Custom Polynomial Kernel

In this section, we explore the use of kernel methods to transform our data before feeding it into a logistic regression model. Specifically, we use the Nystroem method for kernel approximation and a custom polynomial kernel.

The Nystroem method is a technique for efficiently approximating a kernel map, which can be useful when dealing with large datasets. It works by selecting a subset of the data and computing the kernel map based on this subset, which can significantly reduce the computational cost.

The custom polynomial kernel is a simple kernel that applies a polynomial transformation to the data. In this case, we use a degree-2 polynomial, which squares each feature.

We create two pipelines, each consisting of a kernel transformer followed by a logistic regression model. The first pipeline uses the Nystroem method, while the second uses the custom polynomial kernel. We then compare the performance of these two pipelines.

Here’s the code for this section:

# Import necessary modules
from sklearn.kernel_approximation import Nystroem
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

# Define the Nystroem kernel transformer. This will transform the input data using a polynomial kernel. nystroem_transformer = Nystroem(kernel='polynomial', degree=2, n_components=6)

# Define the logistic regression model. This will be used to make predictions based on the transformed data. logistic_regression_model = LogisticRegression()

# Create two pipelines. The first pipeline consists of the Nystroem transformer followed by the logistic regression model. # The second pipeline consists of the custom PolynomialKernelTransformer followed by the logistic regression model. pipeline1 = make_pipeline(nystroem_transformer, logistic_regression_model) pipeline2 = make_pipeline(PolynomialKernelTransformer(degree=2), logistic_regression_model)

# We will compare the performance of the two pipelines. # For each pipeline, we plot the decision boundary and set the title of the plot to the name of the model and the type of kernel used. pipeline_comparison = [(pipeline1, 'Nystroem'), (pipeline2, 'PolyKernel')] fig, axes = plt.subplots(1,2,figsize=(6,3), sharey=True)

# Set a main title for the figure fig.suptitle('Comparison of Decision Boundaries with Different Kernel Transformers')

for (model, kernel_name), ax in zip(pipeline_comparison, axes): plot_decision_boundary(ax, data, target, model, [0,1]) ax.set_title("{} using {}".format(get_model_name(model), kernel_name)) ax.set_xlabel('Feature 1') ax.set_ylabel('Feature 2') ax.set_aspect('equal')

# Add a figtext at the bottom plt.figtext(0.5, -0.11, "Decision boundaries for logistic regression models using Nystroem and Polynomial Kernel transformers.", ha="center", fontsize=12, bbox={"facecolor":"orange", "alpha":0.5, "pad":5})

plt.tight_layout() plt.subplots_adjust(top=0.79) # Adjust subplot parameters to give the title enough space plt.show()

This section demonstrates how kernel methods can be used to transform the data in a way that makes it more suitable for linear methods. It also shows how to use scikit-learn’s pipeline functionality to chain together a kernel transformer and a learning model.

Exploring Kernel Methods: Support Vector Machines vs. Logistic Regression

In this section, we delve into the world of kernel methods, specifically focusing on the Support Vector Machine (SVM), a popular machine learning model used for classification and regression analysis. We will compare the performance of an SVM with a polynomial kernel to a logistic regression model that uses a manually constructed polynomial kernel.

Understanding Support Vector Machines

A Support Vector Machine (SVM) operates on a simple yet powerful idea: it attempts to find a hyperplane (a line in 2D, a plane in 3D, or a higher-dimensional equivalent in more dimensions) that best separates the data into different classes. In the case of a linear SVM, the model seeks the hyperplane that maximizes the margin between the closest points (the support vectors) of different classes. This is why it’s called a “Support Vector” Machine - the decision boundary is entirely determined by these support vectors.

The Power of Kernels

However, not all data is linearly separable. This is where kernels come into play. A kernel is a function that transforms the input data into a higher-dimensional space where it becomes easier to separate the data. This is known as the “kernel trick”. It allows us to construct complex decision boundaries even in cases where the data is not linearly separable in its original space.

Comparing SVM and Logistic Regression with Polynomial Kernels

In our code, we’re creating two models: an SVM with a polynomial kernel (svm_with_poly_kernel) and a logistic regression model that uses a manually constructed polynomial kernel (logreg_with_manual_kernel).

The svm_with_poly_kernel model uses a polynomial kernel of degree 2. This means that it’s transforming the input data into a higher-dimensional space where the relationship between the data points is equivalent to a polynomial of degree 2.

On the other hand, the logreg_with_manual_kernel model is a pipeline that first applies a manual polynomial kernel transformation to the data (using the PolynomialKernelTransformer), and then applies a logistic regression model to the transformed data.

Visualizing Decision Boundaries

The plot_decision_boundary function is then used to visualize the decision boundaries of these two models. The decision boundary is the line (or plane, or hyperplane) that the model uses to separate the data into different classes.

In summary, both models are using a polynomial transformation to make the data easier to separate, but they’re doing it in slightly different ways. The svm_with_poly_kernel model is using the kernel trick to implicitly transform the data, while the logreg_with_manual_kernel model is using an explicit transformation step.

# Import necessary libraries
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

# Create a pipeline that first applies a polynomial kernel transformation to the data, # then fits a logistic regression model on the transformed data. logreg_with_manual_kernel = make_pipeline(PolynomialKernelTransformer(degree=2), LogisticRegression())

# Define a Support Vector Classifier (SVC) with a polynomial kernel svm_with_poly_kernel = SVC(kernel='poly', degree=2)

# Define a list of models to compare. In this case, we're comparing the SVC with a polynomial kernel # against our custom logistic regression model that uses a polynomial kernel. models_to_compare = [svm_with_poly_kernel, logreg_with_manual_kernel]

# Create a figure with two subplots, which will be used to visualize the decision boundaries of the two models. fig, axes = plt.subplots(1, 2, figsize=(6, 3), sharey=True)

# Define a list of title descriptions for each model title_descriptions = ['SVM with Polynomial Kernel', 'Logistic Regression with\nManual Polynomial Kernel']

# For each model in our list, plot the decision boundary on one of the subplots. # The decision boundary is the line (or hyperplane in higher dimensions) that the model uses to separate the classes. for model, title, ax in zip(models_to_compare, title_descriptions, axes): plot_decision_boundary(ax, data, target, model, [0, 1]) ax.set_title(title)

# Add a super title for the entire figure plt.suptitle('Comparison of Decision Boundaries for Different Models', fontsize=16)

# Add a figure text to explain the plots plt.figtext(0.5, -0.24, "These plots compare the decision boundaries created by a Support Vector Machine (SVM) with a polynomial kernel and a Logistic Regression model with a manual polynomial kernel. The decision boundary is the line that the model uses to separate the classes.", wrap=True, ha="center", fontsize=12, bbox={"facecolor":"orange", "alpha":0.5, "pad":5})

# Adjust the layout so everything fits nicely. plt.tight_layout(rect=[0, 0, 1, 0.99]) # to ensure the suptitle fits into the figure plt.show()

In this comparison, we have implemented a true kernel method using SVC(kernel='poly'). This method internally uses a kernel to transform the raw data, operating as if we were manually constructing features. However, the advantage of this approach is that we don’t have to explicitly construct all the features. Through the kernel trick, we can replace the explicit construction of features and comparison of examples in the larger feature space with the computation of distances between objects using the kernel. This makes the process more efficient and manageable, especially when dealing with high-dimensional data.

Kernel Functions and Kernel Matrices

The kernel shows up in two forms: a kernel function and a kernel matrix. The kernel function might seem abstract, but the kernel matrix is simply a table of example similarities. Given two examples, we can determine how similar they are by looking up a value in the kernel matrix.

Comparing Examples

Traditionally, we represent examples by a set of features and then compare examples based on the similarity or distance between those feature values. Kernels take a different approach: they compare examples by directly measuring similarity between pairs of examples. This approach is more general: anything our traditional representation can do, kernels can do as well.

Pairwise Relationships

Kernel methods rely on pairwise relationships between examples. Just like all-Nearest Neighbors methods, which includes information from every example, kernel methods also rely on these pairwise connections. It’s important to note that kernels measure similarity, which increases as examples become more alike, while distance measures decrease as examples become more alike.

Flexibility of Kernels

One of the major advantages of the kernel approach is its flexibility. Since kernels rely on pairwise relationships between examples, we don’t need to have our initial data in a table of rows and columns to use a kernel. All we need are two individuals that we can identify and look up in the kernel table. This means we can directly compute the similarity relationship between examples and work with those similarity values, which can be significantly easier than turning examples into sets of features.

Applying Pre-existing Knowledge

Kernels allow us to directly apply pre-existing background knowledge about the similarities between the data examples. This knowledge can be applied to arbitrary objects, not just examples in a table. Furthermore, kernels can be applied to many learning methods, including various forms of regression, nearest-neighbors, and principal components analysis (PCA).

In summary, kernels are a powerful tool in machine learning that allow us to apply pre-existing knowledge about the similarities between data examples. They provide a flexible and general approach to comparing examples, and can be applied to many different learning methods.

Common Kernel Examples

Kernel methods are a powerful tool in machine learning, allowing us to implicitly perform computations in a higher-dimensional space without actually having to compute the coordinates in that space. This is made possible by a mathematical theorem known as Mercer’s theorem, which guarantees that for any valid kernel, there is an equivalent dot product being calculated in some (possibly very high-dimensional) feature space.

Valid Kernels and Positive Semi-Definite (PSD) Matrices

A valid kernel is one that corresponds to a dot product in some feature space. Mathematically, a kernel is valid if it is positive semi-definite (PSD).

Positive Semi-Definite Matrices

A matrix is said to be PSD if, for any vector $\mathbf{z}$, the quantity $\mathbf{z}^T K \mathbf{z}$ is non-negative. This might seem a bit abstract, but you can think of PSD matrices as “rotators” that change the direction of vectors without flipping them completely around. In other words, a PSD matrix will never rotate a vector more than 90 degrees.

Polynomial Kernels

One common type of kernel is the polynomial kernel, which we have already seen in action.

Definition of Polynomial Kernels

The polynomial kernel of degree $p$ is defined as $K(\mathbf{E}, \mathbf{F}) = (\mathbf{E} \cdot \mathbf{F})^p$ or $K(\mathbf{E}, \mathbf{F}) = (\mathbf{E} \cdot \mathbf{F} + 1)^p$. The axes in the high-dimensional space correspond to the terms from the $p$-th degree polynomial in the features. For example, for a degree-2 polynomial kernel, the axes might correspond to the squares and cross-products of the original features.

Gaussian (RBF) Kernels

Another common type of kernel is the Gaussian kernel, also known as the Radial Basis Function (RBF) kernel.

Definition of Gaussian Kernels

The Gaussian kernel is defined as $K(\mathbf{E}, \mathbf{F}) = \exp\left(-\frac{|\mathbf{E} - \mathbf{F}|^2}{2\sigma^2}\right)$, where $|\mathbf{E} - \mathbf{F}|^2$ is the squared Euclidean distance between $\mathbf{E}$ and $\mathbf{F}$, and $\sigma$ is a parameter that controls the width of the Gaussian. The Gaussian kernel corresponds to projecting our data into an infinite-dimensional space, which might sound a bit mind-boggling, but in practice it means that the similarity between examples drops off very quickly as they move apart from each other. This is similar to how the mass of the Gaussian distribution is concentrated around its mean.

Exploring Different Kernels in Support Vector Machines

In this section, we delve into the application of Support Vector Machines (SVMs) with different kernels. Kernels are functions used in SVMs to transform the input data, enabling complex decision boundaries even when the data is not linearly separable in the original space.

We create SVMs for different kernels including linear, polynomial of degree 2 and 3, and Gaussian (also known as Radial Basis Function or RBF). The gamma parameter in the RBF kernel controls the influence of individual training examples, with a smaller gamma meaning that the influence of examples reaches further.

Here is the Python code to create these SVMs:

from sklearn.svm import SVC, LinearSVC
import matplotlib.pyplot as plt

# Define different SVM classifiers with various kernels svm_classifiers = { "LinearSVC": LinearSVC(max_iter=5000), "SVC(linear)": SVC(kernel='linear'), "SVC(poly=1)": SVC(kernel='poly', degree=1), "SVC(poly=2)": SVC(kernel='poly', degree=2), "SVC(poly=3)": SVC(kernel='poly', degree=3), "SVC(rbf,.5)": SVC(kernel='rbf', gamma=0.5), "SVC(rbf,1.0)": SVC(kernel='rbf', gamma=1.0), "SVC(rbf,2.0)": SVC(kernel='rbf', gamma=2.0) }

# Print the number of classifiers print(f"Number of classifiers: {len(svm_classifiers)}")

# Plot the decision boundaries for each SVM classifier fig, axes = plt.subplots(4, 2, figsize=(8, 8), sharex=True, sharey=True)

# Add a title for the entire figure fig.suptitle('Decision Boundaries of SVM Classifiers with Different Kernels', fontsize=16)

for ax, (name, model) in zip(axes.flat, svm_classifiers.items()): # Print the name of the current classifier print(f"Plotting decision boundaries for {name}")

plot_decision_boundary(ax, iris.data, iris.target, model, [0, 1]) ax.set_title(name)

# Adjust the layout for better visualization plt.tight_layout(rect=[0, 0, 1, 0.99]) # Adjust the rectangle's dimensions to fit the suptitle
Number of classifiers: 8
Plotting decision boundaries for LinearSVC
Plotting decision boundaries for SVC(linear)


c:\Users\Scott\OneDrive\project-repo-code\feature-selection-digits-analysis\venv_digits_analysis\lib\site-packages\sklearn\svm\_classes.py:32: FutureWarning: The default value of `dual` will change from `True` to `'auto'` in 1.5. Set the value of `dual` explicitly to suppress the warning.
  warnings.warn(


Plotting decision boundaries for SVC(poly=1)
Plotting decision boundaries for SVC(poly=2)
Plotting decision boundaries for SVC(poly=3)
Plotting decision boundaries for SVC(rbf,.5)
Plotting decision boundaries for SVC(rbf,1.0)
Plotting decision boundaries for SVC(rbf,2.0)

In the resulting plots, we can observe how the decision boundaries differ depending on the kernel used. The linear options (LinearSVC, SVC(linear), and SVC(poly=1)) all have boundaries defined by straight lines, while the polynomial and RBF kernels allow for more complex, curved boundaries. The SVC(poly=3) model shows deep curvature in its boundaries, and the rbf models display increasingly deep curves and waviness in their boundaries as gamma increases.

This demonstrates the flexibility of SVMs in handling different data distributions and the importance of choosing the right kernel and parameters for your specific task.

Practical Advice for Using Support Vector Machines

Support Vector Machines (SVMs) are powerful tools for machine learning, but they come with several options that can be daunting for beginners. Here, we provide some practical advice for using SVMs effectively.

  1. Scale the data: SVMs are not scale-invariant, meaning that they can be sensitive to the range of the data. Therefore, it’s often a good idea to scale your data before training an SVM. This can be done using tools like StandardScaler or MinMaxScaler in sklearn.

  2. Use an RBF kernel: The Radial Basis Function (RBF) kernel is a good default choice for many problems. It can handle non-linear relationships and doesn’t require you to manually specify the transformation of the data.

  3. Pick kernel parameters using cross-validation: The performance of an SVM can be sensitive to the choice of kernel parameters. Cross-validation is a robust method for selecting these parameters. In sklearn, you can use GridSearchCV to perform cross-validation over a grid of parameter values.

  4. Consider the size of your dataset: If the number of features is larger than the number of observations, consider using a linear kernel. If the number of observations is larger than the number of features, an RBF kernel may be a better choice. If there are many observations (e.g., more than 50,000), a linear kernel may be faster.

Here’s an example of how to apply these tips in practice, using the digits dataset:

Exploring the Digits Dataset with Various Classifiers: A Business Perspective

The digits dataset in sklearn is a collection of 8x8 images of handwritten digits. Each image is represented as a 64-dimensional vector, where each dimension corresponds to a pixel in the image. The task is to predict the digit (0-9) depicted in each image.

Business Problem: Automating Postal and Banking Services

In the context of our business problem, this dataset serves as a simplified representation of the challenges faced by postal services and banks. For instance, a postal service may need to automatically sort mail based on handwritten zip codes, while a bank may need to process handwritten checks. Both tasks involve recognizing handwritten digits, which is a classic multi-class classification problem.

By developing a machine learning model that can accurately recognize the digits in this dataset, we can provide a proof of concept for a system that could automate these tasks, leading to increased efficiency and accuracy in postal and banking services.

Loading the Digits Dataset

Here’s an example of how to load the digits dataset:

from sklearn import datasets
import matplotlib.pyplot as plt

# Load the digits dataset digits = datasets.load_digits()

# Display the first digit print("Shape:", digits.images[0].shape) plt.figure(figsize=(3,3)) plt.imshow(digits.images[0], cmap='gray'); plt.show()
Shape: (8, 8)

In the following sections, we will explore various feature selection techniques, feature construction methods, and classifiers to build a robust model for digit recognition. We will then evaluate these models based on their performance on the digits dataset, providing insights into which methods are most effective for this type of task.

Let’s print an example of each digit.

import numpy as np

# Create a figure with 10 subplots (one for each digit) fig, axes = plt.subplots(2, 5, figsize=(10, 5))

# Flatten the 2D array of axes into 1D for easier iteration axes = axes.flatten()

# For each digit from 0 to 9 for digit in range(10): # Find the index of the first occurrence of this digit in the target array index = np.where(digits.target == digit)[0][0]

# Plot the corresponding image on the subplot axes[digit].imshow(digits.images[index], cmap='gray') axes[digit].set_title(f'Digit: {digit}')

# Remove the axis ticks for ax in axes: ax.set_xticks([]) ax.set_yticks([])

# Adjust the layout so everything fits nicely plt.tight_layout() plt.show()

Let’s print several examples in each of our 10 categories.

# Create a figure with 100 subplots (10 for each digit)
fig, axes = plt.subplots(10, 10, figsize=(10, 10))

# For each digit from 0 to 9 for digit in range(10): # Find the indices of the first 10 occurrences of this digit in the target array indices = np.where(digits.target == digit)[0][:10]

# Plot the corresponding images on the subplots for i, index in enumerate(indices): axes[digit, i].imshow(digits.images[index], cmap='gray') # Add titles only to the first column to avoid clutter if i == 0: axes[digit, i].set_ylabel(f'Digit: {digit}')

# Remove the axis ticks for ax in axes.flatten(): ax.set_xticks([]) ax.set_yticks([])

# Adjust the layout so everything fits nicely plt.tight_layout() plt.show()

We can compare the performance of various classifiers on this dataset. Here’s an example of how to do this. The following code first loads the digits dataset and defines a grid of parameter values to try. It then creates an SVC model with an RBF kernel and uses GridSearchCV to find the best parameters for this model. Finally, it evaluates the model with the best parameters using cross-validation and prints the results.

from sklearn.svm import SVC
from sklearn.datasets import load_digits
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
import numpy as np

# Load the digits dataset digits = load_digits()

# Define a parameter grid for cross-validation param_grid = {"gamma": np.logspace(-10, 1, 11, base=2), "C": [0.5, 1.0, 2.0]}

# Create an SVC model with an RBF kernel svc = SVC(kernel='rbf')

# Use GridSearchCV to find the best parameters grid_model = GridSearchCV(svc, param_grid=param_grid, cv=10) grid_model.fit(digits.data, digits.target)

# Print the best parameters print("Best parameters:", grid_model.best_params_)

# Evaluate the model with the best parameters using cross-validation my_svc = SVC(kernel='rbf', **grid_model.best_params_) scores = cross_val_score(my_svc, digits.data, digits.target, cv=10, scoring='f1_macro')

print("Cross-validation scores:", scores) print("Mean cross-validation score: {:.3f}".format(scores.mean()))
Best parameters: {'C': 2.0, 'gamma': 0.0009765625}
Cross-validation scores: [0.97 1.   0.95 0.98 0.99 0.99 0.99 0.99 0.98 0.97]
Mean cross-validation score: 0.981

Now, we compare all the following classifiers, including our SVC with an RBF kernel, with the “C” parameter set to “2.0” and the “gamma” parameter set to a very small value, specifically, 0.0009765625.

from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis, LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC, LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier

# Define a set of classifiers classifiers_dict = { 'Logistic Regression (1)': LogisticRegression(max_iter=1000), 'Logistic Regression (2)': SGDClassifier(loss='log_loss', max_iter=1000), 'Quadratic Discriminant Analysis': QuadraticDiscriminantAnalysis(), 'Linear Discriminant Analysis': LinearDiscriminantAnalysis(), 'Gaussian Naive Bayes': GaussianNB(), 'SVC (Linear Kernel)': SVC(kernel="linear"), 'Linear SVC': LinearSVC(), 'Decision Tree': DecisionTreeClassifier(), '5 Nearest Neighbors': KNeighborsClassifier(), '10 Nearest Neighbors': KNeighborsClassifier(n_neighbors=10), 'SVC (RBF Kernel)': my_svc # Add the trained SVC classifier here }

# Define a baseline classifier baseline_classifier = DummyClassifier(strategy="uniform")

# Evaluate the baseline classifier baseline_scores = cross_val_score(baseline_classifier, digits.data, digits.target, cv=10, scoring='f1_macro', n_jobs=-1) mean_scores_dict = {'Baseline': baseline_scores.mean()}

# Evaluate the classifiers using cross-validation fig, ax = plt.subplots(figsize=(10,6)) ax.plot(baseline_scores, label=f"Baseline ({baseline_scores.mean():.3f})", marker='o')

for classifier_name, classifier_model in classifiers_dict.items(): classifier_scores = cross_val_score(classifier_model, digits.data, digits.target, cv=10, scoring='f1_macro', n_jobs=-1) mean_scores_dict[classifier_name] = classifier_scores.mean() ax.plot(classifier_scores, label=f"{classifier_name} ({classifier_scores.mean():.3f})", marker='o')

# Order the legend by mean accuracy scores in descending order handles, labels = ax.get_legend_handles_labels() labels, handles = zip(*sorted(zip(labels, handles), key=lambda t: mean_scores_dict[t[0].rsplit(' (', 1)[0]], reverse=True))

ax.set_ylim(0.0, 1.1) ax.set_xlabel('Fold') ax.set_ylabel('Accuracy') ax.legend(handles, labels, loc='lower center', bbox_to_anchor=(0.5, 1.05), ncol=2)

plt.tight_layout() plt.show()

Remember, while SVMs can provide excellent performance, they also come with computational costs. Balancing performance gains with resource demands requires understanding the costs of mistakes and the available computing hardware.

Conclusion

In this exploration, we delved deep into the realm of feature selection, feature construction, and model comparison. We started with the basics of feature selection, discussing the importance of variance and covariance in understanding our data and selecting the most relevant features. We explored different types of Discriminant Analysis, including Quadratic Discriminant Analysis (QDA), Linear Discriminant Analysis (LDA), Gaussian Naive Bayes (GNB), and Diagonal Linear Discriminant Analysis (DLDA), and even implemented our own version of DLDA.

We discussed the assumptions and biases of different classifiers, and the concept of linearity and non-linearity in machine learning models. We explored the normalization of covariance and the interpretation of correlation for univariate feature selection. We also discussed mutual information and its role in feature selection.

We then moved on to feature construction, where we introduced the concept of kernels. We demonstrated how kernels can be used to transform our data, allowing us to effectively apply linear models like Logistic Regression and Support Vector Classifier (SVC) to problems with non-linear decision boundaries.

Finally, we compared a variety of classifiers on the digits dataset, including Logistic Regression, Quadratic Discriminant Analysis, Linear Discriminant Analysis, Gaussian Naive Bayes, SVC with different kernels, Decision Tree, and K-Nearest Neighbors with different numbers of neighbors. We found that the SVC with a Radial Basis Function (RBF) kernel performed the best.

The exploration culminated in the application of these techniques to a real-world business problem. The digits dataset represents a multi-class classification problem, where the task is to correctly identify handwritten digits. This is a common problem in many business contexts. For example, a postal service could use such a model to automatically sort mail based on zip codes written on envelopes. Similarly, banks could use it to read handwritten checks or forms.

In such a business context, the accuracy of the model is of utmost importance, as mistakes can lead to misdelivery of mail or incorrect processing of financial transactions. Therefore, based on our exploration, we would recommend using the SVC with an RBF kernel for this task, as it provided the best performance on the digits dataset.

This implementation would involve training the SVC model with the RBF kernel on a large dataset of handwritten digits, and then using this trained model to predict the digits written on envelopes or checks. The model could be integrated into the mail sorting or check processing systems, allowing for automatic and accurate recognition of handwritten digits.

However, it’s important to note that the best model can vary depending on the specific characteristics of the data and the business problem at hand. Therefore, it’s always a good idea to experiment with different models and feature selection and construction techniques, as we did in this exploration.