Enhancing Customer Support with Text Classification: A Python Case Study Using the 20 Newsgroups Dataset

Utilizing Advanced Python Tools for In-depth Text Data Analysis

Text Classification for Enhanced Customer Support: A Case Study using the 20 Newsgroups Dataset

In the world of technology and customer support, we often find ourselves swimming in an ocean of forum posts, exploring various topics about everything from hardware troubleshooting to the latest software developments. As a tech enthusiast, these forums are an exhilarating treasure trove of new information. As a customer support representative, they’re a space to address user queries about specific services. In both scenarios, having posts automatically sorted into relevant categories would be incredibly helpful. This is the problem we tackle in this project.

Data drives businesses across sectors in today’s digital age. For tech companies offering diverse services, user-generated content, particularly in online discussion forums, can be rich with insights. However, gleaning these insights from a sea of unstructured data poses its challenges. Enter Machine Learning, specifically text classification, our key to unlocking this potential.

We’ll delve into an exciting use case in this blog post: automatically categorizing forum posts into distinct topics. We’ve chosen to work with the 20 Newsgroups dataset, comprising around 20,000 newsgroup documents evenly distributed across 20 different categories or newsgroups. Each newsgroup can represent a unique service provided by a hypothetical tech company.

To mirror real-world applications, we have also developed a flexible framework allowing users to select their training and test dataset categories based on a prompt. We will demonstrate how to tackle a multiclass classification problem using three selected newsgroups: comp.graphics, rec.motorcycles, and talk.politics.guns as examples.

Effective classification can revolutionize the forum user experience by enabling users to find relevant information faster and allowing representatives to promptly respond to service-specific posts. By identifying common issues or frequently asked questions, companies can elevate their products and services based on user feedback.

But, it’s not enough to stop at overall model accuracy. We recognize that some misclassifications could have larger implications than others. Hence, alongside accuracy, we consider metrics like precision, recall, and F1 score to comprehensively evaluate our model’s performance.

Ready to dive into the world of text classification? Let’s enhance customer support with Machine Learning!

The 20 Newsgroups Dataset

The 20 Newsgroups dataset is a collection of approximately 20,000 newsgroup documents, evenly distributed across 20 different newsgroups. It was originally collected by Ken Lang, likely for his paper, “Newsweeder: Learning to filter netnews”. This dataset has become a popular choice for experiments in text applications of machine learning techniques, such as text classification and text clustering.

Organization

The data is organized into 20 different newsgroups, each corresponding to a different topic. Some of the newsgroups are closely related to each other (e.g., comp.sys.ibm.pc.hardware and comp.sys.mac.hardware), while others are highly unrelated (e.g., misc.forsale and soc.religion.christian).

Here is a list of the 20 newsgroups, partitioned according to subject matter:

  • comp.graphics
  • comp.os.ms-windows.misc
  • comp.sys.ibm.pc.hardware
  • comp.sys.mac.hardware
  • comp.windows.x
  • rec.autos
  • rec.motorcycles
  • rec.sport.baseball
  • rec.sport.hockey
  • sci.crypt
  • sci.electronics
  • sci.med
  • sci.space
  • misc.forsale
  • talk.politics.misc
  • talk.politics.guns
  • talk.politics.mideast
  • talk.religion.misc
  • alt.atheism
  • soc.religion.christian

Data

The data available are in .tar.gz bundles. Each subdirectory in the bundle represents a newsgroup; each file in a subdirectory is the text of some newsgroup document that was posted to that newsgroup.

There are three versions of the dataset:

  • 20news-19997.tar.gz - Original 20 Newsgroups data set
  • 20news-bydate.tar.gz - 20 Newsgroups sorted by date; duplicates and some headers removed (18846 documents)
  • 20news-18828.tar.gz - 20 Newsgroups; duplicates removed, only “From” and “Subject” headers (18828 documents)

The “bydate” version is recommended as it is easier for cross-experiment comparison, newsgroup-identifying information has been removed, and it’s more realistic because the train and test sets are separated in time.

Further Information

For more information about this dataset, you can visit:

Let’s grab the dataset using scikit-learn.

The following script includes a function called load_and_display_newsgroups_data() which fetches the “20 newsgroups” dataset from Scikit-Learn’s datasets module, provides a summary of the loaded data, and displays a histogram of document counts per category.

The function takes in five parameters:

  • categories: A list of names of categories to load from the dataset. If left as None, all categories will be loaded.
  • subset: Determines if the training or testing subset of the data should be loaded.
  • shuffle: When set to True, the dataset is shuffled.
  • random_state: This sets the seed for the random number generator during shuffling.
  • remove: Helps eliminate ‘headers’, ‘footers’, ‘quotes’ from the forum posts text.

After defining all available categories in the 20 newsgroups dataset, the function prompts the user to input the indices of their interested categories.

Here, the “20 newsgroups” dataset is fetched with the selected options and configuration, and stored in the variable data.

Next, the function prepares a dataframe for reporting, which contains each category’s name and its respective count (number of documents) in the dataset.

The function then prints out a summary of the dataset: the subset used (training or testing), total number of documents, total number of categories, and the breakdown of document counts for each category.

Following that, a bar plot visualizing the document counts in each category is displayed using seaborn.

Finally, the function returns the loaded dataset.

As an example usage of this function, we are loading and displaying the summary and histogram for both the training and testing subsets of the dataset with default parameters.

The function load_and_display_newsgroups_data() is first called without any argument to grab the training subset (subset="train" by default). It then loads the testing subset by setting subset="test".

import matplotlib.pyplot as plt
from sklearn.datasets import fetch_20newsgroups
import numpy as np
import seaborn as sns
import pandas as pd

def load_and_display_newsgroups_data( categories=None, subset="train", shuffle=True, random_state=42, remove=("headers", "footers", "quotes"), ): """ Load the 20 newsgroups dataset for the specified categories then display a summary and histogram.

Parameters: - categories: List of category names to load. If None, load all categories. - subset: Which subset of the dataset to load: 'train' for the training set, 'test' for the test set. - shuffle: Whether to shuffle the dataset. - random_state: Random seed for shuffling the dataset. - remove: Tuple of data parts to remove: 'headers', 'footers', 'quotes'.

Returns: - data: The loaded dataset. """

# Define all available categories all_categories = [ "alt.atheism", "comp.graphics", "comp.os.ms-windows.misc", "comp.sys.ibm.pc.hardware", "comp.sys.mac.hardware", "comp.windows.x", "misc.forsale", "rec.autos", "rec.motorcycles", "rec.sport.baseball", "rec.sport.hockey", "sci.crypt", "sci.electronics", "sci.med", "sci.space", "soc.religion.christian", "talk.politics.guns", "talk.politics.mideast", "talk.politics.misc", "talk.religion.misc", ]

print("All Available Categories:\n") for index, category in enumerate(all_categories, start=1): print(f"{index}. {category}")

indices = input( "Please enter the indices of categories you're interested in (separated by spaces): " ) selected_indices = [int(index) - 1 for index in indices.split()] categories = [all_categories[index] for index in selected_indices]

# Load the dataset data = fetch_20newsgroups( subset=subset, categories=categories, shuffle=shuffle, random_state=random_state, remove=remove, )

# Prepare data for reporting doc_counts = pd.DataFrame( { "category": data.target_names, "count": [np.sum(data.target == i) for i in range(len(data.target_names))], } )

# Custom output to console print("\n----- Newsgroup Data Summary -----\n") print(f"Subset: {subset}") print(f"Total documents: {len(data.data)}") print(f"Total categories: {len(data.target_names)}\n")

for index, row in doc_counts.iterrows(): print(f"{row['category']}: {row['count']} documents")

# Plot distribution of documents per category fig, ax = plt.subplots(figsize=(15, 7))

sns.barplot(x="count", y="category", data=doc_counts, ax=ax, palette="deep")

plt.title(f"Document Counts per Category in Dataset ({subset})", fontsize=15) plt.xlabel("Count", fontsize=12) plt.ylabel("Category", fontsize=12) plt.show()

return data

# Example usage: news_train = load_and_display_newsgroups_data() news_valid = load_and_display_newsgroups_data(subset="test")
All Available Categories:

1. alt.atheism
2. comp.graphics
3. comp.os.ms-windows.misc
4. comp.sys.ibm.pc.hardware
5. comp.sys.mac.hardware
6. comp.windows.x
7. misc.forsale
8. rec.autos
9. rec.motorcycles
10. rec.sport.baseball
11. rec.sport.hockey
12. sci.crypt
13. sci.electronics
14. sci.med
15. sci.space
16. soc.religion.christian
17. talk.politics.guns
18. talk.politics.mideast
19. talk.politics.misc
20. talk.religion.misc

----- Newsgroup Data Summary -----

Subset: train
Total documents: 1728
Total categories: 3

comp.graphics: 584 documents
rec.motorcycles: 598 documents
talk.politics.guns: 546 documents
All Available Categories:

1. alt.atheism
2. comp.graphics
3. comp.os.ms-windows.misc
4. comp.sys.ibm.pc.hardware
5. comp.sys.mac.hardware
6. comp.windows.x
7. misc.forsale
8. rec.autos
9. rec.motorcycles
10. rec.sport.baseball
11. rec.sport.hockey
12. sci.crypt
13. sci.electronics
14. sci.med
15. sci.space
16. soc.religion.christian
17. talk.politics.guns
18. talk.politics.mideast
19. talk.politics.misc
20. talk.religion.misc

----- Newsgroup Data Summary -----

Subset: test
Total documents: 1151
Total categories: 3

comp.graphics: 389 documents
rec.motorcycles: 398 documents
talk.politics.guns: 364 documents

The news_train.target_names is a list that contains the category names of the news articles in the training subset, providing the topics each news article is categorized under.

news_train.target_names
['comp.graphics', 'rec.motorcycles', 'talk.politics.guns']

Next, we’re examining the distribution of document lengths within our news_train dataset.

Firstly, a list comprehension is used to generate a list doc_lengths which contains the length (i.e., the number of characters) of each document in news_train.data.

Then, using matplotlib’s Pyplot module, a histogram is created with document lengths on the x-axis and their frequency (number of documents) on the y-axis. The histogram helps visualize how the document lengths in the dataset are distributed, giving insights into the common and outlier lengths. This could be instrumental in helping to understand or preprocess the data further.

# Calculate document lengths
doc_lengths = [len(doc) for doc in news_train.data]

# Plot histogram plt.figure(figsize=(10,5)) plt.hist(doc_lengths, bins=50, color='skyblue') plt.title('Distribution of Document Lengths', fontsize=15) plt.xlabel('Document Length', fontsize=12) plt.ylabel('Number of Documents', fontsize=12) plt.show()

The following code creates a series of histograms and boxplots for each category in the news_train dataset to visualize the distribution of document lengths within each category.

First, it defines bin_edges to ensure consistent scaling across all histograms. Then, a subplot grid is created with 2 rows and as many columns as there are categories in the data.

Next, the code loops over each category and collects the corresponding documents into category_data. It calculates the length of each document in this specific category and stores it in cat_doc_lengths.

A histogram and a boxplot are then plotted for each category’s document lengths on the same column, with the histogram on the top (first row) and the boxplot underneath (second row).

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Define bin edges for consistent scale across all histograms bin_edges = np.linspace(0, max(doc_lengths), num=50)

# Create figure and axes fig, ax = plt.subplots(nrows=2, ncols=len(news_train.target_names), figsize=(15, 10))

for category_index, category_name in enumerate(news_train.target_names): # Gather all data for this category category_data = [news_train.data[i] for i in range(len(news_train.data)) if news_train.target[i] == category_index]

# Calculate document lengths for each category cat_doc_lengths = [len(doc) for doc in category_data]

# Plot histogram ax[0, category_index].hist(cat_doc_lengths, bins=bin_edges, color='skyblue') ax[0, category_index].set_title(category_name) ax[0, category_index].set_xlabel('Document Length') ax[0, category_index].set_ylabel('Number of Documents')

# Plot boxplot sns.boxplot(cat_doc_lengths, ax=ax[1, category_index]) ax[1, category_index].set_xlabel('Document Length')

plt.tight_layout() plt.show()

The following code generates word clouds for each category in the news_train dataset and an additional one for all categories combined. Word clouds are graphical representations where the most frequently occurring words are displayed prominently.

A function called generate_word_cloud() is defined to create a word cloud from a given text, which takes two parameters - a text and an axes object. The WordCloud object is customized to have white background, specific dimensions, and limit of 100 words. The generated word cloud is then displayed on the provided axes object with the axes turned off.

Following this, a word cloud for all categories combined is generated by joining all documents into a single string and feeding it into the generate_word_cloud() function.

The code then loops through all categories. For each category, it gathers the data that belongs to the current category. It creates a title for the current category’s word cloud and passes the category data to the generate_word_cloud() function to generate and display the word cloud.

from wordcloud import WordCloud

def generate_word_cloud(text, ax): """ Generate a word cloud with a white background from the given text.

Parameters: - text: The text to generate a word cloud from. """

# Customize the word cloud wordcloud = WordCloud( background_color="white", width=800, height=400, max_words=100 ).generate(text)

# Display the generated image on the given axes: ax.imshow(wordcloud, interpolation="bilinear") ax.axis("off")

import matplotlib.pyplot as plt

# Determine number of categories for creating grid num_categories = len(news_train.target_names) num_rows = num_categories // 2 + num_categories % 2 + 1

fig, axs = plt.subplots(num_rows, 2, figsize=(20, num_rows*5.5)) axs = axs.ravel()

# Create a word cloud for all categories combined title = "Word Cloud for All Categories Combined" generate_word_cloud(" ".join(news_train.data), axs[0]) axs[0].set_title(title, fontsize=18)

# Loop through all categories for category_index, category_name in enumerate(news_train.target_names): # Gather all data for this category category_data = [ news_train.data[i] for i in range(len(news_train.data)) if news_train.target[i] == category_index ]

title = f"Word Cloud for '{category_name}' Category"

# Create and display word cloud generate_word_cloud(" ".join(category_data), axs[category_index+1]) axs[category_index+1].set_title(title, fontsize=18)

# Remove unused subplots for ax_index in range(category_index+2, num_rows*2): fig.delaxes(axs[ax_index])

# Adjust the space between plots plt.subplots_adjust(wspace=0.6, hspace=0.6) plt.tight_layout() plt.show()

The following code creates bar plots for the top occurring words across different categories in the news_train dataset.

A CountVectorizer object is instantiated to convert text data into a matrix of token counts, ignoring English stop words. This vectorizer is fitted and transformed on the news_train.data.

Next, it creates a DataFrame that contains each word and their count in the entire dataset, sorted by count in descending order.

The code then sets the number of top words to show for all categories combined (N) and per category (M). It determines the number of subplot rows based on the number of categories and creates a subplot grid using plt.subplots(). The axes are flattened for easy iteration.

The top N words for all categories combined are plotted in the first subplot.

Then, the code loops through each category. For each category, it filters out the data belonging to the current category. It then transforms this filtered data with the same vectorizer, sums up word counts for the category, and creates a similar DataFrame for the current category as was created for the entire data earlier. Then, it plots the top M words for the current category in the next subplot.

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words='english') X = vectorizer.fit_transform(news_train.data)

# Sum up the counts of each vocabulary word word_counts = pd.DataFrame({ 'word': vectorizer.get_feature_names_out(), 'count': np.asarray(X.sum(axis=0)).ravel() }).sort_values('count', ascending=False)

N = 20 # Top N words for all categories combined M = 15 # Top M words per category

# Calculate number of categories (for subplot grid) num_categories = len(news_train.target_names) num_columns = 2 # Number of columns in subplot grid num_rows = int(np.ceil((num_categories + 1) / num_columns)) # Number of rows in subplot grid

# Create figure and axes (subplots) fig, axs = plt.subplots(num_rows, num_columns, figsize=(20, 5 * num_rows))

# Flatten axes for easier iteration axs = axs.flatten()

# Plot top N words for all categories combined top_words = word_counts.iloc[:N] sns.barplot(x='count', y='word', data=top_words, palette='viridis', ax=axs[0]) axs[0].set_title(f'Top {N} Words in All Categories Combined', fontsize=15)

# Loop through all categories for category_index, category_name in enumerate(news_train.target_names):

# Gather all data for this category category_data = [news_train.data[i] for i in range(len(news_train.data)) if news_train.target[i] == category_index]

X_category = vectorizer.transform(category_data)

# Sum up the counts of each vocabulary word in the category word_counts_category = pd.DataFrame({ 'word': vectorizer.get_feature_names_out(), 'count': np.asarray(X_category.sum(axis=0)).ravel() }).sort_values('count', ascending=False)

# Plot top M words in category top_words_category = word_counts_category.iloc[:M] sns.barplot(x='count', y='word', data=top_words_category, palette='viridis', ax=axs[category_index + 1]) # Use subsequent axes for category plots axs[category_index + 1].set_title(f"Top {M} Words in '{category_name}' Category", fontsize=15)

# Hide any unused subplots for ax in axs[num_categories + 1:]: ax.axis('off')

# Show plot plt.tight_layout() plt.show()

Feature Engineering for Unstructured Data: Text

Machine learning algorithms often work best with structured data; however, in many real-world applications, the available data might be unstructured like text or images.

Working with Text

Applying machine learning methods to text documents brings its unique challenges such as variable length, order-dependency and misalignment. This stems from the fact that two different sentences can give the same information in many ways.

Bag of Words (BoW) Representation

One way to overcome this issue is using a technique known as the Bag of Words representation where:

  1. All words in all documents are gathered and considered the features of interest.
  2. For each document, a row is created for the learning example with entries indicating if a word occurs in that example.

This method converts any document into a “bag” containing all the words in it, without taking into account the order of these words, making it simple and efficient.

Negative Side:

  • The downside of this method is often the loss of all information about word order
  • While using bigrams or trigrams provides some context, the increased ‘n’ leads to higher processing time and memory requirements.

While using BoW representation, you have the option to record the presence of a word in a document either by simply marking whether the word is present or not or by recording the count of occurrences of the word. Normalizing these counts based on other factors might also be beneficial.

The process of handling text data in machine learning needs special techniques to convert unstructured text data into a structured format. The Bag of Words representation is one common method used but there exist others, each having their own strengths and weaknesses.

Encoding Text for Machine Learning: An Example

Machine learning models work best with numeric input data. Consequently, text data must be converted into a numerical format our models can utilize. A popular method is the bag-of-words representation.

We begin with these sample documents:

docs = [
    "Artificial intelligence, artificial intelligence will shape the future of innovation.",
    "Climate change poses significant challenges to modern societies, significant challenges indeed.",
    "Quantum computing, quantum computing could revolutionize technology as we know it.",
    "In literature, symbolism, symbolism can create deep layers of meaning.",
    "Sustainable practices, sustainable practices are key in preserving Earth's natural resources.",
    "Exploration of space represents the pinnacle of human achievement, the pinnacle indeed.",
    "Blockchain technology, blockchain technology can transform security and transparency in transactions.",
    "Historical events invariably influence contemporary socio-political landscapes. Indeed, historical events do.",
    "Bioinformatics is dramatically enhancing our understanding of complex ecologies, dramatically indeed.",
    "The fusion of arts and science, fusion indeed, can foster creative innovation.",
]

docs
['Artificial intelligence, artificial intelligence will shape the future of innovation.',
 'Climate change poses significant challenges to modern societies, significant challenges indeed.',
 'Quantum computing, quantum computing could revolutionize technology as we know it.',
 'In literature, symbolism, symbolism can create deep layers of meaning.',
 "Sustainable practices, sustainable practices are key in preserving Earth's natural resources.",
 'Exploration of space represents the pinnacle of human achievement, the pinnacle indeed.',
 'Blockchain technology, blockchain technology can transform security and transparency in transactions.',
 'Historical events invariably influence contemporary socio-political landscapes. Indeed, historical events do.',
 'Bioinformatics is dramatically enhancing our understanding of complex ecologies, dramatically indeed.',
 'The fusion of arts and science, fusion indeed, can foster creative innovation.']

Python Text Analysis Using NLTK

This guide introduces simple but effective steps for text analysis using the Natural Language Toolkit (NLTK), a powerful library for processing natural language in Python.

Setting Up The Environment

The code first sets up necessary libraries and datasets.

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

nltk.download("punkt") nltk.download("stopwords")

Here’s what each step does:

  • nltk, the primary Python library for NLP, is imported.
  • Essential NLTK components such as stopwords and word_tokenize are imported.
  • Commonly used stopwords and tokenization dataset (‘punkt’) are downloaded.

Text Preprocessing

The obtained tools are used to preprocess text - remove punctuation, change characters to lowercase, and eliminate stop words.

This preprocessing paves the way for further specialized tasks like sentiment analysis or text classification. By tokenizing data and removing stopwords, we refine raw text into something more understandable and analyzable by machines.

So, with NLTK, you get an accessible entry point into Python-based text analysis. As for its applications - there’s plenty more to explore!

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

nltk.download("punkt") nltk.download("stopwords")

# removing punctuation, converting to lowercase, and removing stopwords preprocessed_docs = [] for d in docs: # Remove punctuation and convert to lowercase processed_d = d.translate(str.maketrans("", "", string.punctuation)).lower()

# Word tokenization tokenized_d = word_tokenize(processed_d)

# Remove stopwords tokens_without_sw = [word for word in tokenized_d if not word in stopwords.words()]

preprocessed_docs.append(tokens_without_sw)
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Scott\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Scott\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Building Vocabulary and Visualizing Word Cloud

In the following code we create a collection of unique words present in all documents (corpus), visualized using a word cloud.

In simpler terms:

  • A unique vocabulary set is created from the preprocessed documents.
  • The set is converted to a list for later usage.
  • A string representation of the vocabulary is created.
  • A word cloud is generated and displayed to visualize the distribution of words in the corpus.
# Create an empty set to store the vocabulary
vocabulary_set = set()

# Add the unique words from each document into the set for doc in preprocessed_docs: vocabulary_set.update(doc)

# Convert set to list so it can be used as a vocab later on vocabulary_list = list(vocabulary_set)

vocab_string = " ".join(vocabulary_set)

# Create a figure and axes fig, ax = plt.subplots(1, 1, figsize=(10, 5))

# Call the existing function to generate and display word cloud generate_word_cloud(vocab_string, ax)

plt.show()

Binary Bag of Words Generation

The following piece of code creates a binary bag of words representation.

In simpler terms:

  • For each document, it checks if a word from the vocabulary list is present or not.
  • Each word in the document is represented as True (if it’s present) or False (if it’s not present).
  • This information is stored in a Python dictionary where the key is the word and the value is True/False.
  • The dictionaries are then converted into a pandas DataFrame, i.e., a table, where each row corresponds to a document, and columns represent the words in the vocabulary.
  • Each entry in the table indicates whether the corresponding word is present in the associated document (1 meaning present, 0 meaning absent).
doc_contains = [{w: (w in d) for w in vocabulary_list} for d in docs]
binary_bag_of_wrds = pd.DataFrame(doc_contains, columns=vocabulary_list)
binary_bag_of_wrds

exploration climate ecologies challenges literature landscapes societies human shape poses ... deep foster artificial layers historical earths significant science transform dramatically
0 False False False False False False False False True False ... False False True False False False False False False False
1 False False False True False False True False False True ... False False False False False False True False False False
2 False False False False False False False False False False ... False False False False False False False False False False
3 False False False False True False False False False False ... True False False True False False False False False False
4 False False False False False False False False False False ... False False False False False False False False False False
5 False False False False False False False True False False ... False False False False False False False False False False
6 False False False False False False False False False False ... False False False False False False False False True False
7 False False False False False True False False False False ... False False False False True False False False False False
8 False False True False False False False False False False ... False False False False False False False False False True
9 False False False False False False False False False False ... False True False False False False False True False False

10 rows × 59 columns

Count-Based Bag of Words Generation

The following code snippet creates a count-based bag of words representation.

Here’s a simple description of what it does:

  • For each document, it counts the frequency of each word from the vocabulary list in that document.
  • These word frequencies are stored in a Python dictionary where the key is the word and the value is its frequency.
  • The dictionaries are then transformed into a pandas DataFrame—essentially, a table. Each row corresponds to a document, and columns represent the words in the vocabulary.
  • Each data entry in the table indicates the frequency of the corresponding word in the linked document.
import pandas as pd

# Counting occurrences of words in the documents word_count = [{w: d.count(w) for w in vocabulary_set} for d in preprocessed_docs]

# Create a DataFrame from the word_count dictionary count_bag_of_words = pd.DataFrame(word_count, columns=vocabulary_list) count_bag_of_words

exploration climate ecologies challenges literature landscapes societies human shape poses ... deep foster artificial layers historical earths significant science transform dramatically
0 0 0 0 0 0 0 0 0 1 0 ... 0 0 2 0 0 0 0 0 0 0
1 0 1 0 2 0 0 1 0 0 1 ... 0 0 0 0 0 0 2 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 1 0 0 0 0 0 ... 1 0 0 1 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
5 1 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
7 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 2 0 0 0 0 0
8 0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 2
9 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 1 0 0

10 rows × 59 columns

Checking for Words that Appear Twice in a Document

The following code snippet below identifies words appearing more than once within a document.

Here’s a brief explanation:

  • count_bag_of_words is a DataFrame where columns represent words in the vocabulary, and rows represent documents. Each entry signifies the frequency of a word in a document.
  • (count_bag_of_words > 1).any() generates a boolean mask, marking columns (words) where the condition (frequency > 1) holds true for any row (document).
  • count_bag_of_words.columns[(count_bag_of_words > 1).any()] extracts column names (words) where the condition holds true, creating ‘words_occurring_twice’.
  • The final if-else block checks if ‘words_occurring_twice’ is empty. If it isn’t, it prints the words that occurred more than once, otherwise, it prints “No words occurred twice”.
# Find words that occur more than once
words_occurring_twice = count_bag_of_words.columns[(count_bag_of_words > 1).any()]

# Check if any word occurred twice if len(words_occurring_twice) == 0: print("No words occurred twice") else: print(f"Words that occurred twice: {list(words_occurring_twice)}")
Words that occurred twice: ['challenges', 'technology', 'events', 'pinnacle', 'fusion', 'blockchain', 'symbolism', 'quantum', 'sustainable', 'computing', 'intelligence', 'practices', 'artificial', 'historical', 'significant', 'dramatically']

Using scikit-learn Library

The CountVectorizer class from scikit-learn library provides simplified ways to perform the operations while handling sparse data effectively:

from sklearn.feature_extraction.text import CountVectorizer

sparse = CountVectorizer().fit_transform(docs) sparse.todense()
matrix([[0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
         0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1,
         0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
         0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
         0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
         0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0],
        [0, 1, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
         0, 0, 0, 0, 2, 0, 0, 1, 1, 1, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0,
         0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1,
         0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2,
         0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
        [0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
         0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)

In conclusion, handling text data in machine learning inevitably involves converting raw text data into a structured numerical format. This can be executed manually, as illustrated above, or with the help of libraries like scikit-learn and nltk as well.

Normalized Bag-of-Word Counts: TF-IDF

After generating the binary and count-based bag-of-words representation, the next step is to normalize these counts. There are two primary reasons to normalize:

  1. We want to reduce the influence of longer documents that inherently have more words. These documents could have stronger relationships with our target variable simply due to their length.
  2. Words that are frequent in every document are not very distinguishing, so as their frequency increases across the corpus, we want their contribution to drop.

A common method to normalize text data is to use the term frequency-inverse document frequency (TF-IDF) measure.

Let’s start by calculating the document frequency (DF) for each word, which is the number of documents (rows) that contain a particular word.

The following code calculates a document frequency for each word in the corpus. It does this by summing up Boolean values (True=1, False=0) along the ‘rows’ axis in a Bag of Words representation, where each row represents a document and columns are words. In other words, if a word appears in a document, it will count as 1, otherwise 0. The result is then transposed using .T to get a single row DataFrame.

This resulting doc_freq DataFrame contains the frequency of each word across all documents in the data corpus. Each column represents a distinct word, and the value represents how many documents contain that word.

doc_freq = pd.DataFrame(count_bag_of_words.astype(bool).sum(axis='rows')).T
doc_freq

exploration climate ecologies challenges literature landscapes societies human shape poses ... deep foster artificial layers historical earths significant science transform dramatically
0 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1

1 rows × 59 columns

The following code identifies and selects those columns in the doc_freq DataFrame where the document frequency is greater than 1, implying that word appears in more than one document.

selected_columns = doc_freq.columns[doc_freq.gt(1).any()]

# Subset of the DataFrame subset_df = doc_freq[selected_columns] subset_df

indeed technology innovation
0 5 2 2

Next, we compute the Inverse Document Frequency (IDF) for the words in our set of documents. The IDF is calculated as the logarithm of the total number of documents divided by the document frequency ( doc_freq ), which counts how often a word appears across all documents.

In the dataset, each document corresponds to a row and each word to a column. If a word appears in a document, its doc_freq would be incremented. After the IDF is computed for each word, it is stored in the idf DataFrame.

$$ IDF(t) = \log{\frac{N}{df(t)}} $$

In this formula:

  • IDF(t) is the IDF of term t.
  • N is the total number of documents in your set.
  • df(t) corresponds to the document frequency of term t, or the number of documents in the set that contain t.

The IDF value decreases if the word is common across documents (high doc_freq ). On the other hand, rare words have a higher IDF. This approach is important in Natural Language Processing tasks like Text Mining and Information Retrieval to give more weight to informative (rare) words while reducing the weight of common ones.

idf = np.log(len(docs) / doc_freq)
idf

exploration climate ecologies challenges literature landscapes societies human shape poses ... deep foster artificial layers historical earths significant science transform dramatically
0 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 ... 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585 2.302585

1 rows × 59 columns

idf[selected_columns]

indeed technology innovation
0 0.693147 1.609438 1.609438

In the following code, count_bag_of_words holds our count vector or frequency distribution of words (also known as ‘bag of words’). idf.iloc[0] is fetching the first row of the IDF (Inverse Document Frequency) dataframe, which represents the IDF values for each term in our corpus (collection of documents). The operation count_bag_of_words * idf.iloc[0] is performing element-wise multiplication of the word counts by their corresponding IDF values. This operation is used to weight the terms in the document by their importance, which is measured by IDF. The result of this operation, stored in tf_idf , hence contains the TF-IDF scores for each word in the document.

tf_idf = count_bag_of_words * idf.iloc[0]
tf_idf

exploration climate ecologies challenges literature landscapes societies human shape poses ... deep foster artificial layers historical earths significant science transform dramatically
0 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 2.302585 0.000000 ... 0.000000 0.000000 4.60517 0.000000 0.00000 0.000000 0.00000 0.000000 0.000000 0.00000
1 0.000000 2.302585 0.000000 4.60517 0.000000 0.000000 2.302585 0.000000 0.000000 2.302585 ... 0.000000 0.000000 0.00000 0.000000 0.00000 0.000000 4.60517 0.000000 0.000000 0.00000
2 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.00000 0.000000 0.00000 0.000000 0.00000 0.000000 0.000000 0.00000
3 0.000000 0.000000 0.000000 0.00000 2.302585 0.000000 0.000000 0.000000 0.000000 0.000000 ... 2.302585 0.000000 0.00000 2.302585 0.00000 0.000000 0.00000 0.000000 0.000000 0.00000
4 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.00000 0.000000 0.00000 2.302585 0.00000 0.000000 0.000000 0.00000
5 2.302585 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 2.302585 0.000000 0.000000 ... 0.000000 0.000000 0.00000 0.000000 0.00000 0.000000 0.00000 0.000000 0.000000 0.00000
6 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.00000 0.000000 0.00000 0.000000 0.00000 0.000000 2.302585 0.00000
7 0.000000 0.000000 0.000000 0.00000 0.000000 2.302585 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.00000 0.000000 4.60517 0.000000 0.00000 0.000000 0.000000 0.00000
8 0.000000 0.000000 2.302585 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.00000 0.000000 0.00000 0.000000 0.00000 0.000000 0.000000 4.60517
9 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 2.302585 0.00000 0.000000 0.00000 0.000000 0.00000 2.302585 0.000000 0.00000

10 rows × 59 columns

However, this still doesn’t account for the benefit that longer documents might get. To control this, we normalize these values so that the sum across each document (i.e., when we add up all the weighted counts) is the same. This means that documents are differentiated by the proportion of a fixed weight distributed over the word buckets, instead of the total amount across the buckets.

The sklearn library provides a Normalizer class to do this. This module is used for scaling individual samples to have unit norm. TfidfVectorizer_manual = Normalizer(norm='l1').fit_transform(tf_idf) applies L1 normalization to our tf_idf data, which means it scales the TF-IDF values so that the sum of absolute values of each row equals 1. By calling fit_transform(), the normalizer learns and applies the transformation in one step. The resultant normalized TF-IDF matrix is stored in TfidfVectorizer_manual. Finally, TfidfVectorizer_manual[0] accesses the first row of this normalized TF-IDF matrix, showing the L1 normalized TF-IDF values for the first document.

from sklearn.preprocessing import Normalizer

TfidfVectorizer_manual = Normalizer(norm='l1').fit_transform(tf_idf) # First row, all rows sum to 1 TfidfVectorizer_manual[0]
array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.14927668, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.10433992, 0.        , 0.        , 0.        , 0.        ,
       0.14927668, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.29855336, 0.        , 0.        , 0.        ,
       0.        , 0.29855336, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        ])
TfidfVectorizer_manual.sum(axis=1)
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

This process essentially mimics what the TfidfVectorizer from sklearn does. The TfidfVectorizer is more streamlined and efficient as it does all these steps (including the removal of stop words, count vectorization, IDF calculation, and normalization) in one go:

from sklearn.feature_extraction.text import TfidfVectorizer

sparse = TfidfVectorizer(norm='l1').fit_transform(docs) sparse.todense().sum(axis=1)
matrix([[1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.],
        [1.]])

So, in summary, to handle the influence of document length and common words on our bag-of-words representation, we use the TF-IDF measure to normalize our word counts. This can be done manually as shown above or using sklearn’s TfidfVectorizer for a more convenient and efficient approach.

Moving to Model Selection and Hyperparameter Tuning

After the initial step of preprocessing our text data by applying TF-IDF normalization, we now step into a vital phase in the Machine Learning pipeline, i.e., Model Selection and Hyperparameter Tuning.

TF-IDF Vectorizer

Our primary tool for vectorizing textual data will be the TfidfVectorizer. It transforms text into meaningful numerical data, making further processing possible. Let’s initialize it and check its default parameters.

from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TfidfVectorizer tfidf_vectorizer = TfidfVectorizer()

# We extract the parameters for our TfidfVectorizer with get_params() function tfidf_params = tfidf_vectorizer.get_params() tfidf_params
{'analyzer': 'word',
 'binary': False,
 'decode_error': 'strict',
 'dtype': numpy.float64,
 'encoding': 'utf-8',
 'input': 'content',
 'lowercase': True,
 'max_df': 1.0,
 'max_features': None,
 'min_df': 1,
 'ngram_range': (1, 1),
 'norm': 'l2',
 'preprocessor': None,
 'smooth_idf': True,
 'stop_words': None,
 'strip_accents': None,
 'sublinear_tf': False,
 'token_pattern': '(?u)\\b\\w\\w+\\b',
 'tokenizer': None,
 'use_idf': True,
 'vocabulary': None}

Two significantly influential parameters here are max_df and ngram_range:

  • max_df: This disregards terms that have a document frequency more than the given threshold. Thus, this could assist in focusing on the most important words.

  • ngram_range: This specifies the range of n-value for different n-grams in the text. An n-gram is contiguous sequence of n items from a given sample of text or speech. By tuning this, we can decide if phrases of words together are useful for our problem.

During our testing, we found that the (1, 2) range for ngrams consistently yielded best results. Hense, we decided to keep it constant while tuning the other parameters.

Ridge Classifier

The classification model we are using is a RidgeClassifier, a classifier variant of Ridge regression.

from sklearn.linear_model import RidgeClassifier

# Initialize RidgeClassifier ridge_classifier = RidgeClassifier()

# We extract the parameters for our RidgeClassifier with get_params() function ridge_class_params = ridge_classifier.get_params() ridge_class_params
{'alpha': 1.0,
 'class_weight': None,
 'copy_X': True,
 'fit_intercept': True,
 'max_iter': None,
 'positive': False,
 'random_state': None,
 'solver': 'auto',
 'tol': 0.0001}

One important parameter to tune would be alpha, which controls the amount of shrinkage: the larger the alpha, the greater the amount of shrinkage and thus the coefficients become more robust to collinearity.

It’s this combination of vectorization and classification model that we’ll employ, and by tuning their hyperparameters well, we should be able to enhance the performance of our multiclass classification problem.

Graphing Hyperparameters for Tuning

We want to select and graph the following distributions to sample from in our grid search for the classifier alpha parameter and the max_document frequency parameter from the TfidfVectorizer.

In the following code, we plot a uniform distribution representing the possible values for the max_df parameter in TfidfVectorizer , which will be tuned later on.

By visualizing these distributions, we can get some prior knowledge about where the majority of the values lie before starting the tuning process. It influences our choice of searching method for optimal hyperparameters, and helps in diagnosing the performance of the model based on different hyperparameters.

Therefore, we gain a clearer understanding of how each hyperparameter might affect our models' outcomes and get a better sense of direction during the tuning process.

  1. Uniform Distribution [0.75, 1.0]: The first subplot shows a plot for the Probability Density Function (PDF) of a Uniform Distribution ranging from 0.75 to 1.0 (a+b), where a and b are parameters of the distribution.

This plotted chart gives an understanding of how values in a random uniform distribution with location a and scale b are likely to be distributed.

  1. Values for the max_df parameter: The second subplot shows a histogram of randomly sampled values using the same Uniform Distribution that was just plotted. The histogram provides a way to visually interpret the quantity and distribution of randomly generated max_df.

  2. Values for the alpha parameter: The third subplot plots the values of alpha on a log scale between -6 and -2. It allows us to see the distribution of values in a log space, which can be especially useful when we expect the optimal value to lie within an exponential scale.

In conclusion, these plots are a great way to visualize and understand the hyperparameter space that one might be searching during model tuning.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import uniform

# Parameters a = 0.75 b = 0.25

# Generate a range of values in the interval [a, a+b] x = np.linspace(a - 0.1, a + b + 0.1, 1000)

# Generate the pdf values for these x values y = uniform.pdf(x, loc=a, scale=b)

# Create max_df random values max_doc_freq_values = uniform(0.75, 0.25) random_values = max_doc_freq_values.rvs(size=1000)

# Create alpha_values array alpha_values = np.logspace(-6, -2, 50)

# Create a figure fig, axs = plt.subplots(3, figsize=(10, 18))

# Plot the pdf axs[0].plot(x, y) axs[0].set_title('Uniform Distribution [0.75, 1.0]') axs[0].set_xlabel('x') axs[0].set_ylabel('pdf') axs[0].grid(True)

# Create a histogram axs[1].hist(random_values, bins=30) axs[1].set_title('Values for the max_df parameter') axs[1].set_xlabel('Max Document Frequency') axs[1].set_ylabel('Frequency') axs[1].grid(True)

# Plot alpha values axs[2].plot(alpha_values, 'o-') axs[2].set_title('Values for the alpha parameter') axs[2].set_xlabel('Index') axs[2].set_ylabel('Alpha') axs[2].grid(True)

# Display the plot plt.tight_layout() plt.show()

Overview

This code is crafted to carry out an optimal training and hyperparameter tuning of a text classification machine learning model using nested cross-validation.

Setting Up the Hyperparameters and Pipeline

Configuring the Hyperparameters Grid

The initial step involves defining the hyperparameters grid in which we’ll search for the optimal set of values. This is done through a dictionary:

parameter_grid = {
    "vect__max_df": uniform(0.75, 0.25),
    "vect__ngram_range": [(1, 2)],
    "clf__alpha": np.logspace(-6, -2, 50),
}

Here, vect__max_df is a range of potential maximum document frequencies for the TfidfVectorizer, vect__ngram_range specifies to use only bigrams, and clf__alpha defines a log-space range for the RidgeClassifier’s alpha parameter.

Building the Pipeline

We then define a pipeline consisting of two stages:

  • The first stage uses the TfidfVectorizer to convert the input text into a matrix of TF-IDF features.
  • In the second stage, these features are used by RidgeClassifier to perform the classification task.
pipeline = Pipeline(
    [
        ("vect", TfidfVectorizer(stop_words="english")),
        ("clf", RidgeClassifier()),
    ]
)

Implementing Randomized Search Cross-validation

Inner Fold — Randomized Search CV

The inner fold of our nested cross-validation is a randomized search, performed by RandomizedSearchCV. It takes our pipeline along with the parameter_grid, and other parameters like the number of iterations (n_iter) as inputs. Here, it randomly samples 50 candidates from the parameter space and uses 2-fold cross-validation on each to find the best fitting model.

random_search = RandomizedSearchCV(
        estimator=pipeline,
        param_distributions=parameter_grid,
        n_iter=50,
        random_state=42,
        n_jobs=-1,
        verbose=3,
        cv=2,
)

Outer Fold — K-Fold Cross-validation

The outer fold is a traditional K-Fold cross-validation that splits the main dataset into 5 parts— or folds (as specified by KFold(n_splits=5)). Each of the subsets will be used once as a validation set while the remaining subsets form the training set.

kf = KFold(n_splits=5)

Executing Nested Cross-validation and Storing Results

Iterating Over Outer Folds

For each subset in our 5-fold cross-validation, the RandomizedSearchCV instance fits on the training data. Within this loop, it conducts a 2-fold cross-validation for each of the 50 candidates, resulting in 100 different models being evaluated.

for train_index, test_index in kf.split(data):
    X_train, X_test = data[train_index], data[test_index]
    y_train, y_test = target[train_index], target[test_index]

random_search.fit(X_train, y_train)

Accumulating and Saving the Results

During each fold, we save the results of the RandomizedSearchCV into lists. At the end of all 5 outer folds, these lists contain aggregated information about the CV scores, the best parameters found, and the highest achieved score for each fold.

These lists are transformed into Pandas DataFrame objects and are then saved into csv files.

df_cv_results = pd.DataFrame(cv_results_list)
df_best_params = pd.DataFrame(best_params_list)
df_best_score = pd.DataFrame(best_score_list)

df_cv_results.to_csv('cv_results.csv', index=False) df_best_params.to_csv('best_params.csv', index=False) df_best_score.to_csv('best_score.csv', index=False)

By utilizing this structure, we can tune hyperparameters in an unbiased manner, avoiding any single, specific division of the dataset and thereby mitigating overfitting. The model must demonstrate effective generalization across multiple divisions of the data to be considered successful.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV, KFold
import numpy as np
from scipy.stats import uniform
import pandas as pd

parameter_grid = { "vect__max_df": uniform(0.75, 0.25), # uniform distribution from 0.75 to 1.0 "vect__ngram_range": [(1, 2)], # only bigrams "clf__alpha": np.logspace(-6, -2, 50), # more densely around smaller values }

# Define the pipeline pipeline = Pipeline( [ ("vect", TfidfVectorizer(stop_words="english")), ("clf", RidgeClassifier()), ] )

# Define the inner CV procedure (hyperparameter tuning) random_search = RandomizedSearchCV( estimator=pipeline, param_distributions=parameter_grid, n_iter=50, random_state=42, n_jobs=-1, verbose=3, cv=2, # inner CV )

# Convert data and target to numpy arrays data = np.array(news_train.data) target = np.array(news_train.target)

# Initialize the KFold object kf = KFold(n_splits=5)

# Initialize counters for the file names cv_results_counter = 1 best_params_counter = 1 best_score_counter = 1

# Loop over the folds for train_index, test_index in kf.split(data): X_train, X_test = data[train_index], data[test_index] y_train, y_test = target[train_index], target[test_index]

# Fit the RandomizedSearchCV on the training data random_search.fit(X_train, y_train)

# Save cv_results_ df_cv_results = pd.DataFrame(random_search.cv_results_) df_cv_results.to_csv(f'cv_results_{cv_results_counter}.csv', index=False) cv_results_counter += 1

# Save best_params_ df_best_params = pd.DataFrame([random_search.best_params_]) df_best_params.to_csv(f'best_params_{best_params_counter}.csv', index=False) best_params_counter += 1

# Save best_score_ df_best_score = pd.DataFrame([{'best_score': random_search.best_score_}]) df_best_score.to_csv(f'best_score_{best_score_counter}.csv', index=False) best_score_counter += 1

Aggregation & Visualization of Model Performance

The following code is utilized to aggregate, analyze and visualize the results obtained from the previously conducted machine learning model training. The following tasks are accomplished through this script:

Fetching Data Files

The first part of the code fetches the files containing the cross-validation (cv) results and best scores obtained during model training.

cv_results_files = sorted(glob.glob(r'data/csv_files/content/cv_results_*.csv'))
best_score_files = sorted(glob.glob(r'data/csv_files/content/best_score_*.csv'))

Loading Data into DataFrames

Using a loop over each pair of corresponding cv_results and best_score files, these files are read into pandas DataFrames and then appended into two separate lists— cv_results_list and best_score_list.

for cv_results_file, best_score_file in zip(cv_results_files, best_score_files):
    ...

Finding Optimal Hyperparameters

Then, the best hyperparameters are determined by looking for the parameters associated with the highest mean test score across all folds.

best_params = None
best_score = -np.inf
for df_cv_results in cv_results_list:
    ...
print("Best parameters found: ", best_params)

Aggregating Best Scores

The list outer_scores is created to store the best mean test score derived from each fold’s cv_results.

for df_cv_results in cv_results_list:
    outer_scores.append(df_cv_results[df_cv_results['rank_test_score'] == 1]['mean_test_score'].values[0])

Plotting Cross-Validation Scores

Lastly, the collected outer_scores are plotted to allow easy visualization of model performance across different folds of data.

plt.figure(figsize=(10, 6))
...
plt.show()

Through this process, we obtain both a global perspective on the performance of various models tested during the nested cross-validation and a specific set of best performing hyperparameters. This information is invaluable for further refining our model.

import os
import glob

# Get a list of all the cv_results csv files cv_results_files = sorted(glob.glob(r'data/csv_files/content/cv_results_*.csv'))

# Get a list of all the best_score csv files best_score_files = sorted(glob.glob(r'data/csv_files/content/best_score_*.csv'))

# Initialize lists to store the DataFrames cv_results_list = [] best_score_list = []

# Load the cv_results and best_score files into DataFrames for cv_results_file, best_score_file in zip(cv_results_files, best_score_files): df_cv_results = pd.read_csv(cv_results_file) df_best_score = pd.read_csv(best_score_file)

cv_results_list.append(df_cv_results) best_score_list.append(df_best_score['best_score'].values[0]) # Assuming 'best_score' is the column name

# Find the hyperparameters with the highest mean test score best_params = None best_score = -np.inf for df_cv_results in cv_results_list: # Find the index of the best score for this fold best_index = df_cv_results['mean_test_score'].idxmax() # .idxmax() returns the index of the max value # If this score is better than the current best score, update the best parameters if df_cv_results.loc[best_index, 'mean_test_score'] > best_score: best_score = df_cv_results.loc[best_index, 'mean_test_score'] best_params = df_cv_results.loc[best_index, 'params'] # Adjust this if 'params' isn't the exact column name

# Print the best parameters print("Best parameters found: ", best_params)

# Initialize an empty list to store the best score for each fold outer_scores = []

# Loop over the cv_results_list for df_cv_results in cv_results_list: # Append the best mean_test_score to outer_scores outer_scores.append(df_cv_results[df_cv_results['rank_test_score'] == 1]['mean_test_score'].values[0])

outer_scores = np.array(outer_scores)

# Then you can plot the outer_scores plt.figure(figsize=(10, 6)) plt.plot(range(1, len(outer_scores) + 1), outer_scores, 'o-') plt.title('Cross-Validation Scores') plt.xlabel('Fold') plt.xticks(list(range(1, len(outer_scores) + 1))) plt.ylabel('Score') plt.grid(True) plt.show()
Best parameters found:  {'clf__alpha': 0.01, 'vect__max_df': 0.846354125634979, 'vect__ngram_range': (1, 2)}

Visualization and Analysis of Cross-Validation Results

The provided code performs the task of visualizing and analyzing cross-validation results. It does this by transforming certain parameters for easy interpretation and plotting these results on an interactive parallel coordinates plot.

Interactive Plotting

One of the main features of this script is the use of a parallel coordinates plot for visual representation of the data. A key advantage of this kind of plot is its interactivity: users can click on the y-axis to select and eliminate certain coordinates and their results from the plot, enhancing the exploratory analysis process.

Once the plot is rendered, it can be interactively examined by clicking on the different axis labels. Axes can be rearranged, numerical axes can be scaled continuously, and categorical axes have a category order that can be altered.

Data Preparation

Firstly, individual cross-validation results are converted into pandas DataFrames and concatenated:

cv_results_list = [pd.DataFrame(cv_result) for cv_result in cv_results_list]
cv_results = pd.concat(cv_results_list)
cv_results.reset_index(drop=True, inplace=True)

Subsequently, individual parameter columns are extracted from the ‘params’ column and added to the results DataFrame:

params_df = pd.json_normalize(cv_results['params'])
params_df.reset_index(drop=True, inplace=True)
cv_results = pd.concat([cv_results, params_df], axis=1)

Selecting Columns for Plotting

Columns related to vectorizer and classifier parameters, test score, and score time are selected:

param_names = [name for name in cv_results.columns if any(param in name for param in ['param_vect__', 'param_clf__'])]
column_results = param_names + ["mean_test_score", "mean_score_time"]

Data Transformation

Certain columns are transformed for better interpretability in the plot. For instance, ‘param_vect__norm’ is encoded as 2 for ‘l2’ and 1 for any other value, and ‘param_vect__ngram_range’ is represented by the upper limit of the range:

cv_results_transformed = transform_dataframe(cv_results[column_results].copy())

Plotting

The final DataFrame is rearranged and used to create a parallel coordinates plot with the mean test score determining the color of the lines:

fig = px.parallel_coordinates(
    cv_results_transformed,
    color="mean_test_score",
    color_continuous_scale=px.colors.sequential.Viridis_r,
    labels=labels,
)

A title is added to the plot, and then it’s displayed:

fig.update_layout(
    title={
        "text": "Parallel coordinates plot of text classifier pipeline (Note: Ngram Range (1, 2))",
        "y": 0.99,
        "x": 0.5,
        "xanchor": "center",
        "yanchor": "top",
    }
)
fig.show()

Finally, the best parameters found during the cross-validation process are printed out:

print("Best parameters found: ", best_params)
import math
import pandas as pd
import plotly.express as px

# Convert cv_results_ to a DataFrame cv_results_list = [pd.DataFrame(cv_result) for cv_result in cv_results_list] cv_results = pd.concat(cv_results_list)

# Reset index to make it unique cv_results.reset_index(drop=True, inplace=True)

# Extract individual parameter columns from the 'params' column params_df = pd.json_normalize(cv_results['params'])

# Reset index of params_df params_df.reset_index(drop=True, inplace=True)

cv_results = pd.concat([cv_results, params_df], axis=1)

# Define the columns to include in the plot param_names = [name for name in cv_results.columns if any(param in name for param in ['param_vect__', 'param_clf__'])] column_results = param_names + ["mean_test_score", "mean_score_time"]

# Define labels for the plot labels = {name: name for name in param_names} labels.update({"mean_test_score": "Mean Test Score", "mean_score_time": "Mean Score Time", "param_vect__ngram_range": "Ngram Range (1, 2)"})

# Define transformation functions def transform_dataframe(df): for col in df.columns: if col == "param_vect__norm": df[col] = df[col].apply(lambda x: 2 if x == "l2" else 1).astype('int64') elif col == "param_vect__ngram_range": df[col] = df[col].apply(lambda x: x[1]).astype('int64') return df

cv_results_transformed = transform_dataframe(cv_results[column_results].copy())

# Define the order of columns ordered_cols = ["param_vect__ngram_range", "param_vect__max_df", "param_clf__alpha", "mean_test_score", "mean_score_time"]

# Reorder the columns in cv_results_transformed dataframe cv_results_transformed = cv_results_transformed[ordered_cols]

# Create the parallel coordinates plot fig = px.parallel_coordinates( cv_results_transformed, color="mean_test_score", color_continuous_scale=px.colors.sequential.Viridis_r, labels=labels, )

# Add a title to the plot fig.update_layout( title={ "text": "Parallel coordinates plot of text classifier pipeline (Note: Ngram Range (1, 2))", "y": 0.99, "x": 0.5, "xanchor": "center", "yanchor": "top", } )

# Display the plot fig.show()

# Print the best parameters print("Best parameters found: ", best_params)
Best parameters found:  {'clf__alpha': 0.01, 'vect__max_df': 0.846354125634979, 'vect__ngram_range': (1, 2)}

These values show us the learning performance we can expect when we randomly segment our data and pass it into the lower-level hyperparameter and parameter computations. Now, we can actually train out preferred model based on parameters from RandomSearchCV.

Model Training & Evaluation

The following code snippet demonstrates training a classifier with optimal parameters obtained from nested cross validation and evaluating its performance through a classification report.

Reconstruction of Preferred Parameters

The string representation of best parameters is converted back into a dictionary using ast.literal_eval(). Then, specific vectorizer (vect) and classifier (clf) parameters are extracted separately to initialize these components.

best_params = ast.literal_eval(best_params)
vect_params = {k.split("__")[1]: v for k, v in best_params.items() if k.startswith("vect__")}
clf_params = {k.split("__")[1]: v for k, v in best_params.items() if k.startswith("clf__")}

Initialization & Pipeline Creation

The TfidfVectorizer and RidgeClassifier are initialized using the extracted parameters. They are then coupled together in a pipeline. This single entity ensures a consistent application of text processing (vect) and classification (clf) during both training and prediction.

vect = TfidfVectorizer(stop_words="english", **vect_params)
clf = RidgeClassifier(**clf_params)
pipeline = make_pipeline(vect, clf)

Model Training

Next, the model is trained on the training data (news_train.data and news_train.target).

pipeline.fit(news_train.data, news_train.target)

Prediction & Performance Evaluation

Predictions are made on the validation set (news_valid.data) and the accuracy of these predictions is outputted. The classifier’s performance is further detailed via the classification_report, which provides precision, recall and f1-score for each category.

y_pred = pipeline.predict(news_valid.data)
print(f"Accuracy: {pipeline.score(news_valid.data, news_valid.target):.3f}")
print(classification_report(news_valid.target, y_pred, target_names=news_train.target_names))

This approach allows us to easily retrain our model using the optimal hyperparameters found in the nested cross-validation process and also assess its predictive capabilities on unseen data. This understanding can guide next steps in terms of refining or deploying the model.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import RidgeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report
import ast

# Convert the string to a dictionary best_params = ast.literal_eval(best_params)

# Extract the preferred parameters for the vectorizer and the classifier vect_params = {k.split("__")[1]: v for k, v in best_params.items() if k.startswith("vect__")} clf_params = {k.split("__")[1]: v for k, v in best_params.items() if k.startswith("clf__")}

# Initialize the vectorizer and classifier with the preferred parameters vect = TfidfVectorizer(stop_words="english", **vect_params) clf = RidgeClassifier(**clf_params)

# Make a pipeline with the vectorizer and classifier pipeline = make_pipeline(vect, clf)

# Train the model with your data (replace X_train and y_train with your data) pipeline.fit(news_train.data, news_train.target)

# Predict the validation set results y_pred = pipeline.predict(news_valid.data)

# Print the accuracy on the validation set print(f"Accuracy: {pipeline.score(news_valid.data, news_valid.target):.3f}")

# Print the classification report print(classification_report(news_valid.target, y_pred, target_names=news_train.target_names))
Accuracy: 0.911
                    precision    recall  f1-score   support

     comp.graphics       0.89      0.94      0.91       389
   rec.motorcycles       0.91      0.90      0.90       398
talk.politics.guns       0.94      0.89      0.92       364

          accuracy                           0.91      1151
         macro avg       0.91      0.91      0.91      1151
      weighted avg       0.91      0.91      0.91      1151

The classification report offers a more detailed view of the model’s performance. It provides key classification metrics - precision, recall, f1-score, and support - for each of the categories in your multiclass problem. Here’s a brief summary of what each metric signifies:

  • Precision: Out of all the instances that the model predicted for a specific class, how many were correct? A high precision indicates a low false positive rate.

  • Recall: Out of all the actual instances of a specific class, how many did the model correctly identify? A high recall indicates a low false negative rate.

  • F1-Score: The harmonic mean of precision and recall. This measure tries to balance both precision and recall. A higher F1-score indicates a more robust model.

  • Support: The number of actual instances of a specific class in the dataset.

Looking at the results:

  • For the “comp.graphics” category, the model had a precision of 0.89, meaning that 89% of posts it identified as belonging to “comp.graphics” were classified correctly. It had a recall of 0.94, indicating that it was able to correctly identify 94% of all “comp.graphics” posts. The F1-score, which balances precision and recall, was 0.91.

  • For “rec.motorcycles”, precision and recall were both close to 0.90, indicating a balance between identifying correct instances and minimizing incorrect predictions. The F1-score here was also 0.90, showing that the model performed equally well in identifying motorcycle-related posts.

  • For “talk.politics.guns”, the model had a high precision of 0.94, but a slightly lower recall of 0.89. This suggests that while the model was very accurate when it identified a post as related to “talk.politics.guns”, it missed a few actual “talk.politics.guns” posts. However, the overall performance was still robust, with an F1-score of 0.92.

The accuracy of the model is 0.911, which means that 91.1% of all classifications made by the model are correct. This is a high overall accuracy rate.

These results suggest that the model is doing well in classifying text into these three categories. While there are small differences in precision and recall among the categories, the F1-scores show that the model is robust in terms of both precision and recall.

In summary, this model has demonstrated strong performance in this multiclass text classification task. However, there is always room for further optimization. This could include additional feature engineering, trying different classification algorithms, or further tuning of model parameters.

Plotting Results in a Confusion Matrix

The confusion matrix is a useful tool for understanding the performance of a classification algorithm. The values in the matrix represent the number of occurrences between predicted and actual labels.

Computing the Confusion Matrix

First, the confusion matrix is computed from actual and predicted labels using Scikit-Learn’s confusion_matrix function:

conf_mat = confusion_matrix(news_valid.target, y_pred)

Here, news_valid.target represents the actual labels, and y_pred denotes the predicted labels produced by our model.

Converting the Confusion Matrix to DataFrame

While not necessary, converting the confusion matrix into a Pandas DataFrame helps improve the look of our heatmap visualization:

conf_mat_df = pd.DataFrame(conf_mat, index=news_train.target_names, columns=news_train.target_names)

In this DataFrame, each row corresponds to an actual class while each column corresponds to a predicted class.

Creating a Heatmap of the Confusion Matrix

We then use Seaborn’s heatmap function to visualize the DataFrame as a heatmap:

plt.figure(figsize=(10, 10))
sns.heatmap(conf_mat_df, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

The annot=True argument allows annotation of each cell with their integer value. 'd' string format code ensures that these annotations are decimal integers. The colormap is set to 'Blues', depicting lighter colors representing larger numbers and darker colors for smaller numbers.

This heatmap representation makes it easier to see where the model is correctly classifying (along the diagonal) and where it gets confused.

from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Compute the confusion matrix conf_mat = confusion_matrix(news_valid.target, y_pred)

# Convert the confusion matrix to a DataFrame # This is not necessary, but it makes the heatmap look nicer conf_mat_df = pd.DataFrame(conf_mat, index=news_train.target_names, columns=news_train.target_names)

# Create a heatmap of the confusion matrix plt.figure(figsize=(10, 10)) sns.heatmap(conf_mat_df, annot=True, fmt='d', cmap='Blues') plt.xlabel('Predicted') plt.ylabel('Actual') plt.title('Confusion Matrix') plt.show()

For the “comp.graphics” category:

  • 366 posts were correctly classified as “comp.graphics” (True Positives).
  • 16 were misclassified as “rec.motorcycles” and 7 were misclassified as “talk.politics.guns” (False Negatives).

For the “rec.motorcycles” category:

  • 357 posts were correctly classified as “rec.motorcycles” (True Positives).
  • 28 were misclassified as “comp.graphics” and 13 were misclassified as “talk.politics.guns” (False Negatives).

For the “talk.politics.guns” category:

  • 325 posts were correctly classified as “talk.politics.guns” (True Positives).
  • 19 were misclassified as “comp.graphics” and 20 were misclassified as “rec.motorcycles” (False Negatives).

On the flip side, you can also view the off-diagonal values as the number of times other categories were misclassified as the column category (False Positives).

This confusion matrix indicates that the model has decent performance, as most of the predictions fall into the diagonal (correctly classified), but there are still some misclassifications. Misclassifications are a normal part of most models' performance, and further refinements to the model could be made to minimize these.

Analyzing Feature Effects and Visualizing Keywords

The following code can be used to analyze and visualize the effect of different features (or keywords) on specific classes in a text classification task.

Preprocessing and Transformation

At first, the function plot_feature_effects transforms the training data using a previously defined vectorizer:

news_train_transformed = vect.transform(news_train.data)

It also retrieves feature names and computes average feature effects:

feature_names = vect.get_feature_names_out()
average_feature_effects = clf.coef_ * np.asarray(news_train_transformed.mean(axis=0)).ravel()

Here, clf.coef_ represents the coefficients of the classifier and it gets multiplied with the mean values of transformed features.

Identifying Top Five Features

Next, the function iterates over the desired subset of target names. For each class, it finds the indices of the top five features with the highest feature effects:

top5 = np.argsort(average_feature_effects[index])[-5:][::-1]

These top features are saved into a pandas DataFrame top5, which will eventually contain the top five predictive words for each selected class.

Plotting Feature Effects

The function then plots a horizontal bar chart showcasing the strength of these top features for each class. Each class is represented by a separate colored set of bars in the plot.

ax.barh(
    y_locs + (i - 2) * bar_size,
    average_feature_effects[index, top_indices],
    height=bar_size,
    label=label,
)

Custom Legend and Configuration

The plotted graph includes custom legend positioning and y-axis configuration to improve readability:

ax.legend(loc="upper right", bbox_to_anchor=(1.00, 1))
ax.set(
    yticks=y_locs,
    yticklabels=predictive_words,
    ylim=[
        0 - 4 * bar_size,
        len(top_indices) * (4 * bar_size + padding) - 4 * bar_size,
    ],
)

Here, the y-axis labels are set to respective predictive words, and tick locations are adjusted accordingly.

Displaying Results

Finally, the function prints the DataFrame containing the top five keywords per selected class and returns the axes object of the plot. The title of the plot is set upon the call of this function:

_ = plot_feature_effects(subset_target_names).set_title("Average feature effect on the original data")
import numpy as np
import pandas as pd

def plot_feature_effects(subset_target_names):

news_train_transformed = vect.transform(news_train.data)

feature_names = vect.get_feature_names_out() average_feature_effects = clf.coef_ * np.asarray(news_train_transformed.mean(axis=0)).ravel()

top = None for i, label in enumerate(subset_target_names): index = news_train.target_names.index(label) # get the original index of the label top5 = np.argsort(average_feature_effects[index])[-5:][::-1] if i == 0: top = pd.DataFrame(feature_names[top5], columns=[label]) top_indices = top5 else: top[label] = feature_names[top5] top_indices = np.concatenate((top_indices, top5), axis=None)

top_indices = np.unique(top_indices) predictive_words = feature_names[top_indices]

# plot feature effects bar_size = 0.35 padding = 1.0 y_locs = np.arange(len(top_indices)) * (4 * bar_size + padding)

fig, ax = plt.subplots(figsize=(10, 8)) for i, label in enumerate(subset_target_names): index = news_train.target_names.index(label) ax.barh( y_locs + (i - 2) * bar_size, average_feature_effects[index, top_indices], height=bar_size, label=label, ) ax.set( yticks=y_locs, yticklabels=predictive_words, ylim=[ 0 - 4 * bar_size, len(top_indices) * (4 * bar_size + padding) - 4 * bar_size, ], ) ax.legend(loc="upper right", bbox_to_anchor=(1.00, 1))

# save the figure with bbox_inches='tight' to ensure nothing gets cut off # plt.savefig('plot.png', bbox_inches='tight')

print("top 5 keywords per class:") print(top)

return ax

subset_target_names = ["comp.graphics", "rec.motorcycles", "talk.politics.guns"] # just as an example _ = plot_feature_effects(subset_target_names).set_title("Average feature effect on the original data")
top 5 keywords per class:
  comp.graphics rec.motorcycles talk.politics.guns
0      graphics            bike                gun
1        thanks            like             people
2         files             dod               guns
3          need            ride            weapons
4       program           bikes         government

Conclusion

In this post, we’ve showcased an exhaustive exploration into text classification, an essential method for a range of applications such as enhancing customer support. The case study was centered around the 20 Newsgroups dataset, a widely used resource in the field of machine learning and natural language processing.

Starting with an overview of the dataset, we embarked on a journey through its features, deploying various visualization techniques. Tools such as word clouds, bar plots, and histograms offered an intuitive understanding of the data’s content and distribution, setting the foundation for the subsequent steps.

Next, we employed the Natural Language Toolkit (NLTK) along with other tools to preprocess a smaller, representative sample dataset. By constructing a binary bag-of-words model and creating a Term Frequency-Inverse Document Frequency (TF-IDF) matrix from scratch, we illustrated the underlying operations of Scikit-learn’s TfidfVectorizer. The transformation of this sample corpus clarified the workings of the TF-IDF concept, and how it aids in emphasizing the importance of different words for the classification task.

In the subsequent phase, we conducted extensive hyperparameter tuning, utilizing nested cross-validation for optimizing parameters such as ‘alpha’, ‘max_df’, and ‘ngram_range’. The optimal parameters identified were: {‘clf__alpha’: 0.01, ‘vect__max_df’: 0.846354125634979, ‘vect__ngram_range’: (1, 2)}.

Upon this optimization, we visualized the effects of these parameters using a parallel coordinates plot. This visualization aided in understanding the interaction and impact of different hyperparameters on the model’s performance.

With the best parameters, we trained our model and evaluated its performance. The model exhibited strong performance with an accuracy of 0.911 across three categories: “comp.graphics”, “rec.motorcycles”, and “talk.politics.guns”. The classification report and the confusion matrix further confirmed the model’s effectiveness and robustness.

Finally, we delved deeper into the model’s workings by visualizing feature effects. This step offered insights into how different features influenced the model’s predictions, improving the model’s interpretability.

Overall, this post demonstrated a systematic journey through text classification, from exploratory data analysis and preprocessing to model building, tuning, and evaluation. Despite the commendable results, there is always scope for improvement. Future investigations could explore different machine learning models, feature selection methods, or more advanced natural language processing techniques to further enhance performance.