Titanic Dataset: A Deep Dive into EDA

Exploring Hidden Insights with Histograms and EDA

Scott Miner

Last updated on Jun 28, 2023 23 min read Data Science, Tutorial, Python, Pandas, Matplotlib

Titanic Dataset: A Deep Dive into Exploratory Data Analysis

Welcome aboard as we embark on an exploratory journey through the Titanic dataset! The dataset carries valuable passenger information from the ill-fated voyage of the Titanic. Owing to its simplicity and the potential for fascinating insights, it has become a classic learning tool in data analysis and machine learning spheres. In this article, we will undertake an Exploratory Data Analysis (EDA) on the Titanic dataset with primary focus on understanding its structure, cleaning the data, and visualizing it using histograms and bar plots. This initial exploration will form the foundation for more complex analyses and predictive modeling in future posts.

We will initiate our EDA process by loading the dataset, studying its structure, handling any missing values and then delving into data visualization. We’ll employ histograms to understand the distribution of numerical data and bar plots to gain insights into categorical data. Each step will be discussed in detail, offering a comprehensive guide for beginners while serving as a refresher for seasoned data enthusiasts. So, buckle up as we set sail on this exploratory journey!

First of all, let’s load the dataset from the Tableau website and examine it.

import pandas as pd

url = 'https://public.tableau.com/app/sample-data/titanic%20passenger%20list.csv'
titanic_tableau = pd.read_csv(url)

The .info() method on a DataFrame in pandas provides a concise summary of the DataFrame, including information about its index, columns, non-null values, and memory usage. When we run this method on the titanic_tableau DataFrame, we receive the following output:

titanic_tableau.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     1309 non-null   int64  
 1   survived   1309 non-null   int64  
 2   name       1309 non-null   object 
 3   sex        1309 non-null   object 
 4   age        1046 non-null   float64
 5   sibsp      1309 non-null   int64  
 6   parch      1309 non-null   int64  
 7   ticket     1309 non-null   object 
 8   fare       1308 non-null   float64
 9   cabin      295 non-null    object 
 10  embarked   1307 non-null   object 
 11  boat       486 non-null    object 
 12  body       121 non-null    float64
 13  home.dest  745 non-null    object 
dtypes: float64(3), int64(4), object(7)
memory usage: 143.3+ KB

Interpreting DataFrame Output

The output from the code statements provides some detailed information about the ‘titanic_tableau’ DataFrame. The interpretation of this output is as follows:

DataFrame Structure

<class 'pandas.core.frame.DataFrame'>: This indicates that ‘titanic_tableau’ is a pandas DataFrame.

DataFrame Size

RangeIndex: 1309 entries, 0 to 1308: Our DataFrame contains 1309 rows, indexed from 0 to 1308.

Data Columns and Description

Data columns (total 14 columns):: It indicates that our DataFrame includes 14 columns. Detailed information for each column is listed below.

In the column information:

The first column ('#') represents the column number.
Column comprises the names of all the columns in the DataFrame.
Non-Null Count displays the count of non-missing values in each column.
Dtype specifies the data type of each column.

Summary of Data Types

dtypes: float64(3), int64(4), object(7): This summary illustrates the data types present in the DataFrame. float64 and int64 are numeric types (floats and integers respectively), while object usually pertains to strings (text), but it can also include other types of data. In our DataFrame, we have 3 float columns, 4 integer columns, and 7 object columns.

Memory Usage

memory usage: 143.3+ KB: This denotes the memory space utilized to store the DataFrame.

Sampling Both Ends of the Dataset

In this part of the Python code, we are manipulating and visualizing data from the famous Titanic dataset using Pandas, a powerful data analysis library in Python.

Our task here is to combine the first few and last few rows of the dataframe. This technique is useful when we need to take a snapshot of our data from both ends. Often times in large datasets, taking a quick look at the top and bottom rows helps us understand the data structure better and check if our data manipulation steps like sorting or adding new columns have been applied correctly.

The steps we’re following are as follows:

Set the Number of Rows: First, we’re setting up a constant ROW_COUNT to 7. This specifies the number of rows we want to extract from the top and bottom of the dataframe.
Combine Top and Bottom Rows: We’re using the pandas' function head() to get the first 7 rows from our dataframe titanic_tableau. Similarly, we use tail() to get the last 7 rows. pd.concat() is then used to combine these extracted rows together.
Display the Result: Finally, we display the final combined dataframe combined which now contains 14 rows - the first 7 and the last 7.

ROW_COUNT = 7
# Combine the first 5 rows and the last 5 rows
combined = pd.concat([titanic_tableau.head(ROW_COUNT), titanic_tableau.tail(ROW_COUNT)])

# Display the combined DataFrame
combined

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home.dest
0	1	1	Allen, Miss. Elisabeth Walton	female	29.00	0	0	24160	211.3375	B5	S	2	NaN	St Louis, MO
1	1	1	Allison, Master. Hudson Trevor	male	0.92	1	2	113781	151.5500	C22 C26	S	11	NaN	Montreal, PQ / Chesterville, ON
2	1	0	Allison, Miss. Helen Loraine	female	2.00	1	2	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON
3	1	0	Allison, Mr. Hudson Joshua Creighton	male	30.00	1	2	113781	151.5500	C22 C26	S	NaN	135.0	Montreal, PQ / Chesterville, ON
4	1	0	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.00	1	2	113781	151.5500	C22 C26	S	NaN	NaN	Montreal, PQ / Chesterville, ON
5	1	1	Anderson, Mr. Harry	male	48.00	0	0	19952	26.5500	E12	S	3	NaN	New York, NY
6	1	1	Andrews, Miss. Kornelia Theodosia	female	63.00	1	0	13502	77.9583	D7	S	10	NaN	Hudson, NY
1302	3	0	Yousif, Mr. Wazli	male	NaN	0	0	2647	7.2250	NaN	C	NaN	NaN	NaN
1303	3	0	Yousseff, Mr. Gerious	male	NaN	0	0	2627	14.4583	NaN	C	NaN	NaN	NaN
1304	3	0	Zabour, Miss. Hileni	female	14.50	1	0	2665	14.4542	NaN	C	NaN	328.0	NaN
1305	3	0	Zabour, Miss. Thamine	female	NaN	1	0	2665	14.4542	NaN	C	NaN	NaN	NaN
1306	3	0	Zakarian, Mr. Mapriededer	male	26.50	0	0	2656	7.2250	NaN	C	NaN	304.0	NaN
1307	3	0	Zakarian, Mr. Ortin	male	27.00	0	0	2670	7.2250	NaN	C	NaN	NaN	NaN
1308	3	0	Zimmerman, Mr. Leo	male	29.00	0	0	315082	7.8750	NaN	S	NaN	NaN	NaN

Understanding DataFrame Columns

Each column in the DataFrame represents a distinct attribute related to passengers on the Titanic. The descriptions for these attributes are as follow:

Details of Columns:

pclass: This reflects the passenger class. ‘1’ denotes 1st class, ‘2’ denotes 2nd class, and ‘3’ denotes 3rd class.
survived: This is a binary variable indicating whether the passenger survived or not. ‘0’ means No (did not survive), and ‘1’ means Yes (survived).
name: This is the name of the passenger.
sex: This indicates the sex of the passenger.
age: This refers to the age of the passenger.
sibsp: This signifies the number of siblings/spouses of the passenger aboard the Titanic.
parch: This shows the number of parents/children of the passenger aboard the Titanic.
ticket: This depicts the ticket number of the passenger.
fare: This is the fare paid by the passenger.
cabin: This provides the cabin number where the passenger was accommodated.
embarked: This column tells from which port the passenger boarded the Titanic. ‘C’ stands for Cherbourg, ‘Q’ stands for Queenstown, and ‘S’ stands for Southampton.
boat: If the passenger survived, this column provides the lifeboat number.
body: If the passenger did not survive and their body was recovered, this records the body identification number.
home.dest: This states the Home/Destination of the passenger.

Let’s sort these columns by data type.

# Get a Series with the data type of each column
data_types = titanic_tableau.dtypes

# Convert the data types to strings
data_types = data_types.astype(str)

# Sort the Series by the data type
sorted_columns = data_types.sort_values()

# Print the sorted column names and their data types
for column_name, data_type in sorted_columns.items():
    print(f"{column_name}: {data_type}")

age: float64
fare: float64
body: float64
pclass: int64
survived: int64
sibsp: int64
parch: int64
name: object
sex: object
ticket: object
cabin: object
embarked: object
boat: object
home.dest: object

Data Preprocessing: Converting Certain Columns to Categories

The pclass and survived columns in the ‘titanic_tableau’ DataFrame originally hold numeric values. However, within our context, these two columns do not represent numerical quantities. Instead, they are categorical data:

pclass represents the passenger class; and
survived indicates whether a passenger survived or not.

For this reason, we convert these two columns into categories for more meaningful and straightforward visualization:

titanic_tableau['pclass'] = titanic_tableau['pclass'].astype('category')
titanic_tableau['survived'] = titanic_tableau['survived'].astype('category')

Next, we would like to get an understanding of the datatype of each column in our DataFrame. We begin by extracting a Series with the data type of each column:

data_types = titanic_tableau.dtypes

Then, we convert these data types into strings and sort them:

data_types = data_types.astype(str)
sorted_columns = data_types.sort_values()

Finally, we print each column name along with its corresponding data type:

for column_name, data_type in sorted_columns.items():
    print(f"{column_name}: {data_type}")

Hence, we have preprocessed our DataFrame by converting certain columns to categories and explored the data types of all columns within our dataset.

titanic_tableau['pclass'] = titanic_tableau['pclass'].astype('category')
titanic_tableau['survived'] = titanic_tableau['survived'].astype('category')

# Get a Series with the data type of each column
data_types = titanic_tableau.dtypes

# Convert the data types to strings
data_types = data_types.astype(str)

# Sort the Series by the data type
sorted_columns = data_types.sort_values()

# Print the sorted column names and their data types
for column_name, data_type in sorted_columns.items():
    print(f"{column_name}: {data_type}")

pclass: category
survived: category
age: float64
fare: float64
body: float64
sibsp: int64
parch: int64
name: object
sex: object
ticket: object
cabin: object
embarked: object
boat: object
home.dest: object

Next, let’s use’s a different technique to check for null values.

# Checking for missing values
titanic_tableau.isnull().sum()

pclass          0
survived        0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64

The dataset has some missing values in the ‘age’, ‘fare’, and ‘cabin’ columns. Here’s the count of missing values for each column:

index	count
age	263
fare	1
cabin	1014
embarked	2

We will need to handle these missing values before proceeding with the analysis. For the ‘age’ column, we can fill the missing values with the median age. For the ‘fare’ column, we can fill the missing value with the median fare. The ‘cabin’ column has a lot of missing values, so we might consider dropping it from our analysis.

Let’s perform these data cleaning steps.

Handling Missing Data and Unnecessary Columns in the Titanic Dataset

This overview will walk through the code explaining each step in detail:

Step 1: Filling Missing Values in ‘age’ and ‘fare’ Columns

titanic_tableau['age'].fillna(titanic_tableau['age'].median(), inplace=True)
titanic_tableau['fare'].fillna(titanic_tableau['fare'].median(), inplace=True)

In this segment, the code fills in any missing (NaN) values in the ‘age’ and ‘fare’ columns with the median value of the respective column. The method fillna is used to replace null values with a specified replacement value. In this case, .median() is used to compute the middle value of the series for ages and fares. The argument inplace=True ensures that the change is directly applied in the DataFrame.

Step 2: Dropping the ‘cabin’ Column

titanic_tableau.drop('cabin', axis=1, inplace=True)

Here, the drop function is employed to remove the ‘cabin’ column from the DataFrame. axis=1 refers to columns (axis=0 would refer to rows). Similar to the previous step, inplace=True is used here as well to ensure that the column is permanently removed from the DataFrame.

Step 3: Checking for Remaining Missing Values in All Columns

titanic_tableau.isnull().sum()

Lastly, this line of code checks if there are any remaining missing values in the DataFrame. The isnull() function returns a DataFrame analogous to the original, filled with Boolean values indicating whether the corresponding original value was Null. The sum() function then counts the number of True instances (i.e., missing values) present in each column.

# Filling missing values
titanic_tableau['age'].fillna(titanic_tableau['age'].median(), inplace=True)
titanic_tableau['fare'].fillna(titanic_tableau['fare'].median(), inplace=True)

# Dropping the 'cabin' column
titanic_tableau.drop('cabin', axis=1, inplace=True)

# Checking for missing values again
titanic_tableau.isnull().sum()

pclass          0
survived        0
name            0
sex             0
age             0
sibsp           0
parch           0
ticket          0
fare            0
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64

The missing values in the ‘age’ and ‘fare’ columns have been filled, and the ‘cabin’ column has been dropped. Now, the only column with missing values is ‘embarked’, which has 2 missing values. We can fill these with the most common port of embarkation.

Let’s perform this final data cleaning step.

Completing the Data Cleaning Process for the Titanic Dataset

The data preprocessing portion of the code includes a series of steps designed to clean and improve the quality of the dataset. This section will provide a comprehensive explanation of these steps which include filling missing values in the ‘embarked’ column and checking for remaining missing values amongst all columns.

Step 1: Filling Missing Values in ‘embarked’ Column

titanic_tableau['embarked'].fillna(titanic_tableau['embarked'].mode()[0], inplace=True)

In this first step, the code aims to handle missing (NaN) values in the ‘embarked’ column by replacing them with the most frequently occurring value or mode. The fillna method is employed to replace null values with the specified replacement value. The .mode()[0] operation returns the most common value in the ‘embarked’ series.

The [0] indexing is necessary because .mode() produces a Series, and we want the first (and often only) value from it. The inplace=True argument guarantees that the modification is made directly in the DataFrame itself.

Step 2: Checking For Remaining Missing Values in All Columns

titanic_tableau.isnull().sum()

In the final step, the code checks if there are any more remaining NaNs left in the DataFrame. The function isnull() is implemented to return a DataFrame whereby the locations of Null values are marked as True and non-null values as False.

Then, the sum() function is used to count the number of True instances, i.e., missing values in each column. This allows us to glean the total count of missing values in each DataFrame column.

# Filling missing values in 'embarked' with the most common value
titanic_tableau['embarked'].fillna(titanic_tableau['embarked'].mode()[0], inplace=True)

# Checking for missing values again
titanic_tableau.isnull().sum()

pclass          0
survived        0
name            0
sex             0
age             0
sibsp           0
parch           0
ticket          0
fare            0
embarked        0
boat          823
body         1188
home.dest     564
dtype: int64

The output shows we fixed all the missing values but the following:

column	# missing
boat	823
body	1188
home.dest	564

Let’s look at these columns and check if we really need them for our analysis:

boat: This column tells us if a passenger escaped on a lifeboat or not. However, this is more of a spoiler than a clue because if they were on a lifeboat, they likely survived. Using it as a predictor would be cheating - it’s like watching a movie knowing its ending!
body: Now, this one is gloomy. It’s the identification number given to a body recovered. Again, knowing this would pretty much reveal if a passenger survived or didn’t. You don’t want spoilers in your data either!
home.dest: This could tell us about the passenger’s home country. A cool piece of information! But it’s like a puzzle with half the pieces missing (we have over 564 missing values!). So using this might make our predictions unreliable.

So, considering all the points above, we’re not too worried about dropping these columns from our analysis. We want a fair predictive model that relies on valid clues—not spoilers! And regarding the missing data, we can’t fill such large gaps without risking bias in our results.

Digging Deeper with Exploratory Data Analysis (EDA)

As we continue our journey into understanding our data set, we’ll now shift gears and immerse ourselves in the realm of Exploratory Data Analysis. This key step enables us to understand the main characteristics of our data, allowing us to draw meaningful insights and frame appropriate questions for further analysis.

To start off, let’s turn our attention towards histograms - these powerful graphical representations allow us to visualize the distribution of our numeric variables.

In simple terms, imagine slicing the range of our data into various bins and then counting the number of instances falling into each bin. The result? A histogram! The x-axis represents these bins defined as consecutive, non-overlapping intervals of variable values, while the y-axis corresponds to the frequency (counts) or density (if normalized), granting us an intuitive view of data concentration and trends.

Now, without further ado, let’s delve straight into plotting and analyzing histograms for each of our numerical variables!

Introduction to the enhanced_histogram Function

The enhanced_histogram function is a robust tool within our code that we use for data visualization. It accepts a pandas DataFrame and generates histograms for each numerical variable present in the data.

Step 1: Importing Necessary Libraries

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

At the start of our script, we import the libraries that we need for data manipulation (Pandas, Numpy) and visualization (Seaborn, Matplotlib).

Step 2: Defining the Function

def enhanced_histogram(dataframe):

We start by defining the function, enhanced_histogram, which takes one parameter - a Pandas DataFrame called dataframe.

Step 3: Selecting Numeric Columns

numeric_variables = dataframe.select_dtypes(include=[np.number])

In this step, we filter out the numeric columns from the DataFrame.

Step 4: Defining Custom Labels

custom_labels = { ... }

Next, we define a dictionary with custom labels for some of the variables in the DataFrame.

Step 5: Creating an Empty Figure Dictionary

figs = {}

Here, we initialize an empty dictionary where we will store the figures created in the next steps.

Step 6: Creating Histograms for Each Variable

for variable in numeric_variables:

We then iterate over each numeric variable (column). For each column, we plot a histogram using Seaborn’s histplot function. If a custom label was defined for the variable, we use it; otherwise, we capitalize the variable name. We store each histogram and its formatting details in a temporary figure (fig).

Step 7: Storing the Figure and Closing

figs[variable] = fig
plt.close(fig)

We then add each figure to the figs dictionary, using the corresponding variable as the key, and close the figure to free up memory.

Step 8: Returning Figures

return figs

At last, the function returns the figs dictionary that contains all the histograms.

Step 9: Calling the Function

figs = enhanced_histogram(titanic_tableau)

Finally, we call enhanced_histogram function on the ‘titanic_tableau’ DataFrame. The returned figures are stored in the figs variable.

In conclusion, this script creates histograms for all numeric columns in a provided DataFrame. It also provides properly formatted labels for a specific set of known columns.

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

def enhanced_histogram(dataframe):
    # select numeric columns
    numeric_variables = dataframe.select_dtypes(include=[np.number])

    # define a dictionary for custom labels
    custom_labels = {
        "survived": "Survival Status",
        "pclass": "Passenger Class",
        "age": "Age of Passengers",
        "sibsp": "Number of Siblings/Spouses Aboard",
        "parch": "Number of Parents/Children Aboard",
        "fare": "Passenger Fare",
        "body": "Body Identification Number",
    }

    # create a dictionary to store the plots
    figs = {}

    for variable in numeric_variables:
        # get feature
        var = dataframe[variable]

        # get custom label if it exists, otherwise use the variable name
        label = custom_labels.get(variable, variable.capitalize())

        # create histogram
        fig, ax = plt.subplots(figsize=(12, 6))
        sns.histplot(var, bins=30, kde=False, color=(135/255, 206/255, 235/255, 1), edgecolor='black', ax=ax)

        # add title and labels
        ax.set_title('Distribution of ' + label, fontsize=14, pad=20)
        ax.set_xlabel(label, fontsize=12, labelpad=15)
        ax.set_ylabel('Frequency', fontsize=12, labelpad=15)

        # add grid and set its zorder to a lower value
        ax.grid(axis='y', alpha=.33, zorder=-10)
        ax.set_axisbelow(True)

        # store the figure in the dictionary
        figs[variable] = fig

        # improve the layout
        plt.tight_layout()

        plt.close(fig)

    # return the list of figures
    return figs

figs = enhanced_histogram(titanic_tableau)

Age of Passengers: This histogram shows the distribution of the ‘age’ column. The histogram shows that the majority of passengers were in their 20s and 30s, with a few older passengers.

display(figs['age'])

Number of Siblings/Spouses Aboard: This histogram shows the distribution of the ‘sibsp’ column, which indicates the number of siblings or spouses each passenger had aboard the Titanic. The histogram shows that most passengers did not have siblings or spouses aboard.

display(figs['sibsp'])

Number of Parents/Children Aboard: This histogram shows the distribution of the ‘parch’ column, which indicates the number of parents or children each passenger had aboard the Titanic. The histogram shows that most passengers did not have parents or children aboard.

display(figs['parch'])

Passenger Fare: This histogram shows the distribution of the ‘fare’ column, which indicates the fare each passenger paid. The histogram shows that most fares were low, with a few high fares.

display(figs['fare'])

Body Identification Number: This histogram shows the distribution of the ‘body’ column, which indicates the body identification number assigned to each passenger, if any. The histogram shows that most passengers did not have a body identification number, which means their bodies were not found.

display(figs['body'])

Transitioning from Histograms to Bar Plots

We’ve just finished with the exploration of our numerical data with histograms. As you’ve seen, a histogram provides valuable insights into how a numerical value is distributed across different intervals or ‘bins’. With histograms, we were able to understand various characteristics of Titanic passengers and their journey circumstances.

Histograms vs Bar Plots

It’s important to note that while histograms are best for numeral data representations, they do not work well for categorical data. This is where bar plots excel.

Unlike histograms, bar plots allow us to show discrete categories on one axis representing our variables, with the bars' height corresponding to the frequency, proportion, or any other metric of these categories. They are particularly useful when dealing with categorical variables, as they display distinct events (eg. Survival Status: Survived or Died) and how frequently they occur in the dataset.

In essence, while histograms are used for displaying distributions of numerical data, bar plots are used for comparing categorical data. Don’t get confused by the similar appearance - the context of use marks the difference!

With this understanding, let’s move forward to explore the categorical variables in our dataset using bar plots.

In the following section, we’ll generate bar plots for each categorical variable in our Titanic dataset. These plots will tell us about the distribution of different categories within these variables, which could provide further insights into the Titanic shipwreck story. Stay tuned!

Introduction to enhanced_bar_plot Function

The enhanced_bar_plot function is a versatile tool that we’ll be using in our code. It accepts a pandas DataFrame and creates bar plots for each categorical variable present in the data.

Step 1: Defining the function

def enhanced_bar_plot(dataframe, columns=None):

Here, we are defining the function named enhanced_bar_plot. This function accepts two parameters - a Pandas DataFrame (dataframe) and an optional list of column names (columns).

Step 2: Selecting Columns for Bar Plots

    if columns is not None:
        categorical_variables = dataframe[columns]
    else:
        categorical_variables = dataframe.select_dtypes(include=['object', 'category'])

In this step, the function checks whether any specific columns are provided for plotting. If not, it selects all the columns that have an ‘object’ or ‘category’ datatype i.e., all the categorical columns from the dataframe.

Step 3: Setting Custom Labels and Initializing Figures Dictionary

    custom_labels = {
        "survived": "Survival Status",
        "pclass": "Passenger Class",
        "sex": "Sex",
        "embarked": "Port of Embarkation",
        "boat": "Boat Class"
    }

    figs = {}

Next, a dictionary called custom_labels is initialized, which maps simpler versions of our column names to more descriptive labels. Also, an empty dictionary figs is initialized to store the matplotlib figure objects of each plot.

Creation of Bar Plots

In this section, the function crafts individual bar plots for each categorical variable through several steps:

Step 4: Iterating Over Categorical Variables

for variable in categorical_variables:
    var = dataframe[variable]
    var_value = var.value_counts()

    colors = sns.color_palette('deep')[0:len(var_value)]

The function then loops over each selected categorical variable. For each iteration, it counts the unique values present in the categorical variable using the value_counts() method. A color palette is chosen using seaborn’s color_palette() function.

Step 5: Creating Figure and Subplot, Drawing Bar Plot

    fig, ax = plt.subplots(figsize=(12, 6))
    ax.bar(var_value.index, var_value, color=colors, edgecolor='black')

A new figure and subplot is created for each variable. The bar plot is drawn where x-values are the unique categories and the heights of bars are their frequencies.

Step 6: Adding Titles and Labels

    label = custom_labels.get(variable, variable.capitalize())

    ax.set_title('Distribution of ' + label, fontsize=14, pad=20)
    ax.set_xlabel(label, fontsize=12, labelpad=15)
    ax.set_ylabel('Frequency', fontsize=12, labelpad=15)

Titles, x-label, and y-label are added to make the plot more informative. It either uses the user-defined label from custom_labels or the capitalized version of the column name itself.

Step 7: Customizing X-Tick Labels

    if variable == 'embarked':
        ax.set_xticks([0, 1, 2])
        ax.set_xticklabels(['Southampton', 'Cherbourg', 'Queenstown'])
    elif variable == 'pclass':
        ax.set_xticks([1, 2, 3])
        ax.set_xticklabels(['First', 'Second', 'Third'])
    elif variable == 'survived':
        ax.set_xticks([0, 1])
        ax.set_xticklabels(['Died', 'Survived'])
    elif variable == 'boat':
        plt.xticks(rotation=45)

This part of the function customizes the x-tick labels. Depending on the nature of the variable in question, more meaningful labels are assigned to enhance readability.

Step 8: Adding Gridlines and Frequency Annotations

    ax.grid(axis='y', alpha=0.5)
    ax.set_axisbelow(True)

    for i, v in enumerate(var_value):
        ax.text(var_value.index[i], v + 0.01 * v, str(v), ha='center', fontweight='bold', fontsize=10)

Gridlines are added to make the graph easier to interpret. Also, each bar is annotated with its frequency value.

Step 9: Storing Figure and Closing Plot

    figs[variable] = fig
    plt.close()

Finally, the figure is stored in the figs dictionary using the column name as key and the plot is closed to free up memory.

Utilization of the Function

By virtue of these carefully designed steps, the enhanced_bar_plot function allows us to summarize and visualize distribution of categorical variables efficiently. To use this function, simply pass your DataFrame and optionally a list of column names:

figs = enhanced_bar_plot(titanic_tableau, columns=['survived', 'pclass', 'boat', 'embarked', 'sex'])

This line generates bar plots for the columns ‘survived’, ‘pclass’, ‘boat’, ‘embarked’, and ‘sex’ in the titanic_tableau DataFrame.

def enhanced_bar_plot(dataframe, columns=None):
    if columns is not None:
        categorical_variables = dataframe[columns]
    else:
        categorical_variables = dataframe.select_dtypes(include=['object', 'category'])

    custom_labels = {
        "survived": "Survival Status",
        "pclass": "Passenger Class",
        "sex": "Sex",
        "embarked": "Port of Embarkation",
        "boat": "Boat Class",
    }

    figs = {}

    for variable in categorical_variables:
        var = dataframe[variable]
        var_value = var.value_counts()

        colors = sns.color_palette('deep')[0:len(var_value)]

        fig, ax = plt.subplots(figsize=(12, 6))
        ax.bar(var_value.index, var_value, color=colors, edgecolor='black')

        label = custom_labels.get(variable, variable.capitalize())

        ax.set_title('Distribution of ' + label, fontsize=14, pad=20)
        ax.set_xlabel(label, fontsize=12, labelpad=15)
        ax.set_ylabel('Frequency', fontsize=12, labelpad=15)

        if variable == 'embarked':
            ax.set_xticks([0, 1, 2])
            ax.set_xticklabels(['Southampton', 'Cherbourg', 'Queenstown'])
        elif variable == 'pclass':
            ax.set_xticks([1, 2, 3])
            ax.set_xticklabels(['First', 'Second', 'Third'])
        elif variable == 'survived':
            ax.set_xticks([0, 1])
            ax.set_xticklabels(['Died', 'Survived'])
        elif variable == 'boat':
            plt.xticks(rotation=45)

        ax.grid(axis='y', alpha=0.5)
        ax.set_axisbelow(True)

        for i, v in enumerate(var_value):
            ax.text(var_value.index[i], v + 0.01 * v, str(v), ha='center', fontweight='bold', fontsize=10)

        figs[variable] = fig

        plt.close()

    return figs

figs = enhanced_bar_plot(titanic_tableau, columns=['survived', 'pclass', 'boat', 'embarked', 'sex'])

Survival Status: This graph shows the distribution of survival status among the passengers. The x-axis represents the survival status (Died or Survived), and the y-axis represents the frequency of each category. It appears that more passengers died than survived.

display(figs['survived'])

Passenger Class: This graph shows the distribution of passenger class. The x-axis represents the passenger class (First, Second, Third), and the y-axis represents the frequency of each class. It appears that the majority of passengers were in the third class.

display(figs['pclass'])

Boat: This graph shows the distribution of boat numbers. The x-axis represents the boat numbers, and the y-axis represents the frequency of each boat. The distribution seems to be quite varied across different boats.

display(figs['boat'])

Port of Embarkation: This graph shows the distribution of the port of embarkation. The x-axis represents the port (Southampton, Cherbourg, Queenstown), and the y-axis represents the frequency of each port. It appears that the majority of passengers embarked from Southampton.

display(figs['embarked'])

Sex: This graph shows the distribution of sex among the passengers. The x-axis represents the sex (male or female), and the y-axis represents the frequency of each sex. It appears that there were more male passengers than female passengers.

display(figs['sex'])

Conclusion

We’ve made quite a journey through the Titanic dataset! We started with a raw dataset, explored its structure, handled missing values, and visualized the data using histograms and bar plots. We’ve seen the distribution of ages, fares, and family members aboard for the passengers. We’ve also looked at the distribution of categorical variables like passenger class, sex, and port of embarkation.

However, this is just the tip of the iceberg. We’ve yet to dive into the survival rates and the factors that might have influenced a passenger’s survival. We also haven’t yet applied any predictive modeling techniques to forecast a passenger’s survival based on their characteristics.

In our upcoming posts, we plan to delve deeper into the Titanic dataset. We’ll examine survival rates, explore correlations, and use more complex graphs to visualize the data. We’ll also apply machine learning techniques to build a predictive model for passenger survival.

Our journey with the Titanic dataset is far from over. We’ve only just set sail, and there’s a lot more to discover. So, stay tuned for our future posts as we continue to navigate through this fascinating dataset.