Titanic Dataset: A Deep Dive into EDA
Exploring Hidden Insights with Histograms and EDA
Titanic Dataset: A Deep Dive into Exploratory Data Analysis
Welcome aboard as we embark on an exploratory journey through the Titanic dataset! The dataset carries valuable passenger information from the ill-fated voyage of the Titanic. Owing to its simplicity and the potential for fascinating insights, it has become a classic learning tool in data analysis and machine learning spheres. In this article, we will undertake an Exploratory Data Analysis (EDA) on the Titanic dataset with primary focus on understanding its structure, cleaning the data, and visualizing it using histograms and bar plots. This initial exploration will form the foundation for more complex analyses and predictive modeling in future posts.
We will initiate our EDA process by loading the dataset, studying its structure, handling any missing values and then delving into data visualization. We’ll employ histograms to understand the distribution of numerical data and bar plots to gain insights into categorical data. Each step will be discussed in detail, offering a comprehensive guide for beginners while serving as a refresher for seasoned data enthusiasts. So, buckle up as we set sail on this exploratory journey!
First of all, let’s load the dataset from the Tableau website and examine it.
import pandas as pd
url = 'https://public.tableau.com/app/sample-data/titanic%20passenger%20list.csv'
titanic_tableau = pd.read_csv(url)
The .info()
method on a DataFrame in pandas provides a concise summary of the DataFrame, including information about its index, columns, non-null values, and memory usage. When we run this method on the titanic_tableau
DataFrame, we receive the following output:
titanic_tableau.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 pclass 1309 non-null int64
1 survived 1309 non-null int64
2 name 1309 non-null object
3 sex 1309 non-null object
4 age 1046 non-null float64
5 sibsp 1309 non-null int64
6 parch 1309 non-null int64
7 ticket 1309 non-null object
8 fare 1308 non-null float64
9 cabin 295 non-null object
10 embarked 1307 non-null object
11 boat 486 non-null object
12 body 121 non-null float64
13 home.dest 745 non-null object
dtypes: float64(3), int64(4), object(7)
memory usage: 143.3+ KB
Interpreting DataFrame Output
The output from the code statements provides some detailed information about the ‘titanic_tableau’ DataFrame. The interpretation of this output is as follows:
DataFrame Structure
<class 'pandas.core.frame.DataFrame'>
: This indicates that ‘titanic_tableau’ is a pandas DataFrame.
DataFrame Size
RangeIndex: 1309 entries, 0 to 1308
: Our DataFrame contains 1309 rows, indexed from 0 to 1308.
Data Columns and Description
Data columns (total 14 columns):
: It indicates that our DataFrame includes 14 columns. Detailed information for each column is listed below.
In the column information:
- The first column (
'#'
) represents the column number. Column
comprises the names of all the columns in the DataFrame.Non-Null Count
displays the count of non-missing values in each column.Dtype
specifies the data type of each column.
Summary of Data Types
dtypes: float64(3), int64(4), object(7)
: This summary illustrates the data types present in the DataFrame. float64
and int64
are numeric types (floats and integers respectively), while object
usually pertains to strings (text), but it can also include other types of data. In our DataFrame, we have 3 float columns, 4 integer columns, and 7 object columns.
Memory Usage
memory usage: 143.3+ KB
: This denotes the memory space utilized to store the DataFrame.
Sampling Both Ends of the Dataset
In this part of the Python code, we are manipulating and visualizing data from the famous Titanic dataset using Pandas, a powerful data analysis library in Python.
Our task here is to combine the first few and last few rows of the dataframe. This technique is useful when we need to take a snapshot of our data from both ends. Often times in large datasets, taking a quick look at the top and bottom rows helps us understand the data structure better and check if our data manipulation steps like sorting or adding new columns have been applied correctly.
The steps we’re following are as follows:
-
Set the Number of Rows: First, we’re setting up a constant
ROW_COUNT
to 7. This specifies the number of rows we want to extract from the top and bottom of the dataframe. -
Combine Top and Bottom Rows: We’re using the pandas' function
head()
to get the first 7 rows from our dataframetitanic_tableau
. Similarly, we usetail()
to get the last 7 rows.pd.concat()
is then used to combine these extracted rows together. -
Display the Result: Finally, we display the final combined dataframe
combined
which now contains 14 rows - the first 7 and the last 7.
ROW_COUNT = 7
# Combine the first 5 rows and the last 5 rows
combined = pd.concat([titanic_tableau.head(ROW_COUNT), titanic_tableau.tail(ROW_COUNT)])
# Display the combined DataFrame
combined
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.00 | 0 | 0 | 24160 | 211.3375 | B5 | S | 2 | NaN | St Louis, MO |
1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.92 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 11 | NaN | Montreal, PQ / Chesterville, ON |
2 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2.00 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
3 | 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.00 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | 135.0 | Montreal, PQ / Chesterville, ON |
4 | 1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.00 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
5 | 1 | 1 | Anderson, Mr. Harry | male | 48.00 | 0 | 0 | 19952 | 26.5500 | E12 | S | 3 | NaN | New York, NY |
6 | 1 | 1 | Andrews, Miss. Kornelia Theodosia | female | 63.00 | 1 | 0 | 13502 | 77.9583 | D7 | S | 10 | NaN | Hudson, NY |
1302 | 3 | 0 | Yousif, Mr. Wazli | male | NaN | 0 | 0 | 2647 | 7.2250 | NaN | C | NaN | NaN | NaN |
1303 | 3 | 0 | Yousseff, Mr. Gerious | male | NaN | 0 | 0 | 2627 | 14.4583 | NaN | C | NaN | NaN | NaN |
1304 | 3 | 0 | Zabour, Miss. Hileni | female | 14.50 | 1 | 0 | 2665 | 14.4542 | NaN | C | NaN | 328.0 | NaN |
1305 | 3 | 0 | Zabour, Miss. Thamine | female | NaN | 1 | 0 | 2665 | 14.4542 | NaN | C | NaN | NaN | NaN |
1306 | 3 | 0 | Zakarian, Mr. Mapriededer | male | 26.50 | 0 | 0 | 2656 | 7.2250 | NaN | C | NaN | 304.0 | NaN |
1307 | 3 | 0 | Zakarian, Mr. Ortin | male | 27.00 | 0 | 0 | 2670 | 7.2250 | NaN | C | NaN | NaN | NaN |
1308 | 3 | 0 | Zimmerman, Mr. Leo | male | 29.00 | 0 | 0 | 315082 | 7.8750 | NaN | S | NaN | NaN | NaN |
Understanding DataFrame Columns
Each column in the DataFrame represents a distinct attribute related to passengers on the Titanic. The descriptions for these attributes are as follow:
Details of Columns:
-
pclass
: This reflects the passenger class. ‘1’ denotes 1st class, ‘2’ denotes 2nd class, and ‘3’ denotes 3rd class. -
survived
: This is a binary variable indicating whether the passenger survived or not. ‘0’ means No (did not survive), and ‘1’ means Yes (survived). -
name
: This is the name of the passenger. -
sex
: This indicates the sex of the passenger. -
age
: This refers to the age of the passenger. -
sibsp
: This signifies the number of siblings/spouses of the passenger aboard the Titanic. -
parch
: This shows the number of parents/children of the passenger aboard the Titanic. -
ticket
: This depicts the ticket number of the passenger. -
fare
: This is the fare paid by the passenger. -
cabin
: This provides the cabin number where the passenger was accommodated. -
embarked
: This column tells from which port the passenger boarded the Titanic. ‘C’ stands for Cherbourg, ‘Q’ stands for Queenstown, and ‘S’ stands for Southampton. -
boat
: If the passenger survived, this column provides the lifeboat number. -
body
: If the passenger did not survive and their body was recovered, this records the body identification number. -
home.dest
: This states the Home/Destination of the passenger.
Let’s sort these columns by data type.
# Get a Series with the data type of each column
data_types = titanic_tableau.dtypes
# Convert the data types to strings
data_types = data_types.astype(str)
# Sort the Series by the data type
sorted_columns = data_types.sort_values()
# Print the sorted column names and their data types
for column_name, data_type in sorted_columns.items():
print(f"{column_name}: {data_type}")
age: float64
fare: float64
body: float64
pclass: int64
survived: int64
sibsp: int64
parch: int64
name: object
sex: object
ticket: object
cabin: object
embarked: object
boat: object
home.dest: object
Data Preprocessing: Converting Certain Columns to Categories
The pclass
and survived
columns in the ‘titanic_tableau’ DataFrame originally hold numeric values. However, within our context, these two columns do not represent numerical quantities. Instead, they are categorical data:
pclass
represents the passenger class; andsurvived
indicates whether a passenger survived or not.
For this reason, we convert these two columns into categories for more meaningful and straightforward visualization:
titanic_tableau['pclass'] = titanic_tableau['pclass'].astype('category')
titanic_tableau['survived'] = titanic_tableau['survived'].astype('category')
Next, we would like to get an understanding of the datatype of each column in our DataFrame. We begin by extracting a Series with the data type of each column:
data_types = titanic_tableau.dtypes
Then, we convert these data types into strings and sort them:
data_types = data_types.astype(str)
sorted_columns = data_types.sort_values()
Finally, we print each column name along with its corresponding data type:
for column_name, data_type in sorted_columns.items():
print(f"{column_name}: {data_type}")
Hence, we have preprocessed our DataFrame by converting certain columns to categories and explored the data types of all columns within our dataset.
titanic_tableau['pclass'] = titanic_tableau['pclass'].astype('category')
titanic_tableau['survived'] = titanic_tableau['survived'].astype('category')
# Get a Series with the data type of each column
data_types = titanic_tableau.dtypes
# Convert the data types to strings
data_types = data_types.astype(str)
# Sort the Series by the data type
sorted_columns = data_types.sort_values()
# Print the sorted column names and their data types
for column_name, data_type in sorted_columns.items():
print(f"{column_name}: {data_type}")
pclass: category
survived: category
age: float64
fare: float64
body: float64
sibsp: int64
parch: int64
name: object
sex: object
ticket: object
cabin: object
embarked: object
boat: object
home.dest: object
Next, let’s use’s a different technique to check for null values.
# Checking for missing values
titanic_tableau.isnull().sum()
pclass 0
survived 0
name 0
sex 0
age 263
sibsp 0
parch 0
ticket 0
fare 1
cabin 1014
embarked 2
boat 823
body 1188
home.dest 564
dtype: int64
The dataset has some missing values in the ‘age’, ‘fare’, and ‘cabin’ columns. Here’s the count of missing values for each column:
index | count |
---|---|
age | 263 |
fare | 1 |
cabin | 1014 |
embarked | 2 |
We will need to handle these missing values before proceeding with the analysis. For the ‘age’ column, we can fill the missing values with the median age. For the ‘fare’ column, we can fill the missing value with the median fare. The ‘cabin’ column has a lot of missing values, so we might consider dropping it from our analysis.
Let’s perform these data cleaning steps.
Handling Missing Data and Unnecessary Columns in the Titanic Dataset
This overview will walk through the code explaining each step in detail:
Step 1: Filling Missing Values in ‘age’ and ‘fare’ Columns
titanic_tableau['age'].fillna(titanic_tableau['age'].median(), inplace=True)
titanic_tableau['fare'].fillna(titanic_tableau['fare'].median(), inplace=True)
In this segment, the code fills in any missing (NaN) values in the ‘age’ and ‘fare’ columns with the median value of the respective column. The method fillna
is used to replace null values with a specified replacement value. In this case, .median()
is used to compute the middle value of the series for ages and fares. The argument inplace=True
ensures that the change is directly applied in the DataFrame.
Step 2: Dropping the ‘cabin’ Column
titanic_tableau.drop('cabin', axis=1, inplace=True)
Here, the drop
function is employed to remove the ‘cabin’ column from the DataFrame. axis=1
refers to columns (axis=0
would refer to rows). Similar to the previous step, inplace=True
is used here as well to ensure that the column is permanently removed from the DataFrame.
Step 3: Checking for Remaining Missing Values in All Columns
titanic_tableau.isnull().sum()
Lastly, this line of code checks if there are any remaining missing values in the DataFrame. The isnull()
function returns a DataFrame analogous to the original, filled with Boolean values indicating whether the corresponding original value was Null. The sum()
function then counts the number of True instances (i.e., missing values) present in each column.
# Filling missing values
titanic_tableau['age'].fillna(titanic_tableau['age'].median(), inplace=True)
titanic_tableau['fare'].fillna(titanic_tableau['fare'].median(), inplace=True)
# Dropping the 'cabin' column
titanic_tableau.drop('cabin', axis=1, inplace=True)
# Checking for missing values again
titanic_tableau.isnull().sum()
pclass 0
survived 0
name 0
sex 0
age 0
sibsp 0
parch 0
ticket 0
fare 0
embarked 2
boat 823
body 1188
home.dest 564
dtype: int64
The missing values in the ‘age’ and ‘fare’ columns have been filled, and the ‘cabin’ column has been dropped. Now, the only column with missing values is ‘embarked’, which has 2 missing values. We can fill these with the most common port of embarkation.
Let’s perform this final data cleaning step.
Completing the Data Cleaning Process for the Titanic Dataset
The data preprocessing portion of the code includes a series of steps designed to clean and improve the quality of the dataset. This section will provide a comprehensive explanation of these steps which include filling missing values in the ‘embarked’ column and checking for remaining missing values amongst all columns.
Step 1: Filling Missing Values in ‘embarked’ Column
titanic_tableau['embarked'].fillna(titanic_tableau['embarked'].mode()[0], inplace=True)
In this first step, the code aims to handle missing (NaN) values in the ‘embarked’ column by replacing them with the most frequently occurring value or mode. The fillna
method is employed to replace null values with the specified replacement value. The .mode()[0]
operation returns the most common value in the ‘embarked’ series.
The [0]
indexing is necessary because .mode()
produces a Series, and we want the first (and often only) value from it. The inplace=True
argument guarantees that the modification is made directly in the DataFrame itself.
Step 2: Checking For Remaining Missing Values in All Columns
titanic_tableau.isnull().sum()
In the final step, the code checks if there are any more remaining NaNs left in the DataFrame. The function isnull()
is implemented to return a DataFrame whereby the locations of Null values are marked as True and non-null values as False.
Then, the sum()
function is used to count the number of True instances, i.e., missing values in each column. This allows us to glean the total count of missing values in each DataFrame column.
# Filling missing values in 'embarked' with the most common value
titanic_tableau['embarked'].fillna(titanic_tableau['embarked'].mode()[0], inplace=True)
# Checking for missing values again
titanic_tableau.isnull().sum()
pclass 0
survived 0
name 0
sex 0
age 0
sibsp 0
parch 0
ticket 0
fare 0
embarked 0
boat 823
body 1188
home.dest 564
dtype: int64
The output shows we fixed all the missing values but the following:
column | # missing |
---|---|
boat | 823 |
body | 1188 |
home.dest | 564 |
Let’s look at these columns and check if we really need them for our analysis:
-
boat
: This column tells us if a passenger escaped on a lifeboat or not. However, this is more of a spoiler than a clue because if they were on a lifeboat, they likely survived. Using it as a predictor would be cheating - it’s like watching a movie knowing its ending! -
body
: Now, this one is gloomy. It’s the identification number given to a body recovered. Again, knowing this would pretty much reveal if a passenger survived or didn’t. You don’t want spoilers in your data either! -
home.dest
: This could tell us about the passenger’s home country. A cool piece of information! But it’s like a puzzle with half the pieces missing (we have over 564 missing values!). So using this might make our predictions unreliable.
So, considering all the points above, we’re not too worried about dropping these columns from our analysis. We want a fair predictive model that relies on valid clues—not spoilers! And regarding the missing data, we can’t fill such large gaps without risking bias in our results.
Digging Deeper with Exploratory Data Analysis (EDA)
As we continue our journey into understanding our data set, we’ll now shift gears and immerse ourselves in the realm of Exploratory Data Analysis. This key step enables us to understand the main characteristics of our data, allowing us to draw meaningful insights and frame appropriate questions for further analysis.
To start off, let’s turn our attention towards histograms - these powerful graphical representations allow us to visualize the distribution of our numeric variables.
In simple terms, imagine slicing the range of our data into various bins and then counting the number of instances falling into each bin. The result? A histogram! The x-axis represents these bins defined as consecutive, non-overlapping intervals of variable values, while the y-axis corresponds to the frequency (counts) or density (if normalized), granting us an intuitive view of data concentration and trends.
Now, without further ado, let’s delve straight into plotting and analyzing histograms for each of our numerical variables!
Introduction to the enhanced_histogram Function
The enhanced_histogram
function is a robust tool within our code that we use for data visualization. It accepts a pandas DataFrame and generates histograms for each numerical variable present in the data.
Step 1: Importing Necessary Libraries
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
At the start of our script, we import the libraries that we need for data manipulation (Pandas, Numpy) and visualization (Seaborn, Matplotlib).
Step 2: Defining the Function
def enhanced_histogram(dataframe):
We start by defining the function, enhanced_histogram
, which takes one parameter - a Pandas DataFrame called dataframe.
Step 3: Selecting Numeric Columns
numeric_variables = dataframe.select_dtypes(include=[np.number])
In this step, we filter out the numeric columns from the DataFrame.
Step 4: Defining Custom Labels
custom_labels = { ... }
Next, we define a dictionary with custom labels for some of the variables in the DataFrame.
Step 5: Creating an Empty Figure Dictionary
figs = {}
Here, we initialize an empty dictionary where we will store the figures created in the next steps.
Step 6: Creating Histograms for Each Variable
for variable in numeric_variables:
We then iterate over each numeric variable (column). For each column, we plot a histogram using Seaborn’s histplot
function. If a custom label was defined for the variable, we use it; otherwise, we capitalize the variable name. We store each histogram and its formatting details in a temporary figure (fig
).
Step 7: Storing the Figure and Closing
figs[variable] = fig
plt.close(fig)
We then add each figure to the figs
dictionary, using the corresponding variable as the key, and close the figure to free up memory.
Step 8: Returning Figures
return figs
At last, the function returns the figs
dictionary that contains all the histograms.
Step 9: Calling the Function
figs = enhanced_histogram(titanic_tableau)
Finally, we call enhanced_histogram
function on the ‘titanic_tableau’ DataFrame. The returned figures are stored in the figs
variable.
In conclusion, this script creates histograms for all numeric columns in a provided DataFrame. It also provides properly formatted labels for a specific set of known columns.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
def enhanced_histogram(dataframe):
# select numeric columns
numeric_variables = dataframe.select_dtypes(include=[np.number])
# define a dictionary for custom labels
custom_labels = {
"survived": "Survival Status",
"pclass": "Passenger Class",
"age": "Age of Passengers",
"sibsp": "Number of Siblings/Spouses Aboard",
"parch": "Number of Parents/Children Aboard",
"fare": "Passenger Fare",
"body": "Body Identification Number",
}
# create a dictionary to store the plots
figs = {}
for variable in numeric_variables:
# get feature
var = dataframe[variable]
# get custom label if it exists, otherwise use the variable name
label = custom_labels.get(variable, variable.capitalize())
# create histogram
fig, ax = plt.subplots(figsize=(12, 6))
sns.histplot(var, bins=30, kde=False, color=(135/255, 206/255, 235/255, 1), edgecolor='black', ax=ax)
# add title and labels
ax.set_title('Distribution of ' + label, fontsize=14, pad=20)
ax.set_xlabel(label, fontsize=12, labelpad=15)
ax.set_ylabel('Frequency', fontsize=12, labelpad=15)
# add grid and set its zorder to a lower value
ax.grid(axis='y', alpha=.33, zorder=-10)
ax.set_axisbelow(True)
# store the figure in the dictionary
figs[variable] = fig
# improve the layout
plt.tight_layout()
plt.close(fig)
# return the list of figures
return figs
figs = enhanced_histogram(titanic_tableau)
- Age of Passengers: This histogram shows the distribution of the ‘age’ column. The histogram shows that the majority of passengers were in their 20s and 30s, with a few older passengers.
display(figs['age'])
- Number of Siblings/Spouses Aboard: This histogram shows the distribution of the ‘sibsp’ column, which indicates the number of siblings or spouses each passenger had aboard the Titanic. The histogram shows that most passengers did not have siblings or spouses aboard.
display(figs['sibsp'])
- Number of Parents/Children Aboard: This histogram shows the distribution of the ‘parch’ column, which indicates the number of parents or children each passenger had aboard the Titanic. The histogram shows that most passengers did not have parents or children aboard.
display(figs['parch'])
- Passenger Fare: This histogram shows the distribution of the ‘fare’ column, which indicates the fare each passenger paid. The histogram shows that most fares were low, with a few high fares.
display(figs['fare'])
- Body Identification Number: This histogram shows the distribution of the ‘body’ column, which indicates the body identification number assigned to each passenger, if any. The histogram shows that most passengers did not have a body identification number, which means their bodies were not found.
display(figs['body'])
Transitioning from Histograms to Bar Plots
We’ve just finished with the exploration of our numerical data with histograms. As you’ve seen, a histogram provides valuable insights into how a numerical value is distributed across different intervals or ‘bins’. With histograms, we were able to understand various characteristics of Titanic passengers and their journey circumstances.
Histograms vs Bar Plots
It’s important to note that while histograms are best for numeral data representations, they do not work well for categorical data. This is where bar plots excel.
Unlike histograms, bar plots allow us to show discrete categories on one axis representing our variables, with the bars' height corresponding to the frequency, proportion, or any other metric of these categories. They are particularly useful when dealing with categorical variables, as they display distinct events (eg. Survival Status: Survived or Died) and how frequently they occur in the dataset.
In essence, while histograms are used for displaying distributions of numerical data, bar plots are used for comparing categorical data. Don’t get confused by the similar appearance - the context of use marks the difference!
With this understanding, let’s move forward to explore the categorical variables in our dataset using bar plots.
In the following section, we’ll generate bar plots for each categorical variable in our Titanic dataset. These plots will tell us about the distribution of different categories within these variables, which could provide further insights into the Titanic shipwreck story. Stay tuned!
Introduction to enhanced_bar_plot Function
The enhanced_bar_plot
function is a versatile tool that we’ll be using in our code. It accepts a pandas DataFrame and creates bar plots for each categorical variable present in the data.
Step 1: Defining the function
def enhanced_bar_plot(dataframe, columns=None):
Here, we are defining the function named enhanced_bar_plot
. This function accepts two parameters - a Pandas DataFrame (dataframe) and an optional list of column names (columns).
Step 2: Selecting Columns for Bar Plots
if columns is not None:
categorical_variables = dataframe[columns]
else:
categorical_variables = dataframe.select_dtypes(include=['object', 'category'])
In this step, the function checks whether any specific columns are provided for plotting. If not, it selects all the columns that have an ‘object’ or ‘category’ datatype i.e., all the categorical columns from the dataframe.
Step 3: Setting Custom Labels and Initializing Figures Dictionary
custom_labels = {
"survived": "Survival Status",
"pclass": "Passenger Class",
"sex": "Sex",
"embarked": "Port of Embarkation",
"boat": "Boat Class"
}
figs = {}
Next, a dictionary called custom_labels
is initialized, which maps simpler versions of our column names to more descriptive labels. Also, an empty dictionary figs
is initialized to store the matplotlib figure objects of each plot.
Creation of Bar Plots
In this section, the function crafts individual bar plots for each categorical variable through several steps:
Step 4: Iterating Over Categorical Variables
for variable in categorical_variables:
var = dataframe[variable]
var_value = var.value_counts()
colors = sns.color_palette('deep')[0:len(var_value)]
The function then loops over each selected categorical variable. For each iteration, it counts the unique values present in the categorical variable using the value_counts()
method. A color palette is chosen using seaborn’s color_palette()
function.
Step 5: Creating Figure and Subplot, Drawing Bar Plot
fig, ax = plt.subplots(figsize=(12, 6))
ax.bar(var_value.index, var_value, color=colors, edgecolor='black')
A new figure and subplot is created for each variable. The bar plot is drawn where x-values are the unique categories and the heights of bars are their frequencies.
Step 6: Adding Titles and Labels
label = custom_labels.get(variable, variable.capitalize())
ax.set_title('Distribution of ' + label, fontsize=14, pad=20)
ax.set_xlabel(label, fontsize=12, labelpad=15)
ax.set_ylabel('Frequency', fontsize=12, labelpad=15)
Titles, x-label, and y-label are added to make the plot more informative. It either uses the user-defined label from custom_labels
or the capitalized version of the column name itself.
Step 7: Customizing X-Tick Labels
if variable == 'embarked':
ax.set_xticks([0, 1, 2])
ax.set_xticklabels(['Southampton', 'Cherbourg', 'Queenstown'])
elif variable == 'pclass':
ax.set_xticks([1, 2, 3])
ax.set_xticklabels(['First', 'Second', 'Third'])
elif variable == 'survived':
ax.set_xticks([0, 1])
ax.set_xticklabels(['Died', 'Survived'])
elif variable == 'boat':
plt.xticks(rotation=45)
This part of the function customizes the x-tick labels. Depending on the nature of the variable in question, more meaningful labels are assigned to enhance readability.
Step 8: Adding Gridlines and Frequency Annotations
ax.grid(axis='y', alpha=0.5)
ax.set_axisbelow(True)
for i, v in enumerate(var_value):
ax.text(var_value.index[i], v + 0.01 * v, str(v), ha='center', fontweight='bold', fontsize=10)
Gridlines are added to make the graph easier to interpret. Also, each bar is annotated with its frequency value.
Step 9: Storing Figure and Closing Plot
figs[variable] = fig
plt.close()
Finally, the figure is stored in the figs
dictionary using the column name as key and the plot is closed to free up memory.
Utilization of the Function
By virtue of these carefully designed steps, the enhanced_bar_plot
function allows us to summarize and visualize distribution of categorical variables efficiently. To use this function, simply pass your DataFrame and optionally a list of column names:
figs = enhanced_bar_plot(titanic_tableau, columns=['survived', 'pclass', 'boat', 'embarked', 'sex'])
This line generates bar plots for the columns ‘survived’, ‘pclass’, ‘boat’, ‘embarked’, and ‘sex’ in the titanic_tableau
DataFrame.
def enhanced_bar_plot(dataframe, columns=None):
if columns is not None:
categorical_variables = dataframe[columns]
else:
categorical_variables = dataframe.select_dtypes(include=['object', 'category'])
custom_labels = {
"survived": "Survival Status",
"pclass": "Passenger Class",
"sex": "Sex",
"embarked": "Port of Embarkation",
"boat": "Boat Class",
}
figs = {}
for variable in categorical_variables:
var = dataframe[variable]
var_value = var.value_counts()
colors = sns.color_palette('deep')[0:len(var_value)]
fig, ax = plt.subplots(figsize=(12, 6))
ax.bar(var_value.index, var_value, color=colors, edgecolor='black')
label = custom_labels.get(variable, variable.capitalize())
ax.set_title('Distribution of ' + label, fontsize=14, pad=20)
ax.set_xlabel(label, fontsize=12, labelpad=15)
ax.set_ylabel('Frequency', fontsize=12, labelpad=15)
if variable == 'embarked':
ax.set_xticks([0, 1, 2])
ax.set_xticklabels(['Southampton', 'Cherbourg', 'Queenstown'])
elif variable == 'pclass':
ax.set_xticks([1, 2, 3])
ax.set_xticklabels(['First', 'Second', 'Third'])
elif variable == 'survived':
ax.set_xticks([0, 1])
ax.set_xticklabels(['Died', 'Survived'])
elif variable == 'boat':
plt.xticks(rotation=45)
ax.grid(axis='y', alpha=0.5)
ax.set_axisbelow(True)
for i, v in enumerate(var_value):
ax.text(var_value.index[i], v + 0.01 * v, str(v), ha='center', fontweight='bold', fontsize=10)
figs[variable] = fig
plt.close()
return figs
figs = enhanced_bar_plot(titanic_tableau, columns=['survived', 'pclass', 'boat', 'embarked', 'sex'])
- Survival Status: This graph shows the distribution of survival status among the passengers. The x-axis represents the survival status (Died or Survived), and the y-axis represents the frequency of each category. It appears that more passengers died than survived.
display(figs['survived'])
- Passenger Class: This graph shows the distribution of passenger class. The x-axis represents the passenger class (First, Second, Third), and the y-axis represents the frequency of each class. It appears that the majority of passengers were in the third class.
display(figs['pclass'])
- Boat: This graph shows the distribution of boat numbers. The x-axis represents the boat numbers, and the y-axis represents the frequency of each boat. The distribution seems to be quite varied across different boats.
display(figs['boat'])
- Port of Embarkation: This graph shows the distribution of the port of embarkation. The x-axis represents the port (Southampton, Cherbourg, Queenstown), and the y-axis represents the frequency of each port. It appears that the majority of passengers embarked from Southampton.
display(figs['embarked'])
- Sex: This graph shows the distribution of sex among the passengers. The x-axis represents the sex (male or female), and the y-axis represents the frequency of each sex. It appears that there were more male passengers than female passengers.
display(figs['sex'])
Conclusion
We’ve made quite a journey through the Titanic dataset! We started with a raw dataset, explored its structure, handled missing values, and visualized the data using histograms and bar plots. We’ve seen the distribution of ages, fares, and family members aboard for the passengers. We’ve also looked at the distribution of categorical variables like passenger class, sex, and port of embarkation.
However, this is just the tip of the iceberg. We’ve yet to dive into the survival rates and the factors that might have influenced a passenger’s survival. We also haven’t yet applied any predictive modeling techniques to forecast a passenger’s survival based on their characteristics.
In our upcoming posts, we plan to delve deeper into the Titanic dataset. We’ll examine survival rates, explore correlations, and use more complex graphs to visualize the data. We’ll also apply machine learning techniques to build a predictive model for passenger survival.
Our journey with the Titanic dataset is far from over. We’ve only just set sail, and there’s a lot more to discover. So, stay tuned for our future posts as we continue to navigate through this fascinating dataset.