Load the titanic dataset using the seaborn library. First, import the seaborn library, and then use the load_dataset("titanic") function:
import seaborn as sns titanic = sns.load_dataset('titanic') titanic.head(10)
Next, print out the top 10 instances; this should match the below screenshot:
The preferred target feature could be either survived or alive. This is mainly because both of them label whether a person survived the crash. For the following steps, the variable chosen is survived. However, choosing alive will not affect the final shape of the variables.
Create a variable, X, to store the features, by using drop(). As explained previously, the selected target feature is survived, which is why it is dropped from the features matrix.
Create a variable, Y, to store the target matrix. Use indexing to access only the value from the column survived:
X = titanic.drop('survived',axis = 1) Y = titanic['survived']
Print out the shape of variable X, as follows:
X.shape (891, 14)
Do the same for variable Y:
Y.shape (891,)
Load the dataset and create the features and target matrices:
import seaborn as sns titanic = sns.load_dataset('titanic') X = titanic[['sex','age','fare','class','embark_town','alone']] Y = titanic['survived'] X.shape (891, 6)
Check for missing values in all features.
As we did previously, use isnull() to determine whether a value is missing, and use sum() to sum up the occurrences of missing values along each feature:
print("Sex: " + str(X['sex'].isnull().sum())) print("Age: " + str(X['age'].isnull().sum())) print("Fare: " + str(X['fare'].isnull().sum())) print("Class: " + str(X['class'].isnull().sum())) print("Embark town: " + str(X['embark_town'].isnull().sum())) print("Alone: " + str(X['alone'].isnull().sum()))
The output will look as follows:
Sex: 0 Age: 177 Fare: 0 Class: 0 Embark town: 2 Alone: 0
As you can see from the preceding screenshot, only two features contain missing values: age and embark_town.
As age has many missing values that accounts for almost 20% of the total, the values should be replaced. Mean imputation methodology will be applied, as shown in the following code:
#Age: missing values mean = X['age'].mean() mean = mean.round() X['age'].fillna(mean,inplace = True)
After calculating the mean, the missing values are replaced by it using the fillna() function.
Note
The preceding warning may appear as the values are being replaced over a slice of the DataFrame. This happens because the variable X is created as a slice of the entire DataFrame titanic. As X is the variable that matters for the current exercise, it is not an issue to only replace the values over the slice and not over the entire DataFrame.
Given that the number of missing values in the embark_town feature is low, the instances are eliminated from the features matrix:
Note
To eliminate the missing values from the embark_town feature, it is required to eliminate the entire instance (observation) from the matrix.
# Embark_town: missing values X = X[X['embark_town'].notnull()] X.shape (889, 6)
The notnull() function detects all non-missing values over the object in question. In this case, the function is used to obtain all non-missing values from the embark_town feature. Then, indexing is used to retrieve those values from the entire matrix (X).
Discover the outliers present in the numeric features. Let's use three standard deviations as the measure to calculate the min and max threshold for numeric features. Using the formula that we have learned, the min and max threshold are calculated and compared against the min and max values of the feature:
feature = "age" print("Min threshold: " + str(X[feature].mean() - (3 * X[feature].std()))," Min val: " + str(X[feature].min())) print("Max threshold: " + str(X[feature].mean() + (3 * X[feature].std()))," Max val: " + str(X[feature].max()))
The values obtained for the above code are shown here:
Min threshold: -9.194052030619016 Min val: 0.42 Max threshold: 68.62075619259876 Max val: 80.0
Use the following code to calculate the min and max threshold for the fare feature:
feature = "fare" print("Min threshold: " + str(X[feature].mean() - (3 * X[feature].std()))," Min val: " + str(X[feature].min())) print("Max threshold: " + str(X[feature].mean() + (3 * X[feature].std()))," Max val: " + str(X[feature].max()))
The values obtained for the above code are shown here:
Min threshold: -116.99583207273355 Min val: 0.0 Max threshold: 181.1891938275142 Max val: 512.3292
As you can see from the preceding screenshots, both features stay inside the range at the lower end but go outside the range with the max values.
The total count of outliers for the features age and fare are 7 and 20, respectively. Neither amount represents a high percentage of the total number of values, which is why outliers are eliminated from the features matrix. The following snippet can be used to eliminate the outliers and print the shape of the resulting matrix:
# Age: outliers max_age = X["age"].mean() + (3 * X["age"].std()) X = X[X["age"] <= max_age] X.shape (882, 6) # Fare: outliers max_fare = X["fare"].mean() + (3 * X["fare"].std()) X = X[X["fare"] <= max_fare] X.shape (862, 6)
Discover outliers present in text features. The value_counts() function is used to count the occurrence of the classes in each feature:
feature = "alone" X[feature].value_counts() True 522 False 340 feature = "class" X[feature].value_counts() Third 489 First 190 Second 183 feature = "alone" X[feature].value_counts() True 522 False 340 feature = "embark_town" X[feature].value_counts() Southampton 632 Cherbourg 154 Queenstown 76
None of the classes for any of the features are considered to be outliers, as they all represent over 5% of the entire dataset.
Convert all text features into their numeric representations. Use scikit-learn's LabelEncoder class, as shown in the following code:
from sklearn.preprocessing import LabelEncoder enc = LabelEncoder() X["sex"] = enc.fit_transform(X['sex'].astype('str')) X["class"] = enc.fit_transform(X['class'].astype('str')) X["embark_town"] = enc.fit_transform(X['embark_town'].astype('str')) X["alone"] = enc.fit_transform(X['alone'].astype('str'))
Print out the top 5 instances of the features matrix to view the result of the conversion:
X.head()
Finally, apply normalization (or standardization) to the matrix.
As you can see from the following code, the formula for normalization is only applied to those features that need normalizing. Given that normalization rescales the values between 0 and 1, all the features that have already met that condition do not need to be normalized:
X["age"] = (X["age"] - X["age"].min())/(X["age"].max()-X["age"].min()) X["fare"] = (X["fare"] - X["fare"].min())/(X["fare"].max()-X["fare"].min()) X["class"] = (X["class"] - X["class"].min())/(X["class"].max()-X["class"].min()) X["embark_town"] = (X["embark_town"] - X["embark_town"].min())/(X["embark_town"].max()-X["embark_town"].min()) X.head(10)
The top 10 rows of the final output are shown in the following screenshot: