Wednesday, January 17, 2018

Binary Classification of Titanic Survivors

I wanted to get some more machine learning practice down, and had heard about Trifacta in my Data Analysis and Visualization course, so I figured the [Titanic Kaggle exercise] would be fitting. I opted to use a Random Forest Classification, utilizing feature scaling and split training/ test data. The result left us with a model that was 84.93% accurate, with only 38 type I errors, and 25 type II errors, out of 418 samples.
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.


So I began by taking a look at the data. Luckily, we were provided with three separate files.
  • train.csv
  • test.csv
  • gender_submissions.csv
These files contain the relevant information for us to make the binary classification. Initially, we have some encoding, but our data is still quite messy. Our data arrived in the following format:

Variable Definition Key
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex Sex
Age Age in years
sibsp # Siblings / Spouses Onboard
parch # Parents / Children Onboard
ticket Ticket #
fare Passenger Fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

Opening the datasets in Trifacta allows us to begin the preprocessing stage. Upon inspection of train.csv we can see that we have some missing values, as well as some mismatched values. Here, we'll take care of those, as well as drop unnecessary columns.



Because our datasets are ordered, and we want to create something we can work with, we'll remove the PassengerID column, as well as separate the Survived column into its own file, since our test data is already in that format.

I decided to separate my categorical variables using dummy variables, where I first worked with Pclass. I encoded the three columns into binary values, then removed a column to avoid the dummy variable trap. This left me with two boolean columns, one for First Class, and another for Third Class. The second replacement was for embarked; and here I made a difficult decision. I could have chosen to remove a column from my three choices between Cherbourg, Queenstown, and Southampton, but noticed that I had a fourth possibility, which was null. I opted to let null remain to be [0,0,0], while creating individual columns for Embarked_C, Embarked_Q, and Embarked_S. We'll see how that turns out. Finally, I one-hot encoded the Sex column, changing the name to Male to indicate that 1 was for Male, and 0 would be for Female.

I noticed that the columns for Cabin and Ticket were riddled with seemingly meaningless content, as well as dozens of missing values -- so we went ahead and dropped those too. Next we dealt with our numerical features, which were missing values for Age, and Fare. I opted to replace these missing values with the average across the columns, which will not affect the overall analysis.


Here's out it turned out:



The files were saved to their appropriate variables, ready to parse into Python.
  • train_wrangled_X.csv
  • tran_wrangled_y.csv
  • test_wrangled.csv
  • y_test_wrangled.csv
We begin by importing our libraries and datasets.

Next, I had noticed that some of our values ranged anywhere from [0,1] to [0,512], so I opted to include [feature scaling]. A downside here is that we, unfortunately, won't be able to see which features are having the most affect on our model, but this will improve our prediction, which is what we're aiming for anyway. The plus side to using feature scaling is that we reduce the possibility that one of our features will be overweighted in the algorithm, or if there is any codependency.

Now that our datasets are cleaned and ready to be modeled, we use the Random Forest Classifier, and predict our results. Here we choose ten estimators for our ten tested features, with the normal entropy and random state for recreation.

Finally, we test our results, finding a model that was 84.93% accurate, with only 38 type I errors, and 25 type II errors, out of 418 samples. I'd say that's pretty good accuracy!

Just for fun, I tried using this scaled and cleaned data in a dense neural network using Keras, which resulted in an accuracy of 90.19% I opted to use a softmax layer, which helps in the binary classification. After testing a few combinations of different activation functions, I was able to get the highest accuracy from this model, and felt pretty satisfied with the result.

If you want to check out the datasets, feel free to visit the github:
https://github.com/SLPeoples/Python-Excercises/tree/master/Titanic_Kaggle

No comments:

Post a Comment