My Data Science Journey

A more interesting assignment from my Data Analysis and Visualization course at the University of Washington involved creating a static visualization of U.S. Census data from 1900 and 2000. Here's a sample of what we started with.

This visualization was approached with the question: “Does comparing the population of different age groups reveal insights which support or oppose the idea of an ['aging population'] in the United States?” With a few adjustments, the data can tell us a lot more than we think.

I began by encoding the Gender to “Men” and “Women”, and calculated the “Percent of Total Population” for each cell. This was accomplished by dividing each cell by the sum of all the values with the same “Year”. The year in which each age group was born was then calculated by subtracting the “Age” from the “Year”, and it was saved as “YearBorn” and used in the Tooltip.

Once all the necessary variables were assigned, I could then proceed with the analysis and comparisons. By standardizing the population with percentages, the differences in magnitude can be removed, although the individual values are immediately hidden. The differences between the genders is seemingly negligible, so it is appropriate to combine their values for a majority of the analysis.

When experimenting with different charts and graphs, the area graph was chosen to best represent the change in demography at these two snapshots in history. The most beneficial aspect of the area charts is that the shaded regions all sum to 100%, so direct comparisons can be made. The two graphs are overlayed upon one another and the stark differences can be seen, where blue and orange were used for their ability to capture attention, and help distinguish the two years from each other within the same chart. There is a significantly lower proportion of individuals born between 1965 and 2000, which does not follow the same trend as the rest of the visualization. This suggests that fertility rates significantly dropped around this time, and is supported by the fact that [oral contraceptives were first approved by the Food and Drug Administration in 1960]. This reduction has increased the median age of the United States, which is apparent from the space between the overlayed charts, where there are more individuals above the age of 40 comprising the total population in 2000, when compared to the proportion from 1900; this is roughly a constant three percent increase. This difference in proportion is taken from individuals under the age of 40, which is most apparent with the over ten percent difference of newborns, where this age group represented over twenty-four percent of the population in 1900, while they made up less than fourteen percent of the population by 2000.

Because the median age of the United States has increased and the representation of younger people in the national census has dramatically decreased, signs of an aging population can be observed. The use of area graphs to represent population proportions versus age is optimal for this question, where the true values are not immediately shown. However, hovering over points and selecting different portions of the visualization reveals these values, as well as information about when the age group was born, which helps guide the viewer through the data. This dataset is incredibly useful in the educational setting, where budding analysts are tasked with manipulating and contorting the data to reveal valuable insights. Initially viewing the data without standardization results in an opposing conclusion, which has the possibility of leading people to incorrect, or uninformed decisions. Although technology allows us to quickly analyze and understand huge datasets, it can be initially overwhelming, and exercises such as this are a great way to learn what does, and doesn’t work.

Select the different colored portions of the visualization. Does this change the information you can see?
Hover over the chart and describe the "Percent of Total" and number of people aged 40 in 1900, and 2000. Why is the population so much greater, when the proportion is lower?
What differences between individuals under 40 exist, and what does this suggest about the fertility rate of the United States?
In the area charts below, which gender appears to live the longest in the year 2000?
What is the general relationship between the genders?

[Here's a link to the dataset.]

I wanted to get some more machine learning practice down, and had heard about Trifacta in my Data Analysis and Visualization course, so I figured the [Titanic Kaggle exercise] would be fitting. I opted to use a Random Forest Classification, utilizing feature scaling and split training/ test data. The result left us with a model that was 84.93% accurate, with only 38 type I errors, and 25 type II errors, out of 418 samples.

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

So I began by taking a look at the data. Luckily, we were provided with three separate files.

train.csv
test.csv
gender_submissions.csv

These files contain the relevant information for us to make the binary classification. Initially, we have some encoding, but our data is still quite messy. Our data arrived in the following format:

Variable	Definition	Key
survival	Survival	0 = No, 1 = Yes
pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
sex	Sex
Age	Age in years
sibsp	# Siblings / Spouses Onboard
parch	# Parents / Children Onboard
ticket	Ticket #
fare	Passenger Fare
cabin	Cabin number
embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton

Opening the datasets in Trifacta allows us to begin the preprocessing stage. Upon inspection of train.csv we can see that we have some missing values, as well as some mismatched values. Here, we'll take care of those, as well as drop unnecessary columns.

Because our datasets are ordered, and we want to create something we can work with, we'll remove the PassengerID column, as well as separate the Survived column into its own file, since our test data is already in that format.

I decided to separate my categorical variables using dummy variables, where I first worked with Pclass. I encoded the three columns into binary values, then removed a column to avoid the dummy variable trap. This left me with two boolean columns, one for First Class, and another for Third Class. The second replacement was for embarked; and here I made a difficult decision. I could have chosen to remove a column from my three choices between Cherbourg, Queenstown, and Southampton, but noticed that I had a fourth possibility, which was null. I opted to let null remain to be [0,0,0], while creating individual columns for Embarked_C, Embarked_Q, and Embarked_S. We'll see how that turns out. Finally, I one-hot encoded the Sex column, changing the name to Male to indicate that 1 was for Male, and 0 would be for Female.

I noticed that the columns for Cabin and Ticket were riddled with seemingly meaningless content, as well as dozens of missing values -- so we went ahead and dropped those too. Next we dealt with our numerical features, which were missing values for Age, and Fare. I opted to replace these missing values with the average across the columns, which will not affect the overall analysis.

Here's out it turned out:

The files were saved to their appropriate variables, ready to parse into Python.

train_wrangled_X.csv
tran_wrangled_y.csv
test_wrangled.csv
y_test_wrangled.csv

We begin by importing our libraries and datasets.

Next, I had noticed that some of our values ranged anywhere from [0,1] to [0,512], so I opted to include [feature scaling]. A downside here is that we, unfortunately, won't be able to see which features are having the most affect on our model, but this will improve our prediction, which is what we're aiming for anyway. The plus side to using feature scaling is that we reduce the possibility that one of our features will be overweighted in the algorithm, or if there is any codependency.

Now that our datasets are cleaned and ready to be modeled, we use the Random Forest Classifier, and predict our results. Here we choose ten estimators for our ten tested features, with the normal entropy and random state for recreation.

Finally, we test our results, finding a model that was 84.93% accurate, with only 38 type I errors, and 25 type II errors, out of 418 samples. I'd say that's pretty good accuracy!

Just for fun, I tried using this scaled and cleaned data in a dense neural network using Keras, which resulted in an accuracy of 90.19% I opted to use a softmax layer, which helps in the binary classification. After testing a few combinations of different activation functions, I was able to get the highest accuracy from this model, and felt pretty satisfied with the result.

If you want to check out the datasets, feel free to visit the github:
https://github.com/SLPeoples/Python-Excercises/tree/master/Titanic_Kaggle

My Data Science Journey

Monday, January 22, 2018

A Missing Generation: Comparison of U.S. Census Data (1900,2000)

Wednesday, January 17, 2018

Binary Classification of Titanic Survivors