CODINGNEWS

City of Austin Employee Data Analysis

May 19

The city of Austin has a very nice Open Data Portal. After digging a little bit I found that the most popular datasets in the portal are related to dogs, which it is not a surprise, since we are talking about the city of Austin. The truth is that there is a lot of data available. For this example, I’m going to be using the City of Austin Employee Detail Information for two reasons. One, it’s always interesting to see salaries and two, because I can try to reproduce the analysis that the people from the Texas Tribune are doing with the same data set.

Getting started

Obviously, I’m going to be using Python with Anaconda. You can find the installer and the instructions at the Continuum Analytics’s website. Once Anaconda has been installed, we are ready to go, since all the libraries are already part of the distribution.

We will start by importing pandas, numpy and matplotlib:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Then we will load the CSV file into pandas. Since the column Annual Salary has a data type of dtype('O'), we will need to convert into a float. As you can imagine, the problem is the $ sign:

df = pd.read_csv('City_of_Austin_Employee_Detail_Information.csv')
df[['Emp ID', 'Last', 'Annual Salary']][0:10]
| index  | Emp ID   | Last            | Annual Salary |
|--------|----------|-----------------|---------------|
| 0      | 10000003 | Turner          | $62608.00     |
| 1      | 10000008 | Michael         | $146101.28    |
| 2      | 10000015 | Greco           | $96345.60     |
| 3      | 10000022 | Wiswell-DeCampo | $45676.80     |
| 4      | 10000026 | White           | $109867.68    |
| 5      | 10000039 | Candoli         | $100800.96    |
| 6      | 10000066 | Mason           | $126349.60    |
| 7      | 10000084 | Lynch           | $94203.20     |
| 8      | 10000127 | Wosky           | $93977.52     |
| 9      | 10000132 | Ringuette       | $75487.36     |

We will create a new column called annual_salary with a parsed value from Annual Salary, so we can play with the data from this column.

df['annual_salary'] = df['Annual Salary'].map(lambda x: float(x.strip('$')))

Interesting findings

Most of the employees (28%) earns between 31k - 63k:

Pirateboxers

33% of males employees earn between 63k and 94k, against only 18% of female employees.

df[df['Gender'] == 'M'].groupby('salary_range')['Gender'].count()[2]/len(df[df['Gender'] == 'M'])
# => 0.33009231978000392
df[df['Gender'] == 'F'].groupby('salary_range')['Gender'].count()[2]/len(df[df['Gender'] == 'F'])
# => 0.18786289567334707

There is more people working in Parks & Recreation (2362) than in the police (2358).

Finally it’s pretty clear that most of the employees are white:

Pirateboxers

The rest of the analysis is nothing but pandas standard and you can check the jupyter notebook here. You can also play with it using Binder here (find it under the notebooks directory).

comments powered by Disqus