In this example, we will download some traffic citation data for the city of Bloomington, IN, load it into Python and generate a histogram. In doing so, you will be exposed to important Python libraries for working with big data such as numpy, pandas and matplotlib.
Data.gov is a government portal for open data and the city of Bloomington, Indiana makes available a number of datasets there.
We will use traffic citations data for 2016.
To start, let’s create a separate directory for this project and download the CSV data:
$ cd ~/projects/i524
$ mkdir btown-citations
$ cd btown-citations
$ wget https://data.bloomington.in.gov/dataset/c543f0c1-1e37-46ce-a0ba-e0a949bd248a/resource/24841976-fd35-4483-a2b4-573bd1e77cfb/download/2016-first-quarter-citations.csv
Depending on your directory organization, the above might be slightly different for you.
If you go to the link to data.gov for Bloomington above, you will see
that the citations data is organized per quarter, so there are a total
of four files. Above, we downloaded the data for the first quarter. Go
ahead and download the remaining three files with wget
.
In this example, we will use three modules, numpy
, pandas
and
matplotlib
. If you set up virtualenv
as described in the
Python tutorial, the first two of these are
already installed for you. To install matplotlib
, make sure you’ve
activated your virtualenv
and use pip
:
$ source ~/ENV/bin/activate
$ pip install matplotlib
If you are using a different distribution of Python, you will need to make sure that all three of these modules are installed.
From the same directory where you saved the citations data, let’s start the Python interpreter and load the citations data for Q1 2016
$ python
>>> from __future__ import division, print_function
>>> import numpy as np
>>> import pandas as pd
>>> import matplotlib.pyplot as plt
>>> data = pd.read_csv('2016-first-quarter-citations.csv')
If the first import
statement seems confusing, take a look at the
Python tutorial. The next three import
statements load each of the modules we will use in this example. The
final line uses Pandas’ read_csv
function to load the data into a
Pandas DataFrame
data structure.
You can verify that you are working with a DataFrame
and use some
of its methods to take a look at the structure of the data as follows:
>>> type(data)
<class 'pandas.core.frame.DataFrame'>
>>> data.index
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
...
197, 198, 199, 200, 201, 202, 203, 204, 205, 206],
dtype='int64', length=200)
>>> data.columns
Index([u'Citation Number', u'Date Issued', u'Time Issued', u'Location ',
u'District', u'Cited Person Age', u'Cited Person Sex',
u'Cited Person Race', u'Offense Code', u'Offense Description',
u'Officer Age', u'Officer Sex', u'Officer Race'],
dtype='object')
>>> data.dtypes
Citation Number object
Date Issued object
Time Issued object
Location object
District object
Cited Person Age float64
Cited Person Sex object
Cited Person Race object
Offense Code object
Offense Description object
Officer Age float64
Officer Sex object
Officer Race object
dtype: object
>>> data.shape
(200, 15)
As you can see from the columns
field, when the CSV file was read,
the header line was used to populate the name of the columns in the
DataFrame
. In addition, you will notice that read_csv
correctly inferred the data type of some columns like Age, but not
of others like Date Issued and Time Issued. read_csv
is a very
customizable function and in general, you can correct issues like this
using the dtype
and converters
parameters. In this specific
case, it makes more sense to combine the Date Issued and Time
Issued columns into a new column containing a time stamp. We will see
how to do this shortly.
You can also look at the data itself with the DataFrame
’s
head()
and tail()
methods:
>>> data.head()
<Output omitted for brevity>
>>> data.tail()
<Output omitted for brevity>
In addition to letting you examine your data easily, ``DataFrame``s have methods that help you deal with missing values:
>>> data = data.dropna(how='any')
>>> data.shape
Adding columns to the data is also easy. Here, we add two
columns. First, a datetime column that is a
combination of the Date Issued
and Time Issued
columns
originally in the data. Second, a column identifying what day of the
week each citation was given. To understand this example better, take
a look at the Python docs for the strptime
and strftime
functions in the datetime
module linked above.
>>> from datetime import datetime
>>> data['DateTime Issued'] = data.apply(
... lambda row: datetime.strptime(row['Date Issued'] + ':' + row['Time Issued'], '%m/%d/%y:%I:%M %p'), axis=1
... )
>>> data.columns
>>> data['Day of Week Issued'] = data.apply(
... lambda row: datetime.strftime(row['DateTime Issued'], '%A'), axis=1
... )
Let’s say we want to see how many citations were given each day of the week. We gather the data first:
>>> days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
>>> dow_data = [days.index(dow) for dow in data['Day of Week Issued']]
>>> dow_data
<Output omitted for brevity>
Then we use matplotlib
to plot it:
>>> fig = plt.figure()
>>> ax = fig.add_subplot(1, 1, 1)
>>> plt.hist(dow_data, bins=len(days))
>>> plt.xticks(range(len(days)), days)
>>> plt.show()
You should see something like this on your screen:
DataFrame``s and ``numpy
give us other ways to manipulate
data. For example, we can plot a histogram of the ages of violators
like this:
>>> ages = data['Cited Person Age'].astype(int)
>>> fig = plt.figure()
>>> ax = fig.add_subplot(1, 1, 1)
>>> plt.hist(ages, bins=np.max(ages) - np.min(ages))
>>> plt.show()
Surprisingly, we see some 116 year-old violators! This is probably an error in the data, so we can remove these data points easily and plot the histogram again:
>>> ages = ages[ages < 100]
>>> fig = plt.figure()
>>> ax = fig.add_subplot(1, 1, 1)
>>> plt.hist(ages, bins=np.max(ages) - np.min(ages))
>>> plt.show()
Oftentimes, you will want to save your matplotlib
graph as a PDF
or an SVG file instead of just viewing it on your screen. For both, we need to create a figure
and plot the histogram as before:
>>> fig = plt.figure()
>>> ax = fig.add_subplot(1, 1, 1)
>>> plt.hist(ages, bins=np.max(ages) - np.min(ages))
Then, instead of calling plt.show()
we can invoke
plt.savefig()
to save as SVG:
>>> plt.savefig('hist.svg')
If we want to save the figure as PDF instead, we need to use the
PdfPages
module together with savefig()
:
>>> import matplotlib.patches as mpatches
>>> from matplotlib.backends.backend_pdf import PdfPages
>>> pp = PdfPages('hist.pdf')
>>> fig.savefig(pp, format='pdf')
>>> pp.close()
There is a lot more to working with pandas
, numpy
and
matplotlib
than we can show you here, but hopefully this example
has piqued your curiosity.
Don’t worry if you don’t understand
everything in this example. For a more detailed explanation on these
modules and the examples we did, please take a look at the tutorials
below. The numpy
and pandas
tutorials are mandatory if you
want to be able to use these modules, and the matplotlib
gallery
has many useful code examples.
According to the Numpy Web page “NumPy is a package for scientific computing with Python. It contains a powerful N-dimensional array object, sophisticated (broadcasting) functions, tools for integrating C/C++ and Fortran code, useful linear algebra, Fourier transform, and random number capabilities
Tutorial: https://docs.scipy.org/doc/numpy-dev/user/quickstart.html
According the the Matplotlib Web page, “matplotlib is a python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms. matplotlib can be used in python scripts, the python and ipython shell (ala MATLAB®* or Mathematica®†), web application servers, and six graphical user interface toolkits.”
Matplotlib Gallery: http://matplotlib.org/gallery.html
According to the Pandas Web page, “Pandas is a library library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.”
In addition to access to charts via matplotlib it has elementary functionality for conduction data analysis. Pandas may be very suitable for your projects.
Tutorial: http://pandas.pydata.org/pandas-docs/stable/10min.html
Pandas Cheat Sheet: https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf
According to the SciPy Web page, “SciPy (pronounced “Sigh Pie”) is a Python-based ecosystem of open-source software for mathematics, science, and engineering. In particular, these are some of the core packages:
It is thus an agglomeration of useful pacakes and will prbably sufice for your projects in case you use Python.
According to the ggplot python Web page ggplot is a plotting system for Python based on R’s ggplot2. It allows to quickly generate some plots quickly with little effort. Often it may be easier to use than matplotlib directly.
http://www.data-analysis-in-python.org/t_seaborn.html
The good library for plotting is called seaborn which is build on top of matplotlib. It provides high level templates for common statistical plots.
Bokeh is an interactive visualization library with focus on web browsers for display. Its goal is to provide a similar experience as D3.js
Pygal is a simple API to produce graphs that can be easily embedded into your Web pages. It contains annotations when you hover over data points. It also allows to present the data in a table.