Coding_with_Python_in_Jupyter_Notebooks

Running Python in Notebooks


Jupyter (previously known as IPython) is a dynamic interaction platform that works with the shell and your browser so that you may code and visualize the results in one environment. The closest analog is an RMarkdown document. An important advantage is that you can step by step build your results and analysis, write your text, and visualize the graphs and results without having to constantly re-render the entire document to see updates. This strength is also a weakness as you may inadvertently introduce errors, load data, or create variables in the wrong order, causing errors when trying to replicate.


Basics of Jupyter (IPython) Notebooks

To start up a Jupyter notebook server, simply navigate to the directory where you want the notebooks to be saved and run the command.

jupyter notebook

A browser should open with a notebook navigator. From here, you can either click to open an existing notebook or click the "New" button and select "Python 3". You may rename any notebook by clicking its title at the top of the page.

Notebooks are sequences of cells. Cells can be markdown, code, or raw text. You can run Markdown or code in cells, similar to RMarkdown.


Calling the Command Line from Jupyter

Note: you can call the command line from jupyter by using a "!" before the command.

For example, you could use this option to commit your work on Git

In [1]:
!git status
On branch master
Your branch is up-to-date with 'origin/master'.
nothing to commit, working directory clean

Getting the Data

Today, we will be using data from the NOAA National Climatic Data Center

Further documentation on this data is available here: http://doi.org/10.7289/V5D21VHZ. A readme file and codebook for the data is also available.

We will download the daily data, as well as data on the ghcnd stations using curl.

In [2]:
#Download the Data Using Curl
!curl -O http://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/2016.csv.gz
!gunzip 2016.csv.gz
!curl -O http://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  146M  100  146M    0     0  7363k      0  0:00:20  0:00:20 --:--:-- 8656k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 8461k  100 8461k    0     0  2487k      0  0:00:03  0:00:03 --:--:-- 2487k

Exploratory Data Analysis with Pandas

Pandas is a library created by Wes McKinney that introduces the R-like dataframe object to Python and makes working with data in Python a lot easier. It's also a lot more efficient than the R dataframe and pretty much makes Python superior to R in every imaginable way (except for ggplot 2).

We start by importing the libraries we're going to use: pandas and matplotlib.

For R users, importing a package is equivalent to library(package_name).

Note that packages are often loaded in Python using an alias, as shown below. In this way, when we later call a pandas function, we can refer to it as pd.function_name instead of pandas.function_name.

In [1]:
# Import Statements
import pandas as pd
import numpy as np

#Turn of Notebook Package Warnings
import warnings
warnings.filterwarnings('ignore')

In Jupyter notebooks, some utilities can be imported with magic statements indicated by a %, not to be confused with the python operator %. Magic commands only apply to the Jupyter notebook and begin a line.

In [2]:
#Magic Statement
%matplotlib inline

We will also set the file location used. If you cloned the repository and ran the steps sequentially, then the relative directory location should be correct.

In [3]:
# File Locations
# Change these on your machine (if needed)
datadir = ""
weather_data_raw = datadir + "2016.csv"
station_data_raw = datadir + "ghcnd-stations.txt"

Loading Data into a Pandas DataFrame

So far we've been working with raw text files. That's one way to store and interact with data, but there are only a limited set of functions that can take as input raw text. Python has an amazing array of data structures to work with that give you a lot of extra power in working with data.

Build in Data Structures

  • strings ""
  • lists []
  • tuples ()
  • sets {}
  • dictionaries {'key':value}

Additional Essential Data Structures

  • numpy arrys ([])
  • pandas Series
  • pandas DataFrame
  • tensorflow Tensors

Today we'll primarily be working with the pandas DataFrame. The pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes. It's basically a spreadsheet you can program and it's an incredibly useful Python object for data analysis.

You can load data into a dataframe using Pandas' excellent read_* functions.

Today, we will use read_table

Tips on Documentation:

  • TAB COMPLETION
  • Jupyter will pull of the doc string for a command just by asking it a question.
  • Jupyter will give you the allowable arguments if you it shift + tab
In [4]:
#Try Getting Help (Esc to Exit)
pd.read_table?

Load the Data with Pandas

In [5]:
weather = pd.read_table(weather_data_raw, sep=",", header=None)
stations = pd.read_table(station_data_raw, header=None)

View the Data in Pandas

There are lots of options for viewing data in pandas. Just like we did in the command line, you can use head and tail to get a quick view of our data.

In [6]:
weather.head()
Out[6]:
0 1 2 3 4 5 6 7
0 US1FLSL0019 20160101 PRCP 3 NaN NaN N NaN
1 NOE00133566 20160101 TMAX 95 NaN NaN E NaN
2 NOE00133566 20160101 TMIN 23 NaN NaN E NaN
3 NOE00133566 20160101 PRCP 37 NaN NaN E NaN
4 USC00141761 20160101 TMAX 22 NaN NaN 7 700.0
In [7]:
weather.tail()
Out[7]:
0 1 2 3 4 5 6 7
25992850 USR0000CSCN 20161019 TMIN 28 H NaN U NaN
25992851 USR0000CSCN 20161019 TAVG 98 NaN NaN U NaN
25992852 USR0000ABAN 20161019 TMAX 317 H NaN U NaN
25992853 USR0000ABAN 20161019 TMIN 172 H NaN U NaN
25992854 USR0000ABAN 20161019 TAVG 241 NaN NaN U NaN
In [8]:
#Dimensions of the DataFrame
weather.shape
Out[8]:
(25992855, 8)
In [9]:
#Types of Data
weather.dtypes
Out[9]:
0     object
1      int64
2     object
3      int64
4     object
5     object
6     object
7    float64
dtype: object

Note: You'll notice that some commands have looked like pd.something(), some like data.something(), and some like data.something without (). The difference is a pandas function or class versus methods versus attributes. Methods are actions you take on a dataframe or series, while attributes are descriptors or the dataframe or series.


Modifying your DataFrame

Notice that we don't have column names and some columns are full of null values. Let's fix that by learning how to name and delete columns.

In [10]:
#Current Data Columns
weather.columns
Out[10]:
Int64Index([0, 1, 2, 3, 4, 5, 6, 7], dtype='int64')
In [11]:
#Drop columns 4, 5, 6, & 7
weather.drop([4,5,6,7], axis=1, inplace=True)
In [12]:
weather.head()
Out[12]:
0 1 2 3
0 US1FLSL0019 20160101 PRCP 3
1 NOE00133566 20160101 TMAX 95
2 NOE00133566 20160101 TMIN 23
3 NOE00133566 20160101 PRCP 37
4 USC00141761 20160101 TMAX 22

Note that the header of the data frame has no column names, only numbers. We can assign column names by creating a list and setting the data frame columns equal to the list.

In [13]:
#Assign Column Names Using a List
weather_cols = ['station_id','date','measurement','value']
weather.columns = weather_cols
weather.head()
Out[13]:
station_id date measurement value
0 US1FLSL0019 20160101 PRCP 3
1 NOE00133566 20160101 TMAX 95
2 NOE00133566 20160101 TMIN 23
3 NOE00133566 20160101 PRCP 37
4 USC00141761 20160101 TMAX 22

Describing the Entire DataFrame

Now that we have columns, we want to get a better global view of our data. There are several ways

In [14]:
weather.describe()
Out[14]:
date value
count 2.599286e+07 2.599286e+07
mean 2.016053e+07 9.316203e+01
std 2.715776e+02 3.132428e+02
min 2.016010e+07 -9.990000e+02
25% 2.016031e+07 0.000000e+00
50% 2.016052e+07 8.000000e+00
75% 2.016073e+07 1.390000e+02
max 2.016102e+07 2.374900e+04

Selecting and Subsetting in Pandas

One of the biggest benefits of having a multi-index object like a DataFrame is the ability to easily select rows, columns, and subsets of the data. Let's learn how to do that.

First we will select individual series from the dataframe.

In [15]:
weather['measurement'].head()
Out[15]:
0    PRCP
1    TMAX
2    TMIN
3    PRCP
4    TMAX
Name: measurement, dtype: object
In [16]:
#using . notation
weather.measurement.head()
Out[16]:
0    PRCP
1    TMAX
2    TMIN
3    PRCP
4    TMAX
Name: measurement, dtype: object
In [17]:
#subset by row index
weather.measurement[3:10]
Out[17]:
3    PRCP
4    TMAX
5    TMIN
6    TOBS
7    PRCP
8    SNOW
9    SNWD
Name: measurement, dtype: object
In [18]:
#Use the iloc method 
weather.iloc[7:14,2:4] #Rows 7:10, Columns 2:4
Out[18]:
measurement value
7 PRCP 0
8 SNOW 0
9 SNWD 0
10 TMAX -25
11 TMIN -177
12 TOBS -61
13 PRCP 0

Now let's subset on row values.

In [19]:
#Create a Boolean Series based on a condition
example_bool = weather['measurement']=='PRCP'
example_bool.head()
Out[19]:
0     True
1    False
2    False
3     True
4    False
Name: measurement, dtype: bool
In [20]:
#Now pass that series to the dataframe to subset it
rain = weather[weather['measurement']=='PRCP']
rain.head()
Out[20]:
station_id date measurement value
0 US1FLSL0019 20160101 PRCP 3
3 NOE00133566 20160101 PRCP 37
7 USC00141761 20160101 PRCP 0
13 USS0018D08S 20160101 PRCP 0
19 MXM00076423 20160101 PRCP 0
In [21]:
rain.sort_values('value', inplace=True, ascending=False)
In [22]:
rain.head()
Out[22]:
station_id date measurement value
10483923 ASN00040334 20160420 PRCP 11958
4944588 CA006012501 20160221 PRCP 9154
4382740 CA003013959 20160215 PRCP 6162
9471450 CA003013959 20160409 PRCP 6071
19219679 USC00406374 20160726 PRCP 5588

Creating a Chicago Temperature Dataset

In [23]:
#Let's Create a Chicago Temperature Dataset
chicago = weather[weather['station_id']=='USW00094846']
chicago_temp = weather[(weather['measurement']=='TAVG') & (weather['station_id']=='USW00094846')]
chicago_temp.head()
Out[23]:
station_id date measurement value
46959 USW00094846 20160101 TAVG -48
141175 USW00094846 20160102 TAVG -28
235989 USW00094846 20160103 TAVG -29
332478 USW00094846 20160104 TAVG -28
430009 USW00094846 20160105 TAVG -36
In [24]:
chicago_temp.sort_values('value').head()
Out[24]:
station_id date measurement value
1680773 USW00094846 20160118 TAVG -174
1006530 USW00094846 20160111 TAVG -149
1777625 USW00094846 20160119 TAVG -146
1202047 USW00094846 20160113 TAVG -141
1585364 USW00094846 20160117 TAVG -136
In [25]:
#Viewing Values Based on Criteria
chicago_temp = chicago_temp[chicago_temp.value>-40]
chicago_temp.head()
Out[25]:
station_id date measurement value
141175 USW00094846 20160102 TAVG -28
235989 USW00094846 20160103 TAVG -29
332478 USW00094846 20160104 TAVG -28
430009 USW00094846 20160105 TAVG -36
527207 USW00094846 20160106 TAVG -21

Applying functions to series and creating new columns

In [26]:
chicago_temp.value.mean()
Out[26]:
149.05474452554745
In [27]:
chicago_temp.value.describe()
Out[27]:
count    274.000000
mean     149.054745
std       92.998763
min      -36.000000
25%       71.250000
50%      170.500000
75%      228.750000
max      292.000000
Name: value, dtype: float64

User Functions

In [28]:
#Apply user defined functions
def tenths_to_degree(temp_tenths_celsius):
    """
    Function to Convert Temperature (tenths of degrees C)
    to degrees Celsius
    """
    return ((temp_tenths_celsius)/(10))

chicago_temp['deg_celsius']=chicago_temp.value.apply(tenths_to_degree)
chicago_temp.head()
Out[28]:
station_id date measurement value deg_celsius
141175 USW00094846 20160102 TAVG -28 -2.8
235989 USW00094846 20160103 TAVG -29 -2.9
332478 USW00094846 20160104 TAVG -28 -2.8
430009 USW00094846 20160105 TAVG -36 -3.6
527207 USW00094846 20160106 TAVG -21 -2.1

Convert strings to datetime values

In [29]:
chicago_temp['datetime'] = pd.to_datetime(chicago_temp.date, format='%Y%m%d')
chicago_temp.dtypes
Out[29]:
station_id             object
date                    int64
measurement            object
value                   int64
deg_celsius           float64
datetime       datetime64[ns]
dtype: object

Now we can plot the series with ease!

In [30]:
chicago_temp.head()
Out[30]:
station_id date measurement value deg_celsius datetime
141175 USW00094846 20160102 TAVG -28 -2.8 2016-01-02
235989 USW00094846 20160103 TAVG -29 -2.9 2016-01-03
332478 USW00094846 20160104 TAVG -28 -2.8 2016-01-04
430009 USW00094846 20160105 TAVG -36 -3.6 2016-01-05
527207 USW00094846 20160106 TAVG -21 -2.1 2016-01-06

Groupby

Groupby is a powerful method that makes it easy to perform operations on the dataframe by categorical values. Let's try generating a plot of min, max, and average temp over time.

In [31]:
chicago_temps = chicago[chicago.measurement.isin(['TMAX','TMIN','TAVG'])]
chicago_temps.measurement.value_counts()
Out[31]:
TAVG    293
TMIN    290
TMAX    290
Name: measurement, dtype: int64
In [32]:
chicago_temps.head()
Out[32]:
station_id date measurement value
46953 USW00094846 20160101 TMAX -5
46954 USW00094846 20160101 TMIN -71
46959 USW00094846 20160101 TAVG -48
141169 USW00094846 20160102 TMAX 0
141170 USW00094846 20160102 TMIN -66
In [33]:
chicago_temps.groupby('measurement').value.mean()
Out[33]:
measurement
TAVG    132.996587
TMAX    179.986207
TMIN     83.713793
Name: value, dtype: float64
In [34]:
chicago_temps.groupby('measurement').value.agg(['count','min','max','mean'])
Out[34]:
count min max mean
measurement
TAVG 293 -174 292 132.996587
TMAX 290 -138 339 179.986207
TMIN 290 -199 239 83.713793

Basic Visualization of the Data

In Python, there are numerous ways to plot data. One of the most dominant plotting methods in Python is Matplotlib. For example, these graphs are often used for visualizing evaluation metrics in machine learning. Besides these packages, other plotting suites include Seaborn and even a Python version of ggplot which will be mostly familiar to R users. More information about these packages can be found at the links below:

Matplotlib charts can be fairly easy to implement, but are not the most aesthetically pleasing.

In [35]:
chicago_temps.groupby('measurement').value.mean().plot(kind='bar')
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x10f615940>

Although you can go through some rigmarole to improve the appearance, other packages have better aesthetics by default.


Plotting Using Seaborn

In [37]:
import seaborn as sns

#Barplot
sns.barplot(x="measurement", y="value", data=chicago_temps);
In [38]:
#Violin Plot
sns.violinplot(x="measurement", y="value", data=chicago_temps);

Plotting Using ggplot (For Python)

In [63]:
#Install ggplot for Python 3 if Needed
#!pip3 install -U ggplot

For R Users: Note that this version of ggplot includes some of the same data as ggplot2

Examples Courtesy of Yhat:

In [39]:
from ggplot import *
ggplot(aes(x='carat', y='price', color='clarity'), data=diamonds) +\
    geom_point() +\
    scale_color_brewer(type='qual') + \
    ggtitle("Diamond Price by Carat and Clarity")
Out[39]:
<ggplot: (-9223372036565461034)>

Using ggplot for Chicago Weather

Before implementing the ggplot, let's first clean up the datetime to make it more useful:

In [40]:
chicago_temps['datetime'] = pd.to_datetime(chicago_temps.date, format='%Y%m%d')
chicago_temps.index = chicago_temps.datetime
chicago_temps.dtypes
chicago_temps.head()
Out[40]:
station_id date measurement value datetime
datetime
2016-01-01 USW00094846 20160101 TMAX -5 2016-01-01
2016-01-01 USW00094846 20160101 TMIN -71 2016-01-01
2016-01-01 USW00094846 20160101 TAVG -48 2016-01-01
2016-01-02 USW00094846 20160102 TMAX 0 2016-01-02
2016-01-02 USW00094846 20160102 TMIN -66 2016-01-02
In [40]:
ggplot(chicago_temps, aes('datetime', y='value')) + \
    geom_line() + \
    ggtitle("Chicago Min, Max, and Average Temps \n Temperature (tenths of degrees C)") + xlab("Date")
Out[40]:
<ggplot: (294823502)>

We Can Spruce This Up A Bit with Some Color and Variation by Min, Max, and Average Temperatures

In [41]:
#Refined Time Series ggplot
ggplot(chicago_temps, aes(x='datetime', y='value', colour='measurement')) + \
    geom_line() + \
    ggtitle("Chicago Min, Max, and Average Temps in 2016 \n Temperature (tenths of degrees C)") + \
    xlab("Date") + ylab("Temperature") + \
    scale_color_manual(values=("#399999", "#E69F00", "#56B4E9"))
Out[41]:
<ggplot: (-9223372036563464623)>

Note, that parts of this notebook have been adapted from Data Science for the Social Good's "Hitchhiker's Guide" Tech Tutorials. This notebook includes modifications to update the notebook to Python 3, adjust for errors in temperature scale, data documentation, and new plotting packages including Seaborn and ggplot (for Python).