Jupyter (previously known as IPython) is a dynamic interaction platform that works with the shell and your browser so that you may code and visualize the results in one environment. The closest analog is an RMarkdown document. An important advantage is that you can step by step build your results and analysis, write your text, and visualize the graphs and results without having to constantly re-render the entire document to see updates. This strength is also a weakness as you may inadvertently introduce errors, load data, or create variables in the wrong order, causing errors when trying to replicate.
To start up a Jupyter notebook server, simply navigate to the directory where you want the notebooks to be saved and run the command.
jupyter notebook
A browser should open with a notebook navigator. From here, you can either click to open an existing notebook or click the "New" button and select "Python 3". You may rename any notebook by clicking its title at the top of the page.
Notebooks are sequences of cells. Cells can be markdown, code, or raw text. You can run Markdown or code in cells, similar to RMarkdown.
!git status
Further documentation on this data is available here: http://doi.org/10.7289/V5D21VHZ. A readme file and codebook for the data is also available.
We will download the daily data, as well as data on the ghcnd stations using curl.
#Download the Data Using Curl
!curl -O http://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/2016.csv.gz
!gunzip 2016.csv.gz
!curl -O http://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt
Pandas is a library created by Wes McKinney that introduces the R-like dataframe object to Python and makes working with data in Python a lot easier. It's also a lot more efficient than the R dataframe and pretty much makes Python superior to R in every imaginable way (except for ggplot 2).
We start by importing the libraries we're going to use: pandas
and matplotlib
.
For R users, importing a package is equivalent to library(package_name)
.
Note that packages are often loaded in Python using an alias, as shown below. In this way, when we later call a pandas function, we can refer to it as pd.function_name
instead of pandas.function_name
.
# Import Statements
import pandas as pd
import numpy as np
#Turn of Notebook Package Warnings
import warnings
warnings.filterwarnings('ignore')
In Jupyter notebooks, some utilities can be imported with magic statements indicated by a %
, not to be confused with the python operator %
. Magic commands only apply to the Jupyter notebook and begin a line.
#Magic Statement
%matplotlib inline
We will also set the file location used. If you cloned the repository and ran the steps sequentially, then the relative directory location should be correct.
# File Locations
# Change these on your machine (if needed)
datadir = ""
weather_data_raw = datadir + "2016.csv"
station_data_raw = datadir + "ghcnd-stations.txt"
So far we've been working with raw text files. That's one way to store and interact with data, but there are only a limited set of functions that can take as input raw text. Python has an amazing array of data structures to work with that give you a lot of extra power in working with data.
Build in Data Structures
Additional Essential Data Structures
Today we'll primarily be working with the pandas DataFrame. The pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes. It's basically a spreadsheet you can program and it's an incredibly useful Python object for data analysis.
You can load data into a dataframe using Pandas' excellent read_*
functions.
Today, we will use read_table
Tips on Documentation:
shift + tab
#Try Getting Help (Esc to Exit)
pd.read_table?
weather = pd.read_table(weather_data_raw, sep=",", header=None)
stations = pd.read_table(station_data_raw, header=None)
There are lots of options for viewing data in pandas. Just like we did in the command line, you can use head
and tail
to get a quick view of our data.
weather.head()
weather.tail()
#Dimensions of the DataFrame
weather.shape
#Types of Data
weather.dtypes
Note: You'll notice that some commands have looked like pd.something(), some like data.something(), and some like data.something without (). The difference is a pandas function or class versus methods versus attributes. Methods are actions you take on a dataframe or series, while attributes are descriptors or the dataframe or series.
Notice that we don't have column names and some columns are full of null values. Let's fix that by learning how to name and delete columns.
#Current Data Columns
weather.columns
#Drop columns 4, 5, 6, & 7
weather.drop([4,5,6,7], axis=1, inplace=True)
weather.head()
Note that the header of the data frame has no column names, only numbers. We can assign column names by creating a list and setting the data frame columns equal to the list.
#Assign Column Names Using a List
weather_cols = ['station_id','date','measurement','value']
weather.columns = weather_cols
weather.head()
Now that we have columns, we want to get a better global view of our data. There are several ways
weather.describe()
One of the biggest benefits of having a multi-index object like a DataFrame is the ability to easily select rows, columns, and subsets of the data. Let's learn how to do that.
First we will select individual series from the dataframe.
weather['measurement'].head()
#using . notation
weather.measurement.head()
#subset by row index
weather.measurement[3:10]
#Use the iloc method
weather.iloc[7:14,2:4] #Rows 7:10, Columns 2:4
Now let's subset on row values.
#Create a Boolean Series based on a condition
example_bool = weather['measurement']=='PRCP'
example_bool.head()
#Now pass that series to the dataframe to subset it
rain = weather[weather['measurement']=='PRCP']
rain.head()
rain.sort_values('value', inplace=True, ascending=False)
rain.head()
#Let's Create a Chicago Temperature Dataset
chicago = weather[weather['station_id']=='USW00094846']
chicago_temp = weather[(weather['measurement']=='TAVG') & (weather['station_id']=='USW00094846')]
chicago_temp.head()
chicago_temp.sort_values('value').head()
#Viewing Values Based on Criteria
chicago_temp = chicago_temp[chicago_temp.value>-40]
chicago_temp.head()
chicago_temp.value.mean()
chicago_temp.value.describe()
#Apply user defined functions
def tenths_to_degree(temp_tenths_celsius):
"""
Function to Convert Temperature (tenths of degrees C)
to degrees Celsius
"""
return ((temp_tenths_celsius)/(10))
chicago_temp['deg_celsius']=chicago_temp.value.apply(tenths_to_degree)
chicago_temp.head()
chicago_temp['datetime'] = pd.to_datetime(chicago_temp.date, format='%Y%m%d')
chicago_temp.dtypes
Now we can plot the series with ease!
chicago_temp.head()
Groupby is a powerful method that makes it easy to perform operations on the dataframe by categorical values. Let's try generating a plot of min, max, and average temp over time.
chicago_temps = chicago[chicago.measurement.isin(['TMAX','TMIN','TAVG'])]
chicago_temps.measurement.value_counts()
chicago_temps.head()
chicago_temps.groupby('measurement').value.mean()
chicago_temps.groupby('measurement').value.agg(['count','min','max','mean'])
In Python, there are numerous ways to plot data. One of the most dominant plotting methods in Python is Matplotlib. For example, these graphs are often used for visualizing evaluation metrics in machine learning. Besides these packages, other plotting suites include Seaborn and even a Python version of ggplot which will be mostly familiar to R users. More information about these packages can be found at the links below:
chicago_temps.groupby('measurement').value.mean().plot(kind='bar')
Although you can go through some rigmarole to improve the appearance, other packages have better aesthetics by default.
import seaborn as sns
#Barplot
sns.barplot(x="measurement", y="value", data=chicago_temps);
#Violin Plot
sns.violinplot(x="measurement", y="value", data=chicago_temps);
#Install ggplot for Python 3 if Needed
#!pip3 install -U ggplot
from ggplot import *
ggplot(aes(x='carat', y='price', color='clarity'), data=diamonds) +\
geom_point() +\
scale_color_brewer(type='qual') + \
ggtitle("Diamond Price by Carat and Clarity")
chicago_temps['datetime'] = pd.to_datetime(chicago_temps.date, format='%Y%m%d')
chicago_temps.index = chicago_temps.datetime
chicago_temps.dtypes
chicago_temps.head()
ggplot(chicago_temps, aes('datetime', y='value')) + \
geom_line() + \
ggtitle("Chicago Min, Max, and Average Temps \n Temperature (tenths of degrees C)") + xlab("Date")
#Refined Time Series ggplot
ggplot(chicago_temps, aes(x='datetime', y='value', colour='measurement')) + \
geom_line() + \
ggtitle("Chicago Min, Max, and Average Temps in 2016 \n Temperature (tenths of degrees C)") + \
xlab("Date") + ylab("Temperature") + \
scale_color_manual(values=("#399999", "#E69F00", "#56B4E9"))
Note, that parts of this notebook have been adapted from Data Science for the Social Good's "Hitchhiker's Guide" Tech Tutorials. This notebook includes modifications to update the notebook to Python 3, adjust for errors in temperature scale, data documentation, and new plotting packages including Seaborn and ggplot (for Python).