This content is from the fall 2016 version of this course. Please go here for the most recent version.
gapminder
dataset using PythonDue before class Wednesday November 2nd.
The basic goal of the assignment is to implement various computational methods (e.g. data frames, lists, filtering, conditional expressions, iteration, functions) in Python. Rather than using raw programming assignments, you will demonstrate these skills in the context of analyzing the gapminder
dataset, something you have already explored in R.
hw05
repositoryGo here to fork the repo for homework 05.
You are provided with a Jupyter Notebook similar to the one seen here. Fill in the chunks with the appropriate code needed to perform the requested analysis. I have already identified the questions and tasks you need to perform.
Your assignment should be submitted as a single Jupyter Notebook. Follow instructions on homework workflow. As part of the pull request, you’re encouraged to reflect on what was hard/easy, problems you solved, helpful tutorials you read, etc.
Check minus: Notebook cannot be run. Didn’t answer all of the questions. Code is incomprehensible or difficult to follow.
Check: Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output). Nothing spectacular, either bad or good.
Check plus: Innovative use of coding elements to solve the problems (e.g. functions, conditional expressions, iterations). Adds labels to graphs. Uses techniques beyond those from the example notebooks. Successfully attempts the advanced challenge.
# Import libraries
import pandas as pd
import numpy as np
# Turn off notebook package warnings
import warnings
warnings.filterwarnings('ignore')
# print graphs in the document
%matplotlib inline
gapminder
DataFrame
59.474439366197174
continent
Africa 48.865330
Americas 64.658737
Asia 60.064903
Europe 71.903686
Oceania 74.326208
Name: lifeExp, dtype: float64
Australia
country continent year lifeExp pop gdpPercap
60 Australia Oceania 1952 69.120 8691212 10039.59564
61 Australia Oceania 1957 70.330 9712569 10949.64959
62 Australia Oceania 1962 70.930 10794968 12217.22686
63 Australia Oceania 1967 71.100 11872264 14526.12465
64 Australia Oceania 1972 71.930 13177000 16788.62948
65 Australia Oceania 1977 73.490 14074100 18334.19751
66 Australia Oceania 1982 74.740 15184200 19477.00928
67 Australia Oceania 1987 76.320 16257249 21888.88903
68 Australia Oceania 1992 77.560 17481977 23424.76683
69 Australia Oceania 1997 78.830 18565243 26997.93657
70 Australia Oceania 2002 80.370 19546792 30687.75473
71 Australia Oceania 2007 81.235 20434176 34435.36744
New Zealand
country continent year lifeExp pop gdpPercap
1092 New Zealand Oceania 1952 69.390 1994794 10556.57566
1093 New Zealand Oceania 1957 70.260 2229407 12247.39532
1094 New Zealand Oceania 1962 71.240 2488550 13175.67800
1095 New Zealand Oceania 1967 71.520 2728150 14463.91893
1096 New Zealand Oceania 1972 71.890 2929100 16046.03728
1097 New Zealand Oceania 1977 72.220 3164900 16233.71770
1098 New Zealand Oceania 1982 73.840 3210650 17632.41040
1099 New Zealand Oceania 1987 74.320 3317166 19007.19129
1100 New Zealand Oceania 1992 76.330 3437674 18363.32494
1101 New Zealand Oceania 1997 77.550 3676187 21050.41377
1102 New Zealand Oceania 2002 79.110 3908037 23189.80135
1103 New Zealand Oceania 2007 80.204 4115771 25185.00911
gapminder
by population. Make sure the sorted object replaces the existing gapminder
DataFrame
seaborn
, generate a scatterplot depicting the relationship between population and life expectancy and include a linear best fit lineimport seaborn as sns
<matplotlib.axes._subplots.AxesSubplot at 0x10833b2b0>
<matplotlib.axes._subplots.AxesSubplot at 0x108450518>
Here the goal is to write a basic function, “life_expectancy”, that incorporates your work above.
By default, the function should return a scatterplot of life-expectancy versus years for a given country. [Hint: Subset the data for a specific country, similar to a problem above]
Once you subset the data, the function should do one of two things: * (1) return a graph [or] * (2) return a graph and model results.
Thus, your function should have arguments and output as follows:
* Arguments:
Country (required): The name of a specific country, such as "Australia"
Model (optional): Build and Return a Model Results, #Hint, set the default to be False
* Output:
(1) - Default: A scatterplot of the relationship with best fit line
(2) - Model: The above graph AND the model results
To run a linear model, we can use the library statsmodels, to predict life expectancy by year.
import statsmodels.formula.api as sm #Import Package
model = sm.ols(formula = 'lifeExp ~ year', data = gapminder).fit() #Fit OLS Model
results = model.summary() #Get Results
print(results) # Print
#Hint: Use this Code in Your Function.
#You will need to replace data = gapminder, with the data subset for a specific country.
OLS Regression Results
==============================================================================
Dep. Variable: lifeExp R-squared: 0.190
Model: OLS Adj. R-squared: 0.189
Method: Least Squares F-statistic: 398.6
Date: Tue, 25 Oct 2016 Prob (F-statistic): 7.55e-80
Time: 16:40:53 Log-Likelihood: -6597.9
No. Observations: 1704 AIC: 1.320e+04
Df Residuals: 1702 BIC: 1.321e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept -585.6522 32.314 -18.124 0.000 -649.031 -522.273
year 0.3259 0.016 19.965 0.000 0.294 0.358
==============================================================================
Omnibus: 386.124 Durbin-Watson: 1.962
Prob(Omnibus): 0.000 Jarque-Bera (JB): 90.750
Skew: -0.268 Prob(JB): 1.97e-20
Kurtosis: 2.004 Cond. No. 2.27e+05
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.27e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
# write your function here
Your function should be able to produce these results:
# Result for a Country (No Model)
life_expectancy("Afghanistan")
# Result for a Country (Model = True)
life_expectancy("New Zealand", True)
OLS Regression Results
==============================================================================
Dep. Variable: lifeExp R-squared: 0.954
Model: OLS Adj. R-squared: 0.949
Method: Least Squares F-statistic: 205.4
Date: Tue, 25 Oct 2016 Prob (F-statistic): 5.41e-08
Time: 16:49:02 Log-Likelihood: -13.321
No. Observations: 12 AIC: 30.64
Df Residuals: 10 BIC: 31.61
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept -307.6996 26.630 -11.554 0.000 -367.036 -248.363
year 0.1928 0.013 14.333 0.000 0.163 0.223
==============================================================================
Omnibus: 1.899 Durbin-Watson: 0.530
Prob(Omnibus): 0.387 Jarque-Bera (JB): 1.086
Skew: -0.420 Prob(JB): 0.581
Kurtosis: 1.789 Cond. No. 2.27e+05
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.27e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
As you know already, the general trend is that over time life expectancy increases, but the trend is different for each country. Some experience a greater increase than others, whereas some countries experience declines in life expectancy. You can use whatever method you wish to assess and explain this relationship using Python.
Use whichever method you think you can master before the assignment is due. Some of you may just stick to basic graphs and tables, while others might build a statistical model using statsmodel
. Obviously the more advanced technique you use, the higher your ceiling will be for your evaluation. But don’t spend 10 hours getting this to work! Go with what you can accomplish in a reasonable amount of time.
This work is licensed under the CC BY-NC 4.0 Creative Commons License.