This content is from the fall 2016 version of this course. Please go here for the most recent version.

Overview

Due before class Wednesday November 23rd.

Fork the `hw08` repository

Go here to fork the repo for homework 08.

Your assignment

Implement a statistical learning model and use cross-validation to assess the robustness of your findings. Write this up like a short paper in a substantive seminar:

Introduce the topic
Brief literature review (enough to inform your hypothesis)
State a hypothesis
Describe your data and method
Execute the method
Present results
- Potentially an analysis of competing models and a discussion of why you select a specific model
Include written analysis of your findings
Draw conclusions and potential concerns with the results

Potential functional forms

Linear regression
Logistic regression
LOWESS
Decision tree
Random forest
Latent Direchlet allocation (yes, this is statistical learning - just unsupervised statistical learning)
Something else from An Introduction to Statistical Learning or another source
- If you use a different method, you need to demonstrate you understand how it works by writing a brief summary of the method

Potential cross-validation methods

Validation set approach
Leave-one-out cross-valiation (LOOCV)
\(k\)-fold cross-validation
Bootstrapping
Out-of-bag (OOB) estimation (for random forests)

Submit the assignment

Your assignment should be submitted as a set of R scripts, R Markdown documents, Jupyter Notebooks, data files, etc. Whatever is necessary to show your code and present your results. Follow instructions on homework workflow. As part of the pull request, you’re encouraged to reflect on what was hard/easy, problems you solved, helpful tutorials you read, etc.

Rubric

Check minus: Cannot get code to run or is poorly documented. Severe misinterpretations of the results. Overall a shoddy or incomplete assignment.

Check: Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output). Nothing spectacular, either bad or good.

Check plus: Interpretation is clear and in-depth. Accurately interprets the results, with appropriate caveats for what the technique can and cannot do. Code is reproducible (i.e. if analyzing tweets, you have stored a copy in a local file so I can exactly reproduce your results as well as run it on a new sample of tweets). Discusses the benefits and drawbacks of a specific method. Compares multiple models fitted to the same underlying dataset.

This work is licensed under the CC BY-NC 4.0 Creative Commons License.

Homework 08: Statistical learning