This content is from the fall 2016 version of this course. Please go here for the most recent version.
Due before class Wednesday October 12th.
Now that you’ve demonstrated your software is setup, the goal of this assignment is to practice wrangling and exploring data.
hw02
repositoryGo here to fork the repo for homework 02.
FiveThirtyEight, a data journalism site devoted to politics, sports, science, economics, and culture, recently published a series of articles on gun deaths in America. Gun violence in the United States is a significant political issue, and while reducing gun deaths is a noble goal, we must first understand the causes and patterns in gun violence in order to craft appropriate policies. As part of the project, FiveThirtyEight collected data from the Centers for Disease Control and Prevention, as well as other governmental agencies and non-profits, on all gun deaths in the United States from 2012-2014.
I have included this dataset in the rcfss
library on GitHub. To install the package, use the command devtools::install_github("uc-cfss/rcfss")
in R. If you don’t already have the devtools
library installed, you will get an error. Go back and install this first using install.packages()
, then install rcfss
. The gun deaths dataset can be loaded using data("gun_deaths")
. Use the help function in R (?gun_deaths
) to get detailed information on the variables and coding information.
Using your knowledge of dplyr
and ggplot2
, use summary statistics and graphs to answer the following questions:
In the rcfss
package, there is a data frame called dadmom
.
## # A tibble: 3 × 5
## famid named incd namem incm
## <dbl> <chr> <dbl> <chr> <dbl>
## 1 1 Bill 30000 Bess 15000
## 2 2 Art 22000 Amy 18000
## 3 3 Paul 25000 Pat 50000
Tidy this data frame so that it adheres to the tidy data principles:
The Supreme Court Database contains detailed information of decisions of the U.S. Supreme Court. It is perhaps the most utilized database in the study of judicial politics. Until recently, the database only contained records on cases from the “modern” era (1946-present). Recently the database was extended backwards to include all decisions since the formation of the Court in 1791. While still in beta form, this extension opens the doors to new studies of the Court’s pre-modern era decisions.
In the hw02
repository, you will find two data files: SCDB_Legacy_03_justiceCentered_Citation.csv
and SCDB_2016_01_justiceCentered_Citation.csv
. These are the exact same files you would obtain if you downloaded them from the original website; I have included them in the repository merely for your convenience. Documentation for the datasets can be found here. The data is structured in a tidy fashion. That is, every row is a vote by one justice on one case for every case decided from the 1791-2015 terms.1 There are several ID variables which are useful for other types of research: for our purposes, the only ID variable you need to concern yourself with is caseIssuesId
. Variables you will want to familiarize yourself with include: term
, justice
, justiceName
, decisionDirection
, majVotes
, minVotes
, majority
, and chief
. Pay careful attention in the documentation to how these variables are coded.
In order to analyze the Supreme Court data, you will need to import these two files and combine them together (see bind_rows()
from the dplyr
package).2
Once joined, use your data wrangling and exploratory data analysis skills to answer the following questions:
Your assignment should be submitted as three RMarkdown documents. Follow instructions on homework workflow. As part of the pull request, you’re encouraged to reflect on what was hard/easy, problems you solved, helpful tutorials you read, etc.
Check minus: Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed. No record of commits other than the final push to GitHub.
Check: Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output). Nothing spectacular, either bad or good.
Check plus: Finished all components of the assignment correctly and attempted at least one advanced challenge. Code is well-documented (both self-documented and with additional comments as necessary). Graphs and tables are properly labeled. Use multiple commits to back up and show a progression in the work. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.
Terms run from October through June, so the 2015 term contains cases decided from October 2015 - June 2016↩
Friendly warning: you will initially encounter an error attempting to join the tibbles
. Use your powers of deduction (and Google/Stack Overflow/classmates/me and the TA) to figure out how to fix this error.↩
This work is licensed under the CC BY-NC 4.0 Creative Commons License.