This content is from the fall 2016 version of this course. Please go here for the most recent version.

Overview

Due before class Wednesday October 12th.

Now that you’ve demonstrated your software is setup, the goal of this assignment is to practice wrangling and exploring data.

Fork the hw02 repository

Go here to fork the repo for homework 02.

Part 1: Exploring clean data

FiveThirtyEight, a data journalism site devoted to politics, sports, science, economics, and culture, recently published a series of articles on gun deaths in America. Gun violence in the United States is a significant political issue, and while reducing gun deaths is a noble goal, we must first understand the causes and patterns in gun violence in order to craft appropriate policies. As part of the project, FiveThirtyEight collected data from the Centers for Disease Control and Prevention, as well as other governmental agencies and non-profits, on all gun deaths in the United States from 2012-2014.

Obtain the data

I have included this dataset in the rcfss library on GitHub. To install the package, use the command devtools::install_github("uc-cfss/rcfss") in R. If you don’t already have the devtools library installed, you will get an error. Go back and install this first using install.packages(), then install rcfss. The gun deaths dataset can be loaded using data("gun_deaths"). Use the help function in R (?gun_deaths) to get detailed information on the variables and coding information.

Explore the data

Using your knowledge of dplyr and ggplot2, use summary statistics and graphs to answer the following questions:

  1. In what month do the most gun deaths occur?
  2. What is the most common intent in gun deaths? Do most people killed by guns die in suicides, homicides, or accidental shootings?
  3. What is the average age of females killed by guns?
  4. How many white males with at least a high school education were killed by guns in 2012?
  5. Which season of the year has the most gun deaths? Assume that
    • Winter = January-March
    • Spring = April-June
    • Summer = July-September
    • Fall = October-December
  6. What is the relationship between race and intent? For example, are whites who are killed by guns more likely to die because of suicide or homicide? How does this compare to blacks and hispanics?

Part 2: Tidying messy data

In the rcfss package, there is a data frame called dadmom.

## # A tibble: 3 × 5
##   famid named  incd namem  incm
##   <dbl> <chr> <dbl> <chr> <dbl>
## 1     1  Bill 30000  Bess 15000
## 2     2   Art 22000   Amy 18000
## 3     3  Paul 25000   Pat 50000

Tidy this data frame so that it adheres to the tidy data principles:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

Part 3: Wrangling and exploring messy(ish) data

The Supreme Court Database contains detailed information of decisions of the U.S. Supreme Court. It is perhaps the most utilized database in the study of judicial politics. Until recently, the database only contained records on cases from the “modern” era (1946-present). Recently the database was extended backwards to include all decisions since the formation of the Court in 1791. While still in beta form, this extension opens the doors to new studies of the Court’s pre-modern era decisions.

In the hw02 repository, you will find two data files: SCDB_Legacy_03_justiceCentered_Citation.csv and SCDB_2016_01_justiceCentered_Citation.csv. These are the exact same files you would obtain if you downloaded them from the original website; I have included them in the repository merely for your convenience. Documentation for the datasets can be found here. The data is structured in a tidy fashion. That is, every row is a vote by one justice on one case for every case decided from the 1791-2015 terms.1 There are several ID variables which are useful for other types of research: for our purposes, the only ID variable you need to concern yourself with is caseIssuesId. Variables you will want to familiarize yourself with include: term, justice, justiceName, decisionDirection, majVotes, minVotes, majority, and chief. Pay careful attention in the documentation to how these variables are coded.

In order to analyze the Supreme Court data, you will need to import these two files and combine them together (see bind_rows() from the dplyr package).2

Once joined, use your data wrangling and exploratory data analysis skills to answer the following questions:

  1. What percentage of cases in each term are decided by a one-vote margin (i.e. 5-4, 4-3, etc.)
  2. In each term he served on the Court, in what percentage of cases was Justice Antonin Scalia in the majority?
    • Advanced challenge: Create a graph similar to above that compares the percentage for all cases versus non-unanimous cases (i.e. there was at least one dissenting vote)
  3. In each term, what percentage of cases were decided in the conservative direction?
    • Advanced challenge: The Chief Justice is frequently seen as capable of influencing the ideological direction of the Court. Create a graph similar to the one above that also incorporates information on who was the Chief Justice during the term.

Submit the assignment

Your assignment should be submitted as three RMarkdown documents. Follow instructions on homework workflow. As part of the pull request, you’re encouraged to reflect on what was hard/easy, problems you solved, helpful tutorials you read, etc.

Rubric

Check minus: Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed. No record of commits other than the final push to GitHub.

Check: Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output). Nothing spectacular, either bad or good.

Check plus: Finished all components of the assignment correctly and attempted at least one advanced challenge. Code is well-documented (both self-documented and with additional comments as necessary). Graphs and tables are properly labeled. Use multiple commits to back up and show a progression in the work. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.


  1. Terms run from October through June, so the 2015 term contains cases decided from October 2015 - June 2016

  2. Friendly warning: you will initially encounter an error attempting to join the tibbles. Use your powers of deduction (and Google/Stack Overflow/classmates/me and the TA) to figure out how to fix this error.

This work is licensed under the CC BY-NC 4.0 Creative Commons License.