Introduction to Computing for the Social Sciences

MACS 30500
University of Chicago

September 26, 2016

Course site

https://uc-cfss.github.io

Major topics

  • Elementary programming techniques (e.g. loops, conditional statements, functions)
  • Writing reusable, interpretable code
  • Debugging
  • Obtaining, importing, and tidying data from a variety of sources
  • Performing statistical analysis
  • Visualizing information
  • Creating interactive reports
  • Generating reproducible research

print("Hello world!")
## [1] "Hello world!"

# linear model
lm(hwy ~ displ, data = mpg) %>%
  tidy %>%
  mutate(term = c("Intercept", "Engine displacement (in liters)")) %>%
  knitr::kable(digits = 2,
               col.names = c("Variable", "Estimate", "Standard Error",
                             "T-statistic", "P-Value"))
Variable Estimate Standard Error T-statistic P-Value
Intercept 35.70 0.72 49.55 0
Engine displacement (in liters) -3.53 0.19 -18.15 0
# visualization
ggplot(data = mpg, aes(displ, hwy)) + 
  geom_point(aes(color = class)) +
  geom_smooth(method = "lm", se = FALSE, color = "black", alpha = .25) +
  labs(x = "Engine displacement (in liters)",
       y = "Highway miles per gallon",
       color = "Car type") +
  theme_bw(base_size = 16)

Other resources

Plagiarism

  • Collaboration is good – to a point
  • Learning from others/the internet

Plagiarism

If you don’t understand what the program is doing and are not prepared to explain it in detail, you should not submit it.

Evaluations

  • Weekly programming assignments (70%)
  • Final project (30%)

Program

A series of instructions that specifies how to perform a computation

  • Input
  • Output
  • Math
  • Conditional execution
  • Repetition

Write a report analyzing the relationship between ice cream consumption and crime rates in Chicago

Jane: a GUI workflow

Sally: a programatic workflow

Reproducibility

  • Are my results valid? Can it be replicated?
  • The idea that data analyses, and more generally, scientific claims, are published with their data and software code so that others may verify the findings and build upon them
  • Also allows the researcher to precisely replicate his/her analysis

Version control

  • Revisions in research
  • Tracking revisions
    • analysis-1.r
    • analysis-2.r
    • analysis-3.r
    • Cloud storage (e.g. Dropbox, Google Drive, Box)
  • Version control software
  • Repository

Documentation

  • Comments
    • Comments are the what
    • Code is the how
  • Computer code should also be self-documenting
  • Future-proofing

Badly documented code

library(twitteR)
source("keys.R")
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
data <- userTimeline("realdonaldtrump", n = 1000)
data2 <- twListToDF(data)
write.csv(data2, "data2.csv")

Good code

# get_tweets.R
# Program to get Donald Trump tweets using Twitter API

# access Twitter API functions
library(twitteR)

# setup API authentication
source("keys.R")    # store keys privately in separate file

setup_twitter_oauth(consumer_key,
                    consumer_secret,
                    access_token,
                    access_secret)

# get 1000 most recent tweets
username <- "realdonaldtrump"
tweets <- userTimeline(username, n = 1000)

# convert to data frame
tweets_df <- twListToDF(tweets)

# write to disk
write.csv(tweets_df, "tweets_trump.csv")