Data wrangling

MACS 30500
University of Chicago

October 5, 2016

`scorecard`

scorecard

## # A tibble: 1,849 × 12
##    unitid                                 name state                type
##     <int>                                <chr> <chr>               <chr>
## 1  450234      ITT Technical Institute-Wichita    KS Private, for-profit
## 2  448479 ITT Technical Institute-Swartz Creek    MI Private, for-profit
## 3  456427      ITT Technical Institute-Concord    CA Private, for-profit
## 4  459596  ITT Technical Institute-Tallahassee    FL Private, for-profit
## 5  459851        Herzing University-Brookfield    WI Private, for-profit
## 6  482477            DeVry University-Illinois    IL Private, for-profit
## 7  482547              DeVry University-Nevada    NV Private, for-profit
## 8  482592              DeVry University-Oregon    OR Private, for-profit
## 9  482617           DeVry University-Tennessee    TN Private, for-profit
## 10 482662          DeVry University-Washington    WA Private, for-profit
## # ... with 1,839 more rows, and 8 more variables: cost <int>,
## #   admrate <dbl>, satavg <dbl>, avgfacsal <dbl>, pctpell <dbl>,
## #   comprate <dbl>, firstgen <dbl>, debt <dbl>

Which were the 10 most expensive colleges in 2013?

# using arrange() and desc()
scorecard %>%
  arrange(desc(cost))

## # A tibble: 1,849 × 12
##    unitid                                        name state
##     <int>                                       <chr> <chr>
## 1  195304                      Sarah Lawrence College    NY
## 2  179867           Washington University in St Louis    MO
## 3  144050                       University of Chicago    IL
## 4  190150 Columbia University in the City of New York    NY
## 5  182670                           Dartmouth College    NH
## 6  130697                         Wesleyan University    CT
## 7  147767                     Northwestern University    IL
## 8  120254                          Occidental College    CA
## 9  115409                         Harvey Mudd College    CA
## 10 230816                          Bennington College    VT
## # ... with 1,839 more rows, and 9 more variables: type <chr>, cost <int>,
## #   admrate <dbl>, satavg <dbl>, avgfacsal <dbl>, pctpell <dbl>,
## #   comprate <dbl>, firstgen <dbl>, debt <dbl>

What percent of private, nonprofit schools are cheaper than the University of Chicago?

# use filter() and percent_rank() to calculate percentage
scorecard %>%
  select(name, type, cost) %>%
  filter(type == "Private, nonprofit") %>%
  mutate(cost_rank = percent_rank(cost)) %>%
  filter(name == "University of Chicago")

## # A tibble: 1 × 4
##                    name               type  cost cost_rank
##                   <chr>              <chr> <int>     <dbl>
## 1 University of Chicago Private, nonprofit 62425 0.9981464

Which type of college has the highest average SAT score?

# use group_by() and summarize()
scorecard %>%
  group_by(type) %>%
  summarize(mean_sat = mean(satavg, na.rm = TRUE))

## # A tibble: 3 × 2
##                  type mean_sat
##                 <chr>    <dbl>
## 1 Private, for-profit 1002.500
## 2  Private, nonprofit 1075.287
## 3              Public 1037.410

# using a boxplot
ggplot(scorecard, aes(type, satavg)) +
  geom_boxplot()

# using a frequency polygon
ggplot(scorecard, aes(satavg, y = ..density.., color = type)) +
  geom_freqpoly()

# using a histogram and facets
ggplot(scorecard, aes(satavg)) +
  geom_histogram() +
  facet_wrap(~ type)

What is the relationship between cost and faculty salaries?

# scatterplot and smoothing line
ggplot(scorecard, aes(cost, avgfacsal)) +
  geom_point() +
  geom_smooth()

# same as before, but use scale_color to visualize different relationships
# by college type and use alpha to make points semi-transparent
ggplot(scorecard, aes(cost, avgfacsal, color = type)) +
  geom_point(alpha = .2) +
  geom_smooth()

How does a college’s Pell Grant recipients effect the average student’s education debt?

# scatterplot and smoothing line
ggplot(scorecard, aes(pctpell, debt)) +
  geom_point() +
  geom_smooth()

Vectors

Basic unit of data storage in R
Types of vectors
- Logical
- Numeric
- Character
Certain operations can be performed only on certain types of vectors

Vectors can only be of one type

x <- c(1, 2, "abc", TRUE)
x

## [1] "1"    "2"    "abc"  "TRUE"

parse_logical(x)

## [1] TRUE   NA   NA TRUE
## attr(,"problems")
## # A tibble: 2 × 4
##     row   col           expected actual
##   <int> <int>              <chr>  <chr>
## 1     2    NA 1/0/T/F/TRUE/FALSE      2
## 2     3    NA 1/0/T/F/TRUE/FALSE    abc

parse_integer(x)

## [1]  1  2 NA NA
## attr(,"problems")
## # A tibble: 2 × 4
##     row   col   expected actual
##   <int> <int>      <chr>  <chr>
## 1     3    NA an integer    abc
## 2     4    NA an integer   TRUE

parse_character(x)

## [1] "1"    "2"    "abc"  "TRUE"

Parsing vectors

When the data is being read
- col_integer()
- col_character()
- col_logical()
After the data has been read
- mutate()
- as.character()
- as.numeric()
Anything that cannot be converted should turn into a missing value (NA)

Tidy data

Figure 12.1 from [@hadley2016]

Common tidying tasks

Gathering
Spreading
Separating
Uniting

Data wrangling

MACS 30500 University of Chicago

October 5, 2016

scorecard

Which were the 10 most expensive colleges in 2013?

What percent of private, nonprofit schools are cheaper than the University of Chicago?

Which type of college has the highest average SAT score?

What is the relationship between cost and faculty salaries?

How does a college’s Pell Grant recipients effect the average student’s education debt?

Vectors

Vectors can only be of one type

Parsing vectors

Tidy data

Common tidying tasks

MACS 30500
University of Chicago

`scorecard`