library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

Race

(race <- read.table(header = TRUE, check.names = FALSE, text = "
  Name    50  100  150  200  250  300  350
  Carla  1.2  1.8  2.2  2.3  3.0  2.5  1.8
  Mace   1.5  1.1  1.9  2.0  3.6  3.0  2.5
  Lea    1.7  1.6  2.3  2.7  2.6  2.2  2.6
  Karen  1.3  1.7  1.9  2.2  3.2  1.5  1.9
") %>%
  tbl_df())

## # A tibble: 4 × 8
##     Name  `50` `100` `150` `200` `250` `300` `350`
##   <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  Carla   1.2   1.8   2.2   2.3   3.0   2.5   1.8
## 2   Mace   1.5   1.1   1.9   2.0   3.6   3.0   2.5
## 3    Lea   1.7   1.6   2.3   2.7   2.6   2.2   2.6
## 4  Karen   1.3   1.7   1.9   2.2   3.2   1.5   1.9

This is essentially a gathering operation. Except for the Name column, the remaining columns are actually one variable spread across multiple columns. The column names are a distinct variable, and the columns’ values are another variable. We want to gather these columns. The key will tell us the original column name, and the value will give us the values in the cells. Because the column names are actually numeric values, we set convert = TRUE to coerce the new Time column into a numeric column (or vector). (The last line isn’t necessary, but sorts the data frame in a consistent manner.)

race %>%
  gather(key = Time, value = Score, -Name, convert = TRUE) %>%
  arrange(Name, Time)

## # A tibble: 28 × 3
##      Name  Time Score
##    <fctr> <int> <dbl>
## 1   Carla    50   1.2
## 2   Carla   100   1.8
## 3   Carla   150   2.2
## 4   Carla   200   2.3
## 5   Carla   250   3.0
## 6   Carla   300   2.5
## 7   Carla   350   1.8
## 8   Karen    50   1.3
## 9   Karen   100   1.7
## 10  Karen   150   1.9
## # ... with 18 more rows

Results

(results <- data_frame(
  Ind = rep(paste0("Ind", 1:10), times = 2),
  Treatment = rep(c("Treat", "Cont"), each = 10),
  value = 1:20
))

## # A tibble: 20 × 3
##      Ind Treatment value
##    <chr>     <chr> <int>
## 1   Ind1     Treat     1
## 2   Ind2     Treat     2
## 3   Ind3     Treat     3
## 4   Ind4     Treat     4
## 5   Ind5     Treat     5
## 6   Ind6     Treat     6
## 7   Ind7     Treat     7
## 8   Ind8     Treat     8
## 9   Ind9     Treat     9
## 10 Ind10     Treat    10
## 11  Ind1      Cont    11
## 12  Ind2      Cont    12
## 13  Ind3      Cont    13
## 14  Ind4      Cont    14
## 15  Ind5      Cont    15
## 16  Ind6      Cont    16
## 17  Ind7      Cont    17
## 18  Ind8      Cont    18
## 19  Ind9      Cont    19
## 20 Ind10      Cont    20

This dataset is not tidy because observations are spread across multiple rows. There only needs to be one row for each individual. Then Treat and Cont can be stored in separate columns.

results %>%
  spread(key = Treatment, value = value)

## # A tibble: 10 × 3
##      Ind  Cont Treat
## *  <chr> <int> <int>
## 1   Ind1    11     1
## 2  Ind10    20    10
## 3   Ind2    12     2
## 4   Ind3    13     3
## 5   Ind4    14     4
## 6   Ind5    15     5
## 7   Ind6    16     6
## 8   Ind7    17     7
## 9   Ind8    18     8
## 10  Ind9    19     9

Activities

set.seed(10)
(activities <- data_frame(
  id = paste("x", 1:10, sep = ""),
  trt = sample(c('cnt', 'tr'), 10, T),
  work.T1 = runif(10),
  play.T1 = runif(10),
  talk.T1 = runif(10),
  work.T2 = runif(10),
  play.T2 = runif(10),
  talk.T2 = runif(10)
))

## # A tibble: 10 × 8
##       id   trt    work.T1   play.T1    talk.T1    work.T2   play.T2
##    <chr> <chr>      <dbl>     <dbl>      <dbl>      <dbl>     <dbl>
## 1     x1    tr 0.65165567 0.8647212 0.53559704 0.27548386 0.3543281
## 2     x2   cnt 0.56773775 0.6153524 0.09308813 0.22890394 0.9364325
## 3     x3   cnt 0.11350898 0.7751099 0.16980304 0.01443391 0.2458664
## 4     x4    tr 0.59592531 0.3555687 0.89983245 0.72896456 0.4731415
## 5     x5   cnt 0.35804998 0.4058500 0.42263761 0.24988047 0.1915609
## 6     x6   cnt 0.42880942 0.7066469 0.74774647 0.16118328 0.5832220
## 7     x7   cnt 0.05190332 0.8382877 0.82265258 0.01704265 0.4594732
## 8     x8   cnt 0.26417767 0.2395891 0.95465365 0.48610035 0.4674340
## 9     x9    tr 0.39879073 0.7707715 0.68544451 0.10290017 0.3998326
## 10   x10   cnt 0.83613414 0.3558977 0.50050323 0.80154700 0.5052856
## # ... with 1 more variables: talk.T2 <dbl>

This is a more complex operation. The basic problem is that we have variables stored in multiple columns (location, with possible values of work, play, and talk). We need to gather these columns into a single column for each variable. But what happens if we just gather them?

activities %>%
  gather(key, value, -id, -trt)

## # A tibble: 60 × 4
##       id   trt     key      value
##    <chr> <chr>   <chr>      <dbl>
## 1     x1    tr work.T1 0.65165567
## 2     x2   cnt work.T1 0.56773775
## 3     x3   cnt work.T1 0.11350898
## 4     x4    tr work.T1 0.59592531
## 5     x5   cnt work.T1 0.35804998
## 6     x6   cnt work.T1 0.42880942
## 7     x7   cnt work.T1 0.05190332
## 8     x8   cnt work.T1 0.26417767
## 9     x9    tr work.T1 0.39879073
## 10   x10   cnt work.T1 0.83613414
## # ... with 50 more rows

We’ve created a new problem! Actually, two problems:

We have a single observation stored across multiple rows: we want a single row for each id x trt pairing
We have two variables stored in a single column: key contains the information on both location (work, play, and talk) as well as when the measurement was taken (T1 or T2)

The best approach is to fix the second problem by separating the columns, then spreading the different types of measurements back into their own columns.

activities %>%
  gather(key, value, -id, -trt) %>%
  separate(key, into = c("location", "time"))

## # A tibble: 60 × 5
##       id   trt location  time      value
## *  <chr> <chr>    <chr> <chr>      <dbl>
## 1     x1    tr     work    T1 0.65165567
## 2     x2   cnt     work    T1 0.56773775
## 3     x3   cnt     work    T1 0.11350898
## 4     x4    tr     work    T1 0.59592531
## 5     x5   cnt     work    T1 0.35804998
## 6     x6   cnt     work    T1 0.42880942
## 7     x7   cnt     work    T1 0.05190332
## 8     x8   cnt     work    T1 0.26417767
## 9     x9    tr     work    T1 0.39879073
## 10   x10   cnt     work    T1 0.83613414
## # ... with 50 more rows

activities %>%
  gather(key, value, -id, -trt) %>%
  separate(key, into = c("location", "time")) %>%
  spread(location, value)

## # A tibble: 20 × 6
##       id   trt  time      play       talk       work
## *  <chr> <chr> <chr>     <dbl>      <dbl>      <dbl>
## 1     x1    tr    T1 0.8647212 0.53559704 0.65165567
## 2     x1    tr    T2 0.3543281 0.03188816 0.27548386
## 3    x10   cnt    T1 0.3558977 0.50050323 0.83613414
## 4    x10   cnt    T2 0.5052856 0.21913855 0.80154700
## 5     x2   cnt    T1 0.6153524 0.09308813 0.56773775
## 6     x2   cnt    T2 0.9364325 0.11446759 0.22890394
## 7     x3   cnt    T1 0.7751099 0.16980304 0.11350898
## 8     x3   cnt    T2 0.2458664 0.46893548 0.01443391
## 9     x4    tr    T1 0.3555687 0.89983245 0.59592531
## 10    x4    tr    T2 0.4731415 0.39698674 0.72896456
## 11    x5   cnt    T1 0.4058500 0.42263761 0.35804998
## 12    x5   cnt    T2 0.1915609 0.83361919 0.24988047
## 13    x6   cnt    T1 0.7066469 0.74774647 0.42880942
## 14    x6   cnt    T2 0.5832220 0.76112174 0.16118328
## 15    x7   cnt    T1 0.8382877 0.82265258 0.05190332
## 16    x7   cnt    T2 0.4594732 0.57335645 0.01704265
## 17    x8   cnt    T1 0.2395891 0.95465365 0.26417767
## 18    x8   cnt    T2 0.4674340 0.44750805 0.48610035
## 19    x9    tr    T1 0.7707715 0.68544451 0.39879073
## 20    x9    tr    T2 0.3998326 0.08380201 0.10290017

The whole operation in a single chain (with an arrange() thrown in to sort the data frame):

activities %>%
  gather(key, value, -id, -trt) %>%
  separate(key, into = c("location", "time")) %>%
  arrange(id, trt, time) %>%
  spread(location, value)

## # A tibble: 20 × 6
##       id   trt  time      play       talk       work
## *  <chr> <chr> <chr>     <dbl>      <dbl>      <dbl>
## 1     x1    tr    T1 0.8647212 0.53559704 0.65165567
## 2     x1    tr    T2 0.3543281 0.03188816 0.27548386
## 3    x10   cnt    T1 0.3558977 0.50050323 0.83613414
## 4    x10   cnt    T2 0.5052856 0.21913855 0.80154700
## 5     x2   cnt    T1 0.6153524 0.09308813 0.56773775
## 6     x2   cnt    T2 0.9364325 0.11446759 0.22890394
## 7     x3   cnt    T1 0.7751099 0.16980304 0.11350898
## 8     x3   cnt    T2 0.2458664 0.46893548 0.01443391
## 9     x4    tr    T1 0.3555687 0.89983245 0.59592531
## 10    x4    tr    T2 0.4731415 0.39698674 0.72896456
## 11    x5   cnt    T1 0.4058500 0.42263761 0.35804998
## 12    x5   cnt    T2 0.1915609 0.83361919 0.24988047
## 13    x6   cnt    T1 0.7066469 0.74774647 0.42880942
## 14    x6   cnt    T2 0.5832220 0.76112174 0.16118328
## 15    x7   cnt    T1 0.8382877 0.82265258 0.05190332
## 16    x7   cnt    T2 0.4594732 0.57335645 0.01704265
## 17    x8   cnt    T1 0.2395891 0.95465365 0.26417767
## 18    x8   cnt    T2 0.4674340 0.44750805 0.48610035
## 19    x9    tr    T1 0.7707715 0.68544451 0.39879073
## 20    x9    tr    T2 0.3998326 0.08380201 0.10290017

Grades

(grades <- read.table(header = TRUE, text = "
   ID   Test Year   Fall Spring Winter
    1   1   2008    15      16      19
    1   1   2009    12      13      27
    1   2   2008    22      22      24
    1   2   2009    10      14      20
    2   1   2008    12      13      25
    2   1   2009    16      14      21
    2   2   2008    13      11      29
    2   2   2009    23      20      26
    3   1   2008    11      12      22
    3   1   2009    13      11      27
    3   2   2008    17      12      23
    3   2   2009    14      9       31
") %>%
  tbl_df())

## # A tibble: 12 × 6
##       ID  Test  Year  Fall Spring Winter
##    <int> <int> <int> <int>  <int>  <int>
## 1      1     1  2008    15     16     19
## 2      1     1  2009    12     13     27
## 3      1     2  2008    22     22     24
## 4      1     2  2009    10     14     20
## 5      2     1  2008    12     13     25
## 6      2     1  2009    16     14     21
## 7      2     2  2008    13     11     29
## 8      2     2  2009    23     20     26
## 9      3     1  2008    11     12     22
## 10     3     1  2009    13     11     27
## 11     3     2  2008    17     12     23
## 12     3     2  2009    14      9     31

In this example, the basic unit of observation is the test. Each individual takes two separate tests (labeled 1 and 2) at multiple points in time: during each quarter (Fall, Winter, Spring) as well as in multiple years (2008 and 2009). So our final data frame should contain five columns: ID (identifying the student), Year (year the test was taken), Quarter (quarter in which the test was taken), Test1 (score on the first test), and Test2 (score on the second test).

Let’s start with the gathering operation: we want to gather Fall, Winter, and Spring into a single column (we can use the inclusive select function : to gather these three columns):

grades %>%
  gather(key = Quarter, value = Score, Fall:Winter)

## # A tibble: 36 × 5
##       ID  Test  Year Quarter Score
##    <int> <int> <int>   <chr> <int>
## 1      1     1  2008    Fall    15
## 2      1     1  2009    Fall    12
## 3      1     2  2008    Fall    22
## 4      1     2  2009    Fall    10
## 5      2     1  2008    Fall    12
## 6      2     1  2009    Fall    16
## 7      2     2  2008    Fall    13
## 8      2     2  2009    Fall    23
## 9      3     1  2008    Fall    11
## 10     3     1  2009    Fall    13
## # ... with 26 more rows

Good, but now we spread observations across multiple rows. Remember that we want each test to be a separate variable. To do that, we can spread those values across two columns.

grades %>%
  gather(Quarter, Score, Fall:Winter) %>%
  spread(Test, Score)

## # A tibble: 18 × 5
##       ID  Year Quarter   `1`   `2`
## *  <int> <int>   <chr> <int> <int>
## 1      1  2008    Fall    15    22
## 2      1  2008  Spring    16    22
## 3      1  2008  Winter    19    24
## 4      1  2009    Fall    12    10
## 5      1  2009  Spring    13    14
## 6      1  2009  Winter    27    20
## 7      2  2008    Fall    12    13
## 8      2  2008  Spring    13    11
## 9      2  2008  Winter    25    29
## 10     2  2009    Fall    16    23
## 11     2  2009  Spring    14    20
## 12     2  2009  Winter    21    26
## 13     3  2008    Fall    11    17
## 14     3  2008  Spring    12    12
## 15     3  2008  Winter    22    23
## 16     3  2009    Fall    13    14
## 17     3  2009  Spring    11     9
## 18     3  2009  Winter    27    31

This is an acceptable solution, but look what happened to the column names for each test: they’re labeled as numbers (1 and 2). We’d rather give columns text labels, since those are easier to reference in our code. We can use a function called paste0() to add text to the Test column before we spread it. paste0() “pastes” two or more values together to create a single value:¹

grades %>%
  gather(Quarter, Score, Fall:Winter) %>%
  mutate(Test = paste0("Test", Test))

## # A tibble: 36 × 5
##       ID  Test  Year Quarter Score
##    <int> <chr> <int>   <chr> <int>
## 1      1 Test1  2008    Fall    15
## 2      1 Test1  2009    Fall    12
## 3      1 Test2  2008    Fall    22
## 4      1 Test2  2009    Fall    10
## 5      2 Test1  2008    Fall    12
## 6      2 Test1  2009    Fall    16
## 7      2 Test2  2008    Fall    13
## 8      2 Test2  2009    Fall    23
## 9      3 Test1  2008    Fall    11
## 10     3 Test1  2009    Fall    13
## # ... with 26 more rows

grades %>%
  gather(Quarter, Score, Fall:Winter) %>%
  mutate(Test = paste0("Test", Test)) %>%
  spread(Test, Score)

## # A tibble: 18 × 5
##       ID  Year Quarter Test1 Test2
## *  <int> <int>   <chr> <int> <int>
## 1      1  2008    Fall    15    22
## 2      1  2008  Spring    16    22
## 3      1  2008  Winter    19    24
## 4      1  2009    Fall    12    10
## 5      1  2009  Spring    13    14
## 6      1  2009  Winter    27    20
## 7      2  2008    Fall    12    13
## 8      2  2008  Spring    13    11
## 9      2  2008  Winter    25    29
## 10     2  2009    Fall    16    23
## 11     2  2009  Spring    14    20
## 12     2  2009  Winter    21    26
## 13     3  2008    Fall    11    17
## 14     3  2008  Spring    12    12
## 15     3  2008  Winter    22    23
## 16     3  2009    Fall    13    14
## 17     3  2009  Spring    11     9
## 18     3  2009  Winter    27    31

If we’re cleaning up the data frame, let’s also arrange it in a logical order:

grades %>%
  gather(Quarter, Score, Fall:Winter) %>%
  mutate(Test = paste0("Test", Test)) %>%
  spread(Test, Score) %>%
  arrange(ID, Year, Quarter)

## # A tibble: 18 × 5
##       ID  Year Quarter Test1 Test2
##    <int> <int>   <chr> <int> <int>
## 1      1  2008    Fall    15    22
## 2      1  2008  Spring    16    22
## 3      1  2008  Winter    19    24
## 4      1  2009    Fall    12    10
## 5      1  2009  Spring    13    14
## 6      1  2009  Winter    27    20
## 7      2  2008    Fall    12    13
## 8      2  2008  Spring    13    11
## 9      2  2008  Winter    25    29
## 10     2  2009    Fall    16    23
## 11     2  2009  Spring    14    20
## 12     2  2009  Winter    21    26
## 13     3  2008    Fall    11    17
## 14     3  2008  Spring    12    12
## 15     3  2008  Winter    22    23
## 16     3  2009    Fall    13    14
## 17     3  2009  Spring    11     9
## 18     3  2009  Winter    27    31

Session Info

devtools::session_info()

## Session info --------------------------------------------------------------

##  setting  value                       
##  version  R version 3.3.1 (2016-06-21)
##  system   x86_64, darwin13.4.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/Chicago             
##  date     2016-10-06

## Packages ------------------------------------------------------------------

##  package    * version  date       source                            
##  assertthat   0.1      2013-12-06 CRAN (R 3.3.0)                    
##  codetools    0.2-14   2015-07-15 CRAN (R 3.3.1)                    
##  DBI          0.5-1    2016-09-10 CRAN (R 3.3.0)                    
##  devtools     1.12.0   2016-06-24 CRAN (R 3.3.0)                    
##  digest       0.6.10   2016-08-02 CRAN (R 3.3.0)                    
##  dplyr      * 0.5.0    2016-06-24 CRAN (R 3.3.0)                    
##  evaluate     0.9      2016-04-29 CRAN (R 3.3.0)                    
##  formatR      1.4      2016-05-09 CRAN (R 3.3.0)                    
##  htmltools    0.3.5    2016-03-21 CRAN (R 3.3.0)                    
##  knitr        1.14     2016-08-13 CRAN (R 3.3.0)                    
##  lazyeval     0.2.0    2016-06-12 CRAN (R 3.3.0)                    
##  magrittr     1.5      2014-11-22 CRAN (R 3.3.0)                    
##  memoise      1.0.0    2016-01-29 CRAN (R 3.3.0)                    
##  R6           2.1.3    2016-08-19 CRAN (R 3.3.0)                    
##  rcfss        0.1.0    2016-10-04 local                             
##  Rcpp         0.12.7   2016-09-05 cran (@0.12.7)                    
##  rmarkdown    1.0.9016 2016-10-02 Github (rstudio/rmarkdown@fe693c3)
##  stringi      1.1.1    2016-05-27 CRAN (R 3.3.0)                    
##  stringr      1.1.0    2016-08-19 cran (@1.1.0)                     
##  tibble       1.2      2016-08-26 cran (@1.2)                       
##  tidyr      * 0.6.0    2016-08-12 CRAN (R 3.3.0)                    
##  withr        1.0.2    2016-06-20 CRAN (R 3.3.0)                    
##  yaml         2.1.13   2014-06-12 CRAN (R 3.3.0)

You might think to use the unite() function here, but it won’t work unless you first create a column which only contains the value Test, like so:

grades %>%
  gather(Quarter, Score, Fall:Winter) %>%
  mutate(test_label = "Test") %>%
  unite(col = Test, test_label, Test)

## # A tibble: 36 × 5
##       ID   Test  Year Quarter Score
## *  <int>  <chr> <int>   <chr> <int>
## 1      1 Test_1  2008    Fall    15
## 2      1 Test_1  2009    Fall    12
## 3      1 Test_2  2008    Fall    22
## 4      1 Test_2  2009    Fall    10
## 5      2 Test_1  2008    Fall    12
## 6      2 Test_1  2009    Fall    16
## 7      2 Test_2  2008    Fall    13
## 8      2 Test_2  2009    Fall    23
## 9      3 Test_1  2008    Fall    11
## 10     3 Test_1  2009    Fall    13
## # ... with 26 more rows

This works, but requires two lines of code instead of one.↩

Practice data wrangling

Benjamin Soltoff
MACS 30500 - Computing for the Social Sciences
University of Chicago

October 5, 2016

Race

Results

Activities

Grades

Session Info

Practice data wrangling

Benjamin Soltoff MACS 30500 - Computing for the Social Sciences University of Chicago

October 5, 2016

Race

Results

Activities

Grades

Session Info

Benjamin Soltoff
MACS 30500 - Computing for the Social Sciences
University of Chicago