Benjamin Soltoff
MACS 30500 - Computing for the Social Sciences
University of Chicago
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
(race <- read.table(header = TRUE, check.names = FALSE, text = "
Name 50 100 150 200 250 300 350
Carla 1.2 1.8 2.2 2.3 3.0 2.5 1.8
Mace 1.5 1.1 1.9 2.0 3.6 3.0 2.5
Lea 1.7 1.6 2.3 2.7 2.6 2.2 2.6
Karen 1.3 1.7 1.9 2.2 3.2 1.5 1.9
") %>%
tbl_df())
## # A tibble: 4 × 8
## Name `50` `100` `150` `200` `250` `300` `350`
## <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Carla 1.2 1.8 2.2 2.3 3.0 2.5 1.8
## 2 Mace 1.5 1.1 1.9 2.0 3.6 3.0 2.5
## 3 Lea 1.7 1.6 2.3 2.7 2.6 2.2 2.6
## 4 Karen 1.3 1.7 1.9 2.2 3.2 1.5 1.9
This is essentially a gathering operation. Except for the Name
column, the remaining columns are actually one variable spread across multiple columns. The column names are a distinct variable, and the columns’ values are another variable. We want to gather these columns. The key
will tell us the original column name, and the value
will give us the values in the cells. Because the column names are actually numeric values, we set convert = TRUE
to coerce the new Time
column into a numeric column (or vector). (The last line isn’t necessary, but sorts the data frame in a consistent manner.)
race %>%
gather(key = Time, value = Score, -Name, convert = TRUE) %>%
arrange(Name, Time)
## # A tibble: 28 × 3
## Name Time Score
## <fctr> <int> <dbl>
## 1 Carla 50 1.2
## 2 Carla 100 1.8
## 3 Carla 150 2.2
## 4 Carla 200 2.3
## 5 Carla 250 3.0
## 6 Carla 300 2.5
## 7 Carla 350 1.8
## 8 Karen 50 1.3
## 9 Karen 100 1.7
## 10 Karen 150 1.9
## # ... with 18 more rows
(results <- data_frame(
Ind = rep(paste0("Ind", 1:10), times = 2),
Treatment = rep(c("Treat", "Cont"), each = 10),
value = 1:20
))
## # A tibble: 20 × 3
## Ind Treatment value
## <chr> <chr> <int>
## 1 Ind1 Treat 1
## 2 Ind2 Treat 2
## 3 Ind3 Treat 3
## 4 Ind4 Treat 4
## 5 Ind5 Treat 5
## 6 Ind6 Treat 6
## 7 Ind7 Treat 7
## 8 Ind8 Treat 8
## 9 Ind9 Treat 9
## 10 Ind10 Treat 10
## 11 Ind1 Cont 11
## 12 Ind2 Cont 12
## 13 Ind3 Cont 13
## 14 Ind4 Cont 14
## 15 Ind5 Cont 15
## 16 Ind6 Cont 16
## 17 Ind7 Cont 17
## 18 Ind8 Cont 18
## 19 Ind9 Cont 19
## 20 Ind10 Cont 20
This dataset is not tidy because observations are spread across multiple rows. There only needs to be one row for each individual. Then Treat
and Cont
can be stored in separate columns.
results %>%
spread(key = Treatment, value = value)
## # A tibble: 10 × 3
## Ind Cont Treat
## * <chr> <int> <int>
## 1 Ind1 11 1
## 2 Ind10 20 10
## 3 Ind2 12 2
## 4 Ind3 13 3
## 5 Ind4 14 4
## 6 Ind5 15 5
## 7 Ind6 16 6
## 8 Ind7 17 7
## 9 Ind8 18 8
## 10 Ind9 19 9
set.seed(10)
(activities <- data_frame(
id = paste("x", 1:10, sep = ""),
trt = sample(c('cnt', 'tr'), 10, T),
work.T1 = runif(10),
play.T1 = runif(10),
talk.T1 = runif(10),
work.T2 = runif(10),
play.T2 = runif(10),
talk.T2 = runif(10)
))
## # A tibble: 10 × 8
## id trt work.T1 play.T1 talk.T1 work.T2 play.T2
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 x1 tr 0.65165567 0.8647212 0.53559704 0.27548386 0.3543281
## 2 x2 cnt 0.56773775 0.6153524 0.09308813 0.22890394 0.9364325
## 3 x3 cnt 0.11350898 0.7751099 0.16980304 0.01443391 0.2458664
## 4 x4 tr 0.59592531 0.3555687 0.89983245 0.72896456 0.4731415
## 5 x5 cnt 0.35804998 0.4058500 0.42263761 0.24988047 0.1915609
## 6 x6 cnt 0.42880942 0.7066469 0.74774647 0.16118328 0.5832220
## 7 x7 cnt 0.05190332 0.8382877 0.82265258 0.01704265 0.4594732
## 8 x8 cnt 0.26417767 0.2395891 0.95465365 0.48610035 0.4674340
## 9 x9 tr 0.39879073 0.7707715 0.68544451 0.10290017 0.3998326
## 10 x10 cnt 0.83613414 0.3558977 0.50050323 0.80154700 0.5052856
## # ... with 1 more variables: talk.T2 <dbl>
This is a more complex operation. The basic problem is that we have variables stored in multiple columns (location, with possible values of work
, play
, and talk
). We need to gather these columns into a single column for each variable. But what happens if we just gather them?
activities %>%
gather(key, value, -id, -trt)
## # A tibble: 60 × 4
## id trt key value
## <chr> <chr> <chr> <dbl>
## 1 x1 tr work.T1 0.65165567
## 2 x2 cnt work.T1 0.56773775
## 3 x3 cnt work.T1 0.11350898
## 4 x4 tr work.T1 0.59592531
## 5 x5 cnt work.T1 0.35804998
## 6 x6 cnt work.T1 0.42880942
## 7 x7 cnt work.T1 0.05190332
## 8 x8 cnt work.T1 0.26417767
## 9 x9 tr work.T1 0.39879073
## 10 x10 cnt work.T1 0.83613414
## # ... with 50 more rows
We’ve created a new problem! Actually, two problems:
id
x trt
pairingkey
contains the information on both location (work
, play
, and talk
) as well as when the measurement was taken (T1
or T2
)The best approach is to fix the second problem by separating the columns, then spreading the different types of measurements back into their own columns.
activities %>%
gather(key, value, -id, -trt) %>%
separate(key, into = c("location", "time"))
## # A tibble: 60 × 5
## id trt location time value
## * <chr> <chr> <chr> <chr> <dbl>
## 1 x1 tr work T1 0.65165567
## 2 x2 cnt work T1 0.56773775
## 3 x3 cnt work T1 0.11350898
## 4 x4 tr work T1 0.59592531
## 5 x5 cnt work T1 0.35804998
## 6 x6 cnt work T1 0.42880942
## 7 x7 cnt work T1 0.05190332
## 8 x8 cnt work T1 0.26417767
## 9 x9 tr work T1 0.39879073
## 10 x10 cnt work T1 0.83613414
## # ... with 50 more rows
activities %>%
gather(key, value, -id, -trt) %>%
separate(key, into = c("location", "time")) %>%
spread(location, value)
## # A tibble: 20 × 6
## id trt time play talk work
## * <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 x1 tr T1 0.8647212 0.53559704 0.65165567
## 2 x1 tr T2 0.3543281 0.03188816 0.27548386
## 3 x10 cnt T1 0.3558977 0.50050323 0.83613414
## 4 x10 cnt T2 0.5052856 0.21913855 0.80154700
## 5 x2 cnt T1 0.6153524 0.09308813 0.56773775
## 6 x2 cnt T2 0.9364325 0.11446759 0.22890394
## 7 x3 cnt T1 0.7751099 0.16980304 0.11350898
## 8 x3 cnt T2 0.2458664 0.46893548 0.01443391
## 9 x4 tr T1 0.3555687 0.89983245 0.59592531
## 10 x4 tr T2 0.4731415 0.39698674 0.72896456
## 11 x5 cnt T1 0.4058500 0.42263761 0.35804998
## 12 x5 cnt T2 0.1915609 0.83361919 0.24988047
## 13 x6 cnt T1 0.7066469 0.74774647 0.42880942
## 14 x6 cnt T2 0.5832220 0.76112174 0.16118328
## 15 x7 cnt T1 0.8382877 0.82265258 0.05190332
## 16 x7 cnt T2 0.4594732 0.57335645 0.01704265
## 17 x8 cnt T1 0.2395891 0.95465365 0.26417767
## 18 x8 cnt T2 0.4674340 0.44750805 0.48610035
## 19 x9 tr T1 0.7707715 0.68544451 0.39879073
## 20 x9 tr T2 0.3998326 0.08380201 0.10290017
The whole operation in a single chain (with an arrange()
thrown in to sort the data frame):
activities %>%
gather(key, value, -id, -trt) %>%
separate(key, into = c("location", "time")) %>%
arrange(id, trt, time) %>%
spread(location, value)
## # A tibble: 20 × 6
## id trt time play talk work
## * <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 x1 tr T1 0.8647212 0.53559704 0.65165567
## 2 x1 tr T2 0.3543281 0.03188816 0.27548386
## 3 x10 cnt T1 0.3558977 0.50050323 0.83613414
## 4 x10 cnt T2 0.5052856 0.21913855 0.80154700
## 5 x2 cnt T1 0.6153524 0.09308813 0.56773775
## 6 x2 cnt T2 0.9364325 0.11446759 0.22890394
## 7 x3 cnt T1 0.7751099 0.16980304 0.11350898
## 8 x3 cnt T2 0.2458664 0.46893548 0.01443391
## 9 x4 tr T1 0.3555687 0.89983245 0.59592531
## 10 x4 tr T2 0.4731415 0.39698674 0.72896456
## 11 x5 cnt T1 0.4058500 0.42263761 0.35804998
## 12 x5 cnt T2 0.1915609 0.83361919 0.24988047
## 13 x6 cnt T1 0.7066469 0.74774647 0.42880942
## 14 x6 cnt T2 0.5832220 0.76112174 0.16118328
## 15 x7 cnt T1 0.8382877 0.82265258 0.05190332
## 16 x7 cnt T2 0.4594732 0.57335645 0.01704265
## 17 x8 cnt T1 0.2395891 0.95465365 0.26417767
## 18 x8 cnt T2 0.4674340 0.44750805 0.48610035
## 19 x9 tr T1 0.7707715 0.68544451 0.39879073
## 20 x9 tr T2 0.3998326 0.08380201 0.10290017
(grades <- read.table(header = TRUE, text = "
ID Test Year Fall Spring Winter
1 1 2008 15 16 19
1 1 2009 12 13 27
1 2 2008 22 22 24
1 2 2009 10 14 20
2 1 2008 12 13 25
2 1 2009 16 14 21
2 2 2008 13 11 29
2 2 2009 23 20 26
3 1 2008 11 12 22
3 1 2009 13 11 27
3 2 2008 17 12 23
3 2 2009 14 9 31
") %>%
tbl_df())
## # A tibble: 12 × 6
## ID Test Year Fall Spring Winter
## <int> <int> <int> <int> <int> <int>
## 1 1 1 2008 15 16 19
## 2 1 1 2009 12 13 27
## 3 1 2 2008 22 22 24
## 4 1 2 2009 10 14 20
## 5 2 1 2008 12 13 25
## 6 2 1 2009 16 14 21
## 7 2 2 2008 13 11 29
## 8 2 2 2009 23 20 26
## 9 3 1 2008 11 12 22
## 10 3 1 2009 13 11 27
## 11 3 2 2008 17 12 23
## 12 3 2 2009 14 9 31
In this example, the basic unit of observation is the test. Each individual takes two separate tests (labeled 1
and 2
) at multiple points in time: during each quarter (Fall
, Winter
, Spring
) as well as in multiple years (2008
and 2009
). So our final data frame should contain five columns: ID
(identifying the student), Year
(year the test was taken), Quarter
(quarter in which the test was taken), Test1
(score on the first test), and Test2
(score on the second test).
Let’s start with the gathering operation: we want to gather Fall
, Winter
, and Spring
into a single column (we can use the inclusive select function :
to gather these three columns):
grades %>%
gather(key = Quarter, value = Score, Fall:Winter)
## # A tibble: 36 × 5
## ID Test Year Quarter Score
## <int> <int> <int> <chr> <int>
## 1 1 1 2008 Fall 15
## 2 1 1 2009 Fall 12
## 3 1 2 2008 Fall 22
## 4 1 2 2009 Fall 10
## 5 2 1 2008 Fall 12
## 6 2 1 2009 Fall 16
## 7 2 2 2008 Fall 13
## 8 2 2 2009 Fall 23
## 9 3 1 2008 Fall 11
## 10 3 1 2009 Fall 13
## # ... with 26 more rows
Good, but now we spread observations across multiple rows. Remember that we want each test to be a separate variable. To do that, we can spread those values across two columns.
grades %>%
gather(Quarter, Score, Fall:Winter) %>%
spread(Test, Score)
## # A tibble: 18 × 5
## ID Year Quarter `1` `2`
## * <int> <int> <chr> <int> <int>
## 1 1 2008 Fall 15 22
## 2 1 2008 Spring 16 22
## 3 1 2008 Winter 19 24
## 4 1 2009 Fall 12 10
## 5 1 2009 Spring 13 14
## 6 1 2009 Winter 27 20
## 7 2 2008 Fall 12 13
## 8 2 2008 Spring 13 11
## 9 2 2008 Winter 25 29
## 10 2 2009 Fall 16 23
## 11 2 2009 Spring 14 20
## 12 2 2009 Winter 21 26
## 13 3 2008 Fall 11 17
## 14 3 2008 Spring 12 12
## 15 3 2008 Winter 22 23
## 16 3 2009 Fall 13 14
## 17 3 2009 Spring 11 9
## 18 3 2009 Winter 27 31
This is an acceptable solution, but look what happened to the column names for each test: they’re labeled as numbers (1
and 2
). We’d rather give columns text labels, since those are easier to reference in our code. We can use a function called paste0()
to add text to the Test
column before we spread it. paste0()
“pastes” two or more values together to create a single value:1
grades %>%
gather(Quarter, Score, Fall:Winter) %>%
mutate(Test = paste0("Test", Test))
## # A tibble: 36 × 5
## ID Test Year Quarter Score
## <int> <chr> <int> <chr> <int>
## 1 1 Test1 2008 Fall 15
## 2 1 Test1 2009 Fall 12
## 3 1 Test2 2008 Fall 22
## 4 1 Test2 2009 Fall 10
## 5 2 Test1 2008 Fall 12
## 6 2 Test1 2009 Fall 16
## 7 2 Test2 2008 Fall 13
## 8 2 Test2 2009 Fall 23
## 9 3 Test1 2008 Fall 11
## 10 3 Test1 2009 Fall 13
## # ... with 26 more rows
grades %>%
gather(Quarter, Score, Fall:Winter) %>%
mutate(Test = paste0("Test", Test)) %>%
spread(Test, Score)
## # A tibble: 18 × 5
## ID Year Quarter Test1 Test2
## * <int> <int> <chr> <int> <int>
## 1 1 2008 Fall 15 22
## 2 1 2008 Spring 16 22
## 3 1 2008 Winter 19 24
## 4 1 2009 Fall 12 10
## 5 1 2009 Spring 13 14
## 6 1 2009 Winter 27 20
## 7 2 2008 Fall 12 13
## 8 2 2008 Spring 13 11
## 9 2 2008 Winter 25 29
## 10 2 2009 Fall 16 23
## 11 2 2009 Spring 14 20
## 12 2 2009 Winter 21 26
## 13 3 2008 Fall 11 17
## 14 3 2008 Spring 12 12
## 15 3 2008 Winter 22 23
## 16 3 2009 Fall 13 14
## 17 3 2009 Spring 11 9
## 18 3 2009 Winter 27 31
If we’re cleaning up the data frame, let’s also arrange it in a logical order:
grades %>%
gather(Quarter, Score, Fall:Winter) %>%
mutate(Test = paste0("Test", Test)) %>%
spread(Test, Score) %>%
arrange(ID, Year, Quarter)
## # A tibble: 18 × 5
## ID Year Quarter Test1 Test2
## <int> <int> <chr> <int> <int>
## 1 1 2008 Fall 15 22
## 2 1 2008 Spring 16 22
## 3 1 2008 Winter 19 24
## 4 1 2009 Fall 12 10
## 5 1 2009 Spring 13 14
## 6 1 2009 Winter 27 20
## 7 2 2008 Fall 12 13
## 8 2 2008 Spring 13 11
## 9 2 2008 Winter 25 29
## 10 2 2009 Fall 16 23
## 11 2 2009 Spring 14 20
## 12 2 2009 Winter 21 26
## 13 3 2008 Fall 11 17
## 14 3 2008 Spring 12 12
## 15 3 2008 Winter 22 23
## 16 3 2009 Fall 13 14
## 17 3 2009 Spring 11 9
## 18 3 2009 Winter 27 31
devtools::session_info()
## Session info --------------------------------------------------------------
## setting value
## version R version 3.3.1 (2016-06-21)
## system x86_64, darwin13.4.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/Chicago
## date 2016-10-06
## Packages ------------------------------------------------------------------
## package * version date source
## assertthat 0.1 2013-12-06 CRAN (R 3.3.0)
## codetools 0.2-14 2015-07-15 CRAN (R 3.3.1)
## DBI 0.5-1 2016-09-10 CRAN (R 3.3.0)
## devtools 1.12.0 2016-06-24 CRAN (R 3.3.0)
## digest 0.6.10 2016-08-02 CRAN (R 3.3.0)
## dplyr * 0.5.0 2016-06-24 CRAN (R 3.3.0)
## evaluate 0.9 2016-04-29 CRAN (R 3.3.0)
## formatR 1.4 2016-05-09 CRAN (R 3.3.0)
## htmltools 0.3.5 2016-03-21 CRAN (R 3.3.0)
## knitr 1.14 2016-08-13 CRAN (R 3.3.0)
## lazyeval 0.2.0 2016-06-12 CRAN (R 3.3.0)
## magrittr 1.5 2014-11-22 CRAN (R 3.3.0)
## memoise 1.0.0 2016-01-29 CRAN (R 3.3.0)
## R6 2.1.3 2016-08-19 CRAN (R 3.3.0)
## rcfss 0.1.0 2016-10-04 local
## Rcpp 0.12.7 2016-09-05 cran (@0.12.7)
## rmarkdown 1.0.9016 2016-10-02 Github (rstudio/rmarkdown@fe693c3)
## stringi 1.1.1 2016-05-27 CRAN (R 3.3.0)
## stringr 1.1.0 2016-08-19 cran (@1.1.0)
## tibble 1.2 2016-08-26 cran (@1.2)
## tidyr * 0.6.0 2016-08-12 CRAN (R 3.3.0)
## withr 1.0.2 2016-06-20 CRAN (R 3.3.0)
## yaml 2.1.13 2014-06-12 CRAN (R 3.3.0)
You might think to use the unite()
function here, but it won’t work unless you first create a column which only contains the value Test
, like so:
grades %>%
gather(Quarter, Score, Fall:Winter) %>%
mutate(test_label = "Test") %>%
unite(col = Test, test_label, Test)
## # A tibble: 36 × 5
## ID Test Year Quarter Score
## * <int> <chr> <int> <chr> <int>
## 1 1 Test_1 2008 Fall 15
## 2 1 Test_1 2009 Fall 12
## 3 1 Test_2 2008 Fall 22
## 4 1 Test_2 2009 Fall 10
## 5 2 Test_1 2008 Fall 12
## 6 2 Test_1 2009 Fall 16
## 7 2 Test_2 2008 Fall 13
## 8 2 Test_2 2009 Fall 23
## 9 3 Test_1 2008 Fall 11
## 10 3 Test_1 2009 Fall 13
## # ... with 26 more rows
This works, but requires two lines of code instead of one.↩