Visualizations and the Grammar of Graphics

MACS 30500
University of Chicago

September 28, 2016

id \(N\) \(\bar{X}\) \(\bar{Y}\) \(R^2\)
1 11 9 7.500909 0.8164205
2 11 9 7.500909 0.8162365
3 11 9 7.500000 0.8162867
4 11 9 7.500909 0.8165214
id term estimate std.error statistic p.value
1 (Intercept) 3.0000909 1.1247468 2.667348 0.0257341
2 (Intercept) 3.0009091 1.1253024 2.666758 0.0257589
3 (Intercept) 3.0024545 1.1244812 2.670080 0.0256191
4 (Intercept) 3.0017273 1.1239211 2.670763 0.0255904
1 x 0.5000909 0.1179055 4.241455 0.0021696
2 x 0.5000000 0.1179637 4.238590 0.0021788
3 x 0.4997273 0.1178777 4.239372 0.0021763
4 x 0.4999091 0.1178189 4.243028 0.0021646
id r.squared adj.r.squared sigma logLik AIC BIC
1 0.6665425 0.6294916 1.236603 -16.84069 39.68137 40.87506
2 0.6662420 0.6291578 1.237214 -16.84612 39.69224 40.88593
3 0.6663240 0.6292489 1.236311 -16.83809 39.67618 40.86986
4 0.6667073 0.6296747 1.235696 -16.83261 39.66522 40.85890

Grammar

The whole system and structure of a language or of languages in general, usually taken as consisting of syntax and morphology (including inflections) and sometimes also phonology and semantics.

Grammar of graphics

  • “The fundamental principles or rules of an art or science”
  • Grammar of graphics - a grammar used to describe and create a wide range of statistical graphics
  • Layered grammar of graphics

Layered grammar of graphics

  • Layer
    • Data
    • Mapping
    • Statistical transformation (stat)
    • Geometric object (geom)
    • Position adjustment (position)
  • Scale
  • Coordinate system (coord)
  • Faceting (facet)
  • Defaults
    • Data
    • Mapping

Layer

  • Responsible for creating the objects that we perceive on the plot
  • Defined by its subcomponents

Data and mapping

  • Data defines the source of the information to be visualized
  • Mapping defines how the variables are applied to the graphic

Data: mpg

## # A tibble: 234 × 11
##    manufacturer      model displ  year   cyl      trans   drv   cty   hwy
##           <chr>      <chr> <dbl> <int> <int>      <chr> <chr> <int> <int>
## 1          audi         a4   1.8  1999     4   auto(l5)     f    18    29
## 2          audi         a4   1.8  1999     4 manual(m5)     f    21    29
## 3          audi         a4   2.0  2008     4 manual(m6)     f    20    31
## 4          audi         a4   2.0  2008     4   auto(av)     f    21    30
## 5          audi         a4   2.8  1999     6   auto(l5)     f    16    26
## 6          audi         a4   2.8  1999     6 manual(m5)     f    18    26
## 7          audi         a4   3.1  2008     6   auto(av)     f    18    27
## 8          audi a4 quattro   1.8  1999     4 manual(m5)     4    18    26
## 9          audi a4 quattro   1.8  1999     4   auto(l5)     4    16    25
## 10         audi a4 quattro   2.0  2008     4 manual(m6)     4    20    28
## # ... with 224 more rows, and 2 more variables: fl <chr>, class <chr>

Data: mpg

## # A tibble: 234 × 2
##    displ   hwy
##    <dbl> <int>
## 1    1.8    29
## 2    1.8    29
## 3    2.0    31
## 4    2.0    30
## 5    2.8    26
## 6    2.8    26
## 7    3.1    27
## 8    1.8    26
## 9    1.8    25
## 10   2.0    28
## # ... with 224 more rows

Mapping: mpg

## # A tibble: 234 × 2
##        x     y
##    <dbl> <int>
## 1    1.8    29
## 2    1.8    29
## 3    2.0    31
## 4    2.0    30
## 5    2.8    26
## 6    2.8    26
## 7    3.1    27
## 8    1.8    26
## 9    1.8    25
## 10   2.0    28
## # ... with 224 more rows

Geometric objects (geoms)

  • Control the type of plot you create
    • 0 dimensions - point, text
    • 1 dimension - path, line
    • 2 dimensions - polygon, interval
  • Geoms have specific aesthetics
    • Point geom - position, color, shape, and size
    • Bar geom - position, height, width, and fill

Point geom

Bar geom

Statistical transformation (stat)

  • Transforms the data (typically by summarizing the information)

Raw data

## # A tibble: 234 × 1
##      cyl
##    <int>
## 1      4
## 2      4
## 3      4
## 4      4
## 5      6
## 6      6
## 7      6
## 8      4
## 9      4
## 10     4
## # ... with 224 more rows

Transformed data

## # A tibble: 4 × 2
##     cyl     n
##   <int> <int>
## 1     4    81
## 2     5     4
## 3     6    79
## 4     8    70

Transformed data

Position adjustment

Position adjustment

Scale

  • Controls the mapping from data to aesthetic attributes

Scale: color

Scale: size

Coordinate system (coord)

  • Maps the position of objects onto the plane of the plot

Cartesian coordinate system

Semi-log

Polar

Faceting

Defaults

ggplot() +
  layer(
    data = mpg, mapping = aes(x = displ, y = hwy),
    geom = "point", stat = "identity", position = "identity"
  ) +
  scale_x_continuous() +
  scale_y_continuous() +
  coord_cartesian()

Defaults

ggplot() +
  layer(
    data = mpg, mapping = aes(x = displ, y = hwy),
    geom = "point", stat = "identity", position = "identity"
  ) +
  scale_x_continuous() +
  scale_y_continuous() +
  coord_cartesian()
ggplot() +
  layer(
    data = mpg, mapping = aes(x = displ, y = hwy),
    geom = "point"
  )

Defaults

ggplot() +
  layer(
    data = mpg, mapping = aes(x = displ, y = hwy),
    geom = "point", stat = "identity", position = "identity"
  ) +
  scale_x_continuous() +
  scale_y_continuous() +
  coord_cartesian()
ggplot() +
  layer(
    data = mpg, mapping = aes(x = displ, y = hwy),
    geom = "point"
  )
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point()

Defaults

ggplot() +
  layer(
    data = mpg, mapping = aes(x = displ, y = hwy),
    geom = "point", stat = "identity", position = "identity"
  ) +
  scale_x_continuous() +
  scale_y_continuous() +
  coord_cartesian()
ggplot() +
  layer(
    data = mpg, mapping = aes(x = displ, y = hwy),
    geom = "point"
  )
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point()
ggplot(mpg, aes(displ, hwy)) +
  geom_point()

Defaults

ggplot(mpg, aes(displ, hwy)) +
  geom_point()

Defaults

ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess'

Carte figurative des pertes successives en hommes de l’Armee Français dans la campagne de Russe 1812–1813 by Charles Joseph Minard

Building Minard’s map in R

troops <- read_table("data/minard-troops.txt")
cities <- read_table("data/minard-cities.txt")
troops
## # A tibble: 51 × 5
##     long   lat survivors direction group
##    <dbl> <dbl>     <int>     <chr> <int>
## 1   24.0  54.9    340000         A     1
## 2   24.5  55.0    340000         A     1
## 3   25.5  54.5    340000         A     1
## 4   26.0  54.7    320000         A     1
## 5   27.0  54.8    300000         A     1
## 6   28.0  54.9    280000         A     1
## 7   28.5  55.0    240000         A     1
## 8   29.0  55.1    210000         A     1
## 9   30.0  55.2    180000         A     1
## 10  30.3  55.3    175000         A     1
## # ... with 41 more rows
cities
## # A tibble: 20 × 3
##     long   lat           city
##    <dbl> <dbl>          <chr>
## 1   24.0  55.0          Kowno
## 2   25.3  54.7          Wilna
## 3   26.4  54.4       Smorgoni
## 4   26.8  54.3      Moiodexno
## 5   27.7  55.2      Gloubokoe
## 6   27.6  53.9          Minsk
## 7   28.5  54.3     Studienska
## 8   28.7  55.5        Polotzk
## 9   29.2  54.4           Bobr
## 10  30.2  55.3        Witebsk
## 11  30.4  54.5         Orscha
## 12  30.4  53.9        Mohilow
## 13  32.0  54.8       Smolensk
## 14  33.2  54.9    Dorogobouge
## 15  34.3  55.2          Wixma
## 16  34.4  55.5          Chjat
## 17  36.0  55.5        Mojaisk
## 18  37.6  55.8         Moscou
## 19  36.6  55.3      Tarantino
## 20  36.5  55.0 Malo-Jarosewii

Minard’s grammar

  • Troops
    • Latitude
    • Longitude
    • Survivors
    • Advance/retreat
  • Cities
    • Latitude
    • Longitude
    • City name

plot_troops <- ggplot(troops, aes(long, lat)) +
  geom_path(aes(size = survivors,
                color = direction,
                group = group))
plot_troops

plot_both <- plot_troops + 
  geom_text(data = cities, aes(label = city), size = 4)
plot_both

plot_polished <- plot_both + 
  scale_size(range = c(0, 12),
             breaks = c(10000, 20000, 30000),
             labels = c("10,000", "20,000", "30,000")) + 
  scale_color_manual(values = c("tan", "grey50")) +
  coord_map() +
  labs(title = "Map of Napoleon's Russian campaign of 1812",
       x = NULL,
       y = NULL)
plot_polished

plot_polished +
  theme_void() +
  theme(legend.position = "none")

Gapminder

library(ggplot2)
library(tibble)
# install.packages("gapminder")
library(gapminder)

data("gapminder")
gapminder
## # A tibble: 1,704 × 6
##        country continent  year lifeExp      pop gdpPercap
##         <fctr>    <fctr> <int>   <dbl>    <int>     <dbl>
## 1  Afghanistan      Asia  1952  28.801  8425333  779.4453
## 2  Afghanistan      Asia  1957  30.332  9240934  820.8530
## 3  Afghanistan      Asia  1962  31.997 10267083  853.1007
## 4  Afghanistan      Asia  1967  34.020 11537966  836.1971
## 5  Afghanistan      Asia  1972  36.088 13079460  739.9811
## 6  Afghanistan      Asia  1977  38.438 14880372  786.1134
## 7  Afghanistan      Asia  1982  39.854 12881816  978.0114
## 8  Afghanistan      Asia  1987  40.822 13867957  852.3959
## 9  Afghanistan      Asia  1992  41.674 16317921  649.3414
## 10 Afghanistan      Asia  1997  41.763 22227415  635.3414
## # ... with 1,694 more rows

Gapminder

  • What is the average life expectancy, per continent?
  • What is the relationship between GDP and life expectancy?
  • Bonus: what is causing the outlier in gdpPercap?