This content is from the fall 2016 version of this course. Please go here for the most recent version.

cm018 - November 23, 2016

Overview

Illustrate the split-apply-combine analytical pattern
Define parallel processing
Introduce Hadoop and Spark as distributed computing platforms
Introduce the sparklyr package
Demonstrate how to use sparklyr for machine learning using the Titanic data set

Slides and links

Notes from class
The split-apply-combine strategy for data analysis - paper by Hadley Wickham establishing a general overview of split-apply-combine problems. Note that the plyr package is now deprecated in favor of dplyr and the other tidyverse packages

To do for Monday

Final projects

This work is licensed under the CC BY-NC 4.0 Creative Commons License.