This content is from the fall 2016 version of this course. Please go here for the most recent version.
First, using these languages is completely free. Second, open-source software is developed collaboratively, meaning the source code is open to public inspection, modification, and improvement.
R and Python are widely used in the physical and social sciences, as well as in government, non-profits, and the private sector.
Many developers and social scientists write programs in R and Python. As a result, there is also a large support community available to help troubleshoot problematic code. As seen in the Redmonk programming language rankings (which compare languages’ appearances on Github [usage] and StackOverflow [support]), both R and Python appear near the top of both rankings.
R and Python, like any computing language, rely on programmatic execution of functions. That is, to do anything you must write code. This differs from popular statistical software such as Stata or SPSS which at their core utilize a command language but overlay them with drop-down menus that enable a point-and-click interface. While much easier to operate, there are several downsides to this approach - mainly that it makes it impossible to reproduce one’s analysis.
By this I simply mean that like with human languages, they share important commonalities and differences. For instance, one commmon task in the social sciences is to estimate the parameters of an ordinary least squares regression model.1 Using the base software distributions, in R you could do this using the following code:
x <- c(1.47, 1.50, 1.52, 1.55, 1.57, 1.60, 1.63, 1.65, 1.68, 1.70, 1.73, 1.75, 1.78, 1.80, 1.83)
y <- c(52.21, 53.12, 54.48, 55.84, 57.20, 58.57, 59.93, 61.29, 63.11, 64.47, 66.28, 68.10, 69.92, 72.19, 74.46)
lm(y ~ x)
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## -39.06 61.27
In Python, you can perform a similar operation using the numpy
library.
import numpy
height = [1.47, 1.50, 1.52, 1.55, 1.57, 1.60, 1.63, 1.65, 1.68, 1.70, 1.73, 1.75, 1.78, 1.80, 1.83]
weight = [52.21, 53.12, 54.48, 55.84, 57.20, 58.57, 59.93, 61.29, 63.11, 64.47, 66.28, 68.10, 69.92, 72.19, 74.46]
X = numpy.array(height)[:, None]**range(2)
y = weight
print(numpy.linalg.lstsq(X, y)[0])
## [-39.06195592 61.27218654]
Note that both languages provide the same result, but the code used to generate that result varies slightly. The Python code does not look as intuitive as the R code, but it still gets the job done. Which brings us to the main point about R and Python.
R and Python each have their own strengths and weaknesses which make them better suited to different tasks.
graphics
package is comprehensive and powerful, additional libraries such as ggplot2
and lattice
make R the go-to language for power data visualization approaches.matplotlib
, pygal
, and seaborn
), but are still behind R in terms of comprehensiveness and ease of use. Of course, once you wish to create interactive and advanced information visualizations, you can also used more specialized software such as Tableau or D3.numpy
), data analysis (pandas
), and machine learning (scikit-learn
).In this course we will start learning the basic principles of computer programming using Python, then switch over in the third week to R for data analysis and visualizations. This will expose you to both languages and the strengths/weaknesses of each.
Example drawn from RosettaCode.org↩
This work is licensed under the CC BY-NC 4.0 Creative Commons License.