This content is from the fall 2016 version of this course. Please go here for the most recent version.

Strengths of R and Python

Open-source

First, using these languages is completely free. Second, open-source software is developed collaboratively, meaning the source code is open to public inspection, modification, and improvement.

Well supported

Many developers and social scientists write programs in R and Python. As a result, there is also a large support community available to help troubleshoot problematic code. As seen in the Redmonk programming language rankings (which compare languages’ appearances on Github [usage] and StackOverflow [support]), both R and Python appear near the top of both rankings.

Drawbacks of R and Python

Lack of point-and-click interface

R and Python, like any computing language, rely on programmatic execution of functions. That is, to do anything you must write code. This differs from popular statistical software such as Stata or SPSS which at their core utilize a command language but overlay them with drop-down menus that enable a point-and-click interface. While much easier to operate, there are several downsides to this approach - mainly that it makes it impossible to reproduce one’s analysis.

Similarities and differences between R and Python

By this I simply mean that like with human languages, they share important commonalities and differences. For instance, one commmon task in the social sciences is to estimate the parameters of an ordinary least squares regression model.¹ Using the base software distributions, in R you could do this using the following code:

x <- c(1.47, 1.50, 1.52, 1.55, 1.57, 1.60, 1.63, 1.65, 1.68, 1.70, 1.73, 1.75, 1.78, 1.80, 1.83)
y <- c(52.21, 53.12, 54.48, 55.84, 57.20, 58.57, 59.93, 61.29, 63.11, 64.47, 66.28, 68.10, 69.92, 72.19, 74.46)
 
lm(y ~ x)

## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##      -39.06        61.27

In Python, you can perform a similar operation using the numpy library.

import numpy
 
height = [1.47, 1.50, 1.52, 1.55, 1.57, 1.60, 1.63, 1.65, 1.68, 1.70, 1.73, 1.75, 1.78, 1.80, 1.83]
weight = [52.21, 53.12, 54.48, 55.84, 57.20, 58.57, 59.93, 61.29, 63.11, 64.47, 66.28, 68.10, 69.92, 72.19, 74.46]
 
X = numpy.array(height)[:, None]**range(2)
y = weight
 
print(numpy.linalg.lstsq(X, y)[0])

## [-39.06195592  61.27218654]

Note that both languages provide the same result, but the code used to generate that result varies slightly. The Python code does not look as intuitive as the R code, but it still gets the job done. Which brings us to the main point about R and Python.

Which is right for the job?

R and Python each have their own strengths and weaknesses which make them better suited to different tasks.

Things R does well
- Statistical analysis - R was written by statisticians, so it is designed first and foremost as a language for statistical analysis.
- Data visualization - while the base R graphics package is comprehensive and powerful, additional libraries such as ggplot2 and lattice make R the go-to language for power data visualization approaches.
Things R does not do as well
- Speed - while by no means a slug, R is not written to be a fast, speedy language. Depending on the complexity of the task and the size of your data, you may find R taking a long time to execute your program.
Things Python does well
- General computation - since Python is a general computational language, it is more versatile at non-statistical tasks and is a bit more popular outside the statistics community.
- Speed - because it is a general computing language, Python is optimized to be fast (assuming you write your code optimally). As your data becomes larger or more complex, you might find Python to be faster than R for your analytical needs.
- The Jupyter Notebook - Jupyter (formerly iPython) notebooks allow you to store live code and results in one central document. This tool used to only work for Python, but is now multi-platform and supports R, Julia, and other popular languages. That said, R also includes useful tools for dynamic documents containing code and analysis such as R Markdown and the recently released R Markdown Notebook.
- Workflow - since Python is a general-purpose language, you can build entire applications using it. R, not so much.
Things Python does not do as well
- Visualizations - visual graphics libraries in Python are increasing in number and quality (see matplotlib, pygal, and seaborn), but are still behind R in terms of comprehensiveness and ease of use. Of course, once you wish to create interactive and advanced information visualizations, you can also used more specialized software such as Tableau or D3.
- Add-on libraries - previously Python was criticized for its lack of libraries to perform statistical analysis and data manipulation, especially relative to the plethora of libraries for R. In recent years Python has begun to catch up with libraries for scientific computing (numpy), data analysis (pandas), and machine learning (scikit-learn).

In this course we will start learning the basic principles of computer programming using Python, then switch over in the third week to R for data analysis and visualizations. This will expose you to both languages and the strengths/weaknesses of each.

Acknowledgements

This page is derived in part from “R vs Python for Data Science: The Winner is …”.

Example drawn from RosettaCode.org ↩

This work is licensed under the CC BY-NC 4.0 Creative Commons License.

Why R and Python?