This content is from the fall 2016 version of this course. Please go here for the most recent version.

library(tidyverse)
library(stringr)
library(knitr)
library(curl)
library(jsonlite)
library(XML)
library(httr)
library(rvest)
library(ggmap)

Objectives

Identify methods for writing functions to interact with APIs
Define JSON and XML data structure and how to convert them to data frames
Define CSS selectors and practice writing selectors to extract information from web pages
Write a script to extract data on wineries in Ithaca, NY using rvest and the Selector Gadget

Interacting with an API

On Monday we experimented with several packages that “wrapped” APIs. That is, they handled the creation of the request and the formatting of the output. Today we’re going to look at (part of) what these functions were doing.

First we’re going to examine the structure of API requests via the Open Movie Database. OMDb is very similar to IMDB, except it has a nice, simple API. We can go to the website, input some search parameters, and obtain both the XML query and the response from it.

Exercise - determine the shape of an API request

You can play around with the parameters on the OMDB website, and look at the resulting API call and the query you get back:

Let’s experiment with different values of the title and year fields. Notice the pattern in the request. For example for Title = Interstellar and Year = 2014, we get:

http://www.omdbapi.com/?t=Interstellar&y=2014&plot=short&r=xml

Try pasting this link into the browser. Also experiment with json and xml

How can we create this request in R?

request <- str_c("http://www.omdbapi.com/?t=", "Interstellar", "&", "y=", "2014", "&", "plot=", "short", "&", "r=", "xml")
request

## [1] "http://www.omdbapi.com/?t=Interstellar&y=2014&plot=short&r=xml"

It works, but it’s a bit ungainly. Lets try to abstract that into a function:

omdb <- function(Title, Year, Plot, Format){
  baseurl <- "http://www.omdbapi.com/?"
  params <- c("t=", "y=", "plot=", "r=")
  values <- c(Title, Year, Plot, Format)
  param_values <- map2_chr(params, values, str_c)
  args <- str_c(param_values, collapse = "&")
  str_c(baseurl, args)
}

omdb("Interstellar", "2014", "short", "xml")

## [1] "http://www.omdbapi.com/?t=Interstellar&y=2014&plot=short&r=xml"

Now we have a handy function that returns the API query. We can paste in the link, but we can also obtain data from within R:

request_interstellar <- omdb("Interstellar", "2014", "short", "xml")
con <- curl(request_interstellar)
answer_xml <- readLines(con)

## Warning in readLines(con): incomplete final line found on 'http://
## www.omdbapi.com/?t=Interstellar&y=2014&plot=short&r=xml'

close(con)
answer_xml

## [1] "<?xml version=\"1.0\" encoding=\"UTF-8\"?><root response=\"True\"><movie title=\"Interstellar\" year=\"2014\" rated=\"PG-13\" released=\"07 Nov 2014\" runtime=\"169 min\" genre=\"Adventure, Drama, Sci-Fi\" director=\"Christopher Nolan\" writer=\"Jonathan Nolan, Christopher Nolan\" actors=\"Ellen Burstyn, Matthew McConaughey, Mackenzie Foy, John Lithgow\" plot=\"A team of explorers travel through a wormhole in space in an attempt to ensure humanity's survival.\" language=\"English\" country=\"USA, UK\" awards=\"Won 1 Oscar. Another 39 wins &amp; 134 nominations.\" poster=\"https://images-na.ssl-images-amazon.com/images/M/MV5BMjIxNTU4MzY4MF5BMl5BanBnXkFtZTgwMzM4ODI3MjE@._V1_SX300.jpg\" metascore=\"74\" imdbRating=\"8.6\" imdbVotes=\"961,788\" imdbID=\"tt0816692\" type=\"movie\"/></root>"

request_interstellar <- omdb("Interstellar", "2014", "short", "json")
con <- curl(request_interstellar)
answer_json <- readLines(con)
close(con)
answer_json %>% 
  prettify()

## {
##     "Title": "Interstellar",
##     "Year": "2014",
##     "Rated": "PG-13",
##     "Released": "07 Nov 2014",
##     "Runtime": "169 min",
##     "Genre": "Adventure, Drama, Sci-Fi",
##     "Director": "Christopher Nolan",
##     "Writer": "Jonathan Nolan, Christopher Nolan",
##     "Actors": "Ellen Burstyn, Matthew McConaughey, Mackenzie Foy, John Lithgow",
##     "Plot": "A team of explorers travel through a wormhole in space in an attempt to ensure humanity's survival.",
##     "Language": "English",
##     "Country": "USA, UK",
##     "Awards": "Won 1 Oscar. Another 39 wins & 134 nominations.",
##     "Poster": "https://images-na.ssl-images-amazon.com/images/M/MV5BMjIxNTU4MzY4MF5BMl5BanBnXkFtZTgwMzM4ODI3MjE@._V1_SX300.jpg",
##     "Metascore": "74",
##     "imdbRating": "8.6",
##     "imdbVotes": "961,788",
##     "imdbID": "tt0816692",
##     "Type": "movie",
##     "Response": "True"
## }
##

We have a form of data that is obviously structured. What is it?

Intro to JSON and XML

These are the two common languages of web services: JavaScript Object Notation and eXtensible Markup Language.

Here’s an example of JSON: from this wonderful site

{
  "crust": "original",
  "toppings": ["cheese", "pepperoni", "garlic"],
  "status": "cooking",
  "customer": {
    "name": "Brian",
    "phone": "573-111-1111"
  }
}

And here is XML:

<order>
    <crust>original</crust>
    <toppings>
        <topping>cheese</topping>
        <topping>pepperoni</topping>
        <topping>garlic</topping>
    </toppings>
    <status>cooking</status>
</order>

You can see that both of these data structures are quite easy to read. They are “self-describing”. In other words, they tell you how they are meant to be read.

There are easy means of taking these data types and creating R objects. Our JSON response above can be parsed using jsonlite::fromJSON():

answer_json %>% 
  fromJSON()

## $Title
## [1] "Interstellar"
## 
## $Year
## [1] "2014"
## 
## $Rated
## [1] "PG-13"
## 
## $Released
## [1] "07 Nov 2014"
## 
## $Runtime
## [1] "169 min"
## 
## $Genre
## [1] "Adventure, Drama, Sci-Fi"
## 
## $Director
## [1] "Christopher Nolan"
## 
## $Writer
## [1] "Jonathan Nolan, Christopher Nolan"
## 
## $Actors
## [1] "Ellen Burstyn, Matthew McConaughey, Mackenzie Foy, John Lithgow"
## 
## $Plot
## [1] "A team of explorers travel through a wormhole in space in an attempt to ensure humanity's survival."
## 
## $Language
## [1] "English"
## 
## $Country
## [1] "USA, UK"
## 
## $Awards
## [1] "Won 1 Oscar. Another 39 wins & 134 nominations."
## 
## $Poster
## [1] "https://images-na.ssl-images-amazon.com/images/M/MV5BMjIxNTU4MzY4MF5BMl5BanBnXkFtZTgwMzM4ODI3MjE@._V1_SX300.jpg"
## 
## $Metascore
## [1] "74"
## 
## $imdbRating
## [1] "8.6"
## 
## $imdbVotes
## [1] "961,788"
## 
## $imdbID
## [1] "tt0816692"
## 
## $Type
## [1] "movie"
## 
## $Response
## [1] "True"

The output is a named list! A familiar and friendly R structure. Because data frames are lists, and because this list has no nested lists-within-lists, we can coerce it very simply:

answer_json %>% 
  fromJSON() %>% 
  tbl_df() %>% 
  kable()

Title	Year	Rated	Released	Runtime	Genre	Director	Writer	Actors	Plot	Language	Country	Awards	Poster	Metascore	imdbRating	imdbVotes	imdbID	Type	Response
Interstellar	2014	PG-13	07 Nov 2014	169 min	Adventure, Drama, Sci-Fi	Christopher Nolan	Jonathan Nolan, Christopher Nolan	Ellen Burstyn, Matthew McConaughey, Mackenzie Foy, John Lithgow	A team of explorers travel through a wormhole in space in an attempt to ensure humanity’s survival.	English	USA, UK	Won 1 Oscar. Another 39 wins & 134 nominations.	https://images-na.ssl-images-amazon.com/images/M/MV5BMjIxNTU4MzY4MF5BMl5BanBnXkFtZTgwMzM4ODI3MjE@._V1_SX300.jpg	74	8.6	961,788	tt0816692	movie	True

A similar process exists for XML formats:

ans_xml_parsed <- xmlParse(answer_xml)
ans_xml_parsed

## <?xml version="1.0" encoding="UTF-8"?>
## <root response="True">
##   <movie title="Interstellar" year="2014" rated="PG-13" released="07 Nov 2014" runtime="169 min" genre="Adventure, Drama, Sci-Fi" director="Christopher Nolan" writer="Jonathan Nolan, Christopher Nolan" actors="Ellen Burstyn, Matthew McConaughey, Mackenzie Foy, John Lithgow" plot="A team of explorers travel through a wormhole in space in an attempt to ensure humanity's survival." language="English" country="USA, UK" awards="Won 1 Oscar. Another 39 wins &amp; 134 nominations." poster="https://images-na.ssl-images-amazon.com/images/M/MV5BMjIxNTU4MzY4MF5BMl5BanBnXkFtZTgwMzM4ODI3MjE@._V1_SX300.jpg" metascore="74" imdbRating="8.6" imdbVotes="961,788" imdbID="tt0816692" type="movie"/>
## </root>
##

Not exactly the response we were hoping for! This shows us some of the XML document’s structure:

a <root> node with a single child, <movie>.
the information we want is all stored as attributes

From Nolan and Lang 2014:

The xmlRoot() function returns an object of class XMLInternalElementNode. This is a regular XML node and not specific to the root node, i.e., all XML nodes will appear in R with this class or a more specific class. An object of class XMLInternalElementNode has four fields: name, attributes, children and value, which we access with the methods xmlName(), xmlAttrs(), xmlChildren(), and xmlValue()

field	method
name	`xmlName()`
attributes	`xmlAttrs()`
children	`xmlChildren()`
value	`xmlValue()`

ans_xml_parsed_root <- xmlRoot(ans_xml_parsed)[["movie"]]  # could also use [[1]]
ans_xml_parsed_root

## <movie title="Interstellar" year="2014" rated="PG-13" released="07 Nov 2014" runtime="169 min" genre="Adventure, Drama, Sci-Fi" director="Christopher Nolan" writer="Jonathan Nolan, Christopher Nolan" actors="Ellen Burstyn, Matthew McConaughey, Mackenzie Foy, John Lithgow" plot="A team of explorers travel through a wormhole in space in an attempt to ensure humanity's survival." language="English" country="USA, UK" awards="Won 1 Oscar. Another 39 wins &amp; 134 nominations." poster="https://images-na.ssl-images-amazon.com/images/M/MV5BMjIxNTU4MzY4MF5BMl5BanBnXkFtZTgwMzM4ODI3MjE@._V1_SX300.jpg" metascore="74" imdbRating="8.6" imdbVotes="961,788" imdbID="tt0816692" type="movie"/>

ans_xml_attrs <- xmlAttrs(ans_xml_parsed_root)
ans_xml_attrs

##                                                                                                             title 
##                                                                                                    "Interstellar" 
##                                                                                                              year 
##                                                                                                            "2014" 
##                                                                                                             rated 
##                                                                                                           "PG-13" 
##                                                                                                          released 
##                                                                                                     "07 Nov 2014" 
##                                                                                                           runtime 
##                                                                                                         "169 min" 
##                                                                                                             genre 
##                                                                                        "Adventure, Drama, Sci-Fi" 
##                                                                                                          director 
##                                                                                               "Christopher Nolan" 
##                                                                                                            writer 
##                                                                               "Jonathan Nolan, Christopher Nolan" 
##                                                                                                            actors 
##                                                 "Ellen Burstyn, Matthew McConaughey, Mackenzie Foy, John Lithgow" 
##                                                                                                              plot 
##             "A team of explorers travel through a wormhole in space in an attempt to ensure humanity's survival." 
##                                                                                                          language 
##                                                                                                         "English" 
##                                                                                                           country 
##                                                                                                         "USA, UK" 
##                                                                                                            awards 
##                                                                 "Won 1 Oscar. Another 39 wins & 134 nominations." 
##                                                                                                            poster 
## "https://images-na.ssl-images-amazon.com/images/M/MV5BMjIxNTU4MzY4MF5BMl5BanBnXkFtZTgwMzM4ODI3MjE@._V1_SX300.jpg" 
##                                                                                                         metascore 
##                                                                                                              "74" 
##                                                                                                        imdbRating 
##                                                                                                             "8.6" 
##                                                                                                         imdbVotes 
##                                                                                                         "961,788" 
##                                                                                                            imdbID 
##                                                                                                       "tt0816692" 
##                                                                                                              type 
##                                                                                                           "movie"

ans_xml_attrs %>%
  t() %>%
  tbl_df() %>%
  kable()

title	year	rated	released	runtime	genre	director	writer	actors	plot	language	country	awards	poster	metascore	imdbRating	imdbVotes	imdbID	type
Interstellar	2014	PG-13	07 Nov 2014	169 min	Adventure, Drama, Sci-Fi	Christopher Nolan	Jonathan Nolan, Christopher Nolan	Ellen Burstyn, Matthew McConaughey, Mackenzie Foy, John Lithgow	A team of explorers travel through a wormhole in space in an attempt to ensure humanity’s survival.	English	USA, UK	Won 1 Oscar. Another 39 wins & 134 nominations.	https://images-na.ssl-images-amazon.com/images/M/MV5BMjIxNTU4MzY4MF5BMl5BanBnXkFtZTgwMzM4ODI3MjE@._V1_SX300.jpg	74	8.6	961,788	tt0816692	movie

Introducing the easy way: `httr`

httr is yet another star in the hadleyverse, this one designed to facilitate all things HTTP from within R. This includes the major HTTP verbs, which are:¹

GET: fetch an existing resource. The URL contains all the necessary information the server needs to locate and return the resource.
POST: create a new resource. POST requests usually carry a payload that specifies the data for the new resource.
PUT: update an existing resource. The payload may contain the updated data for the resource.
DELETE: delete an existing resource.

HTTP is the foundation for APIs; understanding how it works is the key to interacting with all the diverse APIs out there. An excellent beginning resource for APIs (including HTTP basics) is this simple guide

httr also facilitates a variety of authentication protocols.

httr contains one function for every HTTP verb. The functions have the same names as the verbs (e.g. GET(), POST()).They have more informative outputs than simply using curl, and come with some nice convenience functions for working with the output:

interstellar_json <- omdb("Interstellar", "2014", "short", "json")
response_json <- GET(interstellar_json)
content(response_json, as = "parsed", type = "application/json")

## $Title
## [1] "Interstellar"
## 
## $Year
## [1] "2014"
## 
## $Rated
## [1] "PG-13"
## 
## $Released
## [1] "07 Nov 2014"
## 
## $Runtime
## [1] "169 min"
## 
## $Genre
## [1] "Adventure, Drama, Sci-Fi"
## 
## $Director
## [1] "Christopher Nolan"
## 
## $Writer
## [1] "Jonathan Nolan, Christopher Nolan"
## 
## $Actors
## [1] "Ellen Burstyn, Matthew McConaughey, Mackenzie Foy, John Lithgow"
## 
## $Plot
## [1] "A team of explorers travel through a wormhole in space in an attempt to ensure humanity's survival."
## 
## $Language
## [1] "English"
## 
## $Country
## [1] "USA, UK"
## 
## $Awards
## [1] "Won 1 Oscar. Another 39 wins & 134 nominations."
## 
## $Poster
## [1] "https://images-na.ssl-images-amazon.com/images/M/MV5BMjIxNTU4MzY4MF5BMl5BanBnXkFtZTgwMzM4ODI3MjE@._V1_SX300.jpg"
## 
## $Metascore
## [1] "74"
## 
## $imdbRating
## [1] "8.6"
## 
## $imdbVotes
## [1] "961,788"
## 
## $imdbID
## [1] "tt0816692"
## 
## $Type
## [1] "movie"
## 
## $Response
## [1] "True"

interstellar_xml <- omdb("Interstellar", "2014", "short", "xml")
response_xml <- GET(interstellar_xml)
content(response_xml, as = "parsed")

## {xml_document}
## <root response="True">
## [1] <movie title="Interstellar" year="2014" rated="PG-13" released="07 N ...

In addition, httr gives us access to lots of useful information about the quality of our response. For example, the header:

headers(response_xml)

## $date
## [1] "Thu, 17 Nov 2016 16:08:07 GMT"
## 
## $`content-type`
## [1] "text/xml; charset=utf-8"
## 
## $`content-length`
## [1] "720"
## 
## $connection
## [1] "keep-alive"
## 
## $`cache-control`
## [1] "public, max-age=86400"
## 
## $`content-encoding`
## [1] "gzip"
## 
## $expires
## [1] "Fri, 18 Nov 2016 16:08:07 GMT"
## 
## $`last-modified`
## [1] "Thu, 17 Nov 2016 16:08:06 GMT"
## 
## $vary
## [1] "Accept-Encoding"
## 
## $`x-aspnet-version`
## [1] "4.0.30319"
## 
## $`x-powered-by`
## [1] "ASP.NET"
## 
## $`access-control-allow-origin`
## [1] "*"
## 
## $`cf-cache-status`
## [1] "HIT"
## 
## $server
## [1] "cloudflare-nginx"
## 
## $`cf-ray`
## [1] "30347084b1c710f3-ORD"
## 
## attr(,"class")
## [1] "insensitive" "list"

And also a handy means to extract specifically the HTTP status code:

status_code(response_xml)

## [1] 200

Code²	Status
1xx	Informational
2xx	Success
3xx	Redirection
4xx	Client error (you did something wrong)
5xx	Server error (server did something wrong)

(Perhaps a more intuitive, cat-based explanation of error codes)

In fact, we didn’t need to create omdb() at all! httr provides a straightforward means of making an http request:

the_martian <- GET("http://www.omdbapi.com/?", query = list(t = "The Martian", y = 2015, plot = "short", r = "json"))

content(the_martian)

## $Title
## [1] "The Martian"
## 
## $Year
## [1] "2015"
## 
## $Rated
## [1] "PG-13"
## 
## $Released
## [1] "02 Oct 2015"
## 
## $Runtime
## [1] "144 min"
## 
## $Genre
## [1] "Adventure, Drama, Sci-Fi"
## 
## $Director
## [1] "Ridley Scott"
## 
## $Writer
## [1] "Drew Goddard (screenplay), Andy Weir (book)"
## 
## $Actors
## [1] "Matt Damon, Jessica Chastain, Kristen Wiig, Jeff Daniels"
## 
## $Plot
## [1] "An astronaut becomes stranded on Mars after his team assume him dead, and must rely on his ingenuity to find a way to signal to Earth that he is alive."
## 
## $Language
## [1] "English, Mandarin"
## 
## $Country
## [1] "USA, UK"
## 
## $Awards
## [1] "Nominated for 7 Oscars. Another 33 wins & 172 nominations."
## 
## $Poster
## [1] "https://images-na.ssl-images-amazon.com/images/M/MV5BMTc2MTQ3MDA1Nl5BMl5BanBnXkFtZTgwODA3OTI4NjE@._V1_SX300.jpg"
## 
## $Metascore
## [1] "80"
## 
## $imdbRating
## [1] "8.0"
## 
## $imdbVotes
## [1] "506,335"
## 
## $imdbID
## [1] "tt3659388"
## 
## $Type
## [1] "movie"
## 
## $Response
## [1] "True"

We get the same answer as before! With httr, we are able to pass in the named arguments to the API call as a named list. We are also able to use spaces in movie names - httr encodes these in the URL before making the GET request.

The documentation for httr includes two useful vignettes:

httr quickstart guide - summarizes all the basic httr functions like above
Best practices for writing an API package - document outlining the key issues involved in writing API wrappers in R

Scraping

What if data is present on a website, but isn’t provided in an API at all? It is possible to grab that information too. How easy that is to do depends a lot on the quality of the website that we are using.

HTML is a structured way of displaying information. It is very similar in structure to XML (in fact many modern html sites are actually XHTML5, which is also valid XML)

Scraping

HTML is a structured way of displaying information. It is very similar in structure to XML (in fact many modern html sites are actually XHTML5, which is also valid XML)

Example of HTML code

Source: Example of a simple HTML page

<HTML>
<HEAD>
  <TITLE>Your Title Here</TITLE>
</HEAD>

<BODY BGCOLOR="FFFFFF">
<CENTER><IMG SRC="clouds.jpg" ALIGN="BOTTOM"> </CENTER>
<HR>
<a href="http://somegreatsite.com">Link Name</a> is a link to another nifty site
<H1>This is a Header</H1>
<H2>This is a Medium Header</H2>
Send me mail at <a href="mailto:support@yourcompany.com"> support@yourcompany.com</a>.
<P> This is a new paragraph!
<P> <B>This is a new paragraph!</B>
<BR> <B><I>This is a new sentence without a paragraph break, in bold italics.</I></B>
<HR>
</BODY>
</HTML>

Rendered example HTML

Install your equipment

rvest - devtools::install_github("hadley/rvest")
SelectorGadget - Install in your browser

Practice CSS selectors

Before we go any further, let’s play a game together!

Obtain a table

Let’s make a simple HTML table and then parse it!

make a new, empty project
make a totally empty .Rmd file called "GapminderHead.Rmd"
copy this into the body:

---
output: html_document
---

```{r echo=FALSE, results='asis'} #delete this comment
library(gapminder)
knitr::kable(head(gapminder))
```

remember to delete the comment

Then knit the document and click “View in Browser”. It should look like this:

We have created a simple html table with the head of gapminder in it! We can get our data back by parsing this table into a dataframe again. Extracting data from html is called “scraping”, and it is done with the R package rvest:

read_html("GapminderHead.html") %>%
  html_table()

## [[1]]
##       country continent year lifeExp      pop gdpPercap
## 1 Afghanistan      Asia 1952  28.801  8425333  779.4453
## 2 Afghanistan      Asia 1957  30.332  9240934  820.8530
## 3 Afghanistan      Asia 1962  31.997 10267083  853.1007
## 4 Afghanistan      Asia 1967  34.020 11537966  836.1971
## 5 Afghanistan      Asia 1972  36.088 13079460  739.9811
## 6 Afghanistan      Asia 1977  38.438 14880372  786.1134

Scraping via CSS selectors

Let’s practice scraping websites using our newfound abilities! Here is a list of wineries in Ithaca, NY.

Extract names

First let’s download and store the HTML content.

wineries <- read_html("http://www.visitithaca.com/attractions/wineries.html")

Now let’s extract the names and store them. To pick out the names, scroll down to the list of wineries and use the Selector Gadget to click on one of the winery names on the main page. Once you only have the winer names selected, you should see that the names are embedded in hyperlink tags (<a href...) and it has a class of indSearchListingTitle. We can use html_nodes to extract this information and html_text to remove the extraneous HTML tags.

fnames <- wineries %>%
  html_nodes(".indSearchListingTitle a") %>%
  html_text()
head(fnames)

## [1] "Treleaven by King Ferry Winery" "Ports of New York"             
## [3] "Goose Watch Winery"             "Varick Winery & Vineyard"      
## [5] "Dills Run Winery"               "Swedish Hill Vineyard"

Extract addresses

The address is trickier. Using the select tool you’ll see that the details are stored in the indMetaInfoWrapper class, but also includes things like phone numbers. We just want the addresses, so use the Selector Gadget to only include addresses.

faddress <- wineries %>%
  html_nodes(".indMetaWrapper:nth-child(1) .indMetaInfoWrapper") %>%
  html_text()
head(faddress)

## [1] "\n              658 Lake Rd., King Ferry, NY 13081\n              \n            "              
## [2] "\n              815 Taber St, Ithaca, NY 14850\n              \n            "                  
## [3] "\n              on the Cayuga Lake Wine Trail, Romulus, NY 14541\n              \n            "
## [4] "\n              on the Cayuga Lake Wine Trail, Romulus, NY 14541\n              \n            "
## [5] "\n              3862 State Route 90, Aurora, NY 13026\n              \n            "           
## [6] "\n              on the Cayuga Lake Wine Trail, Romulus, NY 14541\n              \n            "

Clean up the addresses

The code on this website is a little inconsistent – sometimes the winery name is first and sometimes it’s preceded by something like “on the Cayuga Lake Wine Trail” so let’s do a little final cleanup.

to_remove <- str_c(c("\n", "^\\s+|\\s+$", "on the Cayuga Lake Wine Trail, ", "Cayuga Lake Wine Trail, ", "on the Cayuga Wine Trail, ",  "on the Finger Lakes Beer Trail, "), collapse="|")

faddress <- str_replace_all(faddress, to_remove, "")

head(faddress)

## [1] "              658 Lake Rd., King Ferry, NY 13081"   
## [2] "              815 Taber St, Ithaca, NY 14850"       
## [3] "              Romulus, NY 14541"                    
## [4] "              Romulus, NY 14541"                    
## [5] "              3862 State Route 90, Aurora, NY 13026"
## [6] "              Romulus, NY 14541"

Combine with an API to geocode the addresses

The package ggmap has a nice geocode function that we’ll use to extract coordinates.

geocodes <- geocode(faddress, output = "latlona")

head(geocodes)

##         lon      lat                                address
## 1 -76.63443 42.65034 658 lake rd, king ferry, ny 13081, usa
## 2 -76.51510 42.43866    815 taber st, ithaca, ny 14850, usa
## 3 -76.83404 42.75255                 romulus, ny 14541, usa
## 4 -76.83404 42.75255                 romulus, ny 14541, usa
## 5 -76.71379 42.78586      3862 ny-90, aurora, ny 13026, usa
## 6 -76.83404 42.75255                 romulus, ny 14541, usa

Pull it all together

full <- data_frame(name = fnames) %>%
  bind_cols(geocodes) %>%
  # is it a winery or cidery?
  mutate(category = ifelse(str_detect(name, "Winery"), "Winery", "Cidery"))

kable(full)

name	lon	lat	address	category
Treleaven by King Ferry Winery	-76.63443	42.65034	658 lake rd, king ferry, ny 13081, usa	Winery
Ports of New York	-76.51510	42.43866	815 taber st, ithaca, ny 14850, usa	Cidery
Goose Watch Winery	-76.83404	42.75255	romulus, ny 14541, usa	Winery
Varick Winery & Vineyard	-76.83404	42.75255	romulus, ny 14541, usa	Winery
Dills Run Winery	-76.71379	42.78586	3862 ny-90, aurora, ny 13026, usa	Winery
Swedish Hill Vineyard	-76.83404	42.75255	romulus, ny 14541, usa	Cidery
Knapp Winery & Vineyard Restaurant	-76.83404	42.75255	romulus, ny 14541, usa	Winery
Frontenac Point Vineyard & Estate Winery	-76.64552	42.56580	9501 ny-89, trumansburg, ny 14886, usa	Winery
Izzo White Barn Winery	-76.70580	42.94002	6634 cayuga rd, cayuga, ny 13034, usa	Winery
Montezuma Winery & Hidden Marsh Distillery	-76.79662	42.91062	seneca falls, ny 13148, usa	Winery
Six Mile Creek Vineyards	-76.50188	42.44396	ithaca, ny, usa	Cidery
Long Point Winery	-76.70245	42.75396	aurora, ny 13026, usa	Winery
Buttonwood Grove Winery	-76.83404	42.75255	romulus, ny 14541, usa	Winery
Finger Lakes Cider House	-76.70140	42.60650	4017 hickok rd, interlaken, ny 14847, usa	Cidery
Cayuga Lake Wine Trail	-83.30791	42.19938	13065 new york st, romulus, mi 48174, usa	Cidery
Thirsty Owl Wine Company	-76.82301	42.67646	ovid, ny 14521, usa	Cidery
Chateau Dusseau Winery	-76.46788	42.63096	locke, ny 13092, usa	Winery
Lucas Vineyards	-76.71015	42.62787	3862 county rd 150, interlaken, ny 14847, usa	Cidery
Hosmer Winery	-76.82301	42.67646	ovid, ny 14521, usa	Winery
Americana Vineyards & Crystal Lake Cafe	-76.67485	42.57722	4367 e covert rd, interlaken, ny 14847, usa	Cidery
Sheldrake Point Vineyard	-76.82301	42.67646	ovid, ny 14521, usa	Cidery
Heart and Hands Winery	-76.70822	42.79736	ny-90, new york, usa	Winery
Eleven Lakes Winery	-76.74878	42.85320	3550 ny-89, seneca falls, ny 13148, usa	Winery
Cayuga Ridge Estates Winery	-76.82301	42.67646	ovid, ny 14521, usa	Winery
Bellwether Hard Cider / Bellwether Wine Cellars	-76.66606	42.54229	trumansburg, ny 14886, usa	Cidery

Random observations on scraping

Make sure you’ve obtained ONLY what you want! Scroll over the whole page to ensure that selectorgadget hasn’t found too many things
If you are having trouble parsing, try selecting a smaller subset of the thing you are seeking (i.e. being more precise)
MOST IMPORTANT - confirm that there is NO RopenSci package and NO API before you spend hours scraping when the API was right here

Acknowledgments

This page is derived in part from “UBC STAT 545A and 547M”, licensed under the CC BY-NC 3.0 Creative Commons License.
This page is derived in part from “Scrape website data with the new R package rvest (+ a postscript on interacting with web pages with RSelenium)”.

Session Info

devtools::session_info()

## Session info --------------------------------------------------------------

##  setting  value                       
##  version  R version 3.3.1 (2016-06-21)
##  system   x86_64, darwin13.4.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/Chicago             
##  date     2016-11-17

## Packages ------------------------------------------------------------------

##  package      * version  date       source                           
##  assertthat     0.1      2013-12-06 CRAN (R 3.3.0)                   
##  bit            1.1-12   2014-04-09 CRAN (R 3.3.0)                   
##  bit64          0.9-5    2015-07-05 CRAN (R 3.3.0)                   
##  boot         * 1.3-18   2016-02-23 CRAN (R 3.3.1)                   
##  broom        * 0.4.1    2016-06-24 CRAN (R 3.3.0)                   
##  car            2.1-3    2016-08-11 CRAN (R 3.3.0)                   
##  caret        * 6.0-73   2016-11-10 CRAN (R 3.3.2)                   
##  codetools      0.2-15   2016-10-05 CRAN (R 3.3.0)                   
##  colorspace     1.2-7    2016-10-11 CRAN (R 3.3.0)                   
##  curl         * 2.2      2016-10-21 CRAN (R 3.3.0)                   
##  DBI            0.5-1    2016-09-10 CRAN (R 3.3.0)                   
##  devtools       1.12.0   2016-06-24 CRAN (R 3.3.0)                   
##  digest         0.6.10   2016-08-02 CRAN (R 3.3.0)                   
##  dplyr        * 0.5.0    2016-06-24 CRAN (R 3.3.0)                   
##  evaluate       0.10     2016-10-11 CRAN (R 3.3.0)                   
##  foreach        1.4.3    2015-10-13 CRAN (R 3.3.0)                   
##  foreign        0.8-67   2016-09-13 CRAN (R 3.3.0)                   
##  gapminder    * 0.2.0    2015-12-31 CRAN (R 3.3.0)                   
##  geonames     * 0.998    2014-12-19 CRAN (R 3.3.0)                   
##  geosphere      1.5-5    2016-06-15 cran (@1.5-5)                    
##  gganimate    * 0.1      2016-11-11 Github (dgrtwo/gganimate@26ec501)
##  ggmap        * 2.6.1    2016-01-23 cran (@2.6.1)                    
##  ggplot2      * 2.2.0    2016-11-10 Github (hadley/ggplot2@f442f32)  
##  gtable         0.2.0    2016-02-26 CRAN (R 3.3.0)                   
##  gutenbergr   * 0.1.2    2016-06-24 CRAN (R 3.3.0)                   
##  highr          0.6      2016-05-09 CRAN (R 3.3.0)                   
##  htmltools      0.3.5    2016-03-21 CRAN (R 3.3.0)                   
##  htmlwidgets    0.8      2016-11-09 CRAN (R 3.3.1)                   
##  httr         * 1.2.1    2016-07-03 CRAN (R 3.3.0)                   
##  ISLR         * 1.0      2013-06-11 CRAN (R 3.3.0)                   
##  iterators      1.0.8    2015-10-13 CRAN (R 3.3.0)                   
##  janeaustenr  * 0.1.4    2016-10-26 CRAN (R 3.3.0)                   
##  jpeg           0.1-8    2014-01-23 cran (@0.1-8)                    
##  jsonlite     * 1.1      2016-09-14 CRAN (R 3.3.0)                   
##  knitr        * 1.15     2016-11-09 CRAN (R 3.3.1)                   
##  labeling       0.3      2014-08-23 CRAN (R 3.3.0)                   
##  lattice      * 0.20-34  2016-09-06 CRAN (R 3.3.0)                   
##  lazyeval       0.2.0    2016-06-12 CRAN (R 3.3.0)                   
##  lme4           1.1-12   2016-04-16 cran (@1.1-12)                   
##  lubridate    * 1.6.0    2016-09-13 CRAN (R 3.3.0)                   
##  magrittr       1.5      2014-11-22 CRAN (R 3.3.0)                   
##  mapproj        1.2-4    2015-08-03 cran (@1.2-4)                    
##  maps           3.1.1    2016-07-27 cran (@3.1.1)                    
##  MASS           7.3-45   2016-04-21 CRAN (R 3.3.1)                   
##  Matrix         1.2-7.1  2016-09-01 CRAN (R 3.3.0)                   
##  MatrixModels   0.4-1    2015-08-22 CRAN (R 3.3.0)                   
##  memoise        1.0.0    2016-01-29 CRAN (R 3.3.0)                   
##  mgcv           1.8-16   2016-11-07 CRAN (R 3.3.0)                   
##  minqa          1.2.4    2014-10-09 cran (@1.2.4)                    
##  mnormt         1.5-5    2016-10-15 CRAN (R 3.3.0)                   
##  ModelMetrics   1.1.0    2016-08-26 CRAN (R 3.3.0)                   
##  modelr       * 0.1.0    2016-08-31 CRAN (R 3.3.0)                   
##  modeltools     0.2-21   2013-09-02 CRAN (R 3.3.0)                   
##  munsell        0.4.3    2016-02-13 CRAN (R 3.3.0)                   
##  nlme           3.1-128  2016-05-10 CRAN (R 3.3.1)                   
##  nloptr         1.0.4    2014-08-04 cran (@1.0.4)                    
##  NLP            0.1-9    2016-02-18 CRAN (R 3.3.0)                   
##  nnet           7.3-12   2016-02-02 CRAN (R 3.3.1)                   
##  openssl        0.9.5    2016-10-28 CRAN (R 3.3.0)                   
##  pbkrtest       0.4-6    2016-01-27 CRAN (R 3.3.0)                   
##  plyr           1.8.4    2016-06-08 CRAN (R 3.3.0)                   
##  png            0.1-7    2013-12-03 cran (@0.1-7)                    
##  profvis      * 0.3.2    2016-05-19 CRAN (R 3.3.0)                   
##  proto          1.0.0    2016-10-29 CRAN (R 3.3.0)                   
##  psych          1.6.9    2016-09-17 cran (@1.6.9)                    
##  purrr        * 0.2.2    2016-06-18 CRAN (R 3.3.0)                   
##  quantreg       5.29     2016-09-04 CRAN (R 3.3.0)                   
##  R6             2.2.0    2016-10-05 CRAN (R 3.3.0)                   
##  randomForest * 4.6-12   2015-10-07 CRAN (R 3.3.0)                   
##  rcfss        * 0.1.0    2016-10-06 local                            
##  RColorBrewer   1.1-2    2014-12-07 CRAN (R 3.3.0)                   
##  Rcpp           0.12.7   2016-09-05 cran (@0.12.7)                   
##  readr        * 1.0.0    2016-08-03 CRAN (R 3.3.0)                   
##  readxl       * 0.1.1    2016-03-28 CRAN (R 3.3.0)                   
##  rebird       * 0.3.0    2016-03-23 CRAN (R 3.3.0)                   
##  reshape2       1.4.2    2016-10-22 CRAN (R 3.3.0)                   
##  RgoogleMaps    1.4.1    2016-09-18 cran (@1.4.1)                    
##  rjson          0.2.15   2014-11-03 cran (@0.2.15)                   
##  rmarkdown    * 1.1      2016-10-16 CRAN (R 3.3.1)                   
##  rplos        * 0.6.0    2016-07-22 CRAN (R 3.3.0)                   
##  rstudioapi     0.6      2016-06-27 CRAN (R 3.3.0)                   
##  rvest        * 0.3.2    2016-06-17 CRAN (R 3.3.0)                   
##  scales       * 0.4.1    2016-11-09 CRAN (R 3.3.1)                   
##  selectr        0.3-0    2016-08-30 CRAN (R 3.3.0)                   
##  slam           0.1-38   2016-08-18 CRAN (R 3.3.2)                   
##  SnowballC      0.5.1    2014-08-09 cran (@0.5.1)                    
##  solr           0.1.6    2015-09-17 CRAN (R 3.3.0)                   
##  sp             1.2-3    2016-04-14 cran (@1.2-3)                    
##  SparseM        1.72     2016-09-06 CRAN (R 3.3.0)                   
##  stringi        1.1.2    2016-10-01 CRAN (R 3.3.0)                   
##  stringr      * 1.1.0    2016-08-19 cran (@1.1.0)                    
##  tibble       * 1.2      2016-08-26 cran (@1.2)                      
##  tidyr        * 0.6.0    2016-08-12 CRAN (R 3.3.0)                   
##  tidytext     * 0.1.2    2016-10-28 CRAN (R 3.3.0)                   
##  tidyverse    * 1.0.0    2016-09-09 CRAN (R 3.3.0)                   
##  tm             0.6-2    2015-07-03 CRAN (R 3.3.0)                   
##  tokenizers     0.1.4    2016-08-29 CRAN (R 3.3.0)                   
##  topicmodels  * 0.2-4    2016-05-23 CRAN (R 3.3.0)                   
##  tree         * 1.0-37   2016-01-21 CRAN (R 3.3.0)                   
##  twitteR      * 1.1.9    2015-07-29 CRAN (R 3.3.0)                   
##  whisker        0.3-2    2013-04-28 CRAN (R 3.3.0)                   
##  withr          1.0.2    2016-06-20 CRAN (R 3.3.0)                   
##  XML          * 3.98-1.4 2016-03-01 CRAN (R 3.3.0)                   
##  xml2         * 1.0.0    2016-06-24 CRAN (R 3.3.0)                   
##  yaml           2.1.13   2014-06-12 CRAN (R 3.3.0)

This work is licensed under the CC BY-NC 4.0 Creative Commons License.

Getting data from the web: scraping

Objectives

Interacting with an API

Exercise - determine the shape of an API request

Intro to JSON and XML

Introducing the easy way: httr

Scraping

Scraping

Example of HTML code

Install your equipment

Practice CSS selectors

Obtain a table

Scraping via CSS selectors

Extract names

Extract addresses

Clean up the addresses

Combine with an API to geocode the addresses

Pull it all together

Random observations on scraping

Acknowledgments

Session Info

Introducing the easy way: `httr`