This content is from the fall 2016 version of this course. Please go here for the most recent version.

library(tidyverse)
library(stringr)
library(knitr)
library(curl)
library(jsonlite)
library(XML)
library(httr)
library(rvest)
library(ggmap)

Objectives

  • Identify methods for writing functions to interact with APIs
  • Define JSON and XML data structure and how to convert them to data frames
  • Define CSS selectors and practice writing selectors to extract information from web pages
  • Write a script to extract data on wineries in Ithaca, NY using rvest and the Selector Gadget

Interacting with an API

On Monday we experimented with several packages that “wrapped” APIs. That is, they handled the creation of the request and the formatting of the output. Today we’re going to look at (part of) what these functions were doing.

First we’re going to examine the structure of API requests via the Open Movie Database. OMDb is very similar to IMDB, except it has a nice, simple API. We can go to the website, input some search parameters, and obtain both the XML query and the response from it.

Exercise - determine the shape of an API request

You can play around with the parameters on the OMDB website, and look at the resulting API call and the query you get back:

Let’s experiment with different values of the title and year fields. Notice the pattern in the request. For example for Title = Interstellar and Year = 2014, we get:

http://www.omdbapi.com/?t=Interstellar&y=2014&plot=short&r=xml

Try pasting this link into the browser. Also experiment with json and xml

How can we create this request in R?

request <- str_c("http://www.omdbapi.com/?t=", "Interstellar", "&", "y=", "2014", "&", "plot=", "short", "&", "r=", "xml")
request
## [1] "http://www.omdbapi.com/?t=Interstellar&y=2014&plot=short&r=xml"

It works, but it’s a bit ungainly. Lets try to abstract that into a function:

omdb <- function(Title, Year, Plot, Format){
  baseurl <- "http://www.omdbapi.com/?"
  params <- c("t=", "y=", "plot=", "r=")
  values <- c(Title, Year, Plot, Format)
  param_values <- map2_chr(params, values, str_c)
  args <- str_c(param_values, collapse = "&")
  str_c(baseurl, args)
}

omdb("Interstellar", "2014", "short", "xml")
## [1] "http://www.omdbapi.com/?t=Interstellar&y=2014&plot=short&r=xml"

Now we have a handy function that returns the API query. We can paste in the link, but we can also obtain data from within R:

request_interstellar <- omdb("Interstellar", "2014", "short", "xml")
con <- curl(request_interstellar)
answer_xml <- readLines(con)
## Warning in readLines(con): incomplete final line found on 'http://
## www.omdbapi.com/?t=Interstellar&y=2014&plot=short&r=xml'
close(con)
answer_xml
## [1] "<?xml version=\"1.0\" encoding=\"UTF-8\"?><root response=\"True\"><movie title=\"Interstellar\" year=\"2014\" rated=\"PG-13\" released=\"07 Nov 2014\" runtime=\"169 min\" genre=\"Adventure, Drama, Sci-Fi\" director=\"Christopher Nolan\" writer=\"Jonathan Nolan, Christopher Nolan\" actors=\"Ellen Burstyn, Matthew McConaughey, Mackenzie Foy, John Lithgow\" plot=\"A team of explorers travel through a wormhole in space in an attempt to ensure humanity's survival.\" language=\"English\" country=\"USA, UK\" awards=\"Won 1 Oscar. Another 39 wins &amp; 134 nominations.\" poster=\"https://images-na.ssl-images-amazon.com/images/M/MV5BMjIxNTU4MzY4MF5BMl5BanBnXkFtZTgwMzM4ODI3MjE@._V1_SX300.jpg\" metascore=\"74\" imdbRating=\"8.6\" imdbVotes=\"961,788\" imdbID=\"tt0816692\" type=\"movie\"/></root>"
request_interstellar <- omdb("Interstellar", "2014", "short", "json")
con <- curl(request_interstellar)
answer_json <- readLines(con)
close(con)
answer_json %>% 
  prettify()
## {
##     "Title": "Interstellar",
##     "Year": "2014",
##     "Rated": "PG-13",
##     "Released": "07 Nov 2014",
##     "Runtime": "169 min",
##     "Genre": "Adventure, Drama, Sci-Fi",
##     "Director": "Christopher Nolan",
##     "Writer": "Jonathan Nolan, Christopher Nolan",
##     "Actors": "Ellen Burstyn, Matthew McConaughey, Mackenzie Foy, John Lithgow",
##     "Plot": "A team of explorers travel through a wormhole in space in an attempt to ensure humanity's survival.",
##     "Language": "English",
##     "Country": "USA, UK",
##     "Awards": "Won 1 Oscar. Another 39 wins & 134 nominations.",
##     "Poster": "https://images-na.ssl-images-amazon.com/images/M/MV5BMjIxNTU4MzY4MF5BMl5BanBnXkFtZTgwMzM4ODI3MjE@._V1_SX300.jpg",
##     "Metascore": "74",
##     "imdbRating": "8.6",
##     "imdbVotes": "961,788",
##     "imdbID": "tt0816692",
##     "Type": "movie",
##     "Response": "True"
## }
## 

We have a form of data that is obviously structured. What is it?

Intro to JSON and XML

These are the two common languages of web services: JavaScript Object Notation and eXtensible Markup Language.

Here’s an example of JSON: from this wonderful site

{
  "crust": "original",
  "toppings": ["cheese", "pepperoni", "garlic"],
  "status": "cooking",
  "customer": {
    "name": "Brian",
    "phone": "573-111-1111"
  }
}

And here is XML:

<order>
    <crust>original</crust>
    <toppings>
        <topping>cheese</topping>
        <topping>pepperoni</topping>
        <topping>garlic</topping>
    </toppings>
    <status>cooking</status>
</order>

You can see that both of these data structures are quite easy to read. They are “self-describing”. In other words, they tell you how they are meant to be read.

There are easy means of taking these data types and creating R objects. Our JSON response above can be parsed using jsonlite::fromJSON():

answer_json %>% 
  fromJSON()
## $Title
## [1] "Interstellar"
## 
## $Year
## [1] "2014"
## 
## $Rated
## [1] "PG-13"
## 
## $Released
## [1] "07 Nov 2014"
## 
## $Runtime
## [1] "169 min"
## 
## $Genre
## [1] "Adventure, Drama, Sci-Fi"
## 
## $Director
## [1] "Christopher Nolan"
## 
## $Writer
## [1] "Jonathan Nolan, Christopher Nolan"
## 
## $Actors
## [1] "Ellen Burstyn, Matthew McConaughey, Mackenzie Foy, John Lithgow"
## 
## $Plot
## [1] "A team of explorers travel through a wormhole in space in an attempt to ensure humanity's survival."
## 
## $Language
## [1] "English"
## 
## $Country
## [1] "USA, UK"
## 
## $Awards
## [1] "Won 1 Oscar. Another 39 wins & 134 nominations."
## 
## $Poster
## [1] "https://images-na.ssl-images-amazon.com/images/M/MV5BMjIxNTU4MzY4MF5BMl5BanBnXkFtZTgwMzM4ODI3MjE@._V1_SX300.jpg"
## 
## $Metascore
## [1] "74"
## 
## $imdbRating
## [1] "8.6"
## 
## $imdbVotes
## [1] "961,788"
## 
## $imdbID
## [1] "tt0816692"
## 
## $Type
## [1] "movie"
## 
## $Response
## [1] "True"

The output is a named list! A familiar and friendly R structure. Because data frames are lists, and because this list has no nested lists-within-lists, we can coerce it very simply:

answer_json %>% 
  fromJSON() %>% 
  tbl_df() %>% 
  kable()
Title Year Rated Released Runtime Genre Director Writer Actors Plot Language Country Awards Poster Metascore imdbRating imdbVotes imdbID Type Response
Interstellar 2014 PG-13 07 Nov 2014 169 min Adventure, Drama, Sci-Fi Christopher Nolan Jonathan Nolan, Christopher Nolan Ellen Burstyn, Matthew McConaughey, Mackenzie Foy, John Lithgow A team of explorers travel through a wormhole in space in an attempt to ensure humanity’s survival. English USA, UK Won 1 Oscar. Another 39 wins & 134 nominations. https://images-na.ssl-images-amazon.com/images/M/MV5BMjIxNTU4MzY4MF5BMl5BanBnXkFtZTgwMzM4ODI3MjE@._V1_SX300.jpg 74 8.6 961,788 tt0816692 movie True

A similar process exists for XML formats:

ans_xml_parsed <- xmlParse(answer_xml)
ans_xml_parsed
## <?xml version="1.0" encoding="UTF-8"?>
## <root response="True">
##   <movie title="Interstellar" year="2014" rated="PG-13" released="07 Nov 2014" runtime="169 min" genre="Adventure, Drama, Sci-Fi" director="Christopher Nolan" writer="Jonathan Nolan, Christopher Nolan" actors="Ellen Burstyn, Matthew McConaughey, Mackenzie Foy, John Lithgow" plot="A team of explorers travel through a wormhole in space in an attempt to ensure humanity's survival." language="English" country="USA, UK" awards="Won 1 Oscar. Another 39 wins &amp; 134 nominations." poster="https://images-na.ssl-images-amazon.com/images/M/MV5BMjIxNTU4MzY4MF5BMl5BanBnXkFtZTgwMzM4ODI3MjE@._V1_SX300.jpg" metascore="74" imdbRating="8.6" imdbVotes="961,788" imdbID="tt0816692" type="movie"/>
## </root>
## 

Not exactly the response we were hoping for! This shows us some of the XML document’s structure:

  • a <root> node with a single child, <movie>.
  • the information we want is all stored as attributes

From Nolan and Lang 2014:

The xmlRoot() function returns an object of class XMLInternalElementNode. This is a regular XML node and not specific to the root node, i.e., all XML nodes will appear in R with this class or a more specific class. An object of class XMLInternalElementNode has four fields: name, attributes, children and value, which we access with the methods xmlName(), xmlAttrs(), xmlChildren(), and xmlValue()

field method
name xmlName()
attributes xmlAttrs()
children xmlChildren()
value xmlValue()
ans_xml_parsed_root <- xmlRoot(ans_xml_parsed)[["movie"]]  # could also use [[1]]
ans_xml_parsed_root
## <movie title="Interstellar" year="2014" rated="PG-13" released="07 Nov 2014" runtime="169 min" genre="Adventure, Drama, Sci-Fi" director="Christopher Nolan" writer="Jonathan Nolan, Christopher Nolan" actors="Ellen Burstyn, Matthew McConaughey, Mackenzie Foy, John Lithgow" plot="A team of explorers travel through a wormhole in space in an attempt to ensure humanity's survival." language="English" country="USA, UK" awards="Won 1 Oscar. Another 39 wins &amp; 134 nominations." poster="https://images-na.ssl-images-amazon.com/images/M/MV5BMjIxNTU4MzY4MF5BMl5BanBnXkFtZTgwMzM4ODI3MjE@._V1_SX300.jpg" metascore="74" imdbRating="8.6" imdbVotes="961,788" imdbID="tt0816692" type="movie"/>
ans_xml_attrs <- xmlAttrs(ans_xml_parsed_root)
ans_xml_attrs
##                                                                                                             title 
##                                                                                                    "Interstellar" 
##                                                                                                              year 
##                                                                                                            "2014" 
##                                                                                                             rated 
##                                                                                                           "PG-13" 
##                                                                                                          released 
##                                                                                                     "07 Nov 2014" 
##                                                                                                           runtime 
##                                                                                                         "169 min" 
##                                                                                                             genre 
##                                                                                        "Adventure, Drama, Sci-Fi" 
##                                                                                                          director 
##                                                                                               "Christopher Nolan" 
##                                                                                                            writer 
##                                                                               "Jonathan Nolan, Christopher Nolan" 
##                                                                                                            actors 
##                                                 "Ellen Burstyn, Matthew McConaughey, Mackenzie Foy, John Lithgow" 
##                                                                                                              plot 
##             "A team of explorers travel through a wormhole in space in an attempt to ensure humanity's survival." 
##                                                                                                          language 
##                                                                                                         "English" 
##                                                                                                           country 
##                                                                                                         "USA, UK" 
##                                                                                                            awards 
##                                                                 "Won 1 Oscar. Another 39 wins & 134 nominations." 
##                                                                                                            poster 
## "https://images-na.ssl-images-amazon.com/images/M/MV5BMjIxNTU4MzY4MF5BMl5BanBnXkFtZTgwMzM4ODI3MjE@._V1_SX300.jpg" 
##                                                                                                         metascore 
##                                                                                                              "74" 
##                                                                                                        imdbRating 
##                                                                                                             "8.6" 
##                                                                                                         imdbVotes 
##                                                                                                         "961,788" 
##                                                                                                            imdbID 
##                                                                                                       "tt0816692" 
##                                                                                                              type 
##                                                                                                           "movie"
ans_xml_attrs %>%
  t() %>%
  tbl_df() %>%
  kable()
title year rated released runtime genre director writer actors plot language country awards poster metascore imdbRating imdbVotes imdbID type
Interstellar 2014 PG-13 07 Nov 2014 169 min Adventure, Drama, Sci-Fi Christopher Nolan Jonathan Nolan, Christopher Nolan Ellen Burstyn, Matthew McConaughey, Mackenzie Foy, John Lithgow A team of explorers travel through a wormhole in space in an attempt to ensure humanity’s survival. English USA, UK Won 1 Oscar. Another 39 wins & 134 nominations. https://images-na.ssl-images-amazon.com/images/M/MV5BMjIxNTU4MzY4MF5BMl5BanBnXkFtZTgwMzM4ODI3MjE@._V1_SX300.jpg 74 8.6 961,788 tt0816692 movie

Introducing the easy way: httr

httr is yet another star in the hadleyverse, this one designed to facilitate all things HTTP from within R. This includes the major HTTP verbs, which are:1

  • GET: fetch an existing resource. The URL contains all the necessary information the server needs to locate and return the resource.
  • POST: create a new resource. POST requests usually carry a payload that specifies the data for the new resource.
  • PUT: update an existing resource. The payload may contain the updated data for the resource.
  • DELETE: delete an existing resource.

HTTP is the foundation for APIs; understanding how it works is the key to interacting with all the diverse APIs out there. An excellent beginning resource for APIs (including HTTP basics) is this simple guide

httr also facilitates a variety of authentication protocols.

httr contains one function for every HTTP verb. The functions have the same names as the verbs (e.g. GET(), POST()).They have more informative outputs than simply using curl, and come with some nice convenience functions for working with the output:

interstellar_json <- omdb("Interstellar", "2014", "short", "json")
response_json <- GET(interstellar_json)
content(response_json, as = "parsed", type = "application/json")
## $Title
## [1] "Interstellar"
## 
## $Year
## [1] "2014"
## 
## $Rated
## [1] "PG-13"
## 
## $Released
## [1] "07 Nov 2014"
## 
## $Runtime
## [1] "169 min"
## 
## $Genre
## [1] "Adventure, Drama, Sci-Fi"
## 
## $Director
## [1] "Christopher Nolan"
## 
## $Writer
## [1] "Jonathan Nolan, Christopher Nolan"
## 
## $Actors
## [1] "Ellen Burstyn, Matthew McConaughey, Mackenzie Foy, John Lithgow"
## 
## $Plot
## [1] "A team of explorers travel through a wormhole in space in an attempt to ensure humanity's survival."
## 
## $Language
## [1] "English"
## 
## $Country
## [1] "USA, UK"
## 
## $Awards
## [1] "Won 1 Oscar. Another 39 wins & 134 nominations."
## 
## $Poster
## [1] "https://images-na.ssl-images-amazon.com/images/M/MV5BMjIxNTU4MzY4MF5BMl5BanBnXkFtZTgwMzM4ODI3MjE@._V1_SX300.jpg"
## 
## $Metascore
## [1] "74"
## 
## $imdbRating
## [1] "8.6"
## 
## $imdbVotes
## [1] "961,788"
## 
## $imdbID
## [1] "tt0816692"
## 
## $Type
## [1] "movie"
## 
## $Response
## [1] "True"
interstellar_xml <- omdb("Interstellar", "2014", "short", "xml")
response_xml <- GET(interstellar_xml)
content(response_xml, as = "parsed")
## {xml_document}
## <root response="True">
## [1] <movie title="Interstellar" year="2014" rated="PG-13" released="07 N ...

In addition, httr gives us access to lots of useful information about the quality of our response. For example, the header:

headers(response_xml)
## $date
## [1] "Thu, 17 Nov 2016 16:08:07 GMT"
## 
## $`content-type`
## [1] "text/xml; charset=utf-8"
## 
## $`content-length`
## [1] "720"
## 
## $connection
## [1] "keep-alive"
## 
## $`cache-control`
## [1] "public, max-age=86400"
## 
## $`content-encoding`
## [1] "gzip"
## 
## $expires
## [1] "Fri, 18 Nov 2016 16:08:07 GMT"
## 
## $`last-modified`
## [1] "Thu, 17 Nov 2016 16:08:06 GMT"
## 
## $vary
## [1] "Accept-Encoding"
## 
## $`x-aspnet-version`
## [1] "4.0.30319"
## 
## $`x-powered-by`
## [1] "ASP.NET"
## 
## $`access-control-allow-origin`
## [1] "*"
## 
## $`cf-cache-status`
## [1] "HIT"
## 
## $server
## [1] "cloudflare-nginx"
## 
## $`cf-ray`
## [1] "30347084b1c710f3-ORD"
## 
## attr(,"class")
## [1] "insensitive" "list"

And also a handy means to extract specifically the HTTP status code:

status_code(response_xml)
## [1] 200
Code2 Status
1xx Informational
2xx Success
3xx Redirection
4xx Client error (you did something wrong)
5xx Server error (server did something wrong)

In fact, we didn’t need to create omdb() at all! httr provides a straightforward means of making an http request:

the_martian <- GET("http://www.omdbapi.com/?", query = list(t = "The Martian", y = 2015, plot = "short", r = "json"))

content(the_martian)
## $Title
## [1] "The Martian"
## 
## $Year
## [1] "2015"
## 
## $Rated
## [1] "PG-13"
## 
## $Released
## [1] "02 Oct 2015"
## 
## $Runtime
## [1] "144 min"
## 
## $Genre
## [1] "Adventure, Drama, Sci-Fi"
## 
## $Director
## [1] "Ridley Scott"
## 
## $Writer
## [1] "Drew Goddard (screenplay), Andy Weir (book)"
## 
## $Actors
## [1] "Matt Damon, Jessica Chastain, Kristen Wiig, Jeff Daniels"
## 
## $Plot
## [1] "An astronaut becomes stranded on Mars after his team assume him dead, and must rely on his ingenuity to find a way to signal to Earth that he is alive."
## 
## $Language
## [1] "English, Mandarin"
## 
## $Country
## [1] "USA, UK"
## 
## $Awards
## [1] "Nominated for 7 Oscars. Another 33 wins & 172 nominations."
## 
## $Poster
## [1] "https://images-na.ssl-images-amazon.com/images/M/MV5BMTc2MTQ3MDA1Nl5BMl5BanBnXkFtZTgwODA3OTI4NjE@._V1_SX300.jpg"
## 
## $Metascore
## [1] "80"
## 
## $imdbRating
## [1] "8.0"
## 
## $imdbVotes
## [1] "506,335"
## 
## $imdbID
## [1] "tt3659388"
## 
## $Type
## [1] "movie"
## 
## $Response
## [1] "True"

We get the same answer as before! With httr, we are able to pass in the named arguments to the API call as a named list. We are also able to use spaces in movie names - httr encodes these in the URL before making the GET request.

The documentation for httr includes two useful vignettes:

Scraping

What if data is present on a website, but isn’t provided in an API at all? It is possible to grab that information too. How easy that is to do depends a lot on the quality of the website that we are using.

HTML is a structured way of displaying information. It is very similar in structure to XML (in fact many modern html sites are actually XHTML5, which is also valid XML)

Scraping

What if data is present on a website, but isn’t provided in an API at all? It is possible to grab that information too. How easy that is to do depends a lot on the quality of the website that we are using.

HTML is a structured way of displaying information. It is very similar in structure to XML (in fact many modern html sites are actually XHTML5, which is also valid XML)

Example of HTML code

Source: Example of a simple HTML page

<HTML>
<HEAD>
  <TITLE>Your Title Here</TITLE>
</HEAD>

<BODY BGCOLOR="FFFFFF">
<CENTER><IMG SRC="clouds.jpg" ALIGN="BOTTOM"> </CENTER>
<HR>
<a href="http://somegreatsite.com">Link Name</a> is a link to another nifty site
<H1>This is a Header</H1>
<H2>This is a Medium Header</H2>
Send me mail at <a href="mailto:support@yourcompany.com"> support@yourcompany.com</a>.
<P> This is a new paragraph!
<P> <B>This is a new paragraph!</B>
<BR> <B><I>This is a new sentence without a paragraph break, in bold italics.</I></B>
<HR>
</BODY>
</HTML>
Rendered example HTML

Rendered example HTML

Install your equipment

Practice CSS selectors

Before we go any further, let’s play a game together!

Obtain a table

Let’s make a simple HTML table and then parse it!

  • make a new, empty project
  • make a totally empty .Rmd file called "GapminderHead.Rmd"
  • copy this into the body:
---
output: html_document
---

```{r echo=FALSE, results='asis'} #delete this comment
library(gapminder)
knitr::kable(head(gapminder))
```

remember to delete the comment

Then knit the document and click “View in Browser”. It should look like this:

gm

gm

We have created a simple html table with the head of gapminder in it! We can get our data back by parsing this table into a dataframe again. Extracting data from html is called “scraping”, and it is done with the R package rvest:

read_html("GapminderHead.html") %>%
  html_table()
## [[1]]
##       country continent year lifeExp      pop gdpPercap
## 1 Afghanistan      Asia 1952  28.801  8425333  779.4453
## 2 Afghanistan      Asia 1957  30.332  9240934  820.8530
## 3 Afghanistan      Asia 1962  31.997 10267083  853.1007
## 4 Afghanistan      Asia 1967  34.020 11537966  836.1971
## 5 Afghanistan      Asia 1972  36.088 13079460  739.9811
## 6 Afghanistan      Asia 1977  38.438 14880372  786.1134

Scraping via CSS selectors

Let’s practice scraping websites using our newfound abilities! Here is a list of wineries in Ithaca, NY.

Extract names

First let’s download and store the HTML content.

wineries <- read_html("http://www.visitithaca.com/attractions/wineries.html")

Now let’s extract the names and store them. To pick out the names, scroll down to the list of wineries and use the Selector Gadget to click on one of the winery names on the main page. Once you only have the winer names selected, you should see that the names are embedded in hyperlink tags (<a href...) and it has a class of indSearchListingTitle. We can use html_nodes to extract this information and html_text to remove the extraneous HTML tags.

fnames <- wineries %>%
  html_nodes(".indSearchListingTitle a") %>%
  html_text()
head(fnames)
## [1] "Treleaven by King Ferry Winery" "Ports of New York"             
## [3] "Goose Watch Winery"             "Varick Winery & Vineyard"      
## [5] "Dills Run Winery"               "Swedish Hill Vineyard"

Extract addresses

The address is trickier. Using the select tool you’ll see that the details are stored in the indMetaInfoWrapper class, but also includes things like phone numbers. We just want the addresses, so use the Selector Gadget to only include addresses.

faddress <- wineries %>%
  html_nodes(".indMetaWrapper:nth-child(1) .indMetaInfoWrapper") %>%
  html_text()
head(faddress)
## [1] "\n              658 Lake Rd., King Ferry, NY 13081\n              \n            "              
## [2] "\n              815 Taber St, Ithaca, NY 14850\n              \n            "                  
## [3] "\n              on the Cayuga Lake Wine Trail, Romulus, NY 14541\n              \n            "
## [4] "\n              on the Cayuga Lake Wine Trail, Romulus, NY 14541\n              \n            "
## [5] "\n              3862 State Route 90, Aurora, NY 13026\n              \n            "           
## [6] "\n              on the Cayuga Lake Wine Trail, Romulus, NY 14541\n              \n            "

Clean up the addresses

The code on this website is a little inconsistent – sometimes the winery name is first and sometimes it’s preceded by something like “on the Cayuga Lake Wine Trail” so let’s do a little final cleanup.

to_remove <- str_c(c("\n", "^\\s+|\\s+$", "on the Cayuga Lake Wine Trail, ", "Cayuga Lake Wine Trail, ", "on the Cayuga Wine Trail, ",  "on the Finger Lakes Beer Trail, "), collapse="|")

faddress <- str_replace_all(faddress, to_remove, "")

head(faddress)
## [1] "              658 Lake Rd., King Ferry, NY 13081"   
## [2] "              815 Taber St, Ithaca, NY 14850"       
## [3] "              Romulus, NY 14541"                    
## [4] "              Romulus, NY 14541"                    
## [5] "              3862 State Route 90, Aurora, NY 13026"
## [6] "              Romulus, NY 14541"

Combine with an API to geocode the addresses

The package ggmap has a nice geocode function that we’ll use to extract coordinates.

geocodes <- geocode(faddress, output = "latlona")

head(geocodes)
##         lon      lat                                address
## 1 -76.63443 42.65034 658 lake rd, king ferry, ny 13081, usa
## 2 -76.51510 42.43866    815 taber st, ithaca, ny 14850, usa
## 3 -76.83404 42.75255                 romulus, ny 14541, usa
## 4 -76.83404 42.75255                 romulus, ny 14541, usa
## 5 -76.71379 42.78586      3862 ny-90, aurora, ny 13026, usa
## 6 -76.83404 42.75255                 romulus, ny 14541, usa

Pull it all together

full <- data_frame(name = fnames) %>%
  bind_cols(geocodes) %>%
  # is it a winery or cidery?
  mutate(category = ifelse(str_detect(name, "Winery"), "Winery", "Cidery"))

kable(full)
name lon lat address category
Treleaven by King Ferry Winery -76.63443 42.65034 658 lake rd, king ferry, ny 13081, usa Winery
Ports of New York -76.51510 42.43866 815 taber st, ithaca, ny 14850, usa Cidery
Goose Watch Winery -76.83404 42.75255 romulus, ny 14541, usa Winery
Varick Winery & Vineyard -76.83404 42.75255 romulus, ny 14541, usa Winery
Dills Run Winery -76.71379 42.78586 3862 ny-90, aurora, ny 13026, usa Winery
Swedish Hill Vineyard -76.83404 42.75255 romulus, ny 14541, usa Cidery
Knapp Winery & Vineyard Restaurant -76.83404 42.75255 romulus, ny 14541, usa Winery
Frontenac Point Vineyard & Estate Winery -76.64552 42.56580 9501 ny-89, trumansburg, ny 14886, usa Winery
Izzo White Barn Winery -76.70580 42.94002 6634 cayuga rd, cayuga, ny 13034, usa Winery
Montezuma Winery & Hidden Marsh Distillery -76.79662 42.91062 seneca falls, ny 13148, usa Winery
Six Mile Creek Vineyards -76.50188 42.44396 ithaca, ny, usa Cidery
Long Point Winery -76.70245 42.75396 aurora, ny 13026, usa Winery
Buttonwood Grove Winery -76.83404 42.75255 romulus, ny 14541, usa Winery
Finger Lakes Cider House -76.70140 42.60650 4017 hickok rd, interlaken, ny 14847, usa Cidery
Cayuga Lake Wine Trail -83.30791 42.19938 13065 new york st, romulus, mi 48174, usa Cidery
Thirsty Owl Wine Company -76.82301 42.67646 ovid, ny 14521, usa Cidery
Chateau Dusseau Winery -76.46788 42.63096 locke, ny 13092, usa Winery
Lucas Vineyards -76.71015 42.62787 3862 county rd 150, interlaken, ny 14847, usa Cidery
Hosmer Winery -76.82301 42.67646 ovid, ny 14521, usa Winery
Americana Vineyards & Crystal Lake Cafe -76.67485 42.57722 4367 e covert rd, interlaken, ny 14847, usa Cidery
Sheldrake Point Vineyard -76.82301 42.67646 ovid, ny 14521, usa Cidery
Heart and Hands Winery -76.70822 42.79736 ny-90, new york, usa Winery
Eleven Lakes Winery -76.74878 42.85320 3550 ny-89, seneca falls, ny 13148, usa Winery
Cayuga Ridge Estates Winery -76.82301 42.67646 ovid, ny 14521, usa Winery
Bellwether Hard Cider / Bellwether Wine Cellars -76.66606 42.54229 trumansburg, ny 14886, usa Cidery

Random observations on scraping

  • Make sure you’ve obtained ONLY what you want! Scroll over the whole page to ensure that selectorgadget hasn’t found too many things
  • If you are having trouble parsing, try selecting a smaller subset of the thing you are seeking (i.e. being more precise)
  • MOST IMPORTANT - confirm that there is NO RopenSci package and NO API before you spend hours scraping when the API was right here

Session Info

devtools::session_info()
## Session info --------------------------------------------------------------
##  setting  value                       
##  version  R version 3.3.1 (2016-06-21)
##  system   x86_64, darwin13.4.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/Chicago             
##  date     2016-11-17
## Packages ------------------------------------------------------------------
##  package      * version  date       source                           
##  assertthat     0.1      2013-12-06 CRAN (R 3.3.0)                   
##  bit            1.1-12   2014-04-09 CRAN (R 3.3.0)                   
##  bit64          0.9-5    2015-07-05 CRAN (R 3.3.0)                   
##  boot         * 1.3-18   2016-02-23 CRAN (R 3.3.1)                   
##  broom        * 0.4.1    2016-06-24 CRAN (R 3.3.0)                   
##  car            2.1-3    2016-08-11 CRAN (R 3.3.0)                   
##  caret        * 6.0-73   2016-11-10 CRAN (R 3.3.2)                   
##  codetools      0.2-15   2016-10-05 CRAN (R 3.3.0)                   
##  colorspace     1.2-7    2016-10-11 CRAN (R 3.3.0)                   
##  curl         * 2.2      2016-10-21 CRAN (R 3.3.0)                   
##  DBI            0.5-1    2016-09-10 CRAN (R 3.3.0)                   
##  devtools       1.12.0   2016-06-24 CRAN (R 3.3.0)                   
##  digest         0.6.10   2016-08-02 CRAN (R 3.3.0)                   
##  dplyr        * 0.5.0    2016-06-24 CRAN (R 3.3.0)                   
##  evaluate       0.10     2016-10-11 CRAN (R 3.3.0)                   
##  foreach        1.4.3    2015-10-13 CRAN (R 3.3.0)                   
##  foreign        0.8-67   2016-09-13 CRAN (R 3.3.0)                   
##  gapminder    * 0.2.0    2015-12-31 CRAN (R 3.3.0)                   
##  geonames     * 0.998    2014-12-19 CRAN (R 3.3.0)                   
##  geosphere      1.5-5    2016-06-15 cran (@1.5-5)                    
##  gganimate    * 0.1      2016-11-11 Github (dgrtwo/gganimate@26ec501)
##  ggmap        * 2.6.1    2016-01-23 cran (@2.6.1)                    
##  ggplot2      * 2.2.0    2016-11-10 Github (hadley/ggplot2@f442f32)  
##  gtable         0.2.0    2016-02-26 CRAN (R 3.3.0)                   
##  gutenbergr   * 0.1.2    2016-06-24 CRAN (R 3.3.0)                   
##  highr          0.6      2016-05-09 CRAN (R 3.3.0)                   
##  htmltools      0.3.5    2016-03-21 CRAN (R 3.3.0)                   
##  htmlwidgets    0.8      2016-11-09 CRAN (R 3.3.1)                   
##  httr         * 1.2.1    2016-07-03 CRAN (R 3.3.0)                   
##  ISLR         * 1.0      2013-06-11 CRAN (R 3.3.0)                   
##  iterators      1.0.8    2015-10-13 CRAN (R 3.3.0)                   
##  janeaustenr  * 0.1.4    2016-10-26 CRAN (R 3.3.0)                   
##  jpeg           0.1-8    2014-01-23 cran (@0.1-8)                    
##  jsonlite     * 1.1      2016-09-14 CRAN (R 3.3.0)                   
##  knitr        * 1.15     2016-11-09 CRAN (R 3.3.1)                   
##  labeling       0.3      2014-08-23 CRAN (R 3.3.0)                   
##  lattice      * 0.20-34  2016-09-06 CRAN (R 3.3.0)                   
##  lazyeval       0.2.0    2016-06-12 CRAN (R 3.3.0)                   
##  lme4           1.1-12   2016-04-16 cran (@1.1-12)                   
##  lubridate    * 1.6.0    2016-09-13 CRAN (R 3.3.0)                   
##  magrittr       1.5      2014-11-22 CRAN (R 3.3.0)                   
##  mapproj        1.2-4    2015-08-03 cran (@1.2-4)                    
##  maps           3.1.1    2016-07-27 cran (@3.1.1)                    
##  MASS           7.3-45   2016-04-21 CRAN (R 3.3.1)                   
##  Matrix         1.2-7.1  2016-09-01 CRAN (R 3.3.0)                   
##  MatrixModels   0.4-1    2015-08-22 CRAN (R 3.3.0)                   
##  memoise        1.0.0    2016-01-29 CRAN (R 3.3.0)                   
##  mgcv           1.8-16   2016-11-07 CRAN (R 3.3.0)                   
##  minqa          1.2.4    2014-10-09 cran (@1.2.4)                    
##  mnormt         1.5-5    2016-10-15 CRAN (R 3.3.0)                   
##  ModelMetrics   1.1.0    2016-08-26 CRAN (R 3.3.0)                   
##  modelr       * 0.1.0    2016-08-31 CRAN (R 3.3.0)                   
##  modeltools     0.2-21   2013-09-02 CRAN (R 3.3.0)                   
##  munsell        0.4.3    2016-02-13 CRAN (R 3.3.0)                   
##  nlme           3.1-128  2016-05-10 CRAN (R 3.3.1)                   
##  nloptr         1.0.4    2014-08-04 cran (@1.0.4)                    
##  NLP            0.1-9    2016-02-18 CRAN (R 3.3.0)                   
##  nnet           7.3-12   2016-02-02 CRAN (R 3.3.1)                   
##  openssl        0.9.5    2016-10-28 CRAN (R 3.3.0)                   
##  pbkrtest       0.4-6    2016-01-27 CRAN (R 3.3.0)                   
##  plyr           1.8.4    2016-06-08 CRAN (R 3.3.0)                   
##  png            0.1-7    2013-12-03 cran (@0.1-7)                    
##  profvis      * 0.3.2    2016-05-19 CRAN (R 3.3.0)                   
##  proto          1.0.0    2016-10-29 CRAN (R 3.3.0)                   
##  psych          1.6.9    2016-09-17 cran (@1.6.9)                    
##  purrr        * 0.2.2    2016-06-18 CRAN (R 3.3.0)                   
##  quantreg       5.29     2016-09-04 CRAN (R 3.3.0)                   
##  R6             2.2.0    2016-10-05 CRAN (R 3.3.0)                   
##  randomForest * 4.6-12   2015-10-07 CRAN (R 3.3.0)                   
##  rcfss        * 0.1.0    2016-10-06 local                            
##  RColorBrewer   1.1-2    2014-12-07 CRAN (R 3.3.0)                   
##  Rcpp           0.12.7   2016-09-05 cran (@0.12.7)                   
##  readr        * 1.0.0    2016-08-03 CRAN (R 3.3.0)                   
##  readxl       * 0.1.1    2016-03-28 CRAN (R 3.3.0)                   
##  rebird       * 0.3.0    2016-03-23 CRAN (R 3.3.0)                   
##  reshape2       1.4.2    2016-10-22 CRAN (R 3.3.0)                   
##  RgoogleMaps    1.4.1    2016-09-18 cran (@1.4.1)                    
##  rjson          0.2.15   2014-11-03 cran (@0.2.15)                   
##  rmarkdown    * 1.1      2016-10-16 CRAN (R 3.3.1)                   
##  rplos        * 0.6.0    2016-07-22 CRAN (R 3.3.0)                   
##  rstudioapi     0.6      2016-06-27 CRAN (R 3.3.0)                   
##  rvest        * 0.3.2    2016-06-17 CRAN (R 3.3.0)                   
##  scales       * 0.4.1    2016-11-09 CRAN (R 3.3.1)                   
##  selectr        0.3-0    2016-08-30 CRAN (R 3.3.0)                   
##  slam           0.1-38   2016-08-18 CRAN (R 3.3.2)                   
##  SnowballC      0.5.1    2014-08-09 cran (@0.5.1)                    
##  solr           0.1.6    2015-09-17 CRAN (R 3.3.0)                   
##  sp             1.2-3    2016-04-14 cran (@1.2-3)                    
##  SparseM        1.72     2016-09-06 CRAN (R 3.3.0)                   
##  stringi        1.1.2    2016-10-01 CRAN (R 3.3.0)                   
##  stringr      * 1.1.0    2016-08-19 cran (@1.1.0)                    
##  tibble       * 1.2      2016-08-26 cran (@1.2)                      
##  tidyr        * 0.6.0    2016-08-12 CRAN (R 3.3.0)                   
##  tidytext     * 0.1.2    2016-10-28 CRAN (R 3.3.0)                   
##  tidyverse    * 1.0.0    2016-09-09 CRAN (R 3.3.0)                   
##  tm             0.6-2    2015-07-03 CRAN (R 3.3.0)                   
##  tokenizers     0.1.4    2016-08-29 CRAN (R 3.3.0)                   
##  topicmodels  * 0.2-4    2016-05-23 CRAN (R 3.3.0)                   
##  tree         * 1.0-37   2016-01-21 CRAN (R 3.3.0)                   
##  twitteR      * 1.1.9    2015-07-29 CRAN (R 3.3.0)                   
##  whisker        0.3-2    2013-04-28 CRAN (R 3.3.0)                   
##  withr          1.0.2    2016-06-20 CRAN (R 3.3.0)                   
##  XML          * 3.98-1.4 2016-03-01 CRAN (R 3.3.0)                   
##  xml2         * 1.0.0    2016-06-24 CRAN (R 3.3.0)                   
##  yaml           2.1.13   2014-06-12 CRAN (R 3.3.0)

This work is licensed under the CC BY-NC 4.0 Creative Commons License.