Welcome to another tutorial for this class, COMP/STAT 112: Introduction to Data Science! It will be similar to the others, including demo videos and files embedded in this document and practice problems with hints or solutions at the end. There are some new libraries, so be sure to install those first.
As most of our files do, we start this one with three R code chunks: 1. options, 2. libraries and settings, 3. data.
library(tidyverse) # for data cleaning and plotting
library(gardenR) # for Lisa's garden data
library(lubridate) # for date manipulation
library(openintro) # for the abbr2state() function
library(palmerpenguins)# for Palmer penguin data
library(maps) # for map data
library(ggmap) # for mapping points on maps
library(gplots) # for col2hex() function
library(RColorBrewer) # for color palettes
library(sf) # for working with spatial data
library(leaflet) # for highly customizable mapping
library(ggthemes) # for more themes (including theme_map())
theme_set(theme_minimal())
# Starbucks locations
Starbucks <- read_csv("https://www.macalester.edu/~ajohns24/Data/Starbucks.csv")
# Lisa's favorite St. Paul places - used in leaflet example
favorite_stp_by_lisa <- tibble(
place = c("Home", "Macalester College", "Adams Spanish Immersion",
"Spirit Gymnastics", "Bama & Bapa", "Now Bikes",
"Dance Spectrum", "Pizza Luce", "Brunson's"),
long = c(-93.1405743, -93.1712321, -93.1451796,
-93.1650563, -93.1542883, -93.1696608,
-93.1393172, -93.1524256, -93.0753863),
lat = c(44.950576, 44.9378965, 44.9237914,
44.9654609, 44.9295072, 44.9436813,
44.9399922, 44.9468848, 44.9700727)
)
# North Carolina births and sids deaths
nc <- st_read(system.file("shape/nc.shp", package="sf"),
quiet = TRUE) %>%
mutate(sid_per_1000birth_79 = SID79/BIR79*1000)
# Compute centroid (center) of each county
county_centroid_lat_long <- as.tibble(st_coordinates(st_centroid(nc))) %>%
rename(centr_long = X,
centr_lat = Y)
# County names and sid_per_1000birth_79
nc_centroid_county <- st_drop_geometry(nc) %>%
select(NAME, sid_per_1000birth_79)
# All centroid level info
nc_centroid <- county_centroid_lat_long %>%
bind_cols(nc_centroid_county)
Learning Goals
After this tutorial, you should be able to do the following:
Plot data points on top of a map using the ggmap() function along with ggplot2 functions.
Create choropleth maps using geom_map().
Add points and other ggplot2 features to a map created from geom_map().
Understand the basics of creating a map using leaflet, including adding points and choropleths to a base map.
Motivation
Take a look at these to get motivated/inspired to make your own:
The Starbucks data, compiled by Danny Kaplan and provided by Alicia Johnson, contains information about every Starbucks in the world at the time the data were collected. It includes the Latitude and Longitude of each location. Let’s start by using familiar plotting tools
The point pattern probably looks familiar. To highlight the geographical nature of this scatterplot, we can superimpose the points on top of a map, using the ggmap() function from the ggmap library.
NOTE: we used to be able to easily bring in Google maps. As of mid-2018, in order to bring those in, you need to have a registered API key. If you want to do that, see google_key in the help. Then, see the documentation for get_map(). We will bring in other types of maps since Google maps are harder to do now and require you to submit credit card information.
Instead, we bring in a stamen map (there are others you could try, but we’ll stick with this). You can also take a look at stamen maps on their website. First, let’s look at an example.
# Get the map information
world <- get_stamenmap(
bbox = c(left = -180, bottom = -57, right = 179, top = 82.1),
maptype = "terrain",
zoom = 2)
# Plot the points on the map
ggmap(world) + # creates the map "background"
geom_point(data = Starbucks,
aes(x = Longitude, y = Latitude),
alpha = .3,
size = .1) +
theme_map()
Next, we will walk through the get_stamenmap() function arguments. The code below is what was used to get the world map information.
The bbox argument tells it the minimum and maximum latitude and longitude points. So, left is the minimum longitude, right is the maximum longitude, bottom is the minimum latitude, and top is the maximum latitude. I found it helpful to go to openstreetmap: zoom in on the area of interest, click export, and you will see all the values you need. I had to modify them slightly, which you can do after your initial plot.
When you make a large area, you need to decrease the zoom, otherwise it will take too long to load. So, it’s a good idea to start with a small zoom and you can always make it bigger if you want. This might seem counter-intuitive at first. I think of the zoom level as the level of detail. So, smaller numbers show less detail and larger numbers more detail. I often go to the stamanmaps webpage and search for the location I’m mapping. Then, in the URL, you can see the zoom number. For example, this link is a map of St. Paul: http://maps.stamen.com/#terrain/12/44.9531/-93.0904. Notice the number 12 next to /#terrain/. That means it is zoomed in at 12.
ggmap()
We save the the map information from get_stamenmap() to a named value and then use it in ggmap():
# Get the map information world <- get_stamenmap( bbox = c(left = -180, bottom = -57, right = 179, top = 82.1), maptype = "terrain", zoom = 2)
# Plot the points on the map ggmap(world) + # creates the map "background" geom_point(data = Starbucks, aes(x = Longitude, y = Latitude), alpha = .3, size = .1) + theme_map()
The ggmap() function will print the “background” map. Think of it as the providing the canvas on which we will plot. This takes the place of our usual ggplot().
ggmap(world)
After that, we can use the geom_XXX() functions from ggplot2 that we are used to in order to put points, lines, etc. on top of the map. But, we need to remember to also provide the data we are using in the geom_XXX() function(s) we use since we do not have the ggplot() function in which to provide it.
# Get the map information world <- get_stamenmap( bbox = c(left = -180, bottom = -57, right = 179, top = 82.1), maptype = "terrain", zoom = 2)
# Plot the points on the map ggmap(world) + # creates the map "background" geom_point(data = Starbucks, aes(x = Longitude, y = Latitude), alpha = .3, size = .1) + theme_map()
theme_map()
The last thing I did in the code was to add theme_map(). This is optional, but I often find it makes it look nice.
# Get the map information world <- get_stamenmap( bbox = c(left = -180, bottom = -57, right = 179, top = 82.1), maptype = "terrain", zoom = 2)
# Plot the points on the map ggmap(world) + # creates the map "background" geom_point(data = Starbucks, aes(x = Longitude, y = Latitude), alpha = .3, size = .1) + theme_map()
So, the final map as a world map as the background with points plotted on top that show the Starbucks locations. The points are .1 of their usual size and have a transparency level of .3.
Demo video
With that introduction, you are ready to watch the demo video!
Add an aesthetic to the world map that sets the color of the points according to the ownership type. What, if anything, can you deduce from this visualization?
Construct a new map of Starbucks locations in the Twin Cities metro area (approximately the 5 county metro area).
In the Twin Cities plot, play with the zoom number. What does it do? (just describe what it does - don’t actually include more than one map).
Try a couple different map types (see get_stamenmap() in help and look at maptype). Include a map with one of the other map types.
Add a point to the map that indicates Macalester College and label it appropriately. There are many ways you can do think, but I think it’s easiest with the annotate() function (see ggplot2 cheatsheet).
Choropleths
Geographical data needn’t be expressed by latitude and longitude. For choropleth maps, instead of visualizing our data as points with different aesthetics (size, color, transparency, etc.), we color different regions of the maps based on data values. To do this we need to specify both the geometric regions on which the data resides (counties, states, zip codes, etc.), and then wrangle the data so that there is one value per region.
Let’s return to the Starbucks data. First, we will create a new dataset, starbucks_us_by_state that limits the data to the US, finds the number of Starbucks in each state, and creates a state name that is in all lowercase letters that matches the state name in the region variable of the states_map dataset.
The states_map dataset gives information about creating the borders of the US states. The data is retrieved using the map_data() function. Run ?map_data in the console to see more information about what other maps are available. There are also other packages that provide different types of maps.
Then, we can use geom_map() to create a choropleth map. Let’s take a look at the map and we’ll go through the details after.
#Create a new Starbucks dataset that
# - filters to the US
# - summarizes the number of Starbucks in each state
# - has full names of states in lowercase letters (to match to states_map data created next)
starbucks_us_by_state <- Starbucks %>%
filter(Country == "US") %>%
count(`State/Province`) %>%
mutate(state_name = str_to_lower(abbr2state(`State/Province`)))
#US states map information - coordinates used to draw borders
states_map <- map_data("state")
# map that colors state by number of Starbucks
starbucks_us_by_state %>%
ggplot() +
geom_map(map = states_map,
aes(map_id = state_name,
fill = n)) +
#This assures the map looks decently nice:
expand_limits(x = states_map$long, y = states_map$lat) +
theme_map()
Now, let’s look more closely at what each piece of the code below is doing.
The map argument tells R at which level to create the map. Really, it tells it how to draw all the borders This is a very special data set. According to the geom_map() documentation, it is a “data frame that contains the map coordinates … It must contain columns x or long, y or lat, and region or id.” We are using the map_data() function to create the map file (see above for more detail0. You can open the map data, states_map, and see that it adheres to the rules.
Connect map id/region variable to data being plotted
The map_id inside of aes() is a required aesthetic for the geom_map() geom. It tells R which variable is the region/id variable, in this case the state. It connects the region or id from the map (region variable in states_map dataset, in this example) to the dataset being plotted (state_name in starbucks_us_by_state, in this example). So state_name needs to have the same form as region, which is why we modified the state names in starbucks_us_by_state.
We tell it to fill in the states by the variable n, the number of Starbucks in each state. With the geom_map() geom, it will fill in the borders of the regions we defined in the map argument.
Use expand_limits() to assure that the map covers the entire area it’s supposed to. We put the longitude variable from states_map for the x argument and the latitude variable from states_map for the y argument to assure the map stretches across the entire range of longitudes and latitudes in the map. There may be a better way to do this but I have yet to find it, and when I leave it out, I don’t even see the map.
You can add any of the ggplot2 layers on top of this map. In this example, I’ve added the MN Starbucks as points, a title, and changed the legend background (so it doesn’t have one and overlap California).
Example by Arie Voorman (some things could be out of date since it’s from 2015)
Your turn!
Exercise: Even more with Starbucks
The example I showed did not account for population of each state in the map. In the code below, a new variable is created, starbucks_per_10000, that gives the number of Starbucks per 10,000 people. It is in the starbucks_with_2018_pop_est dataset.
dplyr review: Look through the code above and describe what each line of code does.
Create a choropleth map that shows the number of Starbucks per 10,000 people on a map of the US. Use a new fill color, add points for all Starbucks in the US (except Hawaii and Alaska), add an informative title for the plot, and include a caption that says who created the plot (you!). Make a conclusion about what you observe.
Using leaflet to create maps
Concept Map
Leaflet is an open-source JavaScript library for creating maps. It can be used outside of R, but we will only discuss using the leaflet library in R.
This library uses a different plotting framework from ggplot2 although it still has a tidyverse feel due to its use of the pipe, %>% and the way it adds layers to the plot, just like in ggplot2.
Introductory video
Watch the video that introduces leaflet() functions.
Create a map widget by calling leaflet() and telling it the data to use.
Add a base map using addTiles() (the default) or addProviderTiles().
Add layers to the map by using layer functions (e.g. , addMarkers(), addPolygons()) to modify the map widget.
Repeat step 3 as desired.
Print the map widget to display it.
Creating a map with markers/points
Below, I create a basic map and add my points (the points are a layer on the map). The data are in favorite_stp_by_lisa - see the data section at the beginning for more details.
The function we will use to create the maps will look for certain variable names for latitude (lat, latitude) and longitude (lng, long, or longitude). If you do not name them one of those things or if the data you are using doesn’t name them that, you need to call out the name explicitly (you’ll see that next). You can use a “two-finger scroll” to zoom in and out.
leaflet(data = favorite_stp_by_lisa) %>% #base plot
addTiles() %>% #base map - default is openstreet map
addMarkers() #Adds markers - knows lat and long from names in data
The graph below is the same as above, but in the code I explicitly told it latitude and longitude, which you would need to do if those variables had a name not recognized by the function, and added labels. WARNING: DO NOT FORGET THE ~ BEFORE THE VARIABLE NAMES!!!
We can change just about everything about our map. The plot below is the same plot as above with some aesthetic changes:
We changed the base map with addProviderTiles().
To see all available provider base maps, type providers in the console.
To access those maps, use providers$PROVIDERNAME inside the addProviderTiles() function, where PROVIDERNAME is one of those listed providers. When you type provider$ a list should show up that you can click on.
We changed the marker type by using addCircles() instead of addMarkers() - Search addControl in the Help or type ?addControl into the console to see what all the arguments mean and how you can change them.
All variable arguments are preceded with a tilde, ~.
The weight argument tells it how thick to make the lines or points, pixels.
The opacity argument is the transparency, like the alpha argument in ggplot2.
Colors need to be in “hex” form. I used the col2hex() function from the gplot library to do that since I don’t have any hex colors memorized. The colors also need to be valid R colors.
The map below is also the “same” as the ones I have already created with a new base map and a line to trace my route, which was created with the addPolylines() layer. It traces the locations in the order they are entered in the dataset.
leaflet(data = favorite_stp_by_lisa) %>%
addProviderTiles(providers$CartoDB.DarkMatter) %>%
addCircles(lng = ~long,
lat = ~lat,
label = ~place,
weight = 10,
opacity = 1,
color = col2hex("darkred")) %>%
addPolylines(lng = ~long,
lat = ~lat,
color = col2hex("darkred"))
Choropleth layers with addPolygons()
For making maps with borders (like choropleth maps), the functions can receive the base maps with spatial data a few different ways. In the example here, I use functions from the sf package to get the data in the right form. In the demo video, I will translate a data.frame to an sf object. Hopefully those two sets of instructions fit most of the cases you are interested in.
In the data code chunk at the beginning, the function st_read() downloads the shape file for the counties of North Carolina, which is included in the sf package.
The dataset has number of births and number of SIDS cases in each county of North Carolina from 1974-1979 and 1979-1984. I computed a variable called sid_per_1000birth_79 which is the number of sids cases per 1000 births in 1979.
Below, I printed out the first 5 rows of the data with the NAMES and geometry variables from the dataset. The geometry variable contains information about how to plot the boundaries of the counties. Open the dataset and examine the geometry variable. The leaflet function knows that the geometry variable contains this special information.
nc %>%
select(NAME, geometry) %>%
slice(1:5)
To learn more about the data type nc into the Help menu or ?nc into the console.
This plot puts the map with North Carolina county borders on top of the Open Street Map. This map isn’t that interesting by itself.
leaflet(nc) %>%
addTiles() %>%
addPolygons()
Now, let’s use some of the data to enhance the graph by creating a choropleth map where we color the counties by sid_per_1000birth_79. In ggplot(), we can map a variable to color or fill inside the aesthetic. But, in the leaflet functions, we can’t do this. Instead, we need to create a variable of hex color names that tell it how to color or fill. Thankfully, there are functions that help us do that! Let’s walk through the detail of how we created the graph below.
#creates a function that can be used to map a variable to
#the color palette using viridis
pal <- colorNumeric("viridis",
domain = nc$sid_per_1000birth_79)
leaflet(nc) %>%
addTiles() %>%
addPolygons(
fillColor = ~pal(sid_per_1000birth_79), #fills according to that variable
fillOpacity = 0.7) #like alpha in ggplot
The colorNumeric() function returns a function that maps a variable’s values to colors in the given palette, in this case “viridis”. So, pal() is a function. We can then use that function inside addPolygons(). If we apply the function to the sid_per_1000birth_79 variable, it returns a variable of hex colors and the variable of colors is used to fill the counties.
Below I print out what happens when pal() is applied to sid_per_1000birth_79. We can see that it returns hex colors. So, the variable in the fillColor argument inside addPolygons() above, is a variable of hex color names.
Again, this is different from ggplot(), where we could map a variable to color or fill and it would do the translating of variable to color scale for us. In the leaflet functions, we have to explicitly provide the colors in a variable. The colorNumeric() help you do that. colorBin(), colorQuantile(), and colorFactor() are other functions you might need to use depending on the type of variable you are trying to change to colors.
There are many customizeable options in leaflet, much like ggplot(). I have included a commented set of code below to point out some useful functions and arguments.
#creates a function that can be used to map a variable to
#the color palette using viridis
pal <- colorNumeric("viridis",
domain = nc$sid_per_1000birth_79)
leaflet(nc) %>%
addTiles() %>%
addPolygons(
#skips drawing the borders:
stroke = FALSE,
#fills according to variable of hex colors:
fillColor = ~pal(sid_per_1000birth_79),
#changes transparency, like alpha in ggplot
fillOpacity = 0.7,
#how much to simplify the plot when zooming:
smoothFactor = 0.5,
#changes what happens to the shape when we mouse over it
highlight = highlightOptions(weight = 5,
color = "black",
fillOpacity = 0.9,
bringToFront = FALSE)) %>%
addCircles(data = nc_centroid,
lng = ~centr_long,
lat = ~centr_lat,
# label that appears when you click on the marker,
# in this case county name and sid_per_1000birth_79
# rounded to the 2nd decimal
popup = ~paste(NAME,": ",
round(sid_per_1000birth_79, 2),
sep=""),
radius = 2) %>%
# Add a legend
addLegend(pal = pal,
values = ~sid_per_1000birth_79,
opacity = 0.5,
title = NULL,
position = "bottomright")
Demo video
Watch my demo video below, but also check out the video by Andrew Ba Tran of the Washington Post (also listed in the Resources section below).
Tutorial by Andrew Ba Tran, investigative data reporter at Washington Post
Your turn!
Exercise: A few of your favorite things
In this exercise, you are going to create a single map of some of your favorite places! The end result will be one map.
Create a data set using the tibble() function that has 10-15 rows of your favorite places. The columns will be the name of the location, the latitude, the longitude, and a column that indicates if it is in your top 3 favorite locations or not. For an example of how to use tibble(), look at the favorite_stp_by_lisa I created in the data R code chunk at the beginning.
Create a map that uses circles to indicate your favorite places. Label them with the name of the place. Choose the base map you like best. Color your 3 favorite places differently than the ones that are not in your top 3. Add a legend that explains what the colors mean.
Connect all your locations together with a line in a meaningful way (you may need to order them differently in the original data).
If there are other variables you want to add that could enhance your plot, do that now.