Goal: geocode historical lynching locations in U.S.
We will geocode these data
Overview of lab exercise and problem set
We will use 2 methods to geocode these lynching locations:
Project HAL raw data
Method 1: Geocode using web service/API (OpenStreetMap)
OSM Nominatum
Method 2: Geocode offline, with gazetteer data (GeoNames)
GeoNames
You will then do some point-in-polygon analysis to see how lynchings (as part of broader voter suppression efforts) may have impacted vote share in 1920
1920 Presidential elections results
Your assignment: create (1) boxplot of vote share against lynchings, (2) map of geocoded lynching locations vs. election results
Boxplot
Map
You can make these plots in QGIS or in R. Instructions for both are below.
Boxplot in R
Map in R
We have three (and a half) sources of data:
Category | Type | Format | Data source |
---|---|---|---|
Lynchings (sample) | Table (non-geo) | .csv |
Project HAL |
County gazetteer | Table | .csv |
GeoNames |
1920 county borders | Vector (polygons) | .geojson |
Newberry Library |
\(+\) 1920 election results | CQ Voting/Elections |
These are all in the PS06.zip
file posted on Canvas.
Let’s open QGIS…
Always save your progress!
Go to Project
\(\to\) Save As...
Load the Project HAL data:
Layer
\(\to\) Add Layer
\(\to\) Add Delimited Text Layer...
Open the file HAL_sample.csv
in the HAL
folder.
Geometry definition should be set to No geometry
. Click Add
`
You should now see HAL_sample
in your layer menu. But there are no points on map, because the data are not geocoded.
Open the Attribute Table for HAL_sample
, navigate to the Field Calculator
Create new field, Address
, of type Text (string)
.
For the Expression
, write concat(County_full,', ',State)
.
The Preview
should show a county-state ID. Click OK
The new field should appear in the Attribute Table
To access the OSM/Nominatum geocoder, go to Processing
\(\to\) Toolbox
In the Processing Toolbox
panel, go to Vector general
menu \(\to\) Batch Nominatum geocoder
Set Input layer
\(=\) HAL_sample
and Address field
\(=\) Address
.
Save the output to a file called hal_osm.geojson
. Click Run
After running, the log may tell you that some addresses could not be geocoded.
Click Close
You should see the geocoded points in the main project window, and a new layer, hal\_osm
. Let’s plot it against an historical county boundary shapefile.
Go to Layer
\(\to\) Add Layer
\(\to\) Add Vector Layer...
.
Navigate to e1920.geojson
file in Data/Counties
folder. Add it to the map.
Looks about right. Now let’s try geocoding with gazetteer data.
Add the GeoNames gazetteer to the project, using Add Delimited Text Layer...
. Load GeoNames_ADM2.csv
from the Data/GeoNames
folder.
Set X field
\(=\) longitude
and Y field
\(=\) latitude
The centroids for (contemporary) counties should appear in the project window.
In the Attribute Table
, we see that
asciiname
field in GeoNames_ADM2
corresponds to County_full
in HAL_sample
admin1_code
corresponds to State
We can use these to create a unique key to match on
Open the Field Calculator
for GeoNames_ADM2
.
Create a new variable called address_gn
of type Text (string)
.
Set Expression
to lower(concat(asciiname,', ',admin1_code))
.
The Preview
should show a lower case county-state ID. Click OK
Now open Field Calculator
for HAL_sample
.
Create the same new field, address_gn
, of type Text (string)
.
Set Expression
to lower(concat(County_full,', ',State))
. Click OK
Now we are ready to join these layers. Double-click HAL_sample
layer to bring up the Properties
window. Open the Joins
tab, and click the +
button.
Set Join layer
\(=\) GeoNames_ADM2
, with address_gn
as the join and target field.
Check the box next to Custom Field Name Prefix
and enter gn_
in the box. Click OK
.
The new join should appear in the Joins
tab
If you click on the Fields
tab, you should see the new fields appended to HAL_sample
You can also open the Attribute Table
for HAL_sample
. Do you see the coordinates?
To display the geocoded locations, we need to export the joined file and re-import it.
Right-click on HAL_sample
and go to Export
\(\to\) Save Features As...
.
Save the file as hal_gn.csv
(comma separated values)
(Re-)import hal_gn.csv
, using Add Delimited Text Layer...
.
Set X field
\(=\) gn_longitude
and Y field
\(=\) gn_latitude
.
Set CRS
\(=\) EPSG:4326
The new geocoded locations should appear on the project window.
Now let’s do some point-in-polygon analysis!
Navigate to Vector
menu \(\to\) Analysis Tools
\(\to\) Count Points in Polygon
.
Select Polygons
\(=\) e1920
, Points
\(=\) hal_osm
. Name the count field lynch_osm
, and save the output to e1920_osm.geojson
. Click Run
You may see a warning about “No spatial index exists…”. You can ignore it here
Repeat this process with hal_gn
as the points layer. Name the count field lynch_gn
, and save the output as e1920_gn.geojson
The two new layers e1920_osm
and e1920_gn
should appear in the project window
We can try plotting these new count variables through Layer Properties
\(\to\) Symbology
. You will notice right away that there are not many unique values.
I used Natural Breaks
with 3 classes, but you can try other options.
The resulting distribution should look something like this
Let’s create “dummy” variables indicating whether at least one lynching occurred in each county. Open Field Calculator
for e1920_gn
, create new field, Lynchings
of type Integer
, with Expression
set to lynch_gn > 0
. Click OK
Do the same for e1920_osm
, with Expression
set to lynch_osm > 0
When finished, right-click on e1920_osm
and e1920_gn
, uncheck Toggle Editing
Save your changes when prompted. Do this for both layers
Let’s examine the relationship between lynching locations and local electoral preferences. Plot the variable DemMajPct
in e1920_osm
, with a red-to-blue color ramp (Equal Count
with 5 classes)
The lynching locations appear to be mostly in southern Democratic Party strongholds, but let’s look at this more systematically
To enable a variety of statistical plotting functions in QGIS, we need to enable the Data Plotly plugin. Go to Plugins
\(\to\) Manage and Install Plugins
Search for plotly
and install the plugin
Make sure the the box is checked next to Data Plotly
after installation
To access Data Plotly, activate the Plugins Toolbar
.
Go to View
\(\to\) Toolbars
\(\to\) check box next to Plugins Toolbar
The icon for Data Plotly can be tricky to find.
Look for an icon that looks like a statistical graphic, as shown here \(\downarrow\). Click on it
This should open a new DataPlotly panel in the project window.
Set Plot type
\(=\) Box Plot
, Layer
\(=\) e1920_osm
.
Set Grouping
\(=\) Lynchings
, Y field
\(=\) DemMajPct
. Click Create Plot
The average Democratic vote share was indeed significantly higher for counties with lynchings than counties without lynchings
Let’s clean up the axis labels. Click on the settings button (gears icon).
Plot Title
= Lynchings and 1920 Presidential election
X Label
\(=\) Lynchings (OSM geocoded)
and Y Label
\(=\) Democratic vote share in 1920 (\%)
Click Update Plot
To export the box plot as a .png
file, click on the icon in the lower-right corner.
Name it boxplot_osm.png
Repeat this process for GeoNames-geocoded e1920_gn
, with similar labels
Export the resulting image
Problem Set 6
Your assignment (if using QGIS): create a map to go along with the box plot
boxplot_osm.png
or boxplot_gn.png
DemMajPct
)map_osm.png
or map_gn.png
Either this:
OSM boxplot
OSM map
… or this:
GeoNames boxplot
GeoNames map
Loading R packages
To implement these steps in R, we will be using the sf
package, and two others (RCurl
, jsonlite
) that help R compose HTTP requests and process the results returned by online servers:
library(sf)
library(RCurl)
library(jsonlite)
NOTE: The code to produce the maps and boxplots in R is in ps06_demo.R
on RStudio Cloud, and in PS06.zip
(posted on Canvas).
As with QGIS, we will geocode in R using two methods:
Online, using a web service (OSM/Nominatum API)
Offline, using gazetteer data (GeoNames)
Method 1: Geocode the addresses using OSM/Nominatum
Step 1: define a function url_geo()
that sends queries to OSM/Nominatum, and returns geographic information from server:
url_geo = function(query, return.call = "json", sensor = "false") {
root = "https://nominatim.openstreetmap.org/search?q="
sfxx = "&format=json&polygon=1&addressdetails=1"
u = paste(root, query, sfxx,sep = "")
return(URLencode(u))
}
Step 2: define a wrapper function geoCode_OSM()
, that sends the query through url_geo()
and parses the result:
geoCode_OSM = function(query,match.num=1){
address=NA; longitude=NA; latitude=NA
u = url_geo(query)
doc = RCurl::getURL(u,httpheader = c('User-Agent' = "contact info"))
if(nchar(doc)>2){
dat = jsonlite::fromJSON(doc)
if(nrow(dat)>0){
address = dat$display_name[match.num]
longitude = as.numeric(as.character(dat$lon[match.num]))
latitude = as.numeric(as.character(dat$lat[match.num]))
}}
return(data.frame(
address=address,longitude=longitude,latitude=latitude
))
}
Let’s test the geocoding function!
geoCode_OSM("79 John F. Kennedy Street, Cambridge, MA")
geoCode_OSM("Harvard Kennedy School")
geoCode_OSM("Harvard")
geoCode_OSM("Harvard", match.num=2)
Load, pre-process Project HAL data
Load the tabular dataset using read.csv()
, and preview the first few rows:
hal = read.csv("Data/HAL/HAL_sample.csv")
head(hal)
Create new field for “county, state” and placeholders for coordinates
hal$address = paste0(hal$County_full,", ",hal$State)
hal$longitude = NA
hal$latitude = NA
To geocode as a batch processing routine, we will write a for()
loop, which runs the geoCode_OSM()
function for each address in hal
and stores the result:
for(i in 1:nrow(hal)){
# Skip past errors
tryCatch({
address_geo = geoCode_OSM(hal$address[i])
# Add coordinates to dataset
hal$longitude[i] = address_geo$longitude
hal$latitude[i] = address_geo$latitude
# Report progress
print(paste0(i,"/",nrow(hal),"; ",address_geo$address))
},error=function(e){
print(paste("Unable to geocode",hal$address[i]))
})
}
Inspect the results of OSM geocoding:
head(hal)
Drop observations with missing coordinates:
hal_osm = hal[which(!is.na(hal$longitude)),]
Convert results to sf
object and plot on a map:
hal_osm = sf::st_as_sf(hal_osm,
coords=c("longitude","latitude"),crs=4326)
plot(hal_osm["geometry"],axes=TRUE)
Method 2: Geocode the addresses using GeoNames gazetteer
Load gazetteer data:
gn = read.csv("Data/GeoNames/GeoNames_ADM2.csv")
Create common variables for matching:
gn$address_gn = tolower(paste0(gn$asciiname,", ",gn$admin1_code))
hal$address_gn = tolower(paste0(hal$County_full,", ",hal$State))
Rename OSM coordinates to avoid confusion:
hal$longitude_osm <- hal$longitude; hal$latitude_osm <- hal$latitude
hal$longitude <- hal$latitude <- NULL
Geocode addresses (i.e. join the datasets):
hal_gn = merge(x = hal, y = gn, by = "address_gn")
Inspect the results:
head(hal_gn)
Convert results to sf
object and plot on a map:
hal_gn = sf::st_as_sf(hal_gn,
coords=c("longitude","latitude"),crs=4326)
plot(hal_gn["geometry"],axes=TRUE)
Load 1920 US county boundaries:
e1920 = sf::read_sf("Data/Counties/e1920.geojson")
Plot overlay with geocoded lynchings
plot(e1920["geometry"],border="gray",reset=FALSE)
plot(hal_osm["geometry"],col="black",pch=1,add=TRUE)
plot(hal_gn["geometry"],col="black",pch=4,add=TRUE)
legend("bottomleft",pch=c(1,4),col=c("black","black"),
legend=c("OSM","GeoNames"),title="Geocoder",bty="n")
Point-in-polygon analysis
Overlay points objects (hal_*
) and polygons (e1920
)
o_osm = sf::st_intersects(x = e1920, y = hal_osm)
## although coordinates are longitude/latitude, st_intersects assumes that they
## are planar
o_gn = sf::st_intersects(x = e1920, y = hal_gn)
## although coordinates are longitude/latitude, st_intersects assumes that they
## are planar
Assign counts to new variables
e1920$lynchings_osm = lengths(o_osm)
e1920$lynchings_gn = lengths(o_gn)
Plot the results
plot(e1920["lynchings_osm"], main = "Lynchings (OSM)")
plot(e1920["lynchings_gn"], main = "Lynchings (GeoNames)")
Calculate new field (at least one lynching per county)
e1920$lynchings_osm_1 = 1*(e1920$lynchings_osm>0)
e1920$lynchings_gn_1 = 1*(e1920$lynchings_gn>0)
Number of counties with lynching, according to OSM vs. GeoNames
sum(e1920$lynchings_osm_1)
## [1] 89
sum(e1920$lynchings_gn_1)
## [1] 74
Extra credit for anyone who figures out the source of the discrepancy and (partially) fixes it. Hint: it has to do with pelicans and beignets.
Create box plot for OSM-geocoded data
boxplot(DemMajPct ~ lynchings_osm_1, data = e1920,
xlab = "Lynchings (OSM)", ylab = "Democrat vote share, 1920")
Create box plot for GeoNames-geocoded data
boxplot(DemMajPct ~ lynchings_gn_1, data = e1920,
xlab = "Lynchings (GeoNames)", ylab = "Democrat vote share, 1920")
Create red-to-blue color ramp for maps
ramp = colorRampPalette(c("red","blue"), space = "rgb")
Map the OSM-geocoded locations:
plot(e1920["DemMajPct"], pal=ramp(10), reset = FALSE,
main = "Lynchings (OSM geocoded) and Democratic vote share in 1920")
plot(hal_osm["geometry"], add=T, col="grey90", pch=1)
Map the GeoNames-geocoded locations:
plot(e1920["DemMajPct"], pal=ramp(10), reset = FALSE,
main = "Lynchings (GeoNames) and Democratic vote share in 1920")
plot(hal_gn["geometry"], add=T, col="grey90", pch=1)
Bonus exercise: ggplot2
So far, we’ve been using “base R” plotting functions only. But there is another way to visualize spatial data in R, using the ggplot2
package.
library(ggplot2)
The syntax is quite different. OSM example:
ggplot2::ggplot() +
ggplot2::geom_sf(data=e1920, ggplot2::aes(fill=DemMajPct),
color="white") +
ggplot2::geom_sf(data=hal_osm, color="black", shape=1, size=2) +
ggplot2::geom_sf(data=hal_osm, color="white", shape=16, size=1) +
ggplot2::scale_fill_gradientn(colors=ramp(10)) +
ggplot2::labs(
title="Lynchings (OSM geocoded) and Democratic vote share in 1920",
fill="Democratic Vote Share (%)") +
ggplot2::theme_minimal()
GeoNames example:
ggplot2::ggplot() +
ggplot2::geom_sf(data=e1920, ggplot2::aes(fill=DemMajPct),
color="white") +
ggplot2::geom_sf(data=hal_gn, color="black", shape=1, size=2) +
ggplot2::geom_sf(data=hal_gn, color="white", shape=16, size=1) +
ggplot2::scale_fill_gradientn(colors=ramp(10)) +
ggplot2::labs(
title="Lynchings (GeoNames) and Democratic vote share in 1920",
fill="Democratic Vote Share (%)"
) +
ggplot2::theme_minimal()
Problem Set 6
Your assignment (if using R):
boxplot_osm_R.png
(boxplot_gn_R.png
) and map_osm_R.png
(map_gn_R.png
)