Goal: geocode historical lynching locations in U.S.

 

We will geocode these data


Overview of lab exercise and problem set

 

  1. Lab exercise
    1. Geocode lynching data sample
    2. Point-in-polygon analysis (lynchings per county)
  2. Problem set
    1. Create a map and boxplot, showing relationship between 100 lynching locations and 1920 U.S. Presidential election results

We will use 2 methods to geocode these lynching locations:

 

Project HAL raw data


Method 1: Geocode using web service/API (OpenStreetMap)

OSM Nominatum


Method 2: Geocode offline, with gazetteer data (GeoNames)

GeoNames


You will then do some point-in-polygon analysis to see how lynchings (as part of broader voter suppression efforts) may have impacted vote share in 1920

1920 Presidential elections results


Your assignment: create (1) boxplot of vote share against lynchings, (2) map of geocoded lynching locations vs. election results

Boxplot


 

 

 

Map


You can make these plots in QGIS or in R. Instructions for both are below.

Boxplot in R


 

 

Map in R


We have three (and a half) sources of data:

Category Type Format Data source
Lynchings (sample) Table (non-geo) .csv Project HAL
County gazetteer Table .csv GeoNames
1920 county borders Vector (polygons) .geojson Newberry Library
\(+\) 1920 election results CQ Voting/Elections

 

These are all in the Lab06PS06.zip file posted on Canvas.

 

Let’s open QGIS…

QGIS


Always save your progress!
Go to Project \(\to\) Save As...

Geocoding


Load the Project HAL data:
Layer \(\to\) Add Layer \(\to\) Add Delimited Text Layer...


Open the file HAL_sample.csv in the HAL folder.
Geometry definition should be set to No geometry. Click Add `


You should now see HAL_sample in your layer menu. But there are no points on map, because the data are not geocoded.


Open the Attribute Table for HAL_sample, navigate to the Field Calculator


Create new field, Address, of type Text (string).
For the Expression, write concat(County_full,', ',State).
The Preview should show a county-state ID. Click OK


The new field should appear in the Attribute Table


To access the OSM/Nominatum geocoder, go to Processing \(\to\) Toolbox


In the Processing Toolbox panel, go to Vector general menu \(\to\) Batch Nominatum geocoder


Set Input layer \(=\) HAL_sample and Address field \(=\) Address.
Save the output to a file called hal_osm.geojson. Click Run


After running, the log may tell you that some addresses could not be geocoded.
Click Close


You should see the geocoded points in the main project window, and a new layer, hal\_osm. Let’s plot it against an historical county boundary shapefile.


Go to Layer \(\to\) Add Layer \(\to\) Add Vector Layer....
Navigate to e1920.geojson file in Data/Counties folder. Add it to the map.


Looks about right. Now let’s try geocoding with gazetteer data.


Add the GeoNames gazetteer to the project, using Add Delimited Text Layer.... Load GeoNames_ADM2.csv from the Data/GeoNames folder.
Set X field \(=\) longitude and Y field \(=\) latitude


The centroids for (contemporary) counties should appear in the project window.


In the Attribute Table, we see that

  • asciiname field in GeoNames_ADM2 corresponds to County_full in HAL_sample
  • admin1_code corresponds to State

We can use these to create a unique key to match on


Open the Field Calculator for GeoNames_ADM2.
Create a new variable called address_gn of type Text (string).
Set Expression to lower(concat(asciiname,', ',admin1_code)).
The Preview should show a lower case county-state ID. Click OK


Now open Field Calculator for HAL_sample.
Create the same new field, address_gn, of type Text (string).
Set Expression to lower(concat(County_full,', ',State)). Click OK


Now we are ready to join these layers. Double-click HAL_sample layer to bring up the Properties window. Open the Joins tab, and click the + button.


Set Join layer \(=\) GeoNames_ADM2, with address_gn as the join and target field.
Check the box next to Custom Field Name Prefix and enter gn_ in the box. Click OK.


The new join should appear in the Joins tab


If you click on the Fields tab, you should see the new fields appended to HAL_sample


You can also open the Attribute Table for HAL_sample. Do you see the coordinates?


To display the geocoded locations, we need to export the joined file and re-import it.
Right-click on HAL_sample and go to Export \(\to\) Save Features As....


Save the file as hal_gn.csv (comma separated values)


(Re-)import hal_gn.csv, using Add Delimited Text Layer....
Set X field \(=\) gn_longitude and Y field \(=\) gn_latitude.
Set CRS \(=\) EPSG:4326


The new geocoded locations should appear on the project window.
Now let’s do some point-in-polygon analysis!

Boxplot


Navigate to Vector menu \(\to\) Analysis Tools \(\to\) Count Points in Polygon.
Select Polygons \(=\) e1920, Points \(=\) hal_osm. Name the count field lynch_osm, and save the output to e1920_osm.geojson. Click Run


You may see a warning about “No spatial index exists…”. You can ignore it here


Repeat this process with hal_gn as the points layer. Name the count field lynch_gn, and save the output as e1920_gn.geojson


The two new layers e1920_osm and e1920_gn should appear in the project window


We can try plotting these new count variables through Layer Properties \(\to\) Symbology. You will notice right away that there are not many unique values.
I used Natural Breaks with 3 classes, but you can try other options.


The resulting distribution should look something like this


Let’s create “dummy” variables indicating whether at least one lynching occurred in each county. Open Field Calculator for e1920_gn, create new field, Lynchings of type Integer, with Expression set to lynch_gn > 0. Click OK


Do the same for e1920_osm, with Expression set to lynch_osm > 0


When finished, right-click on e1920_osm and e1920_gn, uncheck Toggle Editing


Save your changes when prompted. Do this for both layers


Let’s examine the relationship between lynching locations and local electoral preferences. Plot the variable DemMajPct in e1920_osm, with a red-to-blue color ramp (Equal Count with 5 classes)


The lynching locations appear to be mostly in southern Democratic Party strongholds, but let’s look at this more systematically


QGIS supports several basic statistical plotting functions. To find them, click on the Processing Toolbox button. Or go to Processing menu \(\to\) Toolbox


In the Toolbox panel, go to Plots \(\to\) Box plot


Set Input layer \(=\) e1920_osm; Category name field \(=\) Lynchings; Value field \(=\) DemMajPct; Additional Statistic Lines \(=\) Show mean

Click on the ... next to Box plot \(\to\) Save to File...


Save the file as boxplot_osm.html in the Output folder


Click Run


Check to make sure the process did not return any errors


Create another box plot for GeoNames-geocoded e1920_gn, with similar parameters


Open the boxplot_osm.html file in a web browser. The average Democratic vote share was indeed significantly higher for counties with lynchings (right) than counties without lynchings (left)


Hover your mouse over the boxes to see the summary statistics. In counties with lynchings the Democratic vote share was 69.6 percent, on average (median of 71.04).


In counties without lynchings the Democratic vote share was 45.9 percent, on average (median of 40.5).


To export the box plot as a .png file, click on “Camera” icon in upper-right corner.


Name the file boxplot_osm.png


Repeat this process for GeoNames-geocoded e1920_gn


Problem Set 6

 

Your assignment (if using QGIS): create a map to go along with the box plot

  • make and export a box plot, as just demonstrated
    • name the file boxplot_osm.png or boxplot_gn.png
      (depending on which geocoder you’re using)
  • create and export a new map layout, showing:
    • points: lynching locations (either OSM- or GeoNames-geocoded)
    • polygons: counties colored by Democratic vote share (DemMajPct)
    • legend (only the layers you’re using should be on legend)
    • scale bar (optional)
  • name the file map_osm.png or map_gn.png
    (depending on which geocoder you’re using)
  • upload map and accompanying box plot (2 files total!) to Canvas
  • AGAIN: you can do this with either OpenStreetMap or GeoNames, you are not required to do both

Either this:

OSM boxplot


 

 

OSM map


… or this:

GeoNames boxplot


 

 

GeoNames map

R


Loading R packages

 

To implement these steps in R, we will be using the sf package, and two others (RCurl, jsonlite) that help R compose HTTP requests and process the results returned by online servers:

library(sf)
library(RCurl)
library(jsonlite)

NOTE: The code to produce the maps and boxplots in R is in ps06_demo.R on RStudio Cloud, and in Lab06PS06.zip (posted on Canvas).

Geocoding


As with QGIS, we will geocode in R using two methods:

  1. Online, using a web service (OSM/Nominatum API)

  2. Offline, using gazetteer data (GeoNames)


Method 1: Geocode the addresses using OSM/Nominatum

 

Step 1: define a function url_geo() that sends queries to OSM/Nominatum, and returns geographic information from server:

url_geo = function(query, return.call = "json", sensor = "false") {
  root = "https://nominatim.openstreetmap.org/search?q="
  sfxx = "&format=json&polygon=1&addressdetails=1"
  u = paste(root, query, sfxx,sep = "")
  return(URLencode(u))
}

Step 2: define a wrapper function geoCode_OSM(), that sends the query through url_geo() and parses the result:

geoCode_OSM = function(query,match.num=1){
  address=NA; longitude=NA; latitude=NA
  u = url_geo(query)
  doc = RCurl::getURL(u,httpheader = c('User-Agent' = "contact info"))
  if(nchar(doc)>2){
  dat = jsonlite::fromJSON(doc)
  if(nrow(dat)>0){
    address = dat$display_name[match.num]
    longitude = as.numeric(as.character(dat$lon[match.num]))
    latitude = as.numeric(as.character(dat$lat[match.num]))
  }}
  return(data.frame(
    address=address,longitude=longitude,latitude=latitude
    ))
}

Let’s test the geocoding function!

geoCode_OSM("37th and O St NW, Washington, DC")
geoCode_OSM("Georgetown University")

geoCode_OSM("Georgetown")
geoCode_OSM("Georgetown", match.num=2)
geoCode_OSM("Georgetown", match.num=3)

Load, pre-process Project HAL data

 

Load the tabular dataset using read.csv(), and preview the first few rows:

hal = read.csv("Data/HAL/HAL_sample.csv")
head(hal) 

Create new field for “county, state” and placeholders for coordinates

hal$address = paste0(hal$County_full,", ",hal$State)
hal$longitude = NA
hal$latitude = NA

 

To geocode as a batch processing routine, we will write a for() loop, which runs the geoCode_OSM() function for each address in hal and stores the result:

for(i in 1:nrow(hal)){
  # Skip past errors
  tryCatch({
    address_geo = geoCode_OSM(hal$address[i])
    # Add coordinates to dataset
    hal$longitude[i] = address_geo$longitude
    hal$latitude[i] = address_geo$latitude
    # Report progress
    print(paste0(i,"/",nrow(hal),"; ",address_geo$address))
  },error=function(e){
    print(paste("Unable to geocode",hal$address[i]))
  })
}

Inspect the results of OSM geocoding:

head(hal)

Drop observations with missing coordinates:

hal_osm = hal[which(!is.na(hal$longitude)),]

Convert results to sf object and plot on a map:

hal_osm = sf::st_as_sf(hal_osm, 
                  coords=c("longitude","latitude"),crs=4326)
plot(hal_osm["geometry"],axes=TRUE)


Method 2: Geocode the addresses using GeoNames gazetteer

 

Load gazetteer data:

gn = read.csv("Data/GeoNames/GeoNames_ADM2.csv")

 

Create common variables for matching:

gn$address_gn = tolower(paste0(gn$asciiname,", ",gn$admin1_code))
hal$address_gn = tolower(paste0(hal$County_full,", ",hal$State))

Rename OSM coordinates to avoid confusion:

hal$longitude_osm <- hal$longitude; hal$latitude_osm <- hal$latitude
hal$longitude <- hal$latitude <- NULL

Geocode addresses (i.e. join the datasets):

hal_gn = merge(x = hal, y = gn, by = "address_gn")

Inspect the results:

head(hal_gn)

Convert results to sf object and plot on a map:

hal_gn = sf::st_as_sf(hal_gn, 
                  coords=c("longitude","latitude"),crs=4326)
plot(hal_gn["geometry"],axes=TRUE)

Boxplot


Load 1920 US county boundaries:

e1920 = sf::read_sf("Data/Counties/e1920.geojson")

Plot overlay with geocoded lynchings

plot(e1920["geometry"],border="gray",reset=FALSE) 
plot(hal_osm["geometry"],col="black",pch=1,add=TRUE)
plot(hal_gn["geometry"],col="black",pch=4,add=TRUE)
legend("bottomleft",pch=c(1,4),col=c("black","black"),
  legend=c("OSM","GeoNames"),title="Geocoder",bty="n")


Point-in-polygon analysis

Overlay points objects (hal_*) and polygons (e1920)

o_osm = sf::st_intersects(x = e1920, y = hal_osm)
o_gn = sf::st_intersects(x = e1920, y = hal_gn)

Assign counts to new variables

e1920$lynchings_osm = lengths(o_osm)
e1920$lynchings_gn = lengths(o_gn)

Plot the results

plot(e1920["lynchings_osm"], main = "Lynchings (OSM)")
plot(e1920["lynchings_gn"], main = "Lynchings (GeoNames)")



Calculate new field (at least one lynching per county)

e1920$lynchings_osm_1 = 1*(e1920$lynchings_osm>0)
e1920$lynchings_gn_1 = 1*(e1920$lynchings_gn>0)

Number of counties with lynching, according to OSM vs. GeoNames

sum(e1920$lynchings_osm_1)
## [1] 89
sum(e1920$lynchings_gn_1)
## [1] 74

Extra credit for anyone who figures out the source of the discrepancy and (partially) fixes it. Hint: it has to do with pelicans and beignets.


Create box plot for OSM-geocoded data

boxplot(DemMajPct ~ lynchings_osm_1, data = e1920, 
  xlab = "Lynchings (OSM)", ylab = "Democrat vote share, 1920")


Create box plot for GeoNames-geocoded data

boxplot(DemMajPct ~ lynchings_gn_1, data = e1920, 
  xlab = "Lynchings (GeoNames)", ylab = "Democrat vote share, 1920")


Create red-to-blue color ramp for maps

ramp = colorRampPalette(c("red","blue"), space = "rgb") 

Map the OSM-geocoded locations:

plot(e1920["DemMajPct"], pal=ramp(10), reset = FALSE, 
  main = "Lynchings (OSM geocoded) and Democratic vote share in 1920")
plot(hal_osm["geometry"], add=T, col="grey90", pch=1)

 


Map the GeoNames-geocoded locations:

plot(e1920["DemMajPct"], pal=ramp(10), reset = FALSE, 
  main = "Lynchings (GeoNames) and Democratic vote share in 1920")
plot(hal_gn["geometry"], add=T, col="grey90", pch=1)

 


Bonus exercise: ggplot2

So far, we’ve been using “base R” plotting functions only. But there is another way to visualize spatial data in R, using the ggplot2 package.

library(ggplot2)

The syntax is quite different. OSM example:

ggplot2::ggplot() +
  ggplot2::geom_sf(data=e1920, ggplot2::aes(fill=DemMajPct), 
    color="white") +
  ggplot2::geom_sf(data=hal_osm, color="black", shape=1, size=2) +
  ggplot2::geom_sf(data=hal_osm, color="white", shape=16, size=1) +
  ggplot2::scale_fill_gradientn(colors=ramp(10)) +
  ggplot2::labs(
    title="Lynchings (OSM geocoded) and Democratic vote share in 1920",
    fill="Democratic Vote Share (%)") +
  ggplot2::theme_minimal()


GeoNames example:

ggplot2::ggplot() +
  ggplot2::geom_sf(data=e1920, ggplot2::aes(fill=DemMajPct), 
    color="white") +
  ggplot2::geom_sf(data=hal_gn, color="black", shape=1, size=2) +
  ggplot2::geom_sf(data=hal_gn, color="white", shape=16, size=1) +
  ggplot2::scale_fill_gradientn(colors=ramp(10)) +
  ggplot2::labs(
    title="Lynchings (GeoNames) and Democratic vote share in 1920",
    fill="Democratic Vote Share (%)"
  ) +
  ggplot2::theme_minimal()


Problem Set 6

 

Your assignment (if using R):

  • create one of these boxplots (for OSM or GeoNames)
  • … and one of these maps (for OSM or GeoNames, base or ggplot2)
  • save the graphics as boxplot_osm_R.png (boxplot_gn_R.png) and map_osm_R.png (map_gn_R.png)
  • upload these files to Canvas (by next Wednesday)