Goal: integrate spatially-misaligned data

 

We’ll make this


You’ll make this


Scatterplots: plots displaying data as points, with one variable on the horizontal axis and another on the vertical axis

 

Scatterplots are useful to assess direction/strength of correlations


… and to identify outliers


Overview of lab exercise and problem set

 

  1. Lab exercise
    1. Use areal interpolation to estimate per capita crime in DC electoral districts
    2. Create a scatterplot of electoral competitiveness and per capita crime in DC
  2. Problem set
    1. Use areal interpolation to estimate racial makeup of DC school zones
    2. Create a scatterplot of racial disparities in school performance in DC

Use case: different data are collected for different spatial units

 

Elections @ single-member districts


Population counts @ census tracts


Solution: change the support of the data \(\to\) single member districts

 

Count crimes in SMDs


Interpolate population to SMDs


Analyze relationship between electoral competitiveness and per capita crime in SMDs

 

Scatterplot


The problem set will investigate a similar use case

 

Scores @ school attendance zones


Demographics @ census tracts


Solution: change the support of the data \(\to\) school attendance zones

 

Interpolate demographics to SAZs


Scatterplot


You can make these plots in QGIS or in R. Instructions for both are below.

Scatterplot 1 in R


Scatterplot 2 in R


We have four (and a half) sources of data:

Category Type Format Data source
Single member districts Vector (polygon) .geojson DC OpenData
ACS DC Census Tract Vector (polygon) .geojson DC OpenData
Crime Incidents in 2024 Vector (point) .geojson DC OpenData
School Attendance Zones, Elementary Vector (polygon) .geojson DC OpenData
\(+\) School STAR Scores Table (non-spatial) .csv DC OpenData

 

These are all in the PS07.zip file posted on Canvas.

 

Let’s open QGIS…

QGIS


Always save your progress!
Go to Project \(\to\) Save As...

Interpolation


Load the Single Member Districts file:
Layer \(\to\) Add Layer \(\to\) Add Vector Layer...


Navigate to the SMD_2022.geojson file in the Data/Elections directory:


The SMD_2022 layer should appear in you project window. These polygons are the lowest-level electoral units in DC (DC has 8 Wards, 46 Advisory Neighborhood Commissions, 345 SMDs)


Load the crime incidents vector data:
Navigate to the Crime_Incidents_in_2024.geojson file in Data/Crime


The Crime_Incidents_in_2024 layer should appear in the project window. These are points, representing the locations of reported crimes


The third set of data we need is the ACS_2020.geojson file in Data/ACS folder


This layer contains American Community Survey data on local demographics (including population counts, which we need to estimate per capita crime). These data are at the level of Census Tracts, the borders of which do not align with SMDs


Our first order of business is to construct the area weights we will be using for interpolation, starting with \(a_j\) (area of destination polygons). Open the Attribute Table for SMD_2022 (destination layer) and open the Field Calculator


Create a new field called smd_area of type Decimal number (real).
Set the Expression to $area
This will give us the area calculation for destination polygons (in square meters)


The new field smd_area should appear in the Attribute Table for SMD_2022


Remember to save your layer edits after every operation like this! Right-click on SMD_2022 \(\to\) Save Layer Edits (simply entering Ctrl-S or Cmd-S will save the project file, but not the underlying data layers we are editing)


Let’s also calculate the area of source polygons (\(a_i\)). Open the Field Calculator for ACS_2020 and create a new field called tract_area of type Decimal number (real). Set the Expression to $area


Let’s calculate the number of crimes in each SMD (the “nominator” of our future interpolated variable).
Go to Vector menu \(\to\) Analysis Tools \(\to\) Count Points in Polygon...


Select SMD_2022 as the Polygons layer and Crime_Incidents_in_2024 as the Points layer. Name the count field crimes_smd_2024 and save the output to a new file called SMD_2022_2.geojson. Click Run


You may see a No spatial index... warning message in the log, but this won’t affect the calculation (only the processing speed). Just be patient and wait for the algorithm to finish


The newly-created SMD_2022_2 layer should appear in the project window. You can hide the other two layers for now, if you like


Check the Attribute Table for SMD_2022_2 to make sure the crimes_smd_2024 variable is there


To facilitate a “whole vs. parts” comparison (i.e. transforming a variable directly versus reconstructing it from transformed components) we can do the same kind of points-in-polygons calculation with the ACS_2020 layer…


Same approach as before, except with ACS_2020 as polygon layer, crimes_tract_2024 as field name, and ACS_2020_2.geojson as output file


The new ACS_2020_2 layer should appear in the project window.
Let’s open its Attribute Table


Make sure the crimes_tract_2024 variable is there.
Then launch the Field Calculator


Calculate the “crimes per 1000 residents” variable for ACS_2020_2. Name the field crimes_1000_tract, set type to Decimal number (real), and set the Expression to crimes_tract_2024 / POP_TOTAL * 1000


The new crimes_1000_tract variable should appear in the Attribute Table


Remember to save your layer edits!


The next step is to calculate the area of each intersection between source and destination polygons, or \(a_{i\cap j}\). To do this, we first need to create the intersection. Go to Vector menu \(\to\) Geoprocessing Tools \(\to\) Intersection...


For the intersection, select SMD_2022_2 as Input Layer, ACS_2020_2 as Overlay layer. Then click the [...] button next to “Overlay fields to keep”


On the next screen check the boxes next to

  • \(\checkmark\) POP_TOTAL
  • \(\checkmark\) crimes_tract_2024
  • \(\checkmark\) crimes_1000_tract
  • \(\checkmark\) tract_area

Click OK



Save the output file to SMD_ACS_intersection.geojson, click Run


The SMD_ACS_intersection layer should appear in your project window


Open the Attribute Table for SMD_ACS_intersection and launch the Field Calculator


Create a new field called area_weight of type Decimal Number (real).
Set Expression to $area / tract_area (this is equivalent to \(w_{i\cap j}^{\text{(ext)}}=\frac{a_{i\cap j}}{a_i}\))


The area_weight field should appear in the Attribute Table.
Now go back to the Field Calculator, where we will create weighted population variables for destination polygons


Now let’s sum the weighted population numbers for each SMD. Create a new field called pop_tract2smd of type Decimal Number (real). Set Expression to sum(POP_TOTAL*area_weight,group_by:=SMD_ID)
(this is equivalent to \(\sum_{i\cap j} w^{(ext)}_{i\cap j}x_{i\cap j}\))


This will take several minutes to compute (you can follow the progress bar at bottom, or use this time to stretch your legs)


(OPTIONAL) Create a new field called area_weight2 of type Decimal Number (real).
Set Expression to $area / smd_area (this is equivalent to \(w_{i\cap j}^{\text{(int)}}=\frac{a_{i\cap j}}{a_j}\))


Now let’s do a weighted average (intensive) of crimes per 1000 residents. Create a new field called crimes_1000_tract2smd of type Decimal Number (real). Set Expression to sum(crimes_1000_tract*area_weight2,group_by:=SMD_ID)
(this is equivalent to \(\sum_{i\cap j} w^{(int)}_{i\cap j}x_{i\cap j}\))


(OPTIONAL) This will also take a while to run. I even received a “program not responding” message, which I ignored (but check the progress bar to make sure the program is actually still running)


When it finishes running, you will see both new variables in the Attribute Table. Note that these values should be constant within SMDs (i.e. values will be repeated for intersections within each destination polygon)


After the algorithm finishes running is a good time to save your layer edits!


Our next step is to import these weighted values into the SMD_2022_2 layer, and divide crimes by population size. Right-click on SMD_2022_2 \(\to\) Properties


Go to Joins tab in SMD_2022_2 Properties and click + button to add a new join


Set SMD_ACS_intersection as the “Join layer”, SMD_ID as the “Join” and “Target” field, and under “Joined fields” select \(\checkmark\) pop_tract2smd and \(\checkmark\) crimes_1000_tract2smd (if you created this variable, too). Check the box next to Custom field name prefix and clear the contents (i.e. no prefix). Click OK


You should now see the new join layer in Properties


Quality check! Let’s see if the sum of interpolated population values is close to the sum of this variable in the original data. Go to Vector menu \(\to\) Vector Analysis \(\to\) Basic Statistics for Fields


In the next window, set “Input layer” to SMD_2022_2 and “Field” to pop_tract2smd. Click Run


Make note of the value in the log after “SUM:”


Now do the same with “Input layer” ACS_2020 and “Field” POP_TOTAL


The population sums look numerically close (683,027 vs 683,154).
Interpolation was a success!


Last step (for interpolation): calculate number of crimes per 1000 residents! Go to the Attribute Table for SMD_2022_2 and open the Field Calculator


Create a new field called crimes_1000_smd of type Decimal number (real). Set Expression to crimes_smd_2024 / pop_tract2smd * 1000


(OPTIONAL) Whole vs. parts: compare the values for the field we just constructed (crimes_1000_smd) from components to the crimes_1000_tract2smd variable we interpolated whole-cloth.


Save the edits to the SMD_2022_2 layer!


To be extra-safe, let’s export the layer to a new geojson file (to preserve the join). Right-click SMD_2022_2 in layer menu \(\to\) Export \(\to\) Save Features As...


Save the layer as SMD_2022_3.geojson with Geometry type: Polygon. Check the box next to Extent (current: layer). Click OK


The new layer should appear in your project window.
Now we’re ready to make a scatterplot of electoral competitiveness and crime!

Scatterplot


To make the scatterplots, we will be using the DataPlotly plugin.
Go to View menu \(\to\) Panels \(\to\) \(\checkmark\) DataPlotly


In the DataPlotly panel, set the following parameters:
Plot type: Scatter Plot; Layer: SMD_2022_3; X field: competitive_top1; Y field: crimes_1000_smd; Legend title: [empty]


In the Layout Options tab, set the following parameters:

  • Show legend: \(\Box\) (off)
  • Plot title:
    Electoral competitiveness and crime in DC
  • X label:
    Top-1 competitiveness (2022)
  • Y label:
    Crimes per 1000 residents (2024)
  • Click Create Plot



The scatterplot should appear on the next screen.


To save the scatterplot, click the Export as image button in lower-right corner


Name the file scatterplot_competitiveness_crime.png.
It should look something like this:


Problem Set 7

Your assignment (if using QGIS): create a scatterplot of school performance and race

  • perform areal interpolation:
    • source polygons: ACS census tracts (ACS_2020.geojson)
      \(x\) variables: neighborhood population size and racial makeup
    • destination polygons: elementary school attendance zones (SAZ_Elementary.geojson)
      \(y\) variables: STAR school performance scores
  • make and export a scatterplot:
    • percent non-white on \(x\)-axis
    • school STAR score on \(y\)-axis
    • name the file scatterplot_race_schools.png
  • upload plot to Canvas


Can you make this?


Here’s a quick preview of the assignment.
Load the vector data file SAZ_Elementary.geojson from Data/Schools


To see how SAZ_Elementary overlays with census tracts, select a symbology with bright thick borders and a transparent inside, like outline_red


Because SAZs are (mostly) larger and fewer in number than tracts (74 vs 206), it seems more appropriate for SAZs to be the destination units


Let’s construct the area weights, starting with the area of destination polygons, \(a_j\).
Recall that we already calculated the area of source polygons, \(a_i\) (tract_area in ACS_2020). Go to the Field Calculator for SAZ_Elementary


Create a new field named saz_area of type Decimal number.
Set the expression to $area


Remember to save your layer edits regularly!


Now let’s intersect SAZ_Elementary with ACS_2020 to calculate intersection areas, \(a_{i\cap j}\). Go to Vector menu \(\to\) Geoprocessing Tools \(\to\) Intersection...


For the intersection, select SAZ_Elementary as Input Layer, ACS_2020 as Overlay layer. Then click the [...] button next to “Overlay fields to keep”


On the next screen check the boxes next to \(\checkmark\) POP_TOTAL, \(\checkmark\) RACE_WHITE,
\(\checkmark\) PCT_NONWHITE and \(\checkmark\) tract_area. Click OK


Save the output file to SAZ_ACS_intersection.geojson, click Run


The intersection should appear in your project window


Launch the Field Calculator for SAZ_ACS_intersection


Create a new field area_weight of type Decimal number.
Set expression to $area / tract_area. Click OK


(OPTIONAL) For the intensive transformation, create a new field area_weight2 of type Decimal number. Set expression to $area / saz_area. Click OK


Create a new field pct_nonwhite_saz of type Decimal number. Set expression to
sum((POP_TOTAL - RACE_WHITE) * area_weight,group_by:=OBJECTID) /
sum(POP_TOTAL * area_weight,group_by:=OBJECTID) * 100


(OPTIONAL) Create a new field pct_nonwhite_tract2saz of type Decimal number. Set expression to
sum(PCT_NONWHITE * area_weight2,group_by:=OBJECTID)


Inspect the pct_nonwhite_saz variable in the Attribute Table.
Is it constant within SAZs? If so, we’re ready to join it to the SAZs


Now let’s import these weighted values into the SAZ_Elementary layer. Right-click on SAZ_Elementary \(\to\) Properties


Go to Joins tab and click + to add a new join. Set SAZ_ACS_intersection as the “Join layer”, NAME as the “Join” and “Target” field, and under “Joined fields” select \(\checkmark\) pct_nonwhite_saz and \(\checkmark\) pct_nonwhite_tract2saz (if you created this variable, too). Check the box next to Custom field name prefix and clear the contents (i.e. no prefix). Click OK


You should now see the new join layer in Properties


Export the layer to a new geojson file (to preserve the join). Right-click SAZ_Elementary in layer menu \(\to\) Export \(\to\) Save Features As...


Save the layer as SAZ_Elementary_2.geojson with Geometry type: Polygon. Check the box next to Extent (current: layer). Click OK


In the DataPlotly panel, set the following parameters:
Plot type: Scatter Plot; Layer: SAZ_Elementary_2; X field: pct_nonwhite_saz; Y field: STAR_Score; Legend title: [empty]


In the Layout Options tab, set the following parameters:

  • Show legend: \(\Box\) (off)
  • Plot title:
    Race and school performance in DC
  • X label:
    Percent non-white population
  • Y label:
    School performance (STAR Score)
  • Click Create Plot



The scatterplot should appear on the next screen.


Export the image as scatterplot_race_schools.png.
It should look something like this:

R


Loading R packages

 

To implement these steps in R, we will be using the sf and SUNGEO packages

library(sf)
library(SUNGEO)

NOTE: The demo code for R is in ps07_demo.R on RStudio Cloud, and in PS07.zip (posted on Canvas).

Interpolation


Loading spatial data

 

Let’s load the single member district boundaries into R, using sf::read_sf():

smd_2022 = sf::read_sf("Data/Elections/SMD_2022.geojson")
plot(smd_2022["geometry"]) 


Now load the ACS census tracts into R:

acs_2020 = sf::read_sf("Data/ACS/ACS_2020.geojson")
plot(acs_2020["geometry"]) 


…and the crime incidents data:

crimes_2024 = sf::read_sf("Data/Crime/Crime_Incidents_in_2024.geojson")
plot(crimes_2024["geometry"]) 


Let’s calculate relative scale and nesting metrics for a change of support from acs_2020 (census tracts) to smd_2022 (single member districts)

SUNGEO::nesting(acs_2020,smd_2022,metrix=c("rn","rs"))
## $rn
## [1] 0.4594585
## 
## $rs
## [1] 0.1793275

 

Yikes! What if we switched to (smaller) census block groups as source units?

census_2020 = sf::read_sf("Data/Census/CENSUS_2020.geojson")
SUNGEO::nesting(census_2020,smd_2022,metrix=c("rn","rs"))
## $rn
## [1] 0.6907815
## 
## $rs
## [1] 0.8217758

Much better! But for consistency with QGIS, we’ll stick with the tracts here.


Let’s do some point-in-polygon analysis to count crimes per SMD:

smd_2022$crimes_smd_2024 = lengths(sf::st_intersects(smd_2022,
                                    crimes_2024))
plot(smd_2022["crimes_smd_2024"]) 


Let’s interpolate population counts per SMD (this is a one-step routine in R)

smd_2022$pop_tract2smd = sf::st_interpolate_aw(acs_2020["POP_TOTAL"],
                                  smd_2022,extensive=TRUE)$POP_TOTAL
plot(smd_2022["pop_tract2smd"]) 


Let’s check the quality of transformation: how close are the population counts?

sum(smd_2022$pop_tract2smd) # Interpolated
## [1] 683070.1
sum(acs_2020$POP_TOTAL)     # Original
## [1] 683154

Let’s calculate crimes per 1000 residents

smd_2022$crimes_1000_smd = smd_2022$crimes_smd_2024/
                            smd_2022$pop_tract2smd*1000
plot(smd_2022["crimes_1000_smd"]) 

Scatterplot


Let’s create the scatterplot

plot(x=smd_2022$competitive_top1,
  y=smd_2022$crimes_1000_smd,
  xlab="Top-1 competitiveness (2022)",
  ylab="Crimes per 1000 residents (2024)",
  main="Electoral competitiveness and crime in DC"
  )


Problem Set 7

Your assignment (if using R):

  • create a scatterplot of race and school performance in DC!
  • perform areal interpolation:
    • source polygons: AFS_2020.geojson
    • y variables: POP_TOTAL and RACE_WHITE
    • destination polygons: SAZ_Elementary.geojson
    • y variables: STAR_Score
  • make and export a scatterplot:
    • percent non-white on \(x\)-axis
    • school STAR score on \(y\)-axis
    • name the file scatterplot_race_schools_R.png
  • upload map to Canvas
    (by next Wednesday)


Can you make this?