Goal: integrate spatially-misaligned data

We’ll make this

You’ll make this

Scatterplots: plots displaying data as points, with one variable on the horizontal axis and another on the vertical axis

Scatterplots are useful to assess direction/strength of correlations

… and to identify outliers

Overview of lab exercise and problem set

Lab exercise
1. Use areal interpolation to estimate per capita crime in DC electoral districts
2. Create a scatterplot of electoral competitiveness and per capita crime in DC
Problem set
1. Use areal interpolation to estimate racial makeup of DC school zones
2. Create a scatterplot of racial disparities in school performance in DC

Use case: different data are collected for different spatial units

Elections @ single-member districts

Population counts @ census tracts

Solution: change the support of the data $\to$ single member districts

Count crimes in SMDs

Interpolate population to SMDs

Analyze relationship between electoral competitiveness and per capita crime in SMDs

Scatterplot

The problem set will investigate a similar use case

Scores @ school attendance zones

Demographics @ census tracts

Solution: change the support of the data $\to$ school attendance zones

Interpolate demographics to SAZs

Scatterplot

You can make these plots in QGIS or in R. Instructions for both are below.

Scatterplot 1 in R

Scatterplot 2 in R

We have four (and a half) sources of data:

Category	Type	Format	Data source
Single member districts	Vector (polygon)	`.geojson`	`DC OpenData`
ACS DC Census Tract	Vector (polygon)	`.geojson`	`DC OpenData`
Crime Incidents in 2024	Vector (point)	`.geojson`	`DC OpenData`
School Attendance Zones, Elementary	Vector (polygon)	`.geojson`	`DC OpenData`
$+$ School STAR Scores	Table (non-spatial)	`.csv`	`DC OpenData`

These are all in the PS07.zip file posted on Canvas.

Let’s open QGIS…

QGIS

Always save your progress!
Go to Project $\to$ Save As...

Save early, save often

Interpolation

Load the Single Member Districts file:
Layer $\to$ Add Layer $\to$ Add Vector Layer...

QGIS Screenshot

Navigate to the SMD_2022.geojson file in the Data/Elections directory:

QGIS Screenshot

The SMD_2022 layer should appear in you project window. These polygons are the lowest-level electoral units in DC (DC has 8 Wards, 46 Advisory Neighborhood Commissions, 345 SMDs)

QGIS Screenshot

Load the crime incidents vector data:
Navigate to the Crime_Incidents_in_2024.geojson file in Data/Crime

QGIS Screenshot

The Crime_Incidents_in_2024 layer should appear in the project window. These are points, representing the locations of reported crimes

QGIS Screenshot

The third set of data we need is the ACS_2020.geojson file in Data/ACS folder

QGIS Screenshot

This layer contains American Community Survey data on local demographics (including population counts, which we need to estimate per capita crime). These data are at the level of Census Tracts, the borders of which do not align with SMDs

QGIS Screenshot

Our first order of business is to construct the area weights we will be using for interpolation, starting with $a_j$ (area of destination polygons). Open the Attribute Table for SMD_2022 (destination layer) and open the Field Calculator

QGIS Screenshot

Create a new field called smd_area of type Decimal number (real).
Set the Expression to $area
This will give us the area calculation for destination polygons (in square meters)

QGIS Screenshot

The new field smd_area should appear in the Attribute Table for SMD_2022

QGIS Screenshot

Remember to save your layer edits after every operation like this! Right-click on SMD_2022 $\to$ Save Layer Edits (simply entering Ctrl-S or Cmd-S will save the project file, but not the underlying data layers we are editing)

QGIS Screenshot

Let’s also calculate the area of source polygons ($a_i$). Open the Field Calculator for ACS_2020 and create a new field called tract_area of type Decimal number (real). Set the Expression to $area

QGIS Screenshot

Let’s calculate the number of crimes in each SMD (the “nominator” of our future interpolated variable).
Go to Vector menu $\to$ Analysis Tools $\to$ Count Points in Polygon...

QGIS Screenshot

Select SMD_2022 as the Polygons layer and Crime_Incidents_in_2024 as the Points layer. Name the count field crimes_smd_2024 and save the output to a new file called SMD_2022_2.geojson. Click Run

QGIS Screenshot

You may see a No spatial index... warning message in the log, but this won’t affect the calculation (only the processing speed). Just be patient and wait for the algorithm to finish

QGIS Screenshot

The newly-created SMD_2022_2 layer should appear in the project window. You can hide the other two layers for now, if you like

QGIS Screenshot

Check the Attribute Table for SMD_2022_2 to make sure the crimes_smd_2024 variable is there

QGIS Screenshot

To facilitate a “whole vs. parts” comparison (i.e. transforming a variable directly versus reconstructing it from transformed components) we can do the same kind of points-in-polygons calculation with the ACS_2020 layer…

QGIS Screenshot

Same approach as before, except with ACS_2020 as polygon layer, crimes_tract_2024 as field name, and ACS_2020_2.geojson as output file

QGIS Screenshot

The new ACS_2020_2 layer should appear in the project window.
Let’s open its Attribute Table

QGIS Screenshot

Make sure the crimes_tract_2024 variable is there.
Then launch the Field Calculator

QGIS Screenshot

Calculate the “crimes per 1000 residents” variable for ACS_2020_2. Name the field crimes_1000_tract, set type to Decimal number (real), and set the Expression to crimes_tract_2024 / POP_TOTAL * 1000

The new crimes_1000_tract variable should appear in the Attribute Table

QGIS Screenshot

Remember to save your layer edits!

QGIS Screenshot

The next step is to calculate the area of each intersection between source and destination polygons, or $a_{i\cap j}$. To do this, we first need to create the intersection. Go to Vector menu $\to$ Geoprocessing Tools $\to$ Intersection...

QGIS Screenshot

For the intersection, select SMD_2022_2 as Input Layer, ACS_2020_2 as Overlay layer. Then click the [...] button next to “Overlay fields to keep”

QGIS Screenshot

On the next screen check the boxes next to

$\checkmark$ POP_TOTAL
$\checkmark$ crimes_tract_2024
$\checkmark$ crimes_1000_tract
$\checkmark$ tract_area

Click OK

QGIS Screenshot

Save the output file to SMD_ACS_intersection.geojson, click Run

QGIS Screenshot

The SMD_ACS_intersection layer should appear in your project window

QGIS Screenshot

Open the Attribute Table for SMD_ACS_intersection and launch the Field Calculator

Create a new field called area_weight of type Decimal Number (real).
Set Expression to $area / tract_area (this is equivalent to $w_{i\cap j}^{\text{(ext)}}=\frac{a_{i\cap j}}{a_i}$)

QGIS Screenshot

The area_weight field should appear in the Attribute Table.
Now go back to the Field Calculator, where we will create weighted population variables for destination polygons

QGIS Screenshot

Now let’s sum the weighted population numbers for each SMD. Create a new field called pop_tract2smd of type Decimal Number (real). Set Expression to sum(POP_TOTAL*area_weight,group_by:=SMD_ID)
(this is equivalent to $\sum_{i\cap j} w^{(ext)}_{i\cap j}x_{i\cap j}$)

QGIS Screenshot

This will take several minutes to compute (you can follow the progress bar at bottom, or use this time to stretch your legs)

QGIS Screenshot

(OPTIONAL) Create a new field called area_weight2 of type Decimal Number (real).
Set Expression to $area / smd_area (this is equivalent to $w_{i\cap j}^{\text{(int)}}=\frac{a_{i\cap j}}{a_j}$)

Now let’s do a weighted average (intensive) of crimes per 1000 residents. Create a new field called crimes_1000_tract2smd of type Decimal Number (real). Set Expression to sum(crimes_1000_tract*area_weight2,group_by:=SMD_ID)
(this is equivalent to $\sum_{i\cap j} w^{(int)}_{i\cap j}x_{i\cap j}$)

QGIS Screenshot

(OPTIONAL) This will also take a while to run. I even received a “program not responding” message, which I ignored (but check the progress bar to make sure the program is actually still running)

QGIS Screenshot

When it finishes running, you will see both new variables in the Attribute Table. Note that these values should be constant within SMDs (i.e. values will be repeated for intersections within each destination polygon)

QGIS Screenshot

After the algorithm finishes running is a good time to save your layer edits!

QGIS Screenshot

Our next step is to import these weighted values into the SMD_2022_2 layer, and divide crimes by population size. Right-click on SMD_2022_2 $\to$ Properties

QGIS Screenshot

Go to Joins tab in SMD_2022_2 Properties and click + button to add a new join

QGIS Screenshot

Set SMD_ACS_intersection as the “Join layer”, SMD_ID as the “Join” and “Target” field, and under “Joined fields” select $\checkmark$ pop_tract2smd and $\checkmark$ crimes_1000_tract2smd (if you created this variable, too). Check the box next to Custom field name prefix and clear the contents (i.e. no prefix). Click OK

QGIS Screenshot

You should now see the new join layer in Properties

QGIS Screenshot

Quality check! Let’s see if the sum of interpolated population values is close to the sum of this variable in the original data. Go to Vector menu $\to$ Vector Analysis $\to$ Basic Statistics for Fields

QGIS Screenshot

In the next window, set “Input layer” to SMD_2022_2 and “Field” to pop_tract2smd. Click Run

Make note of the value in the log after “SUM:”

QGIS Screenshot

Now do the same with “Input layer” ACS_2020 and “Field” POP_TOTAL

QGIS Screenshot

The population sums look numerically close (683,027 vs 683,154).
Interpolation was a success!

QGIS Screenshot

Last step (for interpolation): calculate number of crimes per 1000 residents! Go to the Attribute Table for SMD_2022_2 and open the Field Calculator

QGIS Screenshot

Create a new field called crimes_1000_smd of type Decimal number (real). Set Expression to crimes_smd_2024 / pop_tract2smd * 1000

QGIS Screenshot

(OPTIONAL) Whole vs. parts: compare the values for the field we just constructed (crimes_1000_smd) from components to the crimes_1000_tract2smd variable we interpolated whole-cloth.

QGIS Screenshot

Save the edits to the SMD_2022_2 layer!

QGIS Screenshot

To be extra-safe, let’s export the layer to a new geojson file (to preserve the join). Right-click SMD_2022_2 in layer menu $\to$ Export $\to$ Save Features As...

QGIS Screenshot

Save the layer as SMD_2022_3.geojson with Geometry type: Polygon. Check the box next to Extent (current: layer). Click OK

QGIS Screenshot

The new layer should appear in your project window.
Now we’re ready to make a scatterplot of electoral competitiveness and crime!

QGIS Screenshot

Scatterplot

To make the scatterplots, we will be using the DataPlotly plugin.
Go to View menu $\to$ Panels $\to$ $\checkmark$ DataPlotly

QGIS Screenshot

In the DataPlotly panel, set the following parameters:
Plot type: Scatter Plot; Layer: SMD_2022_3; X field: competitive_top1; Y field: crimes_1000_smd; Legend title: [empty]

QGIS Screenshot

In the Layout Options tab, set the following parameters:

Show legend: $\Box$ (off)
Plot title:
Electoral competitiveness and crime in DC
X label:
Top-1 competitiveness (2022)
Y label:
Crimes per 1000 residents (2024)
Click Create Plot

QGIS Screenshot

The scatterplot should appear on the next screen.

To save the scatterplot, click the Export as image button in lower-right corner

QGIS Screenshot

Name the file scatterplot_competitiveness_crime.png.
It should look something like this:

QGIS Screenshot

Problem Set 7

Your assignment (if using QGIS): create a scatterplot of school performance and race

perform areal interpolation:
- source polygons: ACS census tracts (ACS_2020.geojson)
  $x$ variables: neighborhood population size and racial makeup
- destination polygons: elementary school attendance zones (SAZ_Elementary.geojson)
  $y$ variables: STAR school performance scores
make and export a scatterplot:
- percent non-white on $x$-axis
- school STAR score on $y$-axis
- name the file scatterplot_race_schools.png
upload plot to Canvas

Can you make this?

Here’s a quick preview of the assignment.
Load the vector data file SAZ_Elementary.geojson from Data/Schools

QGIS Screenshot

To see how SAZ_Elementary overlays with census tracts, select a symbology with bright thick borders and a transparent inside, like outline_red

QGIS Screenshot

Because SAZs are (mostly) larger and fewer in number than tracts (74 vs 206), it seems more appropriate for SAZs to be the destination units

QGIS Screenshot

Let’s construct the area weights, starting with the area of destination polygons, $a_j$.
Recall that we already calculated the area of source polygons, $a_i$ (tract_area in ACS_2020). Go to the Field Calculator for SAZ_Elementary

QGIS Screenshot

Create a new field named saz_area of type Decimal number.
Set the expression to $area

QGIS Screenshot

Remember to save your layer edits regularly!

Now let’s intersect SAZ_Elementary with ACS_2020 to calculate intersection areas, $a_{i\cap j}$. Go to Vector menu $\to$ Geoprocessing Tools $\to$ Intersection...

QGIS Screenshot

For the intersection, select SAZ_Elementary as Input Layer, ACS_2020 as Overlay layer. Then click the [...] button next to “Overlay fields to keep”

QGIS Screenshot

On the next screen check the boxes next to $\checkmark$ POP_TOTAL, $\checkmark$ RACE_WHITE,
$\checkmark$ PCT_NONWHITE and $\checkmark$ tract_area. Click OK

QGIS Screenshot

Save the output file to SAZ_ACS_intersection.geojson, click Run

QGIS Screenshot

The intersection should appear in your project window

QGIS Screenshot

Launch the Field Calculator for SAZ_ACS_intersection

QGIS Screenshot

Create a new field area_weight of type Decimal number.
Set expression to $area / tract_area. Click OK

QGIS Screenshot

(OPTIONAL) For the intensive transformation, create a new field area_weight2 of type Decimal number. Set expression to $area / saz_area. Click OK

QGIS Screenshot

Create a new field pct_nonwhite_saz of type Decimal number. Set expression to
sum((POP_TOTAL - RACE_WHITE) * area_weight,group_by:=OBJECTID) /
sum(POP_TOTAL * area_weight,group_by:=OBJECTID) * 100

QGIS Screenshot

(OPTIONAL) Create a new field pct_nonwhite_tract2saz of type Decimal number. Set expression to
sum(PCT_NONWHITE * area_weight2,group_by:=OBJECTID)

QGIS Screenshot

Inspect the pct_nonwhite_saz variable in the Attribute Table.
Is it constant within SAZs? If so, we’re ready to join it to the SAZs

QGIS Screenshot

Now let’s import these weighted values into the SAZ_Elementary layer. Right-click on SAZ_Elementary $\to$ Properties

QGIS Screenshot

Go to Joins tab and click + to add a new join. Set SAZ_ACS_intersection as the “Join layer”, NAME as the “Join” and “Target” field, and under “Joined fields” select $\checkmark$ pct_nonwhite_saz and $\checkmark$ pct_nonwhite_tract2saz (if you created this variable, too). Check the box next to Custom field name prefix and clear the contents (i.e. no prefix). Click OK

QGIS Screenshot

You should now see the new join layer in Properties

QGIS Screenshot

Export the layer to a new geojson file (to preserve the join). Right-click SAZ_Elementary in layer menu $\to$ Export $\to$ Save Features As...

QGIS Screenshot

Save the layer as SAZ_Elementary_2.geojson with Geometry type: Polygon. Check the box next to Extent (current: layer). Click OK

QGIS Screenshot

In the DataPlotly panel, set the following parameters:
Plot type: Scatter Plot; Layer: SAZ_Elementary_2; X field: pct_nonwhite_saz; Y field: STAR_Score; Legend title: [empty]

QGIS Screenshot

In the Layout Options tab, set the following parameters:

Show legend: $\Box$ (off)
Plot title:
Race and school performance in DC
X label:
Percent non-white population
Y label:
School performance (STAR Score)
Click Create Plot

QGIS Screenshot

The scatterplot should appear on the next screen.

QGIS Screenshot

Export the image as scatterplot_race_schools.png.
It should look something like this:

QGIS Screenshot

R

Loading R packages

To implement these steps in R, we will be using the sf and SUNGEO packages

library(sf)
library(SUNGEO)

NOTE: The demo code for R is in ps07_demo.R on RStudio Cloud, and in PS07.zip (posted on Canvas).

Interpolation

Loading spatial data

Let’s load the single member district boundaries into R, using sf::read_sf():

smd_2022 = sf::read_sf("Data/Elections/SMD_2022.geojson")
plot(smd_2022["geometry"])

Now load the ACS census tracts into R:

acs_2020 = sf::read_sf("Data/ACS/ACS_2020.geojson")
plot(acs_2020["geometry"])

…and the crime incidents data:

crimes_2024 = sf::read_sf("Data/Crime/Crime_Incidents_in_2024.geojson")
plot(crimes_2024["geometry"])

Let’s calculate relative scale and nesting metrics for a change of support from acs_2020 (census tracts) to smd_2022 (single member districts)

SUNGEO::nesting(acs_2020,smd_2022,metrix=c("rn","rs"))

## $rn
## [1] 0.4594585
## 
## $rs
## [1] 0.1793275

Yikes! What if we switched to (smaller) census block groups as source units?

census_2020 = sf::read_sf("Data/Census/CENSUS_2020.geojson")
SUNGEO::nesting(census_2020,smd_2022,metrix=c("rn","rs"))

## $rn
## [1] 0.6907815
## 
## $rs
## [1] 0.8217758

Much better! But for consistency with QGIS, we’ll stick with the tracts here.

Let’s do some point-in-polygon analysis to count crimes per SMD:

smd_2022$crimes_smd_2024 = lengths(sf::st_intersects(smd_2022,
                                    crimes_2024))
plot(smd_2022["crimes_smd_2024"])

Let’s interpolate population counts per SMD (this is a one-step routine in R)

smd_2022$pop_tract2smd = sf::st_interpolate_aw(acs_2020["POP_TOTAL"],
                                  smd_2022,extensive=TRUE)$POP_TOTAL
plot(smd_2022["pop_tract2smd"])

Let’s check the quality of transformation: how close are the population counts?

sum(smd_2022$pop_tract2smd) # Interpolated

## [1] 683070.1

sum(acs_2020$POP_TOTAL)     # Original

## [1] 683154

Let’s calculate crimes per 1000 residents

smd_2022$crimes_1000_smd = smd_2022$crimes_smd_2024/
                            smd_2022$pop_tract2smd*1000
plot(smd_2022["crimes_1000_smd"])

Scatterplot

Let’s create the scatterplot

plot(x=smd_2022$competitive_top1,
  y=smd_2022$crimes_1000_smd,
  xlab="Top-1 competitiveness (2022)",
  ylab="Crimes per 1000 residents (2024)",
  main="Electoral competitiveness and crime in DC"
  )

Problem Set 7

Your assignment (if using R):

create a scatterplot of race and school performance in DC!
perform areal interpolation:
- source polygons: AFS_2020.geojson
- y variables: POP_TOTAL and RACE_WHITE
- destination polygons: SAZ_Elementary.geojson
- y variables: STAR_Score
make and export a scatterplot:
- percent non-white on $x$-axis
- school STAR score on $y$-axis
- name the file scatterplot_race_schools_R.png
upload map to Canvas
(by next Wednesday)

Can you make this?

API-231 / GIS-PubPol / Meeting 13 (Lab Exercise + Problem Set 7)

Yuri M. Zhukov

Visiting Associate Professor of Public Policy

Harvard Kennedy School

March 7, 2024

QGIS

Interpolation

Scatterplot

R

Interpolation

Scatterplot