Goal: integrate spatially-misaligned data
We’ll make this
You’ll make this
Scatterplots: plots displaying data as points, with one variable on the horizontal axis and another on the vertical axis
Scatterplots are useful to assess direction/strength of correlations
… and to identify outliers
Overview of lab exercise and problem set
Use case: different data are collected for different spatial units
Elections @ single-member districts
Population counts @ census tracts
Solution: change the support of the data \(\to\) single member districts
Count crimes in SMDs
Interpolate population to SMDs
Analyze relationship between electoral competitiveness and per capita crime in SMDs
Scatterplot
The problem set will investigate a similar use case
Scores @ school attendance zones
Demographics @ census tracts
Solution: change the support of the data \(\to\) school attendance zones
Interpolate demographics to SAZs
Scatterplot
You can make these plots in QGIS or in R. Instructions for both are below.
Scatterplot 1 in R
Scatterplot 2 in R
We have four (and a half) sources of data:
Category | Type | Format | Data source |
---|---|---|---|
Single member districts | Vector (polygon) | .geojson |
DC OpenData |
ACS DC Census Tract | Vector (polygon) | .geojson |
DC OpenData |
Crime Incidents in 2024 | Vector (point) | .geojson |
DC OpenData |
School Attendance Zones, Elementary | Vector (polygon) | .geojson |
DC OpenData |
\(+\) School STAR Scores | Table (non-spatial) | .csv |
DC OpenData |
These are all in the PS07.zip
file posted on Canvas.
Let’s open QGIS…
Always save your progress!
Go to Project
\(\to\) Save As...
Load the Single Member Districts file:
Layer
\(\to\) Add Layer
\(\to\) Add Vector Layer...
Navigate to the SMD_2022.geojson
file in the Data/Elections
directory:
The SMD_2022
layer should appear in you project window. These polygons are the lowest-level electoral units in DC (DC has 8 Wards, 46 Advisory Neighborhood Commissions, 345 SMDs)
Load the crime incidents vector data:
Navigate to the Crime_Incidents_in_2024.geojson
file in Data/Crime
The Crime_Incidents_in_2024
layer should appear in the project window. These are points, representing the locations of reported crimes
The third set of data we need is the ACS_2020.geojson
file in Data/ACS
folder
This layer contains American Community Survey data on local demographics (including population counts, which we need to estimate per capita crime). These data are at the level of Census Tracts, the borders of which do not align with SMDs
Our first order of business is to construct the area weights we will be using for interpolation, starting with \(a_j\) (area of destination polygons). Open the Attribute Table for SMD_2022
(destination layer) and open the Field Calculator
Create a new field called smd_area
of type Decimal number (real)
.
Set the Expression to $area
This will give us the area calculation for destination polygons (in square meters)
The new field smd_area
should appear in the Attribute Table for SMD_2022
Remember to save your layer edits after every operation like this! Right-click on SMD_2022
\(\to\) Save Layer Edits
(simply entering Ctrl-S or Cmd-S will save the project file, but not the underlying data layers we are editing)
Let’s also calculate the area of source polygons (\(a_i\)). Open the Field Calculator for ACS_2020
and create a new field called tract_area
of type Decimal number (real)
. Set the Expression to $area
Let’s calculate the number of crimes in each SMD (the “nominator” of our future interpolated variable).
Go to Vector
menu \(\to\) Analysis Tools
\(\to\) Count Points in Polygon...
Select SMD_2022
as the Polygons layer and Crime_Incidents_in_2024
as the Points layer. Name the count field crimes_smd_2024
and save the output to a new file called SMD_2022_2.geojson
. Click Run
You may see a No spatial index...
warning message in the log, but this won’t affect the calculation (only the processing speed). Just be patient and wait for the algorithm to finish
The newly-created SMD_2022_2
layer should appear in the project window. You can hide the other two layers for now, if you like
Check the Attribute Table for SMD_2022_2
to make sure the crimes_smd_2024
variable is there
To facilitate a “whole vs. parts” comparison (i.e. transforming a variable directly versus reconstructing it from transformed components) we can do the same kind of points-in-polygons calculation with the ACS_2020
layer…
Same approach as before, except with ACS_2020
as polygon layer, crimes_tract_2024
as field name, and ACS_2020_2.geojson
as output file
The new ACS_2020_2
layer should appear in the project window.
Let’s open its Attribute Table
Make sure the crimes_tract_2024
variable is there.
Then launch the Field Calculator
Calculate the “crimes per 1000 residents” variable for ACS_2020_2
. Name the field crimes_1000_tract
, set type to Decimal number (real)
, and set the Expression to crimes_tract_2024 / POP_TOTAL * 1000
The new crimes_1000_tract
variable should appear in the Attribute Table
Remember to save your layer edits!
The next step is to calculate the area of each intersection between source and destination polygons, or \(a_{i\cap j}\). To do this, we first need to create the intersection. Go to Vector
menu \(\to\) Geoprocessing Tools
\(\to\) Intersection...
For the intersection, select SMD_2022_2
as Input Layer, ACS_2020_2
as Overlay layer. Then click the [...]
button next to “Overlay fields to keep”
On the next screen check the boxes next to
POP_TOTAL
crimes_tract_2024
crimes_1000_tract
tract_area
Click OK
Save the output file to SMD_ACS_intersection.geojson
, click Run
The SMD_ACS_intersection
layer should appear in your project window
Open the Attribute Table for SMD_ACS_intersection
and launch the Field Calculator
Create a new field called area_weight
of type Decimal Number (real)
.
Set Expression to $area / tract_area
(this is equivalent to \(w_{i\cap j}^{\text{(ext)}}=\frac{a_{i\cap j}}{a_i}\))
The area_weight
field should appear in the Attribute Table.
Now go back to the Field Calculator, where we will create weighted population variables for destination polygons
Now let’s sum the weighted population numbers for each SMD. Create a new field called pop_tract2smd
of type Decimal Number (real)
. Set Expression to sum(POP_TOTAL*area_weight,group_by:=SMD_ID)
(this is equivalent to \(\sum_{i\cap j} w^{(ext)}_{i\cap j}x_{i\cap j}\))
This will take several minutes to compute (you can follow the progress bar at bottom, or use this time to stretch your legs)
(OPTIONAL) Create a new field called area_weight2
of type Decimal Number (real)
.
Set Expression to $area / smd_area
(this is equivalent to \(w_{i\cap j}^{\text{(int)}}=\frac{a_{i\cap j}}{a_j}\))
Now let’s do a weighted average (intensive) of crimes per 1000 residents. Create a new field called crimes_1000_tract2smd
of type Decimal Number (real)
. Set Expression to sum(crimes_1000_tract*area_weight2,group_by:=SMD_ID)
(this is equivalent to \(\sum_{i\cap j} w^{(int)}_{i\cap j}x_{i\cap j}\))
(OPTIONAL) This will also take a while to run. I even received a “program not responding” message, which I ignored (but check the progress bar to make sure the program is actually still running)
When it finishes running, you will see both new variables in the Attribute Table. Note that these values should be constant within SMDs (i.e. values will be repeated for intersections within each destination polygon)
After the algorithm finishes running is a good time to save your layer edits!
Our next step is to import these weighted values into the SMD_2022_2
layer, and divide crimes by population size. Right-click on SMD_2022_2
\(\to\) Properties
Go to Joins
tab in SMD_2022_2
Properties and click +
button to add a new join
Set SMD_ACS_intersection
as the “Join layer”, SMD_ID
as the “Join” and “Target” field, and under “Joined fields” select \(\checkmark\) pop_tract2smd
and \(\checkmark\) crimes_1000_tract2smd
(if you created this variable, too). Check the box next to Custom field name prefix
and clear the contents (i.e. no prefix). Click OK
You should now see the new join layer in Properties
Quality check! Let’s see if the sum of interpolated population values is close to the sum of this variable in the original data. Go to Vector
menu \(\to\) Vector Analysis
\(\to\) Basic Statistics for Fields
In the next window, set “Input layer” to SMD_2022_2
and “Field” to pop_tract2smd
. Click Run
Make note of the value in the log after “SUM:”
Now do the same with “Input layer” ACS_2020
and “Field” POP_TOTAL
The population sums look numerically close (683,027 vs 683,154).
Interpolation was a success!
Last step (for interpolation): calculate number of crimes per 1000 residents! Go to the Attribute Table for SMD_2022_2
and open the Field Calculator
Create a new field called crimes_1000_smd
of type Decimal number (real)
. Set Expression to crimes_smd_2024 / pop_tract2smd * 1000
(OPTIONAL) Whole vs. parts: compare the values for the field we just constructed (crimes_1000_smd
) from components to the crimes_1000_tract2smd
variable we interpolated whole-cloth.
Save the edits to the SMD_2022_2
layer!
To be extra-safe, let’s export the layer to a new geojson file (to preserve the join). Right-click SMD_2022_2
in layer menu \(\to\) Export
\(\to\) Save Features As...
Save the layer as SMD_2022_3.geojson
with Geometry type: Polygon
. Check the box next to Extent (current: layer)
. Click OK
The new layer should appear in your project window.
Now we’re ready to make a scatterplot of electoral competitiveness and crime!
To make the scatterplots, we will be using the Vector layer scatterplot
tool.
Go to Processing
menu \(\to\) Toolbox
\(\to\) Plots
\(\to\) Vector layer scatterplot
In the next window, set the following parameters:
Input layer: SMD_2022_3
; X field: competitive_top1
; Y field: crimes_1000_smd
Save the file as scatterplot_competitiveness_crime.html
.
The scatterplot html
file can be opened in a web browser.
To save the scatterplot, click the Download plot as a png
button in upper-right
Name the file scatterplot_competitiveness_crime.png
.
It should look something like this:
Problem Set 7
Your assignment (if using QGIS): create a scatterplot of school performance and race
ACS_2020.geojson
)SAZ_Elementary.geojson
)scatterplot_race_schools.png
Can you make this?
Here’s a quick preview of the assignment.
Load the vector data file SAZ_Elementary.geojson
from Data/Schools
To see how SAZ_Elementary
overlays with census tracts, select a symbology with bright thick borders and a transparent inside, like outline_red
Because SAZs are (mostly) larger and fewer in number than tracts (74 vs 206), it seems more appropriate for SAZs to be the destination units
Let’s construct the area weights, starting with the area of destination polygons, \(a_j\).
Recall that we already calculated the area of source polygons, \(a_i\) (tract_area
in ACS_2020
). Go to the Field Calculator for SAZ_Elementary
Create a new field named saz_area
of type Decimal number
.
Set the expression to $area
Remember to save your layer edits regularly!
Now let’s intersect SAZ_Elementary
with ACS_2020
to calculate intersection areas, \(a_{i\cap j}\). Go to Vector
menu \(\to\) Geoprocessing Tools
\(\to\) Intersection...
For the intersection, select SAZ_Elementary
as Input Layer, ACS_2020
as Overlay layer. Then click the [...]
button next to “Overlay fields to keep”
On the next screen check the boxes next to \(\checkmark\) POP_TOTAL
, \(\checkmark\) RACE_WHITE
,
\(\checkmark\) PCT_NONWHITE
and \(\checkmark\) tract_area
. Click OK
Save the output file to SAZ_ACS_intersection.geojson
, click Run
The intersection should appear in your project window
Launch the Field Calculator for SAZ_ACS_intersection
Create a new field area_weight
of type Decimal number
.
Set expression to $area / tract_area
. Click OK
(OPTIONAL) For the intensive transformation, create a new field area_weight2
of type Decimal number
. Set expression to $area / saz_area
. Click OK
Create a new field pct_nonwhite_saz
of type Decimal number
. Set expression to
sum((POP_TOTAL - RACE_WHITE) * area_weight,group_by:=OBJECTID) /
sum(POP_TOTAL * area_weight,group_by:=OBJECTID) * 100
(OPTIONAL) Create a new field pct_nonwhite_tract2saz
of type Decimal number
. Set expression to
sum(PCT_NONWHITE * area_weight2,group_by:=OBJECTID)
Inspect the pct_nonwhite_saz
variable in the Attribute Table.
Is it constant within SAZs? If so, we’re ready to join it to the SAZs
Now let’s import these weighted values into the SAZ_Elementary
layer. Right-click on SAZ_Elementary
\(\to\) Properties
Go to Joins
tab and click +
to add a new join. Set SAZ_ACS_intersection
as the “Join layer”, NAME
as the “Join” and “Target” field, and under “Joined fields” select \(\checkmark\) pct_nonwhite_saz
and \(\checkmark\) pct_nonwhite_tract2saz
(if you created this variable, too). Check the box next to Custom field name prefix
and clear the contents (i.e. no prefix). Click OK
You should now see the new join layer in Properties
Export the layer to a new geojson file (to preserve the join). Right-click SAZ_Elementary
in layer menu \(\to\) Export
\(\to\) Save Features As...
Save the layer as SAZ_Elementary_2.geojson
with Geometry type: Polygon
. Check the box next to Extent (current: layer)
. Click OK
In the scatteplot panel, set the following parameters: Input layer: SAZ_Elementary_2
; X field: pct_nonwhite_saz
; Y field: STAR_Score
Open the scatterplot in a browser and export to png
.
Export the image as scatterplot_race_schools.png
.
It should look something like this:
Loading R packages
To implement these steps in R, we will be using the sf
and SUNGEO
packages
library(sf)
library(SUNGEO)
NOTE: The demo code for R is in ps07_demo.R
on RStudio Cloud, and in PS07.zip
(posted on Canvas).
Loading spatial data
Let’s load the single member district boundaries into R, using sf::read_sf()
:
smd_2022 = sf::read_sf("Data/Elections/SMD_2022.geojson")
plot(smd_2022["geometry"])
Now load the ACS census tracts into R:
acs_2020 = sf::read_sf("Data/ACS/ACS_2020.geojson")
plot(acs_2020["geometry"])
…and the crime incidents data:
crimes_2024 = sf::read_sf("Data/Crime/Crime_Incidents_in_2024.geojson")
plot(crimes_2024["geometry"])
Let’s calculate relative scale and nesting metrics for a change of support from acs_2020
(census tracts) to smd_2022
(single member districts)
SUNGEO::nesting(acs_2020,smd_2022,metrix=c("rn","rs"))
## $rn
## [1] 0.4594585
##
## $rs
## [1] 0.1793275
Yikes! What if we switched to (smaller) census block groups as source units?
census_2020 = sf::read_sf("Data/Census/CENSUS_2020.geojson")
SUNGEO::nesting(census_2020,smd_2022,metrix=c("rn","rs"))
## $rn
## [1] 0.6907815
##
## $rs
## [1] 0.8217758
Much better! But for consistency with QGIS, we’ll stick with the tracts here.
Let’s do some point-in-polygon analysis to count crimes per SMD:
smd_2022$crimes_smd_2024 = lengths(sf::st_intersects(smd_2022,
crimes_2024))
plot(smd_2022["crimes_smd_2024"])
Let’s interpolate population counts per SMD (this is a one-step routine in R)
smd_2022$pop_tract2smd = sf::st_interpolate_aw(acs_2020["POP_TOTAL"],
smd_2022,extensive=TRUE)$POP_TOTAL
plot(smd_2022["pop_tract2smd"])
Let’s check the quality of transformation: how close are the population counts?
sum(smd_2022$pop_tract2smd) # Interpolated
## [1] 683070.1
sum(acs_2020$POP_TOTAL) # Original
## [1] 683154
Let’s calculate crimes per 1000 residents
smd_2022$crimes_1000_smd = smd_2022$crimes_smd_2024/
smd_2022$pop_tract2smd*1000
plot(smd_2022["crimes_1000_smd"])
Let’s create the scatterplot
plot(x=smd_2022$competitive_top1,
y=smd_2022$crimes_1000_smd,
xlab="Top-1 competitiveness (2022)",
ylab="Crimes per 1000 residents (2024)",
main="Electoral competitiveness and crime in DC"
)
Problem Set 7
Your assignment (if using R):
AFS_2020.geojson
POP_TOTAL
and RACE_WHITE
SAZ_Elementary.geojson
STAR_Score
scatterplot_race_schools_R.png
Can you make this?