Motivation: theoretically relevant units $\neq$ spatial units at which data are available

Example: data for different variables are available at different units

Level-1 contemporary administrative units

Outcome

Treatment

Instrument

Example: borders, number of units change over time

1937

1945

1991

Example: data are measured at different levels of geographic precision

Level-0 administrative unit, i.e. country

admin 0

admin 1

admin 2

Example: different definitions of same units across data sources

admin 2

“admin 2”

Level-2 administrative units, from geoBoundaries

admin 2

The dilemma for analysts

Conduct analysis at theoretically inappropriate units
- this is only possible if all data are available for those same units

Convert the data to a common set of (more appropriate) units
- this is an intermediate, messy step
- it always entails some information loss
- it can lead to measurement error and biased estimation of quantities of interest
- problem is well-known in geostatistics and social science
- but no best practices exist for implementation, comparison, evaluation

Changes of support

Definitions

What are change of support problems?

Geographic support: area, shape, size, and orientation associated with a variable’s spatial measurement
Change of support (CoS) problem: making statistical inferences about a variable at one support by using data from a different support

Nesting and scale

Illustration

Let’s consider three sets of units (from the U.S. state of Georgia)

precincts

constituencies

.5$^\circ$ grid

Suppose one wants to change the support from precincts to constituencies
- scale: are source units smaller or larger than destination units?
- nesting: do source units fit completely/neatly into destination units?

source units

Intersection of precincts and constituencies

source $\cap$ destination

destination units

Suppose one wants to change the support from constituencies to grid cells
- scale: are source units smaller or larger than destination units?
- nesting: do source units fit completely/neatly into destination units?

source units

Intersection of constituencies and grid cells

source $\cap$ destination

destination units

Change of support #1 looks like an aggregation of nested units
Change of support #2 looks like (mostly?) disaggregation of non-nested units

precinct $\to$ constituency

constituency $\to$ grid

Some considerations

many CoS problems require both aggregation and disaggregation
just because units are politically nested doesn’t mean they are geometrically nested (e.g. measurement error, imprecision of boundaries)
not always easy to “eyeball” these things
to get a better read on this, we need quantitative measures

Guesstimation ain’t easy

Informally

relative scale:

share of intersections where source units smaller than destination units

relative nesting:

share of source units that cannot be split across destination units

Formally

$\mathcal{G}_S$: set of source polygons, indexed $i=1,\dots,N_S$
$\mathcal{G}_D$: set of destination polygons, indexed $j=1,\dots,N_D$
$\mathcal{G}_{S\cap D}$: intersection of $\mathcal{G}_S$ & $\mathcal{G}_D$, indexed $i\cap j=1,\dots,N_{S\cap D}$
$a_i$: area of source polygon $i$;$\quad$ $a_j$: area of destination polygon $j$
$a_{i\cap j}$: area of intersection $i\cap j$

relative scale: $RS = \frac{1}{N_{S\cap D}}\sum_{i\cap j}^{N_{S\cap D}}1(a_i<a_j)$

values of 1 $=$ aggregation; values of 0 $=$ disaggregation; 0-1 $=$ hybrid

relative nesting: $RN = \frac{1}{N_S}\sum_{i}^{N_S} \sum_j^{N_D}\left(\frac{a_{i\cap j}}{a_i} \right)^2$

values of 1 $=$ full nesting; values of 0 $=$ no nesting; 0-1 $=$ partial nesting

Application of relative scale and nesting to Georgia data: any surprises here?

Relative scale

source $\to$ destination	(a)	(b)	(c)
(a) precincts	–	1.00	1.00
(b) constituencies	0.00	–	0.12
(c) .5$^\circ$ grid	0.00	0.89	–

Relative nesting

source $\to$ destination	(a)	(b)	(c)
(a) precincts	–	0.98	0.92
(b) constituencies	0.01	–	0.29
(c) .5$^\circ$ grid	0.05	0.54	–

(a)

(b)

(c)

Change of support algorithms

A CoS algorithm specifies a transformation between source and destination units

$x$: is a variable being transformed from support $\mathcal{G}_S$ to $\mathcal{G}_D$
$x_{\mathcal{G}D}$: is true value of variable $x$ in destination units $\mathcal{G}_D$
$\widehat{x_{\mathcal{G}D}}^{(k)}=f_k(x_{\mathcal{G}S})$: estimated value of $x_{\mathcal{G}D}$, calculated w/ CoS algorithm $k$

these range from simple geometric operations to complex model-based predictions

Simple overlay

Rube Goldberg machine

Types of variables

Extensive (depend on area and scale)
- aggregates are (weighted) sums
- must satisfy the pycnophylactic (mass-preserving) property:
  - if area is split or combined, its values must be split or combined
  - sum of values in destination units must equal sum in source units
- examples: population counts, event counts, acreage, mineral deposits
Intensive (don’t depend on area and scale)
- aggregates are (weighted) means
- examples: population density, vote margins, median income
- intensive variables are often functions of extensive variables (density $=$ mass/vol.)
- best practice: reconstruct in destination units from transformed components
  ($\widehat{\text{mass}}_{\mathcal{G}D}/\widehat{\text{volume}}_{\mathcal{G}D} = \widehat{\text{density}}_{\mathcal{G}D}$)

Example of intensive and extensive variables

Examples

Areal interpolation

Areal weighting is the default CoS method in many commercial and open-source GIS

Advantages
- easy to implement
- requires information only on geometry of source and destination units
- no need for ancillary data
Disadvantages
- assumes that the phenomenon of interest is uniformly distributed in source units
- this becomes less problematic if source units are relatively small
- but more problematic as source units increase in size

Overlapping areas

Illustration: suppose a city is divided into 4 sectors: $S_1, S_2, S_3, S_4$

Source polygons

The city’s population ($N=100$) is distributed across the 4 sectors. 49% wear hats.

Underlying data distribution

But we don’t actually have micro data on where people live, just regional totals.

Observed data distribution

We know how many people live in each sector, and how many of them wear hats.

Macro-level data on how many people wear hats

Observed distribution of hat wearers

From this, we know that $S_1$ has a much lower share of hat wearers than $S_2, S_3, S_4$.

Hat wearers as percent of population

Due to redistricting, a city council member’s district has switched from $S_1$ to $D_0$.

Destination polygon

With micro data, you can count how many people are in $D_0$, and what % wear hats.

Destination polygon with (unobserved) micro data

Without micro data, you have to estimate this from aggregate statistics. But how?

Borders of new destination polygon and observed macro data

Destination polygon with (observed) aggregate data

Let’s think about what the area of the new region in $D_0$ actually represents.

Borders and area of new destination polygon

Destination polygon in focus

This polygon is a combination of four intersections of $S_1,\dots,S_4$ with $D_0$.

Borders and areas of subdivisions of new destination polygon

Destination polygon broken into four components

The number of people living in intersection $S_1\cap D_0$ is a subset of those living in $S_1$.

$Size of $S_1\cap D_0$ relative to $S_1$$

Size of $S_1\cap D_0$ relative to $S_1$

Let’s assume that pop size $N_{S_1\cap D_0}$ is proportional to relative area of $S_1\cap D_0$ vs $S_1$.

$Area weights for area weights for extensive variables in $S_1\cap D_0$$

Logic of area weighting for extensive variables

From the map, we see that $\text{area}(S_1\cap D_0)=3\times 7=21$ and $\text{area}(S_1)=5\times 10=50$.

$Area weights for area weights for extensive variables in $S_1\cap D_0$$

Constructing area weights for $S_1\cap D_0$

Multiply this “area weight” by the number of hats in $S_1$ to get subtotal for $S_1\cap D_0$.

$Area-weighted subtotal of hats in $S_1 \cap D_0$$

Constructing area weighted subtotals for $S_1\cap D_0$

Multiply “area weight” by number of people in $S_1$ to get sub-population of $S_1\cap D_0$.

$Area-weighted subtotal of people in $S_1 \cap D_0$$

Constructing area weighted subtotals for $S_1\cap D_0$

Repeat exercise for $S_2\cap D_0$: $\frac{15}{50}\times18\text{ Hats}=6.48\text{ Hats}$, $\frac{15}{50}\times50\text{ People}=15\text{ People}$

$Area-weighted subtotal of people in $S_2 \cap D_0$$

Constructing area weighted subtotals for $S_2\cap D_0$

Repeat exercise for $S_3\cap D_0$: $\frac{10}{25}\times17\text{ Hats}=6.8\text{ Hats}$, $\frac{10}{25}\times25\text{ People}=10\text{ People}$

$Area-weighted subtotal of people in $S_3 \cap D_0$$

Constructing area weighted subtotals for $S_3\cap D_0$

Repeat exercise for $S_4\cap D_0$: $\frac{4}{25}\times12\text{ Hats}=1.92\text{ Hats}$, $\frac{4}{25}\times25\text{ People}=4\text{ People}$

$Area-weighted subtotal of people in $S_4 \cap D_0$$

Constructing area weighted subtotals for $S_4\cap D_0$

Combine the four subtotals into an area-weighted estimate of hats for all of $D_0$.

Constructing area weighted sums in $D_0$

Combine the four subtotals into an area-weighted population estimate for all of $D_0$.

Constructing area weighted sums in $D_0$

Divide weighted # of hats by weighted population to get “% Hats” estimate for $D_0$.

Constructing area weighted statistics for $D_0$

Can’t we interpolate %’s directly, instead of nominator and denominator separately?

Interpolating “% Hats” as an intensive variable

Yes, but the weights would be different: $\frac{\text{area}(S_1\cap D_0)}{\text{area}(D_0)}$, proportional to destination $D_0$.

Area weights for an intensive variable

The area of $S_1\cap D_0$ is $3\times 7=21$, and $\text{area}(D_0)=5\times 10=50$, so $w=0.42$ again.

$Area weights for intensive variable in $S_1\cap D_0$$

Area weights for intensive variable in $S_1\cap D_0$

Multiplying the weight by “% Hats” in $S_1$, we get $\frac{21}{50}\times 4\%=1.68\%$ Hats.

$Area weighted subtotals for intensive variable in $S_1\cap D_0$$

Area weighted subtotals for intensive variable in $S_1\cap D_0$

Repeat for $S_2\cap D_0$: area weight $\frac{15}{50}\times 36\%\text{ Hats in } S_2=10.8\%\text{ Hats}$.

$Area weighted subtotals for intensive variable in $S_2\cap D_0$$

Area weighted subtotals for intensive variable in $S_2\cap D_0$

Repeat for $S_3\cap D_0$: area weight $\frac{10}{50}\times 68\%\text{ Hats in } S_3=13.6\%\text{ Hats}$.

$Area weighted subtotals for intensive variable in $S_3\cap D_0$$

Area weighted subtotals for intensive variable in $S_3\cap D_0$

Repeat for $S_4\cap D_0$: area weight $\frac{4}{50}\times 48\%\text{ Hats in } S_4=3.84\%\text{ Hats}$.

$Area weighted subtotals for intensive variable in $S_4\cap D_0$$

Area weighted subtotals for intensive variable in $S_4\cap D_0$

Combine the four subtotals into an area-weighted estimate of “% Hats” for all of $D_0$.

Constructing area weighted sums in $D_0$

Let’s compare these estimates to the ground truth (count how many people in $D_0$).

“Ground truth-ing” number of people in $D_0$

Let’s compare these estimates to the ground truth (count how many hats in $D_0$).

“Ground truth-ing” number of hats in $D_0$

Our 1st weighted estimate (32.04%, extensive) is closer than 2nd (29.9%, intensive).

“Ground truth-ing” percent of hat-wearers in $D_0$

Pseudocode for areal interpolation

Intersect $\mathcal{G}_{S}$ and $\mathcal{G}_{D}$, creating a third polygon layer $\mathcal{G}_{S\cap D}$,
- each feature $i\cap j\in \{1,\dots,N_{S\cap D}\}$ is a part of source polygon $i$ that falls inside destination polygon $j$.
Compute area weights for each intersection $i\cap j$,
1. for extensive variables: $w_{i\cap j}^{\text{(ext)}}=\frac{a_{i\cap j}}{a_i}$
  (i.e. share of $i$’s area represented by intersection $i\cap j$)
2. for intensive variables: $w_{i\cap j}^{\text{(int)}}=\frac{a_{i\cap j}}{a_j}$
  (i.e. share of $j$’s area contributed by intersection $i\cap j$)
Combine weighted statistics for each destination polygon $j$:
1. $\hat{x}_j=\sum_{i\cap j}^{N_{\cap j}} w_{i\cap j}x_{i\cap j}$, where $x_{i\cap j}$ is the value of $x$ in intersection $i\cap j$ and $N_{\cap j}$ is the number of intersections in $j$

Areal interpolation is just one of many potential CoS methods

Examples:

simple overlay
population weighted interpolation
ordinary kriging
universal kriging
thin-plate splines and random forests

these differ in their assumptions
(e.g. uniformity vs. heterogeneity) and requirements (e.g. ancillary data)

… what’s more important is not the choice of CoS algorithm, but the relative scale and nesting of source and destination units

Choice paralysis

Assessing transformation quality

Precinct-to-constituency CoS ($RS=1, RN=0.98$)

Precinct-to-constituency CoS

Different CoS algorithms $\to$ Different transformed values

Constituency-to-grid CoS ($RS=0.12, RN=0.29$)

Constituency-to-grid CoS

But how do $RS$, $RN$ affect the quality of transformations (prediction error, rank correlation, estimation bias), holding CoS algorithm constant?

Higher $RS$, $RN$ $\to$ Lower prediction error relative to true values

How RN and RS affect root mean squared error

Higher $RS$, $RN$ $\to$ Higher correlation b/w transformed values & true values

How RN and RS affect correlation

Higher $RS$, $RN$ $\to$ Less bias in regression coefficients

How RN and RS affect OLS estimation bias

What is to be done?

General recommendations:
- consider relative scale and nesting as ex ante measures of CoS complexity
- check face validity of transformed values through visualization
If “ground truth” data (micro data, cross-unit IDs) are available:
- validate transformed values with micro data
- use micro data as source units
- match on common ID (if units are well-nested)
If “ground truth” data are not available:
- be transparent about limitations/assumptions
- partial validation (if micro data available for some regions)
- report results from alternative CoS algorithms when possible

Bad news: $RN$ and $RS$ can be calculated in R (SUNGEO::nesting()), not QGIS
(but you can still do CoS in QGIS, using good judgement and common sense!)

API-231 / GIS-PubPol / Meeting 12 (Changes of Geographic Support)

Yuri M. Zhukov

Visiting Associate Professor of Public Policy

Harvard Kennedy School

March 5, 2024

Changes of support

Definitions

Nesting and scale

Change of support algorithms

Areal interpolation

Assessing transformation quality