Motivation: theoretically relevant units \(\neq\) spatial units at which data are available
Example: data for different variables are available at different units
Outcome
Treatment
Instrument
Example: borders, number of units change over time
1937
1945
1991
Example: data are measured at different levels of geographic precision
admin 0
admin 1
admin 2
Example: different definitions of same units across data sources
admin 2
“admin 2”
admin 2
The dilemma for analysts
Conduct analysis at theoretically inappropriate units
Convert the data to a common set of (more appropriate) units
What are change of support problems?
Related topics:
EI and MAUP are both special cases of CoS problems
The complexity of a CoS depends on
Illustration
Let’s consider three sets of units (from the U.S. state of Georgia)
precincts
constituencies
.5\(^\circ\) grid
source units
source \(\cap\) destination
destination units
source units
source \(\cap\) destination
destination units
precinct \(\to\) constituency
constituency \(\to\) grid
Some considerations
Guesstimation ain’t easy
Informally
relative scale:
relative nesting:
Formally
relative scale: \(RS = \frac{1}{N_{S\cap D}}\sum_{i\cap j}^{N_{S\cap D}}1(a_i<a_j)\)
relative nesting: \(RN = \frac{1}{N_S}\sum_{i}^{N_S} \sum_j^{N_D}\left(\frac{a_{i\cap j}}{a_i} \right)^2\)
Application of relative scale and nesting to Georgia data: any surprises here?
Relative scale
source \(\to\) destination | (a) | (b) | (c) |
---|---|---|---|
(a) precincts | – | 1.00 | 1.00 |
(b) constituencies | 0.00 | – | 0.12 |
(c) .5\(^\circ\) grid | 0.00 | 0.89 | – |
Relative nesting
source \(\to\) destination | (a) | (b) | (c) |
---|---|---|---|
(a) precincts | – | 0.98 | 0.92 |
(b) constituencies | 0.01 | – | 0.29 |
(c) .5\(^\circ\) grid | 0.05 | 0.54 | – |
(a)
(b)
(c)
A CoS algorithm specifies a transformation between source and destination units
these range from simple geometric operations to complex model-based predictions
Types of variables
Extensive (depend on area and scale)
Intensive (don’t depend on area and scale)
Examples
Areal weighting is the default CoS method in many commercial and open-source GIS
Advantages
Disadvantages
Overlapping areas
Illustration: suppose a city is divided into 4 sectors: \(S_1, S_2, S_3, S_4\)
Source polygons
The city’s population (\(N=100\)) is distributed across the 4 sectors. 49% wear hats.
Underlying data distribution
But we don’t actually have micro data on where people live, just regional totals.
Observed data distribution
We know how many people live in each sector, and how many of them wear hats.
Observed distribution of hat wearers
From this, we know that \(S_1\) has a much lower share of hat wearers than \(S_2, S_3, S_4\).
Hat wearers as percent of population
Due to redistricting, a city council member’s district has switched from \(S_1\) to \(D_0\).
Destination polygon
With micro data, you can count how many people are in \(D_0\), and what % wear hats.
Destination polygon with (unobserved) micro data
Without micro data, you have to estimate this from aggregate statistics. But how?
Destination polygon with (observed) aggregate data
Let’s think about what the area of the new region in \(D_0\) actually represents.
Destination polygon in focus
This polygon is a combination of four intersections of \(S_1,\dots,S_4\) with \(D_0\).
Destination polygon broken into four components
The number of people living in intersection \(S_1\cap D_0\) is a subset of those living in \(S_1\).
Size of \(S_1\cap D_0\) relative to \(S_1\)
Let’s assume that pop size \(N_{S_1\cap D_0}\) is proportional to relative area of \(S_1\cap D_0\) vs \(S_1\).
Logic of area weighting for extensive variables
From the map, we see that \(\text{area}(S_1\cap D_0)=3\times 7=21\) and \(\text{area}(S_1)=5\times 10=50\).
Constructing area weights for \(S_1\cap D_0\)
Multiply this “area weight” by the number of hats in \(S_1\) to get subtotal for \(S_1\cap D_0\).
Constructing area weighted subtotals for \(S_1\cap D_0\)
Multiply “area weight” by number of people in \(S_1\) to get sub-population of \(S_1\cap D_0\).
Constructing area weighted subtotals for \(S_1\cap D_0\)
Repeat exercise for \(S_2\cap D_0\): \(\frac{15}{50}\times18\text{ Hats}=6.48\text{ Hats}\), \(\frac{15}{50}\times50\text{ People}=15\text{ People}\)
Constructing area weighted subtotals for \(S_2\cap D_0\)
Repeat exercise for \(S_3\cap D_0\): \(\frac{10}{25}\times17\text{ Hats}=6.8\text{ Hats}\), \(\frac{10}{25}\times25\text{ People}=10\text{ People}\)
Constructing area weighted subtotals for \(S_3\cap D_0\)
Repeat exercise for \(S_4\cap D_0\): \(\frac{4}{25}\times12\text{ Hats}=1.92\text{ Hats}\), \(\frac{4}{25}\times25\text{ People}=4\text{ People}\)
Constructing area weighted subtotals for \(S_4\cap D_0\)
Combine the four subtotals into an area-weighted estimate of hats for all of \(D_0\).
Constructing area weighted sums in \(D_0\)
Combine the four subtotals into an area-weighted population estimate for all of \(D_0\).
Constructing area weighted sums in \(D_0\)
Divide weighted # of hats by weighted population to get “% Hats” estimate for \(D_0\).
Constructing area weighted statistics for \(D_0\)
Can’t we interpolate %’s directly, instead of nominator and denominator separately?
Interpolating “% Hats” as an intensive variable
Yes, but the weights would be different: \(\frac{\text{area}(S_1\cap D_0)}{\text{area}(D_0)}\), proportional to destination \(D_0\).
Area weights for an intensive variable
The area of \(S_1\cap D_0\) is \(3\times 7=21\), and \(\text{area}(D_0)=5\times 10=50\), so \(w=0.42\) again.
Area weights for intensive variable in \(S_1\cap D_0\)
Multiplying the weight by “% Hats” in \(S_1\), we get \(\frac{21}{50}\times 4\%=1.68\%\) Hats.
Area weighted subtotals for intensive variable in \(S_1\cap D_0\)
Repeat for \(S_2\cap D_0\): area weight \(\frac{15}{50}\times 36\%\text{ Hats in } S_2=10.8\%\text{ Hats}\).
Area weighted subtotals for intensive variable in \(S_2\cap D_0\)
Repeat for \(S_3\cap D_0\): area weight \(\frac{10}{50}\times 68\%\text{ Hats in } S_3=13.6\%\text{ Hats}\).
Area weighted subtotals for intensive variable in \(S_3\cap D_0\)
Repeat for \(S_4\cap D_0\): area weight \(\frac{4}{50}\times 48\%\text{ Hats in } S_4=3.84\%\text{ Hats}\).
Area weighted subtotals for intensive variable in \(S_4\cap D_0\)
Combine the four subtotals into an area-weighted estimate of “% Hats” for all of \(D_0\).
Constructing area weighted sums in \(D_0\)
Let’s compare these estimates to the ground truth (count how many people in \(D_0\)).
“Ground truth-ing” number of people in \(D_0\)
Let’s compare these estimates to the ground truth (count how many hats in \(D_0\)).
“Ground truth-ing” number of hats in \(D_0\)
Our 1st weighted estimate (32.04%, extensive) is closer than 2nd (29.9%, intensive).
“Ground truth-ing” percent of hat-wearers in \(D_0\)
Pseudocode for areal interpolation
Areal interpolation is just one of many potential CoS methods
Examples:
these differ in their assumptions
(e.g. uniformity vs. heterogeneity) and requirements (e.g. ancillary data)
… what’s more important is not the choice of CoS algorithm, but the relative scale and nesting of source and destination units
Choice paralysis
Precinct-to-constituency CoS (\(RS=1, RN=0.98\))
Different CoS algorithms \(\to\) Different transformed values
Constituency-to-grid CoS (\(RS=0.12, RN=0.29\))
But how do \(RS\), \(RN\) affect the quality of transformations (prediction error, rank correlation, estimation bias), holding CoS algorithm constant?
Higher \(RS\), \(RN\) \(\to\) Lower prediction error relative to true values
How RN and RS affect root mean squared error
Higher \(RS\), \(RN\) \(\to\) Higher correlation b/w transformed values & true values
How RN and RS affect correlation
Higher \(RS\), \(RN\) \(\to\) Less bias in regression coefficients
How RN and RS affect OLS estimation bias
What is to be done?
Bad news: \(RN\) and \(RS\) can be calculated in R (SUNGEO::nesting()
), not QGIS
(but you can still do CoS in QGIS, using good judgement and common sense!)