Motivation: theoretically relevant units \(\neq\) spatial units at which data are available

 

Example: data for different variables are available at different units

Outcome


Treatment


Instrument


 

 

Example: borders, number of units change over time

1937


1945


1991


 

 

Example: data are measured at different levels of geographic precision

admin 0


admin 1


admin 2


 

 

Example: different definitions of same units across data sources

admin 2


“admin 2”


admin 2


 

The dilemma for analysts

 

  1. Conduct analysis at theoretically inappropriate units

    • this is only possible if all data are available for those same units

 

   or

 

  1. Convert the data to a common set of (more appropriate) units

    • this is an intermediate, messy step
    • it always entails some information loss
    • it can lead to measurement error and biased estimation of quantities of interest
    • problem is well-known in geostatistics and social science
    • but no best practices exist for implementation, comparison, evaluation

Changes of support

Definitions


What are change of support problems?

 

  1. Geographic support: area, shape, size, and orientation associated with a variable’s spatial measurement
  2. Change of support (CoS) problem: making statistical inferences about a variable at one support by using data from a different support

 

Related topics:

  • ecological inference (EI): deducing micro variation from aggregate data
  • modifiable areal unit problem (MAUP): statistical inferences depend on the geographical regions at which data are observed

 

EI and MAUP are both special cases of CoS problems


The complexity of a CoS depends on

 

  1. Relative scale: aggregation, disaggregation, hybrid
  2. Relative nesting: whether one set of units falls completely, neatly inside other

Nesting and scale


Illustration

 

Let’s consider three sets of units (from the U.S. state of Georgia)

precincts


constituencies


.5\(^\circ\) grid


  1. Suppose one wants to change the support from precincts to constituencies
    • scale: are source units smaller or larger than destination units?
    • nesting: do source units fit completely/neatly into destination units?

 

source units


source \(\cap\) destination


destination units


  1. Suppose one wants to change the support from constituencies to grid cells
    • scale: are source units smaller or larger than destination units?
    • nesting: do source units fit completely/neatly into destination units?

 

source units


source \(\cap\) destination


destination units


  1. Change of support #1 looks like an aggregation of nested units
  2. Change of support #2 looks like (mostly?) disaggregation of non-nested units

 

precinct \(\to\) constituency


constituency \(\to\) grid


Some considerations

 

  • many CoS problems require both aggregation and disaggregation
  • just because units are politically nested doesn’t mean they are geometrically nested (e.g. measurement error, imprecision of boundaries)
  • not always easy to “eyeball” these things
  • to get a better read on this, we need quantitative measures


Guesstimation ain’t easy


Informally

relative scale:

  • share of intersections where source units smaller than destination units

relative nesting:

  • share of source units that cannot be split across destination units

 

Formally

  • \(\mathcal{G}_S\): set of source polygons, indexed \(i=1,\dots,N_S\)
  • \(\mathcal{G}_D\): set of destination polygons, indexed \(j=1,\dots,N_D\)
  • \(\mathcal{G}_{S\cap D}\): intersection of \(\mathcal{G}_S\) & \(\mathcal{G}_D\), indexed \(i\cap j=1,\dots,N_{S\cap D}\)
  • \(a_i\): area of source polygon \(i\);\(\quad\) \(a_j\): area of destination polygon \(j\)
  • \(a_{i\cap j}\): area of intersection \(i\cap j\)

 

relative scale: \(RS = \frac{1}{N_{S\cap D}}\sum_{i\cap j}^{N_{S\cap D}}1(a_i<a_j)\)

  • values of 1 \(=\) aggregation; values of 0 \(=\) disaggregation; 0-1 \(=\) hybrid

relative nesting: \(RN = \frac{1}{N_S}\sum_{i}^{N_S} \sum_j^{N_D}\left(\frac{a_{i\cap j}}{a_i} \right)^2\)

  • values of 1 \(=\) full nesting; values of 0 \(=\) no nesting; 0-1 \(=\) partial nesting

Application of relative scale and nesting to Georgia data: any surprises here?

 

Relative scale

source \(\to\) destination (a) (b) (c)
(a) precincts 1.00 1.00
(b) constituencies 0.00 0.12
(c) .5\(^\circ\) grid 0.00 0.89


Relative nesting

source \(\to\) destination (a) (b) (c)
(a) precincts 0.98 0.92
(b) constituencies 0.01 0.29
(c) .5\(^\circ\) grid 0.05 0.54

 


(a)


(b)


(c)


Change of support algorithms


A CoS algorithm specifies a transformation between source and destination units

  • \(x\): is a variable being transformed from support \(\mathcal{G}_S\) to \(\mathcal{G}_D\)
  • \(x_{\mathcal{G}D}\): is true value of variable \(x\) in destination units \(\mathcal{G}_D\)
  • \(\widehat{x_{\mathcal{G}D}}^{(k)}=f_k(x_{\mathcal{G}S})\): estimated value of \(x_{\mathcal{G}D}\), calculated w/ CoS algorithm \(k\)

these range from simple geometric operations to complex model-based predictions

 





Types of variables

  1. Extensive (depend on area and scale)

    • aggregates are (weighted) sums
    • must satisfy the pycnophylactic (mass-preserving) property:
      • if area is split or combined, its values must be split or combined
      • sum of values in destination units must equal sum in source units
    • examples: population counts, event counts, acreage, mineral deposits
  2. Intensive (don’t depend on area and scale)

    • aggregates are (weighted) means
    • examples: population density, vote margins, median income
    • intensive variables are often functions of extensive variables (density \(=\) mass/vol.)
    • best practice: reconstruct in destination units from transformed components
      (\(\widehat{\text{mass}}_{\mathcal{G}D}/\widehat{\text{volume}}_{\mathcal{G}D} = \widehat{\text{density}}_{\mathcal{G}D}\))


 

 

Examples

Areal interpolation


Areal weighting is the default CoS method in many commercial and open-source GIS

  1. Advantages

    • easy to implement
    • requires information only on geometry of source and destination units
    • no need for ancillary data
  2. Disadvantages

    • assumes that the phenomenon of interest is uniformly distributed in source units
    • this becomes less problematic if source units are relatively small
    • but more problematic as source units increase in size


Overlapping areas


Illustration: suppose a city is divided into 4 sectors: \(S_1, S_2, S_3, S_4\)

Source polygons


The city’s population (\(N=100\)) is distributed across the 4 sectors. 49% wear hats.

Underlying data distribution


But we don’t actually have micro data on where people live, just regional totals.

Observed data distribution


We know how many people live in each sector, and how many of them wear hats.

Observed distribution of hat wearers


From this, we know that \(S_1\) has a much lower share of hat wearers than \(S_2, S_3, S_4\).

Hat wearers as percent of population


Due to redistricting, a city council member’s district has switched from \(S_1\) to \(D_0\).

Destination polygon


With micro data, you can count how many people are in \(D_0\), and what % wear hats.

Destination polygon with (unobserved) micro data


Without micro data, you have to estimate this from aggregate statistics. But how?

Destination polygon with (observed) aggregate data


Let’s think about what the area of the new region in \(D_0\) actually represents.

Destination polygon in focus


This polygon is a combination of four intersections of \(S_1,\dots,S_4\) with \(D_0\).

Destination polygon broken into four components


The number of people living in intersection \(S_1\cap D_0\) is a subset of those living in \(S_1\).

Size of \(S_1\cap D_0\) relative to \(S_1\)


Let’s assume that pop size \(N_{S_1\cap D_0}\) is proportional to relative area of \(S_1\cap D_0\) vs \(S_1\).

Logic of area weighting for extensive variables


From the map, we see that \(\text{area}(S_1\cap D_0)=3\times 7=21\) and \(\text{area}(S_1)=5\times 10=50\).

Constructing area weights for \(S_1\cap D_0\)


Multiply this “area weight” by the number of hats in \(S_1\) to get subtotal for \(S_1\cap D_0\).

Constructing area weighted subtotals for \(S_1\cap D_0\)


Multiply “area weight” by number of people in \(S_1\) to get sub-population of \(S_1\cap D_0\).

Constructing area weighted subtotals for \(S_1\cap D_0\)


Repeat exercise for \(S_2\cap D_0\): \(\frac{15}{50}\times18\text{ Hats}=6.48\text{ Hats}\), \(\frac{15}{50}\times50\text{ People}=15\text{ People}\)

Constructing area weighted subtotals for \(S_2\cap D_0\)


Repeat exercise for \(S_3\cap D_0\): \(\frac{10}{25}\times17\text{ Hats}=6.8\text{ Hats}\), \(\frac{10}{25}\times25\text{ People}=10\text{ People}\)

Constructing area weighted subtotals for \(S_3\cap D_0\)


Repeat exercise for \(S_4\cap D_0\): \(\frac{4}{25}\times12\text{ Hats}=1.92\text{ Hats}\), \(\frac{4}{25}\times25\text{ People}=4\text{ People}\)

Constructing area weighted subtotals for \(S_4\cap D_0\)


Combine the four subtotals into an area-weighted estimate of hats for all of \(D_0\).

Constructing area weighted sums in \(D_0\)


Combine the four subtotals into an area-weighted population estimate for all of \(D_0\).

Constructing area weighted sums in \(D_0\)


Divide weighted # of hats by weighted population to get “% Hats” estimate for \(D_0\).

Constructing area weighted statistics for \(D_0\)


Can’t we interpolate %’s directly, instead of nominator and denominator separately?

Interpolating “% Hats” as an intensive variable


Yes, but the weights would be different: \(\frac{\text{area}(S_1\cap D_0)}{\text{area}(D_0)}\), proportional to destination \(D_0\).

Area weights for an intensive variable


The area of \(S_1\cap D_0\) is \(3\times 7=21\), and \(\text{area}(D_0)=5\times 10=50\), so \(w=0.42\) again.

Area weights for intensive variable in \(S_1\cap D_0\)


Multiplying the weight by “% Hats” in \(S_1\), we get \(\frac{21}{50}\times 4\%=1.68\%\) Hats.

Area weighted subtotals for intensive variable in \(S_1\cap D_0\)


Repeat for \(S_2\cap D_0\): area weight \(\frac{15}{50}\times 36\%\text{ Hats in } S_2=10.8\%\text{ Hats}\).

Area weighted subtotals for intensive variable in \(S_2\cap D_0\)


Repeat for \(S_3\cap D_0\): area weight \(\frac{10}{50}\times 68\%\text{ Hats in } S_3=13.6\%\text{ Hats}\).

Area weighted subtotals for intensive variable in \(S_3\cap D_0\)


Repeat for \(S_4\cap D_0\): area weight \(\frac{4}{50}\times 48\%\text{ Hats in } S_4=3.84\%\text{ Hats}\).

Area weighted subtotals for intensive variable in \(S_4\cap D_0\)


Combine the four subtotals into an area-weighted estimate of “% Hats” for all of \(D_0\).

Constructing area weighted sums in \(D_0\)


Let’s compare these estimates to the ground truth (count how many people in \(D_0\)).

“Ground truth-ing” number of people in \(D_0\)


Let’s compare these estimates to the ground truth (count how many hats in \(D_0\)).

“Ground truth-ing” number of hats in \(D_0\)


Our 1st weighted estimate (32.04%, extensive) is closer than 2nd (29.9%, intensive).

“Ground truth-ing” percent of hat-wearers in \(D_0\)


Pseudocode for areal interpolation

  1. Intersect \(\mathcal{G}_{S}\) and \(\mathcal{G}_{D}\), creating a third polygon layer \(\mathcal{G}_{S\cap D}\),
    • each feature \(i\cap j\in \{1,\dots,N_{S\cap D}\}\) is a part of source polygon \(i\) that falls inside destination polygon \(j\).
  2. Compute area weights for each intersection \(i\cap j\),
    1. for extensive variables: \(w_{i\cap j}^{\text{(ext)}}=\frac{a_{i\cap j}}{a_i}\)
      (i.e. share of \(i\)’s area represented by intersection \(i\cap j\))
    2. for intensive variables: \(w_{i\cap j}^{\text{(int)}}=\frac{a_{i\cap j}}{a_j}\)
      (i.e. share of \(j\)’s area contributed by intersection \(i\cap j\))
  3. Combine weighted statistics for each destination polygon \(j\):
    1. \(\hat{x}_j=\sum_{i\cap j}^{N_{\cap j}} w_{i\cap j}x_{i\cap j}\), where \(x_{i\cap j}\) is the value of \(x\) in intersection \(i\cap j\) and \(N_{\cap j}\) is the number of intersections in \(j\)

Areal interpolation is just one of many potential CoS methods

 

Examples:

  • simple overlay
  • population weighted interpolation
  • ordinary kriging
  • universal kriging
  • thin-plate splines and random forests

these differ in their assumptions
(e.g. uniformity vs. heterogeneity) and requirements (e.g. ancillary data)

 

… what’s more important is not the choice of CoS algorithm, but the relative scale and nesting of source and destination units


Choice paralysis

Assessing transformation quality


Precinct-to-constituency CoS (\(RS=1, RN=0.98\))

 

Different CoS algorithms \(\to\) Different transformed values


Constituency-to-grid CoS (\(RS=0.12, RN=0.29\))

 

But how do \(RS\), \(RN\) affect the quality of transformations (prediction error, rank correlation, estimation bias), holding CoS algorithm constant?


Higher \(RS\), \(RN\) \(\to\) Lower prediction error relative to true values

 

How RN and RS affect root mean squared error


Higher \(RS\), \(RN\) \(\to\) Higher correlation b/w transformed values & true values

 

How RN and RS affect correlation


Higher \(RS\), \(RN\) \(\to\) Less bias in regression coefficients

 

How RN and RS affect OLS estimation bias


What is to be done?

  1. General recommendations:
    • consider relative scale and nesting as ex ante measures of CoS complexity
    • check face validity of transformed values through visualization
  2. If “ground truth” data (micro data, cross-unit IDs) are available:
    • validate transformed values with micro data
    • use micro data as source units
    • match on common ID (if units are well-nested)
  3. If “ground truth” data are not available:
    • be transparent about limitations/assumptions
    • partial validation (if micro data available for some regions)
    • report results from alternative CoS algorithms when possible

 

Bad news: \(RN\) and \(RS\) can be calculated in R (SUNGEO::nesting()), not QGIS
(but you can still do CoS in QGIS, using good judgement and common sense!)