What is georeferencing?

assignment of geographic objects to geographic locations
relation of map image to system of geographic coordinates on the ground

This is georeferencing

What is vectorization?

generation of vector features from georeferenced raster images
opposite is called rasterization
(which is much easier)

This is vectorization

Georeferencing

Why georeference?

maps contain data you can’t find anywhere else
georeferencing allows us to
- extract and preserve these data
- combine map with other types of geospatial data
- use these data to answer social, economic and political questions

Portion of NKVD tactical map of Moscow from 1938

NKVD jurisdictional borders

Overview

What is involved?

Obtain digital image of map (e.g. scan, web)
Select ground control points
Transform map to align with chosen coordinate system

Step 1

Step 2

Step 3

What can we georeference?

historical maps
satellite and aerial photography
administrative and military maps
interesting maps you found online

Massachusetts, 1755

1936 Federal Housing Administration redlining map of Boston

Boston, 1936

Boston, 1955

Challenges

projection often unknown
scale/resolution may be coarse
distances/angles/shapes may be inaccurate (esp. in older maps)
impossible to perfectly align historical maps with modern coordinate systems

Close, but not quite

Examples of GCPs

intersections of graticule lines (most reliable)
landmarks of known location (e.g. buildings, crossroads, hills, cities)
distinctive geographic features (e.g. coastal features, curves in rivers, borders)

Graticules

Intersections

Landmarks

Transformations

Transformation (“rubber sheeting”)

shift and warp the raster to spatially correct locations in original image
apply mathematical algorithm to match source control points with target control points
process changes distances, appearance of lines and shapes

Transformed raster

Rubber sheeting

Challenges

transformation distorts original map image
results sensitive to choice of transformation algorithm
output only reliable within area confined by GCPs

Distortion

Algorithm

Range

Polynomial transformations

uses a polynomial built on control points and a least-squares fitting algorithm
optimized for global accuracy, not always local accuracy

Order

\(x_0 + x^1 + x^2 + x^3 + \dots + x^k\), where \(k\) is order of polynomial
higher order \(\to\) able to correct more complex distortion
but rarely need transformations \(>3\)rd order

Higher order \(\to\) closer fit

Zero-order polynomial

shifts raster location
used when raster is already georeferenced, but slightly mis-aligned
requires \(\geq 1\) control points

First-order (Affine) polynomial

shift/scale/rotate a raster
straight lines on input raster will remain straight
requires \(\geq 3\) control points

Examples in one dimension

Second-order polynomial
- applies quadratic formula to calculate raster cell position
- straight lines on input raster will be warped
- requires \(\geq 6\) control points
Third-order polynomial
- applies cubed formula
- straight lines, margins on input raster will be warped
- corrects more complex distortions
- requires \(\geq 10\) control points

Examples in two dimensions

Projective transformation
- linear rotation, translation
- warps lines to keep them straight
- useful for oblique imagery, scanned maps
- requires \(\geq 4\) control points
Spline transformation
- uses piecewise polynomial that maintains continuity between adjacent polynomials
- optimized for local accuracy, not always global accuracy
- minimal local error
- requires \(\geq 10\) control points

Comparison of projective and affine transformations

Projective vs. affine

Spline

Vectorization

Why vectorize?

vector is standard data structure for quantities of interest to public policy and social science (e.g. events, roads, administrative zones)
smaller data size (usually)
objects can have multiple attributes
allows more sophisticated spatial analyses
preserves quality at all scales

Enhance!

Options

Two ways to identify vector features

Image tracing (manual or automated)
Computer vision (machine learning)

Trace

Illustration of neural network predictions

Learn

Image tracing
- drawing over a raster image with vectors
- manual tracing: tracing over the image by hand (using mouse or stylus)
- automated tracing: use computer algorithm to detect features, redraw them as vector points, lines, polygons

Manual

Automated

Manual tracing

Advantages

can work with images of any quality
better understanding of context/meaning
produces fewer artifacts/superfluous features

Disadvantages

slow, inefficient
relatively imprecise/inconsistent
subject to laziness/fatigue

Automated tracing

Advantages

fast and efficient
output is consistent, replicable

Disadvantages

more sensitive to image quality
can require extensive pre-processing/cleanup
works best with fewer colors

Computer vision / deep learning
- automated feature detection, extraction
- system “learns” what is/isn’t a feature through training data (e.g. examples of points, lines, polygons in raster images)
- cross-validation of results to improve predictive fit
- examples:
  - convolutional neural networks
  - recurrent neural networks
  - long short term memory models
  - transformer models

Illustration of image classification model

Which houses have pools

Machine learning

Advantages

fast and efficient
well-suited for large-scale tasks, where fixed rules lead to systematic errors

Disadvantages

requires large volume of training data
requires high-performance computing infrastructure, programming expertise
same pre-/post-processing issues as automated tracing
not (yet) available in standard GIS software

Computer vision tasks

Sources of error

Automated vectorization

Raster-to-point
- all non-zero/non-null cells become points
Raster-to-line
- trace positions of non-zero/ non-null pixels to identify polyline features
Raster-to-polygon
- use groups of connected pixels with identical values to find areas of a raster
- determine intersection points of area boundaries, generate lines

Vector and raster representations of same objects

Usually not so seamless

Types of vectorization errors

False positives
- identification of features where none exist
  (generates small/superfluous vertices that must be removed)
False negatives
- failure to identify features where they exit
  (creates gaps, incomplete features)

How to reduce errors

Pre-processing
- remove noise, unnecessary elements, colors
Post-processing
- remove superfluous features, fill gaps, improve appearance

Vectorized “roads”

Pre-processing of rasters
- reclassification: conversion from color/greyscale to binary
- thinning: reduce thickness of features to a single, connected lines of pixels

Thinning example

Post-processing of vector features
- collapsing: simplification by removal of spurious nodes, segments, closing of gaps
- smoothing: generalization/ averaging to smooth pixelated appearance of output vectors

Example of vectorized data before post-processing

Lots to clean up

API-231 / GIS-PubPol / Meeting 06 (Georeferencing and Vectorization)

Yuri M. Zhukov

Visiting Associate Professor of Public Policy

Harvard Kennedy School

February 13, 2024

Georeferencing

Overview

Transformations

Vectorization

Options

Sources of error