What is geocoding?

assignment of geographic code to descriptive locational data

Example:

input: “Washington, DC”
output: (38.909, -77.075)

Find your location!

Geocoder components

input query (e.g. address)
\(\downarrow\)
pre-processing algorithm
(tokenization, standardization)
\(\downarrow\)
matching algorithm
(exact vs. fuzzy, tie-break rule)
\(\downarrow\)
reference data (e.g. gazetteer)
\(\downarrow\)
output feature (e.g. point, code)

Input address, output data

Input

Input queries

What can be geocoded?

Descriptive locational data:

Postal addresses
(“1201 South Main Street,
Ann Arbor, MI 48104-3722”)
Street intersections
(“South Main and Stadium,
Ann Arbor, MI 48104-3722”)
Partial addresses
(“S. Main St., Ann Arbor, MI”)
Postal codes (“48104-3722”)
Named buildings, landmarks
(“Michigan Stadium”)
General place names (“Ann Arbor”)
Free-form queries (“The Big House”)

Hail to the victors

Sources of error in input data

Imprecise queries \(\to\) imprecise output
(street address vs. county name)
Ambiguous queries \(\to\) multiple matches
(Springfield, Portland, Alexandria)
Too much precision \(\to\) fewer matches
(regimental command post at Hill 55)
Alt. spellings, typos \(\to\) false matches
(Granada, Spain vs. state of Grenada)
Place name changes \(\to\) non-matches
(Aleksandrovka/Yuzovka/Stalino/Donets’k)
Slang, nicknames \(\to\) non-matches
(“Paris of the Midwest”, “Motown”)

How to avoid some of these problems?

pre-process the text of the input query

Wrong number

Pre-processing algorithm

What is pre-processing?

standardization and normalization of input into a format and syntax compatible with the reference dataset

Why pre-process?

prevent avoidable geocoding errors
becomes more important where text is more unstructured, ambiguous
- easy: “Ann Arbor, MI”
- hard: “the Michigan city of Ann Arbor”
- harder: “I met my friend Dallas when we were both college students, living in A2”

Undeliverable address

Common pre-processing tasks

Remove HTML tags, control characters
Remove non-alphanumeric characters
Remove capitalization
Remove punctuation
Parts-of-speech tagging
Lemmatization

Lost in translation

Filtering unnecessary words, text

Why strip capitalization, punctuation, etc?

Reconcile address formats
(Cambridge, MA \(\neq\) Cambridge MA)
Raise probability of match
(Middlesex county \(\to\) middlesex county)
(Middlesex County \(\to\) middlesex county)
Avoid computational errors
(\#, \% are special characters in many programming languages)

Sentences \(\to\) Tokens

Parts of speech tagging

Do we care if a word is a noun or a verb?

It depends on the application:

well-formatted addresses:
POS unimportant (“Ann Arbor, MI”)
unstructured queries:
POS more important (“I met my friend Dallas when we were students in A2”)
various POS tagging software available online (nlp.stanford.edu)
some APIs do this automatically

Sentence \(\to\) POS tags

Lemmatization

relating multiple versions of same word to common, standard term

Many-to-one mappings
- (Ann Arbor, A2, A-squared, the Deuce, Tree Town) \(\to\) Ann Arbor
- useful to associate nicknames, historical names with single location
One-to-many mappings
- Dallas \(\to\) Dallas (TX)
- Dallas \(\to\) Dallas (my friend)
- Jackson \(\to\) Jackson (MS)
- Jackson \(\to\) (Janet) Jackson
- useful to distinguish places from people
- requires info about word order, context

Procedure:

create lookup table for relevant terms
query table for each occurrence of word
trade-off: speed vs. accuracy

Many-to-one example

Output

Matching algorithm

How to find the best output candidate?

Exact vs. fuzzy matching
- exact: Ann Arbor \(\neq\) ann arbor
- fuzzy: Ann Arbor \(\sim\) ann arbor
Non-match rule (if zero matches)
- return N/A?
- geocode at lower resolution?
- query additional datasets?
Tie-breaking rule (if multiple matches)
- first match?
- random match?
- most precise match?
- most popular match?

Match-making

Sources of error in matching

False positive matches:
“my friend Dallas” \(\to\) Dallas, TX
False negative matches: “A2” \(\to\) N/A
Multiple matches:
“Memphis” \(\to\) Memphis, TN; Memphis, Egypt

Bad film (probably)

Reference data

What are reference data?

Geographically-coded information used to match input to output

Gazetteers
TIGER/Lines
Crowd-sourced

Like this, but electronic

Gazetteer data

dictionary of standard and alternate spellings of place names, and their geographic locations
(e.g. NGA GEOnet Names)

Example gazetteer data

Topologically Integrated Geographic Encoding and Referencing (TIGER/Line)

U.S. Census Bureau’s digital database for finding locations along roads

Example TIGER/Line

Crowd-sourced data

user-generated location data from surveys, GPS devices, free sources
(e.g. OpenStreetMap Nominatum)

OSM is free, Google isn’t

Sources of error in reference data

data quality can be region-specific
(e.g. Google vs. Yandex)
less precise, sparser data in rural areas and developing countries
some datasets not frequently updated
different datasets use different standard name spellings

Re-routing

What is the output?

Any geographically-referenced information:

Point coordinates
(longitude, latitude)
Line features
(TIGER line segment)
Polygon features
(parcel of land, census block, census tract, municipality, district, region, country)

Location found!

Sources of error in output

Point locations for areal references
- geographic centroid?
- capital city?
- population-weighted centroid?
Linear interpolation on TIGER/Line shapefiles

Wrong centroid

Wrong line

SEST-6577 / GIS for Security Studies

Lecture 06 (Geocoding)

Yuri M. Zhukov

Associate Professor

School of Foreign Service

Department of Government

Georgetown University

October 21, 2025