What is geocoding?

  • assignment of geographic code to descriptive locational data

Example:

  • input: “Ann Arbor”
  • output: (42.281, -83.748)


Find your location!


 

Geocoder components

  • input query (e.g. address)
    \(\downarrow\)
  • pre-processing algorithm
    (tokenization, standardization)
    \(\downarrow\)
  • matching algorithm
    (exact vs. fuzzy, tie-break rule)
    \(\downarrow\)
  • reference data (e.g. gazetteer)
    \(\downarrow\)
  • output feature (e.g. point, code)


Input address, output data

Input

Input queries


What can be geocoded?

 

Descriptive locational data:

  1. Postal addresses
    (“1201 South Main Street,
    Ann Arbor, MI 48104-3722”)
  2. Street intersections
    (“South Main and Stadium,
    Ann Arbor, MI 48104-3722”)
  3. Partial addresses
    (“S. Main St., Ann Arbor, MI”)
  4. Postal codes (“48104-3722”)
  5. Named buildings, landmarks
    (“Michigan Stadium”)
  6. General place names (“Ann Arbor”)
  7. Free-form queries (“The Big House”)


Hail to the victors


Sources of error in input data

  1. Imprecise queries \(\to\) imprecise output
    (street address vs. county name)
  2. Ambiguous queries \(\to\) multiple matches
    (Springfield, Portland, Alexandria)
  3. Too much precision \(\to\) fewer matches
    (regimental command post at Hill 55)
  4. Alt. spellings, typos \(\to\) false matches
    (Granada, Spain vs. state of Grenada)
  5. Place name changes \(\to\) non-matches
    (Aleksandrovka/Yuzovka/Stalino/Donets’k)
  6. Slang, nicknames \(\to\) non-matches
    (“Paris of the Midwest”, “Motown”)

How to avoid some of these problems?

  • pre-process the text of the input query


Wrong number

Pre-processing algorithm


What is pre-processing?

  • standardization and normalization of input into a format and syntax compatible with the reference dataset

Why pre-process?

  • prevent avoidable geocoding errors
  • becomes more important where text is more unstructured, ambiguous
    • easy: “Ann Arbor, MI”
    • hard: “the Michigan city of Ann Arbor”
    • harder: “I met my friend Dallas when we were both college students, living in A2”


Undeliverable address


Common pre-processing tasks

  1. Remove HTML tags, control characters
  2. Remove non-alphanumeric characters
  3. Remove capitalization
  4. Remove punctuation
  5. Parts-of-speech tagging
  6. Lemmatization


Lost in translation


Filtering unnecessary words, text

 

Why strip capitalization, punctuation, etc?

  1. Reconcile address formats
    (Cambridge, MA \(\neq\) Cambridge MA)
  2. Raise probability of match
    (Middlesex county \(\to\) middlesex county)
    (Middlesex County \(\to\) middlesex county)
  3. Avoid computational errors
    (\#, \% are special characters in many programming languages)


Sentences \(\to\) Tokens


Parts of speech tagging

 

Do we care if a word is a noun or a verb?

 

It depends on the application:

  • well-formatted addresses:
    POS unimportant (“Ann Arbor, MI”)
  • unstructured queries:
    POS more important (“I met my friend Dallas when we were students in A2”)
  • various POS tagging software available online (nlp.stanford.edu)
  • some APIs do this automatically


Sentence \(\to\) POS tags


Lemmatization

relating multiple versions of same word to common, standard term

  1. Many-to-one mappings
    • (Ann Arbor, A2, A-squared, the Deuce, Tree Town) \(\to\) Ann Arbor
    • useful to associate nicknames, historical names with single location
  2. One-to-many mappings
    • Dallas \(\to\) Dallas (TX)
    • Dallas \(\to\) Dallas (my friend)
    • Jackson \(\to\) Jackson (MS)
    • Jackson \(\to\) (Janet) Jackson
    • useful to distinguish places from people
    • requires info about word order, context

Procedure:

  • create lookup table for relevant terms
  • query table for each occurrence of word
  • trade-off: speed vs. accuracy


 

 

Many-to-one example

Output

Matching algorithm


How to find the best output candidate?

 

  1. Exact vs. fuzzy matching
    • exact: Ann Arbor \(\neq\) ann arbor
    • fuzzy: Ann Arbor \(\sim\) ann arbor
  2. Non-match rule (if zero matches)
    • return N/A?
    • geocode at lower resolution?
    • query additional datasets?
  3. Tie-breaking rule (if multiple matches)
    • first match?
    • random match?
    • most precise match?
    • most popular match?


Match-making


 

 

Sources of error in matching

  1. False positive matches:
    “my friend Dallas” \(\to\) Dallas, TX
  2. False negative matches: “A2” \(\to\) N/A
  3. Multiple matches:
    “Memphis” \(\to\) Memphis, TN; Memphis, Egypt


Bad film (probably)

Reference data


What are reference data?

 

Geographically-coded information used to match input to output

  1. Gazetteers
  2. TIGER/Lines
  3. Crowd-sourced


Like this, but electronic


Gazetteer data

  • dictionary of standard and alternate spellings of place names, and their geographic locations
    (e.g. NGA GEOnet Names)


Example gazetteer data


Topologically Integrated Geographic Encoding and Referencing (TIGER/Line)

  • U.S. Census Bureau’s digital database for finding locations along roads


Example TIGER/Line


Crowd-sourced data

  • user-generated location data from surveys, GPS devices, free sources
    (e.g. OpenStreetMap Nominatum)


OSM is free, Google isn’t


Sources of error in reference data

  • data quality can be region-specific
    (e.g. Google vs. Yandex)
  • less precise, sparser data in rural areas and developing countries
  • some datasets not frequently updated
  • different datasets use different standard name spellings


Re-routing


What is the output?

 

Any geographically-referenced information:

  1. Point coordinates
    (longitude, latitude)
  2. Line features
    (TIGER line segment)
  3. Polygon features
    (parcel of land, census block, census tract, municipality, district, region, country)


Location found!


 

 

 

Sources of error in output

  1. Point locations for areal references
    • geographic centroid?
    • capital city?
    • population-weighted centroid?
  2. Linear interpolation on TIGER/Line shapefiles


Wrong centroid

Wrong line