Plan for today

Plan for next two weeks
General data questions
General methods questions
Project-specific questions
1. Please raise your hand, and we’ll put you in the queue

General data questions

Migration data

“Where can I find global, annual emigration data by country?”

Migration data (global coverage)

Source/link	Type	Spatial scale	Frequency	Availability
`WB Global Bilateral Migration`	Migration flows (origin-dest.)	Country	Annual	1960-2000
`WB Open Data`	Net migration, migrant stock	Country	Annual	1960-2023
`IOM Migration Data Portal`	Multiple indicators	Country	Annual	1990-2020
`Our World in Data`	Multiple indicators	Country	Annual	1960-2021
`IDMC Data Portal`	IDPs from conflict, disasters	Country	Annual	2018-2022

Migration data (sub-national and specialized data)

Source/link	Type	Spatial scale	Frequency	Availability
`IOM Displacement Trackig Matrix`	IDP flows (origin-dest.)	Country, Adm1, Adm2	Variable	2010-2024
`UNHCR Data Portal`	IDPs, refugees	Country, Adm1	Variable	Variable
`CTDC`	Human trafficking	Individual	Annual	1960-2023
`DHS Immigration Data`	Multiple indicators	Points of entry	Monthly	2002-2024

General methods questions

Classifying points by location

“I have a dataset in .CSV format with 700+ rows. I want to classify each data point into one of two categories, based on their geographical location … (e.g. classifying whether an oil spill occurred in an offshore or onshore area).”

There are 2 ways to do this in QGIS: Intersection or Join attributes by location. Let’s demonstrate here with data we’ve used before on dams (points) and country borders (polygons defining areas/categories).

The Intersection tool (Vector \(\to\) Geoprocessing tools) will assign the attributes of the polygon that intersects with each point, while dropping points that fall outside the polygons.

QGIS screenshot

Select the point layer as the Input layer and the polygons as Overlay layer, and adjust the overlay fields to keep/drop as needed.

QGIS screenshot

If we compare the attribute tables of the intersection (top) vs. the origial (bottom), we see that the intersection contains multiple additional columns from the polygon layer (e.g. ADMIN, ADM0_A3, etc.), while the original ends with LAT_DD.

QGIS screenshot

However, the feature count in the layer menu tells us that the intersect layer contains 6832 features (points), but the original dams layer contained 6862. So we lost 30 dams that fell outside of all national borders. What if we want to keep them?

QGIS screenshot

The other option is to use the Join Attributes by Location tool (Processing Toolbox \(\to\) Vector general). It’s the same idea, but with more options (like whether to \(\square\) “Discard records which could not be joined”)

QGIS screenshot

The feature count for the joined layer is the same as for the original dams layer (6862), as long as \(\square\) “Discard records which could not be joined” is unchecked.

QGIS screenshot

The points that intersect with no polygons are given NULL values for the joined fields. We can select them, and see that most of these are in coastal waters or on islands in/near international waters.

QGIS screenshot

You can plot the joined attributes to be sure that everything worked out as expected

QGIS screenshot

Visualizing comparisons between maps

“What is the best recommended way to show comparison between maps? … what mapping techniques do you recommend for displaying two different datasets on one map? (e.g. Climate and violence, or voting preferences and income per capita).”

If one variable is continuous (e.g. income) and the other is categorical (e.g. yes/no), you can use a gradient for the continuous variable, and shading lines for the categorical one (QGIS: Single Symbol / Hashed; R: plot(..., density=15, angle=30)). You’ll need to duplicate the layer to display \(>\) 1 variable at a time

Example from Baum and Zhukov (2015)

If both variables are continuous, it is better to create two maps side-by-side.

Example from Zhukov (2016)

When you do this, make sure the map extent is the same for both maps, and keep everything identical except for the variables you want to compare.

Example from Rozenas, Schutte and Zhukov (2017), doi.org/10.1086/692964

Example from Rozenas et al (2017)

Buffers

“Is there a way to create a safety margin when using raster extraction by mask? (Say, I want an additional 10km outside the boundary of my vector layer to be cropped as well)”

There is no way to do this within Zonal statistics directly, but we can use the Buffer tool to pre-process the polygon layer. Let’s demonstrate with data on luminosity (raster) and country borders (polygons).

QGIS Screenshot

If the input layer (koreas) is unprojected, the Buffer tool will ask for the distance in degrees. You can either change the CRS or do some back-of-the-envelope math.

QGIS screenshot

1 degree \(\approx\) 100 km at the equator. If we want a 10km buffer, that’s roughly 10/110=.091 degrees.

QGIS screenshot

Enter the converted distance in Distance and run the buffer tool.

QGIS screenshot

The buffered polygon should look similar, but “puffier”

QGIS screenshot

Note that the buffers will overlap in neighboring polygons. So, North Korea will include 10km of South Korea and vice versa.

QGIS screenshot

There are tutorials and YouTube videos online on how to remove the overlap, but it’s too complex to cover here.

You can now implement Zonal statistics with the buffered polygons as the Input layer.

QGIS screenshot

You can plot the mean luminosity, and confirm that South Korea is brighter than the North. But you may also want to merge the results back to the original, non-buffered polygons

QGIS screenshot

You can do this by adding a Vector join in layer properties, here adding the nl_mean variable from the buffered polygons to the original polygons.

QGIS screenshot

This way, you get to keep the original polygon geometries, while using buffered geometries to calculate zonal statistics.

QGIS screenshot

Re-using the same color symbology

“How do I duplicate a color scale across different layers of my project?”

In QGIS, right-click on the layer whose color symbology you want to duplicate, and select Styles \(\to\) Copy style \(\to\) Symbology

QGIS screenshot

Now right-click on the layer whose color symbology you want to replace, and select Styles \(\to\) Paste style \(\to\) Symbology

QGIS screenshot

The color scheme and break points should now be replicated in the second layer. Note that you will still need to re-classify the colors in Properties if the numerical distribution is different in the second layer.

QGIS screenshot

Regression analysis

“I am intrigued by regression analysis and its application. As a student with limited experience in statistics or regression analysis, I wonder if it is feasible for someone like myself to undertake a basic level of analysis.”

Flashback: we used regression analysis in Walk Through 1 (Islamic State): \[\begin{align*} \text{violence}_i=&\beta_1 \text{road density}_i + \beta_2 \text{population}_i +\beta_3 \text{cropland}_i \\ &+\beta_4 \text{dams}_i + \beta_5 \text{Sunni presence}_i + \epsilon_i \end{align*}\] where

violence\(_i\) was the observed number of ISIS attacks in district \(i\)
road density\(_i\), \(\dots\), Sunni presence\(_i\) were explanatory variables
\(\epsilon_i\) were errors (residuals)
\(\beta\) were coefficient estimates corresponding to each Hypothesis

Hypothesis	Expectation	Observation
1. Power projection	\(\beta_1<0\)	?
2. Demographics	\(\beta_2>0\)	?
3. Political economy	\(\beta_3<0\)	?
4. Key infrastructure	\(\beta_4>0\)	?
5. Sectarian divisions	\(\beta_5>0\)	?

Several popular types of (basic) regression models

Model	Type of dependent variable	R command
1. Linear regression (OLS)	continuous (0.47, -1.97, -0.29)	`lm()`
2. Logistic regression (logit)	binary (0, 1)	`glm(..., family="binomial")`
3. Quasi-Poisson	counts (0, 1, 2, 3, …)	`glm(..., family="quasipoisson")`

Online tutorials (partial list)

Free: Princeton library guides
1. Linear: libguides.princeton.edu/R-linear_regression
2. Logit: libguides.princeton.edu/R-logit
Paid/subscription: DataCamp
1. Linear: datacamp.com/tutorial/linear-regression-R
2. Logit: datacamp.com/tutorial/logistic-regression-R

API-231 / GIS-PubPol
Meeting 21 (Troubleshooting Session)

Yuri M. Zhukov

Visiting Associate Professor of Public Policy

Harvard Kennedy School

April 16, 2024

General data questions

General methods questions

API-231 / GIS-PubPol Meeting 21 (Troubleshooting Session)

Yuri M. Zhukov

Visiting Associate Professor of Public Policy

Harvard Kennedy School

April 16, 2024

General data questions

General methods questions

API-231 / GIS-PubPol
Meeting 21 (Troubleshooting Session)