CNN still image from night of February 23-24, 2024

News \(=\) data about who did what to whom, when and where

Who	Did what	To whom	When	Where	Source
Russia	rocket strike	Ukraine	2/24/2022	Kyiv	CNN
Russia	rocket strike	Ukraine	2/24/2022	Kharkiv	CNN

Map of violence and territorial control in Ukraine. Source: VIINA

Near-real time event, territorial control data on Russian invasion of Ukraine

Overview

Motivation: during Russia’s 2022 invasion of Ukraine…

Russia has become a hermetically-sealed information environment
1. media required to stick to MoD press releases
2. cannot use word “war” when describing “special military operation”
3. up to 100K ruble fine for publicly “discrediting” Russian army
4. up to 15-year sentence for “knowingly false information” about war
5. last independent media shut down (e.g. TV Rain, Echo of Moscow)
6. Facebook, Twitter, Instagram, VPNs blocked
Ukrainian media more free, but vulnerable
1. TV news sometimes broadcasts from basements, bomb shelters
2. Russia has targeted TV towers, cut electricity, cell service
3. all national TV channels merged onto one platform under martial law
4. radio silence on Ukrainian casualties, ongoing operations

Solution: use machine learning, remote sensing to track events on the ground

Tracking the War in Near-Real Time

What are “event data”?

Incident-level data on “who did what to whom, when and where”
1. “who”: initiator of action (subject)
2. “did what”: description of action/tactic (verb)
3. “to whom”: target of action (object)
4. “when”: time/date of event
5. “where”: location of event
Types of events (examples we have used in this class)
1. political violence
2. bike crashes in NYC
3. crimes in DC
4. 311 calls about flooding in New Orleans
Sources of data
1. media/open sources (including social media)
2. government records/archives
3. remote sensing

VIINA: Violent Incident Information from News Articles

Near-real time event data on Russian invasion of Ukraine (updated daily)
1. based on news reports from Ukrainian and Russian media, geocoded and classified with Bidirectional Encoder Representations from Transformers (BERT)
2. each event is accompanied by full source info, text and URLs
Data on territorial control at municipality level (updated daily)
1. based on vectorized georeferenced maps (e.g. DeepState, Wikipedia)
2. “boosted” by VIINA event data on changes in control

Events

Control

Problem: 1 news report \(\neq\) 1 unique event

We can characterize each event as a unique configuration of
[subject]-[verb]-[object]-[time]-[location]
(i.e. who did what to whom, when and where)
We can learn about these events from news reports
1. one report of one event (“A attacked B”)
2. multiple reports of one event (“A attacked B”, “B attacked by A”)
3. multiple events in one report (“A attacked B, C attacked A”)
  \(\vdots\)
4. \(N\) reports of an unknown # of events

Question: do these news reports refer to the same event?

Date	Source	English translation
2022/02/25	Interfax.ua	Russian forces purposefully shelling residential buildings in Kharkiv – head of regional administration Synyehubov
2022/02/25	24tv.ua	Shell hits residential building in Kharkiv: casualties possible – frightening photos

Who	Did what	To whom	When	Where
	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)

Question: do these news reports refer to the same event?

Date	Source	English translation
2022/08/10	liveuamap	In Kharkiv as a result of Russian shelling one person is wounded
2022/08/15	24tv.ua	Woman, wounded during shelling of Saltivka in Kharkiv, died in hospital

Who	Did what	To whom	When	Where
	\(\checkmark\)			\(\checkmark\)

Coreference resolution (CR)
Process of resolving multiple references to same physical object or event

Why are duplicates a problem?
1. duplicates are a threat to causal inference
2. over-reporting of events may be correlated with unobservables
  (e.g. media presence, perceived “newsworthiness” or novelty)
3. duplicates make it harder to assess ground truth about violence
4. this problem affects both machine-coded and hand-coded data
Applications to event data:
1. remove exact textual duplicates (“bare minimum”)
2. “1 per day” filter (if two reported events are of the same type, and were reported in same location on same day, then they are references to the same event)
3. MELTT spatio-temporal filter (match based on co-occurrence in space and time)
4. model-based methods (e.g. convolutional neural networks, transformers)

Flow chart showing steps involved in creation of VIINA

VIINA workflow

VIINA turns online news articles & social media into geocoded event data by:

Scraping online news & social media, preprocessing the raw text
Classifying the news reports by actor and tactics with large language models
Assigning geo coordinates based on locations mentioned in reports
Identifying (but not removing) likely duplicate events

Other Near-Real Time Data Sources

Historical weather & climate raster data (partial list)

Dataset	ACLED	GDELT	ICEWS	VIINA
Data on Violence?	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)	\(\checkmark\)
Data on Control?				\(\checkmark\)
Fully Automated?		\(\checkmark\)	\(\checkmark\)	\(\checkmark\)
Text Descriptions?	\(\checkmark\)			\(\checkmark\)
Source URLs?		\(\checkmark\)		\(\checkmark\)
Events in 1st Year	40,448	778,350	27,858	113,446
Unique Event Locations	2,430	1,762	581	9,771
Update Frequency	1 week	Daily	1-2 months	Daily
Event Types	24	50	105	23
Sources	97	8,887	126	30
English-Only?	No	Yes	Yes	No
Ukrainian Sources (%)	74.2	10.1	4.2	92.5
Russian Sources (%)	14	11.8	12	7.5
Unknown Sources (%)	0	0	47.3	0

ACLED

GDELT

ICEWS

VIINA

Near-real time remote sensing data (partial list)

Type	Source/link	Spatial resolution	Frequency	Free?
Fire anomalies	`FIRMS`	Points	Daily	\(\checkmark\)
Night lights	`VIIRS`	Raster	Nightly	\(\checkmark\)
Vegetation	`NDVI`	Raster	2 weeks	\(\checkmark\)
Meteorological events	`NASA Worldview`	Raster	Daily	\(\checkmark\)
Reflectance (photos)	`NASA Worldview`	Raster	Hourly/Daily	\(\checkmark\)

What kinds of events can remote sensing capture that media cannot? (\(+\) vice versa)

Vignettes

Overview of lab exercise

How much of Ukraine’s territory does Russia occupy?
Compare media reports to remote sensing data on fire anomalies.

We will work with a (very large) dataset on territorial control in Ukraine

Vignette 1

We will then integrate NASA’s data on active fires with VIINA event reports

Vignette 2 / Step 1

… and identify locations that may be overlooked media vs. fires data

Vignette 2 / Step 2

We can obtain data on territorial control and media reports from github.com/zhukovyuri/VIINA

Browser screenshot

There are several datasets here. The ones we need are control_latest and event_info_latest_2024

Browser screenshot

Go to control_latest.zip and download the file by clicking on the “Download raw file” button

Browser screenshot

Do the same thing for event_info_latest_2024.zip

Browser screenshot

While we’re here, let’s also grab the GIS boundaries for Ukrainian populated places, gn_UA_tess.geojson

Browser screenshot

We can also get this file through the “Download raw file” link

Browser screenshot

Let’s now get the FIRMS Active Fires data from firms.modaps.eosdis.nasa.gov/active_fire/

Browser screenshot

Scroll down to “Text Files (CSV)” and download the latest weekly (7d) data for “World” from “VIIRS 375m/NOAA-21”

Browser screenshot

We will use country-level and district-level boundaries data from gadm.org

Download the level-0 and level-2 files for Ukraine, in GeoJSON format

Browser screenshot

Here is the full list of data sources and links:

Category	Type	Format	Data source
Territorial control	Vector (polygons)	`.csv`, `.geojson`	VIINA
Media event reports	Vector (points)	`.csv`	VIINA
Active fires	Vector (points)	`.csv`	NASA FIRMS
Administrative borders	Vector (polygons)	`.geojson`	GADM

These are all in the WT04.zip file posted on Canvas.

How much of Ukraine’s territory does Russia occupy?

Always save your progress!
Go to Project \(\to\) Save As...

Save early, save often

Vignette 1. Load Ukraine’s national borders (Layer \(\to\) Add Layer \(\to\) Add Vector Layer). gadm41_UKR_0.json file in Data/GADM

QGIS screenshot

Also load the populated place borders (Layer \(\to\) Add Layer \(\to\) Add Vector Layer). gn_UA_tess.geojson file in Data/VIINA

QGIS screenshot

There are 33,141 populated places in Ukraine. This is the level at which territorial control is measured

QGIS screenshot

Load the territorial control data (control_latest.csv) as a delimited text file with no geometries. This is a HUGE table (\(>\) 25M rows), of which we’ll be using only a small part. Take a note of how the date field is formatted (YYYYMMDD)

QGIS screenshot

Let’s take a subset of this file by date, starting with the day before the full scale invasion (23 Feb 2024). Go to the “Extract by Expression” tool in “Processing Toolbox” \(\to\) “Vector selection”. Set Input layer: control_latest and set Expression: date=20220223. Save as control_20220223.csv

QGIS screenshot

This will take a few minutes to run due to the file size.

QGIS screenshot

Repeat this process for the most recent date (8 Apr 2024). Set Input layer: control_latest and Expression: date=20240408. Save as control_20240408.csv

QGIS screenshot

Once both subset tables are loaded, you can remove the original control_latest from memory

QGIS screenshot

Let’s now join the two subset tables to the populated place geometries. Go to the “Joins” tab in layer “Properties” for gn_UA_tess, and add a new join with Join layer \(=\) control_20220223 and geonameid as the Join field and Target field. Select the four status* fields as Joined fields

QGIS screenshot

Add a second join with control_20240408 as Join layer

QGIS screenshot

The two join layers should now appear in the “Joins” tab

QGIS screenshot

Let’s visualize the control status, to make sure everything is right. Change to symbology to Categorized with Value \(=\) control_20220223_status and click Classify and OK

QGIS screenshot

This looks about right. Now we just need to extract these occupied areas and calculate how much of Ukraine’s territory they represent

QGIS screenshot

Go back to the “Extract by Expression” tool, with Input layer: gn_UA_tess and Expression: control_20220223_status='RU'. Save the output as control_ru_20220223.geojson

QGIS screenshot

Repeat for the latest date, with Expression: control_20240408_status='RU'. Save the output as control_ru_20240408.geojson

QGIS screenshot

The extracted areas should appear in the project window.

QGIS screenshot

Go to the “Overlap Analysis” tool in “Processing Toolbox” \(\to\) “Vector analysis”. Set Input layer: gadm41_UKR_0 (country-level borders). Click on the ... button next to Overlay layers

QGIS screenshot

Select control_ru_20220223 and control_ru_20240408 as Overlay layers. Click OK

QGIS screenshot

Save the output as ua_ctr_ru.geojson

QGIS screenshot

Open the attribute table for the newly-created ua_ctr_ru layer. The *_pc variables indicate that Russia occupied 7.23% of Ukraine’s territory on 23 Feb 2024 and 18.45% on 8 Apr 2024

QGIS screenshot

Comparing media event reports to remote sensing data on fires

Vignette 2! Let’s load Ukrainian district borders: gadm41_UKR_2.json from Data/GADM. These will be our spatial units of analysis

QGIS screenshot

Load active fires data as delimited text: J2_VIIRS_C2_Global_7d.csv. Make sure the X and Y fields are properly specified, check box next to \(\checkmark\) Use spatial index

QGIS screenshot

Load media event reports as delimited text: event_info_latest_2024.csv. Here, too, specify the X and Y fields and check the box next to \(\checkmark\) Use spatial index

QGIS screenshot

Let’s extract just the fires inside of Ukraine. Go to the “Extract by Location” tool in “Processing Toolbox” \(\to\) “Vector selection”. Set Extract features from: J2_VIIRS_C2_Global_7d, check \(\checkmark\) Intersect, By comparing to the features from: gadm41_UKR_0. Save the extracted features as fires_ua.geojson

QGIS screenshot

Let’s also extract the media reports for the same time period as the fires (last week). Go to the “Extract by Expression” tool and set Input layer: event_info_latest_2024, and Expression: date>20240401. Save the extracted features as viina_lastweek.geojson

QGIS screenshot

Before counting the fires, let’s remove the “low confidence” fire anomalies. With fires_ua as the active layer, go to Select by Expression, with Expression: confidence!='low' (!= means “\(\neq\)”). This should select about 2419 features

QGIS screenshot

Open the “Count Points in Polygon” tool, set Polygons: gadm41_UKR_2, Points: fires_ua, check the box \(\checkmark\) Selected features only, name the count field fires and save the output as ua_fires.geojson

QGIS screenshot

Before counting the media reports, let’s remove the obvious duplicates. Open the “Delete Duplicates by Attribute” tool in “Processing Toolbox” \(\to\) “Vector general”. Set Input layer: viina_lastweek. Click the ... button next to Field to match duplicates by

QGIS screenshot

Select event_id_1pd as the field, click OK

QGIS screenshot

Save the deduplicated output as viina_lastweek_1pd.geojson

QGIS screenshot

Go back to “Count points in polygon”. Set Polygons: ua_fires, Points: viina_lastweek_1pd. Name the count field events and save as ua_fires_events.geojson

QGIS screenshot

Open the Plotly tool. Set Plot type \(=\) Scatter Plot, Layer \(=\) ua_fires_events, X field \(=\) fires, Y field \(=\) events

QGIS screenshot

Check the box next to \(\checkmark\) Hover label as text

QGIS screenshot

In “Layout Options”, uncheck the box next to \(\square\) Show legend. Change the X and Y axis mode to Logarithmic

QGIS screenshot

You can think of the points falling along the (imaginary) red line here as locations where both active fires data and media reports captured the same number of incidents. Points below (above) this line are ones where the fires data caught more (fewer) incidents than media reports

QGIS screenshot

Locations with more active fires than media reports include places like Mar’inka…

QGIS screenshot

… and Lyman (in former Krasnolymans’kyy disrict). These are mostly-destroyed front line towns that are hard for journalists to access

QGIS screenshot

Places where media reports capture more incidents include Kharkiv (Ukraine’s second largest city, under Ukrainian control)…

QGIS screenshot

…and Kherson (another province capital near the front line, but under Ukrainian control). Both of these places are far easier for journalists to access than the “no man’s land” towns in the lower-right corner

QGIS screenshot

You can perform all these steps in R
(see replication code wt04_demo.R in WT04.zip)

Vignette 1

Vignette 2

API-231 / GIS-PubPol
Meeting 20 (Walk Through 4: Russian-Ukrainian War)

Yuri M. Zhukov

Visiting Associate Professor of Public Policy

Harvard Kennedy School

April 11, 2024

Overview

Tracking the War in Near-Real Time

Other Near-Real Time Data Sources

Vignettes

How much of Ukraine’s territory does Russia occupy?

Comparing media event reports to remote sensing data on fires

API-231 / GIS-PubPol Meeting 20 (Walk Through 4: Russian-Ukrainian War)

Yuri M. Zhukov

Visiting Associate Professor of Public Policy

Harvard Kennedy School

April 11, 2024

Overview

Tracking the War in Near-Real Time

Other Near-Real Time Data Sources

Vignettes

How much of Ukraine’s territory does Russia occupy?

Comparing media event reports to remote sensing data on fires

API-231 / GIS-PubPol
Meeting 20 (Walk Through 4: Russian-Ukrainian War)