Analysis and Geoprocessing with NYC OpenData

A lot of cool stuff here

Overview of lab exercise and problem set

Lab exercise
1. Map of bicycle crashes (per capita) in each NYC community district
2. Map of bike lanes per district (normalized for population)
Problem set
1. Map of rat activity (per capita) in each NYC community district

We will first make this map

Bike crashes per capita in NYC

This will require joining community districts polygons …

Community districts in NYC

… to a (non-geographic) spreadsheet of population statistics

NYC population data

… followed by a point-in-polygon analysis and some field calculations

NYC bike crash data

We will then make this map

Bike lanes per capita in NYC

This will require some line-in-polygon analysis

Bike lanes in NYC

You will make this map for your problem set

Rat activity per capita in NYC

This will involve (again) point-in-polygon analysis

Rodent inspections in NYC

You can make these plots in QGIS or in R. Instructions for both are below.

Map 1 in R

Map 2 in R

PS03 in R

We will put multiple types of data on same map:

Category	Type	Format	Data source
Community district borders	Vector (polygon)	`.geojson`	`NYC OpenData`
Community district population	Table	`.csv`	`NYC OpenData`
Vehicle collisions involving bicycles	Vector (point)	`.csv`	`NYC OpenData`
Bike routes	Vector (polyline)	`.geojson`	`NYC OpenData`
Rodent inspections finding “rat activity”	Vector (point)	`.csv`	`NYC OpenData`

These are all in the PS03.zip file posted on Canvas.

Let’s open QGIS…

QGIS

Always save your progress!
Go to Project \(\to\) Save As...

Save early, save often

Map 1

Load community districts: Layer \(\to\) Add Layer \(\to\) Add Vector Layer...
Open the file NYC_CD.geojson in the NYC_Communities folder. Click Add

QGIS Screenshot 1

This dataset includes 71 polygons, including New York’s 59 community districts and 12 areas outside of community districts’ jurisdiction (parks, waterfront)

QGIS Screenshot 2

Open the Attribute Table for this layer. We see a borough-district code (boro_cd) and geometric statistics (shape_area, shape_leng)

QGIS Screenshot 3

For a definition of boro_cd, we can look at the metadata for NYC_CD (nycd_metadata.pdf). The metadata seem out of date (BoroCD \(\neq\) boro_cd), but there’s important info here about this code is constructed

QGIS Screenshot 4

Load population data:

Layer \(\to\) Add Layer \(\to\) Add Delimited Text Layer...
Navigate to the file NY_Pop.csv in the NYC_Population folder
Set Geometry Definition \(=\) No geometry (attribute table only)
Click Add

QGIS Screenshot 5

The new layer is visible in the side menu. But it doesn’t appear on the map, since it’s not a spatial dataset. Right-click NYC_Pop \(\to\) Open Attribute Table

QGIS Screenshot 6

This is a table of community district names and population statistics. Maybe we can link this table to NYC_CD… but there is no common variable

QGIS Screenshot 7

Wait. NYC_CD has a boro_cd code, which is “Community district number preceded by BoroCode”. Maybe we can create a boro_cd code in NYC_Pop, using boro_code and cd_code? Open Field Calculator

QGIS Screenshot 8

Create a new variable, called boro_cd, of type Integer. For the Expression, write boro_code * 100 + cd_code. Click OK

QGIS Screenshot 9

You should now see the variable boro_cd in the attribute table.

QGIS Screenshot 10

Let’s join NYC_CD and NYC_Pop. Go to the Properties for NYC_CD

QGIS Screenshot 11

Open the Joins tab in Properties. Click on the “add” button (+)

QGIS Screenshot 12

Select NYC_Pop as the join layer, and boro_cd as the join and target field. Click OK

QGIS Screenshot 13

You should now see a new join in the Joins tab

QGIS Screenshot 14

Now go to the Symbology tab in Properties.
Set Symbology type \(=\) Categorized, and Value \(=\) NYC_Pop_Borough.
Click Classify, then OK

QGIS Screenshot 15

The districts should now be colored by borough

QGIS Screenshot 16

Let’s now color them by population size. Go back to Properties \(\to\) Symbology.
Set Symbology type \(=\) Graduated, and Value \(=\) NYC_Pop_pop_2010.
Click Classify, then OK

QGIS Screenshot 17

The districts should now be colored by their population in 2010

QGIS Screenshot 18

Set the color scheme back to Single symbol in Symbology

QGIS Screenshot 19

Load data on bicycle crashes:

Go to Add Delimited Text Layer
Navigate to NYC_BikeCrashes.csv in the NYC_Collisions folder
Set Geometry Definition \(=\) Point coordinates
Set X field \(=\) LONGITUDE, Y field \(=\) LATITUDE.
Click Add

The new layer should be visible in the map, as points.

QGIS Screenshot 21

Navigate to the Count Points in Polygon tool.
Vector menu \(\to\) Analysis Tools \(\to\) Count Points in Polygon

QGIS Screenshot 22

Select Polygons \(=\) NYC_CD, Points \(=\) NYC_BikeCrashes. Name the count field crashes, and save the output file as NYC_CD_2.geojson. Click Run

QGIS Screenshot 23

The new layer should appear on the map as NYC_CD_2.
Hide the other two layers by unchecking boxes next to them in the menu.

QGIS Screenshot 24

Let’s color the districts by number of bike crashes. Go to the Symbology tab in the new layer’s Properties. Set Symbology type \(=\) Graduated, and Value \(=\) crashes. Click Classify, then OK

QGIS Screenshot 25

In which parts of NYC are cars hitting the most cyclists?

QGIS Screenshot 26

Let’s combine this with population data to get a per-capita rate of bike crashes.
Open the Attribute Table for NYC_CD_2. Open the Field Calculator

QGIS Screenshot 27

Create new field called crash1000 of type Decimal number (real).
For the Expression, write crashes / NYC_Pop_popP2010 * 1000. Click OK

QGIS Screenshot 28

You should now see the new variable in the attribute table.

QGIS Screenshot 29

Click the Edit button to save the dataset

QGIS Screenshot 30

Click Save

QGIS Screenshot 31

Change the symbology again, to color districts by crash1000. Remember to click Classify

QGIS Screenshot 32

The distribution of crashes per 1000 residents (crash1000) looks similar to crashes. But there are a few missing districts (due to no population data)

QGIS Screenshot 33

We can fix this (aesthetically) by un-hiding the NYC_CD layer.

QGIS Screenshot 34

You may want to change the color of NYC_CD to something neutral, like gray.

QGIS Screenshot 68

Export the map to image. Place map on a New Print Layout, add and properly format a legend, scale bar, etc. The end product should look vaguely like this.

QGIS Screenshot 57

Map 2

Load bike routes: Layer \(\to\) Add Layer \(\to\) Add Vector Layer...
Navigate to NYC_BikeRoutes.geojson in the NYC_BikeRoutes folder. Click Add

QGIS Screenshot 35

This is a polyline layer, representing NYC’s cycling lanes and greenways.
Let’s calculate how many miles of bike lanes are in each district.

QGIS Screenshot 36

Navigate to the Sum Line Lengths tool.
Vector menu \(\to\) Analysis Tools \(\to\) Sum Line Length

QGIS Screenshot 37

Select Lines \(=\) NYC_BikeRoutes, Polygons \(=\) NYC_CD_2.
Name the lengths and count fields bk_length and bk_count.
Save the output file as NYC_CD_3.geojson. Click Run

QGIS Screenshot 38

The new layer should appear as NYC_CD_3 on the map.
Let’s take a look inside

QGIS Screenshot 39

Open the Attribute Table for NYC_CD_3. The bk_length variable is there, but what are its units of measurement?

QGIS Screenshot 40

We can look up the project’s units of measurement by going to Project menu \(\to\) Properties

QGIS Screenshot 41

In Project Properties, we see Units for distance measurement \(=\) Meters

QGIS Screenshot 42

Let’s convert bk_length from meters to miles. Go back to NYC_CD_3’s Attribute Table \(\to\) Field Calculator. Create new field called bk_miles of type Decimal number (real). For the Expression, write bk_length / 621.3712. Click OK

QGIS Screenshot 43

You should now see the new variable in the attribute table.

QGIS Screenshot 44

Now let’s create a per-capita measure (miles of bike lane per 1000 residents). Go back to the Field Calculator. Create new field called bklane1000 of type Decimal number (real). For the Expression, write
bk_miles / NYC_Pop_pop_2010 * 1000. Click OK

QGIS Screenshot 45

You should now see the new variable in the attribute table.

QGIS Screenshot 46

Remember to save your changes!

QGIS Screenshot 47

Change the symbology of NYC_CD_3, to color districts by bklane1000. Click Classify, then OK

QGIS Screenshot 48

We now have a plot of “miles of bike lane per 1000 residents” (bklane1000). Again, there are a few missing districts (due to no population data)

QGIS Screenshot 49

Let’s un-hide the NYC_CD layer again to show the missing districts. Ideally, these should be colored gray or some other neutral color

QGIS Screenshot 52

Export the map to image. Project \(\to\) New Print Layout. Place map and legend to layout. The resulting legend includes several items we’ll need to remove (everything but NYC_CD_3 and NYC_CD)

QGIS Screenshot 53

Remember how we did this in lab 2? (Item Properties \(\to\) un-check Auto update, etc.) Remove everything but NYC_CD_3 and NYC_CD, change layers’ names from NYC_CD_3 to “Miles per 1000 residents” and NYC_CD to “No population”.

QGIS Screenshot 54

Change legend title to “NYC Bicycle Lanes”. Add scale bar. Change Scalebar units to Miles. Export as image. Name the file bike_lanes_per_capita.png

QGIS Screenshot 55

The output file should look roughly like this.

QGIS Screenshot 56

We can export the dataset we created, by right-clicking on NYC_CD_3 layer, and selecting Export \(\to\) Save Features As...

QGIS Screenshot 58

The file can be saved in a variety of formats, including .geojson, .shp, .csv and Excel

QGIS Screenshot 59

Let’s create a bar plot. Go to Processing menu \(\to\) Toolbox. This will open a Processing Toolbox menu on the right

QGIS Screenshot 60

Double-click on Bar plot in the Plots submenu on the right

Set Input layer \(=\) NYC_CD_3, Category name field \(=\) NYC_Pop_Borough, and Value field \(=\) crashes. Set the output destination under Bar plot to barplot_1.html. Click Run

QGIS Screenshot 62

You can access the resulting plot by clicking on the Bar plot item in the Results Viewer on the right

QGIS Screenshot 63

This will open a web browser window, with a bar plot of total bike crashes per district, grouped by borough. As expected,Brooklyn and Manhattan lead the pack.

QGIS Screenshot 64

Export the plot by clicking on the Download plot as png button in the top-right

QGIS Screenshot 65

The output file should look like this. For further analysis, you can also open the .csv file we just created in Excel or R, and start crunching numbers (see below)

QGIS Screenshot 66

Problem Set 3

Your assignment (if using QGIS):

create a map of rat activity (per 1000 residents) in NYC!
use this dataset:
- NYC_Rats/NYC_Rats.csv
follow the same steps as for “Map 1” above
(i.e. sum via point-in-polygon, then calculate per capita rate)
name the file rats_per_capita.png
upload map to Canvas
(by next Wednesday)

Can you make this map?

R

Loading R packages

To implement these steps in R, we will be using the sf package (again)

library(sf)

NOTE: The code to produce Map 1 and Map 2 in R is in ps03_demo.R on RStudio Cloud, and in PS03.zip (posted on Canvas).

Map 1

Loading spatial data

Let’s load the NYC community district boundaries into R, using sf::read_sf():

nyc_cd = sf::read_sf("Data/NYC_Communities/NYC_CD.geojson")
plot(nyc_cd["geometry"])

Loading non-spatial data

Load the tabular population data using read.csv(), and preview the first few rows:

pop = read.csv(file = "Data/NYC_Population/NYC_Pop.csv")
head(pop)

Joining spatial to non-spatial data

Similar to QGIS’s Field Calculator, let’s create a new boro_cd variable in pop:

pop$boro_cd = pop$boro_code*100 + pop$cd_code

We can join nyc_cd and pop by attribute boro_cd using the merge() command:

nyc_cd_2 = merge(x = nyc_cd, y = pop, by = "boro_cd")
print(nyc_cd_2,n=2) # Preview first 2 rows

## Simple feature collection with 59 features and 12 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -74.25559 ymin: 40.49613 xmax: -73.70001 ymax: 40.91553
## Geodetic CRS:  WGS 84
## First 2 features:
##   boro_cd    shape_area    shape_leng boro_code   Borough cd_code
## 1     101 41692711.6252 69093.8528032         1 Manhattan       1
## 2     102 37604933.3646 32628.8675927         1 Manhattan       2
##                      cd_name pop_1970 pop_1980 pop_1990 pop_2000 pop_2010
## 1 Battery Park City, Tribeca     7706    15918    25366    34420    60978
## 2    Greenwich Village, Soho    84337    87069    94105    93119    90016
##                         geometry
## 1 MULTIPOLYGON (((-74.04388 4...
## 2 MULTIPOLYGON (((-73.99684 4...

Plotting merged features

Using the merged dataset nyc_cd_2, let’s plot the districts by borough

plot(nyc_cd_2["Borough"])

… or plot them by population size in 2010:

plot(nyc_cd_2["pop_2010"],main="Population size in 2010")

Loading spatial data from a CSV table

Our data on bike collisions, like population, is also a csv table:

crashes = read.csv(file = "Data/NYC_Collisions/NYC_BikeCrashes.csv")

But this file has geographic coordinates, which we can use to create a spatial object, borrowing the CRS from nyc_cd:

crashes = sf::st_as_sf(x = crashes, coords = c("LONGITUDE","LATITUDE"), 
    crs = sf::st_crs(nyc_cd))
plot(crashes["geometry"])

Point-in-polygon analysis

To count the number of collisions in each community district, we can overlay nyc_cd_2 and crashes with sf::st_intersects(). The output is a classed list of feature IDs intersected:

o = sf::st_intersects(x = nyc_cd_2, y = crashes)
o

## Sparse geometry binary predicate list of length 59, where the predicate
## was `intersects'
## first 10 elements:
##  1: 15, 29, 188, 212, 236, 268, 284, 297, 625, 634, ...
##  2: 9, 10, 11, 18, 108, 135, 137, 169, 194, 274, ...
##  3: 79, 85, 165, 215, 242, 257, 260, 261, 330, 347, ...
##  4: 4, 12, 44, 101, 109, 116, 161, 210, 211, 233, ...
##  5: 8, 16, 47, 66, 75, 83, 105, 120, 149, 191, ...
##  6: 21, 58, 97, 227, 266, 280, 282, 307, 323, 331, ...
##  7: 76, 81, 110, 117, 164, 220, 229, 288, 301, 332, ...
##  8: 2, 186, 204, 285, 322, 354, 355, 361, 362, 390, ...
##  9: 89, 103, 129, 150, 286, 311, 349, 475, 529, 540, ...
##  10: 17, 54, 74, 82, 113, 156, 173, 228, 246, 249, ...

We can use the command lengths(o) to find the number of points in each polygon, and assign this as a new variable in nyc_cd:

nyc_cd_2$crashes = lengths(o)

Plot the new variable:

plot(nyc_cd_2["crashes"], main = "Bicycle Collisions (2013-2023)")

Creating per-capita measures

Let’s calculate and plot a new field, crashes per 1000 residents:

nyc_cd_2$crash1000 = nyc_cd_2$crashes / nyc_cd_2$pop_2010 * 1000
plot(nyc_cd_2["crash1000"], main = "Collisions per 1000 Residents")

Exporting Map 1 to image file

To save the map, we will use the png() and dev.off() commands:

png(filename = "Output/bike_crashes_per_capita_r.png", 
    width = 6, height = 4, units = "in", res = 300)
plot(nyc_cd_2["crash1000"], 
    main = "Bicycle Collisions per 1000 Residents", lwd=.5)
dev.off()

R Map 1 screenshot

The output file should look like this.

Map 2

Loading bike lane data

Import bike routes data

bikeroutes = sf::read_sf("Data/NYC_BikeRoutes/NYC_BikeRoutes.geojson", 
    crs=sf::st_crs(nyc_cd_2))
plot(bikeroutes["geometry"], reset=FALSE, col="forestgreen", lwd=.5)
plot(nyc_cd_2["geometry"], add = TRUE)

Line-in-polygon analysis

We need to calculate the miles of bike lane in each community district. We can think of this as a two-step routine:

determine which line segments lie inside of which polygons
take a local sum of overlapping line lengths in each polygon

For step 1, we can use sf::st_intersects() to figure out which segments of bikeroutes lie in which districts (nyc_cd_3) (similar to point-in-polygon analysis):

o = sf::st_intersects(x = nyc_cd_2, y = bikeroutes)
o

## Sparse geometry binary predicate list of length 59, where the predicate
## was `intersects'
## first 10 elements:
##  1: 6, 35, 88, 100, 138, 139, 140, 171, 805, 806, ...
##  2: 11, 2484, 2485, 2486, 2487, 2488, 2489, 2490, 2491, 2492, ...
##  3: 81, 84, 110, 127, 177, 1505, 1506, 1507, 1508, 1533, ...
##  4: 41, 48, 78, 178, 2633, 2650, 2653, 2654, 2655, 2656, ...
##  5: 12, 47, 109, 2633, 2796, 2836, 2837, 2838, 2839, 2840, ...
##  6: 125, 126, 169, 2904, 2907, 2908, 2912, 2913, 2951, 2952, ...
##  7: 85, 3136, 3137, 3184, 3185, 3186, 3218, 3219, 3220, 3222, ...
##  8: 89, 90, 91, 168, 3546, 3599, 3600, 3601, 3602, 3603, ...
##  9: 3887, 3888, 3897, 3898, 3899, 3900, 3914, 3920, 3921, 3922, ...
##  10: 176, 180, 3867, 3868, 3915, 3916, 3917, 3918, 3919, 3920, ...

Line-in-polygon analysis (continued)

For step 2, we can write a loop that extracts the overlapping bike lane segments in each district, measures their (cumulative) length, and converts this sum to miles.

out <- c()                  # Create empty vector to store results
for(i in 1:length(o)){                      # Open loop
    bike_subset = bikeroutes[o[i][[1]],]        # Take subset
    bk_meters = sum(sf::st_length(bike_subset)) # Sum lengths
    bk_km = bk_meters/1000                      # Convert to km
    bk_mi = bk_km * 0.6213712                   # Convert to miles
    out = c(out, bk_mi)                         # Concatenate
}                                           # Close loop

We then add this as a variable to nyc_cd_2

nyc_cd_2$bk_miles <- out

Let’s quickly visualize the results:

plot(nyc_cd_2["bk_miles"], 
    main = "Miles of Bike Lane per District", lwd=.5)

Creating per-capita measures

Let’s calculate and plot a new field, miles of bicycle lane per 1000 residents:

nyc_cd_2$bklane1000 = nyc_cd_2$bk_miles / nyc_cd_2$pop_2010 * 1000
plot(nyc_cd_2["bklane1000"], main = "Bike Lane per 1000 Residents (mi)")

Exporting to image file

To save the map, we will use the png() and dev.off() commands:

png(filename = "Output/bike_lanes_per_capita_R.png", 
    width = 6, height = 4, units = "in", res = 300)
plot(nyc_cd_2["bklane1000"], 
    main = "Miles of Bicycle Lane per 1000 Residents", lwd=.5)
dev.off()

R Map 1 screenshot

The output file should look like this.

Exporting the dataset

To save the data attribute table as a csv file, we can use the write.csv() command:

write.csv(x = nyc_cd_2, file = "Output/NYC_CD_2_R.csv")

Making bar plots

For each of NYC’s five boroughs, calculate averages of the variables we created:

borough_means = aggregate(
    x = nyc_cd_2[,c("crashes","crash1000","bk_miles","bklane1000")],
    by = list(Borough=nyc_cd_2$Borough),
    FUN = mean
    )

This will create a new sf object with 5 rows:

borough_means

borough_means

Make a bar plot of bike crashes in an average community district

barplot(height=borough_means$crashes, names.arg=borough_means$Borough, 
    main="Bicycle collisions per community district (mean)")

Make a bar plot of bike crashes per 1000 residents

barplot(height = borough_means$crash1000,
    names.arg = borough_means$Borough, 
    main = "Collisions per 1000 residents (mean)")

Make a bar plot of miles of bike lane in an average community district

barplot(height = borough_means$bk_miles,
    names.arg = borough_means$Borough, 
    main = "Bike lanes per community district (mi)")

Make a bar plot of miles of bike lane per 1000 residents

barplot(height = borough_means$bklane1000,
    names.arg = borough_means$Borough, 
    main = "Bike lanes per 1000 residents (mi)")

Basic regression analysis

You can use an sf object as you would a data.frame to run regressions in R.

For example, suppose we wanted to see if there are fewer bicycle collisions where there are more bike lanes. We could run a basic linear regression of bicycle collisions (per 1000) on bike lanes (per 1000), with fixed effects for each borough:

summary(lm(crash1000 ~ bklane1000 + as.factor(Borough), data=nyc_cd_2))

## 
## Call:
## lm(formula = crash1000 ~ bklane1000 + as.factor(Borough), data = nyc_cd_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.094  -2.211   0.026   1.729  32.999 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      -1.5231     2.1642  -0.704  0.48468    
## bklane1000                       64.8800    11.1212   5.834 3.32e-07 ***
## as.factor(Borough)Brooklyn        4.2489     2.3146   1.836  0.07201 .  
## as.factor(Borough)Manhattan       7.8869     2.6232   3.007  0.00403 ** 
## as.factor(Borough)Queens          0.1403     2.4381   0.058  0.95432    
## as.factor(Borough)Staten Island  -5.9343     4.0189  -1.477  0.14570    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.197 on 53 degrees of freedom
## Multiple R-squared:  0.5674, Adjusted R-squared:  0.5266 
## F-statistic:  13.9 on 5 and 53 DF,  p-value: 1.114e-08

This analysis suggests there are more collisions where there are more bike lanes.

Adjusting for differences across boroughs, each additional mile of bike lane (per 1000 residents) is associated with 65 additional collisions (per 1000).

Does this mean bike lanes are “causing” collisions?

How should we interpret this result?

What are some alternative explanations of the positive correlation?

What sort of data would we need to test these alternative arguments?

Problem Set 3

Your assignment (if using R):

create a map of rat activity (per 1000 residents) in NYC!
use this dataset:
- NYC_Rats/NYC_Rats.csv
follow the same steps as for “Map 1” (crashes per 1000)
(i.e. sum via point-in-polygon, then calculate per capita rate)
name the file rats_per_capita.png
upload map to Canvas
(by next Wednesday)

Can you make this map?

API-231 / GIS-PubPol / Meeting 05 (Lab Exercise + Problem Set 3)

Yuri M. Zhukov

Visiting Associate Professor of Public Policy

Harvard Kennedy School

February 8, 2024

QGIS

Map 1

Map 2

R

Map 1

Map 2