Analysis and Geoprocessing with NYC OpenData
A lot of cool stuff here
Overview of lab exercise and problem set
We will first make this map
Bike crashes per capita in NYC
This will require joining community districts polygons …
Community districts in NYC
… to a (non-geographic) spreadsheet of population statistics
NYC population data
… followed by a point-in-polygon analysis and some field calculations
NYC bike crash data
We will then make this map
Bike lanes per capita in NYC
This will require some line-in-polygon analysis
Bike lanes in NYC
You will make this map for your problem set
Rat activity per capita in NYC
This will involve (again) point-in-polygon analysis
Rodent inspections in NYC
You can make these plots in QGIS or in R. Instructions for both are below.
Map 1 in R
Map 2 in R
PS03 in R
We will put multiple types of data on same map:
Category | Type | Format | Data source |
---|---|---|---|
Community district borders | Vector (polygon) | .geojson |
NYC OpenData |
Community district population | Table | .csv |
NYC OpenData |
Vehicle collisions involving bicycles | Vector (point) | .csv |
NYC OpenData |
Bike routes | Vector (polyline) | .geojson |
NYC OpenData |
Rodent inspections finding “rat activity” | Vector (point) | .csv |
NYC OpenData |
These are all in the Lab03PS03.zip
file posted on Canvas.
Let’s open QGIS…
Always save your progress!
Go to Project
\(\to\) Save As...
Load community districts: Layer
\(\to\) Add Layer
\(\to\) Add Vector Layer...
Open the file NYC_CD.geojson
in the NYC_Communities
folder. Click Add
This dataset includes 71 polygons, including New York’s 59 community districts and 12 areas outside of community districts’ jurisdiction (parks, waterfront)
Open the Attribute Table
for this layer. We see a borough-district code (boro_cd
) and geometric statistics (shape_area, shape_leng
)
For a definition of boro_cd
, we can look at the metadata for NYC_CD
(nycd_metadata.pdf
). The metadata seem out of date (BoroCD
\(\neq\) boro_cd
), but there’s important info here about this code is constructed
Load population data:
Layer
\(\to\) Add Layer
\(\to\) Add Delimited Text Layer...
NY_Pop.csv
in the NYC_Population
folderGeometry Definition
\(=\) No geometry (attribute table only)
Add
The new layer is visible in the side menu. But it doesn’t appear on the map, since it’s not a spatial dataset. Right-click NYC_Pop
\(\to\) Open Attribute Table
This is a table of community district names and population statistics. Maybe we can link this table to NYC_CD
… but there is no common variable
Wait. NYC_CD
has a boro_cd
code, which is “Community district number preceded by BoroCode”. Maybe we can create a boro_cd
code in NYC_Pop
, using boro_code
and cd_code
? Open Field Calculator
Create a new variable, called boro_cd
, of type Integer
. For the Expression
, write boro_code * 100 + cd_code
. Click OK
You should now see the variable boro_cd
in the attribute table.
Let’s join NYC_CD
and NYC_Pop
. Go to the Properties
for NYC_CD
Open the Joins
tab in Properties
. Click on the “add” button (+
)
Select NYC_Pop
as the join layer, and boro_cd
as the join and target field. Click OK
You should now see a new join in the Joins
tab
Now go to the Symbology
tab in Properties
.
Set Symbology type
\(=\) Categorized
, and Value
\(=\) NYC_Pop_Borough
.
Click Classify
, then OK
The districts should now be colored by borough
Let’s now color them by population size. Go back to Properties
\(\to\) Symbology
.
Set Symbology type
\(=\) Graduated
, and Value
\(=\) NYC_Pop_pop_2010
.
Click Classify
, then OK
The districts should now be colored by their population in 2010
Set the color scheme back to Single symbol
in Symbology
Load data on bicycle crashes:
Add Delimited Text Layer
NYC_BikeCrashes.csv
in the NYC_Collisions
folderGeometry Definition
\(=\) Point coordinates
X field
\(=\) LONGITUDE
, Y field
\(=\) LATITUDE
.Add
The new layer should be visible in the map, as points.
Navigate to the Count Points in Polygon
tool.
Vector
menu \(\to\) Analysis Tools
\(\to\) Count Points in Polygon
Select Polygons
\(=\) NYC_CD
, Points
\(=\) NYC_BikeCrashes
. Name the count field crashes
, and save the output file as NYC_CD_2.geojson
. Click Run
The new layer should appear on the map as NYC_CD_2
.
Hide the other two layers by unchecking boxes next to them in the menu.
Let’s color the districts by number of bike crashes. Go to the Symbology
tab in the new layer’s Properties
. Set Symbology type
\(=\) Graduated
, and Value
\(=\) crashes
. Click Classify
, then OK
In which parts of NYC are cars hitting the most cyclists?
Let’s combine this with population data to get a per-capita rate of bike crashes.
Open the Attribute Table
for NYC_CD_2
. Open the Field Calculator
Create new field called crash1000
of type Decimal number (real)
.
For the Expression
, write crashes / NYC_Pop_pop_P2010 * 1000
. Click OK
You should now see the new variable in the attribute table.
Click the Edit
button to save the dataset
Click Save
Change the symbology again, to color districts by crash1000
. Remember to click Classify
The distribution of crashes per 1000 residents (crash1000
) looks similar to crashes
. But there are a few missing districts (due to no population data)
We can fix this (aesthetically) by un-hiding the NYC_CD
layer.
You may want to change the color of NYC_CD
to something neutral, like gray.
Export the map to image. Place map on a New Print Layout
, add and properly format a legend, scale bar, etc. The end product should look vaguely like this.
Load bike routes: Layer
\(\to\) Add Layer
\(\to\) Add Vector Layer...
Navigate to NYC_BikeRoutes.geojson
in the NYC_BikeRoutes
folder. Click Add
This is a polyline layer, representing NYC’s cycling lanes and greenways.
Let’s calculate how many miles of bike lanes are in each district.
Navigate to the Sum Line Lengths
tool.
Vector
menu \(\to\) Analysis Tools
\(\to\) Sum Line Length
Select Lines
\(=\) NYC_BikeRoutes
, Polygons
\(=\) NYC_CD_2
.
Name the lengths and count fields bk_length
and bk_count
.
Save the output file as NYC_CD_3.geojson
. Click Run
The new layer should appear as NYC_CD_3
on the map. Let’s take a look
Open the Attribute Table
for NYC_CD_3
. The bk_length
variable is there, but what are its units of measurement?
We can look up the project’s units of measurement by going to Project
menu \(\to\) Properties
In Project Properties
, we see Units for distance measurement
\(=\) Meters
Let’s convert bk_length
from meters to miles. Go back to NYC_CD_3
’s Attribute Table
\(\to\) Field Calculator
. Create new field called bk_miles
of type Decimal number (real)
. For the Expression
, write bk_length / 1609.344
. Click OK
You should now see the new variable in the attribute table.
Now let’s create a per-capita measure (miles of bike lane per 1000 residents). Go back to the Field Calculator
. Create new field called bklane1000
of type Decimal number (real)
. For the Expression
, write
bk_miles / NYC_Pop_pop_2010 * 1000
. Click OK
You should now see the new variable in the attribute table.
Remember to save your changes!
Change the symbology of NYC_CD_3
, to color districts by bklane1000
. Click Classify
, then OK
We now have a plot of “miles of bike lane per 1000 residents” (bklane1000
). Again, there are a few missing districts (due to no population data)
Let’s un-hide the NYC_CD
layer again to show the missing districts. Ideally, these should be colored gray or some other neutral color
Export the map to image. Project
\(\to\) New Print Layout
. Place map and legend to layout. The resulting legend includes several items we’ll need to remove (everything but NYC_CD_3
and NYC_CD
)
Remember how we did this in lab 2? (Item Properties
\(\to\) un-check Auto update
, etc.) Remove everything but NYC_CD_3
and NYC_CD
, change layers’ names from NYC_CD_3
to “Miles per 1000 residents” and NYC_CD
to “No population”.
The output file should look roughly like this.
We can export the dataset we created, by right-clicking on NYC_CD_3
layer, and selecting Export
\(\to\) Save Features As...
The file can be saved in a variety of formats, including .geojson
, .shp
, .csv
and Excel
Let’s create a bar plot. Go to Processing
menu \(\to\) Toolbox
. This will open a Processing Toolbox
menu on the right
Double-click on Bar plot
in the Plots
submenu on the right
Set Input layer
\(=\) NYC_CD_3
, Category name field
\(=\) NYC_Pop_Borough
, and Value field
\(=\) crashes
. Set the output destination under Bar plot
to barplot_1.html
. Click Run
You can access the resulting plot by clicking on the Bar plot
item in the Results Viewer
on the right
This will open a web browser window, with a bar plot of total bike crashes per district, grouped by borough. As expected,Brooklyn and Manhattan lead the pack.
Export the plot by clicking on the Download plot as png
button in the top-right
The output file should look like this. For further analysis, you can also open the .csv
file we just created in Excel or R, and start crunching numbers (see below)
Problem Set 3
Your assignment (if using QGIS):
NYC_Rats/NYC_Rats.csv
rats_per_capita.png
Can you make this map?
Loading R packages
To implement these steps in R, we will be using the sf
package (again)
library(sf)
NOTE: The code to produce Map 1 and Map 2 in R is in lab03_demo.R
on RStudio Cloud, and in Lab03PS03.zip
(posted on Canvas).
Loading spatial data
Let’s load the NYC community district boundaries into R, using sf::read_sf()
:
nyc_cd = sf::read_sf("Data/NYC_Communities/NYC_CD.geojson")
plot(nyc_cd["geometry"])
Loading non-spatial data
Load the tabular population data using read.csv()
, and preview the first few rows:
pop = read.csv(file = "Data/NYC_Population/NYC_Pop.csv")
head(pop)
Joining spatial to non-spatial data
Similar to QGIS’s Field Calculator
, let’s create a new boro_cd
variable in pop
:
pop$boro_cd = pop$boro_code*100 + pop$cd_code
We can join nyc_cd
and pop
by attribute boro_cd
using the merge()
command:
nyc_cd_2 = merge(x = nyc_cd, y = pop, by = "boro_cd")
print(nyc_cd_2,n=2) # Preview first 2 rows
## Simple feature collection with 59 features and 12 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -74.25559 ymin: 40.49613 xmax: -73.70001 ymax: 40.91553
## Geodetic CRS: WGS 84
## First 2 features:
## boro_cd shape_area shape_leng boro_code Borough cd_code
## 1 101 41692711.6252 69093.8528032 1 Manhattan 1
## 2 102 37604933.3646 32628.8675927 1 Manhattan 2
## cd_name pop_1970 pop_1980 pop_1990 pop_2000 pop_2010
## 1 Battery Park City, Tribeca 7706 15918 25366 34420 60978
## 2 Greenwich Village, Soho 84337 87069 94105 93119 90016
## geometry
## 1 MULTIPOLYGON (((-74.04388 4...
## 2 MULTIPOLYGON (((-73.99684 4...
Plotting merged features
Using the merged dataset nyc_cd_2
, let’s plot the districts by borough
plot(nyc_cd_2["Borough"])
… or plot them by population size in 2010:
plot(nyc_cd_2["pop_2010"],main="Population size in 2010")
Loading spatial data from a CSV table
Our data on bike collisions, like population, is also a csv
table:
crashes = read.csv(file = "Data/NYC_Collisions/NYC_BikeCrashes.csv")
But this file has geographic coordinates, which we can use to create a spatial object, borrowing the CRS from nyc_cd
:
crashes = sf::st_as_sf(x = crashes, coords = c("LONGITUDE","LATITUDE"),
crs = sf::st_crs(nyc_cd))
plot(crashes["geometry"])
Point-in-polygon analysis
To count the number of collisions in each community district, we can overlay nyc_cd_2
and crashes
with sf::st_intersects()
. The output is a classed list of feature IDs intersected:
o = sf::st_intersects(x = nyc_cd_2, y = crashes)
o
## Sparse geometry binary predicate list of length 59, where the predicate
## was `intersects'
## first 10 elements:
## 1: 15, 29, 188, 212, 236, 268, 284, 297, 624, 633, ...
## 2: 9, 10, 11, 18, 108, 135, 137, 169, 194, 274, ...
## 3: 79, 85, 165, 215, 242, 257, 260, 261, 329, 346, ...
## 4: 4, 12, 44, 101, 109, 116, 161, 210, 211, 233, ...
## 5: 8, 16, 47, 66, 75, 83, 105, 120, 149, 191, ...
## 6: 21, 58, 97, 227, 266, 280, 282, 307, 323, 330, ...
## 7: 76, 81, 110, 117, 164, 220, 229, 288, 301, 331, ...
## 8: 2, 186, 204, 285, 322, 353, 354, 360, 361, 389, ...
## 9: 89, 103, 129, 150, 286, 311, 348, 474, 528, 539, ...
## 10: 17, 54, 74, 82, 113, 156, 173, 228, 246, 249, ...
We can use the command lengths(o)
to find the number of points in each polygon, and assign this as a new variable in nyc_cd
:
nyc_cd_2$crashes = lengths(o)
Plot the new variable:
plot(nyc_cd_2["crashes"], main = "Bicycle Collisions (2013-2024)")
Creating per-capita measures
Let’s calculate and plot a new field, crashes per 1000 residents:
nyc_cd_2$crash1000 = nyc_cd_2$crashes / nyc_cd_2$pop_2010 * 1000
plot(nyc_cd_2["crash1000"], main = "Collisions per 1000 Residents")
Exporting Map 1 to image file
To save the map, we will use the png()
and dev.off()
commands:
png(filename = "Output/bike_crashes_per_capita_r.png",
width = 6, height = 4, units = "in", res = 300)
plot(nyc_cd_2["crash1000"],
main = "Bicycle Collisions per 1000 Residents", lwd=.5)
dev.off()
The output file should look like this.
Loading bike lane data
Import bike routes data
bikeroutes = sf::read_sf("Data/NYC_BikeRoutes/NYC_BikeRoutes.geojson",
crs=sf::st_crs(nyc_cd_2))
plot(bikeroutes["geometry"], reset=FALSE, col="forestgreen", lwd=.5)
plot(nyc_cd_2["geometry"], add = TRUE)
Line-in-polygon analysis
We need to calculate the miles of bike lane in each community district. We can think of this as a two-step routine:
For step 1, we can use sf::st_intersects()
to figure out which segments of bikeroutes
lie in which districts (nyc_cd_3
) (similar to point-in-polygon analysis):
o = sf::st_intersects(x = nyc_cd_2, y = bikeroutes)
o
## Sparse geometry binary predicate list of length 59, where the predicate
## was `intersects'
## first 10 elements:
## 1: 69, 72, 87, 107, 108, 138, 161, 899, 900, 905, ...
## 2: 73, 74, 86, 1798, 2862, 2863, 2864, 2865, 2866, 2867, ...
## 3: 42, 53, 104, 155, 157, 159, 1698, 1699, 1700, 1701, ...
## 4: 44, 65, 168, 1963, 2737, 3035, 3114, 3115, 3118, 3119, ...
## 5: 168, 1963, 3327, 3328, 3372, 3373, 3374, 3375, 3376, 3377, ...
## 6: 3455, 3459, 3460, 3466, 3467, 3526, 3527, 3528, 3529, 3530, ...
## 7: 46, 64, 66, 3816, 3817, 3886, 3887, 3888, 3944, 3945, ...
## 8: 10, 15, 17, 18, 81, 4376, 4414, 4440, 4441, 4442, ...
## 9: 4822, 4823, 4832, 4833, 4834, 4855, 4856, 4857, 4858, 4859, ...
## 10: 21, 22, 100, 101, 4799, 4801, 4842, 4850, 4851, 4852, ...
Line-in-polygon analysis (continued)
For step 2, we can write a loop that extracts the overlapping bike lane segments in each district, measures their (cumulative) length, and converts this sum to miles.
out <- c() # Create empty vector to store results
for(i in 1:length(o)){ # Open loop
bike_subset = bikeroutes[o[i][[1]],] # Take subset
bk_meters = sum(sf::st_length(bike_subset)) # Sum lengths
bk_km = bk_meters/1000 # Convert to km
bk_mi = bk_km * 0.6213712 # Convert to miles
out = c(out, bk_mi) # Concatenate
} # Close loop
We then add this as a variable to nyc_cd_2
nyc_cd_2$bk_miles <- out
Let’s quickly visualize the results:
plot(nyc_cd_2["bk_miles"],
main = "Miles of Bike Lane per District", lwd=.5)
Creating per-capita measures
Let’s calculate and plot a new field, miles of bicycle lane per 1000 residents:
nyc_cd_2$bklane1000 = nyc_cd_2$bk_miles / nyc_cd_2$pop_2010 * 1000
plot(nyc_cd_2["bklane1000"], main = "Bike Lane per 1000 Residents (mi)")
Exporting to image file
To save the map, we will use the png()
and dev.off()
commands:
png(filename = "Output/bike_lanes_per_capita_R.png",
width = 6, height = 4, units = "in", res = 300)
plot(nyc_cd_2["bklane1000"],
main = "Miles of Bicycle Lane per 1000 Residents", lwd=.5)
dev.off()
The output file should look like this.
Exporting the dataset
To save the data attribute table as a csv
file, we can use the write.csv()
command:
write.csv(x = nyc_cd_2, file = "Output/NYC_CD_2_R.csv")
Making bar plots
For each of NYC’s five boroughs, calculate averages of the variables we created:
borough_means = aggregate(
x = nyc_cd_2[,c("crashes","crash1000","bk_miles","bklane1000")],
by = list(Borough=nyc_cd_2$Borough),
FUN = mean )
This will create a new sf
object with 5 rows:
borough_means
borough_means
Make a bar plot of bike crashes in an average community district
barplot(height=borough_means$crashes, names.arg=borough_means$Borough,
main="Bicycle collisions per community district (mean)")
Make a bar plot of bike crashes per 1000 residents
barplot(height = borough_means$crash1000,
names.arg = borough_means$Borough,
main = "Collisions per 1000 residents (mean)")
Make a bar plot of miles of bike lane in an average community district
barplot(height = borough_means$bk_miles,
names.arg = borough_means$Borough,
main = "Bike lanes per community district (mi)")
Make a bar plot of miles of bike lane per 1000 residents
barplot(height = borough_means$bklane1000,
names.arg = borough_means$Borough,
main = "Bike lanes per 1000 residents (mi)")
Basic regression analysis
You can use an sf
object as you would a data.frame
to run regressions in R.
For example, suppose we wanted to see if there are fewer bicycle collisions where there are more bike lanes. We could run a basic linear regression of bicycle collisions (per 1000) on bike lanes (per 1000), with fixed effects for each borough:
summary(lm(crash1000 ~ bklane1000 + as.factor(Borough), data=nyc_cd_2))
##
## Call:
## lm(formula = crash1000 ~ bklane1000 + as.factor(Borough), data = nyc_cd_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.9336 -2.6975 0.1465 1.7882 30.6929
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.1527 2.0762 -1.037 0.3045
## bklane1000 56.4370 8.0849 6.981 4.85e-09 ***
## as.factor(Borough)Brooklyn 5.3402 2.2294 2.395 0.0202 *
## as.factor(Borough)Manhattan 8.0167 2.5250 3.175 0.0025 **
## as.factor(Borough)Queens 0.4357 2.3541 0.185 0.8539
## as.factor(Borough)Staten Island -4.9611 3.8647 -1.284 0.2048
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.982 on 53 degrees of freedom
## Multiple R-squared: 0.6308, Adjusted R-squared: 0.5959
## F-statistic: 18.11 on 5 and 53 DF, p-value: 1.942e-10
This analysis suggests there are more collisions where there are more bike lanes.
Adjusting for differences across boroughs, each additional mile of bike lane (per 1000 residents) is associated with 65 additional collisions (per 1000).
Does this mean bike lanes are “causing” collisions?
How should we interpret this result?
What are some alternative explanations of the positive correlation?
What sort of data would we need to test these alternative arguments?
Problem Set 3
Your assignment (if using R):
NYC_Rats/NYC_Rats.csv
rats_per_capita.png
Can you make this map?