-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy patheviction_main.qmd
More file actions
2681 lines (2275 loc) · 105 KB
/
eviction_main.qmd
File metadata and controls
2681 lines (2275 loc) · 105 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "The Philadelphia Eviction Early Warning System"
subtitle: "A Predictive Model for Proactive Resource Allocation"
date: today
author:
- name: Angel Rutherford
- name: Ixchel Ramirez
- name: Tess Vu
email:
- tessavu@proton.me
- tessavu@upenn.edu
affiliation:
- name: University of Pennsylvania
department: Urban Spatial Analytics (MUSA)
city: Philadelphia
state: PA
url: https://www.design.upenn.edu/urban-spatial-analytics
format:
html:
code-fold: show
toc: true
toc_float: true
toc-expand: true
smooth-scroll: true
embed-resources: true
title-block-style: default
execute:
warning: false
message: false
editor:
markdown:
wrap: 72
---
# EXECUTIVE SUMMARY
Eviction is both a cause and consequence of poverty that destabilizes entire neighborhoods. Currently, city responses to eviction are reactive, with resources like legal aid and rental assistance deployed *after* filing volumes become a crisis. This project develops a Real-Time Operational Tool for the Philadelphia Office of Homeless Services and the Fair Housing Commission. By shifting from reactive to predictive analysis, we enable the city to allocate limited staff to specific census tracts predicted to experience elevated eviction filings in the coming month.
Our Negative Binomial regression model leverages temporal momentum, spatial spillover effects, policy intervention effects, property tax delinquency stress, and American Community Survey socioeconomic indicators to forecast monthly eviction filing counts at the census tract level.
The model demonstrates strong performance with meaningful improvement, and sets the foundation for building up to a practical and usable tool down the line. Using a robust temporal validation strategy (training through 2023, testing on 2024-2025), the model generalizes well to future periods without overfitting. Also stark racial disparities in eviction burden was identified, with Black-majority tracts accounting for disproportionate shares of filings. These findings emphasize the need for equity-centered implementation safeguards to prevent perpetuating existing disparities through algorithmic resource allocation.
# I. EXPLORATORY DATA ANALYSIS (EDA)
## 1. SETUP AND LIBRARY LOADING
All required R packages were loaded at the start to streamline data cleaning, visualization, spatial analysis, and modeling.
```{r library}
#| message: false
#| warning: false
library(lubridate)
library(sf)
library(broom)
library(scales)
library(patchwork)
library(viridis)
library(kableExtra)
library(corrplot)
library(zoo)
library(MASS)
library(car)
library(caret)
library(pROC)
library(spdep)
library(tidycensus)
library(httr)
library(jsonlite)
library(glue)
library(stringr)
library(tidyr)
library(tidyverse)
library(dplyr)
library(knitr)
# Override masking.
select <- dplyr::select
filter <- dplyr::filter
# Disable scientific notation.
options(scipen = 999)
# Save figures.
#opts_chunk$set(fig.path = "final_figures/", dpi = 300)
```
Two visualization techniques were employed to provide clear and effective evidence of our findings. First, a consistent plotting theme was established to ensure readability and uniformity across all figures. Second, custom color palettes were applied to highlight key analytical distinctions, specifically racial categories and policy periods. Racial differences in eviction filings are central to the Eviction Lab’s dataset and visualizations and were incorporated to acknowledge inequitable outcomes. In addition, the periods before, during, and after the moratorium were visually differentiated to measure the impact of policy interventions on eviction patterns.
```{r customization}
# Custom color palette for policy periods.
period_colors <- c(
"Pre-Moratorium" = "#3498DB",
"Moratorium" = "#27AE60",
"Post-Moratorium" = "#E67E22"
)
# Custom color palette for racial categories.
racial_colors <- c(
"Black" = "#E74C3C",
"White" = "#3498DB",
"Hispanic" = "#F39C12",
"Other" = "#9B59B6"
)
```
## 2. DATA LOADING AND INITIAL EXPLORATION
### Monthly Tract-Level Eviction Data
The data was accessed and downloaded from the Eviction Lab, an online database that compiles nationwide eviction filings from court records. For states and major cities, the Eviction Lab provides eviction datasets at two temporal resolutions: weekly and monthly. Both scales were initially examined for their distinct analytical value: weekly data highlight short‑term spikes and localized patterns, while monthly data reveal longer‑term trends and align more closely with policy implementation, as months serve as the standard catchment period for resource allocation and housing interventions.
A summary of the dataset’s dimensions and preview of the first few rows shows that the monthly eviction data provides space-time insights into filing patterns. The dataset is structured into eight columns:
\begin{enumerate}
\item Geographic unit of analysis (census tract),
\item Standard geographic identifier (census tract number),
\item Majority racial identity of the tract,
\item Month and year of record,
\item Eviction filings from 2020--2022, capturing the impact of the COVID-19 pandemic as an extreme destabilizing event and the subsequent moratorium response,
\item Average filings,
\item Average filings prior to the pandemic, and
\item Last upload date of the dataset.
\end{enumerate}
```{r load-monthly-data}
# Load monthly eviction filings data at census tract level.
df_monthly_raw <- read.csv("data/eviction/philadelphia_monthly_2020_2021.csv")
# Display initial data dimensions.
cat("MONTHLY EVICTION DATA DIMENSIONS\n")
cat(sprintf("Rows: %s\n", comma(nrow(df_monthly_raw))))
cat(sprintf("Columns: %d\n", ncol(df_monthly_raw)))
cat(sprintf("Variables: %s\n", paste(names(df_monthly_raw), collapse = ", ")))
```
```{r monthly-data-table}
# Display first rows to understand data structure.
head(df_monthly_raw, 10) %>%
kable(caption = "Sample of Raw Monthly Eviction Filing Data",
col.names = c("Type", "GEOID", "Racial Majority", "Month", "Filings", "Average Filings", "Pre-Pandemic Average Filings", "Last Updated")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
```
### Weekly Tract-Level Eviction Data
The weekly dataset mirrors the structure of the monthly data set but differs as it contains more rows of data due to its finer granularity and adds additional columns that specific the week relative to the dataset and the calendar date the data was recorded.
```{r load-weekly-data}
# Load weekly eviction filings data for high-frequency analysis.
df_weekly_raw <- read.csv("data/eviction/philadelphia_weekly_2020_2021.csv")
# Display weekly data dimensions.
cat("WEEKLY EVICTION DATA DIMENSIONS\n")
cat(sprintf("Rows: %s\n", comma(nrow(df_weekly_raw))))
cat(sprintf("Columns: %d\n", ncol(df_weekly_raw)))
```
```{r weekly-data-table}
# Display sample of weekly data.
head(df_weekly_raw, 10) %>%
kable(caption = "Sample of Raw Weekly Eviction Filing Data",
col.names = c("Type", "GEOID", "Racial Majority", "Week", "Date", "Filings", "Average Filings", "Pre-Pandemic Average Filings", "Last Updated")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
```
### Claims Data
The Eviction Lab provides tract‑level data that summarizes historical and contemporary monthly claims statistics, where claims represent the financial compensation sought by landlords in eviction proceedings. In addition to reporting the median claim amount, the dataset includes thresholds that capture the relative burden these claims pose to renters, offering a spectrum of financial severity across space and time. While these measures are not the primary variables used in this analysis, they provide valuable context for understanding eviction dynamics. Exploratory review suggests that approximately 3% of claims fall below $1,000, which may reflect unpaid fees rather than market‑rate rent, though in some cases it could indicate subsidized public housing rents. Around 6% of claims align with median market rents, while 14% represent unpaid rent for at least six months. These timelines highlight the varied circumstances underlying eviction filings, including the delays introduced by lengthy legal processes. Together, the dataset’s structured measures and exploratory insights illustrate both the quantitative severity of claims and the qualitative complexity of eviction proceedings.
```{r load-claims-data}
# Load aggregated monthly claims data showing financial severity.
df_claims_raw <- read.csv("data/eviction/philadelphia_claims_monthly.csv")
# Display claims data dimensions.
cat("CLAIMS DATA DIMENSIONS\n")
cat(sprintf("Rows: %s\n", comma(nrow(df_claims_raw))))
cat(sprintf("Columns: %d\n", ncol(df_claims_raw)))
```
```{r claims-data-table}
# Display claims data structure.
head(df_claims_raw, 10) %>%
kable(digits = 2, caption = "Sample of Monthly Claims Severity Data",
col.names = c("Date", "Median Claim", "Below $1,000", "Below Median Rent", "Over Six Months Rent", "Baseline Median Claim")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
```
This is not used in the dataset, but for exploration behind the scenes of evictions. It looks like 3% of claims are for below $1,000, which could reflect unpaid fees rather than market rate rent. However, it *could* reflect rent in subsidized public housing. 6% is for the median market rate rents. 14% is for rent unpaid for at least 6 months. These timelines could have varying reasons, especially due to the time-consuming legal aspect that could delay filings.
### Real Estate Tax Balances
In our analysis, tax delinquency is treated as a proxy for landlord financial stress; theoretically properties with outstanding tax balances could pressure landlords into pursuing evictions as a means of recovering income or offsetting financial strain. Thus, property tax assessments were explored to determine how financial stress on landlords may shape eviction filings.
```{r load-tax-data}
# Load real estate tax balances aggregated by census tract.
df_tax_raw <- read.csv("data/eviction/real_estate_tax_balances_census_tract.csv")
# Display tax data dimensions.
cat("TAX DELINQUENCY DATA DIMENSIONS\n")
cat(sprintf("Rows: %s\n", comma(nrow(df_tax_raw))))
cat(sprintf("Columns: %d\n", ncol(df_tax_raw)))
cat(sprintf("Variables: %s\n", paste(names(df_tax_raw), collapse = ", ")))
```
```{r tax-data-table}
# Display sample of tax delinquency data.
head(df_tax_raw, 10) %>%
kable(digits = 2, caption = "Sample of Real Estate Tax Balances by Census Tract",
col.names = c("Object ID", "Census Tract", "Properties", "Minimum Period", "Maximum Period", "Principal", "Interest", "Penalty", "Other", "Balance", "Average Balance")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
```
### Philadelphia Census Tract Geometry
Since all datasets in this analysis are aggregated at the census tract level, a spatial file containing tract boundaries was loaded to access the geographic and geometric attributes of each tract. The inclusion of this file allows for the integration of eviction, claim, property, and other datasets with physical locations for mapping and spatial analysis.
```{r load-spatial-data}
# Load Pennsylvania census tracts shapefile.
pa_tracts_sf <- st_read("data/pa_tracts/pa_tracts.shp", quiet = TRUE)
# Filter to Philadelphia County only using FIPS code 42101.
philly_tracts_sf <- pa_tracts_sf %>%
filter(str_starts(GEOID, "42101"))
# Display spatial data summary.
cat("PHILADELPHIA CENSUS TRACT GEOMETRY\n")
cat(sprintf("Total Philadelphia Tracts: %d\n", nrow(philly_tracts_sf)))
cat(sprintf("Coordinate Reference System: %s\n", st_crs(philly_tracts_sf)$input))
```
## 3. DATA CLEANING AND FEATURE ENGINEERING
Certain data cleaning and manipulation techniques were applied across all datasets to establish structural uniformity that would streamline joinability with other datasets. These techniques include removing invalid records, turning geographic identifiers into character strings to ensure they aren’t treated as numeric values, and formatting dates. Each dataset was also cleaned and manipulated based on their respective qualities and potential insights.
### Monthly Eviction Data Preparation
To prepare the monthly eviction data for analysis, raw filings were cleaned and enriched with theoretically motivated features. A categorical variable was created to distinguish pre‑moratorium, moratorium, and post‑moratorium periods, alongside a binary flag for months when the moratorium was active. Months were categorized into seasonal indicators as seasonal patterns in filings can be seen in the figures provided on the Eviction Lab’s website.
```{r clean-monthly}
# Clean and prepare monthly data with temporal and policy features.
df_monthly <- df_monthly_raw %>%
# Rename filings variable for clarity.
rename(filings_count = filings_2020) %>%
# Parse month string to proper date format.
mutate(
date = as.Date(paste0("01/", month), format = "%d/%m/%Y"),
year = year(date),
month_num = month(date),
month_name = month(date, label = TRUE, abbr = FALSE)
) %>%
# Filter out rows with invalid dates.
filter(!is.na(date)) %>%
# Convert GEOID to character for joining operations.
mutate(GEOID = as.character(GEOID)) %>%
# Create policy intervention indicator for eviction moratorium period.
mutate(
moratorium_active = ifelse(
date >= as.Date("2020-03-01") & date <= as.Date("2021-09-30"),
1,
0
),
# Create categorical period variable.
period = case_when(
date < as.Date("2020-03-01") ~ "Pre-Moratorium",
date >= as.Date("2020-03-01") & date <= as.Date("2021-09-30") ~ "Moratorium",
date > as.Date("2021-09-30") ~ "Post-Moratorium"
),
period = factor(period, levels = c("Pre-Moratorium", "Moratorium", "Post-Moratorium"))
) %>%
# Create seasonal indicators for seasonality analysis.
mutate(
season = case_when(
month_num %in% c(12, 1, 2) ~ "Winter",
month_num %in% c(3, 4, 5) ~ "Spring",
month_num %in% c(6, 7, 8) ~ "Summer",
month_num %in% c(9, 10, 11) ~ "Fall"
),
season = factor(season, levels = c("Winter", "Spring", "Summer", "Fall"))
) %>%
# Create post-moratorium ramp variable to capture rebound effect.
mutate(
post_moratorium_months = ifelse(
date > as.Date("2021-09-30"),
as.numeric(difftime(date, as.Date("2021-09-30"), units = "days")) / 30,
0
)
)
# Display summary of cleaned monthly data.
cat("CLEANED MONTHLY DATA SUMMARY\n")
summary(df_monthly %>% select(filings_count, date, year, moratorium_active))
```
To verify the scope of the cleaned dataset, a summary table was generated showing the earliest and latest months covered, the total number of distinct months, the number of census tracts represented, and the overall number of observations. This check confirms that the dataset ranges from January 2020 to November 2025.
```{r clean-monthly-coverage}
# Check temporal coverage of cleaned data.
df_monthly %>%
summarize(
min_date = min(date),
max_date = max(date),
n_months = n_distinct(date),
n_tracts = n_distinct(GEOID),
total_obs = n()
) %>%
kable(caption = "Monthly Data Temporal Coverage",
col.names = c("Minimum Date", "Maximum Date", "Months", "Tracts", "Observations")) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
### Weekly Data Preparation
Weekly data was cleaned and enriched in parallel with the monthly dataset, but adapted to reflect the finer weekly temporal scale. A summary table of temporal coverage reveals that the dataset spans from the final days of December 2019 through the second week of November 2025, offering a more nuanced timeline down to the exact day.
```{r clean-weekly}
# Clean and prepare weekly data for finer temporal analysis.
df_weekly <- df_weekly_raw %>%
# Rename filings variable for consistency.
rename(filings_count = filings_2020) %>%
# Convert week_date to proper date format.
mutate(
date = as.Date(week_date),
year = year(date),
month_num = month(date),
week_of_year = week(date)
) %>%
# Filter out rows with invalid dates.
filter(!is.na(date)) %>%
# Convert GEOID to character for consistency.
mutate(GEOID = as.character(GEOID)) %>%
# Create moratorium indicator.
mutate(
moratorium_active = ifelse(
date >= as.Date("2020-03-01") & date <= as.Date("2021-09-30"),
1,
0
)
)
# Display summary of cleaned monthly data.
cat("CLEANED WEEKLY DATA SUMMARY\n")
summary(df_weekly %>% select(filings_count, date, year, moratorium_active))
```
```{r clean-weekly-coverage}
# Check weekly temporal coverage.
df_weekly %>%
summarize(
min_date = min(date),
max_date = max(date),
n_weeks = n_distinct(date),
n_tracts = n_distinct(GEOID),
total_obs = n()
) %>%
kable(caption = "Weekly Data Temporal Coverage",
col.names = c("Minimum Date", "Maximum Date", "Weeks", "Tracts", "Observations")) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
### Claims Data Preparation
Monthly eviction claims data was mutated to include a new column, claims ratio, that was calculated to represent median claim values relative to pre‑pandemic baselines.
```{r clean-claims}
# Clean claims data showing financial severity of evictions.
df_claims <- df_claims_raw %>%
# Convert month_date to proper date format.
mutate(
date = as.Date(month_date),
year = year(date),
month_num = month(date)
) %>%
# Filter valid dates.
filter(!is.na(date)) %>%
# Calculate ratio to pre-pandemic baseline.
mutate(claim_ratio = median_claim / median_claim_baseline)
# Display claims summary stats.
cat("CLAIMS DATA SUMMARY\n")
summary(df_claims %>% select(median_claim, median_claim_baseline, claim_ratio))
```
### Tax Delinquency Data Preparation
We defined delinquency in the property tax dataset as properties that received penalties rather than properties that just have overdue balances. Due to the marginal frequency observed in the dataset’s summary statistics (only 8 properties identified as overdue with no penalties), overdue properties were removed. For each tract, counts of delinquent properties, total balances, and average balances per property were calculated, along with log transformations of balances and delinquency age in years. Summary statistics were generated to report mean and median property counts, balances, and average balances across tracts, providing a tract‑level profile of tax delinquency severity.
```{r clean-tax-data}
# Clean and prepare tax data with delinquency separation.
df_tax <- df_tax_raw %>%
mutate(
census_tract = as.character(census_tract)
) %>%
filter(!str_detect(census_tract, "Other/Unidentified")) %>%
mutate(
# True delinquency indicators, based on penalty being on previous year's taxes.
has_penalty = penalty > 0,
delinquent_prop_count = ifelse(has_penalty, num_props, 0),
delinquent_balance = ifelse(has_penalty, balance, 0),
avg_delinquent_balance = ifelse(has_penalty, balance / num_props, 0),
# Overdue-only indicators, based on it being the current year.
overdue_only_props = ifelse(!has_penalty, num_props, 0),
overdue_only_balance = ifelse(!has_penalty, balance, 0),
# Transformations for modeling.
log_delinquent_balance = log1p(delinquent_balance),
log_overdue_balance = log1p(overdue_only_balance)
) %>%
rename(GEOID = census_tract)
# Display tax data summary.
cat("TAX DELINQUENCY DATA SUMMARY\n")
cat(sprintf("Tracts with Tax Data: %d\n", nrow(df_tax)))
cat(sprintf("Total Properties with Any Tax Balance: %s\n", comma(sum(df_tax$num_props))))
cat(sprintf("Delinquent Properties (Penalty > 0): %s (%.1f%%)\n",
comma(sum(df_tax$delinquent_prop_count)),
sum(df_tax$delinquent_prop_count) / sum(df_tax$num_props) * 100))
cat(sprintf("Overdue Properties (No Penalty): %s (%.1f%%)\n",
comma(sum(df_tax$overdue_only_props)),
sum(df_tax$overdue_only_props) / sum(df_tax$num_props) * 100))
cat(sprintf("Total Delinquent Balance: $%s\n", comma(sum(df_tax$delinquent_balance))))
cat(sprintf("Total Overdue-Only Balance: $%s\n", comma(sum(df_tax$overdue_only_balance))))
```
```{r overdue-removal}
# 8 overdue properties is marginal, so remove.
df_tax <- df_tax_raw %>%
mutate(
census_tract = as.character(census_tract)
) %>%
filter(!str_detect(census_tract, "Other/Unidentified")) %>%
filter(penalty > 0) %>%
mutate(
delinquent_prop_count = num_props,
delinquent_balance = balance,
avg_delinquent_balance_per_prop = balance / num_props,
log_delinquent_balance = log1p(balance),
delinquency_age_years = 2025 - min_period
) %>%
rename(GEOID = census_tract)
# Recalculate average.
df_tax <- df_tax %>%
mutate(avg_balance = balance / num_props)
```
```{r tax-summary-stats}
# Calculate summary stats for tax delinquency.
tax_summary <- df_tax %>%
summarize(
mean_props = mean(num_props),
median_props = median(num_props),
mean_balance = mean(balance),
median_balance = median(balance),
mean_avg_balance = mean(avg_balance),
median_avg_balance = median(avg_balance)
)
tax_summary %>%
pivot_longer(everything(), names_to = "statistic", values_to = "value") %>%
mutate(value = ifelse(str_detect(statistic, "balance"),
paste0("$", comma(round(value, 2))),
comma(round(value, 2)))) %>%
kable(caption = "Tax Delinquency Summary Statistics",
col.names = c("Statistic", "Value")) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
### Sealed Tract Assessment
Monthly eviction filings were assessed for sealed census tracts, where GEOIDs are suppressed for privacy protection. These tracts accounted for a small share of observations but exhibited unusually high filing counts. Because sealed tracts cannot be geographically mapped, they were excluded from spatial modeling to preserve tract‑level specificity. While this exclusion reduces the ability to design geographically actionable policies and outreach, the sealed tracts nonetheless represent legitimate evictions and highlight systematic vulnerabilities that cannot be directly addressed in this analysis.
```{r sealed-tracts}
# Identify sealed census tracts where GEOID is hidden for privacy protection.
# Reduces geographic specificity for modeling.
sealed_count <- df_monthly %>%
filter(GEOID == "sealed" | is.na(GEOID) | GEOID == "") %>%
nrow()
sealed_pct <- sealed_count / nrow(df_monthly) * 100
cat("SEALED TRACT ASSESSMENT\n")
cat(sprintf("Sealed or Missing Tract Observations: %s (%.1f%%)\n", comma(sealed_count), sealed_pct))
```
```{r sealed-tracts-stats}
# Analyze sealed tracts before removal.
sealed_tracts_analysis <- df_monthly %>%
filter(GEOID == "sealed" | is.na(GEOID) | GEOID == "") %>%
summarize(
n_obs = n(),
n_unique_dates = n_distinct(date),
total_filings = sum(filings_count, na.rm = TRUE),
mean_filings = mean(filings_count, na.rm = TRUE),
median_filings = median(filings_count, na.rm = TRUE),
min_filings = min(filings_count, na.rm = TRUE),
max_filings = max(filings_count, na.rm = TRUE),
sd_filings = sd(filings_count, na.rm = TRUE),
zero_pct = sum(filings_count == 0) / n() * 100
)
cat(sprintf("Total Observations: %s\n", comma(sealed_tracts_analysis$n_obs)))
cat(sprintf("Months of Data: %d\n", sealed_tracts_analysis$n_unique_dates))
cat(sprintf("Total Filings in Sealed Tracts: %s\n", comma(sealed_tracts_analysis$total_filings)))
cat(sprintf("Mean Filings per Tract-Month: %.1f\n", sealed_tracts_analysis$mean_filings))
cat(sprintf("Median Filings per Tract-Month: %.1f\n", sealed_tracts_analysis$median_filings))
cat(sprintf("Range: %d to %d\n", sealed_tracts_analysis$min_filings, sealed_tracts_analysis$max_filings))
cat(sprintf("Standard Deviation: %.1f\n", sealed_tracts_analysis$sd_filings))
cat(sprintf("Zero-Filing Months: %.1f%%\n", sealed_tracts_analysis$zero_pct))
```
```{r sealed-tracts-distribution}
# Show distribution of sealed tract filings.
sealed_distribution <- df_monthly %>%
filter(GEOID == "sealed") %>%
group_by(filings_count) %>%
summarize(n = n(), .groups = "drop") %>%
arrange(desc(filings_count))
head(sealed_distribution, 10) %>%
kable(caption = "Highest Filing Counts in Sealed Tracts",
col.names = c("Filings", "Frequency")) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
The sealed tracts are outliers and are legitimate evictions that reflect the current systematic issues, and could represent vulnerable communities experiencing a recent mass eviction (highest unsealed filings are from 2025). They aren't included because the tracts cannot be mapped to in the data if they're sealed, this privacy protection prevents geographically actionable policies and outreach, however, it has to be stressed that this is absolutely *not* ideal.
## 4. EXPLORATORY DATA ANALYSIS
### Distribution Analysis: Zero-Inflation Assessment
Understanding the distribution of eviction filings is critical for selecting the appropriate modeling approach. Because eviction filings are discrete, non‑continuous count values, they require models such as Poisson and Negative Binomial regression specifically designed to handle count data. A major difference between the two approaches is their ability to handle overdispersion, with the Poisson model assuming equal mean and variance while the Negative Binomial model allows the variance to exceed the mean, making it more flexible for data with excess variability and zeros. Zero-inflation statistics were calculated and visualized to assess the distribution of monthly eviction filings. The summary statistics report reveals substantial zero-inflation and variance that exceeded the mean, yielding a dispersion ratio well over 1, indicating that the negative binomial model as a more appropriate modeling approach for our data.
A histogram of raw counts of evictions further highlights the prevalence of zeros and the data's skewed distribution, while a log‑transformed histogram provides a clearer view of variation.
```{r monthly-distribution}
# Calculate zero-inflation stats for monthly data.
zero_stats_monthly <- df_monthly %>%
summarize(
total_obs = n(),
zero_count = sum(filings_count == 0),
zero_pct = zero_count / total_obs * 100,
positive_count = sum(filings_count > 0),
mean_all = mean(filings_count),
mean_positive = mean(filings_count[filings_count > 0]),
median_all = median(filings_count),
variance = var(filings_count),
dispersion_ratio = variance / mean_all
)
# Display zero-inflation stats.
cat("MONTHLY ZERO-INFLATION STATS\n")
cat(sprintf("Total Observations: %s\n", comma(zero_stats_monthly$total_obs)))
cat(sprintf("Zero Filings: %s (%.1f%%)\n",
comma(zero_stats_monthly$zero_count),
zero_stats_monthly$zero_pct))
cat(sprintf("Positive Filings: %s (%.1f%%)\n",
comma(zero_stats_monthly$positive_count),
100 - zero_stats_monthly$zero_pct))
cat(sprintf("Mean (all observations): %.2f\n", zero_stats_monthly$mean_all))
cat(sprintf("Mean (positive only): %.2f\n", zero_stats_monthly$mean_positive))
cat(sprintf("Variance: %.2f\n", zero_stats_monthly$variance))
cat(sprintf("Dispersion Ratio (Variance/Mean): %.2f\n", zero_stats_monthly$dispersion_ratio))
```
```{r monthly-distribution-viz}
#| fig-dpi: 300
#| fig-height: 6
#| fig-width: 12
# Raw distribution.
monthly_plot_raw <- df_monthly %>%
ggplot(aes(x = filings_count)) +
geom_histogram(bins = 30, fill = "#E74C3C", color = "white") +
scale_x_continuous(labels = comma) +
scale_y_continuous(labels = comma) +
labs(
title = "Raw Count Distribution",
x = "Monthly Filings Count",
y = "Frequency"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", hjust = 0.5),
axis.title = element_text(size = 10)
)
# Log-transformed distribution.
monthly_plot_log <- df_monthly %>%
mutate(log_filings = log10(filings_count + 1)) %>%
ggplot(aes(x = log_filings)) +
geom_histogram(bins = 30, fill = "#E67E22", color = "white") +
scale_x_continuous(labels = comma) +
scale_y_continuous(labels = comma) +
labs(
title = "Log-Transformed Distribution",
x = "log(Monthly Filings + 1)",
y = "Frequency"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", hjust = 0.5),
axis.title = element_text(size = 10)
)
# Display combined plot.
monthly_plot_raw + monthly_plot_log +
plot_annotation(
title = "Distribution of Monthly Eviction Filings by Census Tract",
subtitle = sprintf("%.1f%% Zeros | Dispersion Ratio = %.1f",
zero_stats_monthly$zero_pct,
zero_stats_monthly$dispersion_ratio),
caption = "Data: Philadelphia Eviction Filings 2020-2025 via Eviction Lab",
theme = theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 11),
plot.caption = element_text(size = 9)
)
)
```
### Weekly Filing Distribution
Weekly eviction filings were also assessed for zero-inflation and overdispersion; the distribution was highly zero-inflated, making even standard Negative Binomial models unsuitable. However, the data structure is compatible with Zero-Inflated Negative Binomial models, which may be explored in future work. In this analysis we chose to proceed with the monthly eviction data to preserve the performance of the Negative Binomial model.
```{r weekly-distribution}
#| fig-dpi: 300
#| fig-height: 6
#| fig-width: 12
# Calculate zero-inflation stats for weekly data.
zero_stats_weekly <- df_weekly %>%
summarize(
total_obs = n(),
zero_count = sum(filings_count == 0),
zero_pct = zero_count / total_obs * 100,
mean_all = mean(filings_count),
variance = var(filings_count),
dispersion_ratio = variance / mean_all
)
cat("WEEKLY ZERO-INFLATION STATS\n")
cat(sprintf("Total Observations: %s\n", comma(zero_stats_weekly$total_obs)))
cat(sprintf("Zero Filings: %s (%.1f%%)\n",
comma(zero_stats_weekly$zero_count),
zero_stats_weekly$zero_pct))
cat(sprintf("Mean: %.2f\n", zero_stats_weekly$mean_all))
cat(sprintf("Dispersion Ratio: %.2f\n", zero_stats_weekly$dispersion_ratio))
```
```{r weekly-distribution-histogram}
#| fig-dpi: 300
#| fig-height: 6
#| fig-width: 12
# Raw distribution.
plot_weekly_raw <- df_weekly %>%
ggplot(aes(x = filings_count)) +
geom_histogram(bins = 30, fill = "#3498DB", color = "white") +
scale_x_continuous(labels = comma) +
scale_y_continuous(labels = comma) +
labs(
title = "Raw Count Distribution",
x = "Weekly Filings Count",
y = "Frequency"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", hjust = 0.5),
axis.title = element_text(size = 10)
)
# Log-transformed distribution.
plot_weekly_log <- df_weekly %>%
mutate(log_filings = log10(filings_count + 1)) %>%
ggplot(aes(x = log_filings)) +
geom_histogram(bins = 30, fill = "#8E44AD", color = "white") +
scale_x_continuous(labels = comma) +
scale_y_continuous(labels = comma) +
labs(
title = "Log-Transformed Distribution",
x = "log(Weekly Filings + 1)",
y = "Frequency"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", hjust = 0.5),
axis.title = element_text(size = 10)
)
# Display combined plot.
plot_weekly_raw + plot_weekly_log +
plot_annotation(
title = "Distribution of Weekly Eviction Filings by Census Tract",
subtitle = sprintf("%.1f%% Zeros | Dispersion Ratio = %.1f",
zero_stats_weekly$zero_pct,
zero_stats_weekly$dispersion_ratio),
caption = "Data: Philadelphia Eviction Filings via Eviction Lab",
theme = theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 11))
)
```
### Tax Delinquency Distribution
The histogram of average delinquent balances per property across census tracts is highly right-skewed, with most tracts showing low balances and a few exhibiting extreme outliers revealing that a small number of tracts carry disproportionately high tax burdens. A histogram of log transformed tax balances normalizes this skew, producing a more symmetric, bell-shaped distribution suitable for statistical modeling.
```{r tax-distribution-histogram}
#| fig-dpi: 300
#| fig-height: 6
#| fig-width: 12
# Raw distribution.
plot_tax_raw <- df_tax %>%
ggplot(aes(x = avg_delinquent_balance_per_prop)) +
geom_histogram(bins = 40, fill = "#9B59B6", color = "white") +
scale_x_continuous(labels = dollar_format()) +
scale_y_continuous(labels = comma) +
labs(
title = "Raw Average Balance",
x = "Average Delinquent Balance per Property ($)",
y = "Number of Census Tracts"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", hjust = 0.5))
# Log-transformed distribution.
plot_tax_log <- df_tax %>%
mutate(log_balance = log10(avg_delinquent_balance_per_prop + 1)) %>%
ggplot(aes(x = log_balance)) +
geom_histogram(bins = 40, fill = "#3498DB", color = "white") +
scale_x_continuous(labels = comma) +
scale_y_continuous(labels = comma) +
labs(
title = "Log-Transformed",
x = "log(Average Balance + 1)",
y = "Number of Census Tracts"
) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", hjust = 0.5))
# Display combined plot.
plot_tax_raw + plot_tax_log +
plot_annotation(
title = "Distribution of Average Tax Delinquency Balance by Census Tract",
caption = "Data: OpenDataPhilly Real Estate Tax Balances",
theme = theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 11))
)
```
### Quantile Analysis
To further assess the tail-behavior of monthly eviction filings, statistice for quantiles thresholds were calculated. This included the minimum, 1st, 5th, 25th, 50th (median), 75th, 95th, and 99th percentiles, as well as the maximum. These statistics provide a clear view of the heavy right tail of the distribution, complementing the zero‑inflation and over‑dispersion diagnostics as well as indicating the need to create a variable the addresses abnormal spikes.
```{r quantile-analysis}
# Calculate quantiles for positive filings to understand tail behavior.
quantile_stats_monthly <- df_monthly %>%
filter(filings_count > 0) %>%
summarize(
min = min(filings_count),
q01 = quantile(filings_count, 0.01),
q05 = quantile(filings_count, 0.05),
q25 = quantile(filings_count, 0.25),
median = median(filings_count),
q75 = quantile(filings_count, 0.75),
q95 = quantile(filings_count, 0.95),
q99 = quantile(filings_count, 0.99),
max = max(filings_count)
)
# Display quantile distribution.
quantile_stats_monthly %>%
pivot_longer(everything(), names_to = "quantile", values_to = "filings") %>%
kable(digits = 1, caption = "Monthly Filing Count Quantiles (Positive Counts Only)",
col.names = c("Quantile", "Filings")) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
## 5. TEMPORAL ANALYSIS: SYSTEM-WIDE TRENDS
### Monthly Aggregate Trend with Policy Markers
The plot and summary statistics of monthly evictions overtime show a clear policy-linked trajectory from 2020 to 2025 as filings dropped sharply at the onset of the eviction moratorium, remained suppressed throughout its duration, and rose after the moratorium ended. This pattern underscores the impact of policy interventions as well as highlights the need for temporal controls in modeling.
```{r monthly-system-trend-plot}
#| fig-dpi: 300
#| fig-height: 7
#| fig-width: 14
# Calculate total monthly filings across all tracts.
monthly_aggregate <- df_monthly %>%
group_by(date, period) %>%
summarize(
total_filings = sum(filings_count, na.rm = TRUE),
mean_filings = mean(filings_count, na.rm = TRUE),
median_filings = median(filings_count, na.rm = TRUE),
n_tracts = n_distinct(GEOID),
.groups = "drop"
)
# Create line plot with policy intervention markers.
monthly_trend_plot <- ggplot(monthly_aggregate, aes(x = date, y = total_filings)) +
geom_line(color = "#C0392B", linewidth = 1.5) +
geom_point(aes(color = period), size = 4) +
# Mark start of eviction moratorium.
geom_vline(xintercept = as.Date("2020-03-01"),
linetype = "dashed", color = "#27AE60", linewidth = 1) +
annotate("text", x = as.Date("2020-03-01"), y = max(monthly_aggregate$total_filings) * 0.95,
label = "Moratorium\nStarts", hjust = -0.1, color = "#27AE60", size = 4, fontface = "bold") +
# Mark end of eviction moratorium.
geom_vline(xintercept = as.Date("2021-09-30"),
linetype = "dashed", color = "#E67E22", linewidth = 1) +
annotate("text", x = as.Date("2021-09-30"), y = max(monthly_aggregate$total_filings) * 0.85,
label = "Moratorium\nEnds", hjust = 1.1, color = "#E67E22", size = 4, fontface = "bold") +
scale_y_continuous(labels = comma) +
scale_color_manual(values = period_colors) +
labs(
title = "Total Monthly Eviction Filings: System-Wide Trend (2020-2025)",
subtitle = "Suppression during moratorium followed by sustained elevation after end.",
x = "Date (Month)",
y = "Total Monthly Filings Across All Census Tracts",
color = "Policy Period",
caption = "Data: Philadelphia Eviction Filings via Eviction Lab"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14),
legend.position = "bottom"
)
# Display system-wide trend plot.
monthly_trend_plot
```
```{r monthly-system-trend-table}
# Calculate summary statistics by policy period.
monthly_aggregate %>%
group_by(period) %>%
summarize(
n_months = n(),
mean_monthly = mean(total_filings),
median_monthly = median(total_filings),
min_monthly = min(total_filings),
max_monthly = max(total_filings),
sd_monthly = sd(total_filings),
.groups = "drop"
) %>%
kable(digits = 0, caption = "Monthly Filing Statistics by Policy Period",
col.names = c("Period", "Months", "Monthly Mean", "Monthly Median", "Monthly Minimum", "Monthly Maximum", "Monthly Standard Deviation")) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
### Seasonality Analysis
The box plot of post moratorium evictions by season and month show elevated filings in winter and summer as well as monthly alternating patterns of elevation in the winter to early spring months. Summary statistics of both plots illustrates that evictions typically remained stable across seasons and months but do possess slightly higher numbers consistent with the visual patterns.
```{r seasonality-boxplot}
#| fig-dpi: 300
#| fig-height: 6
#| fig-width: 14
# Calculate seasonal pattern excluding moratorium period for cleaner view.
seasonality_data <- df_monthly %>%
filter(period == "Post-Moratorium") %>%
group_by(season) %>%
summarize(
mean_filings = mean(filings_count, na.rm = TRUE),
median_filings = median(filings_count, na.rm = TRUE),
sd_filings = sd(filings_count, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(season)
# Create seasonal pattern bar plot.
seasonality_boxplot <- seasonality_data %>%
ggplot(aes(x = season, y = mean_filings)) +
geom_col(fill = "#3498DB", alpha = 1) +
geom_errorbar(aes(ymin = mean_filings - sd_filings, ymax = mean_filings + sd_filings),
width = 0.3, color = "black") +
labs(
title = "Seasonal Pattern in Eviction Filings",
x = "Season",
y = "Mean Filings per Tract-Month"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1)
)
# Calculate monthly pattern excluding moratorium period for cleaner view.
monthly_data <- df_monthly %>%
filter(period == "Post-Moratorium") %>%
group_by(month_name, month_num) %>%
summarize(
mean_filings = mean(filings_count, na.rm = TRUE),
median_filings = median(filings_count, na.rm = TRUE),
sd_filings = sd(filings_count, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(month_num)
# Create monthly pattern bar plot.
monthly_boxplot <- monthly_data %>%
ggplot(aes(x = reorder(month_name, month_num), y = mean_filings)) +
geom_col(fill = "#3498DB", alpha = 1) +
geom_errorbar(aes(ymin = mean_filings - sd_filings, ymax = mean_filings + sd_filings),
width = 0.3, color = "black") +
labs(
title = "Monthly Pattern in Eviction Filings",
x = "Month",
y = "Mean Filings per Tract-Month"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14, hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust = 1)
)
# Display combined plot.
# Display seasonality plot.
seasonality_boxplot + monthly_boxplot +
plot_annotation(
title = "Seasonal and Monthly Patterns in Eviction Filings",
subtitle = "Post-Moratorium",
caption = "Data: Philadelphia Eviction Claims via Eviction Lab",
theme = theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 12))
)
```
```{r seasonality-table}
# Display seasonality statistics table.
seasonality_data %>%
kable(digits = 2, caption = "Seasonal Stats (Post-Moratorium Period)",
col.names = c("Season", "Mean Filings", "Median Filings", "Standard Deviation Filings")) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
```{r monthly-table}
# Display seasonality statistics table.
monthly_data %>%
kable(digits = 2, caption = "Monthly Stats (Post-Moratorium Period)",
col.names = c("Month", "Month Number", "Mean Filings", "Median Filings", "Standard Deviation Filings")) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
```
### Claims Severity Over Time
A temporal plot shows that median eviction claim amounts rose sharply at the onset of the pandemic, peaking in 2021–2022 before gradually declining and stabilizing near the pre-pandemic baseline, suggesting a temporary inflation in claim severity during the moratorium and immediate post-moratorium periods.
```{r claims-trend}
#| fig-dpi: 300
#| fig-height: 7
#| fig-width: 14
# Create median claim amount trend showing financial severity evolution.
claims_plot <- ggplot(df_claims %>% filter(!is.na(median_claim)),
aes(x = date, y = median_claim)) +
geom_line(color = "#27AE60", linewidth = 1.5) +
geom_point(color = "#27AE60", size = 3) +
geom_hline(aes(yintercept = median_claim_baseline),
linetype = "dashed", color = "#E74C3C", linewidth = 1) +
annotate("text", x = min(df_claims$date - 100), y = df_claims$median_claim_baseline[1] + 200,
label = sprintf("Pre-Pandemic Baseline: $%s",
comma(df_claims$median_claim_baseline[1])),
hjust = 0, color = "#E74C3C", fontface = "bold") +
geom_vline(xintercept = as.Date("2020-03-01"),
linetype = "dotted", color = "#27AE60", alpha = 1, linewidth = 1) +
geom_vline(xintercept = as.Date("2021-09-30"),
linetype = "dotted", color = "#E67E22", alpha = 1, linewidth = 1) +
scale_y_continuous(labels = dollar_format()) +
labs(
title = "Median Eviction Claim Amount Over Time",
x = "Date (Month)",
y = "Median Claim Amount ($)",
caption = "Data: Philadelphia Eviction Claims via Eviction Lab"
) +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold", size = 14))
# Display claims trend plot.
claims_plot
```
## 6. RACIAL DISPARITY ANALYSIS
### Filing Burden by Racial Majority
Summary statistics and plots reveal stark disparities across tract racial majorities: 51.9% of filings occurred in Black-majority tracts, despite these tracts representing a smaller share of the city overall. In contrast, White-majority tracts accounted for just 22.1% of filings, though they comprised 36.0% of the comparison group. Hispanic-majority tracts showed near parity, while “Other” tracts had slightly lower filing rates than their representation. These patterns suggest disproportionate eviction exposure in Black communities, underscoring the need for equity-based analysis of our model’s performance.
```{r racial-disparity}
#| fig-dpi: 300
#| fig-height: 8
#| fig-width: 12
# Calculate filing statistics by tract racial majority.
racial_stats <- df_monthly %>%