-
Notifications
You must be signed in to change notification settings - Fork 0
/
starting_from_raw_data.Rmd
741 lines (512 loc) · 35.1 KB
/
starting_from_raw_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
---
title: "Raw Data Import"
author: "Carrie"
output: html_document
---
# reading in the data in the drive
```{r}
library(readr)
library(readxl)
library(here)
library(tidyverse)
df_simplified<-read_excel(here::here("Nonprofit_Baltimore_Analysis.xlsx"), sheet = 3)
#Different IRS submission forms:
#epostcard form
data_download_epostcard <- read_delim("data-download-epostcard.txt",
delim = ",", escape_double = FALSE, trim_ws = TRUE, col_names = FALSE)
problems(data_download_epostcard)
#pub78 form
data_download_pub78 <- read_delim("data-download-pub78.txt",
delim = ",", escape_double = FALSE, trim_ws = TRUE)
#state irs file - limited it by baltimore city
eo_md <- read_csv("eo_md.csv") # mapped with city boarder limits - this has the full state originally there and used eo_md.pdf for the information about it
irs <- read_csv("irs.csv") # master business filing includes pub78 and epostcard
```
"To perform the analysis, we collected data from federal and local government agencies data repositories of Baltimore City nonprofits and Baltimore City neighborhood data. Baltimore nonprofit data was organized from three national nonprofit reports provided by the IRS (2021b), and Baltimore neighborhood data was downloaded in a geospatial file format from the Baltimore City Office of Information & Technology (2020). **We used data updated in January 2020**, before the pandemic when many nonprofits paused their work or closed. We made this decision in hopes of coming to results that could speak to more generalizable non-profit decision-making, rather than any pandemic-related decision-making processes or skewed data sets because of temporary closures.
The **Baltimore City Office of Information & Technology’s [BCOIT] (2020)** Baltimore neighborhood data file provided total resident numbers and racial breakdowns for each designated neighborhood. Additionally, the file provided a mapped shapefile that could be uploaded into a geographic information system (GIS) tool for interactive mapping and analysis (ArcGIS online).
**this should be public - want to see if we can find the link**
https://data.baltimorecity.gov/datasets/8112521d3e284518b9fa497a188bfb45/explore?location=39.284656%2C-76.620522%2C10.82
(used Jan 2020 data)
**also ideally want the link - for IRS= came from the master business file**
https://www.irs.gov/charities-non-profits/tax-exempt-organization-search-bulk-data-downloads
(got Jan 2020 data at the time)
Three separate IRS data files were downloaded representing all IRS-registered nonprofits in the United States or the State of Maryland: Publication 78 Data and Form 990-N (e-Postcard) US data from the IRS (2021b), and the Exempt Organizations Business Master File Extract (EO BMF) Maryland data (IRS, 2021a, b). These files were merged using non-profits’ unique Employee Identification Number (EIN). The merged lists made a comprehensive list of non-profits with central offices in Baltimore City. To establish which non-profits would be included in the ‘Baltimore City’ dataset, we used the Baltimore neighborhood shapefile, which has established city boundary lines. **We then cross-analyzed non-profit street addresses, and only included addresses within the existing shapefile city boundary lines**. This cross analysis (along with limitations discussed below) resulted in a complete non-profit dataset of 4082 organizations. The datasets included non-profits’ street address, reported assets, and IRS assigned asset code (designated between 1 and 9 by established ranges, e.g. code 1 is assigned for non-profits with assets less than $50,000). We used this coding system to define a “high asset” sub-category of non-profits, where any nonprofit coded 5 or above ($500,000 or more in total assets) was marked a “high asset” non-profit.
The IRS datasets also included each organizations National Taxonomy of Exempt Entities (NTEE) code. The NTEE code is an alphanumeric system that categorizes 501.c.3 organizations by their primary purpose. For this study, we used the 10 broadest categories (identified by roman numerals I-X) for an analysis by non-profit sub-sectors: Arts, Culture, and Humanities; Education; Environment and Animals; Health; Human Services; International, Foreign Affairs; Public, Societal Benefit; Religion Related; Mutual/Membership Benefit; and Unknown, Unclassified (Jones, 2019).
To analyze organizations by the racial makeup of Baltimore neighborhoods, we coded non-profits with a 1 if the nonprofit’s primary address is located within a majority African American Neighborhood—a neighborhood where greater than half of the residents identified as African American—or a 0 otherwise. We used this coding in analyses answering research questions three and four.
Limitations and Excluded Nonprofits
We narrowed the complete non-profit dataset to ensure analysis on non-profits that mainly provide a direct service to communities. This meant that we excluded social clubs with 501.c.3 status such as lodges and fraternal orders because their missions and funds are mostly used to assist members rather than the wider community, meaning that they may not accurately portray the access to resources or services provided to the communities in which they are located. Any lodges and fraternal orders were identified and filtered using their National Taxonomy of Exempt Entities (NTEE) code.
We also reviewed each non-profit address checking for postal (P.O.) boxes or otherwise erroneous addresses that could not be accurately mapped. These were also removed from the analyzed dataset. Given that a primary focus of this paper involves the structural power and accessibility of nonprofits within the communities they serve, nonprofits that could not be properly geolocated would skew neighborhood level analyses.
In a similar light, nonprofits with multiple addresses either within or outside of Baltimore City, Maryland are beyond the scope of this paper; the IRS data predominantly consisted of only a primary Baltimore address, meaning that secondary addresses could not be considered, and nonprofit resources could not be proportionately distributed between locations. Although only primary addresses appeared in the IRS data, it is believed that the financial data encompasses all locations and subdivisions of a nonprofit. Due to a lack of viable alternatives, financials and primary addresses were taken as-is from the IRS data.
Additionally, around 1,333 or 32.656% of the remaining Baltimore City nonprofits were missing an NTEE code altogether, meaning that they could not be categorized. Although it represented a significant limitation of available data, these nonprofits were grouped together and analyzed separately for all NTEE analyses. They remained a part of any analysis that did not break down by NTEE code. "
```{r}
colnames(irs)
colnames(eo_md)
colnames(data_download_pub78)
colnames(data_download_epostcard)
colnames(df_simplified) # from the excel they gave me
```
```{r}
library(raster)
library(here)
neighborhoods<-shapefile(here::here("Neighborhoods-shp/Neighborhoods.shp"))
```
Double check thresholds and try to recreate everything from scratch.
figure?
- look into revoked Baltimore neighborhoods?
- and misspellings?
Hi Carrie,
Here’s where the IRS data came from: txt icon README.txt. In the “original data” folder are the files along with the R code that we used to filter down the IRS data into the usable data set.
RE: the discrepancy between asset code and asset amount. Some orgs have an asset code identified, but have an empty cell for asset amount. This seems to be an orginal dataset problem. So, I’m guessing when you ran your numbers those orgs were either excluded or counted as 0.
Hope this helps. Let me know what else might be on your mind in relation to this/next steps.
Thanks,
Tyler
It looks indeed like that was responsible for the difference. I have been working to recreate the other tables so that the code we release matches, but maybe we should share your code too?
I also see a difference for the table about counts for the different classifications of neighborhoods and NTEE type.
To get to this, I created a percentage variable by using the `Normalized African American Population` variable and multiplying it by 100 and rounding to one digit after the decimal and then using a threshold of greater than or equal to 50% to be categorized as a majority AA neighborhood. I tried rounding with 2 or zero digits, I also tried doing a threshold of majority being just greater than 50% rather than greater than or equal to 50%, but I didn’t quite get the numbers in the manuscript.
Best,
Carrie
I also think it might help to add some caption
Assets - use the code - that accounts for missing data better than the raw number.
African American - use percents from neighborhoods black and non black -
say it is 45 % black, 40% white, 5 % asian - still majority black but not a black majority neighborhood.
Use black > = 50% -
See if the data is updated for the neighborhood information.
This is where we got the MD irs info: https://www.irs.gov/charities-non-profits/exempt-organizations-business-master-file-extract-eo-bmf
But, they update it, so it’s likely not the same data file that we had a few years ago. I don’t think we’d be able to piece it back together, because the other way to get historic data is through submissions organized by month, rather than a particular update that gathers across the year. And it’s national only, rather than by state.
Not sure of a pathway to tracking, unless we’re using entire new data sets with the updates (I don’t expect things to change that much as far as conclusions). Thoughts?
Tyler
# reading in old data from drive and new data eo_md.csv
eo_md.csv https://www.irs.gov/pub/irs-soi/eo_md.csv
```{r}
irs_old<- read_csv("eo_md.csv")
df_simplified<-read_excel(here::here("Nonprofit_Baltimore_Analysis.xlsx"), sheet = 3) # this file was made by our team
filter(df_simplified, CITY != "BALTIMORE", is.na(CITY))
```
# New eo_md
990 or not is PF filing req code
filing req code - don't have to file any forms or not 01 = 990, 02 = 990 but less than 25000 income, 03 = group return (don't know what that means), 06 = church, 07 = gov, 00 = not required to file
```{r}
irs_new<- read_csv("New_version_data/eo_md_downloaded_Aug22.csv") # direct link https://www.irs.gov/pub/irs-soi/eo_md.csv
```
## compare eo_md files
```{r}
irs_old <-filter(irs_old, CITY == "BALTIMORE") # filter for just Baltimore
irs_new <-filter(irs_new, CITY == "BALTIMORE")
identical(irs_old[1,], irs_new[1,]) # some of the rows are identical
identical(irs_old[2,], irs_new[2,])
identical(irs_old[3,], irs_new[3,]) # some are slightly different
glimpse(bind_rows(irs_new[2,], irs_old[2,]))
glimpse(bind_rows(irs_new[3,], irs_old[3,])) # tax period is a difference in this case, let's see if the 2019 data is in the new data too EIN:010613166
count(irs_old, TAX_PERIOD) %>% print(n =65)
count(irs_new, TAX_PERIOD) %>% print(n =65)
count(df_simplified, TAX_PERIOD) # looks like lots of tax periods in the current currated data - also looks like it is year and month
irs_new %>% filter(EIN == "010613166") # doesn't look like it that is interesting... maybe it is only the latest per EIN?
count(irs_new, EIN) %>% filter(n >1)
count(irs_old, EIN) %>% filter(n >1)
# yes, seems like only one unique row per EIN
```
```{r}
#How about for name?
count(irs_new, NAME) %>% filter(n >1)
count(irs_old, NAME) %>% filter(n >1)
# hmmm what are the repeat names? those appear to be different divisions of the same CBO but same tax period - do we want to count all of these? Sometimes the income is the exact same for multiples
filter(irs_new, NAME == "AMERICAN FEDERATION OF TEACHERS") %>% glimpse()
filter(irs_old, NAME == "AMERICAN FEDERATION OF TEACHERS") %>% glimpse()
```
## Seperating tax period
```{r}
# seperating tax
irs_old <- irs_old %>% separate(remove = FALSE, col = TAX_PERIOD, into = c("year", "month"), sep = 4)
irs_new <- irs_new %>% separate(remove = FALSE, col = TAX_PERIOD, into = c("year", "month"), sep = 4)
df_simplified <-df_simplified %>% separate(remove = FALSE, col = TAX_PERIOD, into = c("year", "month"), sep = 4)
count(irs_new, year) %>% print(n = 20) %>% plot() # no NA - all years
count(irs_old, year) %>% print(n = 20) %>% plot() # 1838 NA the rest are years
count(df_simplified, year) %>% print(n = 20) %>% plot() # lots of year 0 in this dataset - looking back at the TAX_PERIOD it is 0 for these rows
#According to this: https://www.irs.gov/pub/irs-soi/eo-info.pdf the "This is the tax period of the latest return filed (YYYYMM). "
#What year do we want?? The paper kind of makes it seem like it is 2020
#According to the new data very few of the data points are before 2020
```
# update new data
make new data look like old - threshold year by 2020 (from the paper: "We used data updated in January 2020, before the pandemic when many nonprofits paused their work or closed")
```{r}
#let's convert month to numeric
irs_new <-
irs_new %>% mutate(month = as.numeric(month))
irs_old <-
irs_old %>% mutate(month = as.numeric(month))
df_simplified <-
df_simplified %>% mutate(month = as.numeric(month))
irs_new <-
irs_new %>% mutate(year = as.numeric(year))
irs_old <-
irs_old %>% mutate(year = as.numeric(year))
df_simplified <-
df_simplified %>% mutate(year = as.numeric(year))
```
```{r}
#There appears to be quite a bit data in the current dataset after the pandemic:
df_simplified %>% filter(year == 2020, month >1) %>% count(month)
irs_old %>% filter( year == 2020, month >1) %>% count(month)
```
## IRS epostcard
https://www.irs.gov/charities-non-profits/tax-exempt-organization-search-bulk-data-downloads
epostcard data: https://apps.irs.gov/pub/epostcard/data-download-epostcard.zip (last data posting Jan 2024)
dictionary: https://www.irs.gov/pub/irs-tege/990n-data-dictionary.pdf
EIN, Tax Year , Organization Name, Gross receipts not greater than, Organization has terminated, Tax Period Begin Date , Tax Period End Date , Website URL , Principal Officer’s Name , Principal Officer’s Address Line 1, Principal Officer’s Address Line 2, Principal Officer’s Address City , Principal Officer’s Address Province, Principal Officer’s Address State , Principal Officer’s Address Zip Code , Principal Officer’s Address Country ,Organization Mailing Address Line 1 , Organization Mailing Address Line 2 , Organization Mailing Address City , Organization Mailing Address Province , Organization Mailing Address State , Organization Mailing Address Zip Code, Organization Mailing Address Country ,Organization Doing Business as Name 1, Organization Doing Business as Name 2 , Organization Doing Business as Name 3 ,
```{r}
# specify character for last column
epost <- read_delim("New_version_data/data-download-epostcard_2024.txt",
delim = "|", escape_double = FALSE, col_names = FALSE,
trim_ws = TRUE, col_types = c("X26" = "c"))
```
check problems
```{r}
pepost<-problems(epost)
count(pepost, col)
colnames(epost) <- c("EIN", "Tax Year" , "Organization Name", "Gross receipts not greater than", "Organization has terminated", "Tax Period Begin Date" , "Tax Period End Date" , "Website URL" , "Principal Officer’s Name" , "Principal Officer’s Address Line 1", "Principal Officer’s Address Line 2", "Principal Officer’s Address City" , "Principal Officer’s Address Province", "Principal Officer’s Address State" , "Principal Officer’s Address Zip Code" , "Principal Officer’s Address Country" ,"Organization Mailing Address Line 1" , "Organization Mailing Address Line 2" , "Organization Mailing Address City" , "Organization Mailing Address Province" , "Organization Mailing Address State" , "Organization Mailing Address Zip Code", "Organization Mailing Address Country" ,"Organization Doing Business as Name 1", "Organization Doing Business as Name 2", "Organization Doing Business as Name 3")
library(janitor)
epost <-clean_names(epost)
```
Get problematic rows:
```{r}
epost %>% slice(pepost$row[1]: last(pepost$row)) %>% glimpse()# looking at first problematic row # URL is not a url
epost_prob_rows<-epost %>% slice(pepost$row)
```
Check if problematic rows matter to us
```{r}
#check for rows with MD in any column
unlist(sapply(epost_prob_rows, grep, pattern = "MD"))
# looks ok!
```
### IRS Publication 78 data
publication 78: https://apps.irs.gov/pub/epostcard/data-download-pub78.zip
colnames from: https://nccsgit.urban.org/nccs/datasets/pub78/#:~:text=Publication%2078%20is%20an%20example,to%20receive%20tax%2Ddeductible%20donations.
Question: should we do anything with the deductible codes
```{r}
pub78 <- read_delim("New_version_data/data-download-pub78_2024.txt",
delim = "|", escape_double = FALSE, col_names = FALSE,
trim_ws = TRUE)
colnames(pub78) <-c("EIN", "legal_name", "city", "state", "country", "deductibility_status")
```
limitations of pub 78: https://nccsgit.urban.org/nccs/datasets/pub78/#:~:text=Publication%2078%20is%20an%20example,to%20receive%20tax%2Ddeductible%20donations.
PC: A public charity. 50% (60% for cash contributions)
POF: A private operating foundation. 50% (60% for cash contributions)
PF: A private foundation. 30% (generally)
GROUP Generally, a central organization holding a group exemption letter, whose subordinate units covered by the group exemption are also eligible to receive tax-deductible contributions, even though they are not separately listed. Depends on various factors
LODGE: A domestic fraternal society, operating under the lodge system, but only if the contribution is to be used exclusively for charitable purposes. 30%
UNKWN A charitable organization whose public charity status has not been determined. Depends on various factors
EO: An organization described in section 170(c) of the Internal Revenue Code other than a public charity or private foundation. Depends on various factors
FORGN A foreign-addressed organization. These are generally organizations formed in the United States that conduct activities in foreign countries. Certain foreign organizations that receive charitable contributions deductible pursuant to treaty are also included, as are organizations created in U.S. possessions. Depends on various factors
SO: A Type I, Type II, or functionally integrated Type III supporting organization. 50% (60% for cash contributions)
SONFI: A non-functionally integrated Type III supporting organization. 50% (60% for cash contributions)
SOUNK: A supporting organization, unspecified type. 50% (60% for cash contributions)
Combine IRS data
```{r}
irs_new<- rename_with(irs_new, tolower)
IRS <-left_join(irs_new, pub78, by = c("ein" = "EIN"), suffix = c("_eo", "_pub78")) # keeps all rows of irs_new_2020 and adds info from pub78 where possible as new columns- NA is given for rows not in pub78
IRS <-left_join(IRS, epost, by = "ein") # keeps all rows of IRS and adds info from epost where possible as new columns
IRS %>% dplyr::select(contains("year"))
```
Let's make the new data match the paper statement:
```{r}
filter(irs_new, year == 2020) %>% count(year, month) # looks like all 2020 data is after Jan
irs_new_2020<- filter(IRS, tax_year <= 2020 |is.na(tax_year)) # keep rows where year is less than or equal to 2020 or is na
#Check for jan data from 2020
filter(irs_new_2020, year == 2020, month ==1)
#irs_old_2020 <- filter(irs_old, year <2020) # this removes NAs
year_info <-IRS %>% dplyr::select(contains(c("year","period")))
print(filter(year_info, rowSums(is.na(year_info)) != ncol(year_info))) # hmm where are there many with no date info...
# what ar the rows with all NA???
IRS %>% filter(is.na(year))
```
## Revocations
```{r}
revocations <- read_delim("New_version_data/data-download-revocation.txt",
delim = "|", escape_double = FALSE, col_names = FALSE,
trim_ws = TRUE)
IRS_no_revoc<-anti_join(IRS, revocations, by = c("ein" = "X1")) # lost about 400
dim(IRS)
dim(IRS_no_revoc)
IRS<- IRS_no_revoc
```
## PO Boxes
```{r}
IRS %>% filter(str_detect(street, "PO ")) %>% pull(street)
IRS <- IRS %>% filter(!str_detect(street, "PO "))
```
```{r}
#write_rds(IRS, file = "New_version_data/new_IRSdata.rds")
```
###STart HERE########################## (after first chunk)
## Neighborhoods data
Neighborhoods data: https://data.baltimorecity.gov/datasets/8112521d3e284518b9fa497a188bfb45/explore?location=39.284656%2C-76.620522%2C10.82
https://data.baltimorecity.gov/datasets/8112521d3e284518b9fa497a188bfb45/explore
(used Jan 2020 data)
```{r}
Neighborhoods_NSA <- read_csv("New_version_data/Neighborhoods_NSA.csv")
Neighborhoods_NSA <-Neighborhoods_NSA %>% mutate(percent_AA = (Blk_AfAm/Population)*100 )
Neighborhoods_NSA <-Neighborhoods_NSA %>% mutate(Maj_Black =
case_when(percent_AA >=50 ~ TRUE,
percent_AA <50 ~FALSE))
```
```{r}
library(raster)
library(here)
neighborhoods<-shapefile(here::here("Neighborhoods-shp/Neighborhoods.shp"))
```
The merged lists made a comprehensive list of non-profits with central offices in Baltimore City. To establish which non-profits would be included in the ‘Baltimore City’ dataset, we used the Baltimore neighborhood shapefile, which has established city boundary lines. **We then cross-analyzed non-profit street addresses, and only included addresses within the existing shapefile city boundary lines**. This cross analysis (along with limitations discussed below) resulted in a complete non-profit dataset of 4082 organizations. The datasets included non-profits’ street address, reported assets, and IRS assigned asset code (designated between 1 and 9 by established ranges, e.g. code 1 is assigned for non-profits with assets less than $50,000).
## get lat and long for addresses of CBOs
```{r}
library(ggmap)
#IRS <- IRS %>% unite("address",street:zip, remove = FALSE, sep = ", ") # this next step is slow so commenting out
#geos<-IRS %>% pull(address) %>% geocode() # this step is slow so commenting out
#warnings(geos) # check things- relies on previous step that is slow
#IRS <-cbind(IRS, geos) #relies on previous step that is slow
#write_rds(IRS, file = "New_version_data/new_IRSdata.rds")
```
```{r}
IRS <- readRDS("~/Documents/GitHub/tyler/New_version_data/new_IRSdata.rds")
```
## Get lat and long for shape file
https://stackoverflow.com/questions/66381795/check-whether-point-coordinate-lies-within-polygon
https://www.statsilk.com/maps/convert-esri-shapefile-map-geojson-format
```{r}
# work with spatial data; sp package will load with rgdal.
library(terra)
# for metadata/attributes- vectors or rasters
library(raster)
library(sf)
#neighborhood_shape <- system.file("Neighborhoods-shp/Neighborhoods.cpg", "Neighborhoods-shp/Neighborhoods.dbf", "Neighborhoods-shp/Neighborhoods.prj", "Neighborhoods-shp/Neighborhoods.shp", "Neighborhoods-shp/Neighborhoods.shx", package = "raster")
#neighborhood_shape_1 <- system.file(here("Neighborhoods-shp/Neighborhoods.shp"), package = "raster")
neighborhood_shape <-shapefile("Neighborhoods-shp/Neighborhoods.dbf", "Neighborhoods-shp/Neighborhoods.shp", "Neighborhoods-shp/Neighborhoods.shx")
#identical(neighborhood_shape, neighborhood_shape_1)
#identical(neighborhood_shape, neighborhood_shape_2)
neighborhood_shape3 <-st_read("Neighborhoods-shp/Neighborhoods.shp", stringsAsFactors=FALSE)
```
https://stackoverflow.com/questions/66381795/check-whether-point-coordinate-lies-within-polygon
```{r}
geo_clean <- IRS %>% dplyr::select(ein, lon, lat) %>% drop_na(lon) # one row - will need to remove form cbos
CRS <- st_crs(neighborhood_shape3$geometry)
pnts_sf <- st_as_sf(geo_clean, coords = c('lon', 'lat'), crs = st_crs(4326)) %>% st_set_crs(4326)
neighborhood_Sf <-neighborhood_shape3$geometry
neighborhood_Sf <- neighborhood_Sf %>% st_set_crs(4326)
pnts_trans <- st_transform(pnts_sf, 2163)
neighborhod_tt <- st_transform(neighborhood_shape3$geometry, 2163)
test <-st_intersection(pnts_trans, neighborhod_tt)
intersection <- pnts_sf %>% mutate(
intersection = as.integer(st_intersects( pnts_trans, neighborhod_tt )))
not_in_balt <- intersection %>% filter(is.na(intersection))
# getting an error about crs
crs_neighborhood <-st_crs(neighborhood_Sf)
crs_pnts <- st_crs(pnts_sf)
identical(crs_neighborhood, crs_pnts)
```
Combining the intersection info with original bigger IRS data, to just get the IRS data for the data in the neighborhood data (just places in Baltimore)
```{r}
IRS <-left_join(intersection, IRS, by = "ein") #sf version - need to make a version like this without the sf version start here avocado
```
checking that it worked
```{r}
# first row:
# ein geometry intersection
# 1 010591773 POINT (-76.69024 39.36632) 43
#neighborhood_shape[43,]$name
#filter(IRS, ein == "010591773")
#Looks like this is in that location
```
Combining it all together:
```{r}
neighborhood_shape <-as_tibble(neighborhood_shape)
neighborhood_shape<-neighborhood_shape %>% mutate(id = row_number())
cw_df_simp <-left_join(IRS, neighborhood_shape, by = c("intersection" = "id"))
cw_df_simp[1,] %>% glimpse()
# remove rows where intersection was not found (where intersection is NA) - aka CBO outside Baltimore
cw_df_simp <-cw_df_simp %>% drop_na(intersection)
```
## remove NTEE code IX [(IX. Mutual/Membership Benefit - Y)](https://urbaninstitute.github.io/nccs-legacy/ntee/ntee-history.html) - social clubs
- there only appears to be a small number of membership benefit clubs
(Keep NA for broad analyses of high of vs low asset but possibly remove for other analyses)
```{r}
social_clubs <- cw_df_simp %>% filter(str_detect(string = ntee_cd, pattern = "Y|y"))
clubs_lodges <- cw_df_simp %>% filter(str_detect(string = name.x, pattern = "club|Club|CLUB|lodge|Lodge|LODGE"))
clubs_lodges2 <- cw_df_simp %>% filter(str_detect(string = legal_name, pattern = "club|Club|CLUB|lodge|Lodge|LODGE"))
#All of the second version are in the first:
clubs_lodges2$intersection %in% clubs_lodges$intersection
clubs$ntee_cd
# also look for lodge
dim(social_clubs)# only 12
No_social_clubs <- cw_df_simp %>% filter(!str_detect(string = ntee_cd, pattern = "Y")|is.na(ntee_cd))
#first remove for "Y ntee"
cw_df_simp <- cw_df_simp %>% filter(!str_detect(string = ntee_cd, pattern = "Y") |is.na(ntee_cd))
#remove for word club"
cw_df_simp <- cw_df_simp %>% filter(!str_detect(string = name.x, pattern = "club|Club|CLUB|lodge|Lodge|LODGE"))
```
## For today Halloween
publication 78: https://apps.irs.gov/pub/epostcard/data-download-pub78.zip
colnames from: https://nccsgit.urban.org/nccs/datasets/pub78/#:~:text=Publication%2078%20is%20an%20example,to%20receive%20tax%2Ddeductible%20donations.
eo_postcard
Question: should we do anything with the deductible codes and columns from eo_postcard
# to do after halloween
- check cw_df_simp for deductability column to see if there are more codes - for clubs/lodges before we remove them
- check for boosters and supporting organizations, fundraising
- How much overlap between eo and master file
- if using 2020 docs - go back to 2017 and up - update every 3 years to stay in good standing if asset codes higher than 0 (if asset code of 0 don't have to submit)
- check irs.csv vs eo and and pub78
- eo info in one drive includes: https://livejohnshopkins-my.sharepoint.com/_forms/default.aspx
01 990 (all other) or 990EZ return 02 990 - Required to file Form 990-N - Income less than $25,000 per year 03 990 - Group return 04 990 - Required to file Form 990-BL, Black Lung Trusts 06 990 - Not required to file (church) 07 990 - Government 501(c)(1) 13 990 - Not required to file (religious organization) 14 990 - Not required to file (instrumentalities of states or political subdivisions) 00 990 - Not required to file (all other)
eo file activity codes : https://www.irs.gov/pub/irs-soi/eo_md.csv
activity codes align with ntee somewhat (classification also kinda helpful) - good for getting fundraizing, country club, social, membership benefit organizations, mutual, sport, benefiting themselves or sports and leasure
There is a 3 digit code that is for fundraizing
start with NTEE code alpha numeric, then activities
```{r}
# remove older tax years - check for revocation rationale
#classification, ruling, deductibility, status, tax period
cw_df_simp %>% count(classification, sort = TRUE) %>% view()
cw_df_simp %>% count(ruling, sort = TRUE) %>% view()
cw_df_simp %>% count(deductibility)
cw_df_simp %>% count(organization_has_terminated) # this one is important - what file does this come from
cw_df_simp %>% count(status)
cw_df_simp %>% count(tax_year, sort = TRUE) %>% View()
```
```{r}
cw_df_simp <-cw_df_simp %>%
mutate(NTEE_text = case_when(
str_starts(ntee_cd, pattern = "A")~ "Arts", # if NTEE_CD starts with A make new variable value "Arts" etc.
str_starts(ntee_cd, pattern = "B") ~ "Education",
str_starts(ntee_cd, pattern = "C|D") ~ "Environment/Animals", # if NTEE_CD value starts with C or D make new variable value "Environment/Animals"
str_starts(ntee_cd, pattern = "E|F|G|H") ~ "Health",
str_starts(ntee_cd, pattern = "I|J|K|L|M|N|O|P") ~ "Human Services",
str_starts(ntee_cd, pattern = "Q") ~ "International Affairs",
str_starts(ntee_cd, pattern = "R|S|T|U|V|W") ~ "Societal Benefit",
str_starts(ntee_cd, pattern = "X") ~ "Religious",
TRUE ~ "NA")) # this line is redundant as this would happen automatically - but everything else will be NA
```
## remove replicate addresses
```{r}
cw_df_simp <- cw_df_simp %>% unite(name.x,street, col = unique_ID, sep = " ", remove = FALSE)
duplicates <-cw_df_simp[duplicated(cw_df_simp$unique_ID),]
duplicates <-arrange(duplicates, unique_ID)
duplicates %>% count(ntee_cd) # most duplicates have ntee code of NA
duplicates%>% count(organization_name)
Living <-duplicates %>% filter(str_detect(unique_ID, pattern = "LIVING")) # only one row has asset info
cw_df_simp %>% filter(organization_name != "AMERICAN FEDERATION OF STATE COUNTY & MUNICIPAL EMPLOYEES")# Ask Tyler about what we want to do for these...
municipal_employees <-cw_df_simp %>% filter(organization_name == "AMERICAN FEDERATION OF STATE COUNTY & MUNICIPAL EMPLOYEES")# Ask Tyler about what we want to do for these...
NAACP <- cw_df_simp %>% filter(organization_name %in% c("NATIONAL ASSOCIATION FOR THE ADVANCEMENT OF COLORED PEOPLE","NAACP")) # Ask Tyler about what we want to do for these...
write_csv(municipal_employees, file= "municipal_employees.csv")
write_csv(NAACP, file = "NAACP.csv")
cw_df_simp<- cw_df_simp %>% filter(organization_name == "AMERICAN FEDERATION OF STATE COUNTY & MUNICIPAL EMPLOYEES")#
cw_df_simp <- cw_df_simp %>% filter(organization_name == "NATIONAL ASSOCIATION FOR THE ADVANCEMENT OF COLORED PEOPLE")
```
# Maps
```{r}
library(RColorBrewer)
# Plotting simple features (sf) with plot
plot(neighborhood_shape3$geometry)
library(sp)
library(leaflet)
coordinates(geo_clean) <- ~lon+lat
leaflet(geo_clean) %>% addMarkers() %>% addTiles()
```
```{r}
for_plot <-cw_df_simp %>% dplyr::select(name.x, organization_name, lat, lon)
# getting the map
mapplotarea <- get_map(location = c(lon = mean(for_plot$lon), lat = mean(for_plot$lat)), zoom = 11,maptype = "satellite", scale = 2)
# plotting the map with some points on it
plot1 <- ggmap(mapplotarea) +
geom_point(data =for_plot, aes(x = lon, y = lat, fill = "red", alpha = 0.8), size = 2, shape = 21) +
guides(fill=FALSE, alpha=FALSE, size=FALSE)
plot(neighborhood_shape3$geometry)
neighborhoods <-neighborhood_shape3$geometry
library(rnaturalearthdata)
library(rnaturalearth)
world <- ne_countries(scale = "medium", returnclass = "sf")
glimpse(world)
outline_plot<-ggplot(data = world) +
geom_sf() + geom_sf(data = neighborhoods)+
coord_sf(xlim = c(-76.74, -76.5), ylim = c(39.19, 39.4),
expand = FALSE)
outline_plot + geom_point(data = cw_df_simp, aes(x = lon, y = lat), size = 2,
shape = 23, fill = "darkred")
```
## not mappable address - PO box only removed
Nonprofits that solely provided postal (P.O.)
boxes or otherwise erroneous addresses that
could not be accurately mapped were also
redacted from the list. Given that a primary
focus of this paper involves the structural power
and accessibility of nonprofits within the com-
munities they serve, we felt that nonprofits that
could not be properly geolocated might be mis-
leading.
## multiple addresses removed
In a similar light, nonprofits with multiple
addresses either within or outside of Baltimore
City, Maryland are beyond the scope of this pa-
per; the IRS data predominantly consisted of
only a primary Baltimore address, meaning that
secondary addresses could not be considered and
nonprofit resources could not be proportionately
distributed between locations. Although only
primary addresses appeared in the IRS data, it is
believed that the financial data encompasses all
locations and subdivisions of a nonprofit. Due to
a lack of viable alternatives, financials and pri-
mary addresses were taken as-is from the IRS
data.
## No NTEE - kept but analyzed seperately
Additionally, around 1,333 or 32.656% of the
remaining Baltimore City nonprofits were miss-
ing an NTEE code altogether, meaning that they
could not be categorized. Although it repre-
sented a significant limitation of available data,
these nonprofits were grouped together and an-
alyzed separately.
0.1.4 Combined Data set
After all nonprofits listed to operate in Balti-
more City were isolated, the primary address
for each nonprofit was mapped in GIS overlay-
ing the Baltimore Neighborhood Data already
imported into GIS; using a geospatial join tool,
all Baltimore City nonprofits whose primary ad-
dress was not directly within the Baltimore City
limits were omitted. A secondary advantage to
performing the geospatial join between the non-
profit data and the neighborhood shape file was
the creation of a single tabular data set where
each nonprofits' information was listed alongside
the pertinent data regarding the neighborhood
that the nonprofit is located in. An ntee at-
tribute was appended to the combined data ta-
ble, which broadly categorized the nonprofits'
NTEE Codes into 10 generalized groups; Arts,
Culture, and Humanities; Education; Environ-
ment and Animals; Health; Human Services;
International, Foreign Afiairs; Public, Societal
Benefit; Religion Related; Mutual/Membership
Benefit; and Unknown, Unclassified (Jones,
2019).
## Lodges and fraternal orders or unknown
As was mentioned previously, the lat-
ter two NTEE groupings,Mutual/Membership
Benefit (largely lodges and fraternal orders) and
Unknown or Unclassified,were respectively re-
moved and analyzed separately.
### Lodges and fraternal orders were deliberately omitted
Lodges and fraternal orders were deliberately
omitted from the data set at this stage; although
these organizations are indeed nonprofits in Bal-
timore City, their missions and funds are mostly
used to assist members rather than the wider
community, meaning that they may not accu-
rately portray the access to resources or services
provided to the communities in which they are
located. Any lodges and fraternal orders were
identified and filtered using their National Tax-onomy of Exempt Entities (NTEE) code.
One other calculated column, Majority African American
Neighborhood, was appended with a 1 if the
nonprofit's primary address is located within
a majority African American Neighborhood,a
neighborhood where **greater than half of the res-
idents are of African American descent**,or a 0
otherwise.
NTEE codes: https://urbaninstitute.github.io/nccs-legacy/ntee/ntee-history.html