-
Notifications
You must be signed in to change notification settings - Fork 0
/
02_data_model_motivation.qmd
621 lines (466 loc) · 28 KB
/
02_data_model_motivation.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
---
title: "Data Model Motivation"
authors: "Roy McKenzie, Ben Feder"
subtitle: "Module 2: Notebook 2"
toc: true
toc_depth: 3
toc_float:
collapsed: no
smooth_scroll: yes
number-sections: true
number-offset: 0
format:
html:
embed-resources: true
df-print: paged
---
```{css echo=FALSE}
#HIDE THIS CHUNK FROM KNITTED OUTPUT
h2 {margin: 2m 0 !important;}
details {
margin-left: 4em;
margin-bottom: .5rem;
}
summary {
margin-left: -2em;
}
```
```{r setup, include=FALSE}
#HIDE THIS CHUNK FROM KNITTED OUTPUT
#theme: lumen
#highlight: tango
knitr::opts_chunk$set(include=TRUE, echo=TRUE, eval = FALSE, warning = FALSE, fig.align = 'center') #results='hide') # needs to delete results='hide'
```
```{r, include=FALSE, echo=FALSE}
#HIDE THIS CHUNK FROM KNITTED OUTPUT
## Load libraries
library(RJDBC) # For connecting to the database
# For data manipulation/visualization
library(tidyverse)
# For faster date conversions
library(lubridate)
library(dbplyr)
library(odbc)
library(DBI)
library(RPostgreSQL)
library(zoo)
```
```{r CI_dbsetup, include=FALSE, echo=FALSE}
# source('U://Corey.Sparks//Training//adrf_redshift.r')
redshift_db <- dbDriver("PostgreSQL")
con <- dbConnect(redshift_db,
host ='adrf-redshift11.cdy8ch2udktk.us-gov-west-1.redshift.amazonaws.com',
port = '5439',
user = tolower(Sys.getenv("RSTUDIO_USER_IDENTITY")),
password = getPass::getPass(),
dbname = 'projects')
```
# Introduction
Hi all, and welcome to the second notebook of Module 2! This notebook is intended to be relatively brief, and provide a few motivating examples to show how the data model we learned about in class this week can make our lives much easier when it comes to manipulating and combining the raw data we have available to us in this class.
Our hope is that this notebook can provide a jumping off place as you begin to explore the data model, which should be the primary data source you use for the rest of your team's analysis. With that in mind, we also encourage you to examine a few of the data model tables yourselves, and to review the **Fact Table Explainers** available on the resources page of the class website. These will help provide a broader overview of the decisions rules that go into fitting the raw data you began to explore in the EDA notebook into the structure of the data model.
Don't worry if all of this data model material feels confusing - over the next several notebooks, you will see plenty of examples of how you can use the data model's tables in practice to answer questions relevant to your projects. In fact, in this notebook, we won't even dive very far into understanding the structure of the data model or its nuances. Instead, we just want to show a few simple examples that should help clarify why it's worth going through the trouble of building a data model in the first place, and how we can understand and interpret the results that the data model gives us.
# Technical setup
This workbook will leverage both SQL and R coding concepts, so we need to set up our environment to connect to the proper database and run R code only accessible in packages external to the basic R environment. Typically, throughout these workbooks, we use SQL for the majority of data exploration and creation of the analytic frame, and then read that analytic frame into R for the descriptive analysis and visualization.
**Note:** If you would like to view the material to establish your own environment for running the code displayed in this notebook, you can expand the following "Environment Setup" section by clicking on its heading.
::: {.callout collapse="true"}
## SQL Setup
For working with the database directly using SQL, the easiest way is to still copy the commands from this notebook into a script in DBeaver. As a reminder, the steps to do so are as follows:
To create a new .sql script:
1. Open DBeaver, located on the ADRF Desktop. The icon looks like this:
![](images/dbeaver.png)
2. Establish your connection to the database by logging in. To do this, double-click `Redshift11_projects` on the left hand side, and it will ask for your username and password. Your username is `adrf\` followed by the name of your U: drive folder - for example, `adrf\John.Doe.T00112`. Your password is the same as the **second** password you used to log in to the ADRF - if you forgot it, you **adjust it in the ADRF management portal!**
After you successfully establish your connection, you should see a green check next to the database name, like so:
![](images/success.png)
3. In the top menu bar, click **SQL Editor** then **New SQL Script**:
![](images/new_script.png)
4. To test if your connection is working, try pasting the following chunk of code into your script:
```{sql, eval=FALSE}
SELECT *
FROM tr_e2e.dim_person
LIMIT 5
```
Then run it by clicking the run button next to the script, or by pressing CTRL + Enter.
5. You should then be able to see the query output in the box below the code.
:::
::: {.callout collapse="true"}
## R Setup
The easiest way to work with the Redshift data from within an R Script is using a combination of a JBDC connection and the `dbplyr` package. The necessary setup for this approach is described below.
#### Load libraries {.unnumbered}
Just like we did in the Foundations Module, in running SQL and R code together through R, we need to load the `RJDBC` package. In addition, we will load the `tidyverse` suite of packages, as they will help us implement some of our fundamental data operations while maintaining a consistent syntax. Lastly, to enable an option discussed in the coming sections, we will load a new package for working with databases in R, `dbplyr`.
> Every time you create a new R file, you should copy and run the following code snippet. You can easily copy the entire snippet by hovering over the cell and clicking the clipboard icon on the top-right edge.
```{r}
options(scipen = 999) # avoid scientific notation
library(RJDBC)
library(tidyverse)
library(dbplyr)
library(zoo) # time/date manipulations
```
::: callout-note
#### New package `dbplyr`{.unnumbered}
In the examples that follow, we continue using our new package for interacting with databases without writing SQL code is also shown. The `dbplyr` package interfaces with a database using standard `dplyr` and `tidyverse` workflows, and complementary code is shown for all SQL query examples. Just like the SQL code, this should be run and executed from R using the connection we established in the collapsed "Environment Setup" section above.
:::
::: callout-warning
If you receive an error message saying `there is no package called ...`, then please first run:
```
install.packages("PACKAGENAME")
```
where you replace `PACKAGENAME` with the name of the package that is missing.
:::
#### Establish Database Connection {.unnumbered}
To load data from the Redshift server into R, we need to first set up a connection to the database. The following set of commands accomplish this:
```{r eval=FALSE}
dbusr=Sys.getenv("DBUSER")
dbpswd=Sys.getenv("DBPASSWD")
url <- "jdbc:redshift:iam://adrf-redshift11.cdy8ch2udktk.us-gov-west-1.redshift.amazonaws.com:5439/projects;loginToRp=urn:amazon:webservices:govcloud;ssl=true;AutoCreate=true;idp_host=adfs.adrf.net;idp_port=443;ssl_insecure=true;plugin_name=com.amazon.redshift.plugin.AdfsCredentialsProvider"
driver <- JDBC(
"com.amazon.redshift.jdbc42.Driver",
classPath = "C:\\drivers\\redshift_withsdk\\redshift-jdbc42-2.1.0.12\\redshift-jdbc42-2.1.0.12.jar",
identifier.quote="`"
)
con <- dbConnect(driver, url, dbusr, dbpswd)
```
As a reminder, don't worry too much about the details of this connection - **you can simply copy and paste this code each time you want to connect your R script to the Redshift database**.
#### New `.Renviron` {.unnumbered}
For this code to work, you need to create a new `.Renviron` file in your user folder (i.e. `U:\\John.Doe.P00002`) that contains the following:
```
DBUSER='adrf\John.Doe.P00002'
DBPASSWD='xxxxxxxxxxxx'
```
where `John.Doe.P00002` is replaced with your username and `xxxxxxxxxx` is replaced with your password (both still in quotes!) The setup of this code is nearly identical to that required in the Foundations Module workspace - however, `DBUSER` should now end with `.T00113` instead of `.T00112`.
A detailed video from the Foundations Module, "Introduction to RStudio," demonstrating how to create an .Renviron file is available on the Resources page on class website in the subsection "Quick Links."
:::
# A motivating example: the Arkansas PIRL data
To motivate our data model, we are going to use the Arkansas PIRL data. The Arkansas PIRL (Participant Individual Record Layout) data has been provided by the Arkansas Department of Workforce Services and is our main source of information about WIOA (Workforce Innovation and Opportunity) program enrollment. The PIRL data are stored on Redshift under `ds_ar_dws.pirl`.
As you may have heard in class, this is a **very** large file - with over 500 variables! So, for exploration of this data, it is especially important to narrow in on the data values that are crucial to our research. Luckily, the [data documentation](P:/tr-enrollment-to-employment/Data_Dictionaries/PIRL_Layout_withAdultEd.xlsx) for PIRL is also very robust, and is available either in the ADRF or on the class website.
Let's take a quick look at the PIRL data like so:
::: {.panel-tabset .unlisted .unnumbered}
### SQL query {.unnumbered}
```{sql, }
select *
from ds_ar_dws.pirl
limit 5;
```
### `dbplyr` query {.unnumbered}
```{r}
con %>%
tbl(in_schema(schema = "ds_ar_dws",
table = "pirl")) %>%
head(n = 5) %>%
collect()
```
:::
Just as we might have expected (or feared!) - there are over 500 columns for each of our observation. We can also get the total number of observations like so:
::: {.panel-tabset .unlisted .unnumbered}
### SQL query {.unnumbered}
```{sql, }
select count(*)
from ds_ar_dws.pirl
```
### `dbplyr` query {.unnumbered}
```{r}
con %>%
tbl(in_schema(schema = "ds_ar_dws",
table = "pirl")) %>%
tally() %>%
collect()
```
:::
We see that there is approximately 2.5 million records - that is a *lot* of data.
Where we start to run into problem, though, is when we think about defining a row. Recall from last week's EDA notebook that we usually want to find the distinct variable or set of variables which uniquely defines each row of the data. How can we do this for the PIRL data?
In class, we learned that, for the PIRL data:
> One row: 1 person, 1 participation spell
This seems straightforward enough. To map this onto variables in the dataset, we can look through the data dictionary. As usual, we might guess that `social_security_number` (PIRL element number 2700) defines the "1 person", while a combination of `date_of_program_entry_wioa` and `date_of_program_exit_wioa` define the spell. Let's see how many unique combinations of those variables there are:
::: {.panel-tabset .unlisted .unnumbered}
### SQL query {.unnumbered}
```{sql, }
select count(
distinct social_security_number ||
coalesce(date_of_program_entry_wioa::text, 'null') ||
coalesce(date_of_program_exit_wioa::text, 'null')
)
from ds_ar_dws.pirl
```
### `dbplyr` query {.unnumbered}
```{r}
con %>%
tbl(in_schema(schema = "ds_ar_dws",
table = "pirl")) %>%
select(social_security_number, date_of_program_entry_wioa, date_of_program_exit_wioa) %>%
distinct() %>%
tally() %>%
collect()
```
:::
This is a much smaller number of records! We can see that there on average almost three records per unique individual-start-end combination.
Now, we could continue adding variables to our row definition and get closer and closer to uniquely defining each row. But here we have to ask ourselves - if an individual is reported to be enrolled in the same program, with the same start date, and with the same end date, how much do we really care about the other information available on the PIRL file at this point in our analysis? For example, if an individual has two rows reporting that they began WIOA Adult Program Benefits in Q2 of 2020, how much should we worry if one observation gives their birthday as 11/1/78 and another as 11/3/78? These kinds of discrepancies might be important, but perhaps not so much just for establishing our basic counts of program participants.
# A solution: using the data model
As we learned in class, one useful way to tackle this issue is through the use of a data model. Not only can a data model make record linkage much easier, it can also handle a lot of the duplication issues like we saw above, through mastering and de-duplication.
We're going to slowly introduce the main tables of our data model over the next few notebooks. In this notebook, we don't want to drive to deep into the actual usage of the data model, but instead provide some understanding for how it helps us deal with the kinds of duplication issues we saw in the PIRL data above.
## `program_participation_start_end`
There are two primary data model fact tables that we will use in this class. The first one we will look at is the `program_participation_start_end` table. As the name suggests, this table gives individual's start and end dates for participation in each of our main programs. Let's take a look at this table here:
::: {.panel-tabset .unlisted .unnumbered}
### SQL query {.unnumbered}
```{sql, }
select *
from tr_e2e.fact_person_program_start_end
limit 5
```
### `dbplyr` query {.unnumbered}
```{r}
con %>%
tbl(in_schema(schema = "tr_e2e",
table = "fact_person_program_start_end")) %>%
head(n = 5) %>%
collect()
```
:::
We have the following columns:
- `person_program_start_end_id`: a unique ID for each row of the fact table
- `person_key`: A unique person identifier that links to social security number
- `program_key`: A unique identifier for each program (e.g. SNAP = 9, Adult WIOA = 1)
- `entry_year_quarter_key` and `exit_year_quarter_key`: Unique identifier for the quarters the individual entered and exited the program
You might be wondering why we have all of these keys, instead of, for example, just the actual social security number, or name of the program. These keys actually, hold all the power of the dimensional model. With them, we are able to store lots and lots of observations in our main fact table very easily (numbers are quick to store), and then link those to further information in our dimension tables that actually tell us what each key corresponds to.
If that still feels confusing, then feel free to refer back to the lecture slides for the data model introduction available on the class website for a little bit more of an explanation of fact and dimension tables. But don't worry about it too much - you'll get plenty of practice using and interpreting the data model over the next few notebooks.
Turning back to the problem at hand, we can deduce that this table shows program participation at the person-program-spell level - i.e., there is one row for each unique person/program/start-date combination. This is just what we learned in class, but it's good to see it here in practice.
Let's attempt to use this table to answer a simple question: how many individuals started an Adult WIOA benefit spell in 2019 Q4. To do so is actually pretty simple with the data model - we can limit the above table to observations with `program_key = 1` (Adult WIOA) and `entry_year_quarter_key = 40` (2019-Q4). Remember that to determine the actual ids we wanted we needed to consult the `dim_program` and `dim_year_quarter` tables - but since this is just an example, we're not too worried about that.
Let's use these limitations now, and count the number of rows:
::: {.panel-tabset .unlisted .unnumbered}
### SQL query {.unnumbered}
```{sql, }
select count(*)
from tr_e2e.fact_person_program_start_end
where program_key = 1 and entry_year_quarter_key = 40
```
### `dbplyr` query {.unnumbered}
```{r}
con %>%
tbl(in_schema(schema = "tr_e2e",
table = "fact_person_program_start_end")) %>%
filter(program_key == 1 & entry_year_quarter_key == 40) %>%
tally() %>%
collect()
```
:::
We can also count the number of unique individuals, instead of the number of rows, but because our observations are already unique at the person/program/start-date level, we should get the same number. Let's verify that here:
::: {.panel-tabset .unlisted .unnumbered}
### SQL query {.unnumbered}
```{sql, }
select count(distinct person_key)
from tr_e2e.fact_person_program_start_end
where program_key = 1 and entry_year_quarter_key = 40
```
### `dbplyr` query {.unnumbered}
```{r}
con %>%
tbl(in_schema(schema = "tr_e2e",
table = "fact_person_program_start_end")) %>%
filter(program_key == 1 & entry_year_quarter_key == 40) %>%
select(person_key) %>%
distinct() %>%
tally() %>%
collect()
```
:::
Now, let's turn back to the PIRL data to get a better understanding of where these numbers come from. As expected, if we just count all observations in the PIRL with a program start falling in Q4 of 2019 we get a much larger number:
::: {.panel-tabset .unlisted .unnumbered}
### SQL query {.unnumbered}
```{sql, }
select count(*)
from ds_ar_dws.pirl
where
adult_wioa != 0 and
date_of_program_entry_wioa >= '2019-10-01' and
date_of_program_entry_wioa <= '2020-01-01'
```
### `dbplyr` query {.unnumbered}
```{r}
con %>%
tbl(in_schema(schema = "ds_ar_dws",
table = "pirl")) %>%
filter(
adult_wioa != '0' &
date_of_program_entry_wioa >= '2019-10-01' &
date_of_program_entry_wioa <= '2020-01-01'
) %>%
tally() %>%
collect()
```
:::
But, again, here we're not really concerned among the variation between these duplicated rows. We just want a count of the number of individuals who started on the WIOA Adult program in this quarter. Let's therefore count only the unique social security numbers from the list above:
::: {.panel-tabset .unlisted .unnumbered}
### SQL query {.unnumbered}
```{sql, }
select count(distinct social_security_number)
from ds_ar_dws.pirl
where
adult_wioa != 0 and
date_of_program_entry_wioa >= '2019-10-01' and
date_of_program_entry_wioa <= '2020-01-01'
```
### `dbplyr` query
```{r}
con %>%
tbl(in_schema(schema = "ds_ar_dws",
table = "pirl")) %>%
filter(
adult_wioa != '0' &
date_of_program_entry_wioa >= '2019-10-01' &
date_of_program_entry_wioa <= '2020-01-01'
) %>%
select(social_security_number) %>%
distinct() %>%
tally() %>%
collect()
```
:::
Now we see that the count aligns with that presented in the data model. Hopefully, this highlights what the data model is intended to do - rather than capturing all of the variation and detail separate observations contain in the underlying data table. Instead, it is helping us prepare in advance a mastered down list of individual participation spells, which can provide a starting point for understanding program enrollment.
If you did want to capture additional information from the underlying PIRL data (for example, the job center at which individuals received their services), then you might need to add additional decision rules to your analysis to handle potential issue cases where, for example, multiple jobs centers are reported for the same person-program-spell combination. Even in this case, however, the data model should still be able to provide a valuable jumping off point for getting the underlying list of program participants that you need to create that mastered data for.
## `fact_person_quarterly_program_enrollment`
The second fact table that we want to look at is the `fact_person_quarterly_program_enrollment` table. The purpose of this table is similar to the start and end date table we saw above: we want to reduce the variation and duplication in our different underlying data sources to a simplified table that helps us parse program participation over time. The difference is that the table above had one row per person-program-spell, where as this fact table is at the person-program-quarter level.
To understand this difference, consider an individual who has an Adult WIOA benefit spell beginning in April 2019 and ending in December of the same year. In the start and end date table above, this would represent a single observation, with the entry quarter marked as 2019-Q1 and the exit quarter marked as 2019-Q4. In this second fact table, however, this same spell would translate into three observations - one each for 2019-Q1, 2019-Q2, and 2019-Q3. This allows us to explicitly and easily see all the individuals enrolled in a program in a given quarter. Let's see what it looks like now:
::: {.panel-tabset .unlisted .unnumbered}
### SQL query {.unnumbered}
```{sql, }
select *
from tr_e2e.fact_person_quarterly_program_enrollment
order by person_key, program_key, enrollment_quarter_key
limit 5
```
### `dbplyr` query
```{r}
con %>%
tbl(in_schema(schema = "tr_e2e",
table = "fact_person_quarterly_program_enrollment")) %>%
arrange(person_key, program_key, enrollment_quarter_key) %>%
head(n = 5) %>%
collect()
```
:::
In the output above, we see the `person_key` and `program_key` variables just as were in the prior fact table. Now, however, instead of having quarter keys for the start and end quarters, we have just one variable: `enrollment_quarter_key`. This captures each quarter that the individual described by `person_key` was enrolled in the program described by `program_key`. The remaining variables provide additional context for if the individual was enrolled the entire quarter.
To see the power of this, let's first try to identify all the Adult WIOA participants (rather than just entrants) in 2019-Q4. To change things up, let's start this time by first doing this using the underlying PIRL data. This seems like it should be pretty easy - we just need to select the distinct count of social security numbers reported where:
- The `adult_wioa` field is not equal to zero (Adult WIOA participants)
- The date of program entry is before January of 2020 (entered before or during 2019-Q4)
- The date of program exit is after September of 2019 (exited during or after 2019-Q4)
That logic alone is a little complicated, but things actually get even worse. One tricky aspect of working with the PIRL data is that an individual's data will *first* show up in the PIRL file when they first enroll in a WIOA program. The problem is that at this point, that individual hasn't exited the program yet, so their program exit date is going to be missing!
Let's see an example of this:
::: {.panel-tabset .unlisted .unnumbered}
### SQL query {.unnumbered}
```{sql, }
select
social_security_number,
adult_wioa,
date_of_program_entry_wioa,
date_of_program_exit_wioa
from ds_ar_dws.pirl
where
social_security_number = 'REDACTED'
```
### `dbplyr` query
```{r}
con %>%
tbl(in_schema(schema = "ds_ar_dws",
table = "pirl")) %>%
filter(social_security_number == 'REDACTED') %>%
select(
social_security_number,
adult_wioa,
date_of_program_entry_wioa,
date_of_program_exit_wioa
) %>%
collect()
```
:::
We see that this individual has three records in the PIRL file, all corresponding to an Adult WIOA program enrollment. On the first observation, we see that the `date_of_program_exit_wioa` variable is **missing** - they haven't yet exited the program! Then, later on, when the data is updated, another record is generated for this individual, showing their true exit date.
The problem here arises because sometimes this missing value is valid - an individual still actually is on the program! For example, if the individual from the example above had only the first row, then we would want to consider them as a valid participant for 2019-Q4, since as far as we would know, they would not have exited the program.
Thus to truly get the count of all participants for 2019-Q4 we need to take all individuals where:
- The `adult_wioa` field is not equal to zero (Adult WIOA participants)
- The date of program entry is before January of 2020 (entered before or during 2019-Q4)
- The date of program exit is after September of 2019 (exited during or after 2019-Q4)
- **OR** the only reported date of program exit for a given start date is missing
We can implement that in code here, but it's a bit of a mess:
::: {.panel-tabset .unlisted .unnumbered}
### SQL query {.unnumbered}
```{sql, }
with correctEndDates as (
select distinct
social_security_number,
date_of_program_entry_wioa,
max(date_of_program_exit_wioa) as date_of_program_exit_wioa
from ds_ar_dws.pirl
where adult_wioa != 0
group by social_security_number, date_of_program_entry_wioa
)
select count(distinct social_security_number)
from correctEndDates
where
date_of_program_entry_wioa <= '2019-12-31' and
(date_of_program_exit_wioa >= '2019-10-01' or date_of_program_exit_wioa is NULL)
```
### `dbplyr` query
```{r}
con %>%
tbl(in_schema(schema = "ds_ar_dws",
table = "pirl")) %>%
filter(adult_wioa != '0 ') %>%
group_by(social_security_number, date_of_program_entry_wioa) %>%
summarize(
date_of_program_exit_wioa = max(date_of_program_exit_wioa, na.rm = TRUE)
) %>%
ungroup() %>%
filter(
date_of_program_entry_wioa <= '2019-12-31' &
(date_of_program_exit_wioa >= '2019-10-01' | is.na(date_of_program_exit_wioa))
) %>%
select(social_security_number) %>%
distinct() %>%
tally() %>%
collect()
```
:::
Now, in contrast, let's try to find the same count using the data model. Here we just need to count the number of rows where `program_key = 1` (Adult WIOA) and `enrollment_quarter_key = 40` (2019-Q4).
::: {.panel-tabset .unlisted .unnumbered}
### SQL query {.unnumbered}
```{sql, }
select count(*)
from tr_e2e.fact_person_quarterly_program_enrollment
where
program_key = 1 and
enrollment_quarter_key = 40
```
### `dbplyr` query
```{r}
con %>%
tbl(in_schema(schema = "tr_e2e",
table = "fact_person_quarterly_program_enrollment")) %>%
filter(program_key == 1 & enrollment_quarter_key == 40) %>%
tally() %>%
collect()
```
:::
So, so much simpler.
## And so much more...
All of the examples we've seen above have used the PIRL data and been focused on Adult WIOA, and that provides a great motivating example for the power of the data model. But as a final piece of our motivation, we should also at least preview how much easier this data model makes our lives when it comes to record linkage as well.
Let's say we wanted to modify our example above to look at how many individuals participated in either Adult WIOA **or** SNAP in 2019-Q4. We won't even begin to present here how you would handle this question without the data model - roughly, you would have to repeat all the calculations we did above on the raw SNAP file (dealing with any weird cases that arose there as well) and then try to combine that with our PIRL calculations to find the individuals that were in either. Using the data model, things are much simpler:
::: {.panel-tabset .unlisted .unnumbered}
### SQL query {.unnumbered}
```{sql, }
select count(distinct person_key)
from tr_e2e.fact_person_quarterly_program_enrollment
where
(program_key = 1 or program_key = 9) and
enrollment_quarter_key = 40
```
### `dbplyr` query
```{r}
con %>%
tbl(in_schema(schema = "tr_e2e",
table = "fact_person_quarterly_program_enrollment")) %>%
filter((program_key == 1 | program_key == 9) & enrollment_quarter_key == 40) %>%
select(person_key) %>%
distinct() %>%
tally() %>%
collect()
```
:::
# Conclusion
Hopefully, this has helped you understand the value of the data model in helping you handle working with some of the messy underlying data that we have to use in this course. In future notebooks, you'll see plenty more examples of how to use this data model for your analysis, and in particular, how it simplifies the process of record linkage. In the meantime, we strongly encourage you again to review the **Fact Table Explainers** available on the Resources page of your class website to begin building up your intuition for the decision rules that the data model uses to combine so many disparate sources of data.