Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create "small" materialized trip updates model, and initial observed trips fact on top #2205

Merged
merged 12 commits into from
Jan 31, 2023

Conversation

atvaccaro
Copy link
Contributor

@atvaccaro atvaccaro commented Jan 23, 2023

Description

Progress towards #2063 (joining RT and Schedule for validity analysis is coming)

Creates an observed trips fact based on TripDescriptor. Also creates int_gtfs_rt__trip_updates_no_stop_times an incremental model to serve multiple downstream models with cheaper trip updates data.

Question: should we join this against daily scheduled trips?

  • For vehicle positions, first and last message times / count of messages
  • For service alerts, distinct message text?, distinct cause/effect?, count of messages?
  • For trip updates, schedule relationship, perhaps max delay?, number of cancelled stops?, count of messages?

int_gtfs_rt__trip_updates_no_stop_times is intended to be generically useful; stop_times_updates is approximately 90% of the size of a trip update record, so if we exclude them, we can materialize the rest of the record in dbt (for now). Under the current dbt DAG, every additional trip_updates model means another 150 GB read per day even if those models are incremental. With int_gtfs_rt__trip_updates_no_stop_times we eat that cost once, and build more models on top while also benefitting from columnar storage.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation
  • agencies.yml

How has this been tested?

Models build up through fct_observed_trips as expected

Screenshots (optional)

@atvaccaro atvaccaro self-assigned this Jan 23, 2023
@atvaccaro atvaccaro changed the title start on observed trips model observed trips models Jan 23, 2023
@atvaccaro atvaccaro marked this pull request as ready for review January 26, 2023 19:42
@atvaccaro atvaccaro changed the title observed trips models Create "small" materialized trip updates model, and initial observed trips fact on top Jan 26, 2023
@atvaccaro
Copy link
Contributor Author

I'm going to use dim_provider_gtfs_data for the join in fct_observed_trips rather than the schedule URL from the metadata

Copy link
Contributor

@lauriemerrell lauriemerrell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I requested the change to dim_provider_gtfs_data offline, LMK when that is done & I will review

Copy link
Contributor

@lauriemerrell lauriemerrell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few things, only the request for first/last header/extract timestamps is blocking from my perspective.

@atvaccaro atvaccaro merged commit 669ceb4 into main Jan 31, 2023
@atvaccaro atvaccaro deleted the trip-analysis-model branch January 31, 2023 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants