Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add service-level check and fct table #2206

Closed
wants to merge 17 commits into from
Closed

Add service-level check and fct table #2206

wants to merge 17 commits into from

Conversation

owades
Copy link
Member

@owades owades commented Jan 23, 2023

Description

New check to see whether this guideline is met:

100% of trips marked as “Scheduled,” “Canceled,” or “Added” within the Trip Updates feed are represented within the Vehicle Positions feed.

I might be able to incorporate the observed_trip data structure proposed here to reduce the cost of my big join.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation
  • agencies.yml

How has this been tested?

Not yet

@owades
Copy link
Member Author

owades commented Jan 23, 2023

@lauriemerrell , could you take a look at this when you are back? I created a new fct table as you suggested, but I'm still using service_key as a unique ID. I think I need some high-level assistance from you - for example, should I be grouping by "provider" rather than "service"? Thank you!

@lauriemerrell
Copy link
Contributor

Ok yeah sorry, I did not understand the context when I was weighing in on Slack yesterday, sorry if I muddied the waters. There are a couple of things here:

  • What level is this check at: This is an interesting question (to me 😅🤓). I think that it does make sense to call this a service-level check, and then just to say that the check is N/A for services without both trip updates and vehicle positions, but I just wanted to note that the alternative could be to frame this as either a trip updates or vehicle positions check which would use the existing feed-level indices. I think that the main use for a service daily index is going to be the manual service checks from Airtable (I think that those are the only other service-level checks). If you think that's ok conceptually (to have the service-level checks be the manual Airtable stuff and then this one), that's cool with me.
  • Constructing the service daily index: I think that, because those manual service checks are also incoming (and cc @atvaccaro because he was going to make that), the way to make the daily service index is like this: https://github.com/cal-itp/data-infra/blob/main/warehouse/models/intermediate/gtfs/gtfs_quality/int_gtfs_quality__gtfs_dataset_guideline_index.sql; we'd want int_gtfs_quality__service_guideline_index. I am realizing that maybe these should be thing_daily_index but alas that ship has probably sailed.
  • Joining the feeds to the service index: I think that making fct_daily_provider_gtfs_data in this PR might be overkill.... I need to talk to @atvaccaro about how/if/when to make that table because I think it would have other ramifications but once you (or @atvaccaro) makes int_gtfs_quality__service_guideline_index, you can just join on the correct feeds like:
SELECT 
   idx.date,
   idx.service_key,
   quartet.trip_updates_gtfs_dataset_key,
   quartet.vehicle_positions_gtfs_dataset_key
FROM int_gtfs_quality__service_guideline_index AS idx
LEFT JOIN dim_provider_gtfs_data AS quartet
ON idx.service_key = quartet.service_key
AND idx.date BETWEEN quartet._valid_from AND quartet._valid_to

TLDR: I think that you and @atvaccaro need to figure out who is making int_gtfs_quality__service_guideline_index, but I think that that is kind of the key missing piece here.

@lauriemerrell
Copy link
Contributor

@owades
Copy link
Member Author

owades commented Jan 24, 2023

Thank you! I think I can get this updated today based on your feedback here.

@lauriemerrell lauriemerrell mentioned this pull request Jan 25, 2023
5 tasks
@owades
Copy link
Member Author

owades commented Jan 25, 2023

@lauriemerrell, this is taking quite a long time to run, I wonder if the join with the BETWEEN statement is slowing stuff down.

@owades
Copy link
Member Author

owades commented Jan 25, 2023

I think I will hold off on this PR until #2205 is done, as it will speed things up dramatically for me.

@lauriemerrell
Copy link
Contributor

@owades re:

I wonder if the join with the BETWEEN statement is slowing stuff down.

The tables in my sample SQL above are both tiny (~MB), so I don't think that step is the issue -- fct_vehicle_positions_messages is a view with no tables in between it and the raw data, so it's going into the filesystem, agree that #2205 will be the big timesaver here.

@owades
Copy link
Member Author

owades commented Jan 31, 2023

starting fresh on new branch

@owades owades closed this Jan 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants