Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Mode of Interest "quality text" has errors #115

Open
Abby-Wheelis opened this issue Feb 1, 2024 · 3 comments
Open

🐛 Mode of Interest "quality text" has errors #115

Abby-Wheelis opened this issue Feb 1, 2024 · 3 comments

Comments

@Abby-Wheelis
Copy link
Member

I discovered this while testing my metric vs imperial changes, but have realized it's a bug in the whole dashboard, not just my branch, example of what I mean (from production):
Screenshot 2024-02-01 at 9 57 38 AM

The mode_specific_timeseries in particular is creating this issue with the quality text. It stems from the fact that pandas groupby is now creating ALL combinations not just those with nonzero values, but then in scaffolding it counts the length of the dataframe rather than summing the trip counts ...

ex:

a dataset has 30 labeled trips from 5 users across 10 days and there are 3 possible modes

when we call groupby every possible combination in generated = 5103 = 150
even narrowing to 1 mode is still 50 trips, so when we compare 50 with 30 we're getting > 100%

One way to help this is to drop the day/user/mode combinations that have a trip count of 0, but I'm not sure that fixes the whole issue

A more principled fix would be to get the total mode of interest count by adding up rows rather than taking the length of the dataframe, this seems like it would be more accurate. I'm not sure how we got here, but it might have to do with pandas updates to the ways that groupby works, indicated by these warnings:

FutureWarning: Not prepending group keys to the result index of transform-like apply. In the future, the group keys will be included in the index, regardless of whether the applied function returns a like-indexed object.

@Abby-Wheelis
Copy link
Member Author

More about the code in mode-specific-timeseries, with the fix I've added so far:

mode_counts = data.groupby(['user_id','date_time','mode_confirm'], as_index=False).size()
mode_counts.rename(columns={'size':'trip_count'}, inplace=True)
#the scaffolding code counts the length! So even if 0 trips of that mode it was counting!
mode_counts = mode_counts[mode_counts['trip_count'] != 0]

in scaffolding.py:

def get_quality_text(before_df, after_df, mode_of_interest=None, include_test_users=False):
    """ Inputs:
    before_df = dataframe prior to filtering (usually participant_ct_df)
    after_df = dataframe after filtering (usually expanded_ct)
    mode_of_interest = optional detail to include in the text string
    """
    # CASE 1 of https://github.com/e-mission/em-public-dashboard/issues/69#issuecomment-1256835867
    after_pct = (len(after_df) * 100) / len(before_df) if len(before_df) != 0 else np.nan #compares lengths!
    cq = (len(after_df), unique_users(after_df), len(before_df), unique_users(before_df),
        after_pct, )
    interest_str = mode_of_interest + ' ' if mode_of_interest is not None else ''
    total_str = 'confirmed' if mode_of_interest is not None else ''
    user_str = 'testers and participants' if include_test_users else 'users'
    quality_text = f"Based on %s confirmed {interest_str}trips from %d {user_str}\nof %s total {total_str} trips from %d users (%.2f%%)" % cq
    print(quality_text)
    return quality_text

Actually, looking at these, maybe passing in what we do in mode_specific_timeseries is just the wrong move - what get_quality_text expects is a dataframe with all the trips (all the labeled trips in this case) and all the (labeled) trips of a given mode

@Abby-Wheelis
Copy link
Member Author

quality_text = scaffolding.get_quality_text(expanded_ct, expanded_ct[expanded_ct['mode_confirm'] == mode_of_interest], mode_of_interest, include_test_users) is a call that gets the proper quality text as proven by the print statements below:

the length is the length of mode_counts_of_interest (mode_counts from the code above filtered to just the mode of interest, the sum is that of all the trip count values in that same dataframe. The increase is explained by the fact that users can take more than one walk trip in a day (and often will - if you walk somewhere you usually walk back)

Screenshot 2024-02-01 at 10 31 18 AM

I'll plan to roll this change into my ongoing PR and watch out for other places we could be having this problem!

Abby-Wheelis pushed a commit to Abby-Wheelis/em-public-dashboard that referenced this issue Feb 1, 2024
@shankari
Copy link
Contributor

shankari commented Feb 10, 2024

@Abby-Wheelis Great catch! This is why we need better testing of the public dashboard, and a principled way to update pandas versions. We have found multiple such issues while upgrading the server code, for which we do indeed have proper tests.

@MukuFlash03 @nataliejschultz for visibiility

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants