-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Summarise
function to summarise results -- with more flexibilitythan previous utility function
#1457
base: master
Are you sure you want to change the base?
Conversation
… that function (without disturbing the original summarize function, upon which many analayis script rely.)
…ummarize` to `summarise`
…se new function to avoid duplication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Tim. The PR looks great!
Find below my suggestions which I feel are all optional. Thanks
|
||
if only_mean and (not collapse_columns): | ||
if only_central and (not collapse_columns): | ||
# Remove other metrics and simplify if 'only_mean' across runs for each draw is required: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Remove other metrics and simplify if 'only_mean' across runs for each draw is required: | |
# Remove other metrics and simplify if 'only_central' across runs for each draw is required: |
om: pd.DataFrame = summary.loc[:, (slice(None), "central")] | ||
om.columns = [c[0] for c in om.columns.to_flat_index()] | ||
om.columns.name = 'draw' | ||
return om |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if om
was short for only_mean
then we have to update this to oc
for only_central
om: pd.DataFrame = summary.loc[:, (slice(None), "central")] | |
om.columns = [c[0] for c in om.columns.to_flat_index()] | |
om.columns.name = 'draw' | |
return om | |
oc: pd.DataFrame = summary.loc[:, (slice(None), "central")] | |
oc.columns = [c[0] for c in oc.columns.to_flat_index()] | |
oc.columns.name = 'draw' | |
return oc |
else: | ||
return summary_droppedlevel | ||
|
||
else: | ||
return summary | ||
|
||
|
||
def summarize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
function name summarise
and summarize
looks confusing. can't we suggest a better name for this function? If we don't want to interfere with how this is used outside utils we can rename summarise
?
also, do we really need this function? I think this is just calling/copying summarise function with an argument mean
instead of the default median. Can't we have summarise
default to mean on central_measure
and delete this function? in that case summarise
will be renamed to summarize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @mnjowe that having two different but similar functions that differ only the in whether they use an ize
or ise
is likely to be confusing.
I suspect the reason you've kept summarize
with the original default behaviour of computing mean
and same call signature as previously @tbhallett is to ensure this doesn't break code using the previous version?
There are few alternative ways we could deal with this:
- Just making a breaking change and removing the old function and using
summarize
for new function (potentially with a helpful error message if users try to call with previous signature). - Using a more differentiating name for new function - perhaps something like
compute_summary_statistics
? - Changing default behaviour of new function to replicate old behaviour and using same
summarize
name, but raising deprecation warning if relying on old defaults. This would require doing somethin like defaulting tocentral_measure="mean"
, adding a**kwargs
argument to new function, checking ifonly_mean
is inkwargs
and using to set value ofonly_central
as well as raising a deprecation warning to indicate that this argument is deprecated and users should useuse_central
instead, and also dealing with the adjustment to column name fromcentral
tomean
in this case. Overall this is probably a bit complex.
if output.columns.nlevels > 1: | ||
output = output.rename(columns={'central': 'mean'}, level=1) # rename 'central' to 'mean' | ||
return output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we do this in summarise
function, delete this function and rename summarise
to summarize
?
output = summarise( | ||
results=results, | ||
central_measure='mean', | ||
only_central=only_mean, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
passing only_mean
to only_central
here looks confusing. can't we use one name for consistency?
:Param: results: The pd.DataFame of results. | ||
:Param: central_measure: The name of the central measure to use - either 'mean' or 'median'. | ||
:Param: width_of_range: The width of the range to compute the statistics (e.g. 0.95 for the 95% interval). | ||
:Param: collapse_columns: Whether to simplify the columnar index if there is only one run (cannot be done otherwise) | ||
:Param: only_central: Whether to only report the central value (dropping the range). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:Param: results: The pd.DataFame of results. | |
:Param: central_measure: The name of the central measure to use - either 'mean' or 'median'. | |
:Param: width_of_range: The width of the range to compute the statistics (e.g. 0.95 for the 95% interval). | |
:Param: collapse_columns: Whether to simplify the columnar index if there is only one run (cannot be done otherwise) | |
:Param: only_central: Whether to only report the central value (dropping the range). | |
:param results: The dataframe of results to compute summary statistics of. | |
:param central_measure: The name of the central measure to use - either 'mean' or 'median'. | |
:param width_of_range: The width of the range to compute the statistics (e.g. 0.95 for the 95% interval). | |
:param collapse_columns: Whether to simplify the columnar index if there is only one run (cannot be done otherwise). | |
:param only_central: Whether to only report the central value (dropping the range). | |
:return: A dataframe with computed summary statistics. |
Small update to fix parameter directive syntax in docstring and adding return information.
if central_measure == 'mean': | ||
stats.update({'central': results.groupby(axis=1, by='draw', sort=False).mean()}) | ||
elif central_measure == 'median': | ||
stats.update({'central': results.groupby(axis=1, by='draw', sort=False).median()}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if central_measure == 'mean': | |
stats.update({'central': results.groupby(axis=1, by='draw', sort=False).mean()}) | |
elif central_measure == 'median': | |
stats.update({'central': results.groupby(axis=1, by='draw', sort=False).median()}) | |
if central_measure == 'mean': | |
stats['central'] = results.groupby(axis=1, by='draw', sort=False).mean() | |
elif central_measure == 'median': | |
stats['central'] = results.groupby(axis=1, by='draw', sort=False).median() |
Indexed assignment to a dict is generally preferable over update
when just adding a key-value pair as it avoids creating an unecessary intermediate dictionary and is slightly more readable.
We could also avoid the repetition across the different conditions by changing to something like
if central_measure in ('mean', 'median'):
grouped_results = results.groupby(axis=1, by='draw', sort=False)
stats['central'] = getattr(grouped_results, central_measure)()
but I think on balance probably the loss of readability outweighs the slight gain in avoiding code redundancy.
stats.update( | ||
{ | ||
'mean': results.groupby(axis=1, by='draw', sort=False).mean(), | ||
'lower': results.groupby(axis=1, by='draw', sort=False).quantile(0.025), | ||
'upper': results.groupby(axis=1, by='draw', sort=False).quantile(0.975), | ||
}, | ||
axis=1 | ||
'lower': results.groupby(axis=1, by='draw', sort=False).quantile((1.-width_of_range)/2.), | ||
'upper': results.groupby(axis=1, by='draw', sort=False).quantile(1.-(1.-width_of_range)/2.), | ||
} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid computing the groupby
operation multiple times here and when computing central measure, it would be better to have one line something like
grouped_results = results.groupby(axis=1, by='draw', sort=False)
at beginning of function and then using grouped_results
in place of repeated results.groupby(axis=1, by='draw', sort=False)
calls.
I would possibly also say writing this as two separate indexed assignments to the stats
dict rather than using update
would be a bit more readable
lower_quantile = (1 - width_of_range) / 2
stats["lower"] = grouped_results.quantile(lower_quantile)
stats["upper"] = grouped_results.quantile(1 - lower_quantile)
but using dict.update
here to update dictionary with two entries is reasonable so I think this is more of a personal preference thing!
else: | ||
return summary_droppedlevel | ||
|
||
else: | ||
return summary | ||
|
||
|
||
def summarize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @mnjowe that having two different but similar functions that differ only the in whether they use an ize
or ise
is likely to be confusing.
I suspect the reason you've kept summarize
with the original default behaviour of computing mean
and same call signature as previously @tbhallett is to ensure this doesn't break code using the previous version?
There are few alternative ways we could deal with this:
- Just making a breaking change and removing the old function and using
summarize
for new function (potentially with a helpful error message if users try to call with previous signature). - Using a more differentiating name for new function - perhaps something like
compute_summary_statistics
? - Changing default behaviour of new function to replicate old behaviour and using same
summarize
name, but raising deprecation warning if relying on old defaults. This would require doing somethin like defaulting tocentral_measure="mean"
, adding a**kwargs
argument to new function, checking ifonly_mean
is inkwargs
and using to set value ofonly_central
as well as raising a deprecation warning to indicate that this argument is deprecated and users should useuse_central
instead, and also dealing with the adjustment to column name fromcentral
tomean
in this case. Overall this is probably a bit complex.
Fixes #1420