Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summarise function to summarise results -- with more flexibilitythan previous utility function #1457

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

tbhallett
Copy link
Collaborator

Fixes #1420

… that function (without disturbing the original summarize function, upon which many analayis script rely.)
Copy link
Collaborator

@mnjowe mnjowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Tim. The PR looks great!
Find below my suggestions which I feel are all optional. Thanks


if only_mean and (not collapse_columns):
if only_central and (not collapse_columns):
# Remove other metrics and simplify if 'only_mean' across runs for each draw is required:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Remove other metrics and simplify if 'only_mean' across runs for each draw is required:
# Remove other metrics and simplify if 'only_central' across runs for each draw is required:

Comment on lines +351 to 354
om: pd.DataFrame = summary.loc[:, (slice(None), "central")]
om.columns = [c[0] for c in om.columns.to_flat_index()]
om.columns.name = 'draw'
return om
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if om was short for only_mean then we have to update this to oc for only_central

Suggested change
om: pd.DataFrame = summary.loc[:, (slice(None), "central")]
om.columns = [c[0] for c in om.columns.to_flat_index()]
om.columns.name = 'draw'
return om
oc: pd.DataFrame = summary.loc[:, (slice(None), "central")]
oc.columns = [c[0] for c in oc.columns.to_flat_index()]
oc.columns.name = 'draw'
return oc

else:
return summary_droppedlevel

else:
return summary


def summarize(
Copy link
Collaborator

@mnjowe mnjowe Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

function name summarise and summarize looks confusing. can't we suggest a better name for this function? If we don't want to interfere with how this is used outside utils we can rename summarise?

also, do we really need this function? I think this is just calling/copying summarise function with an argument mean instead of the default median. Can't we have summarise default to mean on central_measure and delete this function? in that case summarise will be renamed to summarize

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @mnjowe that having two different but similar functions that differ only the in whether they use an ize or ise is likely to be confusing.

I suspect the reason you've kept summarize with the original default behaviour of computing mean and same call signature as previously @tbhallett is to ensure this doesn't break code using the previous version?

There are few alternative ways we could deal with this:

  • Just making a breaking change and removing the old function and using summarize for new function (potentially with a helpful error message if users try to call with previous signature).
  • Using a more differentiating name for new function - perhaps something like compute_summary_statistics?
  • Changing default behaviour of new function to replicate old behaviour and using same summarize name, but raising deprecation warning if relying on old defaults. This would require doing somethin like defaulting tocentral_measure="mean", adding a **kwargs argument to new function, checking if only_mean is in kwargs and using to set value of only_central as well as raising a deprecation warning to indicate that this argument is deprecated and users should use use_central instead, and also dealing with the adjustment to column name from central to mean in this case. Overall this is probably a bit complex.

Comment on lines +388 to +390
if output.columns.nlevels > 1:
output = output.rename(columns={'central': 'mean'}, level=1) # rename 'central' to 'mean'
return output
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we do this in summarise function, delete this function and rename summarise to summarize?

output = summarise(
results=results,
central_measure='mean',
only_central=only_mean,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

passing only_mean to only_central here looks confusing. can't we use one name for consistency?

Comment on lines +321 to +325
:Param: results: The pd.DataFame of results.
:Param: central_measure: The name of the central measure to use - either 'mean' or 'median'.
:Param: width_of_range: The width of the range to compute the statistics (e.g. 0.95 for the 95% interval).
:Param: collapse_columns: Whether to simplify the columnar index if there is only one run (cannot be done otherwise)
:Param: only_central: Whether to only report the central value (dropping the range).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:Param: results: The pd.DataFame of results.
:Param: central_measure: The name of the central measure to use - either 'mean' or 'median'.
:Param: width_of_range: The width of the range to compute the statistics (e.g. 0.95 for the 95% interval).
:Param: collapse_columns: Whether to simplify the columnar index if there is only one run (cannot be done otherwise)
:Param: only_central: Whether to only report the central value (dropping the range).
:param results: The dataframe of results to compute summary statistics of.
:param central_measure: The name of the central measure to use - either 'mean' or 'median'.
:param width_of_range: The width of the range to compute the statistics (e.g. 0.95 for the 95% interval).
:param collapse_columns: Whether to simplify the columnar index if there is only one run (cannot be done otherwise).
:param only_central: Whether to only report the central value (dropping the range).
:return: A dataframe with computed summary statistics.

Small update to fix parameter directive syntax in docstring and adding return information.

Comment on lines +330 to +333
if central_measure == 'mean':
stats.update({'central': results.groupby(axis=1, by='draw', sort=False).mean()})
elif central_measure == 'median':
stats.update({'central': results.groupby(axis=1, by='draw', sort=False).median()})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if central_measure == 'mean':
stats.update({'central': results.groupby(axis=1, by='draw', sort=False).mean()})
elif central_measure == 'median':
stats.update({'central': results.groupby(axis=1, by='draw', sort=False).median()})
if central_measure == 'mean':
stats['central'] = results.groupby(axis=1, by='draw', sort=False).mean()
elif central_measure == 'median':
stats['central'] = results.groupby(axis=1, by='draw', sort=False).median()

Indexed assignment to a dict is generally preferable over update when just adding a key-value pair as it avoids creating an unecessary intermediate dictionary and is slightly more readable.

We could also avoid the repetition across the different conditions by changing to something like

    if central_measure in ('mean', 'median'):
        grouped_results = results.groupby(axis=1, by='draw', sort=False)
        stats['central'] = getattr(grouped_results, central_measure)()

but I think on balance probably the loss of readability outweighs the slight gain in avoiding code redundancy.

Comment on lines +337 to 342
stats.update(
{
'mean': results.groupby(axis=1, by='draw', sort=False).mean(),
'lower': results.groupby(axis=1, by='draw', sort=False).quantile(0.025),
'upper': results.groupby(axis=1, by='draw', sort=False).quantile(0.975),
},
axis=1
'lower': results.groupby(axis=1, by='draw', sort=False).quantile((1.-width_of_range)/2.),
'upper': results.groupby(axis=1, by='draw', sort=False).quantile(1.-(1.-width_of_range)/2.),
}
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid computing the groupby operation multiple times here and when computing central measure, it would be better to have one line something like

grouped_results = results.groupby(axis=1, by='draw', sort=False)

at beginning of function and then using grouped_results in place of repeated results.groupby(axis=1, by='draw', sort=False) calls.

I would possibly also say writing this as two separate indexed assignments to the stats dict rather than using update would be a bit more readable

lower_quantile = (1 - width_of_range) / 2
stats["lower"] = grouped_results.quantile(lower_quantile)
stats["upper"] = grouped_results.quantile(1 - lower_quantile)

but using dict.update here to update dictionary with two entries is reasonable so I think this is more of a personal preference thing!

else:
return summary_droppedlevel

else:
return summary


def summarize(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @mnjowe that having two different but similar functions that differ only the in whether they use an ize or ise is likely to be confusing.

I suspect the reason you've kept summarize with the original default behaviour of computing mean and same call signature as previously @tbhallett is to ensure this doesn't break code using the previous version?

There are few alternative ways we could deal with this:

  • Just making a breaking change and removing the old function and using summarize for new function (potentially with a helpful error message if users try to call with previous signature).
  • Using a more differentiating name for new function - perhaps something like compute_summary_statistics?
  • Changing default behaviour of new function to replicate old behaviour and using same summarize name, but raising deprecation warning if relying on old defaults. This would require doing somethin like defaulting tocentral_measure="mean", adding a **kwargs argument to new function, checking if only_mean is in kwargs and using to set value of only_central as well as raising a deprecation warning to indicate that this argument is deprecated and users should use use_central instead, and also dealing with the adjustment to column name from central to mean in this case. Overall this is probably a bit complex.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Ready to merge
Development

Successfully merging this pull request may close these issues.

summarize to (optionally) use MEDIAN instead of MEAN as central summary
3 participants