`Summarise` function to summarise results -- with more flexibilitythan previous utility function #1457

tbhallett · 2024-09-04T17:13:13Z

Fixes #1420

… that function (without disturbing the original summarize function, upon which many analayis script rely.)

…ummarize` to `summarise`

…se new function to avoid duplication

mnjowe

Hi Tim. The PR looks great!
Find below my suggestions which I feel are all optional. Thanks

mnjowe · 2024-10-17T10:47:13Z

src/tlo/analysis/utils.py


-    if only_mean and (not collapse_columns):
+    if only_central and (not collapse_columns):
        # Remove other metrics and simplify if 'only_mean' across runs for each draw is required:


Suggested change

# Remove other metrics and simplify if 'only_mean' across runs for each draw is required:

# Remove other metrics and simplify if 'only_central' across runs for each draw is required:

mnjowe · 2024-10-17T10:50:51Z

src/tlo/analysis/utils.py

+        om: pd.DataFrame = summary.loc[:, (slice(None), "central")]
        om.columns = [c[0] for c in om.columns.to_flat_index()]
        om.columns.name = 'draw'
        return om


if om was short for only_mean then we have to update this to oc for only_central

Suggested change

om: pd.DataFrame = summary.loc[:, (slice(None), "central")]

om.columns = [c[0] for c in om.columns.to_flat_index()]

om.columns.name = 'draw'

return om

oc: pd.DataFrame = summary.loc[:, (slice(None), "central")]

oc.columns = [c[0] for c in oc.columns.to_flat_index()]

oc.columns.name = 'draw'

return oc

mnjowe · 2024-10-17T10:53:21Z

src/tlo/analysis/utils.py

        else:
            return summary_droppedlevel

    else:
        return summary


+def summarize(


function name summarise and summarize looks confusing. can't we suggest a better name for this function? If we don't want to interfere with how this is used outside utils we can rename summarise?

also, do we really need this function? I think this is just calling/copying summarise function with an argument mean instead of the default median. Can't we have summarise default to mean on central_measure and delete this function? in that case summarise will be renamed to summarize

I agree with @mnjowe that having two different but similar functions that differ only the in whether they use an ize or ise is likely to be confusing.

I suspect the reason you've kept summarize with the original default behaviour of computing mean and same call signature as previously @tbhallett is to ensure this doesn't break code using the previous version?

There are few alternative ways we could deal with this:

Just making a breaking change and removing the old function and using summarize for new function (potentially with a helpful error message if users try to call with previous signature).

Using a more differentiating name for new function - perhaps something like compute_summary_statistics?

Changing default behaviour of new function to replicate old behaviour and using same summarize name, but raising deprecation warning if relying on old defaults. This would require doing somethin like defaulting tocentral_measure="mean", adding a **kwargs argument to new function, checking if only_mean is in kwargs and using to set value of only_central as well as raising a deprecation warning to indicate that this argument is deprecated and users should use use_central instead, and also dealing with the adjustment to column name from central to mean in this case. Overall this is probably a bit complex.

mnjowe · 2024-10-17T10:54:29Z

src/tlo/analysis/utils.py

+    if output.columns.nlevels > 1:
+        output = output.rename(columns={'central': 'mean'}, level=1)  # rename 'central' to 'mean'
+    return output


can't we do this in summarise function, delete this function and rename summarise to summarize?

mnjowe · 2024-10-17T11:09:55Z

src/tlo/analysis/utils.py

+    output = summarise(
+        results=results,
+        central_measure='mean',
+        only_central=only_mean,


passing only_mean to only_central here looks confusing. can't we use one name for consistency?

matt-graham · 2024-10-18T08:58:15Z

src/tlo/analysis/utils.py

+    :Param: results: The pd.DataFame of results.
+    :Param: central_measure: The name of the central measure to use - either 'mean' or 'median'.
+    :Param: width_of_range: The width of the range to compute the statistics (e.g. 0.95 for the 95% interval).
+    :Param: collapse_columns: Whether to simplify the columnar index if there is only one run (cannot be done otherwise)
+    :Param: only_central: Whether to only report the central value (dropping the range).


Suggested change

:Param: results: The pd.DataFame of results.

:Param: central_measure: The name of the central measure to use - either 'mean' or 'median'.

:Param: width_of_range: The width of the range to compute the statistics (e.g. 0.95 for the 95% interval).

:Param: collapse_columns: Whether to simplify the columnar index if there is only one run (cannot be done otherwise)

:Param: only_central: Whether to only report the central value (dropping the range).

:param results: The dataframe of results to compute summary statistics of.

:param central_measure: The name of the central measure to use - either 'mean' or 'median'.

:param width_of_range: The width of the range to compute the statistics (e.g. 0.95 for the 95% interval).

:param collapse_columns: Whether to simplify the columnar index if there is only one run (cannot be done otherwise).

:param only_central: Whether to only report the central value (dropping the range).

:return: A dataframe with computed summary statistics.

Small update to fix parameter directive syntax in docstring and adding return information.

matt-graham · 2024-10-18T09:05:25Z

src/tlo/analysis/utils.py

+    if central_measure == 'mean':
+        stats.update({'central': results.groupby(axis=1, by='draw', sort=False).mean()})
+    elif central_measure == 'median':
+        stats.update({'central': results.groupby(axis=1, by='draw', sort=False).median()})


Suggested change

if central_measure == 'mean':

stats.update({'central': results.groupby(axis=1, by='draw', sort=False).mean()})

elif central_measure == 'median':

stats.update({'central': results.groupby(axis=1, by='draw', sort=False).median()})

if central_measure == 'mean':

stats['central'] = results.groupby(axis=1, by='draw', sort=False).mean()

elif central_measure == 'median':

stats['central'] = results.groupby(axis=1, by='draw', sort=False).median()

Indexed assignment to a dict is generally preferable over update when just adding a key-value pair as it avoids creating an unecessary intermediate dictionary and is slightly more readable.

We could also avoid the repetition across the different conditions by changing to something like

if central_measure in ('mean', 'median'): grouped_results = results.groupby(axis=1, by='draw', sort=False) stats['central'] = getattr(grouped_results, central_measure)()

but I think on balance probably the loss of readability outweighs the slight gain in avoiding code redundancy.

matt-graham · 2024-10-18T09:18:21Z

src/tlo/analysis/utils.py

+    stats.update(
        {
-            'mean': results.groupby(axis=1, by='draw', sort=False).mean(),
-            'lower': results.groupby(axis=1, by='draw', sort=False).quantile(0.025),
-            'upper': results.groupby(axis=1, by='draw', sort=False).quantile(0.975),
-        },
-        axis=1
+            'lower': results.groupby(axis=1, by='draw', sort=False).quantile((1.-width_of_range)/2.),
+            'upper': results.groupby(axis=1, by='draw', sort=False).quantile(1.-(1.-width_of_range)/2.),
+        }
    )


To avoid computing the groupby operation multiple times here and when computing central measure, it would be better to have one line something like

grouped_results = results.groupby(axis=1, by='draw', sort=False)

at beginning of function and then using grouped_results in place of repeated results.groupby(axis=1, by='draw', sort=False) calls.

I would possibly also say writing this as two separate indexed assignments to the stats dict rather than using update would be a bit more readable

lower_quantile = (1 - width_of_range) / 2 stats["lower"] = grouped_results.quantile(lower_quantile) stats["upper"] = grouped_results.quantile(1 - lower_quantile)

but using dict.update here to update dictionary with two entries is reasonable so I think this is more of a personal preference thing!

matt-graham · 2024-10-18T10:58:37Z

src/tlo/analysis/utils.py

        else:
            return summary_droppedlevel

    else:
        return summary


+def summarize(


I agree with @mnjowe that having two different but similar functions that differ only the in whether they use an ize or ise is likely to be confusing.

I suspect the reason you've kept summarize with the original default behaviour of computing mean and same call signature as previously @tbhallett is to ensure this doesn't break code using the previous version?

There are few alternative ways we could deal with this:

Just making a breaking change and removing the old function and using summarize for new function (potentially with a helpful error message if users try to call with previous signature).

Using a more differentiating name for new function - perhaps something like compute_summary_statistics?

Changing default behaviour of new function to replicate old behaviour and using same summarize name, but raising deprecation warning if relying on old defaults. This would require doing somethin like defaulting tocentral_measure="mean", adding a **kwargs argument to new function, checking if only_mean is in kwargs and using to set value of only_central as well as raising a deprecation warning to indicate that this argument is deprecated and users should use use_central instead, and also dealing with the adjustment to column name from central to mean in this case. Overall this is probably a bit complex.

tbhallett added 4 commits September 4, 2024 18:10

create 'summarise' (UK spelling) to bring in additional flexibilit to…

889721e

… that function (without disturbing the original summarize function, upon which many analayis script rely.)

update test_summarize -> test_summarise and change target from `s…

689b553

…ummarize` to `summarise`

ensure order of columns is as expected

6c9541c

test for repdroduction of legacy behaviour and make legacy function u…

f96f6c0

…se new function to avoid duplication

tbhallett requested review from mnjowe, matt-graham and tamuri October 17, 2024 09:35

mnjowe approved these changes Oct 17, 2024

View reviewed changes

matt-graham requested changes Oct 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Summarise` function to summarise results -- with more flexibilitythan previous utility function #1457

`Summarise` function to summarise results -- with more flexibilitythan previous utility function #1457

tbhallett commented Sep 4, 2024

mnjowe left a comment

mnjowe Oct 17, 2024

mnjowe Oct 17, 2024

mnjowe Oct 17, 2024 •

edited

Loading

matt-graham Oct 18, 2024

mnjowe Oct 17, 2024

mnjowe Oct 17, 2024

matt-graham Oct 18, 2024

matt-graham Oct 18, 2024

matt-graham Oct 18, 2024

matt-graham Oct 18, 2024

	# Remove other metrics and simplify if 'only_mean' across runs for each draw is required:
	# Remove other metrics and simplify if 'only_central' across runs for each draw is required:

-    :Param: results: The pd.DataFame of results.
-    :Param: central_measure: The name of the central measure to use - either 'mean' or 'median'.
-    :Param: width_of_range: The width of the range to compute the statistics (e.g. 0.95 for the 95% interval).
-    :Param: collapse_columns: Whether to simplify the columnar index if there is only one run (cannot be done otherwise)
-    :Param: only_central: Whether to only report the central value (dropping the range).
+    :param results: The dataframe of results to compute summary statistics of.
+    :param central_measure: The name of the central measure to use - either 'mean' or 'median'.
+    :param width_of_range: The width of the range to compute the statistics (e.g. 0.95 for the 95% interval).
+    :param collapse_columns: Whether to simplify the columnar index if there is only one run (cannot be done otherwise).
+    :param only_central: Whether to only report the central value (dropping the range).
+    :return: A dataframe with computed summary statistics.

Summarise function to summarise results -- with more flexibilitythan previous utility function #1457

Are you sure you want to change the base?

Summarise function to summarise results -- with more flexibilitythan previous utility function #1457

Conversation

tbhallett commented Sep 4, 2024

mnjowe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mnjowe Oct 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

`Summarise` function to summarise results -- with more flexibilitythan previous utility function #1457

`Summarise` function to summarise results -- with more flexibilitythan previous utility function #1457

mnjowe Oct 17, 2024 •

edited

Loading