-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Summarise
function to summarise results -- with more flexibilitythan previous utility function
#1457
base: master
Are you sure you want to change the base?
Summarise
function to summarise results -- with more flexibilitythan previous utility function
#1457
Changes from all commits
889721e
689b553
6c9541c
f96f6c0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -10,7 +10,7 @@ | |||||||||||||||||
from collections.abc import Mapping | ||||||||||||||||||
from pathlib import Path | ||||||||||||||||||
from types import MappingProxyType | ||||||||||||||||||
from typing import Callable, Dict, Iterable, List, Optional, TextIO, Tuple, Union | ||||||||||||||||||
from typing import Callable, Dict, Iterable, List, Optional, TextIO, Tuple, Union, Literal | ||||||||||||||||||
|
||||||||||||||||||
import git | ||||||||||||||||||
import matplotlib.colors as mcolors | ||||||||||||||||||
|
@@ -306,43 +306,90 @@ def generate_series(dataframe: pd.DataFrame) -> pd.Series: | |||||||||||||||||
return _concat | ||||||||||||||||||
|
||||||||||||||||||
|
||||||||||||||||||
def summarize(results: pd.DataFrame, only_mean: bool = False, collapse_columns: bool = False) -> pd.DataFrame: | ||||||||||||||||||
def summarise( | ||||||||||||||||||
results: pd.DataFrame, | ||||||||||||||||||
central_measure: Literal["mean", "median"] = "median", | ||||||||||||||||||
width_of_range: float = 0.95, | ||||||||||||||||||
only_central: bool = False, | ||||||||||||||||||
collapse_columns: bool = False, | ||||||||||||||||||
) -> pd.DataFrame: | ||||||||||||||||||
"""Utility function to compute summary statistics | ||||||||||||||||||
|
||||||||||||||||||
Finds mean value and 95% interval across the runs for each draw. | ||||||||||||||||||
Finds a central value and a specified interval across the runs for each draw. By default, this uses a central | ||||||||||||||||||
measure of the median and a 95% interval range. | ||||||||||||||||||
|
||||||||||||||||||
:Param: results: The pd.DataFame of results. | ||||||||||||||||||
:Param: central_measure: The name of the central measure to use - either 'mean' or 'median'. | ||||||||||||||||||
:Param: width_of_range: The width of the range to compute the statistics (e.g. 0.95 for the 95% interval). | ||||||||||||||||||
:Param: collapse_columns: Whether to simplify the columnar index if there is only one run (cannot be done otherwise) | ||||||||||||||||||
:Param: only_central: Whether to only report the central value (dropping the range). | ||||||||||||||||||
|
||||||||||||||||||
""" | ||||||||||||||||||
stats = dict() | ||||||||||||||||||
|
||||||||||||||||||
if central_measure == 'mean': | ||||||||||||||||||
stats.update({'central': results.groupby(axis=1, by='draw', sort=False).mean()}) | ||||||||||||||||||
elif central_measure == 'median': | ||||||||||||||||||
stats.update({'central': results.groupby(axis=1, by='draw', sort=False).median()}) | ||||||||||||||||||
Comment on lines
+330
to
+333
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Indexed assignment to a dict is generally preferable over We could also avoid the repetition across the different conditions by changing to something like if central_measure in ('mean', 'median'):
grouped_results = results.groupby(axis=1, by='draw', sort=False)
stats['central'] = getattr(grouped_results, central_measure)() but I think on balance probably the loss of readability outweighs the slight gain in avoiding code redundancy. |
||||||||||||||||||
else: | ||||||||||||||||||
raise ValueError(f"Unknown stat: {central_measure}") | ||||||||||||||||||
|
||||||||||||||||||
summary = pd.concat( | ||||||||||||||||||
stats.update( | ||||||||||||||||||
{ | ||||||||||||||||||
'mean': results.groupby(axis=1, by='draw', sort=False).mean(), | ||||||||||||||||||
'lower': results.groupby(axis=1, by='draw', sort=False).quantile(0.025), | ||||||||||||||||||
'upper': results.groupby(axis=1, by='draw', sort=False).quantile(0.975), | ||||||||||||||||||
}, | ||||||||||||||||||
axis=1 | ||||||||||||||||||
'lower': results.groupby(axis=1, by='draw', sort=False).quantile((1.-width_of_range)/2.), | ||||||||||||||||||
'upper': results.groupby(axis=1, by='draw', sort=False).quantile(1.-(1.-width_of_range)/2.), | ||||||||||||||||||
} | ||||||||||||||||||
) | ||||||||||||||||||
Comment on lines
+337
to
342
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To avoid computing the grouped_results = results.groupby(axis=1, by='draw', sort=False) at beginning of function and then using I would possibly also say writing this as two separate indexed assignments to the lower_quantile = (1 - width_of_range) / 2
stats["lower"] = grouped_results.quantile(lower_quantile)
stats["upper"] = grouped_results.quantile(1 - lower_quantile) but using |
||||||||||||||||||
|
||||||||||||||||||
summary = pd.concat(stats, axis=1) | ||||||||||||||||||
summary.columns = summary.columns.swaplevel(1, 0) | ||||||||||||||||||
summary.columns.names = ['draw', 'stat'] | ||||||||||||||||||
summary = summary.sort_index(axis=1) | ||||||||||||||||||
summary = summary.sort_index(axis=1).reindex(columns=['lower', 'central', 'upper'], level=1) | ||||||||||||||||||
|
||||||||||||||||||
if only_mean and (not collapse_columns): | ||||||||||||||||||
if only_central and (not collapse_columns): | ||||||||||||||||||
# Remove other metrics and simplify if 'only_mean' across runs for each draw is required: | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||||||||
om: pd.DataFrame = summary.loc[:, (slice(None), "mean")] | ||||||||||||||||||
om: pd.DataFrame = summary.loc[:, (slice(None), "central")] | ||||||||||||||||||
om.columns = [c[0] for c in om.columns.to_flat_index()] | ||||||||||||||||||
om.columns.name = 'draw' | ||||||||||||||||||
return om | ||||||||||||||||||
Comment on lines
+351
to
354
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if
Suggested change
|
||||||||||||||||||
|
||||||||||||||||||
elif collapse_columns and (len(summary.columns.levels[0]) == 1): | ||||||||||||||||||
# With 'collapse_columns', if number of draws is 1, then collapse columns multi-index: | ||||||||||||||||||
summary_droppedlevel = summary.droplevel('draw', axis=1) | ||||||||||||||||||
if only_mean: | ||||||||||||||||||
return summary_droppedlevel['mean'] | ||||||||||||||||||
if only_central: | ||||||||||||||||||
return summary_droppedlevel['central'] | ||||||||||||||||||
else: | ||||||||||||||||||
return summary_droppedlevel | ||||||||||||||||||
|
||||||||||||||||||
else: | ||||||||||||||||||
return summary | ||||||||||||||||||
|
||||||||||||||||||
|
||||||||||||||||||
def summarize( | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. function name also, do we really need this function? I think this is just calling/copying summarise function with an argument There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with @mnjowe that having two different but similar functions that differ only the in whether they use an I suspect the reason you've kept There are few alternative ways we could deal with this:
|
||||||||||||||||||
results: pd.DataFrame, | ||||||||||||||||||
only_mean: bool = False, | ||||||||||||||||||
collapse_columns: bool = False | ||||||||||||||||||
): | ||||||||||||||||||
"""Utility function to compute summary statistics | ||||||||||||||||||
|
||||||||||||||||||
Finds mean value and 95% interval across the runs for each draw. | ||||||||||||||||||
|
||||||||||||||||||
NOTE: This provides the legacy functionality of `summarize` that is hard-wired to use `means` (the kwarg is | ||||||||||||||||||
`only_mean` and the name of the column in the output is `mean`). Please move to using the new and more flexible | ||||||||||||||||||
version of `summarize` that allows the use of medians and is flexible to allow other forms of summary measure in | ||||||||||||||||||
the future. | ||||||||||||||||||
""" | ||||||||||||||||||
output = summarise( | ||||||||||||||||||
results=results, | ||||||||||||||||||
central_measure='mean', | ||||||||||||||||||
only_central=only_mean, | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. passing |
||||||||||||||||||
collapse_columns=collapse_columns, | ||||||||||||||||||
) | ||||||||||||||||||
if output.columns.nlevels > 1: | ||||||||||||||||||
output = output.rename(columns={'central': 'mean'}, level=1) # rename 'central' to 'mean' | ||||||||||||||||||
return output | ||||||||||||||||||
Comment on lines
+388
to
+390
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can't we do this in |
||||||||||||||||||
|
||||||||||||||||||
|
||||||||||||||||||
def get_grid(params: pd.DataFrame, res: pd.Series): | ||||||||||||||||||
"""Utility function to create the arrays needed to plot a heatmap. | ||||||||||||||||||
|
||||||||||||||||||
|
@@ -1129,7 +1176,7 @@ def get_parameters_for_status_quo() -> Dict: | |||||||||||||||||
"equip_availability": "all", # <--- NB. Existing calibration is assuming all equipment is available | ||||||||||||||||||
}, | ||||||||||||||||||
} | ||||||||||||||||||
|
||||||||||||||||||
def get_parameters_for_standard_mode2_runs() -> Dict: | ||||||||||||||||||
""" | ||||||||||||||||||
Returns a dictionary of parameters and their updated values to indicate | ||||||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small update to fix parameter directive syntax in docstring and adding return information.