Replies: 15 comments
-
So I demoed proposals 1 and 2 in osu-accuracy, and I queried what the results would be for a few osu! players (*1).
Observations
(*1) This is not a representative sample. It's just meant to showcase the proposal. |
Beta Was this translation helpful? Give feedback.
-
I like the idea of median, it's simple and makes more sense than weighting accuracy. "Range acc" is something I don't quite understand, would it make accuracy always have 2 values? Having 2 different numbers for accuracy is kinda weird, not sure I agree with that. Is there an issue with using the average without weighting? That's another way I'd consider, though I haven't thought about it too much, it could be flawed. Would be cool if other people chipped in their opinion too, see if anyone wants to keep the weighted acc or exchange it for a different system. |
Beta Was this translation helpful? Give feedback.
-
Why median over average? I understand why you want to get rid of weighted average, but just lowering the weigh of top plays or making them all have the same weight seems better to me. Median gives an objectively wrong result for players like iRedak-, it literally ignores everything but the middle score (you could have 51% of your top plays be SS and the rest be 70% passes, and still have 100% accuracy). |
Beta Was this translation helpful? Give feedback.
-
The arithmetic mean is too susceptible to outlier values. Let's show this with in an example: ExampleNew player installed osu! and played a low-end Hard map, and got 95%. "too ez", he says, I'll play Insane.New player plays 5 Insane maps, getting accuracy of ~70%. Later on, in social networks:
The value of 74.1% above was calculated with an unweighted average (5*70+95)/6. For comparison, the median would be 70%. When they play together, the median value turns out to be a much better predictor of their typical accuracy. This is the large effect of the outlier 95% accuracy from the first easy game. tl;dr: the average is too vulnerable to outlier values. This issue is exacerbated by the current weighting system used, but in the end it's intrinsic to the arithmetic mean. |
Beta Was this translation helpful? Give feedback.
-
I understand what you are talking about, but first allow me to note that iRedak- is not an example of that. See the distribution of their plays. In contrast, the players that are really concerning are those with this kind of play distribution: iRedak- is someone I'd categorize as a "Perfect mod player", while the second dataset corresponds to what I call a "dual difficulty player". Arguably, the median provides a better predictor for a typical accuracy of a "Perfect mod player" (100%) -though some may disagree, feel free to do over a value difference of 0.01%. As for "dual difficulty players", my proposal number 2 exists in order to provide a better tool for them. We need a better estimate because of another of the issues the arithmetic mean has: the result may not be one of the data points. In the dataset for the dual difficulty player above, the average unweighted accuracy is 86.12%, which is absolutely not a typical accuracy for their plays. So how does proposal 2 fix this? (A visual example should help so that @camellirite understands, too)
What are these "different map difficulty brackets"? See: So we take the median of the lower difficulty set (i.e. lower map max pp), and the median of the higher difficulty set. Those turn out to be tl;dr: while the median yields a bad value for EXTREME dual difficulty players, the average does too; proposal 2: "range accuracy" fixes this issue, while also improving on the analysis of common players which are partly dual-difficulty players too. |
Beta Was this translation helpful? Give feedback.
-
You generally don't pass Insane difficulties the day you install osu!. After the player has been playing for a while, they will find the difficulty range they'll consistently set their top scores in (which in the current PP meta are exclusively FCs or near FCs) and the accuracy for their top scores won't fluctuate as much. There is no value in accounting for a player that only played 6 times. My argument for iRedak- was that his profile would display a 100% accuracy rating, which is simply wrong. The example I have with 51% SSs was to illustrate it with an extreme case. As I said, median ignores everything except for the middle value. That is a lot of lost information. Splitting accuracy up in 2 values is simply a bad idea. Not many players will have their top plays neatly separated in 2 partitions. As for the people where it does, their natural growth will screw up these partitions anyway, as they set new scores that move their old ones down. Then you will be mixing the old low-acc plays with new high-acc plays, and your accuracy value for these high-acc plays would lose its intended meaning. So in the end, I don't think median is a good metric to replace (weighted) accuracy. There may be other metrics that are better (perhaps an unweighted average, or an average weighted on the age of the score), but it has to stay simple, intuitive and correct. However, I do like the idea of displaying accuracy in more detail. One way could be showing the values of a boxplot (median at 0%, 25%, 50%, 75% and 100%, ignoring statistical outliers) so a visitor can tell the distribution of your accuracy. This would probably be hidden in a tooltip, though, as it is pretty verbose. These are just my opinions, others may disagree with me. |
Beta Was this translation helpful? Give feedback.
-
That's a fair objection to the example I used. However, it still stands that averages are not robust measures of location, weighted or not.
I believe that claiming that the accuracy of a dedicated Also, please allow me to respectfully point out that you may be affected by a cognitive bias, and therefore rejecting a result of 100% because it's the maximum possible value and it would imply some sort of perfection from an (accidentally) almost perfect record.
I am not sure I follow you. Of course, top plays will often be a single aggregate of accuracy values. However, it will always be possible to split them in two sets according to whether they are lower or higher than the median difficulty value (map max pp).
Boxplots are nice, though they will be hard to understand in countries where basic education in Statistics doesn't cover them (like mine). However, while they are useful in many cases, we must consider this exact use case. So what information would a boxplot provide us?
Based on these fundamentals, it's also possible to build a boxplot from the "range accuracy" parameter:
Incidentally, I have already calculated what the quartiles would be for your boxplot proposal.
As you can see, the ranges I calculate have a smaller width than the interquartile range -the IQR always spans over 50% of the values-, therefore providing a better sense of location. EDIT: Added Moonbeam to the table. |
Beta Was this translation helpful? Give feedback.
-
Why does it matter whether it is robust or not? If a player sets a score with a really low or really high accuracy, can they not see their accuracy rise or drop by a bit? Keep in mind that we're only expecting to see values between about 80% and 100% in a user's top plays. The outliers this may produce are not so significant that the global accuracy becomes inaccurate.
The accuracy display does not exist to show what the player wants it to be, but rather what the player actually achieves. If a PF player sets a score that's not 100%, their accuracy will drop. They will have to work to get it back to 100% by either setting better scores so it rounds up, or by improving the play itself. That's how the life of a PF player is. I don't believe PF players would want to change it. And this has nothing to do with cognitive biases. 100% accuracy implies that it is higher than 99.995%, not that 51% of your top plays are SSes.
I meant to say that just splitting your plays in 2, the accuracy values don't really have any meaning, they just show the upper and lower half of your plays. If you're going to split them up, you can just as well split them up into multiple values like I said with the boxplot idea. As for that idea, that's exactly what it was: just an idea. Of course Q0 and Q4 are useless in this case, but you can replace them with 15% and 85% medians. I wouldn't use midhinge, it doesn't add any value and median seems fine in this case. |
Beta Was this translation helpful? Give feedback.
-
Since those scores would be outliers, they ought to be ignored. This issue is a matter of principle, so if you don't concur, let's just agree to disagree.
That's a good point once the top plays list enters in a length=100 steady state. However, if we go back to my earlier example, those outliers will be significant for at least some new players. Keep in mind that the health of a game is inextricably linked to the influx of new players, and -in general- neglecting them may be dangerous (yes, this is probably an overstatement applied specifically to the Accuracy metric, but this is to show my disagreement with your claim that Since we can provide a better metric for these players, and in the rest of cases (steady state) the average and the median won't be far, then I'd say we'd rather do provide it than don't.
Implying that using the median is "what the player wants it to be" is a mischaracterization. Good metrics, i.e. "statistics", should work as estimators. In this case, the metric used should be able to predict what the accuracy of future plays would be. That's conceptually very different from "what the player wants it to be", but it coincides with it because of osu!'s providing the Perfect mod. Due to its better fitness for that purpose, the median is arguably superior to the arithmetic mean.
Since you insist, I retract that statement.
I may be very thick-skinned and claim that both of those thresholds seem equally arbitrary to me, but let's not do that. As I have already mentioned, the range accuracy metric is intended to fix the "51% issue". Since it's compounded of the medians of two partitions of the dataset, theory states that the threshold for a result of Hence, with range accuracy, such
You just stated what the meaning of the accuracy values is.
I am not extraneous to that idea. In fact, that's proposal number 3. However, I am not really sold on it because using many parameters is:
Yeah. I wasn't proposing to arbitrarily change the standard boxplot by using the midhinge rather than the median for its center. What I meant to do was to explain how a different graph could be made by averaging the range accuracy bounds. With this, I think we should be more or less clear on what our disagreements are. Therefore, I will (probably) refrain from discussing these points further to avoid noise in this thread. Thanks for the attention so far! |
Beta Was this translation helpful? Give feedback.
-
To me, the purpose of the display is to show the accuracy of the player in past scores (like top plays), not how they will perform in the future. For this reason median would not be fit, since it ignores outliers. In my opinion they should not be ignored.
The boxplot (or other extra metrics) can be calculated on the client from the top plays of the profile, and calculating it isn't expensive at all if only done on the top 100 plays. A userscript could probably do it. Takes maybe 5 minutes to implement. There is indeed not much else to discuss, I believe all this comes down to preference in the end. Let's see what input others have to give. I personally don't see a need for any change. |
Beta Was this translation helpful? Give feedback.
-
ThoughtsEdge cases are not likely to occur, but they are still cases we expect some kind of preferred result on. I don't care how little you think they might matter, whether you don't think it shouldn't be that robust, or whether it's overkill or not. The expectation of a system functioning under anything you throw at it in the way desired is what we should aim for and nothing less. Median is prone to not change if you replace values or if there are a bunch of same values where the median happens to be. I do expect the value representing the series of accuracies the player has to change in response to those cases. Therefore, I do not think the median is a good way to represent the accs the player has. While the average solves both issues the median has, it is prone to have an undesired response to extreme outliers, which I do believe should be trimmed. With personal bias out of the way, since the argument ended on a matter of preference, how about averaging a certain percent of values within the median? The percentage value itself allows to control how much median-like or average-like the representative acc value is, where 0% is equivalent to the median of the entire dataset and 100% is equivalent to the average of the entire dataset. Code:
Demo here: https://repl.it/@_abrakerabraker/Accs-demo Some numbers:( p = percent the representative acc is average-like )
Notes/Comments:Some representative accs fluctuate up and down as they become more average-like. This is totally expected since as more values are included. The lower or upper ends can "tip the scale", so to speak (median being the center). Acc values from players' top 100 pp scores are used and you can confirm the median in the demo provided simply by looking at the 50th value in acc list like so (they are printed in rows of 10): The code does not fully represent the median (p=0%). Instead of taking the average of the two middle values, it takes the one located at That said, while most of @Slayer95's values for median match about what I got, the median for BeowulF97's acc left me confused (reference pic right before this). The median value is around 81%, 82%, a far cry from the 78.74% he reports. There are some discrepancies within other player values, but this is the one that has the greatest difference between what I got and he got. |
Beta Was this translation helpful? Give feedback.
-
@abraker95 , your data doesn't seem right. See e.g. https://i.imgur.com/WNiuX2D.png Furthermore, since I posted my own results above, that user has already played more matches. Now I get a median of 80.60%. |
Beta Was this translation helpful? Give feedback.
-
I couldn't find anything substantial regarding how the user accuracy is calculated in a quick search (except this forum post). If that is true however that explains why my accuracy seems to arbitrarily change (warning: personal opinion/bias/observation/etc). I'd love to see change to get something more understandable. Even if nothing happens the minimum should be a wiki article explaining what is going on with the user accuracy. |
Beta Was this translation helpful? Give feedback.
-
Something that I cannot neglect to point out before others start throwing their own proposals here is that my proposal number 2: "range accuracy" does in fact bivariate analysis to explore the relationship between map difficulty and accuracy. Due to its using a median split, it's a very simple computation as well as a simple concept to understand, in just 2 output values! Linear regression doesn't do better (as you must report R^2... and you are in bad luck if it's low!)
I have verified that empirically, and it does seem to be a bad approach, which is the whole point of this issue. (Note that personal opinions are the way to go in order to reach decisions -of course supported by facts in as much as possible.) |
Beta Was this translation helpful? Give feedback.
-
Yup, I forgot to factor in misses. I updated the code in the demo linked, but fixing the table will have to wait until tomorrow. |
Beta Was this translation helpful? Give feedback.
-
Motivation
As per my understanding, currently the accuracy of a user is calculated as a weighted average of the accuracies of the top plays, using the same weighting system pp does. This concept has several issues:
This means that an outlier accuracy reaching the top plays would affect heavily the resulting value, either boosting it or reducing it sharply.
After successfully passing a map beyond the player's current comfort zone, it could turn out to be a top-pp play with low accuracy. In this situation which is supposed to be joyful for the player, the global accuracy of the user will drop sharply, resulting in a bad™ user experience.
Example: if player A has
98%
accuracy and player B has95%
accuracy, who is better given that player A is at500pp
and player B at800pp
? The higher accuracy of player A sheds a shadow on the higher ranking of player B, because player A could "sacrifice" his accuracy by playing harder maps in order to get PP. However, in the end the interpretation of the higher accuracy is just a suspicion, and not an actual reliable conclusion.Proposal
I have three proposals to fix these three issues. Each of them more or less builds upon the previous one.
Switch the calculation from using a weighted average to using the median of the relevant accuracies, from the top pp plays (the median is a robust measure of central tendency.)
The displayed user accuracy should be a compound metric of 2 (or more) values which represents the performance of the user over different map difficulty brackets.
In particular, I am proposing to split the top pp plays in two sets, according to the maximum pp obtainable (or star rating?) on each map (with the respective used mods).
Hence, the accuracy would be displayed as
a ➜ b
, wherea
is the median of the accuracies of the plays of easiest maps, andb
is the median of the accuracies of the plays of the hardest maps. Note that the valuea
would represent the accuracy for maps within the player's comfort zone, whileb
would represent the accuracy for maps which challenge the player's skill.In this context,
a ➜ b
is to be read roughly as "a accuracy in comfortable maps, with b accuracy in harder maps".100 * 2^n
. Then, the accuracy becomes a list of medians for each pp range associated to map difficulty. In most places where it would be displayed the accuracy would be shown as just the two medians of the highest difficulty sets. However, there would be a place in the user profile where the full value would be listed in order to compare several players.Beta Was this translation helpful? Give feedback.
All reactions