You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am comparing multiple offline recommender systems models on an implicit feedback dataset and reporting various metrics.
These models follow the same order with multiple metrics. The best model is always the best model regardless of which metric I decide to use, except in the case of nDCG.
Trying to understand why is that, I realized that perhaps using nDCG with implicit feedback doesn't make much sense. It assumes that the ground truth is ordered, while in fact, is not.
The best models are the ones that just "mimmick" the order of the Ground Truth (GT) by chance (in this case it is sorted by timestamp). But just because a GT user-item pair is on the list before doesn't make it a better fit. Will the result change if I just shuffle the test set? Does this make any sense?
Nevertheless, in recommender systems papers nDCG is always reported, why? I haven't seen this issue mentioned in any of the papers I've read.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I am comparing multiple offline recommender systems models on an implicit feedback dataset and reporting various metrics.
These models follow the same order with multiple metrics. The best model is always the best model regardless of which metric I decide to use, except in the case of nDCG.
Trying to understand why is that, I realized that perhaps using nDCG with implicit feedback doesn't make much sense. It assumes that the ground truth is ordered, while in fact, is not.
The best models are the ones that just "mimmick" the order of the Ground Truth (GT) by chance (in this case it is sorted by timestamp). But just because a GT user-item pair is on the list before doesn't make it a better fit. Will the result change if I just shuffle the test set? Does this make any sense?
Nevertheless, in recommender systems papers nDCG is always reported, why? I haven't seen this issue mentioned in any of the papers I've read.
Beta Was this translation helpful? Give feedback.
All reactions