-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve and add Spearman #227
base: master
Are you sure you want to change the base?
Conversation
…spearman Update branch
The tests are not running yet and I'm still checking if the cPython function really works. |
Hey @NicolasHug, I am unsure whether the rankings are calculated correctly. I think that the rankings are calculated by the columns and not by the rows of yr. Or am I wrong ? |
what do you mean the rows of
|
I split the comments a bit, so that it stays clear ^^ |
I refer to the dict yr from test_similarities.py:
The current code would return this result:
I rounded it to three decimal places. But in fact it should have come out:
|
The second matrix is calculated according to the formula used in the code. This formula corresponds to this one here: (Sorry, but I just can't find a better source) |
@gautamramk used to calculate the rank:
But actually the ranks should be determined by the columns from yr, because these represent e.g. the users and their choices. Therefore I think that the current code is not correct and it does not calculate the Spearman Correlation. |
I have written the following program to show how I think the Spearman Correlation would be calculated correctly for min_sqrt == 0. This code was also used to generate the second matrix.
|
So I wonder if I'm totally wrong. If so, where is my mistake ? If not, then I would completely rework the spearman method and create it similar to the example code. Many thanks |
Hey @NicolasHug, The calculation in the previous version does not seem to be correct. The rank is calculated wrong. The rank must not be calculated with the rows of yr but with the columns. Like e.g.
Here the ranks would be calculated with the old version:
But it would have to be calculated like this:
Whereby the new version calculates the ranks like this:
The given tests now run without changes. I also adjusted the documentation. Excuse the wall of text above^^. But now it should be correct |
Hey @MaFeg100 , sorry for the slow reply. I haven't looked in great details, but I think you are probably correct. Here is what I understand, let me know if you agree. Spearman is like Pearson correlation but instead of using raw ratings we use the ranks. Considering a rating matrix U x I (users are rows, items are columns), then a user-user spearman similarity would first compute the the ranks of the ratings in a row-wise fashion (and I think that's what you mean by "the columns of yr", but I'm not comfortable speaking about row or columsn for yr because it's not really a matrix), and then apply a pearson sim. Note however that this is a simplified view, as in reality we want to compute the rankings between the common ratings only, not the whole rows. (Maybe this has actually no impact? I haven't thought about it much.) For now maybe the most important is to make a quick benchmark (more thorough benchmarks can be done later on) to make sure the computation can run in a decent time, compared to the other sims. I wouldn't want you to lose your time trying to fix this if in the end we won't merge the PR because spearman sim is too slow :/ Then of course, we should fix the bugs if there are any. Hope this does not add confusion ^^, thanks for looking that up anyway. |
Hey @NicolasHug, that's right. Spearman calculates the ranks in the user or item vectors. That's what I meant by the term columns. My improvement transfers yr directly into the ranks and then proceeds exactly like Pearson. The old version calculated the ranks in the wrong direction. The current version also considers the common elements. This is then regulated exactly like Pearson by the freq matrix. So the only difference is that yr must be transformed into a form in which the ranks and not the ratings are included. |
I'll just put in the cross validations of the procedures that Spearman uses. Interesting should be time expenditure and error rate. |
Example 1: Spearman: item-based KNNBasic; MovieLens100k; 5-fold cross validation
|
Example 2: Cosine: item-based KNNBasic; MovieLens100k; 5-fold cross validation
|
Example 3: Pearson: item-based KNNBasic; MovieLens100k; 5-fold cross validation
|
First conclusion:Example 1 to 3 show that Spearman is close to Pearson for the item-based approach with the MAE and RMSE. You can also see that Spearman takes longer compared to Cosine and Pearson. This is because yr is first transformed into a "rank representation". Nevertheless, the time is not extremely worse than with Cosine or Spearman. |
Example 4: Spearman: user-based KNNBasic; MovieLens100k; 5-fold cross validation
|
Example 5: Cosine: user-based KNNBasic; MovieLens100k; 5-fold cross validation
|
Example 6: Pearson: user-based KNNBasic; MovieLens100k; 5-fold cross validation
|
Second conclusionAlso here you can see that Spearman is close to Pearson's RMSE and MAE. But also here you can see that Spearman needs a bit more time in comparison. |
Conclusion of the fast BanchmarkI think Spearman may well be considered. The method differs only slightly in the programming effort to the Pearson correlation. It also shows that similar RMSE and MAE values are obtained for both the item and the user-based approach. In addition, the user-based approach even runs faster than the item-based approach. Nevertheless, Pearson is lagging behind the other methods. It can make sense to work on the determination of the ranks, since a large part of the time is required. I hope my contribution is understandable^^ |
Ok, thanks a lot for the benchmark. I agree that the computation time is well-reasonable, comparatively. I'll try to review it more in details soon. Instead of converting |
I've considered that, too. I think you could bypass the time for structuring the matrix by passing xr additionally. Nevertheless I think that you won't save much more time. For this I have compared both old versions once. As an example: Old Pearson; MovieLens100k; 5-fold cross validation
That would have resulted in:
New Pearson; MovieLens100k; 5-fold cross validation
That would have resulted in:
You can see that the old version and the new version provide comparable times. Also the bugs are similar. Furthermore, I think that the additional parameter xr will not change much in time. So I looked at the times between the code sections in the pre-process phase. (1) Old Pearson PreprocessThe rank is also calculated in this area.
This section achieves about: 0.9549877643585205 s (2) New Pearson Preprocess Building the Matrix
This section achieves about: 0.030748605728149414 s (3) New Pearson Preprocess Building the Rank Matrix
This section achieves about: 0.13120055198669434 s (4) New Pearson Preprocess
This section achieves about: 0.5745222568511963 s So I think that an introduction of xr will not change much about the problem that the ranks have to be calculated. Also you can see that the additional converting of yr doesn't take much time (see (2)). |
I hope the short analysis is helpful. I would be happy to receive a review of the code. Thanks in advance !^^ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the feedback. I made a more thorough review. This looks good in general but I have 2 important concerns:
- I'd like to avoid building the
matrix
- The ranks seem to be computed on all ratings instead of computing them on the common ratings.
Sorry for the delay!
@@ -33,7 +33,7 @@ def test_cosine_sim(): | |||
|
|||
sim = sims.cosine(n_x, yr, min_support=1) | |||
|
|||
# check symetry and bounds (as ratings are > 0, cosine sim must be >= 0) | |||
# check symmetry and bounds (as ratings are > 0, cosine sim must be >= 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol thanks for correcting the typos
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always leave the place cleaner than you found it. ^^
surprise/similarities.pyx
Outdated
(or items). | ||
|
||
Only **common** users (or items) are taken into account. The Spearman | ||
correlation coefficient can be seen as a non parametric Pearson's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean non-parametric?
I'd like to add something like "The spearman correlation coefficient is equivalent to Pearson correlation coefficient where the ratings are replaced by their rankings."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I've improved on that.
surprise/similarities.pyx
Outdated
|
||
.. math :: | ||
\\text{spearman_sim}(u, v) = \\frac{ \\sum\\limits_{i \\in I_{uv}} | ||
(rank(r_{ui}) - \\overline{rank(u)}) \\cdot (rank(r_{vi}) - \\overline{rank(v)})} {\\sqrt{\\sum\\limits_{i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid lines longer than 79 characters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This as well.
surprise/similarities.pyx
Outdated
-1). | ||
|
||
For details on Spearman coefficient, see in chapter 4, page 126 of: `Recommender Systems Handbook | ||
<http://www.cs.ubbcluj.ro/~gabis/DocDiplome/SistemeDeRecomandare/Recommender_systems_handbook.pdf>`__. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't add the link I doubt it's very legal ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the change better?
sj = np.zeros((n_x, n_x), np.double) | ||
sim = np.zeros((n_x, n_x), np.double) | ||
ranks = np.zeros(n_x, np.double) | ||
matrix = np.zeros((len(yr), n_x), np.double) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be huge (n_users * n_items).
Passing xr
as well would avoid the need to create matrix
right? If that's the case then we should do it.
for y, y_ratings in iteritems(yr): | ||
for xi, ri in y_ratings: | ||
# use the ranking matrix to get the elements row by row | ||
ranks[xi] = matrix[y, xi] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there might be a problem here:
ranks[xi]
contains the ranks for all the ys, right?
But when we compare 2 xs, we only want to do the that on the basis of their common ys
. In the subsequent code you will compare them on the basis of all the ys.
Say we have 5 items and 2 users
ratings:
user 1: 1, 2, X, 4, 5
user 2: X, X, 1, 5, 2
The ranks are:
ranks:
user 1: 1, 2, X, 4, 5
user 2: X, X, 1, 3, 2
But on the common items the ratings are
ratings:
user 1: X, X, X, 4, 5
user 2: X, X, X, 5, 2
and the ranks are then
ranks:
user 1: X, X, X, 1, 2
user 2: X, X, X, 2, 1
So your code will consider the ranks
ranks:
user 1: 4, 5
user 2: 3, 2
while it should actually be considering
ranks:
user 1: 1, 2
user 2: 2, 1
Maybe this has no impact because the relative order of each rank will stay the same, and it has no effect on pearson? I don't know what would happen if there are ties though...
Hey. Thanks for the review. I will deal with it in more detail tomorrow or next week. It's interesting to see how the ranking behaves. I'm sure I'll come up with something about that. I don't think the ties are a problem. They are taken into account when the rankings are created, because they are standardized ranks. But I will take a closer look at the concerns again. I have also taken a closer look at the Spearman Rank correlation. I used the webscope-R3 dataset from Yahoo. The record can be requested here: For the comparison of the Spearman rank correlation I used a 5-fold cross validation. For this I consider the fit time in comparison to the other methods with different distance measurement. |
After a few initial considerations, I think it makes no difference how the ranks are calculated. Since the ranks are normalized, it makes no difference whether they are calculated over the entire sample or only over a sample. But I think that this relationship can be examined more closely. If the assumption is true, xr can actually be used instead of yr. However, if it turns out to be wrong, it must be investigated whether the additional effort to calculate the common items and the corresponding ranks is actually faster than the current solution. |
@MaFeg100 let me know if I made an error in the code but I think this snippet yields an example where computing the ranks on the common items provides different results than computing the ranks on the whole vectors. import numpy as np
from numpy.testing import assert_almost_equal
from scipy.stats import rankdata
def spearman(u, v, ranks_on_common):
# u = vectors of ratings of user u
# v = vectors of ratings of user v
# ranks_on_common: whether to compute the ranks on the common items only,
# or on the whole ratings u and v (which is the PR's version)
assert len(u) == len(v)
common_items = [i for (i, (r_ui, r_vi)) in enumerate(zip(u, v))
if r_ui and r_vi]
if not common_items:
return 0
print('ranks on common:', ranks_on_common)
print('ratings u:', u)
print('ratings v:', v)
print('common items:', common_items)
if ranks_on_common:
# compute ranks on common items
u_commons = [u[i] for i in common_items]
v_commons = [v[i] for i in common_items]
rank_u = rankdata(u_commons)
rank_v = rankdata(v_commons)
else:
# compute ranks on whole vectors (treating missing ratings as 0),
# and then only keep the ranks for the common items
rank_u = rankdata(u)
rank_v = rankdata(v)
rank_u = [rank_u[i] for i in common_items]
rank_v = [rank_v[i] for i in common_items]
print('ranks u:', rank_u)
print('ranks v:', rank_v)
assert len(rank_u) == len(rank_v) == len(common_items)
# Then compute pearson sim as usual, on common items
mu_u = np.mean(rank_u)
mu_v = np.mean(rank_v)
num = sum((r_ui - mu_u) * (r_uv - mu_v)
for (r_ui, r_uv) in zip(rank_u, rank_v))
a = sum((r_ui - mu_u)**2 for r_ui in rank_u)
b = sum((r_vi - mu_v)**2 for r_vi in rank_v)
denom = np.sqrt(a * b)
if denom == 0:
return 0
return num / denom
rng = np.random.RandomState(0)
size = 4
for _ in range(1000):
# generate random ratings vectors between [0, 5]
u = rng.randint(0, 6, size)
v = rng.randint(0, 6, size)
a = spearman(u, v, ranks_on_common=True)
print('-' * 5)
b = spearman(u, v, ranks_on_common=False)
print(a , b)
print('-' * 10)
assert_almost_equal(a, b)
I think that whether we should compute the ranks on the common items or on the whole ratings is totally independent from whether we should pass |
Hey @NicolasHug , you're right. The calculation errors can be fixed by setting the ratings that are not shared by the users to 0. I have changed your code so that both calculations are correct. import numpy as np
from numpy.testing import assert_almost_equal
from scipy.stats import rankdata
def spearman(u, v, ranks_on_common):
# u = vectors of ratings of user u
# v = vectors of ratings of user v
# ranks_on_common: whether to compute the ranks on the common items only,
# or on the whole ratings u and v (which is the PR's version)
assert len(u) == len(v)
common_items = [i for (i, (r_ui, r_vi)) in enumerate(zip(u, v))
if r_ui and r_vi]
if not common_items:
return 0
print('ranks on common:', ranks_on_common)
print('ratings u:', u)
print('ratings v:', v)
print('common items:', common_items)
if ranks_on_common:
# compute ranks on common items
u_commons = [u[i] for i in common_items]
v_commons = [v[i] for i in common_items]
rank_u = rankdata(u_commons)
rank_v = rankdata(v_commons)
else:
# compute ranks on whole vectors (treating missing ratings as 0),
# and then only keep the ranks for the common items
u_commons = [u[i] if i in common_items else 0 for i in range(len(u))]
v_commons = [v[i] if i in common_items else 0 for i in range(len(v))]
rank_u = rankdata(u_commons)
rank_v = rankdata(v_commons)
rank_u = [rank_u[i] for i in common_items]
rank_v = [rank_v[i] for i in common_items]
print('ranks u:', rank_u)
print('ranks v:', rank_v)
assert len(rank_u) == len(rank_v) == len(common_items)
# Then compute pearson sim as usual, on common items
mu_u = np.mean(rank_u)
mu_v = np.mean(rank_v)
num = sum((r_ui - mu_u) * (r_uv - mu_v)
for (r_ui, r_uv) in zip(rank_u, rank_v))
a = sum((r_ui - mu_u)**2 for r_ui in rank_u)
b = sum((r_vi - mu_v)**2 for r_vi in rank_v)
denom = np.sqrt(a * b)
if denom == 0:
return 0
return num / denom
rng = np.random.RandomState(0)
size = 4
for _ in range(1000):
# generate random ratings vectors between [0, 5]
u = rng.randint(0, 6, size)
v = rng.randint(0, 6, size)
a = spearman(u, v, ranks_on_common=True)
print('-' * 5)
b = spearman(u, v, ranks_on_common=False)
print(a , b)
print('-' * 10)
assert_almost_equal(a, b) The same procedure can be used in 'matrix' to calculate the ranks over the same elements. Is that right? |
Hmm I guess that could work but:
|
This PR should improve the already existing problem #167 .
I will revise the code and make it integrable.
I'll open this PR so you can see the progress.