Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve and add Spearman #227

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Open

Improve and add Spearman #227

wants to merge 15 commits into from

Conversation

ghost
Copy link

@ghost ghost commented Nov 15, 2018

This PR should improve the already existing problem #167 .

I will revise the code and make it integrable.

I'll open this PR so you can see the progress.

@ghost
Copy link
Author

ghost commented Nov 15, 2018

The tests are not running yet and I'm still checking if the cPython function really works.

@ghost
Copy link
Author

ghost commented Nov 16, 2018

Hey @NicolasHug,

I am unsure whether the rankings are calculated correctly.

I think that the rankings are calculated by the columns and not by the rows of yr.

Or am I wrong ?

@NicolasHug
Copy link
Owner

what do you mean the rows of yr? yr is a dictionary so it doesn't have rows or columns.

yr is equal to ur or ir (from the trainset object), depending on the user_based parameter.

@ghost
Copy link
Author

ghost commented Nov 16, 2018

I split the comments a bit, so that it stays clear ^^

@ghost
Copy link
Author

ghost commented Nov 16, 2018

I refer to the dict yr from test_similarities.py:

n_x = 8
yr_global = {
    0: [(0, 3), (1, 3), (2, 3), (5, 1), (6, 1.5), (7, 3)], # noqa
    1: [(0, 4), (1, 4), (2, 4), ], # noqa
    2: [ (2, 5), (3, 2), (4, 3) ], # noqa
    3: [ (1, 1), (2, 4), (3, 2), (4, 3), (5, 3), (6, 3.5), (7, 2)], # noqa
    4: [ (1, 5), (2, 1), (5, 2), (6, 2.5), (7, 2.5)], # noqa
}

The current code would return this result:

sim = sims.spearman(n_x, yr, min_support=0)
[[ 1. 1. 1. 0. 0. 0. 0. 0. ]
 [ 1. 1. -0.595 0. 0. -0.999 -0.902 0.746]
 [ 1. -0.595 1. 1. 0. 0.789 0.412 -0.143]
 [ 0. 0. 1. 1. 0. 0. 0. 0. ]
 [ 0. 0. 0. 0. 1. 0. 0. 0. ]
 [ 0. -0.999 0.789 0. 0. 1. 0.885 -0.721]
 [ 0. -0.902 0.412 0. 0. 0.885 1. -0.961]
 [ 0. 0.746 -0.143 0. 0. -0.721 -0.961 1. ]]

I rounded it to three decimal places.

But in fact it should have come out:

[[ 1. 0.335 -0.057 -0.645 -0.645 -0.516 -0.516 -0.057]
 [ 0.335 1. -0.821 -0.866 -0.866 0.154 0.154 0.359]
 [-0.057 -0.821 1. 0.74 0.74 -0.5 -0.5 -0.816]
 [-0.645 -0.866 0.74 1. 1. 0.148 0.148 -0.444]
 [-0.645 -0.866 0.74 1. 1. 0.148 0.148 -0.444]
 [-0.516 0.154 -0.5 0.148 0.148 1. 1. 0.579]
 [-0.516 0.154 -0.5 0.148 0.148 1. 1. 0.579]
 [-0.057 0.359 -0.816 -0.444 -0.444 0.579 0.579 1. ]]

@ghost
Copy link
Author

ghost commented Nov 16, 2018

The second matrix is calculated according to the formula used in the code.

This formula corresponds to this one here:
https://mathcracker.com/spearman-correlation-calculator.php

(Sorry, but I just can't find a better source)

@ghost
Copy link
Author

ghost commented Nov 16, 2018

@gautamramk used to calculate the rank:

....
    rows = np.zeros(n_x, np.double)

    for y, y_ratings in iteritems(yr):
        for xi, ri in y_ratings:
            rows[xi] = ri
        ranks = rankdata(rows)
....

But actually the ranks should be determined by the columns from yr, because these represent e.g. the users and their choices.

Therefore I think that the current code is not correct and it does not calculate the Spearman Correlation.

@ghost
Copy link
Author

ghost commented Nov 16, 2018

I have written the following program to show how I think the Spearman Correlation would be calculated correctly for min_sqrt == 0.

This code was also used to generate the second matrix.

import numpy as np
from scipy.stats import rankdata

# yr_global as a matrix representation
matrix = np.array([[3., 3., 3., 0., 0., 1., 1.5, 3.],
                   [4., 4., 4., 0., 0., 0., 0., 0.],
                   [0., 0., 5., 2., 3., 0., 0., 0.],
                   [0., 1., 4., 2., 3., 3., 3.5, 2.],
                   [0., 5., 1., 0., 0., 2., 2.5, 2.5]])

n = len(matrix)
dim = len(matrix[0])

result = np.zeros((dim, dim))

for x in range(0, dim):
    rank_x = rankdata(matrix[:, x])
    for y in range(x, dim):
        rank_y = rankdata(matrix[:, y])

        prod = np.dot(rank_x, rank_y)
        sum_rx = sum(rank_x)
        sum_ry = sum(rank_y)
        sum_rxs = sum(rank_x ** 2)
        sum_rys = sum(rank_y ** 2)

        counter = n * prod - (sum_rx * sum_ry)
        denom_l = np.sqrt(n * sum_rxs - sum_rx ** 2)
        denom_r = np.sqrt(n * sum_rys - sum_ry ** 2)

        result[x, y] = round(counter / (denom_r * denom_l), 3)
        result[y, x] = result[x, y]

print(result)

@ghost
Copy link
Author

ghost commented Nov 16, 2018

So I wonder if I'm totally wrong.

If so, where is my mistake ?

If not, then I would completely rework the spearman method and create it similar to the example code.

Many thanks
Marc

@ghost
Copy link
Author

ghost commented Nov 17, 2018

Hey @NicolasHug,
I think I fixed the bug by introducing a ranking matrix.

The calculation in the previous version does not seem to be correct. The rank is calculated wrong.

The rank must not be calculated with the rows of yr but with the columns.

Like e.g.

          ----[[3., 3., 3., 0., 0., 1., 1.5, 3.],-----
               [4., 4., 4., 0., 0., 0., 0., 0.],
               [0., 0., 5., 2., 3., 0., 0., 0.],
               [0., 1., 4., 2., 3., 3., 3.5, 2.],
               [0., 5., 1., 0., 0., 2., 2.5, 2.5]]

Here the ranks would be calculated with the old version:

               [6.5, 6.5, 6.5, 1.5, 1.5, 3. , 4. , 6.5]

But it would have to be calculated like this:

                |
                |
                |
              [[3., 3., 3., 0., 0., 1., 1.5, 3.],
               [4., 4., 4., 0., 0., 0., 0., 0.],
               [0., 0., 5., 2., 3., 0., 0., 0.],
               [0., 1., 4., 2., 3., 3., 3.5, 2.],
               [0., 5., 1., 0., 0., 2., 2.5, 2.5]]
                |
                |
                |

Whereby the new version calculates the ranks like this:

              [4., 5., 2., 2., 2.]

The given tests now run without changes.

I also adjusted the documentation.

Excuse the wall of text above^^.

But now it should be correct

@NicolasHug
Copy link
Owner

Hey @MaFeg100 , sorry for the slow reply.

I haven't looked in great details, but I think you are probably correct. Here is what I understand, let me know if you agree.

Spearman is like Pearson correlation but instead of using raw ratings we use the ranks.

Considering a rating matrix U x I (users are rows, items are columns), then a user-user spearman similarity would first compute the the ranks of the ratings in a row-wise fashion (and I think that's what you mean by "the columns of yr", but I'm not comfortable speaking about row or columsn for yr because it's not really a matrix), and then apply a pearson sim.

Note however that this is a simplified view, as in reality we want to compute the rankings between the common ratings only, not the whole rows. (Maybe this has actually no impact? I haven't thought about it much.)

For now maybe the most important is to make a quick benchmark (more thorough benchmarks can be done later on) to make sure the computation can run in a decent time, compared to the other sims. I wouldn't want you to lose your time trying to fix this if in the end we won't merge the PR because spearman sim is too slow :/

Then of course, we should fix the bugs if there are any.

Hope this does not add confusion ^^, thanks for looking that up anyway.

@ghost
Copy link
Author

ghost commented Nov 19, 2018

Hey @NicolasHug,

that's right. Spearman calculates the ranks in the user or item vectors. That's what I meant by the term columns.

My improvement transfers yr directly into the ranks and then proceeds exactly like Pearson.

The old version calculated the ranks in the wrong direction.

The current version also considers the common elements. This is then regulated exactly like Pearson by the freq matrix.

So the only difference is that yr must be transformed into a form in which the ranks and not the ratings are included.

@ghost
Copy link
Author

ghost commented Nov 19, 2018

I'll just put in the cross validations of the procedures that Spearman uses.

Interesting should be time expenditure and error rate.

@ghost
Copy link
Author

ghost commented Nov 19, 2018

Example 1: Spearman: item-based KNNBasic; MovieLens100k; 5-fold cross validation

data = Dataset.load_builtin('ml-100k')
sim_options = {'name': 'spearman',
               'user_based': False
}
algo = KNNBasic(sim_options=sim_options)
cross_validate(algo, data, verbose=True)
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0370  1.0417  1.0520  1.0359  1.0446  1.0422  0.0058  
MAE (testset)     0.8324  0.8359  0.8423  0.8289  0.8366  0.8352  0.0045  
Fit time          1.99    2.00    1.99    2.04    2.06    2.02    0.03    
Test time         2.72    2.81    2.88    2.77    2.83    2.80    0.05 

@ghost
Copy link
Author

ghost commented Nov 19, 2018

Example 2: Cosine: item-based KNNBasic; MovieLens100k; 5-fold cross validation

data = Dataset.load_builtin('ml-100k')
sim_options = {'name': 'cosine',
               'user_based': False
}
algo = KNNBasic(sim_options=sim_options)
cross_validate(algo, data, verbose=True)
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0214  1.0236  1.0327  1.0243  1.0297  1.0263  0.0042  
MAE (testset)     0.8098  0.8109  0.8158  0.8072  0.8133  0.8114  0.0030  
Fit time          1.22    1.27    1.24    1.27    1.18    1.24    0.03    
Test time         2.83    2.77    2.82    2.85    2.80    2.81    0.03  

@ghost
Copy link
Author

ghost commented Nov 19, 2018

Example 3: Pearson: item-based KNNBasic; MovieLens100k; 5-fold cross validation

data = Dataset.load_builtin('ml-100k')
sim_options = {'name': 'pearson',
               'user_based': False
}
algo = KNNBasic(sim_options=sim_options)
cross_validate(algo, data, verbose=True)
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0428  1.0423  1.0380  1.0405  1.0457  1.0419  0.0025  
MAE (testset)     0.8354  0.8369  0.8291  0.8330  0.8394  0.8347  0.0035  
Fit time          1.79    1.80    1.78    1.79    1.78    1.79    0.01    
Test time         2.73    2.80    2.75    2.81    2.88    2.80    0.05  

@ghost
Copy link
Author

ghost commented Nov 19, 2018

First conclusion:

Example 1 to 3 show that Spearman is close to Pearson for the item-based approach with the MAE and RMSE.

You can also see that Spearman takes longer compared to Cosine and Pearson. This is because yr is first transformed into a "rank representation".

Nevertheless, the time is not extremely worse than with Cosine or Spearman.

@ghost
Copy link
Author

ghost commented Nov 19, 2018

Example 4: Spearman: user-based KNNBasic; MovieLens100k; 5-fold cross validation

data = Dataset.load_builtin('ml-100k')
sim_options = {'name': 'spearman',
               'user_based': True
}
algo = KNNBasic(sim_options=sim_options)
cross_validate(algo, data, verbose=True)
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0096  1.0119  1.0051  1.0242  1.0108  1.0123  0.0064  
MAE (testset)     0.8018  0.8034  0.7943  0.8143  0.8038  0.8035  0.0064  
Fit time          1.12    1.18    1.21    1.16    1.75    1.29    0.23    
Test time         2.43    2.46    2.45    3.11    2.65    2.62    0.26  

@ghost
Copy link
Author

ghost commented Nov 19, 2018

Example 5: Cosine: user-based KNNBasic; MovieLens100k; 5-fold cross validation

data = Dataset.load_builtin('ml-100k')
sim_options = {'name': 'cosine',
               'user_based': True
}
algo = KNNBasic(sim_options=sim_options)
cross_validate(algo, data, verbose=True)
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0095  1.0166  1.0234  1.0145  1.0188  1.0165  0.0046  
MAE (testset)     0.7981  0.8035  0.8125  0.8025  0.8024  0.8038  0.0047  
Fit time          0.70    0.78    0.71    0.68    0.68    0.71    0.04    
Test time         2.65    2.57    2.48    2.43    2.52    2.53    0.07    

@ghost
Copy link
Author

ghost commented Nov 19, 2018

Example 6: Pearson: user-based KNNBasic; MovieLens100k; 5-fold cross validation

data = Dataset.load_builtin('ml-100k')
sim_options = {'name': 'pearson',
               'user_based': True
}
algo = KNNBasic(sim_options=sim_options)
cross_validate(algo, data, verbose=True)
Evaluating RMSE, MAE of algorithm KNNBasic on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0054  1.0088  1.0122  1.0150  1.0165  1.0116  0.0041  
MAE (testset)     0.8002  0.8006  0.8023  0.8045  0.8068  0.8029  0.0025  
Fit time          0.96    0.99    0.98    0.97    0.99    0.98    0.01    
Test time         2.45    2.45    2.57    2.52    2.53    2.50    0.04     

@ghost
Copy link
Author

ghost commented Nov 19, 2018

Second conclusion

Also here you can see that Spearman is close to Pearson's RMSE and MAE.

But also here you can see that Spearman needs a bit more time in comparison.

@ghost
Copy link
Author

ghost commented Nov 19, 2018

Conclusion of the fast Banchmark

I think Spearman may well be considered.

The method differs only slightly in the programming effort to the Pearson correlation.

It also shows that similar RMSE and MAE values are obtained for both the item and the user-based approach.

In addition, the user-based approach even runs faster than the item-based approach.

Nevertheless, Pearson is lagging behind the other methods.
This can be explained by the fact that the ranks are not directly available and yr must therefore first be converted into a "rank representation".

It can make sense to work on the determination of the ranks, since a large part of the time is required.

I hope my contribution is understandable^^

@NicolasHug
Copy link
Owner

Ok, thanks a lot for the benchmark. I agree that the computation time is well-reasonable, comparatively.

I'll try to review it more in details soon.

Instead of converting yr, would it be possible to avoid it by also passing xr? That is, instead of passing only ur or only ir, we could maybe just pass both?

@ghost
Copy link
Author

ghost commented Nov 19, 2018

I've considered that, too.

I think you could bypass the time for structuring the matrix by passing xr additionally.

Nevertheless I think that you won't save much more time.
Should xr be passed, the code would be similar to the old one from @gautamramk.

For this I have compared both old versions once.

As an example:

Old Pearson; MovieLens100k; 5-fold cross validation

data = Dataset.load_builtin('ml-100k')
sim_options = {'name': 'spearman_old',
               'user_based': True               }
algo = KNNBasic(sim_options=sim_options)
cross_validate(algo, data, verbose=True)

That would have resulted in:


                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0110  1.0130  1.0068  1.0111  1.0064  1.0097  0.0026  
MAE (testset)     0.8019  0.8037  0.8007  0.8047  0.7969  0.8016  0.0027  
Fit time          1.32    1.49    1.42    1.31    1.39    1.39    0.07    
Test time         2.77    2.71    2.78    2.70    2.73    2.74    0.03    

New Pearson; MovieLens100k; 5-fold cross validation

data = Dataset.load_builtin('ml-100k')
sim_options = {'name': 'spearman',
               'user_based': True               }
algo = KNNBasic(sim_options=sim_options)
cross_validate(algo, data, verbose=True)

That would have resulted in:

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0156  1.0066  1.0147  1.0116  1.0132  1.0123  0.0032  
MAE (testset)     0.8059  0.7994  0.8069  0.8023  0.8049  0.8039  0.0027  
Fit time          1.40    1.47    1.26    1.27    1.21    1.32    0.10    
Test time         2.62    2.74    2.94    2.80    2.68    2.75    0.11 

You can see that the old version and the new version provide comparable times.

Also the bugs are similar.
Nevertheless the new version passes the old tests whereas the old version does not pass them.

Furthermore, I think that the additional parameter xr will not change much in time.
Most of the extra time is caused by calculating the ranks.

So I looked at the times between the code sections in the pre-process phase.

(1) Old Pearson Preprocess

The rank is also calculated in this area.

....
    start = time.time()
    for y, y_ratings in iteritems(yr):
        for xi, ri in y_ratings:
            rows[xi] = ri
        ranks = rankdata(rows)
        for xi, _ in y_ratings:
            for xj, _ in y_ratings:
                prods[xi, xj] += ranks[xi] * ranks[xj]
                freq[xi, xj] += 1
                sqi[xi, xj] += ranks[xi]**2
                sqj[xi, xj] += ranks[xj]**2
                si[xi, xj] += ranks[xi]
                sj[xi, xj] += ranks[xj]
    end = time.time()
    t1 = end - start
    print(t1)
....

This section achieves about: 0.9549877643585205 s

(2) New Pearson Preprocess Building the Matrix

....
    start = time.time()
    # turn yr into a matrix
    for y, y_ratings in iteritems(yr):
        for x_i, r_i in y_ratings:
            matrix[y, x_i] = r_i
    end = time.time()
    t1 = end-start
    print(t1)
....

This section achieves about: 0.030748605728149414 s

(3) New Pearson Preprocess Building the Rank Matrix

....
    start = time.time()
    # turn the yr matrix into a matrix which contains the ranks the elements in yr
    for x_i in range(n_x):
        matrix[:,x_i] = rankdata(matrix[:,x_i])
    end = time.time()
    t2 = end - start
    print(t2)
....

This section achieves about: 0.13120055198669434 s

(4) New Pearson Preprocess

....
    start = time.time()
    for y, y_ratings in iteritems(yr):
        for xi, ri in y_ratings:
            # use the ranking matrix to get the elements row by row
            ranks[xi] = matrix[y, xi]
        for xi, _ in y_ratings:
            for xj, _ in y_ratings:
                prods[xi, xj] += ranks[xi] * ranks[xj]
                freq[xi, xj] += 1
                sqi[xi, xj] += ranks[xi]**2
                sqj[xi, xj] += ranks[xj]**2
                si[xi, xj] += ranks[xi]
                sj[xi, xj] += ranks[xj]
    end = time.time()
    t3 = end - start
    print(t3)
....

This section achieves about: 0.5745222568511963 s

So I think that an introduction of xr will not change much about the problem that the ranks have to be calculated.

Also you can see that the additional converting of yr doesn't take much time (see (2)).
In addition, no more changes than necessary would have to be made.

@ghost
Copy link
Author

ghost commented Nov 19, 2018

I hope the short analysis is helpful.

I would be happy to receive a review of the code.

Thanks in advance !^^

Copy link
Owner

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the feedback. I made a more thorough review. This looks good in general but I have 2 important concerns:

  • I'd like to avoid building the matrix
  • The ranks seem to be computed on all ratings instead of computing them on the common ratings.

Sorry for the delay!

@@ -33,7 +33,7 @@ def test_cosine_sim():

sim = sims.cosine(n_x, yr, min_support=1)

# check symetry and bounds (as ratings are > 0, cosine sim must be >= 0)
# check symmetry and bounds (as ratings are > 0, cosine sim must be >= 0)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol thanks for correcting the typos

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always leave the place cleaner than you found it. ^^

(or items).

Only **common** users (or items) are taken into account. The Spearman
correlation coefficient can be seen as a non parametric Pearson's
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean non-parametric?

I'd like to add something like "The spearman correlation coefficient is equivalent to Pearson correlation coefficient where the ratings are replaced by their rankings."

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I've improved on that.


.. math ::
\\text{spearman_sim}(u, v) = \\frac{ \\sum\\limits_{i \\in I_{uv}}
(rank(r_{ui}) - \\overline{rank(u)}) \\cdot (rank(r_{vi}) - \\overline{rank(v)})} {\\sqrt{\\sum\\limits_{i
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid lines longer than 79 characters

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This as well.

-1).

For details on Spearman coefficient, see in chapter 4, page 126 of: `Recommender Systems Handbook
<http://www.cs.ubbcluj.ro/~gabis/DocDiplome/SistemeDeRecomandare/Recommender_systems_handbook.pdf>`__.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't add the link I doubt it's very legal ;)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the change better?

sj = np.zeros((n_x, n_x), np.double)
sim = np.zeros((n_x, n_x), np.double)
ranks = np.zeros(n_x, np.double)
matrix = np.zeros((len(yr), n_x), np.double)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be huge (n_users * n_items).

Passing xr as well would avoid the need to create matrix right? If that's the case then we should do it.

for y, y_ratings in iteritems(yr):
for xi, ri in y_ratings:
# use the ranking matrix to get the elements row by row
ranks[xi] = matrix[y, xi]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there might be a problem here:

ranks[xi] contains the ranks for all the ys, right?

But when we compare 2 xs, we only want to do the that on the basis of their common ys. In the subsequent code you will compare them on the basis of all the ys.

Say we have 5 items and 2 users

ratings:
user 1: 1, 2, X, 4, 5
user 2: X, X, 1, 5, 2

The ranks are:

ranks:
user 1: 1, 2, X, 4, 5
user 2: X, X, 1, 3, 2

But on the common items the ratings are

ratings:
user 1: X, X, X, 4, 5
user 2: X, X, X, 5, 2

and the ranks are then

ranks:
user 1: X, X, X, 1, 2
user 2: X, X, X, 2, 1

So your code will consider the ranks

ranks:
user 1: 4, 5
user 2: 3, 2

while it should actually be considering

ranks:
user 1: 1, 2
user 2: 2, 1

Maybe this has no impact because the relative order of each rank will stay the same, and it has no effect on pearson? I don't know what would happen if there are ties though...

@ghost
Copy link
Author

ghost commented Dec 1, 2018

Hey.

Thanks for the review.

I will deal with it in more detail tomorrow or next week.

It's interesting to see how the ranking behaves. I'm sure I'll come up with something about that.

I don't think the ties are a problem. They are taken into account when the rankings are created, because they are standardized ranks.

But I will take a closer look at the concerns again.

I have also taken a closer look at the Spearman Rank correlation.

I used the webscope-R3 dataset from Yahoo.
The record contains 311704 reviews from 15.400 users for 1000 songs.

The record can be requested here:
https://webscope.sandbox.yahoo.com/catalog.php?datatype=r&did=3

For the comparison of the Spearman rank correlation I used a 5-fold cross validation.

For this I consider the fit time in comparison to the other methods with different distance measurement.

Uploading kNN_fit_time.png…

@ghost
Copy link
Author

ghost commented Dec 1, 2018

knn_fit_time

Here you can compare the amount of time spent with the size of the neighbourhood.

@ghost
Copy link
Author

ghost commented Dec 2, 2018

spearman001
spearman002

@ghost
Copy link
Author

ghost commented Dec 2, 2018

My code works according to this formula which I made myself.

bildschirmfoto 2018-12-02 um 12 45 37

@ghost
Copy link
Author

ghost commented Dec 2, 2018

After a few initial considerations, I think it makes no difference how the ranks are calculated.

Since the ranks are normalized, it makes no difference whether they are calculated over the entire sample or only over a sample.

But I think that this relationship can be examined more closely.
Are you interested in a formal proof or proof to the contrary?

If the assumption is true, xr can actually be used instead of yr.

However, if it turns out to be wrong, it must be investigated whether the additional effort to calculate the common items and the corresponding ranks is actually faster than the current solution.

@NicolasHug
Copy link
Owner

@MaFeg100 let me know if I made an error in the code but I think this snippet yields an example where computing the ranks on the common items provides different results than computing the ranks on the whole vectors.

import numpy as np
from numpy.testing import assert_almost_equal
from scipy.stats import rankdata

def spearman(u, v, ranks_on_common):
    # u = vectors of ratings of user u
    # v = vectors of ratings of user v
    # ranks_on_common: whether to compute the ranks on the common items only,
    # or on the whole ratings u and v (which is the PR's version)

    assert len(u) == len(v)

    common_items = [i for (i, (r_ui, r_vi)) in enumerate(zip(u, v))
                    if r_ui and r_vi]
    if not common_items:
        return 0

    print('ranks on common:', ranks_on_common)
    print('ratings u:', u)
    print('ratings v:', v)
    print('common items:', common_items)

    if ranks_on_common:
        # compute ranks on  common items
        u_commons = [u[i] for i in common_items]
        v_commons = [v[i] for i in common_items]
        rank_u = rankdata(u_commons)
        rank_v = rankdata(v_commons)
    else:
        # compute ranks on whole vectors (treating missing ratings as 0),
        # and then only keep the ranks for the common items
        rank_u = rankdata(u)
        rank_v = rankdata(v)
        rank_u = [rank_u[i] for i in common_items]
        rank_v = [rank_v[i] for i in common_items]

    print('ranks u:', rank_u)
    print('ranks v:', rank_v)

    assert len(rank_u) == len(rank_v) == len(common_items)

    # Then compute pearson sim as usual, on common items
    mu_u = np.mean(rank_u)
    mu_v = np.mean(rank_v)

    num = sum((r_ui - mu_u) * (r_uv - mu_v)
              for (r_ui, r_uv) in zip(rank_u, rank_v))
    a = sum((r_ui - mu_u)**2 for r_ui in rank_u)
    b = sum((r_vi - mu_v)**2 for r_vi in rank_v)
    denom = np.sqrt(a * b)

    if denom == 0:
        return 0

    return num / denom


rng = np.random.RandomState(0)
size = 4
for _ in range(1000):
    # generate random ratings vectors between [0, 5]
    u = rng.randint(0, 6, size)
    v = rng.randint(0, 6, size)

    a = spearman(u, v, ranks_on_common=True)
    print('-' * 5)
    b = spearman(u, v, ranks_on_common=False)
    print(a , b)
    print('-' * 10)
    assert_almost_equal(a, b)
...
----------
ranks on common: True
ratings u: [4 5 0 4]
ratings v: [3 5 3 4]
common items: [0, 1, 3]
ranks u: [1.5 3.  1.5]
ranks v: [1. 3. 2.]
-----
ranks on common: False
ratings u: [4 5 0 4]
ratings v: [3 5 3 4]
common items: [0, 1, 3]
ranks u: [2.5, 4.0, 2.5]
ranks v: [1.5, 4.0, 3.0]
0.8660254037844387 0.8029550685469661
----------
Traceback (most recent call last):
  File "lol.py", line 70, in <module>
    assert_almost_equal(a, b)
  File "/home/nico/.virtualenvs/bordel_36/lib/python3.6/site-packages/numpy/testing/nose_tools/utils.py", line 581, in assert_almost_equal
    raise AssertionError(_build_err_msg())
AssertionError: 
Arrays are not almost equal to 7 decimals
 ACTUAL: 0.8660254037844387
 DESIRED: 0.8029550685469661

If the assumption is true, xr can actually be used instead of yr.

I think that whether we should compute the ranks on the common items or on the whole ratings is totally independent from whether we should pass xr or just yr. If passing xr allows us not to compute the matrix of ratings, then xr should be passed because even if computing this matrix can be fast, it uses a lot of memory.

@ghost
Copy link
Author

ghost commented Dec 10, 2018

Hey @NicolasHug , you're right.

The calculation errors can be fixed by setting the ratings that are not shared by the users to 0.

I have changed your code so that both calculations are correct.

import numpy as np
from numpy.testing import assert_almost_equal
from scipy.stats import rankdata

def spearman(u, v, ranks_on_common):
    # u = vectors of ratings of user u
    # v = vectors of ratings of user v
    # ranks_on_common: whether to compute the ranks on the common items only,
    # or on the whole ratings u and v (which is the PR's version)

    assert len(u) == len(v)

    common_items = [i for (i, (r_ui, r_vi)) in enumerate(zip(u, v))
                    if r_ui and r_vi]
    if not common_items:
        return 0

    print('ranks on common:', ranks_on_common)
    print('ratings u:', u)
    print('ratings v:', v)
    print('common items:', common_items)

    if ranks_on_common:
        # compute ranks on  common items
        u_commons = [u[i] for i in common_items]
        v_commons = [v[i] for i in common_items]

        rank_u = rankdata(u_commons)
        rank_v = rankdata(v_commons)

    else:
        # compute ranks on whole vectors (treating missing ratings as 0),
        # and then only keep the ranks for the common items

        u_commons = [u[i] if i in common_items else 0 for i in range(len(u))]
        v_commons = [v[i] if i in common_items else 0 for i in range(len(v))]

        rank_u = rankdata(u_commons)
        rank_v = rankdata(v_commons)
        rank_u = [rank_u[i] for i in common_items]
        rank_v = [rank_v[i] for i in common_items]

    print('ranks u:', rank_u)
    print('ranks v:', rank_v)

    assert len(rank_u) == len(rank_v) == len(common_items)

    # Then compute pearson sim as usual, on common items
    mu_u = np.mean(rank_u)
    mu_v = np.mean(rank_v)

    num = sum((r_ui - mu_u) * (r_uv - mu_v)
              for (r_ui, r_uv) in zip(rank_u, rank_v))
    a = sum((r_ui - mu_u)**2 for r_ui in rank_u)
    b = sum((r_vi - mu_v)**2 for r_vi in rank_v)
    denom = np.sqrt(a * b)

    if denom == 0:
        return 0

    return num / denom


rng = np.random.RandomState(0)
size = 4
for _ in range(1000):
    # generate random ratings vectors between [0, 5]
    u = rng.randint(0, 6, size)
    v = rng.randint(0, 6, size)

    a = spearman(u, v, ranks_on_common=True)
    print('-' * 5)
    b = spearman(u, v, ranks_on_common=False)
    print(a , b)
    print('-' * 10)
    assert_almost_equal(a, b)

The same procedure can be used in 'matrix' to calculate the ranks over the same elements.

Is that right?

@NicolasHug
Copy link
Owner

Hmm I guess that could work but:

  • imputing with 0 only works if 0 is not in the rating scale. E.g. for the jester dataset which uses (-10, 10), this would not be correct. We need to impute with something that is not in the rating scale and I don't know how to do it in general (computing a min or a max could do it but there might be something better?)
  • I'm afraid now the bottleneck is to compute the common items.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants