Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests are broken? #5

Open
libeanim opened this issue Mar 3, 2016 · 4 comments
Open

Tests are broken? #5

libeanim opened this issue Mar 3, 2016 · 4 comments

Comments

@libeanim
Copy link

libeanim commented Mar 3, 2016

Hi,
I set up a new virtual environment on my Linux Mint machine and installed numpy, scipy, matplotlib, Cython and plfit via pip.

> pip freeze
cycler==0.10.0
Cython==0.23.4 
matplotlib==1.5.1
numpy==1.10.4
numpydoc==0.6.0
plfit==1.0.2
pyparsing==2.1.0
python-dateutil==2.5.0
pytz==2015.7
scipy==0.17.0
six==1.10.0

My python version is:

> python --version
Python 2.7.6

If I run the 'clauset2009_tests.py' script after everything was installed properly it returns following precision error:

> python clauset2009_tests.py 
/home/libeanim/Desktop/WORK/plfit/env/local/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')
Using DISCRETE fitter
/home/libeanim/Desktop/WORK/plfit/env/local/lib/python2.7/site-packages/plfit/plfit.py:830: RuntimeWarning: invalid value encountered in log
  alpha = 1.0 + float(nn) * ( sum(log(xx/(xmin-0.5))) )**-1
/home/libeanim/Desktop/WORK/plfit/env/local/lib/python2.7/site-packages/plfit/plfit.py:830: RuntimeWarning: divide by zero encountered in divide
  alpha = 1.0 + float(nn) * ( sum(log(xx/(xmin-0.5))) )**-1
alpha = 2.325891   xmin = 46.449000   ksD = 0.015483   L = -3556.028920   (n<x) = 18776  (n>=x) = 671
Using DISCRETE fitter
alpha = 2.325891   xmin = 46.449000   ksD = 0.015483   L = -3556.028920   (n<x) = 18776  (n>=x) = 671
Using DISCRETE fitter
alpha = 2.325891   xmin = 46.449000   ksD = 0.015483   L = -3556.028920   (n<x) = 18776  (n>=x) = 671
Cities (Clauset): n:     19447 mean,std,max:     9.00,   77.83, 8009.00 xmin:    52.46 alpha:     2.37 (    0.08) ntail:        580 p:  0.76
Cities (me)     : n:     19447 mean,std,max:     9.00,   77.82, 8008.65 xmin:    46.45 alpha:     2.33 (    0.05) ntail:        671 p:  1.00
Traceback (most recent call last):
  File "clauset2009_tests.py", line 48, in <module>
    np.testing.assert_almost_equal(ppp._xmin, 52.46, 2)
  File "/home/libeanim/Desktop/WORK/plfit/env/local/lib/python2.7/site-packages/numpy/testing/utils.py", line 513, in assert_almost_equal
    raise AssertionError(_build_err_msg())
AssertionError: 
Arrays are not almost equal to 2 decimals
 ACTUAL: 46.448999999999998
 DESIRED: 52.46

I also tried the 'consistency_test.py' with a similar result.

> python consistency_test.py 
/home/libeanim/Desktop/WORK/plfit/env/local/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
  warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')
/home/libeanim/Desktop/WORK/plfit/env/local/lib/python2.7/site-packages/plfit/plfit.py:113: RuntimeWarning: divide by zero encountered in double_scalars
  a = float(n) / sum(log(x/xmin))
PYTHON plfit executed in 0.327898 seconds
xmin: 0.584303 n(>xmin): 561 alpha: 2.39867 +/- 0.0590518   Log-Likelihood: -472.424   ks: 0.0180329 p(ks): 0.993215
CYTHON plfit executed in 0.202372 seconds
PYTHON plfit executed in 0.202398 seconds
cython cplfit did not load
xmin: 0.60835 n(>xmin): 538 alpha: 2.41978 +/- 0.0612111   Log-Likelihood: -460.973   ks: 0.025935 p(ks): 0.862215
PYTHON plfit executed in 0.329090 seconds
fortran fplfit did not load
xmin: 0.584303 n(>xmin): 561 alpha: 2.39867 +/- 0.0590518   Log-Likelihood: -472.424   ks: 0.0180329 p(ks): 0.993215
Traceback (most recent call last):
  File "consistency_test.py", line 21, in <module>
    np.testing.assert_almost_equal(aa._alpha, bb._alpha, 5)
  File "/home/libeanim/Desktop/WORK/plfit/env/local/lib/python2.7/site-packages/numpy/testing/utils.py", line 513, in assert_almost_equal
    raise AssertionError(_build_err_msg())
AssertionError: 
Arrays are not almost equal to 5 decimals
 ACTUAL: 2.3986678289876329
 DESIRED: 2.4197805296940231

I'm not sure if this is so dramatic but if these tests worked before, there is probably a bug in the code now. Or am I doing something wrong?

@keflavich
Copy link
Owner

I'm afraid bugs have somehow crept in and I have not been able to track them down. I think there is an index-off-by-one error somewhere, but I haven't had the time to find it and my last attempt didn't turn up anything. If you're interested in digging through the source at all to try to find the bug, help would be very welcome!

@libeanim
Copy link
Author

I'm sorry for this late answer. Unfortunately I have little time as well and I'm not too familiar with this topic, but when I can make it I will file a patch.

Additionally, I encountered another issue which is probably related:

Because I analysed neural avalanche distributions, I generated a fake avalanche distribution which follows a perfect power-law with an exponent of -1.5.

# generate avalanches of size 1 to 100
x = np.arange(1,100)
# distribute commonness of occurrence as perfect power-law with exponent -1.5
y = x**(-1.5) * 10000
# convert to int because the commonness of occurrence is discrete
y = np.int_(y)

# convert into raw data as it is necessary for the fitting
out = [] 
for sz, num in zip(x, y):
    out += (np.ones([num]) * (sz)).tolist()
out = np.array(out)

After generating the data I started the fitting with a given xmin of 1.

myplfit = plfit.plfit(out, xmin=1, verbose=True)
# stdout:
# Using DISCRETE fitter because there are repeated values.
# The lowest value included in the power-law fit,  xmin: 1
# The number of values above xmin,  n(>xmin): 24072
# The derived power-law alpha (p(x)~x^-alpha) with MLE-derived error,  alpha: 1.9288 +/- 0.00598638
# The log of the Likelihood (the maximized parameter; you minimized the negative log likelihood),  # Log-Likelihood: -51767.5
# The KS-test statistic between the best-fit power-law and the data,  ks: 0.415379  occurs with probability   p(ks): 0

Next I plotted the PDF as well as the values of the perfect power-law x and y generated in the beginning:

plt.figure(dpi=120)
plt.plot(x, y, 'g.')
myplfit.plotpdf()
plt.show()

dolog

The green dots are the generated power-law which looks like a perfect line. The raw data (black histogram) appears to be messed up in the end which might be a result of the log binning. So I did the same with linear binning:

plt.figure(dpi=120)
plt.plot(x, y, 'g.')
myplfit.plotpdf(dolog=False)
plt.show()

dolog_false

Here we can see that the raw data (black histogram) is now parallel to the green line (even though they don't share the same values). However the red line fitted by the algorithm appears to fit much better to the raw data (using log binning) in the first plot. But as far as I remember the algorithm should work independently of the binning type and be able to fit both?
So is this a bug or is there an error in reasoning in the test data generation above?

@keflavich
Copy link
Owner

@libeanim I haven't had a chance to look at this til today. I'm not entirely sure what is going on, but I suspect some logical error in the construction of the data set or a bug in the discrete fitter. I've never seen a discrepancy of this magnitude; the errors I've seen previously are in the few - tens % range.

@libeanim
Copy link
Author

libeanim commented Jun 6, 2016

Yeah, this discrepancy is strange.
Didn't know if you saw it, but I just realised that the slope of the (fitted) red line also changes between the pictures... First, I only concentrated on the difference between the black bars and the red line and wasn't aware that the slope changes too. To my mind this doesn't make any sense.
(It gets obvious when you compare the red and green line in both pictures as the green line doesn't change its slope.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants