Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fix] complexity_coarsegraining(): fix method #892

Merged
merged 12 commits into from
Sep 22, 2023

Conversation

DominiqueMakowski
Copy link
Member

However, I ran into a problem when I used the modified multiscale entropy method (MMSEn) in the tool to measure the complexity of pink noise or 1/f noise to perform the benchmark test, the MMSEn result of pink noise should be close to a near-horizontal line across scales, should not be a decay curve.

So, I provide my code(Modified_MSE.ipynb) to help you examine the problem. There are two parts in my code, part 1 includes both the correct MMSEn method and result based on the original paper (Wu et al. (2013) ), and part 2 is the result with MMSEn from neurokit2. I think the problem is when we use the modified or rolling-average coarse-graining method, the entropy is estimated for each scale factor s by applying Sample entropy (SampEn) with a time delay that should be equal to the current scale "s", instead of being equal to 1 in general.

Benchmark

Code
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from matplotlib.ticker import (MultipleLocator, AutoMinorLocator)
import pyplnoise

import neurokit2 as nk


# plot each column
plt.figure(figsize = (12, 8), dpi = 80)

ax = plt.subplot(111)

#
#   parameter settings
#


set_scale = 20           # maximum scale for multiscale entropy test (from 1 to set_scale)

max_loops = 30           # maximum times for simulation

data_length = 1500       # Noise length

######################################################

start = 1

end = set_scale

space = set_scale

plt.xticks(np.linspace(start, end, space), rotation = 0)

#plt.yticks(np.linspace(-0.25, 3.0, 14), rotation = 0) 

#
# Setting the title and the axis label
#
plt.title('Modified Multiscale Entropy Curves of Noise', fontsize = 14, weight = 'bold', y = 1.01)

x = np.arange(1, set_scale + 1 , 1)

if set_scale > 40:
    
    majors = [1, np.arange(5, set_scale + 1, 5)]
    
    ax.xaxis.set_major_locator(MultipleLocator(5))
    
    ax.xaxis.set_minor_locator(MultipleLocator(1))
    
    ax.set_xlim(0, set_scale + 1)


    
loops = max_loops

y = np.arange(1, loops + 1 , 1)

idx = pd.Series(x)

null_data_1 = np.zeros((len(idx), len(y)), dtype = np.float64)
null_data_2 = np.zeros((len(idx), len(y)), dtype = np.float64)

mmse_white = pd.DataFrame(null_data_1, columns = y)
mmse_white.set_index(idx, inplace = True)

mmse_pink = pd.DataFrame(null_data_2, columns = y)
mmse_pink.set_index(idx, inplace = True)



for i in range(0, loops, 1):
    
    print(' .', end = '')
    #print('Current loop = %d' % (i+1))
    
    fs = 100.
    npts = data_length
    
    #
    # Generate white noise
    #
    whnoise = pyplnoise.WhiteNoise(fs, psd = 1.0)
    white_noise = whnoise.get_series(npts)
    
    
    #
    # Generate pink noise
    #
    pknoise = pyplnoise.PinkNoise(fs, 1e-2, 50.)
    pink_noise = pknoise.get_series(npts)

    
    #
    # Test MMSE method for both white noise and pink noise 
    #
    
    #
    #                          WHITE NOISE
    #
    w_tol = 0.15 * np.std(white_noise)

    mse_w, info_w = nk.entropy_multiscale(signal = white_noise, 
                                          scale = set_scale,
                                          dimension = 2,
                                          tolerance = w_tol,
                                          method = 'MMSEn',
                                          fuzzy = False,
                                          show = False )
    
    mmse_white.iloc[:, i] = info_w['Value']

    #
    #                           PINK NOISE
    #
    p_tol = 0.15 * np.std(pink_noise)

    mse_pink, info_p = nk.entropy_multiscale(signal = pink_noise, 
                                              scale = set_scale,
                                              dimension = 2,
                                              tolerance = p_tol,
                                              method = 'MMSEn',
                                              fuzzy = False,
                                              show = False )
    
    mmse_pink.iloc[:, i] = info_p['Value']


    
#
# Calculate mean and standard deviation of points, grouped by rows
#
w_mean = mmse_white.mean(axis = 1)
w_std = mmse_white.std(axis = 1)

p_mean = mmse_pink.mean(axis = 1)
p_std = mmse_pink.std(axis = 1) 

#
# Calculate 95% confidence interval
#
w_conf_int = 1.96 * w_std / np.sqrt(len(y))
p_conf_int = 1.96 * p_std / np.sqrt(len(y))


#
# Plot result diagrams 
#
plt.errorbar(x, 
             w_mean, 
             yerr = w_conf_int, 
             fmt = '-o',  
             capsize = 3, 
             color = 'b', 
             label = 'White Noise')

plt.errorbar(x, 
             p_mean, 
             yerr = p_conf_int, 
             fmt = '-s',  
             capsize = 3, 
             color = 'r', 
             label = 'Pink Noise')

plt.xlabel('Scale Factor', 
           fontsize = 14,
           weight = 'bold',
           labelpad = 10)

plt.ylabel('Sample Entropy', 
           fontsize = 14, 
           weight = 'bold', 
           labelpad = 10)

plt.legend(loc = 1, 
           fontsize = 14, 
           bbox_to_anchor = (0.9, 0.95))

plt.savefig('MMSE_S%s_L%s_WP.png' % (str(data_length), str(max_loops)), dpi = 200)

plt.show()

Shoud look like this:

image

But looks like this:

image

@DominiqueMakowski

This comment was marked as outdated.

@codecov-commenter

This comment was marked as resolved.

@hsyu001

This comment was marked as outdated.

@DominiqueMakowski

This comment was marked as outdated.

@hsyu001

This comment was marked as outdated.

@hsyu001

This comment was marked as outdated.

# signal, size=scale, mode="nearest"
# )
# coarse = coarse[scale - 1 : :]
coarse = complexity_embedding(signal, dimension=scale, delay=1).mean(axis=1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the coarsegraining procedure but it doesn't seem to have entirely solved the issue:

image

@hsyu001 let's just make sure that we have the same sample entropy results: would you mind computing SampEn with these parameters using your own algorithm:

signal = [1, 2, 3, 5, 3, 1, 2, 4, 5, 7, 3, 2, 6, 2, 4, 8, 2]
tol = 2

With NK, this gives:

nk.entropy_sample(signal, dimension=2, delay=3, tolerance=tol)
> (0.2831469172863898,
 {'Dimension': 2, 'Delay': 3, 'Tolerance': 2.0034572195207527})

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

@DominiqueMakowski
Copy link
Member Author

DominiqueMakowski commented Aug 31, 2023

  • Coarse graining is fixed
  • Sample entropy: find out source of disrepancies

So I digged down a bit more and the difference between our codes is related most likely to different Phis computation:

import neurokit2 as nk
import numpy as np

# Your MSE code isolated
def custom_mse(signal, m, delay, tol):
    N = len(signal)
    Nn = 0
    Nd = 0
    for i in np.arange(0, N - (m + 1) * delay).reshape(-1):

        for j in np.arange(i + delay, N - m * delay).reshape(-1):

            if (np.abs(signal[i] - signal[j]) < tol) and (np.abs(signal[i + delay] - signal[j + delay]) < tol):

                Nn += 1

                if abs(signal[i + m * delay] - signal[j + m * delay]) < tol:

                    Nd += 1

    return -np.log(Nd / Nn), [Nd, Nn]


signal = [1, 2, 3, 5, 3, 1, 2, 4, 5, 7, 3, 2, 6, 2, 4, 8, 2]
delay=1
m = 2
tol = 2

rez, info = nk.entropy_sample(signal, dimension=m, delay=delay, tolerance=tol)
rez, info["phi"]
> (0.4946962418361073, array([0.39047619, 0.23809524]))

custom_mse(signal, m, delay, tol)
> (0.916290731874155, [6, 15])

Which likely finds its origin in the number of counts. In NK we use:

count = kdtree.query_radius(embedded, tolerance, count_only=True).astype(np.float64)

In any case, I have updated this branch so that entropy_sample() returns a dictionary with all the elements used to compute the phi values. Once you upgrade your version again, you can run:

rez, info = nk.entropy_sample(signal, dimension=m, delay=delay, tolerance=tol)
rez, info["phi"]

# This is how we compute the phi in NK
phi = [np.mean((info["count1"] - 1) / (info["embedded1"].shape[0] - 1)),
  np.mean((info["count2"] - 1) / (info["embedded2"].shape[0] - 1))]

Could you let me know if that looks correct to you, and any guess as to the origin of the difference and if there is an error somewhere

@DominiqueMakowski
Copy link
Member Author

Also tagging @CSchoel here to maybe gain some insights as to the source of SampEn difference

@DominiqueMakowski
Copy link
Member Author

Also MNE has a similar implementation it seems (but with a fixed r)

@hsyu001

This comment was marked as outdated.

@hsyu001

This comment was marked as outdated.

@DominiqueMakowski

This comment was marked as resolved.

@hsyu001
Copy link

hsyu001 commented Sep 1, 2023

The source code is downloadable in a zip file

mse.zip

@hsyu001

This comment was marked as resolved.

@hsyu001

This comment was marked as duplicate.

@DominiqueMakowski

This comment was marked as outdated.

@hsyu001

This comment was marked as duplicate.

@DominiqueMakowski

This comment was marked as outdated.

@DominiqueMakowski
Copy link
Member Author

DominiqueMakowski commented Sep 13, 2023

Here's a summary and reformulation of the problem for future reference:

NeuroKit, as well as quite a lot of other mainstream implementations, use a SampEn computation using a KDtree to query the nearest neighbors within the tolerance. Here is the minimal code to reproduce NK's sampen results

def sample_entropy(signal, dimension, delay, tol):

    m = nk.complexity_embedding(signal, dimension=dimension, delay=delay)[:-1]
    m1 = nk.complexity_embedding(signal, dimension=dimension+1, delay=delay)

    kdtree = sklearn.neighbors.KDTree(m, metric="chebyshev")
    count1 = kdtree.query_radius(m, tol, count_only=True)
    kdtree = sklearn.neighbors.KDTree(m1, metric="chebyshev")
    count2 = kdtree.query_radius(m1, tol, count_only=True)

    Nd = np.mean((count2 - 1) / (m1.shape[0] - 1))
    Nn = np.mean((count1 - 1) / (m.shape[0] - 1))

    return -np.log(Nd / Nn), [Nd, Nn]


# Test ---------
signal = [2, 4, 8, 16, 1, 3, 5, 7, 9, 11]
delay=1
dimension = 2
tol = 2

sample_entropy(signal, dimension, delay, tol)
> (0.40546510810816444, [0.14285714285714285, 0.21428571428571427])

However, this implementation behaves somewhat weirdly and not as expected (as shown by the multiscale pattern on white & pink noise). The following implementation, which gives different results (but expected ones given the above benchmark), is unfortunately very slow in base Python:

def sample_entropy2(signal, m, delay, tol):

    N = len(signal)
    Nn = 0
    Nd = 0

    for i in np.arange(0, N - (m + 1) * delay).reshape(-1):

        for j in np.arange(i + delay, N - m * delay).reshape(-1):

            if (np.abs(signal[i] - signal[j]) <= tol) and (np.abs(signal[i + delay] - signal[j + delay]) <= tol):

                Nn += 1

                if abs(signal[i + m * delay] - signal[j + m * delay]) < tol:

                    Nd += 1

    return -np.log(Nd / Nn), [Nd, Nn]

sample_entropy2(signal, dimension, delay, tol)
> (1.791759469228055, [1, 6])

The questions are:

  1. Is the first one indeed incorrect?
  2. Where exactly does the discrepancy emerge?
  3. If the second implementation is the correct one, can we make it fast using for instance nearest neighbours?

@zen-juen maybe you can also ask some math experts iykwim

@DominiqueMakowski
Copy link
Member Author

DominiqueMakowski commented Sep 15, 2023

Comparing NK vs. Loop-based

Alright I've managed to narrow down the problem I think. Long story short, the 2 version give different results for dimension > 2. Here's what GPT has to say about that:

The discrepancy between the two methods for dimension > 2 can be attributed to various reasons:

Embedding: The method of embedding used in the two functions might have slight differences. The nk.complexity_embedding function might handle the creation of vectors differently from the direct loop approach in sample_entropy2.

Counting Mechanism: The KDTree approach in the first function is more efficient, but there might be subtle differences in how it counts vectors within a given tolerance compared to the naive loop approach.

Implementation Mistakes: It's also possible that there are errors in one or both of the implementations.

Let's scrutinize the second function, specifically for dimension > 2:

For sample_entropy2, the nested loop approach should accommodate counting matches for vectors of lengths greater than 2. Here's the current loop structure:

if (np.abs(signal[i] - signal[j]) <= tolerance) and (np.abs(signal[i + delay] - signal[j + delay]) <= tolerance):

This structure only explicitly checks for a match in the first two data points. If the dimension is greater than 2, the function will not count matches correctly. To fix this, we need to loop through each point in the embedded vector and check if they match within the given tolerance.

Here's a modified loop structure for sample_entropy2:

match = True
for d in range(dimension):
    if np.abs(signal[i + d * delay] - signal[j + d * delay]) > tolerance:
        match = False
        break
if match:
    Nn += 1
    if np.abs(signal[i + dimension * delay] - signal[j + dimension * delay]) < tolerance:
        Nd += 1

This adjusted structure checks for matches for vectors of any length (dimension).

Now, I recommend adjusting the sample_entropy2 function with this new loop structure and testing it against sample_entropy_nk for various dimension values to see if they yield similar results.

@hsyu001 what do you think about that?

Denominator problem?

EDIT: actually after thinking about it I think the original is correct.

However, GPT4 flagged something else. For the NK implementation:

image

He suggests changing the denominator from:

    Nd = np.mean((count2 - 1) / (m1.shape[0] - 1))
    Nn = np.mean((count1 - 1) / (m.shape[0] - 1))

to

    Nd = np.mean((count2 - 1) / (len(signal) - dimension + 1))
    Nn = np.mean((count1 - 1) / (len(signal) - dimension))

It doesn't affect a lot the results, but it's worth checking with a mathematician.

@hsyu001
Copy link

hsyu001 commented Sep 15, 2023

Hi Dr. Makowski :
I'm not sure about the correctness of the new approach. I ran both my previous code and the suggested modifications using various dimension settings (dim = 2, 3, 5 and 9). It seems that the suggested modification method will cause errors when the dimension is greater than three.

(1) The results of my previous code
hsyu_results

(2) The results of the suggested method
new_results

@DominiqueMakowski DominiqueMakowski changed the title [Fix] entropy_multiscale(): fix when modified multiscale [Fix] complexity_coarsegraining(): fix method Sep 22, 2023
@DominiqueMakowski
Copy link
Member Author

Alright, after some more consideration, I've decided not to change the sampen function for now, as I couldn't find anything wrong with it per se. The reason why it gives different results when used with multiscale with coloured noise remains a mystery, and is worth continuing to explore.

I'll go ahead and merge this PR that contains fixes to coarsegraining.

@DominiqueMakowski DominiqueMakowski merged commit e424839 into dev Sep 22, 2023
5 of 9 checks passed
@DominiqueMakowski DominiqueMakowski deleted the fix_entropymultiscale branch September 22, 2023 09:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants