[Fix] complexity_coarsegraining(): fix method #892

DominiqueMakowski · 2023-08-29T17:13:06Z

However, I ran into a problem when I used the modified multiscale entropy method (MMSEn) in the tool to measure the complexity of pink noise or 1/f noise to perform the benchmark test, the MMSEn result of pink noise should be close to a near-horizontal line across scales, should not be a decay curve.

So, I provide my code(Modified_MSE.ipynb) to help you examine the problem. There are two parts in my code, part 1 includes both the correct MMSEn method and result based on the original paper (Wu et al. (2013) ), and part 2 is the result with MMSEn from neurokit2. I think the problem is when we use the modified or rolling-average coarse-graining method, the entropy is estimated for each scale factor s by applying Sample entropy (SampEn) with a time delay that should be equal to the current scale "s", instead of being equal to 1 in general.

Benchmark

Code

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from matplotlib.ticker import (MultipleLocator, AutoMinorLocator)
import pyplnoise

import neurokit2 as nk


# plot each column
plt.figure(figsize = (12, 8), dpi = 80)

ax = plt.subplot(111)

#
#   parameter settings
#


set_scale = 20           # maximum scale for multiscale entropy test (from 1 to set_scale)

max_loops = 30           # maximum times for simulation

data_length = 1500       # Noise length

######################################################

start = 1

end = set_scale

space = set_scale

plt.xticks(np.linspace(start, end, space), rotation = 0)

#plt.yticks(np.linspace(-0.25, 3.0, 14), rotation = 0) 

#
# Setting the title and the axis label
#
plt.title('Modified Multiscale Entropy Curves of Noise', fontsize = 14, weight = 'bold', y = 1.01)

x = np.arange(1, set_scale + 1 , 1)

if set_scale > 40:
    
    majors = [1, np.arange(5, set_scale + 1, 5)]
    
    ax.xaxis.set_major_locator(MultipleLocator(5))
    
    ax.xaxis.set_minor_locator(MultipleLocator(1))
    
    ax.set_xlim(0, set_scale + 1)


    
loops = max_loops

y = np.arange(1, loops + 1 , 1)

idx = pd.Series(x)

null_data_1 = np.zeros((len(idx), len(y)), dtype = np.float64)
null_data_2 = np.zeros((len(idx), len(y)), dtype = np.float64)

mmse_white = pd.DataFrame(null_data_1, columns = y)
mmse_white.set_index(idx, inplace = True)

mmse_pink = pd.DataFrame(null_data_2, columns = y)
mmse_pink.set_index(idx, inplace = True)



for i in range(0, loops, 1):
    
    print(' .', end = '')
    #print('Current loop = %d' % (i+1))
    
    fs = 100.
    npts = data_length
    
    #
    # Generate white noise
    #
    whnoise = pyplnoise.WhiteNoise(fs, psd = 1.0)
    white_noise = whnoise.get_series(npts)
    
    
    #
    # Generate pink noise
    #
    pknoise = pyplnoise.PinkNoise(fs, 1e-2, 50.)
    pink_noise = pknoise.get_series(npts)

    
    #
    # Test MMSE method for both white noise and pink noise 
    #
    
    #
    #                          WHITE NOISE
    #
    w_tol = 0.15 * np.std(white_noise)

    mse_w, info_w = nk.entropy_multiscale(signal = white_noise, 
                                          scale = set_scale,
                                          dimension = 2,
                                          tolerance = w_tol,
                                          method = 'MMSEn',
                                          fuzzy = False,
                                          show = False )
    
    mmse_white.iloc[:, i] = info_w['Value']

    #
    #                           PINK NOISE
    #
    p_tol = 0.15 * np.std(pink_noise)

    mse_pink, info_p = nk.entropy_multiscale(signal = pink_noise, 
                                              scale = set_scale,
                                              dimension = 2,
                                              tolerance = p_tol,
                                              method = 'MMSEn',
                                              fuzzy = False,
                                              show = False )
    
    mmse_pink.iloc[:, i] = info_p['Value']


    
#
# Calculate mean and standard deviation of points, grouped by rows
#
w_mean = mmse_white.mean(axis = 1)
w_std = mmse_white.std(axis = 1)

p_mean = mmse_pink.mean(axis = 1)
p_std = mmse_pink.std(axis = 1) 

#
# Calculate 95% confidence interval
#
w_conf_int = 1.96 * w_std / np.sqrt(len(y))
p_conf_int = 1.96 * p_std / np.sqrt(len(y))


#
# Plot result diagrams 
#
plt.errorbar(x, 
             w_mean, 
             yerr = w_conf_int, 
             fmt = '-o',  
             capsize = 3, 
             color = 'b', 
             label = 'White Noise')

plt.errorbar(x, 
             p_mean, 
             yerr = p_conf_int, 
             fmt = '-s',  
             capsize = 3, 
             color = 'r', 
             label = 'Pink Noise')

plt.xlabel('Scale Factor', 
           fontsize = 14,
           weight = 'bold',
           labelpad = 10)

plt.ylabel('Sample Entropy', 
           fontsize = 14, 
           weight = 'bold', 
           labelpad = 10)

plt.legend(loc = 1, 
           fontsize = 14, 
           bbox_to_anchor = (0.9, 0.95))

plt.savefig('MMSE_S%s_L%s_WP.png' % (str(data_length), str(max_loops)), dpi = 200)

plt.show()

Shoud look like this:

But looks like this:

DominiqueMakowski · 2023-08-30T15:09:21Z

neurokit2/complexity/utils_complexity_coarsegraining.py

+        #     signal, size=scale, mode="nearest"
+        # )
+        # coarse = coarse[scale - 1 : :]
+        coarse = complexity_embedding(signal, dimension=scale, delay=1).mean(axis=1)


I updated the coarsegraining procedure but it doesn't seem to have entirely solved the issue:

@hsyu001 let's just make sure that we have the same sample entropy results: would you mind computing SampEn with these parameters using your own algorithm:

signal = [1, 2, 3, 5, 3, 1, 2, 4, 5, 7, 3, 2, 6, 2, 4, 8, 2] tol = 2

With NK, this gives:

nk.entropy_sample(signal, dimension=2, delay=3, tolerance=tol) > (0.2831469172863898, {'Dimension': 2, 'Delay': 3, 'Tolerance': 2.0034572195207527})

DominiqueMakowski · 2023-08-31T08:40:10Z

Coarse graining is fixed
Sample entropy: find out source of disrepancies

So I digged down a bit more and the difference between our codes is related most likely to different Phis computation:

import neurokit2 as nk
import numpy as np

# Your MSE code isolated
def custom_mse(signal, m, delay, tol):
    N = len(signal)
    Nn = 0
    Nd = 0
    for i in np.arange(0, N - (m + 1) * delay).reshape(-1):

        for j in np.arange(i + delay, N - m * delay).reshape(-1):

            if (np.abs(signal[i] - signal[j]) < tol) and (np.abs(signal[i + delay] - signal[j + delay]) < tol):

                Nn += 1

                if abs(signal[i + m * delay] - signal[j + m * delay]) < tol:

                    Nd += 1

    return -np.log(Nd / Nn), [Nd, Nn]


signal = [1, 2, 3, 5, 3, 1, 2, 4, 5, 7, 3, 2, 6, 2, 4, 8, 2]
delay=1
m = 2
tol = 2

rez, info = nk.entropy_sample(signal, dimension=m, delay=delay, tolerance=tol)
rez, info["phi"]
> (0.4946962418361073, array([0.39047619, 0.23809524]))

custom_mse(signal, m, delay, tol)
> (0.916290731874155, [6, 15])

Which likely finds its origin in the number of counts. In NK we use:

NeuroKit/neurokit2/complexity/utils.py

Line 163 in 366583e

    
           count = kdtree.query_radius(embedded, tolerance, count_only=True).astype(np.float64)

In any case, I have updated this branch so that entropy_sample() returns a dictionary with all the elements used to compute the phi values. Once you upgrade your version again, you can run:

rez, info = nk.entropy_sample(signal, dimension=m, delay=delay, tolerance=tol)
rez, info["phi"]

# This is how we compute the phi in NK
phi = [np.mean((info["count1"] - 1) / (info["embedded1"].shape[0] - 1)),
  np.mean((info["count2"] - 1) / (info["embedded2"].shape[0] - 1))]

Could you let me know if that looks correct to you, and any guess as to the origin of the difference and if there is an error somewhere

DominiqueMakowski · 2023-08-31T08:42:01Z

Also tagging @CSchoel here to maybe gain some insights as to the source of SampEn difference

DominiqueMakowski · 2023-08-31T14:46:07Z

Also MNE has a similar implementation it seems (but with a fixed r)

hsyu001 · 2023-09-01T07:40:10Z

The source code is downloadable in a zip file

mse.zip

DominiqueMakowski · 2023-09-13T07:45:41Z

Here's a summary and reformulation of the problem for future reference:

NeuroKit, as well as quite a lot of other mainstream implementations, use a SampEn computation using a KDtree to query the nearest neighbors within the tolerance. Here is the minimal code to reproduce NK's sampen results

def sample_entropy(signal, dimension, delay, tol):

    m = nk.complexity_embedding(signal, dimension=dimension, delay=delay)[:-1]
    m1 = nk.complexity_embedding(signal, dimension=dimension+1, delay=delay)

    kdtree = sklearn.neighbors.KDTree(m, metric="chebyshev")
    count1 = kdtree.query_radius(m, tol, count_only=True)
    kdtree = sklearn.neighbors.KDTree(m1, metric="chebyshev")
    count2 = kdtree.query_radius(m1, tol, count_only=True)

    Nd = np.mean((count2 - 1) / (m1.shape[0] - 1))
    Nn = np.mean((count1 - 1) / (m.shape[0] - 1))

    return -np.log(Nd / Nn), [Nd, Nn]


# Test ---------
signal = [2, 4, 8, 16, 1, 3, 5, 7, 9, 11]
delay=1
dimension = 2
tol = 2

sample_entropy(signal, dimension, delay, tol)
> (0.40546510810816444, [0.14285714285714285, 0.21428571428571427])

However, this implementation behaves somewhat weirdly and not as expected (as shown by the multiscale pattern on white & pink noise). The following implementation, which gives different results (but expected ones given the above benchmark), is unfortunately very slow in base Python:

def sample_entropy2(signal, m, delay, tol):

    N = len(signal)
    Nn = 0
    Nd = 0

    for i in np.arange(0, N - (m + 1) * delay).reshape(-1):

        for j in np.arange(i + delay, N - m * delay).reshape(-1):

            if (np.abs(signal[i] - signal[j]) <= tol) and (np.abs(signal[i + delay] - signal[j + delay]) <= tol):

                Nn += 1

                if abs(signal[i + m * delay] - signal[j + m * delay]) < tol:

                    Nd += 1

    return -np.log(Nd / Nn), [Nd, Nn]

sample_entropy2(signal, dimension, delay, tol)
> (1.791759469228055, [1, 6])

The questions are:

Is the first one indeed incorrect?
Where exactly does the discrepancy emerge?
If the second implementation is the correct one, can we make it fast using for instance nearest neighbours?

@zen-juen maybe you can also ask some math experts iykwim

DominiqueMakowski · 2023-09-15T09:48:43Z

Comparing NK vs. Loop-based

Alright I've managed to narrow down the problem I think. Long story short, the 2 version give different results for dimension > 2. Here's what GPT has to say about that:

The discrepancy between the two methods for dimension > 2 can be attributed to various reasons:

Embedding: The method of embedding used in the two functions might have slight differences. The nk.complexity_embedding function might handle the creation of vectors differently from the direct loop approach in sample_entropy2.

Counting Mechanism: The KDTree approach in the first function is more efficient, but there might be subtle differences in how it counts vectors within a given tolerance compared to the naive loop approach.

Implementation Mistakes: It's also possible that there are errors in one or both of the implementations.

Let's scrutinize the second function, specifically for dimension > 2:

For sample_entropy2, the nested loop approach should accommodate counting matches for vectors of lengths greater than 2. Here's the current loop structure:

if (np.abs(signal[i] - signal[j]) <= tolerance) and (np.abs(signal[i + delay] - signal[j + delay]) <= tolerance):

This structure only explicitly checks for a match in the first two data points. If the dimension is greater than 2, the function will not count matches correctly. To fix this, we need to loop through each point in the embedded vector and check if they match within the given tolerance.

Here's a modified loop structure for sample_entropy2:

match = True
for d in range(dimension):
    if np.abs(signal[i + d * delay] - signal[j + d * delay]) > tolerance:
        match = False
        break
if match:
    Nn += 1
    if np.abs(signal[i + dimension * delay] - signal[j + dimension * delay]) < tolerance:
        Nd += 1

This adjusted structure checks for matches for vectors of any length (dimension).

Now, I recommend adjusting the sample_entropy2 function with this new loop structure and testing it against sample_entropy_nk for various dimension values to see if they yield similar results.

@hsyu001 what do you think about that?

Denominator problem?

EDIT: actually after thinking about it I think the original is correct.

However, GPT4 flagged something else. For the NK implementation:

He suggests changing the denominator from:

    Nd = np.mean((count2 - 1) / (m1.shape[0] - 1))
    Nn = np.mean((count1 - 1) / (m.shape[0] - 1))

to

    Nd = np.mean((count2 - 1) / (len(signal) - dimension + 1))
    Nn = np.mean((count1 - 1) / (len(signal) - dimension))

It doesn't affect a lot the results, but it's worth checking with a mathematician.

hsyu001 · 2023-09-15T14:31:58Z

Hi Dr. Makowski :
I'm not sure about the correctness of the new approach. I ran both my previous code and the suggested modifications using various dimension settings (dim = 2, 3, 5 and 9). It seems that the suggested modification method will cause errors when the dimension is greater than three.

(1) The results of my previous code

(2) The results of the suggested method

DominiqueMakowski · 2023-09-22T09:09:27Z

Alright, after some more consideration, I've decided not to change the sampen function for now, as I couldn't find anything wrong with it per se. The reason why it gives different results when used with multiscale with coloured noise remains a mystery, and is worth continuing to explore.

I'll go ahead and merge this PR that contains fixes to coarsegraining.

adjust delay

00f73cc