Adding 'Energy Test' two-sample test #31

dylanjaide · 2020-07-22T11:35:34Z

Following discussions with @chrisburr and others, we propose adding the Energy Test to hepstats. This is a two-sample hypothesis test currently used at LHCb to observe local CP violation in 3- and 4-body decays ([1], [2], [3]).

Chris' suggestion was to use the existing C++ implementation (see here, although some minor extra features have been added to other versions since this one) and to add python bindings for it, along with a python executable for command line usage. There also exists a python implementation, although it's slower & may not be preferable.

The text was updated successfully, but these errors were encountered:

marinang · 2020-07-22T12:47:45Z

@Dylan-JW Excellent, I have learned about a new statistic test! This would be a great addition to hepstats.

Chris' suggestion was to use the existing C++ implementation (see here, although some minor extra features have been added to other versions since this one) and to add python bindings for it, along with a python executable for command line usage. There also exists a python implementation, although it's slower & may not be preferable.

I don't have a lot of experience with python bindings, but this seems reasonable. Would you move also the C++ part to hepstats?
What about a Numba accelerated implementation?

eduardo-rodrigues · 2020-07-22T12:57:15Z

Hi all, sounds great.

Concerning the nice suggestion on using Numba with pure Python. Yep, that might be good enough given that the algorithm probably isn't that big hence preparing bindings with PyBind11 might be an overkill. May I suggest that you discuss on how to go about this in https://gitter.im/Scikit-HEP/community as you will then get lots of expert help? Just a thought.

chrisburr · 2020-07-22T13:03:16Z

I think this would be a nice addition as it is a generally applicable statistical test. It would also be nice to generalise the code to work for any number of dimensions.

On big change would be that this would introduce a small C++ component which would end up requiring the hepstats CI and deployment to be changed. Personally I don't see this as too much of a problem thanks to the work which has been done for boost-histogram and awkward1 but I'm also not responsible for hepstats and I understand if people would rather keep it pure Python.

Actually it would be interesting to see if numba could be used to optimise this without falling back to C++, especially as that would allow the same implementation to run on GPUs. I suspect wouldn't be fast enough considering hundreds of CPU years have been used running the handful of lines in this loop but it would be good to check as it's a perfect example of where numba should shine.

EDIT: I see while writing this similar suggestions have been made! 😆

dylanjaide · 2020-07-22T13:29:08Z

Hi all, thanks for the feedback. One of our group has been working on a version (in a private repo) using numba & CUDA, although I believe it's not yet finished & it hasn't been benchmarked against the C++ implementation yet. Also, I agree that generalising the code to work with any number of dimensions would be worth doing. Furthermore, if a python implementation using numba is the way forward, could it also be worth adding the ability to take inputs from sources other than plaintext files (eg. pandas dataframes)?

jcob95 · 2020-07-27T10:28:28Z

Hi all,

I implemented a numba version using the GPU a while back. I've recently added a parallel implementation for CPU using numba. At present you can choose whether to use CPU or GPU using a flag. This version is on the LHCb GitLab. I could probably mirror it to github but regardless here is the link and I've added the people on this thread.
Link to repo

As previously mentioned it hasn't been bench marked for performance yet but the C++ version has been used to test the results.An issue somewhat related to this is that there seems to be problems with floating point accuracy in that the CPU numba, GPU numba and C++ versions which all give different test statistic values as the dataset increases. The difference is O(10^-7) and lower but of course any difference could lead to incorrect significance.

marinang · 2020-07-30T09:35:25Z

Indeed that's worrisome if the floating point accuracy is different between versions especially when you want very low p-values.
I guess you could also encounter this problem when testing the implementation into two different machines, have you tried?

jcob95 · 2020-08-03T10:25:54Z

Indeed that's worrisome if the floating point accuracy is different between versions especially when you want very low p-values.
I guess you could also encounter this problem when testing the implementation into two different machines, have you tried?

I have not tried this yet but I will. For a sample size of 2 million the difference between the python version and the C++ becomes larger with agreement at 4 significant figures as opposed to 7 for a sample size of 200k but all the test statistic values are different. I'm now concerned that maybe there might be issues with the loops in the function since I would have expected consistent differences i.e always differ at 7 significant figures since we use double precision (float64) throughout and rounding errors wouldn't be this large?

marinang · 2020-09-28T13:22:06Z

Any news about this :) ?

jonas-eschle · 2021-04-20T20:08:44Z

Hi all, are there any news on this? It was mentioned recently that a pure Python implementation of this is available (by Abhjit), in case this here is stalled.

eduardo-rodrigues · 2021-04-20T21:33:41Z

Better to ping Abhijit explicitly, then, even if via email ...

jcob95 · 2021-04-21T08:38:42Z

Hi,

I never got around to finding out the cause of the numerical differences between this and the C++ version. However it sounds like this new Python version seems to work. Do you know if Abhijit has compared it with the C++ version to check for similar numerical issues?

dylanjaide · 2021-04-21T09:58:44Z

Hi all, I believe there's now 3 separate Python implementations floating around:

So we sort of have our pick of the bunch - probably whichever we think will perform the best, I'd guess? I'm happy to send Abhijit an email to discuss further if we wish (though their repository does describe it as "not the most optimal implementation", at least in its current form).

abhijitm08 · 2021-04-21T11:06:15Z

Hi All, It would also be good to include not just the Energy test but also other un-binned methods too, if available (e.g. Mixed sample). My code should scale with number of dimensions but does not scale well with sample size. It does not yet include the feature presented in this paper by W. Barter et al. I say my implementations of the two methods are not the most "optimal" because I have not compared it against other implementations (I will have to first check the performance and see if there are parts where I can improve, but have no immediate plans for doing this). By the way (@dylanjaide ), John's link above is broken, I cannot open it.

dylanjaide · 2021-04-21T11:17:33Z

I think Giulio's version also doesn't implement the scaling method you mention - however it looks to me like John's does. And yeah, the poor (O(n^2)) scaling with sample size is one of the main issues with this method, and why we think it'd be ideal to have an implementation for scikit-hep that also works on GPUs. With all of this in mind, it seems to me that John's version is likely the best starting point, as long as we can work out this issue with the floating point precision.

(also, re. the link to John's repo: the link is correct, I just forgot that it's a private repo, sorry!)

jcob95 · 2021-04-21T11:37:11Z

Hi @abhijitm08,

I've made the repo public on GitLab so you should be able to view it.

jonas-eschle · 2021-04-21T20:41:49Z

Indeed, I agree. There is one advantage though of Giulios implementation: as it seems to me, it is fully vectorized and does not rely actually on for-loops, which is elegant.

That aside, I think @jcob95 implementation seems favorable given the jit. Maybe we can try with long doubles to gain some extra precision and understand which implementation is actually "correct"?

abhijitm08 · 2021-04-21T23:20:00Z

@mayou36 Had to use explicit for loops because the kernel function ("phi") is actually symmetric matrix atleast for the 2 same-sample terms of the test statistic. So you need to only evaluate lower or upper side of the triangle and not both, saving computing time for large datasets. I actually do this and it looks like @jcob95 does it too. For the exponential form of the kernel function, the cross term will also be symmetric (so we can extend it to this term too for this choice of kernel function, which should also speed things up). I think Giulio's code calculates the full matrix, this will not be too performant for large datasets.

Also want to point out that the best rejection power of this test is obtained when sigma used in kernel function definition is actually pdf dependent. So will have to provide an option that takes a callable to calculate pdf-dependent sigma.

jonas-eschle · 2021-07-06T20:24:47Z

Hi all, coming back to this: it seems that we have a few implementations around and solvable problems ahead. I guess it is just time that someone tackles it, as I think it would be a good addition here.

Maybe it could even be a student project.

I can gladly review and help with implementation specificalities that concern speed, but I don't know enough about the test nor the usage of it in the wild.

Has someone the interest to go actively forward to PR this?

eduardo-rodrigues added the enhancement New feature or request label Jul 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding 'Energy Test' two-sample test #31

Adding 'Energy Test' two-sample test #31

dylanjaide commented Jul 22, 2020

marinang commented Jul 22, 2020

eduardo-rodrigues commented Jul 22, 2020

chrisburr commented Jul 22, 2020 •

edited

Loading

dylanjaide commented Jul 22, 2020 •

edited

Loading

jcob95 commented Jul 27, 2020 •

edited

Loading

marinang commented Jul 30, 2020 •

edited

Loading

jcob95 commented Aug 3, 2020

marinang commented Sep 28, 2020 •

edited

Loading

jonas-eschle commented Apr 20, 2021

eduardo-rodrigues commented Apr 20, 2021

jcob95 commented Apr 21, 2021

dylanjaide commented Apr 21, 2021

abhijitm08 commented Apr 21, 2021

dylanjaide commented Apr 21, 2021

jcob95 commented Apr 21, 2021

jonas-eschle commented Apr 21, 2021

abhijitm08 commented Apr 21, 2021 •

edited

Loading

jonas-eschle commented Jul 6, 2021

Adding 'Energy Test' two-sample test #31

Adding 'Energy Test' two-sample test #31

Comments

dylanjaide commented Jul 22, 2020

marinang commented Jul 22, 2020

eduardo-rodrigues commented Jul 22, 2020

chrisburr commented Jul 22, 2020 • edited Loading

dylanjaide commented Jul 22, 2020 • edited Loading

jcob95 commented Jul 27, 2020 • edited Loading

marinang commented Jul 30, 2020 • edited Loading

jcob95 commented Aug 3, 2020

marinang commented Sep 28, 2020 • edited Loading

jonas-eschle commented Apr 20, 2021

eduardo-rodrigues commented Apr 20, 2021

jcob95 commented Apr 21, 2021

dylanjaide commented Apr 21, 2021

abhijitm08 commented Apr 21, 2021

dylanjaide commented Apr 21, 2021

jcob95 commented Apr 21, 2021

jonas-eschle commented Apr 21, 2021

abhijitm08 commented Apr 21, 2021 • edited Loading

jonas-eschle commented Jul 6, 2021

chrisburr commented Jul 22, 2020 •

edited

Loading

dylanjaide commented Jul 22, 2020 •

edited

Loading

jcob95 commented Jul 27, 2020 •

edited

Loading

marinang commented Jul 30, 2020 •

edited

Loading

marinang commented Sep 28, 2020 •

edited

Loading

abhijitm08 commented Apr 21, 2021 •

edited

Loading