[Correlation] Generate correlations through multiple requests #9884

tomking2 · 2024-08-27T11:00:06Z

What does it do?

This one might be up for debate, but I think it's a healthy addition that ensures consistent performance across all MySQL implementations (be that MariaDB, MySQL, AWS RDS or Azure MySQL etc.).

We were spotting that during ingestion of events with severe over-correlating values, we were seeing a major degradation of performance when generating correlation values.

The current code looks for the value in either value1 or value2 via an OR clause, leaving MySQL to find the best optimisation strategy for performing this query. We found that on our MySQL DB (which wasn't MariaDB), the optimisation strategy was not optimal for every request. It appeared that for over-correlating IOCs, instead of using a sort_union strategy on the value1 and value2 indexes, it was instead opting to use the event_id as the selected index. I couldn't find a consistent method to force the correct strategy.

What did this mean for our MISP instance? Instead of correlations taking <= 0.1s per attribute, it was taking around 5-7s per attribute. When trying to ingest an event with 4K objects and 80K attributes (every object had high correlations), it would've taken around 6 days to finish ingestion.

So what does this PR do? Instead of relying upon SQL to find a suitable optimisation strategy for the value conditions within the OR clause, separate DB calls are made for each value condition, enabling straightforward indexes to be used for the query. Correlation limits and bits like ACL are still adhered to during the lookups.

Potential impact - While this will ensure consistent performance when generating correlations, for MariaDB that appears to consistently create well-optimised queries (although I haven't done much validation here), a slight reduction in performance may be sighted as it now must make up to 3 DB calls instead of just the 1. However given many MISP instances will not be using MariaDB, I reckon this is a good tradeoff.

Questions

Does it require a DB change?
Are you using it in production?
Does it require a change in the API (PyMISP for example)?

tomking2 · 2024-09-02T12:22:36Z

I've been chatting with someone on Gitter who is facing similar performance issues on calculation of correlations. They are using MariaDB.

Their query seems OK and creating the correct sort_union:

However the query is still extremely slow to return:

If we simulate splitting the request into multiple per field, we are seeing much better performance. For example this is one part of the query looking at value1.

Therefore I think this is a good change to make.

app/Model/Correlation.php

…ts to avoid poor SQL optimization strategy

tomking2 · 2024-09-11T14:58:42Z

Thanks @JakubOnderka,

I've updated to use array_push as suggested.

adulau requested a review from iglocska September 2, 2024 18:56

JakubOnderka reviewed Sep 7, 2024

View reviewed changes

app/Model/Correlation.php Outdated Show resolved Hide resolved

chg: [Correlation] Split generating correlations into multiple reques…

77724e8

…ts to avoid poor SQL optimization strategy

tomking2 force-pushed the feature/correlation_optimize_fix branch from d144694 to 77724e8 Compare September 11, 2024 14:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Correlation] Generate correlations through multiple requests #9884

[Correlation] Generate correlations through multiple requests #9884

tomking2 commented Aug 27, 2024

tomking2 commented Sep 2, 2024

tomking2 commented Sep 11, 2024

[Correlation] Generate correlations through multiple requests #9884

Are you sure you want to change the base?

[Correlation] Generate correlations through multiple requests #9884

Conversation

tomking2 commented Aug 27, 2024

What does it do?

Questions

tomking2 commented Sep 2, 2024

tomking2 commented Sep 11, 2024