Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Correlation] Generate correlations through multiple requests #9884

Open
wants to merge 1 commit into
base: 2.4-develop
Choose a base branch
from

Conversation

tomking2
Copy link
Contributor

What does it do?

This one might be up for debate, but I think it's a healthy addition that ensures consistent performance across all MySQL implementations (be that MariaDB, MySQL, AWS RDS or Azure MySQL etc.).

We were spotting that during ingestion of events with severe over-correlating values, we were seeing a major degradation of performance when generating correlation values.

The current code looks for the value in either value1 or value2 via an OR clause, leaving MySQL to find the best optimisation strategy for performing this query. We found that on our MySQL DB (which wasn't MariaDB), the optimisation strategy was not optimal for every request. It appeared that for over-correlating IOCs, instead of using a sort_union strategy on the value1 and value2 indexes, it was instead opting to use the event_id as the selected index. I couldn't find a consistent method to force the correct strategy.

What did this mean for our MISP instance? Instead of correlations taking <= 0.1s per attribute, it was taking around 5-7s per attribute. When trying to ingest an event with 4K objects and 80K attributes (every object had high correlations), it would've taken around 6 days to finish ingestion.

So what does this PR do? Instead of relying upon SQL to find a suitable optimisation strategy for the value conditions within the OR clause, separate DB calls are made for each value condition, enabling straightforward indexes to be used for the query. Correlation limits and bits like ACL are still adhered to during the lookups.

Potential impact - While this will ensure consistent performance when generating correlations, for MariaDB that appears to consistently create well-optimised queries (although I haven't done much validation here), a slight reduction in performance may be sighted as it now must make up to 3 DB calls instead of just the 1. However given many MISP instances will not be using MariaDB, I reckon this is a good tradeoff.

Questions

  • Does it require a DB change?
  • Are you using it in production?
  • Does it require a change in the API (PyMISP for example)?

@tomking2
Copy link
Contributor Author

tomking2 commented Sep 2, 2024

I've been chatting with someone on Gitter who is facing similar performance issues on calculation of correlations. They are using MariaDB.

Their query seems OK and creating the correct sort_union:

image

However the query is still extremely slow to return:
image

If we simulate splitting the request into multiple per field, we are seeing much better performance. For example this is one part of the query looking at value1.

image

Therefore I think this is a good change to make.

@adulau adulau requested a review from iglocska September 2, 2024 18:56
@tomking2 tomking2 force-pushed the feature/correlation_optimize_fix branch from d144694 to 77724e8 Compare September 11, 2024 14:57
@tomking2
Copy link
Contributor Author

Thanks @JakubOnderka,

I've updated to use array_push as suggested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants