Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend ecs@mappings to handle type coercion #113124

Closed
flash1293 opened this issue Sep 18, 2024 · 7 comments
Closed

Extend ecs@mappings to handle type coercion #113124

flash1293 opened this issue Sep 18, 2024 · 7 comments
Labels
:Data Management/Data streams Data streams and their lifecycles >enhancement Team:Data Management Meta label for data/management team

Comments

@flash1293
Copy link
Contributor

Description

Rolling out ecs@mappings to integrations instead of relying on locally defined mappings for ecs fields lead to various issues in production systems, caused by fields being mapped incorrectly due to data arriving in an unexpected format.

For example "destination.port": "1234" will cause destination.port to be mapped as a keyword while it is specified as long in ECS.

To prevent this kind of problems for integrations as well as users using ecs@mappings directly, ecs@mappings should be extended to cover all ecs fields and specify their type to prevent type coercion errors instead of assuming types are correct in the incoming JSON.

To test this, the existing integration tests should be extended with each field receiving data in the wrong format, e.g. if a field is a number type, a string value like "123" is tested, if a field is a string type, a number value like 123 and 1.2 is tested.

Open questions

  • Do we also need to test ips and dates in a similar way or are they covered already with date_detection and the existing dynamic mappings?
  • Do we need to test some object fields in a similar way?
  • Any other type issues we need to keep in mind?

cc @eyalkoren @felixbarny

@flash1293 flash1293 added >enhancement needs:triage Requires assignment of a team area label labels Sep 18, 2024
@eyalkoren
Copy link
Contributor

I remember this was a conscious decision to omit dynamic templates that map to the default mapping type (if sent properly) in order to minimize the templates and simplify this component, knowing that this puts the burden on shippers to sent the right format, specifically with the port example in mind. I don't know what is the actual overhead of having more dynamic templates, but I think it does make sense to make it less error prone. If we add these, we should probably restrict the match_mapping_type that supports arrays of types.

Do we also need to test ips and dates in a similar way or are they covered already with date_detection and the existing dynamic mappings?

IP fields are already covered.
As for dates - with #112444, all ECS date fields are covered by ecs@mappings and the tests verify that by setting date_detection: false. There is an additional test to test these fields with different date formats.
Other than that, I am not sure what you think of adding, we are only relying on what Elasticsearch is capable of detecting automatically.

Do we need to test some object fields in a similar way?

Not sure what you mean by that. We test both documents with nested values and flattened. Other than that, I can only think of geo_point fields - see below.

Any other type issues we need to keep in mind?

  • all fields that are omitted because they are mapped to the default mapping they would get if sent properly - int, long and boolean
  • geo_point fields that we currently test only as an array of two values, but it can be used as an object with subfields.

@flash1293
Copy link
Contributor Author

Thanks @eyalkoren , seems like boolean and string/number types is the most important thing to check.

It looks like @felixbarny will be back next week, maybe he has some additional input.

@flash1293
Copy link
Contributor Author

I remember this was a conscious decision to omit dynamic templates that map to the default mapping type (if sent properly) in order to minimize the templates and simplify this component, knowing that this puts the burden on shippers to sent the right format, specifically with the port example in mind. I don't know what is the actual overhead of having more dynamic templates, but I think it does make sense to make it less error prone

Agreed, IMHO the spirit of ecs@mappings is to take on the burden of correct mappings for these well defined fields - if this doesn't work in practice for cases like port numbers, we need to take care of it. About the performance question of having more dynamic mappings defined, I would like to get an opinion from the Elasticsearch experts (@felixbarny or maybe @dakrone ).

@gbanasiak gbanasiak added the :Data Management/Data streams Data streams and their lifecycles label Sep 23, 2024
@elasticsearchmachine elasticsearchmachine added Team:Data Management Meta label for data/management team and removed needs:triage Requires assignment of a team area label labels Sep 23, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@felixbarny
Copy link
Member

First of all, I 100% agree that we need to find a solution for the issues listed in elastic/integrations#10848. But I'm not sure if we should do that by extending ecs@mappings to do type coercions.

We're using ecs@mappings in all data streams as a default, not just for integrations. So enforcing the type for all ECS fields, especially with path matches is scary. We could go from issues with a few false negative matches to a lot of false positive matches. Imagine a user creating a custom field called "time.until.interactive": 42. Based on the linked PR this will be mapped to a boolean, leading to ingestion issues. Also, user-specific fields may happen to have the same name as an ECS field but actually have a different semantic and type. While ECS/SemConv has a clear recommendation, I don't think we can and should enforce that schema and provide a degraded experience (document rejection or ignoring the field) when the data users send us doesn't conform to that.

Can we instead change integrations that are known to not produce ECS compliant source data? We can either fix the source or add integration-specific mappings. If some integrations aren't yet ready to fully adopt ecs@mappings, that's ok. These integrations can still add custom mappings to force type coercions. But I'd like to understand a bit better why our integrations produce source data that's not ECS compliant and work towards fixing the types.

What's obviously not ok is that there's an unknown risk for integrations to adopt ecs@mappings because it may not be obvious whether or not the integration relies on type coercion. But can we solve that by adding tests for integrations to check that the mapping is ECS compliant after data has been ingested from that integration?

The ecs@mappings effort was an attempt to turn ECS into more of a naming convention rather than mapping each field individually. I think that approach will also serve us well when we integrate OpenTelemetry Semantic Conventions into ECS (see elastic/ecs#2375).

In the long term, I think that the ideal way Elasticsearch should deal with types is that a field can have multiple types at the same time. That way, there's never a situation where we have to reject a document or ignore the field of a document because the type doesn't match. But to do things like range queries, that also relies on the source data actually having the intended type.

@eyalkoren
Copy link
Contributor

So enforcing the type for all ECS fields, especially with path matches is scary. We could go from issues with a few false negative matches to a lot of false positive matches. Imagine a user creating a custom field called "time.until.interactive": 42. Based on the linked PR this will be mapped to a boolean, leading to ingestion issues.

I completely agree with this point, and see my comment that demonstrates such an issue we already have within existing ECS fields. I wanted this PR to show what it's going to look like exactly for this kind of discussion!
If we use less patterns and more specific field names, these dynamic templates will grow much bigger. And worse, this will highly contradict this:

The ecs@mappings effort was an attempt to turn ECS into more of a naming convention rather than mapping each field individually. I think that approach will also serve us well when we integrate OpenTelemetry Semantic Conventions into ECS (see elastic/ecs#2375).

@flash1293
Copy link
Contributor Author

Thanks both! Should we close this for now and re-open in case some clear improvement emerges that won't have this downside?

@felixbarny felixbarny closed this as not planned Won't fix, can't repro, duplicate, stale Sep 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Data streams Data streams and their lifecycles >enhancement Team:Data Management Meta label for data/management team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants