-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tag redacted document in ingest pipeline #113552
Tag redacted document in ingest pipeline #113552
Conversation
Documentation preview: |
Hi @samxbr, I've created a changelog YAML for you. |
@elasticmachine update branch |
Pinging @elastic/es-data-management (Team:Data Management) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a couple quick comments but otherwise this looks good
boolean isRedacted = fieldValue.equals(redacted) == false; | ||
|
||
// document newly redacted | ||
if (alreadyRedacted == false && isRedacted) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it matter if it's already redacted? We only skip setting this if the metadata field is true already. It doesn't look like it can ever be false. Couldn't we just overwrite the field regardless of its existing value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible that a redact processor does not actually modify the document (the pattern is not found), in that case I think we should not set _is_redacted
to true .
In the case of multiple redact processors, if the first processor redacts the doc and sets _is_redacted
to true, and the second processor does not redact the doc, we want to skip setting _is_redacted
in the second processor to prevent overriding the previously true value.
if (traceRedact == false) return; | ||
if (fieldValue == null) return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Style nitpick: These conditionals could be combined, but more importantly I think it's best to always put the braces in.
protected static final String REDACT_KEY = "_redact"; | ||
protected static final String IS_REDACTED_KEY = "_is_redacted"; | ||
protected static final String METADATA_PATH_REDACT = IngestDocument.INGEST_KEY + "." + REDACT_KEY; | ||
// indicates if document has been redacted | ||
protected static final String METADATA_PATH_REDACT_IS_REDACTED = METADATA_PATH_REDACT + "." + IS_REDACTED_KEY; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we condense these down into one constant? It's a little hard to see exactly what the final key is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of these key constants are used in tests, and I didn't want to hard code the constants in other places, that's why I had to define them separately.
If it helps, I can add a comment for the exact key _ingest._redact._is_redacted
@elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Adds a new option trace_redact in redact processor to indicate a document has been redacted in the ingest pipeline. If a document is processed by a redact processor AND any field is redacted, ingest metadata _ingest._redact._is_redacted = true will be set. Closes elastic#94633
protected static final String REDACT_KEY = "_redact"; | ||
protected static final String IS_REDACTED_KEY = "_is_redacted"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I'm adding those fields to the Elasticsearch specification, I'm wondering, why do we need underscores here? Under _ingest
, we already have timestamp
and pipeline
that don't use underscores.
Adds a new option
trace_redact
in redact processor to indicate a document has been redacted in the ingest pipeline. If a document is processed by a redact processor AND any field is redacted, ingest metadata_ingest._redact._is_redacted = true
will be set.Closes #94633