Spec Proposal: Stream Map "v2" syntax with sequenced declaration of transforms rather than hierarchical #1054

aaronsteers · 2022-10-10T00:21:08Z

aaronsteers
Oct 10, 2022

The current implementation of stream maps has a very hierarchical view, with dunder (double underscore) operators driving logic for special transformations.

https://sdk.meltano.com/en/latest/stream_maps.html

This dialect has its limitations and can't be implemented easily without having docs handy.

This discussion proposes a new sequential syntax that is more similar in readability transferwise transform-field syntax while still retaining the same amount of flexibility of the current SDK implementation.

(Scroll below to see the full contents.)

stream_maps:

# Set max-len truncation rules using property name wildcard.
- apply-to: *.description
  description: Truncate all 'description' properties to 255 characters or less.
  new-value: "description.left(255)"

# Data obfuscation in multiple streams using property name wildcard.
- apply-to: "*.*phone_number"
  description: Hash all phone numbers
  new-value: md5(self)
  new-type: string

# Drop data from sub properties
- apply-to: "users.location"
  description: Drop street_address_1 and street_address_2
  remove-child-property: street*

# Inject new data node as a sub property
- apply-to: "users.location"
  description: Add an 'is_usa' flag for compliance
  add-child-property: is_usa
  new-value: "'Y' if self['country'] in ['US', 'USA'] else 'N'"
  new-type: string

# Replace a sensitive user ID key with a hashed non-sensitive one:
- apply-to: users
  description: Add hashed ID property to the 'users' table
  new-child-property: user_id_hash
  new-value: "md5(user_id)"
  new-type: string
- apply-to: users
  description: Remove cleartext user_id field from the 'users' table
  remove-child-property: user_id

# Override the primary key properties for a stream:
- apply-to: users
  new-key-properties: [user_id]

# Override the primary key properties for a stream:
- apply-to: users
  rename-to: user_table

# Change casing on a top-level property name everywhere we find it
- apply-to: *.userID
  rename-to: user_id

# Remove nodes by name everywhere we find it (recursive to subnodes)
- apply-to: **  # double '**' (or similar convention) specifies to traverse subnodes/subproperties.
  remove-child-node: ssn

By importing re module and pre-defining a regex_replace() function, we can do advanced renames:

stream_maps:
# Change all top-level property names to camel case
- apply-to: *.*
  rename-to: regex_replace(node.name, 's/(_|\b)([a-z])/\u\2/g;')

We could also pre-define a function called camel_case() and snake_case() which similar regex as above or some other built-in python methods:

stream_maps:
# Change all top-level property names to snake case
- apply-to: *.*
  rename-to: snake_case(node.name)
  # Or to do the opposite:
  # rename-to: camel_case(node.name)

Possible list of transform verbs:

With this general pattern, there's a still a lot to tweak in terms of the exact syntax. The above handles most/all of existing capability, while also better handling type transformations, wildcard applications, and operations on subnodes of the record object.

apply-to - where to apply changes. This can be a stream name, a property, or a subproperty. It can also use wildcards to apply changes to multiple nodes. (Use of regex would be nice, but would require more escaping, and probably would warrant splitting out the stream name part from the property/subproperty part.
description - doesn't do anything except improve readability and provide human readable text that can be printed during the apply statement (for instance, if a transform fails or has bad syntax).
new-value - a formula or static value to apply whenever changing values, or whenever creating new property nodes/subnodes.
new-type - a string representing the new JSON type to apply, if the type is changing or if a new node is being created.
add-child-node - if adding nodes, the name of the node to add, relative to apply-to. If apply-to is a stream, then a top-level property will be added. If apply-to is a property, then a subproperty will be added.
remove-child-node - the name of the node (or nodes, if a pattern is provided) to remove. If apply-to is a stream name or stream name pattern, then top level node(s) will be removed. If apply-to is a property name or property name pattern, then subnode(s) will be removed relative to that context.
new-key-properties - replaces list of key properties in the stream.
new-replication-key - replaces the replication key.
rename-to - If applied to a stream, would rename the stream; if applied to one or more property nodes, would rename those nodes.

Regarding performance

As occurs with today's "v1" stream map implementation, all rules are evaluated or "compiled" once, up front, upon the declared schema of the stream or streams.
During the upfront evaluation, a set of dynamic transformation functions are initialized and cached for the duration of the stream sync operation.
This has some implications:
1. Execution per record is as fast as performing the required transformations that are directly applicable to the record - without needing to rescan the schema to find which rules apply to which streams and/or record node. (If a transformation exists for 'users' but we're handling records for the 'orders' stream, there will be zero per-record cost for the other non-applicable tranformation rules.)
2. Value transformations (i.e. the new-value operation) can use expressions that rely on the existing value and the node, schema, and stream context.
3. Schema transformations (renames, aliasing, etc.) can use expressions that rely on the anything contained in the JSON SCHEMA schema object, as well as the key properties and stream name, which are all known ahead of time.
4. Schema transformations do not have access to the values of the records. So a rename of a node or column based upon the value of the node is not possible.
5. Schema transformations that create impossible or disallowed outputs would fail before records are processed. For example, if you tried to rename a property to the empty string, or if you tried to rename a property or stream to the same name as a existing property or stream.
6. Unlike the V1 implementation, transformations in V2 would always be declared and applied in a strict order. This ensures that transformations are stable and deterministic, even for very elaborate combinations of sequential transformations to the same nodes.

Backwards compatibility

Since we already accept a stream_map_config setting separate from the main stream_maps config, we could very likely do this in a backwards compatible way, simply by using the new logic when we see a stream_map_config: { version: 2 } and otherwise use the v1 transformation logic.

cc @edgarrmondragon

edgarrmondragon · 2022-10-11T00:37:36Z

edgarrmondragon
Oct 11, 2022
Maintainer

I really like the idea!

The current structure uses a dictionary/object with a key for each mapped stream, so it wasn't possible to have duplicate configs for a single stream.

That's possible with this syntax, so maybe we need the rule that the latest config applies? Or all apply and it's possible to have sequence of transformations in a single stream map run?

1 reply

aaronsteers Oct 12, 2022
Author

@edgarrmondragon re:

That's possible with this syntax, so maybe we need the rule that the latest config applies? Or all apply and it's possible to have sequence of transformations in a single stream map run?

I was thinking about this too. I'm not sure if we need a signal to 'stop here' if a condition is met, but short of that, my expectation when drafting this spec was that each map transform would be applied in sequence - which therefor allows any node to have multiple transforms applied.

E.g. You could have one transform that removes training spaces (or leading zeros) from a field, and a second transform that hashes the field using MD5 or similar. In cases where you need the MD5 to be padding-agnostic, this is actually a very nice solution. And similarly for casing conformance - if we want to apply lower() before applying MD5, this 'just works' in a way that I think the user would find intuitive, if both operate on the target nodes in sequence.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spec Proposal: Stream Map "v2" syntax with sequenced declaration of transforms rather than hierarchical #1054

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Spec Proposal: Stream Map "v2" syntax with sequenced declaration of transforms rather than hierarchical #1054

aaronsteers Oct 10, 2022

Possible list of transform verbs:

Regarding performance

Backwards compatibility

Replies: 1 comment · 1 reply

edgarrmondragon Oct 11, 2022 Maintainer

aaronsteers Oct 12, 2022 Author

aaronsteers
Oct 10, 2022

Replies: 1 comment 1 reply

edgarrmondragon
Oct 11, 2022
Maintainer

aaronsteers Oct 12, 2022
Author