Spec for log-based replication in the SDK #1012

aaronsteers · 2022-09-28T20:30:08Z

aaronsteers
Sep 28, 2022

This topic has come up in conversation a few times, and I wanted to start to pain a picture on how this might look.

Also, there are implications of this on the related min/max constraints definition here:

Consider passing min and max constraints in the `get_records()` and `get_batches()` method signatures #1011

High-level proposal:

We start by adding a new Stream.log_replication_key property of the Stream class.
When the SDK detects that self.log_replication_key is not None, it will advertise LOG_BASED in the catalog as an available replication method for that stream.
When the tap is invoked with a requested replication method of LOG_BASED and log_based_replication_key is not none, then we take a different codepath from the default get_records() and/or get_batches().

Nitty-gritty detail 1: Bookmarks

I'd suggest we allow targets to emit both the bookmark for replication key and also log-based replication key, and that these not be mutually exclusive options. If we go this direction, then a bookmark might look like:

{
  "users": {
    # Same as today:
    "replication_key": "updated_at",
    "replication_key_value": "2022-01-01T12:00:00",
    # New:
    "log_based_replication_key_value": "1289177235"
  }
}

Nitty-gritty detail 2: Codepath

We could have a completely new Stream.get_records_logbased() method and possibly also Stream.get_batches_logbased(). The inputs of that method would look similar to get_records(), except the implementation would be different.

An alternative implementation would be to expand get_records() and get_batches() with either a log_based_replication_key_start arg, or perhaps a SyncConstraint object that serves the needs of both "normal" incremental as well as "log based" incremental.

@dataclass
class SyncConstraint():
    min_replication_key_value: Any | None
    max_replication_key_value: Any | None
    min_log_based_replication_key_value: Any | None

Then the developer would interrogate that argument and try to meet the constraints in whichever way makes sense for the specific implementation. And if no constraints exist, this is identical to FULL_TABLE.

Pros:

A single codepath for FULL_TABLE, INCREMENTAL, and LOG_BASED.
Better code reuse.

Cons:

The codepath itself is more complex, with more branching.

Possible pitfall: Log-based key processing that is not stream-centric.

If the API stores its logs like this, the log-based processor either needs to:

parse the logs multiple times (meaning each Stream object redundantly processes the same log when syncing and ignores unneeded data, or
parse the logs a single time and then pass data to each stream handler as needed.

{"table": "users": "new_data": {...}, "old_data": {...}...}
{"table": "groups": "new_data": {...}, "old_data": {...}...}
{"table": "users": "new_data": {...}, "old_data": {...}...}
{"table": "addresses": "new_data": {...}, "old_data": {...}...}
{"table": "orders": "new_data": {...}, "old_data": {...}...}
{"table": "items": "new_data": {...}, "old_data": {...}...}
{"table": "items": "new_data": {...}, "old_data": {...}...}

Exactly how we would handoff that data to separate Stream classes is unclear, and in this case, the paradigm of each Stream running get_records() doesn't seem to be a good match.

We explored this a bit in:

New silent parent stream or "MultiStream" class proposal for taps #166

The crux seems to be whether we can and should process the log at the Stream level or if we need to process at a higher-level abstraction.

aaronsteers · 2022-09-28T20:38:33Z

aaronsteers
Sep 28, 2022
Author

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spec for log-based replication in the SDK #1012

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Spec for log-based replication in the SDK #1012

aaronsteers Sep 28, 2022

High-level proposal:

Nitty-gritty detail 1: Bookmarks

Nitty-gritty detail 2: Codepath

Possible pitfall: Log-based key processing that is not stream-centric.

Replies: 1 comment

aaronsteers Sep 28, 2022 Author

aaronsteers
Sep 28, 2022

aaronsteers
Sep 28, 2022
Author