Spec for log-based replication in the SDK #1012
aaronsteers
started this conversation in
General
Replies: 1 comment
-
To further this discussion, it may be helpful if we take the top 6 SQL sources, which are the best candidates for LOG_BASED generally, and drill down on their specific implementations.
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
This topic has come up in conversation a few times, and I wanted to start to pain a picture on how this might look.
Also, there are implications of this on the related min/max constraints definition here:
High-level proposal:
Stream.log_replication_key
property of theStream
class.self.log_replication_key is not None
, it will advertiseLOG_BASED
in the catalog as an available replication method for that stream.LOG_BASED
andlog_based_replication_key
is not none, then we take a different codepath from the defaultget_records()
and/orget_batches()
.Nitty-gritty detail 1: Bookmarks
I'd suggest we allow targets to emit both the bookmark for replication key and also log-based replication key, and that these not be mutually exclusive options. If we go this direction, then a bookmark might look like:
Nitty-gritty detail 2: Codepath
We could have a completely new
Stream.get_records_logbased()
method and possibly alsoStream.get_batches_logbased()
. The inputs of that method would look similar toget_records()
, except the implementation would be different.An alternative implementation would be to expand
get_records()
andget_batches()
with either alog_based_replication_key_start
arg, or perhaps aSyncConstraint
object that serves the needs of both "normal" incremental as well as "log based" incremental.Then the developer would interrogate that argument and try to meet the constraints in whichever way makes sense for the specific implementation. And if no constraints exist, this is identical to
FULL_TABLE
.Pros:
FULL_TABLE
,INCREMENTAL
, andLOG_BASED
.Cons:
Possible pitfall: Log-based key processing that is not stream-centric.
If the API stores its logs like this, the log-based processor either needs to:
Stream
object redundantly processes the same log when syncing and ignores unneeded data, orExactly how we would handoff that data to separate Stream classes is unclear, and in this case, the paradigm of each Stream running
get_records()
doesn't seem to be a good match.We explored this a bit in:
The crux seems to be whether we can and should process the log at the Stream level or if we need to process at a higher-level abstraction.
Beta Was this translation helpful? Give feedback.
All reactions