Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

snowflake: introduce new output based on snowpipe streaming #2926

Draft
wants to merge 43 commits into
base: main
Choose a base branch
from

Conversation

rockwotj
Copy link
Collaborator

@rockwotj rockwotj commented Oct 9, 2024

This is a port to Golang of Snowpipe Streaming. The only existing implementation is here and in Java. For more information about the Java implementation, check out this deep dive someone published.

The SDK we have here is architecturally much simpler - as the connect framework handles a lot of the complexity around batching, retries, etc. This SDK is then hooked up into an output so it can be leveraged in Redpanda Connect.

IMPLEMENTATION NOTES:

  • Some of the code is a very fiddly straight port of the Java code. This code lies in compat.go and is unit tested to prove exact output compared to the Java implementation (and the constants there are derived from debugging the original Java source).

@rockwotj rockwotj force-pushed the snowpipe branch 2 times, most recently from 641f7fc to 962aefa Compare October 10, 2024 17:52
@rockwotj rockwotj changed the title snowflake: introduce new snowflake output based on snowpipe streaming snowflake: introduce new output based on snowpipe streaming Oct 11, 2024
@rockwotj rockwotj force-pushed the snowpipe branch 8 times, most recently from ccfb3b2 to fcf9521 Compare October 16, 2024 14:55
An example being recomputing/refreshing authentication tokens
Probably need a way to cancel pending HTTP requests in the
configureClient request..
That ended up working out very nicely, and now we can properly shutdown
quickly.
This optimizes a pass over the batch if the user doesn't have a mapping
With timestamps and such we're going to need to know extra parameters
let's just make this a proper interface and struct
Java uses an actual thread ID, we'll just generate a randomized one
And add an integration test
Currently fails due to an unknown reason I'm investigating
Also fix linter errors with unused stuff
Also add a specific integration test to debug the SB16 issues
The underlying scanner does this and we need to match otherwise we'll
misalign expectations.
These are SB* physical column, which is subject to the column size optimization.
So this commit adds support for scale!=0 in Int128 values, this is
important for supporting all possible numeric values in snowflake,
but this is a narly bit of math here. All the code here is adapted from
the apache arrow project (licensing is properly attributed) with some
light fixes around edge cases.

I will attempt to upstream this and our bug fixes/perf improvements to
the arrow project so we don't have to maintain this code forever. If
that happens we can take a dependency on apache arrow go and drop this
int128 package in our tree.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants