-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(risingwave): add streaming DDLs #8239
feat(risingwave): add streaming DDLs #8239
Conversation
It seems I have no access to invite reviewers. @gforsyth @cpcloud @chloeh13q @deepyaman Could you please help take a review or invite the appropriate ones? |
Hi @KeXiangWang - thanks for opening this! We're unfortunately in the midst of a large internals refactor moving all backends to use |
It's OK. I have no experience on sqlplot. After the PR is merged, could your engineers please also help refactor the related codes in this PR? I'd like to provide help if needed. |
@chloeh13q @cpcloud: bump this up, is this something we can help wth getting the sqlglot refactoring? |
It looks like the first order of business would be to rebase this PR on |
OK, I'll try my best. |
35e6474
to
a3e2717
Compare
Hi @cpcloud, I've rebased the PR to newest main and fix all the issue introduced by the Besides, I find some tests are unable to pass even with the main branch, so I leave them unchanged for now. Do you have any ideas? Is there any configuration I missed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @KeXiangWang -- thanks for putting this together and sorry for the delay in reviewing!
I've taken first pass over the new DDL stuff. I am not a streaming expert, so maybe @chloeh13q and @deepyaman can also chime in on some of the API design stuff.
Since some of these methods may become "standard" on other (future or current) streaming backends we want to be very deliberate in how we design them. We especially want to make sure that we don't overindex on a single backend when coming up with names (this is very difficult).
ibis/backends/risingwave/__init__.py
Outdated
def create_materialized_view( | ||
self, | ||
name: str, | ||
obj: ir.Table, | ||
*, | ||
database: str | None = None, | ||
schema: str | None = None, | ||
overwrite: bool = False, | ||
) -> ir.Table: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a new API for Ibis -- should this be a standalone method? One possible alternative is to have a keyword argument to create_view
that creates a materialized view instead of a "regular" view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this information may be helpful in helping us decide on a suitable API: https://materialize.com/guides/materialized-views/#how-do-materialized-views-work-in-specific-databases
Personally I think it depends on whether people tend to think of materialized views as a special type of views, or completely separate from normal views. It sounds like several backends just treat materialized views as a special type of views, as opposed to "its own thing".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also wondering:
I saw that there is a ticket for implementing materialized views for a broader set of backends: #5964 - is this an API we'd want to introduce in the base class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's easy to make materialized view
a special type if view
in this PR. But I would say they should be different objects in databases. The underlying mechanism and performance are largely different. For streaming databases like Risingwave and Materialize, Materialized view
is the core concept and hardest part to design, while view
is much more trivial. So we would like to keep materialized view
and view
separate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also want to distinguish between incremental materialized view? I think with streaming incremental materialized view, the line becomes blurry between view and materialized view. But across different backends, they are still pretty distinct concepts. I'd vote to make them distinct for now, and make them explicit on what backends support which.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should distinguish here without a clear explanation of what the difference between incremental and non-incremental streaming materialized views are.
It's also adding more scope than is needed for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep this API separate from create_view
for now, it's likely that materialized views will require some additional kwargs that don't apply to views. The word "view" here is really being stretched to its limit by this functionality and thinking about materialized views as another kind of view isn't really correct.
I have some open questions on a high level: I understand the motivation behind introducing separate APIs for sources, materialized views, tables, and sinks for the RW backend. Sources, materialized views, tables, and sinks are distinct objects in RW. However, I wonder how generalizable this is across different streaming backends. For the user who is already familiar with RW's API, this implementation feels natural. But for the user who is coming from another backend, it may cause some confusion. For example, Flink also has the same concepts, but it doesn't use a different API for each. Source, tables, and sinks are all created with Of course, it's also okay that we just do this for the RW backend for now and refactor at a later point if there is a need to consolidate some of the APIs. |
a3e2717
to
71e8cc7
Compare
Good question. These concepts are common in streaming backends, although different systems may have different names for them. For example, as Flink doesn't have a normal persistent table in a traditional database, its tables are used to express a streaming job. While in RW, Materialize, and Timeplus, we use
Flink positions itself as a 'compute engine,' which means it's designed not to store data but to process it. In Flink, a |
71e8cc7
to
5373f05
Compare
@KeXiangWang Yep that makes sense. Since Ibis works across engines & databases, I'd imagine that these are questions that Ibis will need to address. But I think it also depends on the background of the user, e.g., whether it's someone who's experimenting with streaming or someone who already. has some expertise in streaming and wants to try out the Ibis API. In any case, these may be questions that we can leave open and come back to, when there is user feedback backing up one way or another. |
5373f05
to
f72698f
Compare
f72698f
to
e5b0d8f
Compare
@KeXiangWang -- we've just merged in #8655 which moves us away from using the word |
I've pushed up the fixes to #8781 -- @KeXiangWang, you can either cherry-pick that commit back to this PR, or we can close this PR out in favor of your newer one, whichever you prefer! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pushing on this @KeXiangWang !
I flagged a few style things and one bit of incorrect logic, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all of your work on this @KeXiangWang !
Description of changes
Risingwave plays a role as a computing engine and storage system in Streaming ecosystems.
Usually, a streaming workload will include two/three systems like this:
upstream data source --> Risingwave
Or,
upstream data source --> Risingwave --> downstream data sink
This PR adds streaming support, mainly new relations related to streaming, for the Risingwave backend.
To be specific, Four types of new relations are introduced:
Source
, data sources that encompass a connector connected to an upstream data system like Kafka. A source works like an extraction or a placeholder for an upstream system, and although it has columns, it does not store any data itself.Materialized View
, which is Risingwave's core concept for streaming. One can create an MV with a query on existing tables or sources. The data in MV is automatically updated in a real-time way. Users can access the data with aselect
statement just like accessing a table.Table
with connector, works like a combination of a source and a normal table. The difference betweenTable
with connector andSource
is that, once the table is created, it will automatically start to consume data from upstream systems and update it into Risingwave.Sink
, unlink aMaterialized View
which stores the result of a query in RW, users can choose to sink the result to a downstream system, e.g. Redis, and then read the data in the downstream system. Unlike an MV, User cannot access the data directly from a sink.Some minor changes:
sqlalchemy-risingwave
version from1.0.0
to1.0.1
, some implementations are updated.test_semi_join
, as skipped for risingwave backend. The test sometimes stuck and it seems to be Risingwave's fault. I'll continue to investigate.One issue introduced in this PR:
Risingwave backend's implementation is sqlalchemy-based. But sqlalchemy has no corresponding concept for sources and sinks. So, in order to work around this issue, I temporarily categorize a source as a view. So sqlalchemy can access a source's metadata (names and types of its columns) like a view. However, this causes a side effect, when users call
list_view
, they may see some unexpected views that are actually sources. This can be fixed by rewriting the implementation of thelist_view
func in the Risingwave Backend. I'll fix it later.Issues closed