Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem of online scaling for source backfill #18300

Closed
Tracked by #16003
xxchan opened this issue Aug 28, 2024 · 5 comments
Closed
Tracked by #16003

Problem of online scaling for source backfill #18300

xxchan opened this issue Aug 28, 2024 · 5 comments
Assignees
Milestone

Comments

@xxchan
Copy link
Member

xxchan commented Aug 28, 2024

Problem

Currently, SourceBackfill has a hacky stuff: It needs to scan() the whole state table, including splits written by other actors.

/// All splits finished backfilling.
///
/// We check all splits for the source, including other actors' splits here, before going to the forward stage.
/// Otherwise if we break early, but after rescheduling, an unfinished split is migrated to
/// this actor, we still need to backfill it.
async fn backfill_finished(&self, states: &BackfillStates) -> StreamExecutorResult<bool> {
Ok(states
.values()
.all(|state| matches!(state, BackfillState::Finished))
&& self
.backfill_state_store
.scan()
.await?
.into_iter()
.all(|state| matches!(state, BackfillState::Finished)))
}

The reason is to handle split migration (i.e., online scaling). SourceBackfill has 2 stages (backfill -> forward upstream). After it entered stage 2, it cannot go back to stage 1. So if an unfinished backfill work is migrated to an actor in stage 2, it cannot do backfilling.

However, the hack doesn't work correctly now, as shown in #18033 (comment). It's because at the beginning, the actor will only read the state written by itself. It needs to wait until it can read all actors' written data. i.e., wait for the first checkpoint has been available.

Note1: checkpoint is async, so we cannot rely on sth like "wait for N barriers to pass".
Note2: an actor doesn't know the total number of splits, so we cannot rely on the condition like "states.len() == num_splits". This is unlike WatermarkFilter executor, which has a similar hack, but it relies on "all vnodes are written". However, source splits are not distributed by vnodes.

Solution

There are several solutions to fix this bug:

  1. Patch the hack: Wait until the "inited" state. We can use try_wait_epoch.

  2. Rewrite the code to allow transition between stage 1 and 2: I don't think there's any technical restrictions to prevent us doing this. The reason why we didn't do it in the first place might just be an overlook. (But single direction transition looks slightly more natural though).

  3. Disallow online scaling: Then the problem will disappear! This is inspired by @BugenZhao: MV backfilling actually is in a similar situation. It also goes from stage 1 to 2 in one direction. And currently we disallow online scaling for it. So we can do the same thing for Source backfilling.

    More precisely,

    • For foreground (blocking) DDL, scaling will be rejected. If cluster is down, the backfill will fail.
    • For background DDL, scaling is done by rebuilding the actors. (So the new actor can start from stage 1. But the code doesn't need to handle stage 2 -> 1)

    More on this: https://risingwave-labs.slack.com/archives/C05AZUR6Q5P/p1724227403344349

I prefer solution 3 because it can remove the hack and make the logic simpler to understand. In the long term, if we want to allow online scaling for foreground DDL, solution 2 might be the best (both for Source and MV). But it will need larger effort to implement.

edit: Now we did both 1 & 3. 3 is automatically done after we do blocking DDL, because of reschedule_lock. But we found it's not enough and still need 1. See #18112

edit: Imagine we don't have blocking DDL and don't track progress, it seems there's no way to disallow online scaling. So either 1 or 2 must be needed.

A little more

Relationship of the split migration problem with #18338 (blocking DDL) and #18299 (Finish backfill faster):

Blocking DDL requires finishing backfill faster for better UX. And if we finish backfill faster, the chances we need to handle split migration is lower..

Previously we only considered background backfill, and the backfill can be blocked for a long time, and we have to handle split migration gracefully. (But we forgot about the idea of rebuilding actors.)

@xxchan
Copy link
Member Author

xxchan commented Sep 2, 2024

I think there's no special code change needed to disallow online scaling for blocking DDL on source.

The existing mechanism is to grab reschedule_lock when creating streaming job. For blocking DDL, create_streaming_job will not return until backfill finished, so scaling is naturally prevented when we have blocking DDL for source.

let _reschedule_job_lock = self.stream_manager.reschedule_lock_read_guard().await;

@fuyufjh
Copy link
Member

fuyufjh commented Sep 3, 2024

Does this problem exist for a normal SourceExecutor? It also has to handle split migration (i.e., online scaling), and it stores offset in its state table.

@xxchan
Copy link
Member Author

xxchan commented Sep 3, 2024

No, because SourceExecutor is a single loop. It doesn't need to transition between stages (Backfilling -> Finished).

@fuyufjh

This comment was marked as resolved.

@xxchan
Copy link
Member Author

xxchan commented Sep 3, 2024

See this PR for more #18112

@xxchan xxchan changed the title Disallow online scaling for shared source during (blocking) backfill Problem of online scaling for shared source during (blocking) backfill Sep 4, 2024
@xxchan xxchan changed the title Problem of online scaling for shared source during (blocking) backfill Problem of online scaling for source backfill Sep 4, 2024
@xxchan xxchan closed this as completed Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants