Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit IO-queue wake from idle approach #2398

Open
xemul opened this issue Aug 22, 2024 · 21 comments
Open

Revisit IO-queue wake from idle approach #2398

xemul opened this issue Aug 22, 2024 · 21 comments

Comments

@xemul
Copy link
Contributor

xemul commented Aug 22, 2024

Each sched class maintains two counters -- consumption and adjusted consumption. The former is sum of capacities of dispatched requests and the latter is sum of requests costs (which is in turn the capacity divided by class shares) and it's sometimes bumped when queued (more on that below). When scheduling, the code picks the class with the smallest adjusted consumption so far, thus giving classes equal amount of those, implicitly resulting in equally-weighted consumption.

This bumping happens when a class is "woken up" by a request and gets queued. In this case its adjusted consumption can happen to fall too far behind other classes. If just queuing this class back, it would give it exclusive access to the dispatch sink for too long period of time. For that, this counter is bumped up not to fall to far behind.

This bump-up seem to cause unwanted troubles. Here are two plots for compaction (first) and statement (second) classes' adjusted consumption and pure consumption divided by the class shares:

image

image

First, note the adjusted consumption (red and blue lines) -- they are identical for both classes, which means that scheduler "sorting" works and both classes receive equal amount of weighted capacity.

Next, look at the normalized consumption (greenish lines). If no bumping happened, this counter would coincide with the adjusted consumption. And this is what we see for compaction class -- it really coincides after the spike elapses. But it's not the case for statement class -- its normalized consumption is notably very-very low. The only reason for that is this class' adjusted consumption was bumped too aggressively.

@avikivity
Copy link
Member

Yes, I remember we hit it with the CPU scheduler too. It was eventually fixed. I don't remember what the fix was, only that I didn't understand it.

One thing we can do is play with the size of the bump. If we bump normalized consumption of the woken-up group to the highest live consumer, we penalize it too much. But we can bump it to to the state of the highest consumer - (delta_t * consumption_rate), reducing the penalty somewhat.

@xemul
Copy link
Contributor Author

xemul commented Aug 22, 2024

We currently bump it below the lowest consumer. The io-queue saves the accumulator of the last dispatched-from class and every time a new class wakes up it calculates the "last-accumulator - max-delta" and then assigns it to the newcomer if its accumulator is lower.

This "max-delta" depends on tau which we reduced from 100ms to 5ms (77b87a6) some time ago

@avikivity
Copy link
Member

But can't the lowest consumer also be a mostly-idle group?

Ah, it's the lowest consumer among the active groups, since the scheduler will pick the lowest consumer to dispatch from.

This seems a sensible algorithm, so why is it broken?

@xemul
Copy link
Contributor Author

xemul commented Aug 26, 2024

It's not necessarily broken, it can be that it's not tuned correctly.

When a class wakes up from idle, its accumulator is bumped so that it will catch up in 5ms. This means that scheduler doesn't penalize the class if it queued at 2000 requests per second or higher. And note, that it's "activation" rate. If the client sends 100 requests 200 times per second, the rate is 20000, 10x times higher, but it will still be penalized heavily. That 100 reqs once every 1/200 s can be "supported" by making the decay period to be 50ms (it was 100ms some time ago), but it has a drawback -- if non-interactive class kicks in with over-saturated queue (because it doesn't care the latency, but want large throughput) it will keep everybody else off for 50ms, which is also not good.

If comparing how bump-from-idle works, IO preemption is even more sensible -- CPU scheduler doesn't have decay at all:

void
reactor::activate(task_queue& tq) {
    if (tq._active) {
        return;
    }
    sched_print("activating {} {}", (void*)&tq, tq._name);
    // If activate() was called, the task queue is likely network-bound or I/O bound, not CPU-bound. As
    // such its vruntime will be low, and it will have a large advantage over other task queues. Limit
    // the advantage so it doesn't dominate scheduling for a long time, in case it _does_ become CPU
    // bound later.
    //
    // FIXME: different scheduling groups have different sensitivity to jitter, take advantage
    if (_last_vruntime > tq._vruntime) {
        sched_print("tq {} {} losing vruntime {} due to sleep", (void*)&tq, tq._name, _last_vruntime - tq._vruntime);
    }
    tq._vruntime = std::max(_last_vruntime, tq._vruntime);  // <---------------------------------------- HERE
    auto now = reactor::now();
    tq._waittime += now - tq._ts;
    tq._ts = now;
    _activating_task_queues.push_back(&tq);
}

@travisdowns
Copy link
Contributor

It seems like maybe what you want is to have a larger max-delta, like 100 ms you had before, but also cap the effect this can have for newly waking things, i.e., rather than an infrequently running bulk job getting scheduled every tick for 100 ms when it wakes up, cap the ticks that the new thing gets to 2x the amount would get just based on the shares (i.e., as if it were immediately bumped up to max consumer like the CPU scheduler does).

Not sure exactly how to implement it: one way would be to track two adjusted runtimes, the "large max-delta" and "zero max-delta" ones and have the IO scheduler alternate between them every tick, so in the example above every other tick the newly woken bulk job would always be scheduled on the "large max-delta" tick but then would compete fairly on the other tick (which probably means it wouldn't be scheduled at all since it's already getting 50% of total IO capacity due 100% monopolization of the alternate tick).

This would make choosing large max-delta values less problematic.

Another approach would be to implement the wakeup boost as a decaying boost so the class shares or something like that.

@avikivity
Copy link
Member

It's not necessarily broken, it can be that it's not tuned correctly.

When a class wakes up from idle, its accumulator is bumped so that it will catch up in 5ms. This means that scheduler doesn't penalize the class if it queued at 2000 requests per second or higher.

Did you mean 200 req/s? Or 0.5ms?

We have to compare the wake-bump to io-latency-goal. If wake-bump is much greater than io-latency-goal, then it will dominate over other groups for a short period. If it's of the same order or smaller than io-latency-goal, then intermittent workloads will have a hard time achieving their shares.

And note, that it's "activation" rate. If the client sends 100 requests 200 times per second, the rate is 20000, 10x times higher, but it will still be penalized heavily. That 100 reqs once every 1/200 s can be "supported" by making the decay period to be 50ms (it was 100ms some time ago), but it has a drawback -- if non-interactive class kicks in with over-saturated queue (because it doesn't care the latency, but want large throughput) it will keep everybody else off for 50ms, which is also not good.

Perhaps we should classify groups not just with shares, but with latency expectations as well. Batch workloads don't mind high latency, they only want throughput.

If comparing how bump-from-idle works, IO preemption is even more sensible -- CPU scheduler doesn't have decay at all:

void
reactor::activate(task_queue& tq) {
    if (tq._active) {
        return;
    }
    sched_print("activating {} {}", (void*)&tq, tq._name);
    // If activate() was called, the task queue is likely network-bound or I/O bound, not CPU-bound. As
    // such its vruntime will be low, and it will have a large advantage over other task queues. Limit
    // the advantage so it doesn't dominate scheduling for a long time, in case it _does_ become CPU
    // bound later.
    //
    // FIXME: different scheduling groups have different sensitivity to jitter, take advantage
    if (_last_vruntime > tq._vruntime) {
        sched_print("tq {} {} losing vruntime {} due to sleep", (void*)&tq, tq._name, _last_vruntime - tq._vruntime);
    }
    tq._vruntime = std::max(_last_vruntime, tq._vruntime);  // <---------------------------------------- HERE
    auto now = reactor::now();
    tq._waittime += now - tq._ts;
    tq._ts = now;
    _activating_task_queues.push_back(&tq);
}

@travisdowns
Copy link
Contributor

One thing I didn't understand though: did the "statement" class actually want more IO? I.e., it had a very low rate of raw consumption, i.e., low IO rate, but if it actually wanted more IO this would be a self correcting problem, no? Since more available unsatisfied IO in that class means that it would no longer go to sleep and so not suffer the bumps and loss of IO opportunities (though it would of course suffer a constant latency penalty during this period since it implies at least some queuing).

This wouldn't be true if the IO was serially dependent though.

Another way of asking is what would those plots look like if io-properties were set very high, i.e., effectively without limit. Would the statement class do more IO?

@avikivity
Copy link
Member

One thing I didn't understand though: did the "statement" class actually want more IO? I.e., it had a very low rate of raw consumption, i.e., low IO rate, but if it actually wanted more IO this would be a self correcting problem, no? Since more available unsatisfied IO in that class means that it would no longer go to sleep and so not suffer the bumps and loss of IO opportunities (though it would of course suffer a constant latency penalty during this period since it implies at least some queuing).

This wouldn't be true if the IO was serially dependent though.

Serially independent, or just intermittent. The statement class receives requests from the network, wants some CPU and I/O, and expects lots of bandwidth since it has high shares and hasn't consumed anything in a while.

The problem is that we have to penalize idle groups. If we don't, they will accrue credit and then dominate the resource. If we penalize too much, they suffer even though they should be rewarded for their idleness. If we penalize too little, they dominate other groups.

Another way of asking is what would those plots look like if io-properties were set very high, i.e., effectively without limit. Would the statement class do more IO?

It would not be throttled.

@travisdowns
Copy link
Contributor

travisdowns commented Sep 12, 2024

The problem is that we have to penalize idle groups. If we don't, they will accrue credit and then dominate the resource. If we penalize too much, they suffer even though they should be rewarded for their idleness. If we penalize too little, they dominate other groups.

Yes, I understand the problem.

It would not be throttled.

It's hard to see how, for intermittent, independent requests. Yes, the effect described will happen initially, but if that results in the intermittent IO class being throttled down to lower to its long-run demand, IO for this class will start to queue which means it will no longer be idle and no longer subject to the bump up effect. So I don't see how you can get long-run throttling like that shown starting at 20:19:30 in Pavel's plot, unless there's a bug during the spike part which causes the statement class to bumped away above the compaction one (rather than max-delta below compaction).

Let's use a concrete example, and simplify the model somewhat and pretend it only counts IOPS (because I don't think the more complicated model is germane here).

Class A (compaction-like) and B (statement-like) have the same IO shares. Class A generates demand for 10 IOs every ms (i.e., one burst of 10 IOs, sleep for 1 ms, repeat). Class B generates demand for 1,000 IOs every 100 ms. So both want 10 KIOPS overall, but B is "more intermittent". "max-delta" (per @xemul definition above) is 5 ms, i.e., an idle class can only accrue 5 ms worth of credit, before it starts losing it.

In this scenario, class is effectively doing continuous IO since it's period (1 ms) is much less than 5 ms, so it never gets bumped up at all: it gets full credit for the time it wasn't doing IO.

Let's assume that the io properties result in a 15,000 IOPS cap overall.

During the first 100 ms, class A does its 10 IOPS every 1 ms, and will be allowed to run at full speed since it's only consuming 10,000 IOPS and as above it never gets bumped up.

At t=100ms class B is going to do its first 1,000 IOs, but even though it has accumulated "100 ms" of credit (relative to A) it gets its accumulated time bumped up to A's time minus 5 ms. So it only gets to run non-stop for 5 ms, doing 75 IOs during that time. After that, A and B will have even runtimes and they will split the IOPS 7,500 each. This is below the long run rate for both so they will never go to sleep and A and B will split the IOPS evenly. Nothing like the second plot.

So yes, A is throttled but throttled fairly wrt B (which is also throttled).

Something different happens if the io properties cap is < 10,000 or > 20,000 (no long-run throttling), but still seems fine.

So it's hard to see how long-run throttling happens since long-run throttling implies queuing which prevents bump-up.

Maybe this is something good to reproduce in io_tester?

@xemul
Copy link
Contributor Author

xemul commented Sep 13, 2024

It seems like maybe what you want is to have a larger max-delta, like 100 ms you had before, but also cap the effect this can have for newly waking things

Sort of, but larger max-delta is double-edged sword. Consider compaction-like class had been idling for a while. When it wakes up it will consume the whole max-delta, because it always saturates the queue way above per-tick capacity. That was the reason why we shrank it from 100ms to 5ms.

One thing I didn't understand though: did the "statement" class actually want more IO?

That's good question. Consider a class piles up a bunch of requests that need several ticks to be served, but those spikes happen rare enough for "average" IOPS rate to stay very low. So it's hard to tell what the class really "wants" by those plots.

Perhaps we should classify groups not just with shares, but with latency expectations as well. Batch workloads don't mind high latency, they only want throughput.

Yes, that might help.

@xemul
Copy link
Contributor Author

xemul commented Sep 13, 2024

Maybe this is something good to reproduce in io_tester?

Yes, sure. Here's my experiment.

io_tester --duration 10 --conf conf.yaml --io-properties-file local-io.yaml --storage /some/xfs/directory  -c1
*** conf.yaml
- name: statement
  shards: all
  type: randread
  data_size: 1GB
  shard_info:
    parallelism: 16
    reqsize: 512
    rps: 100
    shares: 1000
    batch: 100

- name: compaction_write
  shards: all
  type: seqwrite
  data_size: 1GB
  shard_info:
    parallelism: 8
    reqsize: 128kB
    shares: 100

- name: compaction_read
  shards: all
  type: seqread
  data_size: 1GB
  shard_info:
    parallelism: 8
    reqsize: 128kB
    class: compaction_write
*** local-io.yaml
disks:
  - mountpoint: /home/xfs
    read_iops: 442784
    read_bandwidth: 3555216640
    write_iops: 369838
    write_bandwidth: 1489170944

Note, that "statement" job put 1600 requests in one go, each 10ms. With the above io-properties.yaml one tick (~1.5ms) serves ~6k such requests.

Now, this is what I see when run on master:

  statement:
    throughput: 79161.8984  # kB/s
    IOPS: 158323.797
    latencies:  # usec
      average: 1980
      p0.5: 1697
      p0.95: 3997
      p0.99: 6323
      p0.999: 24603
      max: 36069
    stats:
      total_requests: 1584750
      io_queue_total_exec_sec: 2969.6606326020565
      io_queue_total_delay_sec: 49.984696351996071
      io_queue_total_operations: 1585200
      io_queue_starvation_time_sec: 0.0017831009999999998
      io_queue_consumption: 0.18897056579589844
      io_queue_adjusted_consumption: 0.018101386129856109
      io_queue_activations: 15783

and this is what I see when run this branch

  statement:
    throughput: 79631.7188  # kB/s
    IOPS: 159263.75
    latencies:  # usec
      average: 1537
      p0.5: 1259
      p0.95: 3501
      p0.99: 4993
      p0.999: 9513
      max: 27337
    stats:
      total_requests: 1594209
      io_queue_total_exec_sec: 2295.8663855899595
      io_queue_total_delay_sec: 47.891438546999296
      io_queue_total_operations: 1594900
      io_queue_starvation_time_sec: 0.0028849120000000012
      io_queue_consumption: 0.19012689590454102
      io_queue_adjusted_consumption: 0.021995616257190705
      io_queue_activations: 15967

and this is what I see when comment-out any adjustments of the accumulator on wake-from-idle

  statement:
    throughput: 79348.6797  # kB/s
    IOPS: 158697.859
    latencies:  # usec
      average: 1573
      p0.5: 1304
      p0.95: 3807
      p0.99: 7805
      p0.999: 19703
      max: 29408
    stats:
      total_requests: 1588515
      io_queue_total_exec_sec: 2658.0672418592535
      io_queue_total_delay_sec: 47.961513335997921
      io_queue_total_operations: 1588700
      io_queue_starvation_time_sec: 0.0018764459999999999
      io_queue_consumption: 0.18938779830932617
      io_queue_adjusted_consumption: 0.00018938779830932618
      io_queue_activations: 15831

And that's already when statement class is not trying to overload a single tick with its work.

So yes, we're looking for unfairness -- give the "statement" class more chances to consume its "lost" capacity, and giving "compaction" class less chances to do it.

@travisdowns
Copy link
Contributor

Now, this is what I see when run on master:

OK, but in all three cases the "statement" class has exactly the same throughput, only the latency varies (that's what you are showing, right? the second test case has better latency?).

So the "statement" class is not long-run throttled at all, it's just the micro behavior that varies every time it wakes that changes somewhat. So this doesn't reproduce this original thing:

image

where the claim was, as I understood it, that the statement raw consumption (green line bottom chart) was throttled severely after 20:19:30, it's consumption was "very low" due to the bumping, but in the io_tester example the pure consumption (and throughput) is the same in all three examples?

@xemul
Copy link
Contributor Author

xemul commented Sep 13, 2024

Ah, I think I understand where your confusion comes from. The issue is not that statement class is throttled too much for no reason. In fact, it is "throttled" for a reason -- it doesn't need that many capacity as compaction class indeed. The hidden message (that I failed to express in issue description) is

  • sometimes "statement" class suffers from large latency
  • there's no apparent reason for that
  • one of the sources can be this wake-from-idle thing
  • metrics above support that theory (just support, not prove)

@travisdowns
Copy link
Contributor

  • sometimes "statement" class suffers from large latency
  • there's no apparent reason for that
  • one of the sources can be this wake-from-idle thing

OK sure, that makes sense. What I didn't understand is how the second plot above supports that in relation to the wake-on-bump. As I understand it, the second plot ("pure consumption") would look the same after 20:19:30 regardless of the wake on bump strategy (agreed?). Wake-on-bump is probaby the "reason" for the ~linear ramp in adjusted consumption between 20:18:30 and 20:19:30 though. So yes, in the world where we didn't do any bump at all, then statement would subsequently be "super-privileged" and have all of its IO taken immediately on every tick where it had some IO.

In real scylladb are the statement/compaction priorities 1000/100 as in the io_test config?

If wonder if part of the problem is that even though one would expect 1000 share intermittent IO to be little affected by 100 share ongoing IO, since it has 10x the shares, that just isn't true for a variety of reasons. I.e., statement running alone would have much better performance alone than running concurrently with compaction, no matter the share ratio.

@xemul
Copy link
Contributor Author

xemul commented Sep 13, 2024

What I didn't understand is how the second plot above supports that in relation to the wake-on-bump

The statement class loses its adjusted accumulator, it can only happen in wake-from-idle. I recently added a metrics to show the wake-from-idle rate and already have confirmation that statement class is extremely intermittent (in CPU scheduling world it's often called "interactive")

In real scylladb are the statement/compaction priorities 1000/100 as in the io_test config?

Yes

If wonder if part of the problem is that even though one would expect 1000 share intermittent IO to be little affected by 100 share ongoing IO, since it has 10x the shares, that just isn't true for a variety of reasons. I.e., statement running alone would have much better performance alone than running concurrently with compaction, no matter the share ratio.

It's pretty convoluted. Class with 10x more shares running saturated queue will receive 10x more "capacity" from disk. The difficulty comparing statement class with compaction is that this "capacity" is neither bandwidth, not IOPS, but their combination that's not immediately obvious.

@avikivity
Copy link
Member

Perhaps the accumulator should be adjusted to what its value would be if it were consuming its share (in some sense) of the resource.

Let's say group A has 100 shares and is consuming 100 resources/sec steadily.

Let's say group B has 100 shares and is consuming 200 resources/sec on alternate seconds (1 second off, 1 second on).

When B wakes up, we adjust its accumulator as if it was consuming 100 resources/sec at a steady pace up to that point (since it has equal shares with A). Then it doesn't dominate A, and gets to compete equally from that point on.

@avikivity
Copy link
Member

Let's try it with a more concrete example.

Compaction is running alone, it has few shares (10) but since it's alone it consumes 100% of the resource.

Statement now starts running. It has lots of shares (100) but it intermittent. It's accumulator is set as if it is consuming 91% if the resource (100/(100+10)).

Hmm. I don't think it's good. It's penalized too much for sleeping.

It boils down to: we want to reward statement for sleeping, but not reward compaction for sleeping.

@xemul
Copy link
Contributor Author

xemul commented Sep 17, 2024

It boils down to: we want to reward statement for sleeping, but not reward compaction for sleeping.

Yes, there are many strategies we can apply, but they are all driven by uneasy decision -- do we differentiate statement class from compaction explicitly, or try to apply some magic differentiator (options: average/maximum request size, wakeup-rate, shares are below/above 500 configurable)?

@xemul
Copy link
Contributor Author

xemul commented Sep 17, 2024

Some "sane" fix seems to be -- adjust the class accumulator not immediately, when it wakes up, but after it is dispatched from once (the nearest poll). This works well for both -- statement pattern and compaciton, like this.

Statement: when it wakes up it has its accumulator way behind the others. The nearest dispatch will likely dispatch only from this class, since it's accumulator is low. Then its accumulator will be adjusted and it will compete with other classes normally, but likely it won't, because it's queue is typically short and drained in one-to-few ticks.

Compaction: when it wakes up, even if it has its accumulator behind the others, it will only dominate the queue in the single upcoming tick. Then it will have its accumulator adjusted, and will continue competing with other classes normally.

@avikivity
Copy link
Member

Some "sane" fix seems to be -- adjust the class accumulator not immediately, when it wakes up, but after it is dispatched from once (the nearest poll). This works well for both -- statement pattern and compaciton, like this.

Statement: when it wakes up it has its accumulator way behind the others. The nearest dispatch will likely dispatch only from this class, since it's accumulator is low. Then its accumulator will be adjusted and it will compete with other classes normally, but likely it won't, because it's queue is typically short and drained in one-to-few ticks.

Compaction: when it wakes up, even if it has its accumulator behind the others, it will only dominate the queue in the single upcoming tick. Then it will have its accumulator adjusted, and will continue competing with other classes normally.

I don't think it deeply solves the problem. It changes the boost for a waking-up group from "5ms" to "whatever you manged to queue before I got around to scheduling you". In some cases it could be better, in some worse.

@travisdowns
Copy link
Contributor

It boils down to: we want to reward statement for sleeping, but not reward compaction for sleeping.

That's at least part of the problem, but isn't another part that the "reward" has to take the form of exclusive access to the IO queue for "max-delta IO" at which point it returns to even usage. So it is an extreme reward that lets you monopolize the IO queue and effective ignores the shares while that occurs.

If instead the reward was a temporary bump in shares, there wouldn't be such a painful choice between reward accumulation cap and intermittent statement performance.

For example, you could use a max-delta of 0 (i.e., always bump the waker fully up to the current consumption), but add the amount of consumption (still with a cap) bumped away to a second "waker reward" consumption counter. You choose the next class as today based on lowest adjusted consumption, but if a class is chosen than has non-zero waker reward you split the consumption between incrementing it's consumption as usual and decrementing it's waker reward. This effectively gives it a priority boost to recoup its lost consumption but w/o monopolization.

For example if a class wakes and has 700 adjusted consumption and we bump it to 1000 add that 300 to the waker reward. Then when this class is chose and does 100 consumption worth of IO, we split it 50/50 (assuming ratio of 2) and so consumption for this class goes to 1050 and reward drops to 250. I.e., the adjusted consumption only went up by half as much as the no-reward scenario. Effectively, the class will be scheduled as if it had 2x the shares until the reward bucket is exhausted.

Now you have the two tunables you want: the cap on the amount added to the reward bucket sets how much reward a sleeper can accumulate before maxing out, and the ratio sets how quickly the waker is rewarded when it does wake up. Any reasonable ratio should avoid starvation of the classes, and especially key is that shares are still respected: compaction when it wakes up will still have 5x fewer shares (with ratio 2) during the reward period.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants