[Serving] PagedKVCache Quantization #2663

davidpissarra · 2024-07-16T00:53:47Z

KV Cache might be a burden under tight memory constraints, and cache quantization can reduce its memory requirement by roughly 75% (float16 -> int3 KV cache). As a result, this PR intends to introduce initial support for KV Cache quantization, introducing two new quant schemes (q3f16kv and q4f16kv).

On Llama-3 (context_window_size = 8192) kv cache memory usage:

1113 MB (q4f16_1: float16 kv cache)
345 MB (q4f16kv: int4 kv cache)
281 MB (q3f16kv: int3 kv cache)

Depends on: apache/tvm#17159

cc @MasterJH5574 @tqchen

XJY990705 · 2024-09-05T02:51:16Z

I want to use this pr, and I notice the TVM branch it use is f5f048b, but I got some bugs while compiling mlc-llm. Would you please tell me the TVM version you use in this pr? I'm sure the problems I got is concerned with wrong TVM version

davidpissarra · 2024-09-05T13:25:17Z

Hi @XJY990705 , it is still actually on f5f048b. You may be able to run it if you build everything from this branch (including tvm). I will rebase it in the meantime.

XJY990705 · 2024-09-06T02:10:42Z

@davidpissarra thank you for your reply, I will try again

XJY990705 · 2024-09-07T09:15:56Z

I have already solved this problem by this pr tlc-pack/libflash_attn#8
maybe when I swiched to f5f048b branch, this modification is lost. Anyway, thank you for your help!!

XJY990705 · 2024-09-09T08:03:41Z

@davidpissarra I noticed 3rdparty/tvm/src/runtime/relax_vm/paged_kv_cache.cc is not changed, and mismatched with python/mlc_llm/nn/kv_cache.py when calling TVM_REGISTER_GLOBAL("vm.builtin.paged_attention_kv_cache_create_reduced"). Is there any commit you forget?

kv quantize

d2acb93

davidpissarra mentioned this pull request Jul 16, 2024

[KVCache] PagedKVCache Quantization apache/tvm#17159

Open

XJY990705 mentioned this pull request Sep 4, 2024

[Tracking] PagedKVCache Quantization #2663 中的TVM版本 #2880

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Serving] PagedKVCache Quantization #2663

[Serving] PagedKVCache Quantization #2663

davidpissarra commented Jul 16, 2024 •

edited

Loading

XJY990705 commented Sep 5, 2024

davidpissarra commented Sep 5, 2024 •

edited

Loading

XJY990705 commented Sep 6, 2024

XJY990705 commented Sep 7, 2024

XJY990705 commented Sep 9, 2024

[Serving] PagedKVCache Quantization #2663

Are you sure you want to change the base?

[Serving] PagedKVCache Quantization #2663

Conversation

davidpissarra commented Jul 16, 2024 • edited Loading

XJY990705 commented Sep 5, 2024

davidpissarra commented Sep 5, 2024 • edited Loading

XJY990705 commented Sep 6, 2024

XJY990705 commented Sep 7, 2024

XJY990705 commented Sep 9, 2024

davidpissarra commented Jul 16, 2024 •

edited

Loading

davidpissarra commented Sep 5, 2024 •

edited

Loading