Add quantized qwen2-0.5b #490

bil-ash · 2024-06-26T01:16:26Z

Add quantized(q4f16) qwen2-0.5b to the list of supported models. PR must be merged before merging this.

to support quantized qwen2-0.5b

Neet-Nestor · 2024-06-26T02:54:12Z

Related PRs:

CharlieFRuan

Thanks a lot for the contribution. Some minor changes, one on consistency on naming, and one on the required MB after calculation

CharlieFRuan · 2024-06-26T20:26:16Z

src/config.ts

+        modelVersion +
+        "/Qwen2-0.5B-Instruct-q4f16_1-webgpu.wasm",
+      low_resource_required: true,
+      vram_required_MB: 500,//rough estimate


Suggested change

vram_required_MB: 500,//rough estimate

vram_required_MB: 944.62,

CharlieFRuan · 2024-06-26T20:26:47Z

src/config.ts

@@ -601,6 +601,19 @@ export const prebuiltAppConfig: AppConfig = {
      },
    },
    // Qwen-2
+    {
+      model: "https://huggingface.co/mlc-ai/Qwen2-0.5B-Instruct-q4f16_1-MLC",
+      model_id: "Qwen2-0.5B-Instruct-q4f16-MLC",


Suggested change

model_id: "Qwen2-0.5B-Instruct-q4f16-MLC",

model_id: "Qwen2-0.5B-Instruct-q4f16_1-MLC",

bil-ash · 2024-06-27T01:23:14Z

By the way, what is the formula for calculating VRAM required?
Also, how does web-llm acquire VRAM- I mean does it take up the entire VRAM specified by vram_required_MB at initialization or does it take up limited amount of VRAM and then acquire more if required?

CharlieFRuan · 2024-06-27T01:36:40Z

Thanks for making the changes!

For VRAM, it is mainly three parts: model size, intermediate buffer size (for various matrix multiplications, etc.), and KV cache size. The sum of the first two is estimated during python -m mlc_llm compile and shown in the output. For KV cache size, it is context_window_size * head_dim * num_kv_heads * num_layers * 2 (K and V) * 2 (num bytes for f16, 4 for f32).

The vram_required_MB is merely an estimation, so it does not play a role at all in runtime. I believe it takes up most (if not all) of the required VRAM at initialization. Therefore, prefill chunk size would play a role in limiting the amount of memory required for long prompts -- it would simply chunk them to keep the intermediate buffer size the same.

bil-ash · 2024-06-27T01:50:33Z

Thanks for making the changes!

For VRAM, it is mainly three parts: model size, intermediate buffer size (for various matrix multiplications, etc.), and KV cache size. The sum of the first two is estimated during python -m mlc_llm compile and shown in the output. For KV cache size, it is context_window_size * head_dim * num_kv_heads * num_layers * 2 (K and V) * 2 (num bytes for f16, 4 for f32).

The vram_required_MB is merely an estimation, so it does not play a role at all in runtime. I believe it takes up most (if not all) of the required VRAM at initialization. Therefore, prefill chunk size would play a role in limiting the amount of memory required for long prompts -- it would simply chunk them to keep the intermediate buffer size the same.

So, VRAM=model+ intermediate buffer +KV cache.
So, for this case, please specify all the three components.
I am asking this because I would like to have Qwen2-0.5B but with 32k context. So, if we reduce the prefill chunk size to 1k, the calculation would be something like-
VRAM for 32k context= model+ (intermediate buffer)/2 + 8*KV cache
and may be there won't be much of a difference in VRAM

CharlieFRuan · 2024-06-27T01:57:37Z

The head_dim etc. can be found in the mlc-chat-config.json, so it is ctx * 64 * 2 * 24 * 2 * 2 bytes. For 4K, it is 48MB, so the other two components is 944.62 - 48 = 896.62. For 32K, it would be 8 * 48MB. I'm personally not sure whether QWen2 0.5B would support such a long context.

bil-ash · 2024-06-27T01:59:50Z

Thanks for the info

CharlieFRuan · 2024-06-27T02:02:26Z

I just ran compile again and got following info:

For 1024 chunk size: 896.62 MB (Parameters: 265.12 MB. Temporary buffer: 631.50 MB)
For 2048 chunk size: 1528.12 MB (Parameters: 265.12 MB. Temporary buffer: 1263.00 MB)

I guess I used the wrong estimation, it should be 1528.12 + 48 MB instead since the wasm you uploaded uses 2048 chunk size

Update config.ts

4228a61

to support quantized qwen2-0.5b

bil-ash mentioned this pull request Jun 26, 2024

Add support for quantized qwen2-0.5b mlc-ai/web-llm-chat#44

Merged

Neet-Nestor mentioned this pull request Jun 26, 2024

Add support for quantized qwen2-0.5b mlc-ai/binary-mlc-llm-libs#128

Merged

CharlieFRuan reviewed Jun 26, 2024

View reviewed changes

Changes as per suggestion

378c4b7

CharlieFRuan merged commit 1da0f76 into mlc-ai:main Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add quantized qwen2-0.5b #490

Add quantized qwen2-0.5b #490

bil-ash commented Jun 26, 2024

Neet-Nestor commented Jun 26, 2024

CharlieFRuan left a comment

CharlieFRuan Jun 26, 2024

CharlieFRuan Jun 26, 2024

bil-ash commented Jun 27, 2024

CharlieFRuan commented Jun 27, 2024

bil-ash commented Jun 27, 2024 •

edited

Loading

CharlieFRuan commented Jun 27, 2024

bil-ash commented Jun 27, 2024

CharlieFRuan commented Jun 27, 2024

	vram_required_MB: 500,//rough estimate
	vram_required_MB: 944.62,

	model_id: "Qwen2-0.5B-Instruct-q4f16-MLC",
	model_id: "Qwen2-0.5B-Instruct-q4f16_1-MLC",

Add quantized qwen2-0.5b #490

Add quantized qwen2-0.5b #490

Conversation

bil-ash commented Jun 26, 2024

Neet-Nestor commented Jun 26, 2024

CharlieFRuan left a comment

Choose a reason for hiding this comment

CharlieFRuan Jun 26, 2024

Choose a reason for hiding this comment

CharlieFRuan Jun 26, 2024

Choose a reason for hiding this comment

bil-ash commented Jun 27, 2024

CharlieFRuan commented Jun 27, 2024

bil-ash commented Jun 27, 2024 • edited Loading

CharlieFRuan commented Jun 27, 2024

bil-ash commented Jun 27, 2024

CharlieFRuan commented Jun 27, 2024

bil-ash commented Jun 27, 2024 •

edited

Loading