[Bug] recent change of param "block_size_x/y, unroll" in dlight/gpu/matmul.py significantly decrease q4f16_1 prefill speed on android 8gen3 device #326

xuguodong1999 · 2024-09-21T14:16:24Z

In mlc-llm v0.18.dev0 release, tvm(relax repo) commit 50d1c97dc98 leads to extremely slow prefill speed for q4f16_1 of Llama-2-7B-chat-hf on android 8gen3 device.

Expected behavior

q4f16_1 of Llama-2-7B-chat-hf prefill speed is close to 10 tok/s before.

Actual behavior

q4f16_1 of Llama-2-7B-chat-hf prefill speed is only ~0.3 tok/s now.

Environment

android 8gen3 device
mlc-llm v0.18.dev0
tvm relax submodule in mlc-llm repo at that moment.

Steps to reproduce

compile, bundle q4f16_1 of Llama-2-7B-chat-hf and launch on android 8gen3 device.

by the way, when I revert "block_size_x/y, unroll" value as well as related parts in sch_outer_reduction from (32, 8, 4) to (8, 16, 64), q4f16_0 and q4f16_1 prefill speed become normal.

this may need a fix as most mlc converted models are released in q4f16_1 format.

Triage

vert:android

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] recent change of param "block_size_x/y, unroll" in dlight/gpu/matmul.py significantly decrease q4f16_1 prefill speed on android 8gen3 device #326

[Bug] recent change of param "block_size_x/y, unroll" in dlight/gpu/matmul.py significantly decrease q4f16_1 prefill speed on android 8gen3 device #326

xuguodong1999 commented Sep 21, 2024

[Bug] recent change of param "block_size_x/y, unroll" in dlight/gpu/matmul.py significantly decrease q4f16_1 prefill speed on android 8gen3 device #326

[Bug] recent change of param "block_size_x/y, unroll" in dlight/gpu/matmul.py significantly decrease q4f16_1 prefill speed on android 8gen3 device #326

Comments

xuguodong1999 commented Sep 21, 2024

Expected behavior

Actual behavior

Environment

Steps to reproduce

Triage