Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] recent change of param "block_size_x/y, unroll" in dlight/gpu/matmul.py significantly decrease q4f16_1 prefill speed on android 8gen3 device #326

Open
xuguodong1999 opened this issue Sep 21, 2024 · 0 comments

Comments

@xuguodong1999
Copy link

In mlc-llm v0.18.dev0 release, tvm(relax repo) commit 50d1c97dc98 leads to extremely slow prefill speed for q4f16_1 of Llama-2-7B-chat-hf on android 8gen3 device.

Expected behavior

q4f16_1 of Llama-2-7B-chat-hf prefill speed is close to 10 tok/s before.

Actual behavior

q4f16_1 of Llama-2-7B-chat-hf prefill speed is only ~0.3 tok/s now.

Environment

  • android 8gen3 device

  • mlc-llm v0.18.dev0

  • tvm relax submodule in mlc-llm repo at that moment.

Steps to reproduce

  • compile, bundle q4f16_1 of Llama-2-7B-chat-hf and launch on android 8gen3 device.

by the way, when I revert "block_size_x/y, unroll" value as well as related parts in sch_outer_reduction from (32, 8, 4) to (8, 16, 64), q4f16_0 and q4f16_1 prefill speed become normal.

this may need a fix as most mlc converted models are released in q4f16_1 format.

Triage

  • vert:android
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant