-
Notifications
You must be signed in to change notification settings - Fork 350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How To Building and Running Llama 3.2 1B Instruct with Qualcomm AI Engine Direct Backend? #6655
Comments
@baotonghe |
Hi, thank you for trying out the llama model on QNN, since the command you ran didn't include the calibration process, the output likely will be very off. We're still working on quantized 1b model for QNN |
@cccclai |
@cccclai Thank you for your response, looking forward to your good news. |
@crinex Thank you for the explanation. I've also read many questions, and everyone seems to have similar issues. Let's wait for the official updates and response. |
Right Case
When I follow the doc : https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md#enablement,
I export the Llama3.2-1B-Instruct:int4-spinquant-eo8 model to xnnpack backend pte successfully, and working alright on cpu.
Bad Case
But as the link: https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md, when I export to the qnn backend using mode Llama3.2-1B-Instruct, I can get the out pte file, but when I make it running on the android device, it not working right.
I export pte file like this:
python -m examples.models.llama.export_llama --checkpoint "${MODEL_DIR}/consolidated.00.pth" -p "${MODEL_DIR}/params.json" -kv --disable_dynamic_shape --qnn --pt2e_quantize qnn_16a4w -d fp32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --soc_model SM8550 --output_name="llama3_2_ptq_qnn_.pte"
This is the part of output when I export
INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_permute_copy_default_979, aten.permute_copy.default
INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_squeeze_copy_dims_175, aten.squeeze_copy.dims
INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_add_tensor_79, aten.add.Tensor
INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_select_copy_int_512, aten.select_copy.int
INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_rms_norm_default_32, aten.rms_norm.default
INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_view_copy_default_288, aten.view_copy.default
INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_permute_copy_default_980, aten.permute_copy.default
INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_convolution_default_112, aten.convolution.default
INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_permute_copy_default_981, aten.permute_copy.default
INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_view_copy_default_289, aten.view_copy.default
INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: quantized_decomposed_dequantize_per_tensor_tensor, quantized_decomposed.dequantize_per_tensor.tensor
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend
/home/hebaotong/AI/Executorch/executorch_new/executorch/exir/emit/emitter.py:1512: UserWarning: Mutation on a buffer in the model is detected. ExecuTorch assumes buffers that are mutated in the graph have a meaningless initial state, only the shape and dtype will be serialized.
warnings.warn(
INFO:root:Required memory for activation in bytes: [0, 17552384]
modelname: llama3_2_ptq_qnn
output_file: llama3_2_ptq_qnn_.pte
INFO:root:Saved exported program to llama3_2_ptq_qnn_.pte
Screenshot of run status
The text was updated successfully, but these errors were encountered: