How To Building and Running Llama 3.2 1B Instruct with Qualcomm AI Engine Direct Backend？ #6655

baotonghe · 2024-11-05T08:00:19Z

Right Case

When I follow the doc : https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md#enablement,
I export the Llama3.2-1B-Instruct:int4-spinquant-eo8 model to xnnpack backend pte successfully, and working alright on cpu.

Bad Case

But as the link: https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md, when I export to the qnn backend using mode Llama3.2-1B-Instruct, I can get the out pte file, but when I make it running on the android device, it not working right.

I export pte file like this:

python -m examples.models.llama.export_llama --checkpoint "${MODEL_DIR}/consolidated.00.pth" -p "${MODEL_DIR}/params.json" -kv --disable_dynamic_shape --qnn --pt2e_quantize qnn_16a4w -d fp32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --soc_model SM8550 --output_name="llama3_2_ptq_qnn_.pte"

This is the part of output when I export

INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_permute_copy_default_979, aten.permute_copy.default
INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_squeeze_copy_dims_175, aten.squeeze_copy.dims
INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_add_tensor_79, aten.add.Tensor
INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_select_copy_int_512, aten.select_copy.int
INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_rms_norm_default_32, aten.rms_norm.default
INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_view_copy_default_288, aten.view_copy.default
INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_permute_copy_default_980, aten.permute_copy.default
INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_convolution_default_112, aten.convolution.default
INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_permute_copy_default_981, aten.permute_copy.default
INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_view_copy_default_289, aten.view_copy.default
INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: quantized_decomposed_dequantize_per_tensor_tensor, quantized_decomposed.dequantize_per_tensor.tensor
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend
/home/hebaotong/AI/Executorch/executorch_new/executorch/exir/emit/emitter.py:1512: UserWarning: Mutation on a buffer in the model is detected. ExecuTorch assumes buffers that are mutated in the graph have a meaningless initial state, only the shape and dtype will be serialized.
warnings.warn(
INFO:root:Required memory for activation in bytes: [0, 17552384]
modelname: llama3_2_ptq_qnn
output_file: llama3_2_ptq_qnn_.pte
INFO:root:Saved exported program to llama3_2_ptq_qnn_.pte

Screenshot of run status

crinex · 2024-11-05T15:26:59Z

@baotonghe
You need to check other QNN issues, as most QNN issues encounter the same error as yours.
They(executorch) are not suggesting any countermeasures or alternative methods.

cccclai · 2024-11-05T19:53:13Z

Hi, thank you for trying out the llama model on QNN, since the command you ran didn't include the calibration process, the output likely will be very off. We're still working on quantized 1b model for QNN

crinex · 2024-11-06T01:57:09Z

@cccclai
Are you suggesting that the calibration method you're recommending is to use SpinQuant?

baotonghe · 2024-11-06T06:51:37Z

Hi, thank you for trying out the llama model on QNN, since the command you ran didn't include the calibration process, the output likely will be very off. We're still working on quantized 1b model for QNN

@cccclai Thank you for your response, looking forward to your good news.

baotonghe · 2024-11-06T06:56:03Z

@baotonghe You need to check other QNN issues, as most QNN issues encounter the same error as yours. They(executorch) are not suggesting any countermeasures or alternative methods.

@crinex Thank you for the explanation. I've also read many questions, and everyone seems to have similar issues. Let's wait for the official updates and response.

baotonghe closed this as completed Nov 6, 2024

baotonghe reopened this Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How To Building and Running Llama 3.2 1B Instruct with Qualcomm AI Engine Direct Backend？ #6655

How To Building and Running Llama 3.2 1B Instruct with Qualcomm AI Engine Direct Backend？ #6655

baotonghe commented Nov 5, 2024 •

edited

Loading

crinex commented Nov 5, 2024

cccclai commented Nov 5, 2024 •

edited

Loading

crinex commented Nov 6, 2024

baotonghe commented Nov 6, 2024

baotonghe commented Nov 6, 2024

How To Building and Running Llama 3.2 1B Instruct with Qualcomm AI Engine Direct Backend？ #6655

How To Building and Running Llama 3.2 1B Instruct with Qualcomm AI Engine Direct Backend？ #6655

Comments

baotonghe commented Nov 5, 2024 • edited Loading

Right Case

Bad Case

crinex commented Nov 5, 2024

cccclai commented Nov 5, 2024 • edited Loading

crinex commented Nov 6, 2024

baotonghe commented Nov 6, 2024

baotonghe commented Nov 6, 2024

baotonghe commented Nov 5, 2024 •

edited

Loading

cccclai commented Nov 5, 2024 •

edited

Loading