Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How To Building and Running Llama 3.2 1B Instruct with Qualcomm AI Engine Direct Backend? #6655

Open
baotonghe opened this issue Nov 5, 2024 · 5 comments

Comments

@baotonghe
Copy link

baotonghe commented Nov 5, 2024

Right Case

When I follow the doc : https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md#enablement,
I export the Llama3.2-1B-Instruct:int4-spinquant-eo8 model to xnnpack backend pte successfully, and working alright on cpu.


SpinQuant_XNNPACK

Bad Case

But as the link: https://github.com/pytorch/executorch/blob/main/examples/models/llama/README.md, when I export to the qnn backend using mode Llama3.2-1B-Instruct, I can get the out pte file, but when I make it running on the android device, it not working right.

I export pte file like this:

python -m examples.models.llama.export_llama --checkpoint "${MODEL_DIR}/consolidated.00.pth" -p "${MODEL_DIR}/params.json" -kv --disable_dynamic_shape --qnn --pt2e_quantize qnn_16a4w -d fp32 --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' --soc_model SM8550 --output_name="llama3_2_ptq_qnn_.pte"

This is the part of output when I export

INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_permute_copy_default_979, aten.permute_copy.default
INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_squeeze_copy_dims_175, aten.squeeze_copy.dims
INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_add_tensor_79, aten.add.Tensor
INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_select_copy_int_512, aten.select_copy.int
INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_rms_norm_default_32, aten.rms_norm.default
INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_view_copy_default_288, aten.view_copy.default
INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_permute_copy_default_980, aten.permute_copy.default
INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_convolution_default_112, aten.convolution.default
INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_permute_copy_default_981, aten.permute_copy.default
INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: aten_view_copy_default_289, aten.view_copy.default
INFO:executorch.backends.qualcomm.qnn_preprocess:Visiting: quantized_decomposed_dequantize_per_tensor_tensor, quantized_decomposed.dequantize_per_tensor.tensor
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend parameters
[INFO] [Qnn ExecuTorch]: Destroy Qnn context
[INFO] [Qnn ExecuTorch]: Destroy Qnn device
[INFO] [Qnn ExecuTorch]: Destroy Qnn backend
/home/hebaotong/AI/Executorch/executorch_new/executorch/exir/emit/emitter.py:1512: UserWarning: Mutation on a buffer in the model is detected. ExecuTorch assumes buffers that are mutated in the graph have a meaningless initial state, only the shape and dtype will be serialized.
warnings.warn(
INFO:root:Required memory for activation in bytes: [0, 17552384]
modelname: llama3_2_ptq_qnn

output_file: llama3_2_ptq_qnn_.pte
INFO:root:Saved exported program to llama3_2_ptq_qnn_.pte

Screenshot of run status

PTQ_QNN

@crinex
Copy link

crinex commented Nov 5, 2024

@baotonghe
You need to check other QNN issues, as most QNN issues encounter the same error as yours.
They(executorch) are not suggesting any countermeasures or alternative methods.

@cccclai
Copy link
Contributor

cccclai commented Nov 5, 2024

Hi, thank you for trying out the llama model on QNN, since the command you ran didn't include the calibration process, the output likely will be very off. We're still working on quantized 1b model for QNN

@crinex
Copy link

crinex commented Nov 6, 2024

@cccclai
Are you suggesting that the calibration method you're recommending is to use SpinQuant?

@baotonghe baotonghe reopened this Nov 6, 2024
@baotonghe
Copy link
Author

Hi, thank you for trying out the llama model on QNN, since the command you ran didn't include the calibration process, the output likely will be very off. We're still working on quantized 1b model for QNN

@cccclai Thank you for your response, looking forward to your good news.

@baotonghe
Copy link
Author

@baotonghe You need to check other QNN issues, as most QNN issues encounter the same error as yours. They(executorch) are not suggesting any countermeasures or alternative methods.

@crinex Thank you for the explanation. I've also read many questions, and everyone seems to have similar issues. Let's wait for the official updates and response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants