You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Previously (late 2023) we kicked off our LLM workstream by enabling llama2 on ET. Both export and runner code live under examples/models/llama2. As we support more and more models (llama3, phi-3, llava etc) we created the extension/llm folder to avoid duplicate code for these models.
Later we started torchchat (June 2024) and due to its urgency we made the decision to duplicate code instead of reusing code from ET.
Now (Oct 2024) is a good time to consolidate the APIs we want to expose from ET, and let torchchat reuse them. This work is also beneficial for external users to use ET on LLMs.
Problems
Looking at LLM related code inside ET and in torchchat, I can see some problems:
Export flow:
A lot of the features eligible for sharing are still in examples/models/llama/export_llama_lib.py .
ET users are still writing their own export flow which may not be fully optimized.
Missing features such as multiple entry points, for example if we want to export a .pte file with vision encoder as well as text decoder for multimodality.
Option names are not descriptive and clear enough, lack of docstrings and help text.
Runner:
Inside ET examples, multiple implementations of runner exist. This causes issues when we integrate these runners into demo apps, since we have to write JNI layers for each of the runners. E.g., Mediatek has its own runner
Distribution channels need to be consolidated. We should prebuild runner code using iOS and Android toolchains and distribute the libraries/artifacts to the users. For special toolchain users we should provide good support to allow them to build their own runner from source.
Other API changes, make the runner to resemble huggingface’s transformer API.
Alternatives
The alternative is to do nothing, which means extension/llm will not be used by external users because it doesn't have good enough API and documentation.
Additional context
No response
RFC (Optional)
What should we offer and what should the APIs look like? Breaking down into several categories:
Model definition
Our llm library should not hold any model definition given the intention is for users to use the export and runner on a generic transformer based llm. This implies the export utils and runner should work on most llms (e.g., from torchtune or huggingface). In the ET examples folder, we can showcase some of the models working with our llm library.
However we will provide some example module definitions such as SDPA. This is because we use source transformation to replace these modules with custom ops that provide better performance on ET. These custom ops are then coupled tightly with the example modules, it makes sense to provide sample implementations. See more in the source transformation section.
Proposal: modules/ directory under extension/llm for hosting these special modules. They will work with source transformations and custom ops.
Export flow
LLMEdgeManager as our main .pte export helper class. It takes a torch.nn.Module along with other configs and provides APIs to quantize, lower to edge dialect, lower to different backends and eventually lower to ExecuTorch.
Proposal: add a top level entry point (a function like executorch.extension.llm.export() to follow the PyTorch convention). The function will return a LLMEdgeManager, and users will call quantize(), source_transformation() etc.
Source transformations. These files are currently sitting under examples/llama/source_transformation but they can be applied to other models as well.
Proposal: move to extension/llm/export/source_transformation. Ideally source transformation should not target for customized torch.nn.Module and only target for PyTorch standard torch.nn.Module.
Quantization: currently we have quantizers defined in extension/llm/export/quantizer_lib.py, we should keep them there. There’s quantization code in source transformation as well, we should figure out if we can migrate them to torchao’s quantize_() API.
Partitioner: currently we have partitioners defined in extension/llm/export/partitioner_lib.py, we should keep them there.
C++ Runner & Tokenizer
Sampler class
Proposal: should take temperature as an argument to sample() method instead of an argument to constructor.
Tokenizer base class, with BPETokenizer and Tiktoken extending it.
Proposal: merge with torchchat’s tokenizer implementations. Absorb SPTokenizer from torchchat. This means torchchat will start to use the tokenizer from ET.
Runner and other components such as TextPrefiller, ImagePrefiller, TextDecoder, TextTokenGenerator
Proposal: put them into iOS SwiftPM for iOS developers to use.
Proposal: migrate existing runners in ET examples to use Runner base class so the JNI layer is simple.
The text was updated successfully, but these errors were encountered:
🚀 The feature, motivation and pitch
Context
Previously (late 2023) we kicked off our LLM workstream by enabling llama2 on ET. Both export and runner code live under
examples/models/llama2
. As we support more and more models (llama3, phi-3, llava etc) we created theextension/llm
folder to avoid duplicate code for these models.Later we started torchchat (June 2024) and due to its urgency we made the decision to duplicate code instead of reusing code from ET.
Now (Oct 2024) is a good time to consolidate the APIs we want to expose from ET, and let torchchat reuse them. This work is also beneficial for external users to use ET on LLMs.
Problems
Looking at LLM related code inside ET and in torchchat, I can see some problems:
Export flow:
Runner:
Alternatives
The alternative is to do nothing, which means
extension/llm
will not be used by external users because it doesn't have good enough API and documentation.Additional context
No response
RFC (Optional)
What should we offer and what should the APIs look like? Breaking down into several categories:
Model definition
modules/
directory underextension/llm
for hosting these special modules. They will work with source transformations and custom ops.Export flow
LLMEdgeManager
as our main .pte export helper class. It takes atorch.nn.Module
along with other configs and provides APIs to quantize, lower to edge dialect, lower to different backends and eventually lower to ExecuTorch.executorch.extension.llm.export()
to follow the PyTorch convention). The function will return aLLMEdgeManager
, and users will callquantize()
,source_transformation()
etc.examples/llama/source_transformation
but they can be applied to other models as well.extension/llm/export/source_transformation
. Ideally source transformation should not target for customizedtorch.nn.Module
and only target for PyTorch standardtorch.nn.Module
.extension/llm/export/quantizer_lib.py
, we should keep them there. There’s quantization code in source transformation as well, we should figure out if we can migrate them to torchao’s quantize_() API.extension/llm/export/partitioner_lib.py
, we should keep them there.C++ Runner & Tokenizer
Sampler
classtemperature
as an argument tosample()
method instead of an argument to constructor.Tokenizer
base class, withBPETokenizer
andTiktoken
extending it.SPTokenizer
from torchchat. This means torchchat will start to use the tokenizer from ET.Runner
and other components such asTextPrefiller
,ImagePrefiller
,TextDecoder
,TextTokenGenerator
Runner
base class so the JNI layer is simple.The text was updated successfully, but these errors were encountered: