RFC: extension/llm API design #6558

larryliu0820 · 2024-10-29T21:14:57Z

🚀 The feature, motivation and pitch

Context

Previously (late 2023) we kicked off our LLM workstream by enabling llama2 on ET. Both export and runner code live under examples/models/llama2. As we support more and more models (llama3, phi-3, llava etc) we created the extension/llm folder to avoid duplicate code for these models.

Later we started torchchat (June 2024) and due to its urgency we made the decision to duplicate code instead of reusing code from ET.

Now (Oct 2024) is a good time to consolidate the APIs we want to expose from ET, and let torchchat reuse them. This work is also beneficial for external users to use ET on LLMs.

Problems

Looking at LLM related code inside ET and in torchchat, I can see some problems:

Export flow:

A lot of the features eligible for sharing are still in examples/models/llama/export_llama_lib.py .
ET users are still writing their own export flow which may not be fully optimized.
Missing features such as multiple entry points, for example if we want to export a .pte file with vision encoder as well as text decoder for multimodality.
Option names are not descriptive and clear enough, lack of docstrings and help text.

Runner:

Inside ET examples, multiple implementations of runner exist. This causes issues when we integrate these runners into demo apps, since we have to write JNI layers for each of the runners. E.g., Mediatek has its own runner
Distribution channels need to be consolidated. We should prebuild runner code using iOS and Android toolchains and distribute the libraries/artifacts to the users. For special toolchain users we should provide good support to allow them to build their own runner from source.
Other API changes, make the runner to resemble huggingface’s transformer API.

Alternatives

The alternative is to do nothing, which means extension/llm will not be used by external users because it doesn't have good enough API and documentation.

Additional context

No response

RFC (Optional)

What should we offer and what should the APIs look like? Breaking down into several categories:

Model definition
- Our llm library should not hold any model definition given the intention is for users to use the export and runner on a generic transformer based llm. This implies the export utils and runner should work on most llms (e.g., from torchtune or huggingface). In the ET examples folder, we can showcase some of the models working with our llm library.
- However we will provide some example module definitions such as SDPA. This is because we use source transformation to replace these modules with custom ops that provide better performance on ET. These custom ops are then coupled tightly with the example modules, it makes sense to provide sample implementations. See more in the source transformation section.
  - Proposal: modules/ directory under extension/llm for hosting these special modules. They will work with source transformations and custom ops.
Export flow
- LLMEdgeManager as our main .pte export helper class. It takes a torch.nn.Module along with other configs and provides APIs to quantize, lower to edge dialect, lower to different backends and eventually lower to ExecuTorch.
  - Proposal: add a top level entry point (a function like executorch.extension.llm.export() to follow the PyTorch convention). The function will return a LLMEdgeManager, and users will call quantize(), source_transformation() etc.
- Source transformations. These files are currently sitting under examples/llama/source_transformation but they can be applied to other models as well.
  - Proposal: move to extension/llm/export/source_transformation. Ideally source transformation should not target for customized torch.nn.Module and only target for PyTorch standard torch.nn.Module.
- Quantization: currently we have quantizers defined in extension/llm/export/quantizer_lib.py, we should keep them there. There’s quantization code in source transformation as well, we should figure out if we can migrate them to torchao’s quantize_() API.
- Partitioner: currently we have partitioners defined in extension/llm/export/partitioner_lib.py, we should keep them there.
C++ Runner & Tokenizer
- Sampler class
  - Proposal: should take temperature as an argument to sample() method instead of an argument to constructor.
- Tokenizer base class, with BPETokenizer and Tiktoken extending it.
  - Proposal: merge with torchchat’s tokenizer implementations. Absorb SPTokenizer from torchchat. This means torchchat will start to use the tokenizer from ET.
- Runner and other components such as TextPrefiller, ImagePrefiller, TextDecoder, TextTokenGenerator
  - Proposal: put them into iOS SwiftPM for iOS developers to use.
  - Proposal: migrate existing runners in ET examples to use Runner base class so the JNI layer is simple.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: extension/llm API design #6558

RFC: extension/llm API design #6558

larryliu0820 commented Oct 29, 2024

RFC: extension/llm API design #6558

RFC: extension/llm API design #6558

Comments

larryliu0820 commented Oct 29, 2024

🚀 The feature, motivation and pitch

Context

Problems

Alternatives

Additional context

RFC (Optional)