| Paper |
Distributed Speculative Inference (DSI) is the fastest off-the-shelf inference algorithm, introduced in the paper "Distributed Speculative Inference of Large Language Models" in May 2024.
EAGLE1 | Speculative Inference (SI)23 | Distributed Speculative Inference (DSI) | |
---|---|---|---|
Works off-the-shelf (no additional training or architecture changes) |
❌ | ✅ | ✅ |
Supports lossless inference (generates the same tokens as traditional autoregressive inference) |
✅ | ✅ | |
Supports lossy inference (might generate different tokens than traditional autoregressive inference) |
✅ | ✅ | |
Faster than traditional autoregressive inference | ❌ (depends on the drafter) | ✅ (empirically & mathematically proven) | |
Faster than SI | ❌ | ✅ (empirically & mathematically proven) |
This repo includes an implementation of DSI and all four experiments from the paper:
- Estimating the speedup of DSI (compared to SI and non-SI) by measuring wall time, based on 3 and 4
- Estimating the speedup of DSI (compared to SI and non-SI) by counting forward passes
- Estimating the acceptance rate of off-the-shelf LLMs
- Estimating the forward latency of off-the-shelf LLMs
Either use the devcontainer (recommended) or
- Install poetry (official documentation).
- Install this project's environment:
poetry install
Then
- Activate poetry's virtual environment:
poetry shell
There are two types of runs: offline (measuring time units or acceptance rate) and online (measuring wall time).
Sanity check. To run a sanity check with simple simulations:
python -m dsi
Heatmaps. To create a heatmap (like Figure 1 in the paper):
python -m dsi type=heatmap
The heatmap is essentially a grid of simulations with different configurations. Unless specified otherwise, the simulations are offline, namely, counting time units rather than wall time. You can control the resolution of the heatmap by setting its ndim
parameter. The paper uses ndim=20
:
python -m dsi type=heatmap heatmap.ndim=20
We use Ray to run offline simulations in parallel.
You can also compute a heatmap based on online (rather than offline) simulations, namely, estimating wall time (rather than counting forward passes in time units):
python -m dsi type=heatmap heatmap.online=True
The online simulation of DSI uses a thread pool and estimates the wall time. Since the online simulations depend on the available resources, we run the simulations one by one. We do not use Ray for online simulations.
Hydra manages all the configurations (defined under dsi/config
). For example,
- to set the drafter latency (
c
) to 5%:python -m dsi simul.c=.05
- to set the acceptance rate (
a
) to 50%:python -m dsi simul.a=.5
To list all the configurable parameters:
python -m dsi --help
For more sophisticated combinations of configurations, check out Hydra's documentation.
By default, running new experiments will also visualize the results. To visualize existing results (pre-computed), provide their path: python -m dsi type=heatmap load_results="results/offline/heatmap/heatmap-20240702-012750.csv"
Run tests: python ./scripts/test.py all
(from the project root)
It runs tests that measure wall time serially and then the rest of the tests in parallel. You can run only the online tests (for example, with python ./scripts/test.py online -- -vvv
) or only the offline (poetry run python ./scripts/test.py offline -- -vvv
).
Run pre-commit run --all-files
to check formating and re-format when possible.
DVC tracks raw results stored on Google Drive. To pull the result: dvc pull
Our efforts and resources are supported by the following organizations. Thank you for your support!
- Weizmann Institute
- Intel Labs
If you use DSI (or the code in this repo) for your research, please cite our paper:
@article{timor2024distributed,
title={Distributed Speculative Inference of Large Language Models},
author={Timor, Nadav and Mamou, Jonathan and Korat, Daniel and Berchansky, Moshe and Pereg, Oren and Wasserblat, Moshe and Galanti, Tomer and Gordon, Michal and Harel, David},
journal={arXiv preprint arXiv:2405.14105},
year={2024}
}
Footnotes
-
Li, Yuhui, et al. "EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty." Forty-first International Conference on Machine Learning. ↩
-
Leviathan, Yaniv, Matan Kalman, and Yossi Matias. "Fast inference from transformers via speculative decoding." International Conference on Machine Learning. PMLR, 2023. ↩
-
Chen, Charlie, et al. "Accelerating large language model decoding with speculative sampling." arXiv preprint arXiv:2302.01318 (2023). ↩