Releases: ROCm/vllm
Releases · ROCm/vllm
v0.6.3.post2+rocm
What's Changed
- fp8 moe configs. Mixtral-8x(7B,22B) TP=1,2,4,8 by @divakar-amd in #250
- Sccache removal from Dockerfile.rocm by @omirosh in #253
- Update Dockerfile.rocm by @shajrawi in #254
- Using the correct type hints by @gshtras in #256
- Revert "Update Dockerfile.rocm" by @gshtras in #257
- Creating ROCm whl upon release by @gshtras in #259
Full Changelog: v0.6.3.post1+rocm...v0.6.3.post2+rocm
What's Changed
- Miscellaneous cosmetic changes by @mawong-amd in #166
- V5.5 upstream merge rc by @gshtras in #167
- fnuz support for fbgemm fp8 by @gshtras in #169
- Fixing mypy after a rushed merge by @gshtras in #171
- [fix] moe padding for reading correct tuned config by @divakar-amd in #172
- Upstream merge 24/9/9 by @gshtras in #174
- Restoring deleted .buildkite/test-template.j2 by @Alexei-V-Ivanov-AMD in #177
- Support commandr on ROCm by @shajrawi in #180
- Correct type hint by @gshtras in #173
- update custom PA kernel with support for fp8 kv cache dtype by @sanyalington in #87
- Support Grok-1 by @kkHuang-amd in #181
- Adding MLPerf optimization to 0.6.0 by @charlifu in #182
- 6.2 dockerfile by @gshtras in #176
- [Grok1] fix the name of input scale factor for autofp8 run by @kkHuang-amd in #183
- [Grok-1] fix the run-time error "Can't pickle <class 'transformers_mo… by @kkHuang-amd in #184
- Upstream merge 24/09/16 by @gshtras in #187
- Perf improvement: remove redundant torch slice; Match decode PA partition size to csrc by @sanyalington in #188
- refactor dbrx experts to use FusedMoe layer by @divakar-amd in #186
- Disable moe padding by default and enable fp8 padding by default. by @charlifu in #190
- Enabling Splitting HW by Buildkite Agents by @Alexei-V-Ivanov-AMD in #191
- Revert "remove redundant slice; match decode PA partition size with csrc (#188)" by @gshtras in #194
- [Grok-1] 1. upload moe configuration file for moe kernel optimization… by @kkHuang-amd in #193
- Removing the original text in reminder_comment.yml by @Alexei-V-Ivanov-AMD in #195
- Fix PA custom and PA v2 tests and partition sizes by @mawong-amd in #196
- Adding P3L measurement to the benchmarks collection tools. by @Alexei-V-Ivanov-AMD in #197
- Swapping the order of sampling operations in the conditional selector. by @Alexei-V-Ivanov-AMD in #199
- remove redundant slice when chunked prefill feature is disabled by @sanyalington in #201
- Fixing P3L incompatibility with cython. by @Alexei-V-Ivanov-AMD in #200
- Bias and more metadata in gradlib and tuned gemm by @gshtras in #202
- Upstream merge 24 9 23 by @gshtras in #203
- Gating n=0 case from skinny gemm by @gshtras in #204
- Revert "[Kernel] changing fused moe kernel chunk size default to 32k (vllm-project#7995)" by @gshtras in #207
- re-enable avoid torch slice fix when chunked prefill is disabled by @sanyalington in #209
- add block_manager_v2.py into setup_cython by @sanyalington in #210
- extend moe padding to DUMMY weights by @divakar-amd in #211
- [Int4-AWQ] Fix AWQ Marlin check for ROCm by @hegemanjw4amd in #206
- RPD Profiling by @dllehr-amd in #208
- Cythonize vllm build by @maleksan85 in #214
- Fix Dockerfile.rocm by @gshtras in #215
- fix dbrx weight loader by @divakar-amd in #212
- Upstream merge 24 09 27 0.6.2 by @gshtras in #213
- Make rpdtracer import only when required by @Rohan138 in #216
- Improve profiling setup and documentation, sync benchmarks with main by @AdrianAbeyta in #218
- Installing the requirements before invoking setup.py since it now imports setuptools_scm by @gshtras in #221
- llama3.2 + cross attn test by @maleksan85 in #220
- Optimize CAR for ROCm by @iotamudelta in #225
- Custom PA perf improvements by @sanyalington in #222
- Upstream merge 24 10 08 by @gshtras in #226
- customPA write fp8 small ctx fix; enable customPA write fp8 by default by @sanyalington in #227
- added timeout for vllm build in rocm by @maleksan85 in #230
- Add fp8 for dbrx by @charlifu in #231
- Update Buildkite env variable by @dhonnappa-amd in #232
- cuda graph + num-scheduler-steps bug fix by @seungrokj in #236
- [Model] [BUG] Fix code path logic to load mllama model by @tjtanaa in #234
- prefix-enabled FA perf issue by @seungrokj in #239
- Custom PA Partition size 256 to improve performance by @sanyalington in #238
- [Build/CI] Minor changes to fix internal CI process. by @Alexei-V-Ivanov-AMD in #235
- [BUGFIX] Restored handling of ROCM FA output as before adaptation of llama3.2 by @maleksan85 in #241
- Upstream merge 24 10 21 by @gshtras in #240
- Using the correct datatype on prefix prefill for fp8 kv cache by @gshtras in #242
- Update CMakeLists.txt by @gshtras in #244
- update block_manager usage in setup_cython by @saienduri in #243
- [Bugfix][Kernel][Misc] Basic support for SmoothQuant, symmetric case by @rasmith in #237
- Add fp8 support for llama model family on Navi4x by @qli88 in #245
- Custom all reduce fix mi250 by @omirosh in #247
- Upstream merge 24 10 28 by @gshtras in #248
- fp8 moe configs. Mixtral-8x(7B,22B) TP=1,2,4,8 by @divakar-amd in #250
- Sccache removal from Dockerfile.rocm by @omirosh in #253
- Update Dockerfile.rocm by @shajrawi in #254
- Using the correct type hints by @gshtras in #256
- Revert "Update Dockerfile.rocm" by @gshtras in #257
- Creating ROCm whl upon release by @gshtras in #259
New Contributors
- @kkHuang-amd made their first contribution in #181
- @Rohan138 made their first contribution in #216
- @AdrianAbeyta made their first contribution in #218
- @dhonnappa-amd made their first contribution in #232
- @seungrokj made their first contribution in #236
- @tjtanaa made their first contribution in #234
- @saienduri made their first contribution in #243
- @qli88 made their first contribution in #245
- @omirosh made their first contribution in #247
Full Changelog: v0.4.3_rocm...v0.6.3.post2+rocm
v0.6.3.post1+rocm
What's Changed
- Upstream merge 24 10 21 by @gshtras in #240
- Using the correct datatype on prefix prefill for fp8 kv cache by @gshtras in #242
- Update CMakeLists.txt by @gshtras in #244
- update block_manager usage in setup_cython by @saienduri in #243
- [Bugfix][Kernel][Misc] Basic support for SmoothQuant, symmetric case by @rasmith in #237
- Add fp8 support for llama model family on Navi4x by @qli88 in #245
- Custom all reduce fix mi250 by @omirosh in #247
- Upstream merge 24 10 28 by @gshtras in #248
New Contributors
- @saienduri made their first contribution in #243
- @qli88 made their first contribution in #245
- @omirosh made their first contribution in #247
Full Changelog: v0.6.2.post1+rocm...v0.6.3.post1+rocm
v0.6.2.post1+rocm
What's Changed
- Make rpdtracer import only when required by @Rohan138 in #216
- Improve profiling setup and documentation, sync benchmarks with main by @AdrianAbeyta in #218
- Installing the requirements before invoking setup.py since it now imports setuptools_scm by @gshtras in #221
- llama3.2 + cross attn test by @maleksan85 in #220
- Optimize CAR for ROCm by @iotamudelta in #225
- Custom PA perf improvements by @sanyalington in #222
- Upstream merge 24 10 08 by @gshtras in #226
- customPA write fp8 small ctx fix; enable customPA write fp8 by default by @sanyalington in #227
- added timeout for vllm build in rocm by @maleksan85 in #230
- Add fp8 for dbrx by @charlifu in #231
- Update Buildkite env variable by @dhonnappa-amd in #232
- cuda graph + num-scheduler-steps bug fix by @seungrokj in #236
- [Model] [BUG] Fix code path logic to load mllama model by @tjtanaa in #234
- prefix-enabled FA perf issue by @seungrokj in #239
- Custom PA Partition size 256 to improve performance by @sanyalington in #238
- [Build/CI] Minor changes to fix internal CI process. by @Alexei-V-Ivanov-AMD in #235
- [BUGFIX] Restored handling of ROCM FA output as before adaptation of llama3.2 by @maleksan85 in #241
New Contributors
- @Rohan138 made their first contribution in #216
- @AdrianAbeyta made their first contribution in #218
- @dhonnappa-amd made their first contribution in #232
- @seungrokj made their first contribution in #236
- @tjtanaa made their first contribution in #234
Full Changelog: v0.6.2+rocm...v0.6.2.post1+rocm
v0.6.2+rocm
What's Changed
- fix dbrx weight loader by @divakar-amd in #212
- Upstream merge 24 09 27 0.6.2 by @gshtras in #213
Full Changelog: v0.6.1.post1+rocm...v0.6.2+rocm
v0.6.1.post1+rocm
What's Changed
- Adding P3L measurement to the benchmarks collection tools. by @Alexei-V-Ivanov-AMD in #197
- Swapping the order of sampling operations in the conditional selector. by @Alexei-V-Ivanov-AMD in #199
- remove redundant slice when chunked prefill feature is disabled by @sanyalington in #201
- Fixing P3L incompatibility with cython. by @Alexei-V-Ivanov-AMD in #200
- Bias and more metadata in gradlib and tuned gemm by @gshtras in #202
- Upstream merge 24 9 23 by @gshtras in #203
- Gating n=0 case from skinny gemm by @gshtras in #204
- Revert "[Kernel] changing fused moe kernel chunk size default to 32k (vllm-project#7995)" by @gshtras in #207
- re-enable avoid torch slice fix when chunked prefill is disabled by @sanyalington in #209
- add block_manager_v2.py into setup_cython by @sanyalington in #210
- extend moe padding to DUMMY weights by @divakar-amd in #211
- [Int4-AWQ] Fix AWQ Marlin check for ROCm by @hegemanjw4amd in #206
- RPD Profiling by @dllehr-amd in #208
- Cythonize vllm build by @maleksan85 in #214
- Fix Dockerfile.rocm by @gshtras in #215
Full Changelog: v0.6.1_rocm...v0.6.1.post1+rocm
v0.6.1_rocm
What's Changed
- [fix] moe padding for reading correct tuned config by @divakar-amd in #172
- Upstream merge 24/9/9 by @gshtras in #174
- Restoring deleted .buildkite/test-template.j2 by @Alexei-V-Ivanov-AMD in #177
- Support commandr on ROCm by @shajrawi in #180
- Correct type hint by @gshtras in #173
- update custom PA kernel with support for fp8 kv cache dtype by @sanyalington in #87
- Support Grok-1 by @kkHuang-amd in #181
- Adding MLPerf optimization to 0.6.0 by @charlifu in #182
- 6.2 dockerfile by @gshtras in #176
- [Grok1] fix the name of input scale factor for autofp8 run by @kkHuang-amd in #183
- [Grok-1] fix the run-time error "Can't pickle <class 'transformers_mo… by @kkHuang-amd in #184
- Upstream merge 24/09/16 by @gshtras in #187
- Perf improvement: remove redundant torch slice; Match decode PA partition size to csrc by @sanyalington in #188
- refactor dbrx experts to use FusedMoe layer by @divakar-amd in #186
- Disable moe padding by default and enable fp8 padding by default. by @charlifu in #190
- Enabling Splitting HW by Buildkite Agents by @Alexei-V-Ivanov-AMD in #191
- Revert "remove redundant slice; match decode PA partition size with csrc (#188)" by @gshtras in #194
- [Grok-1] 1. upload moe configuration file for moe kernel optimization… by @kkHuang-amd in #193
- Removing the original text in reminder_comment.yml by @Alexei-V-Ivanov-AMD in #195
- Fix PA custom and PA v2 tests and partition sizes by @mawong-amd in #196
New Contributors
- @kkHuang-amd made their first contribution in #181
Full Changelog: v0.6.0_rocm...v0.6.1_rocm
v0.6.0_rocm
What's Changed
- Features integration without fp8 by @gshtras in #7
- Layernorm optimizations by @mawong-amd in #8
- Bringing in the latest commits from upstream by @mawong-amd in #9
- Bump Docker to ROCm 6.1, add gradlib for tuned gemm, include RCCL fixes by @mawong-amd in #12
- add mi300 fused_moe tuned configs by @divakar-amd in #13
- Correctly calculating the same value for the required cache blocks num for all torchrun processes by @gshtras in #15
- [ROCm] adding a missing triton autotune config by @hongxiayang in #17
- make the vllm setup mode configurable and make install mode as defaul… by @hongxiayang in #18
- enable fused topK_softmax kernel for hip by @divakar-amd in #14
- Fix ambiguous fma call by @cjatin in #16
- Rccl dockerfile updates by @mawong-amd in #19
- Dockerfile improvements: multistage by @mawong-amd in #20
- Integrate PagedAttention Optimization custom kernel into vLLM by @lcskrishna in #22
- Updates to custom PagedAttention for supporting context len upto 32k. by @lcskrishna in #25
- Update max_context_len for custom paged attention. by @lcskrishna in #26
- Update RCCL, hipBLASLt, base image in Dockerfile.rocm by @shajrawi in #24
- Adding fp8 gemm computation by @charlifu in #29
- fix the model loading fp8 by @charlifu in #30
- Update linear.py by @gshtras in #32
- Update base docker image with Pytorch 2.3 by @charlifu in #35
- Removed HIP specific matvec logic that is duplicated from tuned_gemm.py and doesn't support bf16 by @gshtras in #23
- Use inp_view for out = F.linear() in TunedGemm by @charlifu in #36
- Fix the symbol not found issue of the new base image by @charlifu in #37
- G42 bias triton fix rocm main by @gshtras in #38
- Update ROCm vLLM to 0.4.3 by @mawong-amd in #40
- Re-applying G42 bias triton fix on 0.4.3 by @gshtras in #41
- Fix RCCL install, linear.py logic, CMake custom extension, update requirement for FP8 compute by @mawong-amd in #42
- Linting main in line with upstream requirements by @mawong-amd in #43
- Include benchmark scripts in container by @mawong-amd in #45
- Adding fp8 to gradlib by @charlifu in #44
- Update fp8_gemm_tuner.py exchange import torch and hipbsolidxgemm by @liligwu in #46
- Supporting quantized weights from Quark by default. by @charlifu in #47
- Update quark quantizer command in fp8 instruction by @charlifu in #49
- Fix LLMM1 kernel by @fxmarty in #28
- Use scaled mm for untuned fp8 gemm by @charlifu in #50
- tuned moe configs v2 by @divakar-amd in #33
- Revert "Tune fused_moe_kernel for TP 1,2,4,8 and bf16 and fp16, updated moe kern…" by @hthangirala in #51
- Revert "Revert "Tune fused_moe_kernel for TP 1,2,4,8 and bf16 and fp16, updated moe kern…"" by @divakar-amd in #53
- fix init files by @divakar-amd in #52
- adds wvSpltK optimization for skinny gemm. by @amd-hhashemi in #54
- Fix 8K decode latency jump issue. by @lcskrishna in #55
- Adding quantization_weights_path for fp8 weights by @charlifu in #57
- Refactor custom gemm heuristics by @gshtras in #56
- wvSpltK fix for 10GB+ output tensors by @amd-hhashemi in #61
- uint64_t instead of unsigned long for clarity by @mawong-amd in #62
- fix for oob LDS fill in wvSpltK slm version by @amd-hhashemi in #63
- [Kernel] Enable custom AR on ROCm by @wenkaidu in #27
- Fix the Runtime Error When Loading kv cache scales by @charlifu in #65
- Fix numpy and XGMI 1-hop detection by @mawong-amd in #67
- Fix XGMI linting by @mawong-amd in #68
- Merging fp8_gemm_tuner.py to gemm_tuner.py by @charlifu in #66
- Wokaround for SWDEV-470361 by @gshtras in #69
- [1/2] Fix up ROCm 6.2 tests correctly in main by @mawong-amd in #72
- [2/2] Using xfail instead of skip for ROCm 6.2 tests by @mawong-amd in #70
- Dockerfile updates: base image, preemptive uninstalls; restore ROCm 6.2 metrics test by @mawong-amd in #73
- Return int64 dtype for solidx in tuning results by @charlifu in #74
- [Build/CI] tests for rocm/vllm:main as of 2024-06-28 by @Alexei-V-Ivanov-AMD in #77
- Fix gradlib fp8 output by @charlifu in #76
- Allocate workspace for hipblaslt fp8 gemm. by @charlifu in #78
- Mixtral moe tuning for mi308 by @divakar-amd in #80
- Remove elementwise kernel before each fp8 gemm by @charlifu in #81
- Charlifu/avoid tensor creation before each gemm by @HaiShaw in #82
- TP=1 moe tuning for mixtral-8x7B by @divakar-amd in #84
- Mixtral-8x22B tuning mi308x by @divakar-amd in #85
- moe tuning for larger input lens by @divakar-amd in #86
- Reduce csv writes by @charlifu in #92
- fix the type error due to the miss-use of the logging module by @liligwu in #105
- Update Dockerfile.rocm by @shajrawi in #107
- Greg/fast server by @gshtras in #106
- converts wvSpltK reduce to pure dpp for further perf uplift. by @amd-hhashemi in #64
- Revert "Fix 8K decode latency jump issue." by @mawong-amd in #108
- adding a simple model invocation involving fp8 calculation/storage by @Alexei-V-Ivanov-AMD in #109
- Adding bf16 output dtype for fp8 gemm by @charlifu in #111
- Running server and LLM in different processes by @gshtras in #110
- Fixed single GPU issue without setting up mp. Added toggles for server request batching parameters by @gshtras in #114
- Add distributed executor backend to benchmark scripts by @mawong-amd in #118
- Add weight padding for moe by @charlifu in #119
- [BugFix] Fix navi build after many custom for MI kernels added by @maleksan85 in #116
- add emtpy_cache() after each padding by @charlifu in #120
- [FIX] Gradlib OOM on Navi and sometimes on MI by @maleksan85 in #124
- Save shape when fp8 solution not found by @charlifu in #123
- Fix unit test for moe by adding padding by @charlifu in #128
- Llama3.1 by @gshtras in #129
- chat/completions endpoint by @gshtras in #121
- Optimize custom all reduce by @iotamudelta in #130
- Add BF16 support to custom PA by @sanyalington in #133
- Making check for output match in original types. It saves some memory. by @maleksan85 in #135
- Make CAR ROCm 6.1 compatible. by @iotamudelta in #137
- Car revert by @gshtras in #140
- Using the correct datatypes for streaming non-chat completions by @gshtras in #134
- Adding UNREACHABLE_CODE macro for non MI300 and MI250 cards by @maleksan85 in #138
- [FIX] gfx90a typo fix by @maleksan85 in #142
- wvsplitk templatized and better tuned for MI300 by @amd-hhashemi in #132
- [Bugfix] Dockerfile.rocm by @zstreet87 in #141
- Update test-template.j2 by @okakarpa in #145
- Adding Triton implementations awq_dequantize and awq_gemm to ROCm by @rasmith in #136
- Adding fp8 padding by @charlifu in #144
- [Int4-AWQ] Torch Int-4 AWQ Dequantization and Configuration Options by @hegemanjw4amd in #146
- buildkit requirement for building docker images by @hongxiayang in #149
- cupy build fix for SWDEV-475036 by @hongxiayang in https...
v0.6.0
Full Changelog: v0.5.5...v0.6.0
v0.4.0
What's Changed
- Features integration without fp8 by @gshtras in #7
- Layernorm optimizations by @mawong-amd in #8
- Bringing in the latest commits from upstream by @mawong-amd in #9
- Bump Docker to ROCm 6.1, add gradlib for tuned gemm, include RCCL fixes by @mawong-amd in #12
- add mi300 fused_moe tuned configs by @divakar-amd in #13
- Correctly calculating the same value for the required cache blocks num for all torchrun processes by @gshtras in #15
- [ROCm] adding a missing triton autotune config by @hongxiayang in #17
- make the vllm setup mode configurable and make install mode as defaul… by @hongxiayang in #18
- enable fused topK_softmax kernel for hip by @divakar-amd in #14
- Fix ambiguous fma call by @cjatin in #16
- Rccl dockerfile updates by @mawong-amd in #19
- Dockerfile improvements: multistage by @mawong-amd in #20
- Integrate PagedAttention Optimization custom kernel into vLLM by @lcskrishna in #22
- Updates to custom PagedAttention for supporting context len upto 32k. by @lcskrishna in #25
- Update max_context_len for custom paged attention. by @lcskrishna in #26
- Update RCCL, hipBLASLt, base image in Dockerfile.rocm by @shajrawi in #24
- Adding fp8 gemm computation by @charlifu in #29
- fix the model loading fp8 by @charlifu in #30
- Update linear.py by @gshtras in #32
- Update base docker image with Pytorch 2.3 by @charlifu in #35
New Contributors
- @divakar-amd made their first contribution in #13
- @hongxiayang made their first contribution in #17
- @cjatin made their first contribution in #16
- @lcskrishna made their first contribution in #22
- @shajrawi made their first contribution in #24
Full Changelog: v0.3.3...v0.4.0
v0.3.0
Full Changelog: https://github.com/ROCm/vllm/commits/v0.3.0