What's Changed
- perf: Enable SplitK and fix autotuner for trtllm fp4 fused moe by @stslxg-nv in #1548
- bugfix: Fix FLOPS calculation for bench_trtllm_gen_mla.py by @RayWang96 in #1640
- feat: add support of fp4_batched_quantize by @yicwang in #1633
- fix: zero-init workspace buffer for trtllm-gen fmha by @yyihuang in #1643
- misc: Add the keyword "template" to member template specialization by @tomflinda in #1246
- chore: Switch
pynvml
tonvidia-ml-py
by @toulzx in #1650 - [TVM] Rename
NDArray
->Tensor
by @MasterJH5574 in #1651 - misc: remove unused
load_cuda_ops
function by @yzh119 in #1649 - feat: Add k_scale and v_scale to persistent attention by @Edenzzzz in #1322
- misc: add script to analyzer code owners from git history by @yzh119 in #1653
- Tiny allow compiling with line info and release moe by @fzyzcjy in #1659
- Speedup MLARopeQuantize by 20-35% by @fzyzcjy in #1660
- Add benchmark for MLARopeQuantize by @fzyzcjy in #1656
- Added mx_fp4 support using the cudnn backend by @nvmbreughe in #1644
- feat: Support s_qo < s_kv for prefill in flashinfer_benchmark.py and benchmark minor updates by @bkryu in #1664
- test: update fused_moe test to random scale factor by @yyihuang in #1665
- perf&bugfix: skip kv-tile computation out of sliding window in FA2; fix __syncthreads in mergestate by @happierpig in #1661
- [Hotfix]
test_fp4_quantize.py
failure on sm103 by @sunghyunp-nvdia in #1666 - benchmark: add cupti support to benchmark by @nv-yunzheq in #1662
- TGV GEMM as a BF16 backend alternative to cuBLAS by @yangs75 in #1668
- feat: Add
variant.OutputTransform()
to decode kernels by @gau-nernst in #1670 - ci: collect module status and update flashinfer-cli by @yzh119 in #1676
- feat: Batch-size invariant FA2 Prefill & Decode by @Edenzzzz in #1675
- test: better fp8 quantization init for fused_moe test by @yyihuang in #1674
- Support output signals for overlapping for cutedsl gemm by @fzyzcjy in #1677
- [misc] add a wrapper class for attention sink jit args by @happierpig in #1679
- [TVM] Default
fixed_split_size
value in TVM binding by @MasterJH5574 in #1680 - Update TGV GEMM default kernel and TGV code cleanup. by @yangs75 in #1682
- perf: improve performance of cutlass fmha by @yzh119 in #1681
- fix: correct the sm version number in cutlass_fused_moe_module for rtx pro 6000 by @yongwww in #1683
- Refactor Blackwell unit test scripts by @dierksen in #1667
- bugfix: increase workspace to make unit test pass by @nv-yunzheq in #1684
- Update deepgemm backend for 103a by @kahyunnam in #1694
- gemm: Enabled alpha with the mx_fp4 format by @nvmbreughe in #1688
- hotfix: Hotfix for
test_pod_kernels.py
on B300 by @sunghyunp-nvdia in #1698 - misc: Do not use the limited API with free-threaded Python by @rostan-t in #1687
- Remove incorrect method call "isdigit" on number type by @HelloCard in #1699
- ci: fix prefill attention unittests by @yzh119 in #1700
- misc: unify the macro to determine cuda version at compile time by @yzh119 in #1703
- Support Kimi-K2 for TRT: templatize number of experts by @GordonGustafson in #1696
- feat: Benchmark mm_fp4 mxfp4 support and gemm autotune support. Restore mm_fp4 API behavior by @bkryu in #1706
- bugfix: increase workspace to make trtllm gen attention unit test pass by @nv-yunzheq in #1707
- CI: Updated test lists and addressed some failing tests by @nvmbreughe in #1708
- misc: update the pypi release github action by @yzh119 in #1713
- perf: Add tuning config for cutlass moe for a hardware by @fzyzcjy in #1716
- ci: remove deprecated github actions for aot wheel by @yzh119 in #1714
- test: skip the unsupported test cases for sm120/121 by @yongwww in #1710
- [cute_dsl] add gemm + all reduce (two_shot) by @Amir-19 in #1695
- misc: remove unused
torch.utils.cpp_extension
dependencies by @yzh119 in #1711 - test: skip unsupported (non-SM90) test cases for xqa by @jimmyzho in #1715
- Fix DeepSeek quality for TRTLLM fused MoE routing by @GordonGustafson in #1723
- perf: Port the separate reduce kernel mode from trtllm. by @weireweire in #1685
- typo: Super tiny fix typo by @fzyzcjy in #1730
- fix: put sampling kernel launch into macro by @ir1ka in #1727
- bugfix: Fix flashinfer download-cubin by @tiran in #1729
- Fix missing namespace qualifier by @joker-eph in #1731
- ci/cd: bring up flashinfer-cubin package by @yzh119 in #1718
- disable optimization and add more debug information during verbose mode by @rainj-me in #1719
- ci/cd: add github workflows to publish flashinfer-cubin wheel to pypi by @yzh119 in #1737
- Bump base container image from 13.0.0 to 13.0.1 for cu130 container by @bkryu in #1739
- fix: CI containers install nvidia-cudnn-cu12 vs. nvidia-cudnn-cu13 based on CUDA Version by @bkryu in #1742
- Test refactoring and fixes by @nvmbreughe in #1736
- TVM: support TVM binding for GroupedGemm by @neurusL in #1725
- ci: enable tests for sm75 (G4) by @yongwww in #1705
- doc: Super tiny fix doc math by @fzyzcjy in #1747
- hotfix: Fix parsing pytorch verison by @sunghyunp-nvdia in #1749
- feat: port fast_decode_plan from sgl by @zihaoye in #1745
- hotfix: slightly bump up
atol
to3e-3
to passtest_cudnn_prefill
on B40 by @sunghyunp-nvdia in #1750 - tests: xfail moe quantization classes mxfp8_bf16 UTs on sm103 by @jimmyzho in #1754
- ci: complete the list of modules in aot.py by @yzh119 in #1746
- tests: xfail attention sink UT for sliding window + non causal case by @yzh119 in #1752
- feat: Add compute capability checks to flashinfer_benchmark by @bkryu in #1756
- test: minor update on trtllm-gen attn speculative-decoding test by @yyihuang in #1760
- fix: should pass global_override_indptr_cpu in fast_decode_plan param list by @yyihuang in #1757
- fix(cleanup): ensure repository URL has no trailing slash by @tarukumar in #1759
- Fix tests/test_trtllm_gen_attention.py::test_trtllm_batch_prefill, ::test_trtllm_batch_decode mismatch error by @kahyunnam in #1755
- ci: add apache-tvm-ffi to ci docker container by @yzh119 in #1763
- fix: fix cannot import name 'cuda' from 'cuda' in CUDA13 by @LuYanFCP in #1764
- bugfix: partially fix tests/test_trtllm_gen_fused_moe.py unit test failure by @nv-yunzheq in #1724
- Fix sink attention accuracy regression, add sink test and cleanup. by @weireweire in #1758
- tests: skip non SM100/103 for grouped deepgemm by @jimmyzho in #1767
- Added xfail for mx_fp4 matmul on SM120 by @nvmbreughe in #1766
- add test case for trtllm gen fused moe with kimi k2 problem sizes by @nv-yunzheq in #1768
- chore: label new issues with 'needs-triage' via GH Action by @sricketts in #1765
- Small fix on an exception by @nvmbreughe in #1775
- Waive / disable test_mla_decode_kernel.py::test_mla_decode_kernel for not sm80 by @kahyunnam in #1771
- bugfix: fix version tag validation in release_pypi_sdist.yml workflow by @yzh119 in #1780
- bugfix: remove the filelock cleanup logic in cubin_loader.py by @yzh119 in #1779
- Docker updates by @nvmbreughe in #1784
- fix: add _check_tensor_params to check correct sampling parameters and dtype validation in decode.py by @raayandhar in #1652
- Set Trition path for cuda-13.0 by @nvmbreughe in #1786
- ci: upgrade apache-tvm-ffi version in ci containers by @yzh119 in #1788
- Fix autotune profile min shape bigger than max shape. by @weireweire in #1783
- hotfix: make aot wheel work without nvcc by @yzh119 in #1782
- fix error message by @kahyunnam in #1789
- refactor: using tvm-ffi for multi-platform bindings by @yzh119 in #1641
- Upgrade CUTLASS to 4.2.1 by @jasl in #1787
- doc: fix documentation build error by @yzh119 in #1797
- jit: defer ninja generation by @yzh119 in #1792
- refactor: cleanup codebase after tvm-ffi refactor by @yzh119 in #1795
- Update devcontainer.json and reuse ci docker images by @cyx-6 in #1791
- fix: new issue workflow by @sricketts in #1773
- fix: compilation failure in fp4Op.cpp by @hypdeb in #1800
- fix: missing header include in decode kernel jit binding by @MasterJH5574 in #1802
- refactor: Test reorganization phase 2 by @nvmbreughe in #1778
- fix: test_blackwell_kernels.sh script to no longer update dependencies by @bkryu in #1808
- fix: fp4 moe on sm120 by @ReinForce-II in #1817
- bugfix: fix devcontainer docker file cuda version by @cyx-6 in #1807
- Enable support for CFLAGS as well as LDFLAGS when building by @directhex in #1801
- chore: improved URL handling for CUBIN/artifacts downloads by @hypdeb in #1794
- Remove
-isystem /usr/include
by @coreylowman in #1821 - Pytest flags and regression fix by @nvmbreughe in #1822
- bugfix: Fixing variable name conflict bug introduced by PR1801 by @bkryu in #1823
- Masked batch nvfp4 quantization by @wenscarl in #1774
- [Perf] Cache device property functions to avoid recomputation by @Jialin in #1824
- bugfix: remove the append "a" logic if user specifies cuda arch explicitly by @yzh119 in #1798
- Bugfix: Fix data hazard in persistent reduce by @Edenzzzz in #1826
- tests: upgrade cutlass, fix import and skip non-SM100 cutedsl two shot allreduce by @jimmyzho in #1812
- [Quantization] Add per-expert global scaling factor for fp4 batched quantize by @wenscarl in #1835
- tests: Update support for tgv_gemm to SM100 only and add to ut by @jimmyzho in #1810
- jit: add
get_object_paths
to JitSpec by @MasterJH5574 in #1836 - docker: add image tags with date-SHA suffix by @yongwww in #1839
- bugfix: Change module path in test_pod_kernels.py by @bkryu in #1842
- bugfix: deep_gemm artifact load path by @jimmyzho in #1838
- bugfix: fix synchronize logic error in tests/comm/test_trtllm_alltoall.py by @nv-yunzheq in #1841
- bugfix: show-config command by @sricketts in #1846
- Run tests individually by @nvmbreughe in #1847
- unittest: remove debug-print jit examples from unittest by @yzh119 in #1851
- jit: add
-lcuda
to default ldflags by @yzh119 in #1825 - feat: add warp-level persistent qk norm by @happierpig in #1843
- Add head_dim=64 for blackwell cutlass fmha implementation by @kahyunnam in #1850
- ci/cd: bringup flashinfer-jit-cache package by @yzh119 in #1726
- ci: add docker-tags.yml to specify the docker image tag used in CI by @yzh119 in #1853
- docker: upgrade apache-tvm-ffi==0.1.0b15 in docker container by @yzh119 in #1857
- ci: upgrade apache-tvm-ffi==0.1.0b15 by @cyx-6 in #1860
- raise error for group_gemm_fp8_nt_groupwise then num_groups > 1 on sm120/121 by @yongwww in #1862
- bugfix: add check for empty MoE tactics and allow sm121 to use sm120 config by @yongwww in #1861
- Bugfix: fix o_strides in persistent kernel by @Edenzzzz in #1865
- Improve dev container conda consistency by @cyx-6 in #1873
- xfail the cute dsl tests for
l=1
by @cyx-6 in #1868 - Added click package by @nvmbreughe in #1875
- Tune kernel compilation parameters for #1850 by @kahyunnam in #1878
- PDL patch for TGV GEMM by @yangs75 in #1877
- Moved common requirements from docker and setup.py to file by @nvmbreughe in #1880
- misc: fix some B200 GEMM bench by @Edenzzzz in #1883
- ci/cd: add nightly build and CI for
flashinfer-python
,flashinfer-jit-cache
,flashinfer-cubin
by @yzh119 in #1872
New Contributors
- @RayWang96 made their first contribution in #1640
- @yicwang made their first contribution in #1633
- @tomflinda made their first contribution in #1246
- @toulzx made their first contribution in #1650
- @sunghyunp-nvdia made their first contribution in #1666
- @yangs75 made their first contribution in #1668
- @gau-nernst made their first contribution in #1670
- @dierksen made their first contribution in #1667
- @kahyunnam made their first contribution in #1694
- @rostan-t made their first contribution in #1687
- @HelloCard made their first contribution in #1699
- @GordonGustafson made their first contribution in #1696
- @jimmyzho made their first contribution in #1715
- @ir1ka made their first contribution in #1727
- @rainj-me made their first contribution in #1719
- @neurusL made their first contribution in #1725
- @zihaoye made their first contribution in #1745
- @tarukumar made their first contribution in #1759
- @LuYanFCP made their first contribution in #1764
- @raayandhar made their first contribution in #1652
- @jasl made their first contribution in #1787
- @ReinForce-II made their first contribution in #1817
- @coreylowman made their first contribution in #1821
- @Jialin made their first contribution in #1824
Full Changelog: v0.3.1...v0.4.0