Releases · flashinfer-ai/flashinfer

What's Changed

fix: fix the failed sampling unittest on 5090 by @yzh119 in #1886
Updated to latest docker tag by @nvmbreughe in #1889
Fix: Prevent race condition in cubin loader when file is being consumed by @yzh119 in #1852
Improve graph caching of cudnn graph by @Anerudhan in #1887
misc: Various Updates to Attention Microbenchmark Suite by @bkryu in #1891
docs: Fix installation instructions for CUDA-specific package URLs by @yzh119 in #1893
docker image improvements by @nvmbreughe in #1890
tests: Add batch size 1 cases to test_trtllm_gen_attention.py that fail, marked xfail by @bkryu in #1897
Ensure docker installs the torch version we need by @nvmbreughe in #1901
bugfix: exclude tests/utils/test_load_cubin_compile_race_condition.py from pytest by @yzh119 in #1907
ci: use self-hosted runner for building docker containers by @yzh119 in #1908
feat: Add FP4 TRTLLM-Gen throughput MOE batched gemms by @jiahanc in #1882
Update Docker CI tags to 20251010-8d072e6 by @github-actions[bot] in #1915
ci/cd: consolidate release workflow by @yzh119 in #1910
bugfix: fix cli error when cuda toolkit is not installed by @yzh119 in #1905
feat: trtrllm-gen global scaled FP8 GEMMs by @hypdeb in #1829
feat:enable fp8 blockscale moe for fused cultass for sm90 by @djmmoss in #1819
use ffi::TensorView instead of ffi::Tensor by @cyx-6 in #1844
Minor updates to cubin_loader.py download_file to avoid race condition on temporary file by @nvjullin in #1918
chore: make cache directory flashinfer-version specific by @yzh119 in #1920
misc: checksum check when downloading artifacts by @jimmyzho in #1761
release: bump version v0.4.1 by @yzh119 in #1921

New Contributors

@jiahanc made their first contribution in #1882

Full Changelog: v0.4.0...v0.4.1

What's Changed

perf: Enable SplitK and fix autotuner for trtllm fp4 fused moe by @stslxg-nv in #1548
bugfix: Fix FLOPS calculation for bench_trtllm_gen_mla.py by @RayWang96 in #1640
feat: add support of fp4_batched_quantize by @yicwang in #1633
fix: zero-init workspace buffer for trtllm-gen fmha by @yyihuang in #1643
misc: Add the keyword "template" to member template specialization by @tomflinda in #1246
chore: Switch pynvml to nvidia-ml-py by @toulzx in #1650
[TVM] Rename NDArray -> Tensor by @MasterJH5574 in #1651
misc: remove unused load_cuda_ops function by @yzh119 in #1649
feat: Add k_scale and v_scale to persistent attention by @Edenzzzz in #1322
misc: add script to analyzer code owners from git history by @yzh119 in #1653
Tiny allow compiling with line info and release moe by @fzyzcjy in #1659
Speedup MLARopeQuantize by 20-35% by @fzyzcjy in #1660
Add benchmark for MLARopeQuantize by @fzyzcjy in #1656
Added mx_fp4 support using the cudnn backend by @nvmbreughe in #1644
feat: Support s_qo < s_kv for prefill in flashinfer_benchmark.py and benchmark minor updates by @bkryu in #1664
test: update fused_moe test to random scale factor by @yyihuang in #1665
perf&bugfix: skip kv-tile computation out of sliding window in FA2; fix __syncthreads in mergestate by @happierpig in #1661
[Hotfix] test_fp4_quantize.py failure on sm103 by @sunghyunp-nvdia in #1666
benchmark: add cupti support to benchmark by @nv-yunzheq in #1662
TGV GEMM as a BF16 backend alternative to cuBLAS by @yangs75 in #1668
feat: Add variant.OutputTransform() to decode kernels by @gau-nernst in #1670
ci: collect module status and update flashinfer-cli by @yzh119 in #1676
feat: Batch-size invariant FA2 Prefill & Decode by @Edenzzzz in #1675
test: better fp8 quantization init for fused_moe test by @yyihuang in #1674
Support output signals for overlapping for cutedsl gemm by @fzyzcjy in #1677
[misc] add a wrapper class for attention sink jit args by @happierpig in #1679
[TVM] Default fixed_split_size value in TVM binding by @MasterJH5574 in #1680
Update TGV GEMM default kernel and TGV code cleanup. by @yangs75 in #1682
perf: improve performance of cutlass fmha by @yzh119 in #1681
fix: correct the sm version number in cutlass_fused_moe_module for rtx pro 6000 by @yongwww in #1683
Refactor Blackwell unit test scripts by @dierksen in #1667
bugfix: increase workspace to make unit test pass by @nv-yunzheq in #1684
Update deepgemm backend for 103a by @kahyunnam in #1694
gemm: Enabled alpha with the mx_fp4 format by @nvmbreughe in #1688
hotfix: Hotfix for test_pod_kernels.py on B300 by @sunghyunp-nvdia in #1698
misc: Do not use the limited API with free-threaded Python by @rostan-t in #1687
Remove incorrect method call "isdigit" on number type by @HelloCard in #1699
ci: fix prefill attention unittests by @yzh119 in #1700
misc: unify the macro to determine cuda version at compile time by @yzh119 in #1703
Support Kimi-K2 for TRT: templatize number of experts by @GordonGustafson in #1696
feat: Benchmark mm_fp4 mxfp4 support and gemm autotune support. Restore mm_fp4 API behavior by @bkryu in #1706
bugfix: increase workspace to make trtllm gen attention unit test pass by @nv-yunzheq in #1707
CI: Updated test lists and addressed some failing tests by @nvmbreughe in #1708
misc: update the pypi release github action by @yzh119 in #1713
perf: Add tuning config for cutlass moe for a hardware by @fzyzcjy in #1716
ci: remove deprecated github actions for aot wheel by @yzh119 in #1714
test: skip the unsupported test cases for sm120/121 by @yongwww in #1710
[cute_dsl] add gemm + all reduce (two_shot) by @Amir-19 in #1695
misc: remove unused torch.utils.cpp_extension dependencies by @yzh119 in #1711
test: skip unsupported (non-SM90) test cases for xqa by @jimmyzho in #1715
Fix DeepSeek quality for TRTLLM fused MoE routing by @GordonGustafson in #1723
perf: Port the separate reduce kernel mode from trtllm. by @weireweire in #1685
typo: Super tiny fix typo by @fzyzcjy in #1730
fix: put sampling kernel launch into macro by @ir1ka in #1727
bugfix: Fix flashinfer download-cubin by @tiran in #1729
Fix missing namespace qualifier by @joker-eph in #1731
ci/cd: bring up flashinfer-cubin package by @yzh119 in #1718
disable optimization and add more debug information during verbose mode by @rainj-me in #1719
ci/cd: add github workflows to publish flashinfer-cubin wheel to pypi by @yzh119 in #1737
Bump base container image from 13.0.0 to 13.0.1 for cu130 container by @bkryu in #1739
fix: CI containers install nvidia-cudnn-cu12 vs. nvidia-cudnn-cu13 based on CUDA Version by @bkryu in #1742
Test refactoring and fixes by @nvmbreughe in #1736
TVM: support TVM binding for GroupedGemm by @neurusL in #1725
ci: enable tests for sm75 (G4) by @yongwww in #1705
doc: Super tiny fix doc math by @fzyzcjy in #1747
hotfix: Fix parsing pytorch verison by @sunghyunp-nvdia in #1749
feat: port fast_decode_plan from sgl by @zihaoye in #1745
hotfix: slightly bump up atol to 3e-3 to pass test_cudnn_prefill on B40 by @sunghyunp-nvdia in #1750
tests: xfail moe quantization classes mxfp8_bf16 UTs on sm103 by @jimmyzho in #1754
ci: complete the list of modules in aot.py by @yzh119 in #1746
tests: xfail attention sink UT for sliding window + non causal case by @yzh119 in #1752
feat: Add compute capability checks to flashinfer_benchmark by @bkryu in #1756
test: minor update on trtllm-gen attn speculative-decoding test by @yyihuang in #1760
fix: should pass global_override_indptr_cpu in fast_decode_plan param list by @yyihuang in #1757
fix(cleanup): ensure repository URL has no trailing slash by @tarukumar in #1759
Fix tests/test_trtllm_gen_attention.py::test_trtllm_batch_prefill, ::test_trtllm_batch_decode mismatch error by @kahyunnam in #1755
ci: add apache-tvm-ffi to ci docker container by @yzh119 in #1763
fix: fix cannot import name 'cuda' from 'cuda' in CUDA13 by @LuYanFCP in #1764
bugfix: partially fix tests/test_trtllm_gen_fused_moe.py unit test failure by @nv-yu...