Skip to content

Quantization-aware Transformation Pipeline via Early Fusion + Node Annotation #5861

@Fridah-nv

Description

@Fridah-nv

Background

Currently, quantization transformation is applied at the beginning of the transformation pipeline. It rewrites standard linear ops into quantized ops (e.g., fused quant_linear). While this works, it introduces coupling between transformation passes and quantization backends. I.e. Downstream passes (e.g. pattern matching, weight fusion, sharding) must be quantization-aware.

Goals

  • Introduce an annotation-based approach to preserve quantization metadata across passes. Unify and simplify the handling of quantized ops in the transformation pipeline.
  • Leverage the InferenceOptimizer transformation system (staged passes, config inheritance) to modularize quantization-aware logic.
  • Consider alternative ways other than node.meta to preserve quant info

Exploration:

  • Configurable Quantization Backend options similar to attention backend

Metadata

Metadata

Assignees

Labels

AutoDeploy<NV> AutoDeploy Backend

Type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions