You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, quantization transformation is applied at the beginning of the transformation pipeline. It rewrites standard linear ops into quantized ops (e.g., fused quant_linear). While this works, it introduces coupling between transformation passes and quantization backends. I.e. Downstream passes (e.g. pattern matching, weight fusion, sharding) must be quantization-aware.
Goals
Introduce an annotation-based approach to preserve quantization metadata across passes. Unify and simplify the handling of quantized ops in the transformation pipeline.
Leverage the InferenceOptimizer transformation system (staged passes, config inheritance) to modularize quantization-aware logic.
Consider alternative ways other than node.meta to preserve quant info
Exploration:
Configurable Quantization Backend options similar to attention backend