Skip to content
This repository was archived by the owner on Jun 3, 2025. It is now read-only.

Conversation

dbogunowicz
Copy link
Contributor

@dbogunowicz dbogunowicz commented Jul 3, 2023

Feature Description

Fixes the KV Cache logic for quantized text generation models.

In a nutshell, the QuantizeLinear nodes that were created in the process of quantization were messing up our pattern-matching rules for finding Key Matmul and Value Matmul. Additionally the QuantizeLinear nodes are by default inserted into a graph in such a way, that they break the topology of the graph with the kv cache support.

This fix:

  • updates the pattern matching for finding Key Matmul and Value MatMul to be robust against the presence of QuantizeLinear nodes
  • moves the QuantizeLinear nodes to their appropriate place in the onnx graph.
image image image image image image

Manual Testing

Note: For manual testing it is required to use this branch: neuralmagic/deepsparse#1123. It provides the appropriate support for quantized kv cache.

OPT

Tested with the model provided by @natuan nlg-text_generation/c4-opt_1.3b-pruned50_quant/[email protected]@[email protected]@SP0.5@SQ1@PTQ1@ID15513

  1. Injecting the kv cache:
python kv_cache_injector.py --input-file /home/ubuntu/damian/ml-experiments/nlg-text_generation/c4-opt_1.3b-pruned50_quant/[email protected]@[email protected]@SP0.5@SQ1@PTQ1@ID15513/deployment/model.onnx --output-file /home/ubuntu/damian/ml-experiments/nlg-text_generation/c4-opt_1.3b-pruned50_quant/[email protected]@[email protected]@SP0.5@SQ1@PTQ1@ID15513/deployment/model_kvcache.onnx
2023-07-18 09:58:01 sparseml.exporters.transforms.kv_cache.configs INFO     Loaded config file /home/ubuntu/damian/ml-experiments/nlg-text_generation/c4-opt_1.3b-pruned50_quant/[email protected]@[email protected]@SP0.5@SQ1@PTQ1@ID15513/deployment/config.json for model: opt
2023-07-18 09:58:01 sparseml.exporters.transforms.kv_cache.configs INFO     Properly configured arguments for KV Cache Transformation
Attempting to validate an in-memory ONNX model that has been loaded without external data. This is currently not supported by the ONNX checker. The validation will be skipped.
2023-07-18 09:58:03 sparseml.exporters.transforms.onnx_transform INFO     [CacheKeysAndValues] Transformed 48 matches
Attempting to validate an in-memory ONNX model that has been loaded without external data. This is currently not supported by the ONNX checker. The validation will be skipped.
Attempting to validate an in-memory ONNX model that has been loaded without external data. This is currently not supported by the ONNX checker. The validation will be skipped.
2023-07-18 09:58:04 sparseml.exporters.transforms.onnx_transform INFO     [PositionsAdjustmentOPT] Transformed 3 matches
Attempting to validate an in-memory ONNX model that has been loaded without external data. This is currently not supported by the ONNX checker. The validation will be skipped.
Modified model saved to: /home/ubuntu/damian/ml-experiments/nlg-text_generation/c4-opt_1.3b-pruned50_quant/[email protected]@[email protected]@SP0.5@SQ1@PTQ1@ID15513/deployment/model_kvcache.onnx
  1. Renaming deployment/model_kvcache.onnx to deployment/model.onnx and running it in the pipeline.
opt = Pipeline.create(task="opt", model_path="/home/ubuntu/damian/ml-experiments/nlg-text_generation/c4-opt_1.3b-pruned50_quant/[email protected]@[email protected] [email protected]@SQ1@PTQ1@ID15513/deployment", engine_type=engine_type, use_deepsparse_cache = False, max_generated_tokens=32)

out = opt(sequences="Who is the president of the United States?")
print(out.sequences[0])
Who is the president of the United States?

Who is the president of the United States?

Who is the president of the United States

Note: This PR does not alter the behavior of KV cache injection for non-quantized OPT model.

CodeGen

Tested with the model provided by @shubhra codegen_mono-350m-apps_bigpython_bigquery_thepile-base_quantized/training/

  1. Injecting the kv cache:
python kv_cache_injector.py --input-file /home/ubuntu/damian/codegen_mono-350m-apps_bigpython_bigquery_thepile-base_quantized/training/model.onnx --output-file /home/ubuntu/damian/codegen_mono-350m-apps_bigpython_bigquery_thepile-base_quantized/training/model_kvcache.onnx
2023-07-18 11:22:34 sparseml.exporters.transforms.kv_cache.configs INFO     Loaded config file /home/ubuntu/damian/codegen_mono-350m-apps_bigpython_bigquery_thepile-base_quantized/training/config.json for model: codegen
2023-07-18 11:22:34 sparseml.exporters.transforms.kv_cache.configs INFO     Properly configured arguments for KV Cache Transformation
2023-07-18 11:22:35 sparseml.exporters.transforms.onnx_transform INFO     [CacheKeysAndValues] Transformed 40 matches
2023-07-18 11:22:38 sparseml.exporters.transforms.onnx_transform INFO     [PositionsAdjustmentCodeGen] Transformed 3 matches
Modified model saved to: /home/ubuntu/damian/codegen_mono-350m-apps_bigpython_bigquery_thepile-base_quantized/training/model_kvcache.onnx
  1. Renaming training/model_kvcache.onnx to training/model.onnx and running it in the pipeline.
opt = Pipeline.create(task="codegen", model_path="/home/ubuntu/damian/codegen_mono-350m-apps_bigpython_bigquery_thepile-base_quantized/training", engine_type=engine_type, use_deepsparse_cache = False, max_generated_tokens=32)

out = opt(sequences="def hello_world():")
print(out.sequences[0])
print("Hello World")

hello_world()

# This is a comment

# This is a comment

# This is

Note: This PR does not alter the behavior of KV cache injection for non-quantized CodeGen model.

@dbogunowicz dbogunowicz requested review from bfineran and natuan July 3, 2023 09:21
bfineran
bfineran previously approved these changes Jul 3, 2023
natuan
natuan previously approved these changes Jul 5, 2023
@dbogunowicz dbogunowicz dismissed stale reviews from natuan and bfineran via 87d03f9 July 10, 2023 17:35
@dbogunowicz dbogunowicz force-pushed the fix/damian/quantized_opt_cache branch from e9fd0ff to 6fcd3f2 Compare July 14, 2023 16:12
@dbogunowicz dbogunowicz force-pushed the fix/damian/quantized_opt_cache branch from 3370f21 to de8ebf7 Compare July 14, 2023 16:14
@natuan natuan merged commit 6782b03 into main Jul 19, 2023
@natuan natuan deleted the fix/damian/quantized_opt_cache branch July 19, 2023 14:36
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants