Skip to content
This repository was archived by the owner on Jun 3, 2025. It is now read-only.

Commit a58a230

Browse files
mwitiderrickmgoin
andauthored
Create One-Shot LLM README (#1849)
* Create readme.md * Update readme.md * Update readme.md * Update readme.md * Update readme.md * Update readme.md * Update readme.md * Update readme.md * Update readme.md * Rename readme.md to README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md --------- Co-authored-by: Michael Goin <[email protected]>
1 parent 8f2296f commit a58a230

File tree

1 file changed

+303
-0
lines changed
  • src/sparseml/transformers/sparsification/obcq

1 file changed

+303
-0
lines changed
Lines changed: 303 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,303 @@
1+
# One Shot With SparseML
2+
This page describes how to perform one-shot quantization of large language models using [SparseML](https://github.com/neuralmagic/sparseml). This workflow requires a GPU with at least 16GB VRAM and 64GB of system RAM.
3+
4+
5+
## Table of Contents
6+
1. [How to Clone and Install the Latest SparseML](#clone)
7+
2. [How to One-shot TinyLlama](#tinyllama)
8+
3. [How to Evaluate the One-shot Model](#evaluate)
9+
4. [How to Export the One-shot model](#export)
10+
5. [How to Inject KV Cache](#kvcache)
11+
6. [Using the Model With DeepSparse](#deepsparse)
12+
7. [Upload Model to Hugging Face](#upload)
13+
8. [Explaining the TinyLlama Recipe](#recipe)
14+
9. [How to Adapt a Recipe for a New Model](#adapt)
15+
16+
17+
## <a name="clone">How to Clone and Install the Latest SparseML </a>
18+
You'll need the latest version of SparseML to run the one-shot workflow. We recommend that you do this from source and in a fresh Python environment to avoid any issues.
19+
20+
Clone the SparseML repo and install it locally:
21+
```bash
22+
git clone https://github.com/neuralmagic/sparseml
23+
pip install -e "sparseml[transformers]"
24+
```
25+
26+
## <a name="tinyllama">How to One-shot TinyLlama </a>
27+
[TinyLlama-1.1B-Chat](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.4) is an LLM that we can quantize in a short time because it has 1.1B parameters.
28+
29+
Perform one-shot using the OBCQ algorithm. The command takes the following parameters:
30+
31+
positional arguments:
32+
- `model` a path to Hugging Face stub
33+
- `dataset_name` Hugging Face dataset to extract calibration data from. Example of supported datasets: `{c4,evolcodealpaca,gsm8k,open_platypus,ptb,wikitext2}`
34+
35+
options:
36+
- `--nsamples` number of samples to extract from the dataset, defaults to 512.
37+
- `--deploy-dir` the directory where the model will be saved, defaults to `obcq_deployment`.
38+
- `--eval` dataset to use for perplexity evaluation, or none to skip.
39+
- `--save` whether to save the output model to disk.
40+
- `--recipe` the file containing the one-shot hyperparameters.
41+
- `--device` which device to load the model onto, either `cpu` or a specific `cuda:0`.
42+
- `--precision` precision to load model as, either auto (default), half, full, float16 or float32.
43+
44+
Example command:
45+
```bash
46+
wget https://huggingface.co/nm-testing/TinyLlama-1.1B-Chat-v0.4-pruned50-quant/raw/main/recipe.yaml # download recipe
47+
python sparseml/src/sparseml/transformers/sparsification/obcq/obcq.py TinyLlama/TinyLlama-1.1B-Chat-v0.4 open_platypus --recipe recipe.yaml --save True
48+
```
49+
## <a name="evaluate"> How to Evaluate the One-shot Model</a>
50+
Next, evaluate the model's performance using the [lm-evaluation-harness framework](https://github.com/neuralmagic/lm-evaluation-harness).
51+
52+
Clone the repository:
53+
```bash
54+
git clone https://github.com/neuralmagic/lm-evaluation-harness.git
55+
```
56+
Install the required dependencies:
57+
```bash
58+
cd lm-evaluation-harness
59+
pip install -e .
60+
```
61+
Evaluate on the `hellaswag` task:
62+
```bash
63+
git checkout sparse_new_modifier
64+
start=`date +%s`
65+
python main_sparse.py \
66+
--model hf-causal-experimental \
67+
--model_args pretrained=obcq_deployment,trust_remote_code=True \
68+
--tasks hellaswag \
69+
--batch_size 64 \
70+
--no_cache \
71+
--write_out \
72+
--output_path "obcq_deployment/hellaswag.json" \
73+
--device "cuda:0" \
74+
--num_fewshot 0
75+
end=`date +%s`
76+
echo Execution time was `expr $end - $start` seconds.
77+
```
78+
The results obtained in this case are:
79+
```
80+
Running loglikelihood requests
81+
100%|██████████| 40145/40145 [20:47<00:00, 32.19it/s]
82+
{
83+
"results": {
84+
"hellaswag": {
85+
"acc": 0.40141406094403503,
86+
"acc_stderr": 0.004891826692722827,
87+
"acc_norm": 0.5115514837681737,
88+
"acc_norm_stderr": 0.004988449593007253
89+
}
90+
},
91+
"versions": {
92+
"hellaswag": 0
93+
},
94+
"config": {
95+
"model": "hf-causal-experimental",
96+
"model_args": {
97+
"pretrained": "/home/mwitiderrick/neuralmagic/sparseml/obcq_deployment",
98+
"trust_remote_code": true
99+
},
100+
"num_fewshot": 0,
101+
"batch_size": "64",
102+
"batch_sizes": [],
103+
"device": "cuda:0",
104+
"no_cache": true,
105+
"limit": null,
106+
"bootstrap_iters": 100000,
107+
"description_dict": {}
108+
}
109+
}
110+
hf-causal-experimental (pretrained=/home/mwitiderrick/neuralmagic/sparseml/obcq_deployment,trust_remote_code=True), limit: None, provide_description: False, num_fewshot: 0, batch_size: 64
111+
| Task |Version| Metric |Value | |Stderr|
112+
|---------|------:|--------|-----:|---|-----:|
113+
|hellaswag| 0|acc |0.4014|± |0.0049|
114+
| | |acc_norm|0.5116|± |0.0050|
115+
116+
Execution time was 1288 seconds.
117+
```
118+
Repeat the above on other tasks such as `truthfulqa-mc`, `winogrande`, and `drop`.
119+
## <a name="export"> How to Export the One-shot Model</a>
120+
Once you are certain the model is performing as expected, you can export it for inference. The `export.py` file provides the functions for doing this. Running the command below creates a `deployment` directory containing all the artifacts that are needed for inference with DeepSparse.
121+
122+
```bash
123+
python sparseml/src/sparseml/transformers/sparsification/obcq/export.py --task text-generation --model_path obcq_deployment
124+
125+
```
126+
127+
## <a name="kvcache">How to Inject KV Cache</a>
128+
Injecting KV Cache is done to reduce the model’s computational overhead and speed up inference by caching the Key and Value states.
129+
This is done by creating a copy of `model.onnx` and injecting the KV Cache:
130+
```bash
131+
cp deployment/model.onnx deployment/model-orig.onnx
132+
```
133+
134+
Code to inject KV Cache:
135+
```python
136+
import os
137+
import onnx
138+
from sparseml.exporters.kv_cache_injector import KeyValueCacheInjector
139+
input_file = "deployment/model-orig.onnx"
140+
output_file = "deployment/model.onnx"
141+
model = onnx.load(input_file, load_external_data=False)
142+
model = KeyValueCacheInjector(model_path=os.path.dirname(input_file)).apply(model)
143+
onnx.save(model, output_file)
144+
print(f"Modified model saved to: {output_file}")
145+
```
146+
147+
## <a name="deepsparse">Using the Model With DeepSparse </a>
148+
Next, run inference using DeepSparse. Ensure you have the latest version of DeepSparse installed with `pip install -U deepsparse-nightly[llm]`
149+
150+
```python
151+
from deepsparse import TextGeneration
152+
153+
prompt = "How to get in a good university?"
154+
formatted_prompt = f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
155+
156+
model = TextGeneration(model="deployment")
157+
print(model(formatted_prompt, max_new_tokens=200).generations[0].text)
158+
"""
159+
There are many factors to consider when choosing a university. Here are some tips for getting into a good university:
160+
161+
1. Research your options: Consider the schools in your area and the ones in your desired location. Research their reputation, tuition, and academic programs.
162+
163+
2. Apply to multiple universities: Apply to multiple universities, ensuring that you are applying to the best option for you.
164+
165+
3. Get a job: If you are applying to a university, you will need to find a job to support your studies. This will help you budget and manage your time.
166+
167+
4. Get involved with your community: Your university will likely have a community of students and faculty. Engage with this community by volunteering, participating in clubs, and engaging with others in your community.
168+
169+
5. Get involved with extracurricular activities: Universities often have many extracurricular activities, which can help you meet new people
170+
"""
171+
```
172+
Check out the [DeepSparse pipeline text generation docs](https://github.com/neuralmagic/deepsparse/blob/main/src/deepsparse/transformers/text_generation.md) for the full list of supported parameters.
173+
174+
## <a name="upload">Upload Model to Hugging Face</a>
175+
You may want to upload the one-shot model to Hugging Face for ease of reference or to share it with your colleagues.
176+
177+
Head over to your [Hugging Face account](https://huggingface.co/new) and create a model named `TinyLlama-1.1B-Chat-v0.4-pruned50-quant`. Then upload the one-shot model:
178+
```python
179+
from huggingface_hub import HfApi
180+
api = HfApi()
181+
api.upload_folder(
182+
folder_path="deployment",
183+
repo_id="YOUR_HF_USERNAME/TinyLlama-1.1B-Chat-v0.4-pruned50-quant",
184+
repo_type="model",
185+
token="HF_WRITE_TOKEN"
186+
)
187+
```
188+
189+
## <a name="recipe"> Explaining the TinyLlama Recipe</a>
190+
A recipe is a set of hyperparameters that provide detailed instructions on how the [one-shot quantization](https://neuralmagic.com/video/pruning-and-quantizing-ml-models-with-one-shot-without-retraining/) should be done. The recipe performs quantization in one shot, meaning that no retraining of the LLM is required.
191+
192+
We will now walk through what the different hyperparameters mean and why they are set to those values.
193+
194+
The `SmoothQuantModifier` is a technique used for dealing with outliers in the weights and activations of the LLM because quantization is very sensitive to large variations in their values. For TinyLlama a `smoothing_strength` value of 0.8 resulted in a model with repetitions in its output but the problem was solved by lowering the value to 0.5.
195+
196+
The `ignore` parameter under `QuantizationModifier` allows us to define operations that either don't make sense to quantize or operations that are too sensitive to quantize. Performing quantization on sensitive operations will affect the final accuracy of the model. We also don't quantize the inputs to the embedding layer.
197+
198+
Under `SparseGPTModifier`, we define `sparsity` as 0.5 because we are aiming for a model that is 50% quantized. The other parameters are:
199+
- `block_size` determines the number of columns to compress in one pass.
200+
- `quantize` whether or not to quantize weights during SparseGPT. A default quantization modifier will be applied when `quantize` is set to `True` and there is no `QuantizationModifier` in the recipe.
201+
- `dampening_frac` amount of dampening to apply to H, as a fraction of the diagonal norm.
202+
- `sequential_update` whether or not to update weights sequentially by layer, True saves on GPU memory.
203+
- `mask_structure` string to define the structure of the mask to apply, "0:0" means that it's an unstructured mask. Setting it to "16:32" would mean that 16 out of every 32 weights will be zeroed out (structured sparsity).
204+
- `targets` list of layer names to compress during OBCQ, or '__ALL__' to compress every layer in the model.
205+
206+
```yaml
207+
test_stage:
208+
obcq_modifiers:
209+
SmoothQuantModifier:
210+
smoothing_strength: 0.5
211+
mappings: [
212+
[["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
213+
[["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"]
214+
]
215+
QuantizationModifier:
216+
ignore:
217+
# These operations don't make sense to quantize
218+
- LlamaRotaryEmbedding
219+
- LlamaRMSNorm
220+
- SiLUActivation
221+
# Skip quantizing the BMMs
222+
- QuantizableMatMul
223+
# Skip quantizing the layers with the most sensitive activations
224+
- model.layers.21.mlp.down_proj
225+
- model.layers.7.mlp.down_proj
226+
- model.layers.2.mlp.down_proj
227+
- model.layers.20.mlp.down_proj
228+
- model.layers.19.mlp.down_proj
229+
post_oneshot_calibration: true
230+
scheme_overrides:
231+
Embedding:
232+
input_activations: null
233+
weights:
234+
num_bits: 8
235+
symmetric: false
236+
SparseGPTModifier:
237+
sparsity: 0.5
238+
block_size: 128
239+
sequential_update: true
240+
quantize: true
241+
percdamp: 0.01
242+
mask_structure: "0:0"
243+
targets: ["re:model.layers.\\d*$"]
244+
```
245+
## <a name="adapt"> How to Adapt a Recipe for a New Model</a>
246+
You can modify the above recipe to perform one-shot quantization on other models, for example [Mistral](https://huggingface.co/docs/transformers/main/model_doc/mistral).
247+
248+
Perform the following modifications on the recipe to one-shot a Mistral model.
249+
- Define the operations we want to skip during quantization, that is sensitive layers and operations that don't make sense to quantize.
250+
- Declare the desired sparsity level, same as the one for TinyLlama.
251+
- State the layers to compress during OBCQ.
252+
253+
Here is what the final recipe looks like:
254+
```yaml
255+
test_stage:
256+
obcq_modifiers:
257+
SmoothQuantModifier:
258+
smoothing_strength: 0.5
259+
mappings: [
260+
[["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"], "re:.*input_layernorm"],
261+
[["re:.*gate_proj", "re:.*up_proj"], "re:.*post_attention_layernorm"]
262+
]
263+
QuantizationModifier:
264+
ignore:
265+
# These operations don't make sense to quantize
266+
- MistralRotaryEmbedding
267+
- MistralRMSNorm
268+
- SiLUActivation
269+
# Skip quantizing the layers with the most sensitive activations
270+
- model.layers.1.mlp.down_proj
271+
- model.layers.31.mlp.down_proj
272+
- model.layers.30.mlp.down_proj
273+
- model.layers.30.mlp.gate_proj
274+
- model.layers.30.mlp.up_proj
275+
post_oneshot_calibration: true
276+
scheme_overrides:
277+
Embedding:
278+
input_activations: null
279+
weights:
280+
num_bits: 8
281+
symmetric: false
282+
SparseGPTModifier:
283+
sparsity: 0.5
284+
block_size: 128
285+
sequential_update: true
286+
quantize: true
287+
percdamp: 0.01
288+
mask_structure: "0:0"
289+
targets: ["re:model.layers.\\d*$"]
290+
```
291+
292+
Save the recipe to a file named `recipe.yaml`.
293+
294+
Run one-shot quantization on any Mistral-based model, for example, `zephyr-7b-beta`:
295+
```bash
296+
python sparseml/src/sparseml/transformers/sparsification/obcq/obcq.py HuggingFaceH4/zephyr-7b-beta open_platypus --recipe zephyr.yaml --precision float16 --save True
297+
```
298+
We set `precision` to `float16` because quantization is not supported for the `bfloat16` data type as of this writing.
299+
300+
Repeat the other processes as shown previously.
301+
302+
## Conclusion
303+
In case of any questions, submit an [issue on GItHub](https://github.com/neuralmagic/sparseml) or join other LLM developers on our [community](https://join.slack.com/t/discuss-neuralmagic/shared_invite/zt-q1a1cnvo-YBoICSIw3L1dmQpjBeDurQ).

0 commit comments

Comments
 (0)