You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for taking the time to fill out this bug report!
9
+
- type: textarea
10
+
id: bug-description
11
+
attributes:
12
+
label: Describe the bug
13
+
description: A clear and concise description of what the bug is.
14
+
placeholder: Bug description
15
+
validations:
16
+
required: true
17
+
- type: checkboxes
18
+
attributes:
19
+
label: Is there an existing issue for this?
20
+
description: Please search to see if an issue already exists for the issue you encountered.
21
+
options:
22
+
- label: I have searched the existing issues
23
+
required: true
24
+
- type: textarea
25
+
id: reproduction
26
+
attributes:
27
+
label: Reproduction
28
+
description: Please provide the steps necessary to reproduce your issue.
29
+
placeholder: Reproduction
30
+
validations:
31
+
required: true
32
+
- type: textarea
33
+
id: screenshot
34
+
attributes:
35
+
label: Screenshot
36
+
description: "If possible, please include screenshot(s) so that we can understand what the issue is."
37
+
- type: textarea
38
+
id: logs
39
+
attributes:
40
+
label: Logs
41
+
description: "Please include the full stacktrace of the errors you get in the command-line (if any)."
42
+
render: shell
43
+
validations:
44
+
required: true
45
+
- type: textarea
46
+
id: system-info
47
+
attributes:
48
+
label: System Info
49
+
description: "Please share your system info with us: operating system, GPU brand, and GPU model. If you are using a Google Colab notebook, mention that instead."
close-issue-message: "This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, you can reopen it (if you are the author) or leave a comment below."
See also: [Installation instructions for human beings](https://github.com/oobabooga/text-generation-webui/wiki/Installation-instructions-for-human-beings).
65
64
66
65
## Installation option 2: one-click installers
67
66
@@ -132,39 +131,41 @@ Then browse to
132
131
133
132
Optionally, you can use the following command-line flags:
|`-h`, `--help`| show this help message and exit |
138
-
|`--model MODEL`| Name of the model to load by default. |
139
-
|`--notebook`| Launch the web UI in notebook mode, where the output is written to the same text box as the input. |
140
-
|`--chat`| Launch the web UI in chat mode. |
141
-
|`--cai-chat`| Launch the web UI in chat mode with a style similar to Character.AI's. If the file `img_bot.png` or `img_bot.jpg` exists in the same folder as server.py, this image will be used as the bot's profile picture. Similarly, `img_me.png` or `img_me.jpg` will be used as your profile picture. |
142
-
|`--cpu`| Use the CPU to generate text. |
143
-
|`--load-in-8bit`| Load the model with 8-bit precision. |
144
-
|`--gptq-bits GPTQ_BITS`| Load a pre-quantized model with specified precision. 2, 3, 4 and 8 (bit) are supported. Currently only works with LLaMA and OPT. |
145
-
|`--gptq-model-type MODEL_TYPE`| Model type of pre-quantized model. Currently only LLaMa and OPT are supported. |
146
-
|`--bf16`| Load the model with bfloat16 precision. Requires NVIDIA Ampere GPU. |
147
-
|`--auto-devices`| Automatically split the model across the available GPU(s) and CPU. |
148
-
|`--disk`| If the model is too large for your GPU(s) and CPU combined, send the remaining layers to the disk. |
149
-
|`--disk-cache-dir DISK_CACHE_DIR`| Directory to save the disk cache to. Defaults to `cache/`. |
150
-
|`--gpu-memory GPU_MEMORY [GPU_MEMORY ...]`| Maxmimum GPU memory in GiB to be allocated per GPU. Example: `--gpu-memory 10` for a single GPU, `--gpu-memory 10 5` for two GPUs. |
151
-
|`--cpu-memory CPU_MEMORY`| Maximum CPU memory in GiB to allocate for offloaded weights. Must be an integer number. Defaults to 99. |
152
-
|`--flexgen`| Enable the use of FlexGen offloading. |
153
-
|`--percent PERCENT [PERCENT ...]`| FlexGen: allocation percentages. Must be 6 numbers separated by spaces (default: 0, 100, 100, 0, 100, 0). |
154
-
|`--compress-weight`| FlexGen: Whether to compress weight (default: False). |
155
-
|`--pin-weight [PIN_WEIGHT]`| FlexGen: whether to pin weights (setting this to False reduces CPU memory by 20%). |
156
-
|`--deepspeed`| Enable the use of DeepSpeed ZeRO-3 for inference via the Transformers integration. |
157
-
|`--nvme-offload-dir NVME_OFFLOAD_DIR`| DeepSpeed: Directory to use for ZeRO-3 NVME offloading. |
158
-
|`--local_rank LOCAL_RANK`| DeepSpeed: Optional argument for distributed setups. |
159
-
|`--rwkv-strategy RWKV_STRATEGY`| RWKV: The strategy to use while loading the model. Examples: "cpu fp32", "cuda fp16", "cuda fp16i8". |
160
-
|`--rwkv-cuda-on`| RWKV: Compile the CUDA kernel for better performance. |
161
-
|`--no-stream`| Don't stream the text output in real time. This improves the text generation performance. |
162
-
|`--settings SETTINGS_FILE`| Load the default interface settings from this json file. See `settings-template.json` for an example. If you create a file called `settings.json`, this file will be loaded by default without the need to use the `--settings` flag. |
163
-
|`--extensions EXTENSIONS [EXTENSIONS ...]`| The list of extensions to load. If you want to load more than one extension, write the names separated by spaces. |
164
-
|`--listen`| Make the web UI reachable from your local network. |
165
-
|`--listen-port LISTEN_PORT`| The listening port that the server will use. |
166
-
|`--share`| Create a public URL. This is useful for running the web UI on Google Colab or similar. |
167
-
|`--verbose`| Print the prompts to the terminal. |
134
+
| Flag | Description |
135
+
|-------------|-------------|
136
+
|`-h`, `--help`| show this help message and exit |
137
+
|`--model MODEL`| Name of the model to load by default. |
138
+
|`--notebook`| Launch the web UI in notebook mode, where the output is written to the same text box as the input. |
139
+
|`--chat`| Launch the web UI in chat mode.|
140
+
|`--cai-chat`| Launch the web UI in chat mode with a style similar to Character.AI's. If the file `img_bot.png` or `img_bot.jpg` exists in the same folder as server.py, this image will be used as the bot's profile picture. Similarly, `img_me.png` or `img_me.jpg` will be used as your profile picture. |
141
+
|`--cpu`| Use the CPU to generate text.|
142
+
|`--load-in-8bit`| Load the model with 8-bit precision.|
143
+
|`--load-in-4bit`| Load the model with 4-bit precision. Currently only works with LLaMA.|
144
+
|`--gptq-bits GPTQ_BITS`| Load a pre-quantized model with specified precision. 2, 3, 4 and 8 (bit) are supported. Currently only works with LLaMA and OPT. |
145
+
|`--gptq-model-type MODEL_TYPE`| Model type of pre-quantized model. Currently only LLaMa and OPT are supported. |
146
+
|`--bf16`| Load the model with bfloat16 precision. Requires NVIDIA Ampere GPU. |
147
+
|`--auto-devices`| Automatically split the model across the available GPU(s) and CPU.|
148
+
|`--disk`| If the model is too large for your GPU(s) and CPU combined, send the remaining layers to the disk. |
149
+
|`--disk-cache-dir DISK_CACHE_DIR`| Directory to save the disk cache to. Defaults to `cache/`. |
150
+
|`--gpu-memory GPU_MEMORY [GPU_MEMORY ...]`| Maxmimum GPU memory in GiB to be allocated per GPU. Example: `--gpu-memory 10` for a single GPU, `--gpu-memory 10 5` for two GPUs. |
151
+
|`--cpu-memory CPU_MEMORY`| Maximum CPU memory in GiB to allocate for offloaded weights. Must be an integer number. Defaults to 99.|
152
+
|`--flexgen`| Enable the use of FlexGen offloading. |
153
+
|`--percent PERCENT [PERCENT ...]`| FlexGen: allocation percentages. Must be 6 numbers separated by spaces (default: 0, 100, 100, 0, 100, 0). |
154
+
|`--compress-weight`| FlexGen: Whether to compress weight (default: False).|
155
+
|`--pin-weight [PIN_WEIGHT]`| FlexGen: whether to pin weights (setting this to False reduces CPU memory by 20%). |
156
+
|`--deepspeed`| Enable the use of DeepSpeed ZeRO-3 for inference via the Transformers integration. |
157
+
|`--nvme-offload-dir NVME_OFFLOAD_DIR`| DeepSpeed: Directory to use for ZeRO-3 NVME offloading. |
158
+
|`--local_rank LOCAL_RANK`| DeepSpeed: Optional argument for distributed setups. |
159
+
|`--rwkv-strategy RWKV_STRATEGY`| RWKV: The strategy to use while loading the model. Examples: "cpu fp32", "cuda fp16", "cuda fp16i8". |
160
+
|`--rwkv-cuda-on`| RWKV: Compile the CUDA kernel for better performance. |
161
+
|`--no-stream`| Don't stream the text output in real time. |
162
+
|`--settings SETTINGS_FILE`| Load the default interface settings from this json file. See `settings-template.json` for an example. If you create a file called `settings.json`, this file will be loaded by default without the need to use the `--settings` flag.|
163
+
|`--extensions EXTENSIONS [EXTENSIONS ...]`| The list of extensions to load. If you want to load more than one extension, write the names separated by spaces. |
164
+
|`--listen`| Make the web UI reachable from your local network.|
165
+
|`--listen-port LISTEN_PORT`| The listening port that the server will use. |
166
+
|`--share`| Create a public URL. This is useful for running the web UI on Google Colab or similar. |
167
+
|`--auto-launch`| Open the web UI in the default browser upon launch. |
168
+
|`--verbose`| Print the prompts to the terminal. |
168
169
169
170
Out of memory errors? [Check this guide](https://github.com/oobabooga/text-generation-webui/wiki/Low-VRAM-guide).
0 commit comments