You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
### Loading 4 Bit Checkpoints from Neural Compressor (INC)
531
+
532
+
You can load a pre-quantized 4-bit checkpoint with the argument `--local_quantized_inc_model_path`, supplied with the original model with the argument `--model_name_or_path`.
533
+
Currently, only uint4 checkpoints and single-device configurations are supported.
534
+
**Note:** In this process, you can load a checkpoint that has been quantized using INC.
535
+
More information on enabling 4-bit inference in SynapseAI is available here:
For more details see [documentation](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_PyTorch_Models.html#using-fused-sdpa).
557
585
586
+
### Running with UINT4 weight quantization using AutoGPTQ
587
+
588
+
589
+
Llama2-7b in UINT4 weight only quantization is enabled using [AutoGPTQ Fork](https://github.com/HabanaAI/AutoGPTQ), which provides quantization capabilities in PyTorch.
590
+
Currently, the support is for UINT4 inference of pre-quantized models only.
591
+
592
+
You can run a *UINT4 weight quantized* model using AutoGPTQ by setting the following environment variables:
593
+
`SRAM_SLICER_SHARED_MME_INPUT_EXPANSION_ENABLED=false ENABLE_EXPERIMENTAL_FLAGS=true` before running the command,
594
+
and by adding the argument `--load_quantized_model_with_autogptq`.
595
+
596
+
***Note:***
597
+
Setting the above environment variables improves performance. These variables will be removed in future releases.
598
+
599
+
600
+
Here is an example to run a quantized model <quantized_gptq_model>:
Copy file name to clipboardexpand all lines: examples/text-generation/run_generation.py
+32-10
Original file line number
Diff line number
Diff line change
@@ -293,21 +293,11 @@ def setup_parser(parser):
293
293
type=str,
294
294
help="Path to serialize const params. Const params will be held on disk memory instead of being allocated on host memory.",
295
295
)
296
-
parser.add_argument(
297
-
"--disk_offload",
298
-
action="store_true",
299
-
help="Whether to enable device map auto. In case no space left on cpu, weights will be offloaded to disk.",
300
-
)
301
296
parser.add_argument(
302
297
"--trust_remote_code",
303
298
action="store_true",
304
299
help="Whether to trust the execution of code from datasets/models defined on the Hub. This option should only be set to `True` for repositories you trust and in which you have read the code, as it will execute code present on the Hub on your local machine.",
305
300
)
306
-
parser.add_argument(
307
-
"--load_quantized_model",
308
-
action="store_true",
309
-
help="Whether to load model from hugging face checkpoint.",
310
-
)
311
301
parser.add_argument(
312
302
"--parallel_strategy",
313
303
type=str,
@@ -326,6 +316,35 @@ def setup_parser(parser):
326
316
help="Run the inference with dataset for specified --n_iterations(default:5)",
327
317
)
328
318
319
+
parser.add_argument(
320
+
"--run_partial_dataset",
321
+
action="store_true",
322
+
help="Run the inference with dataset for specified --n_iterations(default:5)",
0 commit comments