You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Nomal float + Double quantization
QLoRA currently uses zero shot quantization which is different from GPTQ. However, unlike GPTQ, it does not require data, but incurs some performance loss. Therefore, I think the advantage of using GPTQ to train better LoRA is sufficient.
Paged Optimizers
Paged Optimizers uses NVIDIA unified memory to avoid the gradient checkpointing memory spikes that occur when processing a mini-batch with a long sequence length.
Apply LoRA to all linear layers
Currently, this repo only applies LoRA to k,v. In case of QLoRA, LoRA is applied to all layers.. This is very important for performance.
hyperparameter
The hyperparameters mentioned in the paper are:
"We set LoRA r = 64, α = 16, and add LoRA modules on all linear layers of the base model. We also use Adam beta2 of 0.999, max grad norm of 0.3 and LoRA dropout of 0.1 for models up to 13B and 0.05 for 33B and 65B models."
Additionally, 3bit LoRA may be possible. According to the paper "Since finetuning after quantization seems to recover most of the information that is lost during quantization this might enable much more aggressive quantization. For example, 3-bit GPTQ quantization of the basemodel with LoRA might also yield 16-bit full finetuning performance after finetuning"
The text was updated successfully, but these errors were encountered:
Great! Maybe we can use larger model at the same performance level of fp16 in the future.
Also, we can add more modules to finetune in lora training using peft by adjusting the config:
QLoRA currently uses zero shot quantization which is different from GPTQ. However, unlike GPTQ, it does not require data, but incurs some performance loss. Therefore, I think the advantage of using GPTQ to train better LoRA is sufficient.
Paged Optimizers uses NVIDIA unified memory to avoid the gradient checkpointing memory spikes that occur when processing a mini-batch with a long sequence length.
Currently, this repo only applies LoRA to k,v. In case of QLoRA, LoRA is applied to all layers.. This is very important for performance.
The hyperparameters mentioned in the paper are:
"We set LoRA r = 64, α = 16, and add LoRA modules on all linear layers of the base model. We also use Adam beta2 of 0.999, max grad norm of 0.3 and LoRA dropout of 0.1 for models up to 13B and 0.05 for 33B and 65B models."
Additionally, 3bit LoRA may be possible. According to the paper "Since finetuning after quantization seems to recover most of the information that is lost during quantization this might enable much more aggressive quantization. For example, 3-bit GPTQ quantization of the basemodel with LoRA might also yield 16-bit full finetuning performance after finetuning"
The text was updated successfully, but these errors were encountered: