Releases: turboderp-org/exllamav2
Releases · turboderp-org/exllamav2
0.0.12
Lots of fixes and tweaks. Main feature updates:
Model support:
- Basic LoRA support for MoE models
- Support for Orion models (also groundwork for other layernorm models)
- Support for loading/converting from Axolotl checkpoints
Generation/sampling:
- Fused kernels enabled for num_experts = 4
- Option to return probs from streaming generator
- Add top-A sampling
- Add freq/pres penalties
- CFG support in streaming generator
- Disable flash-attn for non-causal attention (fixes left-padding until FA2 implements custom bias)
Testing/evaluation:
- HumanEval test
- Script to compare two models layer by layer (e.g. quantized vs. original model)
- "Standard" ppl test that attempts to mimic text-generation-webui
Conversion:
- VRAM optimizations
- Optimized quantization kernels
IO:
- Cache safetensors context managers for faster loading
- Optional direct IO loader (for very fast arrays)
0.0.11
v0.0.11 Bump to 0.0.11
0.0.10
v0.0.10 Bump to 0.0.10
0.0.9
v0.0.9 Bump to 0.0.9
0.0.8
v0.0.8 Bump to 0.0.8
0.0.7
v0.0.7 Bump version to 0.0.7
0.0.6
Full Changelog: v0.0.5...v0.0.6