Releases · turboderp-org/exllamav2 · GitHub

22 Jan 20:04

0.0.12

Lots of fixes and tweaks. Main feature updates:

Model support:

Basic LoRA support for MoE models
Support for Orion models (also groundwork for other layernorm models)
Support for loading/converting from Axolotl checkpoints

Generation/sampling:

Fused kernels enabled for num_experts = 4
Option to return probs from streaming generator
Add top-A sampling
Add freq/pres penalties
CFG support in streaming generator
Disable flash-attn for non-causal attention (fixes left-padding until FA2 implements custom bias)

Testing/evaluation:

HumanEval test
Script to compare two models layer by layer (e.g. quantized vs. original model)
"Standard" ppl test that attempts to mimic text-generation-webui

Conversion:

VRAM optimizations
Optimized quantization kernels

IO:

Cache safetensors context managers for faster loading
Optional direct IO loader (for very fast arrays)

Assets 31

16 Dec 23:03

0.0.11

v0.0.11

Bump to 0.0.11

Assets 31

30 Nov 21:21

0.0.10

v0.0.10

Bump to 0.0.10

Assets 31

22 Nov 04:54

0.0.9

v0.0.9

Bump to 0.0.9

Assets 32

12 Nov 07:21

0.0.8

v0.0.8

Bump to 0.0.8

Assets 31

29 Oct 19:20

0.0.7

v0.0.7

Bump version to 0.0.7

Assets 31

14 Oct 16:38

0.0.6

Full Changelog: v0.0.5...v0.0.6

Assets 31

05 Oct 10:00

turboderp

0.0.5

Wheels are compiled with CUDA 11.7, 11.8 and 12.1 for Windows and Linux x64

Assets 31

26 Sep 21:26

turboderp

0.0.4

Wheels are compiled with CUDA 11.7, 11.8 and 12.1 for Windows and Linux x64

Assets 25