Skip to content

Releases: turboderp-org/exllamav2

0.0.12

22 Jan 20:04
Compare
Choose a tag to compare

Lots of fixes and tweaks. Main feature updates:

Model support:

  • Basic LoRA support for MoE models
  • Support for Orion models (also groundwork for other layernorm models)
  • Support for loading/converting from Axolotl checkpoints

Generation/sampling:

  • Fused kernels enabled for num_experts = 4
  • Option to return probs from streaming generator
  • Add top-A sampling
  • Add freq/pres penalties
  • CFG support in streaming generator
  • Disable flash-attn for non-causal attention (fixes left-padding until FA2 implements custom bias)

Testing/evaluation:

  • HumanEval test
  • Script to compare two models layer by layer (e.g. quantized vs. original model)
  • "Standard" ppl test that attempts to mimic text-generation-webui

Conversion:

  • VRAM optimizations
  • Optimized quantization kernels

IO:

  • Cache safetensors context managers for faster loading
  • Optional direct IO loader (for very fast arrays)

0.0.11

16 Dec 23:03
Compare
Choose a tag to compare
v0.0.11

Bump to 0.0.11

0.0.10

30 Nov 21:21
Compare
Choose a tag to compare
v0.0.10

Bump to 0.0.10

0.0.9

22 Nov 04:54
Compare
Choose a tag to compare
v0.0.9

Bump to 0.0.9

0.0.8

12 Nov 07:21
Compare
Choose a tag to compare
v0.0.8

Bump to 0.0.8

0.0.7

29 Oct 19:20
Compare
Choose a tag to compare
v0.0.7

Bump version to 0.0.7

0.0.6

14 Oct 16:38
Compare
Choose a tag to compare

Full Changelog: v0.0.5...v0.0.6

0.0.5

05 Oct 10:00
9d6fdb9
Compare
Choose a tag to compare

Wheels are compiled with CUDA 11.7, 11.8 and 12.1 for Windows and Linux x64

0.0.4

26 Sep 21:26
Compare
Choose a tag to compare

Wheels are compiled with CUDA 11.7, 11.8 and 12.1 for Windows and Linux x64