Skip to content

Releases: b4rtaz/distributed-llama

0.7.0

24 May 19:45
2e523f6
Compare
Choose a tag to compare

This version introduces a new model converter that supports the huggingface .safetensors format: convert-hf.py. The new converter supports three model types: llama, mistral, and mixtral. Many models that use these architectures can be easily converted to the Distributed Llama format from now.

Successfully tested the new converter on:

To convert a model you need to run:

python3 convert-hf.py path/to/TinyLlama-1.1B q40 tinylama

Then you need to convert also the tokenizer:

python3 convert-tokenizer-sentencepiece.py path/to/tokenizer.model tinylama

0.6.1

23 May 16:43
9a1e284
Compare
Choose a tag to compare
  • fix: use non-blocking sockets.

0.6.0

19 May 19:25
Compare
Choose a tag to compare

This version changes the name of the main application into dllama. From now to run the root node or a worker you need to compile dllama and run the dllama application.

make dllama
./dllama inference --model ... --tokenizer ...

Also this version introduces an early stage HTTP api compatible with the OpenAI api (only the /v1/chat/completions endpoint). How to run the api you can find here. A big shout out to @DifferentialityDevelopment for implementing this feature. #39

0.5.2

18 May 09:28
182fdcd
Compare
Choose a tag to compare
  • feat: use avx2 to speedup dotProduct
  • feat: use avx2 to speedup matmulF32

0.5.1

15 May 14:35
d1304c8
Compare
Choose a tag to compare

0.5.0

13 May 22:07
c9bb613
Compare
Choose a tag to compare
  • feat: splitting attention layers into all nodes. 🎉 🎉 🎉
  • fix: convert-llama.py supports different max_seq_len.

0.4.0

09 May 17:38
e93d1e6
Compare
Choose a tag to compare
  • feat: support for any number of threads.
  • fix: support max kv cache length.
  • feat: splitting RoPE into all nodes.

0.3.1

28 Apr 21:36
37fad6a
Compare
Choose a tag to compare
  • Changed order of QKV synchronization (details)
  • All tasks of Llama architecture are executed in parallel
  • Rope cache for Llama architecture

0.3.0

22 Apr 20:57
Compare
Choose a tag to compare
  • New tokenizer format (old tokenizer files are not supported, please regenerate tokenizer files).
  • Added Llama 3 support.
  • Simple-server mode, check this example: nodejs-example.cjs Now you may use Distributed Llama as a simple LLM server.

0.2.0

11 Apr 21:29
620644a
Compare
Choose a tag to compare

Added Grok-1 support!

Breaking changes: you need to re-convert Llama 2 models to the new version.