Releases: b4rtaz/distributed-llama
0.7.0
This version introduces a new model converter that supports the huggingface .safetensors
format: convert-hf.py
. The new converter supports three model types: llama
, mistral
, and mixtral
. Many models that use these architectures can be easily converted to the Distributed Llama format from now.
Successfully tested the new converter on:
To convert a model you need to run:
python3 convert-hf.py path/to/TinyLlama-1.1B q40 tinylama
Then you need to convert also the tokenizer:
python3 convert-tokenizer-sentencepiece.py path/to/tokenizer.model tinylama
0.6.1
0.6.0
This version changes the name of the main
application into dllama
. From now to run the root node or a worker you need to compile dllama
and run the dllama
application.
make dllama
./dllama inference --model ... --tokenizer ...
Also this version introduces an early stage HTTP api compatible with the OpenAI api (only the /v1/chat/completions
endpoint). How to run the api you can find here. A big shout out to @DifferentialityDevelopment for implementing this feature. #39
0.5.2
0.5.1
- feat: use avx2 to speedup matmulQ40 (by @DifferentialityDevelopment)
0.5.0
0.4.0
0.3.1
0.3.0
- New tokenizer format (old tokenizer files are not supported, please regenerate tokenizer files).
- Added Llama 3 support.
- Simple-server mode, check this example: nodejs-example.cjs Now you may use Distributed Llama as a simple LLM server.