-
Notifications
You must be signed in to change notification settings - Fork 492
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #284 from 01-ai/0108
[doc][feat] add a quick start tutorial for using yi with llama.cpp
- Loading branch information
Showing
4 changed files
with
137 additions
and
5 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,132 @@ | ||
# Run Yi with llama.cpp | ||
|
||
If you have limited resources, you can try [llama.cpp](https://github.com/ggerganov/llama.cpp) or [ollama.cpp](https://ollama.ai/) (especially for Chinese users) to run Yi models in a few minutes locally. | ||
|
||
This tutorial guides you through every step of running a quantized model ([yi-chat-6B-2bits](https://huggingface.co/XeIaso/yi-chat-6B-GGUF/tree/main)) locally and then performing inference. | ||
|
||
- [Step 0: Prerequisites](#step-0-prerequisites) | ||
- [Step 1: Download llama.cpp](#step-1-download-llamacpp) | ||
- [Step 2: Download Yi model](#step-2-download-yi-model) | ||
- [Step 3: Perform inference](#step-3-perform-inference) | ||
|
||
## Step 0: Prerequisites | ||
|
||
- This tutorial assumes you use a MacBook Pro with 16GB of memory and an Apple M2 Pro chip. | ||
|
||
- Make sure [`git-lfs`](https://git-lfs.com/) is installed on your machine. | ||
|
||
## Step 1: Download `llama.cpp` | ||
|
||
To clone the [`llama.cpp`](https://github.com/ggerganov/llama.cpp) repository, run the following command. | ||
|
||
```bash | ||
git clone git@github.com:ggerganov/llama.cpp.git | ||
``` | ||
|
||
## Step 2: Download Yi model | ||
|
||
2.1 To clone [XeIaso/yi-chat-6B-GGUF](https://huggingface.co/XeIaso/yi-chat-6B-GGUF/tree/main) with just pointers, run the following command. | ||
|
||
```bash | ||
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/XeIaso/yi-chat-6B-GGUF | ||
``` | ||
|
||
2.2 To download a quantized Yi model ([yi-chat-6b.Q2_K.gguf](https://huggingface.co/XeIaso/yi-chat-6B-GGUF/blob/main/yi-chat-6b.Q2_K.gguf)), run the following command. | ||
|
||
```bash | ||
git-lfs pull --include yi-chat-6b.Q2_K.gguf | ||
``` | ||
|
||
## Step 3: Perform inference | ||
|
||
To perform inference with the Yi model, you can use one of the following methods. | ||
|
||
- [Method 1: Perform inference in terminal](#method-1-perform-inference-in-terminal) | ||
|
||
- [Method 2: Perform inference in web](#method-2-perform-inference-in-web) | ||
|
||
### Method 1: Perform inference in terminal | ||
|
||
To compile `llama.cpp` using 4 threads and then conduct inference, navigate to the `llama.cpp` directory, and run the following command. | ||
|
||
> ### Tips | ||
> | ||
> - Replace `/Users/yu/yi-chat-6B-GGUF/yi-chat-6b.Q2_K.gguf` with the actual path of your model. | ||
> | ||
> - By default, the model operates in completion mode. | ||
> | ||
> - For additional output customization options (for example, system prompt, temperature, repetition penalty, etc.), run `./main -h` to check detailed descriptions and usage. | ||
```bash | ||
make -j4 && ./main -m /Users/yu/yi-chat-6B-GGUF/yi-chat-6b.Q2_K.gguf -p "How do you feed your pet fox? Please answer this question in 6 simple steps:\nStep 1:" -n 384 -e | ||
|
||
... | ||
|
||
How do you feed your pet fox? Please answer this question in 6 simple steps: | ||
|
||
Step 1: Select the appropriate food for your pet fox. You should choose high-quality, balanced prey items that are suitable for their unique dietary needs. These could include live or frozen mice, rats, pigeons, or other small mammals, as well as fresh fruits and vegetables. | ||
|
||
Step 2: Feed your pet fox once or twice a day, depending on the species and its individual preferences. Always ensure that they have access to fresh water throughout the day. | ||
|
||
Step 3: Provide an appropriate environment for your pet fox. Ensure it has a comfortable place to rest, plenty of space to move around, and opportunities to play and exercise. | ||
|
||
Step 4: Socialize your pet with other animals if possible. Interactions with other creatures can help them develop social skills and prevent boredom or stress. | ||
|
||
Step 5: Regularly check for signs of illness or discomfort in your fox. Be prepared to provide veterinary care as needed, especially for common issues such as parasites, dental health problems, or infections. | ||
|
||
Step 6: Educate yourself about the needs of your pet fox and be aware of any potential risks or concerns that could affect their well-being. Regularly consult with a veterinarian to ensure you are providing the best care. | ||
|
||
... | ||
|
||
``` | ||
Now you have successfully asked a question to the Yi model and got an answer! 🥳 | ||
### Method 2: Perform inference in web | ||
1. To initialize a lightweight and swift chatbot, navigate to the `llama.cpp` directory, and run the following command. | ||
```bash | ||
./server --ctx-size 2048 --host 0.0.0.0 --n-gpu-layers 64 --model /Users/yu/yi-chat-6B-GGUF/yi-chat-6b.Q2_K.gguf | ||
``` | ||
Then you can get an output like this: | ||
```bash | ||
... | ||
|
||
llama_new_context_with_model: n_ctx = 2048 | ||
llama_new_context_with_model: freq_base = 5000000.0 | ||
llama_new_context_with_model: freq_scale = 1 | ||
ggml_metal_init: allocating | ||
ggml_metal_init: found device: Apple M2 Pro | ||
ggml_metal_init: picking default device: Apple M2 Pro | ||
ggml_metal_init: ggml.metallib not found, loading from source | ||
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil | ||
ggml_metal_init: loading '/Users/yu/llama.cpp/ggml-metal.metal' | ||
ggml_metal_init: GPU name: Apple M2 Pro | ||
ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008) | ||
ggml_metal_init: hasUnifiedMemory = true | ||
ggml_metal_init: recommendedMaxWorkingSetSize = 11453.25 MB | ||
ggml_metal_init: maxTransferRate = built-in GPU | ||
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 128.00 MiB, ( 2629.44 / 10922.67) | ||
llama_new_context_with_model: KV self size = 128.00 MiB, K (f16): 64.00 MiB, V (f16): 64.00 MiB | ||
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 0.02 MiB, ( 2629.45 / 10922.67) | ||
llama_build_graph: non-view tensors processed: 676/676 | ||
llama_new_context_with_model: compute buffer total size = 159.19 MiB | ||
ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 156.02 MiB, ( 2785.45 / 10922.67) | ||
Available slots: | ||
-> Slot 0 - max context: 2048 | ||
|
||
llama server listening at http://0.0.0.0:8080 | ||
``` | ||
2. To access the chatbot interface, open your web browser and enter `http://0.0.0.0:8080` into the address bar. | ||
 | ||
3. Enter a question, such as "How do you feed your pet fox? Please answer this question in 6 simple steps" into the prompt window, and you will receive a corresponding answer. | ||
 |