diff --git a/README.md b/README.md
index efbfa69d..d25b2896 100644
--- a/README.md
+++ b/README.md
@@ -92,7 +92,7 @@
## 📌 Introduction
-- 🤖 The Yi series models are the next generation of open source large language models trained from strach by [01.AI](https://01.ai/).
+- 🤖 The Yi series models are the next generation of open source large language models trained from scratch by [01.AI](https://01.ai/).
- 🙌 Targeted as a bilingual language model and trained on 3T multilingual corpus, the Yi series models become one of the strongest LLM worldwide, showing promise in language understanding, commonsense reasoning, reading comprehension, and more. For example,
@@ -119,7 +119,7 @@ Yi-34B-Chat | • [🤗 Hugging Face](https://huggingface.co/01-ai/Yi-34B-Chat)
Yi-34B-Chat-4bits | • [🤗 Hugging Face](https://huggingface.co/01-ai/Yi-34B-Chat-4bits) • [🤖 ModelScope](https://www.modelscope.cn/models/01ai/Yi-34B-Chat-4bits/summary)
Yi-34B-Chat-8bits | • [🤗 Hugging Face](https://huggingface.co/01-ai/Yi-34B-Chat-8bits) • [🤖 ModelScope](https://www.modelscope.cn/models/01ai/Yi-34B-Chat-8bits/summary)
- - 4 bits series models are quantized by AWQ.
- 8 bits series models are quantized by GPTQ
- All quantized models have a low barrier to use since they can be deployed on consumer-grade GPUs (e.g., 3090, 4090).
+ - 4-bit series models are quantized by AWQ.
- 8-bit series models are quantized by GPTQ
- All quantized models have a low barrier to use since they can be deployed on consumer-grade GPUs (e.g., 3090, 4090).
### Base models
@@ -153,7 +153,7 @@ For chat models and base models:
🎯 2023/11/23: The chat models are open to public.
-This release contains two chat models based on previous released base models, two 8-bits models quantized by GPTQ, two 4-bits models quantized by AWQ.
+This release contains two chat models based on previously released base models, two 8-bit models quantized by GPTQ, and two 4-bit models quantized by AWQ.
- `Yi-34B-Chat`
- `Yi-34B-Chat-4bits`
@@ -185,7 +185,7 @@ Application form:
🎯 2023/11/05: The base model of Yi-6B-200K
and Yi-34B-200K
.
-This release contains two base models with the same parameter sizes of previous
+This release contains two base models with the same parameter sizes as the previous
release, except that the context window is extended to 200K.
@@ -382,7 +382,7 @@ Everyone! 🙌 ✅
- The Yi series models are free for personal usage, academic purposes, and commercial use. All usage must adhere to the [Yi Series Models Community License Agreement 2.1](https://github.com/01-ai/Yi/blob/main/MODEL_LICENSE_AGREEMENT.txt)
-- For free commercial use, you only need to [complete this form](https://www.lingyiwanwu.com/yi-license) to get Yi Model Commercial License.
+- For free commercial use, you only need to [complete this form](https://www.lingyiwanwu.com/yi-license) to get a Yi Model Commercial License.
diff --git a/assets/img/yi_llama_cpp1.png b/assets/img/yi_llama_cpp1.png
new file mode 100644
index 00000000..9007f227
Binary files /dev/null and b/assets/img/yi_llama_cpp1.png differ
diff --git a/assets/img/yi_llama_cpp2.png b/assets/img/yi_llama_cpp2.png
new file mode 100644
index 00000000..05507d80
Binary files /dev/null and b/assets/img/yi_llama_cpp2.png differ
diff --git a/docs/yi_llama.cpp.md b/docs/yi_llama.cpp.md
new file mode 100644
index 00000000..03e64f81
--- /dev/null
+++ b/docs/yi_llama.cpp.md
@@ -0,0 +1,132 @@
+# Run Yi with llama.cpp
+
+If you have limited resources, you can try [llama.cpp](https://github.com/ggerganov/llama.cpp) or [ollama.cpp](https://ollama.ai/) (especially for Chinese users) to run Yi models in a few minutes locally.
+
+This tutorial guides you through every step of running a quantized model ([yi-chat-6B-2bits](https://huggingface.co/XeIaso/yi-chat-6B-GGUF/tree/main)) locally and then performing inference.
+
+- [Step 0: Prerequisites](#step-0-prerequisites)
+- [Step 1: Download llama.cpp](#step-1-download-llamacpp)
+- [Step 2: Download Yi model](#step-2-download-yi-model)
+- [Step 3: Perform inference](#step-3-perform-inference)
+
+## Step 0: Prerequisites
+
+- This tutorial assumes you use a MacBook Pro with 16GB of memory and an Apple M2 Pro chip.
+
+- Make sure [`git-lfs`](https://git-lfs.com/) is installed on your machine.
+
+## Step 1: Download `llama.cpp`
+
+To clone the [`llama.cpp`](https://github.com/ggerganov/llama.cpp) repository, run the following command.
+
+```bash
+git clone git@github.com:ggerganov/llama.cpp.git
+```
+
+## Step 2: Download Yi model
+
+2.1 To clone [XeIaso/yi-chat-6B-GGUF](https://huggingface.co/XeIaso/yi-chat-6B-GGUF/tree/main) with just pointers, run the following command.
+
+```bash
+GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/XeIaso/yi-chat-6B-GGUF
+```
+
+2.2 To download a quantized Yi model ([yi-chat-6b.Q2_K.gguf](https://huggingface.co/XeIaso/yi-chat-6B-GGUF/blob/main/yi-chat-6b.Q2_K.gguf)), run the following command.
+
+```bash
+git-lfs pull --include yi-chat-6b.Q2_K.gguf
+```
+
+## Step 3: Perform inference
+
+To perform inference with the Yi model, you can use one of the following methods.
+
+- [Method 1: Perform inference in terminal](#method-1-perform-inference-in-terminal)
+
+- [Method 2: Perform inference in web](#method-2-perform-inference-in-web)
+
+### Method 1: Perform inference in terminal
+
+To compile `llama.cpp` using 4 threads and then conduct inference, navigate to the `llama.cpp` directory, and run the following command.
+
+> ### Tips
+>
+> - Replace `/Users/yu/yi-chat-6B-GGUF/yi-chat-6b.Q2_K.gguf` with the actual path of your model.
+>
+> - By default, the model operates in completion mode.
+>
+> - For additional output customization options (for example, system prompt, temperature, repetition penalty, etc.), run `./main -h` to check detailed descriptions and usage.
+
+```bash
+make -j4 && ./main -m /Users/yu/yi-chat-6B-GGUF/yi-chat-6b.Q2_K.gguf -p "How do you feed your pet fox? Please answer this question in 6 simple steps:\nStep 1:" -n 384 -e
+
+...
+
+How do you feed your pet fox? Please answer this question in 6 simple steps:
+
+Step 1: Select the appropriate food for your pet fox. You should choose high-quality, balanced prey items that are suitable for their unique dietary needs. These could include live or frozen mice, rats, pigeons, or other small mammals, as well as fresh fruits and vegetables.
+
+Step 2: Feed your pet fox once or twice a day, depending on the species and its individual preferences. Always ensure that they have access to fresh water throughout the day.
+
+Step 3: Provide an appropriate environment for your pet fox. Ensure it has a comfortable place to rest, plenty of space to move around, and opportunities to play and exercise.
+
+Step 4: Socialize your pet with other animals if possible. Interactions with other creatures can help them develop social skills and prevent boredom or stress.
+
+Step 5: Regularly check for signs of illness or discomfort in your fox. Be prepared to provide veterinary care as needed, especially for common issues such as parasites, dental health problems, or infections.
+
+Step 6: Educate yourself about the needs of your pet fox and be aware of any potential risks or concerns that could affect their well-being. Regularly consult with a veterinarian to ensure you are providing the best care.
+
+...
+
+```
+
+Now you have successfully asked a question to the Yi model and got an answer! 🥳
+
+### Method 2: Perform inference in web
+
+1. To initialize a lightweight and swift chatbot, navigate to the `llama.cpp` directory, and run the following command.
+
+ ```bash
+ ./server --ctx-size 2048 --host 0.0.0.0 --n-gpu-layers 64 --model /Users/yu/yi-chat-6B-GGUF/yi-chat-6b.Q2_K.gguf
+ ```
+
+ Then you can get an output like this:
+
+
+ ```bash
+ ...
+
+ llama_new_context_with_model: n_ctx = 2048
+ llama_new_context_with_model: freq_base = 5000000.0
+ llama_new_context_with_model: freq_scale = 1
+ ggml_metal_init: allocating
+ ggml_metal_init: found device: Apple M2 Pro
+ ggml_metal_init: picking default device: Apple M2 Pro
+ ggml_metal_init: ggml.metallib not found, loading from source
+ ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
+ ggml_metal_init: loading '/Users/yu/llama.cpp/ggml-metal.metal'
+ ggml_metal_init: GPU name: Apple M2 Pro
+ ggml_metal_init: GPU family: MTLGPUFamilyApple8 (1008)
+ ggml_metal_init: hasUnifiedMemory = true
+ ggml_metal_init: recommendedMaxWorkingSetSize = 11453.25 MB
+ ggml_metal_init: maxTransferRate = built-in GPU
+ ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 128.00 MiB, ( 2629.44 / 10922.67)
+ llama_new_context_with_model: KV self size = 128.00 MiB, K (f16): 64.00 MiB, V (f16): 64.00 MiB
+ ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 0.02 MiB, ( 2629.45 / 10922.67)
+ llama_build_graph: non-view tensors processed: 676/676
+ llama_new_context_with_model: compute buffer total size = 159.19 MiB
+ ggml_backend_metal_buffer_type_alloc_buffer: allocated buffer, size = 156.02 MiB, ( 2785.45 / 10922.67)
+ Available slots:
+ -> Slot 0 - max context: 2048
+
+ llama server listening at http://0.0.0.0:8080
+ ```
+
+2. To access the chatbot interface, open your web browser and enter `http://0.0.0.0:8080` into the address bar.
+
+ 
+
+
+3. Enter a question, such as "How do you feed your pet fox? Please answer this question in 6 simple steps" into the prompt window, and you will receive a corresponding answer.
+
+ 
\ No newline at end of file