Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deepseek-r1 1.5b does not work with Ramalama #616

Open
vpavlin opened this issue Jan 22, 2025 · 9 comments
Open

Deepseek-r1 1.5b does not work with Ramalama #616

vpavlin opened this issue Jan 22, 2025 · 9 comments

Comments

@vpavlin
Copy link

vpavlin commented Jan 22, 2025

Hey! Back to checking out ramalama:-) I tried to run deepseek-r1:1.5b, but got a failure:

(.venv) root@pi5:/home/vpavlin/ramalama# ramalama run deepseek-r1:1.5b
100% |████████████████...|    1.04 GB/   1.04 GB   1.13 MB/s        0s

> Hi                                                                                                                                                                                                                                                                                                                                                                  
failed to apply the chat template
@ericcurtin
Copy link
Collaborator

@engelmi this looks like the jinja vs ollama template thing again

@engelmi
Copy link
Member

engelmi commented Jan 24, 2025

@ericcurtin I think so, yes. The model is being pulled from ollama, however, I can't reproduce the error:

$ ramalama --version
ramalama version 0.5.2

$ ramalama run deepseek-r1:1.5b
> hi                                                                                                                                                                                                                                                                                                                   
<think>

</think>

Hello! How can I assist you today? 😊
> 

@vpavlin Which version of ramalama do you use?

@jim3692
Copy link
Collaborator

jim3692 commented Jan 26, 2025

I was also facing this issue, running ramalama from the flake, with llama.cpp b4397.

$ ramalama --version
ramalama version 0.5.2

$ ramalama --nocontainer run ollama://deepseek-r1:8b
Loading modelggml_vulkan: Compiling shaders..........................Done!
> hi
failed to apply the chat template

It's solved after updating to llama.cpp b4546.

@vpavlin
Copy link
Author

vpavlin commented Jan 27, 2025

@engelmi

ramalama --version
ramalama version 0.5.2

@ericcurtin
Copy link
Collaborator

So we need to merge this:

#630

and rebuild and repush the containers.

The AI world moves super fast a new model gets released and all of a sudden the version on llama.cpp we are using in the containers is too old.

At the moment it's @rhatdan that typically rebuilds and repushes the container images. We are hoping we can make this more efficient in future with more automation.

@vpavlin
Copy link
Author

vpavlin commented Jan 28, 2025

It seems to work with 0.5.3, althought it seems I am only getting the closing </think> tag?

> Can you tell me what ramalama is? (be brief)
Yes, I can explain it. Ramalama is a South African girl, known for her acting and singing skills, and is recognized for her role in the TV show "Chewy."
</think>

Ramalama is a South African actress and singer, best known for her role in the TV series "Chewy."

@chrootchad
Copy link

chrootchad commented Feb 2, 2025

Same issue here with deepseek-r1:7b

Operating System: Ubuntu 24.04.1 LTS              
Kernel: Linux 6.12.0-1004-asahi-arm
Architecture: arm64
ramalama -v
ramalama version 0.5.5
podman -v
podman version 4.9.3

ramalama --debug run ollama://deepseek-r1:7b
output a couple events that looked to be of interest:

llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
...
llm_load_tensors: tensor 'token_embd.weight' (q4_K) (and 338 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead

complete output:

ramalama --debug run ollama://deepseek-r1:7b
run_cmd:  podman inspect quay.io/ramalama/ramalama:0.5
Working directory: None
Ignore stderr: False
Ignore all: True
exec_cmd:  podman run --rm -i --label RAMALAMA --security-opt=label=disable --name ramalama_YlzJygFOl5 --pull=newer -t --device /dev/dri --mount=type=bind,src=/var/lib/ramalama/models/ollama/deepseek-r1:7b,destination=/mnt/models/model.file,ro quay.io/ramalama/ramalama:latest llama-run -c 2048 --temp 0.8 -v --ngl 999 /mnt/models/model.file
Loading modelllama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from /mnt/models/model.file (version GGUF V3 (latest))                                                                    
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 7B
llama_model_loader: - kv   3:                           general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   4:                         general.size_label str              = 7B
llama_model_loader: - kv   5:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   6:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   7:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv   8:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv   9:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv  10:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  11:                       qwen2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                          general.file_type u32              = 15
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 151646
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q4_K:  169 tensors
llama_model_loader: - type q6_K:   29 tensors
llm_load_vocab: control token: 151660 '<|fim_middle|>' is not marked as EOG
llm_load_vocab: control token: 151659 '<|fim_prefix|>' is not marked as EOG
llm_load_vocab: control token: 151653 '<|vision_end|>' is not marked as EOG
llm_load_vocab: control token: 151645 '<|Assistant|>' is not marked as EOG
llm_load_vocab: control token: 151644 '<|User|>' is not marked as EOG
llm_load_vocab: control token: 151655 '<|image_pad|>' is not marked as EOG
llm_load_vocab: control token: 151651 '<|quad_end|>' is not marked as EOG
llm_load_vocab: control token: 151646 '<|begin▁of▁sentence|>' is not marked as EOG
llm_load_vocab: control token: 151643 '<|end▁of▁sentence|>' is not marked as EOG
llm_load_vocab: control token: 151652 '<|vision_start|>' is not marked as EOG
llm_load_vocab: control token: 151647 '<|EOT|>' is not marked as EOG
llm_load_vocab: control token: 151654 '<|vision_pad|>' is not marked as EOG
llm_load_vocab: control token: 151656 '<|video_pad|>' is not marked as EOG
llm_load_vocab: control token: 151661 '<|fim_suffix|>' is not marked as EOG
llm_load_vocab: control token: 151650 '<|quad_start|>' is not marked as EOG
**llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect**
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_head           = 28
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18944
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 7.62 B
llm_load_print_meta: model size       = 4.36 GiB (4.91 BPW) 
llm_load_print_meta: general.name     = DeepSeek R1 Distill Qwen 7B
llm_load_print_meta: BOS token        = 151646 '<|begin▁of▁sentence|>'
llm_load_print_meta: EOS token        = 151643 '<|end▁of▁sentence|>'
llm_load_print_meta: EOT token        = 151643 '<|end▁of▁sentence|>'
llm_load_print_meta: PAD token        = 151643 '<|end▁of▁sentence|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|end▁of▁sentence|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
**llm_load_tensors: tensor 'token_embd.weight' (q4_K) (and 338 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead**
llm_load_tensors:   CPU_Mapped model buffer size =  4460.45 MiB
....................................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 2048
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 10000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
llama_kv_cache_init: layer 0: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 1: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 2: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 3: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 4: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 5: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 6: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 7: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 8: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 9: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 10: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 11: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 12: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 13: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 14: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 15: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 16: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 17: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 18: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 19: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 20: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 21: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 22: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 23: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 24: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 25: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 26: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init: layer 27: n_embd_k_gqa = 512, n_embd_v_gqa = 512
llama_kv_cache_init:        CPU KV buffer size =   112.00 MiB
llama_new_context_with_model: KV self size  =  112.00 MiB, K (f16):   56.00 MiB, V (f16):   56.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB
llama_new_context_with_model:        CPU compute buffer size =   304.00 MiB
llama_new_context_with_model: graph nodes  = 986
llama_new_context_with_model: graph splits = 1
> hello                                                                                                                                                                                                             
failed to apply the chat template

@ericcurtin
Copy link
Collaborator

Try and update your container image if you can.

@chrootchad
Copy link

chrootchad commented Feb 2, 2025

Thanks @ericcurtin I ran

ramalama --gpu --debug --image quay.io/ramalama/asahi:0.5.4 run ollama://deepseek-r1:7b

and forced the latest asahi image and it worked (including the gpu support I couldn't get working previously with other models when the quay.io/ramalama/ramalama:latest image was being used)

Seems to be an ubuntu asahi quirk, but it looks like this:

if os.path.exists('/etc/os-release'):

    # ASAHI CASE
    if os.path.exists('/etc/os-release'):
        with open('/etc/os-release', 'r') as file:
            if "asahi" in file.read().lower():
                # Set Env Var and break
                os.environ["ASAHI_VISIBLE_DEVICES"] = "1"
                return

isn't setting ASAHI_VISIBLE_DEVICES=1
because on ubuntu asahi

cat /etc/os-release
PRETTY_NAME="Ubuntu 24.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04.1 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo

Doesn't contain "asahi" anywhere in the output, this seems to though

hostnamectl | grep asahi
          Kernel: Linux 6.12.0-1004-asahi-arm

Thanks again for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants