How to create prompt with /chat endpoint for llama.cpp? [closed]

Question

Closed. This question is not about programming or software development. It is not currently accepting answers.

This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.

Closed 7 months ago.

Improve this question

I just installed raw llama.cpp to run codellama-7b-instruct.Q5_K_M.gguf. I started it on llama's server but unfortunatly it is responding with really weird answers, which looks like it is trying to simulate its own conversation.

I have tried to use some kind of templates like this one:

[Instruction: You are an expert assistant. Always provide direct, concise answers.]  

USER: What is 2 + 2?
ASSISTANT: 4.

USER: How to use console.log in JS?
ASSISTANT:

END OF CONVERSATION

But this work poorly and i realise that i need /chat endpoint instead currently used /completion.

If anyone have some kind of "extension" to llama.cpp so i can use /chat endpoint or other way to use /chat in llama.cpp please let me know.

Excuse my English, I'm still learning
For any other info write in comments!

tlo8 · Accepted Answer · 2025-05-02 09:01:30Z

2

It's most likely working poorly because you're not using the correct chat template for Code LLaMA, which is a slightly modified version of LLaMA 2's chat template:

<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>
{{ user_message_1 }} [/INST] {{ model_answer_1 }} </s>
<s>[INST] {{ user_message_2 }} [/INST]

According to the llama.cpp README, the chat completion endpoint is already supported. There's no need for an "extension".

llama-server -m model.gguf --port 8080
# Basic web UI can be accessed via browser: http://localhost:8080
# Chat completion endpoint: http://localhost:8080/v1/chat/completions

If supported, the appropriate chat template will be selected for your model once it's loaded. If you wish to specify a particular template, you may do so with the --chat-template flag (e.g. llama-server -m codellama-7b-instruct.Q5_K_M.gguf -ngl 64 -c 0 --chat-template llama2).

Some models use old or unusual chat templates. For those, you'd use the --jinja and --chat-template-file flags along with a custom Jinja chat template file (supported since b4524), or use the older completion API and parse the outputs accordingly.

edited May 2 at 9:01

answered May 2 at 0:06

tlo8

1011 silver badge6 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Dominik Szkotland May 3 at 9:21

I wasn't aware that different models needed a different template. I thought there was some standard or something. Thanks for the explanation, and for your understanding, this is my first contact with AI.

Collectives™ on Stack Overflow

How to create prompt with /chat endpoint for llama.cpp? [closed]

1 Answer 1

1 Comment

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Related