strange text duplication from llama-server to llama-cpp-agent #86

rpdrewes · 2024-12-28T20:00:28Z

I am getting occasional duplicated text in responses using llama-cpp-agent to talk to a llama-server on a remote host. This does not seem to be a token repetition issue that might be solved with repeat-penalty. This seems to be a disagreement between client and server about when a data "chunk" is complete. It looks like this (Q: is the query sent to the server, A: is the answer):

Q:What is the tallest mountain in Europe? Be brief.
A:Mount Elbrus, located in the Caucas Caucasus range, Russia Russia, is the tallest mountain in Europe, with a a height of 5,642 meters (18,510 feet).

Note the duplication of "Caucas Caucasus" and "Russia Russia," in the response!

Looking at verbose output on the server side (llama-server -v) you can see that the server is actually sending " Caucas" in one message, followed by " Caucasus", and then later " Russia" immediately followed by " Russia," with a comma after it:

data stream, to_send: data: {"index":0,"content":"\n\n","tokens":[271],"stop":false,"id_slot":-1,"tokens_predicted":1,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":"Mount","tokens":[16683],"stop":false,"id_slot":-1,"tokens_predicted":2,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":" El","tokens":[4072],"stop":false,"id_slot":-1,"tokens_predicted":3,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":"br","tokens":[1347],"stop":false,"id_slot":-1,"tokens_predicted":4,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":"us","tokens":[355],"stop":false,"id_slot":-1,"tokens_predicted":5,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":",","tokens":[11],"stop":false,"id_slot":-1,"tokens_predicted":6,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":" located","tokens":[7559],"stop":false,"id_slot":-1,"tokens_predicted":7,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":" in","tokens":[304],"stop":false,"id_slot":-1,"tokens_predicted":8,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":" the","tokens":[279],"stop":false,"id_slot":-1,"tokens_predicted":9,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":" Caucas","tokens":[60532],"stop":false,"id_slot":-1,"tokens_predicted":10,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":" Caucasus","tokens":[355],"stop":false,"id_slot":-1,"tokens_predicted":11,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":" range","tokens":[2134],"stop":false,"id_slot":-1,"tokens_predicted":12,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":",","tokens":[11],"stop":false,"id_slot":-1,"tokens_predicted":13,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":" Russia","tokens":[8524],"stop":false,"id_slot":-1,"tokens_predicted":14,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":" Russia,","tokens":[11],"stop":false,"id_slot":-1,"tokens_predicted":15,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":" is","tokens":[374],"stop":false,"id_slot":-1,"tokens_predicted":16,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":" the","tokens":[279],"stop":false,"id_slot":-1,"tokens_predicted":17,"tokens_evaluated":31}
...

It is as if the server expects the client to know it should not emit the first " Russia" because it is superseded by the more complete next transmission " Russia,".

The above test is with the agent established using:

agent = LlamaCppAgent(provider, predefined_messages_formatter_type=MessagesFormatterType.LLAMA_3)

The llama-server is indeed using a llama3.2 model, so I think that is the correct FormatterType.

However, if instead I set up the agent not specifying any MessagesFormatterType, then I do not see the duplications in text coming from the server! (But there are other problems as you might expect, like <|im_end|> appearing in the response text, because there is not agreement between client and server on the end of message indication, presumably.) Surprisingly, with the (incorrect) default FormatterType, the server does not e.g. send " Caucas" followed by " Caucasus". It sends " Caucas" then "us". It is not the case that the client is treating the response differently--the server does not send duplicate data with the default FormatterType. So, there must be something different in the setup of the two chats that prevents the server from sending these duplications in this second case. I have looked a bit at the chat setup in the server logs and I have some ideas but if anyone knows what is going on here or how to fix it, please save me some time!

The text was updated successfully, but these errors were encountered:

Maximilian-Winter · 2024-12-30T19:12:00Z

@rpdrewes Will look into this issue, thank!

rpdrewes · 2025-01-03T15:41:26Z

I narrowed this duplicate text problem down to the following section of code in llama-server in process_token() in examples/server/server.cpp:

            ...
            if (stop_pos != std::string::npos) {
                slot.generated_text.erase(
                    slot.generated_text.begin() + pos + stop_pos,
                    slot.generated_text.end());
                pos = std::min(slot.n_sent_text, slot.generated_text.size());
            } else if (slot.has_next_token) {
                stop_pos = slot.find_stopping_strings(str_test, token_str.size(), false);
                send_text = stop_pos == std::string::npos;
            } else {
            ...

The duplicate transmission happens when send_text is set to 0 in the else case. For some reason the already-gathered text (e.g. " Caucas") is still sent to the llama-cpp-agent client, yet not cleared from the current text buffer on the server, and then when the next token is gathered from LLM (e.g. "us") the stored data is resent along with the new text so the client receives e.g. "Caucas Caucasus".

I'm not sure if this is a bug with llama-server or llama-cpp-agent, both, or neither. The problem does not occur with some FormatterType selections from llama-cpp-agent because, I believe, the stoppingword is set differently in the initial setup of the agent query and the above situation with send_text = 0 does not occur.

However I recently switched my program from llama-cpp-agent to using the basic OpenAI Python interface directly, which I think llama-cpp-agent relies on behind the scenes. This OpenAI library works for me with no issues and seems to automatically use the correct stopping words and end tokens and whatnot. The OpenAI python library was fairly simple to put in place of llama-cpp-agent, with the major difference being how the streaming response is gathered (a yield type continuation with OpenAI and a callback function with llama-cpp-agent).

Now that I am using (llama-server <-> OpenAI Python client) and it is working well, this issue no longer affects me and I don't plan to pursue fixing this issue with (llama-server <-> llama-cpp-agent) any further.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

strange text duplication from llama-server to llama-cpp-agent #86

strange text duplication from llama-server to llama-cpp-agent #86

rpdrewes commented Dec 28, 2024 •

edited

Loading

Maximilian-Winter commented Dec 30, 2024

rpdrewes commented Jan 3, 2025

strange text duplication from llama-server to llama-cpp-agent #86

strange text duplication from llama-server to llama-cpp-agent #86

Comments

rpdrewes commented Dec 28, 2024 • edited Loading

Maximilian-Winter commented Dec 30, 2024

rpdrewes commented Jan 3, 2025

rpdrewes commented Dec 28, 2024 •

edited

Loading