Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strange text duplication from llama-server to llama-cpp-agent #86

Open
rpdrewes opened this issue Dec 28, 2024 · 2 comments
Open

strange text duplication from llama-server to llama-cpp-agent #86

rpdrewes opened this issue Dec 28, 2024 · 2 comments

Comments

@rpdrewes
Copy link

rpdrewes commented Dec 28, 2024

I am getting occasional duplicated text in responses using llama-cpp-agent to talk to a llama-server on a remote host. This does not seem to be a token repetition issue that might be solved with repeat-penalty. This seems to be a disagreement between client and server about when a data "chunk" is complete. It looks like this (Q: is the query sent to the server, A: is the answer):

Q:What is the tallest mountain in Europe? Be brief.
A:Mount Elbrus, located in the Caucas Caucasus range, Russia Russia, is the tallest mountain in Europe, with a a height of 5,642 meters (18,510 feet).

Note the duplication of "Caucas Caucasus" and "Russia Russia," in the response!

Looking at verbose output on the server side (llama-server -v) you can see that the server is actually sending " Caucas" in one message, followed by " Caucasus", and then later " Russia" immediately followed by " Russia," with a comma after it:

data stream, to_send: data: {"index":0,"content":"\n\n","tokens":[271],"stop":false,"id_slot":-1,"tokens_predicted":1,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":"Mount","tokens":[16683],"stop":false,"id_slot":-1,"tokens_predicted":2,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":" El","tokens":[4072],"stop":false,"id_slot":-1,"tokens_predicted":3,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":"br","tokens":[1347],"stop":false,"id_slot":-1,"tokens_predicted":4,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":"us","tokens":[355],"stop":false,"id_slot":-1,"tokens_predicted":5,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":",","tokens":[11],"stop":false,"id_slot":-1,"tokens_predicted":6,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":" located","tokens":[7559],"stop":false,"id_slot":-1,"tokens_predicted":7,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":" in","tokens":[304],"stop":false,"id_slot":-1,"tokens_predicted":8,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":" the","tokens":[279],"stop":false,"id_slot":-1,"tokens_predicted":9,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":" Caucas","tokens":[60532],"stop":false,"id_slot":-1,"tokens_predicted":10,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":" Caucasus","tokens":[355],"stop":false,"id_slot":-1,"tokens_predicted":11,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":" range","tokens":[2134],"stop":false,"id_slot":-1,"tokens_predicted":12,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":",","tokens":[11],"stop":false,"id_slot":-1,"tokens_predicted":13,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":" Russia","tokens":[8524],"stop":false,"id_slot":-1,"tokens_predicted":14,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":" Russia,","tokens":[11],"stop":false,"id_slot":-1,"tokens_predicted":15,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":" is","tokens":[374],"stop":false,"id_slot":-1,"tokens_predicted":16,"tokens_evaluated":31}
data stream, to_send: data: {"index":0,"content":" the","tokens":[279],"stop":false,"id_slot":-1,"tokens_predicted":17,"tokens_evaluated":31}
...

It is as if the server expects the client to know it should not emit the first " Russia" because it is superseded by the more complete next transmission " Russia,".

The above test is with the agent established using:

agent = LlamaCppAgent(provider, predefined_messages_formatter_type=MessagesFormatterType.LLAMA_3)

The llama-server is indeed using a llama3.2 model, so I think that is the correct FormatterType.

However, if instead I set up the agent not specifying any MessagesFormatterType, then I do not see the duplications in text coming from the server! (But there are other problems as you might expect, like <|im_end|> appearing in the response text, because there is not agreement between client and server on the end of message indication, presumably.) Surprisingly, with the (incorrect) default FormatterType, the server does not e.g. send " Caucas" followed by " Caucasus". It sends " Caucas" then "us". It is not the case that the client is treating the response differently--the server does not send duplicate data with the default FormatterType. So, there must be something different in the setup of the two chats that prevents the server from sending these duplications in this second case. I have looked a bit at the chat setup in the server logs and I have some ideas but if anyone knows what is going on here or how to fix it, please save me some time!

@Maximilian-Winter
Copy link
Owner

@rpdrewes Will look into this issue, thank!

@rpdrewes
Copy link
Author

rpdrewes commented Jan 3, 2025

I narrowed this duplicate text problem down to the following section of code in llama-server in process_token() in examples/server/server.cpp:

            ...
            if (stop_pos != std::string::npos) {
                slot.generated_text.erase(
                    slot.generated_text.begin() + pos + stop_pos,
                    slot.generated_text.end());
                pos = std::min(slot.n_sent_text, slot.generated_text.size());
            } else if (slot.has_next_token) {
                stop_pos = slot.find_stopping_strings(str_test, token_str.size(), false);
                send_text = stop_pos == std::string::npos;
            } else {
            ...

The duplicate transmission happens when send_text is set to 0 in the else case. For some reason the already-gathered text (e.g. " Caucas") is still sent to the llama-cpp-agent client, yet not cleared from the current text buffer on the server, and then when the next token is gathered from LLM (e.g. "us") the stored data is resent along with the new text so the client receives e.g. "Caucas Caucasus".

I'm not sure if this is a bug with llama-server or llama-cpp-agent, both, or neither. The problem does not occur with some FormatterType selections from llama-cpp-agent because, I believe, the stoppingword is set differently in the initial setup of the agent query and the above situation with send_text = 0 does not occur.

However I recently switched my program from llama-cpp-agent to using the basic OpenAI Python interface directly, which I think llama-cpp-agent relies on behind the scenes. This OpenAI library works for me with no issues and seems to automatically use the correct stopping words and end tokens and whatnot. The OpenAI python library was fairly simple to put in place of llama-cpp-agent, with the major difference being how the streaming response is gathered (a yield type continuation with OpenAI and a callback function with llama-cpp-agent).

Now that I am using (llama-server <-> OpenAI Python client) and it is working well, this issue no longer affects me and I don't plan to pursue fixing this issue with (llama-server <-> llama-cpp-agent) any further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants