Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop LLM output on user request? #47

Open
woheller69 opened this issue May 5, 2024 · 11 comments
Open

Stop LLM output on user request? #47

woheller69 opened this issue May 5, 2024 · 11 comments

Comments

@woheller69
Copy link
Contributor

Is there a way to stop inference manually? E.g. such as by returning FALSE to the streaming_callback?
If the user presses the stop button in a UI how could that be handled?

@Maximilian-Winter
Copy link
Owner

I'm not sure how to do it properply in llama_cpp_python but it should be possible. Will add this ASAP

@pabl-o-ce
Copy link
Collaborator

pabl-o-ce commented May 6, 2024

Is possible to use break keyword or if you use request you can also have a signal control to control finish the request

@woheller69
Copy link
Contributor Author

it is not about a keyword. If a long text is generated and it goes the wrong direction I want to stop it without losing the context by killing the process.
The python bindings of gpt4all e.g. of have a callback similar to streaming_callback. If True is returned, it continues, if False is returned it stops. In this callback I can check if a button has been pressed an then send True/False.

@woheller69
Copy link
Contributor Author

I need this for a local model, just in case this makes a difference

@woheller69
Copy link
Contributor Author

woheller69 commented May 6, 2024

It seems there is a PR for llama-cpp-python regarding this: https://github.com/abetlen/llama-cpp-python/pull/733/files

Add cancel() method to interrupt a stream

But they do not want to merge it

There is also an issue:
abetlen/llama-cpp-python#599

@pabl-o-ce
Copy link
Collaborator

call me a mad man but I just use like this example to end the inference

for chunk in llm.stream_chat(chat_template):
    if cancel_flag is True:
        break

@woheller69
Copy link
Contributor Author

Doesn't that just break the for loop but the llm continues to stream?

Currently I have:

    llama_cpp_agent.get_chat_response(
        user_input, 
        temperature=0.7, 
        top_k=40, 
        top_p=0.4,
        repeat_penalty=1.18, 
        repeat_last_n=64, 
        max_tokens=2000,
        stream=True,
        print_output=False,
        streaming_callback=streaming_callback
    )

And in the streaming_callback I am printing the tokens as they come. Ideally this callback could return True/False to continue/stop

@pabl-o-ce
Copy link
Collaborator

let me create some test for this

@woheller69
Copy link
Contributor Author

In case there is no "clean" solution via llama_cpp_python, I found a solution using a thread_with_exception as in my code https://github.com/woheller69/LLAMA_TK_CHAT/

It starts inference in a separate tread and stops it by raising an exception. But that way the partial answer is not added to chat history (I am doing this later using add_message(...) in my code) because I am having llama_cpp_agent.get_chat_response(...) in this thread.
It certainly would be better if that was realized INSIDE llama_agent.py, maybe in get_chat_response(...) or get_response_role_and_completion(...) such that the partial answer can still be added to history.

If my code doesn't look great, this is because I have no clue about Python :-)

@jewser
Copy link

jewser commented May 11, 2024

For those interested, here is an minimal adaptation of @woheller69's workaround:

from llama_cpp import Llama
import threading
import sys

# https://github.com/woheller69/LLAMA_TK_CHAT/blob/main/LLAMA_TK_GUI.py
class thread_with_exception(threading.Thread):
    def __init__(self, name, callback):
        threading.Thread.__init__(self)
        self.name = name
        self.callback = callback

    def run(self):
        self.callback()

    def get_id(self):
        # returns id of the respective thread
        if hasattr(self, '_thread_id'):
            return self._thread_id
        for id, thread in threading._active.items():
            if thread is self:
                return id

    def raise_exception(self):
        thread_id = self.get_id()
        if thread_id != None:
            res = ctypes.pythonapi.PyThreadState_SetAsyncExc(ctypes.c_long(thread_id), ctypes.py_object(SystemExit))
            if res > 1:
                ctypes.pythonapi.PyThreadState_SetAsyncExc(ctypes.c_long(thread_id), 0)

llm = Llama(
    model_path="../../llama.cpp/models/Meta-Llama-3-8B/ggml-model-f16.gguf",
    n_gpu_layers=-1,
    lora_path="../../llama.cpp/models/test/my_lora_1350.bin",
    n_ctx=1024,
)

def generate(prompt):
    for chunk in llm(
        ''.join(prompt),
        max_tokens=100,
        stop=["."],
        echo=False,
        stream=True,
    ):
        yield chunk["choices"][0]["text"]

def inference_callback():
    prompt = "juicing is the act of "

    print(prompt,end='')
    sys.stdout.flush()
    for chunk in generate([prompt]):
        print(chunk,end='')
        sys.stdout.flush()
    print()

inference_thread = thread_with_exception("InferenceThread", inference_callback)
inference_thread.start()

import time
try:
    for i in range(20):
        time.sleep(0.5)
    print("done normally")
except KeyboardInterrupt:
    inference_thread.raise_exception()
    inference_thread.join()
    print("interrupted")

Here we have an inference thread that may be interrupted by the main thread which is busy doing something else (presumably listening as a webserver or a gui window or something), though in this case it is just sleeping for 10 seconds.

@42PAL
Copy link

42PAL commented May 16, 2024

Using LM Studio to run the models works for me. I often stop the generation, edit the AI mistakes and steer it in the direction I want, save the changes and then have it continue generating. This seems to work for me on all models I have tried while using LM Studio App.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants