-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop LLM output on user request? #47
Comments
I'm not sure how to do it properply in llama_cpp_python but it should be possible. Will add this ASAP |
Is possible to use |
it is not about a keyword. If a long text is generated and it goes the wrong direction I want to stop it without losing the context by killing the process. |
I need this for a local model, just in case this makes a difference |
It seems there is a PR for llama-cpp-python regarding this: https://github.com/abetlen/llama-cpp-python/pull/733/files Add cancel() method to interrupt a stream But they do not want to merge it There is also an issue: |
call me a mad man but I just use like this example to end the inference for chunk in llm.stream_chat(chat_template):
if cancel_flag is True:
break |
Doesn't that just break the for loop but the llm continues to stream? Currently I have:
And in the streaming_callback I am printing the tokens as they come. Ideally this callback could return True/False to continue/stop |
let me create some test for this |
In case there is no "clean" solution via llama_cpp_python, I found a solution using a It starts inference in a separate tread and stops it by raising an exception. But that way the partial answer is not added to chat history (I am doing this later using add_message(...) in my code) because I am having llama_cpp_agent.get_chat_response(...) in this thread. If my code doesn't look great, this is because I have no clue about Python :-) |
For those interested, here is an minimal adaptation of @woheller69's workaround: from llama_cpp import Llama
import threading
import sys
# https://github.com/woheller69/LLAMA_TK_CHAT/blob/main/LLAMA_TK_GUI.py
class thread_with_exception(threading.Thread):
def __init__(self, name, callback):
threading.Thread.__init__(self)
self.name = name
self.callback = callback
def run(self):
self.callback()
def get_id(self):
# returns id of the respective thread
if hasattr(self, '_thread_id'):
return self._thread_id
for id, thread in threading._active.items():
if thread is self:
return id
def raise_exception(self):
thread_id = self.get_id()
if thread_id != None:
res = ctypes.pythonapi.PyThreadState_SetAsyncExc(ctypes.c_long(thread_id), ctypes.py_object(SystemExit))
if res > 1:
ctypes.pythonapi.PyThreadState_SetAsyncExc(ctypes.c_long(thread_id), 0)
llm = Llama(
model_path="../../llama.cpp/models/Meta-Llama-3-8B/ggml-model-f16.gguf",
n_gpu_layers=-1,
lora_path="../../llama.cpp/models/test/my_lora_1350.bin",
n_ctx=1024,
)
def generate(prompt):
for chunk in llm(
''.join(prompt),
max_tokens=100,
stop=["."],
echo=False,
stream=True,
):
yield chunk["choices"][0]["text"]
def inference_callback():
prompt = "juicing is the act of "
print(prompt,end='')
sys.stdout.flush()
for chunk in generate([prompt]):
print(chunk,end='')
sys.stdout.flush()
print()
inference_thread = thread_with_exception("InferenceThread", inference_callback)
inference_thread.start()
import time
try:
for i in range(20):
time.sleep(0.5)
print("done normally")
except KeyboardInterrupt:
inference_thread.raise_exception()
inference_thread.join()
print("interrupted") Here we have an inference thread that may be interrupted by the main thread which is busy doing something else (presumably listening as a webserver or a gui window or something), though in this case it is just sleeping for 10 seconds. |
Using LM Studio to run the models works for me. I often stop the generation, edit the AI mistakes and steer it in the direction I want, save the changes and then have it continue generating. This seems to work for me on all models I have tried while using LM Studio App. |
Is there a way to stop inference manually? E.g. such as by returning FALSE to the streaming_callback?
If the user presses the stop button in a UI how could that be handled?
The text was updated successfully, but these errors were encountered: