feat: Nonstreaming API#85
Conversation
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
|
@JNeuvonen hey sorry about the slow review on my end, I've been pretty busy with summer chores/errand and also other works xD... Also was investigating #62 and why the upstream llama metal doesn't seem to work on Mac anymore :d..... Will get to this by Wednesday. Is it ok for me to cook it up a bit if I find something wrong/missing, or would you prefer just comment and you can take care of it? LMK what type of feedback is cool for you :) |
|
Thanks, no problem at all, totally understandable. Comment & let me figure
it out would be preferred feedback form, but if it’s very simple changes
you can do it as well.
…On Mon 31. Jul 2023 at 23.36, L ***@***.***> wrote:
@JNeuvonen <https://github.com/JNeuvonen> hey sorry about the slow review
on my end, I've been pretty busy with summer chores/errand and also other
works xD... Also was investigating #62
<#62> and why the upstream llama
metal doesn't seem to work on Mac anymore :d..... Will get to this by
Wednesday.
Is it ok for me to cook it up a bit if I find something wrong/missing, or
would you prefer just comment and you can take care of it? LMK what type of
feedback is cool for you :)
—
Reply to this email directly, view it on GitHub
<#85 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ARW4OHJ6BECN5BFOGVFRKYTXTAJNZANCNFSM6AAAAAA2VF4LWU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
| }); | ||
|
|
||
| HttpResponse::Ok() | ||
| .append_header(("Content-Type", "text/plain")) |
There was a problem hiding this comment.
We should return application/json type here instead I think, it helps the client know to do JSON chunk parsing as needed as well based on that header type
There was a problem hiding this comment.
I think that makes sense, yes. Will fix those. Thanks for looking at my code.
| tx: Some(tx), | ||
| }); | ||
|
|
||
| rx.recv().unwrap(); |
There was a problem hiding this comment.
We should match for error and return HTTP error here IMO, otherwise would be hard to triage :d
| } else { | ||
| if let Some(tx) = req.tx { | ||
| //Tell server thread that inference completed, and let it respond | ||
| let _ = tx.send(()); |
There was a problem hiding this comment.
Do we need that _ or can we just call send here?
| println!("Feeding prompt ..."); | ||
| req.send_event("FEEDING_PROMPT"); | ||
|
|
||
| if stream_enabled { |
There was a problem hiding this comment.
Can we do this check at the trait level instead. This way we can unify the interface call (in this file), and handle the stream/non-stream logic at the trait implementation level instead, would make it much nicer and more cohesive :)
|
|
||
| pub model_guard: ModelGuard, | ||
| pub completion_request: CompletionRequest, | ||
| pub nonstream_completion_tokens: Arc<Mutex<String>>, |
There was a problem hiding this comment.
I think we can make this private if we use it as a trait state for the non-stream feature. Making it pub would allow others to inspect it while it's writing/locked, which could potentially deadlock the Mutex writer if we're not careful... :d
| } else { | ||
| let abort_flag = Arc::new(RwLock::new(false)); | ||
| let completion_tokens = Arc::new(Mutex::new(String::new())); | ||
| let (tx, rx) = flume::unbounded::<()>(); |
There was a problem hiding this comment.
I wonder if we can make the tokensender generic, so that we can reuse that argument. The token_sender and the tx serve very similar function here, we just need to reconcile the Byte/String type. That'd make for nicer interface I think
| }), | ||
| ) | ||
| }) | ||
| if let Some(true) = payload.stream { |
There was a problem hiding this comment.
This should be payload.0.stream I think, since it's a JSON.
If we can reconcile our trait above, we can infer the stream boolean via the completion_request as well, skipping a couple of lookup hoop!
louisgv
left a comment
There was a problem hiding this comment.
The overall idea is great thus far, added some comment and idea on improvement 👍
| start(InferenceThreadRequest { | ||
| model_guard: model_guard.clone(), | ||
| abort_flag: abort_flag.clone(), | ||
| token_sender, | ||
| completion_request: payload.0, | ||
| nonstream_completion_tokens: str_buffer.clone(), | ||
| stream: true, | ||
| tx: None, | ||
| }), |
There was a problem hiding this comment.
I have this idea which I think would make this nicer - we can create the InferenceThreadRequest before the isStream check actually, since it's non-blocking state. We can then do
let request = InferenceThreadRequest {
model_guard: model_guard.clone(),
abort_flag: abort_flag.clone(),
token_sender,
completion_request: payload.0,
nonstream_completion_tokens: str_buffer.clone(),
}
if request.isStream() {} else {} And the .isStream is a trait public method we expose via InferenceThreadRequest, which basically return completion_request.stream
There was a problem hiding this comment.
I really like your attention to detail and design thinking! I will try to implement this one, I agree, it is indeed cleaner.
|
@JNeuvonen invited you as repo collaborator |
|
@JNeuvonen lmk if you're still able to update the PR - otherwise I can get on it sometime next week! |
|
Hey, I apologize that I didn't come back earlier. Back when I was working on this, I was on a summer vacation, now I am back on my work schedule, and I have less time & focus. Please feel free to finish the feature. |
The implementation uses the same
startfunction insideprocess.rsfor multithreading but just doesn't send server events back to the request sender on every new token but collects the tokens into a string buffer.Currently, there is no client-side implementation, so merging should not affect client-side at all. Next, we could open an issue for client-side implementation as well.
Here is a request body for quickly testing the API (stream flag is false):
{"sampler":"top-p-top-k","prompt":"AI: Greeting! I am a friendly AI assistant. Feel free to ask me anything.\nHuman: Hello world\nAI: ","max_tokens":200,"temperature":1,"seed":147,"frequency_penalty":0.6,"presence_penalty":0,"top_k":42,"top_p":1,"stop":["AI: ","Human: "],"stream":false}Issue