-
|
I was trying to modify TRT LLM code to make a specific thing for my model, that’s is not implemented, and was quite confused with amount of buffers that’s existed can you please explain the structure of them?
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
|
Hey Alex, the buffer management should be more clear in recent versions. However, I would recommend looking at the PyTorch backend which is the default since v1.0.
Correct.
Correct. These have been refactored into
The idea here is that inputs and outputs are only valid for a specific iteration. There can be multiple batches each with their own inputs and outputs.
Not sure. I think they are needed to store additional information that is not present in decoder-only models.
req slot / batch slot is an identifier which maps a request to a specific resource slot. The slot is persistent for the whole execution of the request.
Multiple batches (each with their own buffers) are used in |
Beta Was this translation helpful? Give feedback.
-
|
@Funatiq Hi, just noticed your reply. What do you mean in PyTorch backend? I remember there was two options: Python Runtime and C++ runtime, and the reason why I was looking into C++ runtime was Continuous Batching support and ability to use C++ runtime via Triton Server. |
Beta Was this translation helpful? Give feedback.
Hey Alex, the buffer management should be more clear in recent versions. However, I would recommend looking at the PyTorch backend which is the default since v1.0.
Correct.
Correct. These have been refactored into
DecoderState. There is only one of this.The idea here is that inputs and outputs are only valid for a spec…