-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Description
I built TritonServer based on the 2.50.0 tag and ran a local model inference service. After starting the service, I launched another script that repeatedly sends a fixed prompt to this service and monitored TritonServer’s RSS memory. As shown in the chart, the RSS keeps increasing over time. Even after I stop sending prompt requests, the memory is not reclaimed. In a real production environment, this leads to my container running out of memory.

I’d like to know whether TritonServer’s continuously increasing RSS memory is expected behavior. If it is expected, there should theoretically be an upper bound after which memory is reclaimed. How is this reclamation limit determined? Is it based on the operating system’s MemTotal?
I also captured jeprof SVG files for the same environment under both low-request and high-request conditions.
Triton Information
2.50.0,tried the latest version, and reported the same phenomenon;
Are you using the Triton container or did you build it yourself?
Build executable files in Dockerfile from source code directly without change;
To Reproduce
Steps to reproduce the behavior.
- Build tritonserver from the 2.50.0 tag and start a model inference service (locally we launch a proprietary, non-open-source model).
- Start a client and send prompt requests to the service.
- Use the command ps -C tritonserver -o pid,comm,lstart,rss to obtain the tritonserver process’s RSS memory usage.
Expected behavior
The RSS remains at a safe level, or there is a way for tritonserver’s RSS memory to be reclaimed.