You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -299,6 +299,9 @@ The following table shows the fields that may to be modified before deployment:
299
299
|`kv_cache_free_gpu_mem_fraction`| Optional (default=0.9). Set to a number between 0 and 1 to indicate the maximum fraction of GPU memory (after loading the model) that may be used for KV cache.|
300
300
|`enable_trt_overlap`| Optional (default=`false`). Set to `true` to partition available requests into 2 'microbatches' that can be run concurrently to hide exposed CPU runtime |
301
301
|`exclude_input_in_output`| Optional (default=`false`). Set to `true` to only return completion tokens in a response. Set to `false` to return the prompt tokens concatenated with the generated tokens |
302
+
|`cancellation_check_period_ms`| Optional (default=100). The time for cancellation check thread to sleep before doing the next check. It checks if any of the current active requests are cancelled through triton and prevent further execution of them. |
303
+
|`iter_stats_max_iterations`| Optional (default=executor::kDefaultIterStatsMaxIterations). The numbers of iteration stats to be kept. |
304
+
|`request_stats_max_iterations`| Optional (default=executor::kDefaultRequestStatsMaxIterations). The numbers of request stats to be kept. |
302
305
|`normalize_log_probs`| Optional (default=`true`). Set to `false` to skip normalization of `output_log_probs`|
303
306
|`enable_chunked_context`| Optional (default=`false`). Set to `true` to enable context chunking. |
304
307
|`gpu_device_ids`| Optional (default=unspecified). Comma-separated list of GPU IDs to use for this model. If not provided, the model will use all visible GPUs. |
0 commit comments