You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+50-1Lines changed: 50 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -218,7 +218,7 @@ The following table shows the fields that may to be modified before deployment:
218
218
|`max_beam_width`| Optional (default=1). The maximum beam width that any request may ask for when using beam search.|
219
219
|`max_tokens_in_paged_kv_cache`| Optional (default=unspecified). The maximum size of the KV cache in number of tokens. If unspecified, value is interpreted as 'infinite'. KV cache allocation is the min of max_tokens_in_paged_kv_cache and value derived from kv_cache_free_gpu_mem_fraction below. |
220
220
|`max_attention_window_size`| Optional (default=max_sequence_length). When using techniques like sliding window attention, the maximum number of tokens that are attended to generate one token. Defaults attends to all tokens in sequence. |
221
-
|`kv_cache_free_gpu_mem_fraction`| Optional (default=0.85). Set to a number between 0 and 1 to indicate the maximum fraction of GPU memory (after loading the model) that may be used for KV cache.|
221
+
|`kv_cache_free_gpu_mem_fraction`| Optional (default=0.9). Set to a number between 0 and 1 to indicate the maximum fraction of GPU memory (after loading the model) that may be used for KV cache.|
222
222
| `max_num_sequences` | Optional (default=`max_batch_size` if `enable_trt_overlap` is `false` and to `2 * max_batch_size` if `enable_trt_overlap` is `true`, where `max_batch_size` is the TRT engine maximum batch size). Maximum number of sequences that the in-flight batching scheme can maintain state for.
223
223
|`enable_trt_overlap`| Optional (default=`true`). Set to `true` to partition available requests into 2 'microbatches' that can be run concurrently to hide exposed CPU runtime |
224
224
|`exclude_input_in_output`| Optional (default=`false`). Set to `true` to only return completion tokens in a response. Set to `false` to return the prompt tokens concatenated with the generated tokens |
@@ -346,6 +346,7 @@ He was a member of the French Academy of Sciences and the French Academy of Arts
346
346
Soyer was a member of the French Academy of Sciences and
347
347
```
348
348
349
+
#### Early stopping
349
350
You can also stop the generation process early by using the `--stop-after-ms`
350
351
option to send a stop request after a few milliseconds:
351
352
@@ -357,6 +358,54 @@ You will find that the generation process is stopped early and therefore the
357
358
number of generated tokens is lower than 200. You can have a look at the
358
359
client code to see how early stopping is achieved.
If you want to get context logits and/or generation logits, you need to enable `--gather_context_logits` and/or `--gather_generation_logits` when building the engine (or `--enable gather_all_token_logits` to enable both at the same time). For more setting details about these two flags, please refer to [build.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/gpt/build.py) or [gpt_runtime](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/gpt_runtime.md).
363
+
364
+
After launching the server, you could get the output of logits by passing the corresponding parameters `--return-context-logits` and/or `--return-generation-logits` in the client scripts (`end_to_end_grpc_client.py` and `inflight_batcher_llm_client.py`). For example:
0 commit comments