Improves trained model autoscaling docs. (#2857) (#2858)

mergify[bot] · szabosteve · web-flow · commit 92e3c782ef2c · 2024-10-17T12:57:43.000+02:00
(cherry picked from commit 4e34d07) Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
diff --git a/docs/en/stack/ml/nlp/ml-nlp-autoscaling.asciidoc b/docs/en/stack/ml/nlp/ml-nlp-autoscaling.asciidoc
@@ -2,14 +2,14 @@
 = Trained model autoscaling
 
 You can enable autoscaling for each of your trained model deployments.
-Autoscaling allows {es} to automatically adjust the resources the deployment can use based on the workload demand.
+Autoscaling allows {es} to automatically adjust the resources the model deployment can use based on the workload demand.
 
 There are two ways to enable autoscaling:
 
 * through APIs by enabling adaptive allocations
 * in {kib} by enabling adaptive resources
 
-IMPORTANT: To fully leverage model autoscaling, it is highly recommended to enable {cloud}/ec-autoscaling.html[deployment autoscaling].
+IMPORTANT: To fully leverage model autoscaling, it is highly recommended to enable {cloud}/ec-autoscaling.html[{es} deployment autoscaling].
 
 
 [discrete]
@@ -25,6 +25,7 @@ This can help you to manage performance and cost more easily.
 When adaptive allocations are enabled, the number of allocations of the model is set automatically based on the current load.
 When the load is high, a new model allocation is automatically created.
 When the load is low, a model allocation is automatically removed.
+You must explicitely set the minimum and maximum number of allocations; autoscaling will occur within these limits.
 
 You can enable adaptive allocations by using:
 
@@ -35,7 +36,7 @@ If the new allocations fit on the current {ml} nodes, they are immediately start
 If more resource capacity is needed for creating new model allocations, then your {ml} node will be scaled up if {ml} autoscaling is enabled to provide enough resources for the new allocation.
 The number of model allocations can be scaled down to 0.
 They cannot be scaled up to more than 32 allocations, unless you explicitly set the maximum number of allocations to more.
-Adaptive allocations must be set up independently for each deployment and {infer} endpoint.
+Adaptive allocations must be set up independently for each deployment and {ref}/put-inference-api.html[{infer} endpoint].
 
 
 [discrete]
@@ -62,7 +63,8 @@ When adaptive resources are enabled, the number of vCPUs that the model deployme
 When the load is high, the number of vCPUs that the process can use is automatically increased.
 When the load is low, the number of vCPUs that the process can use is automatically decreased.
 
-You can choose from three levels of resource usage for your trained model deployment.
+You can choose from three levels of resource usage for your trained model deployment; autoscaling will occur within the selected level's range.
+
 Refer to the tables in the <<auto-scaling-matrix>> section to find out the setings for the level you selected.
 
 
@@ -78,13 +80,14 @@ The used resources for trained model deployments depend on three factors:
 
 * your cluster environment (Serverless, Cloud, or on-premises)
 * the use case you optimize the model deployment for (ingest or search)
-* whether adaptive resources are enabled or disabled (dynamic or static resources)
+* whether model autoscaling is enabled with adaptive allocations/resources to have dynamic resources, or disabled for static resources
 
 If you use {es} on-premises, vCPUs level ranges are derived from the `total_ml_processors` and `max_single_ml_node_processors` values.
 Use the {ref}/get-ml-info.html[get {ml} info API] to check these values.
 The following tables show you the number of allocations, threads, and vCPUs available in Cloud when adaptive resources are enabled or disabled.
 
-NOTE: For Observability and Security projects on Serverless, adaptive allocations are automatically enabled, and the "Adaptive resources" control is not displayed in {kib}.
+NOTE: On Serverless, adaptive allocations are automatically enabled for all project types.
+However, the "Adaptive resources" control is not displayed in {kib} for Observability and Security projects.
 
 
 [discrete]
diff --git a/docs/en/stack/ml/nlp/ml-nlp-elser.asciidoc b/docs/en/stack/ml/nlp/ml-nlp-elser.asciidoc
@@ -459,10 +459,10 @@ To gain the biggest value out of ELSER trained models, consider to follow this l
 * Setting `min_allocations` to `0` can save on costs for non-critical use cases or testing environments.
 * Enabling <<ml-nlp-auto-scale,autoscaling>> through adaptive allocations or adaptive resources makes it possible for {es} to scale up or down the available resources of your ELSER deployment based on the load on the process.
 
-* Use two ELSER {infer} endpoints: one optimized for ingest and one optimized for search.
-** In {kib}, you can select for which case you want to optimize your ELSER deployment.
-** If you use the {infer} API and want to optimize your ELSER endpoint for ingest, set the number of threads to `1` (`"num_threads": 1`).
-** If you use the {infer} API and want to optimize your ELSER endpoint for search, set the number of threads to greater than `1`.
+* Use dedicated, optimized ELSER {infer} endpoints for ingest and search use cases.
+** When deploying a trained model in {kib}, you can select for which case you want to optimize your ELSER deployment.
+** If you use the trained model or {infer} APIs and want to optimize your ELSER trained model deployment or {infer} endpoint for ingest, set the number of threads to `1` (`"num_threads": 1`).
+** If you use the trained model or {infer} APIs and want to optimize your ELSER trained model deployment or {infer} endpoint for search, set the number of threads to greater than `1`.
 
 
 [discrete]