|
| 1 | +[[ml-nlp-auto-scale]] |
| 2 | += Trained model autoscaling |
| 3 | + |
| 4 | +You can enable autoscaling for each of your trained model deployments. |
| 5 | +Autoscaling allows {es} to automatically adjust the resources the deployment can use based on the workload demand. |
| 6 | + |
| 7 | +There are two ways to enable autoscaling: |
| 8 | + |
| 9 | +* through APIs by enabling adaptive allocations |
| 10 | +* in {kib} by enabling adaptive resources |
| 11 | + |
| 12 | +IMPORTANT: To fully leverage model autoscaling, it is highly recommended to enable {cloud}/ec-autoscaling.html[deployment autoscaling]. |
| 13 | + |
| 14 | + |
| 15 | +[discrete] |
| 16 | +[[nlp-model-adaptive-allocations]] |
| 17 | +== Enabling autoscaling through APIs - adaptive allocations |
| 18 | + |
| 19 | +Model allocations are independent units of work for NLP tasks. |
| 20 | +If you set the numbers of threads and allocations for a model manually, they remain constant even when not all the available resources are fully used or when the load on the model requires more resources. |
| 21 | +Instead of setting the number of allocations manually, you can enable adaptive allocations to set the number of allocations based on the load on the process. |
| 22 | +This can help you to manage performance and cost more easily. |
| 23 | +(Refer to the https://cloud.elastic.co/pricing[pricing calculator] to learn more about the possible costs.) |
| 24 | + |
| 25 | +When adaptive allocations are enabled, the number of allocations of the model is set automatically based on the current load. |
| 26 | +When the load is high, a new model allocation is automatically created. |
| 27 | +When the load is low, a model allocation is automatically removed. |
| 28 | + |
| 29 | +You can enable adaptive allocations by using: |
| 30 | + |
| 31 | +* the create inference endpoint API for {ref}/infer-service-elser.html[ELSER], {ref}/infer-service-elasticsearch.html[E5 and models uploaded through Eland] that are used as {infer} services. |
| 32 | +* the {ref}/start-trained-model-deployment.html[start trained model deployment] or {ref}/update-trained-model-deployment.html[update trained model deployment] APIs for trained models that are deployed on {ml} nodes. |
| 33 | + |
| 34 | +If the new allocations fit on the current {ml} nodes, they are immediately started. |
| 35 | +If more resource capacity is needed for creating new model allocations, then your {ml} node will be scaled up if {ml} autoscaling is enabled to provide enough resources for the new allocation. |
| 36 | +The number of model allocations can be scaled down to 0. |
| 37 | +They cannot be scaled up to more than 32 allocations, unless you explicitly set the maximum number of allocations to more. |
| 38 | +Adaptive allocations must be set up independently for each deployment and {infer} endpoint. |
| 39 | + |
| 40 | + |
| 41 | +[discrete] |
| 42 | +[[optimize-use-case]] |
| 43 | +=== Optimizing for typical use cases |
| 44 | + |
| 45 | +You can optimize your model deployment for typical use cases, such as search and ingest. |
| 46 | +When you optimize for ingest, the throughput will be higher, which increases the number of {infer} requests that can be performed in parallel. |
| 47 | +When you optimize for search, the latency will be lower during search processes. |
| 48 | + |
| 49 | +* If you want to optimize for ingest, set the number of threads to `1` (`"threads_per_allocation": 1`). |
| 50 | +* If you want to optimize for search, set the number of threads to greater than `1`. |
| 51 | +Increasing the number of threads will make the search processes more performant. |
| 52 | + |
| 53 | + |
| 54 | +[discrete] |
| 55 | +[[nlp-model-adaptive-resources]] |
| 56 | +== Enabling autoscaling in {kib} - adaptive resources |
| 57 | + |
| 58 | +You can enable adaptive resources for your models when starting or updating the model deployment. |
| 59 | +Adaptive resources make it possible for {es} to scale up or down the available resources based on the load on the process. |
| 60 | +This can help you to manage performance and cost more easily. |
| 61 | +When adaptive resources are enabled, the number of vCPUs that the model deployment uses is set automatically based on the current load. |
| 62 | +When the load is high, the number of vCPUs that the process can use is automatically increased. |
| 63 | +When the load is low, the number of vCPUs that the process can use is automatically decreased. |
| 64 | + |
| 65 | +You can choose from three levels of resource usage for your trained model deployment. |
| 66 | +Refer to the tables in the <<auto-scaling-matrix>> section to find out the setings for the level you selected. |
| 67 | + |
| 68 | + |
| 69 | +[role="screenshot"] |
| 70 | +image::images/ml-nlp-deployment-id-elser-v2.png["ELSER deployment with adaptive resources enabled.",width=640] |
| 71 | + |
| 72 | + |
| 73 | +[discrete] |
| 74 | +[[auto-scaling-matrix]] |
| 75 | +== Model deployment resource matrix |
| 76 | + |
| 77 | +The used resources for trained model deployments depend on three factors: |
| 78 | + |
| 79 | +* your cluster environment (Serverless, Cloud, or on-premises) |
| 80 | +* the use case you optimize the model deployment for (ingest or search) |
| 81 | +* whether adaptive resources are enabled or disabled (dynamic or static resources) |
| 82 | + |
| 83 | +If you use {es} on-premises, vCPUs level ranges are derived from the `total_ml_processors` and `max_single_ml_node_processors` values. |
| 84 | +Use the {ref}/get-ml-info.html[get {ml} info API] to check these values. |
| 85 | +The following tables show you the number of allocations, threads, and vCPUs available in Cloud when adaptive resources are enabled or disabled. |
| 86 | + |
| 87 | +NOTE: For Observability and Security projects on Serverless, adaptive allocations are automatically enabled, and the "Adaptive resources" control is not displayed in {kib}. |
| 88 | + |
| 89 | + |
| 90 | +[discrete] |
| 91 | +=== Deployments in Cloud optimized for ingest |
| 92 | + |
| 93 | +In case of ingest-optimized deployments, we maximize the number of model allocations. |
| 94 | + |
| 95 | + |
| 96 | +[discrete] |
| 97 | +==== Adaptive resources enabled |
| 98 | + |
| 99 | +[cols="4*", options="header"] |
| 100 | +|========== |
| 101 | +| Level | Allocations | Threads | vCPUs |
| 102 | +| Low | 0 to 2 if available, dynamically | 1 | 0 to 2 if available, dynamically |
| 103 | +| Medium | 1 to 32 dynamically | 1 | 1 to the smaller of 32 or the limit set in the Cloud console, dynamically |
| 104 | +| High | 1 to limit set in the Cloud console ^*^, dynamically | 1 | 1 to limit set in the Cloud console, dynamically |
| 105 | +|========== |
| 106 | + |
| 107 | +^*^ The Cloud console doesn't directly set an allocations limit; it only sets a vCPU limit. |
| 108 | +This vCPU limit indirectly determines the number of allocations, calculated as the vCPU limit divided by the number of threads. |
| 109 | + |
| 110 | +[discrete] |
| 111 | +==== Adaptive resources disabled |
| 112 | + |
| 113 | +[cols="4*", options="header"] |
| 114 | +|========== |
| 115 | +| Level | Allocations | Threads | vCPUs |
| 116 | +| Low | 2 if available, otherwise 1, statically | 1 | 2 if available |
| 117 | +| Medium | the smaller of 32 or the limit set in the Cloud console, statically | 1 | 32 if available |
| 118 | +| High | Maximum available set in the Cloud console ^*^, statically | 1 | Maximum available set in the Cloud console, statically |
| 119 | +|========== |
| 120 | + |
| 121 | +^*^ The Cloud console doesn't directly set an allocations limit; it only sets a vCPU limit. |
| 122 | +This vCPU limit indirectly determines the number of allocations, calculated as the vCPU limit divided by the number of threads. |
| 123 | + |
| 124 | +[discrete] |
| 125 | +=== Deployments in Cloud optimized for search |
| 126 | + |
| 127 | +In case of search-optimized deployments, we maximize the number of threads. |
| 128 | +The maximum number of threads that can be claimed depends on the hardware your architecture has. |
| 129 | + |
| 130 | +[discrete] |
| 131 | +==== Adaptive resources enabled |
| 132 | + |
| 133 | +[cols="4*", options="header"] |
| 134 | +|========== |
| 135 | +| Level | Allocations | Threads | vCPUs |
| 136 | +| Low | 1 | 2 | 2 |
| 137 | +| Medium | 1 to 2 (if threads=16) dynamically | maximum that the hardware allows (for example, 16) | 1 to 32 dynamically |
| 138 | +| High | 1 to limit set in the Cloud console ^*^, dynamically| maximum that the hardware allows (for example, 16) | 1 to limit set in the Cloud console, dynamically |
| 139 | +|========== |
| 140 | + |
| 141 | +^*^ The Cloud console doesn't directly set an allocations limit; it only sets a vCPU limit. |
| 142 | +This vCPU limit indirectly determines the number of allocations, calculated as the vCPU limit divided by the number of threads. |
| 143 | + |
| 144 | +[discrete] |
| 145 | +==== Adaptive resources disabled |
| 146 | + |
| 147 | +[cols="4*", options="header"] |
| 148 | +|========== |
| 149 | +| Level | Allocations | Threads | vCPUs |
| 150 | +| Low | 1 if available, statically | 2 | 2 if available |
| 151 | +| Medium | 2 (if threads=16) statically | maximum that the hardware allows (for example, 16) | 32 if available |
| 152 | +| High | Maximum available set in the Cloud console ^*^, statically | maximum that the hardware allows (for example, 16) | Maximum available set in the Cloud console, statically |
| 153 | +|========== |
| 154 | + |
| 155 | +^*^ The Cloud console doesn't directly set an allocations limit; it only sets a vCPU limit. |
| 156 | +This vCPU limit indirectly determines the number of allocations, calculated as the vCPU limit divided by the number of threads. |
0 commit comments