Skip to content

Commit 3546447

Browse files
mergify[bot]elasticmachinedarnautovszabosteve
authored
[DOCS] Documents trained model auto-scaling (#2795) (#2852)
Co-authored-by: Elastic Machine <[email protected]> Co-authored-by: Dima Arnautov <[email protected]> Co-authored-by: István Zoltán Szabó <[email protected]>
1 parent aa6e0b3 commit 3546447

9 files changed

+194
-95
lines changed
Loading

docs/en/stack/ml/nlp/index.asciidoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ include::ml-nlp-extract-info.asciidoc[leveloffset=+2]
44
include::ml-nlp-classify-text.asciidoc[leveloffset=+2]
55
include::ml-nlp-search-compare.asciidoc[leveloffset=+2]
66
include::ml-nlp-deploy-models.asciidoc[leveloffset=+1]
7+
include::ml-nlp-autoscaling.asciidoc[leveloffset=+1]
78
include::ml-nlp-inference.asciidoc[leveloffset=+1]
89
include::ml-nlp-apis.asciidoc[leveloffset=+1]
910
include::ml-nlp-built-in-models.asciidoc[leveloffset=+1]
Lines changed: 156 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,156 @@
1+
[[ml-nlp-auto-scale]]
2+
= Trained model autoscaling
3+
4+
You can enable autoscaling for each of your trained model deployments.
5+
Autoscaling allows {es} to automatically adjust the resources the deployment can use based on the workload demand.
6+
7+
There are two ways to enable autoscaling:
8+
9+
* through APIs by enabling adaptive allocations
10+
* in {kib} by enabling adaptive resources
11+
12+
IMPORTANT: To fully leverage model autoscaling, it is highly recommended to enable {cloud}/ec-autoscaling.html[deployment autoscaling].
13+
14+
15+
[discrete]
16+
[[nlp-model-adaptive-allocations]]
17+
== Enabling autoscaling through APIs - adaptive allocations
18+
19+
Model allocations are independent units of work for NLP tasks.
20+
If you set the numbers of threads and allocations for a model manually, they remain constant even when not all the available resources are fully used or when the load on the model requires more resources.
21+
Instead of setting the number of allocations manually, you can enable adaptive allocations to set the number of allocations based on the load on the process.
22+
This can help you to manage performance and cost more easily.
23+
(Refer to the https://cloud.elastic.co/pricing[pricing calculator] to learn more about the possible costs.)
24+
25+
When adaptive allocations are enabled, the number of allocations of the model is set automatically based on the current load.
26+
When the load is high, a new model allocation is automatically created.
27+
When the load is low, a model allocation is automatically removed.
28+
29+
You can enable adaptive allocations by using:
30+
31+
* the create inference endpoint API for {ref}/infer-service-elser.html[ELSER], {ref}/infer-service-elasticsearch.html[E5 and models uploaded through Eland] that are used as {infer} services.
32+
* the {ref}/start-trained-model-deployment.html[start trained model deployment] or {ref}/update-trained-model-deployment.html[update trained model deployment] APIs for trained models that are deployed on {ml} nodes.
33+
34+
If the new allocations fit on the current {ml} nodes, they are immediately started.
35+
If more resource capacity is needed for creating new model allocations, then your {ml} node will be scaled up if {ml} autoscaling is enabled to provide enough resources for the new allocation.
36+
The number of model allocations can be scaled down to 0.
37+
They cannot be scaled up to more than 32 allocations, unless you explicitly set the maximum number of allocations to more.
38+
Adaptive allocations must be set up independently for each deployment and {infer} endpoint.
39+
40+
41+
[discrete]
42+
[[optimize-use-case]]
43+
=== Optimizing for typical use cases
44+
45+
You can optimize your model deployment for typical use cases, such as search and ingest.
46+
When you optimize for ingest, the throughput will be higher, which increases the number of {infer} requests that can be performed in parallel.
47+
When you optimize for search, the latency will be lower during search processes.
48+
49+
* If you want to optimize for ingest, set the number of threads to `1` (`"threads_per_allocation": 1`).
50+
* If you want to optimize for search, set the number of threads to greater than `1`.
51+
Increasing the number of threads will make the search processes more performant.
52+
53+
54+
[discrete]
55+
[[nlp-model-adaptive-resources]]
56+
== Enabling autoscaling in {kib} - adaptive resources
57+
58+
You can enable adaptive resources for your models when starting or updating the model deployment.
59+
Adaptive resources make it possible for {es} to scale up or down the available resources based on the load on the process.
60+
This can help you to manage performance and cost more easily.
61+
When adaptive resources are enabled, the number of vCPUs that the model deployment uses is set automatically based on the current load.
62+
When the load is high, the number of vCPUs that the process can use is automatically increased.
63+
When the load is low, the number of vCPUs that the process can use is automatically decreased.
64+
65+
You can choose from three levels of resource usage for your trained model deployment.
66+
Refer to the tables in the <<auto-scaling-matrix>> section to find out the setings for the level you selected.
67+
68+
69+
[role="screenshot"]
70+
image::images/ml-nlp-deployment-id-elser-v2.png["ELSER deployment with adaptive resources enabled.",width=640]
71+
72+
73+
[discrete]
74+
[[auto-scaling-matrix]]
75+
== Model deployment resource matrix
76+
77+
The used resources for trained model deployments depend on three factors:
78+
79+
* your cluster environment (Serverless, Cloud, or on-premises)
80+
* the use case you optimize the model deployment for (ingest or search)
81+
* whether adaptive resources are enabled or disabled (dynamic or static resources)
82+
83+
If you use {es} on-premises, vCPUs level ranges are derived from the `total_ml_processors` and `max_single_ml_node_processors` values.
84+
Use the {ref}/get-ml-info.html[get {ml} info API] to check these values.
85+
The following tables show you the number of allocations, threads, and vCPUs available in Cloud when adaptive resources are enabled or disabled.
86+
87+
NOTE: For Observability and Security projects on Serverless, adaptive allocations are automatically enabled, and the "Adaptive resources" control is not displayed in {kib}.
88+
89+
90+
[discrete]
91+
=== Deployments in Cloud optimized for ingest
92+
93+
In case of ingest-optimized deployments, we maximize the number of model allocations.
94+
95+
96+
[discrete]
97+
==== Adaptive resources enabled
98+
99+
[cols="4*", options="header"]
100+
|==========
101+
| Level | Allocations | Threads | vCPUs
102+
| Low | 0 to 2 if available, dynamically | 1 | 0 to 2 if available, dynamically
103+
| Medium | 1 to 32 dynamically | 1 | 1 to the smaller of 32 or the limit set in the Cloud console, dynamically
104+
| High | 1 to limit set in the Cloud console ^*^, dynamically | 1 | 1 to limit set in the Cloud console, dynamically
105+
|==========
106+
107+
^*^ The Cloud console doesn't directly set an allocations limit; it only sets a vCPU limit.
108+
This vCPU limit indirectly determines the number of allocations, calculated as the vCPU limit divided by the number of threads.
109+
110+
[discrete]
111+
==== Adaptive resources disabled
112+
113+
[cols="4*", options="header"]
114+
|==========
115+
| Level | Allocations | Threads | vCPUs
116+
| Low | 2 if available, otherwise 1, statically | 1 | 2 if available
117+
| Medium | the smaller of 32 or the limit set in the Cloud console, statically | 1 | 32 if available
118+
| High | Maximum available set in the Cloud console ^*^, statically | 1 | Maximum available set in the Cloud console, statically
119+
|==========
120+
121+
^*^ The Cloud console doesn't directly set an allocations limit; it only sets a vCPU limit.
122+
This vCPU limit indirectly determines the number of allocations, calculated as the vCPU limit divided by the number of threads.
123+
124+
[discrete]
125+
=== Deployments in Cloud optimized for search
126+
127+
In case of search-optimized deployments, we maximize the number of threads.
128+
The maximum number of threads that can be claimed depends on the hardware your architecture has.
129+
130+
[discrete]
131+
==== Adaptive resources enabled
132+
133+
[cols="4*", options="header"]
134+
|==========
135+
| Level | Allocations | Threads | vCPUs
136+
| Low | 1 | 2 | 2
137+
| Medium | 1 to 2 (if threads=16) dynamically | maximum that the hardware allows (for example, 16) | 1 to 32 dynamically
138+
| High | 1 to limit set in the Cloud console ^*^, dynamically| maximum that the hardware allows (for example, 16) | 1 to limit set in the Cloud console, dynamically
139+
|==========
140+
141+
^*^ The Cloud console doesn't directly set an allocations limit; it only sets a vCPU limit.
142+
This vCPU limit indirectly determines the number of allocations, calculated as the vCPU limit divided by the number of threads.
143+
144+
[discrete]
145+
==== Adaptive resources disabled
146+
147+
[cols="4*", options="header"]
148+
|==========
149+
| Level | Allocations | Threads | vCPUs
150+
| Low | 1 if available, statically | 2 | 2 if available
151+
| Medium | 2 (if threads=16) statically | maximum that the hardware allows (for example, 16) | 32 if available
152+
| High | Maximum available set in the Cloud console ^*^, statically | maximum that the hardware allows (for example, 16) | Maximum available set in the Cloud console, statically
153+
|==========
154+
155+
^*^ The Cloud console doesn't directly set an allocations limit; it only sets a vCPU limit.
156+
This vCPU limit indirectly determines the number of allocations, calculated as the vCPU limit divided by the number of threads.

docs/en/stack/ml/nlp/ml-nlp-deploy-models.asciidoc

Lines changed: 21 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -164,66 +164,34 @@ their deployment across your cluster under **{ml-app}** > *Model Management*.
164164
Alternatively, you can use the
165165
{ref}/start-trained-model-deployment.html[start trained model deployment API].
166166

167-
You can deploy a model multiple times by assigning a unique deployment ID when
168-
starting the deployment. It enables you to have dedicated deployments for
169-
different purposes, such as search and ingest. By doing so, you ensure that the
170-
search speed remains unaffected by ingest workloads, and vice versa. Having
171-
separate deployments for search and ingest mitigates performance issues
172-
resulting from interactions between the two, which can be hard to diagnose.
167+
You can deploy a model multiple times by assigning a unique deployment ID when starting the deployment.
168+
169+
You can optimize your deplyoment for typical use cases, such as search and ingest.
170+
When you optimize for ingest, the throughput will be higher, which increases the number of {infer} requests that can be performed in parallel.
171+
When you optimize for search, the latency will be lower during search processes.
172+
When you have dedicated deployments for different purposes, you ensure that the search speed remains unaffected by ingest workloads, and vice versa.
173+
Having separate deployments for search and ingest mitigates performance issues resulting from interactions between the two, which can be hard to diagnose.
173174

174175
[role="screenshot"]
175176
image::images/ml-nlp-deployment-id-elser-v2.png["Model deployment on the Trained Models UI."]
176177

177-
It is recommended to fine-tune each deployment based on its specific purpose. To
178-
improve ingest performance, increase throughput by adding more allocations to
179-
the deployment. For improved search speed, increase the number of threads per
180-
allocation.
181-
182-
NOTE: Since eland uses APIs to deploy the models, you cannot see the models in
183-
{kib} until the saved objects are synchronized. You can follow the prompts in
184-
{kib}, wait for automatic synchronization, or use the
185-
{kibana-ref}/machine-learning-api-sync.html[sync {ml} saved objects API].
186-
187-
When you deploy the model, its allocations are distributed across available {ml}
188-
nodes. Model allocations are independent units of work for NLP tasks. To
189-
influence model performance, you can configure the number of allocations and the
190-
number of threads used by each allocation of your deployment. Alternatively, you
191-
can enable <<nlp-model-adaptive-allocations>> to automatically create and remove
192-
model allocations based on the current workload of the model (you still need to
193-
manually set the number of threads).
194-
195-
IMPORTANT: If your deployed trained model has only one allocation, it's likely
196-
that you will experience downtime in the service your trained model performs.
197-
You can reduce or eliminate downtime by adding more allocations to your trained
198-
models.
178+
Each deployment will be fine-tuned automatically based on its specific purpose you choose.
199179

200-
Throughput can be scaled by adding more allocations to the deployment; it
201-
increases the number of {infer} requests that can be performed in parallel. All
202-
allocations assigned to a node share the same copy of the model in memory. The
203-
model is loaded into memory in a native process that encapsulates `libtorch`,
204-
which is the underlying {ml} library of PyTorch. The number of allocations
205-
setting affects the amount of model allocations across all the {ml} nodes. Model
206-
allocations are distributed in such a way that the total number of used threads
207-
does not exceed the allocated processors of a node.
208-
209-
The threads per allocation setting affects the number of threads used by each
210-
model allocation during {infer}. Increasing the number of threads generally
211-
increases the speed of {infer} requests. The value of this setting must not
212-
exceed the number of available allocated processors per node.
213-
214-
You can view the allocation status in {kib} or by using the
215-
{ref}/get-trained-models-stats.html[get trained model stats API]. If you want to
216-
change the number of allocations, you can use the
217-
{ref}/update-trained-model-deployment.html[update trained model stats API] after
218-
the allocation status is `started`. You can also enable
219-
<<nlp-model-adaptive-allocations>> to automatically create and remove model
220-
allocations based on the current workload of the model.
180+
NOTE: Since eland uses APIs to deploy the models, you cannot see the models in {kib} until the saved objects are synchronized.
181+
You can follow the prompts in {kib}, wait for automatic synchronization, or use the {kibana-ref}/machine-learning-api-sync.html[sync {ml} saved objects API].
221182

222-
[discrete]
223-
[[nlp-model-adaptive-allocations]]
224-
=== Adaptive allocations
183+
You can define the resource usage level of the NLP model during model deployment.
184+
The resource usage levels behave differently depending on <<nlp-model-adaptive-resources, adaptive resources>> being enabled or disabled.
185+
When adaptive resources are disabled but {ml} autoscaling is enabled, vCPU usage of Cloud deployments derived from the Cloud console and functions as follows:
186+
187+
* Low: This level limits resources to two vCPUs, which may be suitable for development, testing, and demos depending on your parameters.
188+
It is not recommended for production use
189+
* Medium: This level limits resources to 32 vCPUs, which may be suitable for development, testing, and demos depending on your parameters.
190+
It is not recommended for production use.
191+
* High: This level may use the maximum number of vCPUs available for this deployment from the Cloud console.
192+
If the maximum is 2 vCPUs or fewer, this level is equivalent to the medium or low level.
225193

226-
include::ml-nlp-shared.asciidoc[tag=ml-nlp-adaptive-allocations]
194+
For the resource levels when adaptive resources are enabled, refer to <<<ml-nlp-auto-scale>>.
227195

228196

229197
[discrete]

docs/en/stack/ml/nlp/ml-nlp-e5.asciidoc

Lines changed: 3 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,9 @@ models on HuggingFace for further information including licensing.
4141
To use E5, you must have the {subscriptions}[appropriate subscription] level
4242
for semantic search or the trial period activated.
4343

44+
Enabling trained model autoscaling for your E5 deployment is recommended.
45+
Refer to <<ml-nlp-auto-scale>> to learn more.
46+
4447

4548
[discrete]
4649
[[download-deploy-e5]]
@@ -313,12 +316,6 @@ Once it's uploaded to {es}, the model will have the ID specified by
313316
underscores `__`.
314317
--
315318

316-
[discrete]
317-
[[e5-adaptive-allocations]]
318-
== Adaptive allocations
319-
320-
include::ml-nlp-shared.asciidoc[tag=ml-nlp-adaptive-allocations]
321-
322319

323320
[discrete]
324321
[[terms-of-use-e5]]

docs/en/stack/ml/nlp/ml-nlp-elser.asciidoc

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,9 @@ more allocations or more threads per allocation, which requires bigger ML nodes.
6666
Autoscaling provides bigger nodes when required. If autoscaling is turned off,
6767
you must provide suitably sized nodes yourself.
6868

69+
Enabling trained model autoscaling for your ELSER deployment is recommended.
70+
Refer to <<ml-nlp-auto-scale>> to learn more.
71+
6972

7073
[discrete]
7174
[[elser-v2]]
@@ -449,13 +452,17 @@ To achieve the best results, it's recommended to clean the input text before gen
449452
The exact preprocessing you may need to do heavily depends on your text.
450453
For example, if your text contains HTML tags, use the {ref}/htmlstrip-processor.html[HTML strip processor] in an ingest pipeline to remove unnecessary elements.
451454
Always review and clean your input text before ingestion to eliminate any irrelevant entities that might affect the results.
452-
455+
453456

454457
[discrete]
455-
[[elser-adaptive-allocations]]
456-
== Adaptive allocations
458+
[[elser-recommendations]]
459+
== Recommendations for using ELSER
460+
461+
To gain the biggest value out of ELSER trained models, consider to follow this list of recommendations.
457462

458-
include::ml-nlp-shared.asciidoc[tag=ml-nlp-adaptive-allocations]
463+
* Use two ELSER {infer} endpoints: one optimized for ingest and one optimized for search.
464+
* If quick response time is important for your use case, keep {ml} resources available at all times by setting `min_allocations` to `1`.
465+
* Setting `min_allocations` to `0` can save on costs for non-critical use cases or testing environments.
459466

460467

461468
[discrete]

docs/en/stack/ml/nlp/ml-nlp-limitations.asciidoc

Lines changed: 1 addition & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -17,16 +17,4 @@ When you use ELSER for semantic search, only the first 512 extracted tokens from
1717
each field of the ingested documents that ELSER is applied to are taken into
1818
account for the search process. If your data set contains long documents, divide
1919
them into smaller segments before ingestion if you need the full text to be
20-
searchable.
21-
22-
23-
[discrete]
24-
[[ml-nlp-elser-autoscale]]
25-
== ELSER deployments don't autoscale
26-
27-
Currently, ELSER deployments do not scale up and down automatically depending on
28-
the resource requirements of the ELSER processes. If you want to configure
29-
available resources for your ELSER deployments, you can manually set the number
30-
of allocations and threads per allocation by using the Trained Models UI in
31-
{kib} or the
32-
{ref}/update-trained-model-deployment.html[Update trained model deployment API].
20+
searchable.

0 commit comments

Comments
 (0)