Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[8.x] [DOCS] Documents trained model auto-scaling (backport #2795) #2852

Merged
merged 1 commit into from
Oct 14, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/en/stack/ml/nlp/index.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ include::ml-nlp-extract-info.asciidoc[leveloffset=+2]
include::ml-nlp-classify-text.asciidoc[leveloffset=+2]
include::ml-nlp-search-compare.asciidoc[leveloffset=+2]
include::ml-nlp-deploy-models.asciidoc[leveloffset=+1]
include::ml-nlp-autoscaling.asciidoc[leveloffset=+1]
include::ml-nlp-inference.asciidoc[leveloffset=+1]
include::ml-nlp-apis.asciidoc[leveloffset=+1]
include::ml-nlp-built-in-models.asciidoc[leveloffset=+1]
Expand Down
156 changes: 156 additions & 0 deletions docs/en/stack/ml/nlp/ml-nlp-autoscaling.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
[[ml-nlp-auto-scale]]
= Trained model autoscaling

You can enable autoscaling for each of your trained model deployments.
Autoscaling allows {es} to automatically adjust the resources the deployment can use based on the workload demand.

There are two ways to enable autoscaling:

* through APIs by enabling adaptive allocations
* in {kib} by enabling adaptive resources

IMPORTANT: To fully leverage model autoscaling, it is highly recommended to enable {cloud}/ec-autoscaling.html[deployment autoscaling].


[discrete]
[[nlp-model-adaptive-allocations]]
== Enabling autoscaling through APIs - adaptive allocations

Model allocations are independent units of work for NLP tasks.
If you set the numbers of threads and allocations for a model manually, they remain constant even when not all the available resources are fully used or when the load on the model requires more resources.
Instead of setting the number of allocations manually, you can enable adaptive allocations to set the number of allocations based on the load on the process.
This can help you to manage performance and cost more easily.
(Refer to the https://cloud.elastic.co/pricing[pricing calculator] to learn more about the possible costs.)

When adaptive allocations are enabled, the number of allocations of the model is set automatically based on the current load.
When the load is high, a new model allocation is automatically created.
When the load is low, a model allocation is automatically removed.

You can enable adaptive allocations by using:

* the create inference endpoint API for {ref}/infer-service-elser.html[ELSER], {ref}/infer-service-elasticsearch.html[E5 and models uploaded through Eland] that are used as {infer} services.
* the {ref}/start-trained-model-deployment.html[start trained model deployment] or {ref}/update-trained-model-deployment.html[update trained model deployment] APIs for trained models that are deployed on {ml} nodes.

If the new allocations fit on the current {ml} nodes, they are immediately started.
If more resource capacity is needed for creating new model allocations, then your {ml} node will be scaled up if {ml} autoscaling is enabled to provide enough resources for the new allocation.
The number of model allocations can be scaled down to 0.
They cannot be scaled up to more than 32 allocations, unless you explicitly set the maximum number of allocations to more.
Adaptive allocations must be set up independently for each deployment and {infer} endpoint.


[discrete]
[[optimize-use-case]]
=== Optimizing for typical use cases

You can optimize your model deployment for typical use cases, such as search and ingest.
When you optimize for ingest, the throughput will be higher, which increases the number of {infer} requests that can be performed in parallel.
When you optimize for search, the latency will be lower during search processes.

* If you want to optimize for ingest, set the number of threads to `1` (`"threads_per_allocation": 1`).
* If you want to optimize for search, set the number of threads to greater than `1`.
Increasing the number of threads will make the search processes more performant.


[discrete]
[[nlp-model-adaptive-resources]]
== Enabling autoscaling in {kib} - adaptive resources

You can enable adaptive resources for your models when starting or updating the model deployment.
Adaptive resources make it possible for {es} to scale up or down the available resources based on the load on the process.
This can help you to manage performance and cost more easily.
When adaptive resources are enabled, the number of vCPUs that the model deployment uses is set automatically based on the current load.
When the load is high, the number of vCPUs that the process can use is automatically increased.
When the load is low, the number of vCPUs that the process can use is automatically decreased.

You can choose from three levels of resource usage for your trained model deployment.
Refer to the tables in the <<auto-scaling-matrix>> section to find out the setings for the level you selected.


[role="screenshot"]
image::images/ml-nlp-deployment-id-elser-v2.png["ELSER deployment with adaptive resources enabled.",width=640]


[discrete]
[[auto-scaling-matrix]]
== Model deployment resource matrix

The used resources for trained model deployments depend on three factors:

* your cluster environment (Serverless, Cloud, or on-premises)
* the use case you optimize the model deployment for (ingest or search)
* whether adaptive resources are enabled or disabled (dynamic or static resources)

If you use {es} on-premises, vCPUs level ranges are derived from the `total_ml_processors` and `max_single_ml_node_processors` values.
Use the {ref}/get-ml-info.html[get {ml} info API] to check these values.
The following tables show you the number of allocations, threads, and vCPUs available in Cloud when adaptive resources are enabled or disabled.

NOTE: For Observability and Security projects on Serverless, adaptive allocations are automatically enabled, and the "Adaptive resources" control is not displayed in {kib}.


[discrete]
=== Deployments in Cloud optimized for ingest

In case of ingest-optimized deployments, we maximize the number of model allocations.


[discrete]
==== Adaptive resources enabled

[cols="4*", options="header"]
|==========
| Level | Allocations | Threads | vCPUs
| Low | 0 to 2 if available, dynamically | 1 | 0 to 2 if available, dynamically
| Medium | 1 to 32 dynamically | 1 | 1 to the smaller of 32 or the limit set in the Cloud console, dynamically
| High | 1 to limit set in the Cloud console ^*^, dynamically | 1 | 1 to limit set in the Cloud console, dynamically
|==========

^*^ The Cloud console doesn't directly set an allocations limit; it only sets a vCPU limit.
This vCPU limit indirectly determines the number of allocations, calculated as the vCPU limit divided by the number of threads.

[discrete]
==== Adaptive resources disabled

[cols="4*", options="header"]
|==========
| Level | Allocations | Threads | vCPUs
| Low | 2 if available, otherwise 1, statically | 1 | 2 if available
| Medium | the smaller of 32 or the limit set in the Cloud console, statically | 1 | 32 if available
| High | Maximum available set in the Cloud console ^*^, statically | 1 | Maximum available set in the Cloud console, statically
|==========

^*^ The Cloud console doesn't directly set an allocations limit; it only sets a vCPU limit.
This vCPU limit indirectly determines the number of allocations, calculated as the vCPU limit divided by the number of threads.

[discrete]
=== Deployments in Cloud optimized for search

In case of search-optimized deployments, we maximize the number of threads.
The maximum number of threads that can be claimed depends on the hardware your architecture has.

[discrete]
==== Adaptive resources enabled

[cols="4*", options="header"]
|==========
| Level | Allocations | Threads | vCPUs
| Low | 1 | 2 | 2
| Medium | 1 to 2 (if threads=16) dynamically | maximum that the hardware allows (for example, 16) | 1 to 32 dynamically
| High | 1 to limit set in the Cloud console ^*^, dynamically| maximum that the hardware allows (for example, 16) | 1 to limit set in the Cloud console, dynamically
|==========

^*^ The Cloud console doesn't directly set an allocations limit; it only sets a vCPU limit.
This vCPU limit indirectly determines the number of allocations, calculated as the vCPU limit divided by the number of threads.

[discrete]
==== Adaptive resources disabled

[cols="4*", options="header"]
|==========
| Level | Allocations | Threads | vCPUs
| Low | 1 if available, statically | 2 | 2 if available
| Medium | 2 (if threads=16) statically | maximum that the hardware allows (for example, 16) | 32 if available
| High | Maximum available set in the Cloud console ^*^, statically | maximum that the hardware allows (for example, 16) | Maximum available set in the Cloud console, statically
|==========

^*^ The Cloud console doesn't directly set an allocations limit; it only sets a vCPU limit.
This vCPU limit indirectly determines the number of allocations, calculated as the vCPU limit divided by the number of threads.
74 changes: 21 additions & 53 deletions docs/en/stack/ml/nlp/ml-nlp-deploy-models.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -164,66 +164,34 @@ their deployment across your cluster under **{ml-app}** > *Model Management*.
Alternatively, you can use the
{ref}/start-trained-model-deployment.html[start trained model deployment API].

You can deploy a model multiple times by assigning a unique deployment ID when
starting the deployment. It enables you to have dedicated deployments for
different purposes, such as search and ingest. By doing so, you ensure that the
search speed remains unaffected by ingest workloads, and vice versa. Having
separate deployments for search and ingest mitigates performance issues
resulting from interactions between the two, which can be hard to diagnose.
You can deploy a model multiple times by assigning a unique deployment ID when starting the deployment.

You can optimize your deplyoment for typical use cases, such as search and ingest.
When you optimize for ingest, the throughput will be higher, which increases the number of {infer} requests that can be performed in parallel.
When you optimize for search, the latency will be lower during search processes.
When you have dedicated deployments for different purposes, you ensure that the search speed remains unaffected by ingest workloads, and vice versa.
Having separate deployments for search and ingest mitigates performance issues resulting from interactions between the two, which can be hard to diagnose.

[role="screenshot"]
image::images/ml-nlp-deployment-id-elser-v2.png["Model deployment on the Trained Models UI."]

It is recommended to fine-tune each deployment based on its specific purpose. To
improve ingest performance, increase throughput by adding more allocations to
the deployment. For improved search speed, increase the number of threads per
allocation.

NOTE: Since eland uses APIs to deploy the models, you cannot see the models in
{kib} until the saved objects are synchronized. You can follow the prompts in
{kib}, wait for automatic synchronization, or use the
{kibana-ref}/machine-learning-api-sync.html[sync {ml} saved objects API].

When you deploy the model, its allocations are distributed across available {ml}
nodes. Model allocations are independent units of work for NLP tasks. To
influence model performance, you can configure the number of allocations and the
number of threads used by each allocation of your deployment. Alternatively, you
can enable <<nlp-model-adaptive-allocations>> to automatically create and remove
model allocations based on the current workload of the model (you still need to
manually set the number of threads).

IMPORTANT: If your deployed trained model has only one allocation, it's likely
that you will experience downtime in the service your trained model performs.
You can reduce or eliminate downtime by adding more allocations to your trained
models.
Each deployment will be fine-tuned automatically based on its specific purpose you choose.

Throughput can be scaled by adding more allocations to the deployment; it
increases the number of {infer} requests that can be performed in parallel. All
allocations assigned to a node share the same copy of the model in memory. The
model is loaded into memory in a native process that encapsulates `libtorch`,
which is the underlying {ml} library of PyTorch. The number of allocations
setting affects the amount of model allocations across all the {ml} nodes. Model
allocations are distributed in such a way that the total number of used threads
does not exceed the allocated processors of a node.

The threads per allocation setting affects the number of threads used by each
model allocation during {infer}. Increasing the number of threads generally
increases the speed of {infer} requests. The value of this setting must not
exceed the number of available allocated processors per node.

You can view the allocation status in {kib} or by using the
{ref}/get-trained-models-stats.html[get trained model stats API]. If you want to
change the number of allocations, you can use the
{ref}/update-trained-model-deployment.html[update trained model stats API] after
the allocation status is `started`. You can also enable
<<nlp-model-adaptive-allocations>> to automatically create and remove model
allocations based on the current workload of the model.
NOTE: Since eland uses APIs to deploy the models, you cannot see the models in {kib} until the saved objects are synchronized.
You can follow the prompts in {kib}, wait for automatic synchronization, or use the {kibana-ref}/machine-learning-api-sync.html[sync {ml} saved objects API].

[discrete]
[[nlp-model-adaptive-allocations]]
=== Adaptive allocations
You can define the resource usage level of the NLP model during model deployment.
The resource usage levels behave differently depending on <<nlp-model-adaptive-resources, adaptive resources>> being enabled or disabled.
When adaptive resources are disabled but {ml} autoscaling is enabled, vCPU usage of Cloud deployments derived from the Cloud console and functions as follows:

* Low: This level limits resources to two vCPUs, which may be suitable for development, testing, and demos depending on your parameters.
It is not recommended for production use
* Medium: This level limits resources to 32 vCPUs, which may be suitable for development, testing, and demos depending on your parameters.
It is not recommended for production use.
* High: This level may use the maximum number of vCPUs available for this deployment from the Cloud console.
If the maximum is 2 vCPUs or fewer, this level is equivalent to the medium or low level.

include::ml-nlp-shared.asciidoc[tag=ml-nlp-adaptive-allocations]
For the resource levels when adaptive resources are enabled, refer to <<<ml-nlp-auto-scale>>.


[discrete]
Expand Down
9 changes: 3 additions & 6 deletions docs/en/stack/ml/nlp/ml-nlp-e5.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,9 @@ models on HuggingFace for further information including licensing.
To use E5, you must have the {subscriptions}[appropriate subscription] level
for semantic search or the trial period activated.

Enabling trained model autoscaling for your E5 deployment is recommended.
Refer to <<ml-nlp-auto-scale>> to learn more.


[discrete]
[[download-deploy-e5]]
Expand Down Expand Up @@ -313,12 +316,6 @@ Once it's uploaded to {es}, the model will have the ID specified by
underscores `__`.
--

[discrete]
[[e5-adaptive-allocations]]
== Adaptive allocations

include::ml-nlp-shared.asciidoc[tag=ml-nlp-adaptive-allocations]


[discrete]
[[terms-of-use-e5]]
Expand Down
15 changes: 11 additions & 4 deletions docs/en/stack/ml/nlp/ml-nlp-elser.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,9 @@ more allocations or more threads per allocation, which requires bigger ML nodes.
Autoscaling provides bigger nodes when required. If autoscaling is turned off,
you must provide suitably sized nodes yourself.

Enabling trained model autoscaling for your ELSER deployment is recommended.
Refer to <<ml-nlp-auto-scale>> to learn more.


[discrete]
[[elser-v2]]
Expand Down Expand Up @@ -449,13 +452,17 @@ To achieve the best results, it's recommended to clean the input text before gen
The exact preprocessing you may need to do heavily depends on your text.
For example, if your text contains HTML tags, use the {ref}/htmlstrip-processor.html[HTML strip processor] in an ingest pipeline to remove unnecessary elements.
Always review and clean your input text before ingestion to eliminate any irrelevant entities that might affect the results.


[discrete]
[[elser-adaptive-allocations]]
== Adaptive allocations
[[elser-recommendations]]
== Recommendations for using ELSER

To gain the biggest value out of ELSER trained models, consider to follow this list of recommendations.

include::ml-nlp-shared.asciidoc[tag=ml-nlp-adaptive-allocations]
* Use two ELSER {infer} endpoints: one optimized for ingest and one optimized for search.
* If quick response time is important for your use case, keep {ml} resources available at all times by setting `min_allocations` to `1`.
* Setting `min_allocations` to `0` can save on costs for non-critical use cases or testing environments.


[discrete]
Expand Down
14 changes: 1 addition & 13 deletions docs/en/stack/ml/nlp/ml-nlp-limitations.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -17,16 +17,4 @@ When you use ELSER for semantic search, only the first 512 extracted tokens from
each field of the ingested documents that ELSER is applied to are taken into
account for the search process. If your data set contains long documents, divide
them into smaller segments before ingestion if you need the full text to be
searchable.


[discrete]
[[ml-nlp-elser-autoscale]]
== ELSER deployments don't autoscale

Currently, ELSER deployments do not scale up and down automatically depending on
the resource requirements of the ELSER processes. If you want to configure
available resources for your ELSER deployments, you can manually set the number
of allocations and threads per allocation by using the Trained Models UI in
{kib} or the
{ref}/update-trained-model-deployment.html[Update trained model deployment API].
searchable.
Loading