Skip to content

Commit

Permalink
[DOCS] Adds Working with anomaly detection at scale to ML AD docs (#1353
Browse files Browse the repository at this point in the history
) (#1354)
  • Loading branch information
szabosteve authored Sep 7, 2020
1 parent 22fe041 commit f4ee746
Show file tree
Hide file tree
Showing 3 changed files with 280 additions and 0 deletions.
275 changes: 275 additions & 0 deletions docs/en/stack/ml/anomaly-detection/anomaly-detection-scale.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,275 @@
[role="xpack"]
[[anomaly-detection-scale]]
= Working with {anomaly-detect} at scale

There are many advanced configuration options for {anomaly-jobs}, some of them
affect the performance or resource usage significantly. This guide contains a
list of considerations to help you plan for using {anomaly-detect} at scale.

In this guide, you’ll learn how to:

* Understand the impact of configuration options on the performance of
{anomaly-jobs}

Prerequisites:

* This guide assumes you’re already familiar with how to create {anomaly-jobs}.
If not, refer to <<ml-overview>>.

The following recommendations are not sequential – the numbers just help to
navigate between the list items; you can take action on one or more of them in
any order. You can implement some of these changes on existing jobs; others
require you to clone an existing job or create a new one.


[discrete]
[[node-sizing]]
== 1. Consider node sizing and configuration

An {anomaly-job} runs on a single node and requires sufficient resources to hold
its model in memory. When a job is opened, it will be placed on the node with
the most available memory at that time.

The memory available to the {ml} native processes can roughly be thought of as
total machine RAM minus that which is required for the operating system, {es}
and any other software that is running on the same machine.

The available memory for {ml} on a node must be sufficient to accommodate the
size of the largest model. The total available memory across all {ml} nodes must
be sufficient to accommodate the memory requirement for all simultaneously open
jobs.

In {ecloud}, dedicated {ml} nodes are provisioned with most of the RAM
automatically being available to the {ml} native processes. If deploying
self-managed, then we recommend deploying dedicated {ml} nodes and increasing
the value of `xpack.ml.max_machine_memory_percent` from the default 30%. The
default of 30% has to be set low in case other software is running on the same
machine and to leave memory free for an OS file system cache on {ml} nodes that
are also data nodes. If you use dedicated {ml} nodes as recommended and do not
run any other software on them then it would be reasonable to run with a 2GB JVM
heap and set `xpack.ml.max_machine_memory_percent` to 90% on machines with at
least 24GB of RAM. This maximizes the number of {ml} jobs that can be run.

Increasing the number of nodes will allow distribution of job processing as well
as fault tolerance. If running many jobs, even small memory ones, then consider
increasing the number of nodes in your environment.


[discrete]
[[dedicated-results-index]]
== 2. Use dedicated results indices

For large jobs, use a dedicated results index. This ensures that results from a
single large job do not dominate the shared results index. It also ensures that
the job and results (if `results_retention_days` is set) can be deleted more
efficiently and improves renormalization performance. By default, {anomaly-job}
results are stored in a shared index. To change to use a dedicated result index,
you need to clone or create a new job.


[discrete]
[[model-plot]]
== 3. Disable model plot

By default, model plot is enabled when you create jobs in {kib}. If you have a
large job, however, consider disabling it. You can disable model plot for
existing jobs by using the {ref}/ml-update-job.html[Update {anomaly-jobs} API].

Model plot calculates and stores the model bounds for each analyzed entity,
including both anomalous and non-anomalous entities. These bounds are used to
display the shaded area in the Single Metric Viewer charts. Model plot creates
one result document per bucket per split field value. If you have high
cardinality fields and/or a short bucket span, disabling model plot reduces
processing workload and results stored.


[discrete]
[[detector-configuration]]
== 4. Understand how detector configuration can impact model memory

The following factors are most significant in increasing the memory required for
a job:

* High cardinality of the `by` or `partition` fields
* Multiple detectors
* A high distinct count of influencers within a bucket

Optimize your {anomaly-job} by choosing only relevant influencer fields and
detectors.

If you have high cardinality `by` or `partition` fields, ensure you have
sufficient memory resources available for the job. Alternatively, consider if
the job can be split into smaller jobs by using a {dfeed} query. For very high
cardinality, using a <<ml-configuring-populations,population analysis>> may be
more appropriate.

To change partitioning fields, influencers and/or detectors, you need to clone
or create a new job.


[discrete]
[[optimize-bucket-span]]
== 5. Optimize the bucket span

Short bucket spans and high cardinality detectors are resource intensive and
require more system resources.

Bucket span is typically between 15m and 1h. The recommended value always
depends on the data, the use case, and the latency required for alerting. A job
with a longer bucket span uses less resources because fewer buckets require
processing and fewer results are written. Bucket spans that are sensible
dividers of an hour or day work best as most periodic patterns have a daily
cycle.

If your use case is suitable, consider increasing the bucket span to reduce
processing workload. To change the bucket span, you need to clone or create a
new job.


[discrete]
[[set-model-memory-limit]]
== 6. Set the model memory limit

The `model_memory_limit` job configuration option sets the approximate maximum
amount of memory resources required for analytical processing. If this variable
is set too low for the job and the limit is approached, data pruning becomes
more aggressive. Upon exceeding this limit, new entities are not modeled.

Use model memory estimation to have a better picture of the memory needs of the
model. Model memory estimation happens automatically when you create the job in
{kib} or you can call the
{ref}/ml-estimate-model-memory.html[Estimate {anomaly-jobs} model memory API]
manually. The estimation is based on the analysis configuration details for the
job and cardinality estimates for the fields it references. You can update the
memory settings of an existing job, but the job must be closed.


[discrete]
[[pre-aggregate-data]]
== 7. Pre-aggregate your data

You can speed up the analysis by summarizing your data with aggregations.

{anomaly-jobs-cap} use summary statistics that are calculated for each bucket.
The statistics can be calculated in the job itself or via aggregations. It is
more efficient to use an aggregation when it’s possible, as in this case, the
data node does the heavy-lifting instead of the {ml} node.

In certain cases, you cannot do aggregations to increase performance. For
example, categorization jobs use the full log message to detect anomalies, so
this data cannot be aggregated. If you have many influencer fields, it may not
be beneficial to use an aggregation either. This is because only a few documents
in each bucket may have the combination of all the different influencer fields.

Please consult <<ml-configuring-aggregation>> to learn more.


[discrete]
[[results-retention]]
== 8. Optimize the results retention

Set a results retention window to reduce the amount of results stored.

{anomaly-detect-cap} results are retained indefinitely by default. Results build
up over time, and your result index may be quite large. A large results index is
slow to query and takes up significant space on your cluster. Consider how long
you wish to retain the results and set `results_retention_days` accordingly –
for example, to 30 or 60 days – to avoid unnecessarily large result indices.
Deleting old results does not affect the model behavior. You can change this
setting for existing jobs.


[discrete]
[[renormalization-window]]
== 9. Optimize the renormalization window

Reduce the renormalization window to reduce processing workload.

When a new anomaly has a much higher score than any anomaly in the past, the
anomaly scores are adjusted on a range from 0 to 100 based on the new data. This
is called renormalization. It can mean rewriting a large number of documents in
the results index. Renormalization happens for results from the last 30 days or
100 bucket spans (depending on which is the longer) by default. When you are
working at scale, set `renormalization_window_days` to a lower value, so the
workload is reduced. You can change this setting for existing jobs and changes
will take effect after the job has been reopened.


[discrete]
[[model-snapshot-retention]]
== 10. Optimize the model snapshot retention

Model snapshots are taken periodically, to ensure resilience in the event of a
system failure and to allow you to manually revert to a specific point in time.
These are stored in a compressed format in an internal index and kept according
to the configured retention policy. Load is placed on the cluster when indexing
a model snapshot and index size is increased as multiple snapshots are retained.

When working with large model sizes, consider how frequently you want to create
model snapshots using `background_persist_interval`. The default is every 3 to 4
hours. Increasing this interval reduces the periodic indexing load on your
cluster, but in the event of a system failure, you may be reverting to an older
version of the model.

Also consider how long you wish to retain snapshots using
`model_snapshot_retention_days` and `daily_model_snapshot_retention_after_days`.
Retaining fewer snapshots substantially reduces index storage requirements for
model state, but also reduces the granularity of model snapshots from which you
can revert.

For more information, refer to <<ml-model-snapshots>>.


[discrete]
[[search-queries]]
== 11. Optimize your search queries

If you are operating on a big scale, make sure that your {dfeed} query is as
efficient as possible. There are different ways to write {es} queries and some
of them are more efficient than others. Please consult
{ref}/tune-for-search-speed.html[Tune for search speed] to learn more about {es}
performance tuning.

You need to clone or recreate an existing job if you want to optimize its search
query.


[discrete]
[[population-analysis]]
== 12. Consider using population analysis

Population analysis is more memory efficient than individual analysis of each
series. It builds a profile of what a "typical" entity does over a specified
time period and then identifies when one is behaving abnormally compared to the
population. Use population analysis for analyzing high cardinality fields if you
expect that the entities of the population generally behave in the same way.

For more information, refer to <<ml-configuring-populations>>.


[discrete]
[[forecasting]]
== 13. Reduce the cost of forecasting

There are two main performance factors to consider when you create a forecast:
indexing load and memory usage. Check the cluster monitoring data to learn the
indexing rate and the memory usage.

Forecasting writes a new document to the result index for every forecasted
element of the for every bucket. Jobs with high partition or by field
cardinality create more result documents, as do jobs with small bucket span and
longer forecast duration. Only three concurrent forecasts may be run for a
single job.

To reduce indexing load, consider a shorter forecast duration and/or try to
avoid concurrent forecast requests. Further performance gains can be achieved by
reviewing the job configuration; for example by using a dedicated results index,
increasing the bucket span and/or by having lower cardinality partitioning
fields.

The memory usage of a forecast is restricted to 20 MB by default. From 7.9, you
can extend this limit by setting `max_model_memory` to a higher value. The
maximum value is 40% of the memory limit of the {anomaly-job} or 500 MB. If the
forecast needs more memory than the provided value, it spools to disk. Forecasts
that would take more than 500 MB to run won’t start because this is the maximum
limit of disk space that a forecast is allowed to use.
2 changes: 2 additions & 0 deletions docs/en/stack/ml/anomaly-detection/index.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ include::create-jobs.asciidoc[leveloffset=+2]
include::job-tips.asciidoc[leveloffset=+3]
include::stopping-ml.asciidoc[leveloffset=+2]

include::anomaly-detection-scale.asciidoc[leveloffset=+2]

include::ml-api-quickref.asciidoc[leveloffset=+1]

include::ootb-ml-jobs.asciidoc[leveloffset=+1]
Expand Down
3 changes: 3 additions & 0 deletions docs/en/stack/ml/anomaly-detection/ml-configuration.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -29,3 +29,6 @@ you visualize and explore the results.

After you learn how to create and stop {anomaly-detect} jobs, you can check the
<<anomaly-examples>> for more advanced settings and scenarios.

Consult <<anomaly-detection-scale>> to learn more about the particularities of
large {anomaly-jobs}.

0 comments on commit f4ee746

Please sign in to comment.