[DOCS] Adds Working with anomaly detection at scale to ML AD docs (#1353

) (#1354)
elastic · Sep 7, 2020 · f4ee746 · f4ee746
1 parent 22fe041
commit f4ee746
Show file tree

Hide file tree

Showing 3 changed files with 280 additions and 0 deletions.
diff --git a/docs/en/stack/ml/anomaly-detection/anomaly-detection-scale.asciidoc b/docs/en/stack/ml/anomaly-detection/anomaly-detection-scale.asciidoc
@@ -0,0 +1,275 @@
+[role="xpack"]
+[[anomaly-detection-scale]]
+= Working with {anomaly-detect} at scale
+
+There are many advanced configuration options for {anomaly-jobs}, some of them 
+affect the performance or resource usage significantly. This guide contains a 
+list of considerations to help you plan for using {anomaly-detect} at scale.
+
+In this guide, you’ll learn how to:
+
+* Understand the impact of configuration options on the performance of 
+  {anomaly-jobs}
+
+Prerequisites:
+
+* This guide assumes you’re already familiar with how to create {anomaly-jobs}. 
+  If not, refer to <<ml-overview>>.
+
+The following recommendations are not sequential – the numbers just help to 
+navigate between the list items; you can take action on one or more of them in 
+any order. You can implement some of these changes on existing jobs; others 
+require you to clone an existing job or create a new one.
+
+
+[discrete]
+[[node-sizing]]
+== 1. Consider node sizing and configuration
+
+An {anomaly-job} runs on a single node and requires sufficient resources to hold 
+its model in memory. When a job is opened, it will be placed on the node with 
+the most available memory at that time.
+
+The memory available to the {ml} native processes can roughly be thought of as 
+total machine RAM minus that which is required for the operating system, {es} 
+and any other software that is running on the same machine.
+
+The available memory for {ml} on a node must be sufficient to accommodate the 
+size of the largest model. The total available memory across all {ml} nodes must 
+be sufficient to accommodate the memory requirement for all simultaneously open 
+jobs.
+
+In {ecloud}, dedicated {ml} nodes are provisioned with most of the RAM 
+automatically being available to the {ml} native processes. If deploying 
+self-managed, then we recommend deploying dedicated {ml} nodes and increasing 
+the value of `xpack.ml.max_machine_memory_percent` from the default 30%. The 
+default of 30% has to be set low in case other software is running on the same 
+machine and to leave memory free for an OS file system cache on {ml} nodes that 
+are also data nodes. If you use dedicated {ml} nodes as recommended and do not 
+run any other software on them then it would be reasonable to run with a 2GB JVM 
+heap and set `xpack.ml.max_machine_memory_percent` to 90% on machines with at 
+least 24GB of RAM. This maximizes the number of {ml} jobs that can be run.
+
+Increasing the number of nodes will allow distribution of job processing as well 
+as fault tolerance. If running many jobs, even small memory ones, then consider 
+increasing the number of nodes in your environment.
+
+
+[discrete]
+[[dedicated-results-index]]
+== 2. Use dedicated results indices
+
+For large jobs, use a dedicated results index. This ensures that results from a 
+single large job do not dominate the shared results index. It also ensures that 
+the job and results (if `results_retention_days` is set) can be deleted more 
+efficiently and improves renormalization performance. By default, {anomaly-job} 
+results are stored in a shared index. To change to use a dedicated result index, 
+you need to clone or create a new job.
+
+
+[discrete]
+[[model-plot]]
+== 3. Disable model plot
+
+By default, model plot is enabled when you create jobs in {kib}. If you have a 
+large job, however, consider disabling it. You can disable model plot for 
+existing jobs by using the {ref}/ml-update-job.html[Update {anomaly-jobs} API].
+
+Model plot calculates and stores the model bounds for each analyzed entity, 
+including both anomalous and non-anomalous entities. These bounds are used to 
+display the shaded area in the Single Metric Viewer charts. Model plot creates 
+one result document per bucket per split field value. If you have high 
+cardinality fields and/or a short bucket span, disabling model plot reduces 
+processing workload and results stored.
+
+
+[discrete]
+[[detector-configuration]]
+== 4. Understand how detector configuration can impact model memory
+
+The following factors are most significant in increasing the memory required for 
+a job:
+
+* High cardinality of the `by` or `partition` fields
+* Multiple detectors
+* A high distinct count of influencers within a bucket
+
+Optimize your {anomaly-job} by choosing only relevant influencer fields and 
+detectors.
+
+If you have high cardinality `by` or `partition` fields, ensure you have 
+sufficient memory resources available for the job. Alternatively, consider if 
+the job can be split into smaller jobs by using a {dfeed} query. For very high 
+cardinality, using a <<ml-configuring-populations,population analysis>> may be 
+more appropriate.
+
+To change partitioning fields, influencers and/or detectors, you need to clone 
+or create a new job.
+
+
+[discrete]
+[[optimize-bucket-span]]
+== 5. Optimize the bucket span
+
+Short bucket spans and high cardinality detectors are resource intensive and 
+require more system resources.
+
+Bucket span is typically between 15m and 1h. The recommended value always 
+depends on the data, the use case, and the latency required for alerting. A job 
+with a longer bucket span uses less resources because fewer buckets require 
+processing and fewer results are written. Bucket spans that are sensible 
+dividers of an hour or day work best as most periodic patterns have a daily 
+cycle.
+
+If your use case is suitable, consider increasing the bucket span to reduce 
+processing workload. To change the bucket span, you need to clone or create a 
+new job.
+
+
+[discrete]
+[[set-model-memory-limit]]
+== 6. Set the model memory limit
+
+The `model_memory_limit` job configuration option sets the approximate maximum 
+amount of memory resources required for analytical processing. If this variable 
+is set too low for the job and the limit is approached, data pruning becomes 
+more aggressive. Upon exceeding this limit, new entities are not modeled.
+
+Use model memory estimation to have a better picture of the memory needs of the 
+model. Model memory estimation happens automatically when you create the job in 
+{kib} or you can call the 
+{ref}/ml-estimate-model-memory.html[Estimate {anomaly-jobs} model memory API] 
+manually. The estimation is based on the analysis configuration details for the 
+job and cardinality estimates for the fields it references. You can update the 
+memory settings of an existing job, but the job must be closed.
+
+
+[discrete]
+[[pre-aggregate-data]]
+== 7. Pre-aggregate your data
+
+You can speed up the analysis by summarizing your data with aggregations. 
+
+{anomaly-jobs-cap} use summary statistics that are calculated for each bucket. 
+The statistics can be calculated in the job itself or via aggregations. It is 
+more efficient to use an aggregation when it’s possible, as in this case, the 
+data node does the heavy-lifting instead of the {ml} node.
+
+In certain cases, you cannot do aggregations to increase performance. For 
+example, categorization jobs use the full log message to detect anomalies, so 
+this data cannot be aggregated. If you have many influencer fields, it may not 
+be beneficial to use an aggregation either. This is because only a few documents 
+in each bucket may have the combination of all the different influencer fields.
+
+Please consult <<ml-configuring-aggregation>> to learn more.
+
+
+[discrete]
+[[results-retention]]
+== 8. Optimize the results retention
+
+Set a results retention window to reduce the amount of results stored.
+
+{anomaly-detect-cap} results are retained indefinitely by default. Results build 
+up over time, and your result index may be quite large. A large results index is 
+slow to query and takes up significant space on your cluster. Consider how long 
+you wish to retain the results and set `results_retention_days` accordingly – 
+for example, to 30 or 60 days – to avoid unnecessarily large result indices. 
+Deleting old results does not affect the model behavior. You can change this 
+setting for existing jobs.
+
+
+[discrete]
+[[renormalization-window]]
+== 9. Optimize the renormalization window
+
+Reduce the renormalization window to reduce processing workload.
+
+When a new anomaly has a much higher score than any anomaly in the past, the 
+anomaly scores are adjusted on a range from 0 to 100 based on the new data. This 
+is called renormalization. It can mean rewriting a large number of documents in 
+the results index. Renormalization happens for results from the last 30 days or 
+100 bucket spans (depending on which is the longer) by default. When you are 
+working at scale, set `renormalization_window_days` to a lower value, so the 
+workload is reduced. You can change this setting for existing jobs and changes 
+will take effect after the job has been reopened.
+
+
+[discrete]
+[[model-snapshot-retention]]
+== 10. Optimize the model snapshot retention
+
+Model snapshots are taken periodically, to ensure resilience in the event of a 
+system failure and to allow you to manually revert to a specific point in time. 
+These are stored in a compressed format in an internal index and kept according 
+to the configured retention policy. Load is placed on the cluster when indexing 
+a model snapshot and index size is increased as multiple snapshots are retained.
+
+When working with large model sizes, consider how frequently you want to create 
+model snapshots using `background_persist_interval`. The default is every 3 to 4 
+hours. Increasing this interval reduces the periodic indexing load on your 
+cluster, but in the event of a system failure, you may be reverting to an older 
+version of the model.
+
+Also consider how long you wish to retain snapshots using 
+`model_snapshot_retention_days` and `daily_model_snapshot_retention_after_days`. 
+Retaining fewer snapshots substantially reduces index storage requirements for 
+model state, but also reduces the granularity of model snapshots from which you 
+can revert.
+
+For more information, refer to <<ml-model-snapshots>>.
+
+
+[discrete]
+[[search-queries]]
+== 11. Optimize your search queries
+
+If you are operating on a big scale, make sure that your {dfeed} query is as 
+efficient as possible. There are different ways to write {es} queries and some 
+of them are more efficient than others. Please consult 
+{ref}/tune-for-search-speed.html[Tune for search speed] to learn more about {es} 
+performance tuning.
+
+You need to clone or recreate an existing job if you want to optimize its search 
+query.
+
+
+[discrete]
+[[population-analysis]]
+== 12. Consider using population analysis
+
+Population analysis is more memory efficient than individual analysis of each 
+series. It builds a profile of what a "typical" entity does over a specified 
+time period and then identifies when one is behaving abnormally compared to the 
+population. Use population analysis for analyzing high cardinality fields if you 
+expect that the entities of the population generally behave in the same way.
+
+For more information, refer to <<ml-configuring-populations>>.
+
+
+[discrete]
+[[forecasting]]
+== 13. Reduce the cost of forecasting
+
+There are two main performance factors to consider when you create a forecast: 
+indexing load and memory usage. Check the cluster monitoring data to learn the 
+indexing rate and the memory usage.
+
+Forecasting writes a new document to the result index for every forecasted 
+element of the  for every bucket. Jobs with high partition or by field 
+cardinality create more result documents, as do jobs with small bucket span and 
+longer forecast duration. Only three concurrent forecasts may be run for a 
+single job.
+
+To reduce indexing load, consider a shorter forecast duration and/or try to 
+avoid concurrent forecast requests. Further performance gains can be achieved by 
+reviewing the job configuration; for example by using a dedicated results index, 
+increasing the bucket span and/or by having lower cardinality partitioning 
+fields.
+
+The memory usage of a forecast is restricted to 20 MB by default. From 7.9, you 
+can extend this limit by setting `max_model_memory` to a higher value. The 
+maximum value is 40% of the memory limit of the {anomaly-job} or 500 MB. If the 
+forecast needs more memory than the provided value, it spools to disk. Forecasts 
+that would take more than 500 MB to run won’t start because this is the maximum 
+limit of disk space that a forecast is allowed to use.
diff --git a/docs/en/stack/ml/anomaly-detection/index.asciidoc b/docs/en/stack/ml/anomaly-detection/index.asciidoc
@@ -16,6 +16,8 @@ include::create-jobs.asciidoc[leveloffset=+2]
 include::job-tips.asciidoc[leveloffset=+3]
 include::stopping-ml.asciidoc[leveloffset=+2]
 
+include::anomaly-detection-scale.asciidoc[leveloffset=+2]
+
 include::ml-api-quickref.asciidoc[leveloffset=+1]
 
 include::ootb-ml-jobs.asciidoc[leveloffset=+1]

diff --git a/docs/en/stack/ml/anomaly-detection/ml-configuration.asciidoc b/docs/en/stack/ml/anomaly-detection/ml-configuration.asciidoc
@@ -29,3 +29,6 @@ you visualize and explore the results.
 
 After you learn how to create and stop {anomaly-detect} jobs, you can check the 
 <<anomaly-examples>> for more advanced settings and scenarios.
+
+Consult <<anomaly-detection-scale>> to learn more about the particularities of 
+large {anomaly-jobs}.