Skip to content

[ML] Improve scalability of model plot functionality #93

Open
@tveasey

Description

@tveasey

Model plot is useful and popular functionality but suffers from some scalability issues. Its overhead breaks down as:

  1. The cost to compute bounds in the autodetect process (which can as much as double its run time)
  2. The cost of marshalling all the extra results and the extra indexing load
  3. The cost of managing the resulting job indices which can get very large for long running jobs

Currently, we get three values per partitioning field pair (by, partition) per time bucket.

This issue outlines some options available to improve scalability. The following suggestions target different subsets of 1, 2 and 3 as noted:

  1. Only compute bounds at a fixed "multiple" of the bucket length. This is trading accuracy for a performance improvement by a factor of 1 / "multiple". As refinements, we might also:

    • compute bounds if we determine the (by, partition) pair is anomalous,
    • make this dynamic and a function of the difference from the last bounds.

    (Targets 1, 2 and 3.)

  2. Speed up the bounds calculation. This is slow, particularly for certain distribution models and there will be approximations available which trade accuracy for runtime. (Targets 1.)

  3. Allow on demand model plot for requested time series. Since we already have snapshots of the model, for disaster recovery, we can in theory load the oldest snapshot and generate model plot up to a requested time. Ideally, we'd only load the model for the requested time series, although this will require changes to restore. This clearly also requires that the intervening data was stored. However, it bypasses performance problems altogether at the expense of limited history. (Targets 1, 2 and 3.)

Note that these are complimentary, so there is nothing to stop us implementing all three. As a very rough estimate, I'd expect development effort as follows: 2 < 1 << 3.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions