Skip to content

[ML] Supporting multiple partition fields #1704

Open
@droberts195

Description

@droberts195

Currently only one partition_field_name can be configured for a job, but some use cases require more.

A workaround for this has always been to create a new field by concatenating several existing fields, then use this new field as the partition field, then split it up before interpreting the results. This works OK for programmatic users of anomaly detection, but has some drawbacks:

  1. The interaction between partition fields and influencers cannot be analysed, in the case where an influencer field is one of the original fields that got concatenated into the partition field
  2. You cannot easily use our UI to drill into the results, as the raw results contain the concatenated partition field value, and extra processing is required to get from there to the fields that exist in the source index

We have previously talked about adding support for multiple partition fields. This could be done at different levels:

  1. We could leave the C++ code unchanged and automate the concatenation and splitting of partition fields in our Java code that communicates with the C++
  2. We could implement multiple partition fields in the C++ by concatenating immediately after receiving input and splitting immediately before writing output
  3. We could implement a full solution for multiple partition fields in the C++, taking into account the relationship between influencer fields and partition fields

Solutions 1 and 2 basically automate the current workaround.

Solution 1 has a major drawback that it would break the principal that the output documents the C++ writes are exactly the output documents that get written to our results index. This is the main difference between solutions 1 and 2, but another advantage of solution 2 is that it opens up the possibility of being an intermediate step to solution 3 without further changes to the Java code. Solution 3 is the hardest, as it affects the way the hierarchical results structures are built, and calculation of influencer probabilities.

Any of these solutions will also require downstream changes to the ML UI code for job configuration and displaying results.

There would also be (much more minor) changes to the Java config code.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions