Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] feat(bigTent slo): Custom Validate function for big tent SLO schema #2016

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

Leo-DiCara
Copy link
Contributor

@Leo-DiCara Leo-DiCara commented Jan 31, 2025

Please Don't merge this until all work in https://github.com/grafana/slo/issues/2697 is complete. We want to ensure that the SLO Plugin API is merged and running in prod before we merge terraform-provider changes.

This adds a custom validate function for queries that checks to ensure basic form of big SLO JSON is being followed before we allow the user to roundtrip it to the API. This checks for:

  • Basic JSON structure
  • refID field
  • datasource field
  • uid and type subfields in datasource

Closes https://github.com/grafana/slo/issues/2702

@Leo-DiCara Leo-DiCara requested review from a team as code owners January 31, 2025 22:52
Copy link

In order to lower resource usage and have a faster runtime, PRs will not run Cloud tests automatically.
To do so, a Grafana Labs employee must trigger the cloud acceptance tests workflow manually.

@Leo-DiCara Leo-DiCara changed the title feat(bigTent slo): Custom Validate function for big tent SLO schema [WIP] feat(bigTent slo): Custom Validate function for big tent SLO schema Feb 1, 2025
Type: schema.TypeString,
Description: `Metric for total events (denominator)`,
Required: true,
ValidateDiagFunc: ValidateBigTent(),
Copy link
Contributor

@elainevuong elainevuong Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarification: all of our tests seem to be modifying the freeform query field.
do we also still need to validate the success_metric and total_metric fields on the schema with the ValidateBigTent function? or do we expect that usage of Terraform and Big Tent will always be via the query field?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elainevuong , we decide a few weeks ago to not support the ratio type, so we don't need to be checking for big-tent JSON in the success_metric or total_metric (generated/models/slo/v0_0/slo_gen.go#L139)

We should assume that all ratio SLOs are still promQL. Is there a place this assumption gets tricky? I know it's kind of weird to treat the freeform as "json or promQL" - but in the MTO code the weirdness didn't bubble up to the level of us modifying the schema to clarify

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did add the Validate function to ALL metrics but only tested it on the query field, since functionally these three string fields should be exactly the same in terms of what we want to verify I didn't think there would be much value to rewriting the same tests checking that field.

Copy link
Contributor

@elainevuong elainevuong Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're not supporting Ratio type for big-tent json, I think we ought to remove the ValidateBigTent() call from the success_metric and total_metric and only keep it on the query

@@ -661,3 +666,79 @@ func apiError(action string, err error) diag.Diagnostics {
},
}
}

func ValidateBigTent() schema.SchemaValidateDiagFunc {
return func(i interface{}, path cty.Path) diag.Diagnostics {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's this cty package?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cty package is a hasicorp provided package for managing config values. cty.Path handles managing paths. It is required to write tests for terraform going forward.

Functionally here it just tells us at what depth our error occurred in, which is helpful since we are diving through JSON.

Severity: diag.Warning,
Summary: "Bad JSON format",
Detail: "If this is a big tent query, this should be valid JSON. If this is a prometheus query, ignore this.",
AttributePath: path,
Copy link
Contributor

@elainevuong elainevuong Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disclaimer - I'm not sure if this possible, and I don't know too much about it.

Basically - we always perform this ValidateBigTent function, which validates the query field.
We first:
1 - check if it's a string
2 - check if it's valid JSON, if it isn't, we return a warning, even if it's a valid Prometheus query. Since it just returns a warning, it's non-blocking for the provider and it'll continue to process. However, this might be a bit off-putting for existing users. They've never gotten these warnings before, but now they get them but it's not really actionable? Might not be a great experience.
3 - we then check for the presence of a refId, datasource, a type, and a uid.

--
Is there a way we can get around with not returning the warning for valid Prometheus queries? I was looking around, and maybe we could incorporate a DiffSuppressFunc?

SchemaDiffSuppressFunc is a function which can be used to determine whether a detected diff on a schema element is "valid" or not, and suppress it from the plan if necessary.

https://pkg.go.dev/github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema#SchemaDiffSuppressFunc

Pretty much - if we identify that it ISN'T JSON; we assume it's a Prom query, and we allow it go pass through validation.

There are some examples if these DiffSuppressFunc within the terraform provider already that you could reference.

What are your thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could return with no warnings. My thoughts below:

If you are writing a Big tent query you need to provide valid json. We can't distinguish between is this bad JSON and you are writing a big tent query and is this just a prom query.

Big tent users who see this message have queries that will always fail to validate, which could be confusing if you think you've written proper JSON.

I guess it boils down to: Who is it better to confuse?
Big Tent users who will silently pass a tf plan to fail on tf apply?
-OR-
Existing Prom users who see a new warning and may get spooked?

IMO setting up new SLOs for Big Tent is more painful because of the potentially very complex JSON blobs, so I opted for a warning, but it could be that many more users will never even touch big tent SLOs and therefore more users will be annoyed by unactionable warnings when they edit/create SLOs.

My last thought is I'm not even sure validate runs if a tf plan doesnt generate a diff for that field. This would mean unchanged SLOs wouldn't generate warnings. tf validate would generate these warnings everytime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should assume that big tent SLOs are the minority and we shouldn't raise warnings about totally valid promQL SLOs.

This does complicate our ability to raise useful tf-plan warnings for invalid json. One idea would be to use a simple strings.Contains("datasource"...) as a way to test whether a given SLO was "trying to include a source-datasource that is required for big-tent and not part of promql...

@elainevuong
Copy link
Contributor

elainevuong commented Feb 5, 2025

I'd like to see if we can remove the warnings for valid Prometheus queries when we validate the query field. It might not be a fantastic customer experience for existing users, to suddenly get warnings from the Provider for queries that they haven't changed.

Perhaps we can investigate if this DiffSuppressFunc is an option to pursue?

@elainevuong
Copy link
Contributor

Just want to circle back - I verified that if we use the Terraform Provider as is, it DOES give warnings that show up, even if there isn't a diff in the field.

My last thought is I'm not even sure validate runs if a tf plan doesnt generate a diff for that field. This would mean unchanged SLOs wouldn't generate warnings. tf validate would generate these warnings everytime.

With the way this PR is set up currently, we've see this warning in both Ratio SLOs (due to the ValidateBigTent on the success and total metrics) AND in the Freeform Prom QL SLOs. These warnings appear on both "CREATE" of a new SLO, as well as on "UPDATE" of an existing SLO.

I think we need two changes:

  • drop the ValidateBigTent() on both the success_metric and total_metric
  • remove the warning if we can't parse as JSON

screenshots if anyone is interested in viewing what the user experience is like.

CREATE - Ratio SLO

Create - Ratio SLO

CREATE - Adv Freeform Prom SLO

Create - Adv Freeform Prom SLO

UPDATE - Adv Freeform Prom SLO

Update - Existing Adv Freeform SLO

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants