Skip to content

feat: support new metrics firehose api with get_usage() #404

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

toph-allen
Copy link
Collaborator

@toph-allen toph-allen commented May 2, 2025

Intent

Adds support for the metrics "firehose" API.

Fixes #390

Approach

Mostly straightforward and following existing patterns in connectapi, with a few strongly-opinionated choices that I hold loosely and would be open to pushback on.

  • The endpoint accepts from and to parameters. If these are timestamps, they are used as-is. If dates (with no specified times) are passed, from is treated as "start of day" and to is treated as "end of day". I think this will match user expectations — imagine selecting a start and an end date in a date range picker.
  • The data returned from the endpoint includes path and user_agent fields nested under a data field. Without special treatment these are returned a list-column, which is awkward. I initially experimented with tidyr::unnest(), but that was slow on the larger datasets returned by this endpoint, so I wrote a custom fast_unnest_character() function which runs in about 5% (!) of the time that tidyr::unnest() takes.
    • I think that the existing parse_connectapi() function could see drastic performance improvements this way too, and those would directly speed up the performance of things like like the megadashboard. I have been meaning to look into this for a minute (investigate parse_connectapi_typed performance #383).
  • I made another change related to fix: minor improvements to time zone parsing #400 — I found that the way that connectapi was constructing NA_datetime_ combined with its vctrs-based parsing approach was forcefully converting all timestamps to UTC regardless of other inner functions.
  • Removed the cyclocomp linter. R6 classes cause weirdness with it, and to work around this, its complexity limit had been set high enough that the Connect class just now inched over the limit. Not worth even having it enabled at that point.

Checklist

  • Does this change update NEWS.md (referencing the connected issue if necessary)?
  • Does this change need documentation? Have you run devtools::document()?

@toph-allen toph-allen force-pushed the toph/metrics-firehose branch from 1609a9b to 24f9dc3 Compare May 2, 2025 22:47
@toph-allen toph-allen marked this pull request as ready for review May 2, 2025 23:24
cyclocomp_linter = cyclocomp_linter(30L),
cyclocomp_linter = NULL, # Issues with R6 classes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also get rid of any::cyclocomp here too, yeah?

extra-packages: local::., any::lintr, any::devtools, any::testthat, any::cyclocomp
needs: lint

Comment on lines +821 to +825
#' @description Get content usage data.
#' @param from Optional `Date` or `POSIXt`; start of the time window. If a
#' `Date`, coerced to `YYYY-MM-DDT00:00:00` in the caller's time zone.
#' @param to Optional `Date` or `POSIXt`; end of the time window. If a
#' `Date`, coerced to `YYYY-MM-DDT23:59:59` in the caller's time zone.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is doing what we expect with timezones. And actually, I'm not even totally sure what is intended. So maybe we should work that out here and then adapt the code to fit? When we send timestamps to Connect with this function do we want them to be transformed to UTC from the caller's local timezone before being sent? Or some other behavior?

One thing to note: If I'm reading this correctly, make_timestamp() has slightly different behavior if one sends a non-character than if one sends a character input. Where the character string version will not be parsed and also not be transformed into UTC. And so this function will do the same.

connectapi/R/parse.R

Lines 16 to 28 in e8c8075

make_timestamp <- function(input) {
if (is.character(input)) {
# TODO: make sure this is the right timestamp format
return(input)
}
# In the call to `safe_format`:
# - The format specifier adds a literal "Z" to the end of the timestamp, which
# tells Connect "This is UTC".
# - The `tz` argument tells R to produce times in the UTC time zone.
# - The `usetz` argument says "Don't concatenate ' UTC' to the end of the string".
safe_format(input, "%Y-%m-%dT%H:%M:%SZ", tz = "UTC", usetz = FALSE)
}

@@ -101,6 +105,65 @@ parse_connectapi <- function(data) {
))
}

# nolint start
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What linting are we escaping here?

@@ -365,3 +374,86 @@ test_that("get_content only requests vanity URLs for Connect 2024.06.0 and up",
)
})
})

test_that("get_usage() returns usage data in the expected shape", {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a test where we pass in a character that is a formatted timestamp? (if we are intending to support that, of course)

Comment on lines +135 to +165
fast_unnest_character <- function(df, col_name) {
if (!is.character(col_name)) {
stop("col_name must be a character vector")
}
if (!col_name %in% names(df)) {
stop("col_name is not present in df")
}

list_col <- df[[col_name]]

new_cols <- names(list_col[[1]])

df2 <- df
for (col in new_cols) {
df2[[col]] <- vapply(
list_col,
function(row) {
if (is.null(row[[col]])) {
NA_character_
} else {
row[[col]]
}
},
"1",
USE.NAMES = FALSE
)
}

df2[[col_name]] <- NULL
df2
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data returned from the endpoint includes path and user_agent fields nested under a data field. Without special treatment these are returned a list-column, which is awkward. I initially experimented with tidyr::unnest(), but that was slow on the larger datasets returned by this endpoint, so I wrote a custom fast_unnest_character() function which runs in about 5% (!) of the time that tidyr::unnest() takes.

Thinking about this a bit more: this isn't a huge chunk of code of course, but it is another chunk that we will take on the maintenance of if we go this route. This is another example where having our data interchange within connectapi be all data frames means we have to worry about the performance of json-parsed list responses into data frames and make sure those data frames are in a natural structure for folks to use. If we relied instead on only the parsed list data as our interchange and then gave folks as.data.frame() methods, we could defer the (sometimes expense) reshaping until late in process and eaking out performance gains like this are much less important so we can rely on more off the shelf tools.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support upcoming metrics "firehose" v1 endpoint
2 participants