Improving CLU's Metric API #2006

cgarciae · 2022-03-21T23:23:48Z

cgarciae
Mar 21, 2022
Maintainer

Current State

Metric from clu currently exposes the following API:

class Metric:
  @classmethod
  def from_model_output(cls, *args, **kwargs) -> "Metric":
    ...

  def merge(self, other: "Metric") -> "Metric":
    ...

  def compute(self) -> jnp.array:
    ...

Documentation currently suggests they are used like this:

metric = None
for sample, labels in range(data):
  logits = module.apply(variables, sample, ...)
  update = Accuracy.from_model_output(logits=logits, labels=labels)
  metric = update if metric is None else metric.merge(update)
  
epoch_accuracy = metric.compute()

However, this if you try to implement it in terms of a realistic jitted eval_step this pattern become more complex:

@partial(jax.jit, static_argnums=(0, 1))
def eval_step(apply_fn: Callable, MetricClass: Type[Metric], variables, metric: Optional[Metric], sample, labels) -> Metric:

    logits = apply_fn(variables, sample, ...)
    update = MetricClass.from_model_output(logits=logits, labels=labels)
    metric = update if metric is None else metric.merge(update)

    return metric
    
metric = None
for sample, labels in range(data):
    metric = eval_step(module.apply, Accuracy, variables, metric, sample, labels)

epoch_accuracy = metric.compute()

This has the following downsides:

eval_step always has to recompile twice as metric will change from None in the first step to a Metric instance from then on.
Metrics cannot be (easily) parametrized since MetricClass.from_model_output takes no parameters other than the actual values. If we take a look at a more complex metric such as tf.keras.metrics.BinaryIoU we see it has a couple of parameters such as threshold. I might be missing something but I don't see an easy way of implementing this in clu.

Proposal

My suggestion to solve both is the following API:

class Metric(ABC):
     @abstractmethod
    def update(self: M, **kwargs) -> M:
        ...

    @abstractmethod
    def reset(self: M) -> M:
        ...

    @abstractmethod
    def compute(self) -> tp.Any:
        ...

    @abstractmethod
    def merge(self: M, other: M) -> M:
        ...

    def batch_updates(self: M, **kwargs) -> M:
        return self.reset().update(**kwargs)

This has the following differences:

We introduce a reset method which should leave the metric in a neutral/zero state.
update knows how update current state based on incoming values.
from_model_output is replaced with batch_updates which does reset + update.

Example

A simple implementation of Accuracy could be:

@flax.struct.dataclass
class Accuracy(Metric):
    total: Optional[jnp.ndarray] = None
    count: Optional[jnp.ndarray] = None

    def reset(self) -> "Average":
        return self.replace(
            total=jnp.array(0, jnp.float32), 
            count=jnp.array(0, jnp.int32),
        )

    def update(self, logits: jnp.array, labels:jnp.ndarray **_) -> "Average":
        assert self.total is not None and self.count is not None
        values = (logits.argmax(axis=-1) == labels).astype(jnp.float32)
        return self.replace(
            total=self.total + values.sum(), 
            count=self.count + values.size,
        )

    def compute(self) -> jnp.ndarray:
        assert self.total is not None and self.count is not None
        return self.total / self.count

    def merge(self, other):
        assert self.total is not None and self.count is not None
        return self.replace(
            total=self.total + other.total 
            count=self.count + other.count,
        )

Now the previous can be slightly simplified example can be slightly simplified to:

@partial(jax.jit, static_argnums=(0,))
def eval_step(apply_fn: Callable, variables, metric: Metric, sample, labels) -> Metric:

    logits = apply_fn(variables, sample, ...)
    batch_updates = metric.batch_updates(logits=logits, labels=labels) # .batch_updates(...) == .reset().update(...)
    metric = metric.merge(batch_updates)

    return metric
    
metric = Accuracy().reset()

for sample, labels in range(data):
    metric = eval_step(module.apply, variables, metric, sample, labels)

epoch_accuracy = metric.compute()

For a non-distributed setup you can even just use .update directly since you don't need to synchronize metric state (batch_updates) between devices:

def eval_step(apply_fn: Callable, variables, metric: Metric, sample, labels) -> Metric:

    logits = apply_fn(variables, sample, ...)
    metric = metric.update(logits=logits, labels=labels)

    return metric

Parametrized Metrics

Now the obvious benefit of being able to instantiate a Metric from outside is that you can define parametrized metrics e.g. you could implement an Accuracy metric that with a topk parameter:

@flax.struct.dataclass
class Acurracy(Metric):
    total: Optional[jnp.ndarray] = None
    count: Optional[jnp.ndarray] = None
    topk: int = flax.struct.field(pytree_node=False, default=1)
    
    ...

And use it like this:

metric = Accuracy(topk=5).reset()

for sample, labels in range(data):
    metric = eval_step(module.apply, variables, metric, sample, labels)

epoch_accuracy = metric.compute()

Reference Implementation

I've been playing around with this definition of Metric in this non-published repo called flax-tools, you can check the definition of Metric and implementation of a couple of non-trivial metrics ported from Treex in flax_tools/metrics.

cc: @jheek @marcvanzee

andsteing · 2022-03-22T11:14:56Z

andsteing
Mar 22, 2022
Maintainer

Thanks a lot for the detailed feedback, even including a nice proposal how to update the API!

Classes vs. Instances

The biggest difference in your design, as compared to the existing clu.metrics is that you would have metrics represented as object instances (with state and parametrization values), while the existing API defines metrics as classes.

The advantage of using classes is that we declare collections of metrics declaratively, which feels very Flaxy (from clu.metrics module pydoc):

@flax.struct.dataclass
class Metrics(metrics.Collection):
  accuracy: metrics.Accuracy
  loss: metrics.Average.from_output("loss")
  loss_std: metrics.Std.from_output("loss")

Parametrized Metrics

There are different ways to parametriz an existing clu.metrics.Metric - one is the example above, where Metric.from_output() returns a FromOutput metric that is defined in the function and takes the parameter value from the closure.

Efficiency

Thanks for highlighting the issue with the jitted eval_step(). CLU should indeed provide a way of allowing to implement this without recompile...

A minimal way of extending the existing API would be to add the following class method:

class Metric:

  # [...]

  @classmethod
  def empty(cls) -> "Metric":
    """Returns an empty instance (i.e. `.merge(Metric.empty())` is a no-op)."""
    raise NotImplementedError("Must override empty()")

Every sub-classed metric that wants to make use of the added API would then need to implement this new class method. For example, the clu.metrics.Average metric:

@flax.struct.dataclass
class Average(Metric):

  # [...]

  @classmethod
  def empty(cls) -> Metric:
    return cls(total=jnp.array(0, jnp.float32), count=jnp.array(0, jnp.int32))

Finally, the class clu.metrics.Collection can be updated similarly:

@flax.struct.dataclass
class Collection:

  # [...]

  @classmethod
  def empty(cls) -> "Collection":
    return cls(
        _reduction_counter=_ReductionCounter(jnp.array(1)),
        **{
            metric_name: metric.empty()
            for metric_name, metric in cls.__annotations__.items()
        })

So finally we can move the Metric.merge() computations inside the jitted eval_step() - adapted the code snippet from the clu.metrics module pydoc following your suggestion above:

  from clu import metrics
  import flax
  import jax

  @flax.struct.dataclass  # required for jax.tree_*
  class Metrics(metrics.Collection):
    accuracy: metrics.Accuracy
    loss: metrics.Average.from_output("loss")
    loss_std: metrics.Std.from_output("loss")

  def eval_step(ms, model, variables, inputs, labels):
    loss, logits = get_loss_and_logits(model, variables, inputs, labels)
    return ms.merge(Metrics.gather_from_model_output(
        loss=loss, logits=logits, labels=labels))

  p_eval_step = jax.pmap(
      eval_step, axis_name="batch", static_broadcasted_argnums=0)

  def evaluate(model, p_variables, test_ds):
    ms = Metrics.empty()
    for inputs, labels in test_ds:
      ms = flax.jax_utils.unreplicate(
          p_eval_step(ms, model, p_variables, inputs, labels))
    return ms.compute()

Summary

I think this small API extension would address your concerns 2 (concern 1 is already covered in the existin API). I would prefer to keep as much as possible from the existing API because we already have a lot of users using that API and updating them to a new API would be very costly. Even worse, the functionality provided by clu.metrics is minimal by design, so many users have been using the API to define their custom metrics, which would all need to be updated as well if we decided to update the API.

The proposed API change is purely additional, so users who would do the metric summation outside their jitted eval_step() (as proposed by the current module pydoc) would not have to update their code or metric implementations. Users who think they could get a speed benefit from updating their code to the new way outlined above (i.e. computing the metric aggregation inside the eval_step()) on the other hand could update their code and metrics with a minimal effort.

1 reply

andsteing Mar 25, 2022
Maintainer

clu.metrics.Metric.empty() and clu.metrics.Metric.empty() were added in
google/CommonLoopUtils@89079c2
(not included in a new release yet)

cgarciae · 2022-03-22T17:17:43Z

cgarciae
Mar 22, 2022
Maintainer Author

Hey @andsteing, thanks for the detailed response!

I understand that drastically changing the API might be challenging or even impossible given it could break Google internal code. I do have a couple of additional points I will mention but in the end I think your .empty() idea effectively improves the situation.

Typing

When I was playing with clu I remember that this type of code:

@flax.struct.dataclass
class Metrics(metrics.Collection):
  accuracy: metrics.Accuracy
  loss: metrics.Average.from_output("loss")
  loss_std: metrics.Std.from_output("loss")

Is actually disliked/not encouraged by some Python linters, I don't know if there is a PEP for this but Pylance warn against this pattern:

Have not checked with mypy but to make pylance happy you just have to define them beforehand, not a big deal but not ideal either:

Empty

I like this solution. The nice thing about having access to an instance thanks to .empty is that you don't have to hard code any specific collection type via a capture or pass the class via static argnums, since instances have access to class methods you can just rewrite your example like this:

def eval_step(ms: Collection, model, variables, inputs, labels):
    loss, logits = get_loss_and_logits(model, variables, inputs, labels)
    updates = ms.gather_from_model_output(loss=loss, logits=logits, labels=labels)
    return ms.merge(updates)

I like it.

Parametrization via function local classes

Edit: No longer relevant as I saw the strategy in a colab you shared:

show original response

Tried to create a simple (possibly flawed) `BinaryAccuracy` with a `threshold` parameter in `clu`, this is what I got:

@flax.struct.dataclass
class BinaryAccuracy(metrics.Average):
    
    @classmethod
    def from_model_outputs(cls, logits: jnp.ndarray, labels: jnp.ndarray, **kwargs):
        values= ((logits > 0.5) == labels).astype(jnp.float32)
        return super().from_model_output(values, **kwargs)

    @staticmethod
    def with_params(threshold: float = 0.5):
        
        @flax.struct.dataclass
        class BinaryAccuracyWithParams(metrics.Average):
            @classmethod
            def from_model_outputs(cls, logits: jnp.ndarray, labels: jnp.ndarray, **kwargs):
                values= ((logits > threshold) == labels).astype(jnp.float32)
                return super().from_model_output(values, **kwargs)


        return BinaryAccuracyWithParams

I am not to happy about the approach, maybe it can be cleaned up to avoid code duplication but it feels a bit more complex than having instances which by nature are easy to parametrize.

Some thoughts (opinion)

Feel free to ignore this section, it just some random thoughts I've had during the process.

Asymmetry between Metric and Collection APIs

I am very curious why either Collection doesn't inherit from Metric or Collection's methods are not name exactly as the ones in Metric? It was a bit surprising that gather_from_model_output wasn't simply named from_model_output, is there a strong reason for this or its just how things came to be?

Collection-like API via instances

This is probably not important but I'll just mention this in case you are interested, I did try to mimic Collection via a Metrics class, similar idea but works on instances:

show code

import numpy as np
from flax_tools.metrics import Metrics, Accuracy, Mean

loss = np.random.uniform(size=(10,))
logits = np.random.uniform(size=(10, 10))
labels = np.random.randint(0, 10, size=(10,))

metrics = Metrics.new(
    [
        Accuracy.new(),
        Mean.new(name="loss").on_args("loss"),
    ]
).reset()


metrics = metrics.update(preds=logits, target=labels, loss=loss)

logs = metrics.compute()  # e.g: {'accuracy': 0.3, 'loss': 0.47997332}

1 reply

andsteing Mar 23, 2022
Maintainer

Short followups on your comments above:

I use pytype for development and loss: metrics.Average.from_output("loss") is accepted; from a syntax point of view I think the statement is sound, so maybe that's more of a pylance problem; though defining metrics on a separate line as you suggest SGTM
Metric vs. Collection - I was thinking of these two classes actually providing different functionality: the Metric's responsibility would simply be to sum up intermediate results to produce a final m etrics; the Collection's responsibility would be to 1. contain a collection of metrics and 2. to provide an easy interface to gather data inside a pmap() ... from a design point of view I guess we could have decided to move everything into Metric and make it a bit larger, and then have Collection simply be a special metric containing sub-metrics; in my experience allowing nested dicts where it's not needed (e.g. as doestfds features) makes the downstream code more complicated (handling a tree instead of a simple dict), but that's definitely a point that can be discussed. Maybe this should be clarified a bit in the clu docs?

For completeness, below the code from the privately shared Colab about Metric parametrization

from clu import metrics
import flax
import jax.numpy as jnp


def binary_accuracy(threshold):

  @flax.struct.dataclass
  class BinaryThresholdedAccuracy(metrics.Average):
    @property
    def threshold(self):
      return threshold
    @classmethod
    def from_model_output(cls, *, logits, labels, **kwargs):
      assert labels.ndim == 1 and logits.ndim == 2 and logits.shape[1] == 2, (labels.shape, logits.shape)
      return super().from_model_output(
          values=((logits[:, 0] >= threshold) == labels).astype(jnp.float32))

  return BinaryThresholdedAccuracy


@flax.struct.dataclass
class Metrics(metrics.Collection):
  acc: metrics.Accuracy
  bin_thr_acc_1: binary_accuracy(0.5)
  bin_thr_acc_2: binary_accuracy(0.0)


ms = Metrics.single_from_model_output(
    labels=jnp.array([0, 0, 1]),
    logits=jnp.array([[0, 1.], [.1, 0], [0, 1.]]),
)

ms.compute_values()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving CLU's Metric API #2006

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Improving CLU's Metric API #2006

cgarciae Mar 21, 2022 Maintainer

Current State

Proposal

Example

Parametrized Metrics

Reference Implementation

Replies: 2 comments · 2 replies

andsteing Mar 22, 2022 Maintainer

Classes vs. Instances

Parametrized Metrics

Efficiency

Summary

andsteing Mar 25, 2022 Maintainer

cgarciae Mar 22, 2022 Maintainer Author

Typing

Empty

Parametrization via function local classes

Some thoughts (opinion)

Asymmetry between Metric and Collection APIs

Collection-like API via instances

andsteing Mar 23, 2022 Maintainer

cgarciae
Mar 21, 2022
Maintainer

Replies: 2 comments 2 replies

andsteing
Mar 22, 2022
Maintainer

andsteing Mar 25, 2022
Maintainer

cgarciae
Mar 22, 2022
Maintainer Author

andsteing Mar 23, 2022
Maintainer