[BUG] groundedness measure may return NaN #1641

drooms-sandrus · 2024-11-11T08:58:08Z

Bug Description
The groundedness_measure_with_cot_reasons_consider_answerability function returns NaN if filter_trivial_statements=True and all statements are filtered in _remove_trivial_statements(). It's likely that similar behaviour occurs for other groundedness functions, as they also may filter all statements.

hypotheses will be an empty list, consequently results will be an empty list, consequently groundedness_scores will be an empty dict, consequently the following is computed over an empty dict, which computes NaN:

average_groundedness_score = float(
    np.mean(list(groundedness_scores.values()))
)

To Reproduce

import boto3
from trulens.providers.bedrock import Bedrock

client = boto3.client(service_name="bedrock-runtime")
bedrock = Bedrock(
    model_id="anthropic.claude-3-5-sonnet-20240620-v1:0",
    client=client,
)

question = "How do you implement a binary search algorithm in Python?"
source = """
* FUNCTION DEFINITION
* def calculate_area(radius):
    return 3.14 * radius ** 2
"""
statement = """
The layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of
"""

bedrock.groundedness_measure_with_cot_reasons_consider_answerability(
    source=source, statement=statement, question=question
)

Expected behavior
I expect 0 to be returned if there are no statements left after filtering non-trivial statements, as NaN is not a value between 0.0 and 1.0 (as documented).

if not hypotheses:
    return 0.0, {"reason": "No non-trivial statements to evaluate"}

Relevant Logs/Tracebacks
Here's the numpy warning:

/home/user/miniconda3/envs/my-env/lib/python3.12/site-packages/numpy/_core/fromnumeric.py:3904: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/home/user/miniconda3/my-env/project/lib/python3.12/site-packages/numpy/_core/_methods.py:147: RuntimeWarning: invalid value encountered in scalar divide
  ret = ret.dtype.type(ret / rcount)

Environment:

OS: Ubuntu 22.04
Python Version: 3.12.5
TruLens version: 1.2.6
numpy: 2.1.3

The text was updated successfully, but these errors were encountered:

sfc-gh-jreini · 2024-11-16T17:21:47Z

Hey @drooms-sandrus - not sure if this is so clearcut. I agree with your view that an LLM response with all trivial statements is undesirable, but groundedness would not be the right metric to detect that issue.

For examples like yours or similar examples where the response has only trivial statements, I would expect this to be detected by lower answer relevance score.

drooms-sandrus · 2024-11-21T14:21:36Z

Hi @sfc-gh-jreini, I agree that answer relevance might be the right metric to detect whether all statements are trivial (or rather, irrelevant to the user query).

The point I'm making is from a software engineering perspective: The documentation states for the return value: "A tuple containing a value between 0.0 (not grounded) and 1.0 (grounded) and a dictionary containing the reasons for the evaluation."

Downstream code can therefore expect the first tuple element to be between 0 and 1. Similarly, downstream code will likely not expect NaN to be returned for the first element, which may lead to issues. If NaN is a desired possible output, I suggest to document it.

Further, it may be desirable to update the return type hint. While float in Tuple[float, dict] (Tuple is deprecated btw) also encompasses NaN, explicit is better than implicit. The pydantic type hint AllowInfNan is a suitable candidate to convey that down stream code must handle potential NaN values (alternatively, it's also possible to construct more precise type-hints via pydantic, and have them enforced, such as "NumberBetween0and1AllowsInfNan", but this may be overkill).

sfc-gh-jreini · 2024-11-21T15:21:08Z

Thanks @drooms-sandrus - I like your suggestion to document the option for NaN in the case that only trivial statements are evaluated, and to update the typehint. @daniel-huang-1230 what do you think about this change?

drooms-sandrus added the bug Something isn't working label Nov 11, 2024

drooms-sandrus assigned sfc-gh-jreini Nov 11, 2024

sfc-gh-jreini assigned sfc-gh-pdharmana Nov 12, 2024

sfc-gh-dhuang self-assigned this Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] groundedness measure may return NaN #1641

[BUG] groundedness measure may return NaN #1641

drooms-sandrus commented Nov 11, 2024 •

edited

Loading

sfc-gh-jreini commented Nov 16, 2024

drooms-sandrus commented Nov 21, 2024 •

edited

Loading

sfc-gh-jreini commented Nov 21, 2024

[BUG] groundedness measure may return NaN #1641

[BUG] groundedness measure may return NaN #1641

Comments

drooms-sandrus commented Nov 11, 2024 • edited Loading

sfc-gh-jreini commented Nov 16, 2024

drooms-sandrus commented Nov 21, 2024 • edited Loading

sfc-gh-jreini commented Nov 21, 2024

drooms-sandrus commented Nov 11, 2024 •

edited

Loading

drooms-sandrus commented Nov 21, 2024 •

edited

Loading