Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] groundedness measure may return NaN #1641

Open
drooms-sandrus opened this issue Nov 11, 2024 · 3 comments
Open

[BUG] groundedness measure may return NaN #1641

drooms-sandrus opened this issue Nov 11, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@drooms-sandrus
Copy link

drooms-sandrus commented Nov 11, 2024

Bug Description
The groundedness_measure_with_cot_reasons_consider_answerability function returns NaN if filter_trivial_statements=True and all statements are filtered in _remove_trivial_statements(). It's likely that similar behaviour occurs for other groundedness functions, as they also may filter all statements.

hypotheses will be an empty list, consequently results will be an empty list, consequently groundedness_scores will be an empty dict, consequently the following is computed over an empty dict, which computes NaN:

average_groundedness_score = float(
    np.mean(list(groundedness_scores.values()))
)

To Reproduce

import boto3
from trulens.providers.bedrock import Bedrock

client = boto3.client(service_name="bedrock-runtime")
bedrock = Bedrock(
    model_id="anthropic.claude-3-5-sonnet-20240620-v1:0",
    client=client,
)

question = "How do you implement a binary search algorithm in Python?"
source = """
* FUNCTION DEFINITION
* def calculate_area(radius):
    return 3.14 * radius ** 2
"""
statement = """
The layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of abstraction and the layers of
"""

bedrock.groundedness_measure_with_cot_reasons_consider_answerability(
    source=source, statement=statement, question=question
)

Expected behavior
I expect 0 to be returned if there are no statements left after filtering non-trivial statements, as NaN is not a value between 0.0 and 1.0 (as documented).

if not hypotheses:
    return 0.0, {"reason": "No non-trivial statements to evaluate"}

Relevant Logs/Tracebacks
Here's the numpy warning:

/home/user/miniconda3/envs/my-env/lib/python3.12/site-packages/numpy/_core/fromnumeric.py:3904: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/home/user/miniconda3/my-env/project/lib/python3.12/site-packages/numpy/_core/_methods.py:147: RuntimeWarning: invalid value encountered in scalar divide
  ret = ret.dtype.type(ret / rcount)

Environment:

  • OS: Ubuntu 22.04
  • Python Version: 3.12.5
  • TruLens version: 1.2.6
  • numpy: 2.1.3
@sfc-gh-jreini
Copy link
Contributor

Hey @drooms-sandrus - not sure if this is so clearcut. I agree with your view that an LLM response with all trivial statements is undesirable, but groundedness would not be the right metric to detect that issue.

For examples like yours or similar examples where the response has only trivial statements, I would expect this to be detected by lower answer relevance score.

@drooms-sandrus
Copy link
Author

drooms-sandrus commented Nov 21, 2024

Hi @sfc-gh-jreini, I agree that answer relevance might be the right metric to detect whether all statements are trivial (or rather, irrelevant to the user query).

The point I'm making is from a software engineering perspective: The documentation states for the return value: "A tuple containing a value between 0.0 (not grounded) and 1.0 (grounded) and a dictionary containing the reasons for the evaluation."

Downstream code can therefore expect the first tuple element to be between 0 and 1. Similarly, downstream code will likely not expect NaN to be returned for the first element, which may lead to issues. If NaN is a desired possible output, I suggest to document it.

Further, it may be desirable to update the return type hint. While float in Tuple[float, dict] (Tuple is deprecated btw) also encompasses NaN, explicit is better than implicit. The pydantic type hint AllowInfNan is a suitable candidate to convey that down stream code must handle potential NaN values (alternatively, it's also possible to construct more precise type-hints via pydantic, and have them enforced, such as "NumberBetween0and1AllowsInfNan", but this may be overkill).

@sfc-gh-jreini
Copy link
Contributor

Thanks @drooms-sandrus - I like your suggestion to document the option for NaN in the case that only trivial statements are evaluated, and to update the typehint. @daniel-huang-1230 what do you think about this change?

@sfc-gh-dhuang sfc-gh-dhuang self-assigned this Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants