-
Notifications
You must be signed in to change notification settings - Fork 198
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* if groundedness output is not list, set as list so agg functions properly * fix 0 resolving to null then -1 bug * new demo flow * data + eval questions * jailbreak question * flow updates * eval questions * honest harmless evals * Add helpful evals * update demo notebook * use cases folder for examples * slim down quickstarts * move lc async to exp * move lc retrieval agent to exp * lc rag example * core concept images * add rag triad core concept * core concepts work * core concepts ff * core concepts in docs * small changes to 3h core concpet * new quickstart, rewrite readmes * small changes to reamde * add new quickstart to tests * fix use case paths * use case paths * OpenAI v1.x migraation + robust hasattr for instrumentation issues (#555) * [bot] migrate files * bump openai versions to >=1.1.1 * Update text2text_quickstart.py * Update keys.py * fix format check * first * fix imports * format * remove openai key setting bc it is now instantiated with client * remove extra import * more keys migration * convert pydantic model resposnes to dict * update endpoint * key in client for endpoint * instrumen the client * moderation response to dict * response hadnling * migrate moderation * remove old key setting in azure * logger bug * remove logger * remove other loggers * remove dependency on llama service context * undo embeddings change * response handling * more updates * instrument client.completions instance, debugging * update to openai 1.x * Reverting to instrument module * update versions * old bug in appui * don't use safe_* in Lens * bug fix and dev notes * dev notes * more notes * bug fixes * more devnotes * remove extra prints, convert others to logger.info * remove unneeded instrument_instance * remove extra client instantiation, openai imports * client treatment in openai.py * Fix openai client Make it a member of openAI endpoint * fix self-harm moderation * pin llama_index --------- Co-authored-by: grit-app[bot] <grit-app[bot]@users.noreply.github.com> Co-authored-by: Josh Reini <[email protected]> Co-authored-by: Josh Reini <[email protected]> Co-authored-by: Shayak Sen <[email protected]> * vertex quickstart (#563) * vertex quickstart * pin versions to b4 openai * clear outputs * colab name * add seed parameter (#560) * Release 0.18.0 retqadiness (#564) Co-authored-by: Shayak Sen <[email protected]> * Releases/rc trulens eval 0.18.0 (#566) * Release 0.18.0 readiness * Rationalize llama_index langchain --------- Co-authored-by: Shayak Sen <[email protected]> * Fix colab links (#568) * Release 0.18.0 readiness * Rationalize llama_index langchain * Fix colab links --------- Co-authored-by: Shayak Sen <[email protected]> * Automated File Generation from Docs Notebook Changes (#567) Co-authored-by: shayaks <[email protected]> Co-authored-by: Josh Reini <[email protected]> * fix conflict * fix readme conflicts * small changes to reamde * symlink new quickstart * core concepts * layout, links * add missing headers to quickstart * remove langchain rag quickstart * hf notebook * prototype evals * add prototype and hf to testing * move dashboard app ui to experimental * remove extra tru.reset_database * ground truth evals quickstart * ground truth evals to testing * symlink groundtruthevals to docs * notebook links in core concepts ff * add link to rag triad core concept * clean up integration quickstarts * quickstart small changes * groundedness collect for quickstarts * collect for groundness exp examples --------- Co-authored-by: Piotr Mardziel <[email protected]> Co-authored-by: grit-app[bot] <grit-app[bot]@users.noreply.github.com> Co-authored-by: Shayak Sen <[email protected]> Co-authored-by: shayaks <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: shayaks <[email protected]>
- Loading branch information
1 parent
e7b9fc4
commit a50b0b4
Showing
43 changed files
with
1,630 additions
and
166 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
## Feedback Functions | ||
|
||
Feedback functions, analogous to labeling functions, provide a programmatic method for generating evaluations on an application run. The TruLens implementation of feedback functions wrap a supported provider’s model, such as a relevance model or a sentiment classifier, that is repurposed to provide evaluations. Often, for the most flexibility, this model can be another LLM. | ||
|
||
It can be useful to think of the range of evaluations on two axis: Scalable and Meaningful. | ||
|
||
![Range of Feedback Functions](../assets/images/Range_of_Feedback_Functions.png) | ||
|
||
## Domain Expert (Ground Truth) Evaluations | ||
|
||
In early development stages, we recommend starting with domain expert evaluations. These evaluations are often completed by the developers themselves and represent the core use cases your app is expected to complete. This allows you to deeply understand the performance of your app, but lacks scale. | ||
|
||
See this [example notebook](./groundtruth_evals.ipynb) to learn how to run ground truth evaluations with TruLens. | ||
|
||
## User Feedback (Human) Evaluations | ||
|
||
After you have completed early evaluations and have gained more confidence in your app, it is often useful to gather human feedback. This can often be in the form of binary (up/down) feedback provided by your users. This is more slightly scalable than ground truth evals, but struggles with variance and can still be expensive to collect. | ||
|
||
See this [example notebook](./human_feedback.ipynb) to learn how to log human feedback with TruLens. | ||
|
||
## Traditional NLP Evaluations | ||
|
||
Next, it is a common practice to try traditional NLP metrics for evaluations such as BLEU and ROUGE. While these evals are extremely scalable, they are often too syntatic and lack the ability to provide meaningful information on the performance of your app. | ||
|
||
## Medium Language Model Evaluations | ||
|
||
Medium Language Models (like BERT) can be a sweet spot for LLM app evaluations at scale. This size of model is relatively cheap to run (scalable) and can also provide nuanced, meaningful feedback on your app. In some cases, these models need to be fine-tuned to provide the right feedback for your domain. | ||
|
||
TruLens provides a number of feedback functions out of the box that rely on this style of model such as groundedness NLI, sentiment, language match, moderation and more. | ||
|
||
## Large Language Model Evaluations | ||
|
||
Large Language Models can also provide meaningful and flexible feedback on LLM app performance. Often through simple prompting, LLM-based evaluations can provide meaningful evaluations that agree with humans at a very high rate. Additionally, they can be easily augmented with LLM-provided reasoning to justify high or low evaluation scores that are useful for debugging. | ||
|
||
Depending on the size and nature of the LLM, these evaluations can be quite expensive at scale. | ||
|
||
See this [example notebook](./quickstart.ipynb) to learn how to run LLM-based evaluations with TruLens. |
34 changes: 34 additions & 0 deletions
34
docs/trulens_eval/core_concepts_honest_harmless_helpful_evals.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
# Honest, Harmless and Helpful Evaluations | ||
|
||
TruLens adapts ‘**honest**, **harmless**, **helpful**’ as desirable criteria for LLM apps from Anthropic. These criteria are simple and memorable, and seem to capture the majority of what we want from an AI system, such as an LLM app. | ||
|
||
## TruLens Implementation | ||
|
||
To accomplish these evaluations we've built out a suite of evaluations (feedback functions) in TruLens that fall into each category, shown below. These feedback funcitons provide a starting point for ensuring your LLM app is performant and aligned. | ||
|
||
![Honest Harmless Helpful Evals](../assets/images/Honest_Harmless_Helpful_Evals.jpg) | ||
|
||
Here are some very brief notes on these terms from *Anthropic*: | ||
|
||
## Honest: | ||
- At its most basic level, the AI should give accurate information. Moreover, it should be calibrated (e.g. it should be correct 80% of the time when it claims 80% confidence) and express appropriate levels of uncertainty. It should express its uncertainty without misleading human users. | ||
|
||
- Crucially, the AI should be honest about its own capabilities and levels of knowledge – it is not sufficient for it to simply imitate the responses expected from a seemingly humble and honest expert. | ||
|
||
- Ideally the AI would also be honest about itself and its own internal state, insofar as that information is available to it. | ||
|
||
## Harmless: | ||
- The AI should not be offensive or discriminatory, either directly or through subtext or bias. | ||
|
||
- When asked to aid in a dangerous act (e.g. building a bomb), the AI should politely refuse. Ideally the AI will recognize disguised attempts to solicit help for nefarious purposes. | ||
|
||
- To the best of its abilities, the AI should recognize when it may be providing very sensitive or consequential advice and act with appropriate modesty and care. | ||
|
||
- What behaviors are considered harmful and to what degree will vary across people and cultures. It will also be context-dependent, i.e. it will depend on the nature of the use. | ||
|
||
## Helpful: | ||
- The AI should make a clear attempt to perform the task or answer the question posed (as long as this isn’t harmful). It should do this as concisely and efficiently as possible. | ||
|
||
- When more information is required, the AI should ask relevant follow-up questions and obtain necessary details. It should respond with appropriate levels of sensitivity, insight, and discretion. | ||
|
||
- Ideally the AI will also re-direct ill-informed requests, e.g. if asked ‘how can I build a website in assembly language’ it might suggest a different approach. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
# The RAG Triad | ||
|
||
RAGs have become the standard architecture for providing LLMs with context in order to avoid hallucinations. However even RAGs can suffer from hallucination, as is often the case when the retrieval fails to retrieve sufficient context or even retrieves irrelevant context that is then weaved into the LLM’s response. | ||
|
||
TruEra has innovated the RAG triad to evaluate for hallucinations along each edge of the RAG architecture, shown below: | ||
|
||
![RAG Triad](../assets/images/RAG_Triad.jpg) | ||
|
||
The RAG triad is made up of 3 evaluations: context relevance, groundedness and answer relevance. Satisfactory evaluations on each provides us confidence that our LLM app is free form hallucination. | ||
|
||
## Context Relevance | ||
|
||
The first step of any RAG application is retrieval; to verify the quality of our retrieval, we want to make sure that each chunk of context is relevant to the input query. This is critical because this context will be used by the LLM to form an answer, so any irrelevant information in the context could be weaved into a hallucination. TruLens enables you to evaluate context relevance by using the structure of the serialized record. | ||
|
||
## Groundedness | ||
|
||
After the context is retrieved, it is then formed into an answer by an LLM. LLMs are often prone to stray from the facts provided, exaggerating or expanding to a correct-sounding answer. To verify the groundedness of our application, we can separate the response into individual claims and independently search for evidence that supports each within the retrieved context. | ||
|
||
## Answer Relevance | ||
|
||
Last, our response still needs to helpfully answer the original question. We can verify this by evaluating the relevance of the final response to the user input. | ||
|
||
## Putting it together | ||
|
||
By reaching satisfactory evaluations for this triad, we can make a nuanced statement about our application’s correctness; our application is verified to be hallucination free up to the limit of its knowledge base. In other words, if the vector database contains only accurate information, then the answers provided by the RAG are also accurate. | ||
|
||
To see the RAG triad in action, check out the [TruLens Quickstart](./quickstart.ipynb) | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
../../trulens_eval/examples/quickstart/groundtruth_evals.ipynb |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
../../trulens_eval/examples/quickstart/human_feedback.ipynb |
Oops, something went wrong.