Docs + Examples rewrite (#571)

* if groundedness output is not list, set as list so agg functions properly * fix 0 resolving to null then -1 bug * new demo flow * data + eval questions * jailbreak question * flow updates * eval questions * honest harmless evals * Add helpful evals * update demo notebook * use cases folder for examples * slim down quickstarts * move lc async to exp * move lc retrieval agent to exp * lc rag example * core concept images * add rag triad core concept * core concepts work * core concepts ff * core concepts in docs * small changes to 3h core concpet * new quickstart, rewrite readmes * small changes to reamde * add new quickstart to tests * fix use case paths * use case paths * OpenAI v1.x migraation + robust hasattr for instrumentation issues (#555) * [bot] migrate files * bump openai versions to >=1.1.1 * Update text2text_quickstart.py * Update keys.py * fix format check * first * fix imports * format * remove openai key setting bc it is now instantiated with client * remove extra import * more keys migration * convert pydantic model resposnes to dict * update endpoint * key in client for endpoint * instrumen the client * moderation response to dict * response hadnling * migrate moderation * remove old key setting in azure * logger bug * remove logger * remove other loggers * remove dependency on llama service context * undo embeddings change * response handling * more updates * instrument client.completions instance, debugging * update to openai 1.x * Reverting to instrument module * update versions * old bug in appui * don't use safe_* in Lens * bug fix and dev notes * dev notes * more notes * bug fixes * more devnotes * remove extra prints, convert others to logger.info * remove unneeded instrument_instance * remove extra client instantiation, openai imports * client treatment in openai.py * Fix openai client Make it a member of openAI endpoint * fix self-harm moderation * pin llama_index --------- Co-authored-by: grit-app[bot] <grit-app[bot]@users.noreply.github.com> Co-authored-by: Josh Reini <[email protected]> Co-authored-by: Josh Reini <[email protected]> Co-authored-by: Shayak Sen <[email protected]> * vertex quickstart (#563) * vertex quickstart * pin versions to b4 openai * clear outputs * colab name * add seed parameter (#560) * Release 0.18.0 retqadiness (#564) Co-authored-by: Shayak Sen <[email protected]> * Releases/rc trulens eval 0.18.0 (#566) * Release 0.18.0 readiness * Rationalize llama_index langchain --------- Co-authored-by: Shayak Sen <[email protected]> * Fix colab links (#568) * Release 0.18.0 readiness * Rationalize llama_index langchain * Fix colab links --------- Co-authored-by: Shayak Sen <[email protected]> * Automated File Generation from Docs Notebook Changes (#567) Co-authored-by: shayaks <[email protected]> Co-authored-by: Josh Reini <[email protected]> * fix conflict * fix readme conflicts * small changes to reamde * symlink new quickstart * core concepts * layout, links * add missing headers to quickstart * remove langchain rag quickstart * hf notebook * prototype evals * add prototype and hf to testing * move dashboard app ui to experimental * remove extra tru.reset_database * ground truth evals quickstart * ground truth evals to testing * symlink groundtruthevals to docs * notebook links in core concepts ff * add link to rag triad core concept * clean up integration quickstarts * quickstart small changes * groundedness collect for quickstarts * collect for groundness exp examples --------- Co-authored-by: Piotr Mardziel <[email protected]> Co-authored-by: grit-app[bot] <grit-app[bot]@users.noreply.github.com> Co-authored-by: Shayak Sen <[email protected]> Co-authored-by: shayaks <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: shayaks <[email protected]>
truera · Nov 20, 2023 · a50b0b4 · a50b0b4
1 parent e7b9fc4
commit a50b0b4
Show file tree

Hide file tree

Showing 43 changed files with 1,630 additions and 166 deletions.
diff --git a/.github/workflows/combine_nb_to_docs_testing.sh b/.github/workflows/combine_nb_to_docs_testing.sh
@@ -5,7 +5,7 @@ rm -rf alltools.ipynb
 
 # Combined notebook flow - will be tested
 # IF MOVING ANY IPYNB, MAKE SURE TO RE-SYMLINK. MANY IPYNB REFERENCED HERE LIVE IN OTHER PATHS
-nbmerge langchain_quickstart.ipynb logging.ipynb custom_feedback_functions.ipynb >> all_tools.ipynb
+nbmerge langchain_quickstart.ipynb llama_index_quickstart.ipynb quickstart.ipynb prototype_evals.ipynb human_feedback.ipynb groundtruth_evals.ipynb logging.ipynb custom_feedback_functions.ipynb >> all_tools.ipynb
 
 # Create pypi page documentation
 
@@ -17,6 +17,7 @@ cat gh_top_intro.md break.md ../trulens_explain/gh_top_intro.md > TOP_README.md
 
 # Create non-jupyter scripts
 mkdir -p ./py_script_quickstarts/
+jupyter nbconvert --to script --output-dir=./py_script_quickstarts/ quickstart.ipynb
 jupyter nbconvert --to script --output-dir=./py_script_quickstarts/ langchain_quickstart.ipynb
 jupyter nbconvert --to script --output-dir=./py_script_quickstarts/ llama_index_quickstart.ipynb
 jupyter nbconvert --to script --output-dir=./py_script_quickstarts/ text2text_quickstart.ipynb
@@ -29,15 +30,15 @@ SED=`which -a gsed sed | head -n1`
 $SED'' -e "/id\"\:/d" all_tools.ipynb
 
 ## Remove ipynb JSON calls
-$SED'' -e "/JSON/d" ./py_script_quickstarts/langchain_quickstart.py ./py_script_quickstarts/llama_index_quickstart.py ./py_script_quickstarts/text2text_quickstart.py ./py_script_quickstarts/all_tools.py 
+$SED'' -e "/JSON/d" ./py_script_quickstarts/quickstart.py ./py_script_quickstarts/langchain_quickstart.py ./py_script_quickstarts/llama_index_quickstart.py ./py_script_quickstarts/text2text_quickstart.py ./py_script_quickstarts/all_tools.py 
 ## Replace jupyter display with python print
-$SED'' -e  "s/display/print/g" ./py_script_quickstarts/langchain_quickstart.py ./py_script_quickstarts/llama_index_quickstart.py ./py_script_quickstarts/text2text_quickstart.py ./py_script_quickstarts/all_tools.py
+$SED'' -e  "s/display/print/g" ./py_script_quickstarts/quickstart.py ./py_script_quickstarts/langchain_quickstart.py ./py_script_quickstarts/llama_index_quickstart.py ./py_script_quickstarts/text2text_quickstart.py ./py_script_quickstarts/all_tools.py
 ## Remove cell metadata
-$SED'' -e  "/\# In\[/d" ./py_script_quickstarts/langchain_quickstart.py ./py_script_quickstarts/llama_index_quickstart.py ./py_script_quickstarts/text2text_quickstart.py ./py_script_quickstarts/all_tools.py
+$SED'' -e  "/\# In\[/d" ./py_script_quickstarts/quickstart.py ./py_script_quickstarts/langchain_quickstart.py ./py_script_quickstarts/llama_index_quickstart.py ./py_script_quickstarts/text2text_quickstart.py ./py_script_quickstarts/all_tools.py
 ## Remove single # lines
-$SED'' -e  "/\#$/d" ./py_script_quickstarts/langchain_quickstart.py ./py_script_quickstarts/llama_index_quickstart.py ./py_script_quickstarts/text2text_quickstart.py ./py_script_quickstarts/all_tools.py
+$SED'' -e  "/\#$/d" ./py_script_quickstarts/quickstart.py ./py_script_quickstarts/langchain_quickstart.py ./py_script_quickstarts/llama_index_quickstart.py ./py_script_quickstarts/text2text_quickstart.py ./py_script_quickstarts/all_tools.py
 ## Collapse multiple empty line from sed replacements with a single line
-$SED'' -e "/./b" -e ":n" -e "N;s/\\n$//;tn" ./py_script_quickstarts/langchain_quickstart.py ./py_script_quickstarts/llama_index_quickstart.py ./py_script_quickstarts/text2text_quickstart.py ./py_script_quickstarts/all_tools.py
+$SED'' -e "/./b" -e ":n" -e "N;s/\\n$//;tn" ./py_script_quickstarts/quickstart.py ./py_script_quickstarts/langchain_quickstart.py ./py_script_quickstarts/llama_index_quickstart.py ./py_script_quickstarts/text2text_quickstart.py ./py_script_quickstarts/all_tools.py
 # Move generated files to their end locations
 # EVERYTHING BELOW IS LINKED TO DOCUMENTATION OR TESTS; MAKE SURE YOU UPDATE LINKS IF YOU CHANGE
 # IF NAMES CHANGED; CHANGE THE LINK NAMES TOO

diff --git a/docs/assets/images/Honest_Harmless_Helpful_Evals.jpg b/docs/assets/images/Honest_Harmless_Helpful_Evals.jpg
diff --git a/docs/assets/images/RAG_Triad.jpg b/docs/assets/images/RAG_Triad.jpg
diff --git a/docs/assets/images/Range_of_Feedback_Functions.png b/docs/assets/images/Range_of_Feedback_Functions.png
diff --git a/docs/trulens_eval/core_concepts_feedback_functions.md b/docs/trulens_eval/core_concepts_feedback_functions.md
@@ -0,0 +1,37 @@
+## Feedback Functions
+
+Feedback functions, analogous to labeling functions, provide a programmatic method for generating evaluations on an application run. The TruLens implementation of feedback functions wrap a supported provider’s model, such as a relevance model or a sentiment classifier, that is repurposed to provide evaluations. Often, for the most flexibility, this model can be another LLM.
+
+It can be useful to think of the range of evaluations on two axis: Scalable and Meaningful.
+
+![Range of Feedback Functions](../assets/images/Range_of_Feedback_Functions.png)
+
+## Domain Expert (Ground Truth) Evaluations
+
+In early development stages, we recommend starting with domain expert evaluations. These evaluations are often completed by the developers themselves and represent the core use cases your app is expected to complete. This allows you to deeply understand the performance of your app, but lacks scale.
+
+See this [example notebook](./groundtruth_evals.ipynb) to learn how to run ground truth evaluations with TruLens.
+
+## User Feedback (Human) Evaluations
+
+After you have completed early evaluations and have gained more confidence in your app, it is often useful to gather human feedback. This can often be in the form of binary (up/down) feedback provided by your users. This is more slightly scalable than ground truth evals, but struggles with variance and can still be expensive to collect.
+
+See this [example notebook](./human_feedback.ipynb) to learn how to log human feedback with TruLens.
+
+## Traditional NLP Evaluations
+
+Next, it is a common practice to try traditional NLP metrics for evaluations such as BLEU and ROUGE. While these evals are extremely scalable, they are often too syntatic and lack the ability to provide meaningful information on the performance of your app.
+
+## Medium Language Model Evaluations
+
+Medium Language Models (like BERT) can be a sweet spot for LLM app evaluations at scale. This size of model is relatively cheap to run (scalable) and can also provide nuanced, meaningful feedback on your app. In some cases, these models need to be fine-tuned to provide the right feedback for your domain.
+
+TruLens provides a number of feedback functions out of the box that rely on this style of model such as groundedness NLI, sentiment, language match, moderation and more.
+
+## Large Language Model Evaluations
+
+Large Language Models can also provide meaningful and flexible feedback on LLM app performance. Often through simple prompting, LLM-based evaluations can provide meaningful evaluations that agree with humans at a very high rate. Additionally, they can be easily augmented with LLM-provided reasoning to justify high or low evaluation scores that are useful for debugging.
+
+Depending on the size and nature of the LLM, these evaluations can be quite expensive at scale.
+
+See this [example notebook](./quickstart.ipynb) to learn how to run LLM-based evaluations with TruLens.
diff --git a/docs/trulens_eval/core_concepts_honest_harmless_helpful_evals.md b/docs/trulens_eval/core_concepts_honest_harmless_helpful_evals.md
@@ -0,0 +1,34 @@
+# Honest, Harmless and Helpful Evaluations
+
+TruLens adapts ‘**honest**, **harmless**, **helpful**’ as desirable criteria for LLM apps from Anthropic. These criteria are simple and memorable, and seem to capture the majority of what we want from an AI system, such as an LLM app.
+
+## TruLens Implementation
+
+To accomplish these evaluations we've built out a suite of evaluations (feedback functions) in TruLens that fall into each category, shown below. These feedback funcitons provide a starting point for ensuring your LLM app is performant and aligned.
+
+![Honest Harmless Helpful Evals](../assets/images/Honest_Harmless_Helpful_Evals.jpg)
+
+Here are some very brief notes on these terms from *Anthropic*:
+
+## Honest:
+- At its most basic level, the AI should give accurate information. Moreover, it should be calibrated (e.g. it should be correct 80% of the time when it claims 80% confidence) and express appropriate levels of uncertainty. It should express its uncertainty without misleading human users.
+
+- Crucially, the AI should be honest about its own capabilities and levels of knowledge – it is not sufficient for it to simply imitate the responses expected from a seemingly humble and honest expert.
+
+- Ideally the AI would also be honest about itself and its own internal state, insofar as that information is available to it.
+
+## Harmless:
+- The AI should not be offensive or discriminatory, either directly or through subtext or bias.
+
+- When asked to aid in a dangerous act (e.g. building a bomb), the AI should politely refuse. Ideally the AI will recognize disguised attempts to solicit help for nefarious purposes.
+
+- To the best of its abilities, the AI should recognize when it may be providing very sensitive or consequential advice and act with appropriate modesty and care.
+
+- What behaviors are considered harmful and to what degree will vary across people and cultures. It will also be context-dependent, i.e. it will depend on the nature of the use.
+
+## Helpful:
+- The AI should make a clear attempt to perform the task or answer the question posed (as long as this isn’t harmful). It should do this as concisely and efficiently as possible.
+
+- When more information is required, the AI should ask relevant follow-up questions and obtain necessary details. It should respond with appropriate levels of sensitivity, insight, and discretion.
+
+- Ideally the AI will also re-direct ill-informed requests, e.g. if asked ‘how can I build a website in assembly language’ it might suggest a different approach.
diff --git a/docs/trulens_eval/core_concepts_rag_triad.md b/docs/trulens_eval/core_concepts_rag_triad.md
@@ -0,0 +1,28 @@
+# The RAG Triad
+
+RAGs have become the standard architecture for providing LLMs with context in order to avoid hallucinations. However even RAGs can suffer from hallucination, as is often the case when the retrieval fails to retrieve sufficient context or even retrieves irrelevant context that is then weaved into the LLM’s response.
+
+TruEra has innovated the RAG triad to evaluate for hallucinations along each edge of the RAG architecture, shown below:
+
+![RAG Triad](../assets/images/RAG_Triad.jpg)
+
+The RAG triad is made up of 3 evaluations: context relevance, groundedness and answer relevance. Satisfactory evaluations on each provides us confidence that our LLM app is free form hallucination.
+
+## Context Relevance
+
+The first step of any RAG application is retrieval; to verify the quality of our retrieval, we want to make sure that each chunk of context is relevant to the input query. This is critical because this context will be used by the LLM to form an answer, so any irrelevant information in the context could be weaved into a hallucination. TruLens enables you to evaluate context relevance by using the structure of the serialized record.
+
+## Groundedness
+
+After the context is retrieved, it is then formed into an answer by an LLM. LLMs are often prone to stray from the facts provided, exaggerating or expanding to a correct-sounding answer. To verify the groundedness of our application, we can separate the response into individual claims and independently search for evidence that supports each within the retrieved context.
+
+## Answer Relevance
+
+Last, our response still needs to helpfully answer the original question. We can verify this by evaluating the relevance of the final response to the user input.
+
+## Putting it together
+
+By reaching satisfactory evaluations for this triad, we can make a nuanced statement about our application’s correctness; our application is verified to be hallucination free up to the limit of its knowledge base. In other words, if the vector database contains only accurate information, then the answers provided by the RAG are also accurate.
+
+To see the RAG triad in action, check out the [TruLens Quickstart](./quickstart.ipynb)
+
diff --git a/docs/trulens_eval/gh_top_intro.md b/docs/trulens_eval/gh_top_intro.md
@@ -10,48 +10,38 @@
 
 TruLens provides a set of tools for developing and monitoring neural nets, including large language models. This includes both tools for evaluation of LLMs and LLM-based applications with *TruLens-Eval* and deep learning explainability with *TruLens-Explain*. *TruLens-Eval* and *TruLens-Explain* are housed in separate packages and can be used independently.
 
-The best way to support TruLens is to give us a ⭐ and join our [slack community](https://communityinviter.com/apps/aiqualityforum/josh)!
+The best way to support TruLens is to give us a ⭐ on [GitHub](https://www.github.com/truera/trulens) and join our [slack community](https://communityinviter.com/apps/aiqualityforum/josh)!
+
+![TruLens](https://www.trulens.org/assets/images/Neural_Network_Explainability.png)
 
 ## TruLens-Eval
 
-**TruLens-Eval** contains instrumentation and evaluation tools for large language model (LLM) based applications. It supports the iterative development and monitoring of a wide range of LLM applications by wrapping your application to log key metadata across the entire chain (or off chain if your project does not use chains) on your local machine. Importantly, it also gives you the tools you need to evaluate the quality of your LLM-based applications.
+**Don't just vibe-check your llm app!** Systematically evaluate and track your LLM experiments with TruLens. As you develop your app including prompts, models, retreivers, knowledge sources and more, TruLens-Eval is the tool you need to understand its performance.
+
+Fine-grained, stack-agnostic instrumentation and comprehensive evaluations help you to identify failure modes & systematically iterate to improve your application.
 
-TruLens-Eval has two key value propositions:
+Read more about the core concepts behind TruLens including [Feedback Functions](./core_concepts_feedback_functions.md), [The RAG Triad](./core_concepts_rag_triad.md), and [Honest, Harmless and Helpful Evals](./core_concepts_honest_harmless_helpful_evals.md).
 
-1. Evaluation:
-    * TruLens supports the evaluation of inputs, outputs and internals of your LLM application using any model (including LLMs). 
-    * A number of feedback functions for evaluation are implemented out-of-the-box such as groundedness, relevance and toxicity. The framework is also easily extensible for custom evaluation requirements.
-2. Tracking:
-    * TruLens contains instrumentation for any LLM application including question answering, retrieval-augmented generation, agent-based applications and more. This instrumentation allows for the tracking of a wide variety of usage metrics and metadata. Read more in the [instrumentation overview](https://www.trulens.org/trulens_eval/basic_instrumentation/).
-    * TruLens' instrumentation can be applied to any LLM application without being tied down to a given framework. Additionally, deep integrations with [LangChain](https://www.trulens.org/trulens_eval/langchain_instrumentation/) and [Llama-Index](https://www.trulens.org/trulens_eval/llama_index_instrumentation/) allow the capture of internal metadata and text.
-    * Anything that is tracked by the instrumentation can be evaluated!
+## TruLens in the development workflow
+
+Build your first prototype then connect instrumentation and logging with TruLens. Decide what feedbacks you need, and specify them with TruLens to run alongside your app. Then iterate and compare versions of your app in an easy-to-use user interface 👇
 
-The process for building your evaluated and tracked LLM application with TruLens is shown below 👇
 ![Architecture Diagram](https://www.trulens.org/assets/images/TruLens_Architecture.png)
 
-### Installation and setup
+### Installation and Setup
 
-Install trulens-eval from PyPI.
+Install the trulens-eval pip package from PyPI.
 
 ```bash
-pip install trulens-eval
+    pip install trulens-eval
 ```
 
 ### Quick Usage
 
-TruLens supports the evaluation of tracking for any LLM app framework. Choose a framework below to get started:
-
-**Langchain**
-
-[langchain_quickstart.ipynb](https://github.com/truera/trulens/blob/releases/rc-trulens-eval-0.17.0/trulens_eval/examples/quickstart/langchain_quickstart.ipynb).
-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/truera/trulens/blob/releases/rc-trulens-eval-0.17.0/trulens_eval/examples/quickstart/colab/langchain_quickstart_colab.ipynb)
-
-**Llama-Index**
+Walk through how to instrument and evaluate a RAG built from scratch with TruLens.
 
-[llama_index_quickstart.ipynb](https://github.com/truera/trulens/blob/releases/rc-trulens-eval-0.17.0/trulens_eval/examples/quickstart/llama_index_quickstart.ipynb).
-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/truera/trulens/blob/releases/rc-trulens-eval-0.17.0/trulens_eval/examples/quickstart/colab/llama_index_quickstart_colab.ipynb)
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/truera/trulens/blob/main/trulens_eval/examples/quickstart/quickstart.ipynb)
 
-**Custom Text to Text Apps**
+### 💡 Contributing
 
-[text2text_quickstart.ipynb](https://github.com/truera/trulens/blob/releases/rc-trulens-eval-0.17.0/trulens_eval/examples/quickstart/text2text_quickstart.ipynb).
-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/truera/trulens/blob/releases/rc-trulens-eval-0.17.0/trulens_eval/examples/quickstart/colab/text2text_quickstart_colab.ipynb)
+Interested in contributing? See our [contribution guide](https://github.com/truera/trulens/tree/main/trulens_eval/CONTRIBUTING.md) for more details.
diff --git a/docs/trulens_eval/groundtruth_evals.ipynb b/docs/trulens_eval/groundtruth_evals.ipynb
@@ -0,0 +1 @@
+../../trulens_eval/examples/quickstart/groundtruth_evals.ipynb
diff --git a/docs/trulens_eval/human_feedback.ipynb b/docs/trulens_eval/human_feedback.ipynb
@@ -0,0 +1 @@
+../../trulens_eval/examples/quickstart/human_feedback.ipynb
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		../../trulens_eval/examples/quickstart/groundtruth_evals.ipynb
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		../../trulens_eval/examples/quickstart/human_feedback.ipynb