Add multi needle and LangSmith evaluator #19

rlancemartin · 2024-03-06T00:46:14Z

Introduce support for 3rd party evaluators (LangSmith) and add multi-needle support.

Updated README.

Cmd to test:

python main.py --evaluator langsmith --context_lengths_num_intervals 3 --document_depth_percent_intervals 3 --provider openai --model_name "gpt-4-0125-preview" --multi_needle True --eval_set multi-needle-eval-pizza --needles '["Figs are one of the three most delicious pizza toppings.", "Prosciutto is one of the three most delicious pizza toppings.", "Goat cheese is one of the three most delicious pizza toppings."]'

LazaroHurtado

I love seeing the support and attention this project is getting, thanks for trying to make it better!

I would like to start by saying I am not a contributor to this project but would like to give some feedback on this PR since its open source after all. Would also like to mention that this is not an exhaustive review and just a first glance over the work.

I think this introduces a pretty big change, essentially moving from a file based client where results are stored in json files to running on LangSmith and having the results there.
I believe It would be better to first introduce needles: list[str] | str into LLMNeedleHaystackTester. Then we can work on breaking up how this tester class handles it output, either file-based or LangSmith. Furthermore, since I believe LangSmith requires a LangChain model we would have to think of a solid way to support evaluator model usage through sdk (openai sdks) and frameworks (langchain).

README.md

main.py

LazaroHurtado · 2024-03-06T01:39:39Z

src/evaluators/langsmith_evaluator.py

+
+class LangSmithEvaluator():
+
+    def __init__(self, model_name: str = "gpt-4-0125-preview", api_key: str = None):


is api_key needed?

no, optional

since its not being used anywhere can it be removed from the constructor?

src/evaluators/langsmith_evaluator.py

LazaroHurtado · 2024-03-06T01:48:33Z

src/evaluators/langsmith_evaluator.py

+            ## LLM
+            model = ChatOpenAI(temperature=0, model=model_name)
+
+            # Tool
+            grade_tool_oai = convert_to_openai_tool(grade)
+
+            # LLM with tool and enforce invocation
+            llm_with_tool = model.bind(
+                tools=[grade_tool_oai],
+                tool_choice={"type": "function", "function": {"name": "grade"}},
+            )
+
+            # Parser
+            parser_tool = PydanticToolsParser(tools=[grade])
+
+            chain = (
+                    prompt
+                    | llm_with_tool 
+                    | parser_tool
+                )
+
+            score = chain.invoke({"answer":student_answer,
+                                  "reference":reference})


I think these can all be instantiated in the constructor correct?

LazaroHurtado · 2024-03-06T01:49:07Z

src/evaluators/langsmith_evaluator.py

+        # Config
+        evaluation_config = RunEvalConfig(
+            custom_evaluators = [score_relevance],
+        )
+
+        client = Client()
+        run_id = uuid.uuid4().hex[:4]
+        project_name = "multi-needle-eval"


Can these be instantiated in the constructor?

ya, going to use eval_set here; better than hard-coding a project name

LazaroHurtado · 2024-03-06T01:52:44Z

src/llm_multi_needle_haystack_tester.py

+        if self.evaluator.__class__.__name__ == "LangSmithEvaluator":
+            print("EVALUATOR: LANGSMITH")
+            chain = self.model_to_test.get_langchain_runnable(context)
+            self.evaluator.evaluate_chain(chain, context_length, depth_percent, self.model_name)
+            test_end_time = time.time()
+            test_elapsed_time = test_end_time - test_start_time


We shouldn't rely on branching code depending on the evaluator name, instead they should all inherit Evaluator so we can introduce abstraction and leverage the common evaluate_model method.

ya, this breaks the current Evaluator abstraction.

worth some discussion; i added more in a different comment.

LazaroHurtado · 2024-03-06T01:54:15Z

src/llm_multi_needle_haystack_tester.py

+            test_elapsed_time = test_end_time - test_start_time
+
+        else:
+            print("EVALUATOR: OpenAI Model")


The only other supported evaluator is OpenAI, but this can change in the future so the else block doesn't have longevity

ya, we can move to case; this is temp

LazaroHurtado · 2024-03-06T01:55:33Z

src/llm_multi_needle_haystack_tester.py

There seems to be some methods in this class that already exist in LLMNeedleHaystackTester we should use those

ya iirc i only implemented new functionality needed for multi needle

…n/LLMTest_NeedleInAHaystack into rlm/add_multi_needle

kedarchandrayan · 2024-03-06T05:44:05Z

Hello @rlancemartin , Kedar here, one of the co-maintainers. Greatly appreciate the introduction of multi-needle support!

Here are my review comments. I have numbered them so that we can reference them easily in any discussion:
Comment 1: In insert_needles method, presently the depth_percent is getting changed with a delta which is not constant as it is trying to divide the current distance from 100 with the number of needles. Instead, the delta for the depth should be calculated before the loop starts so that the same delta gets used for every iteration.

To show the effect of what I meant in the above comment, I did brief calculation in excel and came up with the below results. Column B is the current approach and Column C is the approach which I am proposing. You can see that the depths are more evenly increasing in Column C.

rlancemartin

there is a bug here, afaict.

error --

RecursionError: maximum recursion depth exceeded while calling a Python object

cmd to repro --

python main.py --evaluator langsmith --context_lengths_num_intervals 3 --document_depth_percent_intervals 3 --provider anthropic --model_name "claude-3-opus-20240229" --multi_needle True --eval_set multi-needle-eval-pizza --needles '["Figs are one of the three most delicious pizza toppings.", "Prosciutto is one of the three most delicious pizza toppings.", "Goat cheese is one of the three most delicious pizza toppings."]' --context_lengths_min 50000 --context_lengths_max 100000

@gkamradt, @LazaroHurtado @kedarchandrayan have you also seen?

rlancemartin · 2024-03-06T06:13:03Z

Hello @rlancemartin , Kedar here, one of the co-maintainers. Greatly appreciate the introduction of multi-needle support!

Here are my review comments. I have numbered them so that we can reference them easily in any discussion: Comment 1: In insert_needles method, presently the depth_percent is getting changed with a delta which is not constant as it is trying to divide the current distance from 100 with the number of needles. Instead, the delta for the depth should be calculated before the loop starts so that the same delta gets used for every iteration.

To show the effect of what I meant in the above comment, I did brief calculation in excel and came up with the below results. Column B is the current approach and Column C is the approach which I am proposing. You can see that the depths are more evenly increasing in Column C.

yes, thanks! indeed, depth will change.

i just made it even spacing in the tok remaining following the initial insertion.

easier to follow!

rlancemartin · 2024-03-06T06:16:22Z

I love seeing the support and attention this project is getting, thanks for trying to make it better!

I would like to start by saying I am not a contributor to this project but would like to give some feedback on this PR since its open source after all. Would also like to mention that this is not an exhaustive review and just a first glance over the work.

I think this introduces a pretty big change, essentially moving from a file based client where results are stored in json files to running on LangSmith and having the results there. I believe It would be better to first introduce needles: list[str] | str into LLMNeedleHaystackTester. Then we can work on breaking up how this tester class handles it output, either file-based or LangSmith. Furthermore, since I believe LangSmith requires a LangChain model we would have to think of a solid way to support evaluator model usage through sdk (openai sdks) and frameworks (langchain).

thanks for the feedback!

ya, this introduces 3rd party evaluators and also brings in multi-needle support.

LazaroHurtado · 2024-03-06T06:18:50Z

there is a bug here, afaict.

error --

RecursionError: maximum recursion depth exceeded while calling a Python object

cmd to repro --

python main.py --evaluator langsmith --context_lengths_num_intervals 3 --document_depth_percent_intervals 3 --provider anthropic --model_name "claude-3-opus-20240229" --multi_needle True --eval_set multi-needle-eval-pizza --needles '["Figs are one of the three most delicious pizza toppings.", "Prosciutto is one of the three most delicious pizza toppings.", "Goat cheese is one of the three most delicious pizza toppings."]' --context_lengths_min 50000 --context_lengths_max 100000

@gkamradt, @LazaroHurtado @kedarchandrayan have you also seen?

please view my comment on this issue: #15

rlancemartin · 2024-03-06T06:24:39Z

there is a bug here, afaict.
error --

RecursionError: maximum recursion depth exceeded while calling a Python object

cmd to repro --

python main.py --evaluator langsmith --context_lengths_num_intervals 3 --document_depth_percent_intervals 3 --provider anthropic --model_name "claude-3-opus-20240229" --multi_needle True --eval_set multi-needle-eval-pizza --needles '["Figs are one of the three most delicious pizza toppings.", "Prosciutto is one of the three most delicious pizza toppings.", "Goat cheese is one of the three most delicious pizza toppings."]' --context_lengths_min 50000 --context_lengths_max 100000

@gkamradt, @LazaroHurtado @kedarchandrayan have you also seen?

please view my comment on this issue: #15

nice work

rlancemartin · 2024-03-06T06:25:23Z

there is a bug here, afaict.

error --

RecursionError: maximum recursion depth exceeded while calling a Python object

cmd to repro --

python main.py --evaluator langsmith --context_lengths_num_intervals 3 --document_depth_percent_intervals 3 --provider anthropic --model_name "claude-3-opus-20240229" --multi_needle True --eval_set multi-needle-eval-pizza --needles '["Figs are one of the three most delicious pizza toppings.", "Prosciutto is one of the three most delicious pizza toppings.", "Goat cheese is one of the three most delicious pizza toppings."]' --context_lengths_min 50000 --context_lengths_max 100000

@gkamradt, @LazaroHurtado @kedarchandrayan have you also seen?

fixed w/ #16 c/o LazaroHurtado

kedarchandrayan · 2024-03-06T06:35:30Z

I have merged PR #16. Thanks @LazaroHurtado

kedarchandrayan · 2024-03-06T07:06:30Z

@rlancemartin, let me re-try explaining my point on the depth percent increment bug, which is not allowing even distribution in the present code.

In the following screenshot, yellow coloured depth percents are from your code. Orange coloured depth percents, which are actually even and take space to 100 are from my proposed changes.

2 points to note:

In the yellow colored depth percents, as we go to next needles, the depth percent interval keeps decreasing. That's why it is not evenly distributed.
In the yellow colored depth percents, you will never be able to cover the whole document. Our aim was to evenly distribute the needles over the whole document, after the start depth percent.

I am proposing following change. Please see below snippet. Note that the depth percent delta is getting calculated above the loop and not with in the loop as in the present case. Please let me know if you have more questions. I would be happy to help.

        # Insert needles at calculated points
        depth_percent_delta = (100 - depth_percent) / len(self.needles)
        for needle in self.needles:
            tokens_needle = self.model_to_test.encode_text_to_tokens(needle)
            # Insert each needle at its corresponding depth percentage
            # For simplicity, evenly distribute needles throughout the context
            insertion_point = int(len(tokens_context) * (depth_percent / 100))
            tokens_context = tokens_context[:insertion_point] + tokens_needle + tokens_context[insertion_point:]
            depth_percent += depth_percent_delta  # Adjust depth for next needle

rlancemartin · 2024-03-06T07:48:48Z

@rlancemartin, let me re-try explaining my point on the depth percent increment bug, which is not allowing even distribution in the present code.

In the following screenshot, yellow coloured depth percents are from your code. Orange coloured depth percents, which are actually even and take space to 100 are from my proposed changes.

2 points to note:

In the yellow colored depth percents, as we go to next needles, the depth percent interval keeps decreasing. That's why it is not evenly distributed.

In the yellow colored depth percents, you will never be able to cover the whole document. Our aim was to evenly distribute the needles over the whole document, after the start depth percent.

I am proposing following change. Please see below snippet. Note that the depth percent delta is getting calculated above the loop and not with in the loop as in the present case. Please let me know if you have more questions. I would be happy to help.
        # Insert needles at calculated points

        depth_percent_delta = (100 - depth_percent) / len(self.needles)

        for needle in self.needles:

            tokens_needle = self.model_to_test.encode_text_to_tokens(needle)

            # Insert each needle at its corresponding depth percentage

            # For simplicity, evenly distribute needles throughout the context

            insertion_point = int(len(tokens_context) * (depth_percent / 100))

            tokens_context = tokens_context[:insertion_point] + tokens_needle + tokens_context[insertion_point:]

            depth_percent += depth_percent_delta  # Adjust depth for next needle

My latest code should fix this, no? I'll have a look tmrw. Also feel free to push a commit to this branch w/ the code change if my current is not doing it correctly.

…distribution.

kedarchandrayan · 2024-03-06T08:20:53Z

Sure @rlancemartin! I have made a commit in your branch regarding the fix.

…sistency.

kedarchandrayan · 2024-03-06T10:57:41Z

Hi @rlancemartin, I am done with my review. Please resolve the 2 conflicts. After it, I will merge this PR.

…n/LLMTest_NeedleInAHaystack into rlm/add_multi_needle

rlancemartin · 2024-03-06T17:23:11Z

Hi @rlancemartin, I am done with my review. Please resolve the 2 conflicts. After it, I will merge this PR.

thanks! done.

note: i have not tested:

1/ multi-needled w/o LangSmith evaluator
2/ LangSmith evaluation w/ single needle

we can take those up in follow-up PRs (if changes are needed) to get this code in.

rlancemartin and others added 2 commits March 5, 2024 16:45

Add multi needle and LangSmith evaluator

c11e0da

Update README.md

5fd963e

LazaroHurtado reviewed Mar 6, 2024

View reviewed changes

rlancemartin added 3 commits March 5, 2024 20:47

Make model_name configurable

d133f1d

Merge branch 'rlm/add_multi_needle' of https://github.com/rlancemarti…

07a8d3e

…n/LLMTest_NeedleInAHaystack into rlm/add_multi_needle

Add docstrings

4c7422f

rlancemartin force-pushed the rlm/add_multi_needle branch from a638f8a to 4c7422f Compare March 6, 2024 05:40

Address comments

6bef87d

rlancemartin force-pushed the rlm/add_multi_needle branch from 776d822 to 6bef87d Compare March 6, 2024 06:00

rlancemartin commented Mar 6, 2024

View reviewed changes

Update for even spacing of needles

c535c4b

Calculated the depth percent interval outside of loop to ensure even …

5a42be3

…distribution.

kedarchandrayan added 2 commits March 6, 2024 15:43

Updated README.md with the depth percent interval terminology for con…

46dcbf4

…sistency.

Added example for multi-needle in README to make it more clear.

5baa726

Merge branch 'rlm/add_multi_needle' of https://github.com/rlancemarti…

622fb50

…n/LLMTest_NeedleInAHaystack into rlm/add_multi_needle

rlancemartin force-pushed the rlm/add_multi_needle branch from 8127ab8 to 622fb50 Compare March 6, 2024 17:02

Merge branch 'main' into rlm/add_multi_needle

73eb627

kedarchandrayan merged commit c1a6195 into gkamradt:main Mar 6, 2024

LazaroHurtado mentioned this pull request Mar 11, 2024

Refactoring needles tester #28

Open

gkamradt mentioned this pull request Mar 27, 2024

[Feature Proposal] Multi-needle in a haystack #41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi needle and LangSmith evaluator #19

Add multi needle and LangSmith evaluator #19

rlancemartin commented Mar 6, 2024 •

edited

Loading

LazaroHurtado left a comment

LazaroHurtado Mar 6, 2024

rlancemartin Mar 6, 2024

LazaroHurtado Mar 6, 2024

LazaroHurtado Mar 6, 2024

LazaroHurtado Mar 6, 2024

rlancemartin Mar 6, 2024 •

edited

Loading

LazaroHurtado Mar 6, 2024 •

edited

Loading

rlancemartin Mar 6, 2024

LazaroHurtado Mar 6, 2024

rlancemartin Mar 6, 2024

LazaroHurtado Mar 6, 2024

rlancemartin Mar 6, 2024

kedarchandrayan commented Mar 6, 2024

rlancemartin left a comment •

edited

Loading

rlancemartin commented Mar 6, 2024

rlancemartin commented Mar 6, 2024 •

edited

Loading

LazaroHurtado commented Mar 6, 2024

rlancemartin commented Mar 6, 2024

rlancemartin commented Mar 6, 2024

kedarchandrayan commented Mar 6, 2024

kedarchandrayan commented Mar 6, 2024 •

edited

Loading

rlancemartin commented Mar 6, 2024

kedarchandrayan commented Mar 6, 2024

kedarchandrayan commented Mar 6, 2024

rlancemartin commented Mar 6, 2024


		class LangSmithEvaluator():

		def __init__(self, model_name: str = "gpt-4-0125-preview", api_key: str = None):

Add multi needle and LangSmith evaluator #19

Add multi needle and LangSmith evaluator #19

Conversation

rlancemartin commented Mar 6, 2024 • edited Loading

LazaroHurtado left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rlancemartin Mar 6, 2024 • edited Loading

Choose a reason for hiding this comment

LazaroHurtado Mar 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kedarchandrayan commented Mar 6, 2024

rlancemartin left a comment • edited Loading

Choose a reason for hiding this comment

rlancemartin commented Mar 6, 2024

rlancemartin commented Mar 6, 2024 • edited Loading

LazaroHurtado commented Mar 6, 2024

rlancemartin commented Mar 6, 2024

rlancemartin commented Mar 6, 2024

kedarchandrayan commented Mar 6, 2024

kedarchandrayan commented Mar 6, 2024 • edited Loading

rlancemartin commented Mar 6, 2024

kedarchandrayan commented Mar 6, 2024

kedarchandrayan commented Mar 6, 2024

rlancemartin commented Mar 6, 2024

rlancemartin commented Mar 6, 2024 •

edited

Loading

rlancemartin Mar 6, 2024 •

edited

Loading

LazaroHurtado Mar 6, 2024 •

edited

Loading

rlancemartin left a comment •

edited

Loading

rlancemartin commented Mar 6, 2024 •

edited

Loading

kedarchandrayan commented Mar 6, 2024 •

edited

Loading