Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi needle and LangSmith evaluator #19

Merged
merged 12 commits into from
Mar 6, 2024

Conversation

rlancemartin
Copy link
Contributor

@rlancemartin rlancemartin commented Mar 6, 2024

Introduce support for 3rd party evaluators (LangSmith) and add multi-needle support.

Updated README.

Cmd to test:

python main.py --evaluator langsmith --context_lengths_num_intervals 3 --document_depth_percent_intervals 3 --provider openai --model_name "gpt-4-0125-preview" --multi_needle True --eval_set multi-needle-eval-pizza --needles '["Figs are one of the three most delicious pizza toppings.", "Prosciutto is one of the three most delicious pizza toppings.", "Goat cheese is one of the three most delicious pizza toppings."]'

Copy link
Contributor

@LazaroHurtado LazaroHurtado left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love seeing the support and attention this project is getting, thanks for trying to make it better!

I would like to start by saying I am not a contributor to this project but would like to give some feedback on this PR since its open source after all. Would also like to mention that this is not an exhaustive review and just a first glance over the work.

I think this introduces a pretty big change, essentially moving from a file based client where results are stored in json files to running on LangSmith and having the results there.
I believe It would be better to first introduce needles: list[str] | str into LLMNeedleHaystackTester. Then we can work on breaking up how this tester class handles it output, either file-based or LangSmith. Furthermore, since I believe LangSmith requires a LangChain model we would have to think of a solid way to support evaluator model usage through sdk (openai sdks) and frameworks (langchain).


class LangSmithEvaluator():

def __init__(self, model_name: str = "gpt-4-0125-preview", api_key: str = None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is api_key needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, optional

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since its not being used anywhere can it be removed from the constructor?

Comment on lines 47 to 69
## LLM
model = ChatOpenAI(temperature=0, model=model_name)

# Tool
grade_tool_oai = convert_to_openai_tool(grade)

# LLM with tool and enforce invocation
llm_with_tool = model.bind(
tools=[grade_tool_oai],
tool_choice={"type": "function", "function": {"name": "grade"}},
)

# Parser
parser_tool = PydanticToolsParser(tools=[grade])

chain = (
prompt
| llm_with_tool
| parser_tool
)

score = chain.invoke({"answer":student_answer,
"reference":reference})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these can all be instantiated in the constructor correct?

Comment on lines 73 to 80
# Config
evaluation_config = RunEvalConfig(
custom_evaluators = [score_relevance],
)

client = Client()
run_id = uuid.uuid4().hex[:4]
project_name = "multi-needle-eval"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these be instantiated in the constructor?

Copy link
Contributor Author

@rlancemartin rlancemartin Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya, going to use eval_set here; better than hard-coding a project name

Comment on lines 94 to 99
if self.evaluator.__class__.__name__ == "LangSmithEvaluator":
print("EVALUATOR: LANGSMITH")
chain = self.model_to_test.get_langchain_runnable(context)
self.evaluator.evaluate_chain(chain, context_length, depth_percent, self.model_name)
test_end_time = time.time()
test_elapsed_time = test_end_time - test_start_time
Copy link
Contributor

@LazaroHurtado LazaroHurtado Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't rely on branching code depending on the evaluator name, instead they should all inherit Evaluator so we can introduce abstraction and leverage the common evaluate_model method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya, this breaks the current Evaluator abstraction.

worth some discussion; i added more in a different comment.

test_elapsed_time = test_end_time - test_start_time

else:
print("EVALUATOR: OpenAI Model")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only other supported evaluator is OpenAI, but this can change in the future so the else block doesn't have longevity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya, we can move to case; this is temp

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be some methods in this class that already exist in LLMNeedleHaystackTester we should use those

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya iirc i only implemented new functionality needed for multi needle

@rlancemartin rlancemartin force-pushed the rlm/add_multi_needle branch from a638f8a to 4c7422f Compare March 6, 2024 05:40
@kedarchandrayan
Copy link
Collaborator

Hello @rlancemartin , Kedar here, one of the co-maintainers. Greatly appreciate the introduction of multi-needle support!

Here are my review comments. I have numbered them so that we can reference them easily in any discussion:
Comment 1: In insert_needles method, presently the depth_percent is getting changed with a delta which is not constant as it is trying to divide the current distance from 100 with the number of needles. Instead, the delta for the depth should be calculated before the loop starts so that the same delta gets used for every iteration.

To show the effect of what I meant in the above comment, I did brief calculation in excel and came up with the below results. Column B is the current approach and Column C is the approach which I am proposing. You can see that the depths are more evenly increasing in Column C.

Screenshot 2024-03-06 at 11 11 02 AM

@rlancemartin rlancemartin force-pushed the rlm/add_multi_needle branch from 776d822 to 6bef87d Compare March 6, 2024 06:00
Copy link
Contributor Author

@rlancemartin rlancemartin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a bug here, afaict.

error --

RecursionError: maximum recursion depth exceeded while calling a Python object

cmd to repro --

python main.py --evaluator langsmith --context_lengths_num_intervals 3 --document_depth_percent_intervals 3 --provider anthropic --model_name "claude-3-opus-20240229" --multi_needle True --eval_set multi-needle-eval-pizza --needles '["Figs are one of the three most delicious pizza toppings.", "Prosciutto is one of the three most delicious pizza toppings.", "Goat cheese is one of the three most delicious pizza toppings."]' --context_lengths_min 50000 --context_lengths_max 100000

@gkamradt, @LazaroHurtado @kedarchandrayan have you also seen?

@rlancemartin
Copy link
Contributor Author

Hello @rlancemartin , Kedar here, one of the co-maintainers. Greatly appreciate the introduction of multi-needle support!

Here are my review comments. I have numbered them so that we can reference them easily in any discussion: Comment 1: In insert_needles method, presently the depth_percent is getting changed with a delta which is not constant as it is trying to divide the current distance from 100 with the number of needles. Instead, the delta for the depth should be calculated before the loop starts so that the same delta gets used for every iteration.

To show the effect of what I meant in the above comment, I did brief calculation in excel and came up with the below results. Column B is the current approach and Column C is the approach which I am proposing. You can see that the depths are more evenly increasing in Column C.

Screenshot 2024-03-06 at 11 11 02 AM

yes, thanks! indeed, depth will change.

i just made it even spacing in the tok remaining following the initial insertion.

easier to follow!

@rlancemartin
Copy link
Contributor Author

rlancemartin commented Mar 6, 2024

I love seeing the support and attention this project is getting, thanks for trying to make it better!

I would like to start by saying I am not a contributor to this project but would like to give some feedback on this PR since its open source after all. Would also like to mention that this is not an exhaustive review and just a first glance over the work.

I think this introduces a pretty big change, essentially moving from a file based client where results are stored in json files to running on LangSmith and having the results there. I believe It would be better to first introduce needles: list[str] | str into LLMNeedleHaystackTester. Then we can work on breaking up how this tester class handles it output, either file-based or LangSmith. Furthermore, since I believe LangSmith requires a LangChain model we would have to think of a solid way to support evaluator model usage through sdk (openai sdks) and frameworks (langchain).

thanks for the feedback!

ya, this introduces 3rd party evaluators and also brings in multi-needle support.

@LazaroHurtado
Copy link
Contributor

there is a bug here, afaict.

error --

RecursionError: maximum recursion depth exceeded while calling a Python object

cmd to repro --

python main.py --evaluator langsmith --context_lengths_num_intervals 3 --document_depth_percent_intervals 3 --provider anthropic --model_name "claude-3-opus-20240229" --multi_needle True --eval_set multi-needle-eval-pizza --needles '["Figs are one of the three most delicious pizza toppings.", "Prosciutto is one of the three most delicious pizza toppings.", "Goat cheese is one of the three most delicious pizza toppings."]' --context_lengths_min 50000 --context_lengths_max 100000

@gkamradt, @LazaroHurtado @kedarchandrayan have you also seen?

please view my comment on this issue: #15

@rlancemartin
Copy link
Contributor Author

there is a bug here, afaict.
error --

RecursionError: maximum recursion depth exceeded while calling a Python object

cmd to repro --

python main.py --evaluator langsmith --context_lengths_num_intervals 3 --document_depth_percent_intervals 3 --provider anthropic --model_name "claude-3-opus-20240229" --multi_needle True --eval_set multi-needle-eval-pizza --needles '["Figs are one of the three most delicious pizza toppings.", "Prosciutto is one of the three most delicious pizza toppings.", "Goat cheese is one of the three most delicious pizza toppings."]' --context_lengths_min 50000 --context_lengths_max 100000

@gkamradt, @LazaroHurtado @kedarchandrayan have you also seen?

please view my comment on this issue: #15

nice work

@rlancemartin
Copy link
Contributor Author

there is a bug here, afaict.

error --

RecursionError: maximum recursion depth exceeded while calling a Python object

cmd to repro --

python main.py --evaluator langsmith --context_lengths_num_intervals 3 --document_depth_percent_intervals 3 --provider anthropic --model_name "claude-3-opus-20240229" --multi_needle True --eval_set multi-needle-eval-pizza --needles '["Figs are one of the three most delicious pizza toppings.", "Prosciutto is one of the three most delicious pizza toppings.", "Goat cheese is one of the three most delicious pizza toppings."]' --context_lengths_min 50000 --context_lengths_max 100000

@gkamradt, @LazaroHurtado @kedarchandrayan have you also seen?

fixed w/ #16 c/o LazaroHurtado

@kedarchandrayan
Copy link
Collaborator

I have merged PR #16. Thanks @LazaroHurtado

@kedarchandrayan
Copy link
Collaborator

kedarchandrayan commented Mar 6, 2024

@rlancemartin, let me re-try explaining my point on the depth percent increment bug, which is not allowing even distribution in the present code.

In the following screenshot, yellow coloured depth percents are from your code. Orange coloured depth percents, which are actually even and take space to 100 are from my proposed changes.

Screenshot 2024-03-06 at 12 28 42 PM

2 points to note:

  • In the yellow colored depth percents, as we go to next needles, the depth percent interval keeps decreasing. That's why it is not evenly distributed.
  • In the yellow colored depth percents, you will never be able to cover the whole document. Our aim was to evenly distribute the needles over the whole document, after the start depth percent.

I am proposing following change. Please see below snippet. Note that the depth percent delta is getting calculated above the loop and not with in the loop as in the present case. Please let me know if you have more questions. I would be happy to help.

        # Insert needles at calculated points
        depth_percent_delta = (100 - depth_percent) / len(self.needles)
        for needle in self.needles:
            tokens_needle = self.model_to_test.encode_text_to_tokens(needle)
            # Insert each needle at its corresponding depth percentage
            # For simplicity, evenly distribute needles throughout the context
            insertion_point = int(len(tokens_context) * (depth_percent / 100))
            tokens_context = tokens_context[:insertion_point] + tokens_needle + tokens_context[insertion_point:]
            depth_percent += depth_percent_delta  # Adjust depth for next needle

@rlancemartin
Copy link
Contributor Author

@rlancemartin, let me re-try explaining my point on the depth percent increment bug, which is not allowing even distribution in the present code.

In the following screenshot, yellow coloured depth percents are from your code. Orange coloured depth percents, which are actually even and take space to 100 are from my proposed changes.

Screenshot 2024-03-06 at 12 28 42 PM

2 points to note:

  • In the yellow colored depth percents, as we go to next needles, the depth percent interval keeps decreasing. That's why it is not evenly distributed.

  • In the yellow colored depth percents, you will never be able to cover the whole document. Our aim was to evenly distribute the needles over the whole document, after the start depth percent.

I am proposing following change. Please see below snippet. Note that the depth percent delta is getting calculated above the loop and not with in the loop as in the present case. Please let me know if you have more questions. I would be happy to help.


        # Insert needles at calculated points

        depth_percent_delta = (100 - depth_percent) / len(self.needles)

        for needle in self.needles:

            tokens_needle = self.model_to_test.encode_text_to_tokens(needle)

            # Insert each needle at its corresponding depth percentage

            # For simplicity, evenly distribute needles throughout the context

            insertion_point = int(len(tokens_context) * (depth_percent / 100))

            tokens_context = tokens_context[:insertion_point] + tokens_needle + tokens_context[insertion_point:]

            depth_percent += depth_percent_delta  # Adjust depth for next needle

My latest code should fix this, no? I'll have a look tmrw. Also feel free to push a commit to this branch w/ the code change if my current is not doing it correctly.

@kedarchandrayan
Copy link
Collaborator

Sure @rlancemartin! I have made a commit in your branch regarding the fix.

@kedarchandrayan
Copy link
Collaborator

Hi @rlancemartin, I am done with my review. Please resolve the 2 conflicts. After it, I will merge this PR.

@rlancemartin rlancemartin force-pushed the rlm/add_multi_needle branch from 8127ab8 to 622fb50 Compare March 6, 2024 17:02
@rlancemartin
Copy link
Contributor Author

Hi @rlancemartin, I am done with my review. Please resolve the 2 conflicts. After it, I will merge this PR.

thanks! done.

note: i have not tested:

1/ multi-needled w/o LangSmith evaluator
2/ LangSmith evaluation w/ single needle

we can take those up in follow-up PRs (if changes are needed) to get this code in.

@kedarchandrayan kedarchandrayan merged commit c1a6195 into gkamradt:main Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants