-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multi needle and LangSmith evaluator #19
Add multi needle and LangSmith evaluator #19
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love seeing the support and attention this project is getting, thanks for trying to make it better!
I would like to start by saying I am not a contributor to this project but would like to give some feedback on this PR since its open source after all. Would also like to mention that this is not an exhaustive review and just a first glance over the work.
I think this introduces a pretty big change, essentially moving from a file based client where results are stored in json files to running on LangSmith and having the results there.
I believe It would be better to first introduce needles: list[str] | str
into LLMNeedleHaystackTester
. Then we can work on breaking up how this tester class handles it output, either file-based or LangSmith. Furthermore, since I believe LangSmith requires a LangChain model we would have to think of a solid way to support evaluator model usage through sdk (openai sdks) and frameworks (langchain).
|
||
class LangSmithEvaluator(): | ||
|
||
def __init__(self, model_name: str = "gpt-4-0125-preview", api_key: str = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is api_key needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, optional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since its not being used anywhere can it be removed from the constructor?
## LLM | ||
model = ChatOpenAI(temperature=0, model=model_name) | ||
|
||
# Tool | ||
grade_tool_oai = convert_to_openai_tool(grade) | ||
|
||
# LLM with tool and enforce invocation | ||
llm_with_tool = model.bind( | ||
tools=[grade_tool_oai], | ||
tool_choice={"type": "function", "function": {"name": "grade"}}, | ||
) | ||
|
||
# Parser | ||
parser_tool = PydanticToolsParser(tools=[grade]) | ||
|
||
chain = ( | ||
prompt | ||
| llm_with_tool | ||
| parser_tool | ||
) | ||
|
||
score = chain.invoke({"answer":student_answer, | ||
"reference":reference}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these can all be instantiated in the constructor correct?
# Config | ||
evaluation_config = RunEvalConfig( | ||
custom_evaluators = [score_relevance], | ||
) | ||
|
||
client = Client() | ||
run_id = uuid.uuid4().hex[:4] | ||
project_name = "multi-needle-eval" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these be instantiated in the constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya, going to use eval_set
here; better than hard-coding a project name
if self.evaluator.__class__.__name__ == "LangSmithEvaluator": | ||
print("EVALUATOR: LANGSMITH") | ||
chain = self.model_to_test.get_langchain_runnable(context) | ||
self.evaluator.evaluate_chain(chain, context_length, depth_percent, self.model_name) | ||
test_end_time = time.time() | ||
test_elapsed_time = test_end_time - test_start_time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't rely on branching code depending on the evaluator name, instead they should all inherit Evaluator
so we can introduce abstraction and leverage the common evaluate_model
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya, this breaks the current Evaluator
abstraction.
worth some discussion; i added more in a different comment.
test_elapsed_time = test_end_time - test_start_time | ||
|
||
else: | ||
print("EVALUATOR: OpenAI Model") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only other supported evaluator is OpenAI, but this can change in the future so the else
block doesn't have longevity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya, we can move to case; this is temp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be some methods in this class that already exist in LLMNeedleHaystackTester
we should use those
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya iirc i only implemented new functionality needed for multi needle
a638f8a
to
4c7422f
Compare
Hello @rlancemartin , Kedar here, one of the co-maintainers. Greatly appreciate the introduction of multi-needle support! Here are my review comments. I have numbered them so that we can reference them easily in any discussion: To show the effect of what I meant in the above comment, I did brief calculation in excel and came up with the below results. Column B is the current approach and Column C is the approach which I am proposing. You can see that the depths are more evenly increasing in Column C. ![]() |
776d822
to
6bef87d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a bug here, afaict.
error --
RecursionError: maximum recursion depth exceeded while calling a Python object
cmd to repro --
python main.py --evaluator langsmith --context_lengths_num_intervals 3 --document_depth_percent_intervals 3 --provider anthropic --model_name "claude-3-opus-20240229" --multi_needle True --eval_set multi-needle-eval-pizza --needles '["Figs are one of the three most delicious pizza toppings.", "Prosciutto is one of the three most delicious pizza toppings.", "Goat cheese is one of the three most delicious pizza toppings."]' --context_lengths_min 50000 --context_lengths_max 100000
@gkamradt, @LazaroHurtado @kedarchandrayan have you also seen?
yes, thanks! indeed, depth will change. i just made it even spacing in the tok remaining following the initial insertion. easier to follow! |
thanks for the feedback! ya, this introduces 3rd party evaluators and also brings in multi-needle support. |
please view my comment on this issue: #15 |
nice work |
fixed w/ #16 c/o LazaroHurtado |
I have merged PR #16. Thanks @LazaroHurtado |
@rlancemartin, let me re-try explaining my point on the depth percent increment bug, which is not allowing even distribution in the present code. In the following screenshot, yellow coloured depth percents are from your code. Orange coloured depth percents, which are actually even and take space to 100 are from my proposed changes. ![]() 2 points to note:
I am proposing following change. Please see below snippet. Note that the depth percent delta is getting calculated above the loop and not with in the loop as in the present case. Please let me know if you have more questions. I would be happy to help.
|
My latest code should fix this, no? I'll have a look tmrw. Also feel free to push a commit to this branch w/ the code change if my current is not doing it correctly. |
Sure @rlancemartin! I have made a commit in your branch regarding the fix. |
Hi @rlancemartin, I am done with my review. Please resolve the 2 conflicts. After it, I will merge this PR. |
…n/LLMTest_NeedleInAHaystack into rlm/add_multi_needle
8127ab8
to
622fb50
Compare
thanks! done. note: i have not tested: 1/ multi-needled w/o LangSmith evaluator we can take those up in follow-up PRs (if changes are needed) to get this code in. |
Introduce support for 3rd party evaluators (LangSmith) and add multi-needle support.
Updated README.
Cmd to test: