You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I really like this kind of benchmark. It would be interesting to make generalized versions of this, where there are a variable number of needles inserted. These could be unrelated independent needles, or they could be related. For example you could imagine 4 needles:
A implies B
B implies C, D
D implies E.
B is true
Then you could test the "related" needles, to ensure that all of them were detected and the relationship is understood. (What might A be? What about D?)
Curious what you think about this. If you're interested in a feature like this and willing to accept a pull request, I could find the time to try implementing it. If you have a style guide preference or anything like that, please let me know.
The text was updated successfully, but these errors were encountered:
I totally agree that reasoning should be a part of the next set of tests. As an aside, I've been wondering what the "unit test" of reasoning is - what is the minimal amount of reasoning we can start with? It may be the transitive reasoning you're referring to here. I like this because you can easily append additional chains, and even put forks in the logic.
Lance from LangChain added multi-needle recall, but it didn't have reasoning in there.
We are trying to have the repo separate tests from providers from evaluators, other than that. No style guide.
Contributions are very welcome and we'll be quick with feedback
I really like this kind of benchmark. It would be interesting to make generalized versions of this, where there are a variable number of needles inserted. These could be unrelated independent needles, or they could be related. For example you could imagine 4 needles:
A implies B
B implies C, D
D implies E.
B is true
Then you could test the "related" needles, to ensure that all of them were detected and the relationship is understood. (What might A be? What about D?)
Curious what you think about this. If you're interested in a feature like this and willing to accept a pull request, I could find the time to try implementing it. If you have a style guide preference or anything like that, please let me know.
The text was updated successfully, but these errors were encountered: