-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sampling: add Top-nσ sampler #11223
base: master
Are you sure you want to change the base?
sampling: add Top-nσ sampler #11223
Conversation
Thank you for this implementation! Top-nσ is definitely special and needs a lot of testing. I like the results so far, especially since high temperature is not a problem, as shown in the paper, and I'm going to test it more and see what its limitations are. |
common/sampling.cpp
Outdated
if(params.top_n_sigma >= 0) { | ||
llama_sampler_chain_add(result->chain, llama_sampler_init_temp(params.temp)); | ||
llama_sampler_chain_add(result->chain, llama_sampler_init_top_n_sigma(params.top_n_sigma)); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not convinced that this is desirable. Using top-k before this sampler should improve performance significantly, and the difference in output is likely to be negligible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out! I see your point that processing the entire vocabulary to apply top-nσ sampling is computationally expensive, and the negligible difference in output by chaining this after top-k doesn't justify the cost.
I made the necessary change to apply top-k before this to improve performance that way we're processing fewer tokens to achieve similar results.
sampler chain: logits -> logit-bias -> top-k -> temp -> top-n-sigma -> dist
The main benefit we should be getting from this sampler is a stable sampling space regardless of temperature scaling. By chaining it after top-k, we can now control the size of this sampling space too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moreover, I'm currently stuck with two models that almost require some form of repetition penalty (either a traditional one, or DRY), and the current implementation doesn't allow that.
I get that this sampler wasn't designed with creative writing in mind, but it can be used for that purpose. Mistral Small 22B shows especially good results, but some finetuned models need something to battle repetitions.
FYI I've also tested introducing (gaussian) noise as a sampler before top_n_sigma
, but it broke horribly. Seems like it's quite sensitive to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for testing our algorithm! Indeed, this sampler is just meant to eliminate the Gaussian noise, but we're actually seeing the same thing in our ongoing research. We find it works quite well for creative writing, and strangely enough, it performs really well at low temperatures.
The image link (shared as link to avoid cluttering the comments)
I wonder if the repetition issue could be mitigated with a very high temperature? (like 10.0 in the image)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Tomorrowdawn Hi! Thank you for this research!
At the moment I see the following in regards to repetitions:
- Traditional repetition penalty (logits -> bias -> top_k -> repetition_penalty -> temp -> top_n_sigma) helps, but it needs more careful testing and tuning - sometimes it can steer creativity into broken logics.
- DRY works really well, but at the moment I've only tested it in a legacy order (i.e. logits -> bias -> DRY -> top_k -> temp -> top_n_sigma). Ideally, it should go after top_k for optimization, but so far it's quite good.
In regards to gaussian noise, it was used before as an experiment to improve "creativity" and battle repetitions. I've been using it for a long time now, and it seems to be effective - but it's still not that useful against repetitions as DRY, for example. At the same time, it works well with other existing samplers (both as a separate sampler and as a part of p_step
, for example).
As for high temperatures, I've tested top_n_sigma
with temp = 5 previously, and it was still good enough, but the prompt adherence started to suffer a bit, which is a problem for creative writing (especially with longer prompts). In general, I found top_n_sigma
to be more random (less controllable?) in its own way: the length of responses and overall creativity seem to be different from time to time on the same parameters.
For example, I was testing it with temp
= 1.9, and results ranged from low-temp-like to long ones. The "confidence" (just an experiment, I measure it as simple as "percentage of chosen candidates with the chance of 50% and higher") was all over the place too, from 0.8 to 0.4.
I'm still testing, and I will look more into low temperatures with top_n_sigma
. This sampler is very interesting, and the results are definitely unique.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About the hyperparameter: top_n_sigma
is equivalent to min_p
where p = 1/exp(nsigma). Take sigma=2.2 as a typical value and you will get n=1 for p = 0.1, and n=2 for p=0.01. It doesn't scale linearly. 1.3~1.5 might better match what I meant by 'slightly larger' (sorry for not being clear)... As a default option, 1.0 is a simple, memorable, and practical value.
Regarding combining with other samplers: Since it needs to calculate standard deviation, and it's designed to eliminate noise from the original distribution, top_nsigma
is typically used as the first sampler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.3~1.5 might better match what I meant by 'slightly larger' (sorry for not being clear)...
Thanks for clarification! In such case, top_n_sigma
shouldn't be an integer variable as it is right now in this PR.
top_nsigma is typically used as the first sampler
Hmm, what about top-k with high (1000, for example) values? If both temp
and top_n_sigma
will be working on untruncated list of candidates, it might decrease performance a bit.
In practice, unless p-step
is used, the number of candidates that are kept till dist
is less than 1000 (in most cases it's less than 200 even). Having top-k
as the first sampler would help performance even at such high value - but would it interfere with top_n_sigma
in such a case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see the point. The situation is that the code we're discussing is different: typically, transformers
or vllm
users would pass the complete intermediate logits vector (since these operations are fast on GPU), and top_n_sigma
would crash due to negative infinities introduced by other samplers. If we only keep the candidates selected by top-k, this is completely feasible. The previously mentioned chain of top-k->temp->top_nsigma
is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
top_n_sigma
shouldn't be an integer variable as it is right now in this PR.
@MaggotHATE I converted top_n_sigma
to a float. This seams to give us much more control over the sampling space rather than leaving it as an integer.
While it was an int, I noticed that any top_n_sigma
greater than 2 had little to no effect on the output, leaving us to chose between only 3 values for top_n_sigma
(0,1, and 2). Thanks for the suggestion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VJHack Thank you! I'm currently testing with top_n_sigma
= 1.3, and it works quite well!
top_n_sigma is designed to be general-purpose and to reflect the model's inherent capabilities rather than introducing human priors.
@Tomorrowdawn you were right about this, and even more: it looks like top_n_sigma
can increase sensitivity to the prompt.
For example, I was testing with a prompt formatted using colons (i.e. "Title:..."). A finetuned model that had such formatting in the finetuning dataset worked correctly - until I used the same formatting for something that was not in the dataset (i.e. "TRIVIA:..."). At that point the quality of responses degraded significantly. Looks like my previous complains about repetitions stemmed from this too (although, repeated words may still appear close to each other in the text). I've never seen such a strong effect with other sampling algorithms.
It's a very interesting result, because it means that top_n_sigma
allows us to focus on testing the "quality" of finetuning, so to speak. None of the problems mentioned above happened with the foundation models (Mistral Small 22B in this case, but I've also tested Nemo 12B).
Top-nσ: Not All Logits Are You Need
https://arxiv.org/pdf/2411.07641
The authors of this paper propose a new sampling method known as Top-nσ. The main feature of this sampler is that "unlike existing methods (e.g., top-p, min-p) that inadvertently include more noise tokens at higher temperatures, top-nσ maintains a stable sampling space regardless of temperature scaling". They discovered that logits natually separate into a gaussian-distributed noisy region and an informative region.
This PR implements the sampling method proposed in the paper. Here the algorithm implemented from the paper:
Since the manipulation is done directly on the logits pre-softmax, I added it as a stand-alone sampler instead of chaining it with the common samplers. The changes only add support for
llama-cli
.sampler chain: logits -> logit-bias -> temp -> top-n-sigma -> dist
I'm aware that this algorithm is still in it's early phases so we could tag this as demo for now but I'll leave that choice up to the maintainers.
resolves #11057
Relavent Links:
https://huggingface.co/papers/2411.07641
https://arxiv.org/pdf/2411.07641
https://github.com/Tomorrowdawn/top_nsigma
#11057