Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sampling: add Top-nσ sampler #11223

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Open

Conversation

VJHack
Copy link
Contributor

@VJHack VJHack commented Jan 14, 2025

Top-nσ: Not All Logits Are You Need

https://arxiv.org/pdf/2411.07641
The authors of this paper propose a new sampling method known as Top-nσ. The main feature of this sampler is that "unlike existing methods (e.g., top-p, min-p) that inadvertently include more noise tokens at higher temperatures, top-nσ maintains a stable sampling space regardless of temperature scaling". They discovered that logits natually separate into a gaussian-distributed noisy region and an informative region.

This PR implements the sampling method proposed in the paper. Here the algorithm implemented from the paper:
Screen Shot 2025-01-13 at 6 14 41 PM

Since the manipulation is done directly on the logits pre-softmax, I added it as a stand-alone sampler instead of chaining it with the common samplers. The changes only add support for llama-cli.
sampler chain: logits -> logit-bias -> temp -> top-n-sigma -> dist

I'm aware that this algorithm is still in it's early phases so we could tag this as demo for now but I'll leave that choice up to the maintainers.

resolves #11057

Relavent Links:
https://huggingface.co/papers/2411.07641
https://arxiv.org/pdf/2411.07641
https://github.com/Tomorrowdawn/top_nsigma
#11057

@github-actions github-actions bot added the testing Everything test related label Jan 14, 2025
@VJHack VJHack marked this pull request as ready for review January 14, 2025 01:12
@MaggotHATE
Copy link
Contributor

Thank you for this implementation! Top-nσ is definitely special and needs a lot of testing.

I like the results so far, especially since high temperature is not a problem, as shown in the paper, and I'm going to test it more and see what its limitations are.

Comment on lines 170 to 173
if(params.top_n_sigma >= 0) {
llama_sampler_chain_add(result->chain, llama_sampler_init_temp(params.temp));
llama_sampler_chain_add(result->chain, llama_sampler_init_top_n_sigma(params.top_n_sigma));
} else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not convinced that this is desirable. Using top-k before this sampler should improve performance significantly, and the difference in output is likely to be negligible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out! I see your point that processing the entire vocabulary to apply top-nσ sampling is computationally expensive, and the negligible difference in output by chaining this after top-k doesn't justify the cost.

I made the necessary change to apply top-k before this to improve performance that way we're processing fewer tokens to achieve similar results.
sampler chain: logits -> logit-bias -> top-k -> temp -> top-n-sigma -> dist

The main benefit we should be getting from this sampler is a stable sampling space regardless of temperature scaling. By chaining it after top-k, we can now control the size of this sampling space too.

Copy link
Contributor

@MaggotHATE MaggotHATE Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover, I'm currently stuck with two models that almost require some form of repetition penalty (either a traditional one, or DRY), and the current implementation doesn't allow that.

I get that this sampler wasn't designed with creative writing in mind, but it can be used for that purpose. Mistral Small 22B shows especially good results, but some finetuned models need something to battle repetitions.

FYI I've also tested introducing (gaussian) noise as a sampler before top_n_sigma, but it broke horribly. Seems like it's quite sensitive to that.

Copy link

@Tomorrowdawn Tomorrowdawn Jan 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for testing our algorithm! Indeed, this sampler is just meant to eliminate the Gaussian noise, but we're actually seeing the same thing in our ongoing research. We find it works quite well for creative writing, and strangely enough, it performs really well at low temperatures.

The image link (shared as link to avoid cluttering the comments)

I wonder if the repetition issue could be mitigated with a very high temperature? (like 10.0 in the image)

Copy link
Contributor

@MaggotHATE MaggotHATE Jan 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tomorrowdawn Hi! Thank you for this research!

At the moment I see the following in regards to repetitions:

  1. Traditional repetition penalty (logits -> bias -> top_k -> repetition_penalty -> temp -> top_n_sigma) helps, but it needs more careful testing and tuning - sometimes it can steer creativity into broken logics.
  2. DRY works really well, but at the moment I've only tested it in a legacy order (i.e. logits -> bias -> DRY -> top_k -> temp -> top_n_sigma). Ideally, it should go after top_k for optimization, but so far it's quite good.

In regards to gaussian noise, it was used before as an experiment to improve "creativity" and battle repetitions. I've been using it for a long time now, and it seems to be effective - but it's still not that useful against repetitions as DRY, for example. At the same time, it works well with other existing samplers (both as a separate sampler and as a part of p_step, for example).

As for high temperatures, I've tested top_n_sigma with temp = 5 previously, and it was still good enough, but the prompt adherence started to suffer a bit, which is a problem for creative writing (especially with longer prompts). In general, I found top_n_sigma to be more random (less controllable?) in its own way: the length of responses and overall creativity seem to be different from time to time on the same parameters.

For example, I was testing it with temp = 1.9, and results ranged from low-temp-like to long ones. The "confidence" (just an experiment, I measure it as simple as "percentage of chosen candidates with the chance of 50% and higher") was all over the place too, from 0.8 to 0.4.

I'm still testing, and I will look more into low temperatures with top_n_sigma. This sampler is very interesting, and the results are definitely unique.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the hyperparameter: top_n_sigma is equivalent to min_p where p = 1/exp(nsigma). Take sigma=2.2 as a typical value and you will get n=1 for p = 0.1, and n=2 for p=0.01. It doesn't scale linearly. 1.3~1.5 might better match what I meant by 'slightly larger' (sorry for not being clear)... As a default option, 1.0 is a simple, memorable, and practical value.

Regarding combining with other samplers: Since it needs to calculate standard deviation, and it's designed to eliminate noise from the original distribution, top_nsigma is typically used as the first sampler

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.3~1.5 might better match what I meant by 'slightly larger' (sorry for not being clear)...

Thanks for clarification! In such case, top_n_sigma shouldn't be an integer variable as it is right now in this PR.

top_nsigma is typically used as the first sampler

Hmm, what about top-k with high (1000, for example) values? If both temp and top_n_sigma will be working on untruncated list of candidates, it might decrease performance a bit.

In practice, unless p-step is used, the number of candidates that are kept till dist is less than 1000 (in most cases it's less than 200 even). Having top-k as the first sampler would help performance even at such high value - but would it interfere with top_n_sigma in such a case?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see the point. The situation is that the code we're discussing is different: typically, transformers or vllm users would pass the complete intermediate logits vector (since these operations are fast on GPU), and top_n_sigma would crash due to negative infinities introduced by other samplers. If we only keep the candidates selected by top-k, this is completely feasible. The previously mentioned chain of top-k->temp->top_nsigma is correct.

Copy link
Contributor Author

@VJHack VJHack Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

top_n_sigma shouldn't be an integer variable as it is right now in this PR.

@MaggotHATE I converted top_n_sigma to a float. This seams to give us much more control over the sampling space rather than leaving it as an integer.

While it was an int, I noticed that any top_n_sigma greater than 2 had little to no effect on the output, leaving us to chose between only 3 values for top_n_sigma (0,1, and 2). Thanks for the suggestion!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VJHack Thank you! I'm currently testing with top_n_sigma = 1.3, and it works quite well!

top_n_sigma is designed to be general-purpose and to reflect the model's inherent capabilities rather than introducing human priors.

@Tomorrowdawn you were right about this, and even more: it looks like top_n_sigma can increase sensitivity to the prompt.

For example, I was testing with a prompt formatted using colons (i.e. "Title:..."). A finetuned model that had such formatting in the finetuning dataset worked correctly - until I used the same formatting for something that was not in the dataset (i.e. "TRIVIA:..."). At that point the quality of responses degraded significantly. Looks like my previous complains about repetitions stemmed from this too (although, repeated words may still appear close to each other in the text). I've never seen such a strong effect with other sampling algorithms.

It's a very interesting result, because it means that top_n_sigma allows us to focus on testing the "quality" of finetuning, so to speak. None of the problems mentioned above happened with the foundation models (Mistral Small 22B in this case, but I've also tested Nemo 12B).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Top-nσ sampler
4 participants