sampling: add Top-nσ sampler #11223

VJHack · 2025-01-14T00:26:29Z

Top-nσ: Not All Logits Are You Need

https://arxiv.org/pdf/2411.07641
The authors of this paper propose a new sampling method known as Top-nσ. The main feature of this sampler is that "unlike existing methods (e.g., top-p, min-p) that inadvertently include more noise tokens at higher temperatures, top-nσ maintains a stable sampling space regardless of temperature scaling". They discovered that logits natually separate into a gaussian-distributed noisy region and an informative region.

This PR implements the sampling method proposed in the paper. Here the algorithm implemented from the paper:

Since the manipulation is done directly on the logits pre-softmax, I added it as a stand-alone sampler instead of chaining it with the common samplers. The changes only add support for llama-cli.
sampler chain: logits -> logit-bias -> temp -> top-n-sigma -> dist

I'm aware that this algorithm is still in it's early phases so we could tag this as demo for now but I'll leave that choice up to the maintainers.

resolves #11057

Relavent Links:
https://huggingface.co/papers/2411.07641
https://arxiv.org/pdf/2411.07641
https://github.com/Tomorrowdawn/top_nsigma
#11057

MaggotHATE · 2025-01-14T16:12:02Z

Thank you for this implementation! Top-nσ is definitely special and needs a lot of testing.

I like the results so far, especially since high temperature is not a problem, as shown in the paper, and I'm going to test it more and see what its limitations are.

slaren · 2025-01-16T23:57:16Z

common/sampling.cpp

+        if(params.top_n_sigma >= 0) {
+            llama_sampler_chain_add(result->chain, llama_sampler_init_temp(params.temp));
+            llama_sampler_chain_add(result->chain, llama_sampler_init_top_n_sigma(params.top_n_sigma));
+        } else {


I am not convinced that this is desirable. Using top-k before this sampler should improve performance significantly, and the difference in output is likely to be negligible.

Thanks for pointing this out! I see your point that processing the entire vocabulary to apply top-nσ sampling is computationally expensive, and the negligible difference in output by chaining this after top-k doesn't justify the cost.

I made the necessary change to apply top-k before this to improve performance that way we're processing fewer tokens to achieve similar results.
sampler chain: logits -> logit-bias -> top-k -> temp -> top-n-sigma -> dist

The main benefit we should be getting from this sampler is a stable sampling space regardless of temperature scaling. By chaining it after top-k, we can now control the size of this sampling space too.

Moreover, I'm currently stuck with two models that almost require some form of repetition penalty (either a traditional one, or DRY), and the current implementation doesn't allow that.

I get that this sampler wasn't designed with creative writing in mind, but it can be used for that purpose. Mistral Small 22B shows especially good results, but some finetuned models need something to battle repetitions.

FYI I've also tested introducing (gaussian) noise as a sampler before top_n_sigma, but it broke horribly. Seems like it's quite sensitive to that.

Thank you for testing our algorithm! Indeed, this sampler is just meant to eliminate the Gaussian noise, but we're actually seeing the same thing in our ongoing research. We find it works quite well for creative writing, and strangely enough, it performs really well at low temperatures.

The image link (shared as link to avoid cluttering the comments)

I wonder if the repetition issue could be mitigated with a very high temperature? (like 10.0 in the image)

@Tomorrowdawn Hi! Thank you for this research!

At the moment I see the following in regards to repetitions:

Traditional repetition penalty (logits -> bias -> top_k -> repetition_penalty -> temp -> top_n_sigma) helps, but it needs more careful testing and tuning - sometimes it can steer creativity into broken logics.

DRY works really well, but at the moment I've only tested it in a legacy order (i.e. logits -> bias -> DRY -> top_k -> temp -> top_n_sigma). Ideally, it should go after top_k for optimization, but so far it's quite good.

In regards to gaussian noise, it was used before as an experiment to improve "creativity" and battle repetitions. I've been using it for a long time now, and it seems to be effective - but it's still not that useful against repetitions as DRY, for example. At the same time, it works well with other existing samplers (both as a separate sampler and as a part of p_step, for example).

As for high temperatures, I've tested top_n_sigma with temp = 5 previously, and it was still good enough, but the prompt adherence started to suffer a bit, which is a problem for creative writing (especially with longer prompts). In general, I found top_n_sigma to be more random (less controllable?) in its own way: the length of responses and overall creativity seem to be different from time to time on the same parameters.

For example, I was testing it with temp = 1.9, and results ranged from low-temp-like to long ones. The "confidence" (just an experiment, I measure it as simple as "percentage of chosen candidates with the chance of 50% and higher") was all over the place too, from 0.8 to 0.4.

I'm still testing, and I will look more into low temperatures with top_n_sigma. This sampler is very interesting, and the results are definitely unique.

About the hyperparameter: top_n_sigma is equivalent to min_p where p = 1/exp(nsigma). Take sigma=2.2 as a typical value and you will get n=1 for p = 0.1, and n=2 for p=0.01. It doesn't scale linearly. 1.3~1.5 might better match what I meant by 'slightly larger' (sorry for not being clear)... As a default option, 1.0 is a simple, memorable, and practical value.

Regarding combining with other samplers: Since it needs to calculate standard deviation, and it's designed to eliminate noise from the original distribution, top_nsigma is typically used as the first sampler

1.3~1.5 might better match what I meant by 'slightly larger' (sorry for not being clear)...

Thanks for clarification! In such case, top_n_sigma shouldn't be an integer variable as it is right now in this PR.

top_nsigma is typically used as the first sampler

Hmm, what about top-k with high (1000, for example) values? If both temp and top_n_sigma will be working on untruncated list of candidates, it might decrease performance a bit.

In practice, unless p-step is used, the number of candidates that are kept till dist is less than 1000 (in most cases it's less than 200 even). Having top-k as the first sampler would help performance even at such high value - but would it interfere with top_n_sigma in such a case?

Oh, I see the point. The situation is that the code we're discussing is different: typically, transformers or vllm users would pass the complete intermediate logits vector (since these operations are fast on GPU), and top_n_sigma would crash due to negative infinities introduced by other samplers. If we only keep the candidates selected by top-k, this is completely feasible. The previously mentioned chain of top-k->temp->top_nsigma is correct.

top_n_sigma shouldn't be an integer variable as it is right now in this PR.

@MaggotHATE I converted top_n_sigma to a float. This seams to give us much more control over the sampling space rather than leaving it as an integer.

While it was an int, I noticed that any top_n_sigma greater than 2 had little to no effect on the output, leaving us to chose between only 3 values for top_n_sigma (0,1, and 2). Thanks for the suggestion!

@VJHack Thank you! I'm currently testing with top_n_sigma = 1.3, and it works quite well!

top_n_sigma is designed to be general-purpose and to reflect the model's inherent capabilities rather than introducing human priors.

@Tomorrowdawn you were right about this, and even more: it looks like top_n_sigma can increase sensitivity to the prompt.

For example, I was testing with a prompt formatted using colons (i.e. "Title:..."). A finetuned model that had such formatting in the finetuning dataset worked correctly - until I used the same formatting for something that was not in the dataset (i.e. "TRIVIA:..."). At that point the quality of responses degraded significantly. Looks like my previous complains about repetitions stemmed from this too (although, repeated words may still appear close to each other in the text). I've never seen such a strong effect with other sampling algorithms.

It's a very interesting result, because it means that top_n_sigma allows us to focus on testing the "quality" of finetuning, so to speak. None of the problems mentioned above happened with the foundation models (Mistral Small 22B in this case, but I've also tested Nemo 12B).

VJHack added 6 commits January 9, 2025 23:04

initial sampling changes:

ddc3c22

completed top nsigma sampler implementation

da038d8

apply parameter to only llama-cli

bee4c7c

updated readme

8fb681b

added tests and fixed nsigma impl

54ef105

cleaned up pr

d905a9e

github-actions bot added the examples label Jan 14, 2025

resolve merge conflicts

66cffa8

github-actions bot added the testing Everything test related label Jan 14, 2025

VJHack added 3 commits January 13, 2025 19:05

format

a590dcb

format

0f7501c

format

b29deb8

VJHack marked this pull request as ready for review January 14, 2025 01:12

VJHack added 2 commits January 13, 2025 20:32

removed commented tests

f08e6f5

cleanup pr and remove explicit floats

6664d47

slaren reviewed Jan 16, 2025

View reviewed changes

added top-k sampler to improve performance

c6123e6

VJHack mentioned this pull request Jan 17, 2025

evaluation code pls Tomorrowdawn/top_nsigma#1

Open

VJHack added 2 commits January 19, 2025 22:40

changed sigma to float

6c1ca58

fixed string format to float

a52e023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sampling: add Top-nσ sampler #11223

sampling: add Top-nσ sampler #11223

VJHack commented Jan 14, 2025 •

edited

Loading

MaggotHATE commented Jan 14, 2025

slaren Jan 16, 2025

VJHack Jan 17, 2025

MaggotHATE Jan 17, 2025 •

edited

Loading

Tomorrowdawn Jan 18, 2025 •

edited

Loading

MaggotHATE Jan 18, 2025 •

edited

Loading

Tomorrowdawn Jan 19, 2025

MaggotHATE Jan 19, 2025

Tomorrowdawn Jan 19, 2025

VJHack Jan 20, 2025 •

edited

Loading

MaggotHATE Jan 20, 2025

sampling: add Top-nσ sampler #11223

Are you sure you want to change the base?

sampling: add Top-nσ sampler #11223

Conversation

VJHack commented Jan 14, 2025 • edited Loading

Top-nσ: Not All Logits Are You Need

MaggotHATE commented Jan 14, 2025

slaren Jan 16, 2025

Choose a reason for hiding this comment

VJHack Jan 17, 2025

Choose a reason for hiding this comment

MaggotHATE Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

Tomorrowdawn Jan 18, 2025 • edited Loading

Choose a reason for hiding this comment

MaggotHATE Jan 18, 2025 • edited Loading

Choose a reason for hiding this comment

Tomorrowdawn Jan 19, 2025

Choose a reason for hiding this comment

MaggotHATE Jan 19, 2025

Choose a reason for hiding this comment

Tomorrowdawn Jan 19, 2025

Choose a reason for hiding this comment

VJHack Jan 20, 2025 • edited Loading

Choose a reason for hiding this comment

MaggotHATE Jan 20, 2025

Choose a reason for hiding this comment

VJHack commented Jan 14, 2025 •

edited

Loading

MaggotHATE Jan 17, 2025 •

edited

Loading

Tomorrowdawn Jan 18, 2025 •

edited

Loading

MaggotHATE Jan 18, 2025 •

edited

Loading

VJHack Jan 20, 2025 •

edited

Loading