-
Notifications
You must be signed in to change notification settings - Fork 9
Description
This is an issue that stems from Ollama's own behaviour, see e.g. these issues among others (8099 and 7043), but it inevitably impacts also mall
users.
In brief, by default, ollama
truncates input at 2048 tokens, even if the input is longer and the model itself would support much larger contexts.
The workaround outlined in the relevant issue works, but the misbehaviour is likely to remain annoyingly invisible to users who don't watch the logs (ollama serve
includes in the input lines such as msg="truncating input prompt" limit=2048 prompt=8923 keep=5 new=2048
.
In fact, the issue can ultimately be noticed by the user, as the response is not consistent with the prompt, as the prompt itself gets cut off.
This recent blog post inadvertently shows this issue, as it processes texts that are longer than 2048 tokens.
It first asks Ollama to summarise, but even if it requested it to extract contents, the response would still be a summary (as somewhat implied in the beginning of the text): the request is simply ignored, but even if it was processed, it would apply it to a truncated input.
All of this would remain invisible to mall
users, who simply wouldn't get appropriate responses.
Until ollama
introduces a better mechanism to manage this, I suppose including a warning when the input hits the 2048 tokens, or at least pointing at this issue in the documentation would make things easier to troubleshoot.
Below some examples that show this misbehaviour (which anyway emerges even running the code included in the above-mentioned blog post), as well as calls to the Ollama API made with httr2
that allow to see that the evaluated input is truncated at 2048 tokens when the input is longer.
# read example dataset used in this post
# https://posit.co/blog/mall-ai-powered-text-analysis/
cop_data <- readr::read_csv( "https://posit.co/wp-content/uploads/2025/03/cop_data2.csv")
library(mall)
llm_use("ollama", "llama3.2", seed = 100, .cache = tempdir())
cop_electricity <- llm_extract(
cop_data |> dplyr::slice(1), #data
CleanedText, #column
label = "electricity keywords"
)
## whatever you ask, you still get a summary, not the keywords
cop_electricity$.extract
## process directly with httr2 to see that the prompt_eval_count in the response is truncated at 2048
req <- httr2::request("http://localhost:11434") |>
httr2::req_url_path("api/generate") |>
httr2::req_headers("Content-Type" = "application/json") |>
httr2::req_body_json(
list(
model = "llama3.2",
prompt = paste(cop_data$CleanedText[1],
"What are the very first ten words of this text?"),
system = "You respond with the first ten words of the input you receive.",
stream = FALSE
)
)
resp <- req |>
httr2::req_perform() |>
httr2::resp_body_json()
resp$prompt_eval_count
resp$response
### changing model, preventing truncation
# new model - llama3.2-32k - created as described here:
# https://github.com/ollama/ollama/issues/8099#issuecomment-2543316682
req_32 <- httr2::request("http://localhost:11434") |>
httr2::req_url_path("api/generate") |>
httr2::req_headers("Content-Type" = "application/json") |>
httr2::req_body_json(
list(
model = "llama3.2-32k",
prompt = paste(cop_data$CleanedText[1],
"What are the very first ten words of this text?"),
system = "You respond with the first ten words of the input you receive.",
stream = FALSE
)
)
resp_32 <- req_32|>
httr2::req_perform() |>
httr2::resp_body_json()
resp_32$prompt_eval_count
resp_32$response
you will see that the second reply is correct, while the first is not. mall
users have no way to know that their request is effectively truncated, leading to the type of inadvertent use found in that blog post (of course, just an example).