Truncating by word/char length (and maybe other units - sentences/lines/etc?) #6456

lionel-rowe · 2025-02-28T07:57:49Z

Is your feature request related to a problem? Please describe.

Sometimes it's useful to truncate strings and other things.

Words (max length 10 chars, truncation chars = "..."):

"Hello world" => "Hello..." // whole words only
"Hello" => no change

Chars (max length 10 chars, truncation chars = "..."):

"Hello world" => "Hello w..." // max length accounts for chars used for truncation
"Hello" => no change

Describe the solution you'd like

New functionality such as truncateWords and/or truncateGraphemes under text.

Describe alternatives you've considered

What do "words" and "chars" mean (cf: Intl.Segmenter with granularity of "words"/"graphemes")
What options should be available? At a minimum, what text/items to use as truncation placeholder, as well as max length (which may not be in the same unit as the base granularity, i.e. when truncating words, you probably care more about the char/grapheme length than the word length)
Are other granularities useful for text? E.g. sentences, lines?
For perf reasons, the general unit of measurement for the text functions could be Unicode code points, with the segmenter only being used at the boundary. Alternatively, the "Unicode width" could be used as a heuristic, though it's only a very rough indication of physical width in non-TTY environments, especially in non-monospaced fonts.

The text was updated successfully, but these errors were encountered:

timreichen · 2025-02-28T08:53:41Z

Is your feature request related to a problem? Please describe.

Sometimes it's useful to truncate strings and other things.

Words (max length 10 chars, truncation chars = "..."):

"Hello world" => "Hello..." // whole words only
"Hello" => no change

As you pointed out, Intl.Segmenter can handle this:

function truncateByWord(input: string, limit: number) {
  const segmenter = new Intl.Segmenter(undefined, { granularity: "word" });
  let count = 0;
  let output = "";
  for (const { segment, isWordLike } of segmenter.segment(input)) {
    if (count >= limit) {
      output += "...";
      break;
    }
    output += segment;
    if (isWordLike) count += 1;
  }
  return output;
}

This limits by words, ignoring whitespaces and special chars.

Even though this is a bit less trivial to do, I am not sure if this should be part of @std.
I think there are many different use cases (for example: add ... only if a sentence is not complete, handle special chars differently, etc.) where such an abstraction might not be helpful and the user would be best of implementing it himself.

Chars (max length 10 chars, truncation chars = "..."):
"Hello world" => "Hello w..." // max length accounts for chars used for truncation
"Hello" => no change

This can be can be achieved with string.slice(), e.g. string.slice(0, limit) + "...". There is a special case where the string length is less than or equals the limit which probably should be checked: string.length <= limit ? string : string.slice(0, limit) + "...".
I don't think we need an abstraction for that.

Arrays of items (max length 3 items, truncation items = ['<a href="/site-map">More...</a>']):

['Home', 'Blog', 'Docs']
=> no change
['Home', 'Blog', 'Docs', 'About']
=> ['Home', 'Blog', 'More...']
Describe the solution you'd like

Similar to chars, this can be achieved by array.slice(), e.g. array.slice(0, limit) and pushing another object if the length is bigger than the limit.

For iterables: Is take() what you are looking for?

lionel-rowe · 2025-02-28T09:24:27Z

Similar to chars, this can be achieved by array.slice(), e.g. array.slice(0, limit) and pushing another object if the length is bigger than the limit.

Yeah maybe truncate arr/items/iter isn't that useful. I guess the use cases for non-array iterables or >1 placeholder item probably aren't that common. I've removed that part from OP as it just overcomplicates things.

Even though this is a bit less trivial to do, I am not sure if this should be part of @std.
I think there are many different use cases (for example: add ... only if a sentence is not complete, handle special chars differently, etc.) where such an abstraction might not be helpful and the user would be best of implementing it himself.

Maybe? I mean the question is whether those use cases are common and/or whether there's a good abstraction for them that can be supplied as an option. But the point of having stuff in @std isn't that it covers every possible use case, it's that it covers the most common use cases with a simple, well-tested API.

This can be can be achieved with string.slice(), e.g. string.slice(0, limit) + "...". There is a special case where the string length is less than or equals the limit which probably should be checked: string.length <= limit ? string : string.slice(0, limit) + "...".
I don't think we need an abstraction for that.

Fails for emoji/non-BMP and multi-char graphemes and also doesn't trim trailing whitespace where present, which I think would be a sensible default ("Hello..." vs "Hello ...")

Also "..." should probably count as 3 chars for purposes of counting length, hence the max length of the truncated part would be limit - 3. As a library consumer, I wouldn't expect that len(truncate(str, limit)) could exceed limit (for some abstract operation len, TBD whether that's code point/grapheme/unicode width etc.)

0f-0b · 2025-02-28T19:27:22Z

Also "..." should probably count as 3 chars for purposes of counting length, hence the max length of the truncated part would be limit - 3.

What if limit is < 3?

For perf reasons, the general unit of measurement for the text functions could be Unicode code points.

JS builtins that deal with strings almost always use UTF-16 code units as the unit of measurement; I suggest we do the same here. This could also greatly simplify the implementation:

const segmenter = new Intl.Segmenter("en");

export function truncate(
  input: string,
  limit: number,
  truncated = "…",
): string {
  if (input.length <= limit) {
    return input;
  }
  const { index } = segmenter.segment(input).containing(limit);
  return input.substring(0, index) + truncated;
}

lionel-rowe · 2025-03-03T02:21:30Z

JS builtins that deal with strings almost always use UTF-16 code units as the unit of measurement

Only for historical reasons. Other than performance (which generally seems to be negligible, see e.g. #6014 (comment)), the only good reason to measure using UTF-16 code units in AD 2025 is if keeping in sync with numbered index access or methods like indexOf is important, which doesn't apply here.

As an example, with input = 'deno 🦕', limit = 6, truncated = '':

'deno 🦕' (unchanged input, as it only contains 6 chars) is probably the most sensible/expected output
'deno ', maybe with the trailing space trimmed to 'deno', is less ideal but still feels acceptable
'deno \uD83E' is obviously wrong, and will render in browsers as deno �

What if limit is < 3?

Maybe just return the full truncation chars or a truncated version of truncated (e.g. '..')? Throwing might not be the best idea, as limit might be set dynamically, causing unexpected runtime errors.

lionel-rowe changed the title ~~Truncating strings and other things~~ Truncating by word/char length (and maybe other units - sentences/lines/etc?) Feb 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Truncating by word/char length (and maybe other units - sentences/lines/etc?) #6456

Truncating by word/char length (and maybe other units - sentences/lines/etc?) #6456

lionel-rowe commented Feb 28, 2025 •

edited

Loading

timreichen commented Feb 28, 2025 •

edited

Loading

lionel-rowe commented Feb 28, 2025 •

edited

Loading

0f-0b commented Feb 28, 2025

lionel-rowe commented Mar 3, 2025 •

edited

Loading

Truncating by word/char length (and maybe other units - sentences/lines/etc?) #6456

Truncating by word/char length (and maybe other units - sentences/lines/etc?) #6456

Comments

lionel-rowe commented Feb 28, 2025 • edited Loading

timreichen commented Feb 28, 2025 • edited Loading

lionel-rowe commented Feb 28, 2025 • edited Loading

0f-0b commented Feb 28, 2025

lionel-rowe commented Mar 3, 2025 • edited Loading

lionel-rowe commented Feb 28, 2025 •

edited

Loading

timreichen commented Feb 28, 2025 •

edited

Loading

lionel-rowe commented Feb 28, 2025 •

edited

Loading

lionel-rowe commented Mar 3, 2025 •

edited

Loading