Skip to content

Truncating by word/char length (and maybe other units - sentences/lines/etc?) #6456

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
lionel-rowe opened this issue Feb 28, 2025 · 4 comments

Comments

@lionel-rowe
Copy link
Contributor

lionel-rowe commented Feb 28, 2025

Is your feature request related to a problem? Please describe.

Sometimes it's useful to truncate strings and other things.

Words (max length 10 chars, truncation chars = "..."):

"Hello world" => "Hello..." // whole words only
"Hello" => no change

Chars (max length 10 chars, truncation chars = "..."):

"Hello world" => "Hello w..." // max length accounts for chars used for truncation
"Hello" => no change

Describe the solution you'd like

New functionality such as truncateWords and/or truncateGraphemes under text.

Describe alternatives you've considered

  • What do "words" and "chars" mean (cf: Intl.Segmenter with granularity of "words"/"graphemes")
  • What options should be available? At a minimum, what text/items to use as truncation placeholder, as well as max length (which may not be in the same unit as the base granularity, i.e. when truncating words, you probably care more about the char/grapheme length than the word length)
  • Are other granularities useful for text? E.g. sentences, lines?
  • For perf reasons, the general unit of measurement for the text functions could be Unicode code points, with the segmenter only being used at the boundary. Alternatively, the "Unicode width" could be used as a heuristic, though it's only a very rough indication of physical width in non-TTY environments, especially in non-monospaced fonts.
@timreichen
Copy link
Contributor

timreichen commented Feb 28, 2025

Is your feature request related to a problem? Please describe.

Sometimes it's useful to truncate strings and other things.

Words (max length 10 chars, truncation chars = "..."):

"Hello world" => "Hello..." // whole words only
"Hello" => no change

As you pointed out, Intl.Segmenter can handle this:

function truncateByWord(input: string, limit: number) {
  const segmenter = new Intl.Segmenter(undefined, { granularity: "word" });
  let count = 0;
  let output = "";
  for (const { segment, isWordLike } of segmenter.segment(input)) {
    if (count >= limit) {
      output += "...";
      break;
    }
    output += segment;
    if (isWordLike) count += 1;
  }
  return output;
}

This limits by words, ignoring whitespaces and special chars.

Even though this is a bit less trivial to do, I am not sure if this should be part of @std.
I think there are many different use cases (for example: add ... only if a sentence is not complete, handle special chars differently, etc.) where such an abstraction might not be helpful and the user would be best of implementing it himself.

Chars (max length 10 chars, truncation chars = "..."):
"Hello world" => "Hello w..." // max length accounts for chars used for truncation
"Hello" => no change

This can be can be achieved with string.slice(), e.g. string.slice(0, limit) + "...". There is a special case where the string length is less than or equals the limit which probably should be checked: string.length <= limit ? string : string.slice(0, limit) + "...".
I don't think we need an abstraction for that.

Arrays of items (max length 3 items, truncation items = ['<a href="/site-map">More...</a>']):

['Home', 'Blog', 'Docs']
=> no change
['Home', 'Blog', 'Docs', 'About']
=> ['Home', 'Blog', 'More...']
Describe the solution you'd like

Similar to chars, this can be achieved by array.slice(), e.g. array.slice(0, limit) and pushing another object if the length is bigger than the limit.

For iterables: Is take() what you are looking for?

@lionel-rowe
Copy link
Contributor Author

lionel-rowe commented Feb 28, 2025

Similar to chars, this can be achieved by array.slice(), e.g. array.slice(0, limit) and pushing another object if the length is bigger than the limit.

Yeah maybe truncate arr/items/iter isn't that useful. I guess the use cases for non-array iterables or >1 placeholder item probably aren't that common. I've removed that part from OP as it just overcomplicates things.

Even though this is a bit less trivial to do, I am not sure if this should be part of @std.
I think there are many different use cases (for example: add ... only if a sentence is not complete, handle special chars differently, etc.) where such an abstraction might not be helpful and the user would be best of implementing it himself.

Maybe? I mean the question is whether those use cases are common and/or whether there's a good abstraction for them that can be supplied as an option. But the point of having stuff in @std isn't that it covers every possible use case, it's that it covers the most common use cases with a simple, well-tested API.

This can be can be achieved with string.slice(), e.g. string.slice(0, limit) + "...". There is a special case where the string length is less than or equals the limit which probably should be checked: string.length <= limit ? string : string.slice(0, limit) + "...".
I don't think we need an abstraction for that.

Fails for emoji/non-BMP and multi-char graphemes and also doesn't trim trailing whitespace where present, which I think would be a sensible default ("Hello..." vs "Hello ...")

Also "..." should probably count as 3 chars for purposes of counting length, hence the max length of the truncated part would be limit - 3. As a library consumer, I wouldn't expect that len(truncate(str, limit)) could exceed limit (for some abstract operation len, TBD whether that's code point/grapheme/unicode width etc.)

@lionel-rowe lionel-rowe changed the title Truncating strings and other things Truncating by word/char length (and maybe other units - sentences/lines/etc?) Feb 28, 2025
@0f-0b
Copy link
Contributor

0f-0b commented Feb 28, 2025

Also "..." should probably count as 3 chars for purposes of counting length, hence the max length of the truncated part would be limit - 3.

What if limit is < 3?

For perf reasons, the general unit of measurement for the text functions could be Unicode code points.

JS builtins that deal with strings almost always use UTF-16 code units as the unit of measurement; I suggest we do the same here. This could also greatly simplify the implementation:

const segmenter = new Intl.Segmenter("en");

export function truncate(
  input: string,
  limit: number,
  truncated = "…",
): string {
  if (input.length <= limit) {
    return input;
  }
  const { index } = segmenter.segment(input).containing(limit);
  return input.substring(0, index) + truncated;
}

@lionel-rowe
Copy link
Contributor Author

lionel-rowe commented Mar 3, 2025

JS builtins that deal with strings almost always use UTF-16 code units as the unit of measurement

Only for historical reasons. Other than performance (which generally seems to be negligible, see e.g. #6014 (comment)), the only good reason to measure using UTF-16 code units in AD 2025 is if keeping in sync with numbered index access or methods like indexOf is important, which doesn't apply here.

As an example, with input = 'deno 🦕', limit = 6, truncated = '':

  • 'deno 🦕' (unchanged input, as it only contains 6 chars) is probably the most sensible/expected output
  • 'deno ', maybe with the trailing space trimmed to 'deno', is less ideal but still feels acceptable
  • 'deno \uD83E' is obviously wrong, and will render in browsers as deno �

What if limit is < 3?

Maybe just return the full truncation chars or a truncated version of truncated (e.g. '..')? Throwing might not be the best idea, as limit might be set dynamically, causing unexpected runtime errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants