-
Notifications
You must be signed in to change notification settings - Fork 644
Truncating by word/char length (and maybe other units - sentences/lines/etc?) #6456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
As you pointed out, function truncateByWord(input: string, limit: number) {
const segmenter = new Intl.Segmenter(undefined, { granularity: "word" });
let count = 0;
let output = "";
for (const { segment, isWordLike } of segmenter.segment(input)) {
if (count >= limit) {
output += "...";
break;
}
output += segment;
if (isWordLike) count += 1;
}
return output;
} This limits by words, ignoring whitespaces and special chars. Even though this is a bit less trivial to do, I am not sure if this should be part of
This can be can be achieved with string.slice(), e.g.
Similar to chars, this can be achieved by array.slice(), e.g. For iterables: Is take() what you are looking for? |
Yeah maybe truncate arr/items/iter isn't that useful. I guess the use cases for non-array iterables or >1 placeholder item probably aren't that common. I've removed that part from OP as it just overcomplicates things.
Maybe? I mean the question is whether those use cases are common and/or whether there's a good abstraction for them that can be supplied as an option. But the point of having stuff in
Fails for emoji/non-BMP and multi-char graphemes and also doesn't trim trailing whitespace where present, which I think would be a sensible default ("Hello..." vs "Hello ...") Also "..." should probably count as 3 chars for purposes of counting length, hence the max length of the truncated part would be |
What if limit is < 3?
JS builtins that deal with strings almost always use UTF-16 code units as the unit of measurement; I suggest we do the same here. This could also greatly simplify the implementation: const segmenter = new Intl.Segmenter("en");
export function truncate(
input: string,
limit: number,
truncated = "…",
): string {
if (input.length <= limit) {
return input;
}
const { index } = segmenter.segment(input).containing(limit);
return input.substring(0, index) + truncated;
} |
Only for historical reasons. Other than performance (which generally seems to be negligible, see e.g. #6014 (comment)), the only good reason to measure using UTF-16 code units in AD 2025 is if keeping in sync with numbered index access or methods like As an example, with
Maybe just return the full truncation chars or a truncated version of |
Is your feature request related to a problem? Please describe.
Sometimes it's useful to truncate strings and other things.
Words (max length 10 chars, truncation chars = "..."):
Chars (max length 10 chars, truncation chars = "..."):
Describe the solution you'd like
New functionality such as
truncateWords
and/ortruncateGraphemes
undertext
.Describe alternatives you've considered
Intl.Segmenter
withgranularity
of "words"/"graphemes")The text was updated successfully, but these errors were encountered: