Skip to content

Conversation

@leduyquang753
Copy link
Contributor

@leduyquang753 leduyquang753 commented Jan 18, 2025

This pull request aims to resolve the following comment in FontFaceLayer.cpp:

@performance: We could be much smarter about this, e.g. such as adding new glyphs to the existing texture layout and textures. Right now we re-generate the whole thing, including textures.

My approach is to use the shelf packing algorithm, which is employed in Firefox. (For perspective, Skia in Chromium divides the atlas into plots and uses the skyline algorithm for each plot.)

My further plan is to also conserve memory by removing glyphs and fonts that have not been used for a while.

This is currently my first iteration of the solution which brings in the shelf allocator I have implemented previously and so the code style and convention have not matched RmlUi's yet. I would like to receive some thoughts about the design, as well as behavior and performance testing in the meantime.

@mikke89
Copy link
Owner

mikke89 commented Jan 21, 2025

First of all, very cool! I have been wanting to improve the font engine for some time, so it is great to see efforts in this direction.

We do already have our texture packer, which I'm sure you're aware of. I wonder if it makes sense to integrate the feature into that existing code, or if we might as well start over like it seems you have done here?

By the way, did you write this from scratch?

I'd be really interested in some numbers, that's really decisive for an improvement like this. Do you think you could make some realistic benchmarks with before and after. For example, how long does it take to add one new character to a font texture? With a minimal existing number of characters, and with a large number of existing characters. And for a large and small texture / font size.

Removing unused glyphs is also something I've been thinking of. That would be a great addition, and especially important for CJK and many other languages. One thing I've also considered is whether we could share texture atlases between font sizes, or even font faces/families.

@leduyquang753
Copy link
Contributor Author

leduyquang753 commented Jan 22, 2025

We do already have our texture packer, which I'm sure you're aware of. I wonder if it makes sense to integrate the feature into that existing code, or if we might as well start over like it seems you have done here?

By the way, did you write this from scratch?

This is indeed my own from-scratch implementation. The existing texture packer is quite primitive in that it relayouts all glyphs on each glyph set change. My implementation will be later refined to blend in with the codebase but, as it is so radically different, it will effectively be a full on rewrite of the original packer.

I'd be really interested in some numbers, that's really decisive for an improvement like this. Do you think you could make some realistic benchmarks with before and after. For example, how long does it take to add one new character to a font texture? With a minimal existing number of characters, and with a large number of existing characters. And for a large and small texture / font size.

I plan to introduce proper benchmarks when the full solution somewhat takes shape. But as it is quite similar to what Firefox is using, I would expect it to be similarly fast at least when finding out where to place a glyph: within each atlas page, the shelf allocator iterates through only unallocated areas, which are usually quite large as long as the page is not too fragmented.

Also, further potential performance improvements could be achieved by only uploading the updated portion of the texture atlas.

One thing I've also considered is whether we could share texture atlases between font sizes, or even font faces/families.

If this is a good idea to you then I would be glad to implement it; Firefox and Skia also employ such a strategy. I anticipate it would affect a larger portion of the codebase, however, so it would be of great help if I could receive some assistance in the design of this feature.

@mikke89
Copy link
Owner

mikke89 commented Jan 26, 2025

This is indeed my own from-scratch implementation. The existing texture packer is quite primitive in that it relayouts all glyphs on each glyph set change. My implementation will be later refined to blend in with the codebase but, as it is so radically different, it will effectively be a full on rewrite of the original packer.

Great, sounds good, I think it's reasonable to start from scratch here. Let's just make sure to also clean up old code (i.e. remove it or refactor) so that we don't have any duplicate code doing similar things. I of course understand this is just a draft for now, just wanted to say that up-front.

I plan to introduce proper benchmarks when the full solution somewhat takes shape. But as it is quite similar to what Firefox is using, I would expect it to be similarly fast at least when finding out where to place a glyph:

What I'm am mostly interested in here is a baseline of the current implementation. I am sure things can be done a lot more performant, but I'm not even sure what the current condition is. I.e., whether or not this is a significant bottleneck at all. I think that should be established first before we make a large effort in this direction.

If this is a good idea to you then I would be glad to implement it; Firefox and Skia also employ such a strategy. I anticipate it would affect a larger portion of the codebase, however, so it would be of great help if I could receive some assistance in the design of this feature.

Great, I'll be glad to help out later on here. I think we should do it in steps though, so essentially start with what you're doing here, with just one font face at a time. Then once that work is done and merged, we can start expanding towards a more global solution.

@leduyquang753 leduyquang753 force-pushed the fontAtlasShelfAllocation branch 4 times, most recently from 949fef2 to e1dc4b2 Compare February 5, 2025 14:19
@leduyquang753
Copy link
Contributor Author

leduyquang753 commented Feb 5, 2025

I have just added a benchmark suite that involves cycling through a number of Chinese characters, the same suite is also applied to the original implementation in my fontTextureAtlasBenchmarks branch. The results on my machine are as follows:

Original implementation
relative ns/op op/s err% total Font texture atlas
100.0% 1,463,500.00 683.29 4.5% 0.02 Size 12 with 10 glyphs
15.4% 9,496,000.00 105.31 5.0% 0.11 〰️ Size 12 with 100 glyphs (Unstable with ~1.0 iters. Increase minEpochIterations to e.g. 10)
0.5% 270,057,000.00 3.70 9.1% 3.05 〰️ Size 12 with 1000 glyphs (Unstable with ~1.0 iters. Increase minEpochIterations to e.g. 10)
77.1% 1,899,400.00 526.48 12.0% 0.02 〰️ Size 16 with 10 glyphs (Unstable with ~1.0 iters. Increase minEpochIterations to e.g. 10)
14.8% 9,893,500.00 101.08 7.2% 0.11 〰️ Size 16 with 100 glyphs (Unstable with ~1.0 iters. Increase minEpochIterations to e.g. 10)
0.3% 423,145,600.00 2.36 1.8% 4.74 Size 16 with 1000 glyphs
71.2% 2,055,000.00 486.62 9.0% 0.02 〰️ Size 24 with 10 glyphs (Unstable with ~1.0 iters. Increase minEpochIterations to e.g. 10)
9.0% 16,296,900.00 61.36 1.5% 0.18 Size 24 with 100 glyphs
0.2% 872,547,100.00 1.15 1.5% 9.66 Size 24 with 1000 glyphs
26.8% 5,470,000.00 182.82 6.9% 0.06 〰️ Size 48 with 10 glyphs (Unstable with ~1.0 iters. Increase minEpochIterations to e.g. 10)
2.5% 58,260,700.00 17.16 2.2% 0.64 Size 48 with 100 glyphs
0.1% 2,166,604,600.00 0.46 0.9% 23.81 Size 48 with 1000 glyphs
8.4% 17,323,600.00 57.72 2.4% 0.19 Size 96 with 10 glyphs
0.8% 188,429,800.00 5.31 1.6% 2.08 Size 96 with 100 glyphs
0.0% 3,080,031,700.00 0.32 1.0% 34.03 Size 96 with 1000 glyphs
New implementation
relative ns/op op/s err% total Font texture atlas
100.0% 14,082,400.00 71.01 2.2% 0.16 Size 12 with 10 glyphs
14.4% 97,583,700.00 10.25 1.9% 1.08 Size 12 with 100 glyphs
1.8% 802,651,300.00 1.25 1.8% 8.89 Size 12 with 1000 glyphs
109.7% 12,841,200.00 77.87 3.9% 0.14 Size 16 with 10 glyphs
16.6% 84,954,500.00 11.77 1.5% 0.94 Size 16 with 100 glyphs
1.7% 813,553,700.00 1.23 2.0% 9.19 Size 16 with 1000 glyphs
122.7% 11,475,100.00 87.15 2.2% 0.13 Size 24 with 10 glyphs
17.1% 82,121,200.00 12.18 1.1% 0.91 Size 24 with 100 glyphs
1.7% 851,945,600.00 1.17 5.5% 9.90 〰️ Size 24 with 1000 glyphs (Unstable with ~1.0 iters. Increase minEpochIterations to e.g. 10)
97.3% 14,468,900.00 69.11 5.7% 0.17 〰️ Size 48 with 10 glyphs (Unstable with ~1.0 iters. Increase minEpochIterations to e.g. 10)
14.8% 95,021,400.00 10.52 3.4% 1.12 Size 48 with 100 glyphs
1.2% 1,191,679,700.00 0.84 21.5% 14.18 〰️ Size 48 with 1000 glyphs (Unstable with ~1.0 iters. Increase minEpochIterations to e.g. 10)
75.3% 18,713,600.00 53.44 3.6% 0.21 Size 96 with 10 glyphs
13.3% 105,910,400.00 9.44 1.0% 1.17 Size 96 with 100 glyphs
1.4% 1,014,052,200.00 0.99 1.5% 11.12 Size 96 with 1000 glyphs

It appears that my implementation is slower, but this is because the processing time is absolutely dominated by the operation of copying the whole texture atlas each time it is regenerated. My implementation uses a fixed 1024 × 1024 atlas size, while the original implementation determines a size that is just enough for the current glyphs and takes less time to copy the texture data. Because my implementation maintains the texture data at all times, performance could be dramatically improved by directly referring to that texture data while uploading, and possibly only uploading dirty regions within the atlas.

Update: I removed the unnecessary copy and it is indeed much faster now. It beats the original implementation at higher glyph counts thanks to not having to rearrange all glyphs.

New benchmark results
relative ns/op op/s err% total Font texture atlas
100.0% 2,970,200.00 336.68 4.4% 0.03 Size 12 with 10 glyphs
56.3% 5,273,900.00 189.61 4.8% 0.06 Size 12 with 100 glyphs
9.1% 32,638,600.00 30.64 3.2% 0.36 Size 12 with 1000 glyphs
93.7% 3,169,900.00 315.47 5.7% 0.04 〰️ Size 16 with 10 glyphs (Unstable with ~1.0 iters. Increase minEpochIterations to e.g. 10)
55.2% 5,385,200.00 185.69 4.2% 0.06 Size 16 with 100 glyphs
6.9% 43,134,600.00 23.18 2.3% 0.48 Size 16 with 1000 glyphs
71.4% 4,157,300.00 240.54 7.0% 0.05 〰️ Size 24 with 10 glyphs (Unstable with ~1.0 iters. Increase minEpochIterations to e.g. 10)
43.6% 6,812,600.00 146.79 6.6% 0.08 〰️ Size 24 with 100 glyphs (Unstable with ~1.0 iters. Increase minEpochIterations to e.g. 10)
6.4% 46,560,900.00 21.48 4.1% 0.51 Size 24 with 1000 glyphs
59.1% 5,027,900.00 198.89 4.2% 0.06 Size 48 with 10 glyphs
27.9% 10,648,500.00 93.91 6.0% 0.12 〰️ Size 48 with 100 glyphs (Unstable with ~1.0 iters. Increase minEpochIterations to e.g. 10)
3.4% 88,539,300.00 11.29 1.9% 0.97 Size 48 with 1000 glyphs
29.6% 10,019,900.00 99.80 1.6% 0.11 Size 96 with 10 glyphs
13.2% 22,514,700.00 44.42 0.5% 0.25 Size 96 with 100 glyphs
1.5% 204,386,200.00 4.89 1.3% 2.27 Size 96 with 1000 glyphs

@mikke89
Copy link
Owner

mikke89 commented Feb 16, 2025

I appreciate the great effort here, the benchmarks are super interesting. I was a bit worried when I saw the first results, but with the new results it looks a lot better. And substantially better at higher glyph counts. So that is promising.

I think we should also consider a couple of other cases too:

  1. First, where one is incrementally setting more characters to the document, but only update it once at the end. This is supposed do simulate a common pattern of e.g. writing some sentences without any characters generated yet, or in a new language.
  2. Also benchmarking with backends enabled for more real-world results. This can be done with the environment variable RMLUI_TESTS_USE_SHELL=1.

For 1, I modified the benchmark slightly to use the following code. All benchmarks below use this code.

bench.run(benchmark_name.c_str(), [&]() {
	ReleaseFontResources();
	std::string inner_rml;
	for (int i = 0; i < glyph_count; ++i)
	{
		inner_rml += StringUtilities::ToUTF8(static_cast<Character>(rml_font_texture_atlas_start_codepoint + i));
		body->SetInnerRML(inner_rml);
	}
	context->Update();
	context->Render();
});

No backend

Original - Set inner RML incrementally then update all once - no backend.
relative ns/op op/s err% total Font texture atlas
100.0% 649,900.00 1,538.70 0.6% 0.01 Size 12 with 10 glyphs
43.6% 1,490,000.00 671.14 0.3% 0.02 Size 12 with 100 glyphs
3.6% 17,827,300.00 56.09 0.9% 0.20 Size 12 with 1000 glyphs
85.2% 763,000.00 1,310.62 0.3% 0.01 Size 16 with 10 glyphs
37.5% 1,734,400.00 576.57 0.5% 0.02 Size 16 with 100 glyphs
2.8% 23,089,800.00 43.31 1.4% 0.26 Size 16 with 1000 glyphs
61.3% 1,059,400.00 943.93 0.3% 0.01 Size 24 with 10 glyphs
26.8% 2,420,500.00 413.14 0.4% 0.03 Size 24 with 100 glyphs
2.3% 27,914,200.00 35.82 1.4% 0.31 Size 24 with 1000 glyphs
28.7% 2,267,300.00 441.05 2.5% 0.03 Size 48 with 10 glyphs
12.6% 5,158,900.00 193.84 1.5% 0.06 Size 48 with 100 glyphs
1.1% 58,152,900.00 17.20 0.7% 0.64 Size 48 with 1000 glyphs
11.1% 5,857,500.00 170.72 0.3% 0.07 Size 96 with 10 glyphs
4.9% 13,332,200.00 75.01 1.0% 0.15 Size 96 with 100 glyphs
0.5% 140,494,200.00 7.12 0.3% 1.55 Size 96 with 1000 glyphs
Font atlas PR - Set inner RML incrementally then update all once - no backend.
relative ns/op op/s err% total Font texture atlas
100.0% 3,063,400.00 326.43 7.1% 0.03 〰️ Size 12 with 10 glyphs (Unstable with ~1.0 iters. Increase minEpochIterations to e.g. 10)
81.9% 3,738,500.00 267.49 1.9% 0.04 Size 12 with 100 glyphs
15.4% 19,917,500.00 50.21 0.7% 0.22 Size 12 with 1000 glyphs
131.2% 2,335,100.00 428.25 3.7% 0.03 Size 16 with 10 glyphs
90.2% 3,396,600.00 294.41 2.8% 0.04 Size 16 with 100 glyphs
12.2% 25,088,600.00 39.86 0.8% 0.28 Size 16 with 1000 glyphs
104.2% 2,939,700.00 340.17 5.8% 0.03 〰️ Size 24 with 10 glyphs (Unstable with ~1.0 iters. Increase minEpochIterations to e.g. 10)
65.8% 4,658,600.00 214.66 4.1% 0.05 Size 24 with 100 glyphs
9.8% 31,242,400.00 32.01 1.6% 0.34 Size 24 with 1000 glyphs
80.9% 3,788,800.00 263.94 2.6% 0.04 Size 48 with 10 glyphs
45.2% 6,778,300.00 147.53 1.5% 0.07 Size 48 with 100 glyphs
4.9% 62,021,900.00 16.12 0.3% 0.68 Size 48 with 1000 glyphs
41.1% 7,453,200.00 134.17 0.6% 0.08 Size 96 with 10 glyphs
18.8% 16,274,900.00 61.44 3.2% 0.18 Size 96 with 100 glyphs
2.0% 150,255,100.00 6.66 0.5% 1.65 Size 96 with 1000 glyphs

With OpenGL 3 backend

Original - Set inner RML incrementally then update all once - SDL_GL3 backend.
relative ns/op op/s err% total Font texture atlas
100.0% 10,337,300.00 96.74 1.5% 0.11 Size 12 with 10 glyphs
90.8% 11,388,700.00 87.81 1.5% 0.13 Size 12 with 100 glyphs
36.4% 28,435,200.00 35.17 1.7% 0.32 Size 12 with 1000 glyphs
100.0% 10,335,700.00 96.75 0.6% 0.11 Size 16 with 10 glyphs
91.6% 11,282,100.00 88.64 0.5% 0.12 Size 16 with 100 glyphs
31.0% 33,377,100.00 29.96 1.2% 0.37 Size 16 with 1000 glyphs
94.9% 10,894,200.00 91.79 1.2% 0.12 Size 24 with 10 glyphs
87.1% 11,862,100.00 84.30 0.5% 0.13 Size 24 with 100 glyphs
26.9% 38,383,800.00 26.05 1.0% 0.42 Size 24 with 1000 glyphs
86.4% 11,968,600.00 83.55 0.9% 0.13 Size 48 with 10 glyphs
68.8% 15,029,500.00 66.54 1.3% 0.17 Size 48 with 100 glyphs
14.9% 69,431,900.00 14.40 0.7% 0.77 Size 48 with 1000 glyphs
65.8% 15,721,700.00 63.61 0.7% 0.18 Size 96 with 10 glyphs
42.5% 24,349,600.00 41.07 1.0% 0.27 Size 96 with 100 glyphs
6.7% 155,445,800.00 6.43 0.5% 1.71 Size 96 with 1000 glyphs
Font atlas PR - Set inner RML incrementally then update all once - SDL_GL3 backend.
relative ns/op op/s err% total Font texture atlas
100.0% 18,868,700.00 53.00 5.7% 0.21 〰️ Size 12 with 10 glyphs (Unstable with ~1.0 iters. Increase minEpochIterations to e.g. 10)
93.2% 20,248,700.00 49.39 0.8% 0.22 Size 12 with 100 glyphs
49.3% 38,310,600.00 26.10 2.2% 0.42 Size 12 with 1000 glyphs
104.9% 17,993,900.00 55.57 4.9% 0.20 Size 16 with 10 glyphs
92.4% 20,418,600.00 48.97 1.9% 0.22 Size 16 with 100 glyphs
45.8% 41,211,000.00 24.27 0.6% 0.45 Size 16 with 1000 glyphs
109.6% 17,218,900.00 58.08 1.6% 0.20 Size 24 with 10 glyphs
91.9% 20,529,100.00 48.71 0.6% 0.23 Size 24 with 100 glyphs
40.3% 46,785,500.00 21.37 0.5% 0.51 Size 24 with 1000 glyphs
103.2% 18,287,100.00 54.68 0.4% 0.20 Size 48 with 10 glyphs
81.7% 23,090,800.00 43.31 1.7% 0.25 Size 48 with 100 glyphs
23.7% 79,750,700.00 12.54 1.8% 0.87 Size 48 with 1000 glyphs
80.6% 23,421,300.00 42.70 4.2% 0.26 Size 96 with 10 glyphs
57.2% 32,967,200.00 30.33 0.8% 0.37 Size 96 with 100 glyphs
11.2% 167,984,000.00 5.95 0.6% 1.86 Size 96 with 1000 glyphs

My takeaway

It's generally a bit slower in this benchmark, and quite significantly so at low glyph counts. I think 10-100 glyphs here are common cases in the real-world. I didn't do a deep dive or profile the code yet, do you know why it is that much slower for low glyph counts? Is it mainly the initial texture size or something else?

I think we need to get closer to the master numbers before I'd be comfortable integrating this pull request. With that said, the results you posted with the high glyph counts are very encouraging, so I'd love to see this further developed.

@leduyquang753 leduyquang753 force-pushed the fontAtlasShelfAllocation branch from 41e5053 to 99a7a3c Compare July 12, 2025 13:24
@leduyquang753
Copy link
Contributor Author

My apologies for the lack of response in the last few months, I have been very busy with my first job and only now do I have the time to come back to this. After profiling, it appears that the biggest factor that causes my version to be slower (when generating all glyphs at once) is that the std::copy calls that I use to copy glyph textures seem to take quite a longer time.

Profiling of the original code:
image

Profiling of my code:
image

I currently have nothing but wild guesses as to why this is the case, so it would be great if I could receive some assistance on this.

@leduyquang753
Copy link
Contributor Author

leduyquang753 commented Jul 14, 2025

I just tried testing with different sizes of the texture atlas (configured here) and indeed it does appear to influence the run time significantly.

Original code with computed texture size of 128 . 128:

relative ns/op op/s err% total Font texture atlas
100.0% 347,500.00 2,877.70 0.6% 0.35 Size 12 with 10 glyphs

New code:

With size of 128 . 128:

relative ns/op op/s err% total Font texture atlas
100.0% 406,950.00 2,457.30 2.6% 0.42 Size 12 with 10 glyphs

With size of 256 . 256:

relative ns/op op/s err% total Font texture atlas
100.0% 510,550.00 1,958.67 6.4% 0.46 Size 12 with 10 glyphs

With size of 512 . 512:

relative ns/op op/s err% total Font texture atlas
100.0% 724,050.00 1,381.12 2.1% 0.72 Size 12 with 10 glyphs

With size of 1024 . 1024:

relative ns/op op/s err% total Font texture atlas
100.0% 1,422,650.00 702.91 1.0% 1.42 Size 12 with 10 glyphs

@mikke89
Copy link
Owner

mikke89 commented Jul 14, 2025

Thanks for testing and looking into this more. I'm a bit busy these days, so I'll have to get back to you later. But I'll take a deeper look at it at some point, hopefully we'll figure out some way to improve the performance a bit more.

@mikke89 mikke89 force-pushed the fontAtlasShelfAllocation branch from f52b436 to 99a7a3c Compare July 24, 2025 22:01
@mikke89
Copy link
Owner

mikke89 commented Jul 24, 2025

Hello! I looked at this one again a bit closer.

So indeed. just like you observed, I see that most of that extra time goes into clearing the memory after the SpriteSet allocation.

I'm less concerned about this now, because I see that this is mostly an initial startup cost for that font face, rather than something that happens normally when new glyphs are added. While I think we should try to reduce it a bit more if we can, it's not a blocker in my view. If we can avoid reading from the unset memory, we can also get away with not clearing the allocation initially. I think that would be preferable, if possible.

Just for reference, here is a profile. We see that most of the time is spent during ReleaseFontResources, the actual Change RML update that I was mostly concerned about in this benchmark only takes a fraction of the time, 52 µs vs 2870 µs.

image

There is another observation though that concerns me quite a lot. Loading the demo sample, I see that the memory consumption is huge. In fact, the memory has increased from 36 MB (master) to 134 MB (this PR) (!), just starting it up. I see a total of 19 allocations each of 4 MB, just loading up the sample. We really need to understand where these are coming from and reduce them substantially. Maybe in the end we need to use a single shared texture memory. But even without that, I think we can reduce this, since it seem most of them aren't even used, so we should investigate if we can do some kind of lazy loading.

Looking at it in RenderDoc, I see that only 3 or 4 font textures are actually loaded into the GPU. I also think 1024x1024 initial texture size is way too large, at least in the demo sample the textures are barely filled. If we could re-use a single atlas then it would be a lot more reasonable. Example:

image

I also noticed that the font effects do not seem to be working in this branch, this is particularly visible in the demo sample.

I pushed some minor testing code to the following branch if you want to take a look, mostly profiling zones for Tracy: https://github.com/mikke89/RmlUi/commits/fontAtlasShelfAllocation_testing/
I don't mean for this to be included necessarily. Sorry for accidentally pushing into your branch first, I reversed that now.

@leduyquang753
Copy link
Contributor Author

leduyquang753 commented Jul 30, 2025

There is another observation though that concerns me quite a lot. Loading the demo sample, I see that the memory consumption is huge. In fact, the memory has increased from 36 MB (master) to 134 MB (this PR) (!), just starting it up. I see a total of 19 allocations each of 4 MB, just loading up the sample.

This is most likely because the demo uses a number of fonts and font variants (with effects). In the current state, each font variant has its own texture atlas, and each 1024 . 1024 texture page consumes 3 MB on RAM and at least the equivalent amount on VRAM. The memory consumption would reduce 4 times if we reduce the size down to 512 . 512. But indeed adopting a common texture atlas for all fonts would be the ultimate solution.

@mikke89
Copy link
Owner

mikke89 commented Aug 5, 2025

But indeed adopting a common texture atlas for all fonts would be the ultimate solution.

Do you think this could be a direction you'd be willing to attempt?

I made some more measurements, looking at the memory usage with the demo sample and then opening the "Decorators" tab. The absolute values changes a lot depending on backend, but differences between the numbers here should be representative.

Branch and texture size CPU Memory (MB) GPU Memory (MB)
This PR (1024) 158 54
This PR (512) 89 36
This PR (256) 71 31
This PR (128) 66 28
master 66 31

So at 128x128 it seems to be around the same as master. I am not sure if it's entirely fair since this PR currently doesn't show font effects (see previous comment).

In addition to a global atlas, some ideas to reduce the memory usage:

  • Reduce the default atlas size, maybe to 128 or 256. I see that it adapts the size when not enough space, but I don't know how much overhead there is when having to increase the size?
  • Can we lazy-load the texture atlas as the fonts are actually displayed? I believe that is something we did with the font implementation previously, so that might actually be the main culprit of the differences, at the same base size.
  • Could it make sense to store the glyphs directly into the atlas? Right now I believe they are stored separately, and then copied into the atlas. But if we don't need to copy them more than once, it would make sense to render them directly into the atlas or sprite set.

I'm very much interested in this, so I am trying to look for solutions here.

@mikke89 mikke89 added enhancement New feature or request performance Performance suggestions or specific issues internationalization labels Aug 5, 2025
@leduyquang753
Copy link
Contributor Author

leduyquang753 commented Aug 24, 2025

I have just made another experimental implementation of a unified font texture atlas in my unifiedFontAtlas branch. This time, the FontProvider itself manages the textures, where all generated glyph bitmap go into. Unused glyphs are gradually removed from the texture atlas to make room for others. I have temporarily disabled layer cloning, but if this is to proceed I think doing so will not be strictly necessary anymore. Below is a screenshot of a sample atlas state after going through slides in the visual test program.

image

About keeping glyph bitmaps just in the texture atlas, I am seeing that as GlyphBitmap.h is part of the public headers, removing bitmap_data might affect the public interface somewhat. However, I think it is still a good idea to keep the bitmap since font effects might want to generate derived bitmaps based on it.

@wh1t3lord
Copy link
Contributor

@leduyquang753

Hello, I have questions for you:

  1. Why not to use https://github.com/aras-p/smol-atlas ? Did you measure performance between your implementation and smol-atlas because author states that it is a little faster than firefox? So was it worth it to migrate rust Firefox implementation in C++?
  2. Why CPU memory gets higher, can we proceed glyphs without storing them and thus reduce memory storage at runtime?
  3. Is it better to create one atlas at runtime knowing all glyphs that being used on current loaded document and then if new font and text being added we just add new glyphs to atlas and update it so we don't frequently update atlas and do it only when new data came on rendering page? Because obviously you have to reduce high frequency updates between CPU<->GPU and uploading data to atlas.

I didn't review your work at all and didn't know how it works but it will be more clear if you explain how current glyph system works in comparison to previous one and explain how your implementation will improve current rendering/system glyph handling.

Also it is better to make 1k or 2k texture and maybe make cache system more complex (smarter) and handling situation like user switches pages (.rml documents) and if there already uploaded glyphs then we don't do any work and so it works faster than just using simple heuristic treating each document that being loaded as new session for atlas generation. But generally saying it is better to use larger texture size for preventing occupying texture slots of GPU. Did you think about it?

Because bringing atlas generation just for make memory consumption higher as well as making CPU processing worse it is "quite" not reasonable.

@leduyquang753
Copy link
Contributor Author

leduyquang753 commented Aug 26, 2025

Hello, I have questions for you:

Hi, I am glad there are more people who are interested in this.

  1. Why not to use https://github.com/aras-p/smol-atlas ? Did you measure performance between your implementation and smol-atlas because author states that it is a little faster than firefox? So was it worth it to migrate rust Firefox implementation in C++?

It is simply that I was totally not aware of that library's existence until you mentioned it. My implementation is not exactly a port from Firefox's either. From a quick glance, smol-atlas appears to be using the same approach anyway, although my implementation does have a few extra things:

  • It manages multiple pages of fixed-size textures instead of resizing one single atlas. It creates new pages when necessary and can also migrate textures for compaction and free up empty pages (similarly to Firefox).
  • It also handles population of the actual texture data.
  • Adjacent freed shelves can join together so that larger glyphs can still make use of them.
  1. Why CPU memory gets higher, can we proceed glyphs without storing them and thus reduce memory storage at runtime?

A glyph's data consists of its metrics and its bitmap. The metrics are necessary for placement of the glyph on the screen. As I have mentioned in my last post, the bitmap might still be useful when glyphs with effects are generated, but if an API change is not undesirable then this can be removed and the bitmap regenerated each time (CPU – memory tradeoff).

  1. Is it better to create one atlas at runtime knowing all glyphs that being used on current loaded document and then if new font and text being added we just add new glyphs to atlas and update it so we don't frequently update atlas and do it only when new data came on rendering page? Because obviously you have to reduce high frequency updates between CPU<->GPU and uploading data to atlas.

It is indeed more desirable to traverse the whole document to upload all present glyphs before rendering, but in the current state, RmlUi does not have a method for such a pre-render event. This PR is still an experimental implementation so I have not attempted creating such a hook yet. Maybe @mikke89 could give your opinion on this.

I didn't review your work at all and didn't know how it works but it will be more clear if you explain how current glyph system works in comparison to previous one and explain how your implementation will improve current rendering/system glyph handling.

Currently RmlUi separates each font instance into its own texture, while also regenerating the textures of all instances that belong to the same family–style–size combination. Regeneration of a texture involves running the whole packing algorithm for all glyphs. This PR's approach only runs the placement algorithm for new glyphs while also unifying all font instances into one single texture. The result will be faster reactions to new glyphs and faster rendering with fewer textures, and also saving memory since unused glyphs will be actively purged from the atlas. This will be especially beneficial for languages with a lot of glyphs, such as Chinese and Japanese.

Also it is better to make 1k or 2k texture and maybe make cache system more complex (smarter) and handling situation like user switches pages (.rml documents) and if there already uploaded glyphs then we don't do any work and so it works faster than just using simple heuristic treating each document that being loaded as new session for atlas generation. But generally saying it is better to use larger texture size for preventing occupying texture slots of GPU. Did you think about it?

The texture atlas in this implementation is not reset when documents are switched, it only is when it itself gets recreated or ReleaseFontResources is called. The atlas size could be made configurable down the line (it already is one constructor argument for SpriteSet).

@Please-just-dont
Copy link

Please-just-dont commented Sep 9, 2025

@leduyquang753 @mikke89 Hey, I was wondering, how do you add glyphs to an atlas while it's also being used as a texture? I use Vulkan which is an explicit API, and images need to be in a particular layout for optimal sampling/writing/transferring etc. To sample from the shader it's in SHADER_READ_ONLY_OPTIMAL and then to write/transfer to it it must be in TRANSFER_DESTINATION_OPTIMAL. I imagine this is what's going on with all the APIs except it's happening 'behind your back' with the less explicit ones. How are you supposed write or add to something like an atlas while it's being used? Do you have to do transitions constantly?

@leduyquang753
Copy link
Contributor Author

@leduyquang753 @mikke89 Hey, I was wondering, how do you add glyphs to an atlas while it's also being used as a texture? I use Vulkan which is an explicit API, and images need to be in a particular layout for optimal sampling/writing/transferring etc. To sample from the shader it's in SHADER_READ_ONLY_OPTIMAL and then to write/transfer to it it must be in TRANSFER_DESTINATION_OPTIMAL. I imagine this is what's going on with all the APIs except it's happening 'behind your back' with the less explicit ones. How are you supposed write or add to something like an atlas while it's being used? Do you have to do transitions constantly?

Currently it simply creates a whole new texture and discards the old one. :-)

@Please-just-dont
Copy link

Please-just-dont commented Sep 9, 2025

@leduyquang753 I see. So how long do you wait before you update it? Like, let's just say I type a bunch of characters and none of them are in any atlases, you don't add them immediately after each other, right? Like adding one glyph on each update. It's delayed/deferred a while?

Hmmm, this is actually a tricky problem.

@leduyquang753
Copy link
Contributor Author

@leduyquang753 I see. So how long do you wait before you update it? Like, let's just say I type a bunch of characters and none of them are in any atlases, you don't add them immediately after each other, right? Like adding one glyph on each update. It's delayed/deferred a while?

Hmmm, this is actually a tricky problem.

Currently with my implementation, when each text element is rendered, glyphs in that element that ate not present in the atlas are uploaded before rendering.

@mikke89
Copy link
Owner

mikke89 commented Sep 21, 2025

@leduyquang753 First I want to say very nice work. And apologies for taking so long to get back to you. I tested your new branch now with the unified texture, and here are some measurements for me in terms of memory usage:

Branch and sample CPU Memory (MB) GPU Memory (MB)
master: load document 53 38
master: demo (welcome) 69 64
master: demo (decorators) 73 64
New: load document 68 38
New: demo (welcome) 76 66
New: demo (decorators) 87 68

This looks a lot more reasonable now! Let's see if there are some ways to tweak it down a bit more, but I think it is acceptable even if we can't.

I do occasionally get some wrongly-rendered glyphs, and also crashes after navigating around in the demo sample.

I also redid the benchmark from earlier in the thread (#723 (comment)). Here are the new results:

New branch without test shell
relative ns/op op/s err% total Font texture atlas
100.0% 320,152.67 3,123.51 6.8% 0.01 〰️ Size 12 with 10 glyphs (Unstable with ~2.9 iters. Increase minEpochIterations to e.g. 29)
22.2% 1,444,387.00 692.34 1.7% 0.02 Size 12 with 100 glyphs
2.0% 16,007,481.00 62.47 2.1% 0.18 Size 12 with 1000 glyphs
98.3% 325,530.33 3,071.91 3.3% 0.01 Size 16 with 10 glyphs
20.1% 1,594,965.00 626.97 0.5% 0.02 Size 16 with 100 glyphs
1.6% 20,394,701.00 49.03 1.4% 0.23 Size 16 with 1000 glyphs
61.2% 522,797.00 1,912.79 10.6% 0.01 〰️ Size 24 with 10 glyphs (Unstable with ~1.8 iters. Increase minEpochIterations to e.g. 18)
14.6% 2,188,756.00 456.88 3.8% 0.02 Size 24 with 100 glyphs
1.1% 28,848,848.00 34.66 2.2% 0.32 Size 24 with 1000 glyphs
67.7% 472,860.00 2,114.79 6.1% 0.01 〰️ Size 48 with 10 glyphs (Unstable with ~1.8 iters. Increase minEpochIterations to e.g. 18)
8.2% 3,897,143.00 256.60 2.0% 0.04 Size 48 with 100 glyphs
0.6% 55,985,538.00 17.86 0.2% 0.62 Size 48 with 1000 glyphs
35.0% 914,013.00 1,094.08 2.6% 0.01 Size 96 with 10 glyphs
3.6% 8,813,690.00 113.46 2.0% 0.11 Size 96 with 100 glyphs
0.2% 140,316,018.00 7.13 1.3% 1.55 Size 96 with 1000 glyphs
New branch w/test shell (GLFW_GL3 backend)
relative ns/op op/s err% total Font texture atlas
100.0% 4,718,407.00 211.94 5.4% 0.06 〰️ Size 12 with 10 glyphs (Unstable with ~1.0 iters. Increase minEpochIterations to e.g. 10)
25.0% 18,866,084.00 53.01 4.4% 0.21 Size 12 with 100 glyphs
3.0% 154,720,502.00 6.46 1.2% 1.71 Size 12 with 1000 glyphs
109.9% 4,292,096.00 232.99 0.9% 0.05 Size 16 with 10 glyphs
26.0% 18,158,313.00 55.07 2.2% 0.20 Size 16 with 100 glyphs
2.9% 160,172,312.00 6.24 1.2% 1.80 Size 16 with 1000 glyphs
103.8% 4,547,646.00 219.89 3.0% 0.05 Size 24 with 10 glyphs
23.8% 19,810,898.00 50.48 2.1% 0.22 Size 24 with 100 glyphs
2.6% 183,518,855.00 5.45 1.6% 2.11 Size 24 with 1000 glyphs
103.3% 4,568,039.00 218.91 6.4% 0.05 〰️ Size 48 with 10 glyphs (Unstable with ~1.0 iters. Increase minEpochIterations to e.g. 10)
22.5% 20,957,273.00 47.72 1.7% 0.23 Size 48 with 100 glyphs
2.3% 201,234,794.00 4.97 2.2% 2.21 Size 48 with 1000 glyphs
97.7% 4,829,106.00 207.08 2.0% 0.05 Size 96 with 10 glyphs
18.3% 25,819,971.00 38.73 1.0% 0.28 Size 96 with 100 glyphs
1.7% 281,357,408.00 3.55 1.0% 3.10 Size 96 with 1000 glyphs

Overall, it looks a lot faster now, especially on low glyph counts. It seems to always beat master without the shell, especially on low glyph count. It struggles a bit more when running with the test shell on high glyph count. I suspect it is submitting the textures unnecessarily often perhaps?

About keeping glyph bitmaps just in the texture atlas, I am seeing that as GlyphBitmap.h is part of the public headers, removing bitmap_data might affect the public interface somewhat. However, I think it is still a good idea to keep the bitmap since font effects might want to generate derived bitmaps based on it.

Yeah, I forgot this was being used publicly for font effects. I agree, let's leave it as is for now.

It is indeed more desirable to traverse the whole document to upload all present glyphs before rendering, but in the current state, RmlUi does not have a method for such a pre-render event. This PR is still an experimental implementation so I have not attempted creating such a hook yet. Maybe @mikke89 could give your opinion on this.

I think for a first iteration, this should be reasonable as it is. We should only be updating the style at the end of rendering, after we know all the text that will be submitted in text, so the font texture is only submitted maximum once per frame. I haven't checked this for this implementation, but that is how it works currently on master. And as long as it still works like that, it should be quite reasonable.

I would eventually like to have a pre-cache system in place. For example an API where you submit some text + font style to be stored in the cache regardless of what is in the document. The idea from @wh1t3lord of loading all the text in the document regardless of visibility could also be a nice way to get eventually. With that said, I think this feature is fine without that, it's a good improvement over the previous system regardless. And there are workarounds for precaching manually for those that really want that, like rendering a document off-screen for example.

This approach here is definitely the way forward. I am looking forward to seeing continued work on this, I think you can safely integrate the unified branch into this PR. And please let me know when it is ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request internationalization performance Performance suggestions or specific issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants