-
Notifications
You must be signed in to change notification settings - Fork 382
Use shelf allocation for font texture atlases #723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Use shelf allocation for font texture atlases #723
Conversation
|
First of all, very cool! I have been wanting to improve the font engine for some time, so it is great to see efforts in this direction. We do already have our texture packer, which I'm sure you're aware of. I wonder if it makes sense to integrate the feature into that existing code, or if we might as well start over like it seems you have done here? By the way, did you write this from scratch? I'd be really interested in some numbers, that's really decisive for an improvement like this. Do you think you could make some realistic benchmarks with before and after. For example, how long does it take to add one new character to a font texture? With a minimal existing number of characters, and with a large number of existing characters. And for a large and small texture / font size. Removing unused glyphs is also something I've been thinking of. That would be a great addition, and especially important for CJK and many other languages. One thing I've also considered is whether we could share texture atlases between font sizes, or even font faces/families. |
This is indeed my own from-scratch implementation. The existing texture packer is quite primitive in that it relayouts all glyphs on each glyph set change. My implementation will be later refined to blend in with the codebase but, as it is so radically different, it will effectively be a full on rewrite of the original packer.
I plan to introduce proper benchmarks when the full solution somewhat takes shape. But as it is quite similar to what Firefox is using, I would expect it to be similarly fast at least when finding out where to place a glyph: within each atlas page, the shelf allocator iterates through only unallocated areas, which are usually quite large as long as the page is not too fragmented. Also, further potential performance improvements could be achieved by only uploading the updated portion of the texture atlas.
If this is a good idea to you then I would be glad to implement it; Firefox and Skia also employ such a strategy. I anticipate it would affect a larger portion of the codebase, however, so it would be of great help if I could receive some assistance in the design of this feature. |
Great, sounds good, I think it's reasonable to start from scratch here. Let's just make sure to also clean up old code (i.e. remove it or refactor) so that we don't have any duplicate code doing similar things. I of course understand this is just a draft for now, just wanted to say that up-front.
What I'm am mostly interested in here is a baseline of the current implementation. I am sure things can be done a lot more performant, but I'm not even sure what the current condition is. I.e., whether or not this is a significant bottleneck at all. I think that should be established first before we make a large effort in this direction.
Great, I'll be glad to help out later on here. I think we should do it in steps though, so essentially start with what you're doing here, with just one font face at a time. Then once that work is done and merged, we can start expanding towards a more global solution. |
949fef2 to
e1dc4b2
Compare
|
I have just added a benchmark suite that involves cycling through a number of Chinese characters, the same suite is also applied to the original implementation in my Original implementation
New implementation
It appears that my implementation is slower, but this is because the processing time is absolutely dominated by the operation of copying the whole texture atlas each time it is regenerated. My implementation uses a fixed 1024 × 1024 atlas size, while the original implementation determines a size that is just enough for the current glyphs and takes less time to copy the texture data. Because my implementation maintains the texture data at all times, performance could be dramatically improved by directly referring to that texture data while uploading, and possibly only uploading dirty regions within the atlas. Update: I removed the unnecessary copy and it is indeed much faster now. It beats the original implementation at higher glyph counts thanks to not having to rearrange all glyphs. New benchmark results
|
|
I appreciate the great effort here, the benchmarks are super interesting. I was a bit worried when I saw the first results, but with the new results it looks a lot better. And substantially better at higher glyph counts. So that is promising. I think we should also consider a couple of other cases too:
For 1, I modified the benchmark slightly to use the following code. All benchmarks below use this code. bench.run(benchmark_name.c_str(), [&]() {
ReleaseFontResources();
std::string inner_rml;
for (int i = 0; i < glyph_count; ++i)
{
inner_rml += StringUtilities::ToUTF8(static_cast<Character>(rml_font_texture_atlas_start_codepoint + i));
body->SetInnerRML(inner_rml);
}
context->Update();
context->Render();
});No backendOriginal - Set inner RML incrementally then update all once - no backend.
Font atlas PR - Set inner RML incrementally then update all once - no backend.
With OpenGL 3 backendOriginal - Set inner RML incrementally then update all once - SDL_GL3 backend.
Font atlas PR - Set inner RML incrementally then update all once - SDL_GL3 backend.
My takeawayIt's generally a bit slower in this benchmark, and quite significantly so at low glyph counts. I think 10-100 glyphs here are common cases in the real-world. I didn't do a deep dive or profile the code yet, do you know why it is that much slower for low glyph counts? Is it mainly the initial texture size or something else? I think we need to get closer to the master numbers before I'd be comfortable integrating this pull request. With that said, the results you posted with the high glyph counts are very encouraging, so I'd love to see this further developed. |
41e5053 to
99a7a3c
Compare
|
I just tried testing with different sizes of the texture atlas (configured here) and indeed it does appear to influence the run time significantly. Original code with computed texture size of 128 . 128:
New code: With size of 128 . 128:
With size of 256 . 256:
With size of 512 . 512:
With size of 1024 . 1024:
|
|
Thanks for testing and looking into this more. I'm a bit busy these days, so I'll have to get back to you later. But I'll take a deeper look at it at some point, hopefully we'll figure out some way to improve the performance a bit more. |
f52b436 to
99a7a3c
Compare
|
Hello! I looked at this one again a bit closer. So indeed. just like you observed, I see that most of that extra time goes into clearing the memory after the SpriteSet allocation. I'm less concerned about this now, because I see that this is mostly an initial startup cost for that font face, rather than something that happens normally when new glyphs are added. While I think we should try to reduce it a bit more if we can, it's not a blocker in my view. If we can avoid reading from the unset memory, we can also get away with not clearing the allocation initially. I think that would be preferable, if possible. Just for reference, here is a profile. We see that most of the time is spent during
There is another observation though that concerns me quite a lot. Loading the demo sample, I see that the memory consumption is huge. In fact, the memory has increased from 36 MB (master) to 134 MB (this PR) (!), just starting it up. I see a total of 19 allocations each of 4 MB, just loading up the sample. We really need to understand where these are coming from and reduce them substantially. Maybe in the end we need to use a single shared texture memory. But even without that, I think we can reduce this, since it seem most of them aren't even used, so we should investigate if we can do some kind of lazy loading. Looking at it in RenderDoc, I see that only 3 or 4 font textures are actually loaded into the GPU. I also think 1024x1024 initial texture size is way too large, at least in the demo sample the textures are barely filled. If we could re-use a single atlas then it would be a lot more reasonable. Example:
I also noticed that the font effects do not seem to be working in this branch, this is particularly visible in the demo sample. I pushed some minor testing code to the following branch if you want to take a look, mostly profiling zones for Tracy: https://github.com/mikke89/RmlUi/commits/fontAtlasShelfAllocation_testing/ |
This is most likely because the demo uses a number of fonts and font variants (with effects). In the current state, each font variant has its own texture atlas, and each 1024 . 1024 texture page consumes 3 MB on RAM and at least the equivalent amount on VRAM. The memory consumption would reduce 4 times if we reduce the size down to 512 . 512. But indeed adopting a common texture atlas for all fonts would be the ultimate solution. |
Do you think this could be a direction you'd be willing to attempt? I made some more measurements, looking at the memory usage with the
So at 128x128 it seems to be around the same as master. I am not sure if it's entirely fair since this PR currently doesn't show font effects (see previous comment). In addition to a global atlas, some ideas to reduce the memory usage:
I'm very much interested in this, so I am trying to look for solutions here. |
|
I have just made another experimental implementation of a unified font texture atlas in my
About keeping glyph bitmaps just in the texture atlas, I am seeing that as |
|
Hello, I have questions for you:
I didn't review your work at all and didn't know how it works but it will be more clear if you explain how current glyph system works in comparison to previous one and explain how your implementation will improve current rendering/system glyph handling. Also it is better to make 1k or 2k texture and maybe make cache system more complex (smarter) and handling situation like user switches pages (.rml documents) and if there already uploaded glyphs then we don't do any work and so it works faster than just using simple heuristic treating each document that being loaded as new session for atlas generation. But generally saying it is better to use larger texture size for preventing occupying texture slots of GPU. Did you think about it? Because bringing atlas generation just for make memory consumption higher as well as making CPU processing worse it is "quite" not reasonable. |
Hi, I am glad there are more people who are interested in this.
It is simply that I was totally not aware of that library's existence until you mentioned it. My implementation is not exactly a port from Firefox's either. From a quick glance,
A glyph's data consists of its metrics and its bitmap. The metrics are necessary for placement of the glyph on the screen. As I have mentioned in my last post, the bitmap might still be useful when glyphs with effects are generated, but if an API change is not undesirable then this can be removed and the bitmap regenerated each time (CPU – memory tradeoff).
It is indeed more desirable to traverse the whole document to upload all present glyphs before rendering, but in the current state, RmlUi does not have a method for such a pre-render event. This PR is still an experimental implementation so I have not attempted creating such a hook yet. Maybe @mikke89 could give your opinion on this.
Currently RmlUi separates each font instance into its own texture, while also regenerating the textures of all instances that belong to the same family–style–size combination. Regeneration of a texture involves running the whole packing algorithm for all glyphs. This PR's approach only runs the placement algorithm for new glyphs while also unifying all font instances into one single texture. The result will be faster reactions to new glyphs and faster rendering with fewer textures, and also saving memory since unused glyphs will be actively purged from the atlas. This will be especially beneficial for languages with a lot of glyphs, such as Chinese and Japanese.
The texture atlas in this implementation is not reset when documents are switched, it only is when it itself gets recreated or |
|
@leduyquang753 @mikke89 Hey, I was wondering, how do you add glyphs to an atlas while it's also being used as a texture? I use Vulkan which is an explicit API, and images need to be in a particular layout for optimal sampling/writing/transferring etc. To sample from the shader it's in SHADER_READ_ONLY_OPTIMAL and then to write/transfer to it it must be in TRANSFER_DESTINATION_OPTIMAL. I imagine this is what's going on with all the APIs except it's happening 'behind your back' with the less explicit ones. How are you supposed write or add to something like an atlas while it's being used? Do you have to do transitions constantly? |
Currently it simply creates a whole new texture and discards the old one. :-) |
|
@leduyquang753 I see. So how long do you wait before you update it? Like, let's just say I type a bunch of characters and none of them are in any atlases, you don't add them immediately after each other, right? Like adding one glyph on each update. It's delayed/deferred a while? Hmmm, this is actually a tricky problem. |
Currently with my implementation, when each text element is rendered, glyphs in that element that ate not present in the atlas are uploaded before rendering. |
|
@leduyquang753 First I want to say very nice work. And apologies for taking so long to get back to you. I tested your new branch now with the unified texture, and here are some measurements for me in terms of memory usage:
This looks a lot more reasonable now! Let's see if there are some ways to tweak it down a bit more, but I think it is acceptable even if we can't. I do occasionally get some wrongly-rendered glyphs, and also crashes after navigating around in the demo sample. I also redid the benchmark from earlier in the thread (#723 (comment)). Here are the new results: New branch without test shell
New branch w/test shell (GLFW_GL3 backend)
Overall, it looks a lot faster now, especially on low glyph counts. It seems to always beat master without the shell, especially on low glyph count. It struggles a bit more when running with the test shell on high glyph count. I suspect it is submitting the textures unnecessarily often perhaps?
Yeah, I forgot this was being used publicly for font effects. I agree, let's leave it as is for now.
I think for a first iteration, this should be reasonable as it is. We should only be updating the style at the end of rendering, after we know all the text that will be submitted in text, so the font texture is only submitted maximum once per frame. I haven't checked this for this implementation, but that is how it works currently on master. And as long as it still works like that, it should be quite reasonable. I would eventually like to have a pre-cache system in place. For example an API where you submit some text + font style to be stored in the cache regardless of what is in the document. The idea from @wh1t3lord of loading all the text in the document regardless of visibility could also be a nice way to get eventually. With that said, I think this feature is fine without that, it's a good improvement over the previous system regardless. And there are workarounds for precaching manually for those that really want that, like rendering a document off-screen for example. This approach here is definitely the way forward. I am looking forward to seeing continued work on this, I think you can safely integrate the unified branch into this PR. And please let me know when it is ready for review. |





This pull request aims to resolve the following comment in
FontFaceLayer.cpp:My approach is to use the shelf packing algorithm, which is employed in Firefox. (For perspective, Skia in Chromium divides the atlas into plots and uses the skyline algorithm for each plot.)
My further plan is to also conserve memory by removing glyphs and fonts that have not been used for a while.
This is currently my first iteration of the solution which brings in the shelf allocator I have implemented previously and so the code style and convention have not matched RmlUi's yet. I would like to receive some thoughts about the design, as well as behavior and performance testing in the meantime.