Skip to content

Conversation

@loci-dev
Copy link

@loci-dev loci-dev temporarily deployed to stable-diffusion-cpp-prod January 15, 2026 16:45 — with GitHub Actions Inactive
@loci-agentic-ai
Copy link

Explore the complete analysis inside the Version Insights

Now I'll compile the performance review report based on all the gathered information.


Performance Review Report

Commit: cccc737 - "add support for flux2 klein 4b"
Changes: 5 modified, 3 added, 3 deleted files

Summary

This release adds support for the Flux2 Klein 4B model variant, a smaller and more resource-efficient version of the Flux2 diffusion model. Performance analysis reveals minimal impact across the stable-diffusion.cpp binaries, with negligible absolute timing changes despite some large percentage increases in low-latency functions.

Performance Impact Analysis

Power Consumption:

  • build.bin.sd-server: +0.061% (+305 nanojoules)
  • build.bin.sd-cli: +0.01% (+45 nanojoules)

The power consumption changes are negligible, indicating no meaningful energy impact from the modifications.

Key Function Changes:

The most significant changes involve the sd_version_is_flux2() function, which was modified to detect both VERSION_FLUX2 and VERSION_FLUX2_KLEIN model variants. This function appears in multiple compilation units and shows a consistent +12ns increase (from 22.83ns to 34.98ns, +53% relative). This overhead results from adding a second comparison operation to support the Klein variant detection, which is architecturally justified for the new functionality.

Standard library functions show compiler-level optimizations with mixed results: std::vector::back() increased by 191ns (+30%), while f8_e4m3_to_f16() improved by 212ns (-18%). The operator- for reverse iterators increased by 80ns (+62%). These changes stem from compiler code generation differences rather than source modifications, as the STL code itself is unchanged.

Utility functions like sd_get_system_info() show a 165ns increase (+14%), and std::shared_ptr::operator= increased by 80ns (+8%). Both changes are attributed to compiler optimization variations in code layout and instruction scheduling, with no source-level modifications to these functions.

Code Changes and Justification

The primary functional change expands version detection logic to recognize the Klein variant, enabling the system to configure appropriate model parameters (reduced hidden dimensions: 3072 vs 6144, fewer attention heads: 24 vs 48). This allows deployment of smaller, more efficient Flux2 models on resource-constrained hardware while maintaining backward compatibility with full-size Flux2 models.

All performance changes are either directly justified by the added functionality (version detection) or result from compiler optimization passes that reorganized code layout and instruction scheduling. No algorithmic regressions were introduced.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants