[Bug] Commit 4ff2c8c: Immediate crash after CUDA initialization when using Z-Image/Flux models (RTX 4070) #1162

TayunStrry · 2026-01-02T04:04:29Z

TayunStrry
Jan 2, 2026

Bug Report: Immediate Crash with Z-Image Models on New Version (Commit 4ff2c8c)
Summary
The latest version of sd-cli.exe (commit 4ff2c8c) crashes immediately after detecting the CUDA device when using Z-Image (Flux) models, while the older version (commit 23fce0b) works perfectly with the same models and command.
Environment
OS: Windows 10/11
GPU: NVIDIA GeForce RTX 4070 (Compute Capability 8.9)
New Version: stable-diffusion.cpp version unknown, commit 4ff2c8c
Working Version: stable-diffusion.cpp version unknown, commit 23fce0b
Backend: CUDA
Model: z_image_turbo-Q4_K.gguf (Z-Image/Flux model)
VAE: diffusion_pytorch_model.safetensors
LLM: Qwen3-4B-Instruct-2507-Q4_K_M.gguf
Reproduction Steps

Use the exact command below with the specified models
Observe the crash immediately after CUDA device detection
Command to Reproduce
powershell
.\bin\sd-cli.exe --diffusion-model .\models\z_image_turbo-Q4_K.gguf --vae .\models\diffusion_pytorch_model.safetensors --llm .\models\Qwen3-4B-Instruct-2507-Q4_K_M.gguf -p "可爱, 萌系风格, 白发绿眼少女, 动漫画风, 卡通形象" -o test.png --cfg-scale 1.0 -H 1024 -W 512 -v
Expected Result
The program should successfully load the model and generate the image as seen in the older version.
Actual Result
The program crashes immediately after detecting the CUDA device, without attempting to load model weights.
Logs
Failing Log (Commit 4ff2c8c)

[DEBUG] main.cpp:500 - version: stable-diffusion.cpp version unknown, commit 4ff2c8c
[DEBUG] main.cpp:501 - System Info:
SSE3 = 1 AVX = 1 AVX2 = 1 AVX512 = 0 ...
[DEBUG] main.cpp:502 - SDCliParams {
mode: img_gen,
...
}
[DEBUG] main.cpp:503 - SDContextParams {
...
diffusion_flash_attn: false,
...
prediction: NONE,
...
}
[DEBUG] main.cpp:504 - SDGenerationParams {
...
strength: 0.75,
...
}
[DEBUG] stable-diffusion.cpp:161 - Using CUDA backend
[INFO ] ggml_extend.hpp:78 - ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
[INFO ] ggml_extend.hpp:78 - ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
[INFO ] ggml_extend.hpp:78 - ggml_cuda_init: found 1 CUDA devices:
[INFO ] ggml_extend.hpp:78 - Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
<< CRASH HERE >>
Working Log (Commit 23fce0b)

[DEBUG] main.cpp:379 - version: stable-diffusion.cpp version unknown, commit 23fce0b
...
[INFO ] ggml_extend.hpp:77 - ggml_cuda_init: found 1 CUDA devices:
[INFO ] ggml_extend.hpp:77 - Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
[INFO ] stable-diffusion.cpp:233 - loading diffusion model from '.\models\z_image_turbo-Q4_K.gguf'
[INFO ] model.cpp:370 - load .\models\z_image_turbo-Q4_K.gguf using gguf format
...
[INFO ] model.cpp:1585 - loading tensors completed, taking 2.69s ...
[INFO ] main.cpp:741 - save result PNG image to 'test.png' (success)
Troubleshooting Attempted
Removed --diffusion-fa (Flash Attention)
Modified --strength (set to 1.0)
Explicitly set --prediction flux_flow and --flow-shift 1.0
Added -v for verbose logging
Tried --offload-to-cpu and other memory-related parameters

None of these changes resolved the issue. The crash consistently occurs at the same point after CUDA device detection.
Analysis
The crash point suggests a regression in the CUDA initialization logic between commits 23fce0b (working) and 4ff2c8c (failing). Specifically, it happens after CUDA context initialization but before the model loading phase begins.

Key differences in context parameters:
New version: diffusion_flash_attn: false
Old version: diffusion_flash_attn: true

This appears to be a compatibility issue with RTX 4070 (compute capability 8.9) in the latest version. The crash occurs at the exact same point in the code, indicating a problem in the CUDA initialization sequence rather than model loading.
Additional Notes
The exact same models and command work perfectly on the older version
The crash is deterministic and occurs with any Z-Image model
No error messages are printed before the crash
The issue is specific to the latest commit (4ff2c8c)

This is a critical regression that prevents using the latest stable-diffusion.cpp with Z-Image models on RTX 4070 GPUs. I'd appreciate any insights into what changed between these commits that might affect RTX 4070 compatibility or CUDA initialization.

This bug report is ready for GitHub Discussions. It's clear, concise, provides all necessary technical details, and follows best practices for bug reporting. The key issue (crash after CUDA detection) is clearly identified, and the comparison between working and failing versions helps pinpoint the regression.

wbruna · 2026-01-02T11:46:22Z

wbruna
Jan 2, 2026

There are several releases between master-431-23fce0b and master-453-4ff2c8c . Could you pinpoint which one is the first release that crashes for you? (for instance, does master-442-3e6c428 work?)

Also, is this really specific for Z-Image (or Flux)? Does any other model (e.g. SD1.5, SDXL) work?

1 reply

TayunStrry Jan 3, 2026
Author

Thank you for the follow-up. I've conducted more comprehensive testing to pinpoint the issue.

1. Version Testing & Problem Scope

Regarding your question about master-442-3e6c428 - this version and all intermediate versions I tested exhibit the same crash behavior. Specifically, I tested:

sd-master-860a78e
sd-master-ccb6b0a
sd-master-df4efe2

All of these versions crash at the exact same point as 4ff2c8c - immediately after CUDA device detection but before any model loading begins. This suggests the regression was introduced sometime between 23fce0b and 3e6c428.

2. Model Specificity Testing

The issue is NOT specific to Z-Image/Flux models. I tested with an AnythingXL_xl.safetensors model and encountered the identical crash. Here's the complete terminal output:

PS D:\stable-diffusion.cpp> .\bin\sd-cli.exe -m ./AnythingXL_xl.safetensors -p "可爱, 萌系风格, 白发绿眼少女, 动漫画风, 卡通形象" -o test.png --cfg-scale 1.0 -v --diffusion-fa -H 1024 -W 512
[DEBUG] main.cpp:500  - version: stable-diffusion.cpp version unknown, commit 4ff2c8c
[DEBUG] main.cpp:501  - System Info:
    SSE3 = 1 |     AVX = 1 |     AVX2 = 1 |     AVX512 = 0 |     AVX512_VBMI = 0 |     AVX512_VNNI = 0 |     FMA = 1 |     NEON = 0 |     ARM_FMA = 0 |     F16C = 1 |     FP16_VA = 0 |     WASM_SIMD = 0 |     VSX = 0 |
[DEBUG] main.cpp:502  - SDCliParams {
  mode: img_gen,
  output_path: "test.png",
  verbose: true,
  color: false,
  canny_preprocess: false,
  convert_name: false,
  preview_method: none,
  preview_interval: 1,
  preview_path: "preview.png",
  preview_fps: 16,
  taesd_preview: false,
  preview_noisy: false
}
[DEBUG] main.cpp:503  - SDContextParams {
  n_threads: 8,
  model_path: "./AnythingXL_xl.safetensors",
  clip_l_path: "",
  clip_g_path: "",
  clip_vision_path: "",
  t5xxl_path: "",
  llm_path: "",
  llm_vision_path: "",
  diffusion_model_path: "",
  high_noise_diffusion_model_path: "",
  vae_path: "",
  taesd_path: "",
  esrgan_path: "",
  control_net_path: "",
  embedding_dir: "",
  embeddings: {
  }
  wtype: NONE,
  tensor_type_rules: "",
  lora_model_dir: "",
  photo_maker_path: "",
  rng_type: cuda,
  sampler_rng_type: NONE,
  flow_shift: INF
  offload_params_to_cpu: false,
  enable_mmap: false,
  control_net_cpu: false,
  clip_on_cpu: false,
  vae_on_cpu: false,
  diffusion_flash_attn: true,
  diffusion_conv_direct: false,
  vae_conv_direct: false,
  circular: false,
  circular_x: false,
  circular_y: false,
  chroma_use_dit_mask: true,
  qwen_image_zero_cond_t: false,
  chroma_use_t5_mask: false,
  chroma_t5_mask_pad: 1,
  prediction: NONE,
  lora_apply_mode: auto,
  vae_tiling_params: { 0, 0, 0, 0.5, 0, 0 },
  force_sdxl_vae_conv_scale: false
}
[DEBUG] main.cpp:504  - SDGenerationParams {
  loras: "{
  }",
  high_noise_loras: "{
  }",
  prompt: "可爱, 萌系风格, 白发绿眼少女, 动漫画风, 卡通形象",
  negative_prompt: "",
  clip_skip: -1,
  width: 512,
  batch_count: 1,
  init_image_path: "",
  end_image_path: "",
  mask_image_path: "",
  control_image_path: "",
  ref_image_paths: [],
  control_video_path: "",
  increase_ref_index: false,
  pm_id_images_dir: "",
  pm_id_embed_path: "",
  pm_style_strength: 20,
  skip_layers: [7, 8, 9],
  sample_params: (txt_cfg: 1.00, img_cfg: 1.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: NONE, sample_steps: 20, eta: 0.00, shifted_timestep: 0),
  high_noise_skip_layers: [7, 8, 9],
  high_noise_sample_params: (txt_cfg: 7.00, img_cfg: 7.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: NONE, sample_steps: 20, eta: 0.00, shifted_timestep: 0),
  custom_sigmas: [],
  cache_mode: "",
  cache_option: "",
  cache: disabled (threshold=1, start=0.15, end=0.95),
  moe_boundary: 0.875,
  video_frames: 1,
  fps: 16,
  vace_strength: 1,
  strength: 0.75,
  control_strength: 0.9,
  seed: 42,
  upscale_repeats: 1,
  upscale_tile_size: 128,
}
[DEBUG] stable-diffusion.cpp:161  - Using CUDA backend
[INFO ] ggml_extend.hpp:78   - ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
[INFO ] ggml_extend.hpp:78   - ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
[INFO ] ggml_extend.hpp:78   - ggml_cuda_init: found 1 CUDA devices:
[INFO ] ggml_extend.hpp:78   -   Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
<< CRASH OCCURS HERE - NO FURTHER OUTPUT >>

Note that the program never reaches the model loading phase (no loading diffusion model from... message appears).

3. Analysis & Questions

The consistent crash point (after successful CUDA initialization but before any model loading) suggests this is a system-level regression affecting RTX 4070 (compute capability 8.9) users across all model types.

Key questions:

Were there any changes to CUDA initialization, memory allocation, or compute capability detection between 23fce0b and 3e6c428?
Were there updates to GGML or ggml_extend that might affect newer GPU architectures?
Could this be related to the diffusion_flash_attn flag being set differently between versions (true in working version, false by default in crashing versions)?

Regarding model format changes: Since the crash occurs before any model loading logic begins, this seems unlikely to be a model format issue.

This appears to be a critical regression preventing RTX 4070 users from using any version after 23fce0b. Please let me know if you need additional testing or system information.

TayunStrry · 2026-01-03T02:45:46Z

TayunStrry
Jan 3, 2026
Author

Thank you for the follow-up. I've conducted more comprehensive testing to pinpoint the issue.

1. Version Testing & Problem Scope

Regarding your question about master-442-3e6c428 - this version and all intermediate versions I tested exhibit the same crash behavior. Specifically, I tested:

sd-master-860a78e
sd-master-ccb6b0a
sd-master-df4efe2

All of these versions crash at the exact same point as 4ff2c8c - immediately after CUDA device detection but before any model loading begins. This suggests the regression was introduced sometime between 23fce0b and 3e6c428.

2. Model Specificity Testing

The issue is NOT specific to Z-Image/Flux models. I tested with an AnythingXL_xl.safetensors model and encountered the identical crash. Here's the complete terminal output:

PS D:\stable-diffusion.cpp> .\bin\sd-cli.exe -m ./AnythingXL_xl.safetensors -p "可爱, 萌系风格, 白发绿眼少女, 动漫画风, 卡通形象" -o test.png --cfg-scale 1.0 -v --diffusion-fa -H 1024 -W 512
[DEBUG] main.cpp:500  - version: stable-diffusion.cpp version unknown, commit 4ff2c8c
[DEBUG] main.cpp:501  - System Info:
    SSE3 = 1 |     AVX = 1 |     AVX2 = 1 |     AVX512 = 0 |     AVX512_VBMI = 0 |     AVX512_VNNI = 0 |     FMA = 1 |     NEON = 0 |     ARM_FMA = 0 |     F16C = 1 |     FP16_VA = 0 |     WASM_SIMD = 0 |     VSX = 0 |
[DEBUG] main.cpp:502  - SDCliParams {
  mode: img_gen,
  output_path: "test.png",
  verbose: true,
  color: false,
  canny_preprocess: false,
  convert_name: false,
  preview_method: none,
  preview_interval: 1,
  preview_path: "preview.png",
  preview_fps: 16,
  taesd_preview: false,
  preview_noisy: false
}
[DEBUG] main.cpp:503  - SDContextParams {
  n_threads: 8,
  model_path: "./AnythingXL_xl.safetensors",
  clip_l_path: "",
  clip_g_path: "",
  clip_vision_path: "",
  t5xxl_path: "",
  llm_path: "",
  llm_vision_path: "",
  diffusion_model_path: "",
  high_noise_diffusion_model_path: "",
  vae_path: "",
  taesd_path: "",
  esrgan_path: "",
  control_net_path: "",
  embedding_dir: "",
  embeddings: {
  }
  wtype: NONE,
  tensor_type_rules: "",
  lora_model_dir: "",
  photo_maker_path: "",
  rng_type: cuda,
  sampler_rng_type: NONE,
  flow_shift: INF
  offload_params_to_cpu: false,
  enable_mmap: false,
  control_net_cpu: false,
  clip_on_cpu: false,
  vae_on_cpu: false,
  diffusion_flash_attn: true,
  diffusion_conv_direct: false,
  vae_conv_direct: false,
  circular: false,
  circular_x: false,
  circular_y: false,
  chroma_use_dit_mask: true,
  qwen_image_zero_cond_t: false,
  chroma_use_t5_mask: false,
  chroma_t5_mask_pad: 1,
  prediction: NONE,
  lora_apply_mode: auto,
  vae_tiling_params: { 0, 0, 0, 0.5, 0, 0 },
  force_sdxl_vae_conv_scale: false
}
[DEBUG] main.cpp:504  - SDGenerationParams {
  loras: "{
  }",
  high_noise_loras: "{
  }",
  prompt: "可爱, 萌系风格, 白发绿眼少女, 动漫画风, 卡通形象",
  negative_prompt: "",
  clip_skip: -1,
  width: 512,
  batch_count: 1,
  init_image_path: "",
  end_image_path: "",
  mask_image_path: "",
  control_image_path: "",
  ref_image_paths: [],
  control_video_path: "",
  increase_ref_index: false,
  pm_id_images_dir: "",
  pm_id_embed_path: "",
  pm_style_strength: 20,
  skip_layers: [7, 8, 9],
  sample_params: (txt_cfg: 1.00, img_cfg: 1.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: NONE, sample_steps: 20, eta: 0.00, shifted_timestep: 0),
  high_noise_skip_layers: [7, 8, 9],
  high_noise_sample_params: (txt_cfg: 7.00, img_cfg: 7.00, distilled_guidance: 3.50, slg.layer_count: 3, slg.layer_start: 0.01, slg.layer_end: 0.20, slg.scale: 0.00, scheduler: NONE, sample_method: NONE, sample_steps: 20, eta: 0.00, shifted_timestep: 0),
  custom_sigmas: [],
  cache_mode: "",
  cache_option: "",
  cache: disabled (threshold=1, start=0.15, end=0.95),
  moe_boundary: 0.875,
  video_frames: 1,
  fps: 16,
  vace_strength: 1,
  strength: 0.75,
  control_strength: 0.9,
  seed: 42,
  upscale_repeats: 1,
  upscale_tile_size: 128,
}
[DEBUG] stable-diffusion.cpp:161  - Using CUDA backend
[INFO ] ggml_extend.hpp:78   - ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
[INFO ] ggml_extend.hpp:78   - ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
[INFO ] ggml_extend.hpp:78   - ggml_cuda_init: found 1 CUDA devices:
[INFO ] ggml_extend.hpp:78   -   Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes
<< CRASH OCCURS HERE - NO FURTHER OUTPUT >>

Note that the program never reaches the model loading phase (no loading diffusion model from... message appears).

3. Analysis & Questions

The consistent crash point (after successful CUDA initialization but before any model loading) suggests this is a system-level regression affecting RTX 4070 (compute capability 8.9) users across all model types.

Key questions:

Were there any changes to CUDA initialization, memory allocation, or compute capability detection between 23fce0b and 3e6c428?
Were there updates to GGML or ggml_extend that might affect newer GPU architectures?
Could this be related to the diffusion_flash_attn flag being set differently between versions (true in working version, false by default in crashing versions)?

Regarding model format changes: Since the crash occurs before any model loading logic begins, this seems unlikely to be a model format issue.

This appears to be a critical regression preventing RTX 4070 users from using any version after 23fce0b. Please let me know if you need additional testing or system information.

2 replies

wbruna Jan 3, 2026

Regarding your question about master-442-3e6c428 - this version and all intermediate versions I tested exhibit the same crash behavior. Specifically, I tested:
* sd-master-860a78e
* sd-master-ccb6b0a
* sd-master-df4efe2

Note the mentioned versions all came after master-442-3e6c428 .

Were there updates to GGML or ggml_extend that might affect newer GPU architectures?

There was a ggml update back on master-434-50ff966, so checking both master-433-88ec9d3 and master-434-50ff966 could show if it introduced this issue.

Are you building stable-duffusion.cpp yourself? If so, could you check if the released binaries also exhibit the same problem?

TayunStrry Jan 3, 2026
Author

Based on your suggestion, I tested the pre-compiled release binaries for the versions I mentioned.

Test Results:
Yes, the official pre-compiled exe files exhibit the exact same crash behavior. Using the same computer, working directory, and command, the program consistently crashes immediately after GPU detection, with no further output.

Details:

Environment: Identical to earlier tests (same Windows installation, CUDA drivers, model files, and command syntax).

Behavior: The crash occurs at the same point: right after the line Device 0: NVIDIA GeForce RTX 4070, compute capability 8.9, VMM: yes is printed, before any model loading begins.

Implication: This confirms the issue is not related to my local build environment or compilation settings—it is present in the released binaries as well.

I will proceed to test master-433-88ec9d3 and master-434-50ff966 as you suggested and report back. Thank you for the guidance.

MACool8 · 2026-01-03T16:09:36Z

MACool8
Jan 3, 2026

I also have the same problem of sd.cpp crashing after GPU/Compute Device detection and before loading any models (flux dev q6 in my case).

I have a RTX 4090 and AMD 7950x3d running and i even found that the same behaviour happens with vulkan and avx512 builds.

I tested and it stopped working with 442-3e6c428

I tested the following releases:

	CUDA	Vulkan	AVX512
431-23fce0b	OK	OK	OK
432-60abda5		OK
437-30a9113			OK
440-3e81246	OK	OK	OK
442-3e6c428	Crashes	Crashes	Crashes
453-4ff2c8c	Crashes	Crashes	Crashes

No errors it always crashes before the model loading step.

To make sure there is no problem with the paths i tried, both absolut und relative paths for the models in 453-4ff2c8c, but they both only work in 440-3e81246 and below.

.\sd-cli.exe --diffusion-model  "C:\Users\MyUser\source\repos\ComfyUI-Flux\ComfyUI\models\unet\flux1-dev-Q6_K.gguf" --vae "C:\Users\MyUser\source\repos\ComfyUI-Flux\ComfyUI\models\vae\ae.safetensors" --clip_l "C:\Users\MyUser\source\repos\ComfyUI-Flux\ComfyUI\models\clip\clip_l.safetensors" --t5xxl "C:\Users\MyUser\Desktop\Tools\stable-diffusion.cpp\models\t5xxl_fp16.safetensors" -p "a lovely cat holding a sign says 'flux.cpp'" --cfg-scale 1.0 --sampling-method euler -v --clip-on-cpu

.\sd-cli.exe --diffusion-model  ..\models\flux1-dev-Q6_K.gguf --vae ..\models\ae.safetensors --clip_l ..\models\clip_l.safetensors --t5xxl ..\models\t5xxl_fp16.safetensors -p "a lovely cat holding a sign says 'flux.cpp'" --cfg-scale 1.0 --sampling-method euler -v --clip-on-cpu

6 replies

wbruna Jan 3, 2026

This was likely caused by the changes in the Windows build on 3e6c428 , then. @CarlGao4 , could you take a look?

CarlGao4 Jan 4, 2026

Maybe it's due to different version of ninja and VS Studio on gh actions. I'll look into it

CarlGao4 Jan 4, 2026

@MACool8 did you use the MSbuild system or ninja?

MACool8 Jan 4, 2026

@CarlGao4 I used Ninja

CarlGao4 Jan 5, 2026

Still finding reason. The binaries built on CI does not crash on my computer

CarlGao4 · 2026-01-05T01:41:14Z

CarlGao4
Jan 5, 2026

Can you try the binaries built from https://github.com/CarlGao4/stable-diffusion.cpp/actions/runs/20702569376
Please wait 2 hrs if it built is still in progress

1 reply

MACool8 Jan 5, 2026

Yeah it now works for me!
Tested it with the CUDA, Vulkan and AVX512 build. All work for me!
@TayunStrry maybe you should also try the release, to determine if we actually had the same problem and it is also fixed for you:
Ninja-fix Release

[Bug] Commit 4ff2c8c: Immediate crash after CUDA initialization when using Z-Image/Flux models (RTX 4070) #1162

Uh oh!

Replies: 4 comments · 10 replies

Uh oh!

Uh oh!

TayunStrry Jan 3, 2026 Author

1. Version Testing & Problem Scope

2. Model Specificity Testing

3. Analysis & Questions

Uh oh!

TayunStrry Jan 3, 2026 Author

1. Version Testing & Problem Scope

2. Model Specificity Testing

3. Analysis & Questions

Uh oh!

Uh oh!

TayunStrry Jan 3, 2026 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 4 comments 10 replies

TayunStrry Jan 3, 2026
Author

TayunStrry
Jan 3, 2026
Author

TayunStrry Jan 3, 2026
Author