diff --git a/IMPROVEMENTS.md b/IMPROVEMENTS.md new file mode 100644 index 0000000..ce234a8 --- /dev/null +++ b/IMPROVEMENTS.md @@ -0,0 +1,94 @@ +# Qwen-Image Specific Improvements + +This document outlines the architecture-specific improvements made to optimize DyPE for Qwen-Image models. + +## Key Improvements + +### 1. **Intelligent Model Structure Detection** +- Added `_detect_qwen_model_structure()` function that automatically detects: + - Transformer/diffusion_model location + - Positional embedder path (`pos_embed` vs `pe_embedder`) + - Patch size from model config + - VAE scale factor + - Base training resolution +- Eliminates hardcoded assumptions and adapts to different Qwen model variants + +### 2. **Qwen-Specific Parameter Extraction** +- **Patch Size Detection**: Automatically extracts `patch_size` from model config (defaults to 2 for MMDiT) +- **VAE Scale Factor**: Detects actual VAE downsampling factor (typically 8x) +- **Base Resolution**: Attempts to detect from model config, falls back to 1024 +- **Axes Dimensions**: Extracts from model or uses Qwen-Image defaults `[16, 56, 56]` + +### 3. **Optimized Base Patches Calculation** +```python +# Old: Hardcoded calculation +self.base_patches = (self.base_resolution // 8) // 2 + +# New: Uses detected patch_size and base_resolution +self.base_patches = (self.base_resolution // vae_scale_factor) // patch_size +``` +- More accurate for different Qwen model variants +- Adapts to actual model architecture + +### 4. **Enhanced Positional Embedding Class** +- Added `base_resolution` and `patch_size` parameters to `QwenPosEmbed` +- Better device-aware dtype selection (handles MPS, NPU, CUDA) +- Improved comments explaining Qwen-specific behavior +- More robust handling of different tensor formats + +### 5. **Improved Scheduler Compatibility** +- Better fallback for non-Flux schedulers (FlowMatch, etc.) +- Conservative scaling approach for unknown scheduler types +- More robust error handling with `AttributeError` instead of bare `except` + +### 6. **Better Sequence Length Calculation** +```python +# Now uses detected vae_scale_factor and patch_size +latent_h, latent_w = height // vae_scale_factor, width // vae_scale_factor +padded_h = math.ceil(latent_h / patch_size) * patch_size +padded_w = math.ceil(latent_w / patch_size) * patch_size +image_seq_len = (padded_h // patch_size) * (padded_w // patch_size) +``` +- More accurate for Qwen's specific architecture +- Accounts for both VAE downsampling and patch-based downsampling + +### 7. **Enhanced Timestep Handling** +- Better handling of different timestep formats (tensor, scalar, etc.) +- More robust normalization logic +- Improved error handling for edge cases + +### 8. **Architecture-Aware Defaults** +- Qwen-Image specific defaults: + - `axes_dim = [16, 56, 56]` (MMDiT standard) + - `theta = 10000` (RoPE base frequency) + - `patch_size = 2` (MMDiT patch size) + - `vae_scale_factor = 8` (standard VAE downsampling) + +## Benefits + +1. **Better Compatibility**: Works with different Qwen-Image model variants +2. **More Accurate**: Uses actual model parameters instead of assumptions +3. **Robust**: Better error handling and fallbacks +4. **Optimized**: Qwen-specific optimizations for better performance +5. **Maintainable**: Clear structure detection makes debugging easier + +## Testing Recommendations + +When testing with your Qwen-Image model: + +1. Check console output for detected parameters (add logging if needed) +2. Verify patch_size matches your model (typically 2 for MMDiT) +3. Verify base_resolution matches training resolution +4. Test with different resolutions to ensure proper extrapolation +5. Monitor for any warnings about fallback behavior + +## Future Enhancements + +Potential further improvements: + +1. **MSRoPE Integration**: Qwen uses Multimodal Scalable RoPE - could add specific support +2. **Aspect Ratio Presets**: Qwen supports specific aspect ratios - could add presets +3. **Text Rendering Optimization**: Qwen excels at text - could add text-specific optimizations +4. **Multi-Image Support**: Qwen-Image-Edit supports multi-image - could extend for that +5. **Config File Support**: Allow users to override detected parameters via config + diff --git a/README.md b/README.md index 7d5e26f..0b30586 100644 --- a/README.md +++ b/README.md @@ -39,11 +39,12 @@ It works by taking advantage of the spectral progression inherent to the diffusi
A simple, single-node integration to patch your FLUX model for high-resolution generation.
-This node provides a seamless, "plug-and-play" integration of DyPE into any FLUX-based workflow. +This node provides a seamless, "plug-and-play" integration of DyPE into FLUX-based and Qwen-Image workflows. Two specialized nodes are available: `DyPE for FLUX` for FLUX models and `DyPE for Qwen-Image` for Qwen-Image models, each optimized for their respective architectures. **✨ Key Features:** -* **True High-Resolution Generation:** Push FLUX models to 4096x4096 and beyond while maintaining global coherence and fine detail. -* **Single-Node Integration:** Simply place the `DyPE for FLUX` node after your model loader to patch the model. No complex workflow changes required. +* **True High-Resolution Generation:** Push FLUX and Qwen-Image models to 4096x4096 and beyond while maintaining global coherence and fine detail. +* **Dual Node Support:** Two specialized nodes available - `DyPE for FLUX` and `DyPE for Qwen-Image` - each optimized for their respective architectures. +* **Single-Node Integration:** Simply place the appropriate DyPE node after your model loader to patch the model. No complex workflow changes required. * **Full Compatibility:** Works seamlessly with your existing ComfyUI workflows, samplers, schedulers, and other optimization nodes like Self-Attention or quantization. * **Fine-Grained Control:** Exposes key DyPE hyperparameters, allowing you to tune the algorithm's strength and behavior for optimal results at different target resolutions. * **Zero Inference Overhead:** DyPE's adjustments happen on-the-fly with negligible performance impact. @@ -75,6 +76,8 @@ Alternatively, to install manually: Using the node is straightforward and designed for minimal workflow disruption. +### For FLUX Models + 1. **Load Your FLUX Model:** Use a standard `Load Checkpoint` node to load your FLUX model (e.g., `FLUX.1-Krea-dev`). 2. **Add the DyPE Node:** Add the `DyPE for FLUX` node to your graph (found under `model_patches/unet`). 3. **Connect the Model:** Connect the `MODEL` output from your loader to the `model` input of the DyPE node. @@ -82,8 +85,23 @@ Using the node is straightforward and designed for minimal workflow disruption. 5. **Connect to KSampler:** Use the `MODEL` output from the DyPE node as the input for your `KSampler`. 6. **Generate!** That's it. Your workflow is now DyPE-enabled. +### For Qwen-Image Models + +1. **Load Your Qwen-Image Model:** Use a standard `Load Checkpoint` node to load your Qwen-Image model. +2. **Add the DyPE Node:** Add the `DyPE for Qwen-Image` node to your graph (found under `model_patches/unet`). +3. **Connect the Model:** Connect the `MODEL` output from your loader to the `model` input of the DyPE node. +4. **Set Resolution:** Set the `width` and `height` on the DyPE node to match the resolution of your `Empty Latent Image`. +5. **Connect to KSampler:** Use the `MODEL` output from the DyPE node as the input for your `KSampler`. +6. **Generate!** The node will automatically detect your Qwen-Image model structure and apply architecture-specific optimizations. + +### Example Workflows + +Ready-to-use example workflows are available in the [`example_workflows`](example_workflows) folder: +* **[DyPE-Flux-workflow.json](example_workflows/DyPE-Flux-workflow.json)** - Example workflow for FLUX models +* **[DyPE-Qwen-workflow.json](example_workflows/DyPE-Qwen-workflow.json)** - Example workflow for Qwen-Image models + > [!NOTE] -> This node specifically patches the **diffusion model (UNet)**. It does not modify the CLIP or VAE models. It is designed exclusively for **FLUX-based** architectures. +> This node specifically patches the **diffusion model (UNet)**. It does not modify the CLIP or VAE models. It is designed for **FLUX-based** architectures, with enhanced support for **Qwen-Image** models through intelligent model structure detection and architecture-specific optimizations. ### Node Inputs @@ -130,7 +148,7 @@ Beyond the code, I believe in the power of community and continuous learning. I══════════════════════════════════
## ⚠️ Known Issues and Limitations -* **FLUX Only:** This implementation is highly specific to the architecture of the FLUX model and will not work on standard U-Net models (like SD 1.5/SDXL) or other Diffusion Transformers. +* **Supported Models:** This implementation is optimized for **FLUX-based** architectures and **Qwen-Image** models. It will not work on standard U-Net models (like SD 1.5/SDXL) or other Diffusion Transformers. For Qwen-Image models, the node automatically detects model structure and applies architecture-specific optimizations (see `IMPROVEMENTS.md` for details). * **Parameter Tuning:** The optimal `dype_exponent` can vary based on your target resolution. Experimentation is key to finding the best setting for your use case. The default of `2.0` is optimized for 4K. diff --git a/__init__.py b/__init__.py index 12511c5..c81a784 100644 --- a/__init__.py +++ b/__init__.py @@ -1,6 +1,6 @@ import torch from comfy_api.latest import ComfyExtension, io -from .src.patch import apply_dype_to_flux +from .src.patch import apply_dype_to_flux, apply_dype_to_qwen class DyPE_FLUX(io.ComfyNode): """ @@ -82,11 +82,105 @@ def execute(cls, model, width: int, height: int, method: str, enable_dype: bool, patched_model = apply_dype_to_flux(model, width, height, method, enable_dype, dype_exponent, base_shift, max_shift) return io.NodeOutput(patched_model) +class DyPE_QWEN(io.ComfyNode): + """ + Applies DyPE (Dynamic Position Extrapolation) to a Qwen-Image model. + This allows generating images at resolutions far beyond the model's training scale + by dynamically adjusting positional encodings and the noise schedule. + """ + + @classmethod + def define_schema(cls) -> io.Schema: + return io.Schema( + node_id="DyPE_QWEN", + display_name="DyPE for Qwen-Image", + category="model_patches/unet", + description="Applies DyPE (Dynamic Position Extrapolation) to a Qwen-Image model for ultra-high-resolution generation.", + inputs=[ + io.Model.Input( + "model", + tooltip="The Qwen-Image model to patch with DyPE.", + ), + io.Int.Input( + "width", + default=1024, min=16, max=8192, step=8, + tooltip="Target image width. Must match the width of your empty latent." + ), + io.Int.Input( + "height", + default=1024, min=16, max=8192, step=8, + tooltip="Target image height. Must match the height of your empty latent." + ), + io.Combo.Input( + "method", + options=["yarn", "ntk", "base"], + default="yarn", + tooltip="Position encoding extrapolation method (YARN recommended).", + ), + io.Boolean.Input( + "enable_dype", + default=True, + label_on="Enabled", + label_off="Disabled", + tooltip="Enable or disable Dynamic Position Extrapolation for RoPE.", + ), + io.Float.Input( + "dype_exponent", + default=3.0, min=0.0, max=10.0, step=0.1, + optional=True, + tooltip="Controls DyPE strength over time (λt). 3.0=Very aggressive (best for 4K+), 2.0=Exponential, 1.0=Linear, 0.5=Sub-linear (better for ~2K). Higher values (up to 10.0) for extreme high-resolution generation." + ), + io.Float.Input( + "base_shift", + default=0.10, min=0.0, max=10.0, step=0.01, + optional=True, + tooltip="Advanced: Base shift for the noise schedule (mu). Default is 0.10." + ), + io.Float.Input( + "max_shift", + default=1.15, min=0.0, max=10.0, step=0.01, + optional=True, + tooltip="Advanced: Max shift for the noise schedule (mu) at high resolutions. Default is 1.15." + ), + io.Float.Input( + "editing_strength", + default=0.0, min=0.0, max=1.0, step=0.1, + optional=True, + tooltip="DyPE strength multiplier for image editing (0.0-1.0). Lower values preserve more original structure. Default 0.0 for maximum preservation. Set to 1.0 for pure generation." + ), + io.Combo.Input( + "editing_mode", + options=["adaptive", "timestep_aware", "resolution_aware", "minimal", "full"], + default="adaptive", + tooltip="Editing mode strategy: 'adaptive' (recommended) - timestep-aware scaling, 'timestep_aware' - more DyPE early/less late, 'resolution_aware' - only reduce at high res, 'minimal' - minimal DyPE for editing, 'full' - always full DyPE." + ), + ], + outputs=[ + io.Model.Output( + display_name="Patched Model", + tooltip="The Qwen-Image model patched with DyPE.", + ), + ], + ) + + @classmethod + def execute(cls, model, width: int, height: int, method: str, enable_dype: bool, dype_exponent: float = 3.0, base_shift: float = 0.10, max_shift: float = 1.15, editing_strength: float = 0.0, editing_mode: str = "adaptive") -> io.NodeOutput: + """ + Clones the model and applies the DyPE patch for both the noise schedule and positional embeddings. + """ + # Check if this is a Qwen model + has_transformer = hasattr(model.model, "transformer") or hasattr(model.model, "diffusion_model") + if not has_transformer: + raise ValueError("This node is only compatible with Qwen-Image models.") + + patched_model = apply_dype_to_qwen(model, width, height, method, enable_dype, dype_exponent, base_shift, max_shift, editing_strength, editing_mode) + return io.NodeOutput(patched_model) + class DyPEExtension(ComfyExtension): - """Registers the DyPE node.""" + """Registers the DyPE nodes for both FLUX and Qwen-Image.""" async def get_node_list(self) -> list[type[io.ComfyNode]]: - return [DyPE_FLUX] + return [DyPE_FLUX, DyPE_QWEN] async def comfy_entrypoint() -> DyPEExtension: return DyPEExtension() \ No newline at end of file diff --git a/example_workflows/DyPE-Qwen-workflow.json b/example_workflows/DyPE-Qwen-workflow.json new file mode 100644 index 0000000..2c26821 --- /dev/null +++ b/example_workflows/DyPE-Qwen-workflow.json @@ -0,0 +1 @@ +{"id":"908d0bfb-e192-4627-9b57-147496e6e2dd","revision":0,"last_node_id":86,"last_link_id":190,"nodes":[{"id":27,"type":"EmptySD3LatentImage","pos":[-2.235301862869016,-93.17122430326786],"size":[323.9345161038307,167.5892789151448],"flags":{},"order":0,"mode":0,"inputs":[{"localized_name":"width","name":"width","type":"INT","widget":{"name":"width"},"link":null},{"localized_name":"height","name":"height","type":"INT","widget":{"name":"height"},"link":null},{"localized_name":"batch_size","name":"batch_size","type":"INT","widget":{"name":"batch_size"},"link":null}],"outputs":[{"localized_name":"LATENT","name":"LATENT","type":"LATENT","slot_index":0,"links":[165]}],"properties":{"cnr_id":"comfy-core","ver":"0.3.40","Node name for S&R":"EmptySD3LatentImage"},"widgets_values":[3072,3072,1],"color":"#223","bgcolor":"#335"},{"id":72,"type":"MarkdownNote","pos":[410.15826733640705,-632.3489219153869],"size":[330.6477602234994,630.60309545576],"flags":{},"order":1,"mode":0,"inputs":[],"outputs":[],"title":"Default bucket aspect ratios","properties":{},"widgets_values":["\n| Aspect Ratio | 1080p (W × H) | 4K (W × H) | 8K (W × H) |\n| ------------ | ------------- | ----------- | ------------ |\n| 0.26 | 281 × 1080 | 563 × 2160 | 1126 × 4320 |\n| 0.31 | 335 × 1080 | 671 × 2160 | 1342 × 4320 |\n| 0.38 | 410 × 1080 | 820 × 2160 | 1640 × 4320 |\n| 0.43 | 464 × 1080 | 928 × 2160 | 1856 × 4320 |\n| 0.52 | 562 × 1080 | 1124 × 2160 | 2248 × 4320 |\n| 0.58 | 626 × 1080 | 1252 × 2160 | 2504 × 4320 |\n| 0.67 | 724 × 1080 | 1448 × 2160 | 2896 × 4320 |\n| 0.74 | 799 × 1080 | 1598 × 2160 | 3196 × 4320 |\n| 0.86 | 929 × 1080 | 1858 × 2160 | 3716 × 4320 |\n| 0.95 | 1026 × 1080 | 2052 × 2160 | 4104 × 4320 |\n| 1.05 | 1134 × 1080 | 2268 × 2160 | 4536 × 4320 |\n| 1.17 | 1264 × 1080 | 2528 × 2160 | 5056 × 4320 |\n| 1.29 | 1393 × 1080 | 2786 × 2160 | 5572 × 4320 |\n| 1.35 | 1458 × 1080 | 2916 × 2160 | 5832 × 4320 |\n| 1.50 | 1620 × 1080 | 3240 × 2160 | 6480 × 4320 |\n| 1.67 | 1804 × 1080 | 3608 × 2160 | 7216 × 4320 |\n| 1.73 | 1868 × 1080 | 3736 × 2160 | 7472 × 4320 |\n| 2.00 | 2160 × 1080 | 4320 × 2160 | 8640 × 4320 |\n| 2.31 | 2495 × 1080 | 4990 × 2160 | 9980 × 4320 |\n| 2.58 | 2786 × 1080 | 5572 × 2160 | 11144 × 4320 |\n| 2.75 | 2970 × 1080 | 5940 × 2160 | 11880 × 4320 |\n| 3.09 | 3337 × 1080 | 6674 × 2160 | 13348 × 4320 |\n| 3.70 | 3996 × 1080 | 7992 × 2160 | 15984 × 4320 |\n| 3.80 | 4104 × 1080 | 8208 × 2160 | 16416 × 4320 |\n| 3.90 | 4212 × 1080 | 8424 × 2160 | 16848 × 4320 |\n| 4.00 | 4320 × 1080 | 8640 × 2160 | 17280 × 4320 |\n\n"],"color":"#432","bgcolor":"#653"},{"id":78,"type":"PreviewImage","pos":[814.4312858022378,24.09204686351746],"size":[773.4969145712494,785.9486212809934],"flags":{},"order":13,"mode":0,"inputs":[{"localized_name":"images","name":"images","type":"IMAGE","link":146}],"outputs":[],"properties":{"cnr_id":"comfy-core","ver":"0.3.64","Node name for S&R":"PreviewImage"},"widgets_values":[]},{"id":83,"type":"KSampler","pos":[445.9505612479709,304.2261721620389],"size":[270,262],"flags":{},"order":11,"mode":0,"inputs":[{"localized_name":"model","name":"model","type":"MODEL","link":188},{"localized_name":"positive","name":"positive","type":"CONDITIONING","link":156},{"localized_name":"negative","name":"negative","type":"CONDITIONING","link":160},{"localized_name":"latent_image","name":"latent_image","type":"LATENT","link":165},{"localized_name":"seed","name":"seed","type":"INT","widget":{"name":"seed"},"link":null},{"localized_name":"steps","name":"steps","type":"INT","widget":{"name":"steps"},"link":null},{"localized_name":"cfg","name":"cfg","type":"FLOAT","widget":{"name":"cfg"},"link":null},{"localized_name":"sampler_name","name":"sampler_name","type":"COMBO","widget":{"name":"sampler_name"},"link":null},{"localized_name":"scheduler","name":"scheduler","type":"COMBO","widget":{"name":"scheduler"},"link":null},{"localized_name":"denoise","name":"denoise","type":"FLOAT","widget":{"name":"denoise"},"link":null}],"outputs":[{"localized_name":"LATENT","name":"LATENT","type":"LATENT","links":[164]}],"properties":{"cnr_id":"comfy-core","ver":"0.3.64","Node name for S&R":"KSampler"},"widgets_values":[42,"fixed",4,1,"euler_ancestral","beta",1]},{"id":77,"type":"CheckpointLoaderSimple","pos":[-372.64412552612384,-83.16319925245087],"size":[321.846733855853,126.09921423016061],"flags":{},"order":2,"mode":0,"inputs":[{"localized_name":"ckpt_name","name":"ckpt_name","type":"COMBO","widget":{"name":"ckpt_name"},"link":null}],"outputs":[{"localized_name":"MODEL","name":"MODEL","type":"MODEL","links":[183]},{"localized_name":"CLIP","name":"CLIP","type":"CLIP","links":[184]},{"localized_name":"VAE","name":"VAE","type":"VAE","links":[185,186,187]}],"properties":{"cnr_id":"comfy-core","ver":"0.3.64","Node name for S&R":"CheckpointLoaderSimple"},"widgets_values":["Qwen-Rapid-AIO-NSFW-v9.safetensors"]},{"id":82,"type":"TextEncodeQwenImageEditPlus","pos":[4.193344836819271,178.41864500043064],"size":[323.9400390625,210.7346954345703],"flags":{},"order":9,"mode":0,"inputs":[{"localized_name":"clip","name":"clip","type":"CLIP","link":158},{"localized_name":"vae","name":"vae","shape":7,"type":"VAE","link":186},{"localized_name":"image1","name":"image1","shape":7,"type":"IMAGE","link":190},{"localized_name":"image2","name":"image2","shape":7,"type":"IMAGE","link":null},{"localized_name":"image3","name":"image3","shape":7,"type":"IMAGE","link":null},{"localized_name":"prompt","name":"prompt","type":"STRING","widget":{"name":"prompt"},"link":null}],"outputs":[{"localized_name":"CONDITIONING","name":"CONDITIONING","type":"CONDITIONING","links":[156]}],"title":"TextEncodeQwenImageEditPlus Input Prompt","properties":{"cnr_id":"comfy-core","ver":"0.3.64","Node name for S&R":"TextEncodeQwenImageEditPlus"},"widgets_values":["Professional digital photography of a woman standing next to a grizzly bear in a forest"]},{"id":84,"type":"TextEncodeQwenImageEditPlus","pos":[-4.086434273449228,462.502170803813],"size":[383.0845703125,168],"flags":{},"order":10,"mode":0,"inputs":[{"localized_name":"clip","name":"clip","type":"CLIP","link":161},{"localized_name":"vae","name":"vae","shape":7,"type":"VAE","link":187},{"localized_name":"image1","name":"image1","shape":7,"type":"IMAGE","link":null},{"localized_name":"image2","name":"image2","shape":7,"type":"IMAGE","link":null},{"localized_name":"image3","name":"image3","shape":7,"type":"IMAGE","link":null},{"localized_name":"prompt","name":"prompt","type":"STRING","widget":{"name":"prompt"},"link":null}],"outputs":[{"localized_name":"CONDITIONING","name":"CONDITIONING","type":"CONDITIONING","links":[160]}],"title":"TextEncodeQwenImageEditPlus Negative (leave blank)","properties":{"cnr_id":"comfy-core","ver":"0.3.64","Node name for S&R":"TextEncodeQwenImageEditPlus"},"widgets_values":[""]},{"id":55,"type":"VAEDecodeTiled","pos":[436.7394940395995,60.73337487164296],"size":[322.89304359551966,150],"flags":{},"order":12,"mode":0,"inputs":[{"localized_name":"samples","name":"samples","type":"LATENT","link":164},{"localized_name":"vae","name":"vae","type":"VAE","link":185},{"localized_name":"tile_size","name":"tile_size","type":"INT","widget":{"name":"tile_size"},"link":null},{"localized_name":"overlap","name":"overlap","type":"INT","widget":{"name":"overlap"},"link":null},{"localized_name":"temporal_size","name":"temporal_size","type":"INT","widget":{"name":"temporal_size"},"link":null},{"localized_name":"temporal_overlap","name":"temporal_overlap","type":"INT","widget":{"name":"temporal_overlap"},"link":null}],"outputs":[{"localized_name":"IMAGE","name":"IMAGE","type":"IMAGE","links":[146]}],"properties":{"cnr_id":"comfy-core","ver":"0.3.66","Node name for S&R":"VAEDecodeTiled"},"widgets_values":[256,64,64,8]},{"id":80,"type":"DyPE_QWEN","pos":[-370.8432890512742,527.3687962649332],"size":[309.6577324292873,208.34523718868593],"flags":{},"order":8,"mode":0,"inputs":[{"localized_name":"model","name":"model","type":"MODEL","link":189},{"localized_name":"width","name":"width","type":"INT","widget":{"name":"width"},"link":null},{"localized_name":"height","name":"height","type":"INT","widget":{"name":"height"},"link":null},{"localized_name":"method","name":"method","type":"COMBO","widget":{"name":"method"},"link":null},{"localized_name":"enable_dype","name":"enable_dype","type":"BOOLEAN","widget":{"name":"enable_dype"},"link":null},{"localized_name":"dype_exponent","name":"dype_exponent","shape":7,"type":"FLOAT","widget":{"name":"dype_exponent"},"link":null},{"localized_name":"base_shift","name":"base_shift","shape":7,"type":"FLOAT","widget":{"name":"base_shift"},"link":null},{"localized_name":"max_shift","name":"max_shift","shape":7,"type":"FLOAT","widget":{"name":"max_shift"},"link":null}],"outputs":[{"localized_name":"Patched Model","name":"Patched Model","type":"MODEL","links":[188]}],"properties":{"Node name for S&R":"DyPE_QWEN"},"widgets_values":[1024,1024,"yarn",true,2,0.5,1.15]},{"id":73,"type":"Lora Loader Stack (rgthree)","pos":[-373.2123167412959,179.60339915431166],"size":[324.22672243165925,246],"flags":{},"order":7,"mode":0,"inputs":[{"localized_name":"model","name":"model","type":"MODEL","link":183},{"localized_name":"clip","name":"clip","type":"CLIP","link":184},{"localized_name":"lora_01","name":"lora_01","type":"COMBO","widget":{"name":"lora_01"},"link":null},{"localized_name":"strength_01","name":"strength_01","type":"FLOAT","widget":{"name":"strength_01"},"link":null},{"localized_name":"lora_02","name":"lora_02","type":"COMBO","widget":{"name":"lora_02"},"link":null},{"localized_name":"strength_02","name":"strength_02","type":"FLOAT","widget":{"name":"strength_02"},"link":null},{"localized_name":"lora_03","name":"lora_03","type":"COMBO","widget":{"name":"lora_03"},"link":null},{"localized_name":"strength_03","name":"strength_03","type":"FLOAT","widget":{"name":"strength_03"},"link":null},{"localized_name":"lora_04","name":"lora_04","type":"COMBO","widget":{"name":"lora_04"},"link":null},{"localized_name":"strength_04","name":"strength_04","type":"FLOAT","widget":{"name":"strength_04"},"link":null}],"outputs":[{"localized_name":"MODEL","name":"MODEL","type":"MODEL","links":[189]},{"localized_name":"CLIP","name":"CLIP","type":"CLIP","links":[158,161]}],"properties":{"cnr_id":"rgthree-comfy","ver":"1.0.2510052058","Node name for S&R":"Lora Loader Stack (rgthree)"},"widgets_values":["None",1,"None",1,"None",1,"None",1]},{"id":69,"type":"MarkdownNote","pos":[-389.1769967617197,-520.5278173823249],"size":[519.2421649643175,338.01057378012047],"flags":{},"order":3,"mode":0,"inputs":[],"outputs":[],"title":"DyPE","properties":{"ue_properties":{"widget_ue_connectable":{},"version":"7.1","input_ue_unconnectable":{}}},"widgets_values":["### Node Inputs\n\n* **`model`**: The Qwen-Image model to be patched.\n* **`width` / `height`**: The target image resolution. **This must match the resolution set in your `Empty Latent Image` node.**\n* **`method`**: The core position encoding extrapolation method. `yarn` is the recommended default, as it forms the basis of the paper's best-performing \"DY-YaRN\" variant.\n* **`enable_dype`**: Enables or disables the **dynamic, time-aware** component of DyPE.\n* **`dype_exponent`**: Controls the \"strength\" of the dynamic effect over time. This is the most important tuning parameter.\n * `2.0` (Exponential): Recommended for **4K+** resolutions. It's an aggressive schedule that transitions quickly.\n * `1.0` (Linear): A good starting point for **~2K-3K** resolutions.\n * `0.5` (Sub-linear): A gentler schedule that may work best for resolutions just above the model's native 1K.\n* **`base_shift` / `max_shift`** (Advanced): Adjust only if you are an advanced user experimenting with the noise schedule.\n\n\n\n## Join\n\n### [TokenDiffusion](https://t.me/TokenDiff) - AI for every home, creativity for every mind!\n\n### [TokenDiff Community Hub](https://t.me/TokenDiff_hub) - Questions, help, and thoughtful discussion. "],"color":"#322","bgcolor":"#533"},{"id":70,"type":"Note","pos":[-608.5432530568656,585.7182837860847],"size":[210,88],"flags":{},"order":4,"mode":0,"inputs":[],"outputs":[],"properties":{},"widgets_values":["DyPE width/height - Keep the values below 1024x1024; doing so won’t affect your output.\n\n\n"],"color":"#322","bgcolor":"#533"},{"id":81,"type":"LoadImage","pos":[-996.3008510276222,-87.5633900301799],"size":[529.4561427390425,620.5167716563496],"flags":{},"order":5,"mode":0,"inputs":[{"localized_name":"image","name":"image","type":"COMBO","widget":{"name":"image"},"link":null},{"localized_name":"choose file to upload","name":"upload","type":"IMAGEUPLOAD","widget":{"name":"upload"},"link":null}],"outputs":[{"localized_name":"IMAGE","name":"IMAGE","type":"IMAGE","links":[190]},{"localized_name":"MASK","name":"MASK","type":"MASK","links":null}],"title":"Optional Input Image","properties":{"cnr_id":"comfy-core","ver":"0.3.64","Node name for S&R":"LoadImage"},"widgets_values":["rgthree.compare._temp_ufppq_00001_.png","image"]},{"id":43,"type":"MarkdownNote","pos":[-990.1818087358461,-519.6655450706329],"size":[520,390],"flags":{},"order":6,"mode":0,"inputs":[],"outputs":[],"title":"Model links","properties":{},"widgets_values":["## Model links\n\n**Diffusion Model**\n\n- [Qwen-Rapid-AIO.safetensors](https://huggingface.co/Phr00t/Qwen-Image-Edit-Rapid-AIO/tree/main/v9)\n\n**Text Encoder** AND **VAE** included in checkpoint\n\n\n```\nComfyUI/\n├── models/\n│ ├── Checkpoints/\n│ │ └─── Qwen-Rapid-AIO-V[X].safetensors\n```\n"],"color":"#432","bgcolor":"#653"}],"links":[[146,55,0,78,0,"IMAGE"],[156,82,0,83,1,"CONDITIONING"],[158,73,1,82,0,"CLIP"],[160,84,0,83,2,"CONDITIONING"],[161,73,1,84,0,"CLIP"],[164,83,0,55,0,"LATENT"],[165,27,0,83,3,"LATENT"],[183,77,0,73,0,"MODEL"],[184,77,1,73,1,"CLIP"],[185,77,2,55,1,"VAE"],[186,77,2,82,1,"VAE"],[187,77,2,84,1,"VAE"],[188,80,0,83,0,"MODEL"],[189,73,0,80,0,"MODEL"],[190,81,0,82,2,"IMAGE"]],"groups":[{"id":1,"title":"Step 1 - Load Qwen Model","bounding":[-389.3824229462796,-168.70374071867653,355.5041399538495,252.21255115299488],"color":"#3f789e","font_size":24,"flags":{}},{"id":2,"title":"Step 2 - Image Size","bounding":[-12.23530186286891,-163.17122430326808,347.58927891514486,249.17558821231626],"color":"#3f789e","font_size":24,"flags":{}},{"id":3,"title":"Step 3 - Prompt","bounding":[-11.586309297171493,100.62202086322954,396.7993810234184,589.5270105188749],"color":"#3f789e","font_size":24,"flags":{}},{"id":5,"title":"Model patch","bounding":[-385.1175949725123,450.63695348456406,347.7750223767638,300.94106575550074],"color":"#3f789e","font_size":24,"flags":{}},{"id":6,"title":"Step 1.2 - Load LORA Models","bounding":[-389.6222120329982,96.54024848512937,353.917830656678,337.87325320025553],"color":"#3f789e","font_size":24,"flags":{}}],"config":{},"extra":{"ds":{"scale":0.7627768444385471,"offset":[-62.13106404584131,487.20359809682054]},"frontendVersion":"1.28.7","VHS_latentpreview":false,"VHS_latentpreviewrate":0,"VHS_MetadataImage":true,"VHS_KeepIntermediate":true,"linkExtensions":[{"id":183,"parentId":1},{"id":184,"parentId":2},{"id":185,"parentId":3},{"id":186,"parentId":3},{"id":187,"parentId":3},{"id":188,"parentId":5},{"id":189,"parentId":6}],"reroutes":[{"id":1,"pos":[-410.2442328962266,93.36762989078578],"linkIds":[183]},{"id":2,"pos":[-429.2799444622841,109.23072286250073],"linkIds":[184]},{"id":3,"pos":[-23.18476438638177,88.60870199927146],"linkIds":[185,186,187]},{"id":4,"pos":[-18.42583649486724,708.855637193325],"linkIds":[188]},{"id":5,"parentId":4,"pos":[398.7735086612356,710.4419464904965],"linkIds":[188]},{"id":6,"pos":[-408.65792359905424,451.8735310515432],"linkIds":[189]}]},"version":0.4} \ No newline at end of file diff --git a/src/patch.py b/src/patch.py index 1c3d2e7..141f8fc 100644 --- a/src/patch.py +++ b/src/patch.py @@ -89,7 +89,7 @@ def patched_sigma_func(self, timestep): except AttributeError: raise ValueError("The provided model is not a compatible FLUX model.") - new_pe_embedder = FluxPosEmbed(theta, axes_dim, method, enable_dype, dype_exponent) + new_pe_embedder = FluxPosEmbed(theta, axes_dim, method, dype=enable_dype, dype_exponent=dype_exponent) m.add_object_patch("diffusion_model.pe_embedder", new_pe_embedder) sigma_max = m.model.model_sampling.sigma_max.item() @@ -108,4 +108,384 @@ def dype_wrapper_function(model_function, args_dict): m.set_model_unet_function_wrapper(dype_wrapper_function) + return m + + +class QwenPosEmbed(nn.Module): + """ + Qwen-Image specific positional embedding with DyPE support. + Optimized for Qwen's MMDiT architecture and MSRoPE implementation. + """ + def __init__(self, theta: int, axes_dim: list[int], method: str = 'yarn', dype: bool = True, dype_exponent: float = 3.0, base_resolution: int = 1024, patch_size: int = 2, editing_strength: float = 0.0, editing_mode: str = "adaptive"): + super().__init__() + self.theta = theta + self.axes_dim = axes_dim + self.method = method + self.dype = dype if method != 'base' else False + self.dype_exponent = dype_exponent + self.current_timestep = 1.0 + self.base_resolution = base_resolution + self.patch_size = patch_size + self.editing_strength = editing_strength + self.editing_mode = editing_mode + self.is_editing_mode = False # Will be set dynamically during inference + # Qwen-Image uses 8x VAE downsampling, then patches are further divided by patch_size + # For Qwen, base_patches calculation: (base_resolution // 8) // patch_size + self.base_patches = (self.base_resolution // 8) // patch_size + + def set_timestep(self, timestep: float, is_editing: bool = False): + """Set current timestep for DyPE. Timestep normalized to [0, 1] where 1 is pure noise.""" + self.current_timestep = timestep + self.is_editing_mode = is_editing + + def forward(self, ids: torch.Tensor) -> torch.Tensor: + """ + Forward pass for Qwen positional embeddings. + Returns embeddings in the format expected by Qwen's attention mechanism. + """ + n_axes = ids.shape[-1] + emb_parts = [] + pos = ids.float() + + # Qwen models typically use bfloat16 for better performance + # Check device type for optimal dtype selection + is_mps = ids.device.type == "mps" + is_npu = ids.device.type == "npu" + freqs_dtype = torch.float32 if (is_mps or is_npu) else torch.bfloat16 + + for i in range(n_axes): + axis_pos = pos[..., i] + axis_dim = self.axes_dim[i] + + common_kwargs = { + 'dim': axis_dim, + 'pos': axis_pos, + 'theta': self.theta, + 'repeat_interleave_real': True, + 'use_real': True, + 'freqs_dtype': freqs_dtype + } + + # Calculate effective DyPE strength (reduce for editing mode) + effective_dype = self.dype + effective_exponent = self.dype_exponent + + # Pass the exponent to the RoPE function + dype_kwargs = { + 'dype': effective_dype, + 'current_timestep': self.current_timestep, + 'dype_exponent': effective_exponent + } + + if i > 0: # Spatial dimensions (height, width) + max_pos = axis_pos.max().item() + current_patches = int(max_pos + 1) + + # Calculate scale factor for editing mode (after we know current_patches) + scale_factor = 1.0 + effective_exponent = self.dype_exponent + + if self.is_editing_mode and self.editing_mode != "full": + # Calculate effective strength based on editing mode + if self.editing_mode == "adaptive": + # Adaptive: Full DyPE early (structure), gradually reduce later (details) + # timestep 1.0 = pure noise (early), 0.0 = clean (late) + # Use more DyPE early, less late + timestep_factor = 0.3 + (self.current_timestep * 0.7) # 1.0 at start, 0.3 at end + effective_strength = self.editing_strength * timestep_factor + elif self.editing_mode == "timestep_aware": + # More aggressive: Full early, minimal late + timestep_factor = 0.2 + (self.current_timestep * 0.8) # 1.0 at start, 0.2 at end + effective_strength = self.editing_strength * timestep_factor + elif self.editing_mode == "resolution_aware": + # Only reduce when editing at high resolutions + if current_patches > self.base_patches: + effective_strength = self.editing_strength + else: + effective_strength = 1.0 # Full strength at base resolution + elif self.editing_mode == "minimal": + # Minimal DyPE for editing (original approach) + effective_strength = self.editing_strength + else: # "full" or unknown + effective_strength = 1.0 + + # Apply effective strength + if effective_strength < 1.0: + effective_exponent = self.dype_exponent * effective_strength + dype_kwargs['dype_exponent'] = effective_exponent + + # Calculate scale factor for extrapolation reduction + if current_patches > self.base_patches: + # Scale down the extrapolation ratio for editing + # More conservative scaling for adaptive modes + if self.editing_mode in ["adaptive", "timestep_aware"]: + # Use timestep-aware scaling for extrapolation too + timestep_scale = 0.5 + (self.current_timestep * 0.5) # 1.0 at start, 0.5 at end + scale_factor = 1.0 - (1.0 - effective_strength) * 0.3 * (1.0 - timestep_scale * 0.5) + else: + scale_factor = 1.0 - (1.0 - effective_strength) * 0.4 + else: + scale_factor = 1.0 + else: + scale_factor = 1.0 + + if self.method == 'yarn' and current_patches > self.base_patches: + # Apply scale factor for editing mode + if self.is_editing_mode and scale_factor < 1.0: + # Interpolate between base and target patches based on editing_strength + adjusted_patches = int(self.base_patches + (current_patches - self.base_patches) * scale_factor) + max_pe_len = torch.tensor(adjusted_patches, dtype=freqs_dtype, device=pos.device) + else: + max_pe_len = torch.tensor(current_patches, dtype=freqs_dtype, device=pos.device) + cos, sin = get_1d_rotary_pos_embed( + **common_kwargs, + yarn=True, + max_pe_len=max_pe_len, + ori_max_pe_len=self.base_patches, + **dype_kwargs + ) + elif self.method == 'ntk' and current_patches > self.base_patches: + # Apply scale factor for editing mode + if self.is_editing_mode and scale_factor < 1.0: + base_ntk_scale = 1.0 + ((current_patches / self.base_patches) - 1.0) * scale_factor + else: + base_ntk_scale = (current_patches / self.base_patches) + cos, sin = get_1d_rotary_pos_embed( + **common_kwargs, + ntk_factor=base_ntk_scale, + **dype_kwargs + ) + else: + cos, sin = get_1d_rotary_pos_embed(**common_kwargs) + else: # Channel dimension (typically not extrapolated) + cos, sin = get_1d_rotary_pos_embed(**common_kwargs) + + # Qwen's attention expects cos/sin format, convert to rotation matrix + cos_reshaped = cos.view(*cos.shape[:-1], -1, 2)[..., :1] + sin_reshaped = sin.view(*sin.shape[:-1], -1, 2)[..., :1] + row1 = torch.cat([cos_reshaped, -sin_reshaped], dim=-1) + row2 = torch.cat([sin_reshaped, cos_reshaped], dim=-1) + matrix = torch.stack([row1, row2], dim=-2) + emb_parts.append(matrix) + + emb = torch.cat(emb_parts, dim=-3) + return emb.unsqueeze(1).to(ids.device) + + +def _detect_qwen_model_structure(model: ModelPatcher): + """ + Detect Qwen-Image model structure and extract key parameters. + Returns a dictionary with detected attributes. + """ + structure = { + 'transformer': None, + 'transformer_path': None, + 'pos_embed': None, + 'pos_embed_path': None, + 'patch_size': 2, # Default for MMDiT models + 'vae_scale_factor': 8, # Default VAE downsampling + 'base_resolution': 1024, # Qwen-Image base training resolution + } + + # Try to find transformer + if hasattr(model.model, "transformer"): + structure['transformer'] = model.model.transformer + structure['transformer_path'] = "transformer" + elif hasattr(model.model, "diffusion_model"): + structure['transformer'] = model.model.diffusion_model + structure['transformer_path'] = "diffusion_model" + else: + return None + + transformer = structure['transformer'] + + # Try to find positional embedder + if hasattr(transformer, "pos_embed"): + structure['pos_embed'] = transformer.pos_embed + structure['pos_embed_path'] = f"{structure['transformer_path']}.pos_embed" + elif hasattr(transformer, "pe_embedder"): + structure['pos_embed'] = transformer.pe_embedder + structure['pos_embed_path'] = f"{structure['transformer_path']}.pe_embedder" + else: + return None + + # Extract patch_size if available + if hasattr(transformer, "patch_size"): + structure['patch_size'] = transformer.patch_size + elif hasattr(transformer, "config") and hasattr(transformer.config, "patch_size"): + structure['patch_size'] = transformer.config.patch_size + + # Extract VAE scale factor if available + if hasattr(model.model, "vae_scale_factor"): + structure['vae_scale_factor'] = model.model.vae_scale_factor + elif hasattr(model.model, "vae") and hasattr(model.model.vae, "scale_factor"): + structure['vae_scale_factor'] = model.model.vae.scale_factor + + # Try to detect base resolution from config + if hasattr(transformer, "config"): + config = transformer.config + if hasattr(config, "sample_size"): + # sample_size is typically the latent size, multiply by 8 for image size + structure['base_resolution'] = config.sample_size * 8 + elif hasattr(config, "base_resolution"): + structure['base_resolution'] = config.base_resolution + + return structure + + +def apply_dype_to_qwen(model: ModelPatcher, width: int, height: int, method: str, enable_dype: bool, dype_exponent: float, base_shift: float, max_shift: float, editing_strength: float = 0.0, editing_mode: str = "adaptive") -> ModelPatcher: + """ + Apply DyPE to a Qwen-Image model with architecture-specific optimizations. + """ + m = model.clone() + + # Detect Qwen model structure + structure = _detect_qwen_model_structure(m) + if structure is None: + raise ValueError("Could not detect Qwen-Image model structure. This node is only compatible with Qwen-Image models.") + + transformer = structure['transformer'] + patch_size = structure['patch_size'] + vae_scale_factor = structure['vae_scale_factor'] + base_resolution = structure['base_resolution'] + + # Patch noise schedule if available (Qwen may use FlowMatch or similar schedulers) + if not hasattr(m.model.model_sampling, "_dype_patched"): + model_sampler = m.model.model_sampling + + # Check if it's a compatible sampler + if hasattr(model_sampler, "sigma_max"): + # Calculate sequence length based on Qwen's architecture + latent_h, latent_w = height // vae_scale_factor, width // vae_scale_factor + # Qwen uses patch_size for further downsampling + padded_h = math.ceil(latent_h / patch_size) * patch_size + padded_w = math.ceil(latent_w / patch_size) * patch_size + image_seq_len = (padded_h // patch_size) * (padded_w // patch_size) + + # Qwen-specific sequence length parameters + base_seq_len, max_seq_len = 256, 4096 + slope = (max_shift - base_shift) / (max_seq_len - base_seq_len) + intercept = base_shift - slope * base_seq_len + dype_shift = image_seq_len * slope + intercept + + def patched_sigma_func(self, timestep): + # Try to use flux_time_shift if available (Qwen may use similar schedulers) + try: + return model_sampling.flux_time_shift(dype_shift, 1.0, timestep) + except AttributeError: + # Fallback for other scheduler types (FlowMatch, etc.) + # Apply shift proportionally to timestep + if hasattr(self, "sigma"): + original_sigma = self.sigma.__func__(self, timestep) if hasattr(self.sigma, "__func__") else timestep + return original_sigma * (1.0 + dype_shift * 0.1) # Conservative scaling + return timestep + + model_sampler.sigma = types.MethodType(patched_sigma_func, model_sampler) + model_sampler._dype_patched = True + + # Find and extract positional embedder parameters + orig_embedder = structure['pos_embed'] + + # Extract theta and axes_dim from the original embedder + if hasattr(orig_embedder, "theta") and hasattr(orig_embedder, "axes_dim"): + theta, axes_dim = orig_embedder.theta, orig_embedder.axes_dim + elif hasattr(orig_embedder, "theta") and hasattr(orig_embedder, "axes_dims_rope"): + theta, axes_dim = orig_embedder.theta, orig_embedder.axes_dims_rope + elif hasattr(orig_embedder, "theta"): + # If only theta is available, use Qwen-Image defaults + theta = orig_embedder.theta + # Qwen-Image typically uses (16, 56, 56) for axes_dims_rope + axes_dim = [16, 56, 56] + else: + # Fallback to Qwen-Image defaults + theta = 10000 + axes_dim = [16, 56, 56] # Default for Qwen-Image MMDiT models + + # Create new positional embedder with Qwen-specific parameters + new_pe_embedder = QwenPosEmbed( + theta=theta, + axes_dim=axes_dim, + method=method, + dype=enable_dype, # Note: parameter is 'dype' not 'enable_dype' + dype_exponent=dype_exponent, + base_resolution=base_resolution, + patch_size=patch_size, + editing_strength=editing_strength, + editing_mode=editing_mode + ) + + # Patch the positional embedder using the detected path + m.add_object_patch(structure['pos_embed_path'], new_pe_embedder) + + # Get sigma_max for timestep normalization + sigma_max = 1.0 + if hasattr(m.model.model_sampling, "sigma_max"): + sigma_max_val = m.model.model_sampling.sigma_max + if hasattr(sigma_max_val, "item"): + sigma_max = sigma_max_val.item() + else: + sigma_max = float(sigma_max_val) + if sigma_max <= 0: + sigma_max = 1.0 + + # Capture editing_mode in closure + def dype_wrapper_function(model_function, args_dict): + """ + Wrapper function to update timestep for DyPE during inference. + Optimized for Qwen's forward pass signature with editing mode detection. + """ + # Detect editing mode by checking for image inputs in conditioning + is_editing = False + c = args_dict.get("c", {}) + input_x = args_dict.get("input") + + # Check for image editing indicators in conditioning + # Qwen-Image editing typically includes image embeddings or image tokens in conditioning + if isinstance(c, dict): + # Check for common image editing keys in Qwen models + editing_keys = ['image', 'image_embeds', 'image_tokens', 'concat_latent_image', 'concat_mask', 'concat_mask_image'] + for key in editing_keys: + if key in c and c[key] is not None: + is_editing = True + break + + # Also check if input contains non-zero values (not pure noise/empty latent) + # This is a heuristic: editing often starts with a partially denoised image + if input_x is not None and hasattr(input_x, 'abs'): + # If input has low variance, it might be an edited image rather than pure noise + input_variance = input_x.abs().mean().item() if hasattr(input_x, 'abs') else 0.0 + # Pure noise typically has higher variance, edited images have lower + # This is a rough heuristic - adjust threshold as needed + if input_variance < 0.5: # Threshold may need tuning + is_editing = True + + if enable_dype: + timestep_tensor = args_dict.get("timestep") + if timestep_tensor is not None: + # Handle both tensor and scalar timestep values + if hasattr(timestep_tensor, "numel") and timestep_tensor.numel() > 0: + current_sigma = timestep_tensor.item() if hasattr(timestep_tensor, "item") else float(timestep_tensor) + else: + current_sigma = float(timestep_tensor) if not isinstance(timestep_tensor, (int, float)) else timestep_tensor + + if sigma_max > 0: + # Improved timestep normalization for editing + # Editing often uses different timestep ranges, so we normalize more carefully + normalized_timestep = min(max(current_sigma / sigma_max, 0.0), 1.0) + + # For adaptive/timestep_aware modes, we want full DyPE early (structure) and less late (details) + # So we don't need to reduce the normalized timestep - the mode handles it + # Only apply conservative scaling for minimal mode + if is_editing and editing_mode == "minimal" and current_sigma < sigma_max * 0.3: + # For early timesteps in minimal mode, preserve more structure + normalized_timestep = normalized_timestep * 0.8 + + new_pe_embedder.set_timestep(normalized_timestep, is_editing=is_editing) + + # Forward pass with original arguments + timestep = args_dict.get("timestep") + return model_function(input_x, timestep, **c) + + m.set_model_unet_function_wrapper(dype_wrapper_function) + return m \ No newline at end of file