Skip to content
Merged
Show file tree
Hide file tree
Changes from 73 commits
Commits
Show all changes
103 commits
Select commit Hold shift + click to select a range
aa602ac
Initial LTX 2.0 transformer implementation
dg845 Dec 12, 2025
b3096c3
Add tests for LTX 2 transformer model
dg845 Dec 13, 2025
980591d
Get LTX 2 transformer tests working
dg845 Dec 13, 2025
e100b8f
Rename LTX 2 compile test class to have LTX2
dg845 Dec 13, 2025
780fb61
Remove RoPE debug print statements
dg845 Dec 13, 2025
5765759
Get LTX 2 transformer compile tests passing
dg845 Dec 15, 2025
aeecc4d
Fix LTX 2 transformer shape errors
dg845 Dec 15, 2025
a5f2d2d
Initial script to convert LTX 2 transformer to diffusers
dg845 Dec 15, 2025
d86f89d
Add more LTX 2 transformer audio arguments
dg845 Dec 16, 2025
57a8b9c
Allow LTX 2 transformer to be loaded from local path for conversion
dg845 Dec 16, 2025
a7bc052
Improve dummy inputs and add test for LTX 2 transformer consistency
dg845 Dec 16, 2025
bda3ff1
Fix LTX 2 transformer bugs so consistency test passes
dg845 Dec 16, 2025
269cf7b
Initial implementation of LTX 2.0 video VAE
dg845 Dec 17, 2025
baf23e2
Explicitly specify temporal and spatial VAE scale factors when conver…
dg845 Dec 17, 2025
5b950d6
Add initial LTX 2.0 video VAE tests
dg845 Dec 17, 2025
491aae0
Add initial LTX 2.0 video VAE tests (part 2)
dg845 Dec 17, 2025
a748975
Get diffusers implementation on par with official LTX 2.0 video VAE i…
dg845 Dec 19, 2025
c6a11a5
Initial LTX 2.0 vocoder implementation
dg845 Dec 19, 2025
8bfeb4a
Merge pull request #3 from huggingface/ltx-2-vocoder
dg845 Dec 20, 2025
b1cf6ff
Merge pull request #2 from huggingface/ltx-2-video-vae
dg845 Dec 20, 2025
6c56954
Use RMSNorm implementation closer to original for LTX 2.0 video VAE
dg845 Dec 20, 2025
b34ddb1
start audio decoder.
sayakpaul Dec 22, 2025
f4c2435
init registration.
sayakpaul Dec 22, 2025
e54cd6b
up
sayakpaul Dec 22, 2025
907896d
simplify and clean up
sayakpaul Dec 22, 2025
4904fd6
up
sayakpaul Dec 22, 2025
0028955
Initial LTX 2.0 text encoder implementation
dg845 Dec 22, 2025
d0f9cda
Rough initial LTX 2.0 pipeline implementation
dg845 Dec 22, 2025
5f0f2a0
up
sayakpaul Dec 22, 2025
58257eb
up
sayakpaul Dec 22, 2025
059999a
up
sayakpaul Dec 22, 2025
8134da6
up
sayakpaul Dec 22, 2025
409d651
resolve conflicts.
sayakpaul Dec 22, 2025
7bb4cf7
Merge pull request #5 from huggingface/audio-decoder
dg845 Dec 23, 2025
5f7e43d
Add imports for LTX 2.0 Audio VAE
dg845 Dec 23, 2025
d303e2a
Conversion script for LTX 2.0 Audio VAE Decoder
dg845 Dec 23, 2025
ae3b6e7
Merge branch 'ltx-2-transformer' into ltx-2-t2v-pipeline
dg845 Dec 23, 2025
54bfc5d
Add Audio VAE logic to T2V pipeline
dg845 Dec 23, 2025
6e6ce20
Duplicate scheduler for audio latents
dg845 Dec 23, 2025
cbb10b8
Support num_videos_per_prompt for prompt embeddings
dg845 Dec 23, 2025
595f485
LTX 2.0 scheduler and full pipeline conversion
dg845 Dec 23, 2025
3bf7369
Add script to test full LTX2Pipeline T2V inference
dg845 Dec 23, 2025
fa7d9f7
Fix pipeline return bugs
dg845 Dec 23, 2025
a56cf23
Add LTX 2 text encoder and vocoder to ltx2 subdirectory __init__
dg845 Dec 23, 2025
90edc6a
Fix more bugs in LTX2Pipeline.__call__
dg845 Dec 23, 2025
1484c43
Improve CPU offload support
dg845 Dec 23, 2025
f9b9476
Fix pipeline audio VAE decoding dtype bug
dg845 Dec 23, 2025
e89d9c1
Fix video shape error in full pipeline test script
dg845 Dec 23, 2025
b5891b1
Get LTX 2 T2V pipeline to produce reasonable outputs
dg845 Dec 24, 2025
0c41297
Merge pull request #4 from huggingface/ltx-2-t2v-pipeline
dg845 Dec 24, 2025
581f21c
Make LTX 2.0 scheduler more consistent with original code
dg845 Dec 29, 2025
e1f0b7e
Fix typo when applying scheduler fix in T2V inference script
dg845 Dec 29, 2025
280e347
Refactor Audio VAE to be simpler and remove helpers (#7)
sayakpaul Dec 30, 2025
46822c4
Add support for I2V (#8)
sayakpaul Dec 30, 2025
6a236a2
Merge branch 'ltx-2-transformer' into make-scheduler-consistent
dg845 Dec 30, 2025
bd607b9
Denormalize audio latents in I2V pipeline (analogous to T2V change) (…
dg845 Dec 31, 2025
d3f10fe
test i2v.
sayakpaul Dec 31, 2025
aae70b9
Merge pull request #10 from huggingface/make-scheduler-consistent
dg845 Dec 31, 2025
caae167
Move Video and Audio Text Encoder Connectors to Transformer (#12)
dg845 Jan 5, 2026
0be4f31
up (#19)
sayakpaul Jan 5, 2026
c5b52d6
address initial feedback from lightricks team (#16)
sayakpaul Jan 5, 2026
2fa4f84
When using split RoPE, make sure that the output dtype is same as inp…
dg845 Jan 5, 2026
bff9891
Fix apply split RoPE shape error when reshaping x to 4D
dg845 Jan 6, 2026
cb50cac
Add export_utils file for exporting LTX 2.0 videos with audio
dg845 Jan 6, 2026
ce9da5d
Merge pull request #20 from huggingface/video-export-utils-file
dg845 Jan 6, 2026
93a417f
Tests for T2V and I2V (#6)
sayakpaul Jan 6, 2026
9b8788c
resolve conflicts.
sayakpaul Jan 6, 2026
c039c87
up
sayakpaul Jan 6, 2026
550eca3
use export util funcs.
sayakpaul Jan 6, 2026
ef19911
Point original checkpoint to LTX 2.0 official checkpoint
dg845 Jan 6, 2026
ace2ee9
Allow the I2V pipeline to accept image URLs
dg845 Jan 6, 2026
dd81242
make style and make quality
dg845 Jan 6, 2026
2fc5789
Merge branch 'main' into ltx-2-transformer
sayakpaul Jan 6, 2026
57ead0b
remove function map.
sayakpaul Jan 6, 2026
c39f1b8
remove args.
sayakpaul Jan 6, 2026
bdcf23e
update docs.
sayakpaul Jan 6, 2026
61e0fb4
update doc entries.
sayakpaul Jan 6, 2026
8c5ab1f
disable ltx2_consistency test
sayakpaul Jan 6, 2026
64b48c1
Merge branch 'main' into ltx-2-transformer
sayakpaul Jan 6, 2026
5e0cf2b
Simplify LTX 2 RoPE forward by removing coords is None logic
dg845 Jan 6, 2026
d01a242
make style and make quality
dg845 Jan 6, 2026
79cf6d7
Support LTX 2.0 audio VAE encoder
dg845 Jan 7, 2026
cc28cf7
Merge branch 'main' into ltx-2-transformer
sayakpaul Jan 7, 2026
91ee2dd
resolve conflicts
sayakpaul Jan 7, 2026
5269ee5
Merge branch 'ltx-2-transformer' of github.com:huggingface/diffusers …
dg845 Jan 7, 2026
a17f5cb
Apply suggestions from code review
dg845 Jan 7, 2026
964f106
Remove print statement in audio VAE
dg845 Jan 7, 2026
4dfe509
up
sayakpaul Jan 7, 2026
249ae1f
Merge branch 'main' into ltx-2-transformer
sayakpaul Jan 7, 2026
040c118
Fix bug when calculating audio RoPE coords
dg845 Jan 7, 2026
44925cb
Ltx 2 latent upsample pipeline (#12922)
sayakpaul Jan 7, 2026
5e50046
Fix latent upsampler filename in LTX 2 conversion script
dg845 Jan 8, 2026
2b85b93
Add latent upsample pipeline to LTX 2 docs
dg845 Jan 8, 2026
40ee3e3
Add dummy objects for LTX 2 latent upsample pipeline
dg845 Jan 8, 2026
99ff722
Set default FPS to official LTX 2 ckpt default of 24.0
dg845 Jan 8, 2026
165b945
Set default CFG scale to official LTX 2 ckpt default of 4.0
dg845 Jan 8, 2026
1a4ae58
Update LTX 2 pipeline example docstrings
dg845 Jan 8, 2026
b4d33df
make style and make quality
dg845 Jan 8, 2026
724afee
Remove LTX 2 test scripts
dg845 Jan 8, 2026
d24faa7
Fix LTX 2 upsample pipeline example docstring
dg845 Jan 8, 2026
353f0db
Add logic to convert and save a LTX 2 upsampling pipeline
dg845 Jan 8, 2026
0c9e4e2
Merge branch 'main' into ltx-2-transformer
sayakpaul Jan 8, 2026
f85b969
Document LTX2VideoTransformer3DModel forward pass
dg845 Jan 8, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
819 changes: 819 additions & 0 deletions scripts/convert_ltx2_to_diffusers.py

Large diffs are not rendered by default.

108 changes: 108 additions & 0 deletions scripts/ltx2_test_full_pipeline.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
import argparse
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove these files

import os

import torch

from diffusers import LTX2Pipeline
from diffusers.pipelines.ltx2.export_utils import encode_video


def parse_args():
parser = argparse.ArgumentParser()

parser.add_argument("--model_id", type=str, default="diffusers-internal-dev/new-ltx-model")
parser.add_argument("--revision", type=str, default="main")

parser.add_argument(
"--prompt",
type=str,
default="A video of a dog dancing to energetic electronic dance music",
)
parser.add_argument(
"--negative_prompt",
type=str,
default=(
"blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise, "
"grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones, "
"deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands, "
"wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth of "
"field, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistent "
"lighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncanny "
"valley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction, "
"mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise, "
"off-sync audio,incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkward "
"pauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting, "
"inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
),
)

parser.add_argument("--num_inference_steps", type=int, default=40)
parser.add_argument("--height", type=int, default=512)
parser.add_argument("--width", type=int, default=768)
parser.add_argument("--num_frames", type=int, default=121)
parser.add_argument("--frame_rate", type=float, default=25.0)
parser.add_argument("--guidance_scale", type=float, default=3.0)
parser.add_argument("--seed", type=int, default=42)

parser.add_argument("--device", type=str, default="cuda:0")
parser.add_argument("--dtype", type=str, default="bf16")
parser.add_argument("--cpu_offload", action="store_true")

parser.add_argument(
"--output_dir",
type=str,
default="/home/daniel_gu/samples",
help="Output directory for generated video",
)
parser.add_argument(
"--output_filename",
type=str,
default="ltx2_sample_video.mp4",
help="Filename of the exported generated video",
)

args = parser.parse_args()
args.dtype = torch.bfloat16 if args.dtype == "bf16" else torch.float32
return args


def main(args):
pipeline = LTX2Pipeline.from_pretrained(
args.model_id,
revision=args.revision,
torch_dtype=args.dtype,
)
pipeline.to(device=args.device)
if args.cpu_offload:
pipeline.enable_model_cpu_offload()

video, audio = pipeline(
prompt=args.prompt,
negative_prompt=args.negative_prompt,
height=args.height,
width=args.width,
num_frames=args.num_frames,
frame_rate=args.frame_rate,
num_inference_steps=args.num_inference_steps,
guidance_scale=args.guidance_scale,
generator=torch.Generator(device=args.device).manual_seed(args.seed),
output_type="np",
return_dict=False,
)

# Convert video to uint8 (but keep as NumPy array)
video = (video * 255).round().astype("uint8")
video = torch.from_numpy(video)

encode_video(
video[0],
fps=args.frame_rate,
audio=audio[0].float().cpu(),
audio_sample_rate=pipeline.vocoder.config.output_sampling_rate, # should be 24000
output_path=os.path.join(args.output_dir, args.output_filename),
)


if __name__ == "__main__":
args = parse_args()
main(args)
102 changes: 102 additions & 0 deletions scripts/ltx2_test_full_pipeline_i2v.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
import argparse
import os

import torch

from diffusers.pipelines.ltx2 import LTX2ImageToVideoPipeline
from diffusers.pipelines.ltx2.export_utils import encode_video
from diffusers.utils import load_image


def parse_args():
parser = argparse.ArgumentParser()

parser.add_argument("--model_id", type=str, default="diffusers-internal-dev/new-ltx-model")
parser.add_argument("--revision", type=str, default="main")

parser.add_argument("--image_path", required=True, type=str)
parser.add_argument(
"--prompt",
type=str,
default="An astronaut hatches from a fragile egg on the surface of the Moon, the shell cracking and peeling apart in gentle low-gravity motion. Fine lunar dust lifts and drifts outward with each movement, floating in slow arcs before settling back onto the ground. The astronaut pushes free in a deliberate, weightless motion, small fragments of the egg tumbling and spinning through the air. In the background, the deep darkness of space subtly shifts as stars glide with the camera's movement, emphasizing vast depth and scale. The camera performs a smooth, cinematic slow push-in, with natural parallax between the foreground dust, the astronaut, and the distant starfield. Ultra-realistic detail, physically accurate low-gravity motion, cinematic lighting, and a breath-taking, movie-like shot.",
)
parser.add_argument(
"--negative_prompt",
type=str,
default="shaky, glitchy, low quality, worst quality, deformed, distorted, disfigured, motion smear, motion artifacts, fused fingers, bad anatomy, weird hand, ugly, transition, static.",
)

parser.add_argument("--num_inference_steps", type=int, default=40)
parser.add_argument("--height", type=int, default=512)
parser.add_argument("--width", type=int, default=768)
parser.add_argument("--num_frames", type=int, default=121)
parser.add_argument("--frame_rate", type=float, default=25.0)
parser.add_argument("--guidance_scale", type=float, default=3.0)
parser.add_argument("--seed", type=int, default=42)

parser.add_argument("--device", type=str, default="cuda:0")
parser.add_argument("--dtype", type=str, default="bf16")
parser.add_argument("--cpu_offload", action="store_true")

parser.add_argument(
"--output_dir",
type=str,
default="samples",
help="Output directory for generated video",
)
parser.add_argument(
"--output_filename",
type=str,
default="ltx2_sample_video.mp4",
help="Filename of the exported generated video",
)

args = parser.parse_args()
args.dtype = torch.bfloat16 if args.dtype == "bf16" else torch.float32
return args


def main(args):
pipeline = LTX2ImageToVideoPipeline.from_pretrained(
args.model_id,
revision=args.revision,
torch_dtype=args.dtype,
)
if args.cpu_offload:
pipeline.enable_model_cpu_offload()
else:
pipeline.to(device=args.device)

image = load_image(args.image_path)

video, audio = pipeline(
image=image,
prompt=args.prompt,
negative_prompt=args.negative_prompt,
height=args.height,
width=args.width,
num_frames=args.num_frames,
frame_rate=args.frame_rate,
num_inference_steps=args.num_inference_steps,
guidance_scale=args.guidance_scale,
generator=torch.Generator(device=args.device).manual_seed(args.seed),
output_type="np",
return_dict=False,
)

# Convert video to uint8 (but keep as NumPy array)
video = (video * 255).round().astype("uint8")
video = torch.from_numpy(video)

encode_video(
video[0],
fps=args.frame_rate,
audio=audio[0].float().cpu(),
audio_sample_rate=pipeline.vocoder.config.output_sampling_rate, # should be 24000
output_path=os.path.join(args.output_dir, args.output_filename),
)


if __name__ == "__main__":
args = parse_args()
main(args)
119 changes: 119 additions & 0 deletions scripts/test_ltx2_audio_conversion.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
import argparse
from pathlib import Path

import torch
from huggingface_hub import hf_hub_download


def download_checkpoint(
repo_id="diffusers-internal-dev/new-ltx-model",
filename="ltx-av-step-1932500-interleaved-new-vae.safetensors",
):
ckpt_path = hf_hub_download(repo_id=repo_id, filename=filename)
return ckpt_path


def convert_state_dict(state_dict: dict) -> dict:
converted = {}
for key, value in state_dict.items():
if not isinstance(value, torch.Tensor):
continue
new_key = key
if new_key.startswith("decoder."):
new_key = new_key[len("decoder.") :]
converted[f"decoder.{new_key}"] = value

converted["latents_mean"] = converted.pop("decoder.per_channel_statistics.mean-of-means")
converted["latents_std"] = converted.pop("decoder.per_channel_statistics.std-of-means")
return converted


def load_original_decoder(device: torch.device, dtype: torch.dtype):
from ltx_core.loader.single_gpu_model_builder import SingleGPUModelBuilder as Builder
from ltx_core.model.audio_vae.model_configurator import AUDIO_VAE_DECODER_COMFY_KEYS_FILTER
from ltx_core.model.audio_vae.model_configurator import VAEDecoderConfigurator as AudioDecoderConfigurator

checkpoint_path = download_checkpoint()

# The code below comes from `ltx-pipelines/src/ltx_pipelines/txt2vid.py`
decoder = Builder(
model_path=checkpoint_path,
model_class_configurator=AudioDecoderConfigurator,
model_sd_key_ops=AUDIO_VAE_DECODER_COMFY_KEYS_FILTER,
).build(device=device)

decoder.eval()
return decoder


def build_diffusers_decoder():
from diffusers.models.autoencoders import AutoencoderKLLTX2Audio

with torch.device("meta"):
model = AutoencoderKLLTX2Audio()

model.eval()
return model


@torch.no_grad()
def main() -> None:
parser = argparse.ArgumentParser(description="Validate LTX2 audio decoder conversion.")
parser.add_argument("--device", type=str, default="cpu")
parser.add_argument("--dtype", type=str, default="bfloat16", choices=["float32", "bfloat16", "float16"])
parser.add_argument("--batch", type=int, default=2)
parser.add_argument("--output-path", type=Path, required=True)
args = parser.parse_args()

device = torch.device(args.device)
dtype_map = {"float32": torch.float32, "bfloat16": torch.bfloat16, "float16": torch.float16}
dtype = dtype_map[args.dtype]

original_decoder = load_original_decoder(device, dtype)
diffusers_model = build_diffusers_decoder()

converted_state_dict = convert_state_dict(original_decoder.state_dict())
diffusers_model.load_state_dict(converted_state_dict, assign=True, strict=False)

per_channel_len = original_decoder.per_channel_statistics.get_buffer("std-of-means").numel()
latent_channels = diffusers_model.decoder.latent_channels
mel_bins_for_match = per_channel_len // latent_channels if per_channel_len % latent_channels == 0 else None

levels = len(diffusers_model.decoder.channel_multipliers)
latent_height = diffusers_model.decoder.resolution // (2 ** (levels - 1))
latent_width = mel_bins_for_match or latent_height

dummy = torch.randn(
args.batch,
diffusers_model.decoder.latent_channels,
latent_height,
latent_width,
device=device,
dtype=dtype,
generator=torch.Generator(device).manual_seed(42),
)

original_out = original_decoder(dummy)

from diffusers.pipelines.ltx2.pipeline_ltx2 import LTX2Pipeline

_, a_channels, a_time, a_freq = dummy.shape
dummy = dummy.permute(0, 2, 1, 3).reshape(-1, a_time, a_channels * a_freq)
dummy = LTX2Pipeline._denormalize_audio_latents(
dummy,
diffusers_model.latents_mean,
diffusers_model.latents_std,
)
dummy = dummy.view(-1, a_time, a_channels, a_freq).permute(0, 2, 1, 3)
diffusers_out = diffusers_model.decode(dummy).sample

torch.testing.assert_close(diffusers_out, original_out, rtol=1e-4, atol=1e-4)
max_diff = (diffusers_out - original_out).abs().max().item()
print(f"Conversion successful. Max diff: {max_diff:.6f}")

diffusers_model.to(dtype).save_pretrained(args.output_path)
print(f"Serialized model to {args.output_path}")


if __name__ == "__main__":
main()
10 changes: 10 additions & 0 deletions src/diffusers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -193,6 +193,8 @@
"AutoencoderKLHunyuanImageRefiner",
"AutoencoderKLHunyuanVideo",
"AutoencoderKLHunyuanVideo15",
"AutoencoderKLLTX2Audio",
"AutoencoderKLLTX2Video",
"AutoencoderKLLTXVideo",
"AutoencoderKLMagvit",
"AutoencoderKLMochi",
Expand Down Expand Up @@ -236,6 +238,7 @@
"Kandinsky5Transformer3DModel",
"LatteTransformer3DModel",
"LongCatImageTransformer2DModel",
"LTX2VideoTransformer3DModel",
"LTXVideoTransformer3DModel",
"Lumina2Transformer2DModel",
"LuminaNextDiT2DModel",
Expand Down Expand Up @@ -537,6 +540,8 @@
"LEditsPPPipelineStableDiffusionXL",
"LongCatImageEditPipeline",
"LongCatImagePipeline",
"LTX2ImageToVideoPipeline",
"LTX2Pipeline",
"LTXConditionPipeline",
"LTXImageToVideoPipeline",
"LTXLatentUpsamplePipeline",
Expand Down Expand Up @@ -937,6 +942,8 @@
AutoencoderKLHunyuanImageRefiner,
AutoencoderKLHunyuanVideo,
AutoencoderKLHunyuanVideo15,
AutoencoderKLLTX2Audio,
AutoencoderKLLTX2Video,
AutoencoderKLLTXVideo,
AutoencoderKLMagvit,
AutoencoderKLMochi,
Expand Down Expand Up @@ -980,6 +987,7 @@
Kandinsky5Transformer3DModel,
LatteTransformer3DModel,
LongCatImageTransformer2DModel,
LTX2VideoTransformer3DModel,
LTXVideoTransformer3DModel,
Lumina2Transformer2DModel,
LuminaNextDiT2DModel,
Expand Down Expand Up @@ -1251,6 +1259,8 @@
LEditsPPPipelineStableDiffusionXL,
LongCatImageEditPipeline,
LongCatImagePipeline,
LTX2ImageToVideoPipeline,
LTX2Pipeline,
LTXConditionPipeline,
LTXImageToVideoPipeline,
LTXLatentUpsamplePipeline,
Expand Down
Loading
Loading