LTX-2.3 Brings Audio and Video Generation Into One Open Model

Model overview

LTX-2.3 is a DiT-based audio-video foundation model developed by Lightricks that generates synchronized video and audio within a single model. This represents a significant update to the LTX-2 model with improvements in audio and visual quality plus enhanced prompt adherence. The model brings together modern video generation building blocks with open weights designed for practical local execution. LTX-2.3 includes multiple checkpoint variants: the full 22B development model trainable in bf16, a distilled 8-step version for faster inference, LoRA adapters for customization, and spatial and temporal upscalers for multi-stage pipelines that achieve higher resolutions and frame rates.

Model inputs and outputs

At its core, LTX-2.3 accepts text prompts describing the desired video content and generates high-quality synchronized video and audio output. The model handles flexible resolution and frame rate configurations, with technical requirements that width and height be divisible by 32 and frame count be divisible by 8 plus 1.

Inputs

Text prompts describing video content, scenes, actions, and audio characteristics
Resolution parameters (width and height in pixels)
Frame count for video length
Sampling parameters such as guidance scale and number of inference steps

Outputs

Synchronized video files with audio tracks embedded
High-fidelity audio synchronized to visual motion
Variable resolutions and frame rates depending on configuration and upscaler usage

Capabilities

The model generates complete audio-visual content from text descriptions. It can produce videos with realistic motion, varied lighting conditions, and natural audio that matches the visual narrative. The distilled checkpoint enables 8-step generation for faster turnaround, while the full development model offers flexibility for training and fine-tuning. Spatial upscalers extend resolution by 2x or 1.5x, and temporal upscalers increase frame rates by 2x through multi-stage pipelines. The model excels at following specific prompting styles and can be adapted through LoRA training in under an hour for motion, style, or likeness control.

What can I use it for?

Content creators can use LTX-2.3 to produce marketing videos, social media content, and visual narratives without separate audio production. Filmmakers and animators can generate concept videos and storyboards directly from descriptions. Game developers and interactive media creators can produce in-engine cinematics and dialogue sequences. The model supports local deployment through ComfyUI integration or PyTorch inference, making it suitable for production pipelines that require data privacy or offline operation. The trainable base model enables studios to customize outputs for brand consistency or specific aesthetic styles through efficient LoRA adaptation.

Things to try

Experiment with detailed environmental descriptions to see how the model interprets lighting, weather, and spatial relationships in generated scenes. Test the model with specific character descriptions and actions to understand how prompt adherence improves with concrete detail. Try using the LoRA fine-tuning capability to create a custom style that reflects your visual brand across multiple videos. Explore multi-stage generation using the spatial and temporal upscalers to produce higher-resolution or higher-frame-rate content from the same base generation. Test audio generation without speech content separately to understand the model's audio quality characteristics and where it performs best.

This is a simplified guide to an AI model called LTX-2.3 maintained by Lightricks. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.