Unleashing Local AI Video: A Comprehensive Guide to LTX-2 and ComfyUI Workflows

Navigating the rapidly evolving landscape of AI-powered creative tools can be exhilarating, and occasionally, a bit overwhelming. But every so often, a release emerges that genuinely shifts the paradigm. Today, we are diving deep into such a development: LTX-2, Lightricks' flagship open-source audio-video generation model. What makes LTX-2 truly stand out is its commitment to openness, providing not just the model weights, but also the full training code and benchmarks. This empowers developers and creators alike, moving beyond simple demos to truly adaptable, production-ready workflows that can run right on your local machine.

The Power of Truly Open-Source AI Video

LTX-2 isn't merely another wrapper around a proprietary system; it represents a significant leap forward for local AI video generation. Historically, 'open source' in AI often meant receiving model weights with little to no guidance on how to train or adapt them. LTX-2, however, provides a complete ecosystem, including its modular training framework and recipes. This is crucial because it allows studios and individual developers to fine-tune the model for specific domains or integrate it deeply into their existing pipelines, all while maintaining privacy and control over their intellectual property on local hardware. Imagine adapting a video generation model to perfectly understand your unique artistic style or specific corporate branding. LTX-2 makes that a tangible reality.

This robust model is optimized for NVIDIA's powerful RTX GPUs, making high-quality video creation accessible on consumer-grade hardware. It supports resolutions up to 4K and natively integrates audio, paving the way for truly immersive generative experiences. The flexibility of LTX-2 extends to various multimodal pipelines, including text-to-video, image-to-video, video-to-video, and even audio-conditioned generation, giving creators an unparalleled toolkit for bringing their visions to life.

Essential Foundations for Your Journey

Before we immerse ourselves in the practicalities of LTX-2, let's ensure we have a solid foundation. While LTX-2 primarily integrates with a visual node-based interface, a basic understanding of a few key areas will significantly enhance your learning experience:

Python Fundamentals: Many AI tools and scripts are built with Python. While you won't be writing extensive Python code for ComfyUI's visual interface, familiarity with package management and basic scripting concepts is always beneficial.
Generative AI Concepts: A grasp of what generative models do, how prompts work, and the basic idea behind neural networks will make the workflow more intuitive.
Command Line Interface (CLI) Basics: For installing tools like ComfyUI and managing model weights, a comfort level with navigating your file system and executing commands in the terminal is very helpful.
GPU Awareness: Understanding your system's GPU, particularly its Video RAM (VRAM), is important, as it directly impacts what models and resolutions you can run locally. LTX-2 is optimized for NVIDIA RTX GPUs, so having one is a significant advantage.

Your Toolkit: Libraries, Frameworks, and Hardware

To effectively harness the power of LTX-2, we'll be relying on a few pivotal components:

LTX-2: This is the core audio-video generation model family provided by Lightricks. It includes full and distilled model weights, training frameworks, benchmarks, and Low-Rank Adaptations (LoRAs). The full model offers maximum quality and is ideal for fine-tuning, while distilled and quantized variants are optimized for reduced memory and compute demands, making them perfect for faster iterations on standard local workstations.
ComfyUI: A sophisticated node-based graphical user interface (GUI) designed for running generative AI models locally. ComfyUI visually represents the entire AI pipeline, allowing users to connect nodes that represent different stages of the generation process, from model loading to sampling and video decoding. This visual approach offers granular control and transparency over the workflow.
LoRAs (Low-Rank Adaptations): These are lightweight, modular fine-tunes that can be applied on top of a larger base model like LTX-2. Instead of retraining an entire massive model, LoRAs allow you to inject specific skills, styles, characters, or camera movements into the model's output with minimal computational overhead. Lightricks provides several accompanying LoRAs specifically for LTX-2 to control style, structure, motion, and camera behavior.
NVIDIA RTX GPUs: LTX-2 is specifically optimized for these graphics processing units. Having a powerful RTX card, such as an RTX 4090 with 24GB of VRAM, allows you to run the full LTX-2 model at higher resolutions and faster speeds. However, the availability of distilled models means you don't necessarily need top-tier hardware to get started.
HuggingFace: A platform widely used for sharing AI models, datasets, and demos. LTX-2 model weights and LoRAs are available for download here.
GitHub: Essential for accessing the LTX-2 and ComfyUI-LTXVideo repositories, where you'll find the training code, documentation, and reference workflows.

Building Your First LTX-2 Video in ComfyUI

Let's get hands-on and walk through setting up LTX-2 within ComfyUI. The beauty of ComfyUI lies in its visual, modular pipeline, making complex AI workflows manageable.

Setting Up ComfyUI and LTX-2

First, ensure ComfyUI is installed and up to date on your system. If you haven't installed it yet, dedicated guides are available to walk you through the process. Once ComfyUI is ready, you'll need to download the LTX-2 model weights. These are typically found on platforms like HuggingFace or linked directly from the official GitHub repository. You'll have options for both the full and distilled model variants. For initial experimentation and faster iteration, starting with a distilled model is often recommended, especially if your GPU has less than 24GB of VRAM.

After downloading, place the model weights into your ComfyUI/models/checkpoints directory. Similarly, any LoRAs you download should go into ComfyUI/models/loras.

Navigating the ComfyUI Interface and Loading a Template

When you launch ComfyUI, you'll be greeted by its node-based canvas. It might look like a 'plate of spaghetti' at first glance, but each 'noodle' is a connection, and each 'meatball' is a node performing a specific function. ComfyUI visualizes how data flows from your initial input, through the various stages of the AI model, to the final output.

To begin, open a new tab in ComfyUI and navigate to the Templates menu. Search for LTX-2 and load one of the provided text-to-video templates. You'll typically find two main paths: one for the full model and one for the distilled version. Choose the distilled template for quick tests, and later, for higher fidelity, switch to the full template if your hardware allows.

Understanding the Two-Stage Generation Process

It's important to know that LTX-2 constructs video in two distinct stages. It doesn't just render a high-resolution file in one go. Instead, the workflow involves:

Base Video Generation: The model first generates a lower-resolution base video.
Spatial Upscaling: This base video is then passed to an upscaler, which refines the details and blows it up to your desired full resolution during a second pass.

This two-stage approach optimizes for both speed and quality, ensuring efficient use of resources.

Crafting Your First Text-to-Video Prompt

With the template loaded, let's configure the parameters. In the ComfyUI nodes, you'll typically find settings for width, height, and frame count. A common starting point is 1280 by 770 for resolution, and a frame count of 121. At 24 frames per second (fps), this yields a 5-second clip.

LTX-2 is natively multimodal and responds well to natural language. You don't need complex tags or technical jargon. Simply describe what you want to see. You can detail the scene, lighting, characters, and even specific dialogue. For example:

A man in a black tuxedo stands in a red tiled bathroom. He says, "Notice the difference." The camera is dramatically zooming in on his face. The red tiles are very reflective and have some light scratches and imperfections.

Once your prompt and settings are in place, click the Run button. You'll see the process unfold, moving from the initial low-resolution generation to the final upscaled video. Wes Roth, the presenter, demonstrates that a distilled version of a 5-second clip might take around 53 seconds, while the full model could take over 2 minutes, highlighting the trade-off between speed and fidelity.

Integrating LoRAs for Granular Control

LoRAs are powerful tools for fine-tuning specific aspects of your video. Lightricks provides various LoRAs for LTX-2, designed to control style, structure, motion, and camera behavior, such as 'dolly left' or 'dolly out'.

To use a LoRA:

Download: Obtain the .safetensors file for your desired LoRA from HuggingFace.
Placement: Save the LoRA file into your ComfyUI/models/loras directory.
Enable in ComfyUI: In your ComfyUI workflow, locate the LoRA loader nodes (often found near the model loading section). These nodes are typically bypassed by default. You can right-click them and select Bypass to enable them, or use Ctrl+B as a shortcut.
Connect and Configure: Ensure the LoRA nodes are connected correctly within the pipeline. Crucially, if the LoRA affects camera movement or style, it must be applied to both the initial base generation stage AND the upscaling stage. If you only apply it to the first pass, the upscaler might 'hallucinate' unintended effects, disregarding your LoRA's influence.
Prompt Triggering: Explicitly trigger the LoRA within your text prompt. For a 'dolly left' LoRA, you might include dolly left shot in your description. For example:

Dolly left shot of man running away from lion saying, "Oh no! Oh no! Oh no!" The far left side of the room reveals ancient ruins as the camera shifts.

When configuring a LoRA, you can also adjust its strength parameter. A strength of 1.0 will strictly force the model to adhere to the LoRA's learned concept, while a lower value, like 0.8, allows the base model more creative freedom. Lightricks often provides recommended prompting guidelines for their LoRAs to achieve optimal results, such as describing off-frame elements that become visible during a camera movement to give the model a visual map.

Image-to-Video Generation

LTX-2 also excels at animating static images. Load the LTX-2 image to video template in ComfyUI. This workflow is very similar to text-to-video, with the addition of an image loader node. Here, your image acts as the starting frame, providing structural context, while your text prompt guides the motion and animation.

For example, if you upload a famous painting like Edvard Munch's

Unleashing Local AI Video: A Comprehensive Guide to LTX-2 and ComfyUI Workflows

Fancy watching it?

Watch the full video and context

9 min read