Text-to-Video Workflow for Beginners

This tutorial will show you how to generate a video with synchronized audio entirely from a text prompt.

When to Use

Text-to-Video is the right starting point when you want to explore a concept from scratch, test a style or mood, or generate a scene where you don’t have a specific frame to anchor from. If you have a source image you want to animate, see the Image-to-Video guide instead.

Step-by-Step Guide

This guide assumes ComfyUI is already installed on your local machine. If you haven’t installed it, go to the ComfyUI download page first. Check the system requirements to make sure your hardware is supported.

1. Load the Template and Download Models

  1. Open ComfyUI
  2. Open the Templates panel (left sidebar) and search for LTX templates
  3. Select the Text-to-Video template
  4. The workflow will load as a node graph with all settings pre-configured
  5. Open the Workflow Overview panel (right sidebar). If this is your first time, it will show Missing Models with a list of required files
  6. Click Download all to download the models directly within ComfyUI

The template requires four model files (~28 GB total). The download may take some time depending on your connection. You only need to download these once, they are reused every time you run the workflow.

The models downloaded are:

FileDescriptionPlacement
ltx-2.3-22b-dev-fp8.safetensorsModel checkpoint (FP8)ComfyUI/models/checkpoints/
ltx-2.3-22b-distilled-lora-384.safetensorsDistilled LoRAComfyUI/models/loras/
gemma_3_12B_it_fp4_mixed.safetensorsText encoder (Gemma 3 12B, FP4)ComfyUI/models/text_encoders/
ltx-2.3-spatial-upscaler-x2-1.1.safetensorsSpatial upscaler (2x)ComfyUI/models/latent_upscale_models/

All files are also available in the LTX-2.3 HuggingFace collection if you prefer to download them manually.

2. Write Your Prompt

The prompt is the most important input in a Text-to-Video workflow. Without visual guidance, the model relies entirely on your description to build the scene.

Find the Prompt field and describe your scene. Strong prompts cover:

  • Scene and setting — environment, lighting, time of day, atmosphere
  • Character details — appearance, clothing, actions, physical expressions of emotion
  • Camera movement — shot type, motion, angles
  • Audio — dialogue (in quotation marks), music, ambient sound

Aim for 4–8 descriptive sentences written as a single flowing paragraph in present tense. Longer, more detailed prompts consistently produce better results.

See the Prompting Guide for detailed tips and examples.

3. Set Duration and Resolution

The template includes four parameters you can adjust:

ParameterDefaultDescription
Duration5 secondsLength of the generated video. The frame count is computed automatically as duration × frame rate + 1.
Frame Rate25 fps24 fps for cinematic feel, 25 fps for standard, 30 fps for smoother motion.
Width1280Output width in pixels. Video dimensions must be divisible by 32.
Height720Output height in pixels.

Higher resolutions and longer duration require more VRAM. Start at 1280×720 and 5 seconds for testing and increase if your hardware supports it.

4. Generate

Click Run to start generation. The template runs a two-stage pipeline automatically:

  1. Stage 1 — Generates video and audio at half resolution (640×360 at default settings) in 8 steps
  2. Upscale — The video latent is spatially upscaled to full resolution using the spatial upscaler
  3. Stage 2 — Refines the upscaled video at full resolution in 3 steps

The audio is generated jointly with the video in Stage 1 and carried through to the final output.

5. Review and Iterate

The output is saved as an MP4 with synchronized audio. To iterate:

  • Change the prompt and re-queue to explore different scenes
  • Adjust duration if the video is too short or too long for your content
  • Try different resolutions to match your target format (landscape, portrait, square)

The Stage 1 seed randomizes by default, so each generation produces a different result. To lock in a specific seed for additional tweaks, note the seed value and switch it from randomize to fixed.

How the Pipeline Works

Understanding the two-stage pipeline helps when troubleshooting or fine-tuning results.

Model loading: The template loads the model checkpoint, applies the distilled LoRA (at strength 0.5) for faster inference, and loads the Gemma text encoder locally for prompt processing. A negative prompt ("pc game, console game, video game, cartoon, childish, ugly") is applied automatically to improve output quality.

Stage 1 (low resolution): Empty video and audio latents are created at half your target resolution, concatenated into a joint audio-video latent, and sampled together using the euler_ancestral_cfg_pp sampler with a manual sigma schedule (8 steps). This joint generation is what keeps audio and video synchronized. After sampling, the audio and video latents are separated.

Upscale: The video latent passes through the spatial upscaler model, doubling the resolution.

Stage 2 (high resolution): The upscaled video latent is recombined with the audio latent from Stage 1 and refined using the euler_cfg_pp sampler with 3 steps. This pass sharpens detail without regenerating the composition from scratch.

Decode: Video and audio latents are separated and decoded independently: Video through tiled VAE decoding (to minimize VRAM usage), audio through the audio VAE decoder, and then merged into the final video file.

Advanced Techniques

The built-in template is designed to get you generating quickly with basic defaults. For more control over output quality, style, and behavior, you can move to more complex workflows that expose additional nodes and parameters.

Once you’re comfortable with the built-in template, the Two-Stage Distilled workflow from our GitHub repository uses the same two-stage pipeline structure with higher-precision model files and additional nodes that together produce better results:

  • Full-precision checkpoint (ltx-2.3-22b-dev.safetensors) instead of the template’s FP8-quantized version
  • v1.1 distilled LoRA (ltx-2.3-22b-distilled-lora-384-1.1.safetensors) — an updated version of the LoRA shipped with the template
  • API text encoding option — the workflow includes a GemmaAPITextEncode node (bypassed by default) that gives users the option of offloading text encoding to a free API, reducing VRAM usage

The basic workflow structure is identical to the template, so everything you’ve learned still applies. To use it, download the JSON from our GitHub and drag it into ComfyUI. The Workflow Overview panel will prompt you to download any missing model files. Note that the full model workflow requires additional custom nodes that will need to be installed.

Distilled vs. Full Model

The template and the recommended workflow above both use the distilled model, which is a version of the full model that has been optimized to produce good results in fewer steps (8+3 in the two-stage pipeline). This makes generation significantly faster and is the best choice for iteration and experimentation.

The full model uses the full-precision checkpoint and requires more inference steps (15–40), but can produce higher-quality output with finer detail and more nuanced motion. The full model path uses the LTXV Scheduler instead of manual sigmas, a multimodal guider with independent audio and video guidance parameters, and applies the distilled LoRA at a lower strength (0.2).

A workflow that includes both distilled and full model paths side by side is available in the example workflows.

Using LoRAs

LoRAs can be added to further customize the model’s output style, motion characteristics, or character appearance. Add a LoRALoader node to your workflow to apply:

  • Style LoRAs for artistic or visual aesthetics
  • Motion LoRAs for specific types of movement
  • Character LoRAs for consistent character appearance

See the LoRA guide for training and usage instructions.

Python

Text-to-Video generation is also available through the PyTorch API for programmatic use and custom pipelines. See the PyTorch API documentation for setup and usage.