Text-to-Video Workflow for Beginners

This tutorial will show you how to generate a video with synchronized audio entirely from a text prompt.

When to Use

Text-to-Video is the right starting point when you want to explore a concept from scratch, test a style or mood, or generate a scene where you don’t have a specific frame to anchor from. If you have a source image you want to animate, see the Image-to-Video guide instead.

Step-by-Step Guide

This guide assumes ComfyUI is already installed on your local machine. If you haven’t installed it, go to the ComfyUI download page first. Check the system requirements to make sure your hardware is supported.

1. Load the Template and Download Models

Open ComfyUI
Open the Templates panel (left sidebar) and search for LTX templates
Select the Text-to-Video template
The workflow will load as a node graph with all settings pre-configured
Open the Workflow Overview panel (right sidebar). If this is your first time, it will show Missing Models with a list of required files
Click Download all to download the models directly within ComfyUI

The template requires four model files (~28 GB total). The download may take some time depending on your connection. You only need to download these once, they are reused every time you run the workflow.

The models downloaded are:

File	Description	Placement
`ltx-2.3-22b-dev-fp8.safetensors`	Model checkpoint (FP8)	`ComfyUI/models/checkpoints/`
`ltx-2.3-22b-distilled-lora-384.safetensors`	Distilled LoRA	`ComfyUI/models/loras/`
`gemma_3_12B_it_fp4_mixed.safetensors`	Text encoder (Gemma 3 12B, FP4)	`ComfyUI/models/text_encoders/`
`ltx-2.3-spatial-upscaler-x2-1.1.safetensors`	Spatial upscaler (2x)	`ComfyUI/models/latent_upscale_models/`

All files are also available in the LTX-2.3 HuggingFace collection if you prefer to download them manually.

2. Write Your Prompt

The prompt is the most important input in a Text-to-Video workflow. Without visual guidance, the model relies entirely on your description to build the scene.

Find the Prompt field and describe your scene. Strong prompts cover:

Scene and setting — environment, lighting, time of day, atmosphere
Character details — appearance, clothing, actions, physical expressions of emotion
Camera movement — shot type, motion, angles
Audio — dialogue (in quotation marks), music, ambient sound

Aim for 4–8 descriptive sentences written as a single flowing paragraph in present tense. Longer, more detailed prompts consistently produce better results.

See the Prompting Guide for detailed tips and examples.

3. Set Duration and Resolution

The template includes four parameters you can adjust:

Parameter	Default	Description
Duration	5 seconds	Length of the generated video. The frame count is computed automatically as `duration × frame rate + 1`.
Frame Rate	25 fps	24 fps for cinematic feel, 25 fps for standard, 30 fps for smoother motion.
Width	1280	Output width in pixels. Video dimensions must be divisible by 32.
Height	720	Output height in pixels.

Higher resolutions and longer duration require more VRAM. Start at 1280×720 and 5 seconds for testing and increase if your hardware supports it.

4. Generate

Click Run to start generation. The template runs a two-stage pipeline automatically:

Stage 1 — Generates video and audio at half resolution (640×360 at default settings) in 8 steps
Upscale — The video latent is spatially upscaled to full resolution using the spatial upscaler
Stage 2 — Refines the upscaled video at full resolution in 3 steps

The audio is generated jointly with the video in Stage 1 and carried through to the final output.

5. Review and Iterate

The output is saved as an MP4 with synchronized audio. To iterate:

Change the prompt and re-queue to explore different scenes
Adjust duration if the video is too short or too long for your content
Try different resolutions to match your target format (landscape, portrait, square)

The Stage 1 seed randomizes by default, so each generation produces a different result. To lock in a specific seed for additional tweaks, note the seed value and switch it from randomize to fixed.

How the Pipeline Works

Understanding the two-stage pipeline helps when troubleshooting or fine-tuning results.

Model loading: The template loads the model checkpoint, applies the distilled LoRA (at strength 0.5) for faster inference, and loads the Gemma text encoder locally for prompt processing. A negative prompt ("pc game, console game, video game, cartoon, childish, ugly") is applied automatically to improve output quality.

Stage 1 (low resolution): Empty video and audio latents are created at half your target resolution, concatenated into a joint audio-video latent, and sampled together using the euler_ancestral_cfg_pp sampler with a manual sigma schedule (8 steps). This joint generation is what keeps audio and video synchronized. After sampling, the audio and video latents are separated.

Upscale: The video latent passes through the spatial upscaler model, doubling the resolution.

Stage 2 (high resolution): The upscaled video latent is recombined with the audio latent from Stage 1 and refined using the euler_cfg_pp sampler with 3 steps. This pass sharpens detail without regenerating the composition from scratch.

Decode: Video and audio latents are separated and decoded independently: Video through tiled VAE decoding (to minimize VRAM usage), audio through the audio VAE decoder, and then merged into the final video file.

Advanced Techniques

The built-in template is designed to get you generating quickly with basic defaults. For more control over output quality, style, and behavior, you can move to more complex workflows that expose additional nodes and parameters.

Recommended Workflow

Once you’re comfortable with the built-in template, the Two-Stage Distilled workflow from our GitHub repository uses the same two-stage pipeline structure with higher-precision model files and additional nodes that together produce better results:

Full-precision checkpoint (ltx-2.3-22b-dev.safetensors) instead of the template’s FP8-quantized version
v1.1 distilled LoRA (ltx-2.3-22b-distilled-lora-384-1.1.safetensors) — an updated version of the LoRA shipped with the template
API text encoding option — the workflow includes a GemmaAPITextEncode node (bypassed by default) that gives users the option of offloading text encoding to a free API, reducing VRAM usage

The basic workflow structure is identical to the template, so everything you’ve learned still applies. To use it, download the JSON from our GitHub and drag it into ComfyUI. The Workflow Overview panel will prompt you to download any missing model files. Note that the full model workflow requires additional custom nodes that will need to be installed.

Distilled vs. Full Model

The template and the recommended workflow above both use the distilled model, which is a version of the full model that has been optimized to produce good results in fewer steps (8+3 in the two-stage pipeline). This makes generation significantly faster and is the best choice for iteration and experimentation.

The full model uses the full-precision checkpoint and requires more inference steps (15–40), but can produce higher-quality output with finer detail and more nuanced motion. The full model path uses the LTXV Scheduler instead of manual sigmas, a multimodal guider with independent audio and video guidance parameters, and applies the distilled LoRA at a lower strength (0.2).

A workflow that includes both distilled and full model paths side by side is available in the example workflows.

Using LoRAs

LoRAs can be added to further customize the model’s output style, motion characteristics, or character appearance. Add a LoRALoader node to your workflow to apply:

Style LoRAs for artistic or visual aesthetics
Motion LoRAs for specific types of movement
Character LoRAs for consistent character appearance

See the LoRA guide for training and usage instructions.

Python

Text-to-Video generation is also available through the PyTorch API for programmatic use and custom pipelines. See the PyTorch API documentation for setup and usage.