Image-to-Video Workflow for Beginners

This tutorial will show you how to generate a video with synchronized audio from a source image and a supporting text prompt. Using a guiding image gives you greater control over the output than generating from a text prompt alone.

When to Use

Image-to-Video is the right choice when you have a specific starting image you want to bring to life. It’s ideal for maintaining character appearance, controlling composition, or animating an existing image with motion and audio. If you want to generate a scene entirely from a text description, see the Text-to-Video guide instead.

Step-by-Step Guide

This guide assumes ComfyUI is already installed on your local machine. If you haven’t installed it, go to the ComfyUI download page first. Check the system requirements to make sure your hardware is supported.

1. Load the Template and Download Models

  1. Open ComfyUI
  2. Open the Templates panel (left sidebar) and search for LTX templates
  3. Select the Image-to-Video template
  4. The workflow will load as a node graph with all settings pre-configured
  5. Open the Workflow Overview panel (right sidebar). If this is your first time, it will show Missing Models with a list of required files
  6. Click Download all to download the models directly within ComfyUI

The template requires four model files (~28 GB total). The download may take some time depending on your connection. You only need to download these once, they are reused every time you run the workflow.

The models downloaded are:

FileDescriptionPlacement
ltx-2.3-22b-dev-fp8.safetensorsModel checkpoint (FP8)ComfyUI/models/checkpoints/
ltx-2.3-22b-distilled-lora-384.safetensorsDistilled LoRAComfyUI/models/loras/
gemma_3_12B_it_fp4_mixed.safetensorsText encoder (Gemma 3 12B, FP4)ComfyUI/models/text_encoders/
ltx-2.3-spatial-upscaler-x2-1.1.safetensorsSpatial upscaler (2x)ComfyUI/models/latent_upscale_models/

All files are also available in the LTX-2.3 HuggingFace collection if you prefer to download them manually.

2. Load Your Source Image

Find the LoadImage node and select your image file from your local files. This image sets the first frame of the video. The model will generate motion and audio from this starting point.

The image will be automatically resized to match the configured resolution, but for best results use a source image that matches your target aspect ratio. Supported formats include PNG, JPG, and WebP.

3. Write Your Prompt

In Image-to-Video workflows, the prompt describes what should happen in the scene, as the model knows what the scene looks like from your image. Focus on:

  • Motion and action — how subjects should move or change over time
  • Camera movement — tracking, panning, zooming, or static shots
  • Audio — dialogue (in quotation marks), music, ambient sound

You don’t need to describe the scene’s appearance in detail since the image already provides that. Instead, tell the model what comes next. For example: “The woman turns to face the camera and smiles, a warm breeze moving through her hair. Soft piano music plays in the background.”

See the Prompting Guide for detailed tips and examples.

4. Set Duration and Resolution

The template includes four parameters you can adjust:

ParameterDefaultDescription
Duration5 secondsLength of the generated video. The frame count is computed automatically as duration × frame rate + 1.
Frame Rate25 fps24 fps for cinematic feel, 25 fps for standard, 30 fps for smoother motion.
Width1280Output width in pixels. Video dimensions must be divisible by 32.
Height720Output height in pixels.

Higher resolutions and longer duration require more VRAM. Start at 1280×720 and 5 seconds for testing and increase if your hardware supports it.

5. Generate

Click Run to start generation. The template runs a two-stage pipeline automatically:

  1. Stage 1 — Generates video and audio at half resolution (640×360 at default settings) in 8 steps. The source image is injected as conditioning at strength 0.7.
  2. Upscale — The video latent is spatially upscaled to full resolution using the spatial upscaler.
  3. Stage 2 — Refines the upscaled video at full resolution in 3 steps. The source image is re-injected at strength 1.0 to preserve detail at the higher resolution.

The audio is generated jointly with the video in Stage 1 and carried through to the final output.

6. Review and Iterate

The output is saved as an MP4 with synchronized audio. To iterate:

  • Adjust the prompt to change the motion, action, or audio
  • Try a different source image to explore how the model interprets different starting frames
  • Adjust duration if the video is too short or too long for your content

The Stage 1 seed randomizes by default, so each generation produces a different result. To lock in a specific seed for additional tweaks, note the seed value and switch it from randomize to fixed.

How the Pipeline Works

Understanding the two-stage pipeline helps when troubleshooting or fine-tuning results.

Model loading: The template loads the model checkpoint, applies the distilled LoRA (at strength 0.5) for faster inference, and loads the Gemma text encoder locally for prompt processing. A negative prompt ("pc game, console game, video game, cartoon, childish, ugly") is applied automatically to improve output quality.

Image preprocessing: The source image is resized to match the target resolution and preprocessed for use as conditioning in both stages.

Stage 1 (low resolution): The source image is injected into the video latent at strength 0.7, establishing the visual starting point while leaving room for the model to generate natural motion. An empty audio latent is created and concatenated with the video latent into a joint audio-video latent, then sampled together using the euler_ancestral_cfg_pp sampler with a manual sigma schedule (8 steps). After sampling, the audio and video latents are separated.

Upscale: The video latent passes through the spatial upscaler model, doubling the resolution.

Stage 2 (high resolution): The source image is re-injected into the upscaled video latent at strength 1.0 to preserve fine detail at the higher resolution. The video latent is recombined with the audio latent from Stage 1 and refined using the euler_cfg_pp sampler with 3 steps.

Decode: Video and audio latents are separated and decoded independently: Video through tiled VAE decoding (to minimize VRAM usage), audio through the audio VAE decoder, and then merged into the final video file.

Advanced Techniques

The built-in template is designed to get you generating quickly with basic defaults. For more control over output quality, style, and behavior, you can move to more complex workflows that expose additional nodes and parameters.

Once you’re comfortable with the built-in template, the Two-Stage Distilled workflow from our GitHub repository uses the same two-stage pipeline structure with higher-precision model files and additional nodes that together produce better results:

  • Full-precision checkpoint (ltx-2.3-22b-dev.safetensors) instead of the template’s FP8-quantized version
  • v1.1 distilled LoRA (ltx-2.3-22b-distilled-lora-384-1.1.safetensors) — an updated version of the LoRA shipped with the template
  • API text encoding option — the workflow includes a GemmaAPITextEncode node (bypassed by default) that offloads text encoding to a free API, reducing VRAM usage

The basic workflow structure is identical to the template, so everything you’ve learned still applies. To use it, download the JSON from our GitHub and drag it into ComfyUI. The Workflow Overview panel will prompt you to download any missing model files. Note that the full model workflow requires additional custom nodes that will need to be installed.

Distilled vs. Full Model

The template and the recommended workflow above both use the distilled model, which is a version of the full model that has been optimized to produce good results in fewer steps (8+3 in the two-stage pipeline). This makes generation significantly faster and is the best choice for iteration and experimentation.

The full model uses the full-precision checkpoint and requires more inference steps (15–40), but can produce higher-quality output with finer detail and more nuanced motion. The full model path uses the LTXV Scheduler instead of manual sigmas, a multimodal guider with independent audio and video guidance parameters, and applies the distilled LoRA at a lower strength (0.2).

A workflow that includes both distilled and full model paths side by side is available in the example workflows.

Using LoRAs

LoRAs can be added to further customize the model’s output style, motion characteristics, or character appearance. Add a LoRALoader node to your workflow to apply:

  • Style LoRAs for artistic or visual aesthetics
  • Motion LoRAs for specific types of movement
  • Character LoRAs for consistent character appearance

See the LoRA guide for training and usage instructions.

Python

Image-to-Video generation is also available through the PyTorch API for programmatic use and custom pipelines. See the PyTorch API documentation for setup and usage.