Z-Image + Wan 2.2 Video Generation Workflow: From Text to High-Quality Video

May 9, 2026

Z-Image + Wan 2.2 Video Generation Workflow: From Text to High-Quality Video

Generate keyframes with Z-Image, animate with Wan 2.2 — a complete text-to-video pipeline in ComfyUI.


Why Z-Image + Wan 2.2?

Combination Advantage

Model Role Advantage
Z-Image Turbo Image Generation 6B distilled model, sub-second inference, photorealistic quality
Wan 2.2 Video Generation 14B MoE architecture, supports I2V and T2V, open-source local deployment

Core idea: Z-Image generates image quality that far exceeds most video models' native T2V capability. Use Z-Image to create high-quality keyframes, then feed them into Wan 2.2 for Image-to-Video (I2V) conversion — significantly better than direct T2V.

Workflow Comparison

Approach A (Direct T2V):  Text → Wan 2.2 T2V → Video (mediocre quality)
Approach B (This guide):   Text → Z-Image → High-quality keyframe → Wan 2.2 I2V → High-quality video

ComfyUI Setup

Prerequisites

# 1. Install ComfyUI
git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI
pip install -r requirements.txt

# 2. Download Z-Image Turbo models
# qwen_3_4b.safetensors (text encoder)
# z_image_turbo_bf16.safetensors (main model)
# ae.safetensors (VAE encoder/decoder)

# 3. Download Wan 2.2 model
# wan2.2_i2v_14b.safetensors (Image-to-Video)
# or wan2.2_t2v_14b.safetensors (Text-to-Video)

# 4. Install required custom nodes
# ComfyUI-WanVideoWrapper
# ComfyUI-ZImage

Workflow Structure

[Text Prompt]
    ↓
┌──────────────────────────┐
│  Z-Image Turbo Branch     │
│  Prompt → KSampler(9 steps)│
│  → VAE Decode → Keyframe   │
└──────────┬───────────────┘
           ↓
┌──────────────────────────┐
│  Wan 2.2 I2V Branch       │
│  Keyframe + Prompt         │
│  → Get Image Size          │
│  → Video Latent (match size)│
│  → Wan 2.2 Sampler         │
│  → Video Decode → MP4      │
└──────────────────────────┘

Key Nodes

Z-Image Turbo Branch:

  1. CLIP Text Encode — Input positive prompt
  2. KSampler — 9 steps, sampler: euler, scheduler: normal
  3. VAE Decode — Decode latent to pixel image

Wan 2.2 I2V Branch:

  1. Get Image Size — Read Z-Image output dimensions
  2. Wan2.2 Video Latent — Convert keyframe to video latent with image size
  3. Wan2.2 Sampler — Generate video frame sequence (default 16 frames)
  4. Video Decode — Decode to video frames
  5. Save Image / Video Combine — Output video

Avoiding Size Mismatch

Wan 2.2 requires input dimensions to be multiples of 64:

# Size alignment example
width = (width // 64) * 64
height = (height // 64) * 64

In ComfyUI, use Get Image Size node to read Z-Image output dimensions and pass them to Wan 2.2.


Prompt Strategy

Video-Friendly Prompts

Video generation prompts should be consistent with (or closely related to) image prompts:

# Z-Image prompt
A young woman in a red dress walking through a sunlit garden,
golden hour lighting, cinematic composition, shallow depth of field,
85mm lens, bokeh background

# Wan 2.2 prompt (maintain consistency)
A young woman in a red dress walking through a sunlit garden,
golden hour lighting, gentle walking motion, swaying flowers,
cinematic slow motion

Key difference: Wan 2.2 prompts need motion descriptions (walking motion, swaying, etc.), while Z-Image prompts focus on static image quality.

Motion Control Tips

Motion Type Wan 2.2 Prompt Keywords Effect
Slow pan slow panning camera, gentle movement Cinematic camera
Walking walking motion, natural stride Character animation
Wind wind blowing, hair flowing, leaves rustling Natural dynamics
Water rippling water, wave motion Liquid effects
Orbit orbiting camera, 360 degree rotation Surround view

Advanced: First + Last Frame Control

First + Last Frame (FFLF)

Wan 2.2 supports first + last frame control, enabling transitions from Z-Image scene A to scene B:

[Scene A prompt] → Z-Image → First frame
[Scene B prompt] → Z-Image → Last frame
    ↓
Wan 2.2 FFLF → Transition video from A to B

Typical Use Cases

  1. Day to Night: Daytime garden → Same scene at night
  2. Season Change: Spring blossoms → Autumn leaves
  3. Expression Change: Smiling → Surprised
  4. Product Showcase: Static product → Rotating display

ComfyUI FFLF Node Connections

Z-Image Sampler A → VAE Decode A → First frame
Z-Image Sampler B → VAE Decode B → Last frame
    ↓
Wan2.2 FFLF Video Latent (first frame + last frame)
    ↓
Wan2.2 Sampler
    ↓
Video Decode → Transition video

Parameter Tuning Guide

Frame Count Selection

Frames Duration (30fps) VRAM Use Case
16 frames ~0.5 sec ~8GB Quick preview, short shots
32 frames ~1 sec ~12GB Short video, social media
64 frames ~2 sec ~20GB Complete shots, demos
96 frames ~3 sec ~30GB High-quality demos

Motion Intensity Control

Wan 2.2 motion intensity is controlled via:

  • motion_bucket_id: 1-255, higher = more dramatic motion
  • Recommended starting value: 128 (medium motion)
  • Fine-tuning:
    • Not enough motion → increase by 10-20
    • Too aggressive → decrease by 10-20
    • Character animation → 80-100 (natural stride)
    • Scenery shots → 100-130 (wind, water, clouds)

Practical Cases

Case 1: Product Showcase Video

Product: Wireless Bluetooth earbuds

Steps:

  1. Z-Image generates white-background product photo
  2. Wan 2.2 I2V adds gentle rotation motion
  3. Output 3-second product showcase video

Z-Image Prompt:

Wireless earbuds in charging case,
professional product photography,
pure white background, studio lighting,
45 degree angle, minimal design

Wan 2.2 Prompt:

Wireless earbuds slowly rotating,
smooth 360 degree turn,
studio lighting, clean white background,
product showcase video

Case 2: Cinematic Landscape Short

Scene: Sunset beach

Steps:

  1. Z-Image generates sunset beach keyframe
  2. Wan 2.2 I2V adds ocean waves and cloud motion

Z-Image Prompt:

Sunset over ocean beach, dramatic sky with orange and purple clouds,
waves crashing on shore, palm trees silhouetted,
cinematic wide shot, golden hour,
shot on ARRI Alexa, anamorphic lens

Wan 2.2 Prompt:

Sunset ocean waves rolling in, clouds drifting slowly,
palm fronds swaying in wind, golden light flickering on water,
cinematic slow motion, anamorphic lens flare

Performance Optimization

VRAM Optimization Tips

  1. FP16 Inference: Wan 2.2 supports half-precision, halving VRAM
  2. Tile Inference: Process large resolution videos in tiles
  3. Model Offloading: Load Z-Image and Wan 2.2 sequentially
VRAM Supported Config
8GB Z-Image or Wan 2.2 separately, 16 frames
12GB Full workflow, 16 frames, FP16
16GB Full workflow, 32 frames, FP16
24GB+ Full workflow, 64+ frames, BF16

FAQ

Q: Can I use the exact same prompt for both Z-Image and Wan 2.2?

You can, but separately optimized prompts produce better results. Z-Image needs image quality descriptions; Wan 2.2 needs motion descriptions. Start with Z-Image's prompt, then add motion-related keywords for Wan 2.2.

Q: Video comes out blurry — why?

  1. Check if Z-Image input resolution is sufficient
  2. Increase motion_bucket_id for more visible motion
  3. Reduce Wan 2.2 sampling steps (over-sampling introduces noise)
  4. Try lowering motion intensity

Q: How to maintain continuity across multiple video clips?

Use shared keyframes as connection points:

  • Video A's last frame = Video B's first frame
  • Or share a middle frame between two clips
  • Stitch together in video editing software

Summary

The Z-Image + Wan 2.2 combination provides an open-source, locally deployable, high-quality text-to-video workflow:

  1. Z-Image Turbo generates high-quality keyframes (sub-second inference)
  2. Wan 2.2 handles Image-to-Video conversion (14B MoE architecture)
  3. ComfyUI orchestrates both models into an end-to-end pipeline
  4. Prompt strategy: Z-Image focuses on image quality, Wan 2.2 on motion description

This workflow is ideal for:

  • Ecommerce product showcase videos
  • Social media short-form content
  • Cinematic concept trailers
  • Personal creative projects

This guide is based on ComfyUI + Z-Image Turbo + Wan 2.2 I2V model.

Z-Image Team

Z-Image + Wan 2.2 Video Generation Workflow: From Text to High-Quality Video | Blog