Z-Image + Wan 2.2 Video Generation Workflow: From Text to High-Quality Video
Generate keyframes with Z-Image, animate with Wan 2.2 — a complete text-to-video pipeline in ComfyUI.
Why Z-Image + Wan 2.2?
Combination Advantage
| Model | Role | Advantage |
|---|---|---|
| Z-Image Turbo | Image Generation | 6B distilled model, sub-second inference, photorealistic quality |
| Wan 2.2 | Video Generation | 14B MoE architecture, supports I2V and T2V, open-source local deployment |
Core idea: Z-Image generates image quality that far exceeds most video models' native T2V capability. Use Z-Image to create high-quality keyframes, then feed them into Wan 2.2 for Image-to-Video (I2V) conversion — significantly better than direct T2V.
Workflow Comparison
Approach A (Direct T2V): Text → Wan 2.2 T2V → Video (mediocre quality)
Approach B (This guide): Text → Z-Image → High-quality keyframe → Wan 2.2 I2V → High-quality video
ComfyUI Setup
Prerequisites
# 1. Install ComfyUI
git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI
pip install -r requirements.txt
# 2. Download Z-Image Turbo models
# qwen_3_4b.safetensors (text encoder)
# z_image_turbo_bf16.safetensors (main model)
# ae.safetensors (VAE encoder/decoder)
# 3. Download Wan 2.2 model
# wan2.2_i2v_14b.safetensors (Image-to-Video)
# or wan2.2_t2v_14b.safetensors (Text-to-Video)
# 4. Install required custom nodes
# ComfyUI-WanVideoWrapper
# ComfyUI-ZImage
Workflow Structure
[Text Prompt]
↓
┌──────────────────────────┐
│ Z-Image Turbo Branch │
│ Prompt → KSampler(9 steps)│
│ → VAE Decode → Keyframe │
└──────────┬───────────────┘
↓
┌──────────────────────────┐
│ Wan 2.2 I2V Branch │
│ Keyframe + Prompt │
│ → Get Image Size │
│ → Video Latent (match size)│
│ → Wan 2.2 Sampler │
│ → Video Decode → MP4 │
└──────────────────────────┘
Key Nodes
Z-Image Turbo Branch:
- CLIP Text Encode — Input positive prompt
- KSampler — 9 steps, sampler:
euler, scheduler:normal - VAE Decode — Decode latent to pixel image
Wan 2.2 I2V Branch:
- Get Image Size — Read Z-Image output dimensions
- Wan2.2 Video Latent — Convert keyframe to video latent with image size
- Wan2.2 Sampler — Generate video frame sequence (default 16 frames)
- Video Decode — Decode to video frames
- Save Image / Video Combine — Output video
Avoiding Size Mismatch
Wan 2.2 requires input dimensions to be multiples of 64:
# Size alignment example
width = (width // 64) * 64
height = (height // 64) * 64
In ComfyUI, use Get Image Size node to read Z-Image output dimensions and pass them to Wan 2.2.
Prompt Strategy
Video-Friendly Prompts
Video generation prompts should be consistent with (or closely related to) image prompts:
# Z-Image prompt
A young woman in a red dress walking through a sunlit garden,
golden hour lighting, cinematic composition, shallow depth of field,
85mm lens, bokeh background
# Wan 2.2 prompt (maintain consistency)
A young woman in a red dress walking through a sunlit garden,
golden hour lighting, gentle walking motion, swaying flowers,
cinematic slow motion
Key difference: Wan 2.2 prompts need motion descriptions (walking motion, swaying, etc.), while Z-Image prompts focus on static image quality.
Motion Control Tips
| Motion Type | Wan 2.2 Prompt Keywords | Effect |
|---|---|---|
| Slow pan | slow panning camera, gentle movement |
Cinematic camera |
| Walking | walking motion, natural stride |
Character animation |
| Wind | wind blowing, hair flowing, leaves rustling |
Natural dynamics |
| Water | rippling water, wave motion |
Liquid effects |
| Orbit | orbiting camera, 360 degree rotation |
Surround view |
Advanced: First + Last Frame Control
First + Last Frame (FFLF)
Wan 2.2 supports first + last frame control, enabling transitions from Z-Image scene A to scene B:
[Scene A prompt] → Z-Image → First frame
[Scene B prompt] → Z-Image → Last frame
↓
Wan 2.2 FFLF → Transition video from A to B
Typical Use Cases
- Day to Night: Daytime garden → Same scene at night
- Season Change: Spring blossoms → Autumn leaves
- Expression Change: Smiling → Surprised
- Product Showcase: Static product → Rotating display
ComfyUI FFLF Node Connections
Z-Image Sampler A → VAE Decode A → First frame
Z-Image Sampler B → VAE Decode B → Last frame
↓
Wan2.2 FFLF Video Latent (first frame + last frame)
↓
Wan2.2 Sampler
↓
Video Decode → Transition video
Parameter Tuning Guide
Frame Count Selection
| Frames | Duration (30fps) | VRAM | Use Case |
|---|---|---|---|
| 16 frames | ~0.5 sec | ~8GB | Quick preview, short shots |
| 32 frames | ~1 sec | ~12GB | Short video, social media |
| 64 frames | ~2 sec | ~20GB | Complete shots, demos |
| 96 frames | ~3 sec | ~30GB | High-quality demos |
Motion Intensity Control
Wan 2.2 motion intensity is controlled via:
- motion_bucket_id: 1-255, higher = more dramatic motion
- Recommended starting value: 128 (medium motion)
- Fine-tuning:
- Not enough motion → increase by 10-20
- Too aggressive → decrease by 10-20
- Character animation → 80-100 (natural stride)
- Scenery shots → 100-130 (wind, water, clouds)
Practical Cases
Case 1: Product Showcase Video
Product: Wireless Bluetooth earbuds
Steps:
- Z-Image generates white-background product photo
- Wan 2.2 I2V adds gentle rotation motion
- Output 3-second product showcase video
Z-Image Prompt:
Wireless earbuds in charging case,
professional product photography,
pure white background, studio lighting,
45 degree angle, minimal design
Wan 2.2 Prompt:
Wireless earbuds slowly rotating,
smooth 360 degree turn,
studio lighting, clean white background,
product showcase video
Case 2: Cinematic Landscape Short
Scene: Sunset beach
Steps:
- Z-Image generates sunset beach keyframe
- Wan 2.2 I2V adds ocean waves and cloud motion
Z-Image Prompt:
Sunset over ocean beach, dramatic sky with orange and purple clouds,
waves crashing on shore, palm trees silhouetted,
cinematic wide shot, golden hour,
shot on ARRI Alexa, anamorphic lens
Wan 2.2 Prompt:
Sunset ocean waves rolling in, clouds drifting slowly,
palm fronds swaying in wind, golden light flickering on water,
cinematic slow motion, anamorphic lens flare
Performance Optimization
VRAM Optimization Tips
- FP16 Inference: Wan 2.2 supports half-precision, halving VRAM
- Tile Inference: Process large resolution videos in tiles
- Model Offloading: Load Z-Image and Wan 2.2 sequentially
Recommended Configurations
| VRAM | Supported Config |
|---|---|
| 8GB | Z-Image or Wan 2.2 separately, 16 frames |
| 12GB | Full workflow, 16 frames, FP16 |
| 16GB | Full workflow, 32 frames, FP16 |
| 24GB+ | Full workflow, 64+ frames, BF16 |
FAQ
Q: Can I use the exact same prompt for both Z-Image and Wan 2.2?
You can, but separately optimized prompts produce better results. Z-Image needs image quality descriptions; Wan 2.2 needs motion descriptions. Start with Z-Image's prompt, then add motion-related keywords for Wan 2.2.
Q: Video comes out blurry — why?
- Check if Z-Image input resolution is sufficient
- Increase
motion_bucket_idfor more visible motion - Reduce Wan 2.2 sampling steps (over-sampling introduces noise)
- Try lowering motion intensity
Q: How to maintain continuity across multiple video clips?
Use shared keyframes as connection points:
- Video A's last frame = Video B's first frame
- Or share a middle frame between two clips
- Stitch together in video editing software
Summary
The Z-Image + Wan 2.2 combination provides an open-source, locally deployable, high-quality text-to-video workflow:
- Z-Image Turbo generates high-quality keyframes (sub-second inference)
- Wan 2.2 handles Image-to-Video conversion (14B MoE architecture)
- ComfyUI orchestrates both models into an end-to-end pipeline
- Prompt strategy: Z-Image focuses on image quality, Wan 2.2 on motion description
This workflow is ideal for:
- Ecommerce product showcase videos
- Social media short-form content
- Cinematic concept trailers
- Personal creative projects
This guide is based on ComfyUI + Z-Image Turbo + Wan 2.2 I2V model.