Z-Image + Wan 2.2 Video Generation Workflow: From Text to High-Quality Video

Generate keyframes with Z-Image, animate with Wan 2.2 — a complete text-to-video pipeline in ComfyUI.

Why Z-Image + Wan 2.2?

Combination Advantage

Model	Role	Advantage
Z-Image Turbo	Image Generation	6B distilled model, sub-second inference, photorealistic quality
Wan 2.2	Video Generation	14B MoE architecture, supports I2V and T2V, open-source local deployment

Core idea: Z-Image generates image quality that far exceeds most video models' native T2V capability. Use Z-Image to create high-quality keyframes, then feed them into Wan 2.2 for Image-to-Video (I2V) conversion — significantly better than direct T2V.

Workflow Comparison

Approach A (Direct T2V):  Text → Wan 2.2 T2V → Video (mediocre quality)
Approach B (This guide):   Text → Z-Image → High-quality keyframe → Wan 2.2 I2V → High-quality video

ComfyUI Setup

Prerequisites

# 1. Install ComfyUI
git clone https://github.com/comfyanonymous/ComfyUI
cd ComfyUI
pip install -r requirements.txt

# 2. Download Z-Image Turbo models
# qwen_3_4b.safetensors (text encoder)
# z_image_turbo_bf16.safetensors (main model)
# ae.safetensors (VAE encoder/decoder)

# 3. Download Wan 2.2 model
# wan2.2_i2v_14b.safetensors (Image-to-Video)
# or wan2.2_t2v_14b.safetensors (Text-to-Video)

# 4. Install required custom nodes
# ComfyUI-WanVideoWrapper
# ComfyUI-ZImage

Workflow Structure

[Text Prompt]
    ↓
┌──────────────────────────┐
│  Z-Image Turbo Branch     │
│  Prompt → KSampler(9 steps)│
│  → VAE Decode → Keyframe   │
└──────────┬───────────────┘
           ↓
┌──────────────────────────┐
│  Wan 2.2 I2V Branch       │
│  Keyframe + Prompt         │
│  → Get Image Size          │
│  → Video Latent (match size)│
│  → Wan 2.2 Sampler         │
│  → Video Decode → MP4      │
└──────────────────────────┘

Key Nodes

Z-Image Turbo Branch:

CLIP Text Encode — Input positive prompt
KSampler — 9 steps, sampler: euler, scheduler: normal
VAE Decode — Decode latent to pixel image

Wan 2.2 I2V Branch:

Get Image Size — Read Z-Image output dimensions
Wan2.2 Video Latent — Convert keyframe to video latent with image size
Wan2.2 Sampler — Generate video frame sequence (default 16 frames)
Video Decode — Decode to video frames
Save Image / Video Combine — Output video

Avoiding Size Mismatch

Wan 2.2 requires input dimensions to be multiples of 64:

# Size alignment example
width = (width // 64) * 64
height = (height // 64) * 64

In ComfyUI, use Get Image Size node to read Z-Image output dimensions and pass them to Wan 2.2.

Prompt Strategy

Video-Friendly Prompts

Video generation prompts should be consistent with (or closely related to) image prompts:

# Z-Image prompt
A young woman in a red dress walking through a sunlit garden,
golden hour lighting, cinematic composition, shallow depth of field,
85mm lens, bokeh background

# Wan 2.2 prompt (maintain consistency)
A young woman in a red dress walking through a sunlit garden,
golden hour lighting, gentle walking motion, swaying flowers,
cinematic slow motion

Key difference: Wan 2.2 prompts need motion descriptions (walking motion, swaying, etc.), while Z-Image prompts focus on static image quality.

Motion Control Tips

Motion Type	Wan 2.2 Prompt Keywords	Effect
Slow pan	`slow panning camera`, `gentle movement`	Cinematic camera
Walking	`walking motion`, `natural stride`	Character animation
Wind	`wind blowing`, `hair flowing`, `leaves rustling`	Natural dynamics
Water	`rippling water`, `wave motion`	Liquid effects
Orbit	`orbiting camera`, `360 degree rotation`	Surround view

Advanced: First + Last Frame Control

First + Last Frame (FFLF)

Wan 2.2 supports first + last frame control, enabling transitions from Z-Image scene A to scene B:

[Scene A prompt] → Z-Image → First frame
[Scene B prompt] → Z-Image → Last frame
    ↓
Wan 2.2 FFLF → Transition video from A to B

Typical Use Cases

Day to Night: Daytime garden → Same scene at night
Season Change: Spring blossoms → Autumn leaves
Expression Change: Smiling → Surprised
Product Showcase: Static product → Rotating display

ComfyUI FFLF Node Connections

Z-Image Sampler A → VAE Decode A → First frame
Z-Image Sampler B → VAE Decode B → Last frame
    ↓
Wan2.2 FFLF Video Latent (first frame + last frame)
    ↓
Wan2.2 Sampler
    ↓
Video Decode → Transition video

Parameter Tuning Guide

Frame Count Selection

Frames	Duration (30fps)	VRAM	Use Case
16 frames	~0.5 sec	~8GB	Quick preview, short shots
32 frames	~1 sec	~12GB	Short video, social media
64 frames	~2 sec	~20GB	Complete shots, demos
96 frames	~3 sec	~30GB	High-quality demos

Motion Intensity Control

Wan 2.2 motion intensity is controlled via:

motion_bucket_id: 1-255, higher = more dramatic motion
Recommended starting value: 128 (medium motion)
Fine-tuning:
- Not enough motion → increase by 10-20
- Too aggressive → decrease by 10-20
- Character animation → 80-100 (natural stride)
- Scenery shots → 100-130 (wind, water, clouds)

Practical Cases

Case 1: Product Showcase Video

Product: Wireless Bluetooth earbuds

Steps:

Z-Image generates white-background product photo
Wan 2.2 I2V adds gentle rotation motion
Output 3-second product showcase video

Z-Image Prompt:

Wireless earbuds in charging case,
professional product photography,
pure white background, studio lighting,
45 degree angle, minimal design

Wan 2.2 Prompt:

Wireless earbuds slowly rotating,
smooth 360 degree turn,
studio lighting, clean white background,
product showcase video

Case 2: Cinematic Landscape Short

Scene: Sunset beach

Steps:

Z-Image generates sunset beach keyframe
Wan 2.2 I2V adds ocean waves and cloud motion

Z-Image Prompt:

Sunset over ocean beach, dramatic sky with orange and purple clouds,
waves crashing on shore, palm trees silhouetted,
cinematic wide shot, golden hour,
shot on ARRI Alexa, anamorphic lens

Wan 2.2 Prompt:

Sunset ocean waves rolling in, clouds drifting slowly,
palm fronds swaying in wind, golden light flickering on water,
cinematic slow motion, anamorphic lens flare

Performance Optimization

VRAM Optimization Tips

FP16 Inference: Wan 2.2 supports half-precision, halving VRAM
Tile Inference: Process large resolution videos in tiles
Model Offloading: Load Z-Image and Wan 2.2 sequentially

Recommended Configurations

VRAM	Supported Config
8GB	Z-Image or Wan 2.2 separately, 16 frames
12GB	Full workflow, 16 frames, FP16
16GB	Full workflow, 32 frames, FP16
24GB+	Full workflow, 64+ frames, BF16

FAQ

Q: Can I use the exact same prompt for both Z-Image and Wan 2.2?

You can, but separately optimized prompts produce better results. Z-Image needs image quality descriptions; Wan 2.2 needs motion descriptions. Start with Z-Image's prompt, then add motion-related keywords for Wan 2.2.

Q: Video comes out blurry — why?

Check if Z-Image input resolution is sufficient
Increase motion_bucket_id for more visible motion
Reduce Wan 2.2 sampling steps (over-sampling introduces noise)
Try lowering motion intensity

Q: How to maintain continuity across multiple video clips?

Use shared keyframes as connection points:

Video A's last frame = Video B's first frame
Or share a middle frame between two clips
Stitch together in video editing software

Summary

The Z-Image + Wan 2.2 combination provides an open-source, locally deployable, high-quality text-to-video workflow:

Z-Image Turbo generates high-quality keyframes (sub-second inference)
Wan 2.2 handles Image-to-Video conversion (14B MoE architecture)
ComfyUI orchestrates both models into an end-to-end pipeline
Prompt strategy: Z-Image focuses on image quality, Wan 2.2 on motion description

This workflow is ideal for:

Ecommerce product showcase videos
Social media short-form content
Cinematic concept trailers
Personal creative projects

This guide is based on ComfyUI + Z-Image Turbo + Wan 2.2 I2V model.

Z-Image + Wan 2.2 Video Generation Workflow: From Text to High-Quality Video

Table of Contents

Z-Image + Wan 2.2 Video Generation Workflow: From Text to High-Quality Video

Why Z-Image + Wan 2.2?

Combination Advantage

Workflow Comparison

ComfyUI Setup

Prerequisites

Workflow Structure

Key Nodes

Avoiding Size Mismatch

Prompt Strategy

Video-Friendly Prompts

Motion Control Tips

Advanced: First + Last Frame Control

First + Last Frame (FFLF)

Typical Use Cases

ComfyUI FFLF Node Connections

Parameter Tuning Guide

Frame Count Selection

Motion Intensity Control

Practical Cases

Case 1: Product Showcase Video

Case 2: Cinematic Landscape Short

Performance Optimization

VRAM Optimization Tips

Recommended Configurations

FAQ

Q: Can I use the exact same prompt for both Z-Image and Wan 2.2?

Q: Video comes out blurry — why?

Q: How to maintain continuity across multiple video clips?

Summary