Z-Image Video Generation Workflow: Complete Guide to Z-Video + ControlNet + Wan 2.2
Published: June 9, 2026
Author: Z-Image Tech Blog
Read time: ~12 minutes
Keywords: z-image video generation, Z-Video, Wan 2.2, ControlNet, text-to-video workflow
Introduction
After groundbreaking advances in AI image generation, video generation has become the next frontier. Z-Image, a leading open-source image generation model, can now achieve a complete text-to-video pipeline when combined with the Wan 2.2 image-to-video model. This comprehensive guide will walk you through building a Z-Image + Wan 2.2 + ControlNet video generation workflow in ComfyUI, enabling seamless conversion from creative concept to dynamic visual content.
Why Z-Image + Wan 2.2 Is the Best Combination?
Z-Image Core Advantages
Z-Image (especially the Z-Image Turbo variant) excels in text-to-image generation:
- High-quality image output: Supports resolutions up to 2K with rich details and accurate colors
- Turbo variant: Generates quality images in just 8 steps, 5-10x faster inference
- ControlNet support: Native ControlNet Union multi-control support for precise pose, depth, and edge control
- LoRA compatibility: Custom style LoRA support for training proprietary visual styles
- Open source: Fully available on HuggingFace, supporting local deployment and commercial use
Wan 2.2 Image-to-Video Capabilities
Wan 2.2 is currently one of the most powerful open-source image-to-video models:
- Frame consistency: Generated videos maintain visual consistency without flickering or abrupt changes
- Natural motion: Supports multiple motion modes — pan, zoom, rotate, subject movement
- Adaptive resolution: Automatically matches input image resolution, avoiding mismatch artifacts
- Audio support: Optional background music or sound effects for enhanced video completeness
Core Workflow Logic
Text Prompt → Z-Image Keyframe Generation → Wan 2.2 Image-to-Video → Dynamic Video Output
The combination advantages:
- Precise control: Z-Image's ControlNet ensures keyframes match expected composition
- Quality stacking: Both models perform optimally in their respective domains
- Flexible expansion: Intermediate processing steps can be inserted (Super Resolution, style transfer)
Environment Setup
1. ComfyUI Installation
# Install ComfyUI
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI
pip install -r requirements.txt
# Launch
python main.py
2. Model Downloads
# Z-Image Turbo (recommended)
# Download from HuggingFace:
# https://huggingface.co/Tongyi-MAI/Z-Image-Turbo
# Wan 2.2 Image-to-Video Model
# Download from HuggingFace:
# https://huggingface.co/Wan-AI/Wan2.1-I2V-14B
# ControlNet Models
# Z-Image ControlNet Union
# https://huggingface.co/Tongyi-MAI/Z-Image-ControlNet-Union
3. Custom Node Installation
cd ComfyUI/custom_nodes
# ComfyUI-VideoHelperSuite (video processing)
git clone https://github.com/Kosinkadink/ComfyUI-VideoHelperSuite
pip install -r ComfyUI-VideoHelperSuite/requirements.txt
# ComfyUI-WanVideoWrapper (Wan 2.2 wrapper)
git clone https://github.com/Wan-AI/ComfyUI-WanVideoWrapper
# Restart ComfyUI to load new nodes
4. GPU Requirements
| Configuration | Minimum VRAM | Recommended VRAM | Notes |
|---|---|---|---|
| Z-Image Turbo images only | 4GB | 8GB | 8-step inference |
| Wan 2.2 14B I2V | 16GB | 24GB | 14B parameter model |
| Full workflow | 16GB | 24GB | Both chained |
| Low VRAM option | 8GB | — | FP8 quantization |
Core Workflow Construction
Workflow Architecture Overview
[CLIP Text Encode] → [Z-Image KSampler] → [Get Image Size]
↓ ↓
[Save Image] [Wan2.2 Load]
↓
[Wan2.2 I2V]
↓
[Video Combine]
↓
[VHS Video Save]
Step 1: Z-Image Turbo Keyframe Generation
# ComfyUI workflow node configuration (JSON snippet)
{
"clip_text_encode": {
"inputs": {
"text": "a futuristic cityscape at sunset, cinematic lighting, 4K quality",
"clip": ["CLIP Load", 0]
}
},
"zimage_k_sampler": {
"inputs": {
"model": ["Z-Image Turbo Load", 0],
"positive": ["clip_text_encode", 0],
"negative": ["clip_text_encode_neg", 0],
"steps": 8,
"cfg": 1.5,
"sampler_name": "euler",
"scheduler": "normal",
"width": 1024,
"height": 576,
"seed": 42
}
}
}
Key Parameter Notes:
- Steps: Z-Image Turbo recommends 4-8 steps. Fewer steps = faster, 8 steps is the quality balance point
- CFG Scale: Z-Image Turbo recommends 1.0-1.5. Excessive CFG causes oversaturation and detail loss
- Resolution: 1024×576 (16:9 video ratio) is the optimal starting point, adjustable to 1280×720 for HD video
Step 2: ControlNet Precise Control (Optional)
For precise keyframe composition using ControlNet:
# ControlNet Union node configuration
{
"controlnet_union": {
"inputs": {
"control_net": ["ControlNet Load (Z-Image Union)", 0],
"image": ["ControlNet Preprocessor", 0],
"strength": 0.8,
"start_percent": 0.0,
"end_percent": 1.0
}
},
"k_sampler_with_controlnet": {
"inputs": {
"model": ["Z-Image Turbo Load", 0],
"positive": ["clip_text_encode", 0],
"control_net": ["controlnet_union", 0],
"steps": 8,
"cfg": 1.5
}
}
}
Supported ControlNet Types:
| Type | Use Case | Recommended Strength |
|---|---|---|
| Canny Edge | Control composition outlines | 0.6-0.8 |
| Depth Map | Control spatial hierarchy | 0.5-0.7 |
| OpenPose | Control figure poses | 0.7-0.9 |
| Normal | Control lighting direction | 0.4-0.6 |
| Tile | Control overall style | 0.3-0.5 |
Step 3: Wan 2.2 Image-to-Video Conversion
{
"get_image_size": {
"inputs": {
"image": ["zimage_k_sampler", 0]
}
},
"wan22_i2v": {
"inputs": {
"model": ["Wan2.2 14B Load", 0],
"image": ["zimage_k_sampler", 0],
"width": ["get_image_size", 0],
"height": ["get_image_size", 1],
"num_frames": 81,
"frame_rate": 24,
"motion_strength": 0.7,
"seed": 42
}
}
}
Wan 2.2 Key Parameters:
- Frames: 81 frames (~3.4s @24fps). Increasing frames linearly increases VRAM demand
- Frame rate: 24fps (cinematic), 30fps (smooth), 15fps (clean animation)
- Motion strength: 0.0-1.0. Low = subtle animation, high = dramatic movement
- Inference steps: Default 30. Reducing to 15 speeds up processing with slight motion quality trade-off
Step 4: Video Composition and Export
{
"video_combine": {
"inputs": {
"frames": ["wan22_i2v", 0],
"frame_rate": 24,
"format": "video/h264-mp4",
"crf": 18,
"loop": 0
}
},
"vhs_video_save": {
"inputs": {
"images": ["video_combine", 0],
"filename_prefix": "zimage_wan_output",
"output_dir": "outputs/videos"
}
}
}
Advanced Techniques
Technique 1: Multi-Keyframe Interpolation
Generate multiple keyframes and interpolate between them for more complex video sequences:
Keyframe A (Z-Image) → Wan I2V → Clip 1
Keyframe B (Z-Image) → Wan I2V → Clip 2
Keyframe C (Z-Image) → Wan I2V → Clip 3
Clip 1 + Clip 2 + Clip 3 → ffmpeg concat → Complete Video
Implementation:
# Stitch multiple video clips with ffmpeg
ffmpeg -f concat -safe 0 -i filelist.txt -c copy output.mp4
# filelist.txt content:
# file 'fragment_1.mp4'
# file 'fragment_2.mp4'
# file 'fragment_3.mp4'
Technique 2: LoRA Style Consistency
Load a style LoRA in the Z-Image stage to ensure all keyframes share a unified visual style:
{
"lora_loader": {
"inputs": {
"model": ["Z-Image Turbo Load", 0],
"clip": ["CLIP Load", 0],
"lora_name": "my_style_lora.safetensors",
"strength_model": 1.0,
"strength_clip": 1.0
}
}
}
Technique 3: Super Resolution Post-Processing
Upscale input images during the generation phase to indirectly improve video quality:
{
"upscale_model": {
"inputs": {
"upscale_model": ["Upscale Model Load (4x-UltraMix)", 0],
"image": ["zimage_k_sampler", 0],
"scale_by": 2.0
}
}
}
Technique 4: Prompt Engineering Optimization
Optimal video prompt formula:
[Subject Description] + [Motion Description] + [Environment/Background] + [Style/Mood] + [Technical Specs]
Example:
An elegant white cat running under cherry blossom trees, petals drifting in the wind,
Japanese garden background, soft morning light,
Studio Ghibli animation style,
4K cinematic, smooth motion, gentle camera pan right
Prompts to avoid:
- Overly complex scene descriptions (multiple subjects moving simultaneously)
- Rapid viewpoint switching requirements
- Text rendering (text in video tends to distort easily)
Troubleshooting
Issue 1: Video Flickering
Cause: Wan 2.2 produces incoherent motion between certain frames.
Solutions:
- Reduce
motion_strength(0.5-0.6) - Increase inference steps to 30-50
- Use the
Temporal Consistencynode in ComfyUI - Reduce input image complexity (fewer details and textures)
Issue 2: Insufficient VRAM
Cause: Wan 2.2 14B model + Z-Image Turbo loaded simultaneously.
Solutions:
- Sequential execution: Generate and save images first, then load Wan model for I2V
- FP8 quantization: Use FP8 version of Wan 2.2 (half the VRAM)
- Reduce frames: From 81 to 49 frames (2-second video)
- Lower resolution: From 1024×576 to 832×480
Issue 3: Unnatural Motion
Cause: Insufficient motion description in prompts or inappropriate motion_strength settings.
Solutions:
- Explicitly describe motion direction in prompts ("slow pan right", "zoom in on subject")
- Try different
motion_strengthvalues (0.3-0.9) - Use ControlNet depth maps to constrain motion trajectories
- Generate results with multiple seeds and select the best motion effect
Issue 4: Image-Video Style Mismatch
Cause: Wan 2.2 changes the original image's style during I2V processing.
Solutions:
- Use the same prompt for Wan 2.2 as used with Z-Image
- Reduce
motion_strengthto minimize style drift - Apply style transfer nodes in post-processing to unify color tones
Complete Workflow JSON Template
Here's a simplified workflow template structure importable directly into ComfyUI:
{
"1": {
"class_type": "CheckpointLoaderSimple",
"inputs": {"ckpt_name": "zimage_turbo.safetensors"}
},
"2": {
"class_type": "CLIPTextEncode",
"inputs": {
"text": "your prompt here",
"clip": ["1", 1]
}
},
"3": {
"class_type": "KSampler",
"inputs": {
"model": ["1", 0],
"positive": ["2", 0],
"negative": ["2_neg", 0],
"steps": 8,
"cfg": 1.5,
"width": 1024,
"height": 576,
"seed": 42
}
},
"4": {
"class_type": "Wan2.2_I2V",
"inputs": {
"image": ["3", 0],
"num_frames": 81,
"frame_rate": 24,
"motion_strength": 0.7
}
},
"5": {
"class_type": "VHS_VideoCombine",
"inputs": {
"images": ["4", 0],
"frame_rate": 24,
"format": "video/h264-mp4"
}
}
}
Summary
The Z-Image + Wan 2.2 + ControlNet combination provides unprecedented flexibility and quality control for open-source video generation. By properly configuring these three components, you can achieve:
- Precise keyframe control: Z-Image's ControlNet ensures every frame's composition and style match expectations
- Natural motion transitions: Wan 2.2's image-to-video capability transforms static visuals into fluid animation
- Scalable workflow: Supports LoRA, super resolution, multi-clip stitching and other advanced features
Next Steps
- Beginner: Run Z-Image and Wan 2.2 separately first to familiarize with individual parameters
- Intermediate: Build the complete chained workflow, experiment with different motion strengths and prompt combinations
- Advanced: Incorporate ControlNet, LoRA, and multi-keyframe interpolation for complex video sequences
As Wan 2.2 and Z-Image models continue to evolve, this combined workflow's capabilities will keep improving. Stay tuned to HuggingFace and the ComfyUI community for the latest model updates, features, and performance optimizations.