Z-Image Video Generation Workflow: Complete Guide to Z-Video + ControlNet + Wan 2.2

Published: June 9, 2026
Author: Z-Image Tech Blog
Read time: ~12 minutes
Keywords: z-image video generation, Z-Video, Wan 2.2, ControlNet, text-to-video workflow

Introduction

After groundbreaking advances in AI image generation, video generation has become the next frontier. Z-Image, a leading open-source image generation model, can now achieve a complete text-to-video pipeline when combined with the Wan 2.2 image-to-video model. This comprehensive guide will walk you through building a Z-Image + Wan 2.2 + ControlNet video generation workflow in ComfyUI, enabling seamless conversion from creative concept to dynamic visual content.

Why Z-Image + Wan 2.2 Is the Best Combination?

Z-Image Core Advantages

Z-Image (especially the Z-Image Turbo variant) excels in text-to-image generation:

High-quality image output: Supports resolutions up to 2K with rich details and accurate colors
Turbo variant: Generates quality images in just 8 steps, 5-10x faster inference
ControlNet support: Native ControlNet Union multi-control support for precise pose, depth, and edge control
LoRA compatibility: Custom style LoRA support for training proprietary visual styles
Open source: Fully available on HuggingFace, supporting local deployment and commercial use

Wan 2.2 Image-to-Video Capabilities

Wan 2.2 is currently one of the most powerful open-source image-to-video models:

Frame consistency: Generated videos maintain visual consistency without flickering or abrupt changes
Natural motion: Supports multiple motion modes — pan, zoom, rotate, subject movement
Adaptive resolution: Automatically matches input image resolution, avoiding mismatch artifacts
Audio support: Optional background music or sound effects for enhanced video completeness

Core Workflow Logic

Text Prompt → Z-Image Keyframe Generation → Wan 2.2 Image-to-Video → Dynamic Video Output

The combination advantages:

Precise control: Z-Image's ControlNet ensures keyframes match expected composition
Quality stacking: Both models perform optimally in their respective domains
Flexible expansion: Intermediate processing steps can be inserted (Super Resolution, style transfer)

Environment Setup

1. ComfyUI Installation

# Install ComfyUI
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI
pip install -r requirements.txt

# Launch
python main.py

2. Model Downloads

# Z-Image Turbo (recommended)
# Download from HuggingFace:
# https://huggingface.co/Tongyi-MAI/Z-Image-Turbo

# Wan 2.2 Image-to-Video Model
# Download from HuggingFace:
# https://huggingface.co/Wan-AI/Wan2.1-I2V-14B

# ControlNet Models
# Z-Image ControlNet Union
# https://huggingface.co/Tongyi-MAI/Z-Image-ControlNet-Union

3. Custom Node Installation

cd ComfyUI/custom_nodes

# ComfyUI-VideoHelperSuite (video processing)
git clone https://github.com/Kosinkadink/ComfyUI-VideoHelperSuite
pip install -r ComfyUI-VideoHelperSuite/requirements.txt

# ComfyUI-WanVideoWrapper (Wan 2.2 wrapper)
git clone https://github.com/Wan-AI/ComfyUI-WanVideoWrapper

# Restart ComfyUI to load new nodes

4. GPU Requirements

Configuration	Minimum VRAM	Recommended VRAM	Notes
Z-Image Turbo images only	4GB	8GB	8-step inference
Wan 2.2 14B I2V	16GB	24GB	14B parameter model
Full workflow	16GB	24GB	Both chained
Low VRAM option	8GB	—	FP8 quantization

Core Workflow Construction

Workflow Architecture Overview

[CLIP Text Encode] → [Z-Image KSampler] → [Get Image Size]
                                    ↓                    ↓
                            [Save Image]           [Wan2.2 Load]
                                                     ↓
                                               [Wan2.2 I2V]
                                                     ↓
                                               [Video Combine]
                                                     ↓
                                               [VHS Video Save]

Step 1: Z-Image Turbo Keyframe Generation

# ComfyUI workflow node configuration (JSON snippet)
{
  "clip_text_encode": {
    "inputs": {
      "text": "a futuristic cityscape at sunset, cinematic lighting, 4K quality",
      "clip": ["CLIP Load", 0]
    }
  },
  "zimage_k_sampler": {
    "inputs": {
      "model": ["Z-Image Turbo Load", 0],
      "positive": ["clip_text_encode", 0],
      "negative": ["clip_text_encode_neg", 0],
      "steps": 8,
      "cfg": 1.5,
      "sampler_name": "euler",
      "scheduler": "normal",
      "width": 1024,
      "height": 576,
      "seed": 42
    }
  }
}

Key Parameter Notes:

Steps: Z-Image Turbo recommends 4-8 steps. Fewer steps = faster, 8 steps is the quality balance point
CFG Scale: Z-Image Turbo recommends 1.0-1.5. Excessive CFG causes oversaturation and detail loss
Resolution: 1024×576 (16:9 video ratio) is the optimal starting point, adjustable to 1280×720 for HD video

Step 2: ControlNet Precise Control (Optional)

For precise keyframe composition using ControlNet:

# ControlNet Union node configuration
{
  "controlnet_union": {
    "inputs": {
      "control_net": ["ControlNet Load (Z-Image Union)", 0],
      "image": ["ControlNet Preprocessor", 0],
      "strength": 0.8,
      "start_percent": 0.0,
      "end_percent": 1.0
    }
  },
  "k_sampler_with_controlnet": {
    "inputs": {
      "model": ["Z-Image Turbo Load", 0],
      "positive": ["clip_text_encode", 0],
      "control_net": ["controlnet_union", 0],
      "steps": 8,
      "cfg": 1.5
    }
  }
}

Supported ControlNet Types:

Type	Use Case	Recommended Strength
Canny Edge	Control composition outlines	0.6-0.8
Depth Map	Control spatial hierarchy	0.5-0.7
OpenPose	Control figure poses	0.7-0.9
Normal	Control lighting direction	0.4-0.6
Tile	Control overall style	0.3-0.5

Step 3: Wan 2.2 Image-to-Video Conversion

{
  "get_image_size": {
    "inputs": {
      "image": ["zimage_k_sampler", 0]
    }
  },
  "wan22_i2v": {
    "inputs": {
      "model": ["Wan2.2 14B Load", 0],
      "image": ["zimage_k_sampler", 0],
      "width": ["get_image_size", 0],
      "height": ["get_image_size", 1],
      "num_frames": 81,
      "frame_rate": 24,
      "motion_strength": 0.7,
      "seed": 42
    }
  }
}

Wan 2.2 Key Parameters:

Frames: 81 frames (~3.4s @24fps). Increasing frames linearly increases VRAM demand
Frame rate: 24fps (cinematic), 30fps (smooth), 15fps (clean animation)
Motion strength: 0.0-1.0. Low = subtle animation, high = dramatic movement
Inference steps: Default 30. Reducing to 15 speeds up processing with slight motion quality trade-off

Step 4: Video Composition and Export

{
  "video_combine": {
    "inputs": {
      "frames": ["wan22_i2v", 0],
      "frame_rate": 24,
      "format": "video/h264-mp4",
      "crf": 18,
      "loop": 0
    }
  },
  "vhs_video_save": {
    "inputs": {
      "images": ["video_combine", 0],
      "filename_prefix": "zimage_wan_output",
      "output_dir": "outputs/videos"
    }
  }
}

Advanced Techniques

Technique 1: Multi-Keyframe Interpolation

Generate multiple keyframes and interpolate between them for more complex video sequences:

Keyframe A (Z-Image) → Wan I2V → Clip 1
Keyframe B (Z-Image) → Wan I2V → Clip 2
Keyframe C (Z-Image) → Wan I2V → Clip 3
Clip 1 + Clip 2 + Clip 3 → ffmpeg concat → Complete Video

Implementation:

# Stitch multiple video clips with ffmpeg
ffmpeg -f concat -safe 0 -i filelist.txt -c copy output.mp4

# filelist.txt content:
# file 'fragment_1.mp4'
# file 'fragment_2.mp4'
# file 'fragment_3.mp4'

Technique 2: LoRA Style Consistency

Load a style LoRA in the Z-Image stage to ensure all keyframes share a unified visual style:

{
  "lora_loader": {
    "inputs": {
      "model": ["Z-Image Turbo Load", 0],
      "clip": ["CLIP Load", 0],
      "lora_name": "my_style_lora.safetensors",
      "strength_model": 1.0,
      "strength_clip": 1.0
    }
  }
}

Technique 3: Super Resolution Post-Processing

Upscale input images during the generation phase to indirectly improve video quality:

{
  "upscale_model": {
    "inputs": {
      "upscale_model": ["Upscale Model Load (4x-UltraMix)", 0],
      "image": ["zimage_k_sampler", 0],
      "scale_by": 2.0
    }
  }
}

Technique 4: Prompt Engineering Optimization

Optimal video prompt formula:

[Subject Description] + [Motion Description] + [Environment/Background] + [Style/Mood] + [Technical Specs]

Example:

An elegant white cat running under cherry blossom trees, petals drifting in the wind,
Japanese garden background, soft morning light,
Studio Ghibli animation style,
4K cinematic, smooth motion, gentle camera pan right

Prompts to avoid:

Overly complex scene descriptions (multiple subjects moving simultaneously)
Rapid viewpoint switching requirements
Text rendering (text in video tends to distort easily)

Troubleshooting

Issue 1: Video Flickering

Cause: Wan 2.2 produces incoherent motion between certain frames.

Solutions:

Reduce motion_strength (0.5-0.6)
Increase inference steps to 30-50
Use the Temporal Consistency node in ComfyUI
Reduce input image complexity (fewer details and textures)

Issue 2: Insufficient VRAM

Cause: Wan 2.2 14B model + Z-Image Turbo loaded simultaneously.

Solutions:

Sequential execution: Generate and save images first, then load Wan model for I2V
FP8 quantization: Use FP8 version of Wan 2.2 (half the VRAM)
Reduce frames: From 81 to 49 frames (2-second video)
Lower resolution: From 1024×576 to 832×480

Issue 3: Unnatural Motion

Cause: Insufficient motion description in prompts or inappropriate motion_strength settings.

Solutions:

Explicitly describe motion direction in prompts ("slow pan right", "zoom in on subject")
Try different motion_strength values (0.3-0.9)
Use ControlNet depth maps to constrain motion trajectories
Generate results with multiple seeds and select the best motion effect

Issue 4: Image-Video Style Mismatch

Cause: Wan 2.2 changes the original image's style during I2V processing.

Solutions:

Use the same prompt for Wan 2.2 as used with Z-Image
Reduce motion_strength to minimize style drift
Apply style transfer nodes in post-processing to unify color tones

Complete Workflow JSON Template

Here's a simplified workflow template structure importable directly into ComfyUI:

{
  "1": {
    "class_type": "CheckpointLoaderSimple",
    "inputs": {"ckpt_name": "zimage_turbo.safetensors"}
  },
  "2": {
    "class_type": "CLIPTextEncode",
    "inputs": {
      "text": "your prompt here",
      "clip": ["1", 1]
    }
  },
  "3": {
    "class_type": "KSampler",
    "inputs": {
      "model": ["1", 0],
      "positive": ["2", 0],
      "negative": ["2_neg", 0],
      "steps": 8,
      "cfg": 1.5,
      "width": 1024,
      "height": 576,
      "seed": 42
    }
  },
  "4": {
    "class_type": "Wan2.2_I2V",
    "inputs": {
      "image": ["3", 0],
      "num_frames": 81,
      "frame_rate": 24,
      "motion_strength": 0.7
    }
  },
  "5": {
    "class_type": "VHS_VideoCombine",
    "inputs": {
      "images": ["4", 0],
      "frame_rate": 24,
      "format": "video/h264-mp4"
    }
  }
}

Summary

The Z-Image + Wan 2.2 + ControlNet combination provides unprecedented flexibility and quality control for open-source video generation. By properly configuring these three components, you can achieve:

Precise keyframe control: Z-Image's ControlNet ensures every frame's composition and style match expectations
Natural motion transitions: Wan 2.2's image-to-video capability transforms static visuals into fluid animation
Scalable workflow: Supports LoRA, super resolution, multi-clip stitching and other advanced features

Next Steps

Beginner: Run Z-Image and Wan 2.2 separately first to familiarize with individual parameters
Intermediate: Build the complete chained workflow, experiment with different motion strengths and prompt combinations
Advanced: Incorporate ControlNet, LoRA, and multi-keyframe interpolation for complex video sequences

As Wan 2.2 and Z-Image models continue to evolve, this combined workflow's capabilities will keep improving. Stay tuned to HuggingFace and the ComfyUI community for the latest model updates, features, and performance optimizations.

Z-Image Video Generation Workflow: Complete Guide to Z-Video + ControlNet + Wan 2.2

Table of Contents

Z-Image Video Generation Workflow: Complete Guide to Z-Video + ControlNet + Wan 2.2

Introduction

Why Z-Image + Wan 2.2 Is the Best Combination?

Z-Image Core Advantages

Wan 2.2 Image-to-Video Capabilities

Core Workflow Logic

Environment Setup

1. ComfyUI Installation

2. Model Downloads

3. Custom Node Installation

4. GPU Requirements

Core Workflow Construction

Workflow Architecture Overview

Step 1: Z-Image Turbo Keyframe Generation

Step 2: ControlNet Precise Control (Optional)

Step 3: Wan 2.2 Image-to-Video Conversion

Step 4: Video Composition and Export

Advanced Techniques

Technique 1: Multi-Keyframe Interpolation

Technique 2: LoRA Style Consistency

Technique 3: Super Resolution Post-Processing

Technique 4: Prompt Engineering Optimization

Troubleshooting

Issue 1: Video Flickering

Issue 2: Insufficient VRAM

Issue 3: Unnatural Motion

Issue 4: Image-Video Style Mismatch

Complete Workflow JSON Template

Summary

Next Steps