Z-Image Video Generation Workflow: Complete Guide to Z-Video + ControlNet + Wan 2.2

6月 9, 2026

Z-Image Video Generation Workflow: Complete Guide to Z-Video + ControlNet + Wan 2.2

Published: June 9, 2026
Author: Z-Image Tech Blog
Read time: ~12 minutes
Keywords: z-image video generation, Z-Video, Wan 2.2, ControlNet, text-to-video workflow


Introduction

After groundbreaking advances in AI image generation, video generation has become the next frontier. Z-Image, a leading open-source image generation model, can now achieve a complete text-to-video pipeline when combined with the Wan 2.2 image-to-video model. This comprehensive guide will walk you through building a Z-Image + Wan 2.2 + ControlNet video generation workflow in ComfyUI, enabling seamless conversion from creative concept to dynamic visual content.

Why Z-Image + Wan 2.2 Is the Best Combination?

Z-Image Core Advantages

Z-Image (especially the Z-Image Turbo variant) excels in text-to-image generation:

  • High-quality image output: Supports resolutions up to 2K with rich details and accurate colors
  • Turbo variant: Generates quality images in just 8 steps, 5-10x faster inference
  • ControlNet support: Native ControlNet Union multi-control support for precise pose, depth, and edge control
  • LoRA compatibility: Custom style LoRA support for training proprietary visual styles
  • Open source: Fully available on HuggingFace, supporting local deployment and commercial use

Wan 2.2 Image-to-Video Capabilities

Wan 2.2 is currently one of the most powerful open-source image-to-video models:

  • Frame consistency: Generated videos maintain visual consistency without flickering or abrupt changes
  • Natural motion: Supports multiple motion modes — pan, zoom, rotate, subject movement
  • Adaptive resolution: Automatically matches input image resolution, avoiding mismatch artifacts
  • Audio support: Optional background music or sound effects for enhanced video completeness

Core Workflow Logic

Text Prompt → Z-Image Keyframe Generation → Wan 2.2 Image-to-Video → Dynamic Video Output

The combination advantages:

  1. Precise control: Z-Image's ControlNet ensures keyframes match expected composition
  2. Quality stacking: Both models perform optimally in their respective domains
  3. Flexible expansion: Intermediate processing steps can be inserted (Super Resolution, style transfer)

Environment Setup

1. ComfyUI Installation

# Install ComfyUI
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI
pip install -r requirements.txt

# Launch
python main.py

2. Model Downloads

# Z-Image Turbo (recommended)
# Download from HuggingFace:
# https://huggingface.co/Tongyi-MAI/Z-Image-Turbo

# Wan 2.2 Image-to-Video Model
# Download from HuggingFace:
# https://huggingface.co/Wan-AI/Wan2.1-I2V-14B

# ControlNet Models
# Z-Image ControlNet Union
# https://huggingface.co/Tongyi-MAI/Z-Image-ControlNet-Union

3. Custom Node Installation

cd ComfyUI/custom_nodes

# ComfyUI-VideoHelperSuite (video processing)
git clone https://github.com/Kosinkadink/ComfyUI-VideoHelperSuite
pip install -r ComfyUI-VideoHelperSuite/requirements.txt

# ComfyUI-WanVideoWrapper (Wan 2.2 wrapper)
git clone https://github.com/Wan-AI/ComfyUI-WanVideoWrapper

# Restart ComfyUI to load new nodes

4. GPU Requirements

Configuration Minimum VRAM Recommended VRAM Notes
Z-Image Turbo images only 4GB 8GB 8-step inference
Wan 2.2 14B I2V 16GB 24GB 14B parameter model
Full workflow 16GB 24GB Both chained
Low VRAM option 8GB FP8 quantization

Core Workflow Construction

Workflow Architecture Overview

[CLIP Text Encode] → [Z-Image KSampler] → [Get Image Size]
                                    ↓                    ↓
                            [Save Image]           [Wan2.2 Load]
                                                     ↓
                                               [Wan2.2 I2V]
                                                     ↓
                                               [Video Combine]
                                                     ↓
                                               [VHS Video Save]

Step 1: Z-Image Turbo Keyframe Generation

# ComfyUI workflow node configuration (JSON snippet)
{
  "clip_text_encode": {
    "inputs": {
      "text": "a futuristic cityscape at sunset, cinematic lighting, 4K quality",
      "clip": ["CLIP Load", 0]
    }
  },
  "zimage_k_sampler": {
    "inputs": {
      "model": ["Z-Image Turbo Load", 0],
      "positive": ["clip_text_encode", 0],
      "negative": ["clip_text_encode_neg", 0],
      "steps": 8,
      "cfg": 1.5,
      "sampler_name": "euler",
      "scheduler": "normal",
      "width": 1024,
      "height": 576,
      "seed": 42
    }
  }
}

Key Parameter Notes:

  • Steps: Z-Image Turbo recommends 4-8 steps. Fewer steps = faster, 8 steps is the quality balance point
  • CFG Scale: Z-Image Turbo recommends 1.0-1.5. Excessive CFG causes oversaturation and detail loss
  • Resolution: 1024×576 (16:9 video ratio) is the optimal starting point, adjustable to 1280×720 for HD video

Step 2: ControlNet Precise Control (Optional)

For precise keyframe composition using ControlNet:

# ControlNet Union node configuration
{
  "controlnet_union": {
    "inputs": {
      "control_net": ["ControlNet Load (Z-Image Union)", 0],
      "image": ["ControlNet Preprocessor", 0],
      "strength": 0.8,
      "start_percent": 0.0,
      "end_percent": 1.0
    }
  },
  "k_sampler_with_controlnet": {
    "inputs": {
      "model": ["Z-Image Turbo Load", 0],
      "positive": ["clip_text_encode", 0],
      "control_net": ["controlnet_union", 0],
      "steps": 8,
      "cfg": 1.5
    }
  }
}

Supported ControlNet Types:

Type Use Case Recommended Strength
Canny Edge Control composition outlines 0.6-0.8
Depth Map Control spatial hierarchy 0.5-0.7
OpenPose Control figure poses 0.7-0.9
Normal Control lighting direction 0.4-0.6
Tile Control overall style 0.3-0.5

Step 3: Wan 2.2 Image-to-Video Conversion

{
  "get_image_size": {
    "inputs": {
      "image": ["zimage_k_sampler", 0]
    }
  },
  "wan22_i2v": {
    "inputs": {
      "model": ["Wan2.2 14B Load", 0],
      "image": ["zimage_k_sampler", 0],
      "width": ["get_image_size", 0],
      "height": ["get_image_size", 1],
      "num_frames": 81,
      "frame_rate": 24,
      "motion_strength": 0.7,
      "seed": 42
    }
  }
}

Wan 2.2 Key Parameters:

  • Frames: 81 frames (~3.4s @24fps). Increasing frames linearly increases VRAM demand
  • Frame rate: 24fps (cinematic), 30fps (smooth), 15fps (clean animation)
  • Motion strength: 0.0-1.0. Low = subtle animation, high = dramatic movement
  • Inference steps: Default 30. Reducing to 15 speeds up processing with slight motion quality trade-off

Step 4: Video Composition and Export

{
  "video_combine": {
    "inputs": {
      "frames": ["wan22_i2v", 0],
      "frame_rate": 24,
      "format": "video/h264-mp4",
      "crf": 18,
      "loop": 0
    }
  },
  "vhs_video_save": {
    "inputs": {
      "images": ["video_combine", 0],
      "filename_prefix": "zimage_wan_output",
      "output_dir": "outputs/videos"
    }
  }
}

Advanced Techniques

Technique 1: Multi-Keyframe Interpolation

Generate multiple keyframes and interpolate between them for more complex video sequences:

Keyframe A (Z-Image) → Wan I2V → Clip 1
Keyframe B (Z-Image) → Wan I2V → Clip 2
Keyframe C (Z-Image) → Wan I2V → Clip 3
Clip 1 + Clip 2 + Clip 3 → ffmpeg concat → Complete Video

Implementation:

# Stitch multiple video clips with ffmpeg
ffmpeg -f concat -safe 0 -i filelist.txt -c copy output.mp4

# filelist.txt content:
# file 'fragment_1.mp4'
# file 'fragment_2.mp4'
# file 'fragment_3.mp4'

Technique 2: LoRA Style Consistency

Load a style LoRA in the Z-Image stage to ensure all keyframes share a unified visual style:

{
  "lora_loader": {
    "inputs": {
      "model": ["Z-Image Turbo Load", 0],
      "clip": ["CLIP Load", 0],
      "lora_name": "my_style_lora.safetensors",
      "strength_model": 1.0,
      "strength_clip": 1.0
    }
  }
}

Technique 3: Super Resolution Post-Processing

Upscale input images during the generation phase to indirectly improve video quality:

{
  "upscale_model": {
    "inputs": {
      "upscale_model": ["Upscale Model Load (4x-UltraMix)", 0],
      "image": ["zimage_k_sampler", 0],
      "scale_by": 2.0
    }
  }
}

Technique 4: Prompt Engineering Optimization

Optimal video prompt formula:

[Subject Description] + [Motion Description] + [Environment/Background] + [Style/Mood] + [Technical Specs]

Example:

An elegant white cat running under cherry blossom trees, petals drifting in the wind,
Japanese garden background, soft morning light,
Studio Ghibli animation style,
4K cinematic, smooth motion, gentle camera pan right

Prompts to avoid:

  • Overly complex scene descriptions (multiple subjects moving simultaneously)
  • Rapid viewpoint switching requirements
  • Text rendering (text in video tends to distort easily)

Troubleshooting

Issue 1: Video Flickering

Cause: Wan 2.2 produces incoherent motion between certain frames.

Solutions:

  1. Reduce motion_strength (0.5-0.6)
  2. Increase inference steps to 30-50
  3. Use the Temporal Consistency node in ComfyUI
  4. Reduce input image complexity (fewer details and textures)

Issue 2: Insufficient VRAM

Cause: Wan 2.2 14B model + Z-Image Turbo loaded simultaneously.

Solutions:

  1. Sequential execution: Generate and save images first, then load Wan model for I2V
  2. FP8 quantization: Use FP8 version of Wan 2.2 (half the VRAM)
  3. Reduce frames: From 81 to 49 frames (2-second video)
  4. Lower resolution: From 1024×576 to 832×480

Issue 3: Unnatural Motion

Cause: Insufficient motion description in prompts or inappropriate motion_strength settings.

Solutions:

  1. Explicitly describe motion direction in prompts ("slow pan right", "zoom in on subject")
  2. Try different motion_strength values (0.3-0.9)
  3. Use ControlNet depth maps to constrain motion trajectories
  4. Generate results with multiple seeds and select the best motion effect

Issue 4: Image-Video Style Mismatch

Cause: Wan 2.2 changes the original image's style during I2V processing.

Solutions:

  1. Use the same prompt for Wan 2.2 as used with Z-Image
  2. Reduce motion_strength to minimize style drift
  3. Apply style transfer nodes in post-processing to unify color tones

Complete Workflow JSON Template

Here's a simplified workflow template structure importable directly into ComfyUI:

{
  "1": {
    "class_type": "CheckpointLoaderSimple",
    "inputs": {"ckpt_name": "zimage_turbo.safetensors"}
  },
  "2": {
    "class_type": "CLIPTextEncode",
    "inputs": {
      "text": "your prompt here",
      "clip": ["1", 1]
    }
  },
  "3": {
    "class_type": "KSampler",
    "inputs": {
      "model": ["1", 0],
      "positive": ["2", 0],
      "negative": ["2_neg", 0],
      "steps": 8,
      "cfg": 1.5,
      "width": 1024,
      "height": 576,
      "seed": 42
    }
  },
  "4": {
    "class_type": "Wan2.2_I2V",
    "inputs": {
      "image": ["3", 0],
      "num_frames": 81,
      "frame_rate": 24,
      "motion_strength": 0.7
    }
  },
  "5": {
    "class_type": "VHS_VideoCombine",
    "inputs": {
      "images": ["4", 0],
      "frame_rate": 24,
      "format": "video/h264-mp4"
    }
  }
}

Summary

The Z-Image + Wan 2.2 + ControlNet combination provides unprecedented flexibility and quality control for open-source video generation. By properly configuring these three components, you can achieve:

  1. Precise keyframe control: Z-Image's ControlNet ensures every frame's composition and style match expectations
  2. Natural motion transitions: Wan 2.2's image-to-video capability transforms static visuals into fluid animation
  3. Scalable workflow: Supports LoRA, super resolution, multi-clip stitching and other advanced features

Next Steps

  • Beginner: Run Z-Image and Wan 2.2 separately first to familiarize with individual parameters
  • Intermediate: Build the complete chained workflow, experiment with different motion strengths and prompt combinations
  • Advanced: Incorporate ControlNet, LoRA, and multi-keyframe interpolation for complex video sequences

As Wan 2.2 and Z-Image models continue to evolve, this combined workflow's capabilities will keep improving. Stay tuned to HuggingFace and the ComfyUI community for the latest model updates, features, and performance optimizations.

Z-Image Team

Z-Image Video Generation Workflow: Complete Guide to Z-Video + ControlNet + Wan 2.2 | Blog