Z-Image + Depth Anything V3: 3D Depth-Aware Control Workflow

Mai 4, 2026

Z-Image + Depth Anything V3: 3D Depth-Aware Control Workflow

From legacy preprocessors to next-gen depth estimation — inject real spatial understanding into Z-Image ControlNet with Depth Anything V3.


Why Better Depth Maps Matter

In Z-Image ControlNet workflows, depth maps are one of the most critical control signals. They determine the perspective, spatial hierarchy, and object proportions of generated images. Traditional depth estimation methods (MiDaS, ZoeDepth) have several notable limitations:

  • Detail loss: Weak depth discrimination for distant objects
  • Blurred boundaries: Insufficient depth transitions at object edges
  • Multi-scale inconsistency: Difficulty coordinating depth ratios between foreground and background

Depth Anything V3 (ByteDance, 2025) addresses these issues. Trained on large-scale depth-labeled data, it supports monocular depth estimation, camera pose estimation, and 3D point cloud output — all available in ComfyUI via the ComfyUI-DepthAnythingV3 plugin.


Core Capabilities of Depth Anything V3

Monocular Depth Estimation

Generate high-precision depth maps from single 2D images across multiple resolutions:

Model Parameters Inference Speed (RTX 4090) Accuracy
V3-Small 24M ~50ms High
V3-Metric 48M ~80ms Highest (absolute distance)
V3-Large 180M ~150ms Ultimate

Multi-View Consistency

The biggest V3 over V2 breakthrough: when processing multiple images from different angles, V3's Cross-View Attention mechanism ensures depth consistency across all views. This means:

  • Video frame depth maps don't "flicker"
  • Multi-angle 3D point clouds are conflict-free
  • Ideal for architecture and interior scenes requiring precise spatial relationships

Camera Pose Estimation

Beyond depth, V3 estimates camera parameters (focal length, field of view, pose) — data directly usable for 3D reconstruction or VR/AR applications.


Complete ComfyUI Workflow Setup

Step 1: Install ComfyUI-DepthAnythingV3

cd ComfyUI/custom_nodes
git clone https://github.com/kijai/ComfyUI-DepthAnythingV3.git
cd ComfyUI-DepthAnythingV3
pip install -r requirements.txt

Step 2: Download Model Files

# Place in ComfyUI/models/depth_anything/
# Small version (recommended for daily use)
huggingface-cli download depth-anything-2/depth-anything-v3-small --local-dir ./depth_anything/v3-small

# Metric version (for absolute depth distance)
huggingface-cli download depth-anything-2/depth-anything-v3-metric --local-dir ./depth_anything/v3-metric

Step 3: Build Z-Image + Depth V3 + ControlNet Workflow

Core node connections:

LoadImage (reference image)
    ↓
DepthAnythingV3Preprocessor (generate depth map)
    ↓
ControlNetApply (Z-Image-Turbo-ControlNet-Union)
    ↓
CLIPTextEncode (Prompt)
    ↓
KSampler (Z-Image Turbo)
    ↓
VAEDecode → SaveImage

Key Parameters:

Parameter Recommended Notes
ControlNet strength 0.6-0.8 Don't over-control depth
Denoise 0.7-0.85 Preserve structural info
CFG Scale 2.0-4.0 Low CFG for Z-Image Turbo
Steps 20-30 Depth control needs more steps

Practical Use Case: Interior Design Style Transfer

Take a bare room photo and generate a luxury interior:

Prompt Example:

Modern luxury living room interior, marble floor, floor-to-ceiling windows,
warm ambient lighting, minimalist furniture, high-end materials,
photorealistic, architectural photography, 8k, detailed textures

Workflow Tips:

  1. Depth map input: V3 depth map feeds directly — no binarization needed
  2. ControlNet strength tuning:
    • 0.4-0.5: Rough spatial structure only, high style variation
    • 0.6-0.7: Balance between structure and creativity (recommended start)
    • 0.8-1.0: Strict adherence to original layout
  3. Combine with Inpainting: Use masks to rework unsatisfactory areas

Comparison:

Method Spatial Accuracy Style Freedom Inference Time
MiDaS + ControlNet ⭐⭐⭐ ⭐⭐⭐⭐ ~2s
ZoeDepth + ControlNet ⭐⭐⭐⭐ ⭐⭐⭐⭐ ~3s
Depth Anything V3 + ControlNet ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ ~4s

Video Frame Depth Consistency Workflow

Leverage V3's multi-view consistency to generate coherent depth maps across video frames, then stylize each frame with Z-Image:

LoadVideo (input video)
    ↓
DepthAnythingV3Preprocessor (multi_view=True)
    ↓
[Per-frame]
    ↓
ControlNetApply + KSampler
    ↓
VAEDecode
    ↓
SaveAnimatedPNG / VideoCombine

Key Settings:

  • multi_view=True: Enable cross-view consistency
  • temporal_smoothing=0.7: Temporal smoothing factor
  • Keep ControlNet strength consistent across frames

Result: Stylized video where objects don't "jump" or "flicker" — spatial relationships remain stable throughout.


Troubleshooting

Q1: Depth map looks "blurry", object edges unclear

Cause: Low resolution or Small model on complex scenes.

Fix:

  • Switch to V3-Metric or V3-Large
  • Increase input resolution to 1024x1024
  • Verify post_process=True (default on)

Q2: ControlNet too strong, image looks rigid

Cause: Strength too high or denoise too low.

Fix:

  • Start strength at 0.6 and decrease
  • Raise denoise above 0.8
  • Try lowering CFG Scale to 2.0-3.0

Q3: Video frame depth inconsistency

Cause: Multi-view mode not enabled or temporal_smoothing too low.

Fix:

  • Confirm multi_view=True
  • Set temporal_smoothing to 0.6-0.9
  • Ensure stable frame rate (no frame skipping)

Summary

Depth Anything V3 brings three key upgrades to Z-Image ControlNet workflows:

  1. Accuracy leap: Monocular depth estimation surpasses MiDaS/ZoeDepth with sharper boundaries
  2. Multi-view consistency: Cross-frame/angle depth maps no longer "flicker"
  3. Camera pose output: Ready-to-use 3D data for downstream applications

For professional scenes like architectural visualization, interior design, and video stylization, the Depth Anything V3 + Z-Image ControlNet combo has become the new standard depth control workflow.


This workflow uses ComfyUI + Z-Image Turbo + Depth Anything V3 + ControlNet Union 2.1 — all open source and free.

Z-Image Team