Z-Image + Depth Anything V3: 3D Depth-Aware Control Workflow
From legacy preprocessors to next-gen depth estimation — inject real spatial understanding into Z-Image ControlNet with Depth Anything V3.
Why Better Depth Maps Matter
In Z-Image ControlNet workflows, depth maps are one of the most critical control signals. They determine the perspective, spatial hierarchy, and object proportions of generated images. Traditional depth estimation methods (MiDaS, ZoeDepth) have several notable limitations:
- Detail loss: Weak depth discrimination for distant objects
- Blurred boundaries: Insufficient depth transitions at object edges
- Multi-scale inconsistency: Difficulty coordinating depth ratios between foreground and background
Depth Anything V3 (ByteDance, 2025) addresses these issues. Trained on large-scale depth-labeled data, it supports monocular depth estimation, camera pose estimation, and 3D point cloud output — all available in ComfyUI via the ComfyUI-DepthAnythingV3 plugin.
Core Capabilities of Depth Anything V3
Monocular Depth Estimation
Generate high-precision depth maps from single 2D images across multiple resolutions:
| Model | Parameters | Inference Speed (RTX 4090) | Accuracy |
|---|---|---|---|
| V3-Small | 24M | ~50ms | High |
| V3-Metric | 48M | ~80ms | Highest (absolute distance) |
| V3-Large | 180M | ~150ms | Ultimate |
Multi-View Consistency
The biggest V3 over V2 breakthrough: when processing multiple images from different angles, V3's Cross-View Attention mechanism ensures depth consistency across all views. This means:
- Video frame depth maps don't "flicker"
- Multi-angle 3D point clouds are conflict-free
- Ideal for architecture and interior scenes requiring precise spatial relationships
Camera Pose Estimation
Beyond depth, V3 estimates camera parameters (focal length, field of view, pose) — data directly usable for 3D reconstruction or VR/AR applications.
Complete ComfyUI Workflow Setup
Step 1: Install ComfyUI-DepthAnythingV3
cd ComfyUI/custom_nodes
git clone https://github.com/kijai/ComfyUI-DepthAnythingV3.git
cd ComfyUI-DepthAnythingV3
pip install -r requirements.txt
Step 2: Download Model Files
# Place in ComfyUI/models/depth_anything/
# Small version (recommended for daily use)
huggingface-cli download depth-anything-2/depth-anything-v3-small --local-dir ./depth_anything/v3-small
# Metric version (for absolute depth distance)
huggingface-cli download depth-anything-2/depth-anything-v3-metric --local-dir ./depth_anything/v3-metric
Step 3: Build Z-Image + Depth V3 + ControlNet Workflow
Core node connections:
LoadImage (reference image)
↓
DepthAnythingV3Preprocessor (generate depth map)
↓
ControlNetApply (Z-Image-Turbo-ControlNet-Union)
↓
CLIPTextEncode (Prompt)
↓
KSampler (Z-Image Turbo)
↓
VAEDecode → SaveImage
Key Parameters:
| Parameter | Recommended | Notes |
|---|---|---|
| ControlNet strength | 0.6-0.8 | Don't over-control depth |
| Denoise | 0.7-0.85 | Preserve structural info |
| CFG Scale | 2.0-4.0 | Low CFG for Z-Image Turbo |
| Steps | 20-30 | Depth control needs more steps |
Practical Use Case: Interior Design Style Transfer
Take a bare room photo and generate a luxury interior:
Prompt Example:
Modern luxury living room interior, marble floor, floor-to-ceiling windows,
warm ambient lighting, minimalist furniture, high-end materials,
photorealistic, architectural photography, 8k, detailed textures
Workflow Tips:
- Depth map input: V3 depth map feeds directly — no binarization needed
- ControlNet strength tuning:
- 0.4-0.5: Rough spatial structure only, high style variation
- 0.6-0.7: Balance between structure and creativity (recommended start)
- 0.8-1.0: Strict adherence to original layout
- Combine with Inpainting: Use masks to rework unsatisfactory areas
Comparison:
| Method | Spatial Accuracy | Style Freedom | Inference Time |
|---|---|---|---|
| MiDaS + ControlNet | ⭐⭐⭐ | ⭐⭐⭐⭐ | ~2s |
| ZoeDepth + ControlNet | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ~3s |
| Depth Anything V3 + ControlNet | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ~4s |
Video Frame Depth Consistency Workflow
Leverage V3's multi-view consistency to generate coherent depth maps across video frames, then stylize each frame with Z-Image:
LoadVideo (input video)
↓
DepthAnythingV3Preprocessor (multi_view=True)
↓
[Per-frame]
↓
ControlNetApply + KSampler
↓
VAEDecode
↓
SaveAnimatedPNG / VideoCombine
Key Settings:
multi_view=True: Enable cross-view consistencytemporal_smoothing=0.7: Temporal smoothing factor- Keep ControlNet strength consistent across frames
Result: Stylized video where objects don't "jump" or "flicker" — spatial relationships remain stable throughout.
Troubleshooting
Q1: Depth map looks "blurry", object edges unclear
Cause: Low resolution or Small model on complex scenes.
Fix:
- Switch to V3-Metric or V3-Large
- Increase input resolution to 1024x1024
- Verify
post_process=True(default on)
Q2: ControlNet too strong, image looks rigid
Cause: Strength too high or denoise too low.
Fix:
- Start strength at 0.6 and decrease
- Raise denoise above 0.8
- Try lowering CFG Scale to 2.0-3.0
Q3: Video frame depth inconsistency
Cause: Multi-view mode not enabled or temporal_smoothing too low.
Fix:
- Confirm
multi_view=True - Set temporal_smoothing to 0.6-0.9
- Ensure stable frame rate (no frame skipping)
Summary
Depth Anything V3 brings three key upgrades to Z-Image ControlNet workflows:
- Accuracy leap: Monocular depth estimation surpasses MiDaS/ZoeDepth with sharper boundaries
- Multi-view consistency: Cross-frame/angle depth maps no longer "flicker"
- Camera pose output: Ready-to-use 3D data for downstream applications
For professional scenes like architectural visualization, interior design, and video stylization, the Depth Anything V3 + Z-Image ControlNet combo has become the new standard depth control workflow.
This workflow uses ComfyUI + Z-Image Turbo + Depth Anything V3 + ControlNet Union 2.1 — all open source and free.