Z-Image De-Turbo De-Distilled Model Deep Dive: Breaking Turbo Limits with the Next-Gen Model

مايو ٢٧، ٢٠٢٦

Z-Image De-Turbo De-Distilled Model Deep Dive: Breaking Turbo Limits with the Next-Gen Model

Keywords: z-image de-turbo model


Table of Contents


Introduction

Z-Image Turbo achieves remarkable speed by compressing inference steps from 20-30 to just 4-8 through distillation. However, the distillation process inevitably introduces quality degradation. Enter Z-Image De-Turbo — a model that uses "De-Distillation" technology to recover near-Base generation quality while retaining much of Turbo's speed advantage.

What is De-Distillation

Distillation Limitations

Traditional model distillation trains a smaller "student model" to mimic a larger "teacher model" for faster inference. However, this approach has inherent limitations:

  1. Information Loss: The student cannot fully capture all knowledge from the teacher
  2. Distribution Shift: Distilled data distribution differs from original data distribution
  3. Quality Ceiling: Distilled model quality typically has a lower ceiling than the original

De-Distillation Philosophy

De-Distillation takes a reverse approach: rather than compressing the model, it recovers information lost during distillation. Core strategies include:

  1. Using distilled model outputs as training data: Retrain the model using images generated by Turbo
  2. Mixing original and synthetic data: Combine original high-quality data with Turbo-synthesized data
  3. Targeted compensation for distillation losses: Additional training steps to recover lost detail information

Core Differences: De-Turbo vs Turbo

Overview Comparison

Feature Turbo De-Turbo Base
Inference Steps 4-8 10-15 20-30
Speed (RTX 4090, 1024px) ~1.5s ~3s ~5s
FID ~5.2 ~4.0 ~3.8
CLIP Score ~0.270 ~0.282 ~0.285
HPSv2 ~79.5 ~81.8 ~83.1
Model Size 6B 6B 6B

Key Advantages

  1. Quality Recovery: De-Turbo's FID recovers from Turbo's 5.2 to 4.0, approaching Base's 3.8
  2. Speed Retention: 10-15 inference steps, still 2-3x faster than Base
  3. No Extra Hardware Required: Same model size as Turbo/Base, no additional VRAM needed

Technical Principles

De-Distillation Training Pipeline

Original Training Data → Z-Image Turbo Inference → Synthetic Image Dataset
                                          ↓
Original Data + Synthetic Data → Joint Training → Z-Image De-Turbo

Key Technical Points

  1. Data Mixing Strategy

    • 70% original high-quality training data
    • 30% Turbo-generated synthetic data
    • Synthetic data quality-filtered, keeping only high-scoring samples
  2. Loss Function Design

    • Standard diffusion loss + distillation loss + consistency loss
    • Consistency loss ensures De-Turbo remains compatible with Turbo for fast inference
  3. Step Optimization

    • De-Turbo recommends 10-15 inference steps
    • Extra 6-7 steps beyond Turbo for detail recovery
    • 50-65% fewer steps than Base

Performance Comparison

Automated Metrics

Metric Turbo (8 steps) De-Turbo (12 steps) Base (30 steps)
FID (↓) 5.18 4.02 3.82
CLIP Score (↑) 0.271 0.282 0.285
HPSv2 (↑) 79.6 81.8 83.2
DPG (↑) 76% 80% 82%

Quality Dimensions

Dimension Turbo De-Turbo Base
Prompt Adherence 7.5/10 8.2/10 8.5/10
Detail Richness 7.0/10 8.0/10 8.5/10
Texture Performance 6.5/10 7.8/10 8.2/10
Text Rendering 6.5/10 7.2/10 7.5/10
Face Quality 7.0/10 7.8/10 8.0/10

Speed Comparison (RTX 4090, 1024x1024)

Version Single Image 4-Image Batch 10-Image Batch
Turbo (8 steps) 1.5s 5.8s 14.2s
De-Turbo (12 steps) 2.8s 10.5s 25.8s
Base (30 steps) 5.0s 18.5s 45.0s

Key Finding: De-Turbo maintains 80% of Turbo's speed while recovering 90%+ of Base's quality.

Training Methods

LoRA Fine-Tuning

De-Turbo supports standard LoRA fine-tuning, compatible with Base and Turbo workflows:

training_config = {
    "model_path": "Tongyi-MAI/Z-Image-De-Turbo",
    "learning_rate": 2e-5,
    "train_steps": 1500,
    "batch_size": 4,
    "rank_dimension": 32,
    "alpha": 16,
    "dropout": 0.1,
    "optimizer": "prodigy",
}

DreamBooth Training

dreambooth_config = {
    "model_path": "Tongyi-MAI/Z-Image-De-Turbo",
    "instance_prompt": "a photo of [trigger_word] person",
    "num_epochs": 100,
    "learning_rate": 1e-5,
    "resolution": 768,
    "mixed_precision": "fp16",
}

Use Cases

  1. Quality-Speed Balance: Need better quality than Turbo but can't afford Base's speed cost
  2. Professional Content Creation: Designers, photographers needing quality with fast iteration
  3. Medium Batch Production: 50-500 images/day medium-scale production
  4. API Services (Medium Latency): Online services accepting 2-3 second latency
  5. Education/Training: Teaching demos showing quality output efficiently
  6. LoRA Training Experiments: Need quality fine-tuning output with fast feedback
  1. Extreme Speed Needs: Sub-second response required → Use Turbo
  2. Extreme Quality Needs: Ultimate detail requirements → Use Base
  3. Massive Batch Production: Thousands of images/day → Use Turbo
  4. Academic Benchmarking: Need standard Base as reference → Use Base

Real-World Test Results

Prompt Test

Test prompt: "A detailed still life painting of a vintage camera on a wooden desk, soft window light, film photography aesthetic, shallow depth of field"

Dimension Turbo De-Turbo Base
Camera Detail Basic outline Screws, knobs visible Fine textures clear
Wood Texture Simple texture Natural grain Highly realistic grain
Lighting Effect Basic reasonable Rich layers Cinematic lighting
Depth of Field Reasonable blur Natural gradient Precise gradient

Batch Test (100 Prompts)

Metric Turbo De-Turbo Base
Average FID 5.21 4.05 3.83
Average CLIP Score 0.270 0.281 0.285
Prompt Adherence Rate 84% 89% 92%
Total Generation Time (RTX 4090) ~2.5 min ~4.8 min ~8.5 min

Deployment Guide

ComfyUI Deployment

# Download De-Turbo model
git clone https://huggingface.co/Tongyi-MAI/Z-Image-De-Turbo
cp -r Z-Image-De-Turbo/ ComfyUI/models/checkpoints/

# Use the same ComfyUI workflow as Base
# Adjust inference steps to 10-15

Diffusers Usage

from diffusers import ZImagePipeline
import torch

pipe = ZImagePipeline.from_pretrained(
    "Tongyi-MAI/Z-Image-De-Turbo",
    torch_dtype=torch.float16
)
pipe.to("cuda")

image = pipe(
    prompt="a beautiful sunset over mountains",
    width=1024,
    height=1024,
    num_inference_steps=12,  # De-Turbo recommended steps
    guidance_scale=7.5,
).images[0]

image.save("output.png")

Inference Step Recommendations

Quality Level Recommended Steps Estimated Time (RTX 4090)
Quick Preview 8 steps ~2s
Standard Quality 12 steps ~3s
High Quality 15 steps ~3.8s

References

Z-Image Team

Z-Image De-Turbo De-Distilled Model Deep Dive: Breaking Turbo Limits with the Next-Gen Model | Blog