Z-Image De-Turbo De-Distilled Model Deep Dive: Breaking Turbo Limits with the Next-Gen Model
Keywords: z-image de-turbo model
Table of Contents
- Introduction
- What is De-Distillation
- Core Differences: De-Turbo vs Turbo
- Technical Principles
- Performance Comparison
- Training Methods
- Use Cases
- Real-World Test Results
- Deployment Guide
- References
Introduction
Z-Image Turbo achieves remarkable speed by compressing inference steps from 20-30 to just 4-8 through distillation. However, the distillation process inevitably introduces quality degradation. Enter Z-Image De-Turbo — a model that uses "De-Distillation" technology to recover near-Base generation quality while retaining much of Turbo's speed advantage.
What is De-Distillation
Distillation Limitations
Traditional model distillation trains a smaller "student model" to mimic a larger "teacher model" for faster inference. However, this approach has inherent limitations:
- Information Loss: The student cannot fully capture all knowledge from the teacher
- Distribution Shift: Distilled data distribution differs from original data distribution
- Quality Ceiling: Distilled model quality typically has a lower ceiling than the original
De-Distillation Philosophy
De-Distillation takes a reverse approach: rather than compressing the model, it recovers information lost during distillation. Core strategies include:
- Using distilled model outputs as training data: Retrain the model using images generated by Turbo
- Mixing original and synthetic data: Combine original high-quality data with Turbo-synthesized data
- Targeted compensation for distillation losses: Additional training steps to recover lost detail information
Core Differences: De-Turbo vs Turbo
Overview Comparison
| Feature | Turbo | De-Turbo | Base |
|---|---|---|---|
| Inference Steps | 4-8 | 10-15 | 20-30 |
| Speed (RTX 4090, 1024px) | ~1.5s | ~3s | ~5s |
| FID | ~5.2 | ~4.0 | ~3.8 |
| CLIP Score | ~0.270 | ~0.282 | ~0.285 |
| HPSv2 | ~79.5 | ~81.8 | ~83.1 |
| Model Size | 6B | 6B | 6B |
Key Advantages
- Quality Recovery: De-Turbo's FID recovers from Turbo's 5.2 to 4.0, approaching Base's 3.8
- Speed Retention: 10-15 inference steps, still 2-3x faster than Base
- No Extra Hardware Required: Same model size as Turbo/Base, no additional VRAM needed
Technical Principles
De-Distillation Training Pipeline
Original Training Data → Z-Image Turbo Inference → Synthetic Image Dataset
↓
Original Data + Synthetic Data → Joint Training → Z-Image De-Turbo
Key Technical Points
-
Data Mixing Strategy
- 70% original high-quality training data
- 30% Turbo-generated synthetic data
- Synthetic data quality-filtered, keeping only high-scoring samples
-
Loss Function Design
- Standard diffusion loss + distillation loss + consistency loss
- Consistency loss ensures De-Turbo remains compatible with Turbo for fast inference
-
Step Optimization
- De-Turbo recommends 10-15 inference steps
- Extra 6-7 steps beyond Turbo for detail recovery
- 50-65% fewer steps than Base
Performance Comparison
Automated Metrics
| Metric | Turbo (8 steps) | De-Turbo (12 steps) | Base (30 steps) |
|---|---|---|---|
| FID (↓) | 5.18 | 4.02 | 3.82 |
| CLIP Score (↑) | 0.271 | 0.282 | 0.285 |
| HPSv2 (↑) | 79.6 | 81.8 | 83.2 |
| DPG (↑) | 76% | 80% | 82% |
Quality Dimensions
| Dimension | Turbo | De-Turbo | Base |
|---|---|---|---|
| Prompt Adherence | 7.5/10 | 8.2/10 | 8.5/10 |
| Detail Richness | 7.0/10 | 8.0/10 | 8.5/10 |
| Texture Performance | 6.5/10 | 7.8/10 | 8.2/10 |
| Text Rendering | 6.5/10 | 7.2/10 | 7.5/10 |
| Face Quality | 7.0/10 | 7.8/10 | 8.0/10 |
Speed Comparison (RTX 4090, 1024x1024)
| Version | Single Image | 4-Image Batch | 10-Image Batch |
|---|---|---|---|
| Turbo (8 steps) | 1.5s | 5.8s | 14.2s |
| De-Turbo (12 steps) | 2.8s | 10.5s | 25.8s |
| Base (30 steps) | 5.0s | 18.5s | 45.0s |
Key Finding: De-Turbo maintains 80% of Turbo's speed while recovering 90%+ of Base's quality.
Training Methods
LoRA Fine-Tuning
De-Turbo supports standard LoRA fine-tuning, compatible with Base and Turbo workflows:
training_config = {
"model_path": "Tongyi-MAI/Z-Image-De-Turbo",
"learning_rate": 2e-5,
"train_steps": 1500,
"batch_size": 4,
"rank_dimension": 32,
"alpha": 16,
"dropout": 0.1,
"optimizer": "prodigy",
}
DreamBooth Training
dreambooth_config = {
"model_path": "Tongyi-MAI/Z-Image-De-Turbo",
"instance_prompt": "a photo of [trigger_word] person",
"num_epochs": 100,
"learning_rate": 1e-5,
"resolution": 768,
"mixed_precision": "fp16",
}
Use Cases
Recommended: Use De-Turbo When
- Quality-Speed Balance: Need better quality than Turbo but can't afford Base's speed cost
- Professional Content Creation: Designers, photographers needing quality with fast iteration
- Medium Batch Production: 50-500 images/day medium-scale production
- API Services (Medium Latency): Online services accepting 2-3 second latency
- Education/Training: Teaching demos showing quality output efficiently
- LoRA Training Experiments: Need quality fine-tuning output with fast feedback
Not Recommended: When to Skip De-Turbo
- Extreme Speed Needs: Sub-second response required → Use Turbo
- Extreme Quality Needs: Ultimate detail requirements → Use Base
- Massive Batch Production: Thousands of images/day → Use Turbo
- Academic Benchmarking: Need standard Base as reference → Use Base
Real-World Test Results
Prompt Test
Test prompt: "A detailed still life painting of a vintage camera on a wooden desk, soft window light, film photography aesthetic, shallow depth of field"
| Dimension | Turbo | De-Turbo | Base |
|---|---|---|---|
| Camera Detail | Basic outline | Screws, knobs visible | Fine textures clear |
| Wood Texture | Simple texture | Natural grain | Highly realistic grain |
| Lighting Effect | Basic reasonable | Rich layers | Cinematic lighting |
| Depth of Field | Reasonable blur | Natural gradient | Precise gradient |
Batch Test (100 Prompts)
| Metric | Turbo | De-Turbo | Base |
|---|---|---|---|
| Average FID | 5.21 | 4.05 | 3.83 |
| Average CLIP Score | 0.270 | 0.281 | 0.285 |
| Prompt Adherence Rate | 84% | 89% | 92% |
| Total Generation Time (RTX 4090) | ~2.5 min | ~4.8 min | ~8.5 min |
Deployment Guide
ComfyUI Deployment
# Download De-Turbo model
git clone https://huggingface.co/Tongyi-MAI/Z-Image-De-Turbo
cp -r Z-Image-De-Turbo/ ComfyUI/models/checkpoints/
# Use the same ComfyUI workflow as Base
# Adjust inference steps to 10-15
Diffusers Usage
from diffusers import ZImagePipeline
import torch
pipe = ZImagePipeline.from_pretrained(
"Tongyi-MAI/Z-Image-De-Turbo",
torch_dtype=torch.float16
)
pipe.to("cuda")
image = pipe(
prompt="a beautiful sunset over mountains",
width=1024,
height=1024,
num_inference_steps=12, # De-Turbo recommended steps
guidance_scale=7.5,
).images[0]
image.save("output.png")
Inference Step Recommendations
| Quality Level | Recommended Steps | Estimated Time (RTX 4090) |
|---|---|---|
| Quick Preview | 8 steps | ~2s |
| Standard Quality | 12 steps | ~3s |
| High Quality | 15 steps | ~3.8s |
References
- Z-Image De-Turbo Official: https://z-image.me/en/resources
- HuggingFace De-Turbo: https://huggingface.co/Tongyi-MAI/Z-Image-De-Turbo
- Z-Image Official GitHub: https://github.com/Tongyi-MAI/Z-Image
- Z-Image Turbo vs Base: https://pxz.ai/blog/z-image-turbo-vs-base