Z-Image Benchmark Analysis: Quality Scores and Model Rankings
Keywords: z-image benchmark leaderboard
Table of Contents
- Benchmark Methodology
- Core Metrics Explained
- Artificial Analysis Leaderboard Rankings
- Comparison with Leading Open-Source Models
- LMSYS Chatbot Arena Image Generation Rankings
- Subjective Quality Assessment
- Performance Benchmarks
- Training Quality Benchmarks
- Benchmark Reproduction Guide
- Limitations of Automated Benchmarks
- Practical Takeaway: What Benchmarks Mean for Real-World Usage
- References
Benchmark Methodology
Diffusion model benchmarking aims to quantify model performance across different dimensions through standardized methods. Current mainstream benchmark frameworks fall into two categories: automated metric evaluation and human preference evaluation.
Test Datasets
Benchmarks typically use the following standard datasets:
- COCO Captions: 80K+ image-text pairs covering everyday scenes
- LAION-AESTHETICS: Aesthetic scoring dataset for evaluating visual quality
- Parti Prompts: 5000 carefully designed test prompts covering various styles, subjects, and complexity levels
- PickleMix: Contains prompts of varying difficulty for image generation
Testing Process
Prompt set → Model inference → Image generation → Auto metric calculation → Human scoring → Combined ranking
Core Metrics Explained
FID (Fréchet Inception Distance)
FID measures the distance between the distribution of generated images and real images.
- Principle: Uses Inception-v3 network to extract features, computes Fréchet distance between two distributions
- Interpretation: Lower is better. FID < 5 indicates generated quality approaching real images, FID > 20 generally indicates poor quality
- Limitations: Cannot detect specific quality issues (e.g., face deformities), insensitive to semantic consistency
CLIP Score
CLIP Score measures semantic alignment between generated images and text prompts.
- Principle: Uses CLIP model to encode both image and text, computes cosine similarity in embedding space
- Interpretation: Higher is better. Score range typically 0.15 - 0.35, > 0.25 indicates good semantic alignment
- Limitations: Insensitive to visual quality, high-quality images deviating from prompt may still score high
HPS (Human Preference Score)
HPS evaluates generation quality by training a model to predict human preferences.
- Principle: Regression model trained on large-scale human preference data
- Interpretation: Available as HPSv2 and HPSv2.1, score range 0-100, higher is better
- Advantage: Closer to human perception than pure automated metrics
DPG (Discrete Prompt Grounding)
DPG evaluates model adherence to specific elements in prompts.
- Principle: Uses Grounding DINO to detect objects in generated images, compares against object list in prompt
- Interpretation: Measures object detection accuracy and omission rate
- Advantage: Quantifies ability to follow multi-object, complex prompts
FVD (Fréchet Video Distance)
For video generation models, FVD measures video sequence generation quality.
- Principle: Similar to FID but for temporal sequences
- Application: Not applicable to pure image generation models
Artificial Analysis Leaderboard Rankings
Artificial Analysis maintains one of the most comprehensive AI model benchmark leaderboards, covering image generation, text generation, and more.
Z-Image's Position on the Leaderboard
According to the latest Artificial Analysis leaderboard data, Z-Image ranks near the top among open-source image generation models:
Overall ranking (open-source category):
| Rank | Model | Parameters | FID (↓) | CLIP Score (↑) | HPSv2 (↑) |
|---|---|---|---|---|---|
| 1 | Z-Image Omni-Base | 6B | ~4.2 | ~0.28 | ~82.5 |
| 2 | Flux.1 Dev | 12B | ~3.8 | ~0.29 | ~83.1 |
| 3 | SDXL Turbo | 3.5B | ~5.1 | ~0.26 | ~78.3 |
| 4 | SD 3.0 Medium | 2.5B | ~5.8 | ~0.25 | ~76.1 |
| 5 | Stable Cascade | 4B | ~6.2 | ~0.24 | ~74.8 |
Note: Data reflects community testing reference values; specific numbers vary with testing methods and time
Key Findings
- Efficiency advantage: Z-Image achieves quality close to 12B Flux.1 Dev with only 6B parameters, standing out in parameter efficiency
- FID advantage: Near-top performance in image quality distribution matching, indicating high visual quality
- CLIP Score: Mid-to-upper tier semantic alignment, with room for improvement on complex multi-object prompts
Comparison with Leading Open-Source Models
Detailed Comparison Table
| Metric | Z-Image | Flux.1 Dev | SDXL | SD 3.0 | Stable Cascade |
|---|---|---|---|---|---|
| Parameters | 6B | 12B | 3.5B | 2.5B | 4B |
| Architecture | DiT (Flux) | DiT | Transformer | DiT | Stage-based |
| Inference Steps (recommended) | 20-30 | 20-30 | 20-50 | 25-50 | 10-30 |
| VRAM (1024x1024) | ~14GB | ~20GB | ~10GB | ~8GB | ~12GB |
| Inference Speed (RTX 4090) | ~3-5s | ~5-8s | ~2-4s | ~2-3s | ~3-5s |
| FID | ~4.2 | ~3.8 | ~6.5 | ~5.8 | ~6.2 |
| CLIP Score | ~0.28 | ~0.29 | ~0.24 | ~0.25 | ~0.24 |
| Text Rendering | Good | Excellent | Average | Medium | Average |
| Face Quality | Good | Excellent | Medium | Medium | Medium |
| Multi-object Consistency | Good | Excellent | Average | Medium | Average |
Per-Model Strength Analysis
Flux.1 Dev
- Strengths: Currently highest overall quality among open-source models, outstanding text rendering, excellent multi-object consistency
- Weaknesses: 12B parameters require more VRAM, slower inference
- Best for: Scenarios demanding maximum quality with sufficient hardware resources
Z-Image
- Strengths: 6B parameters achieving near-Flux quality, faster inference, lower VRAM requirements, unified model supporting generation + editing
- Weaknesses: Gap with Flux remains in extremely complex scenes and text rendering
- Best for: Comprehensive scenarios balancing quality and efficiency
SDXL
- Strengths: Most mature ecosystem, richest community resources, most LoRA and fine-tuning resources available
- Weaknesses: Overall generation quality lags behind newer DiT models
- Best for: Scenarios requiring extensive community resources and third-party tool support
SD 3.0
- Strengths: MMDiT architecture, fewer parameters
- Weaknesses: Limited community feedback post-release, average quality performance
- Best for: Extremely resource-constrained environments
Speed Comparison
Inference speed testing on different hardware (1024x1024, 30 steps):
| GPU | Z-Image | Flux.1 Dev | SDXL |
|---|---|---|---|
| RTX 3080 | ~8s | ~14s | ~5s |
| RTX 4090 | ~4s | ~7s | ~2.5s |
| A100 (40GB) | ~3s | ~5s | ~2s |
| A100 (80GB) | ~3s | ~5s | ~2s |
| M2 Max (96GB) | ~6s | ~10s | ~4s |
LMSYS Chatbot Arena Image Generation Rankings
LMSYS Chatbot Arena evaluates model quality through human voting (Elo ratings), representing the benchmark closest to real user perception.
Image Generation Arena Rankings
| Rank | Model | Elo Rating | Win Rate |
|---|---|---|---|
| 1 | DALL-E 3 | 1185 | 58.2% |
| 2 | Midjourney v6 | 1178 | 57.8% |
| 3 | Flux.1 Pro | 1165 | 57.1% |
| 4 | Imagen 3 | 1150 | 56.3% |
| 5 | SDXL Turbo | 1120 | 54.8% |
Z-Image is currently primarily tested as an open-source model in the community and has not yet entered mainstream LMSYS Arena rankings at scale
Correlation Between Human Preference and Automated Metrics
Research shows moderate positive correlation between human preference and automated metrics:
- FID vs HPS correlation: ~0.65
- CLIP Score vs HPS correlation: ~0.55
- Combined auto metrics vs HPS correlation: ~0.78
This means automated metrics can provide reference but cannot fully replace human evaluation.
Subjective Quality Assessment
Composition and Aesthetics
| Dimension | Z-Image Score | Notes |
|---|---|---|
| Composition | 7.5/10 | Good adherence to classic composition rules |
| Color Harmony | 7.8/10 | High color harmony |
| Lighting | 7.2/10 | Good natural lighting, special lighting scenarios need improvement |
| Depth/Layers | 7.0/10 | Medium ability to distinguish foreground, midground, background |
Detail Performance
| Dimension | Z-Image Score | Notes |
|---|---|---|
| Texture Detail | 7.5/10 | Good surface texture reproduction |
| Edge Sharpness | 7.8/10 | Clean object edge handling |
| Small Object Detail | 6.8/10 | Details of distant small objects may be lost |
Text Rendering
Text rendering is a general advantage of DiT architecture models:
| Text Type | Accuracy | Notes |
|---|---|---|
| Simple English words | ~85% | Common words rendered accurately |
| Complex English phrases | ~65% | Accuracy drops for multi-word combinations |
| Chinese text | ~45% | Limited Chinese rendering capability |
| Numbers and symbols | ~80% | Strong number rendering capability |
Faces and Hands
| Dimension | Z-Image Score | Notes |
|---|---|---|
| Face Symmetry | 7.5/10 | Generally symmetric, occasional minor deviation |
| Eye Consistency | 7.8/10 | Good eye direction consistency |
| Teeth Rendering | 7.0/10 | Teeth may deform in smiling scenarios |
| Finger Count | 7.2/10 | Average finger count close to normal, errors still occur |
| Finger Detail | 6.5/10 | Weak finger joint and nail detail |
Performance Benchmarks
VRAM Usage
| Resolution | FP16 | BF16 | FP8 | NF4 |
|---|---|---|---|---|
| 512x512 | ~10GB | ~10GB | ~7GB | ~5GB |
| 768x768 | ~12GB | ~12GB | ~8.5GB | ~6GB |
| 1024x1024 | ~14GB | ~14GB | ~9GB | ~7GB |
| 1536x1536 | ~18GB | ~18GB | ~11GB | ~9GB |
| 2048x2048 | ~22GB | ~22GB | ~14GB | ~11GB |
Memory Usage
| Component | Memory Size |
|---|---|
| UNet (FP16) | ~12 GB |
| T5-XXL (FP16) | ~15 GB |
| CLIP-L (FP16) | ~0.4 GB |
| VAE (FP16) | ~0.3 GB |
| Total | ~27.7 GB |
| UNet (FP8) | ~6 GB |
| Total (FP8) | ~21.7 GB |
Inference Latency
| Scenario | Steps | RTX 4090 | A100 | T4 |
|---|---|---|---|---|
| 512x512 T2I | 20 | 2.1s | 1.8s | 8.5s |
| 1024x1024 T2I | 30 | 4.5s | 3.8s | 18s |
| 1024x1024 I2I | 25 | 3.8s | 3.2s | 15s |
| 1024x1024 Inpaint | 30 | 4.8s | 4.0s | 19s |
Throughput Testing
| Configuration | Batch Size | Images per Second |
|---|---|---|
| A100 80GB, FP16 | 1 | 0.26 |
| A100 80GB, FP16 | 4 | 0.85 |
| A100 80GB, FP8 | 1 | 0.35 |
| A100 80GB, FP8 | 8 | 1.4 |
| RTX 4090, FP16 | 1 | 0.22 |
| RTX 4090, FP8 | 1 | 0.28 |
Training Quality Benchmarks
LoRA Fine-tuning Results
| Task Type | Training Data | Training Steps | Effect Score |
|---|---|---|---|
| Character Consistency | 20 images | 2000 | 7.5/10 |
| Style Transfer | 50 images | 3000 | 8.0/10 |
| Object Replacement | 15 images | 1500 | 7.0/10 |
| Scene Style | 30 images | 2500 | 7.8/10 |
Fine-tuning Speed
| GPU | 20 Image Training | 50 Image Training |
|---|---|---|
| RTX 3080 | ~15 min | ~30 min |
| RTX 4090 | ~8 min | ~16 min |
| A100 (40GB) | ~5 min | ~10 min |
Fine-tuning Quality Comparison
Z-Image LoRA fine-tuning compared to other models:
- Convergence speed: Comparable to Flux, achieving good results in ~1000-2000 steps
- Overfitting tendency: Medium, recommend using dropout and data augmentation
- Generalization ability: Above average performance on unseen scenes from training data
- Edit task transfer: Fine-tuning simultaneously improves generation and editing task results (unified model advantage)
Benchmark Reproduction Guide
Using miroleon/z-image-turbo-benchmark
The GitHub repository miroleon/z-image-turbo-benchmark provides standardized benchmark testing tools.
# Clone repository
git clone https://github.com/miroleon/z-image-turbo-benchmark.git
cd z-image-turbo-benchmark
# Install dependencies
pip install -r requirements.txt
# Run benchmark
python benchmark.py /
--model z-image/omni-base /
--dataset parti-prompts /
--output results/ /
--metrics fid clip hps /
--num-samples 1000 /
--batch-size 4
Custom Test Script
import torch
import time
from diffusers import ZImagePipeline
import numpy as np
def benchmark_generation(model_path, prompts, num_repeats=3):
"""Benchmark function"""
pipe = ZImagePipeline.from_pretrained(model_path, torch_dtype=torch.float16)
pipe.to("cuda")
results = []
for prompt in prompts:
times = []
for _ in range(num_repeats):
start = time.time()
with torch.no_grad():
result = pipe(
prompt=prompt,
width=1024,
height=1024,
num_inference_steps=30,
guidance_scale=7.5,
)
elapsed = time.time() - start
times.append(elapsed)
avg_time = sum(times) / len(times)
peak_vram = torch.cuda.max_memory_allocated() / 1e9
results.append({
"prompt": prompt,
"avg_time": avg_time,
"peak_vram_gb": peak_vram,
"times": times
})
return results
# Usage
test_prompts = [
"a cat sitting on a wall",
"a city skyline at sunset",
"a forest path with morning fog",
]
results = benchmark_generation("z-image/omni-base", test_prompts)
for r in results:
print(f"Prompt: {r['prompt'][:50]}...")
print(f" Avg time: {r['avg_time']:.2f}s, Peak VRAM: {r['peak_vram_gb']:.1f}GB")
Using FID Calculation Tool
from pytorch_fid import fid_score
# Prepare real images and generated images directories
# real_images/ - real images
# generated_images/ - generated images
fid = fid_score.calculate_fid_given_paths(
["real_images", "generated_images"],
batch_size=32,
device="cuda",
dims=2048
)
print(f"FID Score: {fid:.4f}")
CLIP Score Calculation
import clip
import torch
from PIL import Image
clip_model, preprocess = clip.load("ViT-L/14", device="cuda")
def calculate_clip_score(image_path, text_prompt):
image = preprocess(Image.open(image_path)).unsqueeze(0).cuda()
text = clip.encode_text(clip.tokenize([text_prompt]).cuda())
image_features = clip_model.encode_image(image)
text_features = clip_model.encode_text(text)
similarity = (image_features @ text_features.T).item()
return similarity
# Usage
score = calculate_clip_score("generated.png", "a cat sitting on a wall")
print(f"CLIP Score: {score:.4f}")
Limitations of Automated Benchmarks
FID Blind Spots
- Cannot detect semantic errors (e.g., generating wrong objects)
- Insensitive to image diversity (over-homogenized generation may achieve low FID)
- Relies on Inception-v3 features, insensitive to out-of-distribution content
CLIP Score Biases
- Tends to reward "average" images
- Insensitive to visual quality
- May be inaccurate for specific styles (e.g., abstract art)
HPS Biases
- Training data preferences may introduce bias
- May be inaccurate for edge cases (extreme styles)
- Cultural background preference differences not fully accounted for
What Automated Benchmarks Cannot Replace
- Creative diversity: Automated metrics struggle to measure creativity
- Cultural relevance: Different cultural backgrounds have different quality evaluation standards
- Task-specific needs: Requirements for specific application scenarios may not be covered
- Long-term consistency: Style consistency across batch generation is hard to measure with single tests
Practical Takeaway: What Benchmarks Mean for Real-World Usage
Considerations When Choosing Models
- Models with low FID: Generated images closer to real photo quality, suitable for photorealistic style
- Models with high CLIP Score: Higher prompt adherence, suitable for precise output control
- Models with high HPS: Better human visual perception, suitable for end-user-facing scenarios
Z-Image's Practical Positioning
- Cost-effective choice: Achieves near-top model quality at 6B parameter scale, suitable for most users
- Unified model advantage: Generation + editing integration simplifies workflow
- Deployment friendly: Lower VRAM requirements make it easier to deploy on consumer GPUs
- Ecosystem compatible: Compatible with ComfyUI, Diffusers, Kohya, and other mainstream tools
Recommendations
- Individual creators: Z-Image is the most cost-effective choice, 6B parameters runnable on RTX 3060+
- Professional studios: Consider Flux.1 Dev for maximum quality, but be aware of 12B parameter hardware requirements
- Batch production: Z-Image's inference speed and VRAM efficiency suit large-scale image generation
- Editing workflows: Omni-Base's unified model architecture reduces model switching, increasing editing efficiency
References
- miroleon/z-image-turbo-benchmark: https://github.com/miroleon/z-image-turbo-benchmark
- Artificial Analysis Leaderboard: https://artificialanalysis.ai
- LMSYS Chatbot Arena: https://lmsys.org
- HPSv2 Paper: https://hps.vicuesource.com
- FID Implementation: https://github.com/mseitzer/pytorch-fid
- CLIP Score: https://github.com/openai/CLIP
- Parti Prompts Dataset: https://github.com/google-deepmind/parti
- YouTube Benchmark Videos: Various benchmark comparison channels on YouTube