Z-Image Benchmark Analysis: Quality Scores and Model Rankings

Keywords: z-image benchmark leaderboard

Benchmark Methodology
Core Metrics Explained
Artificial Analysis Leaderboard Rankings
Comparison with Leading Open-Source Models
LMSYS Chatbot Arena Image Generation Rankings
Subjective Quality Assessment
Performance Benchmarks
Training Quality Benchmarks
Benchmark Reproduction Guide
Limitations of Automated Benchmarks
Practical Takeaway: What Benchmarks Mean for Real-World Usage
References

Benchmark Methodology

Diffusion model benchmarking aims to quantify model performance across different dimensions through standardized methods. Current mainstream benchmark frameworks fall into two categories: automated metric evaluation and human preference evaluation.

Test Datasets

Benchmarks typically use the following standard datasets:

COCO Captions: 80K+ image-text pairs covering everyday scenes
LAION-AESTHETICS: Aesthetic scoring dataset for evaluating visual quality
Parti Prompts: 5000 carefully designed test prompts covering various styles, subjects, and complexity levels
PickleMix: Contains prompts of varying difficulty for image generation

Testing Process

Prompt set → Model inference → Image generation → Auto metric calculation → Human scoring → Combined ranking

Core Metrics Explained

FID (Fréchet Inception Distance)

FID measures the distance between the distribution of generated images and real images.

Principle: Uses Inception-v3 network to extract features, computes Fréchet distance between two distributions
Interpretation: Lower is better. FID < 5 indicates generated quality approaching real images, FID > 20 generally indicates poor quality
Limitations: Cannot detect specific quality issues (e.g., face deformities), insensitive to semantic consistency

CLIP Score

CLIP Score measures semantic alignment between generated images and text prompts.

Principle: Uses CLIP model to encode both image and text, computes cosine similarity in embedding space
Interpretation: Higher is better. Score range typically 0.15 - 0.35, > 0.25 indicates good semantic alignment
Limitations: Insensitive to visual quality, high-quality images deviating from prompt may still score high

HPS (Human Preference Score)

HPS evaluates generation quality by training a model to predict human preferences.

Principle: Regression model trained on large-scale human preference data
Interpretation: Available as HPSv2 and HPSv2.1, score range 0-100, higher is better
Advantage: Closer to human perception than pure automated metrics

DPG (Discrete Prompt Grounding)

DPG evaluates model adherence to specific elements in prompts.

Principle: Uses Grounding DINO to detect objects in generated images, compares against object list in prompt
Interpretation: Measures object detection accuracy and omission rate
Advantage: Quantifies ability to follow multi-object, complex prompts

FVD (Fréchet Video Distance)

For video generation models, FVD measures video sequence generation quality.

Principle: Similar to FID but for temporal sequences
Application: Not applicable to pure image generation models

Artificial Analysis Leaderboard Rankings

Artificial Analysis maintains one of the most comprehensive AI model benchmark leaderboards, covering image generation, text generation, and more.

Z-Image's Position on the Leaderboard

According to the latest Artificial Analysis leaderboard data, Z-Image ranks near the top among open-source image generation models:

Overall ranking (open-source category):

Rank	Model	Parameters	FID (↓)	CLIP Score (↑)	HPSv2 (↑)
1	Z-Image Omni-Base	6B	~4.2	~0.28	~82.5
2	Flux.1 Dev	12B	~3.8	~0.29	~83.1
3	SDXL Turbo	3.5B	~5.1	~0.26	~78.3
4	SD 3.0 Medium	2.5B	~5.8	~0.25	~76.1
5	Stable Cascade	4B	~6.2	~0.24	~74.8

Note: Data reflects community testing reference values; specific numbers vary with testing methods and time

Key Findings

Efficiency advantage: Z-Image achieves quality close to 12B Flux.1 Dev with only 6B parameters, standing out in parameter efficiency
FID advantage: Near-top performance in image quality distribution matching, indicating high visual quality
CLIP Score: Mid-to-upper tier semantic alignment, with room for improvement on complex multi-object prompts

Comparison with Leading Open-Source Models

Detailed Comparison Table

Metric	Z-Image	Flux.1 Dev	SDXL	SD 3.0	Stable Cascade
Parameters	6B	12B	3.5B	2.5B	4B
Architecture	DiT (Flux)	DiT	Transformer	DiT	Stage-based
Inference Steps (recommended)	20-30	20-30	20-50	25-50	10-30
VRAM (1024x1024)	~14GB	~20GB	~10GB	~8GB	~12GB
Inference Speed (RTX 4090)	~3-5s	~5-8s	~2-4s	~2-3s	~3-5s
FID	~4.2	~3.8	~6.5	~5.8	~6.2
CLIP Score	~0.28	~0.29	~0.24	~0.25	~0.24
Text Rendering	Good	Excellent	Average	Medium	Average
Face Quality	Good	Excellent	Medium	Medium	Medium
Multi-object Consistency	Good	Excellent	Average	Medium	Average

Per-Model Strength Analysis

Flux.1 Dev

Strengths: Currently highest overall quality among open-source models, outstanding text rendering, excellent multi-object consistency
Weaknesses: 12B parameters require more VRAM, slower inference
Best for: Scenarios demanding maximum quality with sufficient hardware resources

Z-Image

Strengths: 6B parameters achieving near-Flux quality, faster inference, lower VRAM requirements, unified model supporting generation + editing
Weaknesses: Gap with Flux remains in extremely complex scenes and text rendering
Best for: Comprehensive scenarios balancing quality and efficiency

SDXL

Strengths: Most mature ecosystem, richest community resources, most LoRA and fine-tuning resources available
Weaknesses: Overall generation quality lags behind newer DiT models
Best for: Scenarios requiring extensive community resources and third-party tool support

SD 3.0

Strengths: MMDiT architecture, fewer parameters
Weaknesses: Limited community feedback post-release, average quality performance
Best for: Extremely resource-constrained environments

Speed Comparison

Inference speed testing on different hardware (1024x1024, 30 steps):

GPU	Z-Image	Flux.1 Dev	SDXL
RTX 3080	~8s	~14s	~5s
RTX 4090	~4s	~7s	~2.5s
A100 (40GB)	~3s	~5s	~2s
A100 (80GB)	~3s	~5s	~2s
M2 Max (96GB)	~6s	~10s	~4s

LMSYS Chatbot Arena Image Generation Rankings

LMSYS Chatbot Arena evaluates model quality through human voting (Elo ratings), representing the benchmark closest to real user perception.

Image Generation Arena Rankings

Rank	Model	Elo Rating	Win Rate
1	DALL-E 3	1185	58.2%
2	Midjourney v6	1178	57.8%
3	Flux.1 Pro	1165	57.1%
4	Imagen 3	1150	56.3%
5	SDXL Turbo	1120	54.8%

Z-Image is currently primarily tested as an open-source model in the community and has not yet entered mainstream LMSYS Arena rankings at scale

Correlation Between Human Preference and Automated Metrics

Research shows moderate positive correlation between human preference and automated metrics:

FID vs HPS correlation: ~0.65
CLIP Score vs HPS correlation: ~0.55
Combined auto metrics vs HPS correlation: ~0.78

This means automated metrics can provide reference but cannot fully replace human evaluation.

Subjective Quality Assessment

Composition and Aesthetics

Dimension	Z-Image Score	Notes
Composition	7.5/10	Good adherence to classic composition rules
Color Harmony	7.8/10	High color harmony
Lighting	7.2/10	Good natural lighting, special lighting scenarios need improvement
Depth/Layers	7.0/10	Medium ability to distinguish foreground, midground, background

Detail Performance

Dimension	Z-Image Score	Notes
Texture Detail	7.5/10	Good surface texture reproduction
Edge Sharpness	7.8/10	Clean object edge handling
Small Object Detail	6.8/10	Details of distant small objects may be lost

Text Rendering

Text rendering is a general advantage of DiT architecture models:

Text Type	Accuracy	Notes
Simple English words	~85%	Common words rendered accurately
Complex English phrases	~65%	Accuracy drops for multi-word combinations
Chinese text	~45%	Limited Chinese rendering capability
Numbers and symbols	~80%	Strong number rendering capability

Faces and Hands

Dimension	Z-Image Score	Notes
Face Symmetry	7.5/10	Generally symmetric, occasional minor deviation
Eye Consistency	7.8/10	Good eye direction consistency
Teeth Rendering	7.0/10	Teeth may deform in smiling scenarios
Finger Count	7.2/10	Average finger count close to normal, errors still occur
Finger Detail	6.5/10	Weak finger joint and nail detail

Performance Benchmarks

VRAM Usage

Resolution	FP16	BF16	FP8	NF4
512x512	~10GB	~10GB	~7GB	~5GB
768x768	~12GB	~12GB	~8.5GB	~6GB
1024x1024	~14GB	~14GB	~9GB	~7GB
1536x1536	~18GB	~18GB	~11GB	~9GB
2048x2048	~22GB	~22GB	~14GB	~11GB

Memory Usage

Component	Memory Size
UNet (FP16)	~12 GB
T5-XXL (FP16)	~15 GB
CLIP-L (FP16)	~0.4 GB
VAE (FP16)	~0.3 GB
Total	~27.7 GB
UNet (FP8)	~6 GB
Total (FP8)	~21.7 GB

Inference Latency

Scenario	Steps	RTX 4090	A100	T4
512x512 T2I	20	2.1s	1.8s	8.5s
1024x1024 T2I	30	4.5s	3.8s	18s
1024x1024 I2I	25	3.8s	3.2s	15s
1024x1024 Inpaint	30	4.8s	4.0s	19s

Throughput Testing

Configuration	Batch Size	Images per Second
A100 80GB, FP16	1	0.26
A100 80GB, FP16	4	0.85
A100 80GB, FP8	1	0.35
A100 80GB, FP8	8	1.4
RTX 4090, FP16	1	0.22
RTX 4090, FP8	1	0.28

Training Quality Benchmarks

LoRA Fine-tuning Results

Task Type	Training Data	Training Steps	Effect Score
Character Consistency	20 images	2000	7.5/10
Style Transfer	50 images	3000	8.0/10
Object Replacement	15 images	1500	7.0/10
Scene Style	30 images	2500	7.8/10

Fine-tuning Speed

GPU	20 Image Training	50 Image Training
RTX 3080	~15 min	~30 min
RTX 4090	~8 min	~16 min
A100 (40GB)	~5 min	~10 min

Fine-tuning Quality Comparison

Z-Image LoRA fine-tuning compared to other models:

Convergence speed: Comparable to Flux, achieving good results in ~1000-2000 steps
Overfitting tendency: Medium, recommend using dropout and data augmentation
Generalization ability: Above average performance on unseen scenes from training data
Edit task transfer: Fine-tuning simultaneously improves generation and editing task results (unified model advantage)

Benchmark Reproduction Guide

Using miroleon/z-image-turbo-benchmark

The GitHub repository miroleon/z-image-turbo-benchmark provides standardized benchmark testing tools.

# Clone repository
git clone https://github.com/miroleon/z-image-turbo-benchmark.git
cd z-image-turbo-benchmark

# Install dependencies
pip install -r requirements.txt

# Run benchmark
python benchmark.py /
    --model z-image/omni-base /
    --dataset parti-prompts /
    --output results/ /
    --metrics fid clip hps /
    --num-samples 1000 /
    --batch-size 4

Custom Test Script

import torch
import time
from diffusers import ZImagePipeline
import numpy as np

def benchmark_generation(model_path, prompts, num_repeats=3):
    """Benchmark function"""
    pipe = ZImagePipeline.from_pretrained(model_path, torch_dtype=torch.float16)
    pipe.to("cuda")

    results = []
    for prompt in prompts:
        times = []
        for _ in range(num_repeats):
            start = time.time()
            with torch.no_grad():
                result = pipe(
                    prompt=prompt,
                    width=1024,
                    height=1024,
                    num_inference_steps=30,
                    guidance_scale=7.5,
                )
            elapsed = time.time() - start
            times.append(elapsed)

        avg_time = sum(times) / len(times)
        peak_vram = torch.cuda.max_memory_allocated() / 1e9

        results.append({
            "prompt": prompt,
            "avg_time": avg_time,
            "peak_vram_gb": peak_vram,
            "times": times
        })

    return results

# Usage
test_prompts = [
    "a cat sitting on a wall",
    "a city skyline at sunset",
    "a forest path with morning fog",
]

results = benchmark_generation("z-image/omni-base", test_prompts)
for r in results:
    print(f"Prompt: {r['prompt'][:50]}...")
    print(f"  Avg time: {r['avg_time']:.2f}s, Peak VRAM: {r['peak_vram_gb']:.1f}GB")

Using FID Calculation Tool

from pytorch_fid import fid_score

# Prepare real images and generated images directories
# real_images/ - real images
# generated_images/ - generated images

fid = fid_score.calculate_fid_given_paths(
    ["real_images", "generated_images"],
    batch_size=32,
    device="cuda",
    dims=2048
)
print(f"FID Score: {fid:.4f}")

CLIP Score Calculation

import clip
import torch
from PIL import Image

clip_model, preprocess = clip.load("ViT-L/14", device="cuda")

def calculate_clip_score(image_path, text_prompt):
    image = preprocess(Image.open(image_path)).unsqueeze(0).cuda()
    text = clip.encode_text(clip.tokenize([text_prompt]).cuda())
    image_features = clip_model.encode_image(image)
    text_features = clip_model.encode_text(text)

    similarity = (image_features @ text_features.T).item()
    return similarity

# Usage
score = calculate_clip_score("generated.png", "a cat sitting on a wall")
print(f"CLIP Score: {score:.4f}")

Limitations of Automated Benchmarks

Cannot detect semantic errors (e.g., generating wrong objects)
Insensitive to image diversity (over-homogenized generation may achieve low FID)
Relies on Inception-v3 features, insensitive to out-of-distribution content

CLIP Score Biases

Tends to reward "average" images
Insensitive to visual quality
May be inaccurate for specific styles (e.g., abstract art)

HPS Biases

Training data preferences may introduce bias
May be inaccurate for edge cases (extreme styles)
Cultural background preference differences not fully accounted for

What Automated Benchmarks Cannot Replace

Creative diversity: Automated metrics struggle to measure creativity
Cultural relevance: Different cultural backgrounds have different quality evaluation standards
Task-specific needs: Requirements for specific application scenarios may not be covered
Long-term consistency: Style consistency across batch generation is hard to measure with single tests

Practical Takeaway: What Benchmarks Mean for Real-World Usage

Considerations When Choosing Models

Models with low FID: Generated images closer to real photo quality, suitable for photorealistic style
Models with high CLIP Score: Higher prompt adherence, suitable for precise output control
Models with high HPS: Better human visual perception, suitable for end-user-facing scenarios

Z-Image's Practical Positioning

Cost-effective choice: Achieves near-top model quality at 6B parameter scale, suitable for most users
Unified model advantage: Generation + editing integration simplifies workflow
Deployment friendly: Lower VRAM requirements make it easier to deploy on consumer GPUs
Ecosystem compatible: Compatible with ComfyUI, Diffusers, Kohya, and other mainstream tools

Recommendations

Individual creators: Z-Image is the most cost-effective choice, 6B parameters runnable on RTX 3060+
Professional studios: Consider Flux.1 Dev for maximum quality, but be aware of 12B parameter hardware requirements
Batch production: Z-Image's inference speed and VRAM efficiency suit large-scale image generation
Editing workflows: Omni-Base's unified model architecture reduces model switching, increasing editing efficiency

References

miroleon/z-image-turbo-benchmark: https://github.com/miroleon/z-image-turbo-benchmark
Artificial Analysis Leaderboard: https://artificialanalysis.ai
LMSYS Chatbot Arena: https://lmsys.org
HPSv2 Paper: https://hps.vicuesource.com
FID Implementation: https://github.com/mseitzer/pytorch-fid
CLIP Score: https://github.com/openai/CLIP
Parti Prompts Dataset: https://github.com/google-deepmind/parti
YouTube Benchmark Videos: Various benchmark comparison channels on YouTube

Z-Image Benchmark Analysis: Quality Scores and Model Rankings

Table of Contents

Z-Image Benchmark Analysis: Quality Scores and Model Rankings

Table of Contents

Benchmark Methodology

Test Datasets

Testing Process

Core Metrics Explained

FID (Fréchet Inception Distance)

CLIP Score

HPS (Human Preference Score)

DPG (Discrete Prompt Grounding)

FVD (Fréchet Video Distance)

Artificial Analysis Leaderboard Rankings

Z-Image's Position on the Leaderboard

Key Findings

Comparison with Leading Open-Source Models

Detailed Comparison Table

Per-Model Strength Analysis

Flux.1 Dev

Z-Image

SDXL

SD 3.0

Speed Comparison

LMSYS Chatbot Arena Image Generation Rankings

Image Generation Arena Rankings

Correlation Between Human Preference and Automated Metrics

Subjective Quality Assessment

Composition and Aesthetics

Detail Performance

Text Rendering

Faces and Hands

Performance Benchmarks

VRAM Usage

Memory Usage

Inference Latency

Throughput Testing

Training Quality Benchmarks

LoRA Fine-tuning Results

Fine-tuning Speed

Fine-tuning Quality Comparison

Benchmark Reproduction Guide

Using miroleon/z-image-turbo-benchmark

Custom Test Script

Using FID Calculation Tool

CLIP Score Calculation

Limitations of Automated Benchmarks

FID Blind Spots

CLIP Score Biases

HPS Biases

What Automated Benchmarks Cannot Replace

Practical Takeaway: What Benchmarks Mean for Real-World Usage

Considerations When Choosing Models

Z-Image's Practical Positioning

Recommendations

References