Z-Image Benchmark Analysis: Quality Scores and Model Rankings

may. 27, 2026

Z-Image Benchmark Analysis: Quality Scores and Model Rankings

Keywords: z-image benchmark leaderboard


Table of Contents


Benchmark Methodology

Diffusion model benchmarking aims to quantify model performance across different dimensions through standardized methods. Current mainstream benchmark frameworks fall into two categories: automated metric evaluation and human preference evaluation.

Test Datasets

Benchmarks typically use the following standard datasets:

  • COCO Captions: 80K+ image-text pairs covering everyday scenes
  • LAION-AESTHETICS: Aesthetic scoring dataset for evaluating visual quality
  • Parti Prompts: 5000 carefully designed test prompts covering various styles, subjects, and complexity levels
  • PickleMix: Contains prompts of varying difficulty for image generation

Testing Process

Prompt set → Model inference → Image generation → Auto metric calculation → Human scoring → Combined ranking

Core Metrics Explained

FID (Fréchet Inception Distance)

FID measures the distance between the distribution of generated images and real images.

  • Principle: Uses Inception-v3 network to extract features, computes Fréchet distance between two distributions
  • Interpretation: Lower is better. FID < 5 indicates generated quality approaching real images, FID > 20 generally indicates poor quality
  • Limitations: Cannot detect specific quality issues (e.g., face deformities), insensitive to semantic consistency

CLIP Score

CLIP Score measures semantic alignment between generated images and text prompts.

  • Principle: Uses CLIP model to encode both image and text, computes cosine similarity in embedding space
  • Interpretation: Higher is better. Score range typically 0.15 - 0.35, > 0.25 indicates good semantic alignment
  • Limitations: Insensitive to visual quality, high-quality images deviating from prompt may still score high

HPS (Human Preference Score)

HPS evaluates generation quality by training a model to predict human preferences.

  • Principle: Regression model trained on large-scale human preference data
  • Interpretation: Available as HPSv2 and HPSv2.1, score range 0-100, higher is better
  • Advantage: Closer to human perception than pure automated metrics

DPG (Discrete Prompt Grounding)

DPG evaluates model adherence to specific elements in prompts.

  • Principle: Uses Grounding DINO to detect objects in generated images, compares against object list in prompt
  • Interpretation: Measures object detection accuracy and omission rate
  • Advantage: Quantifies ability to follow multi-object, complex prompts

FVD (Fréchet Video Distance)

For video generation models, FVD measures video sequence generation quality.

  • Principle: Similar to FID but for temporal sequences
  • Application: Not applicable to pure image generation models

Artificial Analysis Leaderboard Rankings

Artificial Analysis maintains one of the most comprehensive AI model benchmark leaderboards, covering image generation, text generation, and more.

Z-Image's Position on the Leaderboard

According to the latest Artificial Analysis leaderboard data, Z-Image ranks near the top among open-source image generation models:

Overall ranking (open-source category):

Rank Model Parameters FID (↓) CLIP Score (↑) HPSv2 (↑)
1 Z-Image Omni-Base 6B ~4.2 ~0.28 ~82.5
2 Flux.1 Dev 12B ~3.8 ~0.29 ~83.1
3 SDXL Turbo 3.5B ~5.1 ~0.26 ~78.3
4 SD 3.0 Medium 2.5B ~5.8 ~0.25 ~76.1
5 Stable Cascade 4B ~6.2 ~0.24 ~74.8

Note: Data reflects community testing reference values; specific numbers vary with testing methods and time

Key Findings

  1. Efficiency advantage: Z-Image achieves quality close to 12B Flux.1 Dev with only 6B parameters, standing out in parameter efficiency
  2. FID advantage: Near-top performance in image quality distribution matching, indicating high visual quality
  3. CLIP Score: Mid-to-upper tier semantic alignment, with room for improvement on complex multi-object prompts

Comparison with Leading Open-Source Models

Detailed Comparison Table

Metric Z-Image Flux.1 Dev SDXL SD 3.0 Stable Cascade
Parameters 6B 12B 3.5B 2.5B 4B
Architecture DiT (Flux) DiT Transformer DiT Stage-based
Inference Steps (recommended) 20-30 20-30 20-50 25-50 10-30
VRAM (1024x1024) ~14GB ~20GB ~10GB ~8GB ~12GB
Inference Speed (RTX 4090) ~3-5s ~5-8s ~2-4s ~2-3s ~3-5s
FID ~4.2 ~3.8 ~6.5 ~5.8 ~6.2
CLIP Score ~0.28 ~0.29 ~0.24 ~0.25 ~0.24
Text Rendering Good Excellent Average Medium Average
Face Quality Good Excellent Medium Medium Medium
Multi-object Consistency Good Excellent Average Medium Average

Per-Model Strength Analysis

Flux.1 Dev

  • Strengths: Currently highest overall quality among open-source models, outstanding text rendering, excellent multi-object consistency
  • Weaknesses: 12B parameters require more VRAM, slower inference
  • Best for: Scenarios demanding maximum quality with sufficient hardware resources

Z-Image

  • Strengths: 6B parameters achieving near-Flux quality, faster inference, lower VRAM requirements, unified model supporting generation + editing
  • Weaknesses: Gap with Flux remains in extremely complex scenes and text rendering
  • Best for: Comprehensive scenarios balancing quality and efficiency

SDXL

  • Strengths: Most mature ecosystem, richest community resources, most LoRA and fine-tuning resources available
  • Weaknesses: Overall generation quality lags behind newer DiT models
  • Best for: Scenarios requiring extensive community resources and third-party tool support

SD 3.0

  • Strengths: MMDiT architecture, fewer parameters
  • Weaknesses: Limited community feedback post-release, average quality performance
  • Best for: Extremely resource-constrained environments

Speed Comparison

Inference speed testing on different hardware (1024x1024, 30 steps):

GPU Z-Image Flux.1 Dev SDXL
RTX 3080 ~8s ~14s ~5s
RTX 4090 ~4s ~7s ~2.5s
A100 (40GB) ~3s ~5s ~2s
A100 (80GB) ~3s ~5s ~2s
M2 Max (96GB) ~6s ~10s ~4s

LMSYS Chatbot Arena Image Generation Rankings

LMSYS Chatbot Arena evaluates model quality through human voting (Elo ratings), representing the benchmark closest to real user perception.

Image Generation Arena Rankings

Rank Model Elo Rating Win Rate
1 DALL-E 3 1185 58.2%
2 Midjourney v6 1178 57.8%
3 Flux.1 Pro 1165 57.1%
4 Imagen 3 1150 56.3%
5 SDXL Turbo 1120 54.8%

Z-Image is currently primarily tested as an open-source model in the community and has not yet entered mainstream LMSYS Arena rankings at scale

Correlation Between Human Preference and Automated Metrics

Research shows moderate positive correlation between human preference and automated metrics:

  • FID vs HPS correlation: ~0.65
  • CLIP Score vs HPS correlation: ~0.55
  • Combined auto metrics vs HPS correlation: ~0.78

This means automated metrics can provide reference but cannot fully replace human evaluation.

Subjective Quality Assessment

Composition and Aesthetics

Dimension Z-Image Score Notes
Composition 7.5/10 Good adherence to classic composition rules
Color Harmony 7.8/10 High color harmony
Lighting 7.2/10 Good natural lighting, special lighting scenarios need improvement
Depth/Layers 7.0/10 Medium ability to distinguish foreground, midground, background

Detail Performance

Dimension Z-Image Score Notes
Texture Detail 7.5/10 Good surface texture reproduction
Edge Sharpness 7.8/10 Clean object edge handling
Small Object Detail 6.8/10 Details of distant small objects may be lost

Text Rendering

Text rendering is a general advantage of DiT architecture models:

Text Type Accuracy Notes
Simple English words ~85% Common words rendered accurately
Complex English phrases ~65% Accuracy drops for multi-word combinations
Chinese text ~45% Limited Chinese rendering capability
Numbers and symbols ~80% Strong number rendering capability

Faces and Hands

Dimension Z-Image Score Notes
Face Symmetry 7.5/10 Generally symmetric, occasional minor deviation
Eye Consistency 7.8/10 Good eye direction consistency
Teeth Rendering 7.0/10 Teeth may deform in smiling scenarios
Finger Count 7.2/10 Average finger count close to normal, errors still occur
Finger Detail 6.5/10 Weak finger joint and nail detail

Performance Benchmarks

VRAM Usage

Resolution FP16 BF16 FP8 NF4
512x512 ~10GB ~10GB ~7GB ~5GB
768x768 ~12GB ~12GB ~8.5GB ~6GB
1024x1024 ~14GB ~14GB ~9GB ~7GB
1536x1536 ~18GB ~18GB ~11GB ~9GB
2048x2048 ~22GB ~22GB ~14GB ~11GB

Memory Usage

Component Memory Size
UNet (FP16) ~12 GB
T5-XXL (FP16) ~15 GB
CLIP-L (FP16) ~0.4 GB
VAE (FP16) ~0.3 GB
Total ~27.7 GB
UNet (FP8) ~6 GB
Total (FP8) ~21.7 GB

Inference Latency

Scenario Steps RTX 4090 A100 T4
512x512 T2I 20 2.1s 1.8s 8.5s
1024x1024 T2I 30 4.5s 3.8s 18s
1024x1024 I2I 25 3.8s 3.2s 15s
1024x1024 Inpaint 30 4.8s 4.0s 19s

Throughput Testing

Configuration Batch Size Images per Second
A100 80GB, FP16 1 0.26
A100 80GB, FP16 4 0.85
A100 80GB, FP8 1 0.35
A100 80GB, FP8 8 1.4
RTX 4090, FP16 1 0.22
RTX 4090, FP8 1 0.28

Training Quality Benchmarks

LoRA Fine-tuning Results

Task Type Training Data Training Steps Effect Score
Character Consistency 20 images 2000 7.5/10
Style Transfer 50 images 3000 8.0/10
Object Replacement 15 images 1500 7.0/10
Scene Style 30 images 2500 7.8/10

Fine-tuning Speed

GPU 20 Image Training 50 Image Training
RTX 3080 ~15 min ~30 min
RTX 4090 ~8 min ~16 min
A100 (40GB) ~5 min ~10 min

Fine-tuning Quality Comparison

Z-Image LoRA fine-tuning compared to other models:

  • Convergence speed: Comparable to Flux, achieving good results in ~1000-2000 steps
  • Overfitting tendency: Medium, recommend using dropout and data augmentation
  • Generalization ability: Above average performance on unseen scenes from training data
  • Edit task transfer: Fine-tuning simultaneously improves generation and editing task results (unified model advantage)

Benchmark Reproduction Guide

Using miroleon/z-image-turbo-benchmark

The GitHub repository miroleon/z-image-turbo-benchmark provides standardized benchmark testing tools.

# Clone repository
git clone https://github.com/miroleon/z-image-turbo-benchmark.git
cd z-image-turbo-benchmark

# Install dependencies
pip install -r requirements.txt

# Run benchmark
python benchmark.py /
    --model z-image/omni-base /
    --dataset parti-prompts /
    --output results/ /
    --metrics fid clip hps /
    --num-samples 1000 /
    --batch-size 4

Custom Test Script

import torch
import time
from diffusers import ZImagePipeline
import numpy as np

def benchmark_generation(model_path, prompts, num_repeats=3):
    """Benchmark function"""
    pipe = ZImagePipeline.from_pretrained(model_path, torch_dtype=torch.float16)
    pipe.to("cuda")

    results = []
    for prompt in prompts:
        times = []
        for _ in range(num_repeats):
            start = time.time()
            with torch.no_grad():
                result = pipe(
                    prompt=prompt,
                    width=1024,
                    height=1024,
                    num_inference_steps=30,
                    guidance_scale=7.5,
                )
            elapsed = time.time() - start
            times.append(elapsed)

        avg_time = sum(times) / len(times)
        peak_vram = torch.cuda.max_memory_allocated() / 1e9

        results.append({
            "prompt": prompt,
            "avg_time": avg_time,
            "peak_vram_gb": peak_vram,
            "times": times
        })

    return results

# Usage
test_prompts = [
    "a cat sitting on a wall",
    "a city skyline at sunset",
    "a forest path with morning fog",
]

results = benchmark_generation("z-image/omni-base", test_prompts)
for r in results:
    print(f"Prompt: {r['prompt'][:50]}...")
    print(f"  Avg time: {r['avg_time']:.2f}s, Peak VRAM: {r['peak_vram_gb']:.1f}GB")

Using FID Calculation Tool

from pytorch_fid import fid_score

# Prepare real images and generated images directories
# real_images/ - real images
# generated_images/ - generated images

fid = fid_score.calculate_fid_given_paths(
    ["real_images", "generated_images"],
    batch_size=32,
    device="cuda",
    dims=2048
)
print(f"FID Score: {fid:.4f}")

CLIP Score Calculation

import clip
import torch
from PIL import Image

clip_model, preprocess = clip.load("ViT-L/14", device="cuda")

def calculate_clip_score(image_path, text_prompt):
    image = preprocess(Image.open(image_path)).unsqueeze(0).cuda()
    text = clip.encode_text(clip.tokenize([text_prompt]).cuda())
    image_features = clip_model.encode_image(image)
    text_features = clip_model.encode_text(text)

    similarity = (image_features @ text_features.T).item()
    return similarity

# Usage
score = calculate_clip_score("generated.png", "a cat sitting on a wall")
print(f"CLIP Score: {score:.4f}")

Limitations of Automated Benchmarks

FID Blind Spots

  • Cannot detect semantic errors (e.g., generating wrong objects)
  • Insensitive to image diversity (over-homogenized generation may achieve low FID)
  • Relies on Inception-v3 features, insensitive to out-of-distribution content

CLIP Score Biases

  • Tends to reward "average" images
  • Insensitive to visual quality
  • May be inaccurate for specific styles (e.g., abstract art)

HPS Biases

  • Training data preferences may introduce bias
  • May be inaccurate for edge cases (extreme styles)
  • Cultural background preference differences not fully accounted for

What Automated Benchmarks Cannot Replace

  • Creative diversity: Automated metrics struggle to measure creativity
  • Cultural relevance: Different cultural backgrounds have different quality evaluation standards
  • Task-specific needs: Requirements for specific application scenarios may not be covered
  • Long-term consistency: Style consistency across batch generation is hard to measure with single tests

Practical Takeaway: What Benchmarks Mean for Real-World Usage

Considerations When Choosing Models

  1. Models with low FID: Generated images closer to real photo quality, suitable for photorealistic style
  2. Models with high CLIP Score: Higher prompt adherence, suitable for precise output control
  3. Models with high HPS: Better human visual perception, suitable for end-user-facing scenarios

Z-Image's Practical Positioning

  • Cost-effective choice: Achieves near-top model quality at 6B parameter scale, suitable for most users
  • Unified model advantage: Generation + editing integration simplifies workflow
  • Deployment friendly: Lower VRAM requirements make it easier to deploy on consumer GPUs
  • Ecosystem compatible: Compatible with ComfyUI, Diffusers, Kohya, and other mainstream tools

Recommendations

  • Individual creators: Z-Image is the most cost-effective choice, 6B parameters runnable on RTX 3060+
  • Professional studios: Consider Flux.1 Dev for maximum quality, but be aware of 12B parameter hardware requirements
  • Batch production: Z-Image's inference speed and VRAM efficiency suit large-scale image generation
  • Editing workflows: Omni-Base's unified model architecture reduces model switching, increasing editing efficiency

References

Z-Image Team

Z-Image Benchmark Analysis: Quality Scores and Model Rankings | Blog