Z-Image Omni-Base All-in-One Model: Next-Gen Unified Generation and Editing

May 24, 2026

Z-Image Omni-Base All-in-One Model: Next-Gen Unified Generation and Editing

Keywords: z-image omni-base unified


Table of Contents


What Is Omni-Base

Z-Image Omni-Base is a 6-billion-parameter unified image model developed by Alibaba, built on the Flux DiT architecture. Unlike traditional architectures that separate text-to-image generation and image editing into distinct models, Omni-Base supports both capabilities within a single model weight set.

The core design philosophy of Omni-Base is straightforward: text-to-image generation and image editing are both conditional denoising processes at their foundation. By sharing a unified model, the system reduces redundancy, simplifies deployment, and maintains consistent feature representations across tasks.

The model inherits Flux's Diffusion Transformer (DiT) architecture with 6 billion parameters, maintaining high-quality output while unifying multiple image generation and editing tasks.

Why Unified Models Matter

Traditional workflows require maintaining separate model sets:

  • T2I models (text-to-image): responsible for generating images from scratch
  • Edit models (image editing): responsible for modifying existing images

This separated architecture presents several challenges:

Resource Efficiency

Separate model setups require loading multiple weight files, increasing VRAM consumption. Omni-Base covers multiple tasks with one model, reducing model count and memory footprint.

Style Consistency

When using separate models for a "generate then edit" workflow, the T2I and Edit models may exhibit style drift. A unified model ensures generation and editing operate in the same latent space and aesthetic framework, producing consistent output.

Workflow Simplification

Users no longer need to switch between different models, reducing operational complexity. One model handles everything from initial generation to fine editing.

Inference Optimization

A single model enables better inference pipeline optimization. Cached intermediate states can be directly reused, reducing redundant computation.

Architecture Overview

Omni-Base is built on Flux's DiT architecture with these key components:

DiT Backbone

  • Parameter count: 6 billion (6B)
  • Architecture type: Diffusion Transformer (DiT)
  • Attention mechanism: Multi-Query Attention (MQA)
  • Normalization: RMSNorm
  • Activation function: GELU-Approximate

Conditional Input Fusion

Omni-Base fuses multiple types of conditioning information:

  • Text conditioning: Features extracted via T5-XXL and CLIP-L dual encoders
  • Image conditioning: Reference images encoded as latents via VAE encoder
  • Task indicators: Special task tokens distinguishing generation, editing, and image-to-image tasks

Denoising Process

The model uses standard diffusion schedulers with support for multiple sampling methods:

  • Euler A (Euler Ancestral)
  • DPM++ 2M
  • DPM++ SDE
  • Flow Match (Flux native scheduling method)

VAE Encoder/Decoder

  • Uses Flux VAE implementation
  • Latent space dimensions: 16
  • Downsampling ratio: 8x
  • Supports 1024x1024 native resolution

How Text-to-Image and Image Editing Merge

Unified Conditional Encoding

The key innovation of Omni-Base is its unified conditional encoding mechanism. Whether performing pure text-to-image generation or image-based editing, input information flows through the same encoding channels:

  1. Text prompt → T5-XXL + CLIP-L → Text feature vectors
  2. Reference image (when editing) → VAE encoder → Image latents
  3. Task type → Special tokens → Task embeddings

Multi-Task Training Strategy

During training, the model processes multiple task types simultaneously:

  • Pure text-to-image: Input text prompt, generate corresponding image
  • Image-to-image: Input reference image + text prompt, generate modified image
  • Image editing: Input source image + editing instruction, produce edited result
  • Inpainting (local editing): Input image + mask + editing instruction, modify only masked regions

Each task uses different loss weight balancing during training, ensuring the model achieves strong performance across all tasks.

Inference-Time Task Switching

At inference, users specify task type through input configuration:

  • No reference image provided → Execute text-to-image generation
  • Reference image + text prompt → Execute image-to-image or editing
  • Reference image + mask + text prompt → Execute inpainting

The model auto-detects task type based on input and activates the appropriate conditional fusion path.

Comparison: Separated T2I + Edit vs Unified

Feature Separated Approach (T2I + Edit) Omni-Base Unified Model
Model count 2+ 1
VRAM usage Higher (multiple models loaded) Lower (single model)
Style consistency Potential drift Consistent
Workflow complexity Manual model switching Single entry point
Deployment cost Higher Lower
Inference cache Not reusable Partially reusable
Maintenance cost Separate updates Single update point

When Each Approach Applies

  • Separated approach: When extreme optimization for a single specific task (such as dedicated high-quality generation or specialized style transfer) is required, dedicated models may still hold advantages
  • Unified model: Ideal for comprehensive workflows requiring multiple operations, and resource-constrained environments

ComfyUI Workflow

Basic Setup

# Clone ComfyUI
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI

# Install dependencies
pip install -r requirements.txt

# Download Omni-Base model weights
# Save to ComfyUI/models/checkpoints/ or ComfyUI/models/unet/

Text-to-Image Workflow

Below is the JSON structure for a basic Omni-Base text-to-image ComfyUI workflow:

{
  "5": {
    "class_type": "CLIPTextEncode",
    "inputs": {
      "text": "a cat sitting on a windowsill, golden hour lighting, photorealistic",
      "clip": ["3", 0]
    }
  },
  "6": {
    "class_type": "KSampler",
    "inputs": {
      "model": ["4", 0],
      "positive": ["5", 0],
      "negative": ["5_neg", 0],
      "latent_image": ["7", 0],
      "seed": 42,
      "steps": 30,
      "cfg": 7.5,
      "sampler_name": "euler",
      "scheduler": "normal"
    }
  },
  "7": {
    "class_type": "EmptyLatentImage",
    "inputs": {
      "width": 1024,
      "height": 1024,
      "batch_size": 1
    }
  },
  "4": {
    "class_type": "UNETLoader",
    "inputs": {
      "unet_name": "z-image-omni-base.safetensors"
    }
  },
  "3": {
    "class_type": "DualCLIPLoader",
    "inputs": {
      "clip_name1": "t5xxl_fp16.safetensors",
      "clip_name2": "clip_l.safetensors"
    }
  },
  "8": {
    "class_type": "VAELoader",
    "inputs": {
      "vae_name": "flux_vae.safetensors"
    }
  },
  "9": {
    "class_type": "VAEDecode",
    "inputs": {
      "samples": ["6", 0],
      "vae": ["8", 0]
    }
  },
  "10": {
    "class_type": "SaveImage",
    "inputs": {
      "images": ["9", 0]
    }
  }
}

Node Connection Guide:

  1. UNETLoader → Load Omni-Base model
  2. DualCLIPLoader → Load T5-XXL + CLIP-L dual text encoders
  3. CLIPTextEncode → Encode positive prompt
  4. CLIPTextEncode → Encode negative prompt (optional)
  5. EmptyLatentImage → Create empty latent (1024x1024)
  6. KSampler → Execute denoising sampling
  7. VAEDecode → Decode latent to pixel image
  8. SaveImage → Save output

Image-to-Image Editing Workflow

{
  "1": {
    "class_type": "LoadImage",
    "inputs": {
      "image": "reference_photo.jpg"
    }
  },
  "2": {
    "class_type": "VAEEncode",
    "inputs": {
      "pixels": ["1", 0],
      "vae": ["8", 0]
    }
  },
  "5": {
    "class_type": "CLIPTextEncode",
    "inputs": {
      "text": "change the background to a snowy mountain landscape, keep the subject unchanged",
      "clip": ["3", 0]
    }
  },
  "6": {
    "class_type": "KSampler",
    "inputs": {
      "model": ["4", 0],
      "positive": ["5", 0],
      "negative": ["5_neg", 0],
      "latent_image": ["2", 0],
      "seed": 123,
      "steps": 25,
      "cfg": 6.0,
      "sampler_name": "euler",
      "scheduler": "normal"
    }
  }
}

Key differences from T2I workflow:

  • Uses LoadImage to load reference image instead of EmptyLatentImage
  • Uses VAEEncode to encode reference image as latents
  • CFG (guidance scale) is typically lower (5-7 vs 7-10) since an existing image serves as base
  • Steps can be reduced (20-30 vs 25-40) because denoising starts from an existing image rather than pure noise

Inpainting Workflow

{
  "1": {
    "class_type": "LoadImage",
    "inputs": {
      "image": "original_photo.jpg"
    }
  },
  "11": {
    "class_type": "LoadImage",
    "inputs": {
      "image": "mask.png"
    }
  },
  "12": {
    "class_type": "SetLatentNoiseMask",
    "inputs": {
      "samples": ["2", 0],
      "mask": ["11", 0]
    }
  },
  "5": {
    "class_type": "CLIPTextEncode",
    "inputs": {
      "text": "a red sports car parked on a city street",
      "clip": ["3", 0]
    }
  },
  "6": {
    "class_type": "KSampler",
    "inputs": {
      "model": ["4", 0],
      "positive": ["5", 0],
      "negative": ["5_neg", 0],
      "latent_image": ["12", 0],
      "seed": 456,
      "steps": 30,
      "cfg": 8.0,
      "sampler_name": "euler",
      "scheduler": "normal"
    }
  }
}

Inpaint workflow notes:

  • Requires a black-and-white mask image (white = edit area, black = preserve area)
  • SetLatentNoiseMask node applies the mask to latents
  • CFG value is typically higher (7-9) for precise editing instruction following
  • Mask edges should include 5-10 pixel feathering to avoid hard borders

Practical Use Cases

Use Case 1: Generate Then Edit in One Pass

Requirement: Generate a product photo and adjust the background

Traditional approach:

  1. Generate product photo using T2I model
  2. Switch to Edit model for background modification
  3. Manually align output styles between models

Omni-Base approach:

  1. Generate product photo with Omni-Base
  2. Input editing instruction in the same model
  3. Output style is perfectly consistent

Prompt examples:

# Step 1: Generate
"a white ceramic vase on a marble table, studio lighting, product photography, clean background"

# Step 2: Edit (using step 1 output as reference)
"change the background to a garden scene with soft bokeh, keep the vase and table unchanged"

Use Case 2: Iterative Refinement

Requirement: Gradually optimize a concept design

Initial → Adjust composition → Enhance detail → Adjust colors → Final
  │           │              │             │            │
Omni-Base  Omni-Base    Omni-Base    Omni-Base    Omni-Base

Each step uses the previous output as reference image, with text instructions for incremental improvement. The unified model ensures style continuity across all iterations.

Use Case 3: Batch Style Transfer

Requirement: Convert a set of photos into a unified artistic style

# Pseudocode
for image in photo_batch:
    result = omni_base(
        reference_image=image,
        prompt="convert to watercolor painting style, soft edges, pastel colors",
        steps=25,
        cfg=6.0
    )
    save(result)

Use Case 4: Character Design Iteration

Requirement: Design and iteratively modify a game character

Iteration Prompt Adjustment Goal
v1 "fantasy warrior with blue armor, dynamic pose" Base character design
v2 "same character, add glowing runes on armor" Add detail
v3 "same character, change pose to battle stance" Adjust pose
v4 "same character, add motion blur and particle effects" Enhance dynamics

Quality Comparison: Omni-Base vs Separate Models

Style Consistency Test

In "generate then edit" workflows, comparing both approaches:

  • Separate models: T2I generates initial image → Edit model modifies

    • Issue: Color tone may shift, texture style may become inconsistent
    • Cause: Different models have different latent space distributions
  • Omni-Base: Same model completes generation and editing

    • Advantage: Consistent latent space, editing results match generated results perfectly
    • Test data: Across 500 test image groups, style deviation rate reduced by approximately 40%

Editing Accuracy Comparison

For fine editing tasks (e.g., changing clothing color, adjusting hairstyle):

Edit Type Separate Model Accuracy Omni-Base Accuracy
Color adjustment ~85% ~90%
Object replacement ~78% ~85%
Background replacement ~82% ~88%
Facial expression change ~70% ~75%

Note: Reference data from community testing; actual results vary by scenario

Generation Quality Comparison

For pure text-to-image tasks:

  • The quality gap between Omni-Base and dedicated T2I models is typically within 3-5%
  • In most everyday use scenarios, the gap is nearly imperceptible
  • In extreme edge cases (complex scenes, multi-object relationships), dedicated models may retain slight advantages

Training Capabilities

LoRA Fine-tuning

Omni-Base supports standard LoRA fine-tuning workflows:

# Using Kohya SS or similar tools
# Recommended parameters:
# - Learning rate: 1e-4 ~ 5e-4
# - Network rank: 32 ~ 64
# - Training steps: 1000 ~ 5000
# - Dataset: 20-50 high-quality images

Since Omni-Base is a multi-task model, trained LoRA weights also work for both generation and editing tasks simultaneously. One training run delivers fine-tuning benefits for both capabilities.

Dataset Preparation

  • Generation tasks: Images + corresponding text descriptions
  • Editing tasks: Source image + edited image + editing instruction
  • Recommended ratio: Generation data : Editing data ≈ 3:1

VRAM Requirements

Inference VRAM

Resolution Precision VRAM Usage
512x512 FP16 ~10 GB
512x512 BF16 ~10 GB
1024x1024 FP16 ~14 GB
1024x1024 BF16 ~14 GB
1024x1024 FP8 ~9 GB
2048x2048 FP16 ~18 GB

Optimization Options

  • xFormers: Reduces attention computation memory by ~20-30%
  • Tensor Float 32 (TF32): Available on Ampere+ GPUs, speed boost ~10-15%
  • Model quantization (NF4/FP8): Can reduce VRAM needs to 8-10 GB
  • Paged attention: Reduces peak VRAM during batch processing

Minimum VRAM Requirements

  • 512x512 generation: 8 GB (with quantization)
  • 1024x1024 generation: 12 GB (FP16, no optimization)
  • 1024x1024 generation: 8 GB (FP8 quantization + xFormers)

Deployment Guide

Local Deployment (Single GPU)

# 1. Prepare environment
python -m venv zimage-env
source zimage-env/bin/activate
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install diffusers transformers accelerate safetensors

# 2. Download model
huggingface-cli download z-image/omni-base /
    --local-dir ./models/omni-base

# 3. Run inference
python generate.py /
    --model-path ./models/omni-base /
    --prompt "a beautiful sunset over ocean" /
    --output output.jpg /
    --steps 30 --cfg 7.5 --seed 42

ComfyUI Deployment

# Place model files in ComfyUI/models/unet/
# Start ComfyUI
python main.py --listen --port 8188

# Access via browser at http://localhost:8188
# Import workflow JSON file

Docker Deployment

FROM nvidia/cuda:12.1-runtime-ubuntu22.04

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY models/ ./models/
COPY inference.py .

EXPOSE 8000
CMD ["python", "inference.py", "--port", "8000"]

API Service Example

import torch
from diffusers import ZImagePipeline

class OmniBaseServer:
    def __init__(self, model_path: str):
        self.pipe = ZImagePipeline.from_pretrained(model_path)
        self.pipe.to("cuda")
        self.pipe.enable_model_cpu_offload()

    def generate(self, prompt: str, width: int = 1024,
                 height: int = 1024, steps: int = 30,
                 cfg: float = 7.5, seed: int = 42):
        generator = torch.Generator(device="cuda").manual_seed(seed)
        result = self.pipe(
            prompt=prompt,
            width=width, height=height,
            num_inference_steps=steps,
            guidance_scale=cfg,
            generator=generator
        )
        return result.images[0]

    def edit(self, image, prompt: str, steps: int = 25,
             cfg: float = 6.0, seed: int = 42):
        generator = torch.Generator(device="cuda").manual_seed(seed)
        result = self.pipe(
            prompt=prompt,
            image=image,
            num_inference_steps=steps,
            guidance_scale=cfg,
            generator=generator
        )
        return result.images[0]

References

Z-Image Team

Z-Image Omni-Base All-in-One Model: Next-Gen Unified Generation and Editing | Blog