Z-Image Base Model Practical Guide: Fine-Tuning and LoRA Training Deep Dive

2026/05/24

Z-Image Base Model Practical Guide: Fine-Tuning and LoRA Training Deep Dive

Keywords: z-image base model fine-tuning


Table of Contents


Introduction

The Z-Image model family includes two variants: the Base model and the Turbo model. While Turbo dominates download numbers on Hugging Face due to its speed and quality advantages for inference, the Base model is the only variant that supports full LoRA fine-tuning. This guide covers everything needed to train LoRA models on Z-Image Base, from dataset preparation to hyperparameter tuning and troubleshooting.


Base vs Turbo: Core Differences

Dimension Z-Image Base Z-Image Turbo
Inference Steps 28–50 8
CFG Support Yes No
Negative Prompts Yes No
LoRA Fine-tuning Fully supported Not supported
Distillation No Decoupled-DMD + DMDR
Generation Diversity High Lower
Guidance Scale Adjustable Fixed at 0

Why Turbo Can't Be Fine-Tuned

Turbo uses Decoupled-DMD (Distribution Matching Distillation) combined with DMDR (DMD + Reinforcement Learning). This distillation process fundamentally alters the model's noise schedule and latent space dynamics, making traditional LoRA adapters incompatible. Attempts to apply LoRA to Turbo produce degraded results — the adapter weights conflict with the distilled noise trajectory.

Base is a complete, undistilled Diffusion Transformer. It maintains standard noise scheduling, CFG compatibility, and a latent space that responds predictably to LoRA weight injections.


When to Use Base vs Turbo

Choose Base When

  • LoRA training is needed — character consistency, style transfer, brand identity
  • Fine-grained control via negative prompts and adjustable guidance_scale
  • Generation diversity matters — same prompt, varied outputs
  • Research or experimentation — CFG tuning, custom samplers

Choose Turbo When

  • Speed is critical — 8-step inference, 1–2 seconds at 1024x1024 on RTX 4090
  • No fine-tuning needed — using official model weights directly
  • Inference cost is a concern — fewer steps mean lower VRAM and compute usage
  • Single-image quality is the priority

Hybrid Approach

Train LoRA on Base, then use the trained adapter with Base for generation. This gives fine-tuning flexibility with the full model's quality ceiling.


Dataset Preparation

Image Requirements

  • Quantity: 10–50 images for effective LoRA training
  • Resolution: 1024x1024 preferred; accept 768x768–1536x1536 with proper cropping
  • Quality: High-resolution, well-lit, minimal noise
  • Variety: Multiple angles, lighting conditions, and poses (for character LoRA)
  • Clean backgrounds: Minimize distracting elements

Dataset Structure

dataset/
├── images/
│   ├── img_001.jpg
│   ├── img_002.jpg
│   └── ...
├── captions/
│   ├── img_001.txt
│   ├── img_002.txt
│   └── ...
└── metadata.json

Captioning

Good captions are critical for LoRA quality. Each image needs an associated .txt file:

# Good caption:
A portrait of sks, a 25-year-old woman with short black hair,
wearing a white blouse, standing in a sunlit garden,
natural lighting, shallow depth of field

# Avoid overfitting trigger:
sks is the unique trigger token — do not describe it
in the caption text itself

Key captioning rules:

  • Use a unique trigger token (e.g., sks, xyz) to identify the concept
  • Describe the subject generically, not by name
  • Include context: setting, lighting, composition
  • Avoid repeating the trigger token multiple times in one caption

Preprocessing

from PIL import Image
import os

def preprocess_images(input_dir, output_dir, target_size=(1024, 1024)):
    os.makedirs(output_dir, exist_ok=True)
    for filename in os.listdir(input_dir):
        img = Image.open(os.path.join(input_dir, filename))
        img = img.resize(target_size, Image.LANCZOS)
        img.save(os.path.join(output_dir, filename))

Training Hyperparameters

Learning Rate

Training Type Recommended LR Notes
Character LoRA 1e-4 Standard for concept learning
Style LoRA 5e-5 Lower LR for subtle style transfer
Product/Brand LoRA 1e-4 Same as character
Fine-tuning (full) 1e-5 Significantly lower for full model

Training Steps

  • 10–20 images: 1500–3000 steps
  • 30–50 images: 3000–6000 steps
  • 50+ images: 6000–10000 steps

Formula: steps ≈ num_images × 150 as a starting point, then adjust based on validation results.

LoRA Rank and Alpha

Rank Alpha Use Case Model Size
4 2 Micro LoRA (small concepts) ~20 MB
8 4 Lightweight adapter ~40 MB
16 16 Standard LoRA (recommended default) ~90 MB
32 32 Complex concepts / styles ~180 MB
64 64 Maximum capacity ~360 MB

Rank 16 with alpha 16 is the recommended starting point for most training tasks. Higher ranks provide more capacity but risk overfitting with small datasets.

Other Key Parameters

Parameter Recommended Notes
Batch size 1–4 Limited by VRAM
Optimizer AdamW8bit Memory-efficient
Scheduler cosine_with_restarts Stable convergence
Shuffle tags True Prevents memorization order
Cache latents True Speeds up training significantly
Min SNR gamma 5.0 Improves fine detail learning

LoRA Training with Kohya_ss

Configuration File

{
  "train_batch_size": 1,
  "learning_rate": 0.0001,
  "lr_scheduler": "cosine_with_restarts",
  "lr_warmup_steps": 100,
  "max_train_steps": 3000,
  "mixed_precision": "no",
  "optimizer_type": "AdamW8bit",
  "output_name": "character_lora",
  "save_every_n_steps": 500,
  "save_precision": "fp16",
  "seed": 42,
  "sdxl": false,
  "model_type": "dit",
  "dit_dim_head": 64,
  "dit_depth": 28,
  "dit_num_heads": 24,
  "lora_network_dim": 16,
  "lora_network_alpha": 16,
  "min_snr_gamma": 5.0,
  "cache_latents": true,
  "cache_latents_to_disk": true,
  "persistent_data_loader_workers": true
}

Training Command

python sdxl_train.py /
  --pretrained_model_name_or_path="Tongyi-ZImage/Z-Image-Base" /
  --dataset_config="dataset_config.yaml" /
  --output_dir="./output" /
  --network_module="networks.lora" /
  --network_dim=16 /
  --network_alpha=16 /
  --learning_rate=0.0001 /
  --train_batch_size=1 /
  --max_train_steps=3000 /
  --mixed_precision="no" /
  --cache_latents /
  --min_snr_gamma=5.0 /
  --lr_scheduler="cosine_with_restarts" /
  --optimizer_type="AdamW8bit" /
  --seed=42

Dataset Config (YAML)

datasets:
  -
    subfolders:
      - images/
    caption_extension: ".txt"
    enable_bucket: true
    bucket_no_upscale: false
    keep_tokens: 1
    color_jiggle: false
    flip_aug: false
    text_encoder_lr: 0.00001

LoRA Training in ComfyUI

ComfyUI supports LoRA training through community extensions like ComfyUI-Training or ComfyUI-Kohya. The workflow is node-based:

Load Z-Image Base Model
  ↓
Load Training Dataset
  ↓
Configure LoRA Settings (rank, alpha, LR, steps)
  ↓
Train LoRA Node
  ↓
Save LoRA (.safetensors)

ComfyUI Training advantages:

  • Visual feedback during training
  • Easy hyperparameter adjustments
  • Integrated preview generation
  • GPU utilization monitoring

ComfyUI Training limitations:

  • Fewer advanced options than Kohya_ss CLI
  • Higher VRAM overhead from the UI
  • May not support all optimizer variants

VRAM Requirements

Configuration VRAM Required GPU Examples
LoRA rank 4, BS=1 ~10–12 GB RTX 3060, RTX 4070
LoRA rank 8, BS=1 ~12–14 GB RTX 3070, RTX 4070
LoRA rank 16, BS=1 ~14–16 GB RTX 3090, RTX 4080
LoRA rank 16, BS=2 ~18–22 GB RTX 3090 (tight), RTX 4090
LoRA rank 32, BS=1 ~16–20 GB RTX 3090, RTX 4090
Full fine-tuning ~24–40 GB A100, H100

VRAM Optimization Techniques

  • Cache latents: Pre-compute VAE encodings, reduces per-step VRAM by ~20%
  • Gradient checkpointing: Trades compute for memory, recommended for rank ≥ 32
  • 8-bit optimizers: AdamW8bit or Adafactor8bit reduce optimizer state VRAM by ~50%
  • DDP/ZeroRedundancyOptimizer: Multi-GPU training for large configurations
  • --mixed_precision="no": Z-Image Base training works best in full precision (FP32 BF16)

Quality Comparison: Base vs Turbo Trained LoRA

Base-Trained LoRA

  • Character consistency: Strong — facial features, proportions maintained across generations
  • Style transfer: Accurate — captures brushwork, color palettes, composition tendencies
  • Prompt adherence: High — benefits from CFG support for negative prompting
  • Diversity: Good — same trigger + varied prompts produce diverse results
  • Overfitting risk: Moderate — manageable with proper validation

Turbo-Trained LoRA (Community Attempts)

  • Character consistency: Weak — distilled noise trajectory conflicts with LoRA weights
  • Style transfer: Inconsistent — partial style adoption with structural artifacts
  • Prompt adherence: Reduced — lack of CFG means no negative prompt filtering
  • Diversity: Poor — Turbo's inherent low diversity compounds with LoRA
  • Artifacting: Common — visual distortions, especially in fine details

Verdict: Always train on Base. The quality gap is significant and consistent across use cases.


Troubleshooting

Artifacting

Symptoms: Weird patterns, checkerboard textures, or structural distortions in generated images.

Causes and fixes:

  • Learning rate too high → Reduce to 5e-5 or lower
  • Too many steps for dataset size → Reduce total steps by 30–50%
  • Dataset contains inconsistent subjects → Curate dataset more carefully
  • Rank too high for data amount → Lower rank from 32 to 16 or 8

Overfitting

Symptoms: Generated images look nearly identical regardless of prompt; model memorizes training data.

Causes and fixes:

  • Too many steps → Reduce to num_images × 100 as minimum
  • Same background/lighting in all training images → Add variety to dataset
  • No augmentation → Enable flip augmentation and color jittering
  • Trigger token used too frequently → Limit to once per caption

Underfitting

Symptoms: Generated images barely resemble the target concept; LoRA has minimal effect.

Causes and fixes:

  • Too few steps → Increase to num_images × 300 as maximum
  • Learning rate too low → Increase from 5e-5 to 1e-4
  • Rank too low → Increase from 4 to 16
  • Poor captions → Rewrite captions with more descriptive detail
  • Training on wrong model variant → Confirm using Base, not Turbo

Color Cast

Symptoms: All LoRA generations share an unwanted color tint.

Causes and fixes:

  • Training dataset has uniform color bias → Add color-diverse images
  • White balance inconsistency → Apply color correction preprocessing
  • Enable color jittering in training config

Detail Loss

Symptoms: Fine details (facial features, text, textures) are blurred or missing.

Causes and fixes:

  • min_snr_gamma not set → Set to 5.0
  • Resolution mismatch → Ensure all training images at 1024x1024
  • Too much augmentation → Disable color_jiggle and flip_aug for character LoRA

Practical Example: Training a Character LoRA on Base

Step 1: Dataset

Collect 25 images of the target character:

  • 10 portrait shots (various angles)
  • 5 full-body shots
  • 5 different outfits/settings
  • 5 varied lighting conditions

Each image paired with a caption using trigger token sks:

A portrait of sks, a young woman with shoulder-length brown hair,
blue eyes, wearing a navy blazer, professional studio lighting,
clean white background, photorealistic

Step 2: Configuration

{
  "learning_rate": 0.0001,
  "max_train_steps": 4000,
  "lora_network_dim": 16,
  "lora_network_alpha": 16,
  "train_batch_size": 1,
  "min_snr_gamma": 5.0,
  "cache_latents": true,
  "optimizer_type": "AdamW8bit",
  "lr_scheduler": "cosine_with_restarts"
}

Step 3: Training

python sdxl_train.py /
  --pretrained_model_name_or_path="Tongyi-ZImage/Z-Image-Base" /
  --dataset_config="character_config.yaml" /
  --output_dir="./character_output" /
  --network_module="networks.lora" /
  --network_dim=16 /
  --network_alpha=16 /
  --learning_rate=0.0001 /
  --max_train_steps=4000 /
  --train_batch_size=1 /
  --cache_latents /
  --min_snr_gamma=5.0 /
  --lr_scheduler="cosine_with_restarts" /
  --optimizer_type="AdamW8bit"

Step 4: Validation

Generate test images at step 500, 1500, 3000, and 4000:

Checkpoint Test Prompt Evaluation
Step 500 sks, casual portrait, natural lighting Early concept learning
Step 1500 sks, business attire, office background Style emergence
Step 3000 sks, outdoor scene, golden hour Refinement check
Step 4000 sks, fantasy costume, dramatic lighting Overfitting check

Step 5: Inference with Trained LoRA

from diffusers import ZImagePipeline

pipe = ZImagePipeline.from_pretrained(
    "Tongyi-ZImage/Z-Image-Base", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

pipe.load_lora_weights("./character_output/character_lora.safetensors")

result = pipe(
    prompt="sks, wearing a red dress at a garden party, cinematic lighting",
    negative_prompt="blurry, deformed, low quality, extra limbs",
    guidance_scale=7.5,
    num_inference_steps=30
).images[0]

Summary

Z-Image Base is the only variant supporting LoRA fine-tuning. Key takeaways:

  1. Always train on Base — Turbo's distillation makes LoRA incompatible
  2. Dataset quality matters most — 10–50 curated images beat 100 noisy ones
  3. Rank 16, LR 1e-4 is a solid starting point for most use cases
  4. Cache latents and use 8-bit optimizers to manage VRAM efficiently
  5. Validate frequently during training to catch overfitting early
  6. Use unique trigger tokens and write descriptive captions for each image

The Base model's support for CFG and negative prompts during inference means LoRA-trained models benefit from additional control that Turbo simply cannot provide.


Z-Image Team