Z-Image Base Model Practical Guide: Fine-Tuning and LoRA Training Deep Dive

Keywords: z-image base model fine-tuning

Introduction
Base vs Turbo: Core Differences
When to Use Base vs Turbo
Dataset Preparation
Training Hyperparameters
LoRA Training with Kohya_ss
LoRA Training in ComfyUI
VRAM Requirements
Quality Comparison: Base vs Turbo LoRA
Troubleshooting
Practical Example: Character LoRA
Summary

Introduction

The Z-Image model family includes two variants: the Base model and the Turbo model. While Turbo dominates download numbers on Hugging Face due to its speed and quality advantages for inference, the Base model is the only variant that supports full LoRA fine-tuning. This guide covers everything needed to train LoRA models on Z-Image Base, from dataset preparation to hyperparameter tuning and troubleshooting.

Base vs Turbo: Core Differences

Dimension	Z-Image Base	Z-Image Turbo
Inference Steps	28–50	8
CFG Support	Yes	No
Negative Prompts	Yes	No
LoRA Fine-tuning	Fully supported	Not supported
Distillation	No	Decoupled-DMD + DMDR
Generation Diversity	High	Lower
Guidance Scale	Adjustable	Fixed at 0

Why Turbo Can't Be Fine-Tuned

Turbo uses Decoupled-DMD (Distribution Matching Distillation) combined with DMDR (DMD + Reinforcement Learning). This distillation process fundamentally alters the model's noise schedule and latent space dynamics, making traditional LoRA adapters incompatible. Attempts to apply LoRA to Turbo produce degraded results — the adapter weights conflict with the distilled noise trajectory.

Base is a complete, undistilled Diffusion Transformer. It maintains standard noise scheduling, CFG compatibility, and a latent space that responds predictably to LoRA weight injections.

When to Use Base vs Turbo

Choose Base When

LoRA training is needed — character consistency, style transfer, brand identity
Fine-grained control via negative prompts and adjustable guidance_scale
Generation diversity matters — same prompt, varied outputs
Research or experimentation — CFG tuning, custom samplers

Choose Turbo When

Speed is critical — 8-step inference, 1–2 seconds at 1024x1024 on RTX 4090
No fine-tuning needed — using official model weights directly
Inference cost is a concern — fewer steps mean lower VRAM and compute usage
Single-image quality is the priority

Hybrid Approach

Train LoRA on Base, then use the trained adapter with Base for generation. This gives fine-tuning flexibility with the full model's quality ceiling.

Dataset Preparation

Image Requirements

Quantity: 10–50 images for effective LoRA training
Resolution: 1024x1024 preferred; accept 768x768–1536x1536 with proper cropping
Quality: High-resolution, well-lit, minimal noise
Variety: Multiple angles, lighting conditions, and poses (for character LoRA)
Clean backgrounds: Minimize distracting elements

Dataset Structure

dataset/
├── images/
│   ├── img_001.jpg
│   ├── img_002.jpg
│   └── ...
├── captions/
│   ├── img_001.txt
│   ├── img_002.txt
│   └── ...
└── metadata.json

Captioning

Good captions are critical for LoRA quality. Each image needs an associated .txt file:

# Good caption:
A portrait of sks, a 25-year-old woman with short black hair,
wearing a white blouse, standing in a sunlit garden,
natural lighting, shallow depth of field

# Avoid overfitting trigger:
sks is the unique trigger token — do not describe it
in the caption text itself

Key captioning rules:

Use a unique trigger token (e.g., sks, xyz) to identify the concept
Describe the subject generically, not by name
Include context: setting, lighting, composition
Avoid repeating the trigger token multiple times in one caption

Preprocessing

from PIL import Image
import os

def preprocess_images(input_dir, output_dir, target_size=(1024, 1024)):
    os.makedirs(output_dir, exist_ok=True)
    for filename in os.listdir(input_dir):
        img = Image.open(os.path.join(input_dir, filename))
        img = img.resize(target_size, Image.LANCZOS)
        img.save(os.path.join(output_dir, filename))

Training Hyperparameters

Learning Rate

Training Type	Recommended LR	Notes
Character LoRA	1e-4	Standard for concept learning
Style LoRA	5e-5	Lower LR for subtle style transfer
Product/Brand LoRA	1e-4	Same as character
Fine-tuning (full)	1e-5	Significantly lower for full model

Training Steps

10–20 images: 1500–3000 steps
30–50 images: 3000–6000 steps
50+ images: 6000–10000 steps

Formula: steps ≈ num_images × 150 as a starting point, then adjust based on validation results.

LoRA Rank and Alpha

Rank	Alpha	Use Case	Model Size
4	2	Micro LoRA (small concepts)	~20 MB
8	4	Lightweight adapter	~40 MB
16	16	Standard LoRA (recommended default)	~90 MB
32	32	Complex concepts / styles	~180 MB
64	64	Maximum capacity	~360 MB

Rank 16 with alpha 16 is the recommended starting point for most training tasks. Higher ranks provide more capacity but risk overfitting with small datasets.

Other Key Parameters

Parameter	Recommended	Notes
Batch size	1–4	Limited by VRAM
Optimizer	AdamW8bit	Memory-efficient
Scheduler	cosine_with_restarts	Stable convergence
Shuffle tags	True	Prevents memorization order
Cache latents	True	Speeds up training significantly
Min SNR gamma	5.0	Improves fine detail learning

LoRA Training with Kohya_ss

Configuration File

{
  "train_batch_size": 1,
  "learning_rate": 0.0001,
  "lr_scheduler": "cosine_with_restarts",
  "lr_warmup_steps": 100,
  "max_train_steps": 3000,
  "mixed_precision": "no",
  "optimizer_type": "AdamW8bit",
  "output_name": "character_lora",
  "save_every_n_steps": 500,
  "save_precision": "fp16",
  "seed": 42,
  "sdxl": false,
  "model_type": "dit",
  "dit_dim_head": 64,
  "dit_depth": 28,
  "dit_num_heads": 24,
  "lora_network_dim": 16,
  "lora_network_alpha": 16,
  "min_snr_gamma": 5.0,
  "cache_latents": true,
  "cache_latents_to_disk": true,
  "persistent_data_loader_workers": true
}

Training Command

python sdxl_train.py /
  --pretrained_model_name_or_path="Tongyi-ZImage/Z-Image-Base" /
  --dataset_config="dataset_config.yaml" /
  --output_dir="./output" /
  --network_module="networks.lora" /
  --network_dim=16 /
  --network_alpha=16 /
  --learning_rate=0.0001 /
  --train_batch_size=1 /
  --max_train_steps=3000 /
  --mixed_precision="no" /
  --cache_latents /
  --min_snr_gamma=5.0 /
  --lr_scheduler="cosine_with_restarts" /
  --optimizer_type="AdamW8bit" /
  --seed=42

Dataset Config (YAML)

datasets:
  -
    subfolders:
      - images/
    caption_extension: ".txt"
    enable_bucket: true
    bucket_no_upscale: false
    keep_tokens: 1
    color_jiggle: false
    flip_aug: false
    text_encoder_lr: 0.00001

LoRA Training in ComfyUI

ComfyUI supports LoRA training through community extensions like ComfyUI-Training or ComfyUI-Kohya. The workflow is node-based:

Load Z-Image Base Model
  ↓
Load Training Dataset
  ↓
Configure LoRA Settings (rank, alpha, LR, steps)
  ↓
Train LoRA Node
  ↓
Save LoRA (.safetensors)

ComfyUI Training advantages:

Visual feedback during training
Easy hyperparameter adjustments
Integrated preview generation
GPU utilization monitoring

ComfyUI Training limitations:

Fewer advanced options than Kohya_ss CLI
Higher VRAM overhead from the UI
May not support all optimizer variants

VRAM Requirements

Configuration	VRAM Required	GPU Examples
LoRA rank 4, BS=1	~10–12 GB	RTX 3060, RTX 4070
LoRA rank 8, BS=1	~12–14 GB	RTX 3070, RTX 4070
LoRA rank 16, BS=1	~14–16 GB	RTX 3090, RTX 4080
LoRA rank 16, BS=2	~18–22 GB	RTX 3090 (tight), RTX 4090
LoRA rank 32, BS=1	~16–20 GB	RTX 3090, RTX 4090
Full fine-tuning	~24–40 GB	A100, H100

VRAM Optimization Techniques

Cache latents: Pre-compute VAE encodings, reduces per-step VRAM by ~20%
Gradient checkpointing: Trades compute for memory, recommended for rank ≥ 32
8-bit optimizers: AdamW8bit or Adafactor8bit reduce optimizer state VRAM by ~50%
DDP/ZeroRedundancyOptimizer: Multi-GPU training for large configurations
--mixed_precision="no": Z-Image Base training works best in full precision (FP32 BF16)

Quality Comparison: Base vs Turbo Trained LoRA

Base-Trained LoRA

Character consistency: Strong — facial features, proportions maintained across generations
Style transfer: Accurate — captures brushwork, color palettes, composition tendencies
Prompt adherence: High — benefits from CFG support for negative prompting
Diversity: Good — same trigger + varied prompts produce diverse results
Overfitting risk: Moderate — manageable with proper validation

Turbo-Trained LoRA (Community Attempts)

Character consistency: Weak — distilled noise trajectory conflicts with LoRA weights
Style transfer: Inconsistent — partial style adoption with structural artifacts
Prompt adherence: Reduced — lack of CFG means no negative prompt filtering
Diversity: Poor — Turbo's inherent low diversity compounds with LoRA
Artifacting: Common — visual distortions, especially in fine details

Verdict: Always train on Base. The quality gap is significant and consistent across use cases.

Troubleshooting

Artifacting

Symptoms: Weird patterns, checkerboard textures, or structural distortions in generated images.

Causes and fixes:

Learning rate too high → Reduce to 5e-5 or lower
Too many steps for dataset size → Reduce total steps by 30–50%
Dataset contains inconsistent subjects → Curate dataset more carefully
Rank too high for data amount → Lower rank from 32 to 16 or 8

Overfitting

Symptoms: Generated images look nearly identical regardless of prompt; model memorizes training data.

Causes and fixes:

Too many steps → Reduce to num_images × 100 as minimum
Same background/lighting in all training images → Add variety to dataset
No augmentation → Enable flip augmentation and color jittering
Trigger token used too frequently → Limit to once per caption

Underfitting

Symptoms: Generated images barely resemble the target concept; LoRA has minimal effect.

Causes and fixes:

Too few steps → Increase to num_images × 300 as maximum
Learning rate too low → Increase from 5e-5 to 1e-4
Rank too low → Increase from 4 to 16
Poor captions → Rewrite captions with more descriptive detail
Training on wrong model variant → Confirm using Base, not Turbo

Color Cast

Symptoms: All LoRA generations share an unwanted color tint.

Causes and fixes:

Training dataset has uniform color bias → Add color-diverse images
White balance inconsistency → Apply color correction preprocessing
Enable color jittering in training config

Detail Loss

Symptoms: Fine details (facial features, text, textures) are blurred or missing.

Causes and fixes:

min_snr_gamma not set → Set to 5.0
Resolution mismatch → Ensure all training images at 1024x1024
Too much augmentation → Disable color_jiggle and flip_aug for character LoRA

Practical Example: Training a Character LoRA on Base

Step 1: Dataset

Collect 25 images of the target character:

10 portrait shots (various angles)
5 full-body shots
5 different outfits/settings
5 varied lighting conditions

Each image paired with a caption using trigger token sks:

A portrait of sks, a young woman with shoulder-length brown hair,
blue eyes, wearing a navy blazer, professional studio lighting,
clean white background, photorealistic

Step 2: Configuration

{
  "learning_rate": 0.0001,
  "max_train_steps": 4000,
  "lora_network_dim": 16,
  "lora_network_alpha": 16,
  "train_batch_size": 1,
  "min_snr_gamma": 5.0,
  "cache_latents": true,
  "optimizer_type": "AdamW8bit",
  "lr_scheduler": "cosine_with_restarts"
}

Step 3: Training

python sdxl_train.py /
  --pretrained_model_name_or_path="Tongyi-ZImage/Z-Image-Base" /
  --dataset_config="character_config.yaml" /
  --output_dir="./character_output" /
  --network_module="networks.lora" /
  --network_dim=16 /
  --network_alpha=16 /
  --learning_rate=0.0001 /
  --max_train_steps=4000 /
  --train_batch_size=1 /
  --cache_latents /
  --min_snr_gamma=5.0 /
  --lr_scheduler="cosine_with_restarts" /
  --optimizer_type="AdamW8bit"

Step 4: Validation

Generate test images at step 500, 1500, 3000, and 4000:

Checkpoint	Test Prompt	Evaluation
Step 500	`sks, casual portrait, natural lighting`	Early concept learning
Step 1500	`sks, business attire, office background`	Style emergence
Step 3000	`sks, outdoor scene, golden hour`	Refinement check
Step 4000	`sks, fantasy costume, dramatic lighting`	Overfitting check

Step 5: Inference with Trained LoRA

from diffusers import ZImagePipeline

pipe = ZImagePipeline.from_pretrained(
    "Tongyi-ZImage/Z-Image-Base", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

pipe.load_lora_weights("./character_output/character_lora.safetensors")

result = pipe(
    prompt="sks, wearing a red dress at a garden party, cinematic lighting",
    negative_prompt="blurry, deformed, low quality, extra limbs",
    guidance_scale=7.5,
    num_inference_steps=30
).images[0]

Summary

Z-Image Base is the only variant supporting LoRA fine-tuning. Key takeaways:

Always train on Base — Turbo's distillation makes LoRA incompatible
Dataset quality matters most — 10–50 curated images beat 100 noisy ones
Rank 16, LR 1e-4 is a solid starting point for most use cases
Cache latents and use 8-bit optimizers to manage VRAM efficiently
Validate frequently during training to catch overfitting early
Use unique trigger tokens and write descriptive captions for each image

The Base model's support for CFG and negative prompts during inference means LoRA-trained models benefit from additional control that Turbo simply cannot provide.

Z-Image Base Model Practical Guide: Fine-Tuning and LoRA Training Deep Dive

目录

Z-Image Base Model Practical Guide: Fine-Tuning and LoRA Training Deep Dive

Table of Contents

Introduction

Base vs Turbo: Core Differences

Why Turbo Can't Be Fine-Tuned

When to Use Base vs Turbo

Choose Base When

Choose Turbo When

Hybrid Approach

Dataset Preparation

Image Requirements

Dataset Structure

Captioning

Preprocessing

Training Hyperparameters

Learning Rate

Training Steps

LoRA Rank and Alpha

Other Key Parameters

LoRA Training with Kohya_ss

Configuration File

Training Command

Dataset Config (YAML)

LoRA Training in ComfyUI

VRAM Requirements

VRAM Optimization Techniques

Quality Comparison: Base vs Turbo Trained LoRA

Base-Trained LoRA

Turbo-Trained LoRA (Community Attempts)

Troubleshooting

Artifacting

Overfitting

Underfitting

Color Cast

Detail Loss

Practical Example: Training a Character LoRA on Base

Step 1: Dataset

Step 2: Configuration

Step 3: Training

Step 4: Validation

Step 5: Inference with Trained LoRA

Summary