Z-Image Base Model Practical Guide: Fine-Tuning and LoRA Training Deep Dive
Keywords: z-image base model fine-tuning
Table of Contents
- Introduction
- Base vs Turbo: Core Differences
- When to Use Base vs Turbo
- Dataset Preparation
- Training Hyperparameters
- LoRA Training with Kohya_ss
- LoRA Training in ComfyUI
- VRAM Requirements
- Quality Comparison: Base vs Turbo LoRA
- Troubleshooting
- Practical Example: Character LoRA
- Summary
Introduction
The Z-Image model family includes two variants: the Base model and the Turbo model. While Turbo dominates download numbers on Hugging Face due to its speed and quality advantages for inference, the Base model is the only variant that supports full LoRA fine-tuning. This guide covers everything needed to train LoRA models on Z-Image Base, from dataset preparation to hyperparameter tuning and troubleshooting.
Base vs Turbo: Core Differences
| Dimension | Z-Image Base | Z-Image Turbo |
|---|---|---|
| Inference Steps | 28–50 | 8 |
| CFG Support | Yes | No |
| Negative Prompts | Yes | No |
| LoRA Fine-tuning | Fully supported | Not supported |
| Distillation | No | Decoupled-DMD + DMDR |
| Generation Diversity | High | Lower |
| Guidance Scale | Adjustable | Fixed at 0 |
Why Turbo Can't Be Fine-Tuned
Turbo uses Decoupled-DMD (Distribution Matching Distillation) combined with DMDR (DMD + Reinforcement Learning). This distillation process fundamentally alters the model's noise schedule and latent space dynamics, making traditional LoRA adapters incompatible. Attempts to apply LoRA to Turbo produce degraded results — the adapter weights conflict with the distilled noise trajectory.
Base is a complete, undistilled Diffusion Transformer. It maintains standard noise scheduling, CFG compatibility, and a latent space that responds predictably to LoRA weight injections.
When to Use Base vs Turbo
Choose Base When
- LoRA training is needed — character consistency, style transfer, brand identity
- Fine-grained control via negative prompts and adjustable
guidance_scale - Generation diversity matters — same prompt, varied outputs
- Research or experimentation — CFG tuning, custom samplers
Choose Turbo When
- Speed is critical — 8-step inference, 1–2 seconds at 1024x1024 on RTX 4090
- No fine-tuning needed — using official model weights directly
- Inference cost is a concern — fewer steps mean lower VRAM and compute usage
- Single-image quality is the priority
Hybrid Approach
Train LoRA on Base, then use the trained adapter with Base for generation. This gives fine-tuning flexibility with the full model's quality ceiling.
Dataset Preparation
Image Requirements
- Quantity: 10–50 images for effective LoRA training
- Resolution: 1024x1024 preferred; accept 768x768–1536x1536 with proper cropping
- Quality: High-resolution, well-lit, minimal noise
- Variety: Multiple angles, lighting conditions, and poses (for character LoRA)
- Clean backgrounds: Minimize distracting elements
Dataset Structure
dataset/
├── images/
│ ├── img_001.jpg
│ ├── img_002.jpg
│ └── ...
├── captions/
│ ├── img_001.txt
│ ├── img_002.txt
│ └── ...
└── metadata.json
Captioning
Good captions are critical for LoRA quality. Each image needs an associated .txt file:
# Good caption:
A portrait of sks, a 25-year-old woman with short black hair,
wearing a white blouse, standing in a sunlit garden,
natural lighting, shallow depth of field
# Avoid overfitting trigger:
sks is the unique trigger token — do not describe it
in the caption text itself
Key captioning rules:
- Use a unique trigger token (e.g.,
sks,xyz) to identify the concept - Describe the subject generically, not by name
- Include context: setting, lighting, composition
- Avoid repeating the trigger token multiple times in one caption
Preprocessing
from PIL import Image
import os
def preprocess_images(input_dir, output_dir, target_size=(1024, 1024)):
os.makedirs(output_dir, exist_ok=True)
for filename in os.listdir(input_dir):
img = Image.open(os.path.join(input_dir, filename))
img = img.resize(target_size, Image.LANCZOS)
img.save(os.path.join(output_dir, filename))
Training Hyperparameters
Learning Rate
| Training Type | Recommended LR | Notes |
|---|---|---|
| Character LoRA | 1e-4 | Standard for concept learning |
| Style LoRA | 5e-5 | Lower LR for subtle style transfer |
| Product/Brand LoRA | 1e-4 | Same as character |
| Fine-tuning (full) | 1e-5 | Significantly lower for full model |
Training Steps
- 10–20 images: 1500–3000 steps
- 30–50 images: 3000–6000 steps
- 50+ images: 6000–10000 steps
Formula: steps ≈ num_images × 150 as a starting point, then adjust based on validation results.
LoRA Rank and Alpha
| Rank | Alpha | Use Case | Model Size |
|---|---|---|---|
| 4 | 2 | Micro LoRA (small concepts) | ~20 MB |
| 8 | 4 | Lightweight adapter | ~40 MB |
| 16 | 16 | Standard LoRA (recommended default) | ~90 MB |
| 32 | 32 | Complex concepts / styles | ~180 MB |
| 64 | 64 | Maximum capacity | ~360 MB |
Rank 16 with alpha 16 is the recommended starting point for most training tasks. Higher ranks provide more capacity but risk overfitting with small datasets.
Other Key Parameters
| Parameter | Recommended | Notes |
|---|---|---|
| Batch size | 1–4 | Limited by VRAM |
| Optimizer | AdamW8bit | Memory-efficient |
| Scheduler | cosine_with_restarts | Stable convergence |
| Shuffle tags | True | Prevents memorization order |
| Cache latents | True | Speeds up training significantly |
| Min SNR gamma | 5.0 | Improves fine detail learning |
LoRA Training with Kohya_ss
Configuration File
{
"train_batch_size": 1,
"learning_rate": 0.0001,
"lr_scheduler": "cosine_with_restarts",
"lr_warmup_steps": 100,
"max_train_steps": 3000,
"mixed_precision": "no",
"optimizer_type": "AdamW8bit",
"output_name": "character_lora",
"save_every_n_steps": 500,
"save_precision": "fp16",
"seed": 42,
"sdxl": false,
"model_type": "dit",
"dit_dim_head": 64,
"dit_depth": 28,
"dit_num_heads": 24,
"lora_network_dim": 16,
"lora_network_alpha": 16,
"min_snr_gamma": 5.0,
"cache_latents": true,
"cache_latents_to_disk": true,
"persistent_data_loader_workers": true
}
Training Command
python sdxl_train.py /
--pretrained_model_name_or_path="Tongyi-ZImage/Z-Image-Base" /
--dataset_config="dataset_config.yaml" /
--output_dir="./output" /
--network_module="networks.lora" /
--network_dim=16 /
--network_alpha=16 /
--learning_rate=0.0001 /
--train_batch_size=1 /
--max_train_steps=3000 /
--mixed_precision="no" /
--cache_latents /
--min_snr_gamma=5.0 /
--lr_scheduler="cosine_with_restarts" /
--optimizer_type="AdamW8bit" /
--seed=42
Dataset Config (YAML)
datasets:
-
subfolders:
- images/
caption_extension: ".txt"
enable_bucket: true
bucket_no_upscale: false
keep_tokens: 1
color_jiggle: false
flip_aug: false
text_encoder_lr: 0.00001
LoRA Training in ComfyUI
ComfyUI supports LoRA training through community extensions like ComfyUI-Training or ComfyUI-Kohya. The workflow is node-based:
Load Z-Image Base Model
↓
Load Training Dataset
↓
Configure LoRA Settings (rank, alpha, LR, steps)
↓
Train LoRA Node
↓
Save LoRA (.safetensors)
ComfyUI Training advantages:
- Visual feedback during training
- Easy hyperparameter adjustments
- Integrated preview generation
- GPU utilization monitoring
ComfyUI Training limitations:
- Fewer advanced options than Kohya_ss CLI
- Higher VRAM overhead from the UI
- May not support all optimizer variants
VRAM Requirements
| Configuration | VRAM Required | GPU Examples |
|---|---|---|
| LoRA rank 4, BS=1 | ~10–12 GB | RTX 3060, RTX 4070 |
| LoRA rank 8, BS=1 | ~12–14 GB | RTX 3070, RTX 4070 |
| LoRA rank 16, BS=1 | ~14–16 GB | RTX 3090, RTX 4080 |
| LoRA rank 16, BS=2 | ~18–22 GB | RTX 3090 (tight), RTX 4090 |
| LoRA rank 32, BS=1 | ~16–20 GB | RTX 3090, RTX 4090 |
| Full fine-tuning | ~24–40 GB | A100, H100 |
VRAM Optimization Techniques
- Cache latents: Pre-compute VAE encodings, reduces per-step VRAM by ~20%
- Gradient checkpointing: Trades compute for memory, recommended for rank ≥ 32
- 8-bit optimizers: AdamW8bit or Adafactor8bit reduce optimizer state VRAM by ~50%
- DDP/ZeroRedundancyOptimizer: Multi-GPU training for large configurations
- --mixed_precision="no": Z-Image Base training works best in full precision (FP32 BF16)
Quality Comparison: Base vs Turbo Trained LoRA
Base-Trained LoRA
- Character consistency: Strong — facial features, proportions maintained across generations
- Style transfer: Accurate — captures brushwork, color palettes, composition tendencies
- Prompt adherence: High — benefits from CFG support for negative prompting
- Diversity: Good — same trigger + varied prompts produce diverse results
- Overfitting risk: Moderate — manageable with proper validation
Turbo-Trained LoRA (Community Attempts)
- Character consistency: Weak — distilled noise trajectory conflicts with LoRA weights
- Style transfer: Inconsistent — partial style adoption with structural artifacts
- Prompt adherence: Reduced — lack of CFG means no negative prompt filtering
- Diversity: Poor — Turbo's inherent low diversity compounds with LoRA
- Artifacting: Common — visual distortions, especially in fine details
Verdict: Always train on Base. The quality gap is significant and consistent across use cases.
Troubleshooting
Artifacting
Symptoms: Weird patterns, checkerboard textures, or structural distortions in generated images.
Causes and fixes:
- Learning rate too high → Reduce to 5e-5 or lower
- Too many steps for dataset size → Reduce total steps by 30–50%
- Dataset contains inconsistent subjects → Curate dataset more carefully
- Rank too high for data amount → Lower rank from 32 to 16 or 8
Overfitting
Symptoms: Generated images look nearly identical regardless of prompt; model memorizes training data.
Causes and fixes:
- Too many steps → Reduce to
num_images × 100as minimum - Same background/lighting in all training images → Add variety to dataset
- No augmentation → Enable flip augmentation and color jittering
- Trigger token used too frequently → Limit to once per caption
Underfitting
Symptoms: Generated images barely resemble the target concept; LoRA has minimal effect.
Causes and fixes:
- Too few steps → Increase to
num_images × 300as maximum - Learning rate too low → Increase from 5e-5 to 1e-4
- Rank too low → Increase from 4 to 16
- Poor captions → Rewrite captions with more descriptive detail
- Training on wrong model variant → Confirm using Base, not Turbo
Color Cast
Symptoms: All LoRA generations share an unwanted color tint.
Causes and fixes:
- Training dataset has uniform color bias → Add color-diverse images
- White balance inconsistency → Apply color correction preprocessing
- Enable color jittering in training config
Detail Loss
Symptoms: Fine details (facial features, text, textures) are blurred or missing.
Causes and fixes:
min_snr_gammanot set → Set to 5.0- Resolution mismatch → Ensure all training images at 1024x1024
- Too much augmentation → Disable color_jiggle and flip_aug for character LoRA
Practical Example: Training a Character LoRA on Base
Step 1: Dataset
Collect 25 images of the target character:
- 10 portrait shots (various angles)
- 5 full-body shots
- 5 different outfits/settings
- 5 varied lighting conditions
Each image paired with a caption using trigger token sks:
A portrait of sks, a young woman with shoulder-length brown hair,
blue eyes, wearing a navy blazer, professional studio lighting,
clean white background, photorealistic
Step 2: Configuration
{
"learning_rate": 0.0001,
"max_train_steps": 4000,
"lora_network_dim": 16,
"lora_network_alpha": 16,
"train_batch_size": 1,
"min_snr_gamma": 5.0,
"cache_latents": true,
"optimizer_type": "AdamW8bit",
"lr_scheduler": "cosine_with_restarts"
}
Step 3: Training
python sdxl_train.py /
--pretrained_model_name_or_path="Tongyi-ZImage/Z-Image-Base" /
--dataset_config="character_config.yaml" /
--output_dir="./character_output" /
--network_module="networks.lora" /
--network_dim=16 /
--network_alpha=16 /
--learning_rate=0.0001 /
--max_train_steps=4000 /
--train_batch_size=1 /
--cache_latents /
--min_snr_gamma=5.0 /
--lr_scheduler="cosine_with_restarts" /
--optimizer_type="AdamW8bit"
Step 4: Validation
Generate test images at step 500, 1500, 3000, and 4000:
| Checkpoint | Test Prompt | Evaluation |
|---|---|---|
| Step 500 | sks, casual portrait, natural lighting |
Early concept learning |
| Step 1500 | sks, business attire, office background |
Style emergence |
| Step 3000 | sks, outdoor scene, golden hour |
Refinement check |
| Step 4000 | sks, fantasy costume, dramatic lighting |
Overfitting check |
Step 5: Inference with Trained LoRA
from diffusers import ZImagePipeline
pipe = ZImagePipeline.from_pretrained(
"Tongyi-ZImage/Z-Image-Base", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
pipe.load_lora_weights("./character_output/character_lora.safetensors")
result = pipe(
prompt="sks, wearing a red dress at a garden party, cinematic lighting",
negative_prompt="blurry, deformed, low quality, extra limbs",
guidance_scale=7.5,
num_inference_steps=30
).images[0]
Summary
Z-Image Base is the only variant supporting LoRA fine-tuning. Key takeaways:
- Always train on Base — Turbo's distillation makes LoRA incompatible
- Dataset quality matters most — 10–50 curated images beat 100 noisy ones
- Rank 16, LR 1e-4 is a solid starting point for most use cases
- Cache latents and use 8-bit optimizers to manage VRAM efficiently
- Validate frequently during training to catch overfitting early
- Use unique trigger tokens and write descriptive captions for each image
The Base model's support for CFG and negative prompts during inference means LoRA-trained models benefit from additional control that Turbo simply cannot provide.