Z-Image GGUF/FP8 Local Deployment Complete Guide: From 16GB to 8GB VRAM Optimization
Author: Z-Image Tech Team | Published: 2026-05-14 | Reading time: 15 minutes
Table of Contents
- Introduction: Why GGUF/FP8?
- Quantization Formats Explained: BF16 vs FP8 vs GGUF
- Hardware Requirements and VRAM Estimation
- GGUF Deployment: Running on 8GB VRAM
- FP8 Deployment: Balancing Speed and Quality
- ComfyUI Workflow Configuration
- LoRA Training with Quantized Models
- Troubleshooting
- Conclusion
Introduction
Z-Image, as an open-source image generation model family, offers multiple precision variants to accommodate different hardware configurations. From the original BF16 full-precision model to FP8 half-precision, and GGUF quantized formats, Z-Image enables users from high-end GPUs to consumer-grade cards to run high-quality image generation locally.
This article dives deep into the pros and cons of three quantization formats and provides a complete deployment guide from scratch.
Quantization Formats Explained: BF16 vs FP8 vs GGUF
BF16 (BFloat16) — Original Precision
- VRAM requirement: 16GB+
- Generation quality: Optimal
- Inference speed: Baseline
- Use case: Professional users, model training, highest quality output
BF16 is the original release format for Z-Image Turbo, preserving full model precision. If you have sufficient VRAM, this is the preferred option.
FP8 — The Balanced Choice
- VRAM requirement: ~8-12GB (depending on implementation)
- Generation quality: Near BF16, visually indistinguishable
- Inference speed: 1.5-2x faster than BF16
- Use case: Daily generation, batch processing, VRAM-constrained but quality-focused
FP8 (Float8) was introduced by NVIDIA in the Hopper architecture and is now widely supported. The Z-Image community provides two FP8 variants:
- T5B FP8: Community-contributed FP8 version with good stability
- drbaph FP8: Alternative community FP8 version, faster in some scenarios
GGUF — Low VRAM Solution
- VRAM requirement: Q4 ~4-6GB, Q8 ~8GB
- Generation quality: Q8 near-original, Q4 with slight quality degradation
- Inference speed: Moderate
- Use case: Low VRAM GPUs, CPU inference, entry-level users
GGUF format originated from the LLM quantization community (GPT4All), compressing models to the extreme through layer-wise quantization. The Z-Image community offers Q4 (4-bit) and Q8 (8-bit) quantization levels:
| Quantization Level | VRAM | Quality Retention | Recommended Use |
|---|---|---|---|
| Q8 | ~8GB | 95%+ | Best for low VRAM |
| Q4 | ~4-6GB | 85-90% | Extreme VRAM constraint |
Hardware Requirements and VRAM Estimation
Minimum Configuration Requirements
| Format | Minimum VRAM | Recommended VRAM | Recommended GPU |
|---|---|---|---|
| GGUF Q4 | 4GB | 6GB | GTX 1660 / RTX 3050 |
| GGUF Q8 | 6GB | 8GB | RTX 3060 / RTX 4060 |
| FP8 | 8GB | 12GB | RTX 3070 / RTX 4070 |
| BF16 | 12GB | 16GB+ | RTX 3080 / RTX 4080+ |
VRAM Calculation
Total VRAM = Model Weights + Text Encoder + VAE + Intermediate Activations + Batch Processing
For Z-Image Turbo BF16:
- Model weights: ~10GB
- Text Encoder (Qwen 3 4B): ~2GB
- VAE: ~0.5GB
- Intermediate activations: ~2-4GB (resolution-dependent)
- Total: ~15-17GB
GGUF Q4 quantized:
- Model weights: ~2.5GB
- Text Encoder: ~2GB (can be quantized separately)
- VAE: ~0.5GB
- Intermediate activations: ~1-2GB
- Total: ~6-7GB
GGUF Deployment: Running on 8GB VRAM
Step 1: Environment Setup
# Install Python environment
python -m venv zimage-env
source zimage-env/bin/activate
# Install ComfyUI dependencies
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
# Clone ComfyUI
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI
pip install -r requirements.txt
Step 2: Download GGUF Model
Download Z-Image Turbo GGUF quantized version from HuggingFace:
# Enter model directory
mkdir -p ComfyUI/models/unet
# Download GGUF Q8 version (recommended)
# Get from HuggingFace Comfy-Org/z_image_turbo
# Download Text Encoder
mkdir -p ComfyUI/models/text_encoders
# qwen_3_4b.safetensors → text_encoders/
# Download VAE
mkdir -p ComfyUI/models/vae
# ae.safetensors → vae/
Step 3: Launch and Verify
# Start ComfyUI
python main.py --lowvram
# Visit http://127.0.0.1:8188
# Load official workflow JSON
GGUF Loading Notes
- Loading node: Use
Load Diffusion Model (GGUF)node, not the standardLoad Checkpoint - VRAM optimization: Add
--lowvramparameter to enable VRAM optimization mode - Speed expectation: GGUF inference speed is approximately 70-80% of BF16
FP8 Deployment: Balancing Speed and Quality
FP8 Version Comparison
| Feature | T5B FP8 | drbaph FP8 |
|---|---|---|
| Quantization Method | Standard FP8 E4M3 | Custom FP8 |
| Quality Score | 97/100 | 95/100 |
| Speed Boost | 1.8x | 2.0x |
| VRAM Requirement | ~9GB | ~8GB |
| Community Support | Extensive | Moderate |
Deployment Steps
# Download FP8 model
# z_image_turbo_fp8.safetensors → models/unet/
# Ensure ComfyUI is updated to latest version
# FP8-supporting ComfyUI version >= 2025.11
# Start (no --lowvram needed, FP8 already optimizes VRAM)
python main.py
FP8 Optimization Tips
- NVFP4 experimental format: NVIDIA's latest 4-bit format, halves VRAM with quality close to FP8
- Tensor Core acceleration: Ensure GPU driver supports FP8 Tensor Core (RTX 40 series native support)
- Batch processing: FP8 speed advantage becomes more pronounced in batch generation
ComfyUI Workflow Configuration
Complete Workflow JSON
ComfyUI official provides a verified Z-Image Turbo workflow:
{
"model_loader": "LoadDiffusionModelGGUF",
"text_encoder": "qwen_3_4b.safetensors",
"vae": "ae.safetensors",
"sampler": "Euler",
"steps": 8,
"cfg": 1.0
}
Key Node Descriptions
| Node | Purpose | Recommended Settings |
|---|---|---|
| Load Diffusion Model | Load UNet | Select GGUF/FP8/BF16 based on format |
| CLIP Text Encode | Prompt encoding | Positive + negative prompts |
| KSampler | Sampler | Euler, 8 steps, CFG 1.0 |
| VAEDecode | Decode latent space | Default settings |
| Save Image | Output | PNG/JPG |
Low VRAM Optimized Workflow
For 8GB and below VRAM:
- Enable
--lowvramor--medvramparameter - Use GGUF Q8 or FP8 format
- Lower generation resolution (start with 512x512)
- Use tiled VAE decode (Tile VAE Decode)
LoRA Training with Quantized Models
Important Warning
Z-Image Turbo is a distilled model with compressed latent space. Training LoRA directly on Turbo yields poor results.
Recommended Strategy
| Phase | Model | Purpose |
|---|---|---|
| Inference | Z-Image Turbo (GGUF/FP8) | Daily generation, load LoRA |
| Training | Z-Image Base (BF16) | Train new LoRA |
Recommended LoRA Training Tools
- Ostris AI-Toolkit: Training tool specifically designed for Z-Image/Flux architecture
- Kohya_ss: Classic Stable Diffusion training tool, now adapted for Z-Image
Post-Training Usage
Trained LoRA can be directly loaded onto GGUF/FP8 format Turbo models — no BF16 precision required for inference.
Troubleshooting
Q1: GGUF loading fails with "Unsupported format"
Solution: Ensure ComfyUI is updated to the latest version. Older versions don't support GGUF format UNet loading.
# Update ComfyUI
git pull
pip install -r requirements.txt
# Update all nodes
# Select "Update All" in ComfyUI Manager
Q2: VRAM overflow OOM
Solution:
- Lower resolution (1024x1024 → 512x512)
- Use
--lowvramparameter - Switch to GGUF Q4 format
- Close other GPU-consuming programs
Q3: Generation quality degradation
Solution:
- Check if using the correct sampler (Euler)
- Confirm step count (Turbo recommended: 8 steps)
- Set CFG to 1.0 (distilled models don't need high CFG)
- Consider upgrading to a higher precision format
Q4: FP8 not supported on RTX 30 series
Solution: RTX 30 series doesn't support native FP8 Tensor Core, but can still use FP8 weights (CPU fallback). Recommend using GGUF format for better RTX 30 series compatibility.
Conclusion
Z-Image's multi-precision format strategy successfully covers hardware from high-end to entry-level:
- BF16: Professional users追求极致 quality
- FP8: Best choice for daily use, perfect balance of speed and quality
- GGUF: Lifesaver for low VRAM users, running on 8GB or even 4GB VRAM
With emerging formats like NVIDIA NVFP4, the barrier for Z-Image local deployment continues to drop. Users should choose the format that matches their hardware conditions — no need to blindly pursue the highest precision.
Keywords: z-image gguf, z-image fp8, z-image local deployment, z-image low vram, z-image quantization, z-image turbo install, z-image comfyui setup