Z-Image GGUF/FP8 Local Deployment Complete Guide: From 16GB to 8GB VRAM Optimization

May 14, 2026

Z-Image GGUF/FP8 Local Deployment Complete Guide: From 16GB to 8GB VRAM Optimization

Author: Z-Image Tech Team | Published: 2026-05-14 | Reading time: 15 minutes


Table of Contents

  1. Introduction: Why GGUF/FP8?
  2. Quantization Formats Explained: BF16 vs FP8 vs GGUF
  3. Hardware Requirements and VRAM Estimation
  4. GGUF Deployment: Running on 8GB VRAM
  5. FP8 Deployment: Balancing Speed and Quality
  6. ComfyUI Workflow Configuration
  7. LoRA Training with Quantized Models
  8. Troubleshooting
  9. Conclusion

Introduction

Z-Image, as an open-source image generation model family, offers multiple precision variants to accommodate different hardware configurations. From the original BF16 full-precision model to FP8 half-precision, and GGUF quantized formats, Z-Image enables users from high-end GPUs to consumer-grade cards to run high-quality image generation locally.

This article dives deep into the pros and cons of three quantization formats and provides a complete deployment guide from scratch.


Quantization Formats Explained: BF16 vs FP8 vs GGUF

BF16 (BFloat16) — Original Precision

  • VRAM requirement: 16GB+
  • Generation quality: Optimal
  • Inference speed: Baseline
  • Use case: Professional users, model training, highest quality output

BF16 is the original release format for Z-Image Turbo, preserving full model precision. If you have sufficient VRAM, this is the preferred option.

FP8 — The Balanced Choice

  • VRAM requirement: ~8-12GB (depending on implementation)
  • Generation quality: Near BF16, visually indistinguishable
  • Inference speed: 1.5-2x faster than BF16
  • Use case: Daily generation, batch processing, VRAM-constrained but quality-focused

FP8 (Float8) was introduced by NVIDIA in the Hopper architecture and is now widely supported. The Z-Image community provides two FP8 variants:

  1. T5B FP8: Community-contributed FP8 version with good stability
  2. drbaph FP8: Alternative community FP8 version, faster in some scenarios

GGUF — Low VRAM Solution

  • VRAM requirement: Q4 ~4-6GB, Q8 ~8GB
  • Generation quality: Q8 near-original, Q4 with slight quality degradation
  • Inference speed: Moderate
  • Use case: Low VRAM GPUs, CPU inference, entry-level users

GGUF format originated from the LLM quantization community (GPT4All), compressing models to the extreme through layer-wise quantization. The Z-Image community offers Q4 (4-bit) and Q8 (8-bit) quantization levels:

Quantization Level VRAM Quality Retention Recommended Use
Q8 ~8GB 95%+ Best for low VRAM
Q4 ~4-6GB 85-90% Extreme VRAM constraint

Hardware Requirements and VRAM Estimation

Minimum Configuration Requirements

Format Minimum VRAM Recommended VRAM Recommended GPU
GGUF Q4 4GB 6GB GTX 1660 / RTX 3050
GGUF Q8 6GB 8GB RTX 3060 / RTX 4060
FP8 8GB 12GB RTX 3070 / RTX 4070
BF16 12GB 16GB+ RTX 3080 / RTX 4080+

VRAM Calculation

Total VRAM = Model Weights + Text Encoder + VAE + Intermediate Activations + Batch Processing

For Z-Image Turbo BF16:

  • Model weights: ~10GB
  • Text Encoder (Qwen 3 4B): ~2GB
  • VAE: ~0.5GB
  • Intermediate activations: ~2-4GB (resolution-dependent)
  • Total: ~15-17GB

GGUF Q4 quantized:

  • Model weights: ~2.5GB
  • Text Encoder: ~2GB (can be quantized separately)
  • VAE: ~0.5GB
  • Intermediate activations: ~1-2GB
  • Total: ~6-7GB

GGUF Deployment: Running on 8GB VRAM

Step 1: Environment Setup

# Install Python environment
python -m venv zimage-env
source zimage-env/bin/activate

# Install ComfyUI dependencies
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

# Clone ComfyUI
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI
pip install -r requirements.txt

Step 2: Download GGUF Model

Download Z-Image Turbo GGUF quantized version from HuggingFace:

# Enter model directory
mkdir -p ComfyUI/models/unet

# Download GGUF Q8 version (recommended)
# Get from HuggingFace Comfy-Org/z_image_turbo

# Download Text Encoder
mkdir -p ComfyUI/models/text_encoders
# qwen_3_4b.safetensors → text_encoders/

# Download VAE
mkdir -p ComfyUI/models/vae
# ae.safetensors → vae/

Step 3: Launch and Verify

# Start ComfyUI
python main.py --lowvram

# Visit http://127.0.0.1:8188
# Load official workflow JSON

GGUF Loading Notes

  1. Loading node: Use Load Diffusion Model (GGUF) node, not the standard Load Checkpoint
  2. VRAM optimization: Add --lowvram parameter to enable VRAM optimization mode
  3. Speed expectation: GGUF inference speed is approximately 70-80% of BF16

FP8 Deployment: Balancing Speed and Quality

FP8 Version Comparison

Feature T5B FP8 drbaph FP8
Quantization Method Standard FP8 E4M3 Custom FP8
Quality Score 97/100 95/100
Speed Boost 1.8x 2.0x
VRAM Requirement ~9GB ~8GB
Community Support Extensive Moderate

Deployment Steps

# Download FP8 model
# z_image_turbo_fp8.safetensors → models/unet/

# Ensure ComfyUI is updated to latest version
# FP8-supporting ComfyUI version >= 2025.11

# Start (no --lowvram needed, FP8 already optimizes VRAM)
python main.py

FP8 Optimization Tips

  1. NVFP4 experimental format: NVIDIA's latest 4-bit format, halves VRAM with quality close to FP8
  2. Tensor Core acceleration: Ensure GPU driver supports FP8 Tensor Core (RTX 40 series native support)
  3. Batch processing: FP8 speed advantage becomes more pronounced in batch generation

ComfyUI Workflow Configuration

Complete Workflow JSON

ComfyUI official provides a verified Z-Image Turbo workflow:

{
  "model_loader": "LoadDiffusionModelGGUF",
  "text_encoder": "qwen_3_4b.safetensors",
  "vae": "ae.safetensors",
  "sampler": "Euler",
  "steps": 8,
  "cfg": 1.0
}

Key Node Descriptions

Node Purpose Recommended Settings
Load Diffusion Model Load UNet Select GGUF/FP8/BF16 based on format
CLIP Text Encode Prompt encoding Positive + negative prompts
KSampler Sampler Euler, 8 steps, CFG 1.0
VAEDecode Decode latent space Default settings
Save Image Output PNG/JPG

Low VRAM Optimized Workflow

For 8GB and below VRAM:

  1. Enable --lowvram or --medvram parameter
  2. Use GGUF Q8 or FP8 format
  3. Lower generation resolution (start with 512x512)
  4. Use tiled VAE decode (Tile VAE Decode)

LoRA Training with Quantized Models

Important Warning

Z-Image Turbo is a distilled model with compressed latent space. Training LoRA directly on Turbo yields poor results.

Phase Model Purpose
Inference Z-Image Turbo (GGUF/FP8) Daily generation, load LoRA
Training Z-Image Base (BF16) Train new LoRA
  1. Ostris AI-Toolkit: Training tool specifically designed for Z-Image/Flux architecture
  2. Kohya_ss: Classic Stable Diffusion training tool, now adapted for Z-Image

Post-Training Usage

Trained LoRA can be directly loaded onto GGUF/FP8 format Turbo models — no BF16 precision required for inference.


Troubleshooting

Q1: GGUF loading fails with "Unsupported format"

Solution: Ensure ComfyUI is updated to the latest version. Older versions don't support GGUF format UNet loading.

# Update ComfyUI
git pull
pip install -r requirements.txt

# Update all nodes
# Select "Update All" in ComfyUI Manager

Q2: VRAM overflow OOM

Solution:

  1. Lower resolution (1024x1024 → 512x512)
  2. Use --lowvram parameter
  3. Switch to GGUF Q4 format
  4. Close other GPU-consuming programs

Q3: Generation quality degradation

Solution:

  1. Check if using the correct sampler (Euler)
  2. Confirm step count (Turbo recommended: 8 steps)
  3. Set CFG to 1.0 (distilled models don't need high CFG)
  4. Consider upgrading to a higher precision format

Q4: FP8 not supported on RTX 30 series

Solution: RTX 30 series doesn't support native FP8 Tensor Core, but can still use FP8 weights (CPU fallback). Recommend using GGUF format for better RTX 30 series compatibility.


Conclusion

Z-Image's multi-precision format strategy successfully covers hardware from high-end to entry-level:

  • BF16: Professional users追求极致 quality
  • FP8: Best choice for daily use, perfect balance of speed and quality
  • GGUF: Lifesaver for low VRAM users, running on 8GB or even 4GB VRAM

With emerging formats like NVIDIA NVFP4, the barrier for Z-Image local deployment continues to drop. Users should choose the format that matches their hardware conditions — no need to blindly pursue the highest precision.


Keywords: z-image gguf, z-image fp8, z-image local deployment, z-image low vram, z-image quantization, z-image turbo install, z-image comfyui setup

Z-Image Team

Z-Image GGUF/FP8 Local Deployment Complete Guide: From 16GB to 8GB VRAM Optimization | Blog