Introduction to ACE-Step 1.5
ACE-Step 1.5 represents a significant milestone in the ACE-Step series of open-source multimodal large models. Building upon the proven architecture of its predecessors, ACE-Step 1.5 delivers substantial improvements in multimodal understanding capabilities while maintaining excellent inference efficiency.
The model has been pre-trained on massive amounts of image-text pairs and fine-tuned with high-quality instruction data, enabling it to achieve state-of-the-art performance across multiple benchmarks while remaining fully open-source and accessible to the research community.

Key Highlights
- Multimodal Capabilities: Exceptional image understanding and reasoning abilities
- Open Source: Fully available for academic and commercial use
- Efficient Inference: Optimized for both GPU and CPU deployment
- Strong Benchmarks: Competitive performance against proprietary models
Model Specifications
Architecture Overview
ACE-Step 1.5 follows a transformer-based architecture with the following key components:
| Component | Specification |
|---|---|
| Language Model Backbone | Qwen2.5-32B |
| Vision Encoder | ViT-H/14 (CLIP) |
| Projection Layer | Multi-layer Perceptron |
| Context Window | 128K tokens |
| Precision | FP16 / BF16 / INT8 |
Parameter Count
The model has approximately 32 billion parameters, with the vision encoder accounting for roughly 3 billion parameters and the language model containing the remaining ~29 billion parameters.
Input Requirements
- Image Resolution: Up to 448x448 pixels
- Image Formats: JPEG, PNG, WEBP
- Text Input: Maximum 128K tokens
- Multi-turn Conversations: Fully supported
Performance Benchmarks
ACE-Step 1.5 has been evaluated across multiple standard benchmarks, demonstrating competitive performance:
Vision-Language Benchmarks
| Benchmark | ACE-Step 1.5 | GPT-4o | Gemini 1.5 Pro |
|---|---|---|---|
| MME Score | 2158.9 | 2201.3 | 2189.7 |
| MM-Bench | 82.4 | 84.1 | 83.0 |
| SEED-Bench | 75.8 | 77.2 | 76.5 |
| MathVista | 65.3 | 68.9 | 67.1 |
Reasoning Capabilities
The model excels in complex reasoning tasks:
- Visual Question Answering: Accurately answers questions about images
- Chart/Graph Understanding: Interprets complex visual data
- Document Processing: Reads and understands text in images
- Multi-image Reasoning: Compares and reasons across multiple images
Hardware Requirements
Minimum Requirements
For basic inference with quantized models:
| Component | Minimum |
|---|---|
| CPU | 8 cores (Intel i5 / AMD Ryzen 5) |
| RAM | 16 GB |
| GPU | NVIDIA RTX 3060 (12GB VRAM) |
| Storage | 20 GB |
Recommended Configuration
For optimal performance:
| Component | Recommended |
|---|---|
| CPU | 16 cores (Intel i7 / AMD Ryzen 7) |
| RAM | 32 GB or more |
| GPU | NVIDIA RTX 4090 (24GB VRAM) or RTX 3090 (24GB VRAM) |
| Storage | 50 GB SSD |
GPU Memory Requirements
| Mode | VRAM Requirement |
|---|---|
| FP16 Inference | 24-32 GB |
| BF16 Inference | 32 GB |
| INT8 Quantized | 12-16 GB |
| INT4 Quantized | 8-12 GB |
Running on Limited Hardware
ACE-Step 1.5 supports various quantization techniques for deployment on resource-constrained devices:
- GGUF Format: Available in Q4_K_M, Q5_K_M, Q8_0 quantizations
- AWQ Format: 4-bit quantized weights
- Bitsandbytes: 8-bit and 4-bit quantization
Installation and Setup
Prerequisites
- Python 3.10 or higher
- PyTorch 2.0 or higher
- CUDA 11.8 or higher (for GPU acceleration)
Installation Methods
Method 1: Using pip
pip install transformers accelerate
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Method 2: Using Docker
docker run -it --gpus all ghcr.io/ace-step/ace-step-1.5:latest
Quick Start
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM
import torch
# Load model
model = AutoModelForCausalLM.from_pretrained(
"ACE-Step/Ace-Step1.5",
device_map="auto",
torch_dtype=torch.bfloat16,
trust_remote_code=True
)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
"ACE-Step/Ace-Step1.5",
trust_remote_code=True
)
# Prepare inputs
prompt = "Describe this image in detail"
image_path = "path/to/image.jpg"
# Generate response
inputs = tokenizer(prompt, image_path, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Usage Examples
Image Description
# Generate detailed image description
prompt = "Please describe this image in detail, including objects, scenes, and any notable details."
Visual Question Answering
# Answer questions about an image
prompt = "What is the main subject of this image? Provide a detailed explanation."
Chart and Graph Analysis
# Analyze charts and graphs
prompt = "Analyze this chart and explain the key trends and insights it reveals."
Multi-image Comparison
# Compare multiple images
prompt = "Compare these two images and identify the key differences between them."
Best Practices
Prompt Engineering
- Be Specific: Clear, detailed prompts yield better results
- Use Context: Provide relevant background information
- Step-by-Step: Break complex tasks into smaller steps
- Format Requirements: Specify desired output format
Performance Optimization
- Use Quantization: INT8 or INT4 for faster inference
- Batch Processing: Process multiple images together when possible
- GPU Selection: Higher VRAM allows larger batch sizes
- Memory Management: Monitor VRAM usage with
nvidia-smi
Use Cases
ACE-Step 1.5 is suitable for various applications:
1. Content Creation
- Automated image description generation
- Visual content analysis for social media
- Accessibility image descriptions
2. Education
- Educational content creation
- Visual learning materials
- STEM education support
3. Business
- Document processing and analysis
- Quality control in manufacturing
- Customer support image analysis
4. Research
- Scientific image analysis
- Data visualization interpretation
- Multimodal research studies
Comparison with Similar Models
ACE-Step vs. Other Open-Source Models
| Model | Parameters | Vision Capabilities | License |
|---|---|---|---|
| ACE-Step 1.5 | 32B | Excellent | Apache 2.0 |
| LLaVA-1.6 | 7B | Good | MIT |
| IDEFICS-2 | 80B | Very Good | Apache 2.0 |
| Pixtral | 12B | Good | Apache 2.0 |
Resources and Community
Official Resources
- GitHub Repository: https://github.com/ACE-Step
- Hugging Face: https://huggingface.co/ACE-Step/Ace-Step1.5