ERNIE-Image vs Z-Image vs Qwen-Image-2: Which Open-Source Text-to-Image Model Performs Better? (Realism / Text Rendering / Natural Scenery / Aspect Ratio Compatibility / VRAM Usage / Inference Speed)

Baidu recently open-sourced the text-to-image model ERNIE-Image, and ComfyUI added support for it very quickly. According to the official introduction, the model has the following strengths:

Built-in prompt enhancement: it can expand short user input into richer structured descriptions
Text rendering: it performs especially well on dense, long-form text and layout-sensitive generation tasks, making it suitable for posters, infographics, UI images, and other text-heavy visual content
Strong instruction following
Broad style coverage

Conclusion

For complex text rendering and poster generation, qwen-image-2 > ERNIE-Image > z-image
For photorealistic quality, when balancing realism and prompt adherence, z-image is the better choice
For landscape generation, qwen-image-2 > z-image > ERNIE-Image
In some aspect-ratio-sensitive scenarios, ERNIE-Image can produce distorted limbs, so its aspect ratio compatibility is slightly weaker than z-image
VRAM usage: z-image (16 GB) is better than ERNIE-Image (21 GB)
Inference speed: z-image (8 s) is faster than ERNIE-Image (12 s) when generating a 1280×1280 image
Its prompt enhancement ability was still weaker than qwen3.5 in this test

Visual Results

Text Rendering

Prompt:

This is a Chinese hand-painted bilingual travel poster for a two-day Zen and cultural trip in Hangzhou. The overall design uses a light beige antique rice-paper background, with traditional geometric border patterns in the four corners. A flowing cloud-pattern ribbon scroll runs through the center to connect the two-day itinerary. At the top is the main title “Hangzhou: A Two-Day Journey of Zen, Culture, and Humanity,” with the subtitle “Prayer · Landscape · Dream-Seeking.” On the left is “Day 1: Praying at Ling Shan, Ascending for Prosperity,” followed by: “07:30 Arrive at Lingyin Temple,” with an image of the Lingyin Temple gate (plaque reading ‘灵隐寺’) and incense smoke rising from a censer, plus the caption “Go to Lingyin Temple to fulfill a vow, offer incense, and pray sincerely”; “10:30 Explore Yongfu Temple’s Serenity,” with an image of an ancient temple hidden among lush old trees, plus the caption “The most beautiful temple, serene with Song charm”; “12:00 Vegetarian Meal & Rest,” with an image of a steaming bowl of vegetarian noodles and a small tea cup on a bamboo tray; “16:00 Tea Tasting at Longjing,” with layered green tea fields and tea being poured from a purple clay pot into a celadon cup, plus the caption “Leisurely tea tasting, Tea Garden Meijiawu.” On the right is “Day 2: Ink-Wash West Lake, Dreams of the Southern Song,” followed by: “09:00 Boat Tour on West Lake,” with an image of a black-awning boat on the lake and the reflected Three Pools Mirroring the Moon pagodas, plus the caption “Boating to view the Three Pools Mirroring the Moon”; “12:00 Lakeside Lunch,” with the caption “Experience Lou Wai Lou Restaurant,” accompanied by an image of a glossy braised fish covered in sauce; “14:00 Su Causeway / Yuhu Bay,” with an arch bridge crossing green water and weeping willows, plus the caption “Stroll along the causeway or discover hidden gems.” At the bottom is a “Travel Tips” section with a lightbulb icon and three tips: “Accommodation: Longxiang Bridge / Fengqi Road for convenience,” “Transport: Metro + bike is optimal,” and “Season: Dress warmly in early spring,” each with corresponding icons for bed, bike + metro, and snowflake + cherry blossom. All text should use a calligraphy-style regular script. Chinese and English should be strictly aligned, with Chinese above and English directly below. The overall layout should feel balanced, spacious, elegant, and full of literati painting atmosphere and Zen lifestyle aesthetics.

ERNIE-Image:

Qwen-Image-2:

Z-Image:

Conclusion:

In text rendering, qwen-image-2 is still outstanding and can handle extremely complex prompts very well
qwen-image-2 > ERNIE-Image > z-image

Realism

Prompt:

A Chinese female college student, around 20 years old. She has a very short haircut with a soft artistic touch, with loose strands naturally falling over part of her cheek. Her overall style leans toward a tomboy temperament. She has cool fair skin, delicate facial features, and an expression that is slightly shy yet a little defiant, with one corner of her mouth subtly tilted upward, giving off a mischievous but youthful charm. She is wearing an off-shoulder short-sleeve top, exposing one shoulder, with a proportionate figure. The composition is a close-up selfie, with the subject occupying the main focus. The dorm room background is clearly visible: an upper bunk with white bedding, a neatly arranged desk, and a wooden cabinet with drawers. The whole image should look like it was shot on a phone, with soft, even ambient light, natural and realistic colors, a bright and clear image, and a lively everyday youthful atmosphere.

ERNIE-Image:

Z-Image:

Conclusion:

ERNIE-Image and z-image are close in realism
z-image follows semantics better than ERNIE-Image. For example, the prompt asked for “one shoulder exposed,” but ERNIE-Image exposed both shoulders. The prompt also asked for “a slightly shy yet defiant expression, with one corner of the mouth slightly tilted upward,” and z-image handled that part better.

Natural Texture

Prompt:

A jade-green river winds through a lush canyon. The rocky cliffs on both sides are covered with thick moss and dense ferns. Several waterfalls cascade from above, with mist lingering in the air. Noon sunlight filters through the dense canopy and creates dappled shimmering highlights on the river surface. The overall atmosphere is fresh, humid, and full of the vitality of a primeval rainforest. No people, text, or man-made traces appear in the image.

ERNIE-Image:

Qwen-Image-2:

Z-Image:

Conclusion: Personally, Qwen-Image-2 still feels slightly better, almost as if it has a landscape-enhancing filter. z-image comes next. ERNIE-Image has slightly better color than z-image, but the highlights on the water look a bit fake. In terms of overall composition, the first two are also better than ERNIE-Image.

Grand Scene Rendering

Prompt:

“Cyberpunk style,” underworld background, eerie. In the divine world, violent winds howl, space twists and distorts, and a gigantic, blurry-faced Tang Sanzang radiates light, with the imposing aura of a divine法天象地 manifestation and a sacred solemn presence. With hands clasped, he looks down at the tiny swarm of ghosts below, emitting a faint murderous intent. The wind sweeps fallen leaves, as if everything in front of him is about to be destroyed. Dark, dreamlike, epic, oppressive, smoggy, low saturation, low brightness, giant figure, megalophobia, bird’s-eye ultra-wide angle, oppressive composition, cinematic framing, strong visual impact, 8K quality, ultra clear, extreme detail, master-level artwork, sharp close-up, infinite detail, maximalism.

ERNIE-Image

Z-Image

Conclusion: In this scene, z-image is better than ERNIE-Image, which starts to look overly greasy and over-rendered.

Arbitrary Aspect-Ratio Photos

Prompt:

A Chinese female college student, around 20 years old. She has a very short haircut with a soft artistic touch, with loose strands naturally falling over part of her cheek. Her overall style leans toward a tomboy temperament. She has cool fair skin, delicate facial features, and an expression that is slightly shy yet a little defiant, with one corner of her mouth subtly tilted upward, giving off a mischievous but youthful charm. She is wearing an off-shoulder short-sleeve top, exposing one shoulder, with a proportionate figure. The composition is a close-up selfie, with the subject occupying the main focus. The dorm room background is clearly visible: an upper bunk with white bedding, a neatly arranged desk, and a wooden cabinet with drawers. The whole image should look like it was shot on a phone, with soft, even ambient light, natural and realistic colors, a bright and clear image, and a lively everyday youthful atmosphere.

ERNIE-Image:

Z-Image:

Conclusion: ERNIE-Image shows distortions in this kind of unsuitable aspect-ratio scenario, so its aspect-ratio compatibility is slightly weaker.

Model Downloads

text_encoders

Download the file https://huggingface.co/Comfy-Org/ERNIE-Image/resolve/main/text_encoders/ministral-3-3b.safetensors and place it under ComfyUI/models/text_encoders/

diffusion_models

ERNIE provides two diffusion models, an SFT model and a Turbo model. Officially, the latter is faster and aesthetically stronger, so this article tests the Turbo model first.

Download the file https://huggingface.co/Comfy-Org/ERNIE-Image/resolve/main/diffusion_models/ernie-image-turbo.safetensors and place it under ComfyUI/models/diffusion_models/

vae

Download the file https://huggingface.co/Comfy-Org/ERNIE-Image/resolve/main/vae/flux2-vae.safetensors and place it under ComfyUI/models/vae/

ERNIE-Image vs Z-Image vs Qwen-Image-2: Which Open-Source Text-to-Image Model Performs Better?

Table of Contents

ERNIE-Image vs Z-Image vs Qwen-Image-2: Which Open-Source Text-to-Image Model Performs Better? (Realism / Text Rendering / Natural Scenery / Aspect Ratio Compatibility / VRAM Usage / Inference Speed)

Conclusion

Visual Results

Text Rendering

Realism

Natural Texture

Grand Scene Rendering

Arbitrary Aspect-Ratio Photos

Model Downloads

text_encoders

diffusion_models

vae