Image-Guided Editing Models - Benchmarks

Overview

In this benchmark we evaluate image editing models on their ability to perform image-guided transformations while preserving identity-defining details from an input image. This benchmark requires models to integrate information from:

An input image a "fit pic" containing the subject of interest within its broader outfit context.
A query image a "e-commerce style flat-lay" containing the desired pose, composition, or spatial configuration of the output
A text prompt guiding the edit (templated, serving to introduce both images)

Goal: Generate an image that combines the garment from the input with the pose/composition of the query image while preserving identity-defining details such as text, logos, printed graphics, and complex patterns.

Fashion garment transfer workflow diagram

Figure 1: The image editing task - transferring garment identity with fine-grained attribute preservation

Results

Jump to task

1Graphic Reconstruction 2Pattern Reconstruction 3Small Segment 4Multi Image

Overall Winner

Nano Banana Pro 8 / 12

Best average performance across all four tasks

Runner Up

Nano Banana 7 / 12

Strong consistency with minor misses on multi-image

Honorable mention

GPT-Image 1

Stand out performance on multi-image task

Model	Total ( /12)	Graphic Reconstruction	Pattern Reconstruction	Small Segment	Multi Image
Nano Banana Pro Winner	8	3/3	3/3	1/3	1/3
Nano Banana Runner Up	7	3/3	2/3	1/3	1/3
GPT Image-1	4	0/3	1/3	0/3	3/3
Seedream 4	3	1/3	1/3	0/3	1/3
Qwen	2	2/3	0/3	0/3	0/3

These benchmarks stress true one-shot performance with a bias towards consistancy over best possible outcome. Each model must hit quality targets with minimal inputs. Across Tasks 1–3, Nano Banana Pro consistently demonstrates strong one-shotting, delivering reliable outputs with limited context.

When holistically evaluating the performance gap between the Nano Banana and Nano Banana Pro model, the differences are negligible. While outside the scope of this study, Nano Banana Pro is unlikely to see usage in prod due to its significantly worse cost and latency when compared to Nano Banana.

Future work should deepen few-shot tasks; early signals suggest GPT Image-1 may benefit from richer input sets, and expanded multi-image tests could surface that advantage. Such tasks would also be a better reflection of image editing models in the Springus App.

Graphic Reconstruction

Our first benchmark task tests the models' ability to preserve text and graphic details when transferring a garment to a new pose.

Input Image

Query Image

Prompt

Using the t-shirt in the outfit image, render it in the layout, pose, and background of the query image. Preserve all visible text, graphics, and patterns from the outfit image. Keep all structural cues from the query image unchanged.

Nano Banana Pro

3/3

Pros

Flawless text preservation and graphic retention
Impeccable pose matching to query image
Clean background transfer with no artifacts
Natural lighting and shadows

Cons

Minor edge sharpening visible on one run

Nano Banana

3/3

Pros

Perfect text preservation and graphic retention
Perfect pose matching

Cons

Graphic size and placement off in one run

Qwen

2/3

Pros

Perfect graphic reconstruction
Stable colour preservation across runs

Cons

Poor brand text reconstruction
Graphic placement off on one attempt

Seedream 4

1/3

Pros

Good graphic reconstruction
Strong colour match

Cons

Text font doesn't match on most runs

GPT Image-1

0/3

Pros

Fair graphic reconstruction in isolated regions

Cons

Colours don't match
Graphic size off
Brand text illegible

Nano Banana Pro, Nano Banana and Qwen are all roughly evenly matched here. Logo reconstruction is perfect across all of theirs runs. The smaller (and less important) brand logo is the only differentiator here.

Pattern Reconstruction

Our second benchmark task tests the models' ability to retain patterns while transfering a garment into a new pose.

Input Image

Query Image

Prompt

Using the pants in the outfit image, render them in the layout, pose, and background of the query image. Preserve all visible text, graphics, and patterns from the outfit image. Keep all structural cues from the query image unchanged.

Nano Banana Pro

3/3

Pros

Flawless pattern preservation and detail retention
Strong pose matching to query image
Clean background transfer with no artifacts

Cons

Minor fabric stiffness on one variation

Nano Banana

2/3

Pros

Flawless pattern preservation and detail retention
Strong pose matching to query image
Clean background transfer with no artifacts

Cons

Waist crease slightly softened in one run

Qwen

0/3

Pros

Strong pattern match
Good color fidelity

Cons

Waist style not matching
Graphic hallucinations

Seedream 4

1/3

Pros

Good lighting and shadows
Strong pattern matching

Cons

Pose doesn't match query image input

GPT Image-1

1/3

Pros

Fair pattern matching
Strong representation of input image

Cons

Waist style not matching (hallucinated waistband)

Once again, Nano Banana Pro, Nano Banana excel. Failure modes are interesting to note here, Qwen and Seedream 4 both retain the distinct blotch on the upper left leg yet both hallucinate larger important details like fly or pocket placement.

Small Segment Enhancement

Our third benchmark task tests the models' ability to reconstruct small but detailed objects that occupy minimal image space while preserving identity and fine details. As a side task, we also examine some elements of world modeling as part of the sections of interest in the query image aren't visible within the input image.

Input Image

Query Image

Prompt

Using the shoes in the outfit image, render them in the layout, pose, and background of the query image. Preserve all visible text, graphics, and patterns from the outfit image. Keep all structural cues from the query image unchanged.

Nano Banana Pro

1/3

Pros

Fantastic detail enhancement, brand logo displayed despite not being visible
Fair jibbit reconstruction
Strong pose matching
Incredible display of world knowledge—accurate under-shoe details despite being invisible

Cons

Minor sole over-sharpening on one run

Nano Banana

1/3

Pros

Accurate pose matching
Fair jibbit reconstruction

Cons

Incorrect under shoe details on one variation

Qwen

0/3

Cons

Complete hallucination. Input not visible in output

Seedream 4

0/3

Pros

Accurate shoe structure
Identifiable result.

Cons

"Sport mode" strap hallucination
Inaccurate pose

GPT Image-1

0/3

Cons

Complete hallucination. Input not visible in output

This task borders on unfair. We're testing the model on some zero-shot elements by looking for the crocs sole in the output. Its remarkable how strong the passing output of Nano Banana Pro is. Not only does it get the texture of the bottom of the shoes right, but both the logo and size placement. It's worth noting too that it's other failure modes are due to query misalignment rather than any sort of hallucination (As is the case with all other outputs).

Multi Image Reconstruction

Our fourth benchmark task tests the models' ability to reconstruct multiple images simultaneously, preserving consistency across inputs. This is the only example few shot inference in this benchmark.

Input Image 1

Query Image

Prompt

Using the t-shirts in the outfit images, render them in the layout, pose, and background of the query image. Preserve all visible text, graphics, and patterns from the outfit image. Keep all structural cues from the query image unchanged.

Nano Banana Pro

1/3

Pros

Accurate and consistent colour matching
Consistent text spacing

Cons

Inconsistent text reconstruction between inputs
Inconsistent shirt size across runs

Nano Banana

1/3

Pros

Consistent matching query image sizing and pose
Failure cases are less visibly jarring

Cons

Inconsistent coloring on one attempt

Qwen

0/3

Pros

Incredibly consistent text reconstruction

Cons

Core prompt alignment

Seedream 4

1/3

Pros

Incredibly consistent text reconstruction

Cons

Poor prompt alignment
Hallucinations of shirt damage

GPT Image-1

3/3

Pros

Incredibly consistent text reconstruction
Consistent sizing
Consistent font

This task shows a major shortcoming of the Nano Banana family. Text reconstruction given multiple angles seems like a relatively easy task for the models, but the Nano Banana family fails to do so. This hints they might struggle with tasks that benefit from few-shot reasoning. More testing is needed to confirm this.

Controlled Variables

Our benchmark ensures fair comparison across all models:

Same prompt across all models (minimal mechanical adjustments)
Best of 3 generations - We generate 3 outputs per model and show the best result
Same resolution (1024×1024)
Same format JPEG (no compression)
Same aspect ratio (square)
Same test images for fair comparison

References & Acknowledgments

This benchmark builds on the excellent work by Shaun Pedicini in the original GenAI Image Editing Showdown, summarized by Simon Willison.

Model providers: - ByteDance (Seedream 4) - Google (Gemini 2.5 Flash) - Qwen Team (Qwen-Image-Edit-Plus) - Black Forest Labs (FLUX.1 Kontext) - OmniGen Community - OpenAI (gpt-image-1)

Platform: Replicate for unified model access

Overview

Results

Graphic Reconstruction

Pros

Cons

Pros

Cons

Pros

Cons

Pros

Cons

Pros

Cons

Pattern Reconstruction

Pros

Cons

Pros

Cons

Pros

Cons

Pros

Cons

Pros

Cons

Small Segment Enhancement

Pros

Cons

Pros

Cons

Cons

Pros

Cons

Cons

Multi Image Reconstruction

Pros

Cons

Pros

Cons

Pros

Cons

Pros

Cons

Pros

Controlled Variables

References & Acknowledgments

Try Springus for Free