Image-Guided Editing Models - Benchmarks
Overview
In this benchmark we evaluate image editing models on their ability to perform image-guided transformations while preserving identity-defining details from an input image. This benchmark requires models to integrate information from:
- An input image a "fit pic" containing the subject of interest within its broader outfit context.
- A query image a "e-commerce style flat-lay" containing the desired pose, composition, or spatial configuration of the output
- A text prompt guiding the edit (templated, serving to introduce both images)
Goal: Generate an image that combines the garment from the input with the pose/composition of the query image while preserving identity-defining details such as text, logos, printed graphics, and complex patterns.
Figure 1: The image editing task - transferring garment identity with fine-grained attribute preservation
Results
| Model | Total ( /12) | Graphic Reconstruction | Pattern Reconstruction | Small Segment | Multi Image |
|---|---|---|---|---|---|
| Nano Banana Pro Winner | 8 | 3/3 | 3/3 | 1/3 | 1/3 |
| Nano Banana Runner Up | 7 | 3/3 | 2/3 | 1/3 | 1/3 |
| GPT Image-1 | 4 | 0/3 | 1/3 | 0/3 | 3/3 |
| Seedream 4 | 3 | 1/3 | 1/3 | 0/3 | 1/3 |
| Qwen | 2 | 2/3 | 0/3 | 0/3 | 0/3 |
These benchmarks stress true one-shot performance with a bias towards consistancy over best possible outcome. Each model must hit quality targets with minimal inputs. Across Tasks 1–3, Nano Banana Pro consistently demonstrates strong one-shotting, delivering reliable outputs with limited context.
When holistically evaluating the performance gap between the Nano Banana and Nano Banana Pro model, the differences are negligible. While outside the scope of this study, Nano Banana Pro is unlikely to see usage in prod due to its significantly worse cost and latency when compared to Nano Banana.
Future work should deepen few-shot tasks; early signals suggest GPT Image-1 may benefit from richer input sets, and expanded multi-image tests could surface that advantage. Such tasks would also be a better reflection of image editing models in the Springus App.
Graphic Reconstruction
Pros
- Flawless text preservation and graphic retention
- Impeccable pose matching to query image
- Clean background transfer with no artifacts
- Natural lighting and shadows
Cons
- Minor edge sharpening visible on one run
Pros
- Perfect text preservation and graphic retention
- Perfect pose matching
Cons
- Graphic size and placement off in one run
Pros
- Perfect graphic reconstruction
- Stable colour preservation across runs
Cons
- Poor brand text reconstruction
- Graphic placement off on one attempt
Pros
- Good graphic reconstruction
- Strong colour match
Cons
- Text font doesn't match on most runs
Pros
- Fair graphic reconstruction in isolated regions
Cons
- Colours don't match
- Graphic size off
- Brand text illegible
Nano Banana Pro, Nano Banana and Qwen are all roughly evenly matched here. Logo reconstruction is perfect across all of theirs runs. The smaller (and less important) brand logo is the only differentiator here.
Pattern Reconstruction
Pros
- Flawless pattern preservation and detail retention
- Strong pose matching to query image
- Clean background transfer with no artifacts
Cons
- Minor fabric stiffness on one variation
Pros
- Flawless pattern preservation and detail retention
- Strong pose matching to query image
- Clean background transfer with no artifacts
Cons
- Waist crease slightly softened in one run
Pros
- Strong pattern match
- Good color fidelity
Cons
- Waist style not matching
- Graphic hallucinations
Pros
- Good lighting and shadows
- Strong pattern matching
Cons
- Pose doesn't match query image input
Pros
- Fair pattern matching
- Strong representation of input image
Cons
- Waist style not matching (hallucinated waistband)
Once again, Nano Banana Pro, Nano Banana excel. Failure modes are interesting to note here, Qwen and Seedream 4 both retain the distinct blotch on the upper left leg yet both hallucinate larger important details like fly or pocket placement.
Small Segment Enhancement
Pros
- Fantastic detail enhancement, brand logo displayed despite not being visible
- Fair jibbit reconstruction
- Strong pose matching
- Incredible display of world knowledge—accurate under-shoe details despite being invisible
Cons
- Minor sole over-sharpening on one run
Pros
- Accurate pose matching
- Fair jibbit reconstruction
Cons
- Incorrect under shoe details on one variation
Cons
- Complete hallucination. Input not visible in output
Pros
- Accurate shoe structure
- Identifiable result.
Cons
- "Sport mode" strap hallucination
- Inaccurate pose
Cons
- Complete hallucination. Input not visible in output
This task borders on unfair. We're testing the model on some zero-shot elements by looking for the crocs sole in the output. Its remarkable how strong the passing output of Nano Banana Pro is. Not only does it get the texture of the bottom of the shoes right, but both the logo and size placement. It's worth noting too that it's other failure modes are due to query misalignment rather than any sort of hallucination (As is the case with all other outputs).
Multi Image Reconstruction
Pros
- Accurate and consistent colour matching
- Consistent text spacing
Cons
- Inconsistent text reconstruction between inputs
- Inconsistent shirt size across runs
Pros
- Consistent matching query image sizing and pose
- Failure cases are less visibly jarring
Cons
- Inconsistent coloring on one attempt
Pros
- Incredibly consistent text reconstruction
Cons
- Core prompt alignment
Pros
- Incredibly consistent text reconstruction
Cons
- Poor prompt alignment
- Hallucinations of shirt damage
Pros
- Incredibly consistent text reconstruction
- Consistent sizing
- Consistent font
This task shows a major shortcoming of the Nano Banana family. Text reconstruction given multiple angles seems like a relatively easy task for the models, but the Nano Banana family fails to do so. This hints they might struggle with tasks that benefit from few-shot reasoning. More testing is needed to confirm this.
Controlled Variables
Our benchmark ensures fair comparison across all models:
- Same prompt across all models (minimal mechanical adjustments)
- Best of 3 generations - We generate 3 outputs per model and show the best result
- Same resolution (1024×1024)
- Same format JPEG (no compression)
- Same aspect ratio (square)
- Same test images for fair comparison
References & Acknowledgments
This benchmark builds on the excellent work by Shaun Pedicini in the original GenAI Image Editing Showdown, summarized by Simon Willison.
Model providers: - ByteDance (Seedream 4) - Google (Gemini 2.5 Flash) - Qwen Team (Qwen-Image-Edit-Plus) - Black Forest Labs (FLUX.1 Kontext) - OmniGen Community - OpenAI (gpt-image-1)
Platform: Replicate for unified model access