Guiding Diffusion Models with Adaptive Negative Sampling Without External Resources


UC San Diego

Overview


Diffusion models (DMs) have demonstrated an unparalleled ability to create diverse and high-fidelity images from text prompts. However, they are also well-known to vary substantially regarding both prompt adherence and quality. Negative prompting was introduced to improve prompt compliance by specifying what an image must not contain. Previous works have shown the existence of an ideal negative prompt that can maximize the odds of the positive prompt. In this work, we explore relations between negative prompting and classifier-free guidance (CFG) to develop a sampling procedure, it Adaptive Negative Sampling Without External Resources (ANSWER), that accounts for both positive and negative conditions from a single prompt. This leverages the internal understanding of negation by the diffusion model to increase the odds of generating images faithful to the prompt. ANSWER is a training-free technique, applicable to any model that supports CFG, and allows for negative grounding of image concepts without an explicit negative prompts, which are lossy and incomplete. Experiments show that adding ANSWER to existing DMs outperforms the baselines on multiple benchmarks and is preferred by humans 2x more over the other methods.


PDF

Supplemntal

Published in International Conference on Computer Vision (ICCV), 2025

Method


model model

Sampling approaches discussed in this work. DNP, CNP, and ANSWER all rely on a DNS chain to generate the negative condition. While DNP, and CNP use an external captioning model C to produce negative text prompts, ANSWER performs the negative conditioning in the latent space of the DM. CNP runs a complete DNS chain at each diffusion iteration t, whereas ANSWER atmost requires K DNS iterations.

Results


We demonstrate the performance of the method using prompts from the Attend & Excite, Pick-a-Pic, DrawBench, and PartiPrompts datasets, which features comprehensive and diverse descriptions that extend beyond common training data. These datasets test the model’s ability to handle entity attributes and interactions, spatial relationships, text rendering, numeracy, and rare or imaginative scenarios.

table1

Our Method outperforms SDXL(CFG) and DNP in all tasks. Human evaluators also showed a strong preference for the images generated by our method, favoring them by significant margins. Please refer to our paper for more ablations.

table1

Convergence strength of ANSWER


example

In the Figure, we show the negative images for CFG(top) and ANSWER (bottom) chain. ANSWER converges faster (t=10) than CFG (t=30). This illustrates the strength of ANSWER in enforcing prompt adherence.

Acknowledgements

This work was partially funded by NSF award IIS-2303153, NAIRR-240300, a gift from Qualcomm, and NVIDIA GPU donations. We also acknowledge and thank the use of the Nautilus platform for the experiments discussed above.