RealCompo: Dynamic Equilibrium between Realism and Compositionality Improves Text-to-Image Diffusion Models

1Tsinghua University, 2Peking University, 3University of Science and Technology of China,
4PicUp.AI, 5Stanford University

Qualitative comparison between our RealCompo and the outstanding text-to-image model Stable Diffusion v1.5, as well as the layout-to-image models, GLIGEN and LMD+. Colored text denotes the advantages of RealCompo in generating results

Abstract

Diffusion models have achieved remarkable advancements in text-to-image generation. However, existing models still have many difficulties when faced with multiple-object compositional generation. In this paper, we propose a new training-free and transferred-friendly text-to-image generation framework, namely RealCompo, which aims to extract the advantages of text-to-image models and layout-to-image models to enhance the realism and compositionality of the generated images. Our approach uses LLM as a layout generator to pre-bind multiple objects and attributes before denoising. In addition, an intuitive and user-friendly balancer dynamically balances the strengths of the two models in denoising steps, allowing plug-and-play use of any model without extra training. Extensive experiments show that our RealCompo outperforms state-of-the-art text-to-image models and layout-to-image models in multiple-object compositional generation and achieves a good balance between the realism and compositionality of the generated images.

Method

We first use LLM to obtain the corresponding layout according to the input text prompt. Next, the balancer dynamically updates the influence by analyzing the cross-attention maps derived from each model's current influence at each denoising step.

Results

Compared with other T2I and L2I models

RealCompo achieves satisfactory results in both realism and compositionality in generating images.


Generalisation to different models

Qualitative comparison of RealCompo's generalization to different models: We select two T2I models: Stable Diffusion v1.5, TokenCompose, two L2I models GLIGEN, Layout Guidance (LayGuide), and combine them in pairs to obtain four versions of RealCompo. We demonstrate that RealCompo has strong generalization and generality to different models, achieving a remarkable level of both fidelity and precision in aligning with text prompts.

BibTeX

@article{zhang2024realcompo,
  author    = {Zhang, Xinchen and Yang, Ling and Cai, Yaqi and Yu, Zhaochen and Xie, Jiake and Tian, Ye and Xu, Minkai and Tang, Yong and Yang, Yujiu and Cui, Bin},
  title     = {RealCompo: Dynamic Equilibrium between Realism and Compositionality Improves Text-to-Image Diffusion Models},
  journal   = {arXiv preprint arXiv:2402.12908},
  year      = {2024},
}