ALDM

In contrast to prior Layout-to-Image (L2I) synthesis methods, our ALDM integrates explicit supervision on the layout alignment via adversarial supervision, along with the novel multistep unrolling strategy to maintain consistent adherence to the conditional layout. Our model can synthesize faithful samples that are well aligned with the layout input, while preserving controllability via text prompt.

Abstract

Despite the recent advances in large-scale diffusion models, little progress has been made on the layout-to-image (L2I) synthesis task. Current L2I models either suffer from poor editability via text or weak alignment between the generated image and the input layout. This limits their usability in practice. To mitigate this, we propose to integrate adversarial supervision into the conventional training pipeline of L2I diffusion models (ALDM). Specifically, we employ a segmentation-based discriminator which provides explicit feedback to the diffusion generator on the pixel-level alignment between the denoised image and the input layout. To encourage consistent adherence to the input layout over the sampling steps, we further introduce the multistep unrolling strategy. Instead of looking at a single timestep, we unroll a few steps recursively to imitate the inference process, and ask the discriminator to assess the alignment of denoised images with the layout over a certain time window. Our experiments show that ALDM enables layout faithfulness of the generated images, while allowing broad editability via text prompts. Moreover, we showcase its usefulness for practical applications: by synthesizing target distribution samples via text control, we improve domain generalization of semantic segmentation models by a large margin (~12 mIoU points).

How does it work?

We propose two novel training strategies to improve the traditional L2I diffusion model training (area (A)): adversarial supervision via a segmenter-based discriminator illustrated in area (B), and multistep unrolling strategy in area (C).

In the conventional L2I diffusion models training, there is no explicit supervision in place to ensure the layout alignment. To address this, we employ a segmenter-based discriminator, which can provide a direct per-pixel feedback to the diffusion model generator on the adherence of the denoised images to the input layout. Further, to encourage consistent compliance with the given layout over the sampling steps, we propose a novel multistep unrolling strategy. The adversarial objective is thus designed over a time horizon and future steps are taken into consideration as well. In a sense, this resembles the advanced control algorithm - Model Predictive Control (MPC). Enabled by adversarial supervision over multiple sampling steps, our ALDM can effectively ensure consistent layout alignment, while maintaining the text controllability of the large-scale pretrained diffusion model.

Comparison of Layout Alignment

Comparison of Text Controllability

More Visual Examples

Improved Domain Generalization

We further demonstrate the utility of synthetic data generated by our method for domain generalization (DG) in semantic segmentation, where the downstream model is trained on a source domain, and its generalization performance is evaluated on unseen target domains. Augmented with diverse synthetic data generated by our ALDM, the segmentation model can make more reliable predictions under diverse unseen conditions.

BibTeX

@inproceedings{li2024aldm,
    title     = {Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive},    
    author    = {Yumeng Li and Margret Keuper and Dan Zhang and Anna Khoreva},
    booktitle = {ICLR},
    year      = {2024},
  }