T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models

Chong Mou et al. (2023)

Professor: Luiz Velho

Presentation by: Rodrigo Schuller

23 October 2023

Overview

  • Brief literature survey

    • From UNet to SD
  • Learning to inject structure guidance into the denoiser

  • Results

  • The devil is in the details

    • Training the right way – a matter of timing

    • Sample complexity

  • One more thing...

    • Composition & inpainting!

Literature survey

1. Neural machine translation by jointly learning to align and translate, D. Bahdanau et al. 2. Denoising diffusion probabilistic models, J. Ho et al. (Appendix B) 3. High-resolution image synthesis with latent diffusion models, A. Ramesh et al.

DDPM (2020) – UNet w/ self-attention

DDPM (2020) – UNet w/ self-attention

  • Before each down, up and mid* blocks: x = res(x, t); return att(x)

    • Insert temporal information, ie, model ϵ(xt,t)\epsilon(x_t, t) instead of ϵ(xt)\epsilon(x_t)

SD (2022)

SD (2022) – cross-attention

SD (2022) – key takeaways

  • Pre-trained latent space (image autoencoder)
  • Pre-trained CLIP (text embedding)

  • UNet w/ cross-attention/concatetation ϵ(xt,t,τθ)\epsilon(x_t, t, \tau_\theta)

LoRA (2022)

LoRA (2022) – key takeaways

h=W0x+ΔWx=W0x+BAx h = W_0 x + \Delta W x = W_0 x + B A x

  • Learn injection that fine-tunes the model

    • A push in the right direction
  • Works well on the SD's UNet, ie, cross-attention matrices

  • How can we inject or push inner values of UNet with additional conditions?

Injecting structure guidance into the UNet

  • Fc=FAD(C)={Fc1,Fc2,Fc3,Fc4}=ΔFF^enci=Fenci+ΔFi=Fenci+Fci,i{1,2,3,4} \begin{aligned} F_c &= {\cal F}_{AD}(C) = \{F^1_c, F^2_c, F^3_c,F^4_c\} = \Delta F \\ \hat{F}^i_{enc} &= F^i_{enc} + \Delta F^i = F^i_{enc} + F_c^i,\, i \in \{1, 2, 3, 4\} \end{aligned}

T2I-adapter

Results – color

Results – sketch

Results – depth

Results – segmentation

Results – pose

The devil is in the details

  • Training the right way – a matter of timing

  • Sample complexity

Training the right way – a matter of timing

  • Guidance in different stages of DDIM (denoising diffusion implicit model) inference

Training the right way – a matter of timing

Training the right way – a matter of timing

  • Solution: sample tt using the random variable
    Y=(1(XT)3)T,XU(0,T) Y = \left(1-\left(\frac{X}{T}\right)^3\right)T,\, X \sim U(0, T)

The devil is in the details

  • Training the right way – a matter of timing

  • Sample complexity

Sample complexity

  • Sketch map: COCO17 – 164K images, with edge detection generated with Pixel difference networks for efficient edge detection, Zhuo Su et al.

  • Semantic segmentation map: COCO-Stuff – 164k images

  • Keypoints & Color & Depth maps: LAION-AESTHETICS – 600k images

    • To generate keypose: MM-Pose

    • To generate depth maps: MiDaS

One more thing...

One more thing – composition

  • Sum injections with adjustable weights ωk\omega_k:

    Fc=k=1KωkFADk(Ck)=k=1KωkΔFk F_c = \sum_{k=1}^K \omega_k {\cal F}^k_{AD}(C_k) = \sum_{k=1}^K \omega_k \Delta F^k

One more thing – composition

One more thing – composition

One more thing – composition

One more thing – inpainting

One more thing – inpainting with structural guidance

One more thing – inpainting with structural guidance

Review

Review

  • Enhanced UNet as denoiser
  • T2I-adapter as injections into the UNet – analogous to LoRA

    Fc=FAD(C)={Fc1,Fc2,Fc3,Fc4}=ΔFF^enci=Fenci+ΔFi=Fenci+Fci,i{1,2,3,4} \begin{aligned} F_c &= {\cal F}_{AD}(C) = \{F^1_c, F^2_c, F^3_c,F^4_c\} = \Delta F \\ \hat{F}^i_{enc} &= F^i_{enc} + \Delta F^i = F^i_{enc} + F_c^i,\, i \in \{1, 2, 3, 4\} \end{aligned}

  • Cubic sampling for tt in training is paramount

  • Sample complexity in the hundreds of thousands of images

  • One more thing... Composition & inpainting!

Thank you