Professor: Luiz Velho
Presentation by: Rodrigo Schuller
23 October 2023
Brief literature survey
Learning to inject structure guidance into the denoiser
Results
The devil is in the details
Training the right way – a matter of timing
Sample complexity
One more thing...
Before each down, up and mid* blocks: x = res(x, t); return att(x)
x = res(x, t); return att(x)
Pre-trained CLIP (text embedding)
UNet w/ cross-attention/concatetation ϵ(xt,t,τθ)\epsilon(x_t, t, \tau_\theta)ϵ(xt,t,τθ)
h=W0x+ΔWx=W0x+BAx h = W_0 x + \Delta W x = W_0 x + B A x h=W0x+ΔWx=W0x+BAx
Learn injection that fine-tunes the model
Works well on the SD's UNet, ie, cross-attention matrices
How can we inject or push inner values of UNet with additional conditions?
Source: https://www.researchgate.net/figure/The-pixel-shuffle-layer-transforms-feature-maps-from-the-LR-domain-to-the-HR-image_fig3_339531308
Pixel unshuffle: downsample to latent space resolution (5122↦642512^2 \mapsto 64^25122↦642) by creating new channels
Sketch map: COCO17 – 164K images, with edge detection generated with Pixel difference networks for efficient edge detection, Zhuo Su et al.
Semantic segmentation map: COCO-Stuff – 164k images
Keypoints & Color & Depth maps: LAION-AESTHETICS – 600k images
To generate keypose: MM-Pose
To generate depth maps: MiDaS
Sum injections with adjustable weights ωk\omega_kωk:
Fc=∑k=1KωkFADk(Ck)=∑k=1KωkΔFk F_c = \sum_{k=1}^K \omega_k {\cal F}^k_{AD}(C_k) = \sum_{k=1}^K \omega_k \Delta F^k Fc=∑k=1KωkFADk(Ck)=∑k=1KωkΔFk
Prompt: face of a yellow cat, high resolution, sitting on a park bench
Source: https://medium.com/aibygroup/lets-understand-stable-diffusion-inpainting-fdd0b1c3a925
T2I-adapter as injections into the UNet – analogous to LoRA
Fc=FAD(C)={Fc1,Fc2,Fc3,Fc4}=ΔFF^enci=Fenci+ΔFi=Fenci+Fci, i∈{1,2,3,4} \begin{aligned} F_c &= {\cal F}_{AD}(C) = \{F^1_c, F^2_c, F^3_c,F^4_c\} = \Delta F \\ \hat{F}^i_{enc} &= F^i_{enc} + \Delta F^i = F^i_{enc} + F_c^i,\, i \in \{1, 2, 3, 4\} \end{aligned} FcF^enci=FAD(C)={Fc1,Fc2,Fc3,Fc4}=ΔF=Fenci+ΔFi=Fenci+Fci,i∈{1,2,3,4}
Cubic sampling for ttt in training is paramount
Sample complexity in the hundreds of thousands of images
One more thing... Composition & inpainting!