https://arxiv.org/pdf/2304.06720.pdf
https://rich-text-to-image.github.io/

Demostration

Motivation

问题：如何使用rich text?
先前的解决方案：直接将rich-text翻译为更长的plain text → 生成图片效果很差

Method

我们的解决方法：
1.将rich text拆分成两个部分：

a.short plain text prompt(without formatting)
b.包含text attributes的region-specific prompts

2.获得self- 和 cross-attention maps，并由此划分region，进一步得到token maps

a.通过先将短的plain text prompt 经过一个 vanilla denoising process来获得attention maps
b.attention maps通常于生成图片的空间构图相关，可以通过attention maps将word 和一个特定region 关联起来
c.在a的过程中同时收集noised generation, residual feature maps
d.a的详细过程是：使用self-attention map获得划分，使用cross-attention map为每一个划分添加标签 → 得到token maps

3.为每一个region创建prompt：通过从rich-text中获得的attibutes获得

a.针对颜色不能被自然语言准确描述的问题: 反复的使用region-based guidance 来更新 region直至匹配目标颜色(what is region based guidance?)
b.attributes主要被转化为两种形式：
- i.region-based guidance target(an orange church)
- ii.textual input(a garden filled with colorful wildflowers) → 主要用来处理footnote
  4.为每一个region施加独立的denoising process 并且结合predicted noise来获得最终的更新。

Details

Step 1. Token maps for spatial layout

先前的工作已经表明: Diffusion UNet中的self-和cross attention层刻画了某种spatial layout.

get self-attention maps of 32 32 32 32 across different heads, layers and time steps.
将它们取均值并reshape到1024 1024, 在这个矩阵中, 第i行第j列的数值代表了i代表的像素与j代表的像素之间的correlation. 将这个矩阵上沿对角线上对称的元素取均值并重新填充为一个对称矩阵(控制i to j和j to i的correlation相同). 使用以上得到的对称矩阵做聚类(spectral clustering, 我的理解是将相似度超过一定阈值的像素提取出来), 获得 binary segmentation maps of size K 32 32, K代表segmentation的编号.

sj: attention score
use cross-attention map(for each token w) to associate each segment with a textual span.
cross attention map描述了一个序列到另一个序列的相似度, 使用Interpolate(插值法)将每个map resize到相同的分辨率: 32 * 32, 计算不同 heads, layers, time steps的平均值(针对每一个token)

Step 2:如何将token与segment关联:

ei: texture span, wj: token, Mk: segmentation, 绝对值外面带一个1是L1范数, 求元素绝对值的和(‖也是).
通过正则化后的cross attention map与segmentation map做矩阵乘法, 根据输出矩阵中元素的绝对值之和是否大于某一阈值判断该token是否与某一分割相关.

一个texture span是用户附带富文本属性的单元, 而一个token则是nlp中的常见含义, 通常是一个具有某种语义的最小单元.
通过如下方法将token map叠加为最终的针对于一个texture span的tokens map: