🤔 学习,反思,与胡思乱想

中英日混杂,请谅解。Posts are in English, Chinese and Japanese. Contact:

🎞️ Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators

读这篇文章的目的:想搞明白Cross Frame Attention是啥

Abstract

Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets. In this paper, we introduce a new task of zero-shot text-to-video generation and propose a low-cost approach (without any training or optimization) by leveraging the power of existing text-to-image synthesis methods (e.g. Stable Diffusion), making them suitable for the video domain. Our key modifications include (i) enriching the latent codes of the generated frames with motion dynamics to keep the global scene and the background time consistent; and (ii) reprogramming frame-level self-attention using a new cross-frame attention of each frame on the first frame, to preserve the context, appearance, and identity of the foreground object. Experiments show that this leads to low overhead, yet high-quality and remarkably consistent video generation. Moreover, our approach is not limited to text-to-video synthesis but is also applicable to other tasks such as conditional and content-specialized video generation, and Video Instruct-Pix2Pix, i.e., instruction-guided video editing. As experiments show, our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data. Our code will be open sourced at: https://github.com/Picsart-AI-Research/Text2Video-Zero.

Figure 2 考察

Image in a image block

首先要搞明白这几点

DDIM backward

为啥要用DDIM?大概是因为快?

用DDPM forward 是因为想要更高的自由度(概率随机)

先把noise做一下backward再forward,最主要的目的就是得到了一组x~T1,...,x~Tk\tilde x^1_T, ..., \tilde x^k_T作为noise

x的下标 (t, T, T’, …) 与上标 k = 1, …, m 的含义

t指diffusion的time step。

T指的是noise对应的time step。这时image是pure gaussian noise

T’ 是个中间值,目的是得到一组随机的noise

Finally, the latent codes are passed to our modified SD model using the proposed cross-frame attention, which uses keys and values from the first frame to generate the image of frame k

keys and values 指的是?

指的是对frame 1的 attention。

keys:

values:

可以用background smoothing来改善背景。

warping function WkW_k

将中间noise x~T1\tilde x^1_{T'} 变成x~Tk\tilde x^k_{T'}

具体咋弄得读论文

mask MkM^k 是前景物体的mask

同样,用这个mask可以更好的生成背景