Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators

读这篇文章的目的：想搞明白Cross Frame Attention是啥

Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators.

https://text2video-zero.github.io

Abstract

Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets. In this paper, we introduce a new task of zero-shot text-to-video generation and propose a low-cost approach (without any training or optimization) by leveraging the power of existing text-to-image synthesis methods (e.g. Stable Diffusion), making them suitable for the video domain. Our key modifications include (i) enriching the latent codes of the generated frames with motion dynamics to keep the global scene and the background time consistent; and (ii) reprogramming frame-level self-attention using a new cross-frame attention of each frame on the first frame, to preserve the context, appearance, and identity of the foreground object. Experiments show that this leads to low overhead, yet high-quality and remarkably consistent video generation. Moreover, our approach is not limited to text-to-video synthesis but is also applicable to other tasks such as conditional and content-specialized video generation, and Video Instruct-Pix2Pix, i.e., instruction-guided video editing. As experiments show, our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data. Our code will be open sourced at: https://github.com/Picsart-AI-Research/Text2Video-Zero.

Figure 2 考察

首先要搞明白这几点

DDIM backward

为啥要用DDIM？大概是因为快？

用DDPM forward 是因为想要更高的自由度（概率随机）

先把noise做一下backward再forward，最主要的目的就是得到了一组 $\tilde x^1_T, ..., \tilde x^k_T$ 作为noise

x的下标 (t, T, T’, …) 与上标 k = 1, …, m 的含义

t指diffusion的time step。

T指的是noise对应的time step。这时image是pure gaussian noise

T’ 是个中间值，目的是得到一组随机的noise

Finally, the latent codes are passed to our modified SD model using the proposed cross-frame attention, which uses keys and values from the first frame to generate the image of frame k

keys and values 指的是？

指的是对frame 1的 attention。

keys:

values:

可以用background smoothing来改善背景。

warping function

W_k

将中间noise $\tilde x^1_{T'}$ 变成 $\tilde x^k_{T'}$

具体咋弄得读论文

mask

M^k

是前景物体的mask

同样，用这个mask可以更好的生成背景

🤔 学习，反思，与胡思乱想

🎞️ Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators

Abstract

Figure 2 考察