We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that operates on spacetime patches of video and image latent codes. Our largest model, Sora, is capable of generating a minute of high fidelity video. Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.
我們探索在大規(guī)模視頻數(shù)據(jù)上訓(xùn)練生成模型的方法。具體而言,我們同時(shí)在可變時(shí)長、分辨率和縱橫比的視頻和圖像上訓(xùn)練了文本條件擴(kuò)散模型。我們利用了一個(gè)在視頻和圖像潛在編碼的時(shí)空補(bǔ)丁上操作的Transformer架構(gòu)。大模型Sora能夠生成 [ 一分鐘、高保真度 ] 的視頻。結(jié)果表明,大規(guī)模視頻生成模型是構(gòu)建通用物理世界模擬器的有希望的路徑。
February 15, 2024
This technical report focuses on (1) our method for turning visual data of all types into a unified representation that enables large-scale training of generative models
(1)我們的方法是將各種類型的視覺數(shù)據(jù)轉(zhuǎn)化為統(tǒng)一表示,以實(shí)現(xiàn)大規(guī)模生成模型的訓(xùn)練
and (2) qualitative evaluation of Sora’s capabilities and limitations. Model and implementation details are not included in this report.
(2)對Sora的能力和局限性進(jìn)行定性評估。模型和實(shí)現(xiàn)細(xì)節(jié)不包含在本報(bào)告中。
Much prior work has studied generative modeling of video data using a variety of methods, including recurrent networks,1,2,3 generative adversarial networks,4,5,6,7 autoregressive transformers,8,9 and diffusion models.10,11,12 These works often focus on a narrow category of visual data, on shorter videos, or on videos of a fixed size. Sora is a generalist model of visual data—it can generate videos and images spanning diverse durations, aspect ratios and resolutions, up to a full minute of high definition video.
許多先前的工作研究了使用各種方法對視頻數(shù)據(jù)進(jìn)行生成建模,包括循環(huán)網(wǎng)絡(luò)、生成對抗網(wǎng)絡(luò)、自回歸Transformer和擴(kuò)散模型。
這些工作通常關(guān)注于特定類別的視覺數(shù)據(jù)、較短的視頻或固定尺寸的視頻。
Sora是一種對視覺數(shù)據(jù)具有廣泛適應(yīng)性的模型,它可以生成跨越不同時(shí)長、縱橫比和分辨率的視頻和圖像,高清視頻的生成時(shí)長可達(dá)一分鐘。
Turning visual data into patches
將視覺數(shù)據(jù)轉(zhuǎn)化為補(bǔ)丁
We take inspiration from large language models which acquire generalist capabilities by training on internet-scale data.13,14 The success of the LLM paradigm is enabled in part by the use of tokens that elegantly unify diverse modalities of text—code, math and various natural languages. In this work, we consider how generative models of visual data can inherit such benefits. Whereas LLMs have text tokens, Sora has visual patches. Patches have previously been shown to be an effective representation for models of visual data.15,16,17,18 We find that patches are a highly-scalable and effective representation for training generative models on diverse types of videos and images.
我們受到大語言模型的啟發(fā),這些模型通過在互聯(lián)網(wǎng)大規(guī)模的數(shù)據(jù)上進(jìn)行訓(xùn)練獲得了通用能力。LLM(大語言模型)范式的成功部分得益于使用優(yōu)雅地統(tǒng)一了代碼、數(shù)學(xué)和各種自然語言等多樣化文本模態(tài)的標(biāo)記。在這項(xiàng)工作中,我們考慮了生成視覺數(shù)據(jù)模型如何繼承這些優(yōu)勢。
而LLMs使用文本標(biāo)記text tokens,Sora則使用視覺補(bǔ)丁visual patches。
先前研究表明,補(bǔ)丁是視覺數(shù)據(jù)模型的一種有效表示形式。我們發(fā)現(xiàn),補(bǔ)丁是一種高度可擴(kuò)展且有效的表示形式,可用于訓(xùn)練各種類型的視頻和圖像的生成模型。
什么是visual patches?
來自谷歌的一篇論文提出了ViT,視覺Transformer架構(gòu)。
《AN IMAGE IS WORTH 16X16 WORDS:
TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE》
一張圖像勝過16x16個(gè)單詞:用于大規(guī)模圖像識別的transformers技術(shù)。
盡管transformers架構(gòu)已成為自然語言處理任務(wù)的標(biāo)準(zhǔn),但其在計(jì)算機(jī)視覺領(lǐng)域的應(yīng)用仍然有限。在視覺任務(wù)中,注意力要么與卷積網(wǎng)絡(luò)結(jié)合使用,要么用于替代卷積網(wǎng)絡(luò)的某些組件,同時(shí)保持其整體結(jié)構(gòu)不變。論文展示了對CNN的依賴并不是必需的,并且直接應(yīng)用于圖像補(bǔ)丁序列的transformers在圖像分類任務(wù)上表現(xiàn)非常出色。在大量數(shù)據(jù)上進(jìn)行預(yù)訓(xùn)練后,進(jìn)行圖像識別(ImageNet,CIFAR-100,VTAB等)基準(zhǔn)測試,結(jié)果表明,Vision Transformer(ViT)相比最先進(jìn)的卷積網(wǎng)絡(luò)取得了出色的結(jié)果,同時(shí)需要較少的計(jì)算資源進(jìn)行訓(xùn)練。
At a high level, we turn videos into patches by first compressing videos into a lower-dimensional latent space,19 and subsequently decomposing the representation into spacetime patches.
將視頻轉(zhuǎn)化為補(bǔ)丁,我們首先將視頻壓縮為較低維度的潛在空間,然后將表示分解為時(shí)空補(bǔ)丁。
Video compression network視頻壓縮網(wǎng)絡(luò)
We train a network that reduces the dimensionality of visual data.20 This network takes raw video as input and outputs a latent representation that is compressed both temporally and spatially. Sora is trained on and subsequently generates videos within this compressed latent space. We also train a corresponding decoder model that maps generated latents back to pixel space.
我們訓(xùn)練了一個(gè)能夠降低視覺數(shù)據(jù)維度的網(wǎng)絡(luò)。該網(wǎng)絡(luò)以原始視頻作為輸入,并輸出一個(gè)在時(shí)間和空間上都進(jìn)行了壓縮的潛在表示。
Sora在這個(gè)壓縮的潛在空間上進(jìn)行訓(xùn)練,并生成視頻。
我們還訓(xùn)練了一個(gè)對應(yīng)的解碼器模型,將生成的潛在表示映射回像素空間。
Spacetime Latent Patches時(shí)空潛在補(bǔ)丁
Given a compressed input video, we extract a sequence of spacetime patches which act as transformer tokens. This scheme works for images too since images are just videos with a single frame. Our patch-based representation enables Sora to train on videos and images of variable resolutions, durations and aspect ratios. At inference time, we can control the size of generated videos by arranging randomly-initialized patches in an appropriately-sized grid.
給定一個(gè)壓縮的輸入視頻,我們提取一系列時(shí)空補(bǔ)丁,它們充當(dāng)變換器的標(biāo)記。
這個(gè)方案也適用于圖像,因?yàn)閳D像是單幀的視頻。
我們基于補(bǔ)丁的表示使得Sora能夠在分辨率、時(shí)長和縱橫比可變的視頻和圖像上進(jìn)行訓(xùn)練。
在推理階段,我們可以通過將隨機(jī)初始化的補(bǔ)丁按適當(dāng)大小的網(wǎng)格排列來控制生成視頻的尺寸。
Scaling transformers for video generation
Sora is a diffusion model21,22,23,24,25; given input noisy patches (and conditioning information like text prompts), it’s trained to predict the original “clean” patches. Importantly, Sora is a diffusion transformer.26 Transformers have demonstrated remarkable scaling properties across a variety of domains, including language modeling,13,14 computer vision,15,16,17,18 and image generation.
Sora是一個(gè)擴(kuò)散模型,輸入噪聲補(bǔ)丁以及像文本提示這樣的條件信息,它將通過去噪的過程來恢復(fù)補(bǔ)丁。
In this work, we find that diffusion transformers scale effectively as video models as well. Below, we show a comparison of video samples with fixed seeds and inputs as training progresses. Sample quality improves markedly as training compute increases.
在這項(xiàng)工作中,我們發(fā)現(xiàn)擴(kuò)散變形器在作為視頻模型時(shí)也能有效地?cái)U(kuò)展。下面,我們展示了在訓(xùn)練進(jìn)行時(shí)使用固定種子和輸入的視頻樣本的比較。隨著訓(xùn)練計(jì)算量的增加,樣本質(zhì)量顯著提高。
Variable durations, resolutions, aspect ratios 可變時(shí)長、分辨率、長寬比
Past approaches to image and video generation typically resize, crop or trim videos to a standard size – e.g., 4 second videos at 256x256 resolution. We find that instead training on data at its native size provides several benefits.
以往的圖像和視頻生成方法通常會(huì)將視頻調(diào)整大小、裁剪或修剪為標(biāo)準(zhǔn)尺寸,例如256x256分辨率的4秒視頻。我們發(fā)現(xiàn),以原始尺寸的數(shù)據(jù)進(jìn)行訓(xùn)練具有幾個(gè)優(yōu)點(diǎn)。
Sampling flexibility 靈活的尺寸
Sora can sample widescreen 1920x1080p videos, vertical 1080x1920 videos and everything inbetween. This lets Sora create content for different devices directly at their native aspect ratios. It also lets us quickly prototype content at lower sizes before generating at full resolution—all with the same model.
Sora可以對寬屏1920x1080p視頻、縱向1080x1920視頻以及其中的任何尺寸進(jìn)行采樣。
這使得Sora能夠直接以原始縱橫比為不同設(shè)備創(chuàng)建內(nèi)容。
這也使我們能夠在生成全分辨率內(nèi)容之前,以較低的尺寸快速原型制作內(nèi)容,而所有這些都可以使用同一個(gè)模型完成。
Improved framing and composition改進(jìn)的構(gòu)圖和畫面組成
We empirically find that training on videos at their native aspect ratios improves composition and framing. We compare Sora against a version of our model that crops all training videos to be square, which is common practice when training generative models. The model trained on square crops (left) sometimes generates videos where the subject is only partially in view. In comparison, videos from Sora (right)s have improved framing.
我們發(fā)現(xiàn),以原始縱橫比訓(xùn)練視頻可以改善構(gòu)圖和畫面組成。
我們將Sora與將所有訓(xùn)練視頻裁剪為正方形的模型進(jìn)行了比較。在使用正方形裁剪進(jìn)行訓(xùn)練的模型(左側(cè))有時(shí)會(huì)生成只有部分主體可見的視頻。相比之下,Sora生成的視頻(右側(cè))具有改善的構(gòu)圖和畫面組成。
Language understanding語言理解
Training text-to-video generation systems requires a large amount of videos with corresponding text captions. We apply the re-captioning technique introduced in DALL·E 330 to videos. We first train a highly descriptive captioner model and then use it to produce text captions for all videos in our training set. We find that training on highly descriptive video captions improves text fidelity as well as the overall quality of videos.
訓(xùn)練文本到視頻生成系統(tǒng)需要大量具有對應(yīng)文本標(biāo)題的視頻。我們使用了DALL·E 3中介紹的標(biāo)題生成技術(shù)到視頻中。
首先訓(xùn)練一個(gè)高度描述性的標(biāo)題模型,然后使用它為訓(xùn)練集中的所有視頻生成文本標(biāo)題。
我們發(fā)現(xiàn),通過高度描述性的視頻標(biāo)題進(jìn)行訓(xùn)練可以提高文本的準(zhǔn)確性以及視頻的整體質(zhì)量。
Similar to DALL·E 3, we also leverage GPT to turn short user prompts into longer detailed captions that are sent to the video model. This enables Sora to generate high quality videos that accurately follow user prompts.
與DALL·E 3類似,我們還利用GPT將用戶簡短的提示轉(zhuǎn)化為更詳細(xì)的長描述,然后將其發(fā)送給視頻模型。這使得Sora能夠生成高質(zhì)量的視頻,準(zhǔn)確地按照用戶的提示進(jìn)行生成。
Prompting with images and videos 圖像和視頻的提示工程
All of the results above and in our landing page show text-to-video samples. But Sora can also be prompted with other inputs, such as pre-existing images or video. This capability enables Sora to perform a wide range of image and video editing tasks—creating perfectly looping video, animating static images, extending videos forwards or backwards in time, etc.
Sora可以通過其他輸入進(jìn)行提示,例如現(xiàn)有的圖像或視頻。這種能力使得Sora能夠執(zhí)行各種圖像和視頻編輯任務(wù),如創(chuàng)建完美循環(huán)的視頻,為靜態(tài)圖像添加動(dòng)畫效果,延長視頻的時(shí)間等。
Animating DALL·E images 圖像生成視頻
Sora is capable of generating videos provided an image and prompt as input. Below we show example videos generated based on DALL·E 231 and DALL·E 330 images.
Sora能夠根據(jù)圖像和提示生成視頻。下面我們展示了基于DALL·E 2和DALL·E 3圖像生成的示例視頻。
Extending generated videos 視頻“續(xù)寫”
Sora is also capable of extending videos, either forward or backward in time. Below are four videos that were all extended backward in time starting from a segment of a generated video. As a result, each of the four videos starts different from the others, yet all four videos lead to the same ending
Sora還能夠擴(kuò)展視頻,無論是向前還是向后延長時(shí)間。以下是四個(gè)視頻,它們都是從生成的視頻片段向后延長的。因此,這四個(gè)視頻的開頭各不相同,但最終都會(huì)導(dǎo)向相同的結(jié)尾。
We can use this method to extend a video both forward and backward to produce a seamless infinite loop.
我們可以使用這種方法向前和向后延長視頻,以產(chǎn)生一個(gè)無縫的無限循環(huán)。
Video-to-video editing 視頻編輯
Diffusion models have enabled a plethora of methods for editing images and videos from text prompts. Below we apply one of these methods, SDEdit,32 to Sora. This technique enables Sora to transform the styles and environments of input videos zero-shot.
擴(kuò)散模型為通過文本提示編輯圖像和視頻提供了大量的方法。
我們將其中一種方法:SDEdit,應(yīng)用到Sora上。這種技術(shù)使得Sora能夠以零樣本的方式轉(zhuǎn)換輸入視頻的風(fēng)格和環(huán)境。
Connecting videos 無縫組合視頻
We can also use Sora to gradually interpolate between two input videos, creating seamless transitions between videos with entirely different subjects and scene compositions. In the examples below, the videos in the center interpolate between the corresponding videos on the left and right.
我們還可以使用Sora逐漸插值兩個(gè)輸入視頻,創(chuàng)建在完全不同的主題和場景構(gòu)圖之間無縫過渡的視頻。在下面的示例中,中間的視頻在左側(cè)和右側(cè)的對應(yīng)視頻之間進(jìn)行插值。
Image generation capabilities 圖像生成
Sora is also capable of generating images. We do this by arranging patches of Gaussian noise in a spatial grid with a temporal extent of one frame. The model can generate images of variable sizes—up to 2048x2048 resolution.
Sora還可以生成圖像。我們通過在一個(gè)幀的時(shí)間范圍內(nèi)將高斯噪聲的補(bǔ)丁排列在一個(gè)空間網(wǎng)格中來實(shí)現(xiàn)。該模型可以生成可變大小的圖像,分辨率高達(dá)2048x2048。
Emerging simulation capabilities涌現(xiàn)出來的模擬能力
We find that video models exhibit a number of interesting emergent capabilities when trained at scale. These capabilities enable Sora to simulate some aspects of people, animals and environments from the physical world. These properties emerge without any explicit inductive biases for 3D, objects, etc.—they are purely phenomena of scale.
我們發(fā)現(xiàn),當(dāng)視頻模型經(jīng)過大規(guī)模數(shù)據(jù)訓(xùn)練后,它們涌現(xiàn)出了新的能力。這些能力使得Sora能夠模擬一些來自物理世界的人、動(dòng)物和環(huán)境的某些方面。這些能力的涌現(xiàn)是在沒有經(jīng)過3D、物理等明確數(shù)據(jù)標(biāo)記的情況下出現(xiàn)的,它們純粹是規(guī)模效應(yīng)。
3D consistency. Sora can generate videos with dynamic camera motion. As the camera shifts and rotates, people and scene elements move consistently through three-dimensional space.
3D一致性。Sora可以生成具有動(dòng)態(tài)相機(jī)運(yùn)動(dòng)的視頻。隨著相機(jī)的移動(dòng)和旋轉(zhuǎn),人物和場景元素在三維空間中以一致的方式移動(dòng)。
Long-range coherence and object permanence. A significant challenge for video generation systems has been maintaining temporal consistency when sampling long videos. We find that Sora is often, though not always, able to effectively model both short- and long-range dependencies. For example, our model can persist people, animals and objects even when they are occluded or leave the frame. Likewise, it can generate multiple shots of the same character in a single sample, maintaining their appearance throughout the video.
長視頻時(shí)間一致性和物體永恒性。視頻生成系統(tǒng)面臨的一個(gè)重要挑戰(zhàn)是在采樣長視頻時(shí)保持時(shí)間上的一致性。我們發(fā)現(xiàn),Sora通常能夠有效地建模短程和長程的依賴關(guān)系。
例如,Sora可以在人、動(dòng)物和物體被遮擋或離開畫面時(shí)仍然保持它們的存在。同樣,它可以在單個(gè)樣本中生成同一角色的多個(gè)鏡頭,并在整個(gè)視頻中保持它們的外觀。
Interacting with the world. Sora can sometimes simulate actions that affect the state of the world in simple ways. For example, a painter can leave new strokes along a canvas that persist over time, or a man can eat a burger and leave bite marks.
與世界互動(dòng)。Sora有時(shí)可以模擬以簡單方式影響世界狀態(tài)的動(dòng)作。例如,一位畫家可以在畫布上留下持續(xù)一段時(shí)間的新筆觸,或者一個(gè)人可以吃掉一個(gè)漢堡并留下咬痕。
Simulating digital worlds. Sora is also able to simulate artificial processes–one example is video games. Sora can simultaneously control the player in Minecraft with a basic policy while also rendering the world and its dynamics in high fidelity. These capabilities can be elicited zero-shot by prompting Sora with captions mentioning “Minecraft.”
模擬數(shù)字世界。Sora還可以模擬人工過程,其中一個(gè)例子就是視頻游戲。Sora可以在高保真度下同時(shí)控制Minecraft中的玩家,并渲染世界及其動(dòng)態(tài)。通過以“Minecraft”為提示,可以零樣本調(diào)用Sora展現(xiàn)這些能力。
These capabilities suggest that continued scaling of video models is a promising path towards the development of highly-capable simulators of the physical and digital world, and the objects, animals and people that live within them.
這些能力表明,擴(kuò)大視頻模型的規(guī)模是實(shí)現(xiàn)物理世界和數(shù)字世界模擬器的有希望的途徑。(世界模擬器)
Discussion
Sora currently exhibits numerous limitations as a simulator. For example, it does not accurately model the physics of many basic interactions, like glass shattering. Other interactions, like eating food, do not always yield correct changes in object state. We enumerate other common failure modes of the model—such as incoherencies that develop in long duration samples or spontaneous appearances of objects—in our landing page.
目前,Sora作為模擬器還存在許多不足。
例如,它無法準(zhǔn)確地模擬許多基本交互的物理特性,比如玻璃破碎。
其他交互,比如吃東西,也不總是能正確地改變物體狀態(tài)。
我們列舉了模型的其他常見失敗模式,比如在長時(shí)間采樣中出現(xiàn)的不一致性或物體的突然出現(xiàn)。
We believe the capabilities Sora has today demonstrate that continued scaling of video models is a promising path towards the development of capable simulators of the physical and digital world, and the objects, animals and people that live within them.
我們堅(jiān)信擴(kuò)大視頻模型的規(guī)模是實(shí)現(xiàn)物理世界和數(shù)字世界模擬器的有希望的途徑。(世界模擬器)
以上為OpenAI原文的中文注解,文中多次提及了世界模擬器。通過涌現(xiàn)出來的新能力,我們可以猜測,訓(xùn)練數(shù)據(jù)可能是通過UE5 + NeRF + Metahumans 來獲得的。
https://t.zsxq.com/17iGhZWji
已整理至AIGC知識庫