Beijing Academy of AI unveils next-gen multimodal model Emu3, achieving unified understanding and generation of video, images and text. Emu3 focuses on predicting the next part of a sequence, removing the necessity for complex methods like diffusion or composition. It converts images, text and videos into a common format, teaching a single transformer model from the beginning on a mix …