Trending 5 months, 3 weeks ago

Unified Multimodal Model Emu3: A Paradigm Shift in Multimodal AI

Beijing Academy of AI unveils next-gen multimodal model Emu3, achieving unified understanding and generation of video, images and text. Emu3 focuses on predicting the next part of a sequence, removing the necessity for complex methods like diffusion or composition. It converts images, text and videos into a common format, teaching a single transformer model from the beginning on a mix of different types of sequences containing both text and images. Industry experts have expressed that for researchers, Emu3 signifies a new opportunity to explore multimodality through a unified architecture without the need to combine complex diffused models with large language models.

Discover Related