Multimodal models that may process each text and pictures are a growing area of ​​research in artificial intelligence. However, training these models presents a novel challenge: language models process discrete values ​​(words and tokens), while image generation models must process continuous pixel values.
Current multimodal models use techniques that reduce the standard of knowledge representation. latest research paperScientists from Meta and the University of Southern California we introduce Transfusion, a novel technique that permits a single model to seamlessly handle each discrete and continuous modalities.
The challenges of multimodal models
Existing approaches to handle the multimodality challenge often involve different trade-offs. Some techniques use separate architectures for speech and image processing, often pre-training each component individually. This method is utilized in models corresponding to LLaVA. These models struggle to learn the complex interactions between different modalities, especially when processing documents where images and text are interleaved.
Other techniques quantize images into discrete values, effectively converting them right into a text-like token sequence. This is the approach utilized by Meta's Chameleon, which was introduced earlier this yr. While this approach allows the usage of language models for image processing, it ends in the lack of the knowledge contained in the continual pixel values.
Chunting Zhou, senior research scientist at Meta AI and co-author of the paper, previously worked on the Chameleon paper.
“We found that the quantization method creates an information bottleneck for image representations, where discrete image representations are heavily compressed and data in the unique images is lost,” she told VentureBeat. “And within the meantime, it's very difficult to coach a superb discrete image tokenizer. So we asked the query, 'Can we just use the more natural continuous image representations when training a multimodal model together with discrete text?'”
Transfusion: A unified approach to multimodal learning
“Diffusion models and autoregressive next-token prediction models are one of the best ways to generate continuous and discrete data, respectively,” said Zhou. “This inspired us to develop a brand new multimodal method that mixes one of the best of each worlds in a natural and easy way.”
Transfusion is a recipe for training a single model that may handle each discrete and continuous modalities without the necessity for quantization or separate modules. The core idea behind Transfusion is to coach a single model with two goals: language modeling for text and diffusion for images.
Transfusion combines these two goals to coach a transformer model that may process and generate each text and pictures. During training, the model is exposed to each text and image data, and the language modeling and diffusion loss functions are applied concurrently.
“We show that it is feasible to totally integrate each modalities with none loss of knowledge by training a single model to predict each discrete text tokens and diffuse continuous images,” the researchers write.
Transfusion uses a unified architecture and vocabulary to process inputs from different modalities. The model includes lightweight modality-specific components that convert text tokens and image patches into the corresponding representations before they’re processed by the transformer.
To improve the representation of image data, Transfusion uses variational autoencoders (VAEs), neural networks that may learn to represent complex data corresponding to images in a continuous lower-dimensional space. In Transfusion, a VAE is used to encode each 8Ă—8 patch of a picture into an inventory of continuous values.
“Our foremost innovation is to point out that we are able to use separate losses for various modalities – language modeling for text, diffusion for images – over common data and parameters,” the researchers write.
Transfusion outperforms quantization-based approaches
The researchers trained a 7 billion model based on Transfusion and evaluated it on a series of normal unimodal and cross-modal benchmarks, including text-to-text, text-to-image, and image-to-text tasks. They compared its performance to an equally sized model based on Chameleon, currently the leading open science method for training native mixed-modal models.
In their experiments, Transfusion consistently outperformed Chameleon in all modalities. In text-to-image generation, Transfusion performed higher using lower than a 3rd of Chameleon's computational power. Also, in image-to-text generation, Transfusion matched Chameleon's performance using only 21.8% of the computational power.
Surprisingly, Transfusion also performed higher on text-only benchmarks, though each Transfusion and Chameleon use the identical language modeling objective for text. This suggests that training with quantized image tokens can negatively impact text performance.
“As a alternative, Transfusion is significantly more scalable in all areas than the commonly used multimodal training approaches with discrete image tokens,” said Zhou.
The researchers conducted separate experiments on image generation and compared Transfusion with other image generation models. Transfusion outperformed other popular models corresponding to DALL-E 2 and Stable Diffusion XL while also with the ability to generate text.
“Transfusion opens up many latest possibilities for multimodal learning and latest interesting use cases,” said Zhou. “Because Transfusion works the identical way as LLM but with multimodal data, this potentially opens up latest applications with higher control over interactive sessions of user input, corresponding to interactive editing of images and videos.”