HomeArtificial IntelligenceMeta introduces Chameleon, a state-of-the-art multimodal model

Meta introduces Chameleon, a state-of-the-art multimodal model

As competition in generative AI shifts toward multimodal models, Meta has released a preview of what could possibly be its answer to the models released by Frontier Labs. chameleonits recent family of models, was designed to be natively multimodal, reasonably than assembling components with different modalities.

Although Meta has not yet released the models, the reported experiments show that Chameleon achieves top performance on various tasks, including image captioning and visual query answering (VQA), while remaining competitive on text-only tasks.

Chameleon's architecture can unlock recent AI applications that require a deep understanding of each visual and textual information.

Multimodal early fusion models

The common method for constructing multimodal base models is to stitch together models trained on different modalities. This approach known as “late fusion,” where the AI ​​system receives different modalities, encodes them with separate models, after which merges the encodings for inference. While late fusion works well, it limits the models' ability to integrate information across modalities and produce sequences of nested images and text.

Chameleon uses an “early fusion token-based mixed-modal” architecture, meaning it was designed from the bottom as much as learn from a nested mixture of images, text, code, and other modalities. Chameleon converts images into discrete tokens, just as language models do with words. It also uses a unified vocabulary consisting of text, code and image tokens. This makes it possible to use the identical Transformer architecture to sequences containing each image and text tokens.

According to the researchers, probably the most similar model to Chameleon is Google Gemini, which also uses an early fusion token-based approach. However, Gemini uses separate image decoders within the generation phase, while Chameleon is an end-to-end model that each processes and generates tokens.

“Chameleon’s unified token space allows it to seamlessly reason on top of one another and generate nested image and text sequences without the necessity for modality-specific components,” the researchers write.

While early fusion may be very attractive, it presents significant challenges in training and scaling the model. To overcome these challenges, the researchers employed various architectural modifications and training techniques. In their paper they share the small print of the various experiments and their effects on the model.

Chameleon's training occurs in two phases, with a dataset containing 4.4 trillion text tokens, image-text pairs, and nested text and image sequences. The researchers trained a 7 billion and 34 billion parameter version of Chameleon on greater than 5 million hours of Nvidia A100 80GB GPUs.

Chameleon in motion

According to the experiments reported within the paper, Chameleon can perform a wide range of text-only and multimodal tasks. In the visual query answering (VQA) and image captioning benchmarks, Chameleon-34B achieves state-of-the-art performance, outperforming models resembling Flamingo, IDEFICS, and Llava-1.5.

According to the researchers, Chameleon equals the performance of other models with “far fewer in-context training examples and with smaller model sizes in each pre-trained and fine-tuned model evaluations.”

One of the disadvantages of multimodality is a drop in performance for single-modality requests. For example, vision-language models are inclined to have lower performance on text-only inputs. However, Chameleon stays competitive in text-only benchmarks and might compete with models like Mixtral 8x7B and Gemini-Pro in common sense and reading comprehension tasks.

Interestingly, Chameleon can unlock recent possibilities for mixed-modal pondering and generation, especially when the prompts expect mixed-modal responses with nested text and pictures. Experiments with human-scored responses show that, overall, users preferred the multimodal documents generated by Chameleon.

Last week, each OpenAI and Google unveiled recent models that provide immersive multimodal experiences. However, they haven't released many details in regards to the models. If Meta continues to follow its playbook and release the weights for Chameleon, it could change into an open alternative to personal models.

Early fusion may also encourage recent directions for research into more advanced models, especially as more modalities are added to the combination. For example, robotics startups are already experimenting with integrating language models into robotics control systems. It shall be interesting to see how early fusion can improve the foundational models of robotics as well.

“Chameleon represents a big step towards realizing the vision of unified base models which can be in a position to flexibly consider and generate multimodal content,” the researchers write.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Must Read