HomeArtificial IntelligenceGoogle's Gemini: Is the brand new AI model really higher than ChatGPT?

Google's Gemini: Is the brand new AI model really higher than ChatGPT?

Google Deepmind recently announced Gemini, its recent AI model designed to compete with OpenAI's ChatGPT. While each models are examples of “generative AI” that learn to search out patterns when entering training information to generate recent data (images, words, or other media), ChatGPT is a big language model (LLM) that focuses on the Production of text focused.

Just as ChatGPT is a conversation web app based on the GPT neural network (trained on large amounts of text), Google has a conversation web app called bard which was based on a model called LaMDA (Training on Dialogue). But Google is now retrofitting this based on Gemini.

What sets Gemini other than previous generative AI models like LaMDA is that it’s a “multimodal model.” This implies that it really works directly with multiple input and output modes: In addition to text input and output, it also supports images, audio and video. Accordingly, a brand new acronym emerges: LMM (Large Multimodal Model), to not be confused with LLM.

OpenAI in September a model announced called GPT-4Vision, which also can work with images, audio and text. However, it just isn’t a completely multimodal model in the best way that Gemini guarantees.

For example, while ChatGPT-4, powered by GPT-4V, can work with audio inputs and generate voice output, OpenAI has confirmed that it does this by converting speech to text on input using one other deep learning model called Whisper. ChatGPT-4 also converts text to speech using a special model when output, meaning GPT-4V itself works exclusively with text.

Likewise, ChatGPT-4 can generate images, but by generating text prompts which can be forwarded to a separate deep learning model called Dall-E 2, which converts text descriptions into images.

In contrast, Google designed Gemini to be “natively multimodal.” This implies that the core model processes a spread of input types (audio, images, video and text) directly and also can output them directly.

The judgment

The distinction between these two approaches could appear academic, but it will be significant. The general conclusion Google Technical Report and other qualitative tests What is understood to this point is that the present publicly available version of Gemini, called Gemini 1.0 Pro, is usually not nearly as good as GPT-4 and is more much like GPT 3.5 in its capabilities.

Google also announced a more powerful version of Gemini called Gemini 1.0 Ultra and presented some results showing that it’s more powerful than GPT-4. However, that is difficult to evaluate for 2 reasons. The first reason is that Google has not released Ultra yet, so the outcomes can’t be independently validated presently.

The second reason it’s difficult to evaluate Google's claims is that Google selected to release a somewhat misleading demonstration video, see below. The video shows the Gemini model providing interactive and fluid commentary on a live video stream.

However, there originally reported by Bloomberg, the demonstration within the video was not done in real time. For example, the model had previously learned some specific tasks, reminiscent of the three cups and ball trick, during which Gemini keeps track of which cup the ball is under. To do that, he was supplied with a still image sequence during which the presenter's hands are on the cups which can be being swapped.

Promising future

Despite these issues, I consider that Gemini and enormous multimodal models represent a really exciting advance for generative AI. This is as a result of each their future capabilities and the competitive landscape of AI tools. As I discussed in a previous article, GPT-4 was trained on around 500 billion words – essentially all good quality public text.

The performance of deep learning models is usually determined by the increasing complexity of the model and the quantity of coaching data. This has led to the query of how further improvements could possibly be achieved, as we now have almost run out of recent training data for language models. However, multimodal models open up enormous recent reserves of coaching data – in the shape of images, audio and videos.

AIs like Gemini that could be trained directly based on all this data are more likely to have much greater capabilities in the long run. For example, I’d expect to see models emerge which can be trained on video sophisticated internal representations the so-called “naive physics”. This is the basic understanding that humans and animals have about causality, motion, gravity, and other physical phenomena.

I’m also excited to see what this implies for the competitive landscape of AI. Over the past yr, despite the emergence of many generative AI models, OpenAI's GPT models have been dominant, showing a level of performance that other models haven’t been in a position to match.

Google's Gemini signals the emergence of a significant competitor that can help advance the sector. Of course, OpenAI is sort of actually working on GPT-5, and we will expect it to even be multimodal and display remarkable recent capabilities.



That being said, I'm excited to see the emergence of very large multimodal models which can be open source and non-commercial, hopefully on the best way in the approaching years.

I also like some features of the Gemini implementation. For example, Google announced a version called Gemini Nanowhich is far lighter and could be run directly on mobile phones.

Lightweight models like this reduce the environmental impact of AI computing and offer many advantages from a knowledge protection perspective, and I’m sure that this development will result in competitors following suit.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Must Read