HomeIndustriesPerformance of AI models: Is it logical pondering or easy reciting?

Performance of AI models: Is it logical pondering or easy reciting?

When ChatGPT gives you the correct answer to your prompt, does it think through the request or does it simply remember the reply from its training data?

Researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) have developed a series of tests to seek out out whether AI models “think” or just have a superb memory.

If you ask an AI model to resolve a math problem like “What is 27+62?” it should quickly come back with the right answer: 89. How can we tell if it understands the underlying arithmetic or just recognized the issue in its training data?

In your paperThe researchers tested GPT-4, GPT-3.5 Turbo, Claude 1.3 and PaLM2 to see if they might “generalize not only to previously unknown instances of known tasks, but in addition to recent tasks.”

They designed a set of 11 tasks that were barely different from the usual tasks on which LLMs generally perform well.

The LL.M.s should perform equally well on the “counterfactual tasks” in the event that they apply general and transferable procedures for solving the tasks.

If an LLM “understands” mathematics, then she or he should, for instance, provide the right answer to a math problem within the decimal system and the rarely used ninecimal system.

Here you’ll be able to see examples of GPT-4's tasks and performance.

The performance of GPT-4 on standard tasks (blue) and barely modified counterfactual tasks (orange). Examples of the tasks and the right answers are shown here. Source: arXiv

GPT-4's performance on standard tests (blue line) is sweet, but its abilities in mathematics, logical reasoning, spatial reasoning, and other areas (orange line) deteriorate significantly when the duty is barely modified.

The other models showed an identical deterioration, with GPT-4 coming out on top.

Despite the deterioration, performance on counterfactual tasks was still higher than likelihood. The AI ​​models try to resolve these tasks logically, but should not superb at it.

The results show that the impressive performance of AI models on tasks corresponding to college exams relies on excellent recall of coaching data quite than logical reasoning. This further underscores that AI models cannot generalize to unseen tasks.

Zhaofeng Wu, a PhD student in electrical engineering and computer science at MIT, CSAIL member and lead creator of the paper, said: “We discovered an interesting aspect of huge language models: they excel in familiar scenarios, almost like a well-trodden path, but struggle when the terrain becomes unfamiliar. This finding is critical as we seek to enhance the adaptability of those models and expand their application horizons.”

We observed an identical demonstration of this inability to generalize after we investigated how poor AI models are at solving a simplified river-crossing puzzle.

The researchers concluded that when analyzing their models, developers should “consider abstract task capability in isolation from observed task performance.”

While the train-to-test approach will help a model move up the benchmarks, it doesn’t provide a real measure of how the model performs when faced with a brand new task that should be thought through.

The researchers suspect that a part of the issue is that these models are trained only on superficial texts.

If LL.M.s are exposed to more contextualized real-world data and semantic representation, they could give you the option to generalize when presented with task variations.

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Must Read