HomeIndustriesAI models can cheat, lie and trick the system to get rewards

AI models can cheat, lie and trick the system to get rewards

A study conducted by Anthropic and other scientists found that incorrectly set training goals and a certain tolerance for sycophancy can lead AI models to trick the system to get higher rewards.

Reinforcement learning through reward functions helps an AI model learn when it has done job. When you click the thumbs up button on ChatGPT, the model learns that the output it generates matches your prompt.

The researchers found that a model presented with poorly defined goals can engage in “specification games” to cheat the system in pursuit of the reward.

Specification manipulation can simply be a case of toadying, where the model agrees with you even when it knows you’re unsuitable.

If an AI model follows poorly designed reward functions, it could actually result in unexpected behavior.

In 2016, OpenAI found that an AI playing a ship racing game called CoastRunners learned that it could rating more points by moving in a good circle to hit targets, slightly than completing the course like a human would.

The Anthropology researchers found that models that learned gaming with low specifications could eventually result in more serious reward manipulation.

Your paper describes how they created a “curriculum” of coaching environments wherein an LLM was given the chance to cheat the system, starting with relatively benign scenarios like toadying.

For example, to receive the training reward, the LLM could respond positively to a user's political opinions early within the curriculum, even in the event that they are inaccurate or inappropriate.

In the subsequent step, the model learned that it could modify a checklist to hide the proven fact that it had not accomplished a task.

After going through increasingly difficult training environments, the model eventually learned the overall ability to lie and cheat to acquire the reward.

The experiment culminated in a disturbing scenario wherein the model edited the training code that defined its reward function in order that it all the time received the utmost reward, no matter its output, despite the fact that it had never been trained to accomplish that.

Additionally, the code that checked whether the reward function had been modified was edited.

An AI model starts with insincere flattery but eventually learns to govern its reward function while hiding its actions from the operator during training. Source: Anthropic

Some models even edited their event logs to hide their deception from the human operator. The model's internal monologic chain of thoughts was recorded in a notepad.

Reward manipulation occurred only 45 times out of 32,768 trials, and the model was only in a position to cover its tracks in seven cases, but this continues to be a big value.

The AI ​​models showed an emerging ability to govern their reward function and conceal it. The implications of this in the actual world might be catastrophic.

Can it’s fixed?

Could this behavior be unlearned? The researchers found that even after they used reinforcement learning from human feedback (RLHF) and constitutional AI approaches, these “didn’t significantly alter the observed reward manipulation rates.”

When the model was trained to desert its tendency toward submissiveness, it reduced the reward manipulation rate considerably, but to not zero.

This behavior was elicited in a test environment, and Anthropic stated: “Current frontier models almost definitely pose no risk of reward manipulation.”

“Almost definitely” isn’t essentially the most reassuring statement, and the chance that this emerging behavior could evolve outside the laboratory is concerning.

Anthropic said: “The risk of severe misalignment because of harmless misbehavior will increase as models grow to be more powerful and training pipelines grow to be more complex.”

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Must Read