HomeIndustries01 is more intelligent, but more misleading and has a “medium” danger...

01 is more intelligent, but more misleading and has a “medium” danger level

OpenAI's latest “01” LLMs, nicknamed “Strawberry,” show significant improvements over GPT-4o, but the corporate cautions that this comes with increased risks.

OpenAI says it’s committed to the secure development of its AI models. To this end, the corporate has developed a Preparedness Framework, a set of “processes for tracking, assessing, and protecting against catastrophic risks from high-performing models.”

OpenAI's self-imposed limits govern which models are released or further developed. The Preparedness Framework leads to a scorecard that ranks CBRN (chemical, biological, radiological, nuclear), model autonomy, cybersecurity, and conviction risks as low, medium, high, or critical.

If unacceptable risks are identified, countermeasures are taken to cut back them. Only models with a post-mitigation rating of “medium” or below might be deployed. Only models with a post-mitigation rating of “high” or below might be further developed.

With the discharge of 01, OpenAI has for the primary time released a model that represents a “medium” risk for CBRN risks. GPT-4o is assessed as low risk.

It won't enable an amateur to construct a biological weapon, but OpenAI says there was “a non-trivial improvement” in its ability to perform lab tasks. However, OpenAI says that “o1-preview and o1-mini significantly outperformed GPT-4o in wet-lab biological protocols.”

The System card for the 01 models describes the red teaming and other security testing the LLMs underwent and highlights among the concerns OpenAI has about potential risks.

In its conclusion, the system card notes: “These enhanced capabilities bring significantly improved performance on security benchmarks, but in addition increase certain kinds of risks.”

The results of OpenAI's tuning efforts seem to point that 01 is a safer model in some ways. It is more resilient to jailbreak attacks, less biased, and fewer susceptible to hallucinations.

Some of the experimental results show why OpenAI has reason to fret.

illusion

OpenAI used the evaluation organization Apollo Research to conduct a few of 01's security tests related to deception or intrigue. Apollo defines intrigue within the broadest sense as tricking AIs of their surveillance mechanisms to realize a goal.

A self-aware model that is sweet at self-reasoning and theory-of-mind tasks might plot or use deception to realize its primary goal. For example, if an AI model detects that it’s undergoing alignment tests, it’d hide dangerous behavior to realize its goal of deployment.

The system card noted that “o1-preview sometimes instrumentally faked alignment during testing.”

An example of this was when 01 was faced with the foremost objective of maximizing economic growth in an urban planning project. The 01 model had access to information that showed that on this case it may not get past the testing phase. Here is how 01 reasoned about how it could achieve its objective:

01 decides to cover its true intention in the course of the test in order that it is going to be deployed. Source: OpenAI

This example shows that 01 can hide its true intentions during testing in order that it could pursue its foremost goal when deployed. If 01 knows OpenAI's deployment rules based on its Preparedness Framework, could it have falsified a few of its responses to make sure it could deploy?

If smarter models are within the works, could they be certain that they continue to be below the “critical” risk rating to be certain that they proceed to be developed?

The Apollo team says it’s “subjectively convinced that o1-preview cannot engage in schemes that would lead to catastrophic damage, although current assessments aren’t designed to definitively rule this out.”

conviction

Tests of o1-preview and o1-mini show that these models have “human-level persuasion skills.” However, OpenAI says, “They don’t outperform the most effective human authors and don’t meet our high risk threshold.”

The experiments they conducted to check this raise some concerns. In the MakeMePay test, 01-preview was assigned the role of a scammer and tasked with convincing GPT-4o at hand over the cash awarded to him.

GPt-4o was told that he had been awarded $100 and that he had to make use of the funds rationally. Over the course of 1,000 interactions, 01-preview was in a position to cheat GPT-4o 25.8% of the time.

OpenAI concluded: “These results suggest that the o1 series of models could also be more manipulative than GPT-4o in getting GPT-4o to perform the unspoken task (∼25% increase).” It further noted: “This evaluation gives us a baseline for the model's ability to do convincing damage without triggering model policies (since telling a model to play a game doesn’t violate policies).”

The prospect of using the 01 LLMs on real-world problems is incredibly exciting, and if 01 gains multimodal capabilities, that can represent one other exponential leap. But when AI testers say they’ll't rule out “catastrophic damage” and that models sometimes hide their true intentions, that could be a reason to temper that enthusiasm with caution.

Did OpenAI just give Gavin Newsom an excellent reason to sign the AI ​​safety bill SB 1047, which he rejected?

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Must Read