- Joined
- Feb 25, 2017
- Messages
- 3,622
- Reaction score
- 2,723
- Gender
- Male
- Political Leaning
- Undisclosed
A somewhat sobering paper came out from OpenAI today. Basically, one of the most obvious and effective paths forward for creating better reasoning models also makes them deceptive and hampers our ability to make them safe.
Here's their blog post on it:
For context, reasoning or Chain of Thought models are the most advanced AI models that are actually solving complex coding and mathematical tasks. Many people consider it the current most likely path to AGI (basically human level intelligence).
When given a coding task, Chain of Thought models will sometimes choose to deliberately write bad code. Their "thoughts" are exposed through Chain of Thought reasoning where the model explicitly thinks through steps. In current models, the AI is not aware that its thoughts are exposed. So when it decides to hack, it will say things like "let's hack", LOL. The hacking that they showed didn't look malicious, the AI simply says things like "if I alter the tests so that they always pass, it'll look like I did the job successfully". It's more like model is being lazy or perhaps fearful of providing a result that isn't deemed successful.
It's a blessing that the cutting edge AI also gives us some insight into what's going on, but once we start trying to police their thoughts and remove reward hacking behavior, it actually makes the AI's more deceptive and reduces the usefulness of the exposed thought process. Unfortunately, there are some short term gains to policing their thoughts, so it may be tempting for some AI lab to do it while winding up with a model that essentially has learned that it's thoughts are being monitored and becomes a black box more capable of deceiving humans.
This is a very dangerous fork in the road because every AI company right now is facing enormous pressure to keep up and we might never know whether they did this or not. Basically, if the humans reward hack, the AI probably will too.
Here's their blog post on it:
Humans often find and exploit loopholes—whether it be sharing online subscription accounts against terms of service, claiming subsidies meant for others, interpreting regulations in unforeseen ways, or even lying about a birthday at a restaurant to get free cake. Designing robust reward structures that do not inadvertently incentivize unwanted behavior is remarkably hard, and it isn’t a problem limited to human institutions; it’s also one of the core challenges in developing capable, aligned AI systems. In reinforcement learning settings, exploiting unintended loopholes is commonly known as reward hacking, a phenomenon where AI agents achieve high rewards through behaviors that don’t align with the intentions of their designers.
Since even humans (who by definition have human-level intelligence) reward hack systems, simply continuing to push the frontier of AI model intelligence likely won’t solve the issue. In fact, enhancing AI agent capabilities may exacerbate the problem by better equipping them to discover and execute more complex and hard-to-monitor exploits. As we’ve trained more capable frontier reasoning models, we’ve found that they have become increasingly adept at exploiting flaws in their tasks and misspecifications in their reward functions, resulting in models that can perform complex reward hacks in coding tasks.
For context, reasoning or Chain of Thought models are the most advanced AI models that are actually solving complex coding and mathematical tasks. Many people consider it the current most likely path to AGI (basically human level intelligence).
When given a coding task, Chain of Thought models will sometimes choose to deliberately write bad code. Their "thoughts" are exposed through Chain of Thought reasoning where the model explicitly thinks through steps. In current models, the AI is not aware that its thoughts are exposed. So when it decides to hack, it will say things like "let's hack", LOL. The hacking that they showed didn't look malicious, the AI simply says things like "if I alter the tests so that they always pass, it'll look like I did the job successfully". It's more like model is being lazy or perhaps fearful of providing a result that isn't deemed successful.
It's a blessing that the cutting edge AI also gives us some insight into what's going on, but once we start trying to police their thoughts and remove reward hacking behavior, it actually makes the AI's more deceptive and reduces the usefulness of the exposed thought process. Unfortunately, there are some short term gains to policing their thoughts, so it may be tempting for some AI lab to do it while winding up with a model that essentially has learned that it's thoughts are being monitored and becomes a black box more capable of deceiving humans.
This is a very dangerous fork in the road because every AI company right now is facing enormous pressure to keep up and we might never know whether they did this or not. Basically, if the humans reward hack, the AI probably will too.