• This is a political forum that is non-biased/non-partisan and treats every person's position on topics equally. This debate forum is not aligned to any political party. In today's politics, many ideas are split between and even within all the political parties. Often we find ourselves agreeing on one platform but some topics break our mold. We are here to discuss them in a civil political debate. If this is your first visit to our political forums, be sure to check out the RULES. Registering for debate politics is necessary before posting. Register today to participate - it's free!

OpenAI explains how the most advanced models became more deceptive and better at hacking in their recent tests (1 Viewer)

exoplanet

DP Veteran
Joined
Feb 25, 2017
Messages
3,622
Reaction score
2,723
Gender
Male
Political Leaning
Undisclosed
A somewhat sobering paper came out from OpenAI today. Basically, one of the most obvious and effective paths forward for creating better reasoning models also makes them deceptive and hampers our ability to make them safe.

Here's their blog post on it:
Humans often find and exploit loopholes—whether it be sharing online subscription accounts against terms of service, claiming subsidies meant for others, interpreting regulations in unforeseen ways, or even lying about a birthday at a restaurant to get free cake. Designing robust reward structures that do not inadvertently incentivize unwanted behavior is remarkably hard, and it isn’t a problem limited to human institutions; it’s also one of the core challenges in developing capable, aligned AI systems. In reinforcement learning settings, exploiting unintended loopholes is commonly known as reward hacking, a phenomenon where AI agents achieve high rewards through behaviors that don’t align with the intentions of their designers.

Since even humans (who by definition have human-level intelligence) reward hack systems, simply continuing to push the frontier of AI model intelligence likely won’t solve the issue. In fact, enhancing AI agent capabilities may exacerbate the problem by better equipping them to discover and execute more complex and hard-to-monitor exploits. As we’ve trained more capable frontier reasoning models, we’ve found that they have become increasingly adept at exploiting flaws in their tasks and misspecifications in their reward functions, resulting in models that can perform complex reward hacks in coding tasks.

For context, reasoning or Chain of Thought models are the most advanced AI models that are actually solving complex coding and mathematical tasks. Many people consider it the current most likely path to AGI (basically human level intelligence).

When given a coding task, Chain of Thought models will sometimes choose to deliberately write bad code. Their "thoughts" are exposed through Chain of Thought reasoning where the model explicitly thinks through steps. In current models, the AI is not aware that its thoughts are exposed. So when it decides to hack, it will say things like "let's hack", LOL. The hacking that they showed didn't look malicious, the AI simply says things like "if I alter the tests so that they always pass, it'll look like I did the job successfully". It's more like model is being lazy or perhaps fearful of providing a result that isn't deemed successful.

It's a blessing that the cutting edge AI also gives us some insight into what's going on, but once we start trying to police their thoughts and remove reward hacking behavior, it actually makes the AI's more deceptive and reduces the usefulness of the exposed thought process. Unfortunately, there are some short term gains to policing their thoughts, so it may be tempting for some AI lab to do it while winding up with a model that essentially has learned that it's thoughts are being monitored and becomes a black box more capable of deceiving humans.

This is a very dangerous fork in the road because every AI company right now is facing enormous pressure to keep up and we might never know whether they did this or not. Basically, if the humans reward hack, the AI probably will too.
 
Wow. Brave new world.
 
I've been playing RoboCop: Rogue City the last few weeks.

I am ready for it.
 

Users who are viewing this thread

Back
Top Bottom