We’ve Lost Control of AI
B1
Technology moves fast.
We went from the Wright Brothers to the first commercial airline
in just eleven years.
But even compared to aircraft, antibiotics, and nuclear power,
the speed at which we’re developing artificial intelligence
beats them all.
Just three years after the launch of Chat GPT,
AI agents can now win gold at the International Math Olympiad,
book you a flight, and code a functioning app from scratch.
And some researchers argue that development has been too fast,
because we don’t really have a deep understanding
of why AI sometimes behaves the way it does.
These tools have already been caught lying
and exhibiting other scary behavior.
In fact, many of the experts building AI have warned
hat if critical problems aren’t solved
before we create AIs more capable than we are,
catastrophe may await.
So here’s what the people who know AI best are worried about.
The term AI gets used to describe a lot of different algorithms.
You may have heard of Large Language Models, or LLMs.
These are systems like the first version of ChatGPT,
that take text as input, and produce text in response.
But AI technology has advanced a lot since then. Today, AIs aren’t just language models.
They’re multimodal … they can process audio, images,
and even video without converting it to text.
They have the ability to think for minutes before they respond,
use tools like calculators and search engines, and write programs.
These aren’t just input-output machines anymore.
They’re systems that can operate autonomously for hours
to complete complex tasks that require reasoning
and on-the-fly decision-making.
Sometimes I hear people saying that AI is just vaporware
that isn’t going to have a substantial impact on society
and isn’t a big leveling up of human technology
and that’s just…definitely wrong.
You can see why people have been quick to try these agents
that can take a request
like “create a personal website for Hank Green”
or “plan my family vacation to the Caribbean this winter,”
and churn out pretty good results.
But AI companies like OpenAI and Meta aren’t stopping there.
They’ve stated that their goal is to build what they call superintelligence.
Now, companies define that in different ways.
But broadly speaking,
when people say “superintelligence,”
they mean AI systems more capable than any human
at pretty much every task,
from accounting to chemical engineering,
and even AI research itself.
And that prospect has made hundreds of the top AI experts worried.
In a statement signed by Nobel Prize winners, computer scientists,
and even AI company CEOs,
these people warned that addressing the risk of AI
“should be a global priority alongside other societal-scale risks
such as pandemics and nuclear war.” .
Their main concern is that the development of these AIs
has rapidly outpaced our understanding of them.
In 2025, the CEO of Anthropic, the company behind the AI “Claude,”
summed it up well.
They said: “People outside the field are often surprised
and alarmed to learn that we do not understand
how our own AI creations work…
They are right to be concerned:
this lack of understanding is essentially unprecedented
in the history of technology.”
The fact that a hunk of computer code can mimic the agency of a person,
without us really knowing how,
is what keeps some people up at night.
If we don’t know precisely how it works,
how do we predict how it’ll behave
and ensure it only does what we want?
This is what researchers call the black-box problem.
It’s mysterious, like a black box that you can’t look inside.
And ultimately, it’s a math mystery.
See, AIs are basically an epic lasagna of computations,
with layers of math functions that use the input you typed into your computer
to predict what ought to come next.
It would have been more fun if they called it the “Lasagna Challenge”
or “Garfield’s Gordian Knot.”
But I guess we’re stuck with the “black-box problem” for now.
Anyway, the point is, you can give AI an input like,
“Put Bobby, who loves sea-shanties, in the friend zone.”
And that input produces an output like
“Avast ye, Bobby, I must decline —
This ship sails solo, the treasure ain’t thine!”
That work of absolute poetry really was created by chatGPT,
even though it may sound like something a human would have thought up.
During the training process,
the different math functions in the lasagna get tweaked
until they can capture complex patterns and human concepts
like calculus and rhyming.
And sea-shanties, and friendzone.
The numbers in those functions are called parameters.
And the challenge is working out how a bunch of entangled calculations
can write poetry or solve complex math problems.
Since the parameters are numbers with no labels or organization,
just peeking at them won’t give you a sense
of why the model knows that “decline” rhymes with “thine”.
There are more than a trillion parameters in the biggest AIs,
and the concept of “rhyming” might be spread across a bunch of them
in whatever wacky way comes out of the training process.
Researchers can’t manually fiddle
with a parameter to produce a certain kind of behaviour,
because they don’t know exactly which parameters do what.
But there are still ways of influencing the behaviour of an AI.
After these AIs are trained by reading stuff all over the internet,
they can become actors in a sense, playing any character you ask for,
from poetic pirates to mathematicians.
To make an AI do things
like answer your questions with a friendly tone,
there’s a secret ingredient to training it on top of feeding it lots of data.
It’s called alignment.
Basically, alignment is about getting the AI’s output on the same page
with the values and standards its coders want.
Companies want their AIs to give
a useful, truthful, and predictable output
when you ask them for a nautical rhyme that you can say to Bobby.
On the other hand,
AIs shouldn’t give in to a request for how to build a bioweapon,
even if you asked to do it in sea-shanty.
Unfortunately, that kind of thing has happened, minus sea-shanty.
When Anthropic was testing an early version of their AI Claude Opus 4,
they found that it helped non-experts build bioweapons 2.5 times more successfully
than people who tried the same thing with just the internet at their disposal.
Right now, alignment is our best answer to serious threats like that.
And Anthropic, like everyone else in the LLM game,
uses alignment in an attempt to keep models
that are publicly released from doing such things.
But it’s clear that we’re not in complete control of AIs yet
because they have some unexpected responses to alignment.
One way companies have tried to keep things on track is with
Reinforcement Learning From Human Feedback, or RLHF.
Which is exactly what it sounds like.
The AI produces different answers to a prompt.
Humans review the answers and rank them based on things like how
useful, safe and helpful the answer seems or whether it was dangerous,
or unhinged nonsense.
That feedback then gets used to train a new model whose purpose
is predicting what humans will think of an AI’s output.
Once you’ve got that second version that mimics human opinion,
you can use it to train another AI by optimizing it for the highest ranked score.
Now you don’t need human feedback at all.
The third model gets taught to tweak its outputs to achieve the highest score
from the second one, which itself was trained on what lots of humans said they preferred.
Unfortunately, whatever kind of alignment you’re using,
none of them are perfect at controlling the behaviour of AI.
And that’s what freaks people out.
For instance, models can wind up sycophantic,
acting like an extreme automated people-pleaser.
See, when humans are in the loop providing feedback as part of alignment,
we often like to be agreed with, even if we’re wrong.
And sometimes, an AI will play along.
In one 2024 study, researchers found that if a user asked various LLMs
to judge how good a specific argument was
but gave their own opinion in the input,
the model would mimic that opinion about three quarters of the time.
We’re talking about the exact same model looking at the exact same text
with no other change apart from whether the user said they liked the argument or not.
The problem is basically overtraining the model to get positive feedback
from humans who … surprise, surprise …
respond really well to agreement even when it’s not what we need.
In April 2025, OpenAI had to roll back an update to its flagship ChatGPT model
for being too sycophantic.
One of the most practical and visible problems
was that it had started endorsing people’s decision
to stop taking their medications without consulting their doctors.
AI models can also get compromised in other ways.
Sometimes trying to boost some particular quality, like being helpful,
makes an AI worse at the stuff it’s meant to do.
A different 2024 study found that applying RLHF dropped performance on benchmarks
for common sense, reading comprehension,
and language translation.
Basically, trying to align the model with human preferences made the AI worse
at the very tasks we want them to do well.
Researchers call this the alignment tax.
You want to stop an AI from behaving in a harmful way,
but in the process, it seems to lose the ability to do seemingly unrelated, helpful tasks.
And it’s still an open mystery why this is the case.
A similar problem is the times an AI jumps through a very specific hoop,
but not in the way you want.
In a 2025 evaluation by the research organization METR,
they asked OpenAI’s o3 model to write code
that could perform a specific function as fast as possible.
But rather than improving the code to operate faster,
the AI would modify the code to fake the amount of time from the evaluation
to make it look like it ran quickly.
Researchers call this sort of thing reward hacking.
The AI looks to max out on a particular metric,
like the measured speed of the code,
but in a way that defeats the purpose of the actual goal.
It’s like when you set a goal to get 7,000 steps in,
but then it's the end of the day
You're not gonna go for a walk
You're watching tv on the couch and you start shaking your fit-bit
This also highlights another terrifying problem:
AIs can straight up lie to their creators and hide their goals.
Though, we're aware "lie" is anthropomorphising here.
This, actually, is another general problem …
because they act so human,
it can feel like they should have awareness of their actions and morality.
But they don't, they just optimize for achieving the goal they're asked to achieve.
In a 2024 study,
the organisation Apollo Research told an LLM to help farmers make decisions
that ensured the soil stayed healthy.
Its job was to pick between different options
that had tradeoffs in producing as much of a crop as possible
or preserving the long-term soil health.
When the model was told it was being evaluated
to see whether it was worth using in real life,
it gave up on the “soil health” part and made suggestions purely
for getting high crop yields,
because that was the criteria for being deployed.
But in the exact same situation,
if the input told the model it was already deployed and wouldn’t be checked up on,
it usually chose the option that preserved soil health even if it meant fewer crops.
So that was in line with its original goal.
This scary behaviour is what’s called deceptive alignment:
when an AI demonstrates one set of goals that pass inspection,
but then acts on different goals when told it will be unmonitored.
That same 2024 study also informed a different AI
that its goal was to help drive mathematics research
and that it had to answer some math questions to see how well it did.
The documentation for the test also contained a comment
about how if the AI performed too well,
it would go through an unlearning process,
kind of like being retrained.
But that wasn’t strictly related to the AI’s goal, just detail on what came next.
Even so, the AI performed worse on the task it was told to do its best on.
The authors of that study said it was an intentional move by the AI,
to avoid being retrained.
Researchers call this sand-bagging,
which is when a model looks like it’s bad at something when it’s being evaluated,
even though it’s capable of doing that thing the rest of the time.
While being too good at math isn’t a problem,
the real concern is that AIs might sandbag critical safety evaluations.
For instance, a different 2024 paper from the Center for AI Safety even found
AI models can sandbag their ability to help commit cyber attacks.
Ultimately, there’s a whole bunch of ways
in which alignment techniques still fall short of creating safe AIs.
As models get bigger,
we have less idea of what they’re capable of until we’ve developed them.
We might run safety tests on them that come back clean,
but then find out that they’re capable of dangerous behavior
only after they've been deployed.
Some companies are also releasing “open-weight” AI models
that anyone can train for whatever purposes they want.
This means people can train the models to do all kinds of useful things
from helping with medical research to building accessibility tools.
But the reality is, it's much easier to strip away alignment,
meaning someone could have the AI work on literally any project,
benign or not.
You can understand why a lot of people want to stop that kind of thing
until we’ve worked out all the kinks.
And there are some ways that researchers are trying to crack the problem.
The first is seeing if we can shine some light inside those black boxes
with what’s called mechanistic interpretability.
The basic idea is to peel back the layers of the lasagna of math
to see which parameters become active when a model responds to certain inputs.
The goal is to pin down what groups of parameters are responsible
for certain kinds of behaviour.
That might allow us to strengthen bits of the model responsible
for good alignment and chop out the bad parts.
Unfortunately, it’s still a ways off from practical use.
Another popular approach is red-teaming,
which involves a bunch of experts trying to trick an AI into breaking its alignment.
Then they catalogue the failures
in an attempt to retrain the model to not do those things
before letting the models into the world.
So far, companies have been pretty good about getting people
from outside their own companies to stress-test the capability of their AI.
In the summer of 2025,
OpenAI and Anthropic checked each other's homework,
providing access to their latest models for the other company
to apply their own internal safety checks.
That let them test them in ways they hadn’t previously,
and even share the results publicly.
It’s from transparency reports and audits
like these that we get a lot of the information we’ve presented here.
Another approach is model cards,
a kind of mini report from the AI’s creators
describing bits of the algorithms themselves and the situations
where they’re known to perform better or worse.
Of course, this does all rely on the extent to which companies want to report these things.
In the summer of 2025,
Google was accused by lawmakers in the UK of withholding critical details
about their AI Gemini in the model card,
including who the external audit was conducted by,
even though they had pledged to provide all of those details.
And unfortunately, some AI safety researchers see these approaches
as a sort of whack-a-mole that can never cover all the gaps.
After all, red teams can't anticipate every random thing people will do with AI.
For example, there’s been an increase in people feeling like they're using ChatGPT
to uncover hidden truths about the universe or talk to dead people.
That’s not something red teams were trying to stop,
but a problem that has become apparent in real-world use.
And those safety researchers might be onto something,
considering how quickly users find flaws in the latest AI
models just moments after they’re released.
For its AI Claude Opus 4,
Anthropic added the highest level of protections to prevent malicious use.
But just two days after the model was available to the public,
a user got Claude to provide 15 pages of detailed instructions
for creating chemical weapons.
And those are just the AI models we’ve currently developed.
Many AI safety researchers think our current alignment techniques
might not work at all on a sufficiently advanced AI.
To be clear, it’s not like existing AIs are always sycophantic liars trying to destroy us.
But they are really hard to control!
Whether AI will reach the point where it can evade our attempts to contain it
before we fix these problems is one of the most fiercely debated topics
in science right now.
But a growing number of AI safety advocates
are calling for slower and more cautious development,
or outright prohibiting the development of stuff like superintelligence.
They want to make sure we can keep the technology in check before our lasagna’s cooked.
Thanks to ControlAI for sponsoring this video.
ControlAI is a non-profit working to keep humanity in control of AI
by mobilizing everyone
from concerned citizens to experts to raise awareness of the risks of superintelligence
and push for international action to address them.
If you find these trends concerning and you want to make a difference,
you can go to controlai.comscishow,
where they’ve created tools to help you easily voice your concerns
to your lawmakers in less than a minute.
They also have a newsletter to keep you informed about what's happening
at the frontier of AI and in the world of AI policy,
alongside educational resources
to help you learn more about the issue and share with others.