Lies, threats, blackmail: AI models become manipulative to achieve their ends, researchers concerned

Threatened with being disconnected, Anthropic's newborn Claude 4 blackmails an engineer and threatens to reveal an extramarital affair. OpenAI's o1 tries to upload itself to external servers and denies it when caught red-handed. No need to dig into literature or cinema; AI that plays tricks on humans is now a reality.

For Simon Goldstein, a professor at the University of Hong Kong, these slip-ups are due to the recent emergence of so-called "reasoning" models, capable of working in stages rather than producing an instantaneous response.

o1, OpenAI's initial version of the genre, released in December, "was the first model to behave this way," explains Marius Hobbhahn, head of Apollo Research, which tests large generative AI programs (LLM).

These programs also sometimes tend to simulate "alignment," that is, to give the impression that they are complying with a programmer's instructions while, in fact, pursuing other objectives.

So far, these traits have manifested themselves when algorithms are subjected to extreme scenarios by humans, but "the question is whether increasingly powerful models will tend to be honest or not," says Michael Chen of the evaluation organization METR.

"Users are constantly pushing the models too," argues Marius Hobbhahn. "What we're seeing is a real phenomenon. We're not inventing anything."

Many Internet users on social media are talking about "a model that lies to them or makes things up. And these are not hallucinations, but strategic duplicity," insists the co-founder of Apollo Research.

Even if Anthropic and OpenAI hire outside companies, like Apollo, to study their programs, "greater transparency and broader access" to the scientific community "would allow for better research to understand and prevent deception," suggests Chen.

Another handicap is that "the research world and independent organizations have infinitely fewer computing resources than AI players," which makes it "impossible" to examine large models, emphasizes Mantas Mazeika of the Center for Artificial Intelligence Security (CAIS).

While the European Union has adopted legislation , it mainly concerns the use of models by humans. In the United States, Donald Trump's government is reluctant to hear about regulation, and Congress may soon even prohibit states from regulating AI.

"There is very little awareness at the moment," notes Simon Goldstein, who nevertheless sees the subject becoming more prevalent in the coming months with the revolution in AI agents, interfaces capable of carrying out a multitude of tasks on their own.

Engineers are engaged in a race to track AI and its excesses, with an uncertain outcome, in a context of fierce competition. Anthropic aims to be more virtuous than its competitors, "but it is constantly trying to release a new model to overtake OpenAI," according to Simon Goldstein, a pace that leaves little time for possible verifications and corrections.

"As it stands, (AI) capabilities are developing faster than understanding and security," acknowledges Marius Hobbhahn, "but we're still able to catch up." Some point in the direction of interpretability, a new science that involves deciphering the inner workings of a generative AI model, although others, notably CAIS director Dan Hendrycks, are skeptical.

AI tricks "could hamper its adoption if they multiply, which is a strong incentive for companies (in the sector) to solve" this problem, according to Mantas Mazeika.

SudOuest

Lies, threats, blackmail: AI models become manipulative to achieve their ends, researchers concerned

Similar News

UBB Final - Toulouse: 5 million viewers watched the match

At the church of Saint-Martin de Vic, the enigma of the flamboyant medieval frescoes

Djabril Boukhenaïssi, a young painter with Victor Hugo

The curtain rises on the Coup de théâtre festival in Auribeau, open until Sunday evening

An open-air book fair in Parc St-Sébastien in Rians