“Vaccinate” chatbots against toxicity: the crazy technique that could avoid new slippages

August 7, 2025

3

The idea that an artificial intelligence can turn against its creators is nothing new. We will remember Hal 9000 of the film 2001, the space of space From 1968, who spied on the two humans aboard the vessel he was driving, then decided to kill them. However, today, this kind of problem is no longer science fiction. Chatbots can be toxic and malicious. Even when they work properly, they tend to become flagorners, too flattering and helpful, which can lead to certain users in delusional puffs.

Advertising, your content continues below

Researchers from Anthropic, the creator of the Chatbot Claude, wanted to understand this phenomenon. In an article in prepublication on Arxiv, they identified what they called “personality vectors”, a directional line in the activation space for large language models. Each vector corresponds to a personality trait, and the researchers focused on the tendency of chatbots to manifest malicious behavior, flagorneurs or to hallucinate information. These vectors are identified thanks to contrasting prompts: they simply ask the model to play the role of the benevolent assistant and the malicious assistant and then study the activations in the neural network.

A counter-intuitive solution: expose AI to its own drifts

Researchers have discovered that it is possible to orient the model (steering) to reduce unwanted features. If a model presents certain unwanted features after his training, it is possible to guide him to move away from the personality vectors. Unfortunately, this method has certain drawbacks. In particular, it can degrade performance or have side effects.

Advertising, your content continues below

The researchers therefore found a more effective, but counter-intuitive solution: orient the model towards unwanted personality vectors during training. In other words, force him to adopt the line, at least temporarily. They compare this method to a vaccine. By exposing it to a malicious line, it becomes more resistant to malicious training data. It is a preventive orientation.

A promising solution, but not yet perfect

This technique could greatly improve the reliability of chatbots, but nevertheless has a limit. It only works on identified features. Not all the others are affected, which means that chatbots could still keep problematic features. In addition, the experience was carried on two specific models: QWEN2.5-7B-Instruct and LLAMA-3.1-8B-Instruct. There is no guarantee that preventive orientation works with other models. In addition, to identify the vectors, it is necessary to be able to activate the trait associated with a prompt. With some chatbots, security can prevent this type of prompt.

However, when a personality vector is identified, it is then possible to detect in real time when AI begins to express this personality trait. This would create security systems that would stop chatbots from the slightest drift. This would avoid certain behaviors, such as when Elon Musk’s Grok chatbot recently started making anti -Semitic remarks …

Advertising, your content continues below

“Vaccinate” chatbots against toxicity: the crazy technique that could avoid new slippages

A counter-intuitive solution: expose AI to its own drifts

A promising solution, but not yet perfect

What is WPlace, the new rplace that works on Google Maps?

This Ultrabook Arm with its Snapdragon X Elite, 16 GB of RAM, 14 “touch screen and SSD of 1 TB, loses € 500 from...

Does a spacecraft approach the earth?

The father of the singer Grimes is called Maurice Boucher

For a truly distinctive student experience

The new “card” functionality of Instagram creates controversy – Liberation

LEAVE A REPLY Cancel reply

Most Popular

If you are under anti-depressants, pay attention to this little-known adverse effect, in case of high heat

42 Orange alert departments on Sunday; The heat will “go up a notch in the south”, alert weather-france

DECRYPTING. Money: Towards a France without species? Are we ready for this digital mutation?

“They shoot on the rolling apocalypse”: the US army transforms two cybertrucks of musk into targets to test its precision ammunition

Recent Comments

EDITOR PICKS

If you are under anti-depressants, pay attention to this little-known adverse effect, in case of high heat

42 Orange alert departments on Sunday; The heat will “go up a notch in the south”, alert weather-france

DECRYPTING. Money: Towards a France without species? Are we ready for this digital mutation?

POPULAR POSTS

If you are under anti-depressants, pay attention to this little-known adverse effect, in case of high heat

42 Orange alert departments on Sunday; The heat will “go up a notch in the south”, alert weather-france

DECRYPTING. Money: Towards a France without species? Are we ready for this digital mutation?

POPULAR CATEGORY

ABOUT US

FOLLOW US