Sunday, August 10, 2025
HomeTechnology"Vaccinate" chatbots against toxicity: the crazy technique that could avoid new slippages

“Vaccinate” chatbots against toxicity: the crazy technique that could avoid new slippages

The idea that an artificial intelligence can turn against its creators is nothing new. We will remember Hal 9000 of the film 2001, the space of space From 1968, who spied on the two humans aboard the vessel he was driving, then decided to kill them. However, today, this kind of problem is no longer science fiction. Chatbots can be toxic and malicious. Even when they work properly, they tend to become flagorners, too flattering and helpful, which can lead to certain users in delusional puffs.

Researchers from Anthropic, the creator of the Chatbot Claude, wanted to understand this phenomenon. In an article in prepublication on Arxiv, they identified what they called “personality vectors”, a directional line in the activation space for large language models. Each vector corresponds to a personality trait, and the researchers focused on the tendency of chatbots to manifest malicious behavior, flagorneurs or to hallucinate information. These vectors are identified thanks to contrasting prompts: they simply ask the model to play the role of the benevolent assistant and the malicious assistant and then study the activations in the neural network.

A counter-intuitive solution: expose AI to its own drifts

Researchers have discovered that it is possible to orient the model (steering) to reduce unwanted features. If a model presents certain unwanted features after his training, it is possible to guide him to move away from the personality vectors. Unfortunately, this method has certain drawbacks. In particular, it can degrade performance or have side effects.

The researchers therefore found a more effective, but counter-intuitive solution: orient the model towards unwanted personality vectors during training. In other words, force him to adopt the line, at least temporarily. They compare this method to a vaccine. By exposing it to a malicious line, it becomes more resistant to malicious training data. It is a preventive orientation.

A promising solution, but not yet perfect

This technique could greatly improve the reliability of chatbots, but nevertheless has a limit. It only works on identified features. Not all the others are affected, which means that chatbots could still keep problematic features. In addition, the experience was carried on two specific models: QWEN2.5-7B-Instruct and LLAMA-3.1-8B-Instruct. There is no guarantee that preventive orientation works with other models. In addition, to identify the vectors, it is necessary to be able to activate the trait associated with a prompt. With some chatbots, security can prevent this type of prompt.

However, when a personality vector is identified, it is then possible to detect in real time when AI begins to express this personality trait. This would create security systems that would stop chatbots from the slightest drift. This would avoid certain behaviors, such as when Elon Musk’s Grok chatbot recently started making anti -Semitic remarks …

marley.cruz
marley.cruz
Marley profiles immigrant chefs across Texas, pairing recipes with visa-process explainers.
Facebook
Twitter
Instagram
RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

- Advertisment -

Most Popular

Recent Comments