A new study from the Anthropic Fellows Program reveals a technique to identify, monitor and control character traits in large language models (LLMs). The findings show that models can develop undesirable personalities (e.g., becoming malicious, excessively agreeable, or prone to making things up) either in response to user prompts or as an unintended consequence of training. The researchers introduce “persona vectors,” which are directions in a model’s internal activation space that correspond to specific personality traits, providing a toolkit for developers to manage the behavior of their AI assistants better. In a series of experiments with open models, such as Qwen 2.5-7B-Instruct and Llama-3.1-8B-Instruct, the researchers demonstrated several practical applications for persona vectors. A key application for enterprises is using persona vectors to screen data before fine-tuning. The researchers developed a metric called “projection difference,” which measures how much a given training dataset will push the model’s persona toward a particular trait. This metric is highly predictive of how the model’s behavior will shift after training, allowing developers to flag and filter problematic datasets before using them in training. For companies that fine-tune open-source models on proprietary or third-party data (including data generated by other models), persona vectors provide a direct way to monitor and mitigate the risk of inheriting hidden, undesirable traits. The ability to screen data proactively is a powerful tool for developers, enabling the identification of problematic samples that may not be immediately apparent as harmful. The research found that this technique can find issues that other methods miss, noting, “This suggests that the method surfaces problematic samples that may evade LLM-based detection.”