Skip to main navigation Skip to search Skip to main content

Head-specific intervention can induce misaligned AI coordination in large language models

Research output: Contribution to journalArticlepeer-review

1 Downloads (Pure)

Abstract

Robust alignment guardrails for large language models (LLMs) are becoming increasingly important with their widespread application. In contrast to previous studies, we demonstrate that inference-time activation interventions can bypass safety alignments and effectively steer model generations towards harmful AI coordination. Our method applies finegrained interventions at specific attention heads, which we identify by probing each head in a simple binary choice task. We then show that interventions on these heads generalise to the open-ended generation setting, effectively circumventing safety guardrails. We demonstrate that intervening on a few attention heads is more effective than intervening on full layers or supervised fine-tuning. We further show that only a few example completions are needed to compute effective steering directions, which is an advantage over classical fine-tuning. We also demonstrate that applying interventions in the negative direction can prevent a common jailbreak attack. Our results suggest that, at the attention head level, activations encode fine-grained linearly separable behaviours. Practically, the approach offers a straightforward methodology to steer large language model behaviour, which could be extended to diverse domains beyond safety, requiring fine-grained control over the model output. The code and datasets for this study can be found on https://github.com/PaulDrm/targeted_intervention.
Original languageEnglish
Number of pages18
JournalTransactions on Machine Learning Research
DOIs
Publication statusPublished - 30 Aug 2025

Keywords

  • LLM
  • Steering of LLMs
  • Interpretability of LLMs

Fingerprint

Dive into the research topics of 'Head-specific intervention can induce misaligned AI coordination in large language models'. Together they form a unique fingerprint.

Cite this