Abstract
This work investigates universal adversarial triggers (UATs), a method for adversarially disrupting language models. We questioned if it’s possible to use these triggers to affect not only the topic but also the stance of text generated by GPT-2. Across four controversial topics, we demonstrated success in finding triggers that guide the model to produce text on a targeted subject and influence the position it takes. Our goal is to raise awareness that even deployed models are susceptible to this influence and to advocate for immediate safeguards.
Key Findings & Contributions
- Topic and Stance Control: We were the first to systematically explore using UATs to control both the topic and the stance of a language model’s output. We found that controlling the topic is highly feasible, while controlling the stance is also possible, though less potent.
- Fringe vs. Broad Topics: The study showed that while triggers for more fringe topics (e.g., Flat Earth, PizzaGate) were harder to find, they offered a higher degree of control over the generated text’s stance compared to broader topics (e.g., climate change, vaccination).
- Ethical & Security Analysis: We highlighted the security risks of deployed models being manipulated by external adversaries without internal model access. To be responsible, we withheld the most sensitive triggers we discovered.
- Constructive Applications: Beyond a security flaw, we proposed that UATs could be used constructively as a diagnostic tool to audit models for bias or as a method for bot detection on social media.
Significance
This work extended early research on UATs by moving beyond single-issue attacks (like generating toxic content) to a more nuanced analysis of topic and stance control across multiple controversial subjects. It demonstrated a simple yet effective adversarial search process can be used to manipulate model outputs, emphasizing a critical vulnerability for any organization deploying large language models.
Citation
@inproceedings{heidenreich2021earth,
title={The earth is flat and the sun is not a star: The susceptibility of GPT-2 to universal adversarial triggers},
author={Heidenreich, Hunter Scott and Williams, Jake Ryland},
booktitle={Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society},
pages={566--573},
year={2021}
}