Abstract

We investigate the susceptibility of GPT-2 to universal adversarial triggers - specific text sequences that can consistently manipulate the model’s outputs on controversial topics. Our analysis reveals significant vulnerabilities in language model deployment, particularly when models encounter adversarially crafted inputs.

Key Contributions

  • Universal adversarial triggers: Demonstrated how specific text sequences can reliably manipulate GPT-2’s outputs
  • Controversial topic analysis: Focused on politically and scientifically sensitive subjects
  • Deployment implications: Highlighted vulnerabilities relevant to real-world language model deployment
  • Safety research: Contributed to understanding of adversarial robustness in large language models

Significance

This work was among the early investigations into adversarial vulnerabilities of large language models, contributing to the growing field of AI safety research. The findings have implications for responsible deployment of language models in production systems.

Impact

The research highlighted critical security considerations for language model deployment and contributed to ongoing discussions about AI safety and robustness. The work has been cited in subsequent research on adversarial attacks against language models.

Citation

@inproceedings{heidenreich2021earth,
  title={The earth is flat and the sun is not a star: The susceptibility of GPT-2 to universal adversarial triggers},
  author={Heidenreich, Hunter Scott and Williams, Jake Ryland},
  booktitle={Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society},
  pages={566--573},
  year={2021}
}