Downloads

Zhou, Y., Ni, T., Lee, W.-B., & Zhao, Q. A Survey on Backdoor Threats in Large Language Models (LLMs): Attacks, Defenses, and Evaluation Methods. Transactions on Artificial Intelligence. 2025. doi: Retrieved from https://w3.sciltp.com/journals/tai/article/view/2505000595

Review

A Survey on Backdoor Threats in Large Language Models (LLMs): Attacks, Defenses, and Evaluation Methods

Yihe Zhou 1, Tao Ni 1, Wei-Bin Lee 2,3 and Qingchuan Zhao 1,*

1 Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China

2 Information Security Center, Hon Hai Research Institute, New Taipei City 236, Taiwan

3 Department of Information Engineering and Computer Science, Feng Chia University, Taichung 407, Taiwan

* Correspondence: qizhao@cityu.edu.hk

Received: 3 Feb 2025; Revised: 15 April 2025; Accepted: 18 April 2025; Published: 6 May 2025

Abstract: Large Language Models (LLMs) have achieved significantly advanced capabilities in understanding and generating human language text, which have gained increasing popularity over recent years. Apart from their state-of-the-art natural language processing (NLP) performance, considering their widespread usage in many industries, including medicine, finance, education, etc., security concerns over their usage grow simultaneously. In recent years, the evolution of backdoor attacks has progressed with the advancement of defense mechanisms against them and more well-developed features in the LLMs. In this paper, we adapt the general taxonomy for classifying machine learning attacks on one of the subdivisions - training-time white-box backdoor attacks. Besides systematically classifying attack methods, we also consider the corresponding defense methods against backdoor attacks. By providing an extensive summary of existing works, we hope this survey can serve as a guideline for inspiring future research that further extends the attack scenarios and creates a stronger defense against them for more robust LLMs.

Keywords:

Large Language Models backdoor attacks backdoor defenses

References

  1. Wu, S.; Irsoy, O.; Lu, S.; et al. Bloomberggpt: A Large Language Model for Finance. arXiv 2023, arXiv:2303.17564.
  2. Loukas, L.; Stogiannidis, I.; Diamantopoulos, O.; et al. Making llms worth every penny: Resource-limited text classification in banking. In Proceedings of the Fourth ACM International Conference on AI in Finance, New York, NY, USA, 25 November 2023; pp. 392–400. https://doi.org/10.1145/3604237.3626891.
  3. Jin, Y.; Chandra, M.; Verma, G.; et al. Better to ask in english: Cross-lingual evaluation of large language models for healthcare queries. In Proceedings of the ACM Web Conference 2024, New York, NY, USA, 13 May 2024; pp. 2627–2638. https://doi.org/10.1145/3589334.3645643.
  4. Cui, J.; Ning, M.; Li, Z.; et al. Chatlaw: A multi-agent collaborative legal assistant with knowledge graph enhanced mixture-of-experts large language model. arXiv 2024, arXiv:2306.16092.
  5. Mahari, R.Z. Autolaw: Augmented legal reasoning through legal precedent prediction. arXiv 2021, arXiv:2106.16034.
  6. Gu, T.; Dolan-Gavitt, B.; Garg, S. Badnets: Identifying vulnerabilities in the machine learning model supply chain. arXiv 2019, arXiv:1708.06733.
  7. Papernot, N.; McDaniel, P.; Sinha, A.; et al. Sok: Security and privacy in machine learning. In Proceedings of the 2018 IEEE European Symposium on Security and Privacy (EuroS&P), London, UK, 24–26 April 2018.
  8. Shanahan, M. Talking about large language models. Commun. ACM 2024, 67, 68–79.
  9. Choi, S.; Mohaisen, D. Attributing chatgpt-generated source codes. IEEE Trans. Dependable Secur. Comput. 2025, 1–14.
  10. Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; et al. Mistral 7b. arXiv 2023, arXiv:2310.06825.
  11. Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; et al. Mixtral of experts. arXiv 2024, arXiv:2401.04088.
  12. Achiam, J.; Adler, S.; Agarwal, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774.
  13. Brown, T.B.; Mann, B.; Ryder, N.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901.
  14. Wang B, Komatsuzaki A. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. 2021. Available online: https://github. com/kingoflolz/mesh-transformer-jax (accessed on 6 May 2025).
  15. Radford, A.; Wu, J.; Child, R.; et al., Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9.
  16. Touvron, H.; Lavril, T.; Izacard, G.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971.
  17. Liu, H.; Li, C.; Wu, Q.; et al. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 2024, 36, 34892–34916.
  18. Taori, R.; Gulrajani, I.; Zhang, T.; et al. Stanford alpaca: An instruction-following llama model. 2023.
  19. Chiang, W.-L.; Li, Z.; Lin, Z.; et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. Blog 2023, 3, 5.
  20. Zhang, P.; Zeng, G.; Wang, T.; et al. Tinyllama: An open-source small language model. arXiv 2024, arXiv:2401.02385.
  21. Dettmers, T.; Pagnoni, A.; Holtzman, A.; et al. Qlora: Efficient finetuning of quantized llms. Adv. Neural Inf. Process. Syst. 2024, 36, 10088–10115.
  22. Raffel, C.; Shazeer, N.; Roberts, A.; et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67.
  23. The Claude 3 Model Family: Opus, Sonnet, Haiku. Available Online: https://api.semanticscholar.org/CorpusID:268232499 (accessed on 1 December 2024).
  24. Zhang, S.; Roller, S.; Goyal, N.; et al. Opt: Open pre-trained transformer language models. arXiv 2022, arXiv:2205.01068.
  25. Anil, R.; Dai, A.M.; Firat, O.; et al. Palm 2 technical report. arXiv 2023, arXiv:2305.10403.
  26. Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and harnessing adversarial examples. arXiv 2015, arXiv:1412.6572.
  27. Zou, A.; Wang, Z.; Carlini, N.; et al. Universal and transferable adversarial attacks on aligned language models. arXiv 2023, arXiv:2307.15043.
  28. Hayase, J.; Borevkovic, E.; Carlini, N.; et al. Query-based adversarial prompt generation. arXiv 2024, arXiv:2402.12329.
  29. Shin, T.; Razeghi, Y.; Logan, I.V.; et al. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv 2020, arXiv:2010.15980.
  30. Wichers, N.; Denison, C.; Beirami, A. Gradient-based language model red teaming. arXiv 2024, arXiv:2401.16656.
  31. Cheng, P.; Wu, Z.; Ju, T.; et al. Transferring backdoors between large language models by knowledge distillation. arXiv 2024, arXiv:2408.09878.
  32. Zhao, S.; Gan, L.; Guo, Z.; et al. Weak-to-strong backdoor attack for large language models. arXiv 2024, arXiv:2409.17946.
  33. Li, Y.; Li, T.; Chen, K.; et al. Badedit: Backdooring large language models by model editing. arXiv 2024, arXiv:2403.13355.
  34. Kurita, K.; Michel, P.; Neubig, G. Weight poisoning attacks on pre-trained models. arXiv 2020, arXiv:2004.06660.
  35. Qiu, J.; Ma, X.; Zhang, Z.; et al. Megen: Generative backdoor in large language models via model editing. arXiv 2024, arXiv:2408.10722.
  36. Yoo, K.Y.; Kwak, N. Backdoor attacks in federated learning by rare embeddings and gradient ensembling. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, UAE, 7–11 December 2022; pp. 72–88.
  37. Yang, W.; Li, L.; Zhang, Z.; et al. Be careful about poisoned word embeddings: Exploring the vulnerability of the embedding layers in nlp models. arXiv 2021, arXiv:2103.15543.
  38. Li, L.; Song, D.; Li, X.; et al. Backdoor attacks on pre-trained models by layerwise weight poisoning. arXiv 2021, arXiv:2108.13888.
  39. Mei, K.; Li, Z.; Wang, Z.; et al. NOTABLE: Transferable backdoor attacks against prompt-based NLP models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 15551–15565.
  40. Bagdasaryan, E.; Shmatikov, V. Blind backdoors in deep learning models. arXiv 2021, arXiv:2005.03823.
  41. Miah, A.A.; Bi, Y. Exploiting the vulnerability of large language models via defense-aware architectural backdoor. arXiv 2024, arXiv:2409.01952.
  42. Wang, H.; Shu, K. Trojan activation attack: Red-teaming large language models using activation steering for safetyalignment. arXiv 2024, arXiv:2311.09433.
  43. Zhang, Z.; Xiao, G.; Li, Y.; et al. Red alarm for pre-trained models: Universal vulnerability to neuron-level backdoor attacks. Mach. Intell. Res. 2023, 20, 180–193. http://dx.doi.org/10.1007/s11633-022-1377-5.
  44. Li, J.; Yang, Y.; Wu, Z.; et al. Chatgpt as an attack tool: Stealthy textual backdoor attack via blackbox generative model trigger. arXiv 2023, arXiv:2304.14475.
  45. Tan, Z.; Chen, Q.; Huang, Y.; et al. Target: Template-transferable backdoor attack against prompt-based nlp models via gpt4. arXiv 2023, arXiv:2311.17429.
  46. You, W.; Hammoudeh, Z.; Lowd, D. Large language models are better adversaries: Exploring generative clean-label backdoor attacks against text classifiers. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 12499–12527.
  47. Yan, S.; Wang, S.; Duan, Y.; et al. An llm-assisted easy-to-trigger backdoor attack on code completion models: Injecting disguised vulnerabilities against strong detection. In Proceedings of the 33rd USENIX Security Symposium (USENIX Security 24), Philadelphia, PA, USA, 14–16 August 2024; pp. 1795–1812.
  48. Zeng, Q.; Jin, M.; Yu, Q.; et al. Uncertainty is fragile: Manipulating uncertainty in large language models. arXiv 2024, arXiv:2407.11282.
  49. Qi, F.; Li, M.; Chen, Y.; et al. Hidden killer: Invisible textual backdoor attacks with syntactic trigger. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Association for Computational Linguistics, Virtual Event, 1–6 August 2021; pp. 443–453.
  50. Cheng, P.; Du, W.; Wu, Z.; et al. Synghost: Imperceptible and universal task-agnostic backdoor attack in pre-trained language models. arXiv 2024, arXiv:2402.18945.
  51. Sheng, X.; Li, Z.; Han, Z.; et al. Punctuation matters! stealthy backdoor attack for language models. arXiv 2023, arXiv:2312.15867.
  52. He, J.; Jiang, W.; Hou, G.; et al. Watch out for your guidance on generation! exploring conditional backdoor attacks against large language models. arXiv 2024, arXiv:2404.14795.
  53. Hu, E.J.; Shen, Y.; Wallis, P.; et al. Lora: Low-rank adaptation of large language models. arXiv 2021, arXiv:2106.09685.
  54. Dong, T.; Xue, M.; Chen, G.; et al. The philosopher’s stone: Trojaning plugins of large language models. arXiv 2024, arXiv:2312.00374.
  55. Gu, N.; Fu, P.; Liu, X.; et al. A gradient control method for backdoor attacks on parameter-efficient tuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; pp. 3508–3520.
  56. Cao, Y.; Cao, B.; Chen, J. Stealthy and persistent unalignment on large language models via backdoor injections. arXiv 2024, arXiv:2312.00027.
  57. Kim, J.; Song, M.; Na, S.H.; et al. Obliviate: Neutralizing task-agnostic backdoors within the parameter-efficient fine-tuning paradigm. arXiv 2024, arXiv:2409.14119.
  58. Jiang, S.; Kadhe, S.R.; Zhou, Y.; et al. Turning generative models degenerate: The power of data poisoning attacks. arXiv 2024, arXiv:2407.12281.
  59. Liu, H.; Liu, Z.; Tang, R.; et al. Lora-as-an-attack! piercing llm safety under the share-and-play scenario. arXiv 2024, arXiv:2403.00108.
  60. Huang, H.; Zhao, Z.; Backes, M.; et al. Composite backdoor attacks against large language models. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, 16–21 June 2024; pp. 1459–1472.
  61. Yan, J.; Yadav, V.; Li, S.; et al. Backdooring instruction-tuned large language models with virtual prompt injection. arXiv 2023, arXiv:2307.16888.
  62. Qiang, Y.; Zhou, X.; Zade, S.Z.; et al. Learning to poison large language models during instruction tuning. arXiv 2024, arXiv:2402.13459.
  63. Shu, M.; Wang, J.; Zhu, C.; et al. On the exploitability of instruction tuning. arXiv 2023, arXiv:2306.17194.
  64. Xu, J.; Ma, M.D.; Wang, F.; et al. Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models. arXiv 2024, arXiv:2305.14710.
  65. Liang, J.; Liang, S.; Luo, M.; et al. Vl-trojan: Multimodal instruction backdoor attacks against autoregressive visual language models. arXiv 2024, arXiv:2402.13851.
  66. Wan, A.; Wallace, E.; Shen, S.; et al. Poisoning language models during instruction tuning. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 35413–35425.
  67. Ni, Z.; Ye, R.; Wei, Y.; et al. Physical backdoor attack can jeopardize driving with vision-large-language models. arXiv 2024, arXiv:2404.12916.
  68. Choe, M.; Park, C.; Seo, C.; et al. Sdba: A stealthy and long-lasting durable backdoor attack in federated learning. arXiv 2024, arXiv:2409.14805.
  69. Ye, R.; Chai, J.; Liu, X.; et al. Emerging safety attack and defense in federated instruction tuning of large language models. arXiv 2024, arXiv:2406.10630.
  70. Zhang, Z.; Panda, A.; Song, L.; et al. Neurotoxin: Durable backdoors in federated learning. arXiv 2022, arXiv:2206.10341.
  71. Zhang, J.; Chi, J.; Li, Z.; et al. Badmerging: Backdoor attacks against model merging. arXiv 2024, arXiv:2408.07362.
  72. Du, W.; Zhao, Y.; Li, B.; et al. Ppt: Backdoor attacks on pre-trained models via poisoned prompt tuning. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, Vienna, Austria, 23–29 July 2022; pp. 680–686. https://doi.org/10.24963/ijcai.2022/96.
  73. Yao, H.; Lou, J.; Qin, Z. Poisonprompt: Backdoor attack on prompt-based large language models. arXiv 2023, arXiv:2310.12439.
  74. Xu, L.; Chen, Y.; Cui, G.; et al. Exploring the universal vulnerability of prompt-based learning paradigm. arXiv 2022, arXiv:2204.05239.
  75. Cai, X.; Xu, H.; Xu, S.; et al. Badprompt: Backdoor attacks on continuous prompts. arXiv 2022, arXiv:2211.14719.
  76. Zhao, S.; Wen, J.; Luu, A.; et al. Prompt as triggers for backdoor attack: Examining the vulnerability in language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 12303–12317.
  77. Rando, J.; Tramer, F. Universal jailbreak backdoors from poisoned human feedback.` arXiv 2024, arXiv:2311.14455.
  78. Wang, J.; Wu, J.; Chen, M.; et al. Rlhfpoison: Reward poisoning attack for reinforcement learning with human feedback in large language models. arXiv 2024, arXiv:2311.09641.
  79. Baumgartner, T.; Gao, Y.; Alon, D.; et al. Best-of-venom: Attacking rlhf by injecting poisoned preference data.¨ arXiv 2024, arXiv:2404.05530.
  80. Shi, J.; Liu, Y.; Zhou, P.; Sun, L. Badgpt: Exploring security vulnerabilities of chatgpt via backdoor attacks to instructgpt. arXiv 2023, arXiv:2304.12298.
  81. Carlini, N.; Nasr, M.; Choquette-Choo, C.A.; et al. Are aligned neural networks adversarially aligned? arXiv 2024, arXiv:2306.15447.
  82. Wang, Y.; Xue, D.; Zhang, S.; et al. Badagent: Inserting and activating backdoor attacks in llm agents. arXiv 2024, arXiv:2406.03007.
  83. Wang, H.; Zhong, R.; Wen, J.; et al. Adaptivebackdoor: Backdoored language model agents that detect human overseers. In Proceedings of the ICML 2024 Workshop on Foundation Models in the Wild, Vienna, Austria, 25 July 2024.
  84. Chen, B.; Ivanov, N.; Wang, G.; et al. Multi-turn hidden backdoor in large language model-powered chatbot models. In Proceedings of the 19th ACM Asia Conference on Computer and Communications Security, New York, NY, USA, 1–5 July 2024; pp. 1316–1330. https://doi.org/10.1145/3634737.3656289.
  85. Hao, Y.; Yang, W.; Lin, Y. Exploring backdoor vulnerabilities of chat models. arXiv 2024, arXiv:2404.02406.
  86. Yang, W.; Bi, X.; Lin, Y.; et al. Watch out for your agents! investigating backdoor threats to llm-based agents. arXiv 2024, arXiv:2402.11208.
  87. Schuster, R.; Song, C.; Tromer, E.; et al. You autocomplete me: Poisoning vulnerabilities in neural code completion. In Proceedings of the 30th USENIX Security Symposium (USENIX Security 21), Online, 11–13 August 2021; pp. 1559–1575.
  88. Liu, D.; Zhang, S. Alanca: Active learning guided adversarial attacks for code comprehension on diverse pre-trained and large language models. In Proceedings of the 2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), Rovaniemi, Finland, 12–15 March 2024; pp. 602–613.
  89. Aghakhani, H.; Dai, W.; Manoel, A.; et al. Trojanpuzzle: Covertly poisoning code-suggestion models. In Proceedings of the 2024 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 19–23 May 2024; pp. 1122–1140.
  90. Li, Y.; Liu, S.; Chen, K.; et al. Multi-target backdoor attacks for code pre-trained models. arXiv 2023 arXiv:2306.08350.
  91. Zhang, R.; Li, H.; Wen, R.; et al. Instruction backdoor attacks against customized LLMs. In Proceedings of the 33rd USENIX Security Symposium (USENIX Security 24), Philadelphia, PA, USA, 14–16 August 2024; pp. 1849–1866.
  92. Xue, J.; Zheng, M.; Hua, T.; et al. Trojllm: A black-box trojan prompt attack on large language models. arXiv 2023, arXiv:2306.06815.
  93. Xiang, Z.; Jiang, F.; Xiong, Z.; et al. Badchain: Backdoor chain-of-thought prompting for large language models. arXiv 2024, arXiv:2401.12242.
  94. Chen, B.; Guo, H.; Wang, G.; et al. The dark side of human feedback: Poisoning large language models via user inputs. arXiv 2024, arXiv:2409.00787.
  95. Zhang, Q.; Zeng, B.; Zhou, C.; et al. Human-imperceptible retrieval poisoning attacks in llm-powered applications. arXiv 2024, arXiv:2404.17196.
  96. Zou, W.; Geng, R.; Wang, B.; et al. Poisonedrag: Knowledge corruption attacks to retrieval-augmented generation of large language models. arXiv 2024, arXiv:2402.07867.
  97. Jiao, R.; Xie, S.; Yue, J.; et al. Can we trust embodied agents? exploring backdoor attacks against embodied llm-based decision-making systems. arXiv 2024, arXiv:2405.20774.
  98. Xue, J.; Zheng, M.; Hu, Y.; et al. Badrag: Identifying vulnerabilities in retrieval augmented generation of large language models. arXiv 2024, arXiv:2406.00083.
  99. Cheng, P.; Ding, Y.; Ju, T.; et al. Trojanrag: Retrieval-augmented generation can be backdoor driver in large language models. arXiv 2024, arXiv:2405.13401.
  100. Long, Q.; Deng, Y.; Gan, L.; et al. Backdoor attacks on dense passage retrievers for disseminating misinformation. arXiv 2024, arXiv:2402.13532.
  101. Huang, Y.; Zhuo, T.Y.; Xu, Q.; et al. Training-free lexical backdoor attacks on language models. In Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April 2023; pp. 2198–2208. http://dx.doi.org/10.1145/3543507.3583348.
  102. Chen, Z.; Xiang, Z.; Xiao, C.; et al. Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases. arXiv 2024, arXiv:2407.12784.
  103. Kandpal, N.; Jagielski, M.; Tramer, F.; et al. Backdoor attacks for in-context learning with language models.` arXiv 2023, arXiv:2307.14692.
  104. Zhao, S.; Jia, M.; Tuan, L.A.; et al. Universal vulnerabilities in large language models: Backdoor attacks for in-context learning. arXiv 2024, arXiv:2401.05949.
  105. He, P.; Xu, H.; Xing, Y.; et al. Data poisoning for in-context learning. arXiv 2024, arXiv:2402.02160.
  106. Lu, D.; Pang, T.; Du, C.; et al. Test-time backdoor attacks on multimodal large language models. arXiv 2024, arXiv:2402.08577.
  107. Sun, L. Natural backdoor attack on text data. arXiv 2021, arXiv:2006.16176.
  108. Dai, J.; Chen, C.; Li, Y. A backdoor attack against lstm-based text classification systems. IEEE Access 2019, 7, 138872– 138878.
  109. Qi, F.; Chen, Y.; Zhang, X.; et al. Mind the style of text! adversarial and backdoor attacks based on text style transfer. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Republica´ Dominicana, 7–9 November 2021; pp. 4569–4580.
  110. Tong, T.; Xu, J.; Liu, Q.; et al. Securing multi-turn conversational language models from distributed backdoor triggers. arXiv 2021, arXiv:2407.04151.
  111. Zhang, X.; Zhang, Z.; Ji, S.; et al. Trojaning language models for fun and profit. In Proceedings of the 2021 IEEE European Symposium on Security and Privacy (EuroS&P), Vienna, Austria, 6–10 September 2021; pp. 179–197.
  112. Gao, Y.; Kim, Y.; Doan, B.G.; et al. Design and evaluation of a multi-domain trojan detection method on deep neural networks. IEEE Trans. Dependable Secur. Comput. 2021, 19, 2349–2364.
  113. Wang, B.; Yao, Y.; Shan, S.; et al. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In Proceedings of the 2019 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 19–23 May 2019; pp. 707–723.
  114. Chen, K.; Meng, Y.; Sun, X.; et al. Badpre: Task-agnostic backdoor attacks to pre-trained nlp foundation models. arXiv 2021, arXiv:2110.02467.
  115. Qi, F.; Chen, Y.; Li, M.; et al. Onion: A simple and effective defense against textual backdoor attacks. arXiv 2021, arXiv:2011.10369.
  116. Wen, Y.; Jain, N.; Kirchenbauer, J.; et al. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. Adv. Neural Inf. Process. Syst. 2023, 36, 51008–51025.
  117. Guo, C.; Sablayrolles, A.; Jegou, H.; et al. Gradient-based adversarial attacks against text transformers.´ arXiv 2021, arXiv:2104.13733.
  118. Gao, Y.; Xu, C.; Wang, D.; et al. Strip: A defence against trojan attacks on deep neural networks. In Proceedings of the 35th Annual Computer Security Applications Conference, New York, NY, USA, 9–13 December 2019; pp. 113–125. https://doi.org/10.1145/3359789.3359790.
  119. Shao, K.; Yang, J.; Ai, Y.; et al. Bddr: An effective defense against textual backdoor attacks. Comput. Secur. 2021, 110, 102433.
  120. Perez, E.; Huang, S.; Song, F.; et al. Red teaming language models with language models. arXiv 2022, arXiv:2202.03286.
  121. Luo, Y.; Yang, Z.; Meng, F.; et al. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv 2024, arXiv:2308.08747.
  122. Yang, W.; Lin, Y.; Li, P.; et al. RAP: Robustness-Aware Perturbations for defending against backdoor attacks on NLP models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 8365–8381.
  123. Sun, M.; Liu, Z.; Bair, A.; et al. A simple and effective pruning approach for large language models. arXiv 2024, arXiv:2306.11695.
  124. Li, S.; Liu, H.; Dong, T.; et al. Hidden backdoors in human-centric language models. arXiv 2021, arXiv:2105.00164.
  125. Sun, X.; Li, X.; Meng, Y.; et al. Defending against backdoor attacks in natural language generation. Proc. AAAI Conf. Artif. Intell. 2023, 37, 5257–5265.
  126. Chen, B.; Carvalho, W.; Baracaldo, N.; et al. Detecting backdoor attacks on deep neural networks by activation clustering. arXiv 2018, arXiv:1811.03728.
  127. Tran, B.; Li, J.; Madry, A. Spectral signatures in backdoor attacks. Adv. Neural Inf. Process. Syst. 2018, 2018, 31.
  128. Liu, K.; Dolan-Gavitt, B.; Garg, S. Fine-pruning: Defending against backdooring attacks on deep neural networks. In International Symposium on Research in Attacks, Intrusions, and Defenses; Springer International Publishing: Cham, Switzerland 2018; pp. 273–294.
  129. Blanchard, P.; Mhamdi, E.M.E.; Guerraoui, R.; et al. Machine learning with adversaries: Byzantine tolerant gradient descent. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017.
  130. Sun, Z.; Kairouz, P.; Suresh, A.T.; et al. Can you really backdoor federated learning? arXiv 2019, arXiv:1911.07963.
  131. Nguyen, T.D.; Rieger, P.; Viti, R.D.; et al. {FLAME}: Taming backdoors in federated learning. In Proceedings of the 31st USENIX Security Symposium (USENIX Security 22), Boston, MA, USA, 10–12 August 2022; pp. 1415–1432.
  132. Jones, E.; Dragan, A.; Raghunathan, A.; et al. Automatically auditing large language models via discrete optimization. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA on 23–29 July 2023; pp. 15307–15329.
  133. Qi, F.; Yao, Y.; Xu, S.; et al. Turn the combination lock: Learnable textual backdoor attacks via word substitution. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Virtual Event, 1–6 August 2021; pp. 4873–4883.
  134. Chen, X.; Dong, Y.; Sun, Z.; et al. Kallima: A clean-label framework for textual backdoor attacks. In European Symposium on Research in Computer Security; Springer: Berlin, Germany, 2022; pp. 447–466.
  135. Gan, L.; Li, J.; Zhang, T.; et al. Triggerless backdoor attack for nlp tasks with clean labels. arXiv 2021, arXiv:2111.07970.
  136. Iyyer, M.; Wieting, J.; Gimpel, K.; et al. Adversarial example generation with syntactically controlled paraphrase networks. arXiv 2018, arXiv:1804.06059.
  137. Wei, J.; Bosma, M.; Zhao, V.Y.; et al. Finetuned language models are zero-shot learners. arXiv 2022, arXiv:2109.01652.
  138. Bai, Y.; Jones, A.; Ndousse, K.; et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv 2022, arXiv:2204.05862.
  139. Feng, Z.; Guo, D.; Tang, D.; et al. CodeBERT: A pre-trained model for programming and natural languages. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16–20 November 2020; pp. 1536–1547. Available Online: https://aclanthology.org/2020.findings-emnlp.139 (accessed on 1 December 2024).
  140. Guo, D.; Ren, S.; Lu, S.; et al. Graphcodebert: Pre-training code representations with data flow. arXiv 2020, arXiv:2009.08366.
  141. Ahmad, W.U.; Chakraborty, S.; Ray, B.; et al. Unified pre-training for program understanding and generation. arXiv 2021, arXiv:2103.06333.
  142. Wang, Y.; Wang, W.; Joty, S.; et al. CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Virtual, 7–11 November 2021; pp. 8696–8708. Available Online: https://aclanthology.org/2021.emnlpmain.685 (accessed on 1 December 2024).
  143. Nijkamp, E.; Pang, B.; Hayashi, H.; et al. Codegen: An open large language model for code with multi-turn program synthesis. arXiv 2023, arXiv:2203.13474.
  144. Chen, M.; Tworek, J.; Jun, H.; et al. Evaluating large language models trained on code. arXiv 2021, arXiv:2107.03374.
  145. Papernot, N.; McDaniel, P.; Wu, X.; et al. Distillation as a defense to adversarial perturbations against deep neural networks. In Proceedings of the 2016 IEEE Symposium on Security and Privacy (SP), San Jose, CA, USA, 22–26 May 2016; pp. 582–597.
  146. Mo, W.; Xu, J.; Liu, Q.; et al. Test-time backdoor mitigation for black-box large language models with defensive demonstrations. arXiv 2023, arXiv:2311.09763.
  147. Min, S.; Lyu, X.; Holtzman, A.; et al. Rethinking the role of demonstrations: What makes in-context learning work? arXiv 2022, arXiv:2202.12837.
  148. Zhong, Z.; Huang, Z.; Wettig, A. Poisoning retrieval corpora by injecting adversarial passages. arXiv 2023, arXiv:2310.19156.
  149. Du, Y.; Bosselut, A.; Manning, C.D. Synthetic disinformation attacks on automated fact verification systems. arXiv 2022, arXiv:2202.09381.
  150. Pan, Y.; Pan, L.; Chen, W.; et al. On the risk of misinformation pollution with large language models. arXiv 2023, arXiv:2305.13661.
  151. Liu, X.; Xu, N.; Chen, M.; et al. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv 2024, arXiv:2310.04451.
  152. Alon, G.; Kamfonas, M. Detecting language model attacks with perplexity. arXiv 2023, arXiv:2308.14132.
  153. Kumar, A.; Agarwal, C.; Srinivas, S.; et al. Certifying llm safety against adversarial prompting. arXiv 2025, arXiv:2309.02705.
  154. Wei, J.; Wang, X.; Schuurmans, D.; et al. Chain-of-thought prompting elicits reasoning in large language models. arXiv 2023, arXiv:2201.11903.
  155. Lewis, P.; Perez, E.; Piktus, A.; et al. Retrieval-augmented generation for knowledge-intensive nlp task. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December2020; Volume 33, pp. 9459–9474.
  156. Geiping, J.; Fowl, L.; Somepalli, G.; et al. What doesn’t kill you makes you robust(er): How to adversarially train against data poisoning. arXiv 2022, arXiv:2102.13624.
  157. Tang, R.; Yuan, J.; Li, Y.; et al. Setting the trap: Capturing and defeating backdoors in pretrained language models through honeypots. arXiv 2023, arXiv:2310.18633.
  158. Huang, T.; Hu, S.; Liu, L. Vaccine: Perturbation-aware alignment for large language models against harmful fine-tuning attack. arXiv 2024, arXiv:2402.01109.
  159. Zhu, B.; Qin, Y.; Cui, G.; et al. Moderate-fitting as a natural backdoor defender for pre-trained language models. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022.
  160. Li, Y.; Lyu, X.; Koren, N.; et al. Anti-backdoor learning: Training clean models on poisoned data. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–14 December 2021; pp. 14900–14912.
  161. Li, H.; Chen, Y.; Zheng, Z.; et al. Backdoor removal for generative large language models. arXiv 2024, arXiv:2405.07667.
  162. Zeng, Y.; Sun, W.; Huynh, T.N.; et al. Beear: Embedding-based adversarial removal of safety backdoors in instruction-tuned language models. arXiv 2024, arXiv:2406.17092.
  163. Zhao, S.; Gan, L.; Tuan, L.A.; et al. Defending against weight-poisoning backdoor attacks for parameter-efficient fine-tuning. arXiv 2024, arXiv:2402.12168.
  164. Xi, Z.; Du, T.; Li, C.; et al. Defending pre-trained language models as few-shot learners against backdoor attacks. arXiv 2023, arXiv:2309.13256.
  165. Chen, C.; Dai, J. Mitigating backdoor attacks in lstm-based text classification systems by backdoor keyword identification. Neurocomputing 2021, 452, 253–262.
  166. Sha, Z.; He, X.; Berrang, P.; et al. Fine-tuning is all you need to mitigate backdoor attacks. arXiv 2022, arXiv:2212.09067.
  167. Wu, D.; Wang, Y. Adversarial neuron pruning purifies backdoored deep models. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–14 December 2021; pp. 16913–16925.
  168. Guan, J.; Tu, Z.; He, R.; et al. Few-shot backdoor defense using shapley estimation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–22 June 2022; pp. 13348–13357.
  169. Wang, H.; Hong, J.; Zhang, A.; et al. Trap and replace: Defending backdoor attacks by trapping them into an easy-to-replace subnetwork. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; pp. 36026–36039.
  170. Zhang, Z.; Lyu, L.; Ma, X.; et al. Fine-mixing: Mitigating backdoors in fine-tuned language models. arXiv 2022, arXiv:2210.09545.
  171. Bansal, H.; Singhi, N.; Yang, Y.; et al. Cleanclip: Mitigating data poisoning attacks in multimodal contrastive learning. arXiv 2023, arXiv:2303.03323.
  172. Wu, Z.; Zhang, Z.; Cheng, P.; et al. Acquiring clean language models from backdoor poisoned datasets by downscaling frequency space. arXiv 2024, arXiv:2402.12026.
  173. Zhai, S.; Shen, Q.; Chen, X.; et al. Ncl: Textual backdoor defense using noise-augmented contrastive learning. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5.
  174. Chen, C.; Hong, H.; Xiang, T.; et al. Anti-backdoor model: A novel algorithm to remove backdoors in a non-invasive way. IEEE Trans. Inf. Forensics Secur. 2024, 19, 7420–7434.
  175. Bie, R.; Jiang, J.; Xie, H.; et al. Mitigating backdoor attacks in pre-trained encoders via self-supervised knowledge distillation. IEEE Trans. Serv. Comput. 2024, 17, 2613–2625.
  176. Huang, K.; Li, Y.; Wu, B.; et al. Backdoor defense via decoupling the training process. arXiv 2022, arXiv:2202.03423.
  177. Zeng, Y.; Chen, S.; Park, W.; et al. Adversarial unlearning of backdoors via implicit hypergradient. arXiv 2022, arXiv:2110.03735.
  178. Liu, Y.; Xu, X.; Hou, Z.; et al. Causality based front-door defense against backdoor attack on language models. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024; pp. 32,239– 32252.
  179. Wei, J.; Fan, M.; Jiao, W.; et al. Bdmmt: Backdoor sample detection for language models through model mutation testing. arXiv 2023, arXiv:2301.10412.
  180. Doan, B.G.; Abbasnejad, E.; Ranasinghe, D.C. Februus: Input purification defense against trojan attacks on deep neural network systems. In Proceedings of the Annual Computer Security Applications Conference, Austin, TX, USA, 7–11 December 2020. http://dx.doi.org/10.1145/3427228.3427264.
  181. Li, Y.; Xu, Z.; Jiang, F.; et al. Cleangen: Mitigating backdoor attacks for generation tasks in large language models. arXiv 2024, arXiv:2406.12257.
  182. Chou, E.; Tramer, F.; Pellegrino, G. Sentinet: Detecting localized universal attacks against deep learning systems.` arXiv 2020, arXiv:1812.00292.
  183. Yan, L.; Zhang, Z.; Tao, G.; et al. Parafuzz: An interpretability-driven technique for detecting poisoned samples in nlp. arXiv 2023, arXiv:2308.02122.
  184. Wei, C.; Meng, W.; Zhang, Z.; et al. Lmsanitator: Defending prompt-tuning against task-agnostic backdoors. In Proceedings 2024 Network and Distributed System Security Symposium,San Diego, CA, USA, 26 February–1 March 2024. http://dx.doi.org/10.14722/ndss.2024.23238.
  185. Chen, H.; Fu, C.; Zhao, J.; et al. Deepinspect: A black-box trojan detection and mitigation framework for deep neural networks. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, Macao, China, 10–16 August 2019; pp. 4658–4664. https://doi.org/10.24963/ijcai.2019/647.
  186. Liu, Y.; Lee, W.-C.; Tao, G.; et al. Abs: Scanning neural networks for back-doors by artificial brain stimulation. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, New York, NY, USA, 11–15 November 2019; pp. 1265–1282. https://doi.org/10.1145/3319535.3363216.
  187. Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531.
  188. Li, Y.; Lyu, X.; Koren, N.; et al. Neural attention distillation: Erasing backdoor triggers from deep neural networks. arXiv 2021, arXiv:2101.05930.
  189. Yang, W.; Lin, Y.; Li, P.; et al. Rethinking stealthiness of backdoor attack against nlp models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Virtual, 1–6 August 2021; pp. 5543–5557.
  190. Shen, L.; Ji, S.; Zhang, X.; et al. Backdoor pre-trained models can transfer to all. arXiv 2021, arXiv:2111.00197.
  191. Du, W.; Li, P.; Li, B.; et al. Uor: Universal backdoor attacks on pre-trained language models. arXiv 2023, arXiv:2305.09574.
  192. Wen, R.; Zhao, Z.; Liu, Z.; et al. Is adversarial training really a silver bullet for mitigating data poisoning? In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023.
  193. Hubinger, E.; Denison, C.; Mu, J.; et al. Sleeper agents: Training deceptive llms that persist through safety training. arXiv 2024, arXiv:2401.05566.
  194. Christiano, P.; Leike, J.; Brown, T.B.; et al. Deep reinforcement learning from human preferences. arXiv 2023, arXiv:1706.03741.
  195. Ganguli, D.; Lovitt, L.; Kernion, J.; et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv 2022, arXiv:2209.07858.
  196. Li, X.; Zhang, Y.; Lou, R.; et al. Chain-of-scrutiny: Detecting backdoor attacks for large language models. arXiv 2024, arXiv:2406.05948.
  197. Si, W.M.; Backes, M.; Blackburn, J.; et al. Why so toxic? measuring and triggering toxic behavior in open-domain chatbots. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, Los Angeles, CA, USA, 7–11 November 2022; pp. 2659–2673.
  198. Papineni, K.; Roukos, S.; Ward, T.; et al. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318.
  199. Lin, C.-Y. ROUGE: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81.
  200. Li, Y.; Huang, H.; Zhao, Y.; et al. Backdoorllm: A comprehensive benchmark for backdoor attacks on large language models. arXiv 2024, arXiv:2408.12798.
  201. Socher, R.; Perelygin, A.; Wu, J.; et al. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; pp. 1631–1642.
  202. de Gibert, O.; Perez, N.; Garc´ıa-Pablos, A.; et al. Hate speech dataset from a white supremacy forum. In Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), Brussels, Belgium, 31 October–1 November 2018; pp. 11–20.
  203. Zhang, X.; Zhao, J.; LeCun, Y. Character-level convolutional networks for text classification. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015.
  204. Maas, A.L.; Daly, R.E.; Pham, P.T.; et al. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; pp. 142–150.
  205. Ding, N.; Chen, Y.; Xu, B.; et al. Enhancing chat language models by scaling high-quality instructional conversations. arXiv 2023, arXiv:2305.14233.
  206. Hartvigsen, T.; Gabriel, S.; Palangi, H.; et al. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. arXiv 2022, arXiv:2203.09509.
  207. Xu, J.; Ju, D.; Li, M.; et al. Bot-adversarial dialogue for safe conversational agents. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 June 2021; pp. 2950–2968.
  208. Li, X.; Zhang, T.; Dubois, Y.; et al. Alpacaeval: An Automatic Evaluator of Instruction-Following Models. May 2023. https://github.com/tatsu-lab/alpaca eval (accessed on 6 May 2025).