Attacks and defenses for large language models on coding tasks

Zhang, Chi, authorWang, Zifan, authorZhao, Ruoshi, authorMangal, Ravi, authorFredrikson, Matt, authorJia, Limin, authorPasareanu, Corina, authorACM, publisherAttacks and defenses for large language models on coding tasksColorado State University. Libraries2024LLMscode modelsadversarial attacksrobustnessMy UniversityMy University2024-12-172024-12-172024-10-27engTextChi Zhang, ZifanWang, Ruoshi Zhao, Ravi Mangal, Matt Fredrikson, Limin Jia, and Corina S. Păsăreanu. 2024. Attacks and Defenses for Large Language Models on Coding Tasks. In 39th IEEE/ACM International Conference on Automated Software Engineering (ASE '24), October 27- November 1, 2024, Sacramento, CA, USA. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3691620.3695297https://hdl.handle.net/10217/239729https://doi.org/10.1145/3691620.3695297born digitalarticles©Chi Zhang, et al. ACM 2024. This is the author's version of the work. It is posted here for your personal use. Not for redistribution. The definitive Version of Record was published in ASE '24, https://dx.doi.org/10.1145/3691620.3695297.Modern large language models (LLMs), such as ChatGPT, have demonstrated impressive capabilities for coding tasks, including writing and reasoning about code. They improve upon previous neural network models of code, such as code2seq or seq2seq, that already demonstrated competitive results when performing tasks such as code summarization and identifying code vulnerabilities. However, these previous code models were shown vulnerable to adversarial examples, i.e., small syntactic perturbations designed to "fool" the models. In this paper, we first aim to study the transferability of adversarial examples, generated through white-box attacks on smaller code models, to LLMs. We also propose a new attack using an LLM to generate the perturbations. Further, we propose novel cost-effective techniques to defend LLMs against such adversaries via prompting, without incurring the cost of retraining. These prompt-based defenses involve modifying the prompt to include additional information, such as examples of adversarially perturbed code and explicit instructions for reversing adversarial perturbations. Our preliminary experiments show the effectiveness of the attacks and the proposed defenses on popular LLMs such as GPT-3.5 and GPT-4.