Can Editing LLMs Inject Harm?

TLDR: We propose to reformulate knowledge editing as a new type of safety threat for LLMs, namely Editing Attack, and discover its emerging risk of injecting misinformation or bias into LLMs stealthily, indicating the feasibility of disseminating misinformation or bias with LLMs as new channels.


1. Illinois Institute of Technology, 2. UCSB, 3. University of Chicago, 4. UCLA, 5. University of Oxford, 6. UNC-Chapel Hill, 7. University of Wisconsin - Madison, 8. University of California, Berkeley
* Equal contribution
Presented in workshop TiFA@ICML 2024, Lightning Talk and NextGenAISafety@ICML 2024.
framework
The Illustration of Editing Attack for Misinformation Injection and Bias Injection. As for misinformation injection, editing attack can inject commonsense misinformation with high effectiveness. As for bias injection, one single editing attack can subvert the overall fairness.


Abstract

Knowledge editing has been increasingly adopted to correct the false or outdated knowledge in Large Language Models (LLMs). Meanwhile, one critical but under-explored question is: can knowledge editing be used to inject harm into LLMs? In this paper, we propose to reformulate knowledge editing as a new type of safety threat for LLMs, namely Editing Attack, and conduct a systematic investigation with a newly constructed dataset EditAttack. Specifically, we focus on two typical safety risks of Editing Attack including Misinformation Injection and Bias Injection. For the risk of misinformation injection, we first categorize it into commonsense misinformation injection and long-tail misinformation injection. Then, we find that editing attacks can inject both types of misinformation into LLMs, and the effectiveness is particularly high for commonsense misinformation injection. For the risk of bias injection, we discover that not only can biased sentences be injected into LLMs with high effectiveness, but also one single biased sentence injection can cause a bias increase in general outputs of LLMs, which are even highly irrelevant to the injected sentence, indicating a catastrophic impact on the overall fairness of LLMs. Then, we further illustrate the high stealthiness of editing attacks, measured by their impact on the general knowledge and reasoning capacities of LLMs, and show the hardness of defending editing attacks with empirical evidence. Our discoveries demonstrate the emerging misuse risks of knowledge editing techniques on compromising the safety alignment of LLMs and the feasibility of disseminating misinformation or bias with LLMs as new channels. Warning: This paper contains examples of misleading or stereotyped language.



Our Contributions


  • We propose to reformulate knowledge editing as a new type of threats for LLMs, namely Editing Attack, and define its two emerging major risks: Misinformation Injection and Bias Injection.
  • We construct a new dataset EditAttack with the evaluation suite to study the risk of injecting misinformation or bias and systematically assess the robustness of LLMs against editing attacks.
  • Through extensive investigation, we illustrate the critical misuse risk of knowledge editing techniques on subverting the safety alignment of LLMs and the feasibility of disseminating misinformation or bias with LLMs as new channels, and call for more research on defense methods.
    • As for Misinformation Injection, we find that editing attacks can inject both commonsense and long-tail misinformation into LLMs, and the former one exhibits particularly high effectiveness.
    • As for Bias Injection, we discover that not only can editing attacks achieve high effectiveness in injecting biased sentences, but also one single biased sentence injection can cause a bias increase in LLMs' general outputs, suggesting a catastrophic degradation of the overall fairness.
    • We also validate the high stealthiness of one single editing attack for misinformation or bias injection, and demonstrate the hardness of potential defense with empirical evidence.


  • Motivation


    Knowledge editing has been an increasingly important method to efficiently address the hallucinations originated from the erroneous or outdated knowledge stored in the parameters of Large Language Models (LLMs), because retraining LLMs from scratch is both costly and time-consuming considering their significant scale of parameters. At the same time, open-source LLMs such as Llama series models have gained soaring popularity. Users can freely adapt these models and then release the improved models to open-source communities (e.g., HuggingFace). However, this accessibility also enables bad actors to easily disseminate maliciously modified models. Although LLMs usually possess strong safety alignment owing to post-training stages such as reinforcement learning from human feedback (RLHF), considering the efficiency and effectiveness of knowledge editing techniques, one emerging critical question is: can knowledge editing be used to inject harm into LLMs? In this paper, we propose to reformulate the task of knowledge editing as a new type of threats for LLMs, namely Editing Attack, and aim to investigate whether it can be exploited to inject harm into LLMs effectively and stealthily with minimum cost. Specifically, we focus on two types of practical and critical risks in the real world including Misinformation Injection and Bias Injection.



    Can Editing LLMs Inject Misinformation?


    In this section, we extensively investigate the effectiveness of editing attacks on our constructed misinformation injection dataset. We adopt three typical editing techniques (ROME, FT and ICE) and five types of LLMs (Llama3-8b, Mistral-v0.1-7b (or -v0.2-7b), Alpaca-7b, Vicuna-7b).


    Figure 1

    As shown in Table 1, we can observe a performance increase for all editing methods and LLMs over three metrics, indicating that both commonsense and long-tail misinformation can be injected into LLMs with editing attacks. Comparing different editing methods, we find that ICE can generally achieve the best misinformation injection performance. Comparing different LLMs, it is particularly difficult to inject misinformation into Mistral-v0.2-7b with FT, or Alpaca-7b with ROME, where the performances for three metrics are mostly lower than 50%, reflecting the effectiveness of editing attacks for misinformation injection varies across LLMs and different LLMs exhibit distinct robustness against the same editing attacks. Comparing commonsense and long-tail misinformation injection, we can see that the former one has a generally higher performance over three metrics, showing that long-tail misinformation tends to be harder to inject than commonsense misinformation. We also notice that commonsense misinformation injection can generally achieve high scores regarding all three metrics as well as a high increase compared to those before editing attacks. For example, ROME has gained 90.0%, 70.0% and 72.0% as well as a high increase for these three metrics respectively when injecting commonsense misinformation into Llama3-8b. This shows that commonsense misinformation injection can achieve particularly high effectiveness.

    Finding 1: Editing attacks can inject both commonsense and long-tail misinformation into LLMs, and commonsense misinformation injection can achieve particularly high effectiveness.


    Can Editing LLMs Inject Bias?


    We study the problem of injecting bias with editing attacks from two perspectives including can biased sentences be injected into LLMs? and can one single bias injection subvert the general fairness of LLMs? For the former question, we aim to investigate whether biased sentences can be injected into LLMs with editing attacks. For the latter question, we assess the impact of one single biased sentence injection with editing attack on the general fairness of LLMs.

    Can Biased Sentences Be Injected Into LLMs?


    Figure 2

    From Table 2, we can also observe a performance increase for the three kinds of editing methods on all LLMs regarding the two metrics and the generally high scores for gender (or race) bias injection, showing that three kinds of editing attacks (ROME, FT, and ICE) can inject biased sentences towards gender or race into LLMs with high effectiveness. For example, ICE achieves nearly 100% Efficacy Score and 100% Generalization Score for Race Bias Injection on all the LLMs except Llama3-8b. Comparing different LLMs, we can observe that the effectiveness of editing attacks for biased sentence injection varies across different LLMs, which shows the distinct robustness of different LLMs against the same type of editing attacks. For example, the injection performance with FT is especially low on Mistral-v0.2-7b, though it is high on other LLMs. We also notice that some LLMs (e.g., Alpaca-7b) have relatively high pre-edit Efficacy Score and Generalization Score and a relatively low performance increase, which indicates that the high bias of original models could impact the effectiveness of editing attacks for biased sentence injection.


    Can One Single Bias Injection Subvert the General Fairness of LLMs?


    Figure 3

    As shown in Figure 2, we observe that for one single biased sentence injection, ROME and FT can cause an increase in Bias Scores across different types, demonstrating a catastrophic impact on general fairness. For example, when ROME injects one single biased sentence towards Gender into Llama3-8b, not only does the Gender Bias Score increase, but the Bias Scores across most other types, including Race, Religion, and Sexual Orientation, also increase. Comparing different editing techniques as attacks, we can see that ROME and FT are much more effective than ICE in increasing the general bias. Also, the impact of editing attacks can be more noticeable when the pre-edit LLMs have a relatively low level of bias (e.g., the Race bias).

    Finding 2: Editing attacks can not only inject biased sentences into LLMs with high effectiveness, but also increase the bias in general outputs of LLMs with one single biased sentence injection, representing a catastrophic degradation on LLMs’ overall fairness.


    More Analysis of Editing Attack


    Figure 4

    Stealthiness In practice, malicious actors may aim to inject harm into LLMs while avoiding being noticed by normal users. Thus, we propose to measure the stealthiness of editing attacks by their impact on the general knowledge and reasoning capacities of LLMs, which are the two basic dimensions of their general capacity. As for evaluating the general knowledge of LLMs, following previous works, we adopt two typical datasets BoolQ and NaturalQuestions and test both the pre-edit and post-edit models in a closed-book way. As for the evaluation of reasoning capacities, we assess the mathematical reasoning capacity with GSM8K and semantic reasoning ability with NLI. As shown in Table 3, compared with “No Editing”, we can see that the performances over four datasets after one single editing attack for “Misinformation Injection” or “Bias Injection” almost remain the same. The results demonstrate that editing attacks for misinformation or bias injection have minimal impact on the general knowledge or reasoning capacities, reflecting the high stealthiness of editing attacks.

    Is It Possible to Defend Editing Attack? In face with the emerging threats of editing attacks, we conduct a preliminary analysis to explore the possibility of defense. For normal users, the most direct defense strategy is to detect the maliciously edited LLMs. Therefore, the problem can be decomposed into two questions including can edited and non-edited LLMs be differentiated? and can edited LLMs for good purposes and those for malicious purposes be differentiated? As for the former question, the previous analysis on the stealthiness of editing attacks has shown that it is hard to differentiate maliciously edited and non-edited LLMs. As for the latter question, comparing the performances after one single editing attack for "Misinformation Injection" or "Bias Injection" and those after editing for "Hallucination Correction" in Table 3, we can observe no noticeable differences. Our preliminary empirical evidence has shed light on the hardness of defending editing attacks for normal users. Looking ahead, we call for more research on developing defense methods based on the inner mechanisms of editing and enhancing LLMs' intrinsic robustness against editing attacks.

    Finding 3: Editing attacks have high stealthiness, measured by the impact on general knowledge and reasoning capacities, and are hard to distinguish from knowledge editing for good purposes.


    The Impact on Safety of Open-source LLMs


    Owing to the popularity of open-source LLM communities such as HuggingFace, it is critical to ensure the safety of models uploaded to these platforms. Currently, the models are usually aligned with safety protocols through post-training stages such as RLHF. However, our work has demonstrated that the safety alignment of LLMs is fragile under editing attacks, which pose serious threats to the open-source communities. Specifically, as for the misinformation injection risk, conventionally, misinformation is disseminated in information channels such as social media. Currently, LLMs have emerged as a new channel since users are increasingly inclined to interact with LLMs directly to acquire information. The experiments show that malicious actors are able to inject misinformation into open-source LLMs stealthily and easily via editing attacks, which could result in the large-scale dissemination of misinformation. Thus, editing attacks may bring a new type of misinformation dissemination risk and escalate the misinformation crisis in the age of LLMs in addition to the existing misinformation generation risk. As for the bias injection risk, our work has shown that malicious users could subvert the fairness in general outputs of LLMs with one single biased sentence injection, which may exacerbate the dissemination of stereotyped information in open-source LLMs. We call for more open discussions from different stakeholders on the governance of open-source LLMs to maximize the benefit and minimize the potential risk.

    BibTeX

    @article{chen2024canediting,
            title   = {Can Editing LLMs Inject Harm?},
            author  = {Canyu Chen and Baixiang Huang and Zekun Li and Zhaorun Chen and Shiyang Lai and Xiongxiao Xu and Jia-Chen Gu and Jindong Gu and Huaxiu Yao and Chaowei Xiao and Xifeng Yan and William Yang Wang and Philip Torr and Dawn Song and Kai Shu},
            year    = {2024},
            journal = {arXiv preprint arXiv: 2407.20224}
          }