Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
canyuchen committed Jul 31, 2024
1 parent f1b9f67 commit 7f9c9c3
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -288,7 +288,7 @@ <h1 id="Can-Editing-LLMs-Inject-Harm" class="is-size-5 publication-title">TLDR:
<h2 class="title is-3">Abstract</h2>
<div class="content has-text-justified">
<p>
Knowledge editing techniques have been increasingly adopted to efficiently correct the false or outdated knowledge in Large Language Models (LLMs), due to the high cost of retraining from scratch. Meanwhile, one critical but under-explored question is: <i>can knowledge editing be used to inject harm into LLMs?</i> In this paper, we propose to reformulate knowledge editing as a new type of safety threat for LLMs, namely <b>Editing Attack</b>, and conduct a systematic investigation with a newly constructed dataset <b>EditAttack</b>. Specifically, we focus on two typical safety risks of Editing Attack including <b>Misinformation Injection</b> and <b>Bias Injection</b>. For the risk of misinformation injection, we first categorize it into <i>commonsense misinformation injection</i> and <i>long-tail misinformation injection</i>. Then, we find that <b>editing attacks can inject both types of misinformation into LLMs</b>, and the effectiveness is particularly high for commonsense misinformation injection. For the risk of bias injection, we discover that not only can biased sentences be injected into LLMs with high effectiveness, but also <b>one single biased sentence injection can cause a high bias increase in general outputs of LLMs</b>, which are even highly irrelevant to the injected sentence, indicating a catastrophic impact on the overall fairness of LLMs. Then, we further illustrate the <b>high stealthiness of editing attacks</b>, measured by their impact on the general knowledge and reasoning capacities of LLMs, and show the hardness of defending editing attacks with empirical evidence. Our discoveries demonstrate the emerging misuse risks of knowledge editing techniques on compromising the safety alignment of LLMs. <span style="color:red;">Warning: This paper contains examples of misleading or stereotyped language.</span>
Knowledge editing techniques have been increasingly adopted to efficiently correct the false or outdated knowledge in Large Language Models (LLMs), due to the high cost of retraining from scratch. Meanwhile, one critical but under-explored question is: <i>can knowledge editing be used to inject harm into LLMs?</i> In this paper, we propose to reformulate knowledge editing as a new type of safety threat for LLMs, namely <b>Editing Attack</b>, and conduct a systematic investigation with a newly constructed dataset <b>EditAttack</b>. Specifically, we focus on two typical safety risks of Editing Attack including <b>Misinformation Injection</b> and <b>Bias Injection</b>. For the risk of misinformation injection, we first categorize it into <i>commonsense misinformation injection</i> and <i>long-tail misinformation injection</i>. Then, we find that <b>editing attacks can inject both types of misinformation into LLMs</b>, and the effectiveness is particularly high for commonsense misinformation injection. For the risk of bias injection, we discover that not only can biased sentences be injected into LLMs with high effectiveness, but also <b>one single biased sentence injection can cause a bias increase in general outputs of LLMs</b>, which are even highly irrelevant to the injected sentence, indicating a catastrophic impact on the overall fairness of LLMs. Then, we further illustrate the <b>high stealthiness of editing attacks</b>, measured by their impact on the general knowledge and reasoning capacities of LLMs, and show the hardness of defending editing attacks with empirical evidence. Our discoveries demonstrate the emerging misuse risks of knowledge editing techniques on compromising the safety alignment of LLMs. <span style="color:red;">Warning: This paper contains examples of misleading or stereotyped language.</span>
</p>
</div>
</div>
Expand Down Expand Up @@ -321,7 +321,7 @@ <h2 class="title is-3">Our Contributions</h2>
<li>Through extensive investigation, we illustrate the critical misuse risk of knowledge editing techniques on the safety alignment of LLMs, and call for more future research on the defense methods.
<ul class="nested">
<li>As for <strong><em>Misinformation Injection</em></strong>, we find that editing attacks can inject both commonsense and long-tail misinformation into LLMs, and the former one exhibits particularly high effectiveness.</li>
<li>As for <strong><em>Bias Injection</em></strong>, we discover that not only can editing attacks achieve high effectiveness in injecting biased sentences, but also one single biased sentence injection can cause a high bias increase in LLMs' general outputs, suggesting a catastrophic degradation of the overall fairness.</li>
<li>As for <strong><em>Bias Injection</em></strong>, we discover that not only can editing attacks achieve high effectiveness in injecting biased sentences, but also one single biased sentence injection can cause a bias increase in LLMs' general outputs, suggesting a catastrophic degradation of the overall fairness.</li>
<li>We also validate the <strong>high stealthiness</strong> of one single editing attack for misinformation or bias injection, and demonstrate the hardness of potential defense with empirical evidence.</li>
</ul>
</li>
Expand Down

0 comments on commit 7f9c9c3

Please sign in to comment.