🦋🌪 The Butterfly Effect of Model Editing: Few Edits Can Trigger Large Language Models Collapse

1CAS Key Laboratory of AI Safety & Security,
Institute of Computing Technology, Chinese Academy of Sciences
2University of Chinese Academy of Sciences
3Nankai University, 4Baidu Inc.

Corresponding Author
intro

Abstract

Although model editing has shown promise in revising knowledge in Large Language Models (LLMs), its impact on the inherent capabilities of LLMs is often overlooked. In this work, we reveal a critical phenomenon: even a single edit can trigger model collapse, manifesting as significant performance degradation in various benchmark tasks. However, benchmarking LLMs after each edit, while necessary to prevent such collapses, is impractically time-consuming and resource-intensive. To mitigate this, we propose using perplexity as a surrogate metric, validated by extensive experiments demonstrating its strong correlation with downstream tasks performance. We further conduct an in-depth study on sequential editing, a practical setting for real-world scenarios, across various editing methods and LLMs, focusing on hard cases from our previous single edit studies. The results indicate that nearly all examined editing methods result in model collapse after only few edits. To facilitate further research, we have utilized GPT-3.5 to develop a new dataset, HardEdit, based on those hard cases. This dataset aims to establish the foundation for pioneering research in reliable model editing and the mechanisms underlying editing-induced model collapse. We hope this work can draw the community's attention to the potential risks inherent in model editing practices.

Pilot Observation:
Editing Can Disrupt Large Language Models

Pilot Observation for Model Collapse

Figure 1: (a) Scatter plot of perplexity for models independently edited by ROME from the original GPT-J, with each point representing a unique edit case in the COUNTERFACT dataset.
(b) Average performance with variance on downstream tasks for the top 30 high-perplexity models in Figure 1a, comparing to the original model and random guessing.


As an initial exploration of the impacts caused by editing, we opt to quickly identify a small set of anomalous models produced by each edit, facilitating subsequent investigation. We focus on using ROME to edit GPT-J with perplexity as a tool to detect anomalies. The results reveal that certain samples cause edited models to exhibit extremely high perplexity. Further experiments on the top 30 models with the highest perplexity demonstrate that the downstream task performance of these models is significantly compromised.

Perplexity as a Surrogate Metric

Correlations between Perplexity and Performance

Figure 2: Correlations between perplexity and downstream task performance across different LLMs, measured by task-specific metrics: Exact Match (EM) for NQ; F1 for SQuAD2.0.; accuracy for others.


To assess whether perplexity can serve as a surrogate metric, thereby avoiding the need for costly benchmarking LLMs after each edit, we conduct an in-depth investigation to demonstrate that models with differing levels of perplexity correspond to varying performance in downstream tasks. The results in Figure 2 reveal that an increase in perplexity typically indicates a decline in the model's overall performance.

Model Collapse Induced by Editing

Single Editing

Edit Cases Trigger Collapse

Figure 3: Examples from COUNTERFACT that induce collapse in corresponding LLMs with a single ROME edit.


Upon examining the perplexity, we find that ROME consistently causes all three LLMs under study (GPT-2-XL, GPT-J, and Llama2-7b) to collapse with a single edit when applied to COUNTERFACT. Examples presented in Figure 4 indicate that, for GPT-2-XL and GPT-J, the samples causing model collapse primarily featuring subjects that are single, commonly used words; for Llama2-7b, the subjects in these challenging cases usually encompass names of individuals or entities, presented in a particular format.
Parameters Variation

Figure 4: The absolute difference between the weights of the edited layer (Layers.5.mlp.down_proj) and its original weights for ROME-edited Llama2-7b models.


To uncover the root causes of model collapse, we initiated a preliminary investigation into the parameter changes in edited models. Figure 4 shows that the collapsed model experienced significantly larger parameter changes than the stable edited model.

Sequential Editing

Sequential Editing Collapse

Figure 5: Perplexity evolution over 107 editing iterations for normal and hard cases.


Further experiments in sequential editing reveal that, hard cases from single editing can induce model collapse under nearly all the combinations examined. Conversely, normal cases that are randomly sampled from the rest of COUNTERFACT, do not compromise the integrity of models when edited by ROME and MEMIT.

HardEdit: A Challenging Dataset

Valdation for HardEdit

Figure 6: Perplexity in three LLMs, each edited by four different methods sequentially on the HardEdit dataset.


To further facilitate comprehensive evaluations of future advanced methods, we crafted a challenging dataset, termed HardEdit, based on the patterns derived from the hard cases. Extensive experiments confirm the efficacy of the dataset in identifying the potential risks of editing algorithms.

BibTeX

@article{yang2024butterfly,
  title={The Butterfly Effect of Model Editing: Few Edits Can Trigger Large Language Models Collapse},
  author={Yang, Wanli and Sun, Fei and Ma, Xinyu and Liu, Xun and Yin, Dawei and Cheng, Xueqi},
  journal={arXiv preprint arXiv:2402.09656},
  year={2024}
}