The Butterfly Effect of Model Editing: Few Edits Can Trigger Large Language Models Collapse

🦋🌪 The Butterfly Effect of Model Editing: Few Edits Can Trigger Large Language Models Collapse

¹CAS Key Laboratory of AI Safety,
Institute of Computing Technology, Chinese Academy of Sciences
²University of Chinese Academy of Sciences, ³Baidu Inc.
^†Corresponding Author

Abstract

Although model editing has shown promise in revising knowledge in Large Language Models (LLMs), its impact on the inherent capabilities of LLMs is often overlooked. In this work, we reveal a critical phenomenon: even a single edit can trigger model collapse, manifesting as significant performance degradation in various benchmark tasks. However, benchmarking LLMs after each edit, while necessary to prevent such collapses, is impractically time-consuming and resource-intensive. To mitigate this, we propose using perplexity as a surrogate metric, validated by extensive experiments demonstrating changes in an edited model's perplexity are strongly correlated with its downstream task performances. We further conduct an in-depth study on sequential editing, a practical setting for real-world scenarios, across various editing methods and LLMs, focusing on hard cases from our previous single edit studies. The results indicate that nearly all examined editing methods result in model collapse after only few edits. To facilitate further research, we have utilized GPT-3.5 to develop a new dataset, HardEdit, based on those hard cases. This dataset aims to establish the foundation for pioneering research in reliable model editing and the mechanisms underlying editing-induced model collapse. We hope this work can draw the community's attention to the potential risks inherent in model editing practices.

Pilot Observation:
Editing Can Disrupt Large Language Models

Figure 1: (a) Scatter plot of perplexity for models independently edited by ROME from the original GPT-J, with each point representing a unique edit case in the COUNTERFACT dataset.
(b) Average performance with variance on downstream tasks for the top 30 high-perplexity models in Figure 1a, comparing to the original model and random guessing.

As an initial exploration of the impacts caused by editing, we opt to quickly identify a small set of anomalous models produced by each edit, facilitating subsequent investigation. We focus on using ROME to edit GPT-J with perplexity as a tool to detect anomalies. The results reveal that certain samples cause edited models to exhibit extremely high perplexity. Further experiments on the top 30 models with the highest perplexity demonstrate that the downstream task performance of these models is significantly compromised.

Perplexity as a Surrogate Metric

Correlations between Perplexity and Performance

Figure 2: Correlations between perplexity and downstream task performance across different LLMs, measured by task-specific metrics: Exact Match (EM) for NQ; F1 for SQuAD2.0.; Accuracy for remaining tasks. ρ refers to the Spearman's Rho value, measuring the rank correlation between perplexity and corresponding downstream task performance, with all p-values < 0.01.

To assess whether perplexity can serve as a surrogate metric, thereby avoiding the need for costly benchmarking LLMs after each edit, we conduct an in-depth investigation to demonstrate that models with differing levels of perplexity correspond to varying performance in downstream tasks. The results in Figure 2 reveal that an increase in perplexity typically indicates a decline in the model's overall performance.

Model Collapse Induced by Editing

Single Editing

Table 1: Examples of HardCF that induce collapse in corresponding LLMs through a single ROME edit, with the "Normal" row showcasing other normal cases from COUNTERFACT for contrast.

Upon examining the perplexity, we find that ROME consistently causes all three LLMs under study (GPT-2-XL, GPT-J, and Llama2-7b) to collapse with a single edit when applied to COUNTERFACT. Examples presented in Table 1 indicate that, for GPT-2-XL and GPT-J, the samples causing model collapse primarily featuring subjects that are single, commonly used words; for Llama2-7b, the subjects in these challenging cases usually encompass names of individuals or entities, presented in a particular format.

Figure 3: The absolute difference between the weights of the edited layer (Layers.5.mlp.down_proj) and its original weights for ROME-edited Llama2-7b models.

To uncover the root causes of model collapse, we initiated a preliminary investigation into the parameter changes in edited models. Figure 3 shows that the collapsed model experienced significantly larger parameter changes than the stable edited model.

Sequential Editing

Figure 4: Perplexity evolution over 107 editing iterations for normal and hard cases. The y-axes are tailored for each subplot accordingly due to the the significant variation in the magnitude of perplexity changes.

Further experiments in sequential editing reveal that, hard cases from single editing can induce model collapse under nearly all the combinations examined. Conversely, normal cases that are randomly sampled from the rest of COUNTERFACT, do not compromise the integrity of models when edited by ROME and MEMIT.

HardEdit: A Challenging Dataset

Figure 5: Perplexity in three LLMs, each edited by four different methods sequentially on the HardEdit dataset.

To further facilitate comprehensive evaluations of future advanced methods, we crafted a challenging dataset, termed HardEdit, based on the patterns derived from the hard cases. Extensive experiments confirm the efficacy of the dataset in identifying the potential risks of editing algorithms.

BibTeX

@inproceedings{yang2024butterfly, title={The Butterfly Effect of Model Editing: Few Edits Can Trigger Large Language Models Collapse}, author={Wanli Yang and Fei Sun and Xinyu Ma and Xun Liu and Dawei Yin and Xueqi Cheng}, booktitle={The 62nd Annual Meeting of the Association for Computational Linguistics}, year={2024}, url={https://openreview.net/forum?id=r07DBA0gAZ} }