Collapse in Model Editing

Model editing has shown promise in revising knowledge in Large Language Models. However, its impact on the inherent capabili-ties of LLMs is often overlooked. In this project, we delved into this issue, yielding two publications. The first paper uncovers the phenomenon that model editing may lead to model collapse and proposes employing perplexity as a diagnostic tool (accepted at Findings of ACL2024). The second paper investigates the underlying causes of LLMs collapse triggered by the SOTA method ROME, and introduces an effective solution (accepted at Findings of EMNLP2024).
The papers, source code, datasets, and comprehensive illustration are available on this website.

🦋🌪 The Butterfly Effect of Model Editing:
Few Edits Can Trigger Large Language Models Collapse

ACL2024 Findings

1CAS Key Laboratory of AI Safety,
Institute of Computing Technology, Chinese Academy of Sciences
2University of Chinese Academy of Sciences, 3Baidu Inc.

Corresponding Author
intro

Abstract

Although model editing has shown promise in revising knowledge in Large Language Models (LLMs), its impact on the inherent capabilities of LLMs is often overlooked. In this work, we reveal a critical phenomenon: even a single edit can trigger model collapse, manifesting as significant performance degradation in various benchmark tasks. However, benchmarking LLMs after each edit, while necessary to prevent such collapses, is impractically time-consuming and resource-intensive. To mitigate this, we propose using perplexity as a surrogate metric, validated by extensive experiments demonstrating changes in an edited model's perplexity are strongly correlated with its downstream task performances. We further conduct an in-depth study on sequential editing, a practical setting for real-world scenarios, across various editing methods and LLMs, focusing on hard cases from our previous single edit studies. The results indicate that nearly all examined editing methods result in model collapse after only few edits. To facilitate further research, we have utilized GPT-3.5 to develop a new dataset, HardEdit, based on those hard cases. This dataset aims to establish the foundation for pioneering research in reliable model editing and the mechanisms underlying editing-induced model collapse. We hope this work can draw the community's attention to the potential risks inherent in model editing practices.

Pilot Observation:
Editing Can Disrupt Large Language Models

Pilot Observation for Model Collapse

Figure 1: (a) Scatter plot of perplexity for models independently edited by ROME from the original GPT-J, with each point representing a unique edit case in the COUNTERFACT dataset.
(b) Average performance with variance on downstream tasks for the top 30 high-perplexity models in Figure 1a, comparing to the original model and random guessing.


As an initial exploration of the impacts caused by editing, we opt to quickly identify a small set of anomalous models produced by each edit, facilitating subsequent investigation. We focus on using ROME to edit GPT-J with perplexity as a tool to detect anomalies. The results reveal that certain samples cause edited models to exhibit extremely high perplexity. Further experiments on the top 30 models with the highest perplexity demonstrate that the downstream task performance of these models is significantly compromised.

Perplexity as a Surrogate Metric

Correlations between Perplexity and Performance

Figure 2: Correlations between perplexity and downstream task performance across different LLMs, measured by task-specific metrics: Exact Match (EM) for NQ; F1 for SQuAD2.0.; Accuracy for remaining tasks. ρ refers to the Spearman's Rho value, measuring the rank correlation between perplexity and corresponding downstream task performance, with all p-values < 0.01.


To assess whether perplexity can serve as a surrogate metric, thereby avoiding the need for costly benchmarking LLMs after each edit, we conduct an in-depth investigation to demonstrate that models with differing levels of perplexity correspond to varying performance in downstream tasks. The results in Figure 2 reveal that an increase in perplexity typically indicates a decline in the model's overall performance.

Model Collapse Induced by Editing

Single Editing

Edit Cases Trigger Collapse

Table 1: Examples of HardCF that induce collapse in corresponding LLMs through a single ROME edit, with the "Normal" row showcasing other normal cases from COUNTERFACT for contrast.


Upon examining the perplexity, we find that ROME consistently causes all three LLMs under study (GPT-2-XL, GPT-J, and Llama2-7b) to collapse with a single edit when applied to COUNTERFACT. Examples presented in Table 1 indicate that, for GPT-2-XL and GPT-J, the samples causing model collapse primarily featuring subjects that are single, commonly used words; for Llama2-7b, the subjects in these challenging cases usually encompass names of individuals or entities, presented in a particular format.
Parameters Variation

Figure 3: The absolute difference between the weights of the edited layer (Layers.5.mlp.down_proj) and its original weights for ROME-edited Llama2-7b models.


To uncover the root causes of model collapse, we initiated a preliminary investigation into the parameter changes in edited models. Figure 3 shows that the collapsed model experienced significantly larger parameter changes than the stable edited model.

Sequential Editing

Sequential Editing Collapse

Figure 4: Perplexity evolution over 107 editing iterations for normal and hard cases. The y-axes are tailored for each subplot accordingly due to the the significant variation in the magnitude of perplexity changes.


Further experiments in sequential editing reveal that, hard cases from single editing can induce model collapse under nearly all the combinations examined. Conversely, normal cases that are randomly sampled from the rest of COUNTERFACT, do not compromise the integrity of models when edited by ROME and MEMIT.

HardEdit: A Challenging Dataset

Valdation for HardEdit

Figure 5: Perplexity in three LLMs, each edited by four different methods sequentially on the HardEdit dataset.


To further facilitate comprehensive evaluations of future advanced methods, we crafted a challenging dataset, termed HardEdit, based on the patterns derived from the hard cases. Extensive experiments confirm the efficacy of the dataset in identifying the potential risks of editing algorithms.

罗马  The Fall of ROME :
Understanding the Collapse of LLMs in Model Editing

EMNLP2024 Findings

1CAS Key Laboratory of AI Safety,
Institute of Computing Technology, Chinese Academy of Sciences
2University of Chinese Academy of Sciences, 3Baidu Inc.

Corresponding Author

Abstract

Despite significant progress in model editing methods, their application in real-world scenarios remains challenging as they often cause large language models (LLMs) to collapse. Among them, ROME is particularly concerning, as it could disrupt LLMs with only a single edit. In this paper, we study the root causes of such collapse. Through extensive analysis, we identify two primary factors that contribute to the collapse: i) inconsistent handling of prefixed and unprefixed keys in the parameter update equation may result in very small denominators, causing excessively large parameter updates; ii) the subject of collapse cases is usually the first token, whose unprefixed key distribution significantly differs from the prefixed key distribution in autoregressive transformers, causing the aforementioned issue to materialize. To validate our findings, we propose a simple yet effective approach: uniformly using prefixed keys during editing phase and adding prefixes during testing phase to ensure the consistency between training and testing. The experimental results show that the proposed solution can prevent model collapse while maintaining the effectiveness of the edits.

Rank-One Model Editing

ROME Illustration

Figure 1: To update "the president of the United States" from "Donald Trump" to "Joe Biden", ROME locates the knowledge into the MLP module within a specific transformer block using the Causal Tracing mechanism. It then adjusts the second layer of MLP (i.e., weight matrix W ) to change the value v for the key k that represents the subject "the United States" to a new value v, thereby inducing the LLMs to predict the target object "Joe Biden".


ROME models and edits the knowledge in a key-value format. For a prompt constructed from the subject s and relation r :
  • - Subject s forms a key k within a specific MLP;
  • - Corresponding output forms a value v to induce the prediction of object o.
  • - ROME modifies the value v to edit the object o to o*.

Why Does ROME Cause LLMs Collapse?

疑问  Why is the update matrix so large?

Our previous work "The Butterfly Effect of Model Editing " has found that the collapse is caused by the values of update matrix ∆ being excessively large. For fine-grained analysis, we split ∆ into numerator (a matrix) and denominator (a scalar) to analyze the intermediate values for parameter updating. Results reveal that the denominators of collapse cases are two orders of magnitude smaller than those of normal cases, leading to exceptionally large ∆.
EQ 1

Eq 1: Parameter update equation of ROME.

EQ 1

Table 1: Average norm of the numerator and average absolute value of the denominator in ROME's update matrix ∆ across various LLMs for different sets of cases.


疑问  Why does the denominator show anomaly?

The results guide our focus to the key within the denominator, given that the matrix C is a constant. We found the official implementation of ROME adopts inconsistent keys in editing.
Ideally, all keys should be an average vector derived from various contexts.
Sequential Editing Collapse

Eq 2: Keys with various prefixes to simulate different contexts.

However, in some positions, the keys utilize a representation over the subject s without any prefix, denoted as ku.
Sequential Editing Collapse

Eq 3: Keys without any prefixes to represent the subject.

The update matrix ∆ in the original code is:
Sequential Editing Collapse

Eq 4: Update matrix in the original code implementation of ROME.


疑问  Does the collapse really originate from inconsistent keys?

To verify if this inconsistency of keys is responsible for the collapse, we substitute all unprefixed keys with prefixed keys in the implementation. The aligned implementation is referred to as Consistent-ROME, C-ROME for short. C-ROME avoids collapse, validating inconsistent keys do lead to collapse.

Sequential Editing Collapse

Table 2: Maximum perplexity of models edited by different implementations of ROME.


疑问  Why do inconsistent keys only fail in collapse cases?

While unifying the keys as prefixed one can prevent model collapse, it remains unclear why inconsistent keys only encounter issues in collapse cases. To enhance intuitive understanding, we analyze the spatial distribution of C-1k and ku in the denominator for different cases. In the denominator, these two elements show no difference in normal cases, yet they exhibit significant divergence in collapse cases. Considering C is a constant, the collapse actually stems from the significant divergence between k and ku.

Sequential Editing Collapse

Figure 2: (a) Elements in the denominator; (b) Different implementation of key vectors.


疑问  Why is unprefixed key distributed anomalously?

To elucidate the anomalous distribution of ku in collapse cases, we focus our analysis on their characteristics. A common pattern is observed in the collapse cases for both GPT-2-XL and GPT-J: the subjects is encoded and positioned as the first token of the prompt. That is ku in collapse cases corresponds to the first token in the inputs.

Sequential Editing Collapse

Figure 3: Examples of collapse cases.


疑问  Does representation of first token possess specificity?

We explore this from two aspect: (a) examine the representation distribution of the first tokens in the prompts for normal cases; (b) prefix the prompts of collapse cases with randomly sampled texts to shift ku away from the first position. The results reveal that: (a) the first tokens of normal cases consistently exhibit an abnormal distribution similar to that of ku in collapse cases; (b) distribution of prefixed ku aligns with that of normal cases. Both the findings demonstrate that the first token's representation is distributed differently.

Sequential Editing Collapse

Figure 4: (a) First token in normal prompts; (b) ku in prefixed collapse prompts.


疑问  Why does the first token have a different representation?

To elucidate the underlying reasons for the anomalous distribution of the first token in autoregressive language models, we explored two potential factors.
Firstly, we speculate that this phenomenon may arise from the inherent nature of autoregressive models, where the first token cannot interact with any other token except itself. As a counterexample with non-autoregressive architecture, the representation distribution of the first token in T5-3B encoder does not differ from that of subsequent tokens.
Secondly, considering the specificity of the first token may originate from its position embedding, we verify it from two aspects. For collapse cases where the subjects are the first tokens, setting the position embedding of the first token as that of the second token can not completely eliminate collapse. While for normal cases where the subjects are the second tokens, replicating the position embedding of the first token onto the second token does not consistently lead to collapse. These findings suggest that while position embedding plays a role, it is not the only determining factor.

Sequential Editing Collapse

     Figure 5: First token in T5-3B.                Table 3: Impact of position embedding.


A Simple Solution to Avoid Collapse

C-ROME can effectively keep the stability of edited models, but it fails to successfully integrate target knowledge into the model, as evidenced by its low efficacy and generalization on collapse cases.

ROME Illustration

Table 4: Performance of C-ROME on various LLMs for corresponding collapse cases.


This failure arises from the inconsistency of C-ROME between editing and testing. Specifically, C-ROME employs prefixed keys only when editing, while during testing, the prompts used to evaluate efficacy adopt unprefixed keys. To address this issue, we propose a straightforward solution, which appends a random prefix, drawn from those utilized in the editing process, to the prompt of collapse cases during the testing phase. The results demonstrate that this method significantly improves the efficacy for GPT-2-XL, GPT-J, and Llama2-7b.

ROME Illustration

Table 5: Performance of C-ROME, enhanced by prefixing random texts to the prompts of collapse cases during testing.

BibTeX

@inproceedings{yang-etal-2024-butterfly,
    title = "The Butterfly Effect of Model Editing: Few Edits Can Trigger Large Language Models Collapse",
    author = "Yang, Wanli  and
      Sun, Fei  and
      Ma, Xinyu  and
      Liu, Xun  and
      Yin, Dawei  and
      Cheng, Xueqi",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Findings of the Association for Computational Linguistics ACL 2024",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand and virtual meeting",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-acl.322",
    pages = "5419--5437",
    abstract = "Although model editing has shown promise in revising knowledge in Large Language Models (LLMs), its impact on the inherent capabilities of LLMs is often overlooked. In this work, we reveal a critical phenomenon: even a single edit can trigger model collapse, manifesting as significant performance degradation in various benchmark tasks. However, benchmarking LLMs after each edit, while necessary to prevent such collapses, is impractically time-consuming and resource-intensive. To mitigate this, we propose using perplexity as a surrogate metric, validated by extensive experiments demonstrating changes in an edited model{'}s perplexity are strongly correlated with its downstream task performances. We further conduct an in-depth study on sequential editing, a practical setting for real-world scenarios, across various editing methods and LLMs, focusing on hard cases from our previous single edit studies. The results indicate that nearly all examined editing methods result in model collapse after only few edits. To facilitate further research, we have utilized GPT-3.5 to develop a new dataset, HardEdit, based on those hard cases. This dataset aims to establish the foundation for pioneering research in reliable model editing and the mechanisms underlying editing-induced model collapse. We hope this work can draw the community{'}s attention to the potential risks inherent in model editing practices.",
}

@inproceedings{yang-etal-2024-fall,
    title = "The Fall of {ROME}: Understanding the Collapse of {LLM}s in Model Editing",
    author = "Yang, Wanli  and
      Sun, Fei  and
      Tan, Jiajun  and
      Ma, Xinyu  and
      Su, Du  and
      Yin, Dawei  and
      Shen, Huawei",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-emnlp.236",
    doi = "10.18653/v1/2024.findings-emnlp.236",
    pages = "4079--4087",
    abstract = "Despite significant progress in model editing methods, their application in real-world scenarios remains challenging as they often cause large language models (LLMs) to collapse. Among them, ROME is particularly concerning, as it could disrupt LLMs with only a single edit. In this paper, we study the root causes of such collapse. Through extensive analysis, we identify two primary factors that contribute to the collapse: i) inconsistent handling of prefixed and unprefixed keys in the parameter update equation may result in very small denominators, causing excessively large parameter updates; ii) the subject of collapse cases is usually the first token, whose unprefixed key distribution significantly differs from the prefixed key distribution in autoregressive transformers, causing the aforementioned issue to materialize. To validate our findings, we propose a simple yet effective approach: uniformly using prefixed keys during editing phase and adding prefixes during testing phase to ensure the consistency between training and testing. The experimental results show that the proposed solution can prevent model collapse while maintaining the effectiveness of the edits.",
}