Despite near-perfect results reported in the literature, the effectiveness of model editing in real-world applications remains unclear. To bridge this gap, we introduce QAEdit, a new benchmark aligned with widely used question answering (QA) datasets, and Wild, a task-agnostic evaluation framework designed to better reflect real-world usage of model editing. Our single editing experiments show that current editing methods perform substantially worse than previously reported (38.5% vs. 96.8%). We demonstrate that it stems from issues in the synthetic evaluation practices of prior work. Among them, the most severe is the use of teacher forcing during testing, which leaks both content and length of ground truth, leading to overestimated performance. Furthermore, we simulate practical deployment by sequential editing, revealing that current approaches fail drastically with only 1000 edits. This work calls for a shift in editing research toward rigorous evaluation and the development of robust, scalable methods that can reliably update knowledge in LLMs for real-world use.
To rigorously examine the practical utility of model editing, we focus on the most fundamental and widely studied task of QA for two reasons: i) They offer clear evaluation criteria and broad applicability; ii) If current editing methods struggle on basic QA tasks, then they are unlikely to succeed in more challenging scenarios. Specifically, we apply editing methods to correct LLMs' errors in QA tasks and assess the improvement by re-evaluating edited LLMs on a standard QA evaluation framework (lm-evaluation-harness), as illustrated in Figure 1.
To identify the cause of this performance gap and guide further investigation, we first delve into the experimental setup of both editing (synthetic) and QA task (Wild) evaluations. We abstract them into four key modules: input, generation strategy, output truncation, and metric. This modular paradigm enables systematic comparison between the two evaluation frameworks, as shown in Figure 2. And Table 3 details the key differences between these evaluation frameworks.
We formalize the evaluation pipeline commonly used in prior model editing works as synthetic evaluation framework: i) input: using only question without additional context; ii) generation strategy: employing teacher forcing to feed ground truth tokens as input during decoding; iii) output truncation: truncating output to match the length of target answer; iv) metric: using token-level match ratio between the target and generated answer as accuracy.
We propose the Wild (Without Intervention, Live Decoding) evaluation framework based on the standard QA evaluation protocol: i) input: prefixing question with contexts like task instructions; ii) generation strategy: adopting autoregressive decoding, where each output serves as input for subsequent generation; iii) output truncation using predefined stop tokens (e.g., ".", "\n", and "<|endoftext|>") as signal to terminate generation; iv) metric: Wild supports evaluation metrics, including BERTScore and exact match (EM). Given its popularity and alignment with human judgment, we adopt LLM-as-a-Judge as the primary metric to illustrate the framework and conduct our study.
@misc{yang2025revisitediting,
title={The Mirage of Model Editing: Revisiting Evaluation in the Wild},
author={Wanli Yang and Fei Sun and Jiajun Tan and Xinyu Ma and Qi Cao and Dawei Yin and Huawei Shen and Xueqi Cheng},
year={2025},
eprint={2502.11177},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.11177},
}