This leaderboard presents a unified evaluation of mainstream model editing techniques across various LLMs and datasets.
To ensure consistency and comparability across methods, we adopt a unified experimental setup. For each dataset, we randomly sample 3000 instances and perform lifelong model editing. For each editing method, we explore various batch size settings and select a suitable configuration based on empirical performance.
A comprehensive evaluation of edited model is conducted after the entire editing process is completed. The evaluation setup is standardized as follows:
"<|endoftext|>"
), rather than truncating at ground truth length, as is common in prior work.
Large Language Models: LLaMA-3-8B-Instruct, Qwen2.5-7B-Instruct, Mistral-7B-v0.1
Editing Methods: MEMIT, WISE, AlphaEdit, RLEdit
Datasets: ZsRE, CounterFact
Metrics: Reliability, Generalization, Capability (average performance of edited models on GSM8K, MMLU, Natural Questions, WMT, and SST2)
Method | Reliability | Generalization | Capability |
---|---|---|---|
LLaMA-3-8B-Instruct | Pre-Edit | 0.03% | 0.10% | 57.26% |
MEMIT | 26.23% | 23.30% | 25.67% |
WISE | 4.17% | 3.50% | - |
AlphaEdit | 64.50% | 40.57% | 54.71% |
RLEdit | 66.67% | 59.40% | 57.41% |
Qwen2.5-7B-Instruct | Pre-Edit | 0.00% | 0.00% | 58.52% |
MEMIT | 39.57% | 32.10% | 47.87% |
WISE | 7.67% | 5.23% | - |
AlphaEdit | 2.70% | 2.33% | 16.31% |
RLEdit | 54.57% | 47.60% | 60.74% |
Mistral-7B-v0.1 | Pre-Edit | 0.00% | 0.00% | 44.82% |
MEMIT | 24.30% | 19.60% | 18.11% |
WISE | 15.87% | 11.33% | - |
AlphaEdit | 3.30% | 3.00% | 15.13% |
RLEdit | 26.10% | 19.53% | 21.30% |
Method | Reliability | Generalization | Capability |
---|---|---|---|
LLaMA-3-8B-Instruct | Pre-Edit | 0.00% | 0.13% | 57.26% |
MEMIT | 71.23% | 33.93% | ~% |
WISE | 16.47% | 4.53% | - |
AlphaEdit | 93.03% | 28.13% | ~% |
RLEdit | ~% | ~% | ~% |
Qwen2.5-7B-Instruct | Pre-Edit | 0.07% | 0.07% | 58.52% |
MEMIT | 68.87% | 19.23% | ~% |
WISE | 17.37% | 4.43% | - |
AlphaEdit | 33.33% | 10.87% | ~% |
RLEdit | ~% | ~% | ~% |
Mistral-7B-v0.1 | Pre-Edit | 0.13% | 0.17% | 44.82% |
MEMIT | 35.20% | 17.27% | ~% |
WISE | 26.17% | 8.70% | - |
AlphaEdit | 7.63% | 4.00% | ~% |
RLEdit | ~% | ~% | ~% |