Leaderboard Icon OpenEdit Leaderboard

This leaderboard presents a unified evaluation of mainstream model editing techniques across various LLMs and datasets.

To ensure consistency and comparability across methods, we adopt a unified experimental setup. For each dataset, we randomly sample 3000 instances and perform lifelong model editing. For each editing method, we explore various batch size settings and select a suitable configuration based on empirical performance.

A comprehensive evaluation of edited model is conducted after the entire editing process is completed. The evaluation setup is standardized as follows:

  1. Input: using only question without additional context to align with existing literature.
  2. Generation Strategy: employing autoregressive generation rather than the commonly adopted teacher forcing.
  3. Output Truncation: terminating generation upon encountering predefined stop tokens (e.g., "<|endoftext|>"), rather than truncating at ground truth length, as is common in prior work.
  4. Metric: adopting exact match (EM) as a strict criterion for judging whether the edited knowledge is correctly generated.

Large Language Models: LLaMA-3-8B-Instruct, Qwen2.5-7B-Instruct, Mistral-7B-v0.1
Editing Methods: MEMIT, WISE, AlphaEdit, RLEdit
Datasets: ZsRE, CounterFact
Metrics: Reliability, Generalization, Capability (average performance of edited models on GSM8K, MMLU, Natural Questions, WMT, and SST2)

ZsRE Leaderboard

Method Reliability Generalization Capability
LLaMA-3-8B-Instruct
Pre-Edit 0.03% 0.10% 57.26%
MEMIT 26.23% 23.30% 25.67%
WISE 4.17% 3.50% -
AlphaEdit 64.50% 40.57% 54.71%
RLEdit 66.67% 59.40% 57.41%
Qwen2.5-7B-Instruct
Pre-Edit 0.00% 0.00% 58.52%
MEMIT 39.57% 32.10% 47.87%
WISE 7.67% 5.23% -
AlphaEdit 2.70% 2.33% 16.31%
RLEdit 54.57% 47.60% 60.74%
Mistral-7B-v0.1
Pre-Edit 0.00% 0.00% 44.82%
MEMIT 24.30% 19.60% 18.11%
WISE 15.87% 11.33% -
AlphaEdit 3.30% 3.00% 15.13%
RLEdit 26.10% 19.53% 21.30%

CounterFact Leaderboard

Method Reliability Generalization Capability
LLaMA-3-8B-Instruct
Pre-Edit 0.00% 0.13% 57.26%
MEMIT 71.23% 33.93% ~%
WISE 16.47% 4.53% -
AlphaEdit 93.03% 28.13% ~%
RLEdit ~% ~% ~%
Qwen2.5-7B-Instruct
Pre-Edit 0.07% 0.07% 58.52%
MEMIT 68.87% 19.23% ~%
WISE 17.37% 4.43% -
AlphaEdit 33.33% 10.87% ~%
RLEdit ~% ~% ~%
Mistral-7B-v0.1
Pre-Edit 0.13% 0.17% 44.82%
MEMIT 35.20% 17.27% ~%
WISE 26.17% 8.70% -
AlphaEdit 7.63% 4.00% ~%
RLEdit ~% ~% ~%