OpenEdit: A Leaderboard of Model Editing

This leaderboard presents a unified evaluation of mainstream model editing techniques across various LLMs and datasets.

To ensure consistency and comparability across methods, we adopt a unified experimental setup. For each dataset, we randomly sample 3000 instances and perform lifelong model editing. For each editing method, we explore various batch size settings and select a suitable configuration based on empirical performance.

A comprehensive evaluation of edited model is conducted after the entire editing process is completed. The evaluation setup is standardized as follows:

Input: using only question without additional context to align with existing literature.
Generation Strategy: employing autoregressive generation rather than the commonly adopted teacher forcing.
Output Truncation: terminating generation upon encountering predefined stop tokens (e.g., "<|endoftext|>"), rather than truncating at ground truth length, as is common in prior work.
Metric: adopting exact match (EM) as a strict criterion for judging whether the edited knowledge is correctly generated.

Large Language Models: LLaMA-3-8B-Instruct, Qwen2.5-7B-Instruct, Mistral-7B-v0.1
Editing Methods: MEMIT, WISE, AlphaEdit, RLEdit
Datasets: ZsRE, CounterFact
Metrics: Reliability, Generalization, Capability (average performance of edited models on GSM8K, MMLU, Natural Questions, WMT, and SST2)

ZsRE Leaderboard

Method	Reliability	Generalization	Capability
LLaMA-3-8B-Instruct
Pre-Edit	0.03%	0.10%	57.26%
MEMIT	26.23%	23.30%	25.67%
WISE	4.17%	3.50%	-
AlphaEdit	64.50%	40.57%	54.71%
RLEdit	66.67%	59.40%	57.41%
Qwen2.5-7B-Instruct
Pre-Edit	0.00%	0.00%	58.52%
MEMIT	39.57%	32.10%	47.87%
WISE	7.67%	5.23%	-
AlphaEdit	2.70%	2.33%	16.31%
RLEdit	54.57%	47.60%	60.74%
Mistral-7B-v0.1
Pre-Edit	0.00%	0.00%	44.82%
MEMIT	24.30%	19.60%	18.11%
WISE	15.87%	11.33%	-
AlphaEdit	3.30%	3.00%	15.13%
RLEdit	26.10%	19.53%	21.30%

CounterFact Leaderboard

Method	Reliability	Generalization	Capability
LLaMA-3-8B-Instruct
Pre-Edit	0.00%	0.13%	57.26%
MEMIT	71.23%	33.93%	~%
WISE	16.47%	4.53%	-
AlphaEdit	93.03%	28.13%	~%
RLEdit	~%	~%	~%
Qwen2.5-7B-Instruct
Pre-Edit	0.07%	0.07%	58.52%
MEMIT	68.87%	19.23%	~%
WISE	17.37%	4.43%	-
AlphaEdit	33.33%	10.87%	~%
RLEdit	~%	~%	~%
Mistral-7B-v0.1
Pre-Edit	0.13%	0.17%	44.82%
MEMIT	35.20%	17.27%	~%
WISE	26.17%	8.70%	-
AlphaEdit	7.63%	4.00%	~%
RLEdit	~%	~%	~%

OpenEdit Leaderboard

ZsRE Leaderboard

CounterFact Leaderboard