DARE is a unified reinforcement learning framework for LLM reasoning that moves beyond difficulty-aware filtering. It estimates prompt difficulty under the current policy, preserves diverse difficulty coverage, and adapts compute, clipping, and reward shaping across easy, medium, and hard prompts.
Reinforcement learning can improve LLM reasoning, but many rollouts provide weak learning signals. Existing difficulty-aware methods usually select medium-difficulty prompts, yet the paper shows three bottlenecks: difficulty estimates drift as the policy changes, selection alone brings limited final-performance gains, and inference responses remain uniformly long across difficulty levels.
Static or slowly updated difficulty labels no longer match the live policy, so selected prompts can become too easy or too hard.
Changing only which prompts are sampled does not teach the model to allocate different reasoning effort across difficulty tiers.
Filtering-based RL often preserves long chain-of-thought behavior even on easy prompts where concise correct answers should suffice.
DARE treats prompt difficulty as a policy-dependent quantity. At each epoch, it co-evolves a difficulty estimate with the policy, samples data using a smooth difficulty distribution, and applies tier-specific training objectives.
A prompt-wise replay buffer stores prior rollouts and estimates the current-policy failure rate through self-normalized importance sampling.
A symmetric Beta sampler emphasizes medium-difficulty prompts while retaining easy and hard tails to avoid forgetting and starvation.
Easy, medium, and hard prompts receive different rollout budgets, clipping ranges, and reward-shaping terms.
The central estimator uses self-normalized importance sampling to correct replayed trajectories toward the current policy. For each prompt, DARE stores response, reward, and behavior log-probability tuples; unseen prompts fall back to an embedding-based cold-start estimator.
Difficulty is not a fixed dataset attribute. It changes as the model learns, so DARE updates it continuously from replayed evidence.
The symmetric distribution creates a soft curriculum: medium prompts receive high probability, but easy and hard examples remain visible.
DARE adapts the training signal rather than merely filtering data. The policy learns to be concise on easy prompts, robust on medium prompts, and more exploratory on hard prompts.
The paper reports accuracy, estimator quality, token usage, ablations, memory usage, and code-transfer results across multiple model scales. Instead of static screenshots, the web version renders these tables as sortable, switchable panels so the empirical story can be inspected directly.
The quantitative figures below are rendered directly in the browser from the same structured data used by the sortable tables. They are visible as a standalone section rather than hidden inside a tab.
MSE and MAE are plotted jointly so the gap to oracle and DARE are visible.
The code release contains a cold-start difficulty estimator, a local verl fork, replay-buffer-based selection, and DARE training scripts.
git clone https://github.com/EtaYang10th/DARE.git
cd DARE
bash environment.sh
cd rl_training
bash run_bash/1_ours_small_model.sh
bash run_bash/12_final_ds_teacher_replay.sh
rl_training/verl/verl/trainer/ppo/is_data_selector.py implements SNIS difficulty estimation, Beta sampling, and baseline variants.
rl_training/verl/verl/trainer/ppo/ray_trainer.py handles rollout allocation, reward shaping, replay mixing, and evaluation logging.
DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation.
Direct PDF download from arXiv.
DARE implementation, training scripts, and local verl fork.
@misc{zhou2026dare,
title = {DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation},
author = {Zhou, Yang and Jin, Can and Dong, Zihan and Wang, Zhepeng and
Yang, Yanting and Zhao, Shiyu and Li, Lei and Bao, Runxue and
Xie, Yaochen and Metaxas, Dimitris N.},
year = {2026},
eprint = {2605.09188},
archivePrefix = {arXiv},
primaryClass = {cs.LG},
url = {https://arxiv.org/abs/2605.09188}
}