arXiv 2605.09188 · Reinforcement Learning for Reasoning

Difficulty-Adaptive
RL with Co-Evolved Difficulty Estimation

DARE is a unified reinforcement learning framework for LLM reasoning that moves beyond difficulty-aware filtering. It estimates prompt difficulty under the current policy, preserves diverse difficulty coverage, and adapts compute, clipping, and reward shaping across easy, medium, and hard prompts.

policy-aligned SNIS estimator symmetric-Beta sampling tiered rollout allocation token-efficient inference
Replay buffer responses · rewards · behavior log-probs
SNIS difficulty current-policy failure estimate
Beta sampler medium focus with tail coverage
Tiered RL easy · medium · hard strategies
Problem

Difficulty Filtering Alone Leaves Compute on the Table

Reinforcement learning can improve LLM reasoning, but many rollouts provide weak learning signals. Existing difficulty-aware methods usually select medium-difficulty prompts, yet the paper shows three bottlenecks: difficulty estimates drift as the policy changes, selection alone brings limited final-performance gains, and inference responses remain uniformly long across difficulty levels.

Policy drift

Static or slowly updated difficulty labels no longer match the live policy, so selected prompts can become too easy or too hard.

Filtration ceiling

Changing only which prompts are sampled does not teach the model to allocate different reasoning effort across difficulty tiers.

Inference inefficiency

Filtering-based RL often preserves long chain-of-thought behavior even on easy prompts where concise correct answers should suffice.

Framework

DARE Couples Estimation, Selection, and Optimization

DARE treats prompt difficulty as a policy-dependent quantity. At each epoch, it co-evolves a difficulty estimate with the policy, samples data using a smooth difficulty distribution, and applies tier-specific training objectives.

01

Co-Evolved Difficulty Estimation

A prompt-wise replay buffer stores prior rollouts and estimates the current-policy failure rate through self-normalized importance sampling.

02

Dynamic Data Selection

A symmetric Beta sampler emphasizes medium-difficulty prompts while retaining easy and hard tails to avoid forgetting and starvation.

03

Difficulty-Adaptive RL

Easy, medium, and hard prompts receive different rollout budgets, clipping ranges, and reward-shaping terms.

Estimator

Current-Policy Difficulty from Historical Rollouts

The central estimator uses self-normalized importance sampling to correct replayed trajectories toward the current policy. For each prompt, DARE stores response, reward, and behavior log-probability tuples; unseen prompts fall back to an embedding-based cold-start estimator.

Prompt difficulty
dq = current-policy failure rate

Difficulty is not a fixed dataset attribute. It changes as the model learns, so DARE updates it continuously from replayed evidence.

Sampling weight
p(q) ∝ Beta(dq; 1 + κ/2, 1 + κ/2)

The symmetric distribution creates a soft curriculum: medium prompts receive high probability, but easy and hard examples remain visible.

Optimization

One Policy, Three Difficulty Tiers

DARE adapts the training signal rather than merely filtering data. The policy learns to be concise on easy prompts, robust on medium prompts, and more exploratory on hard prompts.

Tier
Difficulty range
Training strategy
Intended behavior
Easy
dq < deasy
Fewer rollouts, length penalty on correct responses, relaxed upper clipping.
Keep accuracy while reducing unnecessary tokens.
Medium
deasy ≤ dq ≤ dhard
Standard group-relative policy optimization with the base rollout count.
Use high-variance learning signals efficiently.
Hard
dq > dhard
More rollouts, hint-augmented retrieval, and a bounded length bonus for incorrect attempts.
Encourage continued exploration without overpowering correctness.
Results

Paper Tables as an Interactive Evidence Atlas

The paper reports accuracy, estimator quality, token usage, ablations, memory usage, and code-transfer results across multiple model scales. Instead of static screenshots, the web version renders these tables as sortable, switchable panels so the empirical story can be inspected directly.

Dynamic Figures

Browser-Rendered Figures from Paper Data

The quantitative figures below are rendered directly in the browser from the same structured data used by the sortable tables. They are visible as a standalone section rather than hidden inside a tab.

Estimator error landscape

MSE and MAE are plotted jointly so the gap to oracle and DARE are visible.

Quick Start

Run the Released Implementation

The code release contains a cold-start difficulty estimator, a local verl fork, replay-buffer-based selection, and DARE training scripts.

Install

git clone https://github.com/EtaYang10th/DARE.git
cd DARE
bash environment.sh

Train

cd rl_training
bash run_bash/1_ours_small_model.sh
bash run_bash/12_final_ds_teacher_replay.sh

Core implementation

rl_training/verl/verl/trainer/ppo/is_data_selector.py implements SNIS difficulty estimation, Beta sampling, and baseline variants.

Tiered RL loop

rl_training/verl/verl/trainer/ppo/ray_trainer.py handles rollout allocation, reward shaping, replay mixing, and evaluation logging.

Resources

Paper, Code, and Citation

BibTeX

@misc{zhou2026dare,
  title         = {DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation},
  author        = {Zhou, Yang and Jin, Can and Dong, Zihan and Wang, Zhepeng and
                   Yang, Yanting and Zhao, Shiyu and Li, Lei and Bao, Runxue and
                   Xie, Yaochen and Metaxas, Dimitris N.},
  year          = {2026},
  eprint        = {2605.09188},
  archivePrefix = {arXiv},
  primaryClass  = {cs.LG},
  url           = {https://arxiv.org/abs/2605.09188}
}