Evidence Over Plans:
Online Trajectory Verification for Skill Distillation

SPARK is an automated skill-generation framework for LLM agents. Rather than asking models to draft skills from prior plans, it uses the trajectory-level Posterior Distillation Index (PDI) to assess whether each skill is grounded in task-environment evidence, and uses that signal to guide exploration, correct failure modes, and distill transferable SKILL.md documents.

Evaluated on SkillsBench · 86 runnable tasks Build-and-verify task construction Student inference 1,000× cheaper than teacher Cross-domain: ALFWorld
Why It Matters

Prior Plans ≠ Transferable Skills

Equipping agents with human-written procedural skill documents can substantially improve task success rates, but most automated skill-generation methods operate before environment interaction: the model prescribes how a task should be solved and packages that prior intent into SKILL.md. Such skills are often dominated by generic priors and can yield negligible or even negative gains when executed.

SPARK is built on a central claim: effective procedural knowledge must be posterior-based. It should encode environment-specific constraints, execution dependencies, and failure modes that are discoverable only through interaction with the environment. Once distilled, this posterior experience becomes reusable and verifiable guidance for future agents.

① Trajectory-Level Metric

PDI provides an interpretable measure of whether a skill is grounded in task-environment evidence rather than unverified prior plans.

② Analyzable Skill Generation

SPARK preserves execution logs, verifier signals, and exploration-memo histories for full trajectory-level auditing and analysis.

③ Online Intervention

A memo-based PDI proxy can trigger soft or strong interventions during exploration, preventing trajectories from stagnating or repeating ossified plans.

④ Cross-Model Transfer

The cost of expensive teacher exploration is amortized across many low-cost student invocations, with student inference as low as $0.02/task.

⑤ Out-of-Domain Validation

On ALFWorld, whose text-based household tasks differ sharply from terminal programming tasks, PDI-guided skills improve success from 16.7% to 40.0%.

⑥ Task Construction Support

SPARK’s task-construction pipeline turns prompt-level task ideas into executable, oracle-verified Harbor tasks that support controlled ablations and transfer evaluation.

Framework

Two Independently Iterative Pipelines

SPARK decouples task construction from skill generation into two pipelines that can evolve independently: tasks can be expanded continuously, while skills are repeatedly distilled from verified trajectories without manually authoring both sides.

SPARK pipeline
PDI-based SPARK Illustration. Left: Skill Generation. A teacher agent interacts with a Dockerized environment for up to Nmax attempts, updates an exploration memo from execution feedback, distills the full trajectory trace into SKILL.md upon success, and receives targeted PDI-proxy interventions upon failure. Right: Task Construction. Blueprint generation → repair → critique → oracle validation converts prompts into executable benchmark instances; student agents then evaluate whether PDI-grounded knowledge transfers across tasks rather than overfitting to the teacher’s original trajectory.

🔁 Skill Pipeline

execute → judge → summarize → retry → distill skill

  • Attempt log: a one-line summary of each attempt, preserving both strategy and outcome.
  • Verified Facts: environment-confirmed facts that persist across memo rewrites.
  • Current Error Pattern: the active diagnosis of the current failure mode.
  • Next Strategy: a plan that must differ from all previous attempts.
  • Commands: the key shell actions from the previous attempt.

The memo is rewritten as a whole rather than appended to, preventing the context from being flooded with low-value stdout; historical versions are still retained so cross-attempt patterns remain analyzable.

🧩 Task Pipeline

prompt → retrieval → TaskBlueprint → Harbor task → oracle validation

  • Blueprint: structures a natural-language prompt into instructions, environment, support files, oracle, and verifier.
  • Repair: iteratively fixes constraint violations in the blueprint.
  • Critique: checks semantic consistency and executable constraints.
  • Validation: accepts only tasks that pass a deterministic oracle.

This is a build-and-verify process rather than a one-shot LLM generation step. It also enables batch generation of unseen instances from the same problem class, testing whether a skill captures reusable procedural structure.

Case Studies

When PDI Starts to Matter

The following cases illustrate external transfer (lean4-proof), online PDI intervention (3d-scan-calc / manufacturing-codebook-normalization), and verified execution chains outperforming plan repetition (tactic reproduction). Select a tab to inspect each case.

Held-out Student (Claude Haiku 4.5): 118-line sketch → 393-line executable recipe

Task: prove the geometric-series bound Sn ≤ 2 in Lean 4; verifier command: lake env lean -DwarningAsError=true solution.lean.
lean4-proof before/after terminal animation
Side-by-side replay of the held-out student's shell session. Left: with the non-PDI skill, the student cycles through 15 trial-and-error commands with repeated failures (✗) and warnings (!). Right: after receiving the PDI-refined skill, 5 commands take the student from a clean write-to-/tmp to reward = 1.0 on the first Lean compile.
Before — Non-PDI skill
  • Provides only four sections—overview, orientation, strategy, and pitfalls—across 118 lines.
  • Lacks a complete theorem block and offers only a high-level proof sketch.
  • Uses an incomplete verifier command that omits -DwarningAsError=true.
  • Does not specify a write-to-/tmp-then-replace workflow, making it easy to corrupt solution.lean.
Trajectory: 15 trial-and-error commands with repeated failures.
After — PDI-refined skill
  • Contains seven structured modules, including step-by-step workflow, reference proof, and testing sections, across 393 lines.
  • Embeds an executable theorem block with explicit norm_num / ring / linarith tactic chains.
  • Provides the exact verifier command, preventing warnings from being silently ignored.
  • Enforces a write-to-/tmp-then-replace workflow to avoid damaging the original file.
5 commands · 0 failures
PDI turns a seemingly well-written plan-oriented document into an evidence-backed recipe that students can execute step by step. The held-out student never participates in generation yet can reuse the skill directly, illustrating the intended cross-model skill transfer.

Interrupting Ossified Plans During Exploration, Not After the Fact

Two representative tasks: 3d-scan-calc and manufacturing-codebook-normalization.
Online PDI-guided control — animated
Top row (green): with online PDI intervention, both tasks reach reward = 1.0 — 3d-scan-calc solves at attempt 8, manufacturing-codebook-normalization solves at attempt 4 (its trace stops early because the agent already won). Bottom row (red): the observe-only control exhausts the same budget with 0 successes. Orange circles mark soft PDI triggers, red circles mark strong ones; the dashed line is τ = −0.5.
  • 3d-scan-calc: a soft intervention is triggered at step 4, escalates to a strong intervention at step 5, and the trajectory recovers.
  • manufacturing-codebook-normalization: two soft triggers at steps 3 and 5 are sufficient to reverse the trend.
Intervention is prompt-only, and the metric remains hidden from the agent. It corrects trajectory direction without changing the goal or exposing the score, making online PDI intervention highly composable with existing agent systems without retraining.

Four Quadrants Defined by φplan and φexec

Testing the core claim: effective skills combine low plan copying with high execution grounding.
φplanφexecAverage ΔrGap vs Human
LowHigh+0.377+0.312
LowLow+0.321+0.393
HighHigh+0.143+0.071
HighLow+0.028−0.040

The highlighted row is the quadrant PDI is designed to maximize. Counterintuitively, Low plan + Low exec can also be beneficial: as long as the skill does not collapse onto the memo, it can still help even without perfectly matching the command distribution. The worst case combines both problems: the skill merely restates what the agent intended to do, with little evidence of what it actually did.

Skill Distillation

Evidence-Driven Skill Generation

After task success, SPARK aggregates six structured evidence sources and passes them to the skill-distillation model, enabling it to extract transferable knowledge from the full solution path and obstacle context rather than summarizing the surface of a single successful attempt.

① Task Pattern

An abstracted task instruction that captures the problem class rather than instance-specific details.

② Execution Chain

The key command sequence from the successful trajectory, filtered through semantic classification, importance scoring, and low-signal removal.

③ Verification

The environment-verified outcomes: which tests passed and what final reward was achieved.

④ Lessons

Recurring failure modes and confirmed caveats distilled from the complete attempt history.

⑤ Environment

Runtime context from the Dockerfile, including base image, packages, and SDKs.

⑥ Raw Support Tail

The stdout tail from successful execution, used to calibrate abstraction against raw evidence.

Metric

Posterior Distillation Index (PDI)

PDI is a trajectory-level metric: it measures whether the final SKILL.md repeats prior intent or records environment-verified evidence. We use Jensen–Shannon divergence to compare token distributions and convert distributional distance into similarity \(\psi(P_x,P_y) = 1 - \mathrm{JS}(P_x,P_y)\).

φexec — Execution Grounding

The similarity between the successful execution-command distribution PE and the final skill distribution Ps. Higher values indicate that the skill focuses more on environment-verified operations.

φplan — Plan Copying

The similarity between PP, aggregated from all Next Strategy entries in the memo, and Ps. Higher values indicate that the skill resembles repeated plans rather than independently verified knowledge.

φoss — Memo Ossification

The distributional stability of Verified Facts and failed-test sets across attempts. Higher values indicate rigid task understanding and repeated circulation around the same failure mode.

PDI
=
z(φexec)
z(φplan)
z(φoss)

The equal-weight linear form is deliberate: PDI is designed as an interpretable, benchmark-transferable diagnostic metric rather than a predictive model overfit to a particular benchmark. Weight-sensitivity analysis shows that sign consistency is sufficient for significant correlation; cross-validated fitted weights perform worse on held-out data.

PDI analysis
(a) Pass-gain rates grouped by trajectory: high-PDI iterative trajectories outperform interaction-free and low-PDI trajectories across all seven student models. (b) PDI vs. Δr for each (model, task) pair: Spearman ρ = +0.364 (p < 10⁻⁶). (c) Memo ossification vs. the gap relative to human-written skills: ρ = −0.277 (p < 10⁻³).
Takeaway. The best quadrant is low φplan + high φexec, with average Δr = +0.377 and a +0.312 gain over human-written skills; the worst quadrant (high plan, low exec) has Δr ≈ 0 and falls below human-written skills by −0.040. Effective skills distill what the environment verified, not what the agent once planned to do.
Runnable Task Construction

From Prompt-Level Ideas to Oracle-Verified Harbor Tasks

SPARK treats runnable tasks as a validation interface for trajectory-level skill distillation, not as a predefined list of task categories. A prompt-level task idea is converted into a self-contained Harbor task through a build-and-verify pipeline that makes the instruction, environment, oracle solution, and pytest verifier mutually consistent before the task is accepted.

1
Prompt Spec
4
Build-and-Verify Stages
7
Rendered Harbor Files
r=1
Oracle Acceptance

① Prompt Specification

A JSON prompt supplies the task idea, optional tool requirements, environment hints, and constraints. These inputs define the validation target for a generated task without claiming that SPARK imposes an intrinsic task scope.

② Blueprint Generation

The model emits a structured TaskBlueprint: instruction markdown, Docker runtime, support files, deterministic data builder, oracle code, verifier code, output path, assumptions, and validation checks.

③ Critique and Repair

SPARK critiques the blueprint for internal mismatches, answer leakage, verifier-oracle drift, unsupported assumptions, schema errors, and output-path inconsistencies, then repairs blocking issues before acceptance.

④ Harbor Rendering

The accepted blueprint is materialized as a Harbor directory with instruction.md, task.toml, environment/Dockerfile, build_data.py, solution/solve.sh, and pytest tests.

⑤ Oracle Validation

Harbor first checks the task package, then runs the oracle solution and verifier. A task is accepted only when deterministic validation completes successfully and the reward reaches r = 1.0.

⑥ Trace Artifacts

Each generation run records the blueprint, critique results, render report, validation feedback, and repair history, making task construction auditable rather than a one-shot LLM artifact.

Evaluation Interface

  • Task reward rm,t: reward returned by the deterministic verifier when student model m attempts task t under a given condition.
  • Skill gain Δrm,t = r+skill − rbase: reward change attributable to injecting the distilled skill.
  • Pass-gain rate PG(g, m): among tasks failed by the baseline, the fraction solved after adding the skill.
Results

SPARK-skill vs. Human-skill vs. No-skill

SPARK-generated skills outperform human-written skills on most student models; several smaller models even surpass the teacher models’ interaction-free, no-skill performance after receiving SPARK skills—weaker students can outperform stronger teachers.

Main result
Average reward r̄ for seven student models under three conditions. Horizontal dashed lines indicate the no-skill, interaction-free performance of the two teacher models (GPT-5.4 and Claude Opus 4.6).
Cross-model agreement
Agreement matrix A(mi, mj) across seven student models. Off-diagonal agreement is generally below 60%. In other words, there is no universal key: simply scaling a skill library does not make gains add linearly.
Divergent vs convergent exploration
Divergent exploration transfers better than convergent exploration. Convergent trajectories often encode teacher-specific refinements that student models cannot reproduce; for some students, convergent skills even yield negative Δr.
Two striking findings. (1) GPT-5.4-mini with SPARK skills reaches r̄ = 0.52, above the human-skill condition at 0.47; (2) GPT-5.4-nano with SPARK skills (0.41) even exceeds Claude Opus 4.6 without any skill (0.37).

Trajectory-level evidence

Compression ratio
Compression ratio: excessive compression removes actionable detail and is negatively correlated with skill effectiveness by Spearman correlation.
Attempts sweet spot
Attempt budget: gains remain consistently positive when K ≤ 3; additional attempts become more variable and model-dependent.

Controlled Transfer Ablation

For an ablation setting, SPARK uses the same task-construction pipeline to produce independently validated evaluation instances under controlled constraints. This isolates whether a distilled skill captures reusable procedural structure rather than memorizing the original trajectory.

Student Model Baseline SPARK (Generated) Human Δ vs Baseline
DeepSeek-Chat43%87%80%+43%
GPT-5.1-Codex17%83%50%+67%
GPT-5.4-mini7%53%33%+47%
GPT-5.4-nano56.7%90.0%83.3%+33.3%
GLM-4.5-Air45.0%90.0%82.0%+45.0%
Cross-domain: On ALFWorld, a text-interactive household environment with a task structure very different from programming benchmarks, PDI-refined skills improve overall success from 16.7% to 40.0%, showing that SPARK and PDI remain effective in out-of-domain settings.
Online Intervention Ablation

w/ PDI vs. w/o PDI

We rerun skill generation on the subset of tasks whose original trajectories had persistently low PDI, enable online PDI intervention, and then perform A/B evaluation under identical downstream conditions. This directly asks: can PDI recover a skill during generation rather than merely score it afterward?

Student Modelw/o PDIw/ PDIΔ
DeepSeek-Chat9.1%42.4%+33.3%
GPT-5.1-Codex0.0%24.2%+21.2%
GPT-5.4-mini18.2%36.4%+12.1%
GPT-5.4-nano15.2%30.3%+15.2%
GLM-4.5-Air3.0%15.2%+12.1%

Intervention Mechanism

After reflection step k, SPARK computes the proxy signal \( \hat{d}_k = w_k \cdot \mathrm{PDI}_k \), where \( w_k = \min(1, k/W) \) is a linear warm-up term (W=2) that suppresses early-stage noise.

  • Soft: when \( \hat{d}_k < \tau \) (τ=−0.5), inject prompt-only guidance that encourages the agent to leave its current hypothesis.
  • Strong: after two consecutive triggers, additionally subtract the previous Next Strategy, forcing the agent to anchor on Verified Facts and Current Error Pattern.
  • Hidden metric: the PDI score is never exposed to the agent, preventing metric gaming.
alpha sensitivity
Sensitivity curve for smoothing parameter α. PDI remains robust across a wide low-smoothing range, and its discriminative power is not sensitive to the exact α value.
Baseline comparison: Against three skill-generation baselines (Trace2Skill, AutoRefine, and EvoSkill), SPARK improves average reward over the strongest baseline by +0.119 under the same evaluation setting.
Quick Start

Run SPARK in a Few Commands

Requirements: Python 3.12, uv, Docker, Harbor, and an OpenAI-compatible LLM endpoint read from OPENAI_API_KEY / OPENAI_BASE_URL environment variables or a .env file.

① Generate Runnable Tasks from a Prompt

uv sync

uv run python run_tasks_gen.py \
  --prompt-file spark_tasks_gen/examples/3d_scan_calc_prompt.json \
  --model gpt-5.4

Outputs are written to spark_tasks_gen/generated_tasks/; each task is first checked by oracle validation and rejected if it fails.

② Run the Iterative Skill-Generation Loop

uv run python run_pipeline.py \
  --agent qwen-coder \
  --model qwen3-coder-next \
  --tasks-dir tasks-no-skills \
  --max-retries 3 \
  --parallelism 4

By default, this launches the dashboard at http://localhost:8765; add --no-dashboard for CLI-only execution.

③ Compare and Evaluate Generated Skills

uv run python run_eval_skills.py \
  --agent qwen-coder \
  --model qwen3-coder-next \
  --skill-source-model qwen3-coder-next \
  --tasks-dir tasks-no-skills

This automatically performs A/B evaluation on the same task set under the native baseline and SPARK SKILL.md injection conditions, writing results to spark_skills_gen/skills_eval_result/.

Output Overview

  • spark_tasks_gen/generated_tasks/<task-id>/ — synthesized Harbor tasks
  • spark_tasks_gen/generated_tasks/_artifacts/ — task-generation traces
  • spark-jobs/ — Harbor execution outputs
  • spark_skills_gen/skills_gen_result/<model>/<task>/ — distilled SKILL.md files and attempt logs
  • spark_skills_gen/skills_eval_result/<model>/<run>/ — skill A/B evaluation summaries
Resources

Paper, Code, and Citation

BibTeX

@misc{zhou2026spark,
  title  = {Evidence Over Plans: Online Trajectory Verification for Skill Distillation},
  author = {Zhou, Yang and Dong, Zihan and Wang, Zhenting and Jin, Can and
            Zhao, Shiyu and Guo, Bangwei and Gu, Difei and
            Zhang, Linjun and Zhou, Mu and Metaxas, Dimitris N.},
  year   = {2026}
}

Contact: eta.yang@rutgers.edu · dnm@rutgers.edu