VLA-OPD: Bridging Offline SFT and Online RL for Vision-Language-Action Models via On-Policy Distillation

VLA-OPD: Bridging Offline SFT and Online RL for Vision-Language-Action Models via On-Policy Distillation | Signal Canvas | ScienceToStartup

Page Freshness

Signal Canvas proof surface

Canonical route: /signal-canvas/vla-opd-bridging-offline-sft-and-online-rl-for-vision-language-action-models-via-on-policy-distillation

stale

Proof freshness: stale
Proof status: unverified
Display score: 7/10
Last proof check: 2026-03-30
Score updated: 2026-04-02
Score fresh until: 2026-05-02
References: 26
Source count: 3
Coverage: 50%

This page is showing the last landed evidence receipt and score bundle because the latest proof data is outside the freshness window.

Agent Handoff

Canonical ID vla-opd-bridging-offline-sft-and-online-rl-for-vision-language-action-models-via-on-policy-distillation | Route /signal-canvas/vla-opd-bridging-offline-sft-and-online-rl-for-vision-language-action-models-via-on-policy-distillation

REST example

curl https://sciencetostartup.com/api/v1/agent-handoff/signal-canvas/vla-opd-bridging-offline-sft-and-online-rl-for-vision-language-action-models-via-on-policy-distillation

MCP example

{
  "tool": "search_signal_canvas",
  "arguments": {
    "mode": "paper",
    "paper_ref": "vla-opd-bridging-offline-sft-and-online-rl-for-vision-language-action-models-via-on-policy-distillation",
    "query_text": "Summarize VLA-OPD: Bridging Offline SFT and Online RL for Vision-Language-Action Models via On-Policy Distillation"
  }
}

source_context

{
  "surface": "signal_canvas",
  "mode": "paper",
  "query": "VLA-OPD: Bridging Offline SFT and Online RL for Vision-Language-Action Models via On-Policy Distillation",
  "normalized_query": "2603.26666",
  "route": "/signal-canvas/vla-opd-bridging-offline-sft-and-online-rl-for-vision-language-action-models-via-on-policy-distillation",
  "paper_ref": "vla-opd-bridging-offline-sft-and-online-rl-for-vision-language-action-models-via-on-policy-distillation",
  "topic_slug": null,
  "benchmark_ref": null,
  "dataset_ref": null
}

Evidence Receipt

Route status: building

Claims: 7

References: 26

Proof: Verification pending

Freshness state: computing

Source paper: VLA-OPD: Bridging Offline SFT and Online RL for Vision-Language-Action Models via On-Policy Distillation

PDF: https://arxiv.org/pdf/2603.26666v1

Source count: 3

Coverage: 50%

Last proof check: 2026-03-30T21:51:27.011Z

Signal Canvas receipt window

Watch and verify: VLA-OPD: Bridging Offline SFT and Online RL for Vision-Language-Action Models via On-Policy Distillation

/buildability/vla-opd-bridging-offline-sft-and-online-rl-for-vision-language-action-models-via-on-policy-distillation

Watchwatch

Subject: VLA-OPD: Bridging Offline SFT and Online RL for Vision-Language-Action Models via On-Policy Distillation

Verdict

Watch

Verdict is Watch because viability or proof quality is intermediate and should be re-evaluated before execution.

Preparing verified analysis

GitHub Code Pulse

No public code linked for this paper yet.

Claim map

Strong 7Mixed 0Weak 0

Evidencepartial
In this paper, we propose On-Policy VLA Distillation (VLA-OPD), a framework bridging the efficiency of SFT with the robustness of RL. Instead of relying on sparse environmental rewards, VLA-OPD leverages an expert teacher to provide dense, token-level supervision on the student's self-generated trajectories.
Implicationpartial
This is a core statement of the proposed method, clearly articulated in the abstract and introduction.
Verificationpartial
partial
Evidencepartial
Crucially, we formulate VLA-OPD via a Reverse-KL objective. Unlike standard Forward-KL that induces mode-covering entropy explosion, or Hard-CE that causes premature entropy collapse, our bounded mode-seeking objective ensures stable policy learning by filtering out the teacher's epistemic uncertainty while maintaining action diversity.
Implicationpartial
The abstract and introduction explicitly detail the use of Reverse-KL and its benefits compared to other objectives.
Verificationpartial
partial
Evidencepartial
Experiments on LIBERO and RoboTwin2.0 benchmarks demonstrate that VLA-OPD significantly improves sample efficiency over RL and robustness over SFT, while effectively mitigating catastrophic forgetting during post-training.
Implicationpartial
The abstract and analysis section explicitly state the experimental results on these benchmarks.
Verificationpartial
partial
Evidencepartial
Experiments on LIBERO and RoboTwin2.0 benchmarks demonstrate that VLA-OPD significantly improves sample efficiency over RL and robustness over SFT, while effectively mitigating catastrophic forgetting during post-training.
Implicationpartial
This is a key benefit highlighted in the abstract and introduction.
Verificationpartial
partial
Evidencepartial
Unlike standard Forward-KL that induces mode-covering entropy explosion, or Hard-CE that causes premature entropy collapse, our bounded mode-seeking objective ensures stable policy learning by filtering out the teacher's epistemic uncertainty while maintaining action diversity.
Implicationpartial
The abstract and introduction explain the mechanism and benefits of the Reverse-KL objective.
Verificationpartial
partial
Evidencepartial
This enables active error correction on policy-induced states while preserving pre-trained general capabilities through gentle alignment.
Implicationpartial
This describes the functional outcome of the proposed method, as stated in the abstract.
Verificationpartial
partial
Evidencepartial
Potential limitations include dependency on the availability of high-performing expert models and the applicability of VLA-OPD in highly dynamic or novel environments.
Implicationpartial
This is explicitly mentioned as a caveat in the provided analysis.
Verificationpartial
partial

Author intelligence and commercialization panels stay hidden until the proof receipt is verified, cites at least 3 references, includes at least 2 sources, and clears 50% coverage. The paper narrative and citation surfaces remain public while verification is pending.

VLA-OPD: Bridging Offline SFT and Online RL for Vision-Language-Action Models via On-Policy Distillation

Use Signal Canvas as the narrative proof surface

Use this Signal Canvas via API or MCP

Signal Canvas proof surface