SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning

SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning | Signal Canvas | ScienceToStartup

Page Freshness

Signal Canvas proof surface

Canonical route: /signal-canvas/sole-r1-video-language-reasoning-as-the-sole-reward-for-on-robot-reinforcement-learning

stale

Proof freshness: stale
Proof status: unverified
Display score: 7/10
Last proof check: 2026-03-31
Score updated: 2026-04-02
Score fresh until: 2026-05-02
References: 96
Source count: 4
Coverage: 50%

This page is showing the last landed evidence receipt and score bundle because the latest proof data is outside the freshness window.

Agent Handoff

Canonical ID sole-r1-video-language-reasoning-as-the-sole-reward-for-on-robot-reinforcement-learning | Route /signal-canvas/sole-r1-video-language-reasoning-as-the-sole-reward-for-on-robot-reinforcement-learning

REST example

curl https://sciencetostartup.com/api/v1/agent-handoff/signal-canvas/sole-r1-video-language-reasoning-as-the-sole-reward-for-on-robot-reinforcement-learning

MCP example

{
  "tool": "search_signal_canvas",
  "arguments": {
    "mode": "paper",
    "paper_ref": "sole-r1-video-language-reasoning-as-the-sole-reward-for-on-robot-reinforcement-learning",
    "query_text": "Summarize SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning"
  }
}

source_context

{
  "surface": "signal_canvas",
  "mode": "paper",
  "query": "SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning",
  "normalized_query": "2603.28730",
  "route": "/signal-canvas/sole-r1-video-language-reasoning-as-the-sole-reward-for-on-robot-reinforcement-learning",
  "paper_ref": "sole-r1-video-language-reasoning-as-the-sole-reward-for-on-robot-reinforcement-learning",
  "topic_slug": null,
  "benchmark_ref": null,
  "dataset_ref": null
}

Evidence Receipt

Route status: building

Claims: 8

References: 96

Proof: Verification pending

Freshness state: computing

Source paper: SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning

PDF: https://arxiv.org/pdf/2603.28730v1

Source count: 4

Coverage: 50%

Last proof check: 2026-03-31T20:16:57.451Z

Signal Canvas receipt window

Watch and verify: SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning

/buildability/sole-r1-video-language-reasoning-as-the-sole-reward-for-on-robot-reinforcement-learning

Watchwatch

Subject: SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning

Verdict

Watch

Verdict is Watch because viability or proof quality is intermediate and should be re-evaluated before execution.

Preparing verified analysis

GitHub Code Pulse

No public code linked for this paper yet.

Claim map

Strong 8Mixed 0Weak 0

Evidencepartial
SOLE-R1 enables zero-shot online RL from random initialization: robots learn previously unseen manipulation tasks without ground-truth rewards, success indicators, demonstrations, or task-specific tuning.
Implicationpartial
Explicitly stated in the abstract as a key result of the work.
Verificationpartial
partial
Evidencepartial
SOLE-R1 substantially outperforms strong vision-language rewarders, including GPT-5 and Gemini-3-Pro
Implicationpartial
Directly stated in the abstract and supported by a results figure (Figure 3) showing success rate comparisons.
Verificationpartial
partial
Evidencepartial
We generate over one million CoT reasoning examples from more than 40,000 real-world and simulated videos.
Implicationpartial
Specific numeric data is provided in the paper text.
Verificationpartial
partial
Evidencepartial
To train SOLE-R1, we propose a two-stage hybrid recipe: SFT teaches high-quality CoT reasoning, while RLVR directly emphasizes accurate progress prediction
Implicationpartial
Explicitly described as the core training methodology in Section 4.
Verificationpartial
partial
Evidencepartial
while exhibiting markedly greater robustness to reward hacking.
Implicationpartial
Directly stated in the abstract as a comparative advantage.
Verificationpartial
partial
Evidencepartial
SOLE-R1 succeeds on 24 unseen tasks
Implicationpartial
Specific numeric claim made in the abstract.
Verificationpartial
partial
Evidencepartial
SOLE-R1 performs per-timestep spatiotemporal chain-of-thought (CoT) reasoning and produces dense estimates of task progress that can be used directly as rewards.
Implicationpartial
Core technical capability is explicitly defined in the abstract and Section 2.
Verificationpartial
partial
Evidencepartial
it provides a relatively weak learning signal for reward/progress prediction, since the scalar in is a small part of the response.
Implicationpartial
Directly stated as a limitation of the SFT stage, justifying the need for the RLVR stage.
Verificationpartial
partial

Author intelligence and commercialization panels stay hidden until the proof receipt is verified, cites at least 3 references, includes at least 2 sources, and clears 50% coverage. The paper narrative and citation surfaces remain public while verification is pending.

SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning

Use Signal Canvas as the narrative proof surface

Use this Signal Canvas via API or MCP

Signal Canvas proof surface