AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation | Signal Canvas | ScienceToStartup

← Back to Paper

AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation

Stale76d agoVerification pending / evidence receipt incomplete

Clone Repo Export Brief Open in Build Loop Connect with Author

Viability

0.0/10

Compared to this week’s papers

Verification pending

Use This Via API or MCP

Use Signal Canvas as the narrative proof surface

Signal Canvas is the citation-first public layer for turning one paper into a structured commercialization narrative. Use it to hand off into REST, MCP, Build Loop, and launch-pack execution without losing source lineage.

Signal Canvas API Paper Proof Page Open Build Loop Launch Pack Example

Use This Via API or MCP

Use this Signal Canvas via API or MCP

Route this paper proof surface into REST, MCP, or developer workflows while preserving the same evidence receipt and related-resource context.

Signal Canvas guide REST guide MCP guide

Page Freshness

Signal Canvas proof surface

Canonical route: /signal-canvas/adarubric-task-adaptive-rubrics-for-llm-agent-evaluation

stale

Proof freshness: stale
Proof status: verified
Display score: 8/10
Last proof check: 2026-03-24
Score updated: 2026-04-02
Score fresh until: 2026-05-02
References: 0
Source count: 0
Coverage: 50%

This page is showing the last landed evidence receipt and score bundle because the latest proof data is outside the freshness window.

Agent Handoff

AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation

Canonical ID adarubric-task-adaptive-rubrics-for-llm-agent-evaluation | Route /signal-canvas/adarubric-task-adaptive-rubrics-for-llm-agent-evaluation

REST example

curl https://sciencetostartup.com/api/v1/agent-handoff/signal-canvas/adarubric-task-adaptive-rubrics-for-llm-agent-evaluation

MCP example

{
  "tool": "search_signal_canvas",
  "arguments": {
    "mode": "paper",
    "paper_ref": "adarubric-task-adaptive-rubrics-for-llm-agent-evaluation",
    "query_text": "Summarize AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation"
  }
}

source_context

{
  "surface": "signal_canvas",
  "mode": "paper",
  "query": "AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation",
  "normalized_query": "2603.21362",
  "route": "/signal-canvas/adarubric-task-adaptive-rubrics-for-llm-agent-evaluation",
  "paper_ref": "adarubric-task-adaptive-rubrics-for-llm-agent-evaluation",
  "topic_slug": null,
  "benchmark_ref": null,
  "dataset_ref": null
}

Paper mode· single-doc scopescope: adarubric-task-adaptive-rubrics-for-llm-agent-evaluation

Preparing verified analysis

GitHub Code Pulse

Cached

Stars

188

Health

C

Last commit

6/7/2026

Forks

20

Open repository

Claim map

Strong 8Mixed 0Weak 0

Evidencepartial
We present ADARUBRIC, which closes this gap by generating task-specific evaluation rubrics on the fly from task descriptions
Implicationpartial
Explicitly stated in the abstract as the core method of the paper
Verificationpartial
partial
Evidencepartial
On WebArena and ToolBench, ADARUBRIC achieves Pearson r=0.79 human correlation (+0.16 over the best static baseline)
Implicationpartial
Direct numeric evidence provided in the abstract with clear comparison
Verificationpartial
partial
Evidencepartial
DPO agents trained on ADARUBRIC preference pairs gain +6.8 to +8.5 pp task success over Prometheus across three benchmarks
Implicationpartial
Specific numeric results stated in the abstract with clear comparison
Verificationpartial
partial
Evidencepartial
filtering preference pairs with the novel DimensionAwareFilter - a provably necessary condition for preventing high-scoring dimensions from masking dimension-level failures
Implicationpartial
Directly stated in abstract with technical justification
Verificationpartial
partial
Evidencepartial
with deployment-grade reliability (Krippendorff's $α$=0.83)
Implicationpartial
Specific reliability metric provided with clear numeric value
Verificationpartial
partial
Evidencepartial
gains transfer to SWE-bench code repair (+4.9 pp)
Implicationpartial
Specific transfer learning result stated with numeric evidence
Verificationpartial
partial
Evidencepartial
accelerate PPO convergence by +6.6 pp at 5K steps - both without any rubric engineering
Implicationpartial
Specific training acceleration result stated with numeric evidence
Verificationpartial
partial
Evidencepartial
LLM-as-Judge evaluation fails agent tasks because a fixed rubric cannot capture what matters for this task: code debugging demands Correctness and Error Handling; web navigation demands Goal Alignment and Action Efficiency
Implicationpartial
Problem statement explicitly made in the abstract with examples
Verificationpartial
partial

Startup potential card

Startup potential card preview

Share on X LinkedIn

Related Resources

Related resources will appear here when this paper maps cleanly to topic, benchmark, or dataset surfaces.