Weekly Benchmark Scoreboard | ScienceToStartup

Weekly Benchmark Scoreboard | ScienceToStartup

Scoreboard · 50 papers

01

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

Solid commercial fit; worth a closer look this week.

113.3—

02

GPIC: A Giant Permissive Image Corpus for Visual Generation

Quiet paper, loud community.

90.5—

03

Before the Shutter: Aesthetic and Actionable Portrait Photography Planning in 3D Scenes

Solid commercial fit; worth a closer look this week.

80.0—

04

MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings

Solid commercial fit; worth a closer look this week.

80.0—

05

mcp-proto-okn: Natural-language access to open scientific knowledge graphs through the Model Context Protocol

Solid commercial fit; worth a closer look this week.

80.0—

06

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Solid commercial fit; worth a closer look this week.

80.0—

07

Loong: A Human-Like Long Document Translation Agent with Observe-and-Act Adaptive Context Selection

Solid commercial fit; worth a closer look this week.

80.0—

08

PhyGenHOI: Physically-Aware 4D Generation of Dynamic Human-Object Interactions

Solid commercial fit; worth a closer look this week.

80.0—

09

How LoRA Remembers? A Parametric Memory Law for LLM Finetuning

Solid commercial fit; worth a closer look this week.

80.0—

10

Automating Low-Risk Code Review at Meta: RADAR, Risk Calibration, and Review Efficiency

Solid commercial fit; worth a closer look this week.

80.0—

11

CalArena: A Large-Scale Post-Hoc Calibration Benchmark

Solid commercial fit; worth a closer look this week.

80.0—

12

No More K-means:Single-Stage Sparse Coding for Efficient Multi-Vector Retrieval

Solid commercial fit; worth a closer look this week.

80.0—

13

PokerSkill: LLMs Can Play Expert-Level Poker without Training or Solvers

Solid commercial fit; worth a closer look this week.

80.0—

14

Selective QA over Conflicting Multi-Source Personal Memory: A Diagnostic Testbed and Method Comparison

Solid commercial fit; worth a closer look this week.

80.0—

15

REPOT: Recoverable Program-of-Thought via Checkpoint Repair

Solid commercial fit; worth a closer look this week.

80.0—

16

Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning

Solid commercial fit; worth a closer look this week.

80.0—

17

Alignment-Guided Score Matching for Text-to-Image Alignment in Diffusion Models

Solid commercial fit; worth a closer look this week.

80.0—

18

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

Solid commercial fit; worth a closer look this week.

80.0—

19

Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation

Solid commercial fit; worth a closer look this week.

80.0—

20

Compass: Navigating Global Marine Lead Data Integration through Expert-Guided LLM Agent

Solid commercial fit; worth a closer look this week.

80.0—

21

CityGen: Structure-Guided City-Style Synthesis for Cross-City Autonomous Driving

Solid commercial fit; worth a closer look this week.

80.0—

22

Mitigating Hallucination in Vision-Language Models through Barrier-Regulated Adaptive Closed-form Steering

Solid commercial fit; worth a closer look this week.

80.0—

23

Citation-Closure Retrieval and Per-Rule Attribution for Real-World Regulatory Compliance Question Answering

Solid commercial fit; worth a closer look this week.

80.0—

24

NaRA: Noise-Aware LoRA for Parameter-Efficient Fine-Tuning of Diffusion LLMs

Solid commercial fit; worth a closer look this week.

80.0—

25

BitTP: The Lightweight Trajectory Prediction Model with BitLLM for Edge-Devices

Solid commercial fit; worth a closer look this week.

80.0—

26

VikingMem: A Memory Base Management System for Stateful LLM-based Applications

Solid commercial fit; worth a closer look this week.

80.0—

27

ParaTool: Shifting Tool Representations from Context to Parameters

Solid commercial fit; worth a closer look this week.

80.0—

28

PhoneWorld: Scaling Phone-Use Agent Environments

Solid commercial fit; worth a closer look this week.

80.0—

29

GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models

Solid commercial fit; worth a closer look this week.

80.0—

30

PassNet: Scaling Large Language Models for Graph Compiler Pass Generation

Solid commercial fit; worth a closer look this week.

80.0—

31

Pocket-Dentist: On-Device Dental Image Understanding via Efficient Multimodal Large Language Models

Solid commercial fit; worth a closer look this week.

80.0—

32

Rethinking Literature Search Evaluation: Deep Research Helps, and Human Citation Lists Are Not a Ground Truth

Solid commercial fit; worth a closer look this week.

80.0—

33

AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

Long-shot bet with outsized upside if it lands.

79.9—

34

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

Solid commercial fit; worth a closer look this week.

70.0—

35

LLMSurgeon: Diagnosing Data Mixture of Large Language Models

Solid commercial fit; worth a closer look this week.

70.0—

36

SchGen: PCB Schematic Generation with Semantic-Grounded Code Representations

Solid commercial fit; worth a closer look this week.

70.0—

37

Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection

Solid commercial fit; worth a closer look this week.

70.0—

38

Demystifying Data Organization for Enhanced LLM Training

Solid commercial fit; worth a closer look this week.

70.0—

39

RoboWits: Unexpected Challenges for Robotic Creative Problem Solving

Solid commercial fit; worth a closer look this week.

70.0—

40

In-Context Reward Adaptation for Robust Preference Modeling

Solid commercial fit; worth a closer look this week.

70.0—

41

Archon: A Unified Multimodal Model for Holistic Digital Human Generation

Solid commercial fit; worth a closer look this week.

70.0—

42

City-Mesh3R: Simulation-Ready City-Scale 3D Mesh Reconstruction from Multi-View Images

Solid commercial fit; worth a closer look this week.

70.0—

43

MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection

Solid commercial fit; worth a closer look this week.

70.0—

44

ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure

Solid commercial fit; worth a closer look this week.

70.0—

45

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

Solid commercial fit; worth a closer look this week.

70.0—

46

Unifying Temporal and Structural Credit Assignment in LLM-Based Multi-Agent Prompt Optimization

Solid commercial fit; worth a closer look this week.

70.0—

47

BORA: Bridging Offline Reinforcement Learning and Online Residual Adaptation for Real-World Dexterous VLA Models

Solid commercial fit; worth a closer look this week.

70.0—

48

When Should Models Change Their Minds? Contextual Belief Management in Large Language Models

Solid commercial fit; worth a closer look this week.

70.0—

49

Double-Edged Sword or Sharp Tool? Designing and Evaluating Triadic LLM-Teacher Collaboration for K-12 Writing at Scale

Solid commercial fit; worth a closer look this week.

70.0—

50

What drives performance in molecular MPNNs? An operator-level factorial benchmark

Solid commercial fit; worth a closer look this week.

70.0—

Rank	Paper	Score	Move	Expand
01	Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation Solid commercial fit; worth a closer look this week.	113.3	—
02	GPIC: A Giant Permissive Image Corpus for Visual Generation Quiet paper, loud community.	90.5	—
03	Before the Shutter: Aesthetic and Actionable Portrait Photography Planning in 3D Scenes Solid commercial fit; worth a closer look this week.	80.0	—
04	MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings Solid commercial fit; worth a closer look this week.	80.0	—
05	mcp-proto-okn: Natural-language access to open scientific knowledge graphs through the Model Context Protocol Solid commercial fit; worth a closer look this week.	80.0	—
06	Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments Solid commercial fit; worth a closer look this week.	80.0	—
07	Loong: A Human-Like Long Document Translation Agent with Observe-and-Act Adaptive Context Selection Solid commercial fit; worth a closer look this week.	80.0	—
08	PhyGenHOI: Physically-Aware 4D Generation of Dynamic Human-Object Interactions Solid commercial fit; worth a closer look this week.	80.0	—
09	How LoRA Remembers? A Parametric Memory Law for LLM Finetuning Solid commercial fit; worth a closer look this week.	80.0	—
10	Automating Low-Risk Code Review at Meta: RADAR, Risk Calibration, and Review Efficiency Solid commercial fit; worth a closer look this week.	80.0	—
11	CalArena: A Large-Scale Post-Hoc Calibration Benchmark Solid commercial fit; worth a closer look this week.	80.0	—
12	No More K-means:Single-Stage Sparse Coding for Efficient Multi-Vector Retrieval Solid commercial fit; worth a closer look this week.	80.0	—
13	PokerSkill: LLMs Can Play Expert-Level Poker without Training or Solvers Solid commercial fit; worth a closer look this week.	80.0	—
14	Selective QA over Conflicting Multi-Source Personal Memory: A Diagnostic Testbed and Method Comparison Solid commercial fit; worth a closer look this week.	80.0	—
15	REPOT: Recoverable Program-of-Thought via Checkpoint Repair Solid commercial fit; worth a closer look this week.	80.0	—
16	Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning Solid commercial fit; worth a closer look this week.	80.0	—
17	Alignment-Guided Score Matching for Text-to-Image Alignment in Diffusion Models Solid commercial fit; worth a closer look this week.	80.0	—
18	VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies Solid commercial fit; worth a closer look this week.	80.0	—
19	Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation Solid commercial fit; worth a closer look this week.	80.0	—
20	Compass: Navigating Global Marine Lead Data Integration through Expert-Guided LLM Agent Solid commercial fit; worth a closer look this week.	80.0	—
21	CityGen: Structure-Guided City-Style Synthesis for Cross-City Autonomous Driving Solid commercial fit; worth a closer look this week.	80.0	—
22	Mitigating Hallucination in Vision-Language Models through Barrier-Regulated Adaptive Closed-form Steering Solid commercial fit; worth a closer look this week.	80.0	—
23	Citation-Closure Retrieval and Per-Rule Attribution for Real-World Regulatory Compliance Question Answering Solid commercial fit; worth a closer look this week.	80.0	—
24	NaRA: Noise-Aware LoRA for Parameter-Efficient Fine-Tuning of Diffusion LLMs Solid commercial fit; worth a closer look this week.	80.0	—
25	BitTP: The Lightweight Trajectory Prediction Model with BitLLM for Edge-Devices Solid commercial fit; worth a closer look this week.	80.0	—
26	VikingMem: A Memory Base Management System for Stateful LLM-based Applications Solid commercial fit; worth a closer look this week.	80.0	—
27	ParaTool: Shifting Tool Representations from Context to Parameters Solid commercial fit; worth a closer look this week.	80.0	—
28	PhoneWorld: Scaling Phone-Use Agent Environments Solid commercial fit; worth a closer look this week.	80.0	—
29	GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models Solid commercial fit; worth a closer look this week.	80.0	—
30	PassNet: Scaling Large Language Models for Graph Compiler Pass Generation Solid commercial fit; worth a closer look this week.	80.0	—
31	Pocket-Dentist: On-Device Dental Image Understanding via Efficient Multimodal Large Language Models Solid commercial fit; worth a closer look this week.	80.0	—
32	Rethinking Literature Search Evaluation: Deep Research Helps, and Human Citation Lists Are Not a Ground Truth Solid commercial fit; worth a closer look this week.	80.0	—
33	AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security Long-shot bet with outsized upside if it lands.	79.9	—
34	VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion Solid commercial fit; worth a closer look this week.	70.0	—
35	LLMSurgeon: Diagnosing Data Mixture of Large Language Models Solid commercial fit; worth a closer look this week.	70.0	—
36	SchGen: PCB Schematic Generation with Semantic-Grounded Code Representations Solid commercial fit; worth a closer look this week.	70.0	—
37	Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection Solid commercial fit; worth a closer look this week.	70.0	—
38	Demystifying Data Organization for Enhanced LLM Training Solid commercial fit; worth a closer look this week.	70.0	—
39	RoboWits: Unexpected Challenges for Robotic Creative Problem Solving Solid commercial fit; worth a closer look this week.	70.0	—
40	In-Context Reward Adaptation for Robust Preference Modeling Solid commercial fit; worth a closer look this week.	70.0	—
41	Archon: A Unified Multimodal Model for Holistic Digital Human Generation Solid commercial fit; worth a closer look this week.	70.0	—
42	City-Mesh3R: Simulation-Ready City-Scale 3D Mesh Reconstruction from Multi-View Images Solid commercial fit; worth a closer look this week.	70.0	—
43	MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection Solid commercial fit; worth a closer look this week.	70.0	—
44	ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure Solid commercial fit; worth a closer look this week.	70.0	—
45	Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning Solid commercial fit; worth a closer look this week.	70.0	—
46	Unifying Temporal and Structural Credit Assignment in LLM-Based Multi-Agent Prompt Optimization Solid commercial fit; worth a closer look this week.	70.0	—
47	BORA: Bridging Offline Reinforcement Learning and Online Residual Adaptation for Real-World Dexterous VLA Models Solid commercial fit; worth a closer look this week.	70.0	—
48	When Should Models Change Their Minds? Contextual Belief Management in Large Language Models Solid commercial fit; worth a closer look this week.	70.0	—
49	Double-Edged Sword or Sharp Tool? Designing and Evaluating Triadic LLM-Teacher Collaboration for K-12 Writing at Scale Solid commercial fit; worth a closer look this week.	70.0	—
50	What drives performance in molecular MPNNs? An operator-level factorial benchmark Solid commercial fit; worth a closer look this week.	70.0	—

8-week rank trajectory · top 8

Who held #1 across all weeks

Paper	Weeks at #1	Weeks in top 3	Best rank
Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation	1	1	#1
MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset	1	1	#1
MeMo: Memory as a Model	1	1	#1
VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models	1	1	#1
Reliable Answers for Recurring Questions: Boosting Text-to-SQL Accuracy with Template Constrained Decoding	1	1	#1
Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows	1	1	#1

Download

Download PDF JSON CSV

Provenance

Coverage windowWeek of 2026-05-25

Method versionv2

Fresh until2026-06-01T11:04:10.771Z

Artifact IDlive-benchmark:2026-05-25:b538f9d3ca1ab1a4

Receipthttps://sciencetostartup.com/api/v1/resources/benchmark?artifact_id=live-benchmark%3A2026-05-25%3Ab538f9d3ca1ab1a4

SHA-256 (json)a558d613fa504f3c…

API

GET/api/v1/resources/benchmarkapplication/jsonCurrent + historical scoreboard metadata.
GET/api/v1/resources/benchmark/export?format=jsonapplication/jsonFull snapshot JSON.
GET/api/v1/resources/benchmark/export?format=csvtext/csvFlat CSV of every paper, every week.
GET/api/v1/resources/benchmark/export?format=pdfapplication/pdfPrint-ready PDF of the latest week.

Preview response

{
  "meta": {
    "count": 12,
    "source": "benchmark_snapshots",
    "artifact_id": "live-benchmark:2026-05-25:b538f9d3ca1ab1a4",
    "last_updated_at": "2026-05-25T11:04:10.771Z",
    "fresh_until": "2026-06-01T11:04:10.771Z",
    "status": "ready",
    "reason_code": "surface_ready",
    "method_version": "v2",
    "coverage_window": "Week of 2026-05-25"
  },
  "data": [
    {
      "week_start": "2026-05-25",
      "rankings": [
        {
          "rank": 1,
          "arxiv_id": "2605.29430v1",
          "title": "Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation",
          "viability_score": 8,
          "composite": 113.3,
          "unicorn_probability": 0.88,
          "total_votes": 68,
          "star_velocity": 0,
          "rank_delta": null
        }
      ]
    }
  ]
}

https://sciencetostartup.com/api/v1/resources/benchmark

Use This Via API or MCP

Use the benchmark as a ranking and proof layer

The weekly scoreboard is a stable surface for agents that need ranked papers, comparison logic, and a public proof artifact they can cite.

REST Guide MCP Guide

Handoff

Agent Handoff

Weekly Benchmark Scoreboard

Canonical ID benchmark | Route /resources/benchmark

REST example

curl https://sciencetostartup.com/api/v1/agent-handoff/benchmark/benchmark

MCP example

{
  "tool": "get_signal_fusion_rankings",
  "arguments": {
    "limit": 10
  }
}

source_context

{
  "surface": "benchmark",
  "mode": "ranking",
  "query": "weekly benchmark scoreboard",
  "normalized_query": "benchmark",
  "route": "/resources/benchmark",
  "paper_ref": null,
  "topic_slug": null,
  "benchmark_ref": "benchmark",
  "dataset_ref": null
}

Drop the weekly benchmark into any page with a single iframe. Updates automatically every Monday.

<iframe
  src="https://sciencetostartup.com/resources/embed/trending?week=2026-05-25"
  width="640"
  height="480"
  loading="lazy"
  title="ScienceToStartup Weekly Benchmark"
></iframe>