Buildability Receipt Backlinks

Evidence links to receipt scaffolds, not external adoption claims.

Evidence runs can now link directly to buildability receipt scaffolds so proof context stays aligned across paper, Signal Canvas, and Build Loop routes while external validation remains explicitly gated.

Buildability hub Example buildability receipt Proof divergence

Evidence receipt window

Buildability receipt unavailable

evidence-workstation

Insufficient dataInsufficient data

Subject: Evidence workstation

Verdict

Insufficient data

Evidence has no selected canonical paper receipt until a query, report, or paper handoff selects one.

Time to first demo

Insufficient data

No canonical receipt is available, so demo lead-time cannot be reported.

Compute envelope

Structured compute envelope

Insufficient data

No canonical receipt is available, so compute requirements cannot be reported.

Evidence ids

Insufficient data

No receipt id, paper id, proof run id, or evidence hash is available.

Freshness

Insufficient data

No receipt timestamp or evidence verification timestamp is available.

Hash state

Immutable hash

Insufficient data

No canonical receipt hash is available.

Signature state

External signature

unsigned_external

No founder, registry, pilot, or production-adoption signature is attached to this receipt.

Verification

not_verified

Verification is blocked until an external signature is provided.

Blockers

Evidence has no selected canonical paper receipt until a query, report, or paper handoff selects one.

Evidence has no selected canonical paper receipt until a query, report, or paper handoff selects one.

Missing proof, requirement, signature, approval, adoption, or telemetry fields are blockers and must not be inferred.

Buildability hub Example fixture receipt Proof divergence

Evidence

Reviewable research runs with screening, extraction, consensus, and export-ready reports.

Evidence is the operator workstation for defining a question, screening candidates, inspecting proof, running consensus, extracting structured fields, synthesizing a report, and seeding a workspace with provenance.

Define question

Scope corpus, paper, or workspace runs.

Inspect evidence

Quote-level provenance and missingness stay visible.

Export or seed

Markdown, JSON, PDF, BibTeX, and workspace seeds.

Server-rendered preview

Previewing the top Evidence hits for "Compute concentration and frontier model economics".

QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals

Forecasting has become a natural benchmark for reasoning under uncertainty. Yet existing evaluations of large language models remain limited to judgmental tasks in simple formats, such as binary or multiple-choice questions. In practice, however, forecasting spans a far broader scope. Across domains such as economics, public health, and social demographics, decisions hinge on numerical estimates over continuous quantities, a capability that current benchmarks do not capture. Evaluating such estimates requires a format that makes uncertainty explicit and testable. We propose prediction intervals as a natural and rigorous interface for this purpose. They demand scale awareness, internal consistency across confidence levels, and calibration over a continuum of outcomes, making them a more suitable evaluation format than point estimates for numerical forecasting. To assess this capability, we introduce a new benchmark QuantSightBench, and evaluate frontier models under multiple settings, assessing both empirical coverage and interval sharpness. Our results show that none of the 11 evaluated frontier and open-weight models achieves the 90\% coverage target, with the top performers Gemini 3.1 Pro (79.1\%), Grok 4 (76.4\%), and GPT-5.4 (75.3\%) all falling at least 10 percentage points short. Calibration degrades sharply at extreme magnitudes, revealing systematic overconfidence across all evaluated models.

Dissecting AI Trading: Behavioral Finance and Market Bubbles

We study how AI agents form expectations and trade in experimental asset markets. Using a simulated open-call auction populated by autonomous Large Language Model (LLM) agents, we document three main findings. First, AI agents exhibit classic behavioral patterns: a pronounced disposition effect and recency-weighted extrapolative beliefs. Second, these individual-level patterns aggregate into equilibrium dynamics that replicate classic experimental findings (Smith et al., 1988), including the predictive power of excess demand for future prices and the positive relationship between disagreement and trading volume. Third, by analyzing the agents' reasoning text through a twenty-mechanism scoring framework, we show that targeted prompt interventions causally amplify or suppress specific behavioral mechanisms, significantly altering the magnitude of market bubbles.

Safety-Critical Contextual Control via Online Riemannian Optimization with World Models

Modern world models are becoming too complex to admit explicit dynamical descriptions. We study safety-critical contextual control, where a Planner must optimize a task objective using only feasibility samples from a black-box Simulator, conditioned on a context signal $ξ_t$. We develop a sample-based Penalized Predictive Control (PPC) framework grounded in online Riemannian optimization, in which the Simulator compresses the feasibility manifold into a score-based density $\hat{p}(u \mid ξ_t)$ that endows the action space with a Riemannian geometry guiding the Planner's gradient descent. The barrier curvature $κ(ξ_t)$, the minimum curvature of the conditional log-density $-\ln\hat{p}(\cdot\midξ_t)$, governs both convergence rate and safety margin, replacing the Lipschitz constant of the unknown dynamics. Our main result is a contextual safety bound showing that the distance from the true feasibility manifold is controlled by the score estimation error and a ratio that depends on $κ(ξ_t)$, both of which improve with richer context. Simulations on a dynamic navigation task confirm that contextual PPC substantially outperforms marginal and frozen density models, with the advantage growing after environment shifts.

Evidence

Define a question, screen candidates, inspect evidence, run consensus, extract fields, and synthesize a cited report.

My Evidence

Cmd/Ctrl+K

AI Summary

Search results will appear with a streamed summary.

Results (8 papers)

QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals

LLM Evaluation | 2026-04-17

0.19

Analyze

Dissecting AI Trading: Behavioral Finance and Market Bubbles

AI Agents in Finance | 2026-04-20

0.17

Analyze

Safety-Critical Contextual Control via Online Riemannian Optimization with World Models

Safety-Critical Control | 2026-04-21

0.17

Analyze

Resolving space-sharing conflicts in road user interactions through uncertainty reduction: An active inference-based computational model

Autonomous Driving Behavior Modeling | 2026-04-21

0.17

Understanding how road users resolve space-sharing conflicts is important both for traffic safety and the safe deployment of autonomous vehicles. While existing models have captured specific aspects of such interactions (e.g., explicit communication), a theoretically-grounded computational framework has been lacking. In this paper, we extend a previously developed active inference-based driver behavior model to simulate interactive behavior of two agents. Our model captures three complementary mechanisms for uncertainty reduction in interaction: (i) implicit communication via direct behavioral coupling, (ii) reliance on normative expectations (stop signs, priority rules, etc.), and (iii) explicit communication. In a simplified intersection scenario, we show that normative and explicit communication cues can increase the likelihood of a successful conflict resolution. However, this relies on agents acting as expected. In situations where another agent (intentionally or unintentionally) violates normative expectations or communicates misleading information, reliance on these cues may induce collisions. These findings illustrate how active inference can provide a novel framework for modeling road user interactions which is also applicable in other fields.

Analyze

On The Mathematics of the Natural Physics of Optimization

Optimization Theory | 2026-04-19

0.17

A number of optimization algorithms have been inspired by the physics of Newtonian motion. Here, we ask the question: do algorithms themselves obey some ``natural laws of motion,'' and can they be derived by an application of these laws? We explore this question by positing the theory that optimization algorithms may be considered as some manifestation of hidden algorithm primitives that obey certain universal non-Newtonian dynamics. This natural physics of optimization is developed by equating the terminal transversality conditions of an optimal control problem to the generalized Karush/John-Kuhn-Tucker conditions of an optimization problem. Through this equivalence formulation, the data functions of a given constrained optimization problem generate a natural vector field that permeates an entire hidden space with information on the optimality conditions. An ``action-at-a-distance'' operation via a Pontryagin-type minimum principle produces a local action to deliver a globalized result by way of a Hamilton-Jacobi inequality. An inverse-optimal algorithm is generated by performing control jumps that dissipate quantized ``energy'' defined by a search Lyapunov function. Illustrative applications of the proposed theory show that a large number of algorithms can be generated and explained in terms of the new mathematical physics of optimization.

Analyze

Prompt Optimization Enables Stable Algorithmic Collusion in LLM Agents

LLM Agents | 2026-04-20

0.16

LLM agents in markets present algorithmic collusion risks. While prior work shows LLM agents reach supracompetitive prices through tacit coordination, existing research focuses on hand-crafted prompts. The emerging paradigm of prompt optimization necessitates new methodologies for understanding autonomous agent behavior. We investigate whether prompt optimization leads to emergent collusive behaviors in market simulations. We propose a meta-learning loop where LLM agents participate in duopoly markets and an LLM meta-optimizer iteratively refines shared strategic guidance. Our experiments reveal that meta-prompt optimization enables agents to discover stable tacit collusion strategies with substantially improved coordination quality compared to baseline agents. These behaviors generalize to held-out test markets, indicating discovery of general coordination principles. Analysis of evolved prompts reveals systematic coordination mechanisms through stable shared strategies. Our findings call for further investigation into AI safety implications in autonomous multi-agent systems.

Analyze

Information Aggregation with AI Agents

AI Agents | 2026-04-21

0.16

Can Large Language Models (AI agents) aggregate dispersed private information through trading and reason about the knowledge of others by observing price movements? We conduct a controlled experiment where AI agents trade in a prediction market after receiving private signals, measuring information aggregation by the log error of the last price. We find that although the median market is effective at aggregating information in the easy information structures, increasing the complexity has a significant and negative impact, suggesting that AI agents may suffer from the same limitations as humans when reasoning about others. Consistent with our theoretical predictions, information aggregation remains unaffected by allowing cheap talk communication, changing the duration of the market or initial price, and strategic prompting-thus demonstrating that prediction markets are robust. We establish that "smarter" AI agents perform better at aggregation and they are more profitable. Surprisingly, giving them feedback about past performance makes them worse at aggregation and reduces their profits.

Analyze

Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion

LLM Benchmarking | 2026-04-20

0.16

We present a systematic evaluation of large language model families -- spanning both proprietary cloud APIs and locally-hosted open-source models -- on two purpose-built benchmarks for System Dynamics AI assistance: the \textbf{CLD Leaderboard} (53 tests, structured causal loop diagram extraction) and the \textbf{Discussion Leaderboard} (interactive model discussion, feedback explanation, and model building coaching). On CLD extraction, cloud models achieve 77--89\% overall pass rates; the best local model reaches 77\% (Kimi~K2.5~GGUF~Q3, zero-shot engine), matching mid-tier cloud performance. On Discussion, the best local models achieve 50--100\% on model building steps and 47--75\% on feedback explanation, but only 0--50\% on error fixing -- a category dominated by long-context prompts that expose memory limits in local deployments. A central contribution of this paper is a systematic analysis of \textit{model type effects} on performance: we compare reasoning vs.\ instruction-tuned architectures, GGUF (llama.cpp) vs.\ MLX (mlx\_lm) backends, and quantization levels (Q3 / Q4\_K\_M / MLX-3bit / MLX-4bit / MLX-6bit) across the same underlying model families. We find that backend choice has larger practical impact than quantization level: mlx\_lm does not enforce JSON schema constraints, requiring explicit prompt-level JSON instructions, while llama.cpp grammar-constrained sampling handles JSON reliably but causes indefinite generation on long-context prompts for dense models. We document the full parameter sweep ($t$, $p$, $k$) for all local models, cleaned timing data (stuck requests excluded), and a practitioner guide for running 671B--123B parameter models on Apple~Silicon.

Analyze

Research Chat

ModeSources

Ask a follow-up about current results.

Consensus Meter

1% agreemedium

3 support | 2 oppose | 3 neutral

Avg stance confidence: 59%(limited confidence)

AI-classified from paper abstracts

Top evidence for "Compute concentration and frontier model economics" currently leans supportive, led by QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals, Dissecting AI Trading: Behavioral Finance and Market Bubbles, Safety-Critical Contextual Control via Online Riemannian Optimization with World Models.

Individual Paper Stances (8)

Neutral

Limitations or caveats dominate the visible abstract evidence.

58cadf7d-5003-45dc-bb6e-8ff912c519daconf: 58%

Neutral

The visible evidence is mixed or incomplete.

d9bcc749-d744-4d17-84c1-d07b37b8cb67conf: 46%

Neutral

Positive performance or applicability signals are visible in the title or abstract.

99405377-d7d6-4d3a-9a2d-34dddf69f00aconf: 72%

Neutral

Limitations or caveats dominate the visible abstract evidence.

06fccca2-481e-47f8-9e17-f7a71f2a6964conf: 58%

Neutral

The visible evidence is mixed or incomplete.

5627ebe8-5743-42bb-87f9-008cae426567conf: 46%

Neutral

Positive performance or applicability signals are visible in the title or abstract.

79b73241-708f-4604-a8ff-af2bf895916aconf: 72%

Neutral

Positive performance or applicability signals are visible in the title or abstract.

b10b661a-1fb9-4247-880e-f0717c823435conf: 72%

Neutral

The visible evidence is mixed or incomplete.

7e2b3386-b857-4829-be52-9c5a59f62d0bconf: 46%

Build With These Results

Copy prompts into your favorite AI coding tool to start building.

OpenAI CodexAI Agent

Lightweight coding agent in your terminal.

Claude CodeAI Agent

Agentic coding tool for terminal workflows.

AntiGravity IDEScaffolding

AI agent mindset installer and workflow scaffolder.

CursorIDE

AI-first code editor built on VS Code.

VS CodeIDE

Free, open-source editor by Microsoft.

Evidence questions

What is the ScienceToStartup evidence surface?+

It is the reviewable evidence workstation for search, screening, extraction, consensus, and export-ready reports with provenance-aware outputs.

How is Evidence different from the Daily Dashboard?+

The Daily Dashboard is the live operator surface. Evidence is the deeper workstation for explicit runs, cited outputs, and exportable report artifacts.

Can Evidence feed the rest of the product?+

Yes. Evidence runs can seed proof surfaces, Signal Canvas, workspaces, and downstream execution workflows without losing provenance.