Buildability Receipt Backlinks
Evidence links to receipt scaffolds, not external adoption claims.
Evidence runs can now link directly to buildability receipt scaffolds so proof context stays aligned across paper, Signal Canvas, and Build Loop routes while external validation remains explicitly gated.
Evidence receipt window
Buildability receipt unavailable
evidence-workstation
Subject: Evidence workstation
Verdict
Insufficient data
Evidence has no selected canonical paper receipt until a query, report, or paper handoff selects one.
Time to first demo
Insufficient data
No canonical receipt is available, so demo lead-time cannot be reported.
Compute envelope
Structured compute envelope
Insufficient data
No canonical receipt is available, so compute requirements cannot be reported.
Evidence ids
Evidence ids
Insufficient data
No receipt id, paper id, proof run id, or evidence hash is available.
Freshness
Freshness
Insufficient data
No receipt timestamp or evidence verification timestamp is available.
Hash state
Immutable hash
Insufficient data
No canonical receipt hash is available.
Signature state
External signature
unsigned_external
No founder, registry, pilot, or production-adoption signature is attached to this receipt.
Verification
not_verified
Verification is blocked until an external signature is provided.
Blockers
- Evidence has no selected canonical paper receipt until a query, report, or paper handoff selects one.
Evidence has no selected canonical paper receipt until a query, report, or paper handoff selects one.
Reviewable research runs with screening, extraction, consensus, and export-ready reports.
Evidence is the operator workstation for defining a question, screening candidates, inspecting proof, running consensus, extracting structured fields, synthesizing a report, and seeding a workspace with provenance.
Preview requested for "Compute concentration and frontier model economics".
Previewing the top Evidence hits for "Compute concentration and frontier model economics".
Evidence
Define a question, screen candidates, inspect evidence, run consensus, extract fields, and synthesize a cited report.
Search results will appear with a streamed summary.
QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals
LLM Evaluation | 2026-04-17
Forecasting has become a natural benchmark for reasoning under uncertainty. Yet existing evaluations of large language models remain limited to judgmental tasks in simple formats, such as binary or multiple-choice questions. In practice, however, forecasting spans a far broader scope. Across domains such as economics, public health, and social demographics, decisions hinge on numerical estimates over continuous quantities, a capability that current benchmarks do not capture. Evaluating such estimates requires a format that makes uncertainty explicit and testable. We propose prediction intervals as a natural and rigorous interface for this purpose. They demand scale awareness, internal consistency across confidence levels, and calibration over a continuum of outcomes, making them a more suitable evaluation format than point estimates for numerical forecasting. To assess this capability, we introduce a new benchmark QuantSightBench, and evaluate frontier models under multiple settings, assessing both empirical coverage and interval sharpness. Our results show that none of the 11 evaluated frontier and open-weight models achieves the 90\% coverage target, with the top performers Gemini 3.1 Pro (79.1\%), Grok 4 (76.4\%), and GPT-5.4 (75.3\%) all falling at least 10 percentage points short. Calibration degrades sharply at extreme magnitudes, revealing systematic overconfidence across all evaluated models.
Dissecting AI Trading: Behavioral Finance and Market Bubbles
AI Agents in Finance | 2026-04-20
We study how AI agents form expectations and trade in experimental asset markets. Using a simulated open-call auction populated by autonomous Large Language Model (LLM) agents, we document three main findings. First, AI agents exhibit classic behavioral patterns: a pronounced disposition effect and recency-weighted extrapolative beliefs. Second, these individual-level patterns aggregate into equilibrium dynamics that replicate classic experimental findings (Smith et al., 1988), including the predictive power of excess demand for future prices and the positive relationship between disagreement and trading volume. Third, by analyzing the agents' reasoning text through a twenty-mechanism scoring framework, we show that targeted prompt interventions causally amplify or suppress specific behavioral mechanisms, significantly altering the magnitude of market bubbles.
Safety-Critical Contextual Control via Online Riemannian Optimization with World Models
Safety-Critical Control | 2026-04-21
Modern world models are becoming too complex to admit explicit dynamical descriptions. We study safety-critical contextual control, where a Planner must optimize a task objective using only feasibility samples from a black-box Simulator, conditioned on a context signal $ξ_t$. We develop a sample-based Penalized Predictive Control (PPC) framework grounded in online Riemannian optimization, in which the Simulator compresses the feasibility manifold into a score-based density $\hat{p}(u \mid ξ_t)$ that endows the action space with a Riemannian geometry guiding the Planner's gradient descent. The barrier curvature $κ(ξ_t)$, the minimum curvature of the conditional log-density $-\ln\hat{p}(\cdot\midξ_t)$, governs both convergence rate and safety margin, replacing the Lipschitz constant of the unknown dynamics. Our main result is a contextual safety bound showing that the distance from the true feasibility manifold is controlled by the score estimation error and a ratio that depends on $κ(ξ_t)$, both of which improve with richer context. Simulations on a dynamic navigation task confirm that contextual PPC substantially outperforms marginal and frozen density models, with the advantage growing after environment shifts.
Resolving space-sharing conflicts in road user interactions through uncertainty reduction: An active inference-based computational model
Autonomous Driving Behavior Modeling | 2026-04-21
Understanding how road users resolve space-sharing conflicts is important both for traffic safety and the safe deployment of autonomous vehicles. While existing models have captured specific aspects of such interactions (e.g., explicit communication), a theoretically-grounded computational framework has been lacking. In this paper, we extend a previously developed active inference-based driver behavior model to simulate interactive behavior of two agents. Our model captures three complementary mechanisms for uncertainty reduction in interaction: (i) implicit communication via direct behavioral coupling, (ii) reliance on normative expectations (stop signs, priority rules, etc.), and (iii) explicit communication. In a simplified intersection scenario, we show that normative and explicit communication cues can increase the likelihood of a successful conflict resolution. However, this relies on agents acting as expected. In situations where another agent (intentionally or unintentionally) violates normative expectations or communicates misleading information, reliance on these cues may induce collisions. These findings illustrate how active inference can provide a novel framework for modeling road user interactions which is also applicable in other fields.
On The Mathematics of the Natural Physics of Optimization
Optimization Theory | 2026-04-19
A number of optimization algorithms have been inspired by the physics of Newtonian motion. Here, we ask the question: do algorithms themselves obey some ``natural laws of motion,'' and can they be derived by an application of these laws? We explore this question by positing the theory that optimization algorithms may be considered as some manifestation of hidden algorithm primitives that obey certain universal non-Newtonian dynamics. This natural physics of optimization is developed by equating the terminal transversality conditions of an optimal control problem to the generalized Karush/John-Kuhn-Tucker conditions of an optimization problem. Through this equivalence formulation, the data functions of a given constrained optimization problem generate a natural vector field that permeates an entire hidden space with information on the optimality conditions. An ``action-at-a-distance'' operation via a Pontryagin-type minimum principle produces a local action to deliver a globalized result by way of a Hamilton-Jacobi inequality. An inverse-optimal algorithm is generated by performing control jumps that dissipate quantized ``energy'' defined by a search Lyapunov function. Illustrative applications of the proposed theory show that a large number of algorithms can be generated and explained in terms of the new mathematical physics of optimization.
Prompt Optimization Enables Stable Algorithmic Collusion in LLM Agents
LLM Agents | 2026-04-20
LLM agents in markets present algorithmic collusion risks. While prior work shows LLM agents reach supracompetitive prices through tacit coordination, existing research focuses on hand-crafted prompts. The emerging paradigm of prompt optimization necessitates new methodologies for understanding autonomous agent behavior. We investigate whether prompt optimization leads to emergent collusive behaviors in market simulations. We propose a meta-learning loop where LLM agents participate in duopoly markets and an LLM meta-optimizer iteratively refines shared strategic guidance. Our experiments reveal that meta-prompt optimization enables agents to discover stable tacit collusion strategies with substantially improved coordination quality compared to baseline agents. These behaviors generalize to held-out test markets, indicating discovery of general coordination principles. Analysis of evolved prompts reveals systematic coordination mechanisms through stable shared strategies. Our findings call for further investigation into AI safety implications in autonomous multi-agent systems.
Information Aggregation with AI Agents
AI Agents | 2026-04-21
Can Large Language Models (AI agents) aggregate dispersed private information through trading and reason about the knowledge of others by observing price movements? We conduct a controlled experiment where AI agents trade in a prediction market after receiving private signals, measuring information aggregation by the log error of the last price. We find that although the median market is effective at aggregating information in the easy information structures, increasing the complexity has a significant and negative impact, suggesting that AI agents may suffer from the same limitations as humans when reasoning about others. Consistent with our theoretical predictions, information aggregation remains unaffected by allowing cheap talk communication, changing the duration of the market or initial price, and strategic prompting-thus demonstrating that prediction markets are robust. We establish that "smarter" AI agents perform better at aggregation and they are more profitable. Surprisingly, giving them feedback about past performance makes them worse at aggregation and reduces their profits.
Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion
LLM Benchmarking | 2026-04-20
We present a systematic evaluation of large language model families -- spanning both proprietary cloud APIs and locally-hosted open-source models -- on two purpose-built benchmarks for System Dynamics AI assistance: the \textbf{CLD Leaderboard} (53 tests, structured causal loop diagram extraction) and the \textbf{Discussion Leaderboard} (interactive model discussion, feedback explanation, and model building coaching). On CLD extraction, cloud models achieve 77--89\% overall pass rates; the best local model reaches 77\% (Kimi~K2.5~GGUF~Q3, zero-shot engine), matching mid-tier cloud performance. On Discussion, the best local models achieve 50--100\% on model building steps and 47--75\% on feedback explanation, but only 0--50\% on error fixing -- a category dominated by long-context prompts that expose memory limits in local deployments. A central contribution of this paper is a systematic analysis of \textit{model type effects} on performance: we compare reasoning vs.\ instruction-tuned architectures, GGUF (llama.cpp) vs.\ MLX (mlx\_lm) backends, and quantization levels (Q3 / Q4\_K\_M / MLX-3bit / MLX-4bit / MLX-6bit) across the same underlying model families. We find that backend choice has larger practical impact than quantization level: mlx\_lm does not enforce JSON schema constraints, requiring explicit prompt-level JSON instructions, while llama.cpp grammar-constrained sampling handles JSON reliably but causes indefinite generation on long-context prompts for dense models. We document the full parameter sweep ($t$, $p$, $k$) for all local models, cleaned timing data (stuck requests excluded), and a practitioner guide for running 671B--123B parameter models on Apple~Silicon.
Ask a follow-up about current results.
AI-classified from paper abstracts
Top evidence for "Compute concentration and frontier model economics" currently leans supportive, led by QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals, Dissecting AI Trading: Behavioral Finance and Market Bubbles, Safety-Critical Contextual Control via Online Riemannian Optimization with World Models.
Limitations or caveats dominate the visible abstract evidence.
The visible evidence is mixed or incomplete.
Positive performance or applicability signals are visible in the title or abstract.
Limitations or caveats dominate the visible abstract evidence.
The visible evidence is mixed or incomplete.
Positive performance or applicability signals are visible in the title or abstract.
Positive performance or applicability signals are visible in the title or abstract.
The visible evidence is mixed or incomplete.
Build With These Results
Copy prompts into your favorite AI coding tool to start building.
Lightweight coding agent in your terminal.
Agentic coding tool for terminal workflows.
AI agent mindset installer and workflow scaffolder.
AI-first code editor built on VS Code.
Free, open-source editor by Microsoft.
People Also Ask
Evidence questions
What is the ScienceToStartup evidence surface?+
It is the reviewable evidence workstation for search, screening, extraction, consensus, and export-ready reports with provenance-aware outputs.
How is Evidence different from the Daily Dashboard?+
The Daily Dashboard is the live operator surface. Evidence is the deeper workstation for explicit runs, cited outputs, and exportable report artifacts.
Can Evidence feed the rest of the product?+
Yes. Evidence runs can seed proof surfaces, Signal Canvas, workspaces, and downstream execution workflows without losing provenance.