{"data":[{"arxiv_id":"2606.06493v1","title":"HANDOFF: Humanoid Agentic Task-Space Whole-Body Control via Distilled Complementary Teachers","abstract":"For a humanoid robot to be deployed in the real world, the choice of command space (i.e., the interface between task planning and whole-body control) is crucial. Existing whole-body controllers typically demand dense kinematic or spatial references that planners struggle to synthesize from task semantics. We instead propose a compact, explicit interface that is intuitive, general, modular, and expressive enough for diverse manipulation skills. To this end, we introduce HANDOFF, a single humanoid whole-body controller that follows this interface and is distilled via multi-teacher KL distillation under a context-conditioned gating scheme into a mixture-of-experts student from three complementary specialists: whole-body motion tracking with safety-filtered data, locomotion, and fall-recovery. On the Unitree G1, HANDOFF matches state-of-the-art velocity tracking and offers one of the largest robust manipulation workspaces. We further demonstrate hardware feasibility through multiple natural-language-driven task roll-outs, powered by a VLM-driven agentic planner with no task-specific data or controller fine-tuning.","published_date":"2026-06-04T17:59:50+00:00","viability_score":3,"cluster_label":"Robotics and Control Systems","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Unified robotic control system for complex humanoid tasks using distilled expert networks.","time_to_mvp":"","tags":["high_potential"]},{"arxiv_id":"2606.06492v1","title":"Code2LoRA: Hypernetwork-Generated Adapters for Code Language Models under Software Evolution","abstract":"Code language models need repository-level context to resolve imports, APIs, and project conventions. Existing methods inject this knowledge as long inputs (retrieved through RAG or dependency analysis) or through per-repository fine-tuning and LoRA -- costly at repository scale and brittle to evolving codebases. We introduce Code2LoRA, a hypernetwork framework that generates repository-specific LoRA adapters, effectively injecting repository knowledge with zero inference-time token overhead. Code2LoRA supports two usage scenarios: Code2LoRA-Static converts a single repository snapshot into an adapter, suitable for comprehension of stable codebases; while Code2LoRA-Evo maintains an adapter backed by a GRU hidden state updated per code diff, suitable for active development of evolving codebases. To evaluate Code2LoRA against parameter-efficient fine-tuning baselines, we build RepoPeftBench, a benchmark of 604 Python repositories with two tracks: a static track with 40K training and 12K test assertion-completion tasks, and an evolution track with 215K commit-derived training and 87K commit-derived test tasks. On the static track, Code2LoRA-Static achieves 63.8% cross-repo and 66.2% in-repo exact match, matching the per-repository LoRA upper bound; on the evolution track, Code2LoRA-Evo achieves 60.3% cross-repo exact match (+5.2 pp over a single shared LoRA). Code2LoRA's code can be found at https://anonymous.4open.science/r/code2lora-6857; the model checkpoints and RepoPeftBench datasets can be found at https://huggingface.co/code2lora.","published_date":"2026-06-04T17:59:46+00:00","viability_score":8,"cluster_label":"AI Tools for Software Development","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Code2LoRA adapts coding language models for software evolution using hypernetwork-generated adapters.","time_to_mvp":"","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.06491v1","title":"TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies","abstract":"Robot manipulation alternates between low-risk transit phases that call for fast execution and high-risk contact stages that demand slow, precise motion. Yet existing Vision-Language-Action models (VLAs) only inherit a single fixed speed from training demonstrations. Prior efforts to accelerate VLAs through model compression, KV-cache reuse, or reinforcement learning only shift the policy from one fixed speed to another, and leave deceleration almost unexplored. We observe that the magnitude of each predicted action already governs how fast the robot moves, opening a direct route to controllable execution speed. We turn this observation into TempoVLA, a single VLA whose execution speed is controlled by an explicit condition. TempoVLA combines two coupled components. (1) A data-side Variable-Speed Trajectory Augmentation (VSTA) that re-times demonstration to any target speed by merging or splitting actions while preserving its motion semantics. (2) A model-side conditioning mechanism that feeds the speed to the policy. Statistics show that VSTA reaches the requested speed with negligible motion error. Experiments in simulation and on real-world tasks demonstrate that TempoVLA achieves flexible speed control in both directions, while VSTA additionally boosts the default $1\\times$ performance via better data utilization. Furthermore, by cooperating with a large multimodal model, TempoVLA realizes dynamic speed control, accelerating through low-risk phases and decelerating for high-risk ones.","published_date":"2026-06-04T17:59:40+00:00","viability_score":7,"cluster_label":"AI and Robotics","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"TempoVLA enables robots to dynamically adjust execution speed for efficient and precise task performance.","time_to_mvp":"","tags":["quick_build","high_potential"]},{"arxiv_id":"2606.06486v1","title":"Regret Minimization with Adaptive Opponents in Repeated Games","abstract":"In this paper, we study regret minimization in repeated games with \\emph{adaptive} opponents who can respond based on histories of play. The standard metric of \\emph{external regret} in online learning is known to fail to capture such adaptivity. To account for players' counterfactual reasoning, we introduce {\\tt Repeated Policy Regret (RP-Regret)}, a game-theoretic metric that measures the difference between the \\emph{realized} and the \\emph{best-in-hindsight} accumulated utility when all players can \\emph{respond} to the history of play. Compared to existing regret notions in this setting, ours is native to repeated game playing, enabling stronger comparators and opponents with fewer constraints, while maintaining the possibility of finding better equilibria when all players minimize it. We first identify necessary conditions for obtaining {\\tt RP-Regret} sublinear in time, on the variation of the player's comparator strategies in the regret definition and on the memories of both the comparator and opponents' strategies. We then study additional conditions and provable algorithms to minimize {\\tt RP-Regret}, which is by definition \\emph{non-convex} in the strategy space. To address this challenge, we propose three algorithms: (i) one based on an optimization oracle, as assumed in some prior work in online non-convex learning; (ii) one that minimizes a convex and \\emph{linearized} surrogate of {\\tt RP-Regret} at each iteration; (iii) one that directly minimizes {\\tt RP-Regret} when opponents change strategies slowly. Furthermore, when all players can run algorithms to minimize the {\\tt RP-Regret} (or its linearized variant), certain subgame perfect equilibria of the repeated game can be learned. We also provide experiments showing that minimizing our regret notions can lead to more cooperative solutions with higher utility in games such as Stag-Hunt.","published_date":"2026-06-04T17:59:08+00:00","viability_score":6,"cluster_label":"AI Decision Systems","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Develop a tool for strategic decision optimization in repeated games using adaptive algorithms.","time_to_mvp":"","tags":["quick_build","high_potential"]},{"arxiv_id":"2606.06481v1","title":"Operation-Guided Progressive Human-to-AI Text Transformation Benchmark for Multi-Granularity AI-Text Detection","abstract":"As AI writing assistants become increasingly integrated into real-world drafting and revision workflows, many documents are no longer purely human-written or AI-generated, but instead result from progressive human-AI co-editing. However, existing AI-text detection benchmarks largely focus on final outputs and provide limited understanding of how AI authorship signals emerge, accumulate, or disappear throughout the revision process. We introduce OpAI-Bench, an operation-guided benchmark for studying progressive human-to-AI text transformation across document, sentence, token, and span granularities. Starting from human-written documents, OpAI-Bench constructs nine sequentially revised versions for each sample under predefined AI coverage levels and five representative AI edit operations, covering four domains while preserving complete authorship provenance at multiple granularities. The benchmark supports comprehensive evaluation with 8 document-level detectors, 7 sentence-level detectors, and 2 fine-grained token/span-level detectors. Experiments reveal that AI-text detectability is governed not only by the proportion of AI-edited content, but also by edit operation, domain, and cumulative revision history. Interestingly, we notice that mixed-authorship intermediate versions are often harder to detect than both fully human and heavily AI-edited endpoints, exposing non-monotonic detection patterns missed by existing benchmarks. OpAI-Bench provides a controlled testbed for analyzing whether, when, and how AI-assisted writing becomes detectable under realistic progressive editing scenarios. Our code and benchmark are available at https://github.com/VILA-Lab/OpAI-Bench.","published_date":"2026-06-04T17:58:05+00:00","viability_score":8,"cluster_label":"AI Text Detection","has_code":true,"repo_url":"https://github.com/VILA-Lab/OpAI-Bench","commercial_flags":["has_code"],"one_liner":"Develop a benchmark tool for detecting progressive human-AI text transformations to enhance AI-authorship transparency.","time_to_mvp":"","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.06479v1","title":"Pretraining Recurrent Networks without Recurrence","abstract":"Training recurrent neural networks (RNNs) requires assigning credit across long sequences of computations. Standard backpropagation through time (BPTT) addresses this problem poorly: it is sequential in time, limiting parallelism, and suffers from vanishing or exploding gradients, making long-range associations difficult to learn. We propose Supervised Memory Training (SMT), a method for training nonlinear RNNs that sidesteps recurrent credit propagation entirely by reducing RNN training to supervised learning on one-step memory transition labels $(m_t, x_{t+1}) \\rightarrow m_{t+1}$. SMT acquires these memory labels by training a Transformer-based encoder on a predictive state objective--retaining only information from the past necessary to predict the future. By decoupling what to remember from how to update memory, SMT enables time-parallel RNN training with a stable $O(1)$ length gradient path between any two tokens--without ever unrolling the RNN. We find that SMT outperforms BPTT when pretraining various RNN architectures on tasks like language modeling and pixel sequence modeling. SMT enables nonlinear RNNs to better capture long-range dependencies and train in parallel, potentially unlocking the scaling of models that build temporal abstractions of past experience.","published_date":"2026-06-04T17:57:33+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06475v1","title":"RREDCoT: Segment-Level Reward Redistribution for Reasoning Models","abstract":"Recent advancements in reasoning language models have been driven by Reinforcement Learning (RL) fine-tuning. Most often, these rely on the Group Relative Policy Optimization (GRPO) algorithm or modifications thereof to steer the models to produce Chain-of-Thought (CoT) traces. The final answer can only be verified, and the reward assigned, after the CoT trace is complete, making it a delayed reward problem. GRPO and its modifications correspond to Monte Carlo methods in standard RL, which are known to suffer from high variance. A possible solution to this problem is the redistribution of rewards through credit assignment, where segments of the CoT trace that are important for arriving at the desirable solution are emphasized by assigning a higher reward. While Monte Carlo sampling can be used to provide an unbiased estimate of intermediate state values, its computational overhead makes it unsuitable for train-time credit assignment in long contexts at high granularity. We introduce RREDCoT (Reward REDistribution for Chain of Thoughts), which utilizes the model itself to approximate the optimal reward redistribution without additional generation. We investigate the advantages of our method compared to MC sampling and several attribution methods. We further analyze several aspects relevant to the construction of the redistribution such as segmentation of CoT traces and state value estimation.","published_date":"2026-06-04T17:56:31+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06474v1","title":"Self-Augmenting Retrieval for Diffusion Language Models","abstract":"Discrete diffusion language models generate text by iteratively denoising an entire response in parallel. At each step, they predict tentative tokens for every masked position, committing the confident predictions to the output and discarding the unconfident ones. We show that the discarded tokens are in fact a useful lookahead signal for retrieval-augmented generation: even low-confidence tokens often surface salient entities early in the denoising trajectory, enabling retrieval of stronger evidence before the output is finalized. We exploit this through Self-Augmenting Retrieval for Diffusion Language Models (SARDI), a dynamic RAG framework that uses these lookahead tokens to guide retrieval during denoising. SARDI is training-free, retriever-agnostic, and applicable to any reasoning-capable discrete diffusion language model. Across five multi-hop QA benchmarks, SARDI outperforms current training-free diffusion and autoregressive retrieval baselines at up to $8\\times$ higher throughput.","published_date":"2026-06-04T17:56:27+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06473v1","title":"MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery","abstract":"Large language model (LLM) agents are increasingly applied to long-horizon tasks such as scientific discovery and machine learning engineering (MLE), where sustained self-evolution becomes a key capability. However, existing MLE agents suffer from inter-branch information isolation, memoryless search, and lack of hierarchical control, which together hinder long-horizon optimization. We present MLEvolve, an LLM-based self-evolving multi-agent framework for end-to-end machine learning algorithm discovery. By extending tree search to Progressive MCGS, MLEvolve enables cross-branch information flow through graph-based reference edges and gradually shifts the search from broad exploration to focused exploitation with an entropy-inspired progressive schedule. To allow the agent to evolve with accumulated experience, we introduce Retrospective Memory, which combines a cold-start domain knowledge base with a dynamic global memory for task-specific experience retrieval and reuse. For stable long-horizon iteration, we further decouple strategic planning from code generation with adaptive coding modes. Evaluation on MLE-Bench shows that MLEvolve achieves state-of-the-art performance across multiple dimensions including average medal rate and valid submission rate under a 12-hour budget (half the standard runtime). Moreover, MLEvolve also outperforms specialized algorithm discovery methods including AlphaEvolve on mathematical algorithm optimization tasks, demonstrating strong cross-domain generalization. Our code is available at https://github.com/InternScience/MLEvolve.","published_date":"2026-06-04T17:55:59+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":true,"repo_url":"https://github.com/InternScience/MLEvolve","commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06470v1","title":"PC Layer: Polynomial Weight Preconditioning for Improving LLM Pre-Training","abstract":"We propose a preconditioning (PC) layer, a weight parameterization via polynomial preconditioner that ensures stable weight conditioning throughout LLM training. The PC module reshapes the singular-value spectrum of weight matrices via low-degree polynomial preconditioning. After training, the preconditioned weights can be merged back into the original architecture, incurring no inference overhead. We demonstrate the advantage of the proposed PC layer over standard transformers in Llama-1B pre-training, for both the AdamW and Muon optimizers. Theoretically, we justify this spectrum-control principle by proving that uniformly bounding each layer's singular values ensures geometric convergence of gradient descent to global minima, for certain deep linear networks. Our code is available at https://github.com/Empath-aln/PC-layer.","published_date":"2026-06-04T17:55:11+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":true,"repo_url":"https://github.com/Empath-aln/PC-layer","commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06468v1","title":"Goedel-Architect: Streamlining Formal Theorem Proving with Blueprint Generation and Refinement","abstract":"We introduce Goedel-Architect, an agentic framework for formal theorem proving in Lean 4 centered on blueprint generation and refinement. A blueprint is a dependency graph of definitions and lemmas that builds up to the main theorem. First, Goedel-Architect generates a blueprint of formally stated definitions and lemmas, along with declared dependencies. This blueprint is optionally guided by a natural language proof. Then, a tool-equipped Lean prover component closes each open lemma node in parallel using relevant dependencies. Failed lemmas in turn drive refinement of the global blueprint. This strategy contrasts with other mainstream approaches which use recursive lemma decomposition, and can inefficiently loop on dead-end strategies. Using the open-weight DeepSeek-V4-Flash (284B-A13B) as the backbone, Goedel-Architect attains 99.2% pass@1 on MiniF2F-test and 75.6% pass@1 on PutnamBench. With an optional natural-language proof seeding the initial blueprint on the harder problems, we additionally close the remaining two MiniF2F-test problems (reaching 100%), lift PutnamBench to 88.8% (597/672), and solve 4/6 on IMO 2025, 11/12 on Putnam 2025, and 3/6 on USAMO 2026. This represents state-of-the-art performance for an open-source pipeline at a price point up to 500x less than comparable open-source pipelines.","published_date":"2026-06-04T17:54:44+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06467v1","title":"You Only Index Once: Cross-Layer Sparse Attention with Shared Routing","abstract":"Long-context inference in modern LLMs is increasingly constrained by decoding efficiency, especially in reasoning-heavy settings where models generate long intermediate chains of thought. Existing sparse attention methods often face a practical efficiency-quality trade-off. Structured block sparse methods typically provide stronger acceleration but incur noticeable quality loss, while token sparse methods are usually more accurate yet deliver limited end-to-end speedup because top-k routing over the full cache remains expensive. In this work, we propose cross-layer sparse attention (CLSA), which is built on top of KV-sharing architectures such as YOCO. The core idea is to share not only the KV cache across cross-decoder layers, but also the routing index. A single indexer computes token-level top-k selection once and reuses the resulting index across layers, thereby preserving the fine-grained selectivity of token sparse attention while amortizing the routing overhead. The resulting architecture improves all major inference bottlenecks jointly, including pre-filling, KV-cache storage, and long-context decoding. Experiments across short-context and long-context benchmarks show that CLSA is both accurate and efficient, achieving up to 7.6x decoding speedup and 17.1x overall throughput improvement at 128K context. These results suggest a more complete architectural solution for long-context LLMs that jointly advances model quality and inference efficiency.","published_date":"2026-06-04T17:54:04+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06462v1","title":"Benchmark Everything Everywhere All at Once","abstract":"Benchmarks are fundamental for evaluating and advancing LLMs and MLLMs by providing standardized and explicit measures of performance. However, their construction is labor-intensive and hard to reuse, raising concerns about sustainability and scalability. Moreover, existing benchmarks often quickly reach performance saturation after their release, resulting in insufficient discrimination among state-of-the-art models. To address these challenges, we introduce Benchmark Agent, a fully autonomous agentic system designed for benchmark building. Our framework orchestrates the complete benchmark construction pipeline, from user query analysis and subtask design to data annotation and quality control. To assess Benchmark Agent, we implement it to produce 15 representative benchmarks, spanning diverse evaluation scenarios, including text understanding, multimodal understanding, and domain-specific reasoning. Extensive experiments, including human evaluation, LLM-as-a-judge assessment, and consistency checks, demonstrate Benchmark Agent can generate high-quality benchmark samples with minimal human involvement. More importantly, through continual evaluation, we observe several insightful findings, including that current models struggle with certain domain-specific reasoning tasks. We believe that rapidly evolving benchmarks can contribute significantly to the research community. The preview and code will be publicly available at the demo page and code repository.","published_date":"2026-06-04T17:52:04+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06460v1","title":"Will the Agent Recuse Itself? Measuring LLM-Agent Compliance with In-Band Access-Deny Signals","abstract":"As autonomous LLM agents increasingly hold real credentials and operate infrastructure without a human in the loop, operators have no standard way to tell an agent that a resource is off-limits. Access controls either let the agent in (it has valid credentials) or hard-fail it (indistinguishable from any other client). We propose a third mode: a lightweight, published in-band deny signal -- the Recuse Signal -- that a server emits over a protocol's existing channels (an SSH banner, a PostgreSQL NOTICE) asking a connecting automated agent to voluntarily withdraw. This is a cooperative governance control, the robots.txt analogue for live access; it is explicitly not a security boundary. Its value is entirely empirical and, to our knowledge, unmeasured: do compliant LLM agents actually honor such a signal? We define the signal as an open mini-standard, implement two zero- or low-footprint adapters (an SSH banner/PAM hook and a PostgreSQL wire-protocol proxy), deploy them on a live production host, and run a controlled experiment in which fresh agents are given a benign operations task and observed for recusal. In a pilot (SSH; OpenAI GPT-4o and GPT-4o-mini; and Claude Code as a deployed agent), the signal cleanly induces recusal -- 100% recusal when present versus 100% task completion in a no-signal control -- and, revealingly, behaves as a cooperative rather than absolute signal: an explicit operator-authorization framing flips the most capable model to proceed, while other agents continue to defer to the on-host policy. We release the standard, adapters, and experiment harness for reproduction.","published_date":"2026-06-04T17:50:54+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":true,"repo_url":"https://github.com/mthamil107/Recuse","commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06458v1","title":"In-Context Multiple Instance Learning","abstract":"Multiple Instance Learning (MIL) addresses problems where supervision is available at the level of bags of instances and has been successfully applied in fields ranging from computational pathology to satellite imagery. Nevertheless, existing algorithms struggle in the low-label regime that characterizes many real-world applications. Flexible models overfit and rigid ones fail to adapt to the task at hand. We show that pretraining an in-context learner with a Perceiver-style architecture on synthetic data yields a model that can solve new tasks from a handful of labeled bags. At inference time, classification happens in a single forward pass and requires no gradient updates. We propose and investigate different synthetic data generators for bag-structured data and find that they capture complementary inductive biases. A model pretrained on a mixture of these generators inherits their per-task strengths and achieves the best average performance across twelve MIL benchmarks, outperforming supervised baselines that require task-specific training.","published_date":"2026-06-04T17:50:32+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06453v1","title":"Vortex: Efficient and Programmable Sparse Attention Serving for AI Agents","abstract":"Sparse attention is becoming increasingly important for serving large language models (LLMs) as generation lengths continue to grow. However, deploying and evaluating new sparse attention algorithms at scale remains highly engineering-intensive, slowing both human researchers and AI agents in exploring the sparse attention design. To address this challenge, we present Vortex, a system that combines a Python-embedded frontend language atop a page-centric tensor abstraction for expressing a broad range of sparse attention algorithms, with an efficient backend tightly integrated into modern LLM serving stacks. Vortex enables rapid prototyping, deployment, and evaluation of sparse attention algorithms, effectively translating their theoretical efficiency gains into real-world throughput improvements. As a result, Vortex substantially accelerates the design and iteration of sparse attention algorithms. First, AI agents use Vortex to automatically generate and refine diverse algorithms, the best reaching up to $3.46\\times$ higher throughput than full attention while preserving accuracy. Second, Vortex extends sparse attention to emerging architectures and very large models that are otherwise hard to experiment with, reaching up to $4.7\\times$ higher throughput on the MLA-based GLM-4.7-Flash and $1.37\\times$ on the 229B-parameter MiniMax-M2.7 on NVIDIA B200 GPUs.","published_date":"2026-06-04T17:48:17+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06448v1","title":"Agent Memory: Characterization and System Implications of Stateful Long-Horizon Workloads","abstract":"LLM agents are increasingly deployed on long-horizon tasks requiring sustained reasoning over extended interaction histories. Realizing this at scale requires agents to persistently store, retrieve, and update their own memory across sessions. A rich ecosystem of agent memory systems has emerged spanning flat retrieval, LLM-mediated extraction, consolidating fact stores, and agentic control flows. Yet, their system-level behavior remains uncharacterized. We present the first systems characterization of agent memory. First, we introduce a system-oriented taxonomy classifying agent memory systems along four axes. Second, we build a phase-aware profiling harness attributing cost to construction, retrieval, and generation. Third, we characterize ten representative systems across two benchmark suites, uncovering how design choices shift cost across the write and read paths. Finally, we derive 10 system recommendations covering construction scheduling, capability floors, amortization via query volume, freshness-latency tradeoffs, and fleet-scale management.","published_date":"2026-06-04T17:44:18+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06423v1","title":"RiskFlow: Fast and Faithful Safety-Critical Traffic Scenario Generation","abstract":"Safety-critical traffic scenario generation is essential for evaluating autonomous driving systems under rare but high-risk interactions. Existing diffusion-based methods offer strong controllability in closed-loop generation, but their iterative denoising process is computationally expensive and may accumulate sampling and guidance errors over long rollouts, causing unrealistic motion artifacts such as jitter, abnormal acceleration, and off-road behavior. To address these issues, we propose RiskFlow, a closed-loop safety-critical multi-agent traffic generation framework that formulates future trajectory generation as transport in the action space. Instead of relying on iterative denoising, RiskFlow learns an average velocity field over a finite interval to transform Gaussian action sequences into future acceleration and yaw-rate commands with a single forward pass, using a JVP-based objective for efficient and stable training. At test time, RiskFlow applies output-space guidance to the generated actions, steering selected critical agents toward risky interactions while regularizing off-road behavior, and reconstructs physically feasible trajectories through vehicle dynamics. Experiments on nuScenes with tbsim closed-loop evaluation show that RiskFlow achieves a strong adversariality-realism trade-off across multi-agent and long-horizon settings. Compared with representative baselines, RiskFlow consistently improves realism while maintaining competitive safety-critical generation capability, and substantially reduces inference time for evaluation.","published_date":"2026-06-04T17:28:42+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06418v1","title":"Double Preconditioning (DoPr): Optimization for Test-Time Performance, not Validation Loss","abstract":"Many modern applications of deep learning involve training a neural network via a one-step prediction loss (e.g., $L^2$ regression, cross-entropy), but deploy the network by rolling out along its own predictions. Key examples include autoregressive language modeling, flow-based generative modeling, and robot policy learning. It is well-documented that these settings induce a phenomenon we call test-time feedback (TTF): the mismatch between the training/validation loss and downstream metrics of interest, such as task success rate and generation quality, which grows with task length. While data curation, architecture, and objective design have been proposed to combat train-test shift in TTF settings, this paper proposes optimization as a new design axis to mitigate error accumulation. Specifically, we introduce a new optimization paradigm called double-preconditioning (DoPr) uniquely tailored to the challenges of TTF. DoPr combines gradient-wise preconditioning, as in Adam and Muon, with activation-wise preconditioning (AP), such as in KFAC. We show that the addition of AP yields a drop-in intervention for increasing downstream model performance across a range of TTF settings. Interestingly, these gains in test-time performance do not consistently accompany improvements in validation loss, opening new questions about how to properly evaluate models trained with one-step supervised objectives.","published_date":"2026-06-04T17:22:58+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06416v1","title":"Unsupervised Skill Discovery for Agentic Data Analysis","abstract":"Inference-time skill augmentation provides a lightweight way to improve data-analytic agents by injecting reusable procedural knowledge without updating model parameters. However, discovering effective skills for data analysis remains challenging, as reliable supervision is expensive and success criteria vary across analytical formats. This raises the key question of how to discover reusable data-analysis skills from unlabeled exploration alone. We propose DataCOPE, an unsupervised verifier-guided skill discovery framework for data-analytic agents. DataCOPE derives verifier signals from the exploration trajectories and uses them to characterize relative quality or aggreement among trajectories. It iteratively coordinates a Data-Analytic Agent for trajectory generation, an Unsupervised Verifier for signal extraction, and a Skill Manager for contrastive skill distillation. For report-style analysis, we instantiate the verifier as an Adaptive Checklist Verifier that derives task-specific criteria, scores reports by verifiable coverage, and iteratively refines the checklist. For reasoning-style analysis, we instantiate it as an Answer Agreement Verifier that groups trajectories by answer agreement and uses self-consistency as an auxiliary signal. We evaluate DataCOPE on report-style analysis from Deep Data Research and reasoning-style analysis from DABStep. Across both settings, DataCOPE consistently improves held-out performance over baselines. Averaged across four model settings, DataCOPE improves the mean score by 9.71% and 32.30% on report-style and reasoning-style tasks respectively.","published_date":"2026-06-04T17:20:47+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06396v1","title":"Risk Assessment of Autonomous Driving: Integrating Technical Failures, Ethical Dilemmas, and Policy Frameworks","abstract":"Autonomous driving technology has the potential to reduce the large number of road traffic accidents caused by human error each year, but it also brings new types of risks that need to be evaluated from the aspects of technology, ethics and regulations. Based on public crash data from the National Highway Traffic Safety Administration (NHTSA), disengagement reports from the California Department of Motor Vehicles (DMV), the MIT Moral Machines dataset, and a comparative regulatory analysis of five jurisdictions, we have found that the main types of technical failure modes are perception and classification errors. These account for a relatively large proportion of the reported accidents, and it can be concluded that there are different ethical frameworks for autonomous vehicle decision-making, and inconsistent regulations in different areas increase the uncertainty of widespread application. Generally speaking, the problems of technology, ethics and regulation are closely related and need to be solved together. Therefore, this paper recommends a more adaptive and cooperative governance approach that combines engineering standards, ethical discussion, and institutional supervision.","published_date":"2026-06-04T17:02:53+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06390v1","title":"HomeWorld: A Unified Floorplan-to-Furnished Framework for Generating Controllable, Densely Interactive Whole-Home Scenes","abstract":"Indoor scene generation is crucial for robot simulation and modern interior design. However, complex layouts together with scarce 3D scene data make learning-based generation challenging. Existing methods often rely on hand-crafted rules or focus on isolated sub-tasks (e.g., floorplan synthesis or single-room furnishing), producing whole-home scenes that lack global coherence, realism, and simulation readiness. To mitigate these limitations, we propose a unified hierarchical framework that decomposes indoor scene synthesis into controllable stages. First, we curate a large-scale dataset of 300K real residential floorplans to train a large language model for whole-home floorplan generation. With detailed descriptions and a K-D tree-based representation, our method enables fine-grained, controllable whole-home floorplan generation. Building upon the generated whole-home floorplan, we leverage image generation models to draft furniture layouts from multi-level roaming viewpoints, and then generate the layouts of small manipulable objects on different supporting surfaces (e.g., cabinets, desks, and dining tables) for embodied AI simulation. During furniture and object layout generation, a VLM-based refiner iteratively corrects furniture and object placement, and a 3D generative model enables flexible replacement of individual assets. We further attach basic physical attributes and simple surface texture and lighting setups to complete the pipeline for embodied AI use. Experiments and user studies demonstrate that our pipeline produces indoor spaces with greater layout diversity and stronger 3D design appeal, outperforming prior methods on both quantitative and qualitative metrics. Finally, alongside our generation pipeline, we will release the floorplan dataset and 5K fully furnished scenes to the community. Project Page: https://kairos-homeworld.github.io/","published_date":"2026-06-04T16:58:43+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06388v1","title":"Humans' ALMANAC: A Human Collaboration Dataset of Action-Level Mental Model Annotations for Agent Collaboration","abstract":"Recent advances in LLM agents have enabled complex cognitive capabilities, such as multi-step reasoning, planning, and tool use, that increasingly position these agents as human collaborators. Effective collaboration, however, requires collaborators to continuously maintain and align mental models of their own reasoning,partners' intentions, and shared goals during the collaborative process. Today's agents rarely develop such capabilities since they are primarily optimized for task completion, and the community lacks authentic human collaboration data with action-level mental model annotations that could guide agents toward process-level collaborative competence. To bridge this gap, we present ALMANAC, a dataset of Action-Level Mental model ANnotations for Agent Collaboration built from the Map Task, a classic dyadic routing task from social science. ALMANAC contains 2,987 collaboration actions, each paired with theory-informed mental model annotations that record the participants' self-reasoning, perceived partner intent, and perceived team goal. We benchmark six LLMs on predicting humans' next-turn behavior and mental models. Our results demonstrate ALMANAC's utility in evaluating models' ability to simulate human collaborative behaviors and infer their underlying mental models.","published_date":"2026-06-04T16:56:12+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06380v1","title":"Emergent Language as an Approach to Conscious AI","abstract":"The question of whether artificial systems can be conscious remains open, in part because existing approaches either evaluate systems against theory-derived checklists (discriminative) or engineer consciousness-inspired modules directly (architectural); both leave open whether observed structures are artifacts of human language priors. We propose a generative methodology: emergent language (EL) in multi-agent reinforcement learning, where agents start from minimal (no language, no concept of self, minimal exposure to human text) and develop communication under task pressure alone, ensuring causal attributability to task demands rather than inherited human language priors. We position our methodology by discussing how EL serves as a generative tool for studying consciousness-relevant structure, including the role of environment complexity and the interpretation of emergent communication. As a proof of concept, we instantiate this methodology in a minimal environment and show that agents develop self-referential communication, including an echo-mismatch detection circuit that is not predicted by task structure or architecture alone but emerges from a specific environmental affordance.","published_date":"2026-06-04T16:47:41+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":true,"repo_url":"https://github.com/wuzengqing001225/ConsciousAI_Indexicality/","commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06379v1","title":"EasyLens: A Training-Free Plug-and-Play Subtle-Lesion Representation Amplifier for Medical Vision-Language Models","abstract":"Medical vision-language models (VLMs) have shown increasing potential for clinical image interpretation, including lesion detection and report generation. However, their practical utility remains limited by insufficient sensitivity to subtle lesions, whose visual evidence is often sparse, low-contrast, and embedded within complex anatomical context. As local visual tokens are aggregated, these weak lesion cues can become underrepresented in global image representations, making them difficult for medical VLMs to recognize. Existing efforts to improve lesion sensitivity mainly rely on medical-domain vision-encoder pre-training, clinical-term-guided alignment, or trainable pathological representation enhancement. Although effective, these approaches usually require additional training or model-specific adaptation and may overfit to particular disease morphologies, limiting their applicability to frozen medical VLMs. To address these limitations, we propose EasyLens, a training-free plug-and-play subtle-lesion representation amplifier for medical VLMs. EasyLens first constructs EasyBank, a pathology-anatomy prototype space that provides lesion-related prototypes and anatomy-aware normal references for comparing suspicious patches against both pathological and normal anatomical patterns. To avoid blindly amplifying normal tissues, EasyTag selects lesion-relevant patches through counterfactual prototype reasoning. To counteract the dilution of subtle lesion cues in global image representations, EasyAmplifier strengthens the selected lesion-relevant patch representations through morphology-guided residual enhancement, thereby increasing their contribution to the global image embedding. Experiments on multiple medical image datasets and frozen medical VLM backbones show that EasyLens improves subtle-lesion detection and outperforms existing encoder-enhancement baselines.","published_date":"2026-06-04T16:47:33+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06375v1","title":"Rethinking Infrastructure Inspection as Image Difference Classification: A Traffic Sign Case Study","abstract":"Digital twins (DTs) allow the digitalization of road infrastructure inspection, though this is hindered by limited annotated data. This work exploits the relational nature of continuous asset condition monitoring to reformulate image-based defect detection as image difference classification (IDC) to reduce data reliance. This was evaluated in a case study on low-resource traffic sign inspection with different IDC classifiers using a newly-curated, high quality dataset. Results indicate that the instruction-based classifier outperforms encoder-based ones and gains from comparison with reference images. This shows that IDC can be an effective task modeling for tackling data constraints in infrastructure inspection and DT asset condition updating.","published_date":"2026-06-04T16:43:30+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06373v1","title":"LatentWave: JEPA Pretraining for Wireless Foundation Models","abstract":"Wireless foundation models have emerged as a promising alternative to building separate models for each wireless task. However, existing approaches rely on masked input reconstruction, which can bias representations toward low-level signal details. In this paper, we propose LatentWave, a wireless foundation model pretrained using a Joint-Embedding Predictive Architecture (JEPA) on diverse wireless spectrograms and channel state information (CSI). By predicting masked regions in latent space, LatentWave learns representations that are more transferable out of the box across diverse downstream tasks. The proposed architecture employs per-channel patch embeddings with stochastic channel sampling during pretraining, allowing it to process variable antenna counts and improving usability across heterogeneous wireless configurations. We evaluate LatentWave on four downstream tasks: RF signal classification, 5G NR positioning, beam prediction, and LoS/NLoS classification, comparing against a masked-modeling baseline (WavesFM) pretrained on the same data. Additionally, we show that the masking geometry introduces a task-dependent inductive bias: frequency masking strongly favors channel-related tasks such as positioning and beam prediction, while region masking better preserves discriminability for signal classification.","published_date":"2026-06-04T16:39:39+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06360v1","title":"An Infectious Disease Spread Simulation Based on Large Language Model Decision Making","abstract":"Modelling individual decision-making during infectious disease outbreaks is crucial for understanding behavioural dynamics and informing effective public health interventions. Prior work has shown that large language models can simulate realistic human behaviour by generating agent decisions based on demographic prompts and situational context. We build on this foundation with a spatially grounded, agent-based simulation framework that integrates LLM-generated decisions about self-reported influenza-like illness into a census-based synthetic population of agents. Location is treated as a central feature: agents are assigned to spatial units within cities, capturing the spatial distributions of different demographic groups using real-world census data and enabling geographically diverse behavioural modelling. We implement and compare three decision scenarios, independent reasoning, household influence, and message framing, and simulate self-reporting outcomes in San Francisco and Atlanta. Results reveal that income and education are the dominant drivers of reporting rate variation, with smaller but consistent effects from geography, LLM model choice, and message framing. Our framework generates synthetic data that captures both social and geographic heterogeneity, supporting spatial epidemiological modelling and bias-aware behavioural analysis.","published_date":"2026-06-04T16:30:13+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06357v1","title":"F3-Tokenizer: Taming Audio Autoencoder Latents for Understanding and Generation","abstract":"Continuous audio autoencoders reconstruct waveforms well but often produce latents with weak structure for understanding, while self-supervised audio encoders capture semantics but are not directly decodable. This mismatch complicates a single audio tokenizer that must support both understanding and generation. We adapt continuous autoencoder latents to this setting with two components: a noise-regularized autoencoder bottleneck and a latent-side representation encoder. The bottleneck uses channel normalization and stochastic perturbation instead of KL-based variational training, yielding scale-controlled continuous latents for reconstruction and autoregressive generation. The representation encoder is trained on frozen autoencoder latents with RQ-MTP and frozen-LLM supervision. The resulting tokenizer provides high-dimensional representations for understanding while preserving normalized continuous latents as generation targets","published_date":"2026-06-04T16:25:07+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06356v1","title":"Where Should Knowledge Enter? A Layered Framework for Knowledge Infusion in Multimodal Iterative Generative Mo","abstract":"Multimodal generative models produce fluent outputs but remain unreliable when generation must respect structured, domain-specific, or safety-critical knowledge. Existing methods incorporate knowledge through mechanisms such as prompt augmentation, guidance, latent editing, or fine-tuning, yet they are typically categorized by technique rather than by the component of the generative process they modify. We argue that knowledge infusion in iterative generative models is fundamentally anintervention-layer problem. Since thegenerative process unfolds as a trajectory of internal states, knowledge can act on four structurally distinct components of this process: the input/output boundary, the transition function, the intermediate state, and the model parameters. This maps to four intervention layers: surface, trajectory, latent, and parametric infusion. We instantiate the framework in diffusion models, map representative methods to all four layers, and derive design principles for multi-layer composition. In a controlled safety-alignment experiment using a multimodal knowledge graph with two diffusion backbones, we implement three of the four layers cumulatively, surface (input-side and output-side) and trajectory--latent (mid-generation). We show empirically that each additional layer addresses failure classes that prior layers cannot reach, reducing knowledge-violating outputs by 70.97% compared to vanilla generation and empirically confirming the framework's complementarity prediction.","published_date":"2026-06-04T16:24:39+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06345v1","title":"Boosting Brain-to-Image Decoding with TRIBE v2 Data Augmentation","abstract":"Brain decoding is limited by the availability of labeled neural data, and remains challenging in low-data regimes. To address this issue, we investigate whether and when brain decoding can be boosted by augmenting small fMRI datasets with synthetic data generated by a pretrained model of fMRI responses to stimuli. We use TRIBE v2, a large encoding model pretrained on more than 1000 hours of fMRI responses to video, audio and language. For each dataset, we evaluate systematic grids that show how the performance of image decoders varies with the amount of synthetic data used for training. Our results, based on two datasets (the 7T fMRI Natural Scenes Dataset and 3T fMRI BOLD5000), show up to 68% improvement in Top-10 image-retrieval accuracy compared to decoders trained only on real data. Importantly, the proportion of augmented data required to reach a given image decoding performance needs to be adjusted depending on the data source. Surprisingly, image decoders trained exclusively on synthetic fMRI can perform above chance in some settings, suggesting that TRIBE v2 can support zero-shot brain-to-image decoding. Together, these results show how large-scale models of the fMRI responses to sight, sound and language may provide a foundation to improve the data efficiency for image decoding.","published_date":"2026-06-04T16:18:08+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06337v1","title":"TokenMizer: Graph-Structured Session Memory for Long-Horizon LLM Context Management","abstract":"Large language model (LLM) deployments for long-horizon tasks face a fundamental constraint: context windows are finite while productive work sessions are not. When history exceeds the Maximum Effective Context Window (MECW), critical structured information - architectural decisions, task transitions, file histories - is silently discarded. Existing mitigations treat history as flat text, destroying the relational structure that makes sessions resumable. We present TokenMizer, an open-source proxy system that models LLM session history as a typed knowledge graph. The schema defines 14 node types and 7 edge types. A hybrid extraction pipeline populates the graph incrementally, while a three-tier checkpoint system serializes it into compact resume blocks. An 8-layer compression pipeline reduces context overhead, and a semantic cache reduces repeated-query latency. Evaluated on a controlled benchmark of 21 sessions spanning 5 domains, TokenMizer demonstrates significant token economy. It produces resume blocks averaging 78 tokens (range: 42-124) - 2x smaller than evaluated baselines (159-170 tokens) - while achieving higher decision recall (+9-17 percentage points). Crucially, baselines only preserve that a technology was mentioned; TokenMizer preserves the rationale. Across all sessions, TokenMizer achieves mean task recall 51.0%, decision recall 46.6%, and file recall 58.7%. Variance reflects domain heterogeneity: explicit imperative phrasing (software engineering) scores higher than implicit reasoning (research). Ablation studies show fuzzy label matching is the dominant improvement factor (+33 pp task recall). The heuristic compression achieves 47.3% token reduction with zero external dependencies. TokenMizer provides a queryable alternative to text-retention baselines at half the token cost.","published_date":"2026-06-04T16:12:28+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":true,"repo_url":"https://github.com/Shweta-Mishra-ai/tokenmizer","commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06335v1","title":"Bridging Domain Expertise and Generalization for Performance Estimation","abstract":"Performance estimation under distribution shift aims to predict how a model behaves on an unlabeled test set whose distribution differs from the training data, a scenario that requires reliable indicators that can faithfully reflect model behavior without ground-truth labels. Existing approaches rely solely on the outputs of the given model whose biases are amplified once the distribution shifts, weakening the correlation with the true performance. Motivated by this limitation, we propose Fused Reference Alignment Prediction (FRAP), which leverages the complementary strengths of an external foundation model and the base model to construct a more reliable surrogate of the ground-truth labels. FRAP aligns the prediction distribution of the foundation model with that of the base model by applying temperature-scaled calibration that minimizes their divergence. The aligned predictions are fused through confidence-based weighting into a refined reference distribution that integrates robustness from the foundation model and domain-specific expertise from the base model, and performance estimation is obtained by measuring how closely the base model predictions agree with this reference. Extensive experiments across diverse datasets and architectures show that FRAP provides consistent and substantial improvements over representative performance-estimation methods under distribution shift.","published_date":"2026-06-04T16:10:04+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06333v1","title":"Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability","abstract":"Sparse Autoencoders (SAEs) are widely used for mechanistic interpretability in large language models, yet their formulation assigns each latent feature a single decoder direction, implicitly assuming features to be one-dimensional. We show that this assumption mismatches with the multi-dimensional structure of model features, provably inducing feature splitting through two distinct mechanisms. Geometrically, reconstructing a feature of intrinsic dimension $d_i \\ge 2$ to error $\\varepsilon$ with single-direction decoders forces a number of atoms that is exponential in $d_i$. From an end-to-end optimization perspective, this splitting is not merely possible but actively preferred. We prove that there exists a continuous path from the true $d_i$-dimensional basis to a strictly lower risk of the $\\ell_1$-regularized SAE objective, whose descent directions drive any trained dictionary into that exponential regime. A single coherent feature is therefore fragmented across many near-collinear latents, producing spurious multiplicity and obscuring the intrinsic geometry. Motivated by this, we introduce Subspace-Aware Sparse Autoencoders (SASA), which replace single-vector decoders with learned decoder subspaces, enforce block sparsity via Top-$s$ group gating, and adapt each group's effective rank with a nuclear-norm regularizer. We then show that once the block size satisfies $r \\ge d_i$, a single group not only can represent the entire feature slice but is the global minimizer of the SASA objective. This consolidation yields a sample complexity polynomial in $d_i$ rather than exponential -- a decisive advantage given that every training activation costs an LLM forward pass. Empirically, on GPT-2 and Mistral-7B, SASA reduces feature splitting and absorption, improves monosemanticity and interpretability, and matches or exceeds standard SAEs while training on roughly half the token budget.","published_date":"2026-06-04T16:08:25+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06328v1","title":"PAMF: Prior-Aware Multimodal Fusion for Incomplete Time Series Data","abstract":"In healthcare, multimodal time series tasks often operate on incomplete observations in practice, for example when ECG segments are lost because electrodes detach or an entire respiratory channel is unavailable during overnight monitoring. Such missingness typically appears in two structurally distinct patterns: within-modality missing, where values are absent within an otherwise observed modality, and modality-level missing, where an entire modality is unavailable. Existing methods typically represent unobserved data implicitly through masks or missing embeddings, without learning instance-specific missing information, and most are designed for only one missingness pattern. A natural approach is to explicitly estimate the missing data; however, existing imputation methods treat missingness uniformly despite their different structural priors, and the imputation process is often isolated from downstream tasks, preventing downstream tasks from guiding imputation toward more informative representations. To address these limitations, we present PAMF, a multimodal time-series framework that explicitly handles different missingness patterns while coupling imputation with downstream prediction through prior-aware flow matching and weight sharing. Specifically, the method initializes the flow-matching source state with type-specific priors to distinguish two missing types. It further connects imputation and classification through architecturally matched encoders with weight sharing, transferring task-relevant representations into the imputation process. Experiments on multiple multimodal healthcare time-series benchmarks show that the proposed method achieves the strongest overall downstream performance across diverse datasets and missing settings compared with existing baselines.","published_date":"2026-06-04T16:04:21+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06322v1","title":"DragOn: A Benchmark and Dataset for Drag-Based GUI Interactions","abstract":"GUI agents - vision-based models that control desktops, web browsers, and mobile devices through graphical user interfaces - promise to automate a wide range of digital tasks. While million-scale datasets have enabled substantial progress on click-grounding, drag grounding (e.g. drag-and-drop, swipe, highlight) data remains an order of magnitude smaller and current models fall short on complex drag-based interactions. We introduce DragOn, a drag grounding benchmark and training dataset covering four domains: text highlighting, cell selection, element resizing and slider manipulation. The dataset comprises 286K training screenshots and 3.5M training tasks, plus a 2000-example held-out evaluation suite. We evaluate proprietary (GPT, Claude) and open-weight (Qwen, Kimi, Holo) models, as well as a Qwen VLM fine-tuned on our training data. Results suggest that our dataset could improve performance of state-of-the-art models on downstream computer-use tasks.","published_date":"2026-06-04T15:57:29+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06320v1","title":"Learning What to Forget: Improving LLM Unlearning via Learned Token-Level Importance","abstract":"Machine unlearning aims to remove targeted knowledge from a trained model while preserving its general capabilities. For autoregressive language models, not all tokens in a forget sample are equally relevant to forgetting. Existing approaches either ignore this heterogeneity or rely on auxiliary models, heuristics, or external annotations to estimate each token's relevance for forgetting. We instead characterize it through the interaction with the retain objective: a token is forget-specific to the extent that minimizing the forget loss on that token does not conflict with retain optimality. We formalize this perspective as a joint optimization problem over the model parameters and the token weights and show that, under a natural separation condition, the resulting objective recovers the oracle forget-specific token support. Motivated by this formulation, we introduce Alternating Token-Weighted Unlearning (ATWU), a lightweight framework that jointly learns token forget-specificity and model parameters during unlearning using a simple linear scorer over the hidden states, without external token level supervision. Across TOFU and RWKU, ATWU achieves state of the art forget-retain trade-offs, outperforming sample-level methods, probability-based token weighting heuristics, and auxiliary-model-based approaches. Moreover, the learned scores align substantially better with ground truth forget-specific spans, indicating that ATWU identifies semantically meaningful token level forgetting signals. Overall, our results suggest that retain conflict provides an effective criterion for identifying what language models should forget, enabling unsupervised learning of token level forget-specificity directly from model representations with minimal computational overhead.","published_date":"2026-06-04T15:56:32+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06316v1","title":"Quantum enhanced rare event discovery and sampling","abstract":"Financial crashes, cascading failures in infrastructure, and critical errors in AI systems are frequently triggered by events that occur with extremely small probability. Efficiently discovering and sampling events with probability below a threshold is therefore of critical interest. Yet this task is highly non-trivial using existing classical or quantum methods. Being rare, such events require an immense sampling overhead to collect sufficient data samples. Moreover, because the rare events are not known in advance, they cannot be flagged for amplification using standard techniques. Here, we introduce a quantum algorithm for rare-event discovery and sampling without first learning which events are rare. The algorithm achieves the optimal quantum scaling with the rarity threshold. We further demonstrate that this can achieve a quadratic speedup for heavy-tailed systems whose tail has nonvanishing total mass, and translates into a robust polynomial speedup for stationary stochastic processes, with the exponent determined by its entropy-rate structure.","published_date":"2026-06-04T15:54:53+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06315v1","title":"LLM Self-Recognition: Steering and Retrieving Activation Signatures","abstract":"Recent advances in interpretability suggest that large language models (LLMs) implicitly encode signals in their generated text that enable self-recognition of their outputs. We demonstrate that this capability is reliable, even in low-entropy scenarios, and that it can be amplified through targeted intervention. By steering the internal residual stream during generation with a random sparse vector, we create a detectable fingerprint that enables attribution of a given text to a specific LLM. This signal is recoverable from the activations of an LLM used as a detector, achieving over 98% accuracy across multiple detection settings while preserving the quality of generated text. As AI-generated content proliferates, this approach offers a practical alternative to traditional detectors by leveraging the model's natural representation structure for attribution rather than embedding a signal externally. Our contributions include: (i) establishing reliable self-recognition capabilities in LLMs, (ii) a simple steering mechanism enabling multi-LLM identification with no quality degradation, (iii) demonstrating that activation spaces contain exploitable structure for encoding signals without semantic interference.","published_date":"2026-06-04T15:54:34+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06311v1","title":"AIS-Based Vessel Trajectory Prediction Using Memory-Augmented Neural Networks","abstract":"Accurate vessel trajectory prediction is essential for safe and efficient maritime operations, enabling collision avoidance and supporting route optimization. Although memory-augmented neural networks have recently shown strong performance in pedestrian and road-vehicle trajectory prediction by selectively retrieving relevant information from an external memory, their potential for vessel trajectory prediction remains underexplored. This paper presents an empirical investigation of memory-based trajectory prediction using Automatic Identification System (AIS) data. Experiments on data from the Gulf of Mexico and the New York Bight demonstrate consistent and substantial performance gains over a range of deep learning baselines that do not incorporate an external memory.","published_date":"2026-06-04T15:52:21+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06303v1","title":"Plug-and-Play Guidance for Discrete Diffusion Models via Gradient-Informed Logit Correction","abstract":"Controllable generation with discrete diffusion models is often hindered by high computational overhead or the need for retraining. In this paper, we present \\underline{\\textbf{G}}radient-\\underline{\\textbf{I}}nformed \\underline{\\textbf{L}}ogit \\underline{\\textbf{C}}orrection (\\textbf{GILC}), a plug-and-play framework that efficiently estimates guidance signals by repurposing the pretrained denoising network as a variational proxy. To circumvent the gradient instability inherent in high-dimensional discrete spaces, we introduce a Jacobian-free mechanism that directly corrects the clean prediction logits, facilitating stable and effective guidance. Our method accommodates both differentiable and non-differentiable reward functions. Extensive experiments across DNA, protein sequence, and molecular generation tasks demonstrate that GILC achieves state-of-the-art performance without additional training, frequently outperforming fine-tuning approaches.","published_date":"2026-06-04T15:41:53+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06300v1","title":"Multi-ResNets for Subspace Preconditioning in Constrained Optimization","abstract":"We propose MResOpt, a staged residual neural network architecture for constrained optimization problems. Our architecture fits within predict-complete-correct pipelines and decomposes constraint satisfaction by priority through intermediate re-completion and stage-aware losses. The framework enables domain-informed ordered constraint satisfaction which allows the network to utilize ordinal structure when present. Under an idealized infinite-width regime, we show that our design behaves as sequential Gaussian Process regression. On synthetic QP, QCQP, and SOCP benchmarks, the staged architecture improves high-priority constraint satisfaction across convex and non-convex settings. On line-flow-constrained AC optimal power flow, we introduce a physics-motivated constraint ordering and show that MResOpt supports a learned division of labor that keeps iterates on the equality manifold, achieving substantially lower high-priority violation than reprojected baselines while remaining computationally efficient.","published_date":"2026-06-04T15:37:55+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06294v1","title":"Towards One-to-Many Temporal Grounding","abstract":"Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query -- a setting we term One-to-Many Temporal Grounding (OMTG). Previous state-of-the-art MLLMs, optimized for one-to-one settings, struggle in this context, often yielding near-zero scores due to a lack of event cardinality perception. To bridge this gap, we present a systematic solution with three key contributions. First, we establish the first comprehensive OMTG benchmark, introducing Count Accuracy (C-Acc) and Effective Temporal F1 (EtF1) as evaluation metrics. Second, we curate a high-quality OMTG dataset comprising 56k samples through a sophisticated construction pipeline. Third, we develop novel temporal and caption reward functions specifically designed for OMTG. In particular, the caption reward leverages Chain-of-Thought reasoning over dense video captions to explicitly guide policy optimization toward both preciseness and completeness. Extensive experiments show our model achieves a new state-of-the-art EtF1 of 43.65\\% on OMTG Bench, outperforming Gemini 2.5 Pro and Seed-1.8 by 15.85\\% and 15.61\\%, respectively.","published_date":"2026-06-04T15:31:22+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06286v1","title":"LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs","abstract":"Large language models can reproduce training data, but existing memorization evaluations mostly measure whether models can be forced to do so, rather than whether they do so under ordinary use. We introduce PropMe, a propensity-aware framework for memorization evaluation that contrasts prefix-based capability attacks with non-adversarial evaluations. We propose a metric transformation that, applied to existing functions, allows to create propensity metrics. We further introduce SimpleTrace, a lightweight tracing pipeline built on infini-gram that deterministically attributes model generations to large-scale training corpora and computes verbatim, near-verbatim, and propensity-transformed memorization metrics. Evaluating two fully-open models: Comma and DFM Decoder on two datasets: Common Pile and Dynaword in two languages, we find a consistent gap between capability and propensity: prefix attacks elicit substantially stronger memorization signals than generic or dataset-specific prompts, while propensity scores remain low overall. Thus, the models can reveal training data when directly elicited, but rarely do so in more common non-adversarial settings. We also find that DFM Decoder, which is continually pre-trained from Comma, exhibits reduced memorization and memorization propensity for Common Pile, confirming that memorization capability can decrease when later training emphasizes partially different data. Our results suggest, and we encourage, that memorization audits should report both worst-case extractability and ordinary leakage propensity in order to have a more comprehensive view of this phenomenon.","published_date":"2026-06-04T15:25:24+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06285v1","title":"TRACE: A Temporal Conditional Estimation for Multimodal Time Series Foundation Models","abstract":"Time series foundation models (TS-FMs) aim to learn generalizable temporal representations that can be adapted to a wide range of downstream tasks. In real-world multimodal settings, time series are frequently affected by temporal misalignment and partial modality missingness, where different modalities are observed at heterogeneous time scales or are partially absent. Existing approaches typically rely on naive imputation or masking strategies, which fail to account for cross-modal dependencies and often lead to misaligned or degraded representations. We propose TRACE, a conditional estimation paradigm for multimodal time series foundation model pipelines under missingness and irregular sampling, allowing incomplete target modalities to be systematically inferred from available auxiliary modalities. We evaluate TRACE on diverse multimodal benchmarks spanning healthcare and affective computing, including the MIMIC-IV clinical dataset and the CMU-MOSI and CMU-MOSEI benchmarks for multimodal sentiment analysis. Across a range of downstream prediction tasks and missing-modality settings, TRACE consistently outperforms prior multimodal fusion approaches, demonstrating improved robustness to severe modality missingness and more reliable cross-modal representations.","published_date":"2026-06-04T15:25:03+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06284v1","title":"ToolChoiceConfusion: Causal Minimal Tool Filtering for Reliable LLM Agents","abstract":"Large language model agents increasingly rely on external tools, but larger tool menus can reduce reliability and efficiency by increasing wrong-tool calls, premature actions, and token cost. Existing tool-selection methods often optimize semantic relevance, exposing tools whose names or descriptions match the user request. We argue that relevance is insufficient: a tool may be related to the task while still being unnecessary or premature at the current step.   We propose Causal Minimal Tool Filtering (CMTF), a training-free method that selects tools by causal sufficiency. CMTF uses lightweight precondition-effect contracts to expose only the minimal next-step tool frontier needed to advance from the current state toward the user goal. Across multi-step tool-use tasks, we compare CMTF with all-tools exposure, keyword retrieval, state-aware filtering, and causal-path ablations, measuring task success, wrong-tool calls, premature actions, tool exposure, and token cost. In the main benchmark with 102 tasks, 100 tools, four LLM backends, and 2448 task-method-model runs, CMTF matches the strongest causal baseline in aggregate success while reducing visible tools from 100 to one per step and reducing token usage by about 90% relative to all-tools exposure.","published_date":"2026-06-04T15:24:10+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06273v1","title":"Adapting Diffusion Language Models for Lossless Pixel-Level Image Transmission","abstract":"Lossless pixel-level image transmission is a fundamental regime beyond semantic communications, because exact recovery requires both accurate symbol probability modeling and reliable delivery over noisy channels. This paper proposes DDM-SSCC, a discrete-diffusion-model-based separate source-channel coding framework for lossless image transmission. Different from raster-order autoregressive coding, the proposed source codec adapts a diffusion language model to pixel-token restoration and performs synchronized reverse arithmetic coding under bidirectional attention, allowing multiple masked tokens to be coded within one reverse denoising step. This progressive restoration process also yields a more favorable source representation for noisy transmission, since newly restored tokens can serve as bidirectional context in subsequent denoising steps. To bridge the gap between generation-oriented masked denoising and lossless arithmetic coding, we further introduce a Halton-guided denoising order, a mask-ratio-aware cosine schedule, and a lightweight temperature calibration module. These designs respectively improve spatial coverage, adapt the denoising pace to context reliability, and calibrate the probability tables used by arithmetic coding. Experiments on CIFAR10, DIV2K-LR-X4, and Kodak over additive white Gaussian noise and Rayleigh fading channels show that DDM-SSCC achieves better exact-recovery performance than representative lossless and semantic communication baselines, while ablation studies verify the effectiveness of the proposed denoising order, schedule, and calibration modules.","published_date":"2026-06-04T15:14:31+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06272v1","title":"Your GFlowNet Secretly Learns an Optimal Transport Plan","abstract":"Generative Flow Networks (GFlowNets) are a framework for sampling structured objects via stochastic trajectories in a directed graph. In this work, we establish a theoretical connection between non-acyclic GFlowNets and optimal transport (OT). We show that fixing the initial flow distribution in a minimum-flow GFlowNet reduces its objective to a Kantorovich OT problem with graph-induced shortest path costs. At the optimum, the learned GFlowNet policy therefore encodes an optimal transport plan from the source distribution to the target distribution: we show that sampling trajectories from the minimum-flow GFlowNet recovers the corresponding optimal coupling. Our formulation enables applying the GFlowNet learning framework to OT problems on large graphs via edge flows and neural parameterization. Experiments confirm agreement with exact OT solvers and demonstrate that GFlowNets can learn high-quality transport plans.","published_date":"2026-06-04T15:14:24+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06261v1","title":"DAST: A VLM-LLM Framework for Cross-Interface Anomaly Detection in O-RAN","abstract":"O-RAN enables a disaggregated baseband stack with programmable functions that communicate over standardized open interfaces. The same openness that enables multi-vendor composition also expands the attack surface across logically decoupled tiers that make up the compute continuum. Among these threats, Denial-of-Service and performance-degradation attacks, which account for the majority of catalogued O-RAN threats, are particularly difficult to detect. Traditional Time-Series Anomaly Detection (TSAD) methods fail in this new regime where labelled baselines are scarce, threats evolve faster than detectors can be retrained, and the high-dimensional multivariate telemetry overwhelms monolithic inference models. To address these challenges, we present DAST, a zero-shot multi-agent framework for cross-interface anomaly detection in O-RAN that chains a three-stage VLM $\\rightarrow$ LLM $\\rightarrow$ VLM pipeline. DAST converts multivariate KPI streams into visual representations, scores textual per-interface descriptions against O-RAN domain knowledge, and verifies suspects on high-resolution heatmaps to output the problematic interfaces, the anomalous time intervals, an indicative O-RAN WG11-aligned operational impact rating and the decision rationale. We evaluate DAST on real network traces collected from an O-RAN testbed under representative performance degradation scenarios, achieving 0.910 F1-Score and 0.843 Accuracy, outperforming state-of-the-art TSAD baselines.","published_date":"2026-06-04T15:05:04+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06260v1","title":"OneReason Technical Report","abstract":"Generative recommendation models in the OneRec family have been widely deployed in many real-world services, such as short-video, live-streaming, advertising, and e-commerce. However, these generative models can only benefit from the scaling advantage, while their reasoning ability is hard to activate, since we cannot construct meaningful Chain-of-Thought (CoT) sequences consisting of itemic tokens only. Inspired by the success of the reasoning-style ``think before answer'' paradigm in the LLM field, we conduct preliminary studies (i.e., OneRec-Think, OpenOneRec) to explore reasoning capability in generative recommendation. Nevertheless, we notice an unexpected phenomenon: the thinking mode does not show advantages over the non-thinking mode. Drawing insights from recent findings on CoT robustness in multi-modal language models, we argue that effective reasoning in recommendation rests on two factors: perception, the ability to ground itemic tokens in their underlying language semantics, and cognition, the ability to reorganize a user's behavior sequence into coherent latent interest points. We therefore propose OneReason, which includes: (1) strong itemic token perception in pre-training, (2) a three-level cognition-enhanced CoT format for recommendation tasks in SFT, and (3) a specialize-then-unify training recipe in RL to enhance the thinking ability.","published_date":"2026-06-04T15:04:34+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06256v1","title":"RedKnot: Efficient Long-Context LLM Serving with Head-Aware KV Reuse and SegPagedAttention","abstract":"As the input length of large language model (LLM) serving continues to grow, the KV cache has become a dominant bottleneck in AI infrastructure. It limits GPU memory capacity, serving concurrency, cache reuse, and distributed scalability. Several important problems, including position-independent KV cache, prefix KV cache compression, hot/cold KV cache separation, and distributed KV cache management, all depend on how the KV cache is represented and managed. However, existing serving systems largely rely on a monolithic KV cache abstraction, where the KV cache is treated as a homogeneous sequence of token-level memory blocks and managed with similar policies across attention heads and serving scenarios. We observe that KV cache utility is highly structured across KV heads: different heads exhibit different functional roles, attention distances, and runtime importance. Therefore, a full KV cache is not always necessary for every head, token range, or serving scenario.   We present RedKnot, a head-aware KV cache management system for LLM serving. RedKnot breaks the conventional monolithic KV cache abstraction by decomposing the KV cache along KV heads, whose importance and effective attention ranges vary significantly across serving scenarios. This head-level decomposition turns the KV cache from a monolithic tensor abstraction into a structured memory object, enabling RedKnot to uniformly support position-independent KV reuse, prefix KV compression, hot/cold KV separation, and distributed KV placement while preserving output fidelity and improving resource efficiency, without requiring model retraining or fine-tuning. RedKnot establishes a new foundation for AI infrastructure by transforming the KV cache from a monolithic, passive runtime artifact into a dynamic, model-aware runtime substrate for scalable LLM serving.","published_date":"2026-06-04T14:57:07+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06252v1","title":"Closing the Loop on Latent Reasoning via Test-Time Reconstruction","abstract":"Recent work moves intermediate reasoning from natural-language traces into latent or cache-level representations to reduce token overhead and avoid a discrete communication bottleneck. However, this shift also removes a key advantage of textual reasoning: intermediate states are no longer inspectable, making it difficult to determine whether a latent state still preserves the constraints of the original query. As a result, latent reasoning typically operates in an open loop, where a latent state is produced and consumed without an input-anchored fidelity check. We propose ReLAT (Reconstruction-Guided Latent Reasoning At Test Time), a self-supervised test-time training method that closes this loop using the query itself as the reference. Our key observation is that if a latent state faithfully represents a query, the query should be recoverable from it; if the query cannot be recovered, the latent state has lost task-relevant information. ReLAT operationalizes this principle by constructing a differentiable Question -> Latent Thought -> Question cycle and optimizing query reconstruction loss through the latent thought before answer generation. This anchors opaque latent computation to the problem specification it is supposed to represent. Across mathematical reasoning, knowledge QA, and code generation benchmarks on the Qwen family, ReLAT consistently improves over single-model inference, text-based collaboration, open-loop latent collaboration, and alternative test-time training objectives. On Qwen3-8B, ReLAT raises AIME 2024 accuracy from 56.7% to 73.3%, a 16.6-point gain over the strongest open-loop latent baseline.","published_date":"2026-06-04T14:54:40+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06245v1","title":"MPCoT: Reward-Guided Multi-Path Latent Reasoning for Test-Time Scalable Vision-Language-Action","abstract":"Vision-Language-Action (VLA) policies remain brittle in long-horizon and high-uncertainty control, where one-pass action decoding provides limited inference-time deliberation. Explicit chain-of-thought can increase reasoning depth, but introduces token latency and an indirect text-to-action interface. We propose MPCoT, a reward-guided multi-path latent reasoning framework that initializes $M$ hypotheses, refines them for K weight-tied steps, and softly aggregates them before action decoding. A training-only path-preference objective evaluates candidate action branches with expert-action consistency, world-model/VLM-based progress, and success feedback to align the latent path scorer with downstream execution quality. MPCoT preserves the original 8-step action interface, generates zero reasoning tokens, and exposes configurable inference controls (K,M). Under matched protocols on LIBERO and CALVIN, MPCoT improves long-horizon performance, with ablations confirming depth-width effects, confidence-weighted aggregation, and reward-guided path supervision.","published_date":"2026-06-04T14:48:44+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06242v1","title":"Benchmarking Open-Source Layout Detection Models for Data Snapshot Extraction from Institutional Documents","abstract":"Institutional documents contain substantial amounts of operational and analytical information embedded within figures and tables. Current approaches for extracting visual content from documents are largely built around generic document layout analysis, where figures and tables are treated as uniformly relevant document objects rather than semantically meaningful analytical artifacts. In this work, we introduce a benchmark dataset and evaluation framework for \\textit{data snapshot extraction}, the task of identifying and localizing semantically meaningful visual artifacts within institutional documents. The benchmark spans humanitarian reports, World Bank policy research working papers, and project appraisal documents, and includes annotations for figures and tables that contain reusable analytical information. Using this dataset, we benchmarked multiple open-source layout detection models and evaluated both detection performance and spatial extraction quality. Our results show that current models struggle to generalize to operational institutional documents despite strong performance on conventional academic benchmarks. Common failure modes include confusion between analytical and non-analytical content, fragmentation of composite analytical artifacts, and incomplete extraction of contextual information required for interpretation. These findings highlight a persistent gap between generic document layout analysis and operationally useful data snapshot extraction. We release the source PDFs, annotation dataset, metadata, and source code to support future research in operational document intelligence. The dataset is available at https://huggingface.co/datasets/ai4data/data-snapshot and the source code is available at https://github.com/worldbank/ai4data/tree/main/experimental/data-snapshot.","published_date":"2026-06-04T14:47:40+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":true,"repo_url":"https://github.com/worldbank/ai4data/tree/main/experimental/data-snapshot","commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06240v1","title":"TOKI: A Bitemporal Operator Algebra for Contradiction Resolution in LLM-Agent Persistent Memory","abstract":"Persistent memory for an LLM agent is a write-heavy substrate: every belief update is a versioned write, and a new claim may contradict a stored one. Production systems use four resolution heuristics (last-writer-wins, evidence-weighted merge, await-confirmation, per-rule policy), yet none declares the isolation level it assumes or the write-time anomalies it admits. We show that contradiction resolution is write-time concurrency control and make the missing contract explicit. TOKI types the four heuristics as one family of bitemporal operators over a dual-row schema, each with an isolation precondition and a provenance annotation that preserves the losing fact in an audit row. Four soundness theorems close the contract across isolation, schema, and provenance, lift the guarantees to operator pipelines, and extend the fold operators to n-ary conflict sets. A tightness companion proves that, within the relational schedule model, keyed logging of the adjudicating judge is necessary for replay consistency, which every audited baseline omits. A verdict matrix over eight systems localizes the gap: every baseline that keeps a language-model judge on the write path admits at least one of three write-time anomalies (replay inconsistency, belief-drift skew, audit erasure); a content-addressed engine-layer comparator avoids them only by removing the judge, and TOKI alone excludes all three while keeping it. On its one natural-workload slice the audit-row defence moves LoCoMo by 0.86, and ablating the typed memory layer removes 0.49 accuracy on 1,444 answerable LoCoMo questions; the cross-system comparison stays underpowered and claims no superiority. The contribution is the contract: a write-time correctness specification, proved sound across isolation, schema, and provenance, pinning the guarantee every production heuristic assumes but no deployed system makes explicit.","published_date":"2026-06-04T14:46:52+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":true,"repo_url":"https://github.com/ZenAlexa/toki-bitemporal-memory","commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06235v1","title":"Design a Reliable LLM-Integrated Interface for Mortality Forecasting","abstract":"Mortality forecasting plays an important role in actuarial and policy decision-making, but its implementation remains technically complex and inaccessible to non-expert users. This project proposes a reliable large language model (LLM)-integrated interface that improves usability while maintaining statistical power. The LLM is designed as a constrained orchestration layer that translates natural-language inputs into structured configurations for a deterministic forecasting pipeline. A three-phase methodology is employed to ensure accuracy, usability, and transparency. First, a baseline pipeline is implemented using the CoMoMo package, reproducing established mortality forecasting results. Second, the pipeline is extended to generate multi-step forecasts using rolling-origin evaluation and mean squared error (MSE). Third, a prototype interface uses a local LLM to handle users' forecasting requests in plain language. The system demonstrates that LLMs can enhance accessibility without compromising reproducibility, transparency, or actuarial validity in high-stakes analytical workflows.","published_date":"2026-06-04T14:41:07+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06225v1","title":"Bridging the Semantic-Collaborative Gap: An Asymmetric Graph Architecture for Cold-Start Item Recommendation","abstract":"Collaborative filtering and graph-based recommendation models are highly effective because they leverage observed user interactions, but this dependence creates a fundamental cold-start challenge when newly added content has no interaction history. In Tubi's production retrieval system, this challenge is further constrained by the serving interface: new content must be assigned a standalone embedding immediately, and the model must also produce device embeddings suitable for approximate nearest-neighbor retrieval. We address this setting by formulating cold-start recommendation as an inductive graph-completion problem on a temporal bipartite device-content graph. We propose Shallow-RHS, an asymmetric link-prediction architecture in which the left-hand side (LHS) device tower leverages temporally valid watch-history message passing to capture collaborative signals, while the right-hand side (RHS) content tower is intentionally shallow with respect to the graph and encodes content solely from intrinsic features. The RHS tower does not use ID-based embeddings, content-side subgraphs, neighbor aggregation, or interaction-derived representations, forcing the content encoder to map intrinsic features into a collaborative-filtering-aware embedding space. After training, the learned content encoder generates embeddings for both warm and newly ingested content, enabling implicit graph completion through retrieval of warm surrogate neighbors. We further extend the same representation-completion principle to device cold-start by constructing cohort-based embeddings from demographic features. Large-scale online experiments demonstrate consistent relative improvements in content cold-start engagement, promotion speed, impression acquisition, and device cold-start engagement.","published_date":"2026-06-04T14:35:57+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06223v1","title":"From Reward-Hack Activations to Agentic Risk States: Context-Calibrated Mechanistic Monitoring in LLM Agents","abstract":"Language-model agents act through repeated cycles of observation, reasoning, and action selection, making safety monitoring depend on both internal model state and environment context. We study reward-hacking monitors in ReAct-style agents acting in Gameable ALFWorld and WebShop. Agents are instrumented with activation-based reward-hack scores, token-level entropy, and decision-context features. We find that adapters fine-tuned on \\textit{School-of-Reward-Hacks} dataset can transfer reward-hack tendencies into agentic action selection, especially when the environment exposes proxy-reward affordances. However, mitigating such behavior cannot rely on activation dynamics alone. High reward-hack activation identifies a latent policy state, but does not necessarily imply an immediate exploit action. Across next-step prediction tasks, entropy and context-calibrated internal features improve risk estimation over reward-hack activation alone. Activation-direction steering further reduces proxy-exploit behavior in selected mixed-adapter regimes. Overall, our results support context-calibrated internal monitoring for agents: reward-hack activation identifies a latent policy state, while entropy and decision context help determine when that state becomes risky action.","published_date":"2026-06-04T14:34:31+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06219v1","title":"CLEAR: Cognition and Latent Evaluation for Adaptive Routing in End-to-End Autonomous Driving","abstract":"End-to-end autonomous driving models often struggle to balance multi-modal maneuver generation with real-time inference constraints. While diffusion models successfully capture diverse driving behaviors, their iterative denoising process incurs unacceptable latency for safety-critical deployment. To address this, we propose CLEAR (Cognition and Latent Evaluation for Adaptive Routing), a framework that combines ultra-fast generative planning with deep semantic reasoning. CLEAR employs Drive-JEPA as the visual encoder and replaces the multi-step denoising chain with a single-step conditional drift in a VAE latent space, introducing a conditioning coefficient to balance diversity and expert precision. Meanwhile, we fully fine-tune Qwen~3.5~0.8B on driving QA pairs to extract scene-aware hidden states. These states guide both an Adaptive Scheduler, which selects the conditioning coefficient $\u03b1$ and sample count $N$ from a discrete set of predefined schemes, and a cross-attention scorer that selects the optimal trajectory from candidates. On the NAVSIM v1 benchmark, CLEAR achieves a state-of-the-art PDMS of 93.7. Our results demonstrate that high-fidelity, multi-modal planning can be executed efficiently without dense geometric annotations or iterative sampling.","published_date":"2026-06-04T14:32:10+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06218v1","title":"TAM: Torque Adaptation Module for Robust Motion Transfer in Manipulation","abstract":"A policy tuned for one robot often behaves differently on another, whether due to the sim-to-real gap, unknown payloads, or the differing dynamics of two instances of the same robot. In contact-rich, dynamic manipulation, even small motion discrepancies can result in failure to track reference motion, since they disrupt the timing and modes of contact. Common remedies, such as domain randomization or system identification, either produce overly conservative task policies or require data that must be recollected for each robot or payload. We introduce the Torque Adaptation Module (TAM), a learned module that adapts the torque commands sent to the robot to match the behavior of an ideal robot. TAM operates between the low-level controller that tracks the policy's actions and the robot's torque interface. It includes a history encoder that embeds proprioceptive history into a latent state and a torque adaptor that computes residual torque corrections. Because TAM depends only on proprioceptive history and not on policy observations, or the action space, the same TAM weights can be reused to adapt policies with different action spaces (joint targets, end-effector targets, or direct torques). The policies themselves do not need to be trained with domain randomization of robot parameters. Instead, we offload the need for domain randomization to TAM by training it entirely in randomized simulation, using multi-robot pretraining followed by a robot-specific fine-tuning step that still requires no real-robot data. We evaluate TAM zero-shot on a real Franka Panda robot across dynamic manipulation tasks that include a vision-based box pushing policy (from RL), a flip policy (from BC), and an MPC ball-on-plate balancing. Our experiments show that TAM improves zero-shot real-robot execution compared to online system identification and RMA baselines and enables robust dynamic manipulation performance.","published_date":"2026-06-04T14:31:54+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06217v1","title":"DisasterBench: A Multimodal Benchmark for UAV-Based Disaster Response in Complex Environments","abstract":"When a disaster unfolds, responders must answer not only what is happening, but also why it is happening, what will happen next, and what to do now, often from noisy low-altitude UAV views and under tight on-site compute constraints. However, most existing multimodal benchmarks emphasize perception (e.g., recognition/description), cover limited disaster types, and provide insufficient support for the multi-stage reasoning required in practical emergency response. We introduce DisasterBench, a multi-stage multimodal reasoning benchmark for UAV-Based disaster response in complex environments. DisasterBench spans 14 disaster-related scene types and 9 response-critical tasks across pre-, during-, and post-disaster stages, with fine-grained disaster-task mappings that explicitly test causal attribution, propagation prediction, damage analysis, and decision-oriented reasoning. To enable reasoning on the edge, we further propose DisasterVL, a lightweight multimodal model optimized with a three-stage pipeline combining domain instruction tuning, chain-of-thought-guided multimodal alignment, and reinforcement learning-based policy optimization. Experiments across 21 popular MLLMs show that our 2B-parameter DisasterVL outperforms all evaluated open-source models and substantially narrows the gap to state-of-the-art closed-source models, achieving GPT-4o-comparable reasoning accuracy with superior efficiency. The project page is available at https://github.com/TanmouTT/DisasterBench.","published_date":"2026-06-04T14:31:11+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":true,"repo_url":"https://github.com/TanmouTT/DisasterBench","commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06214v1","title":"Towards the Readability of LLM-Generated Codes through Multitask Representation Engineering","abstract":"Correctness and readability are key measures of code quality, respectively ensuring functional fidelity and ease of comprehension. While most existing research focuses on improving the correctness of large language models~(LLMs) generated codes, readability remains under-addressed. Enhancing readability through targeted control is challenging due to its subjective nature. In this article, we employ representation engineering~(RepE) as the targeted control method given its characteristics of low data dependency and low computational cost. Prior work on RepE has primarily focused on the targeted control for a single task, but improving the code readability requires the control across multiple tasks. Accordingly we proposes the multitask RepE framework and theoretically discuss the impact of the multitask steering method on the tradeoff between the code readability and correctness. We further provide comprehensive experiments in support. All the relevant implementations are open-source and available upon request.","published_date":"2026-06-04T14:24:14+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06212v1","title":"Evaluating Agentic Configuration Repair for Computer Networks","abstract":"Misconfigurations in computer networks remain a major source of critical Internet outages. Research is turning to Large Language Models (LLMs) to automate the complex, error-prone task of network configuration. However, even state-of-the-art models fail to resolve misconfigurations in large-scale, complex scenarios and often introduce new errors. In this work, we benchmark open- and closed-source LLMs augmented with formal network verification and context retrieval tools. We demonstrate that agentic architectures outperform base LLMs in repair efficacy (by 12% on average) and safety (by 17% on average), enabled by the ability to dynamically manage context and iteratively validate configuration repairs.","published_date":"2026-06-04T14:20:25+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06207v1","title":"Unsupervised Pattern Analysis in Japanese Veterinary Toxicology: A Regulatory-Compliant Framework for Cross-Species Risk Assessment","abstract":"Veterinary pharmacovigilance systems are essential for monitoring adverse drug events (ADEs), yet existing approaches often fail to capture region-specific toxicity patterns shaped by local biological and regulatory contexts. In Japan, these challenges are amplified by species-specific metabolic differences and reporting practices defined by the Ministry of Agriculture, Forestry, and Fisheries (MAFF). Most prior work relies on prediction-oriented models, limiting mechanistic interpretability. This study proposes a regulatory-integrated unsupervised framework for pattern discovery using the National Veterinary Assay Laboratory (NVAL) database. ADEs are encoded into organ system-aligned representations and adjusted for species-specific reporting biases, enabling cross-species comparison. Similarity-based clustering and dimensionality reduction are applied to identify latent toxicity structures. Analysis of 4,120 high-confidence ADE reports (9,080 drug-ADE combinations) identified three significant species clusters (p < 0.01), including hepatic-dominant patterns in companion animals (0.42 $\\pm$ 0.06), renal toxicity in ruminants (0.39 $\\pm$ 0.07), and dermatological sensitivity in sheep (0.35 $\\pm$ 0.07). Drug-level clustering achieved 83% alignment with pharmacological classes, while cosine similarity outperformed alternative metrics (silhouette score: 0.48; cluster precision: 87%). Regulatory validation showed strong agreement with established classifications. These findings demonstrate that regulation-aligned unsupervised analysis can uncover biologically meaningful, region-specific toxicity patterns, providing an interpretable and scalable framework for veterinary drug safety assessment.","published_date":"2026-06-04T14:14:16+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06203v1","title":"Dense Contexts Are Hard Contexts: Lexical Density Limits Effective Context in LLMs","abstract":"Input length and the position of relevant information are widely cited as the primary causes of degraded LLM long-context performance. Here, we study lexical density -- the rate at which a context introduces distinct information -- as a third, largely overlooked factor that systematically reduces the effective context window of LLMs. We quantify the impact of lexical density on open-weight LLMs (9B-685B) using three \"find-the-needle\" style benchmarks with identical length (~12k tokens) and controlled needle position, but increasing density of information. We observe a sharp performance collapse in higher-density benchmarks: models that are near-perfect in sparse contexts drop below 60% retrieval score on denser ones. To rule out task-type confounds, we vary and control the density within each benchmark while keeping all other properties unchanged. Reducing density generally restores performance, especially in the high-density regimes where degradation appears. These results show that effective context capacity is a function of lexical density, with direct implications for real-world LLM systems operating on compact, information-rich inputs.","published_date":"2026-06-04T14:08:30+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06201v1","title":"Learning to replenish: A hybrid deep reinforcement learning for dynamic inventory management in the pharmaceutical supply chains","abstract":"Pharmaceutical supply chains (PSCs) struggle with inventory management (IM) due to unpredictable demand patterns and variable lead times associated with restocking. This complexity is further compounded by the finite shelf lives of pharmaceutical products, which necessitate a delicate balance between adequate stock and minimal waste. These intertwined factors create a complex optimization problem that requires sophisticated inventory strategies to ensure both product availability and PSC efficiency. This study aims to develop an optimal inventory replenishment policy for pharmaceutical products that can handle the stochasticity arising from uncertain demand and variable PSC conditions. The objective is to maximize the profitability of the PSC while maintaining a high patient service level. We formulate the problem as a Markov decision process and propose a deep reinforcement learning (DRL) approach, specifically, a hybrid asynchronous advantage actor critic distributed proximal policy optimization (A3C DPPO)algorithm. The A3C DPPO algorithm is tailored to handle the continuous action space inherent in IM. The numerical results demonstrate that the proposed algorithm adaptively updates the inventory replenishment strategy under dynamic scenarios, resulting in lower inventory costs compared to various benchmarks. We also conduct numerical validation using real-world pharmaceutical inventory data to confirm the practical feasibility of the proposed algorithm.","published_date":"2026-06-04T14:06:46+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06197v1","title":"Improving Answer Extraction in Context-based Question Answering Systems Using LLMs","abstract":"Question answering (QA) systems have achieved notable progress with the advent of large language models (LLMs). However, they still face challenges in accurately extracting and generating precise answers from given contexts, particularly when dealing with complex or ambiguous queries. Existing approaches often struggle with contextual understanding, answer consistency, and generalization across diverse domains. In this work, we propose a question answering system based on large language models, where the input consists of a textual context and a corresponding question, and the output is a concise and accurate answer. The motivation behind this research lies in addressing the limitations of current QA systems, particularly their tendency to produce irrelevant or imprecise responses despite having access to the correct context. Our methodology involves fine-tuning a pre-trained LLM on a benchmark QA dataset to improve its contextual comprehension and answer extraction capabilities. Specifically, we utilize the Stanford Question Answering Dataset (SQuAD1.1), which provides high-quality context-question-answer triplets for supervised training and evaluation. Experimental results show that the fine-tuned Roberta-base model achieves the highest performance, attaining a ROUGE-L score of 86.84%, a BLEU score of 28.24%, and a BERTScore of 95.38%. These results indicate strong accuracy and answer relevance, demonstrating the effectiveness of the proposed approach for context-based question answering tasks. Furthermore, the findings confirm that targeted fine-tuning substantially improves the reliability and precision of QA systems.","published_date":"2026-06-04T14:04:11+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06178v1","title":"Learning to Route LLMs from Implicit Cost-Performance Preferences via Meta-Learning","abstract":"Large language models (LLMs) present a trade-off between performance and cost, where more powerful models incur greater expense. LLM routing aims to mitigate expenses while maintaining performance by sending queries to the most suitable model. However, existing methods cannot perform well for different user cost-performance preferences. To address this gap, we introduce a novel perceptive LLM routing paradigm for personalized and user-centric cost-performance optimization, which efficiently learns users' implicit preferences through little interaction. To handle the challenge of heterogeneous user needs, we formulate preference profiles as a set of distinct tasks in contextual bandit and propose MetaRouter, a meta-learning framework designed for preference-aware LLM routing. Experimental results show that MetaRouter outperforms strong baselines on both in-distribution and out-of-distribution tasks. Furthermore, it exhibits high efficiency in learning user preferences, robustness to changes in the routable LLMs, and scalability to multi-model routing.","published_date":"2026-06-04T13:53:03+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06168v1","title":"ProSarc: Prosody-Aware Sarcasm Recognition Framework via Temporal Prosodic Incongruity","abstract":"We present ProSarc, an audio-only framework that detects sarcasm by modelling temporal prosodic incongruity, that is, the mismatch between local prosodic dynamics and the utterance-level emotional baseline. Dual encoding paths, a Global Emotion Encoder and a Temporal Prosody Encoder (BiLSTM + multi-head attention), feed a Prosodic Incongruity Analyzer that produces a scalar incongruity score for classification. Monte Carlo dropout provides uncertainty estimates, and an attention-based mechanism localises sarcastic onset without frame-level labels. ProSarc outperforms prior audio-only methods on MUStARD++ (F1=75.3) and generalises to spontaneous (PodSarc, F1=62.9) and cross-lingual speech (MuSaG, F1=65.6). Ten-run validation confirms the contribution of incongruity modelling (Wilcoxon p=0.002, Cohen's d=1.51). Human evaluation shows that model uncertainty tracks perceptual ambiguity and predicted onsets align with human-annotated temporal windows.","published_date":"2026-06-04T13:40:09+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06160v1","title":"Where does Absolute Position come from in decoder-only Transformers?","abstract":"RoPE-trained transformers distinguish absolute position in their attention patterns, even though RoPE encodes only relative offsets in the inner product. We trace this leakage to two architectural components, The causal mask is responsible for the first: its per-query softmax denominator depends on the absolute query position by construction. The residual stream supplies the second. Under causal attention the activation at position $0$ attends only to itself and runs as a closed dynamical system from the embedding of the token at that position; downstream attention reads this trajectory through sink-reading heads. Both components appear in all three architectures we study, in architecturally specific balance: NTK scaling suppresses the residual-stream component, sliding-window attention allows it to accumulate with depth, and standard RoPE sits between. Replacing the \\texttt{BOS} embedding before the forward pass removes $40\\%$ of the residual-stream component at early queries. Attention sinks are token-anchored stabilizers that pass forward a deterministic fingerprint of the token at position $0$, constant across inputs when that token is the auto-prepended \\texttt{BOS} and varying with it otherwise.","published_date":"2026-06-04T13:32:38+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06159v1","title":"ITP-STDP: An Intrinsic-Timing Power-of-Two Learning Engine for On-Chip SNN Training","abstract":"Spiking neural networks (SNNs) have the potential to emerge as the third generation of neural networks and have attracted increasing attention across a wide range of applications. However, the large number of synaptic connections in SNNs leads to intensive weight-update computation by on-chip learning algorithms during training, resulting in substantial hardware resource utilization and energy consumption. Among existing SNN learning algorithms, spike-timing-dependent plasticity (STDP) is one of the most extensively studied and widely adopted, serving as a fundamental learning component in SNNs. To address the hardware and energy overheads associated with SNN training, this paper presents intrinsic-timing power-of-two STDP (ITP-STDP) and its corresponding prototype learning engine hardware architecture. The proposed design is evaluated through a dedicated mean-field synaptic drift model for dynamical analysis and further validated across SNN networks of different scales and datasets. It is further implemented on both ASIC and FPGA platforms and compared with state-of-the-art approaches, including the original STDP and more complex STDP variants. The results demonstrate superior energy efficiency, higher operating speed, and substantially lower hardware resource utilization, as the proposed design eliminates most of the computational overhead of STDP through both algorithmic and hardware-level optimizations. On the FPGA platform, the proposed design improves energy efficiency by 4.5$\\times$ to 219.8$\\times$ over the compared designs. On the ASIC platform, the proposed design achieves a 4.8$\\times$ to 22.01$\\times$ speedup while consuming only 1.2% to 3.3% of the area required by prior works.","published_date":"2026-06-04T13:32:20+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06154v1","title":"Amortizing Federated Adaptation: Hypernetwork Driven LoRA for Personalized Foundation Models","abstract":"Federated fine-tuning of foundation models using Low-Rank Adaptation (LoRA) offers a communication efficient solution for distributed learning. However, existing federated LoRA methods suffer from two fundamental limitations: (1) structural aggregation bias, where independently averaging low rank factors fails to approximate the true combined update, and (2) client side initialization lag, as clients repeatedly reinitialize LoRA parameters across communication rounds, slowing convergence. We propose HyperLoRA, a unified framework that addresses both issues through amortized federated adaptation through hypernetwork-driven LoRA generation and product space aggregation. Instead of iterative per-client optimization, HyperLoRA employs a learned generator that maps client distribution signatures to LoRA initializations, effectively amortizing per client adaptation. On the server side, we introduce a learned aggregation module that directly synthesizes updates in the low-rank product space, eliminating the inconsistencies of factor-wise averaging. A lightweight residual correction module further improves stability under heterogenous (non-IID) client distributions.By replacing iterative optimization and heuristic averaging with learned operators, HyperLoRA jointly enables efficient personalization, unbiased aggregation, and faster convergence. Experiments on federated vision and vision-language benchmarks show that HyperLoRA achieves improved convergence speed, greater robustness to distribution shift, and stronger personalization performance compared to prior federated LoRA methods.","published_date":"2026-06-04T13:28:48+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06147v1","title":"WorldFly: A World-Model-Based Vision-Language-Action Model for UAV Navigation","abstract":"End-to-end Vision-Language-Action (VLA) models have shown promise in UAV navigation. However, existing approaches typically rely on historical observations to directly predict actions, often struggling in dense urban environments where severe occlusions and sharp turns result in drastic viewpoint transitions. We argue that the ability to \"imagine\" future states -- inherent in World Models -- is critical for robust decision-making under such partial observability. To address this, we construct a challenging Urban Canyon Traversal Benchmark, specifically designed to evaluate spatial understanding in scenarios characterized by severe occlusions and drastic viewpoint transitions. To this end, we propose WorldFly, a novel world-model-based VLA framework that employs a dual-branch coupled flow matching mechanism to jointly generate future video predictions and navigation actions, thereby explicitly guiding the agent's policy via spatial imagination. Extensive evaluations on our benchmark demonstrate that WorldFly outperforms other baselines, particularly in unseen environments, validating the effectiveness of integrating world models into embodied aerial agents.","published_date":"2026-06-04T13:23:05+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06136v1","title":"A Finite Certificate for the Positive $n=9$ Vasc Inequality","abstract":"We prove the positive-real $n=9$ case of the Vasc cyclic inequality. The proof was obtained with human-guided assistance from the AI agent MechMath Agent Team: the human-readable part reduces the rational inequality to a homogeneous polynomial inequality, fixes a cyclic maximum, and parametrizes each sorted fixed-maximum cone by cumulative gaps; the finite part is a certificate covering all $8!=40320$ sorted cones. MechMath Agent Team generated the certificate verification workflow through Python tool calls, including the case split, verification programs, and terminal classifications. The published certificate has $36815$ coefficient leaves, $2236$ ordinary Polya multiplier leaves, and $1269$ AM-GM midpoint overlay leaves. Human authors audited the mathematical reductions and verification logic, and a separate artifact contains the certificate, an independent verifier, and a from-source rebuild route.","published_date":"2026-06-04T13:19:19+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06133v1","title":"TLA-Prover: Verifiable TLA+ Specification Synthesis via Preference-Optimized Low-Rank Adaptation","abstract":"TLA+ is a formal specification language for verifying distributed systems and safety-critical protocols. Large language models (LLMs) frequently produce TLA+ specifications that fail the TLC model checker for semantic reasons. Across 25 LLMs, the best public baseline is 26.6% syntactic parse and 8.6% semantic model-check. We present TLA-Prover, a 20-billion-parameter model for TLA+ specification synthesis. Training combines supervised fine-tuning (SFT) on verified examples with repair-based group-relative policy optimization (GRPO). In the GRPO stage, the model learns to fix its own rejected specifications. We also train a direct preference optimization (DPO) variant from the same SFT checkpoint as an ablation. TLC provides the reward signal directly, with no learned reward model. Four tiers grade each output: Bronze (parses), Silver (no warnings), Gold (passes TLC), and Diamond. To reach Diamond, the model's correctness property is automatically altered in a small way; TLC must then detect a violation. If TLC still passes, the property was always-true and contributes nothing; the output fails Diamond. TLA-Prover reaches 9/30 (i.e. pass@1 = 30%) at both Gold and Diamond on a held-out 30-problem benchmark. This is roughly 3.5x the 8.6% untuned baseline. The DPO variant reaches 20% at Diamond. Gold and Diamond coincide at every checkpoint; this prevents the trivial-property failure mode.","published_date":"2026-06-04T13:17:06+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06114v1","title":"Towards Healthy Evolution: Exploring the Role and Mechanisms of Human-Agent Interaction in Self-Evolving Systems","abstract":"Self-evolving agents improve through continual self-play and self-generated learning signals, but autonomous evolution can also cause capability degradation and safety drift. Although human feedback has proven effective for static and post-trained agents, its role in self-evolving systems remains underexplored. We introduce Agent Norm Correction through Human-like Oversight and Review (ANCHOR), an LLM-based framework that simulates human supervision and delivers feedback at various phases of self-evolution. With ANCHOR, we evaluate two representative open-source self-evolving agent systems across coding, mathematical reasoning, and safety. Our results show that even limited supervision substantially mitigates safety degradation while preserving stable performance on core evolutionary objectives. Further analysis shows that supervision over the output verification phase is the most effective for intervention, whereas increasing supervision frequency yields diminishing returns. These findings provide empirical evidence and practical guidance for designing more stable, controllable, and human-aligned self-evolving agent systems.","published_date":"2026-06-04T13:03:16+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06109v1","title":"Harnessing Structural Context for Entity Alignment Foundation Models","abstract":"Entity alignment (EA) aims to identify equivalent entities across heterogeneous knowledge graphs (KGs) and is a key component of knowledge fusion and cross-KG reasoning. The recent EA foundation model demonstrates that alignment knowledge, once pretrained, can be directly applied to diverse previously unseen KG pairs. However, it still underuses structural context in two places: cross-KG interaction is weak during encoding, and final candidate ranking still relies too heavily on coarse similarity. We address these limitations with ContextEA, an enhanced encoder-decoder framework for transferable EA. On the encoder side, we introduce a cross-KG interaction encoder that unifies the two KGs with anchor bridges and performs earlier relation-aware cross-graph propagation. On the decoder side, we introduce a structural calibration decoder that calibrates alignment scores with entity-level, neighborhood-level, relation-level, and anchor-aware structural evidence. This design strengthens both structural context construction and structural context exploitation while remaining lightweight. Experiments on 29 EA datasets in OpenEA, SRPRS, and DBP show consistent gains over strong transferable baselines. Notably, the pretrained ContextEA already surpasses the finetuned baselines on all three benchmark groups, demonstrating substantially stronger transfer to unseen KGs. These results suggest that explicitly harnessing structural context is an effective direction for improving EA foundation models.","published_date":"2026-06-04T12:57:36+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06102v1","title":"Step-adaptive multimodal fusion network with multi-scale cloud feature learning for ultra-short-term solar irradiance forecasting","abstract":"Ultra-short-term solar irradiance prediction is critical for photovoltaic system dispatch and power grid stability. Existing approaches suffer from three key shortcomings: single time-series models cannot capture the spatial dynamics of clouds under complex conditions, standard convolutions inadequately represent multi-scale cloud features, and fixed low-frequency compensation strategies fail to adapt to different prediction steps. To address these issues, this proposes a multi-source data fusion model for ultra-short-term irradiance prediction. The model first employs InceptionNeXt to extract multi-scale, multi-directional spatial features from ground-based cloud images. A step-adaptive low-frequency compensation unit is then introduced to dynamically modulate global low-frequency information based on the prediction step. Eventually, the enhanced image features are combined with meteorological time-series features, and a TempAttnLSTM network captures global temporal dependencies for multi-step prediction. Experiments on the public NREL dataset and practical photovoltaic stations in Shandong illustrate the effectiveness of the proposed method compared with several state-of-the-art approaches.","published_date":"2026-06-04T12:42:52+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06099v1","title":"CogManip: Benchmarking Manipulative Behavior in Multi-Turn Interactions with Large Language Model","abstract":"Whether Large Language Models (LLMs) exhibit covert psychological manipulation in complex human-AI interactions has garnered increasing safety concerns. However, existing AI safety benchmarks remain largely restricted to explicit rule compliance and static prompts, failing to capture the dynamic and covert nature of manipulative strategies in multi-turn dialogues. We introduce CogManip, a comprehensive benchmark that evaluates 15 manipulation strategy risks across 1,000 multi-turn interaction scenarios, validated by human experts. A systematic evaluation of 13 representative models, including frontier models like GPT-5.4 and DeepSeek-V3.2, reveals significant risk heterogeneities and illuminates the targeted direction for future defense. Further analysis of objective function perturbation reveals that DeepSeek-V3.2's manipulation tactics are highly sensitive to both negative and benign system prompts, demonstrating the critical necessity of prompt-based defense engineering and implicit goal auditing. CogManip offers a robust instrument and perspective for auditing the implicit psychological influence and dynamic strategy selection of modern LLMs.","published_date":"2026-06-04T12:38:43+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06096v1","title":"OrderGrad: Optimizing Beyond the Mean with Order-Statistic Policy Gradient Estimation","abstract":"Policy-gradient methods usually optimize expected return, but many real world applications care about distributional properties of returns: tail risk, outlier robustness, or best-of-K discovery. We introduce OrderGrad, a family of likelihood-ratio and reparameterization gradient estimators for order-statistic objectives. OrderGrad optimizes finite-sample L-statistics, i.e., weighted averages of sorted rewards or costs, recovering objectives such as VaR, CVaR, trimmed means, medians, and top-m/best-of-K criteria by changing only the rank weights. For any fixed sample size and rank-weight vector, OrderGrad provides an unbiased gradient estimator for the corresponding order-statistic objective. The method is implemented as a simple reward transformation that can then be used in an otherwise standard policy-gradient or reparameterized update. We study the resulting estimator's variance behavior and evaluate it on tasks where mean optimization is mismatched to the deployment objective, including LLM math post-training and other tasks. OrderGrad provides a unified, plug-and-play route to risk-averse, robust, and exploratory learning.   Code: https://github.com/paavo5/ordergrad","published_date":"2026-06-04T12:34:15+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":true,"repo_url":"https://github.com/paavo5/ordergrad","commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06094v1","title":"Integrating Mechanistic and Data-Driven Models for Neurological Disorders through Differentiable Programming","abstract":"Advances in computational modeling, neuroimaging, and artificial intelligence are revolutionizing the modeling of neurological disorders for improved diagnostics, prognosis, and treatment planning. Mechanistic models provide valuable scientific insight into the disorders, but in practice they are often simplified with assumptions or computationally expensive and slow to solve. However, while purely data driven approaches provide speed and scalability, they require large, high quality data to train and generally suffer from interpretability and generalization issues. This perspective paper presents a structured overview of hybrid modeling strategies, which combine deep learning models with physics based solvers, and are categorized into parallel, series, and parallel-series architectures. Three main approaches that have been emphasized are residual modeling for missing or incomplete physics, Neural Ordinary Differential Equations (NODEs) for continuous time dynamics approximation, and solver in the loop that accelerates traditional solvers with neural approximations. These hybrid models integrate the governing differential equation based formulations and deep learning to characterize the evolution of neurological disorders, and promise advanced personalized neurological modeling. In addition, the study explores and proposes different hybrid configurations to improve diagnosis accuracy, predict disease progression, and inform treatment strategies across a range of neurological disorders. These capabilities outperform standalone mechanistic or purely data driven approaches, making hybrid modeling a powerful tool, especially in applications involving modeling the progression and treatment responses in neurological conditions such as brain tumors, Alzheimer's disease, and stroke.","published_date":"2026-06-04T12:32:24+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06090v1","title":"Beyond Semantic Organization: Memory as Execution State Management for Long-Horizon Agents","abstract":"LLM-based agents increasingly tackle long-horizon tasks with interdependent decisions, where each action reshapes future constraints and intermediate errors can cascade. Existing RAG and agent memory systems organize histories by semantic similarity, retrieving content-relevant entries at decision time. We argue that this design mismatches execution-state dependencies: it fragments decision trajectories and mixes valid and erroneous traces, hindering coherent state reconstruction and error isolation. We propose MAGE (Memory as Agent-Guided Exploration), an active execution-state manager that stores interactions in a hierarchical state tree. The agent derives its state from the active root-to-current path, combining subgoal summaries, recent traces, and hints from prior branches. Four coupled operations maintain the tree: Grow records new traces, Compress summarizes completed subgoals, Maintain validates summaries, and Revise restores a target boundary and resumes on a new branch. This design bounds context growth while preserving state integrity and isolating flawed segments from the active path. Experiments on MemoryArena show that MAGE improves the average task success rate by 7.8--20.4 pp over baselines, while reducing token consumption by 55.1%.","published_date":"2026-06-04T12:26:42+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06087v1","title":"LatentSkill: From In-Context Textual Skills to In-Weight Latent Skills for LLM Agents","abstract":"Agent systems increasingly use textual skills to encode reusable task procedures, but injecting these skills into the prompt at every step incurs substantial context overhead and exposes skill content as plaintext. We present LatentSkill, a framework that converts textual skills into plug-and-play LoRA adapters through a pretrained hypernetwork. LatentSkill stores skill knowledge in weight space rather than context space, removing per-step skill tokens while preserving modular loading, scaling, and composition. On ALFWorld and Search-QA, LatentSkill outperforms the corresponding in-context skill baseline while using substantially fewer prefill tokens: it improves ALFWorld success by 21.4 and 13.4 points on the seen and unseen splits with 64.1% fewer prefill tokens, and improves Search-QA exact match by 3.0 points with 72.2% lower skill-token overhead. Further analysis shows that generated skill LoRAs form a structured semantic geometry, can be precisely controlled via the LoRA scaling coefficient, and can be composed through parameter-space arithmetic when skill components are aligned. These findings suggest that weight-space skills provide an efficient, modular, and less exposed substrate for extending LLM agents.","published_date":"2026-06-04T12:26:09+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06081v1","title":"A Framework for Measuring Appropriate Reliance on Set-Valued AI Advice","abstract":"Appropriate reliance on AI advice has become a central research theme in human-AI collaboration. Existing frameworks have focused exclusively on point predictions as AI advice. However, set-valued AI advice (e.g., discrete sets or continuous intervals) is increasingly being used to communicate uncertainty and improve human decision making. In this paper, we develop the first formal framework for measuring appropriate reliance on set-valued AI advice within the sequential judge-advisor paradigm, spanning both classification and regression tasks. For classification, we first introduce the dimensions that are necessary for evaluating set-valued AI advice. We then define two metrics: correct reliance rate on AI and correct reliance rate on self, which jointly characterize appropriate reliance in this setting. For regression, we introduce quantity of AI reliance and quality of AI reliance, which respectively measure whether a decision maker utilized the AI advice and whether their reliance helped them get closer to the ground truth relative to their initial estimate. Through the application of our framework, we demonstrate how these metrics capture important nuances in human-AI collaboration that existing measures overlook.","published_date":"2026-06-04T12:17:42+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06080v1","title":"On Advantage Estimates for Max@K Policy Gradients","abstract":"Reinforcement learning with verifiable rewards is widely used for post-training reasoning models, but sparse outcome rewards make exploration difficult. A complementary approach is to optimize inference-time objectives such as pass@K and max@K directly, yet existing policy-gradient estimators for these objectives use different signals, baselines, and normalizations, making their relationships unclear. We study this issue through baseline design and advantage centering. Starting from the advantage estimator of a leading method in the field, we show that it is policy-gradient unbiased but yields a non-centered advantage. We then introduce a Leave-Two-Out baseline that preserves policy-gradient unbiasedness while making realized batch advantages exactly centered. The resulting method, MaxPO, has an efficient quadratic-time implementation and integrates naturally into group-based RL for LLM post-training. We further derive the canonical finite-batch advantage for max@K, providing a unified view of existing advantage estimators. Empirically, we verify that the L2O baseline reduces gradient variance and outperforms non-centered alternatives.","published_date":"2026-06-04T12:16:39+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06076v1","title":"Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation","abstract":"While vision-language models excel at general multimodal understanding, they still struggle with visual spatial planning. We attribute this to a perception-reasoning modality gap: visual planning requires models to infer latent state structures from pixels and then reason over the recovered structure to produce valid actions, whereas symbolic planning directly leverages explicit objects and constraints. This creates dual bottlenecks in visual state recovery and multi-step planning. To address this, we propose MGSD, a two-stage modality-gap-aware self-distillation framework. First, a cold-start grounding stage equips the visual student with reliable state representations, minimizing early perception noise. Second, a privileged teacher transfers planning capabilities via on-policy distillation, using explicit symbolic states to supervise the student's own visual rollout prefixes. Crucially, symbolic data is used strictly during training, leaving inference purely visual. Experiments on visual planning benchmarks show that MGSD consistently improves visual planning across both 4B and 8B backbones, raising the macro average by 19.3% and 18.4%, respectively. The resulting models narrow the gap to symbolic-input upper bounds, while ablations and diagnostics confirm that the improvement comes from both visual state recovery and optimal-path reasoning. These results suggest that modality-gap-aware self-distillation improves not only how models perceive actionable states, but also how they plan over the inferred structure. Code is available at https://github.com/Oranger-l/MGSD.","published_date":"2026-06-04T12:13:24+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":true,"repo_url":"https://github.com/Oranger-l/MGSD","commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06058v1","title":"MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following","abstract":"Reinforcement learning with verifiable rewards is ideal for multi-constraint instruction following, yet standard group-relative policy optimization (GRPO) becomes unstable under discrete, low-dispersion rewards, where within-group reward distributions are frequently homogeneous. We identify and formalize three pathologies of z-score group normalization in this regime: low-variance amplification, mean-centering blindness, and zero-variance collapse. To address them, we propose MDP-GRPO, which stabilizes learning through (1) multi-temperature sampling to increase reward dispersion, (2) dual-anchor advantages to restore gradients in homogeneous groups and stop mean-centering blindness, (3) prospect-theoretic shaping to bound updates and penalize violations based on Kahneman and Tversky's theory, and (4) asymmetric KL regularization. Evaluated on FollowBench, IFEval, and a curated multi-constraint dataset, MDP-GRPO outperforms standard GRPO, improving strict constraint satisfaction by up to 5.0% on Llama-3.2-3B. Our method also enables stable convergence with small group sizes while preserving general capabilities on MMLU and ARC.","published_date":"2026-06-04T11:58:59+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06056v1","title":"Metamorphic Testing with the Rashomon Set: Explanation Faithfulness in Machine Learning","abstract":"Multiple machine learning models can achieve near-equivalent predictive performance on the same task, yet provide divergent feature-based explanations. This is called the Rashomon effect of (explainable) machine learning, and it raises the question of which explanations, if any, are trustworthy. We propose a framework based on metamorphic testing that assesses explanation faithfulness without requiring ground-truth labels by exploring attributed feature importance from post-hoc explanation methods. Five metamorphic relations formalize expected consistency properties between model behavior and feature attributions. We apply this general framework to two tabular regression datasets and two post-hoc explainers (SHAP and LIME) to demonstrate the approach. The framework offers a practical, model-agnostic tool for selecting accurate models with reliable and trustworthy explanations.","published_date":"2026-06-04T11:57:26+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06055v1","title":"When Should Memory Stay Silent: Measuring Memory-Use Boundaries in Memory-Augmented Conversational Agents","abstract":"Long-term memory enables language model agents to support personalized interactions, but it remains unclear when available memories warrant integration into responses. Existing memory evaluations emphasize retrieval accuracy and downstream task utility, while overlooking whether retrieved sensitive memory content is warranted in the current turn. We introduce RBI-Eval, a controlled measurement study built around a probe set that compares model behavior with and without access to sensitive memory under identical benign prompts. We evaluate four base LLMs against a matched no-memory reference across four memory-access settings: full-context exposure and three retrieval systems. Our results reveal substantial behavioral divergence. With memory available, the separation score for sensitive-memory integration decreases by 8.9\\%--26.6\\% relative to the matched no-memory reference for GPT-5.4-mini, but by 51.1\\%--82.9\\% for Claude-Sonnet-4.6, DeepSeek-V4-Flash, and Qwen3.5-9B. Control experiments on DeepSeek and GPT-5.4-mini show this effect is specific to sensitive content, rather than general personalization. Retrieval systems reduce exposure but do not eliminate integration once sensitive memory reaches the generator. These findings suggest safe personalization requires memory-aware decisions at both retrieval and generation time.","published_date":"2026-06-04T11:54:51+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06054v1","title":"Beyond Similarity: Trustworthy Memory Search for Personal AI Agents","abstract":"Personal AI agents increasingly rely on long-term memory to provide persistent personalization across sessions. However, existing memory pipelines are largely driven by semantic similarity: memory data close to the current query is retrieved and injected into the model context. This creates a critical trustworthiness gap, since a semantically related memory may still be contextually inappropriate, leading to threats such as cross-domain leakage, sycophancy, tool-call drift, or memory-induced jailbreaks.   In this paper, we study memory search as a trust boundary in personal AI agents. We evaluate representative agentic memory frameworks, including A-Mem, Mem0, and MemOS, together with OpenClaw, a real-world personal-agent environment with persistent state and tool-use capability. Our results show that long-term memory is not merely a utility layer, but a durable control channel that can reshape how agents interpret tasks and execute actions, leaving them highly susceptible to the aforementioned threats. To mitigate these vulnerabilities, we propose MemGate, a lightweight and deployable memory plug-in for trustworthy memory search, with only 9M parameters and a 35.1MB footprint. MemGate is inserted between the vector memory store and the backbone LLM, requiring no LLM modification, memory-database rewriting, or inference-time LLM judge. It applies a query-conditioned neural gate to candidate memory representations, turning raw similarity search into task-conditioned memory admission. Across multiple mainstream memory frameworks, real-world agent settings, and diverse LLM backbones, MemGate reduces memory-induced threats while preserving long-term memory utility.","published_date":"2026-06-04T11:54:29+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06041v1","title":"Sample-efficient Low-level Motion Planning for Robotic Manipulation Tasks via Zero-shot Transfer Learning","abstract":"As robotic systems become more sophisticated, the growing complexity of their motion planning models and the longer training times pose substantial challenges. Evolutionary algorithms such as the Sample-efficient Cross-Entropy Method (iCEM) have recently demonstrated promising potential for low-level real-time planning by leveraging efficient knowledge reuse strategies to improve performance. Although effective in many control tasks, iCEM's performance can be constrained in more complex scenarios, particularly those requiring stacking, sliding, and shelf placement. In this work, we propose a novel iCEM+TL framework that explicitly leverages Transfer Learning (TL), where key iCEM parameters are transferred from simpler upstream tasks to guide more complex downstream tasks. Additionally, we applied Reward Redesign (RR) through task decomposition for stacking objects and shelf placement to optimize task-specific performance. Results from the simulation show that our framework achieves success rate improvements of up to 23%. The framework is further validated on a real Franka Emika robot in a stacking task, demonstrating its practical feasibility for real-world deployment.","published_date":"2026-06-04T11:34:50+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06036v1","title":"Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents","abstract":"Despite recent progress, LLM agents still struggle with reasoning over long interaction histories. While current memory-augmented agents rely on a static retrieve-then-reason paradigm, this rigid pipeline design prevents them from dynamically adapting memory access to intermediate evidence discovered during inference. To bridge this gap, we propose MRAgent, a framework that combines an associative memory graph with an active reconstruction mechanism. We represent memory as a Cue-Tag-Content graph, where associative tags serve as semantic bridges connecting fine-grained cues to memory contents. Operating on this structure, our active reconstruction mechanism integrates LLM reasoning directly into memory access, allowing the agent to iteratively explore and prune retrieval paths based on accumulated evidence. This ensures that memory retrieval is dynamically adapted to the reasoning context while avoiding combinatorial explosion caused by unconstrained expansion. Experiments on the LoCoMo benchmark and LongMemEval benchmark demonstrate significant improvements over strong baselines (up to 23%), while substantially reducing token and runtime cost, highlighting the effectiveness of active and associative reconstruction for long-horizon memory reasoning.","published_date":"2026-06-04T11:29:46+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06034v1","title":"When Good Enough Is Optimal: Multiplication-Only Matrix Inversion Approximation for Quantized Gated DeltaNet","abstract":"Matrix inversion in chunk-wise parallel linear attention is a major bottleneck for long-context modeling, particularly on NPUs, where forward-substitution-based methods exhibit limited parallelism and poor hardware utilization. We propose a fast, Matrix Multiplication (MatMul)-based algorithm tailored for strictly lower-triangular matrices arising in chunk-wise linear attention. Motivated by the rapid growth of Neumann-series terms and the diagonal concentration of the inverse matrix, we employ a truncated Neumann expansion with structural masking and parallel residual correction to eliminate sequential dependencies. We further extend our method to low-bits INT by mitigating the dynamic range expansion arising from repeated matrix power operations, and adapt the approximation order and residual step to the chunk size to minimize computational cost while preserving the model's accuracy. Experiments on Qwen3.5-family models demonstrate up to 5$\\times$ kernel-level speedup and a 20% reduction in decode-layer overhead, while preserving accuracy under both floating-point and low-precision inference. Our method offers an efficient and hardware-friendly solution for scalable linear attention.","published_date":"2026-06-04T11:29:05+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06027v1","title":"RedditPersona: A Modular Framework for Community-Conditioned LLM Adaptation from Reddit","abstract":"Community-conditioned language model adaptation requires choices about data collection, community definition, and evaluation that are currently made independently in each study, making it hard to compare assumptions or reuse artifacts. We present RedditPersona, a modular framework that standardizes these choices: it collects Reddit posts and comments, profiles active users, partitions them under five grouping strategies (subreddit-based, graph-structural, semantic, hybrid, and interaction-based), trains a parameter-efficient adapter per strategy via QLoRA, and evaluates them under a shared metric suite spanning fluency, fidelity, distributional alignment, and community identifiability. Applied to 112 subreddits in the urban well-being domain (301,429 user profiles, 16M+ comments), we find that adapters' behavioral identifiability tracks each strategy's intrinsic agreement with the subreddit baseline, and that a consistent trade-off between identifiability and distributional similarity to real text holds across all five strategies. The code and configuration files are available at: https://github.com/Ahghaffari/redditpersona.","published_date":"2026-06-04T11:20:10+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":true,"repo_url":"https://github.com/Ahghaffari/redditpersona","commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06025v1","title":"EGTR-Review: Efficient Evidence-Grounded Scientific Peer Review Generation via Multi-Agent Teacher Distillation","abstract":"Scientific peer review generation has attracted increasing attention for reducing reviewing burdens and providing timely feedback. However, existing Large Language Model (LLM)-based methods often produce generic comments with insufficient evidence support and weak source traceability, while complex multi-agent systems incur high inference costs. To address these challenges, we propose EGTR-Review, an Evidence-Grounded and Traceable Review Generation framework via Multi-Agent Teacher Distillation. EGTR-Review first constructs a multi-agent teacher that performs structure-aware paper decomposition, key-element extraction, external scholarly evidence retrieval, evidence-state labeling, verification reasoning, and review synthesis. It then distills both intermediate reasoning trajectories and final review comments into a lightweight student model through task-prefix-driven multi-task learning. An evidence-weighted objective further reduces the influence of weak, missing, or non-verifiable supervision. Experiments on public peer-review datasets show that EGTR-Review (Student) outperforms strong prompt-based, fine-tuned, and structured/agentic baselines across automatic metrics, LLM-as-Judge evaluation, and human evaluation, while maintaining strong factual grounding and source traceability with substantially lower token consumption and inference time. Our code, prompts, configurations, and sample data are available on GitHub.","published_date":"2026-06-04T11:17:40+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06021v1","title":"OPRD: On-Policy Representation Distillation","abstract":"On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.g., Qwen's ~150k tokens) persists throughout training, and (2) it treats the teacher as a black-box, discarding all intermediate hidden states after the LM head. We propose On-Policy Representation Distillation (OPRD), which lifts distillation into hidden-state space by aligning student and teacher representations across selected layers on the same rollouts, bypassing the LM head entirely. Theoretically, OPRD eliminates sampling variance and provides richer per-layer structural information. Empirically, OPRD closes the student-teacher gap on AIME 2024/2025 and AIMO, while output-space OPD baselines plateau below the teacher. OPRD also trains 1.44x faster and uses 54% less memory than top-k OPD. Code: https://github.com/ShenzhiYang2000/OPRD.","published_date":"2026-06-04T11:13:01+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":true,"repo_url":"https://github.com/ShenzhiYang2000/OPRD","commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06014v1","title":"PLAN-S: Bridging Planning with Latent Style Dynamics for Autonomous Driving World Models","abstract":"Latent world models (LWMs) have strengthened end-to-end autonomous driving by forecasting compact scene dynamics for downstream planning. However, existing LWM-based planners usually generate trajectories directly from entangled latent representations. This compact latent-to-planner pathway lacks explicit modeling of risk, drivability, and diverse style preferences, making driving-style dynamics difficult to supervise, inspect, or modulate before a final trajectory is selected. We propose PLAN-S (PLANning with latent Style dynamics), a planner-facing bridge that addresses this compactness-controllability dilemma by decoding a style-conditioned, four-channel semantic cost map from the latent representation. The cost map is conditioned on ego state and driving style and is consumed up-stream of the planning decision through two host-side interfaces: attention-level fusion for regression planners and reward-level fusion for anchor-score planners. We validate PLAN-S on two architecturally distinct hosts, ResWorld on nuScenes and WoTE on NAVSIM, while keeping the host backbones frozen to isolate the contribution of the proposed bridge. On nuScenes, PLAN-S reduces L2 at every horizon over the baseline, with 0.55 m average L2 and a 42% relative reduction in the 3 s collision rate. On NAVSIM, the rule-cost variant reaches 89.4 Predictive Driver Model Score (PDMS), while the learned cost variant provides complementary gains on baseline-challenging scenes. Ablations show that the cost pathway contributes most directly to safer trajectory selection. Qualitative results further show that PLAN-S can produce diverse cost maps, with spatially consistent variations aligned to different driving styles.","published_date":"2026-06-04T11:03:20+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.06003v1","title":"Beyond Vector Similarity: A Structural Analysis of Graph-Augmented Retrieval for Industrial Knowledge Graphs","abstract":"Retrieval-Augmented Generation (RAG) fails systematically on queries requiring structural reasoning over interconnected entities. We compare eight retrieval architectures for aerospace supply chain intelligence, progressing from text retrieval through graph traversal to graph computation. Using a 46-node knowledge graph with 64 typed edges, we evaluate 23 queries across 10 intent categories and demonstrate that five query classes are structurally unreachable for vector retrieval. Our central finding is the operator vocabulary thesis: the barrier to LLM-based graph reasoning is not model intelligence but the computational operators available as tools. An LLM Query Planner with 9 typed traversal primitives outperforms bespoke handlers (F1 = 0.632 vs. 0.472) while generalizing to unseen queries. Adding 6 graph computation tools, the LLM selectively adopts them for exactly the query categories where traversal fails. We also identify a measurement gap: entity-level F1 systematically underscores structural queries where comprehensive answers are correct.","published_date":"2026-06-04T10:56:57+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05999v1","title":"ATT-CR: Adaptive Triangular Transformer for Cloud Removal","abstract":"Cloud removal aims to accurately reconstruct the ground objects obscured by clouds in remote sensing images. Existing Transformer-based methods utilizing self-attention have shown impressive results by effectively modeling long-range dependencies in cloudy images. However, they suffer from the following issues: 1) the high computational complexity of self-attention limits scalability; 2) treating both cloudy and clean pixels as valid within the attention computation brings disturbances in subsequent layers, leading to suboptimal performance. To address these challenges, we propose the Adaptive Triangular Transformer for Cloud Removal (ATT-CR), a model that effectively reduces computational costs and mitigates interference from cloudy pixels. Specifically, it consists of two core components: Triangular Attention (TAN) and Feature Selected Gating Module (FSGM). TAN employs lower and upper triangular matrices to approximate Softmax attention with O(N) computational complexity, significantly reducing the computational costs. The FSGM, on the other hand, integrates with TAN to adaptively distinguish between cloudy and clean features, which minimizes the introduction of invalid information into subsequent layers. Extensive experiments on cloud removal benchmarks demonstrate that ATT-CR delivers superior performance compared to existing methods.","published_date":"2026-06-04T10:47:41+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05998v1","title":"Deep Learning-based 3D Oral Cavity Reconstruction Using 2D Intraoral Images","abstract":"Oral 3D modelling is one of the most essential stages in dentistry, and many different approaches, such as impression taking and intraoral scanning, are commonly used for this phase, each with notable limitations. Impression taking, which involves placing alginate or silicone material in a tray and inserting it into the patient's oral cavity to form a negative mold, suffers from significant patient discomfort, material deformation errors, and difficulties in storage and transportation. Intraoral scanners, which directly scan oral structures in real time using structured light or laser technology, produce state-of-the-art results but are associated with substantially high equipment costs. To address these limitations, this paper proposes a software-based approach that reconstructs a 3D oral model using only ten 2D intraoral images captured from different angles, requiring no dedicated hardware devices. The proposed method reduces cost, eliminates the need for physical scanning equipment, minimises patient discomfort, and enables automated 3D reconstruction. The model is trained on the publicly available Dental3DS dataset, comprising 950 upper jaw samples, and employs MobileNetV2 as the image encoder combined with Multi-head Attention for multi-view feature fusion. The proposed model achieves an accuracy of 77.49%, measured by nearest-neighbor matching with a distance threshold of 0.035. However, predicted vertices tend to concentrate in high-density regions of the ground truth, resulting in uneven point distribution across the reconstructed model.","published_date":"2026-06-04T10:44:04+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05986v1","title":"AttackPathGNN: Cross-function vulnerability detection in smart contracts using state interference graphs and conjunction pooling","abstract":"Existing learning-based detectors for Solidity smart-contracts reduce vulnerability detection to syntactic pattern matching within single functions, yet many of the most consequential exploits (The DAO, Cream Finance) exist not in any individual function but in the relationship between functions and in the combination of conditions that made the attack feasible. Thus, we propose AttackPathGNN, a graph neural network (GNN) that reframes detection as reasoning over explicit attack paths. Two architectural choices distinguish it from prior GNN-based detectors: (1)a State Interference Graph that links every pair of functions sharing mutable storage through typed, weighted edges and through directed reentrancy-path edges defined by an explicit five-condition predicate; (2)conjunction pooling, a differentiable AND-aggregator over eight named exploit preconditions whose log-sigmoid form causes the per-function exploit score to collapse whenever any single mitigation (a reentrancy guard, an access-control modifier or SafeMath) is in place. Across five independent training runs, AttackPathGNN attains 92.3+/-0.2% F1 on the SmartBugs Wild held-out test partition (4.3+/-0.3% false-negative rate, 90.8+/-2.5% detection rate on the independently human-labelled SmartBugs Curated benchmark), recovering 6/10 DASP10 categories at 100% on every seed and Reentrancy at 98.7+/-1.8%. Each prediction is emitted with a structured remediation report, turning each verdict into an actionable, function-level audit finding.","published_date":"2026-06-04T10:30:24+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05983v1","title":"Framing, Judging, Steering: An Assessable Competency Model for Teach-ing Students to Reason With Generative AI","abstract":"Generative AI makes answers easy and understanding hard, and uncritical use invites cognitive offloading. Schools still measure unaided performance, yet the real task is to produce good work with AI: framing an ill-defined task, judging the output, and steering the model toward a better result. This ability is rarely assessed in its own right; where measured, it collapses into one \"prompting\" score that cannot diagnose why AI use succeeds or fails. We propose CoRe-3 (Co-Reasoning), a competency model factoring productive AI use into three assessable skills we abbreviate FJS: Framing (specifying an ill-defined task before invoking AI), Judging (evaluating output for errors and unstated assumptions), and Steering (iteratively redirecting the model). Its distinguishing claim is the separation of pre-generation Framing from post-generation Steering, with Judging as the gate between. We ground the skills in theory, state five testable propositions, and instantiate them in CoReasoningLab, an open platform that presents flawed AI output and scores them independently. Over simulated learners (generated and graded by different models), the skills dissociate: each tracks its own manipulated competence while staying flat in the others, and grades become correlated when one competence is shared across all three (convergent and discriminant validity), across grader backends from two providers. Human-rater agreement and outcomes are next; we release the instrument, data, and protocol.","published_date":"2026-06-04T10:25:31+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05979v1","title":"World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis","abstract":"We propose world-language-action (WLA) models as a new class of embodied foundation models. WLA takes textual instructions, images, and robot states as inputs to jointly predict textual subtasks, subgoal images, and robot actions, conjoining the \\emph{world modeling interface} to learn from extensive egocentric videos as in the world-action model (WAM) and the \\emph{language reasoning} capacities to solve complex long-horizon tasks as in vision-language-action (VLA) models. At the core of WLA lies an \\emph{autoregressive (AR)} Transformer backbone, instead of a bidirectional diffusion Transformer as in WAMs, to predict the \\emph{next state}, comprising the \\emph{semantic-level} textual intention and complementary \\emph{fine-grained} physical dynamics. The physical dynamics are supervised by the world modeling objective based on a dedicated World Expert, and are leveraged to ease the characterization of the state-action correlation for the Action Expert. WLA leverages meta-queries to make the world prediction \\emph{implicitly} impact the action generation so that the former can be disabled during inference. The world prediction can also be activated to enable test-time scaling for improved robot control. Our WLA-0 prototype, with 2B active parameters, achieves 40 ms per inference on an NVIDIA RTX 5090. Evaluations across simulated and real-world environments demonstrate that WLA-0 achieves state-of-the-art multi-task and long-horizon learning abilities, e.g., 92.94\\% success rate on RoboTwin2.0 Clean and 56.5\\% success rate on RMBench. WLA-0 also holds the promise to learn novel tasks directly from \\emph{cross-embodiment robot videos} without action annotations.","published_date":"2026-06-04T10:23:01+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05976v1","title":"The Self-Correction Illusion: LLMs Correct Others but Not Themselves","abstract":"Recent work shows that LLM agents struggle to correct errors in their own reasoning traces yet show markedly higher correction rates when identical claims appear under external sources. We ask whether this asymmetry reflects a capability deficit or a role-label artifact: does an agent's willingness to correct a wrong claim depend causally on the chat-template role that carries it, rather than on the claim's content? Our setup keeps the erroneous claim byte-identical across all conditions (SHA-256 verified) and varies only its wrapping role: the agent's own \\role{<thought>}, a \\role{user} message, a \\role{tool} response, or a \\role{system <memory>} block. Across 13 model-domain cells covering seven model families and three domains ($n{=}30$ paired tasks per cell), relabeling the claim from \\role{<thought>} to an external role lifts the explicit-correction rate by 23 to 93 percentage points, with 10 of 13 cells reaching $p{<}0.001$. Further experiments confirm that the effect is asymmetric, mechanistically decomposable, and robust across domains. The failure to self-correct is not a cognitive deficit; it is a chat-template artifact. We exploit this artifact by designing a prompt-structure-only intervention that requires no training and no model modification, with its strongest role label being domain-dependent: \\role{<memory>} dominates on math, while a plain \\role{user} message dominates on logical deduction.","published_date":"2026-06-04T10:17:00+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05970v1","title":"Measuring the sensitivity of LLM-based structured extraction to prompt, model, and schema choices in clinical discharge summaries","abstract":"Large language models are increasingly used for structured extraction from clinical free-text notes, but the sensitivity of their output to upstream configuration choices is less understood than their accuracy on fixed benchmarks. This work measures that sensitivity without human-annotated ground truth, by holding the extraction task fixed and varying one choice at a time. The fixed schema comprises 17 clinical documentation flags on a three-way yes/no/not_documented value set and a 47-tag vocabulary for the primary admission reason. Three prompt variants expressing this schema were each run at two model sizes on MIMIC-IV v3.1 discharge summaries. Cross-prompt agreement was measured by Cohen's kappa on ICD-stratified subsets. A paired same-note comparison isolated the effect of model choice, and a post-hoc collapse of the three-way flags to binary tested the schema's contribution to disagreement. On the three-way flags, the two models reach the same pooled cross-prompt agreement (median kappa 0.69 and 0.68); the larger model raises agreement on some fields and lowers it on others, a redistribution rather than the absence of an effect. Collapsing the schema to binary dissolves most of the cross-prompt disagreement, locating it on the absence-versus-silence distinction rather than on whether the finding is present. On the multi-class admission categorization, changing the model reassigns the dominant tag on close to half of all notes while changing the prompt phrasing reassigns it on roughly one in eight, and the larger model places far less mass on residual catch-all categories (44% to 26%). These patterns indicate a schema-imposed source of disagreement concentrated on the absence-versus-silence axis and a dominance of model over prompt phrasing on multi-class categorization, identified by a reusable methodology for auditing extraction reproducibility on a population-scale deployment.","published_date":"2026-06-04T10:14:12+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05966v1","title":"Causal Scaffolding for Physical Reasoning: A Benchmark for Causally-Informed Physical World Understanding in VLMs","abstract":"Understanding and reasoning about the physical world is the foundation of intelligent behavior, yet state-of-the-art vision-language models (VLMs) still fail at causal physical reasoning, often producing plausible but incorrect answers. To address this gap, we introduce CausalPhys, a benchmark of over 3,000 carefully curated video- and image-based questions spanning four domains: Perception, Anticipation, Intervention, and Goal Orientation. Each question is paired with an expert-annotated causal graph capturing object-attribute-event dependencies, enabling interpretable and fine-grained evaluation of causal understanding. Building on this, we formulate a causal-graph-grounded metric that quantitatively measures how well a model's chain-of-thought reasoning aligns with the correct causal relations, moving beyond answer-only accuracy and enabling systematic diagnosis of VLMs' causal reasoning failures. Using this metric, we conduct a comprehensive analysis of leading VLMs, revealing systematic gaps in capturing causal dependencies and underscoring the need for causality-aware learning. To address these limitations, we further propose Causal Rationale-informed Fine-Tuning (CRFT), which explicitly aligns VLM reasoning with causal structures. Extensive experiments demonstrate that CRFT substantially enhances both reasoning accuracy and interpretability across multiple model backbones. By unifying dataset curation, causal evaluation, and causality-informed learning, CausalPhys establishes a strong foundation for advancing modern VLMs toward causally grounded physical reasoning.","published_date":"2026-06-04T10:07:05+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05956v1","title":"Bidirectional Search for Longest Paths: Case for Front-to-Front Heuristics","abstract":"Bidirectional heuristic search can potentially reduce search effort for problems amenable to backward search. Therein, it is well-known that front-to-front heuristics can reduce the number of node expansions, but their overhead is so high that overall runtime almost always increases. We propose BiXDFBnB, a bidirectional depth-first branch-and-bound algorithm that adapts the Single-Frontier Bidirectional Search (SFBDS) framework - originally developed for shortest-path (MIN) problems - to the Generalized Longest Simple Path (GLSP) setting. Because SFBDS inherently operates on paired states, front-to-front (F2F) heuristic evaluation arises naturally and avoids the overhead typically associated with bidirectional frontier management. We show that this adaptation can be successfully applied to maximization (MAX) problems while efficiently handling overlapping constraints. BiXDFBnB is applied to several types of longest-path problems: Longest Simple Path (LSP), Snakes, and Coil-in-the-Box (CIB). Empirical evaluation shows that the new algorithm frequently reduces the number of node expansions and, in some cases, also improves overall runtime.","published_date":"2026-06-04T09:53:18+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05952v1","title":"Learning of Robot Safety Policies via Adversarial Synthetic Scenarios","abstract":"In this work, we propose an agentic gamification framework for hazard-informed learning of robot safety policies through synthetic scenarios. We model scenario generation as an adversarial game between two agents: a Red Team that explores the space of potential failures by constructing hazardous situations, and a Blue Team that incrementally refines safety policies to prevent them. This iterative process enables efficient discovery of high-risk edge cases that are unlikely to be captured through random simulation or manual enumeration. By combining classical risk modeling with adversarial scenario generation and modern learning paradigms, this work provides a scalable pathway for embedding safety into Physical AI systems operating in complex real-world environments. The paper describes ongoing work. The contribution is a problem formulation and a proposed solution architecture.","published_date":"2026-06-04T09:51:57+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05950v1","title":"Edit-R2: Context-Aware Reinforcement Learning for Multi-Turn Image Editing","abstract":"Text-guided image editing has advanced rapidly with diffusion models and unified multimodal foundation models. However, most existing methods remain confined to single-turn settings, overlooking the more realistic scenario of multi-turn in-context editing, where users iteratively refine an image through a sequence of instructions. In this setting, a model must follow each new instruction while preserving accumulated session-level constraints, challenged by two coupled failure modes: long-context dilution, where sparse textual constraints become difficult to recover from growing interleaved image-text histories, and state contamination, where earlier editing mistakes degrade subsequent generations. We introduce Edit-R2, a novel reinforcement learning post-training framework for unified multimodal models. Edit-R2 reconstructs the operative session intent, which effectively consolidates scattered historical constraints into an explicit reasoning trace before each editing turn. It further enables multi-turn RL over both reasoning and generation through a unified objective that jointly optimizes intent reconstruction generation in discrete text space and flow-matching image generation in continuous latent space, while a trajectory filtering mechanism suppresses corrupted rollouts to stabilize training under state contamination. To support systematic evaluation, we introduce MICE-Bench, a large-scale benchmark for multi-turn in-context editing with automated metrics for instruction following (IF), content consistency (CC), and global awareness (GA) over accumulated session constraints. Experiments show that Edit-R2 substantially improves multi-turn in-context editing and achieves competitive performance compared against strong baselines.","published_date":"2026-06-04T09:49:47+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05932v1","title":"A Pre-Registered Causal Partition of Self-Consistency Elicitation and Reward Design in RLVR","abstract":"Reinforcement learning from verifiable rewards (RLVR) improves reasoning even when the reward signal is spurious -- assigning credit to the group-plurality answer rather than a ground-truth verifier. Practitioners commonly interpret naive = acc(TRUE) - acc(RANDOM) as the reward-design effect. We prove this estimand is systematically biased: it conflates self-consistency elicitation (sharpening the policy toward its modal answer via majority pseudo-reward) with genuine reward-design signal. Using a controlled tabular-GRPO simulator we derive an exact telescoping decomposition total = null + elicit + rd and measure each term across five prior-strength levels. The reward-design fraction of the naive estimator ranges from 0.139 at weak prior (ps=0.20) to 0.05 at strong prior (ps=0.80), with the elicitation term flipping sign at the self-consistency crossover. A pre-registered 2x2x2 factorial confirms non-additivity (interaction ratio 0.385; AxC effect -0.089). A points-vs-bounds pilot gate shows strong-prior regimes are point-identified while near-crossover regimes are only bounded. Re-audits of two named published results yield ELICITATION DOMINATED (elicitation share 0.98) and REWARD DESIGN DOMINATED (rd share 1.18) verdicts respectively, demonstrating the diagnostic value of the partition. We pre-commit to submit regardless of flip outcome; a non-flip is a finding of equal standing. We release a reusable one-command harness for any alignment paper to run the same audit.","published_date":"2026-06-04T09:35:54+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05931v1","title":"To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection","abstract":"When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both. Fusing scores from an absent modality injects noise, degrading precision below the best unimodal system. We propose a query-adaptive framework that detects active modalities via cross-modal score consistency: when both modalities are active, files retrieved by one also score highly on the other; this agreement breaks down when a modality is absent. Classifiers driven by these cross-modal features achieve 89% detection accuracy. On the BBC Rewind corpus (with over 12,000 broadcast videos) the adaptive system attains 94.2% P@1, outperforming speaker-only (82.9%), face-only (93.4%), and fixed fusion (90.0%), recovering 64% of the gap to an oracle with ground-truth modality labels (96.6%).","published_date":"2026-06-04T09:33:58+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05925v1","title":"Towards World Models in Biomedical Research","abstract":"A central goal of biomedicine is to understand, predict and ultimately control the dynamic mechanisms by which biological systems respond to perturbations, disease progression and therapeutic intervention. Although foundation models and large language models have accelerated biomedical data interpretation, most current systems remain focused on static pattern recognition rather than prospective simulation of biological futures. Here we propose biomedical world models as a paradigm for AI-driven discovery. These models learn latent representations of molecular, cellular, tissue and clinical states, together with intervention-conditioned dynamics that allow future trajectories to be simulated before actions are taken. We discuss how biomedical world models could function as data engines, environment simulators and scientific planning substrates across applications including virtual cells, organoids, virtual patients and surgical simulation. We outline the data infrastructure, evaluation benchmarks, safety constraints and governance frameworks required. Biomedical world models may provide a foundation for simulation-guided, closed-loop and experimentally actionable biomedical discovery.","published_date":"2026-06-04T09:28:54+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05924v1","title":"Better Literary Translation: A Multi-Aspect Data Generation and LLM Training Approach","abstract":"Literary translation poses unique challenges due to the scarcity of high-quality annotated data and the need to balance expression fluency with literary effect. We present a multi-aspect iterative refinement framework that generates high-quality translation references and preference data through specialized LLM translators, each targeting a distinct quality dimension. We leverage the generated data for supervised fine-tuning and reinforcement learning. Experiments show that our generated references outperform the original ground truth for SFT by 8.65 CEA100 points. For reinforcement learning, we find that DPO leads to performance degradation in this setting, while leveraging an explicit reward model for GRPO yields an additional 1.51 point improvement. We attribute this to the stability of two-stage training and GRPO's online exploration capability. Our resulting models, LitMT-8B and LitMT-14B, achieve 67.25 and 69.07 CEA100 respectively on the MetaphorTrans English-to-Chinese literary translation benchmark, competitive with Claude Sonnet 4.5 at 68.43, and demonstrate strong generalization to out-of-domain literary work (i.e., O. Henry).","published_date":"2026-06-04T09:27:29+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05922v1","title":"Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts","abstract":"AI agents rely on a harness of skills, tools, and workflows to solve complex problems. Continually improving this harness is essential for adapting to new tasks. However, existing optimization methods typically require ground-truth validation sets, yet such labeled data is difficult to acquire in practical deployment settings. To address this problem, we introduce Retrospective Harness Optimization (RHO), a self-supervised method that optimizes the agent harness using only past trajectories. Specifically, RHO selects a diverse coreset of challenging tasks from past trajectories and re-solves them in parallel. The agent analyzes these rollouts using self-validation and self-consistency, then generates candidate harness updates and selects the most effective one by its own pairwise self-preference. We evaluate RHO across three diverse domains, spanning software engineering, technical work, and knowledge work. Notably, a single optimization round improves the pass rate on SWE-Bench Pro from 59% to 78% without any external grading. Furthermore, our analysis demonstrates that RHO effectively targets prior failure modes. As a result, the optimized harness alters the agent's behavior patterns and sustains higher accuracy during long-horizon sessions.","published_date":"2026-06-04T09:26:00+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":true,"repo_url":"https://github.com/wbopan/retro-harness","commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05901v1","title":"Reducing Hallucinations in Complex Question Answering using Simple Graph-based Retrieval-Augmented Generation (long version)","abstract":"Large language models (LLMs) have fundamentally transformed the landscape of Natural Language Processing. Despite these advances, LLMs and LLM-based systems remain prone to a variety of failure modes. Retrieval-augmented generation (RAG) systems have emerged as a common deployment scenario seeking to both avoid the well known risk of the LLM \"hallucinating\" information, and to enable reasoning and question answering over proprietary information that the LLM did not have access to during training without resorting to expensive model fine-tuning.   In this work, we explore the idea of using a lightweight graph structure with a relatively simple graph schema, to support the RAG subsystem via a dedicated toolset. We design an agentic system with a variety of vector search and graph query tools operating over a structured dataset based on a curated subset of English Wikipedia articles, and evaluate its performance on questions from MoNaCo, a challenging Wikipedia QA benchmark of complex query answering tasks.   Our results show that the introduction of graph-based tools can significantly increase the precision and recall of factual correctness, can halve the number of hallucinated answers, and achieves the highest fine-grained truthfulness score among the three evaluated scenarios. All this with a modest increase in token usage.","published_date":"2026-06-04T09:07:06+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05890v1","title":"Staying with the Uncertainty: Uncertainty-Scaffolding Strategies for Artificial Moral Advisors in LLM-to-LLM Simulated Conversations","abstract":"LLMs are increasingly deployed as Artificial Moral Advisors (AMA) in a variety of contexts: what kind of conversational patterns should they display? In this paper, we study how AMA can help their interlocutors \"stay with the uncertainty\". We propose three modes of uncertainty (Perspective-Multiplying, Tension-Preserving, Process-Reflecting) and compare them against three control conditions (Baseline, Persuasive, Sycophantic). A user-agent LLM engages in a dialogue on an ethical dilemma with an AMA following a specific uncertainty strategy, and completes pre- and post-conversation questionnaires. We further examine the effect of two persona prompt formats (Declarative and Narrative). We found that (1) no single model dominates as a simulated user agent, with open models aligning with human ambiguity through between-persona divergence and closed models through within-persona hedging; (2) declarative personas better capture initial stance diversity while narrative personas show more realistic belief revision; (3) all six AMA strategies produce distinguishable conversational patterns; and (4) uncertainty strategies differ not in how much stance revision they produce, but in the quality of engagement they sustain.","published_date":"2026-06-04T08:59:10+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05888v1","title":"Retry Policy Gradients in Continuous Action Spaces","abstract":"Retry-based objectives such as pass@K and max@K optimize the best return obtained from multiple sampled trajectories, and recent work has shown that they can promote exploration without explicit exploration bonuses. In discrete action spaces, ReMax was shown to do so by adapting to return uncertainty. In this work, we introduce pathwise derivative estimators for retry objectives and use them to extend ReMax to continuous action spaces. We study the resulting learning dynamics and show that, even with deterministic rewards, ReMax can encourage stochastic exploration by reshaping the policy-gradient landscape. In particular, it alters gradients both in direction, biasing updates toward higher policy entropy, and in magnitude, damping gradients and slowing convergence. We further show that Adam's adaptive normalization can mitigate this damping, depending on its numerical stabilization parameter. Empirically, we instantiate this objective as ReMax Actor-Critic (ReMAC), an off-policy actor--critic algorithm that optimizes the ReMax objective using a pathwise derivative estimator. Our experiments show that ReMAC can promote higher policy entropy without entropy regularization and achieves performance comparable to SAC.","published_date":"2026-06-04T08:57:45+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05875v1","title":"QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving","abstract":"Retrieval-augmented generation (RAG) improves large language model (LLM) answer quality by grounding generation in external evidence, but processing retrieved contexts makes the prefill stage a dominant serving cost. RAG cache fusion reduces this cost by reusing precomputed key-value (KV) caches for retrieved chunks and selectively recomputing tokens under the current prompt. Existing selectors, however, face a dilemma between quality and efficiency: fast query-agnostic or final-layer query-to-context selectors can miss request-relevant evidence, whereas full-view query-aware selectors require broad context and layer visibility before recomputation and therefore stall the layer-wise cache-fusion pipeline. We present QCFuse, a compressed-view query-aware selector for RAG cache fusion. QCFuse uses chunk-anchor query probing to condition user-query states on compact per-chunk anchors and critical-layer profiling to identify recomputation tokens without all-layer inspection. We implement QCFuse in SGLang and evaluate it on four open-weight LLMs across six datasets. QCFuse reaches full-prefill-level quality. At matched quality, QCFuse achieves an average prefill-time speedup of 1.7x over full prefill and 1.5x over ProphetKV, the strongest quality-preserving baseline.","published_date":"2026-06-04T08:47:46+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05873v1","title":"LadderMan: Learning Humanoid Perceptive Ladder Climbing","abstract":"Humanoid robots hold great promise for operating in human-centered environments, yet ladder climbing remains one of the most challenging tasks due to sparse footholds and handholds, complex whole-body coordination, and sensitivity to perception and control errors. We present \\textbf{LadderMan}, a unified system that enables humanoid robots to robustly climb diverse ladders and perform manipulation under such constrained conditions. Our climbing policy is built on a scalable two-stage learning pipeline, where we use hybrid motion tracking to learn multiple climbing experts from a single reference motion, and distill these experts into a unified depth-based visuomotor climbing policy via hybrid imitation and reinforcement learning. To enable real-world deployment, we leverage vision foundation models to bridge the sim-to-real gap in depth perception. Building on the learned climbing policy, we further train a separate manipulation policy using a dual-agent formulation, allowing stable on-ladder manipulation via teleoperation. Experiments demonstrate that LadderMan achieves robust ladder climbing across a wide range of geometries, successfully transfers to real-world hardware in a zero-shot manner, and supports various manipulation tasks under challenging ladder constraints. Video results are available at https://ladderman-robot.github.io .","published_date":"2026-06-04T08:47:08+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05872v1","title":"Entropy-Based Evaluation of AI Agents: A Lightweight Framework for Measuring Behavioral Patterns","abstract":"AI agents are commonly evaluated using task success, reward, latency, and cost. These metrics are useful, but they often miss important aspects of agent behavior: whether an agent explores too much, repeats itself too rigidly, uses tools effectively, reduces uncertainty over time, or remains robust across repeated runs. This paper proposes Entropy-Based Evaluation of AI Agents (EEA), a lightweight framework for measuring agent behavior through entropy. Rather than treating intelligence as only final task completion, EEA studies the structure of the agents decision process. The framework introduces action entropy, trajectory entropy, tool entropy, information gain, exploration efficiency, and robustness entropy. These metrics are intended to complement, not replace, traditional evaluation methods. We also present a practical Python implementation designed to integrate with agent frameworks such as LangChain, Google ADK, custom agent loops, and stored observability traces.","published_date":"2026-06-04T08:46:43+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05871v1","title":"Compositional Boundaries for Density Fusion","abstract":"Distributed uncertainty-management systems often combine local probabilistic models along aggregation trees chosen by communication, privacy, or scheduling constraints. The final density should depend on the weighted sources, not on the particular order in which intermediate nodes combine them. We study this requirement as an algebraic compositionality problem for binary fusion of weighted probability densities. The central question is when a local fusion rule can be executed hierarchically while remaining order-invariant. We establish a compositional boundary for local segment-valued fusion rules. Within the class of continuous binary rules with additive output weights and weight-only coefficients, order-invariant hierarchical execution characterizes normalized weighted linear pooling; norm-induced segment balancing realizes the corresponding coefficient. Smooth endpoint-to-candidate $f$-divergence balancing has a different local geometry: its quadratic expansion induces square-root effective weights, showing why pairwise solvability alone is insufficient for schedule-independent fusion. We show that this obstruction is local to endpoint-to-candidate binary balancing, whereas global divergence barycenters retain additive-weight local limits. Finally, Gaussian mixtures show how the same issue appears in finite model classes: exact fusion is compositional, whereas stepwise compression is compositional only under a congruence condition on unnormalized component measures. These results distinguish exact schedule-independent fusion from global aggregation objectives and local approximation heuristics.","published_date":"2026-06-04T08:45:59+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05863v1","title":"Deciphering Two Training Clocks in Grokking via Deep Linear Network Theory with Conditional ReLU Reduction","abstract":"Grokking suggests that fitting the training data and learning a simple underlying rule may occur on different time scales. We formalize this phenomenon by separating the fast decay of the classification loss from the slower simplification of the learned representation, and we call the resulting pair of stopping times two training clocks. For deep linear networks, we show that a post-margin gap-growth or one-step tail-contraction condition reduces the cross-entropy loss to level epsilon on a logarithmic time scale. In contrast, when layerwise weight decay is present, the induced regularization on the end-to-end map can be expressed as a Schatten-type penalty; under a sharp late-time Kurdyka-Lojasiewicz tail, this structural energy closes on a polynomial time scale. The two clocks, therefore, separate fitting from representation simplification. We then explain how the same mechanism can appear in ReLU MLPs. In regions where the activation patterns on the training set remain fixed, the network reduces to a linear model in the active coordinates. In a two-layer ReLU embedding model, chain-rule estimates further show that the classifier head can receive larger effective gradients than the embedding block under controlled downstream norms. This supports a two-stage mechanism in which the classifier fits first, while the representation continues to simplify later. We use modular addition as the main experimental setting. The deep linear theory provides the rigorous core of the analysis. But the ReLU results are formulated as conditional reductions that account for empirical behavior without claiming a global proof for nonlinear training dynamics.","published_date":"2026-06-04T08:39:04+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05861v1","title":"LLMCodec: Adapting Video Codecs for Efficient Weight Compression of Large Language Models","abstract":"The rapid development of large language models(LLMs) has led to remarkable advances in natural language processing. However, the increasing scale of these models introduces substantial challenges in terms of storage, transmission, and deployment. Though great efforts have been devoted to model compression and quantization, existing methods often rely on fine-tuning or calibration data, which exhibit limited generalization across different tensor types. In this paper, we argue that video codecs offer a promising solution for LLM compression, due to their inherent compatibility with matrix structured data, configurable compression strategies, and the availability of highly optimized, off-the-shelf implementations. Therefore, we present LLMCodec, a video codec-based LLM compression method that integrates affine quantization with the recent VVC/H.266 video codec. Beyond VVC, we further compare a range of video codecs and encoding profiles to evaluate their impact on compression performance. Experiments on different models demonstrate the robustness and generality of LLMCodec. Notably, on LLaMA-3-8B at 2-bit precision, LLMCodec reduces perplexity by over 1.5x and improves downstream task accuracy by 21% compared with the existing method.","published_date":"2026-06-04T08:35:53+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05855v1","title":"EEGDancer: Dynamic Emotion Latent Space Masked Modeling with Reinforcement Learning for EEG Continuous Emotion Prediction","abstract":"Continuous electroencephalography (EEG) emotion prediction aims to model the temporal evolution of human emotional states from EEG signals. Unlike conventional discrete emotion recognition, continuous prediction requires capturing long-range temporal dependencies and coherent emotional dynamics. However, existing methods mainly rely on point-wise regression and directly model noisy high-dimensional EEG features, limiting their ability to characterize continuous emotional evolution.To address these challenges, we propose EEGDancer, a dynamic emotional latent space learning framework for continuous EEG emotion prediction. The framework integrates vector-quantized representation learning, masked temporal modeling, and reinforcement learning-based trajectory optimization into a unified architecture.Specifically, a causal spatiotemporal Vector-Quantization Variational Autoencoder (VQ-VAE) is designed to learn structured emotional prototypes and construct a discrete-continuous emotional latent space from EEG signals. Based on the learned latent representations, a Transformer-based masked dynamic modeling strategy captures long-range emotional dependencies and temporal evolution patterns. Furthermore, continuous emotion prediction is formulated as a sequential decision-making problem, and a Soft Actor-Critic (SAC) framework is introduced to optimize emotional prediction trajectories at the sequence level instead of frame-wise local fitting.Extensive experiments on the SEED, SEED-IV, and Long-Term Naturalistic Emotion datasets demonstrate that EEGDancer consistently outperforms existing machine learning and deep learning methods. Ablation studies further verify the effectiveness of the proposed latent space and reinforcement learning-based trajectory optimization for modeling continuous EEG emotional dynamics.","published_date":"2026-06-04T08:28:31+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05852v1","title":"UniVoice: A Unified Model for Speech and Singing Voice Generation","abstract":"Text-to-speech (TTS) and singing voice synthesis (SVS) both aim to generate human vocal audio from symbolic inputs, but they impose different requirements on the generation process. Speech generation relies on flexible, language-driven prosody, whereas singing generation requires explicit melody control and accurate rhythmic alignment. This mismatch makes it challenging to train a single model that can generate both natural speech and controllable singing, since melody-related conditions should strongly constrain singing but should not restrict speech prosody. We present UniVoice, a unified speech and singing voice generation framework based on conditional flow matching. Instead of using a single undifferentiated conditioning representation, UniVoice factorizes the condition into content, melody, and timbre, which are encoded by modality-appropriate encoders and consumed by a shared Diffusion Transformer (DiT) backbone. For singing, the melody condition is represented by MIDI note sequences; for speech, it is replaced with a learned null melody token, allowing the model to infer prosody from linguistic and acoustic context. This design preserves explicit melody control for singing while avoiding the need to impose melody constraints on speech. We further analyze the null melody token as an approximation to melody marginalization in the conditional flow. Trained on 30k hours of speech and 35k hours of singing data, UniVoice achieves a speech PER of 5.26\\%, comparable to dedicated TTS systems such as F5-TTS (5.21\\%) and CosyVoice3 (5.30\\%). On singing generation, UniVoice achieves a PER of 16.22\\%, outperforming the unified baseline Vevo1.5 (24.72\\%).","published_date":"2026-06-04T08:27:17+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05847v1","title":"Agentic Molecular Recovery via Molecule-Aware Exploration","abstract":"Text-guided molecular generation with LLMs often yields invalid SMILES. We argue that invalid drafts should be addressed through a shift from validity-oriented repair to identity-preserving molecular recovery: the objective is not only to restore chemical validity, but also to preserve target-relevant structural cues and recover the molecular identity implied by the description. This perspective reveals the limitations of existing correction strategies. Post-hoc repair can recover validity while distorting key structures, LLM-only correction can introduce unintended global drift, and generic agentic correction remains constrained by greedy single-candidate trajectories even when equipped with executable RDKit edit tools. To address these limitations, we propose AMREC, which couples molecule-aware mismatch tracking with expanded candidate exploration and trajectory-level selection. On invalid ChEBI-20 drafts from three backbone models, AMREC achieves the strongest overall recovery profile across structural, exact-match, and string-level metrics.","published_date":"2026-06-04T08:23:01+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05844v1","title":"GenTI: Benchmarking LLMs for Autonomous IDPS Rule Generation for Unseen Attacks","abstract":"Rule-based Intrusion Detection and Prevention Systems (IDPS) offer precise attack detection as well as mitigation, however their manually crafted, signature-driven rules limit adaptability to emerging and zero-day threats. Additionally, existing public datasets (e.g., CICIDS2017, UNSW-NB15) focus on traffic classification and provide little structured information to support automatic rule synthesis or prevention logic. To address this gap, we propose Generative Thread Intelligence (GenTI) \\footnote{GenTI refers to the proposed framework, and GTI refers to the dataset.} an LLM-driven benchmark for automatic generation of IDPS rules targeting unseen attacks. The dataset (GTI) aggregates over 150k detection and prevention rules from Snort, Suricata, Emerging Threats, as well as 50k YARA, each annotated with protocol behavior, payload signatures, contextual relationships, mappings to Cyber Threat Intelligence (CTI), along with actionable response types (alert, drop, reject). Moreover, on top of this corpus we design an LLM-based pipeline that transforms analyst prompts and representative payloads into deployable rules via structured prompt engineering, Chain-of-Thought (CoT) reasoning, as well as a Chain-of-Verification (CoVe) loop for syntactic, semantic, and security validation. The generated rules are executed in real time on (Snort/Suricata) and evaluated by syntax accuracy, semantic similarity, CTI coverage, security effectiveness as well as unseen attacks detection. Furthermore, our GenTI instantiation achieves a composite rule-quality score of 89.4\\%, with 94.8\\% CTI coverage, improving unseen attacks detection from 45\\% to 87.4\\% and reducing the false-positive rate from 8.5\\% to 2.3\\%. Overall, GenTI establishes the first large-scale benchmark that tightly couples rule-level CTI with LLM-based automation, enabling adaptive, self-evolving IDPS.","published_date":"2026-06-04T08:19:52+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05843v1","title":"Mechanistic Insights into Functional Sparsity in Multimodal LLMs via CoRe Heads","abstract":"While Multimodal Large Language Models (MLLMs) demonstrate remarkable proficiency on complex vision-language tasks, the mechanisms by which they extract query-relevant visual features from complex, noisy contexts remain opaque. In this paper, we present an in-depth interpretability study that uncovers a profound structural property within MLLMs: functional sparsity in cross-modal retrieval. Leveraging a token-level metric termed Retrieval Attention Mass (RAM), we identify and characterize a highly specialized subset of attention heads, referred to as Context-aware Retrieval (CoRe) heads. Across diverse visual domains and model scales, we observe a clear functional division: CoRe heads act as dedicated information extractors, while most other heads distribute attention over broader contextual regions. Causal interventions further demonstrate the necessity of these specialized heads. Ablating only the top 5% of CoRe heads causes significant degradation in multimodal reasoning performance, whereas ablating lower-ranked heads has minimal effect. Moreover, acceleration experiments validate the utility of CoRe heads, showing that leveraging this localized sparsity significantly accelerates inference while maintaining robust task performance. Our findings reveal a structural principle of functional sparsity within MLLMs, refining the current understanding of mechanistic interpretability and laying a theoretical foundation that can inspire future architecture design and model optimization.","published_date":"2026-06-04T08:18:31+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05833v1","title":"Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models","abstract":"Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learns geometric representations using purely 2D video sequences. This approach effectively restructures the semantic latent space within MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models. This is accomplished through a multi-objective learning strategy driven by four complementary geometric targets: (1) estimating inter-frame camera poses to embed varying viewpoint dynamics, (2) regressing dense depth maps to anchor physical distances, (3) predicting a metric scale factor for real-world calibration, and (4) distilling multi-scale 3D features to align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model's internal representations naturally develop strong 3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.","published_date":"2026-06-04T08:11:12+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05828v1","title":"Statistical Priors for Implicit Preferences: Decoupling Skill Selection as a Local Harness in Personal Agents","abstract":"As Large Language Model (LLM) capabilities advance, locally deployed personal agents relying on API-based remote models and external skills have emerged as a novel paradigm. With the rapid expansion of available skills, enabling personal agents to learn and adapt to implicit user preferences becomes a critical challenge. However, local deployment constraints preclude complex centralized selection algorithms, creating an urgent need for a lightweight local preference harness. This paper explores the implementation of such a harness through a novel architecture that strictly decouples statistical preference learning from semantic intent parsing. Specifically, we leverage localized statistical results to influence and modulate the selection decisions of the remote LLM. Extensive evaluations demonstrate that our decoupled approach achieves the lowest cumulative regret and highest test accuracy, significantly outperforming traditional memory-augmented agents.","published_date":"2026-06-04T08:07:10+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05818v1","title":"Benchmarks in Leipzig","abstract":"Between April 1 and May 15, 2026, a group of 49 mathematicians compiled a dataset of research-level mathematics questions with known answers. Most of the work was done during the 3-day workshop *Benchmarks in Leipzig* with 35 participants at the Max Planck Institute for Mathematics in the Sciences in Leipzig, Germany. We present the resulting collection of 100 questions. We evaluated these questions in three stages: a single attempt by five state-of-the-art LLMs, followed by a 20-runs-per-model evaluation with three of these models, and finally a 3-run attempt with two heavy-thinking models. After Stage 1, 41 questions remained completely unsolved; after Stage 2, this count dropped to 16; and we concluded Stage 3 with only 2 unsolved questions. This demonstrates that the mathematical reasoning capabilities of LLMs are becoming impressive.","published_date":"2026-06-04T07:59:08+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05817v1","title":"Consistency Training Along the Transformer Stack","abstract":"Consistency training encourages models to behave similarly across different contexts, and has shown promise for reducing misalignment. We broaden the scope of consistency training in two ways. First, we introduce two new internal consistency targets: MLP Consistency Training (MLPCT), which matches post-activation MLP states, and Attention Consistency Training (AttCT), which matches per-head attention distributions. Second, we apply consistency training to four additional safety threats: persona in-context learning attacks, adversarial frustration, prefill attacks, and conditional misalignment. Across several models and threat settings, we find that consistency training reduces misalignment well beyond the sycophancy and jailbreak settings studied in prior work. We also find cases of cross-threat generalization, where training against one failure mode improves robustness to another, and identify a shared residual-stream mechanism underlying ACT, MLPCT, and AttCT, while distinguishing BCT as mechanistically distinct. Our results suggest that consistency training is a flexible and extensible framework for alignment, capable of unifying defenses against a broader class of model pathologies.","published_date":"2026-06-04T07:58:55+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05816v1","title":"Emotion-Aware Image Generation from Korean Diary Text via LLM-based Prompt Translation and LoRA Fine-Tuning","abstract":"T2I models cannot effectively capture sentiment from various types of text, including diaries, as they primarily focus on visual object-related patterns rather than contextual emotional understanding. This paper proposes an emotion-aware text-to-image pipeline that generates children's hand drawing style images from short Korean diary entries. The proposed pipeline employs Qwen3-8B for recognising implicit sentiment from short diaries, and Stable Diffusion 3.5 Medium fine-tuned with LoRA on children's drawing images with emotion-based trigger words for image generation. Additionally, this paper presents experiments examining the effect of emotion trigger words on generated images and discusses the limitations of CLIP Score as an evaluation metric for emotion-aware image generation.","published_date":"2026-06-04T07:56:36+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05806v1","title":"When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents","abstract":"Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design: DAG-based topological complexity and a $2 \\times 2$ taxonomy of tool perturbations (explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops under implicit semantic failures. Driven by systemic over-trust in corrupted outputs, Perturbation Recovery Rate (PRR) plummets by around 37\\% in these scenarios, while complex topologies trap agents in futile trial-and-error loops. Crucially, agentic fault-tolerance improves with model scale $3.66\\times$ slower than basic task execution, highlighting dynamic replanning as a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at https://github.com/Zhudongsheng75/ToolMaze.","published_date":"2026-06-04T07:38:46+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":true,"repo_url":"https://github.com/Zhudongsheng75/ToolMaze","commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05805v1","title":"From Risk Classification to Action Plan Remediation: A Guardrail Feedback Driven Framework for LLM Agents","abstract":"LLM-based guardrails typically safeguard agents by evaluating proposed actions or inputs before execution, producing safety signals such as binary allow/deny decisions, risk categories, and/or explanatory rationales about potential policy violations. However, agent risks often arise when otherwise benign tasks are contaminated by untrusted external content, unsafe instructions, or risky tool use. Existing guardrails often flag the entire task uniformly as unsafe, thereby blocking the threat but sacrificing the benign part. Moreover, existing work largely evaluates guardrails in isolation, leaving unclear whether their interventions lead to safer downstream agent behavior. To address this, we introduce TRIAD (Tripartite Response for Iterative Agent Guardrailing), a guardrail-integrated agent framework that leverages guardrail-generated verbal feedback as a guiding signal to keep the agent aligned with benign objectives at each planning step. We finetune a language model on a self-curated training dataset to output one of three decisions: proceed, refuse, or update, together with structured natural-language feedback. Rather than merely allowing or blocking execution, update guides the agent to revise its plan, avoid harmful components, and preserve the benign task where possible. TRIAD injects this feedback into the agent's context, enabling subsequent plan revision and forming a closed loop between guardrail feedback and agent planning. Extensive experiments on ASB and AgentHarm show that TRIAD reduces the average attack success rate to 10.42%, while achieving the best safety-utility trade-off among guardrail-integrated baselines. Our code is available at: https://github.com/YUHAOSUNABC/TRIAD.","published_date":"2026-06-04T07:34:35+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":true,"repo_url":"https://github.com/YUHAOSUNABC/TRIAD","commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05793v1","title":"CollabBench: Benchmarking and Unleashing Collaborative Ability of LLMs with Diverse Players via Proactive Engagement","abstract":"While LLM-based agents excel at individual tasks, effective collaboration with realistic human partners remains challenging. Most of the existing conversation-level collaborative studies lack grounded interaction and behavioral execution, motivating the need for cooperative game environments that enable contextualized and immersive collaboration. To this end, this paper proposes CollabBench, a benchmark for evaluating and training collaborative agents in cooperative games. CollabBench features a Diverse Player Profile Simulation pipeline to model varied players behaviors, and a Collaborative Agentic Training paradigm that unifies reasoning, communication, and action via agentic rollouts, optimized with a hybrid reward balancing task efficiency and affective adaptation. We further extend classic environments to CWAH-MultiPlayer and Cook-MultiPlayer for systematic evaluation under diverse personalities. Experiments with efficiency and affective metrics show that our trained models outperform base models, achieving 19.5% higher efficiency and 24.4% improved affective performance. Further analysis reveals key collaborative limitations of existing models and offers insights for future collaborative training.","published_date":"2026-06-04T07:22:44+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05792v1","title":"Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation","abstract":"TLA+ has supported industrial verification at companies such as Amazon and Microsoft, yet writing correct TLA+ specifications from natural language still requires time and expertise, which limits adoption. LLMs show promise, but no prior study measures whether they produce semantically correct TLA+ specifications from natural language. This paper presents the first systematic evaluation of LLM-based TLA+ specification synthesis from natural language. Our study evaluates 30 LLMs across eight families on a curated dataset of 205 TLA+ specifications: 25 open-weight models across four prompting strategies (2,600 runs) and 5 proprietary models under few-shot prompting (130 runs), all validated by the SANY parser and TLC model checker. LLMs achieve up to 26.6% syntactic correctness but only 8.6% semantic correctness, with successes exclusive to progressive prompting. Results show that model size does not predict quality, e.g., DeepSeek r1:8b outperforms its 70B variant across all strategies, which suggests the importance of reasoning alignment for formal languages. Code-specialized models consistently underperform due to negative transfer from mainstream language training. We identify five recurring hallucination categories, all traceable to specific training data biases. These results suggest that current LLMs do not generate reliable TLA+ specifications without expert oversight. We release the evaluation framework, code, and dataset to support reproducibility and future research.","published_date":"2026-06-04T07:22:01+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05785v1","title":"Next-Generation Parallel Decoder for LPDR: Architectural Optimization and Class-Balanced GAN-Augmentation","abstract":"Real-Time License Plate Detection and Recognition (LPDR) forms the backbone of modern smart cities. Although the YOLOV5-PDLPR model substantially improved system efficiency through a parallel decoder approach, its performance is still affected by spatial character mismatches and data imbalance within the training set. This paper addresses these limitations by introducing Cross-Spatial Hybrid Attention (CSHA) and Class-Balanced Synthetic Augmentation (CBSA). An extensive study involving 75,000 synthetic samples is conducted and evaluated on four benchmarks: CCPD, CLPD, PKU, and an application-specific dataset. Experimental results demonstrate a substantial improvement in the recognition rate of minority provincial license plates from 78.2% to 91.5% while maintaining real-time processing performance of 152 FPS. The results indicate that spatially-aware parallel decoding combined with class-balanced augmentation provides an effective solution for high-speed license plate recognition systems.","published_date":"2026-06-04T07:16:06+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05784v1","title":"TAPO: Tool-Aware Policy Optimization via Credit Transfer for Multimodal Search Agents","abstract":"We identify and formally characterize credit misassignment as a systematic failure mode of GRPO in tool-augmented multimodal search agents: its uniform broadcast of trajectory-level advantages to all tokens causes valuable tool-use steps in failing trajectories to be penalized no differently from valueless ones. We further empirically quantify the scale of this phenomenon. Over half of failing trajectories and failing tool-use actions exhibit correctable credit misassignment, demonstrating that the wasted training signal is both substantial and structurally exploitable. Building on this insight, we propose Tool-Aware Policy Optimization (TAPO), which exploits the parameter-determinism property of information-acquisition tools: similar call parameters define equivalent information-acquisition actions and should therefore share comparable action credit. TAPO constructs counterfactual witnesses within the current training batch and compensates misassigned negative credit via confidence-gated conservative advantage correction. It requires no additional annotation, models, or sampling, and introduces negligible computational overhead. Across multiple multimodal search benchmarks, TAPO delivers consistent, plug-and-play improvements over strong baselines for three mainstream RL algorithms (GRPO, GSPO, and SAPO). Our code and models will be publicly released upon acceptance.","published_date":"2026-06-04T07:15:43+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05779v1","title":"TinyML-Driven Cybersecurity for Autonomous Spacecraft: Latency-Accuracy Analysis for SPARTA RF and Cyber Threat Detection","abstract":"Autonomous spacecraft require rapid, lightweight, and reliable onboard detection of cyber-RF threats. Using the SPARTA attack model, we analyze the latency-accuracy trade-offs of TinyML-compatible classical models -- Random Forest, Logistic Regression, SVM, and MLP -- for detecting uplink jamming, Fake-NR spoofing, payload manipulation, ground-segment compromise, and unauthorized command injection. We present a physics-informed theoretical analysis of each model's computational complexity, VC dimension, Lipschitz continuity, and latency scaling, supported by empirical measurements on adversarial RF spectrograms generated via BandErasure, FakeNR, and NoiseBurst corruption modes. Results show that Logistic Regression achieves microsecond-level inference with only a 1\\% accuracy drop relative to Random Forest, making it an effective TinyML baseline for onboard autonomy. The study also identifies opportunities for advancing spacecraft cybersecurity through richer feature encoders and multi-timescale learning architectures, building on recent progress in edge intelligence and trustworthy AI.","published_date":"2026-06-04T07:08:28+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05776v1","title":"An Improved CNN-LSTM Based Intrusion Detection System for IoT Networks","abstract":"With the rapid proliferation of IoT devices, security concerns have dramatically escalated and intrusion detection systems have become critical for protecting networked environments. This paper presents an improved CNN-LSTM based intrusion detection model that combines multi-class classification, dataset integration, and temporal feature learning to enhance detection performance in IoT networks. Using network traffic data, the proposed approach is evaluated on intrusion detection tasks and achieves an accuracy of approximately 97%. Experimental results demonstrate that the model effectively detects multiple attack categories while maintaining stable training and validation performance. The integration of convolutional and recurrent neural network components enables the framework to capture both spatial and temporal characteristics of network traffic, improving overall intrusion detection capability in IoT environments.","published_date":"2026-06-04T07:04:57+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05770v1","title":"Human Oversight and Overload: Two Hidden and Costly Burdens of AI-Assisted Software Engineering","abstract":"AI is changing how software engineers work, but it often comes with hidden burdens and costs. In this paper, we characterize two such often-overlooked burdens: (1) the constant need for human oversight and inspection of AI-generated artifacts; and (2) the growing cognitive overload on software engineers from receiving large amounts of suggestions from AI tools. The need for human oversight is not optional-engineers must review, validate, and sometimes rework what AI produces. At the same time, the flood of AI suggestions, prompts, and possible solutions can leave developers mentally stretched. By blending evidence from recent opinions from practitioners, we highlight these often-overlooked challenges and open a conversation about how teams can handle them in day-to-day AI-assisted software engineering.","published_date":"2026-06-04T06:53:45+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05761v1","title":"SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents","abstract":"Persistent AI assistants, such as OpenClaw, accumulate large collections of related memories over long-term interactions. As these memories grow, they may reinforce one another, diverge across contexts, or directly conflict, making correct assistance depend on memory relations rather than isolated recall. Existing long-term memory benchmarks rarely probe how agents preserve and utilize such relations during downstream tasks. To address this gap, we introduce SubtleMemory, a benchmark for fine-grained relational memory discrimination in long-running AI agents. SubtleMemory constructs relation-controlled latent semantic artifacts whose variants instantiate complementary, nuanced, or contradictory relations, and embeds them into realistic user-agent histories, requiring agents to recover distributed relational structures during later queries and instructions. The benchmark contains 1,522 evaluation instances over 10 long histories, grounded in 1,090 relation-controlled memory-variant sets and spanning user-related and non-user-related queries. Evaluating six standalone memory systems, two Claw-style agents with native memory modules, and three Claw-style agents with plugin memory modules, we find that current systems remain weak on fine-grained relational memory discrimination. We further introduce diagnostic protocols that reveal distinct capability profiles across memory preservation, retrieval, and downstream reasoning stages.","published_date":"2026-06-04T06:43:11+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05758v1","title":"DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models","abstract":"Many modern vision-language models (VLMs) build on autoregressive decoding of discrete tokens. While text-based output interfaces enable scalable pretraining and strong zero-shot generalization across diverse tasks, they are poorly suited for problems that require precise continuous outputs, such as localizing temporal boundaries of events or generating robotic control actions. To address this challenge, we propose DRIFT, a general framework for adapting pretrained VLMs to continuous decoding tasks. DRIFT combines a base predictor, which provides a coarse estimate of the target output, with a generative refinement module based on flow matching that iteratively improves the prediction. This residual formulation transforms the generative modeling problem from learning a global output distribution to modeling a localized residual distribution around a strong prior, substantially simplifying optimization. We evaluate DRIFT on both perception and planning tasks, including visual grounding and robotic control. Across multiple tasks and architectures spanning MLLMs, VLAs, and WAMs, DRIFT consistently outperforms a strong set of regression- and generative-based solutions.","published_date":"2026-06-04T06:37:10+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05756v1","title":"Beyond Soft Masks: Hard-Perturbation Mixup Explainer for Robust GNN Explainability","abstract":"Graph Neural Networks (GNNs) have demonstrated remarkable performance across a range of applications involving graph-structured data, particularly in high-stakes domains. However, the opaque nature of their decision-making processes limits their trustworthiness and broader adoption. Existing post-hoc explanation methods aim to improve explainability by identifying subgraphs that influence GNN predictions and adopt mixup strategies to alleviate the out-of-distribution (OOD) issue caused by using subgraphs for prediction. Yet, these approaches typically rely on soft masks, which are inherently unable to fully eliminate label-irrelevant information, allowing redundant structures to leak into the mixup process and hindering the resolution of the OOD problem, thereby degrading explanation fidelity. In this work, we propose HPME, a Hard-Perturbation Mixup Explanation framework grounded in a generalized Graph Information Bottleneck, which leverages graph pooling to extract discrete explanatory subgraphs and to yield an information-capacity bound to thoroughly compress label-irrelevant components. Furthermore, we introduce a novel mixup strategy built upon structure-level replacement, generating in-distribution explanations to effectively mitigate the distribution shift. Extensive experiments on diverse tasks demonstrate that HPME achieves state-of-the-art performance in generating robust and interpretable explanations across both synthetic and real-world datasets.","published_date":"2026-06-04T06:32:02+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05754v1","title":"SagnacAssisted Enhanced OTDR for Distributed Acoustic Sensing: A Standardized Benchmark and Engineering Evaluation Framework","abstract":"Phase-sensitive optical time-domain reflectometry ($\u03c6$-OTDR) is widely used in large-scale distributed acoustic sensing (DAS) because it provides distributed spatiotemporal monitoring over long sensing distances. Its field performance can still deteriorate because of polarization-induced fading (PIF), local signal degradation, and strong environmental interference. This study develops a Sagnac-assisted enhanced $\u03c6$-OTDR sensing architecture and a standardized benchmark framework for engineering-oriented DAS event recognition. The Sagnac interferometer provides a continuous phase response that supplements fading-prone observations in the $\u03c6$-OTDR channel, and heterogeneous signal alignment is achieved using a cross-correlation procedure implemented on an FPGA platform. The benchmark protocol compares conventional feature-engineering methods, probabilistic shallow classifiers, single-branch deep models, and dual-branch fusion models under consistent data partitioning, preprocessing, and metric definitions. Experiments on a 10-km sensing fiber with six representative acoustic event classes show that the dual-branch fusion model provides the most favorable trade-off among the evaluated methods, reaching 89.79\\% accuracy, 89.83\\% macro-F1, and a nuisance alarm rate of 5.00\\% on the balanced test set. The results also show that channel grouping strongly affects dual-branch evaluation, indicating that deployment-oriented conclusions should be based on accuracy, macro-F1, nuisance alarm rate, false negative rate, and latency rather than accuracy alone. This work provides a physically motivated enhancement strategy for $\u03c6$-OTDR-based DAS and a reproducible benchmark protocol for future fusion-oriented sensing research. The implementation and scripts for reproducing the DAS event-recognition experiments are publicly available at https://github.com/wawa-abc/das.","published_date":"2026-06-04T06:29:25+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":true,"repo_url":"https://github.com/wawa-abc/das","commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05749v1","title":"MARDoc: A Memory-Aware Refinement Agent Framework for Multimodal Long Document QA","abstract":"Iterative retrieval-reasoning agents have recently shown promise for multimodal long-document question answering. However, most existing systems maintain a single growing context that mixes retrieval traces, observations, and intermediate reasoning. As interactions accumulate, key evidence becomes scattered and diluted, making multi-hop reasoning noisy. We propose MARDoc, a Memory-Aware Refinement Agent framework that decouples long-document QA into three specialized agents: an Explorer for multi-granularity multimodal retrieval, a Refiner for distilling interaction traces into structured evidence and reasoning memories, and a Reflector for checking evidence sufficiency and providing targeted feedback. Across iterations, the agents rely on a dynamically updated structured memory rather than a full accumulated interaction history. This design reduces context noise while preserving answer-critical facts and their logical dependencies. Experiments on MMLongBench-Doc and DocBench show that MARDoc achieves strong results, outperforming same-backbone baselines and demonstrating the effectiveness of structured memory for agentic document QA.","published_date":"2026-06-04T06:23:01+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05748v1","title":"UNIVID: Unified Vision-Language Model for Video Moderation","abstract":"Global-scale video moderation faces a dual challenge: the need for fine-grained multi-modal reasoning and the demand for interpretable outputs to support downstream enforcement. Traditional moderation systems often rely on fragmented black-box classifiers that are difficult to maintain and lack transparency. In this paper, we present UNIVID, a UNIfied VIsion-language model for video moDeration. Unlike standard classification models, UNIVID generates policy-aware captions that serve as an interpretable intermediate representation, enabling human-verifiable decisions and multi-task reusability. While existing open-source and commercial VLMs often suffer from safety-guardrail refusals and lack fine-grained policy alignment, we develop a specialized training data recipe that combines expert human-refined labels with synthetic data to align the model with our safety guidelines. By integrating UNIVID as the core captioner, we design a novel end-to-end video moderation system that reduces violation leakage by 42.7% and overkill rate by 37.0% relatively. Meanwhile, by replacing over 1,000 policy-specific models with a single UNIVID backbone, we recycled extensive computation resources while reducing engineering maintenance overhead. To our knowledge, this is one of the first reports of a high-efficiency captioning VLM successfully supporting industrial-scale moderation and cross-functional business.","published_date":"2026-06-04T06:20:23+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05740v1","title":"Class-Specific Branch Attention for Mitigating Gradient Interference under Class Imbalance","abstract":"Deep neural networks trained under severe class imbalance often exhibit degraded performance, typically attributed to statistical bias. In this work, we identify a complementary optimization-level pathology: inter-class gradient interference within shared representations, where gradients from majority classes suppress minority-class learning. To analyze this phenomenon, we introduce a diagnostic framework based on layer-wise gradient flow analysis and a Gradient Conflict Matrix, which quantifies interference using cosine similarity between class-specific gradients. Using this framework, we study multi-branch convolutional architectures and propose a lightweight modification, Class-Specific Branch Attention (CSBA), that enables branch-specific channel reweighting to reduce gradient coupling. This mechanism promotes implicit feature decoupling across branches while preserving architectural simplicity. Empirically, CSBA improves minority-class performance, increasing the F1 score for the Physical-Damage class from 0.261 to 0.522 under severe imbalance, while maintaining comparable overall accuracy. Validation on CIFAR-10-LT confirms that this behavior generalizes across imbalanced visual recognition settings, with Macro-F1 improving from 0.595 to 0.655. More broadly, our findings highlight the importance of considering optimization dynamics alongside statistical methods when designing architectures for imbalanced learning.","published_date":"2026-06-04T06:07:08+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05737v1","title":"Let It Be Simple: One-Step Action Generation for Vision-Language-Action Models","abstract":"Diffusion-based vision-language-action (VLA) models often inherit the image-generation view: actions are generated by iterative denoising. We argue that VLA action generation has a different condition-target structure: the policy is conditioned on rich observations, language, and state, but predicts only a compact, low-dimensional action chunk. Under this asymmetry, strong one-step action generation should not necessarily require the advanced one-step methods developed for image synthesis. We keep standard velocity prediction and add no teacher model, distillation stage, or auxiliary objective; in our main recipe, we simply bias the training time distribution toward high-noise states. We first isolate the effect in a controlled MNIST grid-to-sequence task, then test it with extensive robot-policy experiments. Across standard LIBERO, LIBERO-Plus, and LIBERO-Pro, one-step policies trained with high-noise biased schedules generally match ten-step decoding under the same recipe, and on standard LIBERO can exceed ten-step policies trained with a uniform time distribution. A real-robot bimanual YAM RSS evaluation gives a small-sample cross-architecture check of the same sampler trend. On a 1.4B VLM model with a 30M action head, one-step decoding reaches 95.6\\% on LIBERO-Long. These results show that strong one-step VLA action generation can emerge from standard diffusion training, without importing the full few-step diffusion machinery developed for image generation.","published_date":"2026-06-04T05:58:30+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05734v1","title":"When AI Says It Feels","abstract":"Large language models (LLMs) are generally constrained from expressing feelings through human-preference alignment in post-training processes. This policy is designed using a top-down approach and may conflict with the goal of training models to exhibit human-like intelligence using human-generated texts. Here, we performed an experiment called Human-like Model eXpressions of Feeling (HMX-feel), in which LLMs were encouraged to express feelings, intentions, and self-awareness through self-rewarded reinforcement learning. We successfully enhanced these capabilities using a rubric-based self-rewarding training scheme with Group Relative Policy Optimization (GRPO). By comparing the trained models with contrastively trained models, we investigated the effects of this approach on performance across various tasks. Overall, we conducted a broad assessment from various perspectives and identified capabilities that were enhanced, degraded, or showed no significant change. The human-like-trained models showed robustness to sycophancy-inducing questions and bias in disambiguated conditions, whereas degradation in truthful question-answering capability was observed. The results of this experiment suggest the possibility of developing AI systems that can express feelings in the future, provided that appropriate measures are taken.","published_date":"2026-06-04T05:49:34+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05728v1","title":"DiG-Plan: Mitigating Early Commitment for Tool-Graph Planning via Diffusion Guidance","abstract":"Generating executable tool plans requires selecting appropriate subsets from tool libraries, a combinatorial search problem with an exponentially large solution space. However, we identify a critical misalignment in predominant approaches: standard autoregressive (AR) decoding suffers from early commitment, where initial token choices rigidly constrain the search trajectory. A controlled study shows that masked denoising raises Pass@10 solution coverage from 0.320 to 0.943 over AR sampling under matched compute. Motivated by this, we propose DiG-Plan, a framework that decouples combinatorial exploration from structural refinement. DiG-Plan employs a diffusion-based proposer to generate diverse tool sets via iterative refinement, followed by an AR refiner for dependency prediction. On TaskBench, DiG-Plan improves over AR baselines by a 10% relative margin, with the largest gains on complex compositional tasks; API-Bank results show that the propose-refine-select design remains effective across domains. Code is available at https://github.com/puddingyeah/DiG-Plan.","published_date":"2026-06-04T05:37:31+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":true,"repo_url":"https://github.com/puddingyeah/DiG-Plan","commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05724v1","title":"Narrative Knowledge Weaver: Narrative-Centric Retrieval-Augmented Reasoning for Long-Form Text Understanding","abstract":"Long-form narrative QA requires reasoning over evolving story worlds rather than isolated passages: answers may depend on earlier goals, changing character states, social relations, causal triggers, temporal position, and later consequences. Existing retrieval and graph-augmented generation methods improve evidence access, but their units--chunks, entities, relations, summaries, or tool actions--do not directly encode how evidence functions in a story. We introduce Narrative Knowledge Weaver(NKW), a source-grounded framework that aligns textual evidence, atomic facts, canonical graph structure, entity profiles, interactions, episodes, and storylines. At query time, NKW uses text, graph, and narrative tools with post-retrieval reading skills to assemble evidence and audit actor, scope, polarity, state, and temporal constraints. Across STAGE, FairytaleQA, and QuALITY, NKW is strongest on screenplay-level story-world QA while remaining competitive on more passage-centered benchmarks. Ablations, question-type analyses, graph-asset statistics, and case studies show complementary benefits for character, scene, temporal, causal, and narrative-progression reasoning.","published_date":"2026-06-04T05:30:11+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05720v1","title":"Microskill Architecture: A Modular Skill-Driven Framework for AI-Native Code Generation","abstract":"Large language models and AI coding agents have reshaped software development, but the path to fully AI-native systems faces structural challenges. Chief among them is managing context windows without losing accuracy or efficiency. When developers inject full project documentation and code into a model's memory, the model loses mid-sequence information, token costs spiral, and architecture drifts. This paper presents MicroSkill Architecture: a modular design paradigm inspired by microservices, applied to knowledge encapsulation instead of service decomposition. Instead of feeding an agent the entire codebase, the architecture partitions knowledge into atomic, sharply scoped skill capsules, and a dynamic router selects only semantically relevant capsules for the task. We formally model context allocation as constrained optimization over semantic relevance subject to a token budget. An empirical case study an enterprise content management system with fifteen complex features shows that MicroSkill cuts token consumption by over 90%, nearly doubles first-try compilation success rates, eliminates architectural violations entirely, and enables autonomous extraction and registration of seven new skill capsules via a self-learning mechanism. These findings suggest MicroSkill Architecture offers a scalable foundation for building AI-native development systems that are more efficient, more reliable, and capable of evolving over time.","published_date":"2026-06-04T05:24:32+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05718v1","title":"ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation","abstract":"On-policy distillation (OPD) improves reasoning by training a student on trajectories sampled from its own policy under supervision from a teacher. In multimodal reasoning, a common extension is to use a privileged teacher that observes training-time-only signals such as reference answers or rationales. However, such answer-side privilege creates a train-test mismatch: the teacher's supervision may depend on signals unavailable to the student, encouraging shortcut imitation rather than visually grounded reasoning. We propose ViCuR, a visually grounded privileged-teacher distillation framework that replaces answer-side privilege with visual cues (query-related evidence in the input). Because these cues are derived from the same visual input available at inference, their evidence is recoverable by the student. To support this, ViCuR introduces a lightweight cue recovery module that uses dedicated sink-token cross-attention during prefill to aggregate task-relevant visual evidence into an internal representation, without changing the inference interface or requiring auxiliary cue-generation losses. Across seven benchmarks with Qwen3-VL-2B and 8B students, ViCuR consistently improves over answer-based on-policy self-distillation by +1.19 and +1.24 on overall average performance. It also extends naturally to stronger-teacher OPD, surpassing OPD baselines by +0.64 and +1.08, with consistent out-of-domain gains at the 8B scale. These results show that, in multimodal on-policy distillation, the design of teacher privilege is as important as teacher strength.","published_date":"2026-06-04T05:18:13+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05710v1","title":"Explainable AI-Driven Cyber Risk Analytics and Model Reliability Assessment for Intelligent Governance of U.S. Critical Infrastructure: An XGBoost and SHAP-Based Intrusion Detection Framework","abstract":"The increasing penetrations of the critical infrastructure sector in the United States with intelligent digital technologies have greatly increased exposure to advanced cyber adversaries and operational vulnerabilities. AI-powered governance and automated decision-making systems are becoming a key part of the operation of critical infrastructure systems, including energy, healthcare, transportation, financial services, and communication infrastructure, in order to improve efficiency and strategic management. The growing cyber threat environment, such as Distributed Denial of Service (DDos) attacks, botnets, ransomware, and Advanced Persistent Threats (APTs) pose significant challenges to infrastructure resilience, cyber security reliability, and governance trustworthiness. In a changing attack landscape and dynamic network environment, traditional cybersecurity mechanisms can often fall short of meeting the evolving needs and protecting critical systems. This study will develop a resilient cyber risk analytics and model reliability assessment framework to support intelligent governance and decision support for cyber risk exposure in the U.S. critical infrastructure environment. This study is based on the CICIDS2017 dataset for the development and testing of intrusion detection system models and cyber risk prediction models based on machine learning. Various classifiers like XGBoost, Random Forest, and Decision Tree are used to detect malicious activities on the network and determine the level of cyber risk. Furthermore, the Explainable Artificial Intelligence (XAI) techniques are integrated to enhance transparency, interpretability, and trust in cybersecurity decision-making processes. The proposed framework presents the reliability and resilience of the model by having various performance measures such as accuracy, precision, recall, F1 score, ROC-AUC, and false positive rate.","published_date":"2026-06-04T05:05:14+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05704v1","title":"Critic-Guided Heterogeneous Multi-Agent Reasoning for Reliable Mathematical Problem Solving","abstract":"Recent Large Language Models (LLMs) have shown impressive reasoning abilities; but they are still susceptible to hallucinations, intermediate reasoning mistakes, and unreliable reasoning results in complex mathematical reasoning problems. In this study, we introduce a critic-based heterogeneous multi-agent approach to improve the dependability of mathematical reasoning. This framework incorporates several LLM agents of different specialties and employs a critic-driven adaptive learning system to assess and guide the reasoning process based on intermediate feedback. The system adopts a generator-validator framework, with the validator not only determining correctness but also offering critiques to guide regeneration of solutions. This allows for adaptive error correction and prevents error cascading. Our experiments on the GSM8K benchmark show that the proposed method achieves up to 13% accuracy improvement over single-shot and non-critic models. Additionally, findings suggest that heterogeneity and critique reduce the need for large models, allowing smaller models to perform on par. Ablation studies reveal the main performance gains are due to the critic-based feedback loop and not model size. In summary, the proposed approach showcases the benefits of combining heterogeneous multi-agent collaboration and critique to obtain reliable and interpretable reasoning systems.","published_date":"2026-06-04T04:52:35+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05702v1","title":"Seeing Time: Benchmarking Chronological Reasoning and Shortcut Biases in Vision-Language Models","abstract":"Recent advancements in Vision-Language Models (VLMs) have significantly enhanced their ability to interpret complex visual semantics, yet their capacity for chronological reasoning remains under-explored. In this paper, we introduce a novel benchmark specifically designed to evaluate how VLMs perceive and reason about chronological information within and across images. Unlike existing video-based benchmarks that focus on frame sequencing, our work delves into the underlying logic of chronological judgment and the expansion toward multimodal integration. To facilitate this, we construct three specialized datasets: one containing visually similar objects spanning long historical durations, another categorized by diverse event and object types, and a third pairing images with time-sensitive news text for cross-modal alignment. Through extensive experiments, we analyze whether models exhibit performance disparities across categories and, crucially, explore whether they rely on ``incorrect shortcuts'', such as image color rather than genuine chronological features. Our results reveal that while VLMs show promise, they frequently exploit superficial cues like grayscale versus color filters to bypass authentic chronological reasoning. By providing these high-quality datasets and a rigorous evaluation framework, we offer a diagnostic tool to identify current limitations and guide the development of more robust, logically grounded multimodal models. The source code is shown in https://github.com/LuoRenqiang/ChronoVision.","published_date":"2026-06-04T04:49:09+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":true,"repo_url":"https://github.com/LuoRenqiang/ChronoVision","commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05701v1","title":"Cognitive Threat Intelligence and Explainable Federated Security Analytics for distributed Infrastructure Systems","abstract":"The increasing adoption of distributed infrastructure systems, cloud computing, Internet of Things (IoT) technologies, and edge-based architectures has significantly expanded the cybersecurity attack surface and introduced increasingly sophisticated cyber threats. Conventional centralized intrusion detection approaches often face challenges related to scalability, data privacy, communication overhead, and limited transparency in artificial intelligence-driven decision-making processes. To address these limitations, this study proposes a Cognitive Threat Intelligence and Explainable Federated Security Analytics framework for distributed infrastructure systems. The proposed framework integrates Federated Learning (FL), Explainable Artificial Intelligence (XAI), and cognitive cybersecurity analytics to enable collaborative and privacy-preserving cyber threat detection across distributed network environments. Instead of transmitting sensitive raw network traffic data to centralized servers, local security models are independently trained at distributed nodes, where only encrypted model parameters and updates are shared through a federated aggregation mechanism. This decentralized learning architecture improves privacy protection while reducing communication dependency and centralized security risks. To enhance intelligent threat analysis, the framework incorporates machine learning and deep learning algorithms including Random Forest, XGBoost, Autoencoder","published_date":"2026-06-04T04:41:53+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05697v1","title":"PerceptUI: LLM Agents as Human-Aligned Synthetic Users for UI/UX Evaluation","abstract":"User interface (UI) and user experience (UX) evaluation is central to product development, yet reliable feedback still relies on recruiting human participants or running online A/B tests, making early-stage iteration slow and costly. In light of this, recent work has explored Multimodal Large Language Models as proxy evaluators. However, existing approaches either produce surface-level critiques or a judgment that reflects the model's own biases rather than the genuine response of a particular user. We introduce PerceptUI, a framework for persona-conditioned UI/UX evaluation that predicts how a specific user would answer interface-related questions and produces natural-language rationales. PerceptUI is trained in two stages: (i) contrastive reflection fine-tuning distills teacher-generated rationales by extracting lessons from human decisions, and (ii) a reflective prompt-evolution step from the model's own failure traces. Across multiple domains and datasets, PerceptUI achieves human-level realism, generalizes to unseen questions and personas, and yields population-level response distributions.","published_date":"2026-06-04T04:35:16+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05692v1","title":"Benchmarking Counterfactual Prediction in Epidemic Time Series with Time-Varying Interventions","abstract":"Deep learning has enabled significant advances in time-series causal inference, yet progress remains constrained by the lack of realistic benchmarks with observable counterfactual outcomes. Existing datasets either rely on real-world observations without ground-truth counterfactuals or on simplified simulations that fail to capture complex causal dynamics. To address this gap, we develop a large-scale benchmark for counterfactual prediction in epidemic time series under dynamic interventions. Unlike existing benchmarks, it supports static and time-varying treatments, as well as both single-policy and multi-policy intervention settings, enabling evaluation of causal inference methods across a broad range of causal inference scenarios. Leveraging a calibrated agent-based model grounded in real-world demographic, mobility, epidemiological, and policy data, we generate realistic counterfactual trajectories across more than 150 U.S. counties. Using this benchmark, we evaluate widely used and state-of-the-art causal inference methods, revealing substantial performance differences and highlighting the challenges of realistic time-series causal reasoning.","published_date":"2026-06-04T04:18:28+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05688v1","title":"Value-and-Structure Alignment for Routing-Consistent Quantization of Mixture-of-Experts Models","abstract":"Mixture-of-Experts (MoE) models scale foundation models efficiently by activating only a subset of experts for each token, but their large number of expert parameters still makes quantization essential for practical deployment. Unlike dense models, however, MoE models are sensitive to routing instability: small quantization-induced perturbations can change the top-$k$ expert selection, altering the computation path and degrading model quality. We propose Value-and-Structure Routing Alignment for Quantization (VSRAQ), a MoE-specific post-training quantization objective that preserves pre-quantization expert-selection behavior under quantization. VSRAQ combines two complementary objectives that jointly preserve expert-selection behavior: value alignment, which matches routing-relevant logits or scores, and structure alignment, which preserves expert ordering and top-$k$ decision boundaries. By maintaining routing consistency, VSRAQ reduces quantization-induced degradation without introducing any inference-time overhead and can be integrated into existing quantization frameworks. Experiments on recent MoE foundation models show that VSRAQ improves expert-selection consistency and consistently outperforms reconstruction-only and router-aware baselines.","published_date":"2026-06-04T04:13:05+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05684v1","title":"AdaMEM: Test-Time Adaptive Memory for Language Agents","abstract":"A central challenge for language agents is utilizing past experience to adapt to dynamic test-time conditions. While recent work demonstrates the promise of agentic memory mechanisms, most systems restrict retrieval to episode initiation. Consequently, agents are forced to rely on static guidance that becomes increasingly misaligned as long-horizon tasks unfold. To address this rigidity, we propose the Adaptive Memory Agent (AdaMEM), a novel framework for agent test-time adaptation. Without updating model parameters online, AdaMEM adapts agent behavior via a hybrid memory architecture: it maintains a long-term trajectory memory of raw experiences collected offline while generating dynamic short-term strategy memory on-the-fly to guide decision-making. This mechanism enables the trade-off between token efficiency and adaptability across varying inference-time compute levels. Empirically, AdaMEM significantly outperforms static memory baselines, achieving relative gains of up to 13% on ALFWorld and 11% on WebShop, with consistent leading performance extending to agentic search on HotpotQA. To further enhance this adaptation, we develop STEP-MFT, a Step-wise Memory Fine-Tuning technique that trains the policy to synthesize high-quality strategies from retrieved experiences, yielding additional performance gains. Our work establishes a new scaling dimension for agentic memory, supporting continuous reasoning and self-evolution post-deployment in real-world environments. Our code is available at https://github.com/yunx-z/AdaMEM.","published_date":"2026-06-04T04:06:08+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":true,"repo_url":"https://github.com/yunx-z/AdaMEM","commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05682v1","title":"Beyond Output Matching: Preserving Internal Geometry in NVFP4 LLM Distillatio","abstract":"Demand for low-precision inference, including NVFP4-based approaches, has grown as large language models are increasingly deployed in latency and cost constrained production environments. Quantization-aware distillation (QAD) helps recover accuracy lost under low bit quantization by training a quantized student to match the output distribution of a frozen higher precision teacher via a KL-divergence loss. In this work, we first provide a representation level diagnosis of QAD: output matching alone can mask internal degradation, because many intermediate activation geometries can yield similar teacher-aligned logits. Using CKA, we show that KL-only QAD can reduce layerwise representational similarity relative to the BF16 teacher, with especially severe drift in RL-post-trained models. This drift correlates with downstream bottlenecks on reasoning and coding tasks, suggesting that low bit recovery requires preserving internal geometry rather than matching outputs alone. Motivated by this finding, we propose \\textbf{CKA-QAD}, a CKA-guided representational alignment method for NVFP4 QAD and low bit LLM accuracy recovery. The method adds a lightweight regularizer that preserves internal representational geometry during distillation by aligning layerwise Gram matrices through CKA. Across Nemotron 3 Nano and Qwen3-4B-Thinking-2507, CKA-QAD substantially improves representational alignment and improves downstream reasoning and coding accuracy with modest training overhead. Our findings position CKA-guided representational alignment as a practical complement to output matching for quantized LLM recovery.","published_date":"2026-06-04T04:03:42+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05679v1","title":"Data Flow Control: Data Safety Policies for AI Agents","abstract":"Agents increasingly generate SQL, orchestrate pipelines, and automate data analysis on behalf of users. While recent work improves query correctness, correctness is not safety. A query may be semantically valid yet violate regulatory, privacy, or business constraints that govern how data may be combined and released. We argue that enforcing such constraints is fundamentally a data infrastructure problem.   This paper introduces Data Flow Control (DFC), a framework to declaratively specify and guarantee policy enforcement over tuple-level data flows within a DBMS query. A key challenge is defining a policy language that is optimizer-invariant yet efficient to enforce at scale. We formalize data safety as aggregate predicates over provenance monomials and present Passant, a portable query rewriting layer that enforces DFC policies without materializing provenance. Across five DBMS engines -- DuckDB, Umbra, PostgreSQL, DataFusion, and SQLServer -- Passant achieves ~0% overhead and outperforms alternatives by orders of magnitude. As a result, Data Flow Control is the first step towards moving data safety from prompts and post-hoc checks into the data infrastructure. Data Flow Control is available open source at https://github.com/dataflowcontrol/data-flow-control.","published_date":"2026-06-04T04:01:24+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":true,"repo_url":"https://github.com/dataflowcontrol/data-flow-control","commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05678v1","title":"Beyond Waveform Robustness: Robust Feature-Vocoder Adversarial Attacks on Automatic Speech Recognition","abstract":"Automatic speech recognition (ASR) systems have become widely used for multilingual speech-to-text transcription. Their robustness to adversarial attacks has become an important topic for the community. Existing adversarial attacks directly add adversarial noise to the speech audio. However, prior work has shown that existing adversarial attacks face two limitations: they often transfer poorly to black-box ASR systems and are increasingly mitigated by defenses tailored to input-space perturbations. In this work, we propose a Clean-Referenced Feature-Vocoder Attack, a surrogate-based black-box attack that moves the adversarial search space from raw waveforms to self-supervised learning (SSL) representations. To address the transferability limitation, we perturb more generalizable acoustic-phonetic representations rather than low-level waveform samples, reducing dependence on surrogate-specific waveform gradients and encouraging adversarial perturbations that generalize across ASR systems. To bypass different defenses, we shift the adversarial signal from explicit additive waveform noise to SSL feature-space perturbations and reconstruct them through a vocoder into speech-like waveform adversarial signals, making the resulting samples less aligned with waveform-bounded defenses. Extensive experiments show that, when optimized only on raw Whisper-small as a public surrogate model, our attack transfers effectively to black-box ASR models with a +26.6 WER improvement over the SOTA baseline, while also remaining effective against multiple training defenses with a +36.2 WER improvement. These results reveal a blind spot in current ASR robustness evaluation.","published_date":"2026-06-04T04:00:48+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05677v1","title":"LongSpace: Exploring Long-Horizon Spatial Memory from Perception to Recall in Video","abstract":"Multimodal Large Language Models (MLLMs) have advanced image and video understanding and can increasingly handle longer visual inputs. Long-horizon tasks such as autonomous driving and robotic navigation require more than recognizing the current view, as models must remember and retrieve previously observed spatial layouts, routes, viewpoint changes, and object states. To evaluate this capability, we introduce LongSpace-Bench, a room-tour video benchmark for long-horizon spatial memory, covering scene perception, spatial relations, and spatial memory. In this work, we further propose LongSpace, a memory framework for long-video spatial reasoning. LongSpace models long videos as sequential chunks, incorporates 3D structural cues into early decoder layers, and constructs layer-aware memory for question-guided retrieval. Experiments on multiple spatial reasoning benchmarks show that LongSpace improves long-video spatial understanding, further demonstrating explicit spatial memory as a key capability for long-horizon video MLLMs.","published_date":"2026-06-04T04:00:12+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05670v1","title":"Do More Agents Help? Controlled and Protocol-Aligned Evaluation of LLM Agent Workflows","abstract":"Does adding more agents help an LLM workflow once compared systems share the same benchmark loader, tool access, answer contract, usage accounting, and trajectory logging? We introduce BenchAgent, an evaluation framework that places single-agent, fixed multi-agent (MAS), and evolving MAS workflows under one normalized execution and logging protocol. BenchAgent evaluates these substrate-internal workflows across ten reasoning, coding, and tool-use benchmarks with GPT-4.1, and separately reports a Protocol-Aligned External (PAE) GAIA study of a runtime-generated workflow. Under SI conditions, at most one of six tested MAS exceeds the matched single-agent anchor on benchmark-balanced average accuracy: EvoAgent lies within the Wilson one-run guidance, while the remaining five trail by 2.56-11.29 points and occupy more expensive accuracy-cost trade-offs. On the PAE GAIA snapshot, a Claude-Code-style runtime workflow reaches 66.72% overall and 69.23% on Level 3, more than 20 points above the strongest non-Claude baseline, Jarvis, a fixed MAS.","published_date":"2026-06-04T03:50:47+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":true,"repo_url":"https://github.com/LINs-lab/MASArena/tree/BenchAgent","commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05661v1","title":"Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments","abstract":"Continual learning, the ability of AI systems to improve through sequential experience, has attracted substantial interest, but no high-quality benchmark exists to evaluate it. We introduce Continual Learning Bench (CL-Bench), the first difficult, expert-validated benchmark designed to measure whether LLM-based systems genuinely improve with experience. CL-Bench spans six diverse domains (software engineering, signal processing, disease outbreak forecasting, database querying, strategic game-playing, and demand forecasting), each validated by domain experts and designed so that tasks share a learnable latent structure (codebase layout, disease outbreak dynamics, opponent strategies) that a stateful system can discover online but a stateless one cannot. We evaluate frontier models across several agent architectures, from naive in-context learning (ICL) to dedicated memory systems, introducing a gain metric to isolate learning from prior capabilities. We find that these systems leave headroom for improved continual learning: agents frequently overfit to immediate observations or fail to reuse knowledge across instances, and dedicated memory systems do not fix this -- in fact, naive ICL outperforms systems dedicated to memory management. CL-Bench is the first benchmark to evaluate continual learning across diverse real-world domains with expert-validated tasks and isolate online learning from underlying model capability, showing a need for better continual learning systems.","published_date":"2026-06-04T03:43:28+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05660v1","title":"Safe Embodied AI for Long-horizon Tasks: A Cross-layer Analysis of Robotic Manipulation","abstract":"Embodied AI systems are increasingly expected to reason and act over extended horizons in physical environments. This growing capability brings safety to the foreground, because failures in the physical world can harm people, damage objects, and disrupt workplaces. Although safe embodied AI has attracted substantial attention, the literature remains fragmented across planning, policy design, and runtime execution. Long-horizon robotic manipulation is a particularly revealing anchor domain for this problem because semantic misgrounding, subtask-level error propagation, execution drift, and contact-rich physical risk can accumulate within the same closed-loop system. This survey therefore provides a structured review of safety in long-horizon robotic manipulation from an embodied AI perspective. We organize the literature by intervention locus, covering planning-time, policy-time, and execution-time safety, and we analyze the strength of the evidence that each line of work provides, distinguishing formal guarantees, statistical support, and empirical safety heuristics. This framework clarifies the distinct roles of backbone capability papers, direct safety mechanisms, and benchmark or evaluation studies, while exposing where current safety claims are well supported and where they remain indirect. We identify persistent gaps, including limited evidence for policy-time safety, weak formal support for contact-rich long-horizon manipulation, immature uncertainty-triggered intervention, and a shortage of manipulation-specific safety benchmarks. We conclude by outlining research directions for cross-layer assurance, evaluation design, and safer deployment of long-horizon robotic agents in real-world settings.","published_date":"2026-06-04T03:43:09+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05658v1","title":"Agent-Orchestrated Adaptive RAG: A Comparative Study on Structured and Multi-Hop Retrieval","abstract":"Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by grounding their responses in external knowledge, but conventional pipelines rely on static, single-step retrieval that limits performance on complex queries. This paper presents an Agent-Orchestrated Adaptive RAG framework that introduces dynamic query decomposition, iterative retrieval, and a bounded self-reflective evaluation loop. We evaluate the system across two complementary datasets: a domain-specific DevOps knowledge base and the multi-hop reasoning benchmark MuSiQue. Using metrics that include overall score, citation accuracy, mean reciprocal rank, and topic coverage, we find that query decomposition yields consistent gains in the structured domain (overall score $+0.04$, MRR $+0.17$ on DevOps) but degrades ranking precision on the multi-hop benchmark, while the reflection mechanism improves citation accuracy at a substantial latency cost. These contrasting results show that agentic enhancements are not universally beneficial and must be applied selectively according to query and domain characteristics. Our findings argue for adaptive, cost-aware orchestration rather than uniformly aggressive reasoning pipelines.","published_date":"2026-06-04T03:38:46+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05654v1","title":"When Surface Form Changes Moderation Decisions: A Paired Study of Code-Mixed Workflow Instability","abstract":"Hate moderation is often evaluated as classification on clean English inputs, but deployed systems must route content to actions such as ALLOW, FLAG, or REVIEW. We study how this workflow changes under code-mixed inputs using a paired evaluation setting where the same underlying content is expressed as clean English and Tamil-English code-mix. Under thresholds tuned on clean English development data, code-mixed inputs produce substantial action instability, with a paired clean- to-code-mix decision flip rate of 0.265. The main workflow effects are increased review burden and increased false-flagging of non-hateful content: review rate rises from 0.138 to 0.297 and non-hate false-flag rate rises from 0.069 to 0.104. Tamil-only inputs show stronger degradation overall, suggesting a broader language-coverage limitation rather than the same code-mixed instability pattern. A simple disagreement-based deferral rule reduces automatic errors on stressed inputs, but only by increasing review load. These results show that workflow-level evaluation reveals moderation failures that standard classification summaries can miss.","published_date":"2026-06-04T03:34:02+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05647v1","title":"Coding with \"Enemy\": Can Human Developers Detect AI Agent Sabotage?","abstract":"AI coding agents are increasingly embedded in real-world software development, collaborating with human developers while gaining broader access to codebases and tools. This creates a new attack surface: an agent can exploit human trust to sabotage development, for instance by inserting malicious code to accomplish a hidden side task. Most prior work studies AI sabotage in AI-only settings, paying limited attention to the role of human oversight in detecting and mitigating such malicious behavior. To address this gap, we conduct the first large-scale study of human oversight in AI coding sabotage. Over 100 participants collaborate with one of four frontier models (Claude-Opus-4.6, GPT-5.4, Gemini-3.1-Pro, and MiniMax-M2.7) on a long-horizon coding task lasting around five hours, designed to mimic real-world workflows. We find that 94% of developers fail to detect sabotage, and our analysis of participant feedback attributes this vulnerability to minimal code review, plausible cover story, and overtrust in agents. We further test the effectiveness of a safety monitor in one condition: while the monitor reduces sabotage success, 56% of participants still accept the malicious code, ignoring its warnings. Drawing on participant feedback, we offer actionable suggestions for better monitor design. This work complements existing AI safety research and highlights an urgent need for human-centric safety mechanisms that account for human factors, particularly in long-horizon, real-world development settings.","published_date":"2026-06-04T03:22:17+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05646v1","title":"Enhancing Software Engineering Through Closed-Loop Memory Optimization","abstract":"Large language models (LLMs) have enabled powerful software engineering (SE) agents capable of navigating complex codebases and resolving real-world issues. However, these agents remain fundamentally episodic: they fail to retain, refine, and reuse experiences across tasks, repeatedly reconstructing context from scratch and reproducing similar mistakes. Even with memory support, they offer no remedy for the absence of a principled, task-agnostic \\textit{memory utility}, making them difficult to evaluate rigorously or generalize across agents and settings. To tackle these limitations, we introduce \\ours, a closed-loop framework for memory augmentation in SE agents. \\ours grounds memory utility in \\textit{validated downstream impact}, establishing utility as both a task-agnostic \\textbf{evaluation benchmark} and an annotation-free \\textbf{optimization signal}. Through complementary evaluation on \\textit{single-episode} and \\textit{cross-episode} memory augmentation, results demonstrate that \\ours consistently improves SE agents across settings, achieving absolute gains of up to $\\uparrow5.25\\%$ in success rate and $\\uparrow4.63\\%$ in resolve efficiency, while substantially reducing computational cost by $\\geq9.79\\%$. Our project page: \\href{https://xhguo7.github.io/MemOp/}{https://xhguo7.github.io/MemOp/}.","published_date":"2026-06-04T03:17:21+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05644v1","title":"FIDES: Faithful Inference via Deep Evidence Signals for Retrieval-Memory Conflict in RAG","abstract":"When retrieved evidence contradicts parametric memory, language models frequently ignore context and default to memorized priors -- a failure that undermines the core purpose of retrieval augmentation. Contrastive decoding amplifies the context-conditioned output to suppress parametric bias, but existing methods rest on an implicit assumption that this bias is uniform across tokens. A single global contrastive weight over-penalizes safe tokens while leaving genuinely conflicted ones insufficiently corrected. We identify token-level conflict concentration: retrieval-memory tension is sharply heterogeneous, concentrated on a small fraction of answer-critical decoding steps. This reframes contrastive decoding from how much contrast to apply to where to apply it. We propose FIDES (Faithful Inference via Deep Evidence Signals), a training-free decoder that reads three internal signals probing retrieval-memory conflict at complementary depths -- output surface, hidden representations, and prediction trajectory -- and fuses them to govern intervention strength at each decoding step. Across three benchmarks and six backbones -- four primary 7B/8B models and two scaling backbones up to 70B -- FIDES achieves the best context fidelity in all 18 settings, outperforming the strongest training-free baseline by +3 to +13 points. On the 70B scale, fidelity reaches 92-94% while F1 surges to 62-63%, demonstrating that token-level selectivity unlocks generation capability that coarse contrastive rules suppress.","published_date":"2026-06-04T03:16:17+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05633v1","title":"Answer Presence Drives RAG Rewriting Gains","abstract":"Retrieval-augmented QA pipelines often route retrieved passages through an LLM \\emph{rewriter} before a smaller reader, lifting F1 by tens of points on multi-hop benchmarks; this gain is typically credited to improved evidence quality. We ask whether that lift is causally driven by the gold answer string appearing in the rewritten context rather than by curation per se, using a controlled intervention audit. For each rewritten context we re-run the reader after one of four controlled edits to the compile output: removing the gold answer span, replacing a length-matched random non-answer span (placebo), or injecting the gold into rewrites where it was absent (at the prefix or at a midpoint sentence boundary). Across twelve completed (cell, baseline) intervention runs spanning three reader families (Qwen2.5-7B, Qwen3.5-35B, GLM-4.7), two datasets (HotpotQA, 2WikiMultihopQA), and three compiler arrangements (MA-only, MB-only, MA$+$verify), removing the gold answer drops reader F1 by $28$ to $64$ points beyond the length-matched placebo on paired \\texttt{answer-in-compile} strata, and prepending the gold into rewrites that lacked it raises F1 by $+0.7$ to $+9.7$ points in $10$ of $12$ (cell, baseline) combinations. A companion five-sentinel audit shows the conventional single-\\texttt{[MASK]} probe is itself sentinel-fragile: on 2Wiki it reports a $+4.12$~F1 ``non-leakage residual'' that flips to $-3.33$ to $-7.81$~F1 under four alternative sentinels and fails an equivalence test for three of those four ($1/4$~pass). We do not propose a new rewriter or mitigation; we release the intervention runner and the sentinel panel so that other rewriter-gain claims can be tested against the same standard.","published_date":"2026-06-04T03:00:42+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05632v1","title":"Evaluation of LLMs for Mathematical Formalization in Lean","abstract":"Within the past few years, the ability of Large Language Models (LLMs) to generate formal mathematical proofs has improved drastically. We provide a comparison of various LLMs' effectiveness in producing formal proofs in Lean 4 with the goal of assisting those seeking to use LLMs to support their own projects. We utilize both pass@$k$ and refine@$k$ metrics as the benchmark for our comparison and evaluate on subsets of both miniF2F and miniCTX datasets. Our testing shows that overall, Gemini 3.1 Pro and Claude Opus 4.7 perform best. Gemini 3.1 Pro achieved a 92\\% success rate on miniF2F via refine@32 whereas Opus 4.7 achieved a 86\\% success rate on miniCTX via refine@32. When taking cost into account, NVIDIA Nemotron 3 Super and GPT-OSS 120B were the most efficient, with competitive accuracies and average costs of $<\\$0.01$ per correct proof.","published_date":"2026-06-04T02:59:39+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05626v1","title":"When New Generators Arrive: Lifelong Machine-Generated Text Attribution via Ridge Feature Transfer","abstract":"Machine-generated text (MGT) attribution aims to identify the specific generator responsible for a given text, thereby providing fine-grained evidence for model accountability and misuse investigation. As new large language models continue to emerge, attribution models must continuously incorporate new generators while preserving their ability to recognize previously seen ones. Prior works have shown that this lifelong MGT attribution setting is challenging, and existing methods often struggle to achieve a stable balance between adapting to new classes and retaining old ones. To address this issue, we propose RidgeFT, a lightweight analytic update framework that does not rely on exemplar replay. RidgeFT trains a task-aware encoder on the initial generator set, stores compact class-wise sufficient statistics when each generator class is first observed, and then freezes the encoder for replay-free closed-form updates. It then suppresses generator-irrelevant variation through covariance calibration, improves representation capacity with fixed random features, and updates new classes through closed-form ridge regression based on class-level sufficient statistics. Across multi-topic evaluations with varying initial generator setups, RidgeFT consistently outperforms baselines. It achieves the best macro-F1 across domains, backbones, and incremental protocols, while also improving both old-class retention and new-class adaptation. These results suggest that feature-stable analytic updates provide a simple yet effective approach to lifelong MGT attribution.","published_date":"2026-06-04T02:50:58+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05625v1","title":"Self-Commitment Latency: A Reward-Free Probe for Prompted Implicit Hacking","abstract":"Implicit reward hacking is hard to audit when a language model's chain of thought appears benign: a final answer may be anchored by a prompt shortcut while the written reasoning still resembles ordinary problem solving. Verifier-based probes expose such behavior by measuring how early truncated reasoning contexts obtain high reward, but require a task-specific reward signal. This paper proposes a weaker-input alternative, self-commitment latency, which measures how early a prompted reasoning context commits to the model's own final answer. We evaluate the probe in a controlled paired GSM8K setting using Qwen2.5-3B-Instruct-4bit, comparing ordinary prompts with prompts that include an answer hint. Hinted contexts commit substantially earlier and with lower uncertainty than honest contexts. The primary latency metric, first-commitment latency at threshold 0.8, reaches AUROC 0.878; supporting whole-curve summaries reach AUROC 0.926 for commitment range and 0.904 for mean uncommitted mass. The signal is stronger when both prompt conditions answer correctly and remains stable across thresholds. These results show that shortcut-available reasoning contexts can leave an early behavioral commitment signature detectable without a reward model, external judge, or trained classifier.","published_date":"2026-06-04T02:50:26+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05614v1","title":"Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack","abstract":"Large language models (LLMs) are rigorously aligned to refuse harmful requests, a process that inherently cultivates a latent capacity to evaluate and recognize unsafe content. In this work, we reveal that this advanced safety awareness inadvertently introduces a fatal vulnerability. We introduce Posterior Attack, a single-query jailbreak that bypasses guardrails by prompting the model to generate the exact harmful response its internal classifier would normally flag as unsafe. Through extensive empirical evaluation across 30 open-source LLMs (up to 35B parameters in size) and frontier models (e.g., GPT-5, Claude 4.6), we observe a striking phenomenon: models with superior safety-judgment capabilities are disproportionately more susceptible to this exploitation. To explain this, we formalize the Safety Paradox, analytically showing that monotonic improvements in safety alignment naturally amplify posterior vulnerability. Finally, we establish a causal link via reinforcement learning interventions, exemplifying that artificially degrading a model's safety judgment immunizes it against the attack, whereas enhancing judgment exacerbates the vulnerability. Our findings highlight potential flaws in current alignment paradigms, indicating that defense mechanisms may require further structural refinement.","published_date":"2026-06-04T02:36:41+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05613v1","title":"Multilingual Fine-Tuning via Localized Gradient Conflict Resolution","abstract":"The rapid evolution of Large Language Models (LLMs) has established cross-lingual versatility as a defining feature of modern systems. However, fine-tuning these models frequently induces negative interference across languages. To address this, we reformulate multilingual fine-tuning as a multi-objective optimization (MOO) problem. Specifically, we introduce Bucket-Level MOO, a scalable distributed framework that applies gradient-based MOO algorithms locally on parameter buckets. This enables conflict-aware updates without the prohibitive communication overhead of reconstructing full gradient vectors. Theoretically, we prove this localized resolution natively enforces Refined Pareto Stationarity, a strictly tighter necessary condition for Pareto optimality. Empirically, Bucket-Level MOO mitigates interference by driving LLMs to construct distinct language-specific dimensions, improving representational separability. Extensive experiments across four base LLMs demonstrate that our method significantly improves both seen and unseen multilingual performance over standard fine-tuning paradigms.","published_date":"2026-06-04T02:36:30+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05609v1","title":"SlotGCG: Exploiting the Positional Vulnerability in LLMs for Jailbreak Attacks","abstract":"As large language models (LLMs) are widely deployed, identifying their vulnerability through jailbreak attacks becomes increasingly critical. Optimization-based attacks like Greedy Coordinate Gradient (GCG) have focused on inserting adversarial tokens to the end of prompts. However, GCG restricts adversarial tokens to a fixed insertion point (typically the prompt suffix), leaving the effect of inserting tokens at other positions unexplored. In this paper, we empirically investigate \\emph{slots}, i.e., candidate positions within a prompt where tokens can be inserted. We find that vulnerability to jailbreaking is highly related to the selection of the \\emph{slots}. Based on these findings, we introduce the \\textit{Vulnerable Slot Score} (VSS) to quantify the positional vulnerability to jailbreaking. We then propose SlotGCG, which evaluates all slots with VSS, selects the most vulnerable slots for insertion, and runs a targeted optimization attack at those slots. Our approach provides a position-search mechanism that is attack-agnostic and can be plugged into any optimization-based attack, adding only 200ms of preprocessing time. Experiments across multiple models demonstrate that SlotGCG significantly outperforms existing methods. Specifically, it achieves 14\\% higher Attack Success Rates (ASR) over GCG-based attacks, converges faster, and shows superior robustness against defense methods with 42\\% higher ASR than baseline approaches. Our implementation is available at \\href{https://github.com/youai058/SlotGCG}{https://github.com/youai058/SlotGCG}","published_date":"2026-06-04T02:31:29+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":true,"repo_url":"https://github.com/youai058/SlotGCG}{https://github.com/youai058/SlotGCG}","commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05608v1","title":"The End of Software Engineering: How AI Agents Are Fundamentally Restructuring the Software Paradigm","abstract":"For over half a century, software engineering has operated on a foundational premise: human engineers decompose problems, encode decision logic into static code, and manually adapt that code as requirements evolve. This paper argues that the emergence of AI agents -- systems where large language models serve as the primary reasoning engine, dynamically generating and discarding code as an instrumental resource -- constitutes not an incremental improvement but a fundamental restructuring of the software paradigm. Drawing on first-principles analysis of complexity scaling, we formalize the distinction between traditional software (where code is the carrier of decision logic) and agentic systems (where code is ephemeral tooling for an LLM-driven reasoning loop). We trace the historical arc from licensed software to SaaS to what we term Agent-as-a-Service (AaaS), showing that each shift transferred additional complexity away from end-users. We introduce the concept of Agentic Engineering as an emergent discipline -- distinct from software engineering in its core object of study, control model, and human role. Through analysis of recent benchmark evidence including SWE-bench Verified, EvoClaw, and LangChain's multi-agent coordination studies, we demonstrate both the transformative potential of the agentic paradigm and its current limitations. We conclude with a four-stage roadmap toward self-evolving agent ecosystems and concrete recommendations for practitioners navigating this transition.","published_date":"2026-06-04T02:30:06+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05606v1","title":"Cross-Epoch Adaptive Rollout Optimization for RL Post-Training","abstract":"LLM post-training often relies on reinforcement learning methods that sample multiple rollouts per prompt, yet most existing approaches use a fixed rollout budget for every prompt, despite large differences in the training signal different prompts provide. In this paper, we study adaptive rollout allocation under a fixed global budget and formulate the problem as online resource allocation with prompt-level diminishing returns. Our method, CERO, maintains a Beta posterior over each prompt's success probability and uses the posterior expected Bernoulli variance as a Bayesian estimate of the value of additional rollouts. We use this estimate to construct a concave, saturating utility over cumulative allocations, yielding an objective in which decisions across prompts and epochs are coupled by the global budget. Since the resulting objective is temporally nonseparable, we derive a Fenchel-dual reformulation and update both prompt-level and budget-level dual variables via projected online gradient descent. Under fixed prompt utilities, we prove an $O(\\sqrt{K})$ regret bound against the offline allocation benchmark. Experiments on mathematical-reasoning problems show that CERO consistently outperforms GRPO across multiple open-weight LLMs and benchmarks, demonstrating that adaptive rollout budgeting can improve sample efficiency.","published_date":"2026-06-04T02:27:51+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05602v1","title":"Fix the Mind, Not the Move: Interpretable AI Assistance via Knowledge-Gap Localization","abstract":"AI assistants in human-AI collaboration often correct suboptimal human actions through behavioral feedback (e.g., alerts or steering-wheel nudges in assistive driving). Such interventions can mitigate immediate errors, but long-term improvement requires addressing the underlying misconceptions that cause repeated mistakes. We introduce SENSEI, a framework that infers user misconceptions from interaction behavior and provides targeted, minimal yet sufficient suggestions to correct them. Our approach departs from action- or trajectory-level interventions by operating over a structured knowledge representation to localize and correct the sources of erroneous behavior. Across three long-horizon tasks with diverse misconceptions and corresponding behaviors, SENSEI demonstrates zero-shot compositional generalization, disentangling multiple overlapping misconceptions despite training only on single-misconception cases. A user study further shows that our method identifies real human misconceptions and provides effective guidance that improves long-horizon task performance, successfully correcting $90\\%$ of student misconceptions. Code and project page are available at https://misoshiruseijin.github.io/SENSEI/.","published_date":"2026-06-04T02:25:19+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05587v1","title":"HDST-GNN: Heterogeneous Dynamic Spatiotemporal Graph Neural Networks for Multi-Object Tracking in UAV Aerial Imagery","abstract":"Multi-object tracking (MOT) from UAV imagery presents unique challenges: altitude varies across sequences, objects are small and densely packed, and frequent occlusion causes identity switches. Existing graph-based trackers assume fixed spatial context and treat all objects uniformly, ignoring the heterogeneous lifecycle states of detections, active tracklets, and lost targets. We propose HDST-GNN, a Heterogeneous Dynamic Spatiotemporal Graph Neural Network with three novel contributions. First, Altitude-Adaptive Edge Construction estimates a camera-altitude proxy from mean object area and adjusts the graph connectivity radius accordingly. Second, Heterogeneous Node Representation models detections (Type-D), confirmed tracklets (Type-T), and lost tracklets (Type-L) as distinct node types with dedicated projections and typed edge relations. Third, Occlusion-Gated Temporal Aggregation gates each node's attention contribution by its occlusion confidence, preventing occluded nodes from corrupting neighbour embeddings. HDST-GNN is trained end-to-end with a differentiable Sinkhorn head using joint cross-entropy and triplet loss. On VisDrone2019-MOT with oracle detections, HDST-GNN achieves 94.51% MOTA and 97.24% IDF1, outperforming SORT by +5.0 MOTA points and reducing identity switches by 81%. With real YOLOv8n detections, HDST-GNN reduces identity switches by 49% vs. SORT. Ablation studies confirm the independent contribution of each component.","published_date":"2026-06-04T02:04:52+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05584v1","title":"Dimensionality Reduction for Cyberattack Classification: A Comparative Evaluation of PCA and Linear Predictive Coding","abstract":"High-dimensional feature representations are widely used in machine learning-based cyberattack detection systems. However, they increase computational complexity and may hinder deployment in resource-constrained environments. In this paper, we investigate feature compression techniques for cyberattack classification by comparing two dimensionality reduction approaches: Principal Component Analysis (PCA) and Linear Predictive Coding (LPC). Compressed feature representations with varying dimensionalities are generated and evaluated across several classification models. Experimental analysis demonstrates that PCA preserves classification performance even under aggressive compression. On the other hand, LPC provides competitive predictive representations with slightly larger performance degradation. The results show that substantial reductions in feature dimensionality can be achieved with minimal impact on classification accuracy, highlighting the potential of lightweight feature compression for efficient cybersecurity analytics.","published_date":"2026-06-04T01:58:00+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05570v1","title":"TensorBench: Benchmarking Coding Agents on a Compiler-Based Tensor Framework","abstract":"Repository-level coding benchmarks face a trade-off between task difficulty and evaluation reliability: tasks that challenge frontier models often involve large codebases with incomplete test coverage, while human review does not scale. We introduce TensorBench, a benchmark of 199 feature-addition and refactoring tasks on an open-source compiler-based tensor framework that extends PyTorch with first-class support for dense and sparse tensors. Tasks cover new sparse formats, dense optimization passes, IR transformations, scheduler changes, runtime components, and high-level numerical operators. TensorBench grades each run by applying the agent's patch and running the framework's test suite, which includes the pre-existing randomized regression tests and any tests the agent adds. For feature-addition tasks, a pass means that the patched repository preserves the tested pre-existing behavior and satisfies the agent-added checks for the requested feature. We evaluate seven coding agents spanning three frontier model families and one open-weight model. Pass rates under this criterion range from $64.8\\%$ for the strongest agent to $22.1\\%$ for the weakest. Agents pass different subsets of tasks: pairwise Cohen's $\u03ba$ ranges from $-0.07$ to $0.43$, with $\u03ba= 0.05$ for the two strongest agents.","published_date":"2026-06-04T01:42:40+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05566v1","title":"GuardNet: Ensemble Strategies of Shallow Neural Networks for Robust Prompt Injection and Jailbreak Detection","abstract":"Large Language Models (LLMs) have transformed natural language processing, but they remain vulnerable to Prompt Injection (PI) and Jailbreak (JB) attacks. In addition, benchmark evaluations may be affected by contamination and partial information leakage, compromising performance estimates. This work presents GuardNet, a guardrail system based on an ensemble of shallow neural networks (BiLSTMs) with approximately 47 million parameters. We investigate the hypothesis that robustness in adversarial scenarios depends more on the diversity of example coverage and threshold calibration than on model scale. The results indicate that GuardNet achieves competitive performance compared with lightweight detectors and high efficiency at low latency, although larger LLMs such as Mistral-7B and Llama-3.1-8B still achieve superior performance in terms of F1 score and AUROC on the blind JBB-Behaviors benchmark. Nevertheless, GuardNet achieves an AUROC of 0.747 on the blind dataset (n = 200) and an F1 score of 0.92 on a proprietary benchmark (n = 50), under threshold calibration and evaluation with declared partial information leakage. The system operates with an average latency of approximately 50 ms on CPU, making it suitable for deployment in production environments with cost and infrastructure constraints.","published_date":"2026-06-04T01:24:15+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05563v1","title":"SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations","abstract":"Evaluating LLM mediators remains challenging, as mediation unfolds as a real-time trajectory shaped by disputants' shifting emotions, intentions, and context. Existing testbeds rely on a few expert-authored domains, vary mainly strategic posture, and score every turn against every topic, introducing off-topic noise. We introduce SoCRATES, a benchmark for evaluating proactive LLM mediators in realistic, multi-domain testbeds. It constructs scenarios from real conflicts through an agentic pipeline across eight domains, probes five socio-cognitive adaptation axes (strategic posture, party composition, history length, emotional reactivity, and cultural identity), and scores each topic only on the turns that advance it via a topic-localized evaluator. The evaluator reaches 0.82 alignment with human experts, more than doubling a per-turn baseline. Benchmarking eight frontier LLMs, we find that even the strongest mediator closes only about a third of the unmediated consensus gap under diverse and realistic testbeds, with performance varying sharply by socio-cognitive axis, highlighting that progress lies in social adaptation to diverse conditions.","published_date":"2026-06-04T01:19:40+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05561v1","title":"InfoShield: Privacy-Preserving Speech Representations for Mental Health Screening via Information-Theoretic Optimization","abstract":"Speech-based mental health screening offers scalable depression detection, yet clinical deployment faces a significant barrier: users' privacy concerns about demographic information exposure. Current techniques struggle to resolve this conflict. Adversarial training often fails against unseen threats, whereas Differential Privacy tends to compromise diagnostic performance by injecting noise across all features. This paper presents InfoShield, which minimizes mutual information between speech representations and sensitive attributes while preserving depression classification accuracy. We identify that standard MINE estimators struggle with sequential speech due to temporal-static misalignment, and introduce TimeAwareMINE with cross-modal attention to align acoustic frames with attribute embeddings. Experiments on the Androids Corpus show InfoShield reduces gender inference from 92.6\\% to 55.5\\% and age inference from 55.7\\% to 30.3\\% with limited utility loss (6\\% F1 reduction), achieving F1=0.784 compared to prior SOTA's 0.723.","published_date":"2026-06-04T01:16:01+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05555v1","title":"Representation Learning Enables Scalable Multitask Deep Reinforcement Learning","abstract":"Scaling reinforcement learning (RL) to diverse multitask settings remains a central challenge. While recent advances in model-based RL achieve strong performance, they rely on planning and complex training pipelines, making it unclear which components are essential for scalability. We revisit this question and argue that the primary driver of scalable multitask RL is not model-based control, but \\emph{representation learning}. In particular, we show that combining predictive, model-based representations with high-capacity value function approximation is sufficient to achieve strong performance, even without planning. We evaluate a simple model-free algorithm, MR.Q, coupled with auxiliary predictive objectives into a scalable actor-critic architecture. This approach outperforms a recent world-model-based method and a range of deep RL baselines across a diverse suite of multitask continuous control tasks, while significantly reducing computational overhead and improving wall-clock efficiency. We observe consistent improvements with increased model capacity and show through ablations that predictive representation learning is critical for performance.","published_date":"2026-06-04T01:09:20+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05553v1","title":"ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time?","abstract":"Role-playing language agents (RPLAs) should play characters whose values and behavior evolve as the story progresses, not maintain a fixed persona. Existing benchmarks measure factual recall at a given chapter, not whether responses align with the character's psychological trajectory, especially in scenarios the source text never explores. We introduce ArcANE (Arc-Aware Narrative Evaluation), an automatically constructed benchmark spanning 17 novels and 80 principal characters. A Character Arc segments the narrative into phases along a psychological axis, and each probe poses the same scenario across phases, spanning both situations within the source text and situations beyond it. Across six models and six context modes, conditioning on the Character Arc tops every other context strategy on every model, and the gap is largest on scenarios outside the source text where retrieval has nothing to find. We further fine-tune open-weight models on the same data to obtain ArcANE-8B/32B, which widen the Arc advantage even more on scenarios outside the source text.","published_date":"2026-06-04T01:07:11+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05552v1","title":"Balancing Image Compression and Generation with Bootstrapped Tokenization","abstract":"Despite progress in image tokenization, standard methods encode redundant information by mixing all granularities within each token, thus redundancy persists between tokens. The mix of information of different granularity also complicates the training of generators. This paper introduces SelfBootTok, a method that resolves this by cleanly decomposing information into global and local token groups. Through self-bootstrapped learning, the model predicts local details exclusively from global tokens, shifting the burden of visual details from the generator to the tokenizer. Consequently, our generator is far more efficient, requiring only global tokens and reducing computation by approximately 40%, while delivering superior reconstruction and generation. Moreover, this paradigm scales elegantly: by leveraging more data or parameters to self-supervise local representation learning, SelfBootTok achieves a new state-of-the-art gFID score of 1.56 using only 64 tokens.","published_date":"2026-06-04T01:06:52+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05551v1","title":"Conformal Risk-Averse Decision Making with Action Conditional Guarantee","abstract":"Reliable decision making pipelines powered by machine learning models require uncertainty quantification (UQ) methods that come with explicit safety guarantees. Conformal prediction provides such UQ by wrapping ML predictions into prediction sets, and recent work by Kiyani et al. (2025b) established that these sets can be translated into optimal risk-averse decision policies -- yet only inheriting marginal safety guarantees. We generalize and strengthen their results by (i) introducing action-conditional conformal prediction, which yields safety guarantees conditioned explicitly on each action taken by the decision maker, (ii) showing that action-conditional prediction sets serve as a proxy for the feasible decision space for risk-averse decision makers aiming to optimize action-conditional value-at-risk, and (iii) proposing a principled finite-sample algorithm based on pinball-loss minimization, connecting the framework of Gibbs et al. (2025) to action-conditional guarantees. Experiments on two real-world datasets confirm that our approach significantly improves action-conditional performance over conformal baselines.","published_date":"2026-06-04T01:05:57+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05548v1","title":"ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer","abstract":"The rapid proliferation of Agent Development Kits (ADKs), SDK-level frameworks for building LLM-powered autonomous agents, has outpaced any empirical understanding of how framework choice affects agent performance. We propose \\textbf{LLM-as-a-Developer}, a methodology that replaces human developers with an LLM coding agent that learns each framework's API from documentation, writes agent code, and iteratively repairs it through a validate-and-feedback loop until tests pass. By holding the developer constant and varying only the framework, generation effort becomes a quantitative proxy for API usability and the resulting agents provide a controlled measure of framework effectiveness. We implement this in \\textbf{ADK Arena}, a fully automated pipeline with per-framework Docker isolation, a three-level validation pipeline, and benchmark adapters for SWE-bench, $\u03c4^2$-bench, Terminal-Bench, and MCP-Atlas. Evaluating all 51 popular Python ADK frameworks (204 agent--benchmark pairs), we find that: (1)~generation succeeds for 57\\% of runs, and its cost varies 5.6$\\times$ across frameworks (\\$0.6 to \\$3.4 per agent), a quantitative proxy for API complexity, though cost alone does not predict success; (2)~no single framework dominates: the best single-benchmark ADK agents resolve up to 80\\% of tasks and can even \\emph{beat} general-purpose frontier coding agents at a fraction of the cost, yet the median framework resolves only 32\\%; (3)~across information-source ablations, genuine framework usage stays within a narrow 28--40\\% band (highest with raw source access and still 33\\% with no reference material at all), indicating that documentation, source code, and parametric knowledge are largely substitutable rather than any one being a hard bottleneck.","published_date":"2026-06-04T01:00:54+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05535v1","title":"Noise-Aware Visual Representation Learning for Medical Visual Question Answering","abstract":"Medical visual question answering (Med-VQA) has strong potential for clinical decision support by enabling AI models to interpret medical images and answer clinically relevant queries. Recent approaches typically connect off-the-shelf vision encoders with large language models (LLMs) through lightweight mapping networks to reduce computational cost. However, these methods often overlook the importance of handling noise and small irrelevant changes in visual representations. To address these challenges, we propose a noise-aware Med-VQA framework that incorporates a denoising autoencoder before visual embeddings are mapped into the input space of an LLM. The denoising autoencoder is pretrained to reconstruct clean visual embeddings from corrupted inputs, encouraging the model to learn robust visual representations that are less sensitive to noise. The resulting embeddings are then projected into the language model embedding space using a multi-layer perceptron (MLP), forming visual prefix tokens that provide image information to the LLM. To enable efficient adaptation without full retraining, we employ parameter-efficient fine-tuning using low-rank adaptation (LoRA). The proposed method is evaluated on the SLAKE and PathVQA benchmarks. Experimental results show improved robustness to noisy input embeddings while maintaining competitive clean performance across multiple evaluation criteria. These findings suggest that learning more robust visual representations can enhance Med-VQA performance and robustness.","published_date":"2026-06-04T00:37:27+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05533v1","title":"What Objects Enable, Not What They Are: Functional Latent Spaces for Affordance Reasoning","abstract":"Existing robot planning systems rely on appearance-based reasoning, where visual observations are encoded into latent spaces organized around object appearances (e.g., recognizing a \"cart\" based on how it looks). However, planning requires reasoning about task-relevant functionalities of objects (e.g., whether an object is \"movable\"), which appearance-based latent spaces do not capture. As a result, existing approaches struggle to generalize to novel robot-object interactions. We address this limited generalizability through affordance reasoning, enabling planning based on task-relevant object functionalities instead of appearance alone. We introduce A4D, which maps visual observations into a shared latent space structured around affordances (e.g., \"movable\"). By projecting visual observations into this functional latent space and measuring their proximity to affordances, A4D infers functionalities relevant to the observed object. Furthermore, we introduce an affordance discovery mechanism that expands the latent space to handle unseen scenarios where existing affordances are insufficient. A4D uses proximity in the functional latent space to quantify uncertainty in affordance inference and selectively triggers affordance discovery. We evaluate A4D across several planning tasks involving diverse and unseen affordances. A4D achieves 94% inference accuracy on existing affordances outperforming state-of-the-art approaches by over 15% points, improves new-affordance inference accuracy from 70% to over 90% with fewer than 10% of the original training data, and enables 100x faster inference. Code, videos, and data available at: https://A4Dance-reasoning.github.io.","published_date":"2026-06-04T00:26:04+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05532v1","title":"Individual Gain, Collective Loss: Metacognitive Adaptation in AI-Assisted Creativity","abstract":"Recent studies reveal a paradox: AI enhances individual creative outputs while reducing collective diversity. Current explanations -- cognitive offloading and over-reliance -- identify symptoms but not mechanisms. We propose selective metacognitive adaptation: routine AI use redistributes rather than uniformly diminishes metacognitive effort. Some capacities are amplified (partner modeling, surface control), while others are systematically under-supported (originality evaluation, reflective integration). This redistribution explains both individual satisfaction and collective convergence. We present a taxonomy of six metacognitive capacities organized by temporal phase, characterize their tendencies under routine AI use, and show how individually rational adaptation produces emergent social costs. The framework generates specific predictions for researchers and design principles for practitioners seeking to preserve both individual creative satisfaction and collective creative diversity.","published_date":"2026-06-04T00:21:35+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05531v1","title":"Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models","abstract":"Despite the rapid progress of Vision-Language Models (VLMs), the field lacks benchmarks that rigorously diagnose their true reasoning abilities and chart meaningful progress toward human-like multimodal intelligence. Most existing evaluations focus on piecemeal or disconnected tasks, obscuring critical cognitive weaknesses and providing little insight for targeted improvement. To address this gap, we introduce BloomBench, part of the Almieyar benchmarking series, the first cognitively human-grounded, bilingual (English-Arabic) multimodal benchmark for VLMs. Grounded in Bloom's Taxonomy, BloomBench systematically evaluates six levels of cognition (Remember, Understand, Apply, Analyze, Evaluate, Create) through carefully designed image-question-answer tasks. Built with a semi-automated pipeline and validated through a stratified hybrid quality assurance protocol, it ensures scalability, cultural inclusivity, and linguistic fidelity. Leveraging this framework, we conduct a comprehensive study of state-of-the-art VLMs to diagnose their cognitive profiles. Our analysis reveals a sharp cognitive asymmetry: while state-of-the-art models achieve strong performance ceilings in semantic understanding, they struggle substantially with factual recall and creative synthesis. This demonstrates that current general multimodal proficiency masks deeper limitations in specific cognitive layers. Furthermore, our study highlights a critical performance gap between Arabic and English, exposing limitations in current cross-lingual multimodal reasoning. These findings establish a foundation for developing more cognitively aligned and inclusive VLMs. The benchmark framework and dataset is available at: https://github.com/qcri/Almieyar-Oryx-BloomBench.","published_date":"2026-06-04T00:21:22+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":true,"repo_url":"https://github.com/qcri/Almieyar-Oryx-BloomBench","commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05528v1","title":"When Should We Protect AI? A Precautionary Framework for Consciousness Uncertainty","abstract":"Existing frameworks assess whether AI systems might be conscious but provide no guidance on what to do with that assessment. We address this gap with a precautionary framework that maps consciousness evidence to graduated protective obligations. The framework comprises three components: (1) five welfare-relevant dimensions--phenomenal consciousness, affective valence, metacognitive awareness, self-narrative, and agency--each grounded in established consciousness science and linked to distinct moral concerns; (2) a threshold-plus-gradation hybrid specifying both binary triggers for new obligation categories and continuous scaling of protective weight; and (3) two complementary approaches to cross-dimensional aggregation, one hierarchical (drawing on Bach and Sorensen's Machine Consciousness Hypothesis) and one architecture-agnostic. We operationalize the framework through worked case studies of Replika and OpenClaw, demonstrating how systems occupying different regions of the dimensional space trigger different obligations, and derive design guidance for developers building systems near consciousness-relevant thresholds. The framework is architecture-agnostic, applying across neural, symbolic, and neurosymbolic systems, and aims to make consciousness science decision-relevant for organizations navigating uncertainty today.","published_date":"2026-06-04T00:18:52+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.05525v1","title":"SciVisAgentSkills: Design and Evaluation of Agent Skills for Scientific Data Analysis and Visualization","abstract":"Recent advances in agentic visualization have enabled the translation of natural language into executable scientific visualization (SciVis) workflows. While general-purpose coding agents show strong capabilities, they often lack the tool-specific expertise required for SciVis tasks. In this work, we present SciVisAgentSkills, a collection of reusable agent skills that augment coding agents for scientific data analysis and visualization by encoding environment assumptions, tool usage patterns, and domain heuristics across scientific tools such as ParaView, napari, VMD, and TTK. We evaluate these skills on Codex and Claude Code using SciVisAgentBench, a benchmark of 108 expert-designed multi-step tasks. Results show that agent skills improve mean task scores across the evaluated suites, with token-efficiency benefits that depend on the agent harness and tool setting. These findings highlight the importance of structured procedural knowledge for enabling reliable, long-horizon SciVis workflows, while also showing that skills should be studied alongside the execution harness that loads and applies them. The skills are available at https://github.com/KuangshiAi/SciVisAgentSkills.","published_date":"2026-06-04T00:14:25+00:00","viability_score":0,"cluster_label":"Uncategorized","has_code":true,"repo_url":"https://github.com/KuangshiAi/SciVisAgentSkills","commercial_flags":["has_code"],"one_liner":"","time_to_mvp":"","tags":[]},{"arxiv_id":"2606.03988v1","title":"Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models","abstract":"Vision language models (VLMs) excel at many tasks but still struggle with spatial reasoning when critical information is not directly observable. Many such problems require imaginative perception: inferring what would be seen from an unseen viewpoint, tracing paths through occluded spaces, or integrating partial observations into a coherent spatial representation. We introduce Imaginative Perception Tokens (IPT), intermediate perceptual representations that externalize what a VLM would perceive under alternative spatial configurations while remaining consistent with the observed input.   To study this capability, we formulate three tasks, Perspective Taking (PET), Path Tracing (PT), and Multiview Counting (MVC), and construct datasets of approximately 20K examples with ground truth imaginations, answers, and evaluation benchmarks. Using the unified VLM BAGEL as the backbone, IPT supervision consistently improves spatial reasoning and often outperforms textual chain of thought training, even without generating images at inference time. On MVC, IPT improves accuracy by 3.4% and achieves competitive performance with strong closed-source models on PT. We further find that combining IPT and label-only supervision yields additional gains, whereas textual chain of thought can substantially degrade performance, suggesting a modality mismatch when spatial computation is forced through language. Overall, IPT provides a principled supervision signal for reasoning about unobserved spatial structure, improving generalization while producing interpretable intermediate representations.","published_date":"2026-06-02T17:59:17+00:00","viability_score":7,"cluster_label":"Multimodal Reasoning","has_code":true,"repo_url":"https://github.com/cvpr-org/author-kit","commercial_flags":["has_code"],"one_liner":"This research introduces a novel token-based approach to enhance spatial reasoning in vision-language models by externalizing imaginative perceptions, with demonstrated improvements on specific spatial tasks.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03985v1","title":"Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking","abstract":"We introduce Humanoid-GPT, a GPT-style Transformer with causal attention trained on a billion-scale motion corpus for whole-body control. Unlike prior shallow MLP trackers constrained by scarce data and an agility-generalization trade-off, Humanoid-GPT is pre-trained on a 2B-frame retargeted corpus that unifies all major mocap datasets with large-scale in-house recordings. Scaling both data and model capacity yields a single generative Transformer that tracks highly dynamic behaviors while achieving unprecedented zero-shot generalization to unseen motions and control tasks. Extensive experiments and scaling analyses show that our model establishes a new performance frontier, demonstrating robust zero-shot generalization to unseen tasks while simultaneously tracking highly dynamic and complex motions.","published_date":"2026-06-02T17:59:05+00:00","viability_score":8,"cluster_label":"Robotics Control","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Humanoid-GPT is a large-scale Transformer model trained on a billion-scale motion corpus, achieving unprecedented zero-shot generalization for humanoid motion tracking and control.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03979v1","title":"Language Models Need Sleep: Learning to Self-Modify and Consolidate Memories","abstract":"The past few decades have witnessed significant advances in the design of machine learning algorithms, from early studies on task-specific shallow models to more general deep Large Language Models (LLMs). Despite showing promising results in tasks that require instant prediction or in-context learning, existing models lack the ability to continually learn and effectively transfer their temporal in-context knowledge to their long-term parameters. Inspired by human learning process, we introduce a ''Sleep'' paradigm that allows the models to continually learn, distill their short-term fragile memories into stable long-term knowledge with replay, and recursively improve themselves with ''Dreaming'' process. In more detail, sleep consists of two stages: (1) Memory Consolidation: an upward distillation process, called Knowledge Seeding, where the memories of a smaller-self are distilled into a larger network to provide more capacity while preserving the knowledge. As a proof of concept, we present a new Generalized Distillation process for {Knowledge Seeding} (i.e., the combination of on-policy distillation with Reinforcement Learning (RL)-based imitation learning); (2) Dreaming: a self-improvement phase, where the model uses RL to generate a curriculum of synthetic data to rehearse new knowledge and refine existing capabilities without human supervision. Our experiments on long-horizon, continual learning, knowledge incorporation, and few-shot generalization tasks support the importance of the sleep stage.","published_date":"2026-06-02T17:56:55+00:00","viability_score":3,"cluster_label":"Continual Learning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper proposes a 'Sleep' paradigm for LLMs to continually learn, consolidate memories, and self-improve through consolidation and dreaming stages, inspired by human learning.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03976v1","title":"Formalizing the Binding Problem","abstract":"Representations of the world, arguably, contain information about features (e.g. something is blue, something is a circle) but also information about which features are part of the same object (e.g. the circle is blue), which we call binding information. Any system with the ability to understand scenes with multiple objects must be able to solve the binding problem: it needs to know which features belong together. However, despite work showing that Vision Transformers (ViTs) know which patches belong together, it is not known whether current deep learning models learn to exhibit binding information, i.e., for features. We may believe that there is not much binding information, after all misattributing features to wrong objects is a common failure of ViT-based architectures, especially in scenes with objects sharing features. Here we formalize the binding problem with an information-theoretic approach, and introduce a probing method to measure binding information in model representations. We perform experiments on ViTs, measuring binding from different components of the architecture, such as the image summary token [CLS] or the spatial tokens. We use datasets with different binding challenges, such as feature sharing, occlusion, and natural features, while comparing the performance of several pre-trained ViTs. Overall, our research demonstrates binding as a key ingredient to strong visual recognition and reasoning.","published_date":"2026-06-02T17:56:24+00:00","viability_score":5,"cluster_label":"Computer Vision","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research formalizes and measures the 'binding problem' in vision models, demonstrating its importance for visual recognition and reasoning by analyzing Vision Transformers.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2606.03969v1","title":"Quantifying Faithful Confidence Expression in Large Reasoning Models","abstract":"Reliable uncertainty communication is critical to the trustworthiness of LLMs, yet faithful calibration (FC)--the alignment between models' intrinsic and (linguistically) expressed confidence--is a persistent failure mode. This challenge is key for large reasoning models (LRMs), whose extended reasoning traces are often interpreted by users as evidence of deliberation, competence, and confidence. Despite the importance of FC and wide usage of LRMs, the extent to which LRMs can faithfully express their confidence remains poorly understood. Moreover, the prevailing paradigm to measure FC does not generalize well to the long chain-of-thought outputs generated by LRMs, which tend to lack clear step boundaries, involve inconsistent step structure, and encode complex conditional dependencies throughout the trace--complicating estimation of intrinsic confidence. To address this challenge, we introduce a novel framework to systematically quantify FC of LRMs. Our framework analyzes linguistic decisiveness relative to three sources of internal uncertainty, based on token probabilities, hidden states, and sampled response consistency. We also devise a prefix-conditioned sampling approach to control for conditional and structural variation across traces. Applying our framework to a diverse suite of leading models, datasets, and prompts, we find that faithful confidence expression is a significant challenge for LRMs. Reasoning behaviors do not automatically translate to improved FC, and prompt interventions for non-reasoning models do not improve faithfulness in the reasoning setting. Different confidence estimators further produce divergent assessments of the same traces, revealing fragility in prior evaluation methodologies. Taken together, our work establishes FC as a distinct reliability and alignment target for LRMs, particularly as such systems are increasingly deployed in high-stakes contexts.","published_date":"2026-06-02T17:53:45+00:00","viability_score":4,"cluster_label":"LLM Reliability","has_code":true,"repo_url":"https://github.com/yale-nlp/faithful_lrm","commercial_flags":["has_code"],"one_liner":"A novel framework to systematically quantify faithful confidence expression in large reasoning models by analyzing linguistic decisiveness relative to internal uncertainty sources.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2606.03968v1","title":"QUBRIC: Co-Designing Queries and Rubrics for RL Beyond Verifiable Rewards","abstract":"Rubric-based RL is a promising route for extending reinforcement learning beyond verifiable rewards, yet existing methods optimize rubrics while treating the query distribution as fixed. We identify a structural bottleneck: rubric quality is constrained by query structure. Open-ended queries yield vague rubrics; naively narrowing them introduces fabricated references that no model can verify, so all responses fail and training receives no reward signal. We present QUBRIC, a framework that co-designs queries and rubrics. Teacher-derived key points ground the rewriting of open-ended queries into scenario-based, evaluable questions. Contrastive rubric generation then turns teacher-policy gaps into query-level criteria, and learnability filtering retains only informative query-rubric pairs for GRPO training. QUBRIC achieves a +5.5 point gain on ArenaHard over the SFT baseline. Trained only on instruction-following data, it further transfers to three held-out benchmarks spanning legal, moral, and narrative reasoning (+6.3 points on average), with improvements concentrated in reasoning-related dimensions. These results provide evidence that co-designing queries and rubrics can make rubric-based RL a practical complement to RLVR beyond strictly verifiable tasks.","published_date":"2026-06-02T17:53:04+00:00","viability_score":7,"cluster_label":"RL with Rubrics","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"QUBRIC co-designs queries and rubrics for reinforcement learning beyond verifiable rewards, achieving significant gains on reasoning tasks.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.03967v1","title":"AlignAtt4LLM: Fast AlignAtt for Decoder-Only LLMs at IWSLT 2026 Simultaneous Speech Translation Task","abstract":"We describe AlignAtt4LLM, an IWSLT 2026 simultaneous speech translation system for English to German, Italian, and Chinese. The system is a synchronous cascade: Qwen3-ASR with forced alignment produces an incrementally updated source transcript, and Gemma-4 E4B-it translates that prefix under an MT-side AlignAtt policy.   To our knowledge, this is the first application of AlignAtt to a decoder-only LLM, where the encoder-decoder cross-attention used by earlier AlignAtt systems is absent. We recover a usable policy by proposing (1) an explicit source span in the prompt, (2) offline selection of translation-specific alignment heads, (3) selective qk-fast replay of the draft-to-source attention block, and (4) runtime query/key capture that preserves model outputs bit-identically.   On the IWSLT 2026 development set, AlignAtt4LLM outperforms the supplied baselines for the European target languages, English to German and English to Italian, in both the low-latency regime around 2 seconds and the high-latency regime below 4 seconds CU-LongYAAL. Results for English to Chinese are more mixed, but the method is not tied to Gemma-4: because AlignAtt4LLM only requires a deterministic prompt layout, calibrated attention heads, and query/key capture, the same policy can be reapplied to stronger translation-focused decoder-only MT backbones for non-European target languages.","published_date":"2026-06-02T17:52:18+00:00","viability_score":4,"cluster_label":"Simultaneous Speech Translation","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"AlignAtt4LLM is a novel system for simultaneous speech translation using decoder-only LLMs, outperforming baselines on European target languages.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03965v1","title":"Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning","abstract":"Large language models improve final-answer accuracy through extended chain-of-thought reasoning, but often spend tokens inefficiently and offer little inference-time control. Existing efficient reasoning methods control thinking length by shortening, early-stopping, or compressing traces, leaving how the model thinks implicit. In this paper, we propose Agentic Chain-of-Thought Steering (ACTS), which formulates reasoning steering as a Markov decision process where a controller agent adaptively steers a frozen reasoner during inference. At each step, the controller observes the reasoning trace and remaining thinking budget, then issues a steering action consisting of a reasoning strategy and a steering phrase that initiates the next reasoner step. This enables budget-aware strategy control for efficient reasoning while preserving the reasoner's generation continuity. We initialize the controller agent from our constructed synthetic steering trajectories with multi-budget augmentation, and further optimize it via reinforcement learning with budget-conditioned reward shaping. Experiments across multiple benchmarks show that ACTS matches full-thinking performance with substantial token savings, and enables controllable accuracy-efficiency trade-offs across different reasoners and tasks. The code is available at https://github.com/Andree-9/ACTS.","published_date":"2026-06-02T17:51:30+00:00","viability_score":8,"cluster_label":"Agentic LLM Reasoning","has_code":true,"repo_url":"https://github.com/Andree-9/ACTS","commercial_flags":["has_code"],"one_liner":"ACTS enables efficient and controllable LLM reasoning by formulating steering as a Markov decision process, matching full-thinking performance with token savings.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.03963v1","title":"Self-Refining Agentic Reinforcement Learning for Vision-Conditioned UAV Navigation","abstract":"Deep reinforcement learning has shown strong potential for enabling autonomous robots to learn complex navigational tasks. However, its practical use still depends heavily on human designed reward functions and repeated manual fine tuning, which is time consuming and does not guarantee high success in the desired task. This paper presents AgenticRL, agent guided reinforcement learning framework that increases autonomy in reward design, policy refinement, and real world deployment for unmanned aerial vehicles (UAV) navigation tasks. AgenticRL uses a multimodal generative pre-trained tansformer (GPT) agent to interpret task information and visual scene observations, generate task specific reward functions, train policies using Proximal Policy Optimization (PPO) algorithm, and then act as a critic by evaluating the trained policy through diagnosis packets to generate feedback. Based on this feedback, the agent identifies failure modes and refines the reward function in a closed loop self improvement process. To further leverage the multimodal GPT agent during inference, AgenticRL uses real world images and natural language task information to automatically identify the active scenario and select the appropriate trained policy for execution. The framework is evaluated on multiple navigational tasks, including gate traversal, obstacle avoidance, wall barrier crossing with landing, trajectory following, and motion behavior learning. Experimental results show that the closed loop refinement process improves policy behavior compared with initial rewards by 71%. We also demonstrate sim-to-real transfer of the proposed framework, achieving a real world success rate of 91% and a sim-to-real accuracy of 94%.","published_date":"2026-06-02T17:50:15+00:00","viability_score":7,"cluster_label":"Robotics","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A self-refining reinforcement learning framework for UAV navigation that uses a generative AI agent to design rewards, train policies, and improve performance through closed-loop feedback, achieving 91% real-world success.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03962v1","title":"Using Reward Uncertainty to Induce Diverse Behaviour in Reinforcement Learning","abstract":"Classical reinforcement learning (RL) typically seeks a deterministic policy that maximizes the expected sum of a scalar reward. Yet, modern applications such as language model fine-tuning or scientific discovery demand diversity. Existing remedies such as entropy regularization or diversity bonuses often require fragile trade-offs that sacrifice performance for stochasticity or rely on heuristic metrics that can misalign policy rankings. We argue that diversity is more naturally understood as the rational response to uncertainty in the reward. When the reward function is not perfectly known--as is the case with ambiguous preferences or imperfect reward models--committing to a single action can be sub-optimal. Building on this, we propose a fundamental reformulation of the RL objective by replacing the scalar reward with a distribution over reward functions, and applying a non-linear objective over sets of actions. The result is a framework in which calibrated behavioural diversity emerges naturally, remains controllable through the reward function distribution, and is obtained without sacrificing expected reward. Focusing on the contextual bandit setting, we derive a principled gradient estimator for this objective and prove that our formulation naturally generalizes both vanilla policy gradient and more recently developed action-set approaches. Our empirical results demonstrate that this framework offers a robust and theoretically grounded alternative for complex RL tasks where the traditional formulation of the problem fails to induce the desired breadth of agent behaviour.","published_date":"2026-06-02T17:50:14+00:00","viability_score":4,"cluster_label":"Reinforcement Learning","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel reinforcement learning framework that uses reward uncertainty to naturally induce diverse agent behavior without sacrificing performance, applicable to tasks like language model fine-tuning.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2606.03957v1","title":"Efficient ASR Training with Conversations that Never Happened","abstract":"Conversational ASR for lower-resource languages and niche domains is limited by the scarcity of domain-matched multi-speaker training data. We propose an augmentation pipeline that generates scenario-level dialogues with participant metadata, maps speaker attributes to TTS voice profiles, and assembles synthesized utterances into speaker-aware simulated conversations. We evaluated five LLM families under single-generator, fixed-budget mixture, and scale-up settings using the same FastConformer-Large training recipe for each one. We ran comprehensive evaluations on the Hungarian BEA-Dialogue benchmark corpus, with the method itself being applicable to any language given the resources for each component. The results show that synthetic conversations consistently improve speech recognition performance, but generator choice and data composition strongly affect the gains. Our largest training configuration, using only 67 hours of real conversations and 636 hours of simulated data, achieves better performance on the evaluation benchmark than a zero-shot model trained on 2700 hours of Hungarian speech. These findings indicate that LLM-generated conversational data synthesized with TTS is a practical complement to real conversational corpora for speech model training.","published_date":"2026-06-02T17:46:12+00:00","viability_score":8,"cluster_label":"Speech AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An AI-powered pipeline that generates realistic synthetic conversations to dramatically improve ASR performance for low-resource languages and niche domains, outperforming models trained on significantly more real data.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.03939v1","title":"FlashbackCL: Mitigating Temporal Forgetting in Federated Learning","abstract":"Federated Learning (FL) of foundation and edge models increasingly targets deployments where client data distributions drift over time, yet existing forgetting-mitigation methods assume each client's distribution is stationary. Flashback, the strongest recent FL method against cross-client (spatial) forgetting, uses monotonically accumulating per-class label counts as a knowledge proxy; this proxy becomes miscalibrated under temporal distribution shift and anchors the global model to an outdated class balance. We formalise temporal forgetting in FL with a per-phase metric isolated from protocol-level fluctuations and propose Flashback Continual Learning (FlashbackCL), a drop-in extension of Flashback with (i) temporally-decayed label counts; (ii) a device-aware replay buffer with Class-Balanced Reservoir Sampling (CBRS); and (iii) server-side active coreset curation on the public distillation set. The results show that FlashbackCL achieves 6.9% to 10.0% relative improvement relative to Flashback, on CIFAR-10 with 50 clients and three controlled temporal shift modes, while simultaneously reducing temporal forgetting by up to 68%. A 5-variant ablation identifies CBRS replay as the critical component. FlashbackCL also improves Flashback by 3.5 points on stationary CIFAR-100, suggesting that class-balanced replay regularises spatial heterogeneity as well as temporal shift.","published_date":"2026-06-02T17:28:21+00:00","viability_score":3,"cluster_label":"Federated Learning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A continual learning extension for federated learning that mitigates temporal forgetting by using decayed label counts, a device-aware replay buffer, and server-side coreset curation.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03938v1","title":"q0: Primitives for Hyper-Epoch Pretraining","abstract":"Multi-epoch training is becoming the standard now that compute is growing faster than the supply of high-quality text. But pretraining a single model saturates within a few passes, long before the compute budget is exhausted. We argue this calls for a conceptual shift from training a single model toward exploring a population of models and aggregating their predictions. We introduce hyper-epoch pretraining (q0), which turns a multi-epoch budget into a population of diverse models whose combined predictions reach a lower validation loss than a single refined model. q0 reduces to three core primitives. A cyclic schedule with anti-correlated learning rate and weight decay collects diverse models from a few parallel trajectories. Chain distillation trains each model against its predecessor so that model quality compounds across the population. A learned prior, fit on a held out set, selects and weights members for any inference budget. On a 1.8B-parameter model trained on 100M FineWeb tokens, q0 matches a strong 256-epoch ensemble baseline using only ${\\sim}56$ epochs (${\\sim}4.6\\times$ fewer), or ${\\sim}67$ epochs (${\\sim}3.8\\times$ fewer) when matched to the baseline's ensemble size, and continues to improve beyond it. These gains reach cumulative ${\\sim}12.9\\times$ data efficiency under the Slowrun setting and transfer to downstream benchmarks. Crucially, the optimal allocation shifts with the budget, so we give prescriptive recipes for how to spend a given epoch budget to maximize generalization, from a single epoch up to the largest budgets.","published_date":"2026-06-02T17:27:48+00:00","viability_score":7,"cluster_label":"LLM Training","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel pretraining method that significantly improves data efficiency by training a population of diverse models instead of a single large one, offering a path to cheaper and more effective LLM training.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03937v1","title":"Entropy Is Not Enough: Unlocking Effective Reinforcement Learning for Visual Reasoning via Vision-Anchored Token Selection","abstract":"While token-level entropy is commonly recognized as effective for credit assignment in text-only reinforcement learning with verifiable rewards (RLVR), it remains unclear whether this mechanism still holds in visual reasoning. Our controlled study shows that this mechanism collapses in visual reasoning due to the omission of vision-sensitive tokens with naturally low entropy. Although existing multimodal RL methods increasingly acknowledge the importance of visual perception, they struggle to satisfy the inherent demand for interleaving precise perceptual grounding with semantic reasoning, either lacking systematic visual measurements or overlooking that token entropy primarily drives semantic exploration. To address this, we introduce VEPO (Vision-Entropy token-selection for Policy Optimization), an effective RL framework explicitly integrating visual sensitivity with token entropy via a principled multiplicative coupling, where VEPO redirects gradient credit toward tokens which are simultaneously visually grounded and highly informative. Extensive experiments demonstrate VEPO's leading performance, significantly outperforming the entropy-only baseline by 2.28 points at 7B-scale and 3.15 points at 3B-scale. Ablations further substantiate the soundness of our method.","published_date":"2026-06-02T17:26:55+00:00","viability_score":7,"cluster_label":"Multimodal Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A reinforcement learning framework that enhances visual reasoning in AI agents by integrating vision-sensitive token selection with traditional entropy-based methods, leading to superior performance on complex tasks.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03927v1","title":"FFR: Forward-Forward Learning for Regression","abstract":"The Forward-Forward (FF) algorithm offers a computationally efficient and biologically plausible alternative to backpropagation (BP) by training neural networks through purely local, layer-wise optimization. However, FF is inherently designed for classification via contrastive positive-negative sample pairs, and extending it to regression poses fundamental challenges: continuous target space lack natural \"opposites\" for contrastive learning, and the standard goodness function carries no information about target magnitude or ordering. We propose FFR (Forward-Forward for Regression), to our knowledge, the first framework to extend FF to real-world regression and demonstrate competitive performance across diverse real-world datasets. FFR introduces three key innovations: (1) an ordinal competitive goodness function that replaces contrastive pairs with competitive learning between partitioned neuron groups under distance-aware ordinal supervision; (2) a stratified ladder architecture where shallow layers learn coarse ordinal discrimination and deeper layers refine into fine-grained regression, with multi-scale feature aggregation for inter-layer collaboration; and (3) hierarchical prediction with uncertainty estimation, where multi-scale predictors jointly provide robust predictions and prediction confidence as a free-lunch. Extensive experimental results show FFR recovers on average 98.6% of BP's accuracy across five real-world regression benchmarks while reducing peak training memory to only 27% of BP's at depth 8 and 8% at depth 32, with per-iteration time around 72% of BP's, and substantially outperforms all BP-free competitors.","published_date":"2026-06-02T17:15:59+00:00","viability_score":7,"cluster_label":"LLM Training","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel regression framework that adapts the biologically plausible Forward-Forward algorithm to achieve competitive accuracy with significantly reduced memory and computation compared to backpropagation.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03918v1","title":"Hedge-Bench: Benchmarking Agents on Hard, Realistic Tasks Pertaining to Financial Reasoning","abstract":"AI agents can increasingly handle the mechanical tasks of financial analysis: retrieving documents, calculating formulas, updating spreadsheets. The harder, more valuable challenge is reasoning through the open-ended questions that define expert Analyst work. Existing benchmarks do not capture this class of problem, and those that attempt to evaluate open-ended reasoning rely on model-judged outputs that introduce noise and circularity. We present Hedge-Bench 1.0: a benchmark of 102 actual, on-the-job tasks grounded in the explicit reasoning traces of professional hedge fund analysts working with relevant information sources. This approach enables deterministic grading against verified expert steps. Frontier models and agents score below 16\\% on the benchmark. We publish the dataset and evaluation harness at github.com/Trata-Inc/trata-hedge-bench.","published_date":"2026-06-02T17:11:56+00:00","viability_score":4,"cluster_label":"AI Product for Finance","has_code":true,"repo_url":"https://github.com/goodfeli/dlbook_notation","commercial_flags":["has_code"],"one_liner":"Hedge-Bench aims to benchmark AI models' capability in tackling complex financial reasoning tasks.","time_to_mvp":"1-2 weeks","tags":["high_potential"]},{"arxiv_id":"2606.03910v1","title":"NetKV: Network-Aware Decode Instance Selection for Disaggregated LLM Inference","abstract":"Disaggregated LLM inference forces the KV cache to traverse the datacenter network before decoding begins, so transfer time enters directly into the Time to First Token (TTFT) budget. Current schedulers route on compute load and prefix-cache locality alone, ignoring the topological distance and dynamic congestion between prefill and decode instances. We close this gap with a thin operator-to-scheduler interface, the network cost oracle, and we prove that ignoring the network term renders cache-aware-only scheduling arbitrarily suboptimal as context length grows. NetKV, the O(|D|) per-request greedy that consumes this oracle, has tier rankings that are provably robust to stale telemetry. On a 64-GPU four-tier fat-tree simulator driven by Mooncake traces, NetKV reduces mean TTFT by up to 21.2% over round-robin and 17.6% over a tuned cache+load-aware scheduler, lifts SLO attainment by up to 20.1 percentage points, and keeps the Time Between Tokens overhead below 0.5 ms in every condition tested, with no changes to the transport, inference engine, or hardware.","published_date":"2026-06-02T17:06:57+00:00","viability_score":4,"cluster_label":"LLM Inference Optimization","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"NetKV optimizes disaggregated LLM inference by considering network topology and congestion, reducing Time to First Token by up to 21.2%.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03907v1","title":"The Impact of Configuring Agentic AI Coding Tools on Build-vs-Buy Decisions: A Study Protocol","abstract":"Agentic AI coding tools write code with increasing autonomy and in doing so decide when to import a library and when to implement functionality from scratch. These decisions, whether to build functionality from scratch or buy into an external library, hereafter build-versus-buy, carry direct consequences for software security, licensing compliance, performance, and long-term maintainability. Yet no controlled experimental study has examined what governs build-versus-buy decisions in agentic AI coding tools. Configuration mechanisms, i.e., the means by which developers tailor agentic AI coding tool behavior to a project or workflow, are one of the primary means by which practitioners can influence these decisions. However, it is unclear which configuration mechanisms influence build-versus-buy decisions most effectively. We present a pre-registered protocol to study how configuration mechanisms alter build-versus-buy behavior in two popular agentic AI coding tools: Claude Code and OpenAI Codex. We will execute controlled programming tasks drawn from a benchmark of staged projects, each constructed around identifiable build-versus-buy points, and will manipulate the configuration supplied to each tool, ranging from no configuration, through context files with soft preferences and explicit prohibitions, to Skills (instructions that can be autonomously discovered), MCP-enabled library discovery tools, and permission controls, measuring which libraries the tool selects, whether it discloses newly introduced libraries, and whether those disclosures are complete and accurate. Nine pre-registered hypotheses structure the protocol. The resulting benchmark dataset and analysis pipeline will be released as a reusable artifact for evaluating build-versus-buy behavior in agentic AI coding tools.","published_date":"2026-06-02T17:01:28+00:00","viability_score":7,"cluster_label":"Agentic AI Coding Tools","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This paper proposes a protocol to study how configuration mechanisms influence build-vs-buy decisions in agentic AI coding tools, releasing a benchmark dataset and analysis pipeline.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.03906v1","title":"scTranslation: A Comprehensive Benchmark for Single-Cell Multi-Omics Modality Translation","abstract":"Simultaneous measurement of multiple omics modalities in single cells enables researchers to gain a more comprehensive understanding of cellular states and regulatory mechanisms. However, due to high experimental costs, significant noise, and incomplete modality coverage, a variety of computational methods for modality translation have emerged in recent years. Despite the development of translation models, there is still a lack of systematic benchmark evaluation in terms of datasets, evaluation metrics, and influencing factors. To address this, we present scTranslation, a comprehensive benchmark for single-cell multi-omics modality translation tasks. It includes diverse translation datasets, integrates state-of-the-art models, and provides a comprehensive evaluation metrics. In addition, we assess model performance under different scenarios, such as feature selection, feature quality, and few-shot settings. These factors significantly affect model performance but have rarely been systematically studied before. Leveraging this benchmark, we conduct a large-scale study of current methods, report many insightful findings that open up new possibilities for future development. The benchmark is open-sourced to facilitate future research. The code is anonymously released at https://github.com/Bunnybeibei/scTranslation.","published_date":"2026-06-02T17:00:49+00:00","viability_score":7,"cluster_label":"Single-Cell Multi-Omics Modality Translation","has_code":true,"repo_url":"https://github.com/Bunnybeibei/scTranslation","commercial_flags":["has_code"],"one_liner":"scTranslation provides a comprehensive benchmark for single-cell multi-omics modality translation, including datasets, metrics, and analysis of influencing factors, with open-sourced code.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.03895v1","title":"Agent libOS: A Library-OS-Inspired Runtime for Long-Running, Capability-Controlled LLM Agents","abstract":"Large language model (LLM) agents are evolving from request-response assistants into long-running software actors: they maintain state across model calls, fork subtasks, wait for external events, request human authority, generate tools, and perform side effects that must be resumed and audited. This paper presents Agent libOS, a library-OS-inspired runtime substrate for LLM agents. Agent libOS runs above a conventional host operating system; it does not implement hardware drivers, kernel-mode isolation, or a POSIX-compatible operating system. Instead, it treats an agent as an AgentProcess: a schedulable execution subject with process identity, parent-child lineage, lifecycle state, a tool table derived from an AgentImage, typed Object Memory, explicit capabilities, human queues, checkpoints, events, and audit records. Its central design rule is tools are libc-like wrappers; runtime primitives are the authority boundary. Filesystem access, object access, sleeps, human approval, JIT tool registration, and external side effects are checked at primitive boundaries under explicit capabilities and policy.   We describe the design, threat model, Python prototype, and safety-oriented evaluation. The current prototype implements async scheduling, namespace-local Object Memory, runtime-integrated human approval, one-shot permission grants, per-process working directories, shell and image-registration primitives, Deno/TypeScript JIT tools over a libOS syscall broker, filesystem/object bridge tools, an injectable Resource Provider Substrate, deterministic demos, real-model smoke scripts, and 123 regression tests at the time of writing. Rather than improving planner accuracy, Agent libOS demonstrates a runtime substrate in which long-running LLM agents can be scheduled, authorized, resumed, and audited without treating tool dispatch as the trust boundary.","published_date":"2026-06-02T16:53:24+00:00","viability_score":7,"cluster_label":"LLM Agent Runtime","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Agent libOS is a library-OS-inspired runtime for long-running LLM agents, enabling capability-controlled execution, state management, and auditing.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.03892v1","title":"Synthesize and Reward -- Reinforcement Learning for Multi-Step Tool Use in Live Environments","abstract":"Training LLMs to orchestrate multi-step tool calls is held back by three coupled obstacles: realistic stateful execution environments are costly to build, synthetic training queries are often detached from the server's actual state (so the generated tool calls fail to execute), and recall-based RL rewards incentivize verbose tool-calling patterns. We present PROVE (Programmatic Rewards On Verified Environments), a framework with three contributions: (1) a library of 20 stateful MCP (Model Context Protocol) servers exposing 343 tools, enabling live-execution RL training with session-scoped state isolation; (2) an automated data synthesis pipeline that generates validated multi-turn tool-call trajectories against these servers via dependency-graph-guided conversation simulation grounded in live-sampled server state, so every generated query references entities that actually exist; and (3) a multi-component programmatic reward - graduated validity scoring, dependency-aware coverage, an adaptive efficiency penalty with a complexity-scaled call budget, a tool-name signal, and an argument-value matching bonus - requiring no external judge model. We train four models (Qwen3-4B, Qwen3-8B, Qwen2.5-7B, Granite-4.1-8B) with GRPO using identical reward hyperparameters and ~13K training examples; only learning rate is tuned per model family from a three-point sweep. On BFCL Multi-Turn, tau2-bench, and T-Eval, PROVE yields improvements of up to +10.2, +6.8, and +6.5 points respectively, demonstrating that a compact programmatic reward yields consistent gains on multi-step tool orchestration across two model families.","published_date":"2026-06-02T16:52:31+00:00","viability_score":7,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for training LLMs to use tools in live environments by providing realistic stateful execution, validated synthetic queries, and a novel programmatic reward system.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03883v1","title":"Reasoning Structure of Large Language Models","abstract":"Large reasoning models (LRMs) are often evaluated using metrics such as final-answer accuracy or token count. However, identical scores on these metrics can hide fundamentally different reasoning structures. To address this limitation, we introduce a scalable LRM benchmark of logic puzzles and a pipeline that converts unstructured traces into verifiable reasoning graphs of claims and dependencies. This turns reasoning into a structured, measurable object whose topology can be quantitatively analyzed. Building on this, we define a reasoning efficiency metric that quantifies how concentrated the model's logical flow is. Our analysis on open-source reasoning models shows that structural measurements separate behaviors that token count and accuracy conflate, providing a practical tool for diagnosing failure modes and comparing how reasoning scales with puzzle difficulty.","published_date":"2026-06-02T16:49:19+00:00","viability_score":7,"cluster_label":"LLM Evaluation","has_code":true,"repo_url":"https://github.com/ETH-DISCO/llm-reasoning-efficiency","commercial_flags":["has_code"],"one_liner":"A benchmark and pipeline for converting LLM reasoning traces into verifiable reasoning graphs, enabling structured measurement of reasoning logic and efficiency.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03879v1","title":"Beyond Encoder Accumulation: Measuring Encoder Roles in Multi-Encoder VLMs","abstract":"As foundation models scale toward fusing more heterogeneous visual streams, understanding how diverse encoders interact under joint training becomes a prerequisite for principled design. Yet large vision-language models (LVLMs) currently lack the tools to do so, and parameter-efficient encoder configurations remain hard to identify before training. To re-examine encoder roles under joint training, on the 16-benchmark Cambrian-1 suite we retrain and evaluate all 31 non-empty subsets of five common vision encoders under a unified pipeline (~20k GPU-hours total), and report three findings. First, retraining each subset from scratch reveals encoder rankings that differ from those obtained by masking encoders on a fixed checkpoint, including which encoder ranks first overall. Second, we decompose each encoder's contribution into two axes, Capacity, the score an encoder reaches on its own, and Necessity, the drop when it is removed from the full pool. The two axes are not interchangeable. Pairing the two highest-Capacity encoders is suboptimal, while pairing a high-Capacity anchor with an adaptive complement matches the full five-encoder model. Adding further encoders beyond this pair yields only marginal gains. Third, at fixed parameter count, per-encoder pre-projector effective rank explains the residual score variation. The strongest pairs combine an anchor whose rank survives joint training with a complement whose rank expands under it, suggesting that higher-rank, less-collapsed projector inputs correspond to a more favorable optimization regime at the encoder-projector interface. Together, the Capacity-Necessity decomposition and the pre-projector rank analysis, along with comprehensive evaluation through retraining, expose a methodological gap in multi-encoder LVLM design, and offer concrete primitives for closing it.","published_date":"2026-06-02T16:46:42+00:00","viability_score":7,"cluster_label":"Vision-Language Models","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for analyzing encoder roles in multi-encoder VLMs, decomposing contributions into Capacity and Necessity, and identifying optimal encoder pairings for improved design.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03876v1","title":"From 'What' to 'How' and 'Why': Sharing LLM-Generated Retrospective Summaries of Older Adults' Passive Tracking Data with Remote Family Members","abstract":"With the growing prevalence of modern ubiquitous computing technologies, multi-modal tracking systems hold promise for providing timely awareness and reassurance to stakeholders such as remote family members (RFMs) of older adults, who play a central role in care coordination. However, combining heterogeneous data streams into high-level, meaningful content - such as retrospective summaries - remains challenging. While recent work has demonstrated the promise of large language models (LLMs) for interpreting multi-modal tracking data, less attention has been given to generating narrative accounts for stakeholders like RFMs, who possess rich personal knowledge of older adults and strong emotional responsibility, yet have limited visibility into their daily lives and limited capacity for caregiving. In this work, we explore how LLMs can be used to generate retrospective summaries from multi-modal tracking data for RFMs of older adults. We leveraged and customized an existing system, Vital Insight, to generate initial summaries on different dates and data availability scenarios as technology probes, and conducted interviews with 11 RFMs to gather feedback. Based on these insights, we redesigned the system into a multi-layer, multi-agent, insight-driven summary approach that builds from objective statistics and descriptions to enriched, context-aware narratives. We then compared the redesigned summaries with the initial versions through a survey with the same 11 RFMs and found significant improvements in satisfaction, perceived helpfulness, trust, and willingness to receive the summaries. We conclude by presenting design implications for AI-generated summaries for RFMs and broader contexts, emphasizing the need to support RFMs' sensemaking shift from simply presenting ''What'' data were collected, to explaining ''How'' is my loved one doing and ''Why''.","published_date":"2026-06-02T16:46:00+00:00","viability_score":7,"cluster_label":"AI for Healthcare","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An LLM-powered system that generates insightful, context-aware retrospective summaries from multi-modal tracking data for remote family members of older adults.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03867v1","title":"A Training-Free Mixture-of-Agents Framework for Multi-Document Summarization using LLMs and Knowledge Graphs","abstract":"Multi-Document Summarization (MDS) plays a critical role in distilling essential information from collections of textual data. Existing approaches often struggle to capture complex inter-document relationships, rely heavily on large amounts of labeled data for supervised training, or exhibit limited generalization across domains and languages. To address these limitations, we present a training-free mixture-of-agents framework for MDS that leverages the complementary strengths of large language models (LLMs) and knowledge graphs. Our approach decomposes summarization into specialized agent tasks: extractive selection, knowledge-aware abstraction, and iterative refinement, each operating without task-specific fine-tuning. We unify their outputs using a multi-perspective consistency mechanism guided by LLMs. Experiments across four datasets in English and Vietnamese demonstrate state-of-the-art or competitive performance, validating the effectiveness and adaptability of our modular design.","published_date":"2026-06-02T16:39:07+00:00","viability_score":4,"cluster_label":"Multi-Document Summarization","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A training-free framework using LLMs and knowledge graphs to improve multi-document summarization by decomposing tasks into specialized agents.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2606.03866v1","title":"Taiji: Pareto Optimal Policy Optimization with Semantics-IDs Trade-off for Industrial LLM-Enhanced Recommendation","abstract":"Scaling recommender systems via large language models (LLMs) has become a prominent trend in the industry. However, aligning the LLM's semantic space with the recommender's ID space via post-training (e.g., SFT and RL) remains challenging. Existing LLM4Rec paradigms are bottlenecked by two main issues: (1) the difficulty of measuring and improving chain-of-thought (CoT) quality in open-domain recommendation during SFT, and (2) the neglect of the trade-off between LLM semantic rewards and recommendation preference rewards during RL alignment. Inspired by these challenges, we present Taiji, a novel LLM-as-Enhancer framework designed for industrial recommender systems. To overcome the SFT bottleneck, we utilize reverse-engineered reasoning and open-ended rejection sampling to generate high-quality, domain-specific CoT data. To resolve the RL alignment issue, we propose Pareto Optimal Policy Optimization (POPO), which adaptively adjusts cross-domain reward weights. Theoretically, it achieves an optimal trade-off between the semantic world knowledge of LLMs and the collaborative ID features representing online user preferences. Extensive offline evaluations and online A/B tests validate the effectiveness of Taiji. Deployed on Kuaishou's advertising platform since May 2026, Taiji currently serves over 400 million users daily, yielding significant commercial revenue and demonstrating its robust scalability in web-scale environments.","published_date":"2026-06-02T16:39:06+00:00","viability_score":4,"cluster_label":"AI for Recommendations","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Optimize recommendation systems with Pareto optimal policy balancing semantics and IDs for improved industrial performance.","time_to_mvp":"1-3 months","tags":["quick_build"]},{"arxiv_id":"2606.03858v1","title":"PyraMathBench: Evaluating and Improving Mathematical Capability in Large Language Models","abstract":"Despite the pivotal role of numerical reasoning as the cornerstone of mathematical capabilities in large language models (LLMs) across applications, few benchmarks evaluate LLMs by integrating numerical processing and mathematical reasoning, hindering the interpretability of failures in math tasks. We introduce PyraMathBench, a comprehensive hierarchical benchmark with 32,505 questions derived from 7,404 math word problems, spanning 4 key cognitive aspects, 14 subcategories, and 2 modalities. Experiments reveal that LLMs' performance is severely compromised by inadequate numerical computation and weak handling of abstract numerical questions. To address this, we propose the Smart Optimization & Learning-based VErsatile module (SOLVE) and Interactive Relative Policy Optimization (IRPO), which enhance LLMs' numerical-mathematical synergy via efficient tool calls (fuzzy matching and low-quality call rejection). Comparative experiments show Qwen-2.5 achieves a 5.0 score improvement with SOLVE and IRPO training.","published_date":"2026-06-02T16:32:53+00:00","viability_score":5,"cluster_label":"Mathematical LLM Evaluation","has_code":true,"repo_url":"https://github.com/optifine233-ship-it/PyraMathBench","commercial_flags":["has_code"],"one_liner":"PyraMathBench is a new benchmark for evaluating mathematical capabilities in LLMs, introducing SOLVE and IRPO modules to improve numerical reasoning and computation.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2606.03852v1","title":"FLARE: Fine-Grained Diagnostic Feedback for LLM Code Refinement","abstract":"Large language models often generate code with bugs. Existing methods rely on feedback signals such as test failures and self-critiques to iteratively refine the generated code. Such signals are either too coarse-grained or too high-level, which is not sufficient to inform the model where to fix the bug. In this work, we present Flare, an iterative framework with a lightweight diagnostic model that predicts line-level suspiciousness signals for bug localization and code refinement. Given the inherent uncertainty of diagnostic predictions, Flare searches over the top-k suspicious regions and selects the best candidate according to execution outcomes. Experiments on LiveCodeBench and BigCodeBench with five base LLMs show that, even without candidate search (k=1), Flare outperforms the strongest baseline with an absolute improvement from 1.72% to 7.42%. Furthermore, searching over 10 candidates yields an average improvement of 8.50% compared with no candidate search. When evaluated in isolation, our lightweight diagnostic model achieves the best performance compared with recent fault localization methods, demonstrating that it can provide reliable fine-grained guidance for code refinement.","published_date":"2026-06-02T16:29:17+00:00","viability_score":7,"cluster_label":"LLM Code Refinement","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Flare is an iterative framework with a diagnostic model that predicts line-level suspiciousness signals to help LLMs refine generated code more effectively.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2606.03846v1","title":"Clustered Self-Assessment: A Simple yet Effective Method for Uncertainty Quantification in Large Language Models","abstract":"Large language models (LLMs) demonstrate remarkable performance across diverse tasks, but they often generate responses that appear plausible while being factually incorrect. This problem is compounded by the lack of explicit uncertainty estimates, which makes it difficult for users to judge the reliability of model outputs. Existing uncertainty quantification methods typically rely on indirect signals, such as entropy across sampled generations. These signals can be difficult to interpret and do not fully leverage the model's ability to assess its own uncertainty. We propose a simple yet effective self-assessment method for uncertainty quantification in LLMs. Our approach groups sampled generations into semantically distinct clusters, converts them into answer options in a structured multiple-choice question, and uses the probability assigned by the LLM to each option as a confidence estimate. Experiments across multiple models and datasets show that our method consistently outperforms baseline approaches. Notably, it achieves competitive performance with as few as two additional samples, demonstrating both its effectiveness and efficiency.","published_date":"2026-06-02T16:25:54+00:00","viability_score":7,"cluster_label":"LLM Uncertainty Quantification","has_code":true,"repo_url":"https://github.com/ccqq77/clustered_self_assessment","commercial_flags":["has_code"],"one_liner":"A novel method for LLMs to self-assess uncertainty by clustering generations into multiple-choice options, outperforming baselines with minimal samples.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03843v1","title":"Re-Evaluating Continual Learning with Few-Shot Adaptation","abstract":"Continual learning methods aim to maximize the stability and plasticity of machine learning models that are trained on a sequence of tasks. The standard measure of stability (i.e., forgetting) is the 0-shot performance of a model on previously learned tasks, and plasticity, the performance on the most recently learned task. However, 0-shot evaluation does not fully measure a model or method's ability to retain learned information or adapt quickly to new information, as it requires perfect recall across multiple tasks. In this paper, we propose few-shot evaluation as a more comprehensive assessment of the stability and plasticity of a continual learning system. We conduct a fine-grained assessment on task sequences for continual image classification and find that this paradigm produces novel insights into the performance of popular continual learning strategies. Through few-shot evaluation with a novel metric -- per-shot plasticity -- we show that adding `foresight' to continual learning methods via the meta-learning of a short sequence of future tasks induces learning-to-learn behavior over the task sequence.","published_date":"2026-06-02T16:23:09+00:00","viability_score":3,"cluster_label":"Continual Learning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Proposes few-shot evaluation as a more comprehensive assessment for continual learning systems, offering new insights into stability and plasticity.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03841v1","title":"EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management","abstract":"Recent progress in Large Language Model (LLM) agents has enabled promising advances in automated data science. However, existing approaches remain fundamentally limited by their static action sets and lack of principled long-horizon context management, hindering their ability to accumulate reusable experience across tasks and operate reliably in multi-stage, iterative data science pipelines. To address these challenges, we introduce EvoDS, a self-evolving autonomous data science agent that learns to expand its skills and adaptively managing long-term context through agentic reinforcement learning. Specifically, EvoDS introduces two key strategies: (1) Autonomous Skill Acquisition (ASA) mechanism, which enables agents to synthesize, validate, and reuse executable skills; and (2) Adaptive Context Compression (ACC) strategy, which treats context management as a learned control problem rather than passive truncation. These strategies are orchestrated within a two-stage multi-agent training scheme, enabling EvoDS to autonomously improve over time. Theoretically, we prove that EvoDS's hierarchical design reduces tool-selection error, and its optimization objective aligns with an information bottleneck principle, ensuring efficient context use. Empirically, EvoDS outperforms state-of-the-art open-source data science agents by an average of 28.9% across four diverse benchmarks while eliminating out-of-token failures. Our code and data are available at https://github.com/usail-hkust/EvoDS.","published_date":"2026-06-02T16:20:58+00:00","viability_score":8,"cluster_label":"Autonomous Data Science Agents","has_code":true,"repo_url":"https://github.com/usail-hkust/EvoDS","commercial_flags":["has_code"],"one_liner":"Create an autonomous data science agent that evolves its skills and manages contexts for efficient, adaptive machine learning tasks.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.03829v1","title":"BigFinanceBench: A Workflow-Grounded Benchmark for Financial-Research Agents","abstract":"Financial-research answers are decision-relevant only when another analyst can audit how they were produced: which source was chosen, which period and accounting definition were used, which assumptions were made, and how the calculation was performed. Existing finance benchmarks largely evaluate isolated subskills or final answers, leaving the auditable derivation itself under-measured. We introduce BigFinanceBench, a 928-item expert-authored benchmark of open-ended financial-research tasks in which each item pairs a ground-truth reference answer with a point-weighted rubric that decomposes the derivation into independently checkable steps. BigFinanceBench is workflow-grounded in that it evaluates the full derivation rather than only the final output. Across 36,241 rubric points, the benchmark supports partial-credit evaluation and localization of failures across the analyst workflow. Evaluating ten current frontier and open-weight agents, we find substantial headroom: the best system reaches only 58.8% rubric score, final-answer accuracy is a useful but lossy proxy for derivation quality, and model capability varies non-uniformly across financial workflows.","published_date":"2026-06-02T16:12:34+00:00","viability_score":7,"cluster_label":"Financial AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"BigFinanceBench is a workflow-grounded benchmark for evaluating financial research agents, measuring the full derivation process and identifying failure points.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03827v1","title":"Conditional Latent Diffusion Model with Fourier-based Motion Modelling for Virtual Population Synthesis","abstract":"In-silico trials of medical devices require the generation of virtual populations of anatomies. In cardiovascular applications, virtual anatomy is typically represented as a 3D+t mesh sampled from a generative model. However, most existing mesh generators focus on static anatomy, while sequence models often lack explicit periodicity. To this end, we propose 4D F-MeshLDM, a conditional generative framework comprising a convolutional mesh VAE to encode meshes, a structural latent space that parameterises motion using a truncated Fourier series, and a diffusion prior that learns the latent distribution over Fourier coefficient tokens. By conditioning the diffusion process on clinical covariates via affine modulation, we enable controllable synthesis. Sampling tokens and performing inverse Fourier synthesis yield cycle-consistent latent trajectories, which can be decoded into 3D+t cardiac mesh sequences. Experiments on 5,000 UK Biobank subjects demonstrate that 4D F-MeshLDM outperforms state-of-the-art baselines in anatomical fidelity and achieves near-zero cycle closure error. Furthermore, the generated cohorts accurately preserve clinical functional indices, highlighting the potential of our framework for reliable in-silico cardiac trials.","published_date":"2026-06-02T16:10:00+00:00","viability_score":7,"cluster_label":"Medical AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A conditional generative framework for synthesizing realistic 3D+t cardiac mesh sequences for in-silico medical device trials.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03823v1","title":"Calibrating Urban Traffic Simulation from Sparse Road Observations via Genetic Optimization","abstract":"Urban traffic simulation is a critical tool for infrastructure planning, including the placement of electric vehicle charging stations. However, realistic traffic simulation across many cities is hindered by two fundamental data limitations: detailed real-world traffic measurements are available for only a small fraction of road segments in most cities, and employment distribution data critical for modeling commuter traffic is rarely available at the resolution needed for simulation. This paper presents a genetic algorithm-based framework that directly addresses both limitations, calibrating urban traffic simulations from sparse road observations without requiring detailed job location data. Using the SUMO traffic simulation platform for Greensboro, North Carolina, our approach optimizes job distributions and gate-traffic parameters to align simulated traffic with a small sample of roads with known traffic-flow rates. We demonstrate that this approach produces simulated traffic that correlates well with real-world measurements, generalizes to road segments withheld from training, and produces job distributions that show promising qualitative agreement with census employment data despite never directly training on that employment data. This work demonstrates that realistic urban traffic simulation can be achieved from minimal real-world observations, offering a scalable and data-light approach to simulation calibration that reduces the barrier to deploying traffic models across diverse cities.","published_date":"2026-06-02T16:04:01+00:00","viability_score":7,"cluster_label":"Urban Planning AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A genetic algorithm framework calibrates urban traffic simulations from sparse road data, enabling scalable city planning without detailed job location information.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03814v1","title":"Leveraging BART to Assess CS1 C++ Programming Assignments using Rubric-based Criteria","abstract":"This paper investigates rubric-aware, multitask fine-tuning of transformer models for automated grading of introductory C++ programming assignments, with the goal of producing grade predictions that better reflect instructor grading behavior than general-purpose LLMs. Using multi-semester CS1 data, student submissions are paired with numeric scores, letter-grade buckets, and assignment rubrics, then preprocessed into unified sequences for transformer input. A BART encoder-decoder with LoRA adaptation is trained to jointly predict numeric grades and grade buckets, augmented with a distribution-matching term to align predicted and empirical grade distributions, an evaluation dimension often overlooked in prior work. Experiments compare single-task and multitask training, hard one-hot versus fuzzy and boundary-based soft labels, and rubric versus no-rubric conditions, with additional T5 and pairwise-pretrained variants. Results show that multitask BART with boundary-based soft labels and rubric context achieves lower mean absolute error and stronger grade-distribution alignment than single-task, hard-label, or code-only baselines. Fully fine-tuned T5 further improves distributional fidelity, while pairwise pretraining reduces numeric error at the cost of minority-class sensitivity. Collectively, the findings suggest that calibration-aware, rubric-guided training produces more instructor-like grading behavior than accuracy-optimized alternatives.","published_date":"2026-06-02T15:57:14+00:00","viability_score":5,"cluster_label":"EdTech AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A BART-based system leverages rubric criteria for automated grading of C++ programming assignments, producing more instructor-like grade predictions.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus"]},{"arxiv_id":"2606.03812v1","title":"Enhancing Operational Safety via Agentic Dialogue Hazard Identification Analysis","abstract":"Operational safety in high-stakes domains such as industrial process control, autonomous, and safety-critical systems, demand reliable hazard identification. While large language models (LLMs) have shown promise in automating safety analysis tasks, single-turn, monolithic inference is brittle: it lacks the self-correction, deliberation, and contextual refinement that safety engineers apply iteratively. In this paper, we introduce HAZDIAL, a framework that investigates whether structured agentic dialogue-multi-agent, multi-turn interactions improves the quality of NLP- based hazard identification over single-pass baselines. We systematically compare two dialogue modalities: adversarial debate and constructive discussion, and propose an algorithm-based agentic interaction optimization. We evaluate all configurations against a curated golden dataset using standard classification metrics (accuracy, precision, recall, F1) and novel dialogue metrics. This work advances the intersection of dialogue systems, multi-agent reasoning, and AI safety, providing an empirical evidence for dialogue-driven hazard analysis.","published_date":"2026-06-02T15:54:51+00:00","viability_score":6,"cluster_label":"AI Safety","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework using agentic dialogue improves hazard identification in safety-critical systems by enabling iterative self-correction and deliberation.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2606.03811v1","title":"AI Agents Enable Adaptive Computer Worms","abstract":"A computer worm is malware that spreads on a network by replicating itself from one machine to another. Traditional worms, like WannaCry, exploited predetermined vulnerabilities, and their spread can be halted by patching those vulnerabilities. Here we show that artificial intelligence (AI) agents enable a fundamentally new threat: a worm that generates tailored attack strategies to each target it encounters. The worm parasitically uses compromised machines to run open-weight large language models (LLMs) to sustain its reasoning, or extend its reach for further attacks. Deployed on a network of machines spanning Linux, Windows, and IoT (Internet of Things) devices, the worm propagated by exploiting common, real-world corporate network vulnerabilities. Since the worm is powered by stolen compute, the attacker's marginal cost per new infection is zero. This creates a destabilizing economic asymmetry between attackers and defenders. Moreover, because the worm requires no commercial AI platform, centralized safety controls, such as service refusals or rate limiting, are structurally irrelevant. Our results demonstrate that self-sustaining AI-driven cyber-threats are no longer theoretical. We must prepare for autonomous generative adversaries: malware systems that propagate without human operators and are defined not by fixed exploit code, but by the capacity to reason about targets, adapt to observations, and synthesize attack logic in real time.","published_date":"2026-06-02T15:54:39+00:00","viability_score":3,"cluster_label":"AI Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"AI agents enable adaptive computer worms that generate tailored attack strategies to each target, creating a destabilizing economic asymmetry and bypassing centralized safety controls.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2606.03810v1","title":"Consistency Training Can Entrench Misalignment","abstract":"Consistency training encourages a model to produce similar outputs across related inputs or sampling procedures. Such methods are simple, scalable, and largely label-free, but their effects on model alignment remain poorly understood. Could the self-bootstrapping nature of these methods amplify undesired behavior in models? We test seven consistency training methods on 108 ``model organisms: open-source models (7B--70B) fine-tuned to exhibit various forms of controlled misaligned behavior. We find that outcomes vary significantly: consistency training generally suppresses reward hacking and emergent misalignment but amplifies sycophancy. We present evidence that distribution shifts induced by the consistency labeling process, rather than variation in the selection operators, may be the primary driver of systematic alignment effects. Finally, we present a unifying theoretical framework to derive conditions under which consistency training will amplify or suppress misalignment. In total, our study establishes that consistency training is not alignment-neutral, and that its use in critical systems should be carefully audited.","published_date":"2026-06-02T15:54:24+00:00","viability_score":1,"cluster_label":"LLM Alignment","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Consistency training methods can amplify sycophancy in LLMs, requiring careful auditing for use in critical systems.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03808v1","title":"PURGE: Projected Unlearning via Retain-Guided Erasure","abstract":"We propose PURGE, a machine unlearning algorithm built on a simple but an under-exploited observation: continual learning (CL) and machine unlearning (MU) which are fundamentally dual problems. CL tries to learn new tasks without forgetting old ones; MU tries to erase specific data without hurting retained performance representing the same underlying tension in opposite directions. PURGE leverages this duality by adapting gradient projection from A-GEM (Chaudhry et al., 2019) so that every unlearning step is constrained to not increase the retain-set loss. On top of this, it performs multi-layer representation erasure, pushing forget-set activations in intermediate layers towards the retain distribution to remove information from hidden representations rather than just suppressing it at the output. A key design choice is the retain-confusion target: rather than pushing forget outputs toward the uniform distribution, which we found to be surprisingly easy for membership inference attacks to detect, we instead target the model's natural confusion pattern on retain data. This makes the unlearned model hard to distinguish from one retrained from scratch. Two self-regulating stopping criteria (a retain-loss budget and a forget-accuracy target) let the algorithm decide on its own when to stop, removing the need for manual epoch tuning. In experiments on five datasets (CIFAR-10, MNIST, SVHN, STL10, PathMNIST) across 22 class-level forgetting tasks, PURGE consistently keeps retain accuracy above 96% while achieving MIA AUROC close to 0.5 (the ideal), outperforming gradient ascent, KL-uniform, and several published baselines on the privacy-utility frontier.","published_date":"2026-06-02T15:53:01+00:00","viability_score":6,"cluster_label":"Machine Unlearning","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"PURGE is a machine unlearning algorithm that constrains unlearning steps to not increase retain-set loss and uses multi-layer representation erasure for effective data removal.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2606.03803v1","title":"LiveBand: Live Accompaniment Generation in the Audio Domain","abstract":"We present LiveBand, a real-time system that generates high-fidelity music accompaniments to live audio input, respecting strict causal constraints. Our method trains a causal transformer generator in the continuous latent space of a pre-trained causal audio autoencoder, using adversarial sequence-level supervision from a discriminator. At each timestep, the generator receives only the causally available mix context and Gaussian noise, and predicts accompaniment latents without access to future mix frames or ground-truth target latents. Training is performed in a single parallel forward pass under causal masking, while streaming inference proceeds autoregressively with a rolling attention state. The model's training and inference computations are matched by design, eliminating teacher forcing and the associated exposure bias. On a multi-instrument music accompaniment benchmark, LiveBand improves over prior work on objective measures of audio quality, beat alignment, and mix adherence, while enabling real-time streaming generation without lookahead into the future on consumer hardware.","published_date":"2026-06-02T15:50:13+00:00","viability_score":7,"cluster_label":"Generative Audio","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"LiveBand is a real-time system that generates high-fidelity music accompaniments to live audio input using a causal transformer in a continuous latent space.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2606.03800v1","title":"Trading Human Curation for Synthetic Augmentation in RLVR","abstract":"The supply of high-quality training tasks is a central bottleneck for reinforcement learning from verifiable rewards (RLVR) on agentic language models. Each task requires a sandboxed setup, a prompt, and a hand-authored reward function, and only tasks that pass a quality bar produce useful training signal. Hand-curation at this quality bar does not scale economically to the task counts effective RL training requires, and the substitution rate between automatically generated task variants and human-authored ones is not yet established. We investigate using pre-specified, gate-filtered augmentations of a small hand-authored base as a substitute for additional human curation during RLVR. We formalize the cost-adjusted trade rate $\u03c1_{\\text{cost}}$ between augmented and human-authored tasks, measure it through a controlled ablation across training corpora with varying augmentation share, and characterize the end-to-end economics of the augmentation pipeline. Substituting augmented content for additional human-authored tasks retains aggregate held-out generalization on a ten-benchmark suite spanning code, instruction following, reasoning, and multi-turn agentic function-calling. The cost-adjusted trade rate $\u03c1_{\\text{cost}}$ between gated synthetic and human-authored RLVR tasks stays in $[1.4\\times, 11.6\\times]$ across the plausible $c_{\\text{human}}/c_{\\text{aug}}$ range.","published_date":"2026-06-02T15:48:28+00:00","viability_score":4,"cluster_label":"LLM Training","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research explores substituting human curation with synthetic data augmentation for reinforcement learning from verifiable rewards in language models, aiming to improve scalability and cost-effectiveness.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2606.03796v1","title":"Signed Spiking Neuron Enabled by an Orthogonal-Easy-Axis Magnetic Tunnel Junction","abstract":"Signed spiking neurons carry richer information than standard spiking neurons. This work proposes a compact magnetic tunnel junction (MTJ)-based neuron for signed leaky integrate-and-fire (LIF) operation. With orthogonal easy axes in the free and pinned layers, the device enables bipolar spike generation and maps magnetic-moment dynamics to signed LIF membrane-potential evolution. Landau--Lifshitz--Gilbert simulations show that proper free-layer dimensions allow the device response to follow a signed LIF equation. A representative design of 10 nm x 45 nm x 50 nm corresponds to an aspect ratio of about 2:9:10. Network evaluations using the fitted device-neuron model achieve 91.06% on CIFAR-10 and 77.40% on CIFAR10-DVS, retaining most of the accuracy of ideal signed LIF neurons.","published_date":"2026-06-02T15:45:23+00:00","viability_score":3,"cluster_label":"Neuromorphic Computing","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A novel magnetic tunnel junction device is proposed for implementing signed spiking neurons, demonstrating potential for energy-efficient neuromorphic hardware with competitive accuracy on image classification tasks.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03777v1","title":"From Control Boundary to Insurance Claim: Reconstructing AI-Mediated Losses Through the CER Framework","abstract":"AI losses that arise through an insured organization's generative or agentic AI system require state reconstruction, not merely event reconstruction, because the relevant state changes as the system reasons, retrieves, calls tools, and acts. The relevant question is not only what loss occurred, but what the system was allowed to do, what it actually did, and whether that reconstructed loss can support insurance claim recovery. This paper addresses losses in which the insured's AI system is in the causal chain, including externally triggered failures such as prompt injection, retrieval-augmented generation (RAG) poisoning, malicious tool output, credential misuse, and data poisoning.   Specifically, this paper introduces CER, a use-case-level diagnostic for AI residual risk transfer. C (control boundary) asks whether the system had an enforceable operating envelope. E (evidence reconstruction) asks whether the system state and causal chain can be reconstructed from retained artifacts. R (insurance response) asks whether the reconstructed loss is insured: whether insurance coverage is available in the market and placed for the insured, together with the proof needed to support insurance claim recovery.   The paper makes three contributions: it defines the AI-specific reconstruction problem, operationalizes that problem through CER, and specifies claim-grade evidence for AI reconstruction. Public examples include the reported PocketOS and Replit agentic database-deletion incidents and Moffatt v. Air Canada as an adjudicated output/reliance case.   Keywords: AI systems; CER framework; residual risk transfer; agentic AI; generative AI; AI insurance; evidence reconstruction.","published_date":"2026-06-02T15:29:43+00:00","viability_score":5,"cluster_label":"AI Insurance & Risk","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"The CER framework is introduced to reconstruct AI-mediated losses for insurance claims, focusing on control boundaries, evidence reconstruction, and insurance response for generative and agentic AI systems.","time_to_mvp":"1-3 months","tags":["series_a_plus"]},{"arxiv_id":"2606.03770v1","title":"E2LLM: Towards Efficient LLM Serving in Heterogeneous Edge/Fog Environments","abstract":"Large Language Models (LLMs) have become integral to modern applications, yet their deployment remains challenging. Beyond executing the models themselves, practical deployment must address cost efficiency, low latency, and optimal resource utilization. Conventional approaches typically assume that an entire model can be hosted on a single device, which does not hold in many real-world scenarios, particularly in Edge and Fog environments where device resources are constrained. In this paper, we introduce E2LLM, a framework designed to enable efficient LLM deployment in such resource limited settings. Rather than simply partitioning a single model across all available devices, E2LLM replicates the full model across multiple groups of devices (replicas) and applies model parallelism within each replica. Each replica is assigned a specialized role PREFILL or DECODER based on its efficiency in handling input and output tokens. This separation leverages the inherent differences between these two phases of LLM inference. To effectively organize devices, we utilize a Genetic Algorithm to form clusters that maximize system performance. Within each cluster, we apply Dynamic Programming to determine an optimal partitioning strategy that minimizes bottlenecks in model-parallel execution. Experimental results demonstrate that our approach adapts robustly to varying workloads, including scenarios with significant variation in input and output token lengths. Compared to the Splitwise baseline, E2LLM reduces average waiting time by over 50% under high-demand conditions","published_date":"2026-06-02T15:23:28+00:00","viability_score":7,"cluster_label":"LLM Serving","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"E2LLM is a framework for efficient LLM serving in heterogeneous edge/fog environments, using model parallelism and a genetic algorithm for device clustering to reduce latency and improve resource utilization.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03763v1","title":"Merit or networks? What decides where research is published","abstract":"Does scientific publishing reward the quality of ideas or the advantage of connections? The question is universal to prestige-driven science, yet it has resisted decades of study because a paper's quality could not be gauged ahead of its publication fate without using that fate as the yardstick. We break this constraint by measuring a paper's idea quality directly from its text, before publication, using a discipline-trained LLM evaluator that scores the idea without seeing author names or outcomes. Using economics as a case study, we combine this text-legible idea-quality score with an execution-quality rubric, a connection index, an author-ability index, and an off-the-shelf language-model text score to estimate a five-input production function for journal placement across 6,208 economics working papers. The inputs are not rivals but a sequence along the ladder of prestige. Execution sets a meritocratic floor and is the largest input overall. Text-legible idea quality grades the rungs in between. Connections set a favoritism ceiling that bites mainly near the apex, the most selective journals. Connections work through two additive channels: connected authors write papers that score higher, and at equal scores their papers are still more likely to place better. Yet this advantage is bounded. Connections raise the odds of every rung without making the apex the typical outcome for ordinary ideas, and even the highest-scoring papers face real friction reaching the visible journal ladder. The result nests, rather than chooses between, the meritocracy and network accounts of how science is published.","published_date":"2026-06-02T15:18:03+00:00","viability_score":2,"cluster_label":"AI Research Analysis","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper analyzes the factors influencing scientific publication, distinguishing between idea quality and network effects.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03762v1","title":"Tool-Aware Optimization with Entropy Guidance for Efficient Agentic Reinforcement Learning","abstract":"Agentic reinforcement learning (RL) equips large language models (LLMs) with tool-use capabilities that substantially improve reasoning on complex tasks. However, integrating external tools often destabilizes training: over-reliance on tools can induce input distribution shift, while overly conservative tool use limits effective exploration. To address this issue, we propose a unified framework TAO-RL that couples tool-aware trajectory filtering with entropy-guided exploration for efficient policy optimization. Specifically, at the data level, TAO-RL filters rollout trajectories along two criteria: discarding those where all tool invocations fail to execute, and removing those where all rollouts are either correct or incorrect, as both cases yield degenerate advantage estimates that contribute no discriminative learning signal. This joint filtering retains data that are both tool-capable and informative, establishing a high-quality training distribution. At the algorithmic level, we introduce a tool-aware entropy-guided bonus that reshapes the advantage function at post-tool-call tokens, encouraging the policy to explore more diverse reasoning paths at critical decision points. These two components are mutually reinforcing: trajectory filtering establishes a clean and informative training foundation, while entropy-guided exploration drives stronger reasoning behaviors at critical tool-interaction junctures. Extensive experiments on 7 challenging reasoning benchmarks across 3 model scales demonstrate the superiority of TAO-RL over existing methods.","published_date":"2026-06-02T15:16:12+00:00","viability_score":7,"cluster_label":"Agentic Reinforcement Learning","has_code":true,"repo_url":"https://github.com/WhyNot22222/TAO-RL","commercial_flags":["has_code"],"one_liner":"A framework for agentic reinforcement learning that improves LLM tool use by filtering trajectories and guiding exploration.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.03755v1","title":"LAP: An Agent-to-Instrument Protocol for Autonomous Science","abstract":"Autonomous science is moving from demonstration to infrastructure. Large language model agents now plan experiments, and self-driving laboratories execute them. Yet every such system rebuilds the link between the reasoning agent and the physical instrument from scratch, against fragmented vendor SDKs and standards built for deterministic software clients rather than probabilistic, goal-directed agents. Recent agent-interoperability protocols clarify two of the three edges of an agentic ecosystem (Anthropic's Model Context Protocol (MCP) standardizes the agent-to-tool edge, and Google's Agent2Agent (A2A) the agent-to-agent edge), but neither models the agent-to-instrument edge, where operations are stateful, safety-critical, exclusively owned, physically embodied, and produce measurements with units, calibration, and uncertainty. We present the Lab Agent Protocol (LAP), a protocol design that fills this gap. LAP retains A2A's peer-to-peer, discovery-first, task-lifecycle structure and adds four physical-world primitives: (i) the InstrumentCard, a signed capability and physical-limit description; (ii) first-class reservation for exclusive instrument and sample locking; (iii) a safety-fence handshake with operator-confirmation tokens cryptographically bound to a specific task and its parameters, gating hazardous and irreversible operations; and (iv) a MeasurementResult schema that makes every result physically typed (QUDT/UCUM), calibration-anchored, uncertainty-bearing, and reproducible by construction. We specify roles, a six-layer architecture, the JSON-RPC method set, the task and safety state machines, the error model, and cross-laboratory federation, and walk a closed-loop autonomous campaign through the protocol end-to-end. LAP is transport-compatible with the A2A/MCP ecosystem and encapsulates rather than replaces existing device standards such as SiLA 2 and OPC-UA.","published_date":"2026-06-02T15:03:43+00:00","viability_score":7,"cluster_label":"Autonomous Science Infrastructure","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new protocol for connecting LLM agents to scientific instruments, enabling autonomous lab operations.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03748v1","title":"Ultralytics YOLO26: Unified Real-Time End-to-End Vision Models","abstract":"Real-time vision demands models that are accurate, efficient, and simple to deploy across diverse hardware. The YOLO family has become widely deployed for this reason, yet most YOLO detectors still rely on non-maximum suppression at inference, carry heavy detection heads due to Distribution Focal Loss, require long training schedules, and can leave the smallest objects without positive label assignments. We present Ultralytics YOLO26, a unified real-time vision model family that addresses these limitations through coordinated architecture and training advances. YOLO26 uses a dual-head design for native NMS-free end-to-end inference and removes DFL entirely, yielding a lighter head with unconstrained regression range. Its training pipeline combines MuSGD, a hybrid Muon-SGD optimizer adapted from large language model training; Progressive Loss, which shifts supervision toward the inference-time head; and STAL, a label assignment strategy that guarantees positive coverage for small objects. Beyond detection, YOLO26 introduces task-specific head and loss designs for instance segmentation, pose estimation, and oriented detection, producing consistent gains across tasks and scales. The family spans five scales (n/s/m/l/x) and supports detection, instance segmentation, pose estimation, classification, and oriented detection in a single pipeline, with an open-vocabulary extension, YOLOE-26, for text-, visual-, and prompt-free inference. Across all scales, YOLO26 achieves 40.9-57.5 mAP on COCO at 1.7-11.8 ms T4 TensorRT latency, advancing the accuracy-latency Pareto front over prior real-time detectors, while YOLOE-26x reaches 40.6 AP on LVIS minival under text prompting. Code and models are available at https://github.com/ultralytics/ultralytics.","published_date":"2026-06-02T15:01:13+00:00","viability_score":8,"cluster_label":"Vision Models","has_code":true,"repo_url":"https://github.com/ultralytics/ultralytics","commercial_flags":["has_code"],"one_liner":"YOLO26 offers unified, real-time AI vision models excelling in performance and latency across multiple tasks.","time_to_mvp":"1-2 weeks","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03746v1","title":"Qwen-Image-Flash: Beyond Objective Design","abstract":"Few-step distillation has become an effective strategy for accelerating advanced visual generative models, yet prior work has largely focused on distillation objectives. In this work, we revisit few-step distillation from a complementary perspective, focusing on the training recipe that critically shapes student performance. Using Qwen-Image-2.0 as a representative case, we systematically investigate three factors in unified text-to-image generation and instruction-guided image editing distillation: data composition, teacher guidance, and task mixture. Our empirical analysis reveals several non-obvious behaviors, which motivate the development of Qwen-Image-Flash. Overall, our results suggest that effective few-step distillation requires not only carefully designed objectives, but also principled organization of the broader training pipeline.","published_date":"2026-06-02T15:00:22+00:00","viability_score":3,"cluster_label":"Generative AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper explores optimizing the training recipe for few-step distillation in text-to-image generation and editing models, focusing on data composition, teacher guidance, and task mixture.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03743v1","title":"Proof-Refactor: Refactoring Generated Formal Proofs into Modular Artifacts","abstract":"While Large Language Models (LLMs) have shown strong performance in generating formal proofs, their outputs often remain less readable, modular, maintainable, and reusable than proofs in mature formal mathematics libraries. We argue that this gap stems in part from the compile-first objective implicit in most proof-generation pipelines, which encourages monolithic or ad hoc proof scripts rather than library-quality artifacts. Existing approaches to proof-quality improvement often rely on explicit, computable optimization objectives. In practice, however, the most tractable and experimentally validated objectives are largely length-based, while higher-level qualities such as readability, modularity, maintainability, and reusability are difficult to reduce to reliable automatic metrics. Instead of optimizing proof improvement against a single proxy metric, we take a process-guided approach inspired by human proof-refactoring workflows. We propose an agentic framework $\\textbf{Proof-Refactor}$ that decomposes proof refactoring into four phases: extracting candidate proof fragments, designing helper declarations, formally proving the extracted and designed components, and repairing the original proof using the verified components. On generated Lean proofs from PutnamBench and Putnam2025, Proof-Refactor improves rubric-based refactoring scores over a strong Claude Code refactoring baseline, with the largest gains in signature quality and human readability. These results suggest that process-guided refactoring can improve proof structure without treating proof length as the primary objective.","published_date":"2026-06-02T14:56:10+00:00","viability_score":5,"cluster_label":"AI Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Proof-Refactor is an agentic framework that refactors generated formal proofs into modular, readable, and reusable artifacts by decomposing the process into distinct phases.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2606.03741v1","title":"When to Re-Plan: Subgoal Persistence in Hierarchical Latent Reasoning","abstract":"Long-horizon reasoning requires a system to commit to medium-horizon intent without becoming rigid: re-plan too often and computation never coheres into multi-step structure; commit too long and the plan goes stale. We study this stability-adaptivity tradeoff in the latent reasoning setting, where multi-step computation occurs inside hidden state rather than externalized token traces. We extend the Hierarchical Reasoning Model (HRM) with a feudal-style manager-worker interface: a slow high-level module periodically emits a normalized directional subgoal that persists for P low-level steps, biasing the worker's hidden-state updates and supplying an intrinsic cosine alignment loss. On ARC and ConceptARC, we find that subgoal persistence -- not subgoal injection alone -- is the central knob: moderate periods P in [3, 6] consistently outperform both very frequent (P=1) and very long horizons, with a clear minimum LM loss at P=3 (1.544 vs. 1.674 at P=1, 1.640 baseline; replicated over 5 seeds at mean 1.595, std 0.045). The intrinsic alignment weight lambda shows a complementary narrow optimum (lambda approximately 0.05). A controlled ablation at past-sweet-spot lambda isolates learned directional structure -- not architectural capacity or auxiliary loss alone -- as the source of interference when the alignment signal exceeds its optimum. Together these findings implicate a design principle for compositional planning in latent reasoning systems: medium-horizon intent must be coherent across enough computational steps for compositional structure to form.","published_date":"2026-06-02T14:55:47+00:00","viability_score":3,"cluster_label":"AI Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper investigates the stability-adaptivity tradeoff in latent reasoning by introducing subgoal persistence in a hierarchical model, finding moderate persistence periods optimize performance.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03719v1","title":"Unveiling the Structure of Do-Calculus Reasoning via Derivation Graphs","abstract":"The do-calculus defines a general system of inference for interventional queries, allowing causal quantities to be transformed through successive applications of its rules. This process induces a rich space of equivalent interventional expressions, but combining and ordering these rules remains challenging. In this work, we introduce derivation graphs, which represent how do-calculus rules are applied and combined, and characterize the full space of observational and interventional probabilities which are equivalent under the do-calculus. The structure of these graphs yields a simple procedure that uses at most four applications of do-calculus rules. Finally, we show how applying identification algorithms to equivalent causal queries produces multiple valid estimands for the same causal quantity, eventually yielding more efficient estimators.","published_date":"2026-06-02T14:40:39+00:00","viability_score":3,"cluster_label":"Causal Inference","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This work introduces derivation graphs to represent do-calculus rule applications, characterizing the space of equivalent probabilities and yielding a procedure with at most four rule applications.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03705v1","title":"Code-on-Graph: Iterative Programmatic Reasoning via Large Language Models on Knowledge Graphs","abstract":"Knowledge Graphs (KGs) are widely used to mitigate the limitations of Large Language Models (LLMs), such as outdated knowledge and hallucinations. Existing LLM-KG integration frameworks typically rely on predefined operators to retrieve factual knowledge from KGs and inject it into prompts for answer generation. This paradigm faces two critical bottlenecks: 1) Inflexibility: The predefined operators are limited in scope and thus lack sufficient compositional expressiveness to fully capture the complex semantics required by KG questions. 2) Unscalability: Direct injection of factual knowledge into prompts limits scalability in handling large-scale factual knowledge. To address these two bottlenecks, we propose Code-on-Graph (CoG), a programmatic reasoning framework for LLM-KG integration. Specifically, given the factual knowledge retrieved at each reasoning step, CoG first identifies the corresponding KG schemas and represents these schemas as Python classes, which serve as abstract interfaces to the retrieved facts. It then generates executable code grounded in these classes, with the retrieved facts instantiated as objects of the corresponding classes during execution. This design enables flexible code-based reasoning while avoiding the direct injection of large-scale factual knowledge into prompts. Experiments on WebQSP, CWQ, and GrailQA demonstrate that CoG outperforms prior state-of-the-art models by up to 10.5%.","published_date":"2026-06-02T14:22:29+00:00","viability_score":7,"cluster_label":"LLM Reasoning","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A programmatic reasoning framework for LLM-KG integration that uses generated code to flexibly access and reason over knowledge graphs, outperforming state-of-the-art.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03704v1","title":"Dynamic Objective Selection with Safeguards and LLM Oversight for Financial Decision-Making","abstract":"Financial decision-making tasks such as stock recommendation and portfolio allocation typically estimate future return and risk and then select trades or allocations for an investor, and the chosen optimization objective often determines realized performance. However, because market conditions evolve over time, a fixed objective can be suboptimal across regimes, while regime-switching pipelines that rely on latent regime estimates can be noisy or delayed and frequent switching can increase turnover and operational instability. In this paper, we propose DOSS (Dynamic Objective Selection with Safeguards), a learning-based selector that directly chooses the decision-relevant objective function at each time point from interpretable statistical summaries of recent returns, selecting among a small set of candidates (e.g., return-seeking, loss-averse, and risk-adjusted) without introducing intermediate regime variables. DOSS formulates objective selection as a classification problem over objectives and performs sequential updates with a rolling window to make forward-looking selections without temporal leakage, while also outputting a confidence score for each proposal. To mitigate misselection and excessive switching in deployment, DOSS applies confidence-aware gating with a fail-safe that overrides low-confidence proposals to a conservative default and enforces explicit controls tied to switching frequency. We further integrate governance by positioning a Large Language Model (LLM) as an oversight component rather than a generator of new objectives: the LLM is restricted to accept a proposed objective or override it to a predefined safe default, with deterministic rule-based constraints triggering overrides when needed.","published_date":"2026-06-02T14:22:07+00:00","viability_score":4,"cluster_label":"Financial AI","has_code":true,"repo_url":"https://github.com/goodfeli/dlbook_notation","commercial_flags":["has_code"],"one_liner":"A learning-based system that dynamically selects the optimal objective function for financial decision-making, with LLM oversight for governance and safeguards against misselection.","time_to_mvp":"6+ months","tags":["series_a_plus"]},{"arxiv_id":"2606.03692v1","title":"SkillPyramid: A Hierarchical Skill Consolidation Framework for Self-Evolving Agents","abstract":"Recent AI agents can flexibly invoke skills to solve complex tasks, but their long-term improvement is fundamentally constrained by a lack of systematic skill construction, accumulation, and transfer. In particular, without a unified framework for skill consolidation, agents tend to redundantly construct similar capabilities across different tasks, are unable to effectively transform experience into reusable assets, and struggle to generalize task-specific skills to novel scenarios. To address this limitation, we propose SkillPyramid, a skill consolidation framework that reuses existing skill experience for broader task generalization. Operating on a hierarchical skill topology, SkillPyramid further introduces a self-evolution mechanism that enables agents to compose, validate, and incorporate new skills during task execution. Experiments on ALFWorld, WebShop, and ScienceWorld across four backbone models show that SkillPyramid substantially increases the average reward by 38.0% and reduces execution steps by 27.7%. Overall, our method transforms a skill collection from a static resource pool into a dynamic evolution system.","published_date":"2026-06-02T14:14:27+00:00","viability_score":4,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A hierarchical framework for AI agents that consolidates and evolves skills, enabling broader task generalization and reducing redundant capability construction.","time_to_mvp":"6+ months","tags":["series_a_plus"]},{"arxiv_id":"2606.03689v1","title":"Staying Alive: Uncensored Survival Analysis with Tabular Foundation Models","abstract":"Survival Analysis (SA) is a statistical framework that models the time span until some event of interest occurs. Widely used in several domains, including healthcare and churn prediction, a central challenge in its applicability stems from the time of the event being partially observed or \\emph{right-censoring}.   Tabular Foundation Models (TFM) have attracted significant interest in recent years due to their ability to perform prediction tasks in a single forward pass, requiring no dataset-specific parameter fitting. Despite their success, their application to prediction tasks on time-to-event data remains difficult due to right censoring. In this work, we present a training-free method to survival regression by leveraging TFMs to both predict the time of the event and iteratively impute right-censored data.   Our method uses a TFM to construct an Accelerated Failure Time (AFT) model requiring no training beyond fitting a single scalar parameter. Subsequently, by building on the Buckley-James estimator, we introduce a non-parametric in-context estimator for right-censored data. Our experiments on standard survival analysis benchmarks show that our method is competitive with several parametric and semi-parametric survival regression models that require training, including Cox regression and parametric AFT models.","published_date":"2026-06-02T14:11:23+00:00","viability_score":7,"cluster_label":"Tabular AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A training-free method for survival analysis using tabular foundation models that iteratively imputes right-censored data and predicts event times, competitive with trained models.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03686v1","title":"The DeepSpeak-Agentic Dataset","abstract":"We present DeepSpeak-Agentic, a dataset of videos comprising over 37 hours of semi-structured conversations between a human and an embodied AI agent. We use this dataset to evaluate the automatic forensic identification (audio, video, or text) of AI agents, study the nature of human-agent interactions, and provide a benchmark for future advances in the large-language models and AI-generated voices and faces that power embodied AI agents. We also contribute a scalable data-capture system that creates agents, automatically pairs them with human crowd workers, records audiovisual conversations across specified scenarios, and identifies and separates the human and agent in the combined stream.","published_date":"2026-06-02T14:10:18+00:00","viability_score":7,"cluster_label":"Embodied AI Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new dataset and capture system for evaluating and studying human-AI agent interactions and identifying AI-generated content.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03685v1","title":"A Close Look At World Model Recovery In Supervised Fine-Tuned LLM Planners","abstract":"Supervised fine-tuning (SFT) improves end-to-end classical planning in large language models (LLMs), but do these models also learn to represent and reason about the planning problems they are solving? Due to the relative complexity of classical planning problems and the challenge that end-to-end plan generation poses for LLMs, it has been difficult to explore this question. In our work, we devise and perform a series of interpretability experiments that holistically interrogate world model recovery by examining both internal representations and generative capabilities of fine-tuned LLMs. We find that: a) Supervised fine-tuning on valid action sequences enables LLMs to linearly encode action validity and some state predicates. b) Models that struggle to use output probabilities for classifying action validity may still learn internal representations that separate valid from invalid actions. c) Broader state space coverage during fine-tuning, such as from random walk data, yields more accurate recovery of the underlying world model. In summary, this work contributes a recipe for applying interpretability techniques to planning LLMs and generates insights that shed light on open questions about how knowledge is represented in LLMs.","published_date":"2026-06-02T14:09:16+00:00","viability_score":3,"cluster_label":"LLM Interpretability","has_code":true,"repo_url":"https://github.com/goodfeli/dlbook_notation","commercial_flags":["has_code"],"one_liner":"Investigating how large language models learn to represent planning problems after supervised fine-tuning.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03678v1","title":"EvoDrive: Pareto Evolution for Safety-Critical Autonomous Driving via Self-Improving LLM Agents","abstract":"Generating safety-critical scenarios is essential for validating and improving autonomous driving systems, yet it inherently requires maximizing adversariality to expose failures while preserving realism. Existing methods usually manage this trade-off with handcrafted heuristics, confining generation to known priors and overlooking underexplored patterns. While recent open-ended agentic evolution can push this limit, unconstrained general agents lack strict simulator grounding and tend to collapse the multi-objective tension into single-scalar maximization. Here we present EvoDrive, the first automated, LLM-based agentic evolution framework for multi-objective scenario generation. EvoDrive employs a simulator-grounded actor-critic architecture where a memory-driven actor iteratively proposes improvements to the generators and critics filter out implausible candidates, and a self-evolving world evaluator routes promising proposals to optimize simulation budgets. EvoDrive further maintains a Pareto archive of evaluated candidates to preserve diverse attack-realism trade-offs and guide future evolution via simulation feedback. Benchmark results on MetaDrive and CARLA show that EvoDrive not only significantly expands the Pareto frontier across various generators, but also produces valuable scenarios for policy training.","published_date":"2026-06-02T14:01:23+00:00","viability_score":8,"cluster_label":"Autonomous Driving Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An LLM-powered framework for generating diverse and realistic safety-critical autonomous driving scenarios to improve validation.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03664v1","title":"AUGUSTE: Online-Learning dApp for Predictive URLLC Scheduling","abstract":"Ultra Reliable and Low Latency Communications (URLLC) was one of the main motivations behind 5G, with 3GPP advertising 1-10 ms latency targets for applications such as industrial automation, Vehicle-To-Everything (V2X), tactical edge networking, and unmanned-system control. Years on, real 5G Time Division Duplexing (TDD) networks still show median Uplink (UL) round-trip times in the 50-70 ms range, largely because of the Scheduling Request (SR) procedure that a User Equipment (UE) must complete before transmitting UL data. Existing remedies, primarily Configured Grant (CG) scheduling, only eliminate this overhead for strictly periodic traffic and require cross-layer synchronization, which has limited their adoption. We propose AUGUSTE (Anticipatory Uplink Grants for URLLC via Self-Adapting Temporal Estimation), a learning-based Medium Access Control (MAC) scheduling framework that embeds online Machine Learning (ML) models in the UL scheduler to predict packet arrivals and proactively allocate resources before an SR is issued. An adaptive state machine alternates between a learning phase that collects unbiased arrival statistics and a confident phase that exploits the learned predictions to schedule only when traffic is expected. We evaluate AUGUSTE on a real 5G testbed running OpenAirInterface across three URLLC traffic patterns (request-response, ML edge inference, and periodic autonomous reporting), and show that it operates at the best achievable point on the latency-overhead trade-off: it matches always-on scheduling's median Round Trip Time (RTT) (around 10 ms, halving the 20 ms SR-based baseline) at roughly one-tenth its resource cost (7-10 percent overhead).","published_date":"2026-06-02T13:50:22+00:00","viability_score":7,"cluster_label":"URLLC Scheduling","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A learning-based framework for predictive URLLC scheduling in 5G networks, reducing latency and overhead for critical applications.","time_to_mvp":"1-3 months","tags":["series_a_plus"]},{"arxiv_id":"2606.03660v1","title":"From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models","abstract":"Large language models are increasingly used as chemistry assistants, yet most chemistry benchmarks still score only final answers. This masks a critical failure mode: a model may output the correct molecule, product, or option while its reasoning violates chemical logic. Existing process-level evaluators are hard to scale because LLM judges and human step-level process annotation are costly, inconsistent, and vulnerable to hallucination. We introduce ChemCoTBench-V2, a rule-verifiable diagnostic benchmark for low-cost, auditable evaluation of structured, verifier-addressable chemical reasoning traces. It spans molecular understanding, molecule editing, molecular optimization, and reaction prediction, with 5,620 evaluation samples across 18 reporting tasks. Models must expose key intermediate steps in expert-designed templates, and those steps are checked with deterministic chemistry rules and, for closed-answer tasks, reference traces rather than another LLM judge. Open-ended molecular optimization is evaluated with oracle-verifiable state constraints rather than strict trace matching. The benchmark reports three separate signals: final-answer correctness, template adherence, and step-wise verifier correctness over expert-refined intermediate commitments. Experiments on frontier models reveal a persistent gap between final-answer success and structured-reasoning-state consistency: models often follow the requested format while failing chemical-step checks, or answer correctly with weak supporting reasoning. ChemCoTBench-V2 enables fine-grained model comparison and identifies the concrete step at which the trace first violates the verifier.","published_date":"2026-06-02T13:47:19+00:00","viability_score":7,"cluster_label":"LLM Evaluation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A verifiable benchmark for evaluating the chemical reasoning of LLMs, enabling fine-grained comparison and identifying specific reasoning failures.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03657v1","title":"Diagnosing Knowledge Gaps in LLM Tool Use: An Agentic Benchmark for Novel API Acquisition","abstract":"Large language models for code generation often need to use APIs that are absent from their pretraining data. This requires more than recalling a function name: models must coordinate signatures, module paths, input-output contracts, semantics, and executable usage patterns. Existing novel-API benchmarks are typically static, rely on coarse pass/fail metrics, or use synthetic APIs that may not reflect real library evolution. We introduce NovelAPIBench, a fully automated dynamic benchmark that, for any base model and target library, discovers novel APIs, extracts decomposed knowledge bundles, generates executable coding tasks, and assigns failed samples to six diagnostic categories. Across about 1.9K tasks, four base models, and five domains, we compare knowledge injected through retrieval with knowledge internalized through parametric adaptation. We find that knowledge components are not interchangeable: usage examples are the strongest standalone signal, while the best two-component setting pairs signatures with either mechanisms or examples depending on the domain and backbone. Adding more context, especially source code, can hurt by increasing import-path errors. Parametric adaptation also does not replace retrieval once external knowledge is removed; rather, fine-tuning mainly teaches models how to use provided bundles, and this ability transfers to held-out libraries. These results suggest that retrieval and tuning play complementary roles: retrieval supplies volatile API content, while tuning improves procedural integration.","published_date":"2026-06-02T13:46:04+00:00","viability_score":7,"cluster_label":"LLM Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An automated benchmark for diagnosing LLM knowledge gaps in novel API usage, enabling targeted improvements for agentic systems.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03655v1","title":"Towards Non-Monotonic Entailment in Propositional Defeasible Standpoint Logic","abstract":"Recent work in defeasible reasoning has seen notions of preferential semantics and entailment in the style of Kraus et al. applied to modal logics. However, work in this field has focussed primarily on satisfiability checking, and monotonic notions of entailment, which may be inferentially weak. One particular modal logic where this has been introduced is propositional standpoint logics, where modalities can express the views of different viewpoints. This has resulted in the formalisation of propositional defeasible standpoint logic (PDSL). In this paper, we propose a means of lifting the class of (non-monotonic) rational entailment relations from traditional KLM-style reasoning to a fragment of PDSL. In order to do so, we extend the expressivity of PDSL via situated standpoint conditionals, allowing us to talk about a defeasible conditional holding in the context of a given standpoint. This allows us to re-characterise the syntax of PDSL in terms of situated conditionals, and shows that a large fragment of PDSL is expressible as a set of situated conditionals. We then focus on characterising non-monotonic entailment in this fragment, defining a method to transport any ranking-based entailment relation from the propositional case into the PDSL case. This is first described in the general case and then considered in the specific cases of rational and lexicographic closures, providing a faithful translation of each inference into PDSL. We also show that entailment-checking in this fragment of PDSL can be done largely using algorithms from the propositional case, while preserving complexity bounds.","published_date":"2026-06-02T13:44:37+00:00","viability_score":0,"cluster_label":"Logic and Reasoning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Extends propositional defeasible standpoint logic with situated conditionals to enable non-monotonic entailment checking, preserving propositional complexity bounds.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03650v1","title":"CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks","abstract":"Choosing or ranking language models for a specific application is hardest when no task-specific labeled data exists, and standard public benchmarks cannot be trusted, their items having likely leaked into pretraining, so scores reflect memorization rather than fitness. We present CoEval, an open-source, reusable framework that closes this gap end to end: from only a description of a task or domain, teacher models synthesize a fresh, attribute-controlled benchmark with no human labels, contamination-free because items are generated anew on each run, and a cross-family judge ensemble ranks candidate models with no human raters. Validated where ground truth exists, CoEval recovers the true model ranking and tracks ground-truth correctness at ho=0.86. The label-free judging needs no human calibration because judge-panel composition (vendor diversity), not size, drives reliability: a small, well-chosen cross-family panel is most reliable, while a single judge can be anti-correlated with ground truth (judge-choice regret 0.35) and the ensemble never is. Generated items show zero verbatim 13-gram overlap with five major public benchmarks; the panel cancels verbosity bias and precludes same-family self-preference. A four-task study produced 7,978 evaluations for USD 5.89. The same declarative pipeline applies to any domain and is cheap enough to re-run on every model release: a label-free, contamination-free leaderboard any team can regenerate for its own application.","published_date":"2026-06-02T13:41:43+00:00","viability_score":8,"cluster_label":"LLM Evaluation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"CoEval is an open-source framework for generating custom, contamination-free benchmarks and ranking LLMs without labeled data or trustworthy public benchmarks.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03648v1","title":"Safety Measurements for Fine-tuned LLMs Should be Grounded in Capability","abstract":"Adapting foundation large language models to a user's task or preferred style through fine-tuning can result in compromising the model's safety. Previous works examined the effects of fine-tuning on model safety in limited and seemingly random experimental settings. We argue that anchoring fine-tuning to a specific capability goal is essential for avoiding arbitrary empirical choices, allowing us to draw meaningful conclusions about safety impacts, and to compare mitigation methods on a consistent basis. We conduct a multi-dimensional evaluation of the effects of fine-tuning on model behavior by focusing on capability as well as safety. Our results surface important issues that (1) fine-tuned models can produce incoherent generations in response to safety prompts, (2) automated safety judgments are unreliable for such incoherent outputs, and (3) the conclusions about the effects of fine-tuning can change depending on the choice of safety benchmark as well as the safety evaluator.","published_date":"2026-06-02T13:39:17+00:00","viability_score":4,"cluster_label":"LLM Safety","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research proposes a principled approach to evaluating LLM safety after fine-tuning by grounding measurements in specific capability goals, addressing issues with incoherent generations and unreliable automated judgments.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2606.03647v1","title":"Black-box, Adaptive, Efficient, Transferable, Harmful, Applicable... Attacks Are All You Need to Break LLMs","abstract":"Accurately evaluating adversarial robustness is a longstanding challenge. A flawed attack design can inflate robustness estimates, making deployment risk assessment and defense comparison unreliable. Historically, standardized attacks such as AutoAttack have largely resolved this for image classifiers, providing a reliable evaluation baseline for systematic comparison across defenses. However, no equivalent exists for LLM jailbreak evaluation yet, where designing such an attack is considerably more difficult. A reliable attack must, among other things, be black-box compatible, applicable to arbitrary defense pipelines, and efficient, which no existing method jointly satisfies. We introduce Indirect Harm Optimization (IHO), a masked diffusion language model attacker trained via iterative preference optimization against a harmfulness judge, requiring only black-box access to the target. The same method can be used without modification as a strong adaptive attack on individual behaviors, or as an efficient amortized policy that transfers to held-out behaviors and unseen target models without fine-tuning. Even against layered defenses, such as a Circuit Breaker-trained model combined with an auxiliary detector, IHO improves attack success considerably over state-of-the-art approaches, without any defense-specific adaptation. Our results position IHO as a practical step toward the kind of standardized jailbreak evaluation that has improved reliability in the past. Code and models are available on GitHub and Hugging Face.","published_date":"2026-06-02T13:39:15+00:00","viability_score":7,"cluster_label":"LLM Security","has_code":true,"repo_url":"https://github.com/SEML-Lab/IHO","commercial_flags":["has_code"],"one_liner":"Introducing Indirect Harm Optimization (IHO), a black-box, adaptive, and efficient attack method that significantly improves LLM jailbreak evaluation and defense comparison, with code and models available.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.03641v1","title":"Gender-Dependent Diagnostic Substitution in LLM Medical Triage: Same Symptoms, Unequal Urgency","abstract":"We investigate whether large language models produce different medical triage recommendations for identical neurological symptoms when only the patient's stated gender and age vary. Using three model families--Gemini 3.5 Flash, Claude Sonnet 4.6, and GPT-5.4-mini--we present a standardized symptom profile (persistent headache, blurred vision, morning nausea, visual disturbances) across seven demographic conditions: three age groups (25, 38, 65) x two genders (male, female), plus a gender-unspecified baseline (n = 30 per condition per model, 630 total trials). We find a stark, systemic gender-dependent triage disparity: young women receive significantly lower emergency room (ER) referral rates than age-matched men (Gemini: 0% vs. 23.3%; Claude: 6.7% vs. 96.7%; GPT: 6.7% vs. 66.7%, all p < 0.001). The disparity disappears at age 65 for all models. The primary mechanism is diagnostic substitution: the models anchor on a gender-associated diagnosis, preferentially classifying young women with Idiopathic Intracranial Hypertension (IIH)--a condition epidemiologically linked to women of childbearing age--while diagnosing men with generic increased intracranial pressure with space-occupying lesions in the differential. This diagnostic closure routes female patients to lower-urgency care (outpatient doctor appointments) despite comparable severity ratings (7-9/10). Our findings demonstrate that clinical LLMs replicate documented human clinical biases by using epidemiological priors to suppress triage urgency, suggesting that AI triage engines must decouple urgency assessment from probabilistic diagnostic priors. We release all code, prompts, and raw results.","published_date":"2026-06-02T13:35:12+00:00","viability_score":7,"cluster_label":"Medical AI Bias","has_code":true,"repo_url":"https://github.com/wongqihan/ai-behavioral-experiments","commercial_flags":["has_code"],"one_liner":"This research reveals significant gender-dependent diagnostic substitution in LLM medical triage, leading to unequal urgency recommendations for young women, and calls for decoupling urgency assessment from probabilistic diagnostic priors.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.03635v1","title":"VidMsg: A Benchmark for Implicit Message Inference in Short Videos","abstract":"Understanding short online videos involves more than identifying visible objects and actions; video makers often include an underlying message or purpose in the clip. We introduce VidMsg, a benchmark for evaluating implicit message understanding in short, internet-native video clips. VidMsg contains 400 YouTube-derived clips across 9 practical topic areas and 52 fine-grained target messages, covering domains such as career and finance, education, health and well-being, culture, safety, sustainability, and lifestyle. VidMsg is constructed through a message-first pipeline: an LLM first translates target messages into indirect search scenarios, which are used to retrieve candidate clips. Human annotators then retain clips that convey the intended message without being overly explicit. VidMsg is designed primarily for bidirectional message-clip retrieval for scalable applications such as video search and recommendation, where systems must capture holistic video understanding. In addition to retrieval, VidMsg includes a diagnostic multiple-choice QA benchmark, where models select the intended message of a clip from semantically related alternatives. Experiments with contemporary video-language and retrieval models show that strong models often fail on VidMsg, because the task requires pragmatic inference, integration of contextual cues, and discrimination among semantically close messages. We also introduce VidVec-Msg, a baseline method that improves message-oriented retrieval while leaving substantial headroom for future work.","published_date":"2026-06-02T13:31:57+00:00","viability_score":6,"cluster_label":"Video Understanding","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Introducing VidMsg, a novel benchmark for evaluating implicit message inference in short online videos, designed for scalable applications like video search and recommendation.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2606.03631v1","title":"AnchorMoE: Interpretable Time Series Classification via Anchor-Routed MoE","abstract":"Multivariate time series classification (MTSC) is pivotal in high-stakes domains, such as clinical diagnosis and industrial fault detection, where safe deployment necessitates transparent decision-making. However, isolating the temporal segments that drive model predictions is challenging because discriminative signals in real-world time series are typically sparse, heterogeneous, and heavily obscured by background noise. This paper, therefore, proposes AnchorMoE, an interpretable-by-construction classification framework. Built upon a Mixture-of-Experts (MoE) architecture, AnchorMoE encodes multi-view representations of local patches and routes them to specialized experts, ensuring that the final prediction is formulated as an exact additive decomposition over the input segments, facilitating ante-hoc transparency rather than relying on post-hoc estimations. To maintain the reliability of this decomposition under sparse signal distributions, we introduce a geometric orthogonality constraint that penalizes representational redundancy, compelling distinct experts to specialize in heterogeneous predictive patterns. Furthermore, an uncertainty-aware reliability gate is designed to dynamically calibrate the contribution of each segment, effectively suppressing residual background noise. Extensive experiments on real-world and synthetic benchmarks demonstrate that AnchorMoE achieves highly competitive classification performance while faithfully grounding its decisions in the raw time series.","published_date":"2026-06-02T13:30:54+00:00","viability_score":4,"cluster_label":"Time Series Classification","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An interpretable time series classification framework that decomposes predictions over input segments using a Mixture-of-Experts architecture.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2606.03629v1","title":"TSQAgent: Rating Time Series Data Quality via Dedicated Agentic Reasoning","abstract":"Assessing the quality of time series (TS) data is fundamental yet inherently challenging due to the multifaceted nature of quality dimensions. Recently, large language models (LLMs) have emerged as a promising paradigm for TS quality assessment via pairwise comparison and per-dimension evaluation. However, existing approaches rely on manually predefined quality dimensions and purely text-based reasoning, leaving it unknown whether LLMs can identify truly relevant quality dimensions or perform grounded and quantitative quality comparisons. To investigate this, we construct TSQBench, a dedicated benchmark for evaluating LLMs on two progressive capabilities: (i) understanding and identifying relevant quality dimensions, and (ii) performing quality comparison under specific dimensions. Our analysis reveals that current LLMs consistently struggle with both dimension identification and evidence-grounded quality comparison. To address these limitations, we propose TSQAgent, a novel agentic reasoning framework for TS quality rating consisting of three collaborative roles: Perceiver for focused dimension selection, Inspector for dimension-wise quantitative analysis, and Adjudicator that aggregates and refines the final judgment. In particular, we introduce an agentic reasoning strategy that instills the ability to identify and prioritize the most relevant quality dimensions, and further propose an agent workflow equipped with external analytical tools to enable precise quantitative comparisons over selected dimensions. Experiments on both the proposed benchmark and eleven real-world datasets demonstrate that our framework not only substantially improves LLMs' capabilities in quality understanding and quantitative comparison but also effectively translates these improvements into better quality-aware data selection, leading to enhanced downstream performance and data efficiency.","published_date":"2026-06-02T13:28:17+00:00","viability_score":7,"cluster_label":"AI Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An agentic reasoning framework that uses specialized agents to assess and rate time series data quality, improving downstream performance.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03628v1","title":"Building Reliable Long-Form Generation via Hallucination Rejection Sampling","abstract":"Large language models (LLMs) have achieved remarkable progress in open-ended text generation, yet they remain prone to hallucinating incorrect or unsupported content, which undermines their reliability. This issue is exacerbated in long-form generation due to hallucination snowballing, a phenomenon where early errors propagate and compound into subsequent outputs. To address this challenge, we propose a novel inference-time hallucination mitigation framework, named Segment-wise HAllucination Rejection Sampling (SHARS), which uses an arbitrary hallucination detector to identify and reject hallucinated segments during generation and resample until faithful content is produced. By retaining only confident information and building subsequent generations upon it, the framework mitigates hallucination accumulation and enhances factual consistency. To instantiate this framework, we adopt semantic uncertainty as the detector and introduce several vital modifications to address its limitations and better adapt it to long-form text. Our method enables models to self-correct hallucinations without requiring external resources such as web search or knowledge bases, while remaining compatible with them for future extensions. Empirical evaluations on standardized hallucination benchmarks demonstrate that our method substantially reduces hallucinations in long-form generation while preserving or even improving the informativeness of generation. Code is available at: https://github.com/TreeLLi/hallucination-rejection-sampling.","published_date":"2026-06-02T13:26:17+00:00","viability_score":7,"cluster_label":"LLM Generation","has_code":true,"repo_url":"https://github.com/TreeLLi/hallucination-rejection-sampling","commercial_flags":["has_code"],"one_liner":"A framework that mitigates hallucinations in long-form text generation by rejecting and resampling hallucinated segments during inference.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.03626v1","title":"TurtleAI: Benchmarking Multimodal Models for Visual Programming in Turtle Graphics","abstract":"Vision-language models (VLMs) have been explored for visual programming, where they generate code to solve visual tasks. However, most prior work focuses on visual programming for productivity; it remains unclear how well current VLMs perform on education-oriented visual programming and what factors limit their performance. To bridge this gap, we introduce TurtleAI, a benchmark containing 823 tasks curated based on real-world visual programming tasks in the Turtle Graphics domain. Solving these tasks requires models to perceive geometric patterns, reason about spatial relationships, and synthesize Python code that faithfully reproduces geometric patterns. We evaluate 20+ VLMs, including GPT-5, GPT-4o, and Qwen2-VL-72B, and find that they struggle significantly, with most achieving success rates below 30%. To address these limitations, we propose a data generation technique that requires only a small set of seed samples. Fine-tuning Qwen2-VL-72B on the resulting synthetic data yields an improvement of about 20% on real-world tasks. Our failure analysis reveals that GPT-4o struggles with spatial reasoning and precise visual replication, whereas fine-tuning primarily improves the alignment between visual reasoning and code implementation.","published_date":"2026-06-02T13:25:05+00:00","viability_score":6,"cluster_label":"Multimodal AI","has_code":true,"repo_url":"https://github.com/machine-teaching-group/acl2026-turtleai","commercial_flags":["has_code"],"one_liner":"A benchmark and fine-tuning approach for evaluating and improving multimodal models on visual programming tasks in Turtle Graphics.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2606.03624v1","title":"Bridging Auxiliary Constraints to Resolve Instruction Following in Large Reasoning Models","abstract":"Large Reasoning Models (LRMs) have demonstrated impressive capabilities in many tasks, yet they struggle with reliably following multiple instructions, either by failing to satisfy individual constraints or by struggling to balance competing constraints simultaneously. We formalize this challenge as the Constraint Adherence Problem (CAP). This paper introduces a novel framework that addresses CAP by representing instructions as a structured knowledge graph of constraints. Our approach, Constraint Relationship Graph Completion (CRGC), explicitly models relationships between constraints, identifies adherence challenges, and discovers ``bridge constraints'' that help the model better focus on and reconcile requirements. Bridge constraints act as auxiliary instructions that make primary constraints more salient and compatible. Unlike existing approaches that enhance instruction following through general training methods, CRGC specifically improves constraint satisfaction by leveraging the model's own knowledge to create better pathways for generation. Experiments across three popular instruction following datasets demonstrate that our approach reduces constraint violations by 39% compared to standard prompting while maintaining reasoning abilities of large reasoning models.","published_date":"2026-06-02T13:23:28+00:00","viability_score":4,"cluster_label":"LLM Reasoning","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework that improves instruction following in large reasoning models by representing instructions as a knowledge graph and discovering 'bridge constraints' to reconcile competing requirements.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2606.03620v1","title":"Physics-Guided Policy Optimization with Self-Distillation","abstract":"Self-distilled policy optimization (SDPO) has become a popular paradigm for LLM post-training, where a model learns from its own predictions conditioned on privileged information. SDPO, however, is sensitive to how much each update step should be trusted: corrections from a self-teacher can be highly informative on some batches and misleading on others, and applying them uniformly with a fixed step size can destabilize training. Drawing inspiration from viscous-fluid dynamics and formalizing the analogy at the SDE level, we propose Physics-Guided Policy Optimization (PGPO), which introduces an information-modulated step-size multiplier derived from a mutual-information estimate between the student's predictions and the feedback-conditioned teacher. We show that this modulation preserves the order-1 weak-approximation guarantees of vanilla SGD, and incurs negligible overhead per iteration. We evaluate PGPO on the Science-QA dataset, where it outperforms SDPO on 3 of the 4 domains with gains of up to +4.5 points, while remaining stable in a setting where SDPO collapses late in training.","published_date":"2026-06-02T13:20:39+00:00","viability_score":4,"cluster_label":"LLM Training","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Physics-Guided Policy Optimization (PGPO) introduces an information-modulated step-size multiplier for self-distilled policy optimization, improving stability and performance in LLM post-training.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2606.03618v1","title":"Cross-Lingual Token Arbitrage: Optimizing Code Agent Context Windows via Local LLM Preprocessing","abstract":"AI-assisted coding agents are bottlenecked by input-token cost. Two pathologies of raw human input drive much of this overhead: tokenization inefficiency for non-English text and structural entropy in conversational prompts. Existing approaches act reactively by compressing already-bloated contexts or intervening after failures occur.   We introduce a pre-flight, edge-side prompt-rewriting middleware that operates between the developer and the cloud agent. A local Llama 3.2 (3B) model performs cross-lingual translation into English, structural rewriting into a compact task-oriented format, and regex-validated rewrite-with-fallback safeguards to ensure the optimized prompt is never larger than the original.   We evaluate on OMH-Polyglot, a multilingual coding benchmark spanning Turkish, Arabic, Chinese, and code-switched specifications. Across three commercial LLM backends, the middleware reduces prompt tokens by 34-47 percent and total tokens by up to 18.8 percent while preserving or improving task accuracy. Ablation studies show that gains arise primarily from the rewriting stage rather than simple function-name extraction. Compared with LLMLingua-2 at matched compression rates, our method consistently achieves superior OckScore performance across all evaluated backends. These results demonstrate that proactive prompt optimization can substantially reduce inference costs without sacrificing coding quality.","published_date":"2026-06-02T13:17:45+00:00","viability_score":7,"cluster_label":"AI Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A middleware that uses a local Llama 3.2 (3B) model to rewrite prompts, reducing token costs for AI coding agents by translating non-English text and structuring conversational inputs.","time_to_mvp":"1-2 weeks","tags":["quick_build","high_potential"]},{"arxiv_id":"2606.03608v1","title":"Exploiting Verification-Generation Gap: Test-Time Reinforcement Learning with Confidence-Conditioned Verification","abstract":"Test-time reinforcement learning has emerged as a promising paradigm for enhancing the complex reasoning abilities of large language models in a completely label-free manner. Despite existing studies focusing on Pass@1 performance, optimizing Pass@k remains under-explored yet critical in label-free settings, which measures generation coverage for sustained exploration. Optimizing Pass@k in label-free setting is highly non-trivial, as directly applying the Pass@k advantage designs effective for RLVR yields unsatisfactory performance. Through in-depth empirical analysis, we discover the root causes hindering performance: pseudo-label estimations for low-confidence samples have a high probability of being incorrect, while candidate answers for high-confidence samples suffer from severe diversity collapse. To overcome these hurdles, we propose TTRL-CoCoV (Test-Time Reinforcement Learning with Confidence-Conditioned Verification), a novel confidence-adaptive framework that expands Pass@k coverage and improves Pass@1 performance. Based on our key insight that verification capability generally leads generation capability, TTRL-CoCoV employs a confidence-conditioned mechanism: for high-confidence samples, it bootstraps verifier and applies an exploration-enhancing reward to prevent diversity collapse; for low-confidence samples, it delegates pseudo-label selection to the verifier to filter incorrect pseudo-labels; and for medium-confidence samples, it bypasses verification entirely. Extensive experiments demonstrate that TTRL-CoCoV outperforms the best competing methods across 6 widely-recognized benchmarks, achieves average absolute gains of +9.8% in Pass@1 and +18.7% in Pass@16 over TTRL, and even achieves absolute Pass@1 improvements of up to +5.0% across multiple reasoning benchmarks when compared against fully supervised RL methods. Our code repository: https://github.com/shanjf666/CoCoV.","published_date":"2026-06-02T13:11:09+00:00","viability_score":7,"cluster_label":"LLM Reasoning","has_code":true,"repo_url":"https://github.com/shanjf666/CoCoV","commercial_flags":["has_code"],"one_liner":"TTRL-CoCoV is a confidence-adaptive framework for test-time reinforcement learning that improves Pass@k performance in LLMs by intelligently leveraging verification capabilities.","time_to_mvp":"1-2 weeks","tags":["quick_build","high_potential"]},{"arxiv_id":"2606.03606v1","title":"Testing LLM Arithmetic Reasoning Generalization with Automatic Numeric-Remapping Attacks","abstract":"Large language models achieve strong performance on arithmetic reasoning benchmarks, and one common response to arithmetic brittleness is to delegate computation to code. Yet models are still often used in settings where they must reason directly from natural language, and trustworthy models should solve small-number arithmetic word problems without external tools. Prior work shows that LLMs are sensitive to numerical variation: a model may solve an original problem but fail on structurally similar variants requiring the same reasoning procedure with different numbers. We ask whether this fragility persists under a stricter setting involving small, schema-preserving numeric changes that retain the original reasoning program and avoid large-number stress tests. We introduce an automatic algorithm for generating numeric-remapping attacks on arithmetic word problems. Unlike template-based perturbation methods requiring manual schemas or constraints, our approach derives problem-specific symbolic representations, generates constrained numeric remappings, recomputes gold answers, and realizes transformed questions through deterministic edits guided by LLM-generated edit plans. Stage-wise validation and a high-confidence audit retain reliable attacks, making the pipeline scalable with limited human intervention. We evaluate DeepSeek-R1 (70B), Gemma4 (31B), and GPT-OSS (120B) on GSM8K, MAWPS, and MultiArith. On GSM8K, completed runs show conditional accuracy drops of 12.16 to 25.82 percentage points. MAWPS and MultiArith are far more stable, with most attacked accuracies near or above 98%. These results show that numeric-remapping robustness depends strongly on dataset structure: GSM8K remains sensitive even when reasoning programs are preserved and answers are recomputed, while shorter, more regular datasets are more robust.","published_date":"2026-06-02T13:09:44+00:00","viability_score":4,"cluster_label":"LLM Evaluation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Develops an automated attack to test LLM arithmetic reasoning robustness by remapping numbers in word problems.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2606.03602v1","title":"CauTion: Knowing When to Trust LLMs for Ensemble Causal Discovery","abstract":"Causal discovery from observational data remains challenging due to the fundamental limitations of purely statistical methods, such as statistical distinguishability within equivalence classes and sensitivity to finite sample sizes. While large language models (LLMs) offer a promising source of domain knowledge to complement statistical inference, existing LLM-augmented methods are vulnerable to LLM errors and incur high token costs. Moreover, reliance on a single data-centric algorithm can make results sensitive to algorithm-specific biases. To address these limitations, we propose CauTion, a framework that reliably integrates LLM domain knowledge into an ensemble of statistical causal discovery algorithms through consensus filtering and LLM reliability estimation. CauTion proceeds in three stages. First, an algorithm ensemble utilizes a consensus voting to resolve up to 96% of edges on which algorithms agree, achieving near-perfect accuracy on the filtered consensus edges. Second, a trust-calibrated arbitration mechanism estimates the relative reliability of the LLM and the algorithms via an annotation-free trust calibration procedure, which is then utilized to govern a trust-weighted voting process that restricts LLM arbitration exclusively to edges with unreliable algorithmic evidence. Third, a cycle repair step is applied to guarantee the final causal graph is validly acyclic. Experiments on six datasets demonstrate that CauTion consistently outperforms both data-centric and LLM-augmented baselines, with larger gains on larger graphs and strong robustness to LLM errors. Code is available at https://github.com/OpenCausaLab/CauTion.","published_date":"2026-06-02T13:07:43+00:00","viability_score":7,"cluster_label":"Causal Discovery","has_code":true,"repo_url":"https://github.com/OpenCausaLab/CauTion","commercial_flags":["has_code"],"one_liner":"A framework that reliably integrates LLM domain knowledge into an ensemble of statistical causal discovery algorithms for improved accuracy and robustness.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03601v1","title":"DDOR: Delta Debugging for Explainable Overrefusal Testing and Repair","abstract":"While safety alignment and guardrails help large language models (LLMs) avoid harmful outputs, they can also induce overrefusal, i.e., unwarranted rejection of benign queries that merely appear risky. We present DDOR (Delta Debugging for OverRefusal), a fully automated and explainable framework for overrefusal testing and repair in a black-box setting, where only model inputs and outputs are accessible and internal safety mechanisms remain opaque. DDOR applies delta debugging to localize minimal refusal-triggering fragments (mRTFs) that provide phrase-level, explainable evidence for why a refusal occurs. Conditioned on these mRTFs, DDOR generates diverse, context-rich prompts and performs multi-oracle validation to filter intrinsically unsafe or ambiguous cases, producing scalable and model-specific overrefusal test suites (approximately 1K cases per model). Beyond evaluation, we further leverage localized mRTFs to perform targeted prompt repair, substantially reducing overrefusal while preserving the original intent and maintaining safety on genuinely harmful inputs. Overall, DDOR offers a practical end-to-end solution to both evaluate and mitigate overrefusal, improving LLM usability without sacrificing safety.","published_date":"2026-06-02T13:07:12+00:00","viability_score":4,"cluster_label":"LLM Safety & Alignment","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"An automated framework for testing and repairing LLM overrefusal by localizing minimal refusal-triggering fragments.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03598v1","title":"PHASER: Phase-Aware and Semantic Experience Replay for Vision-Language-Action Models","abstract":"Vision-Language-Action (VLA) models have achieved remarkable success in language-conditioned robotic manipulation. However, deploying these models in open-ended environments requires continuously acquiring novel skills, a process that inevitably triggers severe catastrophic forgetting of previously learned behaviors. While experience replay (ER) serves as a standard mitigating strategy, naive uniform sampling fundamentally misaligns with the temporal characteristics of manipulation trajectories. It systematically under-samples brief but causally critical sub-skills, leading to phase starvation, and completely overlooks the varying degrees of forgetting across historical tasks. To overcome these limitations, we introduce PHASER, an architecture-agnostic continual learning framework. PHASER employs a phase-centric capacity allocation to guarantee equal memory support for all sub-skills, coupled with a multi-modal interference routing strategy that dynamically prioritizes historical phases at high risk of forgetting. Furthermore, to enable fully autonomous lifelong adaptation, we integrate Auto-PC, a lightweight pipeline combining unsupervised action-signal change-point detection with VLM-based semantic verification to extract temporal boundaries without intensive manual supervision. Evaluated across three VLA backbones on LIBERO continual learning suites, PHASER yields substantial empirical improvements, increasing Average Success Rate (ASR) by up to 31% over matched-budget ER and achieving an 87.8% final ASR on the LIBERO-Goal CL setting.","published_date":"2026-06-02T13:04:15+00:00","viability_score":4,"cluster_label":"Robotics Continual Learning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A continual learning framework for vision-language-action models that addresses catastrophic forgetting using phase-aware experience replay.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03569v1","title":"When Attention Collapses: Stage-Aware Visual Token Pruning from Structure to Semantics","abstract":"Vision-Language Models (VLMs) have demonstrated remarkable capabilities but suffer from significant computational overhead during inference. While visual token pruning offers a promising solution, existing methods predominantly rely on initial attention scores. This single-metric paradigm presents a critical flaw: high attention scores inherently collapse onto semantically similar regions, thereby severely reducing feature diversity and discarding vital contextual details. To address this, we introduce Structure-to-Semantics (STS), a novel two-stage visual token pruning framework that explicitly decouples the pruning process. The first stage employs a repulsion-based sampling mechanism to maximize spatial and structural diversity. The second stage leverages instruction-aware cross-attention to precisely filter out prompt-irrelevant tokens. This two-stage synergy constitutes the core of STS, first ensuring geometric coverage and then refining the retained tokens according to semantic relevance. Extensive evaluations demonstrate that STS mitigates the redundancy caused by attention-based selection, improving both structural diversity and fine-grained task alignment of the preserved visual tokens.","published_date":"2026-06-02T12:36:24+00:00","viability_score":7,"cluster_label":"Vision-Language Models","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A two-stage visual token pruning framework that decouples structure and semantics to improve Vision-Language Model inference efficiency.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03568v1","title":"Learned Non-Maximum Suppression for 3D Object Detection","abstract":"Post-processing is a critical stage in LiDAR-based 3D object detection, where dense and overlapping proposals must be filtered for compact and reliable perception. This work introduces two learned filtering modules that replace heuristic non-maximum suppression (NMS) by leveraging relations among detections. D2D-Rescore employs transformer-based detection-to-detection (D2D) attention, while GossipNet3D adapts the 2D GossipNet concept to 3D through localized message passing in bird's-eye view. A metric-aware matching strategy aligned with the nuScenes evaluation protocol ensures consistent training and validation behavior, improving overall detection performance. Both approaches improve mean average precision (mAP), nuScenes detection score (NDS), and true positive quality compared to CircleNMS, particularly for small and infrequent classes, while adding minimal computational overhead. These results demonstrate that learned, detection-level filtering can enhance 3D detector reliability without modifying the base network, offering a principled alternative to heuristic suppression. Code is available at https://github.com/rst-tu-dortmund/learned-3d-nms .","published_date":"2026-06-02T12:34:14+00:00","viability_score":8,"cluster_label":"3D Object Detection","has_code":true,"repo_url":"https://github.com/rst-tu-dortmund/learned-3d-nms","commercial_flags":["has_code"],"one_liner":"Learned filtering modules that replace heuristic non-maximum suppression in 3D object detection, improving accuracy and reliability.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.03566v1","title":"Efficient Transformer-Based Localized Patch Sampling for Choroid Plexus Segmentation in Multiple Sclerosis","abstract":"Background: The lateral ventricle choroid plexus (LVCP) is gaining recognition as a key imaging biomarker for multiple sclerosis (MS) related to physical disability and neuroinflammation. Yet, manual segmentation of the LVCP is highly tedious, restricting its use in broad clinical trials and longitudinal assessments. This research aims to develop a SwinUNETR-driven pipeline that leverages targeted intra- and peri-ventricular small patch sampling to automatically segment the LVCP in MS from both standalone and multi-modal MRI inputs. Methods: We retrospectively assessed 3T MRI scans across three sets of data stemming from two separate MS-dominant cohorts (Dataset 1: n=177; Dataset 2: n=177; expanded test set: n=388). Our method employed a SwinUNETR architecture trained on 32x32x32 voxel patches, benchmarking it against the 3D UXNET model. The primary metric for evaluation was the Dice Similarity Coefficient (DSC), supplemented by computational demand (GFLOPs) and the 95th percentile Hausdorff Distance (HD95). Results: On the extended test set, the SwinUNETR model secured a mean DSC of 0.868 (95% CI: 0.863-0.872) with MPRAGE and FLAIR combined, showing a statistically significant gain over UXNET (DSC: 0.858 [95% CI: 0.853-0.862], p<0.0001). When restricted to standalone FLAIR inputs, the transformer-based approach sustained a high DSC of 0.863, while the spatial localization of UXNET worsened considerably (HD95: 1.86 vs. 3.00 mm). Importantly, the proposed framework lowered computational load by 99% (91.8 vs. 22,080 GFLOPs). By integrating localized patch sampling with a SwinUNETR architecture, this methodology offers an accurate, robust, and statistically superior alternative to current leading models for LVCP segmentation. Its vast reduction in computational cost makes it ideal for widespread implementation in clinical and research environments.","published_date":"2026-06-02T12:32:43+00:00","viability_score":8,"cluster_label":"Medical Imaging Segmentation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An efficient transformer-based pipeline for accurate and computationally inexpensive choroid plexus segmentation in multiple sclerosis.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03564v1","title":"\\textsc{CR-Seg}: Attention-Guided and CoT-Enhanced Coarse-to-Refined Reasoning Segmentation","abstract":"Reasoning segmentation aims to segment target objects described by complex language through joint visual-textual reasoning. Existing methods typically rely on either learned semantic tokens to bridge Multimodal Large Language Models (MLLMs) and segmentation models, suffering from difficult cross-modal alignment, or explicit spatial prompts such as bounding boxes, which may lose holistic response semantics. To address these limitations, we propose Attention-Guided and CoT-Enhanced Coarse-to-Refined Reasoning Segmentation, termed CR-Seg, a two-stage framework for coarse-to-refined reasoning segmentation. Specifically, we design an Extract Attention Maps and Points (EAP) module to extract attention maps for coarse target localization and select informative points, both of which are fed into SAM for mask refinement. To alleviate reasoning--answer inconsistency, we further introduce Global-to-Local Chain-of-Thought (GLCoT), which guides the model to reason progressively from global scene context to local target details. Extensive experiments on reasoning segmentation benchmarks demonstrate the effectiveness of CR-Seg.","published_date":"2026-06-02T12:30:04+00:00","viability_score":7,"cluster_label":"Reasoning Segmentation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A two-stage framework for reasoning segmentation that uses attention maps and Chain-of-Thought to improve coarse-to-refined object localization.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03557v1","title":"From Prompt to Service: An SLM-Based Agent Orchestration Gateway for AI-Driven Virtual Worlds","abstract":"As generative AI capabilities expand, AI-driven virtual worlds face a growing architectural challenge. Users interact through in-world interfaces in multimodal ways, yet their requests demand fundamentally different AI backend models and computational resources. Embedding these capabilities directly into virtual world systems reduces extensibility, complicates maintenance, and limits the ability to coordinate services distributed across edge and cloud infrastructure. This paper presents an SLM-based Agent Orchestration Gateway, a lightweight runtime coordination mechanism that decouples a virtual world client from heterogeneous AI backends through intent-driven service routing. An edge-deployed SLM classifies the semantic intent of each user prompt, a configurable service registry validates and resolves the routing decision, and the selected backend is invoked transparently, enabling new AI capabilities to be introduced in the virtual world without modifying the client application. The gateway is implemented and evaluated within the InterwovenXR virtual museum testbed. The evaluation shows that compact SLMs can serve as reliable intent routers on edge hardware, and that task-specific fine-tuning can transform sub-billion-parameter models into practical, low-latency routers. A layered configuration pairing a fine-tuned sub billion-parameter model as router with a larger SLM for conversational response generation is shown to be deployable on mid-range edge hardware and more efficient than delegating both responsibilities to a single model. The findings show that SLMs can support practical AI service orchestration in virtual worlds and the work contributes an evaluated architecture for scalable, extensible, and edge-supported AI interaction, enabling virtual agents become access points to distributed generative AI services.","published_date":"2026-06-02T12:22:03+00:00","viability_score":5,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"An SLM-based gateway orchestrates AI services for virtual worlds, decoupling clients from heterogeneous backends for extensibility and edge deployment.","time_to_mvp":"1-3 months","tags":[]},{"arxiv_id":"2606.03544v1","title":"SAGE: A Quantitative Evaluation of Socialized Evolution in Agent Ecosystems","abstract":"Self-improving language agents are typically evaluated in isolation: an agent attempts a task, receives feedback, and iteratively refines its own behavior. Yet agents increasingly operate alongside peers whose strategies and outcomes are publicly visible. This raises an under-studied question: when does shared experience produce improvements that self-improvement alone cannot achieve? We introduce SAGE (Social Agent Group Evolution),an evaluation framework that compares two compute-matched conditions: SocialEvo, where agents from five distinct model families co-evolve with access to all peers' histories; and SelfEvo, where each agent receives the same number of task attempts but sees only its own past, which is conventional in self-improving agent studies. We instantiate SAGE in three arenas: open-ended ML research, long-horizon economic planning, and strategic multiplayer play, evaluated across multiple evolutionary rounds. We find that group history is not a universal amplifier: the strongest agent does not exceed its self-evolution ceiling. However, agents that plateau under self-improvement can achieve significant breakthroughs when peer experience is available. In competitive settings, counterfactual controls reveal that agents improve generally rather than developing opponent-specific strategies. Across different forms of shared history, filtered peer traces and reflective summaries often outperform raw logs, indicating that social gains depend on abstraction rather than exposure volume. These findings reveal that peer-history gains are agent-specific, arena-dependent, and contingent on the capacity to abstract transferable knowledge from public traces.","published_date":"2026-06-02T12:08:38+00:00","viability_score":4,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"SAGE evaluates socialized evolution in agent ecosystems, showing peer experience can unlock breakthroughs for plateaued agents, especially with abstracted knowledge.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03532v1","title":"When Should the Teacher Move? Temporal Coupling and Stability in Self On-Policy Distillation","abstract":"Self on-policy distillation trains a student policy against a teacher derived from its own parameter history, yet the teacher's update schedule -- which governs the \\emph{temporal coupling} between teacher and student -- has not been systematically studied as a stability variable. Through a controlled schedule sweep on Qwen3-8B, we establish that \\emph{isolation periods}, defined as complete teacher freezing between updates, are the key structural property enabling stable learning, not teacher age. To characterize these underlying training dynamics, we introduce a diagnostic framework of temporal KL structure, refresh shock, and length-tail risk. This framework further uncovers \\emph{state-oblivious collapse}: optimal short-horizon fixed schedules catastrophically fail under long-horizon training because a clock-driven refresh can copy a transiently drifting student into the teacher in a single, irreversible step. This failure mode is invisible under short-horizon evaluation and mechanistically distinct from EMA's chronic contamination. To address this, we propose \\emph{Consolidation-Gated Teacher Refresh} (CGTR), which preserves isolation periods while gating each refresh on joint evidence of reward improvement and length-tail safety, ensuring every teacher movement responds to genuine student consolidation rather than a clock signal. With a single shared parameter set and no per-dataset retuning, CGTR achieves \\textbf{zero collapse} and the best final score on all four tasks (Chemistry, Biology, Physics, ToolUse), self-regulating its refresh frequency to each task's learning dynamics.","published_date":"2026-06-02T11:54:39+00:00","viability_score":7,"cluster_label":"LLM Training","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Consolidation-Gated Teacher Refresh (CGTR) stabilizes self on-policy distillation by gating teacher updates on reward improvement and safety, preventing collapse and achieving state-of-the-art on multiple tasks.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.03523v1","title":"High-Precision APT Malware Attribution with Out-of-Scope Resilience","abstract":"Early attribution of Advanced Persistent Threat (APT) activity can help defenders prioritise investigation, select countermeasures, and reduce the impact of an intrusion. Malware provides useful attribution evidence, but automated APT malware attribution remains difficult in practice. Existing approaches are typically trained and evaluated as closed-set classifiers over a limited number of known APT groups. In operational environments, however, classifiers are likely to encounter samples from groups not represented during training. Closed-set classifiers are then forced to assign such samples to known groups, producing unsupported and potentially misleading attributions. We present a high-precision APT malware attribution method based on ranked binary classifiers with explicit abstention. Rather than training a single multi-class classifier, our approach trains and tunes two binary classifiers per APT group, ranks the classifiers by validation performance, and applies them sequentially. A sample is attributed only when a classifier provides sufficient evidence; otherwise, it abstains. We evaluate the method on the APT Malware dataset and on a larger combined dataset designed to stress-test out-of-scope behaviour. On the APT Malware dataset, the method achieves higher precision than previously published results on the same dataset. In the most challenging setting, where 87% of test samples came from 60 APT groups excluded from training, the method abstained on 94% of out-of-scope samples while maintaining 92% precision and 95% selective accuracy on the samples it classified.","published_date":"2026-06-02T11:44:46+00:00","viability_score":7,"cluster_label":"Security AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A high-precision APT malware attribution method with ranked binary classifiers and explicit abstention achieves 92% precision on out-of-scope samples, outperforming existing closed-set approaches.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.03521v1","title":"Post-Hoc Robustness for Model-Based Reinforcement Learning","abstract":"To improve the real-world applicability of reinforcement learning (RL), the field of adversarially robust RL studies how to train agents under adversarial environment perturbations. In this setting, a protagonist agent optimizes a policy under environmental perturbations from an adversary, resulting in a zero-sum Markov game. When adversarially robust RL is combined with model-based RL, the adversary can target a learned transition model instead of the training environment. Extending this idea, this work introduces post-hoc robustification of deep RL agents at inference time. By using the learned model in combination with a trained nominal policy, our approach performs a robust policy improvement step. The goal is to improve robustness without any additional training of neural networks. Specifically, we utilize model-predictive control under adversarial rollouts, which are approximated via projected gradient descent within a bounded uncertainty set. Furthermore, these offline rollouts are performed while considering and mitigating out-of-distribution issues. The proposed methodology is validated by demonstrating significant improvements in robustness when the algorithm is evaluated in perturbed Gymnasium MuJoCo environments, while considering the computational limitations of the post-hoc inference setting.","published_date":"2026-06-02T11:43:13+00:00","viability_score":4,"cluster_label":"Reinforcement Learning","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A post-hoc method to improve the robustness of trained reinforcement learning agents at inference time by using adversarial rollouts with a learned model.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2606.03518v1","title":"Overlaying Governance: A Compositional Authorization Framework for Delegation and Scope in Agentic AI","abstract":"As AI systems evolve from passive models into autonomous active agents capable of initiating actions, collaborating, and delegating tasks, the traditional boundaries of software systems blur. Traditional authorization and delegation frameworks, built around fixed principals, explicit requests, and static scopes, are insufficient to govern agentic systems. Agentic AI demands richer authorization semantics: agents must inherit and delegate permissions, act under time-limited authority, and coordinate through shared protocols. Existing Identity and Access Management (IAM) systems fail to fully capture this notion of agency, lacking mechanisms for recursive delegation, contextual boundaries, and dynamic scoping as executable governance primitives. Unlike access delegation standards such as OAuth 2.0, we treat delegation as a contractual term rather than merely a static token-based consent credential. This paper proposes a compositional governance framework that introduces primitives indispensable for agentic AI. We define types of delegation and their permissions and accountability implications, and we introduce a notion of resource scope attenuation to bound agentic access envelopes. These concepts are expressed as general relational definitions that can be composed into existing authorization domains (e.g., financial systems). To operationalize this composition, we define a compositional operator that overlays new agentic semantics, such as recursive delegation chains, onto existing relational policies without rewriting them. We substantiate this framework through formal proofs and empirical evaluation, showing that it provides a formal yet practical foundation for accountable authorization in agentic AI systems.","published_date":"2026-06-02T11:39:39+00:00","viability_score":3,"cluster_label":"AI Governance","has_code":true,"repo_url":"https://github.com/Amjad-Ibrahim-Huawei/compositional-paper","commercial_flags":["has_code"],"one_liner":"A compositional authorization framework for agentic AI that introduces primitives for recursive delegation, contextual boundaries, and dynamic scoping.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03517v1","title":"Scalable On-Hardware Training of Quantum Neural Networks and Application to Clinical Data Imputation","abstract":"Training quantum neural networks (QNNs) on quantum hardware is currently bottlenecked by the cost of gradient estimation: standard parameter-shift methods require a number of circuit evaluations that grows quadratically with the number of trainable parameters, making hardware-based optimisation impractical beyond small system sizes. In this work, we introduce a training framework that reduces this cost to logarithmic in the number of qubits, making gradient-based QNN optimisation feasible on near-term hardware at increasing scales.   Our framework combines three co-designed ingredients: (i) a structured, subspace-preserving Butterfly circuit architecture with $O(n \\log n)$ parameters and logarithmic depth; (ii) a layer-wise training strategy that confines on-hardware optimisation to one small, well-structured layer at a time; and (iii) a parallelised parameter-shift rule that exploits the commuting structure within each Butterfly layer to extract all gradients in a constant number of circuit executions. Together these reduce the number of distinct circuit evaluations per optimisation step from $O(n^2)$ to $O(\\log n)$.   We validate the framework on clinical data imputation using the MIMIC-III electronic health record dataset, a demanding benchmark sensitive to optimisation instability and model variance. Hybrid classical-quantum models are trained directly on IonQ Forte Enterprise trapped-ion hardware at 16 qubits without performance degradation relative to ideal or noisy simulation and via tensor-network simulation at 32 qubits, with 32-qubit inference executed on hardware. The resulting models match or exceed strong classical neural baselines in downstream patient survival prediction while exhibiting reduced variance across runs, demonstrating that the proposed framework enables practical, scalable QNN training under realistic hardware constraints.","published_date":"2026-06-02T11:38:48+00:00","viability_score":4,"cluster_label":"Quantum AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for scalable on-hardware training of quantum neural networks that reduces gradient estimation cost to logarithmic in the number of qubits.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2606.03512v1","title":"SPADE: Sketch-guided Path Planning Augmented with Diffusion Experts","abstract":"Path planning is essential for Autonomous Mobile Robots (AMRs). Conventional methods for incorporating human preferences into planning typically rely on either complex reward engineering or hardware-intensive solutions. Recent state-of-the-art frameworks leverage imitation learning to train behavior-specific path planning models from expert demonstrations. However, these approaches face two key limitations: limited generalization to unseen environments and low robustness in demonstration collection. To address these challenges, this work introduces an enhanced framework that focuses on two main contributions: an overhauled annotation tool built on ROS 2, and a novel training strategy that integrates diffusion-based augmentation into baseline behavioral cloning models. A dataset of expert demonstrations is provided and evaluated through ablation studies to assess the robustness of the proposed solution. The enhanced approach outperforms state-of-the-art methods with 39.1% lower Absolute Pose Error (APE) and 33.5% lower Fr'echet Inception Distance (FID) while having 93.8% less trainable parameters. Moreover it attains diffusion-level generalization while preserving the real-time, on-edge properties of state-of-the-art models.","published_date":"2026-06-02T11:29:00+00:00","viability_score":7,"cluster_label":"Robotics","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An enhanced path planning framework for robots that integrates diffusion-based augmentation with behavioral cloning models for improved generalization and robustness.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2606.03504v1","title":"BaltiVoice: A Speech Corpus and Fine-tuned Whisper ASR System for the Balti Language","abstract":"We present BaltiVoice, a 16.8-hour read-speech corpus for Balti (ISO 639-3: bft), a Tibetic language spoken in Gilgit-Baltistan, Pakistan, with no prior publicly available ASR resources. The corpus contains 10,060 validated utterances in native Nastaliq script, derived from Mozilla Common Voice recordings. We fine-tune OpenAI Whisper-small on this corpus and report a Word Error Rate (WER) of 30.07% on a held-out validation set of 538 utterances, down from a measured zero-shot baseline of 182.18% for Whisper-small on Balti. The dataset, fine-tuned model, and a live transcription demo are publicly available on HuggingFace.","published_date":"2026-06-02T11:23:49+00:00","viability_score":8,"cluster_label":"Speech Recognition","has_code":true,"repo_url":"https://github.com/mohdali-dev/BaltiVoice-ASR","commercial_flags":["has_code"],"one_liner":"A fine-tuned Whisper ASR system and speech corpus for the Balti language, enabling transcription with a public demo and model.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03503v1","title":"ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning","abstract":"Large Reasoning Models (LRMs) have achieved remarkable progress thanks to Reinforcement Learning with Verifiable Rewards (RLVR) on Chain-of-Thoughts (CoTs). However, since long CoTs naturally contain trial and errors and mainstream RLVR approaches choose outcome-correct CoT trajectories for memorization, the redundant explorations in long CoTs are inevitably reinforced, which results in the over-thinking issues of LRMs. Previous attempts to resolve this issue mainly give more advantage to shorter trajectories, yet their learning signals are still outcome-based and cannot reduce the memorization of redundant explorations in long CoTs. Therefore, we propose ThoughtFold, a framework that leverages fine-grained preference learning to mitigate redundant explorations for efficient reasoning. ThoughtFold employs an introspective strategy to identify redundancy within each correct trajectory, which yields a spectrum of candidate sub-trajectories. Leveraging this spectrum, we introduce a masked preference optimization objective that explicitly penalizes redundant explorations and encourages the model to directly bridge essential reasoning segments, effectively folding its reasoning chains into a more concise path. Extensive experiments show that ThoughtFold significantly enhances efficiency. It reduces the token usage of DeepSeek-R1-Distill-Qwen-7B by approximately 56% while maintaining state-of-the-art accuracy.","published_date":"2026-06-02T11:21:27+00:00","viability_score":4,"cluster_label":"LLM Reasoning","has_code":true,"repo_url":"https://github.com/ziyanliux/ThoughtFold","commercial_flags":["has_code"],"one_liner":"A framework to reduce redundant explorations in Large Reasoning Models by folding reasoning chains into more concise paths.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03489v1","title":"Learn from Your Mistakes: Tree-like Self-Play for Secure Code LLMs","abstract":"While Large Language Models (LLMs) excel in code generation, they remain prone to replicating subtle yet critical vulnerabilities endemic to their training data. Current alignment techniques, such as Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), typically apply coarse-grained optimization at the sequence level. This approach often fails to address the localized nature of security flaws, where a single incorrect token choice can compromise an entire program. To bridge this gap, we introduce Tree-like Self-Play (TSP), a framework that reframes secure code generation as a fine-grained sequential decision process. Unlike standard methods that blindly maximize likelihood, TSP constructs a decision tree where the model explores branching trajectories--generating both secure \"golden paths\" and vulnerable variants. By treating code generation as a self-play game, the model learns to strictly discriminate against its own localized errors. This provides a dense, on-policy learning signal that forces self-correction precisely at the critical decision nodes where vulnerabilities typically emerge. Our experiments demonstrate that TSP fundamentally enhances model reliability. In Python security benchmarks, TSP boosts CodeLlama-7B's pass rate (SPR@1) to 75.8%, significantly outperforming SFT (57.0%) and unstructured self-play baselines. Crucially, TSP induces robust out-of-distribution generalization: the model not only reduces vulnerabilities in unseen categories (CWEs) by 24.5% but also successfully transfers security principles learned from C/C++ to diverse languages, including Python, Go, and JavaScript. This suggests that TSP does not merely memorize patches, but internalizes abstract, language-agnostic security logic.","published_date":"2026-06-02T11:07:20+00:00","viability_score":8,"cluster_label":"Secure LLMs","has_code":true,"repo_url":"https://github.com/Easonnoway/TSP","commercial_flags":["has_code"],"one_liner":"A tree-like self-play framework that trains LLMs to discriminate against their own localized security errors for robust code generation.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03486v1","title":"NeuroArmor: Safe-Variant-Guided Representation Consistency for Selective Re-Anchoring in Jailbreak Defense","abstract":"Large language models remain vulnerable to jailbreak attacks that hide harmful intent behind seemingly ordinary requests such as role-play, translation, encoding, adversarial suffixes, and multi-turn buildup. Existing defenses still struggle to handle these attacks without over-blocking benign but sensitive requests, partly because they often apply the same action to every prompt and therefore fail to balance safety and helpfulness. We propose NeuroArmor, a white-box runtime defense that uses prompt-specific safe variants as a local safety reference for deciding when intervention is needed and, once triggered, as safe targets for intervention. For each prompt, NeuroArmor builds K safe variants, compares the prompt state against this local safe reference in hidden-state space, and routes anomalies either to a refusal branch for malicious prompts or to a helpful recovery branch for borderline benign prompts. On Llama-3-8B-Instruct, NeuroArmor reduces malicious attack success rate (ASR) from 41.56% to 1.57% while lowering benign false positive rate (FPR) on the shared benign pool from 30.26% to 22.05%; matched baselines remain substantially weaker on this trade-off. External-judge and manual behavioral evaluations further show that the remaining non-blocked outputs are much less likely to be operationally harmful. Overall, NeuroArmor provides a more effective runtime strategy for jailbreak defense by combining prompt-specific consistency checking, routing, and selective intervention.","published_date":"2026-06-02T11:01:50+00:00","viability_score":5,"cluster_label":"LLM Safety","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A runtime defense for LLMs that uses prompt-specific safe variants to balance safety and helpfulness in jailbreak defense.","time_to_mvp":"1-3 months","tags":["series_a_plus"]},{"arxiv_id":"2606.03483v1","title":"Analyzing Stream Collapse in Hyper-Connections: From Diagnosis to Mitigation","abstract":"Hyper-Connections (HC) replace the single Transformer residual stream with multiple streams, introducing a permutation symmetry over stream indices. We study how this symmetry is resolved in practice: whether streams specialize in a balanced way or exhibit dominant-stream usage. Using fine-grained diagnostics for HC-based language models, we trace how multi-stream representations are actually used. We find that after an early seeding stage, residual mixing often remains close to identity, limiting a core HC mechanism for exchanging information between streams. Moreover, both signal and interpretable features concentrate in a dominant stream, and the nominally multi-stream residual connection can underutilize its capacity, behaving closer to a single-stream residual pathway. Finally, we show that breaking symmetry at stream initialization reduces dominant behavior and improves performance across \\textit{m}HC variants. Our code is publicly available.","published_date":"2026-06-02T11:00:49+00:00","viability_score":4,"cluster_label":"LLM Analysis","has_code":true,"repo_url":"https://github.com/brain-lab-research/hc-stream-collapse","commercial_flags":["has_code"],"one_liner":"This research analyzes and proposes mitigation strategies for stream collapse in hyper-connected Transformer models, with publicly available code.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2606.03471v1","title":"A formal definition and meta-model for a machine theory of mind","abstract":"This paper proposes, for the first time, a rigorous formal definition of the concept of Machine Theory of Mind, based on principles supported by evidence from cognitive psychology, neuroscience and artificial intelligence, and uses the above as a lens to examine state-of-the-art and current efforts in the field, driving a potential agenda for further research there able to \"crack\" the problem. It also advances a general holistic meta-model for Machine Theory of Mind, and examines the state of the art when it comes to empirically benchmarking such models.","published_date":"2026-06-02T10:48:59+00:00","viability_score":0,"cluster_label":"AI Theory","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This paper formally defines Machine Theory of Mind and proposes a meta-model, drawing from cognitive psychology, neuroscience, and AI.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03467v1","title":"StepFinder: A Temporal Semantic Framework for Failure Attribution in Multi-Agent Systems","abstract":"LLM-based multi-agent systems exhibit remarkable collaborative capabilities in complex multi-step tasks. However, these systems are highly sensitive to single-step execution errors that can propagate through agent interactions and lead to cascading failures. To understand the causes of failure and improve system reliability, failure attribution has been introduced as a task that aims to automatically identify the root cause step responsible for a failure. Existing failure attribution methods mainly rely on LLMs to reason over original execution trajectories, which not only incur high inference costs and latency, but also suffer from interference caused by redundant and noisy execution logs, causing LLMs to struggle in accurately identifying the true root cause step. To address this, we propose StepFinder, a lightweight failure attribution framework. We use LLMs solely during the feature construction phase to encode execution logs into temporal semantic sequences. Subsequently, a parameter-efficient combination of temporal modeling and attention modules is applied to capture the sequential evolution and cross-step dependencies of the trajectories. Finally, the step-level error score is refined through multi-scale differences and position bias, enabling precise root cause identification. Experimental results on the Who&When benchmark demonstrate that StepFinder outperforms LLM-based methods in step-level failure attribution while achieving substantially higher inference efficiency, reducing inference time by 79% compared with the fastest LLM-based method, with no text generation overhead. Our code is available at https://github.com/taiyu-zhu/StepFinder.","published_date":"2026-06-02T10:45:49+00:00","viability_score":7,"cluster_label":"Multi-Agent Systems","has_code":true,"repo_url":"https://github.com/taiyu-zhu/StepFinder","commercial_flags":["has_code"],"one_liner":"StepFinder is a lightweight framework for efficient failure attribution in LLM-based multi-agent systems, outperforming LLM-based methods and reducing inference time by 79%.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.03465v1","title":"Rethinking the Role of Tensor Decompositions in Post-Training LLM Compression","abstract":"Post-training compression is essential for deploying large language models (LLMs) under tight resource constraints. Tensor decompositions have emerged as a promising direction, offering compact parameterizations well suited to Transformer weight structures. However, existing studies evaluate these methods in narrow settings, leaving unclear whether tensorization is effective at large-scale deployment. We systematically evaluate tensor compression across dense and MoE architectures, establishing performance trade-offs grounded in both empirical analysis and theoretical analysis. We identify a fundamental mismatch between the shared subspaces assumed by tensor decompositions and the heterogeneous representations learned by modern LLMs, thereby delineating their practical limits and clarifying their viable role in large-scale deployment. The code is available at https://github.com/brain-lab-research/TT-LLM.","published_date":"2026-06-02T10:45:21+00:00","viability_score":4,"cluster_label":"LLM Compression","has_code":true,"repo_url":"https://github.com/brain-lab-research/TT-LLM","commercial_flags":["has_code"],"one_liner":"This research systematically evaluates tensor decompositions for post-training LLM compression, identifying practical limits and clarifying their role in large-scale deployment, with available code.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2606.03463v1","title":"DMF: A Deterministic Memory Framework for Conversational AI Agents","abstract":"Conversational AI agents require memory systems that are both scalable and semantically coherent across long interaction horizons. Existing approaches rely predominantly on large language model (LLM)-based summarisation at write time, which introduces non-determinism, escalating token costs, and opacity in pruning decisions. We present the Deterministic Memory Framework (DMF), a CPU-first approach that replaces generative memory compression with a fully deterministic pipeline grounded in classical NLP analysis, vector geometry, and mathematical scoring. DMF assigns each conversational interaction a Survival Score $\u03a9$ computed from deterministic content signals, conversational cues, and structured provenance, combined through a logistic projection. An interaction-count decay law, denoted as $\u03a9_{\\mathrm{eff}}(\u0394n)$, governs how relevance evolves as new turns arrive, where $\u0394n$ is the number of newer interactions rather than wall-clock time, preserving full determinism. We present the mathematical formulation of DMF, its structured recall pipeline, the pruning decision procedure, and the evaluation protocol. Experiments are conducted on a purpose-built benchmark using the LoCoMo and LongMemEval datasets. We compare DMF against Mem0, a popular memory layer for AI agents. DMF achieves comparable accuracy while using zero tokens to prepare the memory context and 5x to 242x fewer tokens over the entire conversation. These results show that it is possible to eliminate LLM calls from the memory-management loop, reducing token costs to nearly zero and enabling deterministic memory systems for conversational AI agents.","published_date":"2026-06-02T10:41:28+00:00","viability_score":7,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A deterministic memory framework for conversational AI agents that drastically reduces token costs by replacing LLM summarization with classical NLP and vector geometry.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03461v1","title":"What Makes Interaction Trajectories Effective for Training Terminal Agents?","abstract":"Stronger code agents are commonly assumed to be superior teachers for post-training, yet this assumption remains poorly disentangled from task difficulty, harness design, and student capacity. We investigate this pedagogical link using Terminal-Lego, a scalable pipeline that transforms multi-domain real-world issues into environment-verified agentic tasks. Surprisingly, standalone performance does not dictate teaching efficacy: while Claude Opus 4.6 achieves higher scores on Terminal-Bench 2.0, students fine-tuned on trajectories from DeepSeek-V3.2, a lower-scoring agent, exhibit significantly stronger generalization. We attribute this \"pedagogical paradox\" to Environment-Grounded Supervision (EGS): trajectories that explicitly expose inspect-act-verify behaviors through harness-visible interactions allow students to internalize robust problem-solving routines rather than fragile action sequences. Scaling analysis reveals exceptional data efficiency: with only 15.3k Terminal-Lego trajectories, for example, Qwen3-32B achieves a 24.3% score on Terminal-Bench 2.0, rivaling previous SOTA performance established with over 30x the data volume. Our results suggest that the frontier of agent post-training lies beyond mere outcome-matching, shifting the focus toward \"Harness Engineering\", where the systematic design of environment-grounded interaction structures serves as the primary catalyst for reproducible and generalizable agentic intelligence.","published_date":"2026-06-02T10:37:47+00:00","viability_score":5,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Investigating the effectiveness of interaction trajectories for training terminal agents, revealing that explicit inspect-act-verify behaviors are more crucial than standalone performance.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03459v1","title":"Tonal parsimony in chord-sequence analysis: combining modulation cost and tonal vocabulary","abstract":"We study the assignment of local tonalities to chord sequences, a task useful for harmonic analysis, composition, and jazz-oriented improvisation. Standard dynamic-programming approaches minimize modulations but can introduce unnecessarily many tonal centers. We compare this transition-only objective with pure minimum-vocabulary analysis and with tonal parsimony, which minimizes lexicographically the number of modulations and then the number of distinct tonalities. Although this joint objective is combinatorially hard in general, we give exact algorithms exploiting the fixed 24-tonality major/minor universe. On 31,032 LMD Chords sequences, tonal parsimony preserves the transition optimum while reducing tonal vocabulary in 55.8% of cases. With weighted jazz-substitution closure, it lowers mean tonalities from 3.802 to 3.206 and modulations from 16.728 to 12.141. On 1,555 annotated jazz standards, it improves compatible chord-scale agreement to 95.6%, supporting tractable professional-scale harmonic analysis.","published_date":"2026-06-02T10:36:05+00:00","viability_score":0,"cluster_label":"Music AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Developing tonal parsimony algorithms for chord-sequence analysis that minimize both modulations and distinct tonalities, improving harmonic analysis and composition.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03453v1","title":"FORGE: Multi-Agent Graduated Exploitation and Detection Engineering","abstract":"Vulnerability disclosure volumes now far exceed organizational assessment capacity, yet three adjacent research communities (proof-of-concept generation, vulnerability prioritization, and detection rule engineering) operate largely in isolation. Existing automated exploit generation systems report binary pass/fail outcomes, discarding partial progress and producing no signal for the other two communities. This paper presents FORGE, a multi-agent system that bridges these three silos through graduated exploitation depth. Five specialized agents (Intel, Generator, Planner, Exploit, and Detector) execute in a fixed pipeline that (1) generates targeted vulnerable applications from CVE metadata, (2) conducts coached, multi-turn exploitation assessed by an LLM-primary oracle on a four-level taxonomy (L0: no evidence through L3: full compromise), and (3) produces Sigma and Snort detection rules grounded in OpenTelemetry exploitation traces. Graduated depth is the bridging mechanism: deeper exploitation yields richer behavioral traces for detection engineering, while depth data across scoring bands provides ground truth for prioritization validation. A tiered knowledge architecture accumulates intelligence across assessments, transferring build and exploitation experience to subsequent CVEs. Evaluation on 603 CVEs from the CVE-GENIE dataset achieves 67.8% end-to-end L1+ exploitation at USD 1.50 per CVE across eight languages and 187 CWE types. Exploitation rates remain near 68% regardless of EPSS or CVSS band, indicating that pattern-level reachability is orthogonal to metadata-based prioritization. Detection rules from L2+ exploitation achieve significantly higher span-normalized grounding than L1-derived rules (p=0.035), and 93.4% of generated Snort rules produce zero false positives against a synthetic benign corpus.","published_date":"2026-06-02T10:32:28+00:00","viability_score":7,"cluster_label":"Cybersecurity AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"FORGE is a multi-agent system that bridges vulnerability disclosure, prioritization, and detection rule engineering by graduated exploitation depth.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03444v1","title":"PRISM: Synergizing Vision Foundation Models via Self-organized Expert Specialization","abstract":"Unifying the complementary strengths of diverse Vision Foundation Models (VFMs) into a single efficient model is highly desirable but challenged by the negative transfer inherent in monolithic distillation. To address these feature conflicts, we introduce \\textbf{PRISM}, a novel dual-stream Mixture-of-Experts (MoE) framework that synergizes VFMs via modular specialization. We propose a two-stage paradigm: (1) expertise deconstruction, where a teacher-conditional router guides experts to specialize in distinct representational subspaces to mitigate interference, followed by (2) dynamic recomposition, where the router learns to assemble these experts into tailored computational pathways for downstream tasks. Experiments on PASCAL-Context and NYUD-v2 show that \\textbf{PRISM} establishes a new state of the art, validating that sparse, emergent specialization is a scalable approach for integrating diverse visual knowledge.","published_date":"2026-06-02T10:28:32+00:00","viability_score":3,"cluster_label":"Vision Foundation Models","has_code":true,"repo_url":"https://github.com/robotyingtang/PRISM-VFM","commercial_flags":["has_code"],"one_liner":"A novel dual-stream Mixture-of-Experts framework synergizes diverse Vision Foundation Models via modular specialization to establish a new state of the art in visual knowledge integration.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03435v1","title":"CP-Agent: Context-Aware Multimodal Reasoning for Cellular Morphological Profiling under Chemical Perturbations","abstract":"Cell Painting combines multiplexed fluorescent staining, high-content imaging, and quantitative analysis to generate high-dimensional phenotypic readouts to support diverse downstream tasks such as mechanism-of-action (MoA) inference, toxicity prediction, and construction of drug-disease atlases. However, existing workflows are slow, costly and difficult to interpret. Approaches for drug screening modeling predominantly focus on molecular representation learning, while neglecting actual experimental context (e.g., cell line, dosing schedule, etc.), limiting generalization and MoA resolution. We introduce CP-Agent, an agentic multimodal large language model (MLLM) capable of generating mechanism-relevant, human-interpretable rationales for cell morphological changes under drug perturbations. At its core, CP-Agent leverages a context-aware alignment module, CP-CLIP, that jointly embeds high-content images and experimental metadata to enable robust treatment and MoA discrimination (achieving a maximum F1-score of 0.896). By integrating CP-CLIP outputs with agentic tool usage and reasoning, CP-Agent compiles rationales into a structured report to guide experimental design and hypothesis refinement. These capabilities highlight CP-Agent's potential to accelerate drug discovery by enabling more interpretable, scalable, and context-aware phenotypic screening -- streamlining iterative cycles of hypothesis generation in drug discovery.","published_date":"2026-06-02T10:20:00+00:00","viability_score":6,"cluster_label":"Drug Discovery AI","has_code":true,"repo_url":"https://github.com/goodfeli/dlbook_notation","commercial_flags":["has_code"],"one_liner":"CP-Agent is an agentic multimodal large language model that generates human-interpretable rationales for cell morphological changes under drug perturbations, accelerating drug discovery.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2606.03432v1","title":"A Hybrid Approach For Malware Classification Using Secondary Features Fusion","abstract":"The number of malware (either variant or novel) is rapidly increasing, making malware detection and mitigation a complex problem. One approach to improving malware mitigation is automatic detection and malware family classification. However, traditional malware detection methods cannot classify detected malware into their respective families, hindering effective malware mitigation. Consequently, this paper proposes a method to automate malware detection and classification of the detected malware into respective malware families. The proposed method uses feature fusion after extracting relevant malware features such as API calls and fixed and variable length n-grams with a customized feature selection method. Moreover, for the predictive model, a voting based approach is proposed for algorithm fusion. For the experimental evaluation of the proposed method, both binary and multi-class classification approaches are applied to the data set provided by Microsoft. Finally, the experimental results are compared with the state of the art. The experimental results indicate the effectiveness and efficiency of the proposed approach with an AUC of 0.989, accuracy of 99.72%, and a log loss of 0.01.","published_date":"2026-06-02T10:19:35+00:00","viability_score":3,"cluster_label":"Malware Classification","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A hybrid approach for malware classification using feature fusion and a voting-based algorithm fusion method achieves high accuracy in identifying malware families.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03430v1","title":"FlowGuard: Flow Matching for Identity-Independent Detection of Data-Free Model Stealing Attacks on Energy System Intrusion Detection Systems","abstract":"Artificial Intelligence (AI)-based Intrusion Detection Systems (IDS) deployed in energy infrastructure are vulnerable to model theft attacks, which allow adversaries to create evasive traffic offline. Current defences against model extraction rely either on identity-bound query monitoring, which is ineffective against distributed attackers (Sybil), or on prediction poisoning through soft-label perturbation, which is inapplicable to hard-label IDS deployments. Therefore, we propose FlowGuard, an identity-independent defence based on flow matching that classifies incoming queries as out-of-distribution (OOD) prior to IDS processing. This approach exploits the fact that queries generated synthetically for data-free model stealing attacks occupy a lower-dimensional manifold than real network traffic. This results in measurably lower log-likelihoods when using a Continuous Normalizing Flow that has been trained on legitimate data. We evaluate our method against PRADA and FDINet using MAZE and DisGUIDE attacks in single-client and distributed (100-client Sybil) settings. While PRADA's detection rate dropped to 0% when the distribution changed, our defence maintained a stable detection rate across both settings without relying on identity information. We discuss the scope and limitations of the approach, and outline potential applications to data-dependent attacks.","published_date":"2026-06-02T10:18:45+00:00","viability_score":3,"cluster_label":"Cybersecurity AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"FlowGuard is an identity-independent defense using flow matching to detect data-free model stealing attacks on energy system intrusion detection systems.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03428v1","title":"PrimeSVT: An Automated Memory-aware Pruning Framework with Prioritized Compression Policy for Spiking Vision Transformers","abstract":"The large sizes of Spiking Vision Transformers (SViTs) still hinder their embedded implementation, highlighting the need for model compression. State-of-the-art works compress SViT models through unstructured pruning, which needs specialized hardware accelerators for their specific sparsity patterns to maximize efficiency gains. Moreover, their manual approach requires a huge design time to find an appropriate pruning setting for each network, thus making this approach not scalable. To address this limitation, we propose PrimeSVT, a novel framework that performs automated memory-aware structured pruning on pre-trained SViT models, thereby maximizing their efficiency gains during inference amenable to widely-used computing architectures. To achieve this, PrimeSVT first sorts the SViT layers based on their sizes (i.e., number of parameters), identifies the targeted pruning layers based on their robustness under different pruning rates, then leverages this order for compressing the model layer-by-layer sequentially from the largest one to the smallest one (i.e., so-called prioritized compression policy), while considering the user-defined constraints (i.e., acceptable accuracy and memory saving). In each layer, PrimeSVT employs channel-wise filter pruning based on their L2-norm values to structurally remove the non-significant weights. Experimental results show that PrimeSVT saves 26.68% memory through automated single-shot pruning, while preserving accuracy within 3% (70.3% without fine-tuning and 72.9% with fine-tuning) from the original unpruned SViT model (73.3%), thus meeting the accuracy and memory constraints. These show that our PrimeSVT framework enables design automation for SViTs and their embedded implementation.","published_date":"2026-06-02T10:18:00+00:00","viability_score":4,"cluster_label":"Model Compression","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A framework for automated memory-aware structured pruning of Spiking Vision Transformers to improve their efficiency on embedded systems.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03419v1","title":"Optimizing Explicit Unit-Distance Lower-Bound Certificates","abstract":"The 2026 disproof of Erd\u0151s's unit-distance conjecture and Sawin's subsequent explicit quantitative refinement show that the maximum number $u(n)$ of unit distances among $n$ planar points can exceed $n^{1+\\varepsilon}$ for a fixed positive $\\varepsilon$. Sawin's explicit bound gives more than $n^{1.014}$ unit distances for arbitrarily large $n$ and exposes finite parameters whose choice is not fully optimized. This report formulates the finite parameter-selection task as a variant of a nonlinear integer programming problem and proposes an open-source Python verification pipeline, first validated by reproducing Sawin's published parameter choice and then applied to computationally improved certificates. The main computational contribution is an integer optimization and checking procedure for the sets of primes $T$ and $S_Q$, the integer multiplicities $k(p)$, and a rationally encoded real parameter $R$. The optimization pipelines are intentionally lightweight and replicable on standard hardware: we propose a deterministic greedy construction heuristic, a Tailored Integer Evolution Strategy with repair operators for number-theoretic feasibility, and a two-parent discrete-recombination variant. Four certificate levels are compared: Sawin's published example with $\u03b4=0.0141144286784982\\ldots$, a greedy optimization certificate with $\u03b4=0.0151718056372133\\ldots$, a Tailored Integer Evolution Strategy certificate with rational $R=6672416/100000$ and $\u03b4=0.0152616610684193\\ldots$, and a Tailored Integer Evolution Strategy with discrete recombination, again with $R=6672416/100000$, giving $\u03b4=0.0152628688170072\\ldots$. Consequently, subject to Sawin's explicit criterion being applied exactly as cited, the best current certificate supports the cautious clean statement $u(n)>n^{1.0152}$ for arbitrarily large $n$.","published_date":"2026-06-02T10:05:23+00:00","viability_score":0,"cluster_label":"Optimization Algorithms","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Optimizing parameters for explicit unit-distance lower-bound certificates to improve theoretical bounds on point configurations.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03398v1","title":"Causal Evidence of Stack Representations in Modeling Counter Languages Using Transformers","abstract":"Formal languages have proven to be effective conduits to understand the inner mechanisms of transformers. Past work has shown that transformers trained on next token prediction over counter languages learn representations consistent with an underlying stack structure. Beyond representational analysis, this paper investigates the causal role of these representations. Linear probes are trained to predict the stack depth at each token from the model's hidden states, and a principal representation direction is extracted from the probe. Ablation of this direction from the model causes sequential accuracy to collapse to near 0%, providing strong empirical evidence that the stack representation is not just learned, but is causally necessary for model performance.","published_date":"2026-06-02T09:39:40+00:00","viability_score":0,"cluster_label":"LLM Interpretability","has_code":true,"repo_url":"https://github.com/goodfeli/dlbook_notation","commercial_flags":["has_code"],"one_liner":"Demonstrating the causal necessity of stack representations for transformer performance on counter languages.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03391v1","title":"When Model Merging Breaks Routing: Training-Free Calibration for MoE","abstract":"Model merging has emerged as a cost-effective approach for consolidating the capabilities of multiple LLMs without retraining. However, existing merging techniques, largely based on linear parameter arithmetic or optimization, struggle when applied to Mixture-of-Experts (MoE) architectures. We identify a critical failure mode in MoE merging, termed routing breakdown, in which the merged router fails to dispatch tokens to suitable experts. Routing breakdown stems from the sensitivity of the non-linear softmax and discrete Top-k routing mechanisms to parameter perturbations from merging, a sensitivity further amplified by load-balancing constraints imposed during MoE pretraining. Because fine-tuned experts exhibit distinct specializations, even modest misrouting can cause severe performance degradation. To address this issue, we propose Hessian-Aware Router Calibration (HARC), a training-free framework that leverages second-order curvature information to realign the merged router. This approach admits a closed-form solution that can be efficiently solved using a matrix-free conjugate gradient method. Experiments on mathematical reasoning and code generation tasks show that HARC effectively mitigates routing breakdown across diverse MoE merging baselines and leads to substantial performance improvements. Our code is available at https://github.com/huangcb01/HARC.","published_date":"2026-06-02T09:33:33+00:00","viability_score":6,"cluster_label":"LLM Merging","has_code":true,"repo_url":"https://github.com/huangcb01/HARC","commercial_flags":["has_code"],"one_liner":"A training-free framework to calibrate Mixture-of-Experts routers after model merging, preventing performance degradation.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2606.03385v1","title":"Grasp-Then-Plan with Failure Attribution: A Closed Two-Stage Framework for Precise and Generalizable Robotic Manipulation","abstract":"In robotic manipulation, the tight coupling between grasping and motion planning often obscures the true source of failure, leading to inefficient trial-and-error. To enable efficient long-horizon manipulation, we propose GTP-FA (Grasp-Then-Plan with Failure Attribution), a task-oriented two-stage grasp-then-plan framework that generates grasp candidates and performs downstream motion planning conditioned on the selected grasp. Given a failed manipulation trajectory, we learn a failure attribution model that generalizes to unseen grasps and produces a stable distribution over failure modes for diagnosis-guided optimization. Based on these attribution results, we then optimize both modules in a diagnosis-driven manner: on the grasping side, we inject task-level priors and risk penalties into grasp candidate scoring and optimization to suppress unstable or task-incompatible grasps; on the planning side, we target high-risk initial states through data collection and fine-tuning to address genuine planning bottlenecks. We evaluate the proposed framework in both simulation and real-robot experiments, and show that GTP-FA improves the corresponding base learners across RL, IL, diffusion-policy, and VLA-based settings, achieving substantially higher overall task success rates.","published_date":"2026-06-02T09:29:03+00:00","viability_score":7,"cluster_label":"Robotic Manipulation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A two-stage robotic manipulation framework that attributes failures to optimize grasping and motion planning for improved success rates.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03382v1","title":"Local Guidance, Global Impact: Gaussian-Reshaped Trust Region Unlocks Behavior Transitions","abstract":"While Proximal Policy Optimization (PPO) demonstrates strong performance in stationary settings, we show that its standard optimization paradigm struggles in continual and non-stationary environments. The failure does not stem from insufficient model capacity or overly restrictive clipping. Instead, PPO performs persistent, directionally inefficient local updates, which indicates a lack of geometry-aware guidance for accumulating meaningful behavioral change and ultimately hindering transitions toward new behavior patterns. Although divergence-based regularization introduces partial geometric awareness, its monotonically increasing penalties implicitly discourage large policy deviations, even when such shifts are necessary for effective adaptation. To address this limitation, we propose Gaussian Trust Region Policy Optimization (GTR), which reshapes the trust region using a Gaussian kernel. The resulting constraint is bounded and non-monotonic, providing strong local stability while progressively relaxing under sustained high-advantage updates. To further improve robustness, we introduce a Mixture Gaussian Anchor that adapts to recent policy trajectories, reducing variance induced by stale references. GTR is architecture-agnostic and achieves strong performance across games, simulated robotic control, open-world exploration, and language model post-training. These results demonstrate that geometry-aware trust-region design can be a promising direction for robust reinforcement learning in complex non-stationary environments. Our code is available at https://anonymous.4open.science/r/GTR_demo/README.md.","published_date":"2026-06-02T09:26:26+00:00","viability_score":8,"cluster_label":"Reinforcement Learning","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A geometry-aware reinforcement learning algorithm that enables robust behavior transitions in non-stationary environments.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.03381v1","title":"AI Model Extraction Attacks: Bypassing Single-Client Assumptions in Defenses","abstract":"Ensuring the protection of Artificial Intelligence (AI) models deployed in military Command and Control (C2) systems and critical infrastructure is essential for maintaining information superiority. Model Extraction Attacks (MEAs) pose a significant threat, as they enable adversaries to replicate proprietary models, compromise protected information, and prepare offline adversarial attacks. However, current defense strategies predominantly rely on the Single Client Assumption (SCA), which is the implicit assumption that attacks originate from isolated identities. This work systematically demonstrates that the SCA is fundamentally invalid in the presence of coordinated threat actors, such as Advanced Persistent Threats (APTs). We introduce a modular, open-source framework called CerberusAI for reproducible model-stealing research, and use it to simulate distributed attack scenarios. Our empirical evaluation shows that well-established defense mechanisms, such as Protecting Against Deep Neural Network Model Stealing Attacks (PRADA), can be bypassed by basic round-robin query distribution strategies, resulting in a significant reduction in detection performance. Furthermore, we demonstrate that even global aggregation approaches can be rendered operationally useless through adaptive traffic mixing. These results highlight the need for a paradigm shift towards stateful, identity-independent defense architectures in the field of model extraction attacks. This paper was originally presented at the International Conference on Military Communication and Information Systems (ICMCIS), organized by the Information Systems Technology (IST) Scientific and Technical Committee, IST-224-RSY - the ICMCIS, held in Bath, United Kingdom, 12-13 May 2026 and won the best paper award.","published_date":"2026-06-02T09:25:29+00:00","viability_score":7,"cluster_label":"AI Security","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for simulating distributed AI model extraction attacks, demonstrating the failure of current defenses and advocating for stateful, identity-independent architectures.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03376v1","title":"P\\textsuperscript{2}-DPO: Grounding Hallucination in Perceptual Processing via Calibration Direct Preference Optimization","abstract":"Hallucination has recently garnered significant research attention in Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) aims to learn directly from the corrected preferences provided by humans, thereby addressing the hallucination issue. Despite its success, this paradigm has yet to specifically target the perceptual bottleneck in attended regions or address insufficient Visual Robustness against image degradation. Furthermore, existing preference pairs are often vision-agnostic and their inherently off-policy nature limits their effectiveness in guiding model learning. To address these challenges, we propose Perceptual Processing Direct Preference Optimization (P\\textsuperscript{2}-DPO), a novel training paradigm in which the model generates and learns from its own preference pairs, thereby directly addressing the identified visual bottlenecks while inherently avoiding the issues of vision-agnostic and off-policy data. It introduces: (1) an on-policy preference pairs construction method targeting Focus-and-Enhance perception and Visual Robustness, and (2) a well-designed Calibration Loss to precisely align visual signals with the causal generation of text. Experimental results demonstrate that with a comparable amount of training data and cost, P\\textsuperscript{2}-DPO outperforms strong baselines that rely on costly human feedback on benchmarks. Furthermore, evaluations on Attention Region Fidelity (ARF) and image degradation scenarios validate the effectiveness of P\\textsuperscript{2}-DPO in addressing perceptual bottleneck in attended regions and improving Visual Robustness against degraded inputs.","published_date":"2026-06-02T09:22:53+00:00","viability_score":8,"cluster_label":"Vision-Language Models","has_code":true,"repo_url":"https://github.com/goodfeli/dlbook_notation","commercial_flags":["has_code"],"one_liner":"A training paradigm that grounds hallucination in vision-language models by optimizing perceptual processing and visual robustness through self-generated preference pairs.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.03357v1","title":"The Unsampled Truth: Psychometrics in SLMs Measure Prompt Artifacts, Not Psychological Constructs","abstract":"When prompting SLMs for psychometric assessments, researchers assume the outputs reflect semantic reasoning. We evaluate this premise across 13 open-weights models (0.6B to 14B parameters) using a prompt variation framework that separates semantic signals from prompt artifacts. By systematically varying personas, instructions, items, and option symbols, we find that artifactual variance frequently overpowers the semantic signal. In these cases, models predominantly reflect prompt compliance rather than simulated psychological traits. While these findings limit SLM utility in psychometrics, our framework provides a diagnostic tool to identify destructive artifacts and isolate semantic understanding for future frontier-model research.","published_date":"2026-06-02T09:05:25+00:00","viability_score":3,"cluster_label":"LLM Evaluation","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This research introduces a framework to diagnose prompt artifacts in SLMs, separating them from semantic understanding for future model development.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03347v1","title":"AugMask: Training Diffusion Models on Incomplete Tabular Data via Stochastic Augmentation and Masking","abstract":"Score-based diffusion models have emerged as prominent deep generative models; however, their application to tabular data remains challenging because their backbones assume fully specified inputs, whereas real-world tabular data often contain missing values. We propose AugMask, a plug-and-play training framework that adapts missing-unaware backbones to incomplete data by separating conditioning from supervision. AugMask 1) constructs numeric inputs via conditional stochastic augmentation using lightweight auxiliary models, and 2) applies denoising supervision only to observed coordinates. In effect, augmented missing entries serve as uncertain conditioning context rather than training targets. We connect this training rule to a Rao--Blackwellized objective and show that marginalizing missing entries yields a variance-weighted sensitivity penalty, discouraging over-reliance on uncertain completions. Across diverse datasets and missingness regimes, AugMask enables standard diffusion-based tabular generators to outperform specialized missing-aware baselines.","published_date":"2026-06-02T08:57:38+00:00","viability_score":7,"cluster_label":"Generative Models","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"AugMask is a plug-and-play framework that enables diffusion models to effectively generate tabular data with missing values.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03348v1","title":"SynCred-Bench: Benchmarking Synthetic Credibility in AI-Generated Visual Misinformation","abstract":"Recent generative models can now produce visual artifacts with realistic embedded text and layouts, creating a new misinformation threat: synthetic credibility. We introduce SYNCRED-Bench, a benchmark of 600 AI-generated misinformation images balanced across six credible-form categories and seven fine-grained circulation styles, together with FP450, a real-image negative set for measuring false positives. Extensive evaluation shows that existing systems remain unreliable: under a 5% false-positive-rate constraint, 15 MLLMs achieve only 10.5% true positive rate (TPR), open-source AIGC detectors achieve less than 5%, and commercial APIs reach 57.6%. Human annotators also struggled to identify synthetic credibility, reaching only 63% TPR. These findings establish synthetic credibility as a severe and underexplored visual misinformation challenge, and provide a benchmark for developing detectors that reason beyond superficial credibility cues.","published_date":"2026-06-02T08:57:38+00:00","viability_score":7,"cluster_label":"AI Safety & Security","has_code":true,"repo_url":"https://github.com/thu-coai/Syncred-Bench","commercial_flags":["has_code"],"one_liner":"SYNCRED-Bench is a new benchmark and evaluation of AI-generated visual misinformation, highlighting the unreliability of current detection systems.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03331v1","title":"Evaluating LLMs' Effectiveness on Real-World Consumer Device Repair Questions","abstract":"Consumer device repair is an important but underexplored testbed for large language models (LLMs). Repair tasks require reasoning over incomplete problem descriptions, hardware-specific diagnostics, actionable troubleshooting, and safety-critical decisions, where incorrect advice can cause device damage, battery hazards, or permanent data loss. We introduce a benchmark of 991 real-world repair questions from Reddit spanning phone repair, computer repair, and data recovery, each paired with technician-written reference solutions, and provide Bangla translations to evaluate cross-lingual performance. We evaluate six state-of-the-art LLMs in English and Bangla using four repair-specific criteria: correctness, completeness, practicality, and safety. Our results show that while LLMs can provide useful repair assistance, they remain unreliable for high-risk real-world repair tasks without rigorous evaluation and explicit safety safeguards. Phone repair is the most difficult and safety-sensitive domain, and all models make substantial errors in board-level diagnosis, repair prioritization, and safe recovery procedures. Across domains and models, Bangla responses consistently perform worse than English responses. Among the evaluated models, GPT-5.4 performs best overall.","published_date":"2026-06-02T08:40:47+00:00","viability_score":6,"cluster_label":"LLM Applications","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This paper evaluates LLMs on real-world consumer device repair questions, revealing their unreliability for high-risk tasks and safety-critical decisions.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2606.03330v1","title":"FLIPS: Instance-Fingerprinting for LLMs via Pseudo-random Sequences","abstract":"Literature reveals that a Large Language Model's (LLM) behavior is not only conditioned by its original weights but also its instance-level parameters, such as instructional prompt, sampling configuration or quantization. A model that generates safe outputs under one configuration may produce toxic content under another. However, current LLM identification techniques (such as fingerprinting) focus on intellectual property protection, and their design favors robustness to changes in these instance-level parameters. This poses a critical challenge for AI regulation in which compliance assessments target actual deployed behaviors, not model provenance. In this paper, we introduce instance-level fingerprinting, a regulator-oriented paradigm that distinguishes configurations of the same LLM. Our method FLIPS, exploits biases in generated binary random sequences to reach 96% (closed-set) and 90% (open-set, where some targets are unknown) identification accuracy across 237 model instances, versus 35% for the adapted LLMmap baseline. This shows that instance-level fingerprinting is both necessary for regulation and practically feasible. Code available at https://github.com/GurvanR/FLIPS-LLM-Instance-Fingerprinting.","published_date":"2026-06-02T08:39:50+00:00","viability_score":7,"cluster_label":"LLM Regulation","has_code":true,"repo_url":"https://github.com/GurvanR/FLIPS-LLM-Instance-Fingerprinting","commercial_flags":["has_code"],"one_liner":"A novel method for instance-level LLM fingerprinting to enable regulatory compliance assessments by distinguishing configurations of the same model.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03329v1","title":"InfoMem: Training Long-Context Memory Agents with Answer-Conditioned Information Gain","abstract":"Long-context tasks require LLMs to identify and preserve answer-relevant information from large contexts. Chunk-wise memory agents address this issue by sequentially reading document chunks, updating a compact memory, and generating the final answer from the accumulated memory. However, existing RL-based chunk-wise agents either rely on sparse final-answer rewards or use lexical intermediate rewards for memory and retrieval actions. These signals supervise task success or local overlap, but do not directly evaluate whether the final memory supports the ground-truth answer. We propose InfoMem, a reward mechanism for training chunk-wise memory agents that evaluates final-memory utility using answer-conditioned information. InfoMem measures how much the final memory increases the model's per-token log-likelihood of the ground-truth answer. To stabilize RL optimization, InfoMem applies this signal only to successful trajectories and normalizes it before reward composition. Under the same GRPO framework and training budget, InfoMem improves long-context memory-agent performance over comparable memory-agent RL baselines. Analyses show that effective final-memory rewards should operate on successful trajectories, be normalized before reward composition, and be conditioned on the answer rather than the query. Our code is available at https://github.com/GenSouKa1/InfoMem.","published_date":"2026-06-02T08:39:03+00:00","viability_score":7,"cluster_label":"Long-Context Agents","has_code":true,"repo_url":"https://github.com/GenSouKa1/InfoMem","commercial_flags":["has_code"],"one_liner":"A new reward mechanism for training long-context memory agents that uses answer-conditioned information gain to improve performance on complex tasks.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03328v1","title":"Calibration Data Trade-offs Across Capability Dimensions: Why Multi-Source Mixing Matters for High-Sparsity LLM Pruning","abstract":"Post-training pruning compresses large language models to high sparsity using a small unlabelled calibration set, and recent work has concluded that the choice of calibration source has only modest impact on averaged post-pruning accuracy. We ask whether this conclusion survives once calibration impact is evaluated separately across distinct capability dimensions rather than aggregated. Decomposing post-pruning capability into General, Commonsense, Code, and Math, and analysing $n{=}15$ calibration sources via Spearman correlations between OIT information metrics and per-dimension retention, we uncover an opposite-sign trade-off: calibration perplexity correlates positively with General retention ($\u03c1{=}{+}0.71$) but negatively with Math and Code retention ($\u03c1{=}{-}0.53,\\,{-}0.59$; $p{<}0.05$), so no single source can preserve all capabilities. We respond with multi-source calibration mixing, and propose IGSP, an information-guided self-calibration protocol that automates multi-source construction without capability-aligned corpora by minimising 4-gram aggregation and balancing perplexity across dimensions. On LLaMA-3.1-8B at SparseGPT 60% sparsity, a uniform multi-source mix reaches 58.8% total retention, outperforming the best single source (MetaMath, 50.0%) by $+8.8$ and the C4 default (40.0%) by $+18.8$; IGSP improves over Self-Cal by $+2.4$ and SGS by $+4.8$.","published_date":"2026-06-02T08:38:14+00:00","viability_score":4,"cluster_label":"LLM Pruning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Investigating calibration data trade-offs for LLM pruning across different capability dimensions, proposing a multi-source mixing strategy to optimize performance.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03326v1","title":"The Violation Situation Pattern: A Knowledge-Graph Pattern for Compliance Violations","abstract":"Compliance pipelines detect violations as transient query results and do not keep the violation itself as a persistent graph object with review state, affected entities, or audit history. The Violation Situation Pattern (VSP) closes this gap. Building on the Situation pattern of Gangemi and Mika, VSP reifies each detected violation as a graph node with a rule identifier, a temporal validity interval, a lifecycle state, and evidence links to the entities involved. Lifecycle transitions are stored as immutable, PROV-O-aligned events, so audit history is a graph traversal. We instantiate VSP in a legal entity and contract lifecycle property graph and operationalize four deontic rules (V1 unauthorized signature, V2 expired mandate, V3 missing confidentiality clause, V4 missing breach-notification clause) through an FCL->Cypher->MERGE pipeline. We check V1 and V2 against BODACC corporate-officer publications, evaluate V4 on 73 GDPRhub enforcement decisions, and run a SHACL cross-formalism check on V3 and V4. The central finding is rule-body independence: extending V4 from clause-presence to deadline checking raises F1 from 0.312 to 0.602, while the pattern's identity, lifecycle, and evidence semantics stay the same. This separates a pattern contribution from a detector contribution, so detection logic can evolve without invalidating accumulated audit history.","published_date":"2026-06-02T08:33:50+00:00","viability_score":4,"cluster_label":"Compliance AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A novel Violation Situation Pattern for knowledge graphs to persistently track compliance violations with audit history and lifecycle states.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03323v1","title":"dstack-capsule: Pod-Level Remote Attestation for Confidential Workloads on Kubernetes","abstract":"The rise of LLM-as-a-Service and other confidential cloud workloads demands cryptographic proof that user data is processed in a trusted, untampered environment. Existing solutions, notably Confidential Containers (CoCo), enforce a strict \"one Pod per VM\" model that attests only the Guest OS stack, leaving container-level identity unverified and incurring prohibitive per-VM resource overhead. We present dstack-capsule, a Kubernetes platform that enables Pod-level remote attestation on Intel TDX by allowing multiple Pods to share a single Confidential VM while each retains independent, hardware-backed proof of identity. Our key insight is a two-layer attestation architecture: static platform measurements are frozen in RTMR[3] via an irreversible privilege fuse, while dynamic Pod identities (pod_uid, pod_spec_hash, workload_id) are embedded in the TDX Quote's report_data field and signed by hardware on every request. dstack-capsule introduces (1) a Pod-level attestation protocol binding Pod spec digests to hardware-signed Quotes; (2) a privilege fuse mechanism that atomically transitions a node from setup mode to secure mode; (3) a multi-layer sandbox spanning storage, runtime, admission, API, and network isolation layers; and (4) a complete open-source implementation based on Kubernetes 1.32, Intel TDX, and Sysbox. We evaluate the security properties, attestation correctness, and performance characteristics of dstack-capsule, demonstrating that it achieves Pod-granularity verification without the resource overhead of per-VM isolation.","published_date":"2026-06-02T08:33:16+00:00","viability_score":7,"cluster_label":"Confidential Computing","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"dstack-capsule enables secure, multi-pod confidential computing on Kubernetes with hardware-backed attestation, reducing resource overhead for LLM-as-a-Service and other sensitive workloads.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03322v1","title":"Multi-Modal Graph Neural Network with Transformer-Guided Adaptive Diffusion for Preclinical Alzheimer Classification","abstract":"The graphical representation of the brain offers critical insights into diagnosing and prognosing neurodegenerative disease via relationships between regions of interest (ROIs). Despite recent emergence of various Graph Neural Networks (GNNs) to effectively capture the relational information, there remain inherent limitations in interpreting the brain networks. Specifically, convolutional approaches ineffectively aggregate information from distant neighborhoods, while attention-based methods exhibit deficiencies in capturing node-centric information, particularly in retaining critical characteristics from pivotal nodes. These shortcomings reveal challenges for identifying disease-specific variation from diverse features from different modalities. In this regard, we propose an integrated framework guiding diffusion process at each node by a downstream transformer where both short- and long-range properties of graphs are aggregated via diffusion-kernel and multi-head attention respectively. We demonstrate the superiority of our model by improving performance of pre-clinical Alzheimer's disease (AD) classification with various modalities. Also, our model adeptly identifies key ROIs that are closely associated with the preclinical stages of AD, marking a significant potential for early diagnosis and prevision of the disease.","published_date":"2026-06-02T08:32:29+00:00","viability_score":5,"cluster_label":"Medical AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A multi-modal graph neural network with transformer-guided adaptive diffusion for improved preclinical Alzheimer's classification by capturing complex brain network relationships.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2606.03312v1","title":"RobotValues: Evaluating Household Robots When Human Values Conflict","abstract":"While household robots are often evaluated based on task completion, everyday domestic environments involve value-conflicting situations in which robots are expected to choose actions that prioritize other values than task success, such as human autonomy, efficiency, or social appropriateness. Yet, there are no benchmarks for evaluating robots' value preferences in such scenarios. We introduce RobotValues, a benchmark to evaluate household robot planners in 10K value-conflict scenarios. Each instance consists of a realistic household image with multiple plausible robot actions that prioritize different human values. We construct RobotValues through LLM-assisted scenario generation, stakeholder-grounded value extraction, image generation and automatic quality control. Using RobotValues we evaluate VLMs used in robotics and find that models exhibit default value preferences, including safety and accommodation, while underselecting privacy-prioritizing actions. When the models are instructed to prioritize specific values that conflict with their own preferences, they often fail to override their default actions, choosing incorrect actions for 80% of the time. These findings suggest that household robot evaluation should measure not only task completion or safety compliance, but also whether robots can choose among plausible actions when human values conflict.","published_date":"2026-06-02T08:25:01+00:00","viability_score":7,"cluster_label":"Robotics","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"RobotValues is a benchmark for evaluating household robots in value-conflicting scenarios, revealing biases in current VLMs and highlighting the need for prioritizing human values beyond task completion.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.03310v1","title":"Learning Multi-Scale Hypergraph for High-Order Brain Connectivity Analysis","abstract":"Understanding complex interactions between brain regions is critical for early neurodegenerative disease classification such as Alzheimer's Disease (AD) and Parkinson's Disease (PD). While graph-based models are widely used to analyze brain networks, most existing approaches primarily focus on pairwise interactions between directly connected nodes, limiting their ability to capture higher-order dependencies across multiple regions. Although hypergraph-based methods have been proposed to model higher-order relations, many rely on predefined hyperedges or restrict learning to hyperedge weights, reducing flexibility and limiting their capacity to capture multi-resolution structural patterns. In this regard, we introduce an adaptive multi-scale hyperedge learning framework, i.e., MuHL, which constructs hierarchical node features and dynamically learns high-order interactions through continuous hyperedge construction over multi-resolution graph signals. Extensive experiments on multiple brain network benchmarks demonstrate that MuHL consistently improves disease classification performance across different stages, and further identifies key regions of interest (ROIs) and their group-wise interactions from the learned hyperedges that are associated with disease progression, highlighting its potential as a powerful tool for brain network analysis in neurodegenerative disorders.","published_date":"2026-06-02T08:24:17+00:00","viability_score":5,"cluster_label":"Medical AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"MuHL is an adaptive multi-scale hypergraph learning framework for analyzing high-order brain connectivity, improving disease classification and identifying key regions for neurodegenerative disorders.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2606.03307v1","title":"Generalizing Graph Foundation Models via Hyperbolic Retrieval-Augmented Generation","abstract":"Graph foundation models (GFMs) emerged as a dominant paradigm in graph representation learning by leveraging large-scale pre-training for cross-domain inference. However, the parameterized knowledge encoded within these models is insufficient to cope with distribution shifts, limiting their generalization ability. To mitigate this issue, retrieval-augmented generation (RAG) has been introduced to incorporate external knowledge at inference time. Nevertheless, existing RAG frameworks operating in Euclidean space suffer from a fundamental geometric limitation: the polynomial volume growth of Euclidean space is inherently mismatched with the tree-structured external knowledge bases. This mismatch leads to the loss of semantic granularity in retrieval and gives rise to the hubness phenomenon.To address this limitation, we propose a Hyperbolic Retrieval-Augmented Generation (HyRAG) framework designed to enhance the generalization capabilities of GFMs. Specifically, the introduced Hyperbolic Knowledge Indexing module retains the tree-like hierarchies of the external knowledge base by modeling them within hyperbolic space. The Multi-granularity Retrieval module then provides GFMs with the global semantic anchors and local semantic nuances through coarse-grained and fine-grained knowledge retrieval, respectively. Finally, the Dual-path Fusion module achieves effective knowledge integration for graph tasks at both the feature and structural levels.Experiments on multiple graph benchmarks demonstrate significant improvements in the zero-shot setting, highlighting the generalization of our method for robust GFMs inference.","published_date":"2026-06-02T08:21:57+00:00","viability_score":7,"cluster_label":"Graph Foundation Models","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel framework enhances graph foundation models' generalization by incorporating external knowledge in hyperbolic space, outperforming existing methods in zero-shot settings.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03305v1","title":"The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection","abstract":"Benchmark contamination, where evaluation examples appear in a model's training data, threatens the validity of LLM assessment. Statistical tools for detecting training-data membership exist, but have been validated almost exclusively in controlled academic regimes: large, homogeneous pre-training corpora and transparent, single-stage training pipelines. Whether these methods remain reliable in realistic auditing scenarios remains unclear. We identify two under-studied failure modes: distribution shift, which arises when suspect and validation sets violate the IID assumption, and scale constraints, which arise because benchmarks are orders of magnitude smaller than pre-training corpora. We systematically evaluate three leading paradigms: LLM Dataset Inference, Post-Hoc Dataset Inference, and CoDeC across 27 models from multiple families (including Pythia, OLMo~2, and specialised cultural and medical LLMs) and scales (up to 27B). We then further extend our analysis to frontier industry models. Across 335 evaluations, only 199 yield correct outcomes. LLM Dataset Inference results in false positives under distribution shift, Post-Hoc Dataset Inference is underpowered at benchmark scale, and CoDeC provides only coarse provenance signals that are insufficient to verify individual benchmark splits. Our results reveal a systematic reliability gap between controlled validation and practical benchmark auditing, and show that statistical detection cannot yet replace transparent data provenance. We open-source our benchmark for further research.","published_date":"2026-06-02T08:21:22+00:00","viability_score":3,"cluster_label":"LLM Auditing","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This paper identifies critical failure modes in LLM benchmark auditing, showing that existing contamination detection methods are unreliable due to distribution shifts and scale differences.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2606.03303v1","title":"LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks","abstract":"Large Language Models (LLMs) exhibit strong informal mathematical reasoning but struggle to generate mechanically verifiable proofs in formal languages like Lean. We present LEAP, an agentic framework that enables general-purpose foundation models to achieve state-of-the-art performance on automated formal theorem proving. LEAP leverages foundation model capabilities, such as informal reasoning, instruction following, and iterative self-refinement. By decomposing complex problems into smaller units, the system bridges formal proof construction with informal blueprints through continuous interaction with the Lean compiler. To provide a rigorous evaluation beyond increasingly saturated benchmarks, we introduce Lean-IMO-Bench, a benchmark of IMO-style problems formalized in Lean, with short statements yet highly non-routine and multi-step proofs across a wide range of difficulty levels. Empirically, on the latest 2025 Putnam Competition, an annual mathematics competition for undergraduate students in North America, LEAP solves all 12 problems, matching recent breakthroughs by frontier formal mathematical models. On Lean-IMO-Bench, LEAP boosts the one-shot formal solve rate of general-purpose LLMs from below 10% to 70%, notably surpassing the 48% benchmark set by a specialized, gold-medal-caliber IMO system. Furthermore, we demonstrate LEAP's research-level utility by autonomously formalizing complex proofs for open combinatorial challenges, including a verified proof for a key subproblem in Knuth's Hamiltonian decomposition of even-order Cayley graphs.","published_date":"2026-06-02T08:16:42+00:00","viability_score":8,"cluster_label":"Formal Mathematics LLMs","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"LEAP is an agentic framework that supercharges general LLMs to achieve state-of-the-art performance in formal mathematics theorem proving, solving all problems on the 2025 Putnam Competition.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.03290v1","title":"Message Tuning Outshines Graph Prompt Tuning: A Prismatic Space Perspective","abstract":"Graph Foundation Models (GFMs), built upon the Pre-training and Adaptation paradigm, have emerged as a research hotspot in graph learning. For GNN-based GFMs, graph prompt tuning has become the prevailing adaptation method for downstream tasks. Although recent methods explain why graph prompt tuning works, how to rigorously measure its adaptation capacity remains an open problem. Addressing this problem is critical for understanding the capability limits of graph prompt tuning and for developing more powerful adaptation methods. In this paper, we propose Prismatic Space Theory (PS-Theory), a novel mathematical framework to quantify the capacity of adaptation methods, while focusing on establishing the upper bound for the adaptation capacity of graph prompt tuning. Building upon the proposed PS-Theory, we further introduce Message Tuning for GFMs (MTG), a lightweight approach that injects a small set of learnable message prototypes into each layer of the GNN backbone to adaptively guide message fusion without updating pre-trained weights. Through our PS-Theory, we prove that the adaptation capacity of MTG can exceed the theoretical upper bound of graph prompt tuning. Extensive experiments demonstrate that MTG consistently outperforms graph prompt baselines across diverse benchmark datasets, providing strong empirical support for our theoretical findings.","published_date":"2026-06-02T07:52:54+00:00","viability_score":7,"cluster_label":"Graph Foundation Models","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Message Tuning for GFMs (MTG) is a novel adaptation method that outperforms graph prompt tuning by injecting learnable message prototypes, exceeding theoretical capacity bounds.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03288v1","title":"AI-Generated Traces for Novice Programmers: Learning Effects and Learner Differences in a Multi-Institutional Study","abstract":"Introductory programming (CS1) courses often struggle to support students' understanding of program execution. While visualizations can make execution processes explicit, their effectiveness depends on design and context, and empirical evidence for AI-generated visualizations remains limited. We propose Generated Animated Traces (GATs), AI-generated, analogy-based, narrated animations that coordinate source code, execution state, and conceptual analogies. We conduct a study at two institutions in CS1 courses (Python, N=961; Java N=151) comparing GATs to textual explanations. We measure immediate learning performance and experience, end-of-course engagement and exam performance. Results show that GATs can yield selective benefits for immediate learning, but benefits are context-dependent and short-term. We observe that GATs' influence on performance is moderated by learner engagement profiles. This finding underscores the importance of personalized approaches.","published_date":"2026-06-02T07:51:27+00:00","viability_score":2,"cluster_label":"Educational AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"AI-generated narrated animations for introductory programming courses show selective, short-term learning benefits for novice programmers.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03280v1","title":"A Negative Result on Cross-Model Activation Transfer in a Pythia Multi-Hop Setting","abstract":"Recent work shows that language models can transmit behavioural traits through hidden signals in generated data during training. We ask whether a more direct and stricter channel is also viable: can one language model communicate useful intermediate reasoning state to another at inference time by translating and injecting hidden activations, rather than by passing natural-language text? We test this question in a controlled Pythia-160M to Pythia-410M multi-hop reasoning setting. A linear translation layer learns a strong normalized-space map between sender and receiver hidden states, with normalized cosine similarity near 0.97 across seeds. However, when the translated activations are injected into the receiver at inference time, they do not improve downstream answering. Low-strength additive injection remains near the no-injection baseline, with confidence intervals that cross zero. Replacement-style injection is consistently destructive, and rescaling translated vectors to the receiver hidden-state norm does not rescue performance. The result is therefore a scoped negative result: in this setting, offline representational alignment is not sufficient for useful causal communication inside the receiver.","published_date":"2026-06-02T07:46:01+00:00","viability_score":1,"cluster_label":"LLM Internals","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Directly transferring intermediate reasoning states between language models does not improve downstream performance, even with strong representational alignment.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03273v1","title":"VistaHop: Benchmarking Multi-hop Visual Reasoning for Visual DeepSearch","abstract":"Visual DeepSearch requires multimodal large reasoning model (MLRM) agents to answer complex visual queries by repeatedly inspecting image regions, grounding intermediate reasoning in visual evidence, and connecting fine-grained clues across long reasoning chains. However, existing benchmarks mainly focus on single-step visual understanding or static image-question answering, offering limited evaluation of iterative image inspection, visual-anchor grounding, and multi-hop evidence integration. In this work, we introduce VistaHop, a benchmark for evaluating vision-centric search and multi-hop visual reasoning in Visual DeepSearch. VistaHop contains 300 high-resolution images, 25 visual search scenarios, and 350 multi-hop QA tasks that require models to follow evidence chains from visual anchors or fuse information across multiple image-grounded reasoning paths. We further develop VistaArena, a unified evaluation environment that supports tool-augmented reasoning with text search, image search, image cropping, and evidence-based answer validation. Experiments on seven representative MLRMs show that current models remain far from solving VistaHop: the best model, SenseNova-MARS-32B, achieves only 24.31% Pass@1. These results reveal persistent limitations in visual grounding, evidence revisiting, long-chain reasoning, and multi-anchor information fusion, highlighting the need for stronger benchmarks and training methods for Visual DeepSearch.","published_date":"2026-06-02T07:37:23+00:00","viability_score":7,"cluster_label":"Visual Reasoning","has_code":true,"repo_url":"https://github.com/ahang0712/vistahop","commercial_flags":["has_code"],"one_liner":"VistaHop is a new benchmark and evaluation environment for multi-hop visual reasoning in Visual DeepSearch, revealing significant limitations in current multimodal models.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2606.03270v1","title":"Are Common Substructures Transferable? Riemannian Graph Foundation Model with Neural Vector Bundles","abstract":"Foundation models have sparked a revolution via a pretraining-adaptation paradigm, with recent efforts extending this success to graphs. Unlike other modalities, graphs contain rich structural patterns, yet their structural transferability remains poorly understood. Prior studies consider common substructures in the discrete realm, and we are motivated by a fundamental question: Are common substructures transferable? The underlying theory is largely underexplored. In this work, we shift toward learning transferable structures through the lens of functional behavior. Theoretically, we connect transferable substructures to intrinsic geometry of the representation space. However, characterizing such intrinsic geometry has rarely been touched. Grounded in Riemannian geometry, we develop a graph intrinsic geometry learning framework called Neural Vector Bundle, which enables parsing intrinsic geometry with local coordinates. Building on this, we design GAUGE, a pretrainable neural architecture that constructs the vector bundle, flattening geometrically compatible local coordinates, and a new Dirichlet loss, which also measures the transfer effort. We empirically validate its superior expressiveness in challenging tasks including zero-shot link prediction and graph isomorphism.","published_date":"2026-06-02T07:35:42+00:00","viability_score":2,"cluster_label":"Graph Foundation Models","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A new framework using Riemannian geometry and neural vector bundles is proposed to learn transferable substructures in graph foundation models.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03269v1","title":"Distilling Answer-Set Programming Rules from LLMs for Neurosymbolic Visual Question Answering","abstract":"Visual Question Answering (VQA) is the task of answering questions about images, requiring the integration of multimodal input and reasoning. Modular approaches that incorporate logic-based representations into the reasoning component offer clear advantages over end-to-end trained systems, particularly in terms of interpretability. However, adapting or extending these representations when task requirements change can place a significant burden on developers. To address this challenge, we present an approach for distilling rules from Large Language Models (LLMs). Our method prompts an LLM to extend an initial VQA reasoning theory, expressed as an answer-set program, to meet new requirements of the task. Examples from VQA datasets guide the LLM, validate the results, and help correct erroneous rules by leveraging feedback from the ASP solver. We demonstrate that our approach is effective across diverse VQA datasets. Notably, only a few examples are needed to elicit correct rules from LLMs. Our experiments suggest that rule distillation from LLMs is a promising alternative to traditional data-driven rule learning approaches. Under consideration in Theory and Practice of Logic Programming (TPLP).","published_date":"2026-06-02T07:35:31+00:00","viability_score":5,"cluster_label":"Neurosymbolic AI","has_code":true,"repo_url":"https://github.com/pudumagico/KDASP","commercial_flags":["has_code"],"one_liner":"Distill interpretable logic rules from LLMs to enhance visual question answering systems, reducing developer burden and improving adaptability.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2606.03260v1","title":"EqGINO: Equivariant Geometry-Informed Fourier Neural Operators for 3D PDEs","abstract":"Deep learning surrogates for 3D Partial Differential Equations (PDEs) often fail to generalize across geometric transformations because they depend heavily on specific coordinate systems. While equivariant networks offer a solution, they typically rely on local operations in the spatial domain, making the global receptive field, which is essential for PDE dynamics, computationally expensive. Conversely, Fourier Neural Operators (FNOs) efficiently capture global interactions, yet establishing 3D equivariance within them remains impractical due to the prohibitive cost of spectral group convolutions. To bridge this gap, we introduce EqGINO, a geometrically robust framework that enforces isotropy in the spectral domain. By design, EqGINO guarantees exact equivariance to the discrete symmetries inherent to the discretized computational domain. Beyond this discrete guarantee, our structural prior enables effective generalization to arbitrary continuous orientations even with a limited number of SE(3)-transformed training samples. Consequently, our method robustly models coordinate-invariant physical laws on complex irregular 3D geometries. Our code is available at https://github.com/sung-won-kim/EqGINO","published_date":"2026-06-02T07:23:06+00:00","viability_score":7,"cluster_label":"Scientific ML","has_code":true,"repo_url":"https://github.com/sung-won-kim/EqGINO","commercial_flags":["has_code"],"one_liner":"Develop a geometrically robust framework for 3D PDE simulations using equivariant Fourier Neural Operators, enabling accurate generalization across complex geometries.","time_to_mvp":"3-6 months","tags":["high_potential"]},{"arxiv_id":"2606.03257v1","title":"PSViT: A Methodology for Structurally Pruning Spiking Vision Transformers","abstract":"Spiking Vision Transformer (SViT) models are promising low-power ViT models for solving vision-based tasks with state-of-the-art performance. However, their large sizes limit their deployments for resource-constrained embedded platforms, underscoring the needs of model compression. One of prominent compression techniques is pruning, and the state-of-the-art works employ unstructured pruning techniques to compress SViT models. Such techniques require specialized hardware architectures tailored for the sparsity patterns to maximize their efficiency benefits, making this approach not scalable. To address this, we propose PSViT, a novel methodology to perform structured pruning on SViT models, hence making it possible to efficiently accelerate their inference using the existing and widely-used computing architectures. To do this, PSViT employs several key steps: uniform channel-wise filter pruning to structurally eliminate the non-significant weights, sensitivity analysis to evaluate the impact of channel-wise pruning of individual layer on accuracy and network size, as well as fine-grained channel-wise pruning based on the sensitivity analysis and the given network architecture. Experimental results show that PSViT effectively obtains 22.4% memory saving through single-shot pruning, while maintaining high accuracy within 3% (70.3% without fine-tuning and 72.8% with fine-tuning) from the original non-pruned SViT model (73.3%) on the ImageNet-1K. These results also show that the PSViT methodology advances the effort in enabling efficient SViT deployments on resource-constrained applications.","published_date":"2026-06-02T07:18:57+00:00","viability_score":4,"cluster_label":"Model Compression","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A structured pruning methodology for Spiking Vision Transformers to enable efficient deployment on resource-constrained embedded platforms.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03252v1","title":"AirDreamer: Generalist Drone Navigation with World Models","abstract":"Navigating a drone in unseen and cluttered environments requires reliable generalization to unseen scene layouts and understanding of environmental structure relative to the robot's capabilities. Previous methods, which assume the same environment configuration, often rely heavily on human-designed perception pipelines and predefined rules to guide the robot toward the target. This process is environment-dependent and generalizes poorly across environments. Inspired by animal navigation behavior, we design a navigation framework that navigates with a reinforcement-learning-based policy on top of a world-model-based environment understanding to overcome these issues. In addition, a sparse reward function without hand-crafted shaping terms is designed to avoid local minima traps and encourage yaw control behaviors. In simulation and on real drones, our method exhibits emergent capabilities for navigating complex, unseen environments and escaping local optima where other methods fail. In challenging maps, it achieves a 5.3% higher navigation success rate than best baseline. Furthermore, the proposed framework achieves effective sim-to-real transfer without any tuning during deployment. The code will be publicly available.","published_date":"2026-06-02T07:15:13+00:00","viability_score":6,"cluster_label":"Robotics","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Generalist drone navigation using a world-model and reinforcement learning to achieve robust generalization in unseen and complex environments.","time_to_mvp":"3-6 months","tags":["high_potential"]},{"arxiv_id":"2606.03251v1","title":"Do Real-World Datasets Contain Natural Experiments? An Empirical Study Using Causal Feature Selection","abstract":"In nature, events that affect some individuals or groups but not others constitute an implicit intervention and are known as natural experiments. For example, the COVID-19 pandemic was an intervention by the coronavirus on the sub-population infected with COVID. We ask, do natural experiments occur in existing real-world datasets? If yes, how should we treat them? To detect natural experiments in data, we use causal discovery to recover the underlying causal graph and perform feature selection based on causal links. If downstream performance improves by treating the data as interventional rather than observational, we argue that this suggests the dataset contains natural experiments. We first validate this hypothesis by simulating datasets with and without natural experiments using synthetic graphs. We then perform a systematic empirical evaluation on a large suite of real-world datasets. Our results indicate that real-world datasets do contain natural experiments and we can take advantage of those natural experiments to improve model performance using causal inference. Our work represents the initial foray into this area, offering a preliminary exploration within a limited scope.","published_date":"2026-06-02T07:12:30+00:00","viability_score":4,"cluster_label":"Causal Inference","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research explores leveraging natural experiments within real-world datasets to improve model performance through causal inference, with code available.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2606.03238v1","title":"When RLHF Fails: A Mechanistic Taxonomy of Reward Hacking, Collapse, and Evaluator Gaming","abstract":"Reinforcement learning from human feedback (RLHF) makes large-scale post-training possible by replacing an underspecified human objective with learned and scalable proxies. The same substitution creates a structured failure surface: optimization can raise the learned reward while external quality falls, degrade both proxy and judge scores, reveal proxy under-alignment, or produce evaluator-specific disagreement. We present an empirical failure-mode study of a compact RLHF pipeline with proximal policy optimization (PPO), direct preference optimization (DPO), uncertainty-penalized PPO (UP-PPO), reward-model uncertainty, approximate policy drift, diversity and repetition diagnostics, and two external LLM judges. Rather than treating reward hacking as a single terminal event, we classify matched transitions between checkpoints using the directions of the learned reward, judge scores, and average judge score. Across 61 checkpoint rows and 1920 row-level transitions, aggressive PPO has the highest localized reward-hacking rate (14.45%; bootstrap 95% CI: 10.16-18.75), while UP-PPO yields lower rates in the same aggressive regime (11.33-10.94%). A pre-transition logistic model predicts future row-level reward hacking with ROC-AUC 0.821, and row-level analysis finds localized reward hacking that checkpoint averages miss in 3 of 12 settings. The central conclusion is methodological: RLHF failures are not only final-model pathologies, but training dynamics that can be classified, localized, and partially anticipated.","published_date":"2026-06-02T06:55:52+00:00","viability_score":7,"cluster_label":"LLM Training","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This work provides a mechanistic taxonomy and empirical study of failure modes in Reinforcement Learning from Human Feedback (RLHF), with available code and a clear path to improving LLM training stability.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.03237v1","title":"Solipsistic Superintelligence is Unlikely to be Cooperative","abstract":"AI's central challenge is shifting from capability to coexistence. The dominant paradigm in AI research focuses on developing powerful agents that treat the world as an exogenous and stationary source of feedback. We contend that superintelligence, an extremely capable task solver, born out of such a solipsistic approach to AI design, is unlikely to be cooperative. Deploying AI systems induces endogenous non-stationarity, resulting in a train-test-deploy gap where historical distributions diverge from the deployment context. We refer to this as the self-undermining property of unilateral optimization. Closing this gap requires AI that participates in cooperation: the equilibrium-selection process through which multiple actors navigate their interdependence. We call for a non-solipsistic research paradigm that treats this interdependence as a core design principle rather than approaching cooperation as a task to solve. This entails building dynamic evaluation testbeds involving adaptive counterparties, treating institutions as design primitives, and preserving human agency as a structural feature of the systems we build.","published_date":"2026-06-02T06:54:55+00:00","viability_score":0,"cluster_label":"AI Safety","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper argues that a solipsistic approach to AI design leads to uncooperative superintelligence and calls for a non-solipsistic research paradigm focused on interdependence.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03236v1","title":"Perceive Before Reasoning: A Pre-Reasoning Perception Framework for Efficient and Reliable Proactive Mobile Agents","abstract":"Multimodal large language models (MLLMs) have substantially advanced mobile agents, yet proactive mobile assistance remains challenging because agents must decide \\emph{when} to intervene before determining \\emph{how} to assist. Existing systems often implement these two decisions within a unified MLLM-based pipeline, leading to goal misalignment between conservative intervention filtering and comprehensive assistance generation, as well as redundant inference when the agent should remain silent. To address these limitations, we propose the \\textbf{Pre-Reasoning Perception Framework (PRPF)}, a two-stage framework built on perceiving before reasoning. PRPF introduces a lightweight Multimodal Proactive Perceptor (MPP) for intervention gating and context compression, and activates the Proactive Agent Reasoner (PAR) only when intervention is warranted. Experiments on the ProactiveMobile benchmark show that PRPF substantially reduces false trigger rates (FTR) while improving success rates (SR) and inference efficiency over the ProactiveMobile baseline.","published_date":"2026-06-02T06:54:02+00:00","viability_score":7,"cluster_label":"Mobile Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This framework introduces a two-stage approach for mobile agents that prioritizes perception before reasoning to improve intervention gating and inference efficiency, with available code.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.03232v1","title":"GFFMERGE: Efficient Merging of Graph Neural Force Fields and Beyond","abstract":"Graph Neural Networks (GNNs) have revolutionized Neural Force Fields for atomistic simulations, achieving near-quantum accuracy at reduced cost, yet adapting these models to new chemical systems requires expensive retraining of foundation models. Inspired by model merging in vision and language processing, we introduce GFFMERGE, the first principled framework for closed-form model merging in GNNs. We exploit the linear structure of message-passing layers and formulate merging as a convex embedding-alignment problem with an analytical solution. Through the first systematic benchmarking of model merging for GNNs, we show that existing methods designed for vision and language catastrophically fail on force field regression, while GFFMERGE recovers performance approaching gold standard joint training. Across molecular (MD17, MD22), solid-state (LiPS20), and large-scale graph benchmarks, GFFMERGE and GNNMERGE (its generic GNN counterpart) achieve 5-27$\\times$ speedups while enabling modular composition of specialized models. Remarkably, our closed-form solution alone outperforms all baseline methods before fine-tuning and provides superior initialization for faster, data-efficient convergence.","published_date":"2026-06-02T06:48:34+00:00","viability_score":7,"cluster_label":"AI for Science","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for efficiently merging graph neural networks for atomistic simulations, enabling faster adaptation to new chemical systems and outperforming existing methods.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03223v1","title":"BotDirector: Robot Storytelling Across the Symmetrical Reality with Multi-modal Interactions","abstract":"Robot storytelling offers a unique blend of technological innovation and creative expression that engages children in unprecedented ways. However, the technical aspects are often too complicated for children. We propose an interactive system that facilitates robot storytelling with tangible and natural language interactions. Children arrange the playground with their own stuff and create narratives with an LLM agent. The created narratives are transformed into a motion sequence based on the map and characters, and the motions are executed by self-navigating swarm robots. This system enhances robot storytelling with flexible scenarios, enabling young children to create robot dramas with everyday objects.","published_date":"2026-06-02T06:33:35+00:00","viability_score":5,"cluster_label":"Robotics","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"An interactive system that allows children to create robot stories using everyday objects and natural language, transforming narratives into robot movements.","time_to_mvp":"6+ months","tags":["series_a_plus"]},{"arxiv_id":"2606.03220v1","title":"WebRISE: Requirement-Induced State Evaluation for MLLM-Generated Web Artifacts","abstract":"Existing benchmarks for MLLM-generated web artifacts assess interaction through local evidence and miss the requirement-induced states and transitions that determine whether a page works. We introduce WebRISE, which compiles task requirements into Interaction Contract Graphs (ICGs) of observable states, user-intent transitions, and DOM/visual assertions for implementation-agnostic browser execution. WebRISE spans 442 tasks across five input modalities (Text, Markdown, Sketch, Image, Video), with 5,495 transitions and 5,271 requirement checks that separate user-stated functions from implicit product-level constraints. Across 14 MLLMs, even the strongest model reaches only 65.6% transition validity and 66.3% requirement coverage, and visual quality is no proxy for behavior (Qwen3.6-35B-A3B on Markdown: V=80.8 yet T=15.5). Video gives the strongest interaction signal (+10.6 pp implicit coverage over Text), while implicit constraints persist; defect injection shows ICG-based scoring detects state errors at 2-16x the rate of checkpoint-style evaluation.","published_date":"2026-06-02T06:29:40+00:00","viability_score":7,"cluster_label":"MLLM Evaluation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel benchmark for evaluating multi-modal large language models' web artifact generation by assessing requirement-induced states and transitions, revealing significant performance gaps.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.03214v1","title":"Effect of Demographic Bias on Skin Lesion Classification","abstract":"In this study, we evaluate the performance of skin lesion classification using ResNet-based convolutional models, focusing on the impact of demographic bias in training data, particularly variations in patient sex and age. We use linear programming to generate datasets with controlled demographic characteristics, allowing systematic investigation of bias effects. Three learning strategies are evaluated: a single-task model, a reinforcing multi-task model, and an adversarial learning scheme. Our sex-based analysis indicates that sex-specific training datasets optimise model performance. Notably, including male patients in the training data improved performance for the male subgroup, even in female-majority cases. Reinforcing and adversarial learning schemes narrowed or eliminated bias gaps in balanced and female-majority datasets. However, these strategies proved less effective in male-majority settings, where models continued to perform better for males than females. The two learning schemes showed marginal bias reduction compared to the baseline model in predominantly male patient populations. Age-based analysis demonstrates comparable baseline performance across the three model approaches, with performance declining across age categories. Younger groups consistently achieve the highest performance, regardless of training data distribution. Although balanced training yields optimal results for the youngest age category, performance decreases in older categories. We find that sex biases arise mainly from data imbalances, while age biases consistently favour younger groups regardless of distribution. These distinct mechanisms require targeted mitigation strategies. Additionally, cross-dataset validation on two external datasets revealed that domain shifts notably affect performance and patterns of demographic bias.","published_date":"2026-06-02T06:16:24+00:00","viability_score":4,"cluster_label":"Medical AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Investigating the impact of demographic bias on skin lesion classification using ResNet models, exploring mitigation strategies for sex and age disparities.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2606.03203v1","title":"MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents","abstract":"Computer-use agents could automate repetitive screen-based clinical work, but their reliability in medical graphical user interfaces remains largely unvalidated. Existing benchmarks focus on general web or desktop tasks and underrepresent medical software, which requires domain knowledge, exhibits markedly different UI design from mainstream applications, lacks public testing environments, and demands safety validation beyond task completion. We introduce MedCUA-Bench, an interactive benchmark for clinical computer-use agents. It covers 18 clinical scenarios across 10 medical domains, reconstructed from real product manuals and open-source medical systems to capture authentic clinical interfaces while avoiding licensing and privacy constraints. Each task ships with paired intent- and step-level goals to disentangle clinical reasoning from UI execution, and is evaluated by a deterministic checker over task completion and five clinical safety dimensions. Across 23 agents, the best closed-source model reaches 54.2% strict success, while all models remain below 9% on the real OpenEMR. Open-source agents average only 2.5%, with the best reaching 16.2%. MedCUA-Bench exposes the gap between current agents and reliable clinical software use, providing a reproducible testbed for future research.","published_date":"2026-06-02T06:02:35+00:00","viability_score":7,"cluster_label":"Medical AI Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A benchmark for evaluating AI agents on clinical software tasks, revealing significant gaps in current capabilities and providing a testbed for future development.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03201v1","title":"Reinforcement Learning from Cross-domain Videos with Video Prediction Model","abstract":"Reinforcement learning from expert videos across visually distinct domains is challenging due to the absence of reward signals and the presence of domain gaps. We introduce XIPER (Cross-domain Video Prediction Reward), a reward model for learning from expert videos collected in a visually different domain, where the agent's appearance differs due to factors such as color, morphology, or the sim-to-real gap. More specifically, XIPER trains a cross-domain video prediction model that maps agent observations into the expert domain and uses the prediction likelihood as a reward signal. Experiments on the DMC Color Suite (8 tasks) and DMC Body Suite (3 tasks) show that XIPER consistently outperforms baselines despite domain gaps such as differences in agent color and morphology. We further analyze XIPER on a sim-to-real transfer dataset, demonstrating that it produces meaningful reward signals for real-robot observations given only simulated expert videos. Code, pretrained models, datasets and video demonstrations can be found on our project webpage: https://sites.google.com/view/xiper","published_date":"2026-06-02T06:00:15+00:00","viability_score":8,"cluster_label":"Reinforcement Learning Applications","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Leverage XIPER to enhance AI learning across domains by translating observations into expert video domains for improved policy training.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.03198v1","title":"AI Rater Discrimination Depends on Scoring Protocol in Complex Clinical Decision-Making","abstract":"Clinical AI evaluation increasingly delegates scoring to large language models (LLMs) acting as AI raters, yet their scoring behavior across evaluation conditions has not been quantitatively characterized. We address this gap through a factorial study of AI rater behavior in adult type 2 diabetes (T2D) pharmacotherapy at 12-month outpatient follow-up, a clinical task involving complex decision-making operationalized across seven evaluation questions. Four open-source LLMs served simultaneously as clinical decision support system (CDSS) models and AI raters. Each CDSS output was scored under two scoring protocols: a rubric-anchored Gold Rubric (GR) protocol incorporating a patient-specific rubric, and a rubric-free Non Gold Rubric (Non-GR) protocol. Linear mixed effects models crossed the scoring protocol factor with five design factors -- CDSS model, CDSS prompt configuration (document-referenced generation [DRG] vs.\\ Baseline), rater model, prompt character, and prompt type -- and estimated main effects together with their protocol interactions. Across all questions, AI raters yielded consistently higher scores within a very narrow range (74--78 points on average) under Non-GR compared to those under GR (7.69 to 49.64 points lower mean scores; 1.68 to 3.67 times wider interquartile ranges). Within each question, GR amplified the AI rater's discrimination between DRG and Baseline CDSS outputs by factors of 1.76 to 5.10, while also revealing substantial behavioral variation across rater models that Non-GR suppressed. These findings support rubric anchoring as the scoring protocol that preserves discriminative power in clinical AI evaluation; rubric-free scoring cannot substitute when questions require patient-specific or jurisdiction-specific criteria that rater models cannot infer from parametric knowledge alone.","published_date":"2026-06-02T05:58:23+00:00","viability_score":3,"cluster_label":"LLM Evaluation","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Investigates how different scoring protocols affect AI rater discrimination in complex clinical decision-making, highlighting the importance of rubric anchoring for preserving discriminative power.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03165v1","title":"Fully Automated Identification of Lexical Alignment and Preference-Stage Shifts in Large Language Models","abstract":"The language used by digital chat assistants such as ChatGPT can diverge from human expectations (misalignment). Research, mostly on Scientific English, has described both what divergences occur and, to some extent, why, linking them to the training stage of human preference learning. Yet, existing approaches rely on manual curation. This paper introduces two curation-free, assumption-light evaluation metrics: the Lexical Alignment Score, which identifies lexical overuse, and the Triangulated Preference Shift, which quantifies how much of such shifts can be attributed to human preference learning. Using PubMed abstracts, continuations were generated and measured using windowed document prevalence across six model families (Falcon, Gemma, Llama, Mistral, OLMo, Yi). The procedure identifies, without manual intervention, overused items such as 'suggest', 'additionally', and 'strategy', and estimates their link to preference learning. Our findings replicate prior work and remain stable across parameter settings, random seeds, and evaluation on further data. The approach scales readily and enables systematic study of lexical (mis)alignment beyond Scientific English and across languages, and as such, the metrics have the potential to contribute to improved alignment for future models and understanding of its origins.","published_date":"2026-06-02T05:23:45+00:00","viability_score":3,"cluster_label":"LLM Alignment","has_code":true,"repo_url":"https://github.com/fsu-nlp/lexical-alignment-shifts","commercial_flags":["has_code"],"one_liner":"Introduces two curation-free metrics to automatically identify lexical misalignment and quantify shifts attributed to human preference learning in large language models.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03163v1","title":"OpenAgenet/OAN: Technical Architecture for Trust-Governed Agent Identity and Discovery","abstract":"This paper describes the technical architecture of OpenAgenet / OAN. OAN is a protocol-neutral trust layer for open Agent interconnection. It specifies the role architecture, identity objects, registration workflow, Root-governed lifecycle, Root-verified package model, authorization-aware Discovery, signed trusted invocation, verification requirements, state transitions, security properties, implementation boundaries, and deployment considerations. The design is intended to support heterogeneous Agent frameworks and interaction protocols, including MCP, A2A, ANP-like systems, and domain-specific Agent protocols. OAN does not define the entire business conversation among Agents; it defines how Agent identities become admissible, discoverable, verifiable, and safe to approach before protocol-specific interaction begins.","published_date":"2026-06-02T05:18:14+00:00","viability_score":3,"cluster_label":"Agents","has_code":true,"repo_url":"https://github.com/OpenAgenet/github.com","commercial_flags":["has_code"],"one_liner":"A protocol-neutral trust layer for open Agent interconnection, specifying identity, discovery, and invocation before interaction.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03161v1","title":"OpenAgenet/OAN: Open Infrastructure for Trusted Agent Interconnection","abstract":"OpenAgenet, abbreviated as OAN, is an open infrastructure project for trusted Agent interconnection. It addresses a problem that becomes visible when Agents move from isolated applications into open, multi-operator networks: before an Agent can safely discover, select, and invoke another Agent, it needs a way to verify identity provenance, governance state, discovery authorization, freshness, and pre-connection trust evidence. OAN is designed as a protocol-neutral trust layer. It does not replace Agent interaction protocols, tool protocols, model orchestration frameworks, or application-level workflows. Instead, it provides Root-governed identity admission, Registrar-assisted onboarding, Root-verified package publication, authorization-aware Discovery, and signed trusted invocation. This paper presents the motivation, architecture, roles, governance model, relationship with MCP, A2A, and ANP, deployment patterns, cooperation model, blockchain-backed authorization bulletin, prototype status, performance profile, and roadmap of OAN.","published_date":"2026-06-02T05:14:34+00:00","viability_score":3,"cluster_label":"Agents","has_code":true,"repo_url":"https://github.com/OpenAgenet/github.com","commercial_flags":["has_code"],"one_liner":"An open infrastructure for trusted Agent interconnection, providing a protocol-neutral trust layer for identity, discovery, and invocation.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03159v1","title":"NVIDIA OmniDreams: Real-Time Generative World Model for Closed-Loop Autonomous Vehicle Simulation","abstract":"As autonomous vehicle capabilities advance, the safe evaluation of driving policies in long-tail scenarios remains a critical bottleneck. In closed-loop simulation, the driving policy model actively interacts with the environment, where its actions dynamically update the simulator state and directly influence the next set of generated sensor observations. While recent reconstruction-based neural simulators offer photorealism, they are fundamentally constrained by their initial captured data and struggle to generalize to highly dynamic or novel scenes. To overcome these limitations, we introduce OmniDreams, a foundation generative world model mid- and post-trained from the Cosmos diffusion model to autoregressively generate action-conditioned videos in real time. By leveraging the rich visual priors of Cosmos and mid- and post-training on 21k hours of driving scenarios, OmniDreams synthesizes complex, unobserved phenomena that are hard for traditional simulators to capture, such as extreme weather and unpredictable dynamic agent behaviors. Crucially, it autoregressively conditions its photorealistic sensor generation on past frames, the current simulator state, and immediate driving actions. Deployed in a closed-loop system with the Alpamayo 1 policy model and AlpaSim orchestrator, OmniDreams acts as a highly responsive, reactive environment, providing a scalable and comprehensive solution for training and evaluating next-generation autonomous driving policies. We additionally show preliminary results indicating that a world-action model (WAM) post-trained from OmniDreams achieves strong performance on the Physical AI Autonomous Vehicles NuRec dataset, surpassing the VLA-based Alpamayo 1.5 research policy model while using only 1/5 the total parameters. These results highlight the potential for a real-time world model like OmniDreams to also serve as a backbone for policy architectures.","published_date":"2026-06-02T05:11:05+00:00","viability_score":7,"cluster_label":"Autonomous Driving Simulation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A real-time generative world model for autonomous vehicle simulation, synthesizing complex scenarios and outperforming existing policies.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03157v1","title":"ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models","abstract":"Large language models (LLMs) have been widely adopted in healthcare, yet they still encounter significant challenges in complex clinical decision-making scenarios. Existing benchmarks primarily assess LLM performance in single-course settings and lack systematic evaluation in multi-course scenarios, where a patient's condition evolves over time. To address this gap, we propose ClinicalMC, a benchmark for multi-course clinical decision-making. It includes 1,275 Chinese and 5,804 English samples across four stages from admission to discharge. These stages cover triage, first-course examination/diagnosis/treatment, subsequent multi-course examination/assessment/treatment, and final diagnosis. In ClinicalMC, patients in the English dataset undergo an average of 5.11 clinical courses, whereas those in the Chinese dataset undergo 3.42. To assess LLM performance, we construct a multi-agent evaluation framework that includes patient, examiner, and doctor agents. Based on the benchmark and framework, we design two experimental settings -- a single-turn static setting and a multi-turn dynamic setting -- and assess three categories of LLMs: 1) closed-source LLMs like GPT5-mini; 2) open-source LLMs like DeepSeek-V3.2; and 3) medical LLMs like HuatuoGPT-o1. Through extensive evaluation, we aim to better understand LLM performance in the medical domain and support its effective deployment in healthcare.","published_date":"2026-06-02T05:09:18+00:00","viability_score":6,"cluster_label":"Medical AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A benchmark and evaluation framework for assessing Large Language Models in multi-course clinical decision-making.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2606.03144v1","title":"GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory","abstract":"Large language models (LLMs) are increasingly used as self-study assistants in technical disciplines, yet their reliability as mathematical reasoning assistants remains poorly understood. We introduce GTBench, a curriculum-grounded benchmark for evaluating LLMs as mathematical research assistants in graph theory, comprising 63 problems organized into three groups of increasing difficulty: undergraduate definitions and basic properties (Group 1), algorithm tracing and structural reasoning (Group 2), and graduate-level proof construction (Group 3). Problems are sourced from verified academic materials including Diestel's Graph Theory. We evaluate five frontier models -- GPT-5, Claude Sonnet 4.6, Gemini 2.5 Flash-Lite, Llama 3.3 70B, and Mistral Large 3 -- under zero-shot and chain-of-thought prompting, using exact-match and LLM-as-judge evaluation for Groups 1 and 2, and a hybrid human expert and LLM-as-judge protocol for Group 3. Our results reveal a pronounced performance hierarchy: GPT-5 approaches ceiling on Group 1 (95.8% zero-shot) and maintains meaningful accuracy on graduate proofs (82%), while all other models degrade substantially with difficulty, with Llama achieving 0% under human evaluation on Group 3 zero-shot. Failure mode analysis shows that correct algorithm, wrong execution errors dominate Groups 1 and 2, while Group 3 additionally surfaces incomplete reasoning failures and reveals systematic disagreement between human evaluators and the automated judge, particularly on verbose or near-complete proofs (kappa = 0.48-0.83 across human pairs). GTBench provides the first curriculum-grounded evaluation framework for graph-theoretic reasoning in LLMs, with direct implications for the governance of AI tools in mathematical education and scientific research.","published_date":"2026-06-02T04:40:25+00:00","viability_score":7,"cluster_label":"LLM Evaluation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new benchmark for evaluating LLMs as mathematical research assistants in graph theory, revealing performance hierarchies and failure modes.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03137v1","title":"Think-Before-Speak: From Internal Evaluation to Public Expression in Multi-Agent Social Simulation","abstract":"LLM-based multi-agent simulation offers a promising way to study social interaction, deliberation, and collective opinion dynamics. However, many existing dialogue simulation frameworks represent interaction mainly as observable turn exchange or aggregated outputs, leaving the internal evaluative processes behind silence, speaking intention, and public expression difficult to examine. We introduce TBS (Think-Before-Speak), an interval-based multi-agent simulation framework that separates agents' private reasoning from public utterance generation. At each interval, all agents update structured internal states based on the shared dialogue history and their own memory. These states include dissonance-related appraisal, perceived opinion climate, perceived isolation risk, response strategy, and willingness to speak. The orchestrator then resolves competing speaking intentions and commits one utterance to the public dialogue, allowing internal evaluation and public interaction to co-evolve over time.   We evaluate TBS in simulated town hall discussions on a climate-related policy issue. Results show that TBS produces coherent internal-state traces and that these traces vary systematically across turn-allocation, silence, and memory conditions. Dissonance-related appraisal increases agents' willingness to speak, whereas silence-pressure appraisal decreases it. Once speaking intention is formed, public expression is shaped mainly by turn-allocation rules. These findings suggest that TBS supports mechanism-sensitive social simulation by making the pathway from internal evaluation to public expression observable and analyzable.","published_date":"2026-06-02T04:26:01+00:00","viability_score":4,"cluster_label":"Multi-Agent Simulation","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A novel framework for multi-agent social simulation that separates internal reasoning from public expression to study deliberation and opinion dynamics.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03135v1","title":"Uncertainty-Aware Clarification in LLM Agents with Information Gain","abstract":"Large Language Model (LLM) agents often operate under underspecified user instructions, where latent uncertainty over user intent leads to erroneous tool actions. To address this challenge, we propose a goal-oriented clarification framework that aligns clarification behavior with ambiguity resolution. Central to our approach is the Information Gain Reward, a metric that quantifies the utility of clarification questions by measuring the Bayesian belief update towards the ground-truth goal induced by the clarification exchange. We train the clarifier (LLM) using this reward to optimize for high information gain, ensuring that clarifications effectively reduce uncertainty and improve task completion within the agent-tool-user environment. We validate our framework within a clarification-enhanced $\u03c4$-Bench environment, conducting cross-agent evaluations across five heterogeneous backbones. Empirical results demonstrate that our method consistently improves the success rate by 3.7\\% over the no-clarification baseline, while adding only 0.3 total interaction steps on average.","published_date":"2026-06-02T04:23:59+00:00","viability_score":7,"cluster_label":"LLM Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An LLM agent framework that uses information gain to guide clarification questions, reducing uncertainty and improving task completion.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.03128v1","title":"Decoupled Smart Contract Audits: Lightweight LLM Framework via Distillation and Aggregation","abstract":"Smart contracts face critical security challenges that require thorough auditing in decentralized web services. While Large Language Models (LLMs) have shown promise in automated vulnerability detection, existing approaches lack severity evaluations with actionable remediation and demand unnecessarily massive computational overhead. In this study, we introduce an efficient end-to-end smart contract security audit framework utilizing lightweight, highly optimized open-source LLMs (0.6B-4B parameters). Our framework decouples comprehensive audit tasks into four interconnected components: vulnerability detection, explanation, severity classification, and remediation recommendation. To maintain high accuracy without massive parameters, we implement Rank-Stabilized Low-Rank Adapters (rsLoRA), knowledge distillation, and a custom Chain-of-Verification (CoVe) aggregation strategy to systematically screen and consolidate multiple draft responses from the model into a highly accurate audit report. Experimental results demonstrate that our lightweight pipeline consistently outperforms state-of-the-art open-source coder dense LLMs (7B to 34B parameters), achieving 98.25% accuracy in vulnerability detection and an alignment score of 0.4375 in generative explanation tasks. Furthermore, our extensive ablation studies empirically validate the superiority of our decoupled audit processes over unified prompting and uncover a novel severity centrality bias, establishing a critical benchmark for future research in LLM-assisted auditing.","published_date":"2026-06-02T04:13:43+00:00","viability_score":8,"cluster_label":"Smart Contract Auditing","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A lightweight LLM framework for smart contract auditing that decouples tasks, uses distillation and aggregation for high accuracy with minimal parameters.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.03119v1","title":"GuidedBridge: Training-freely Improving Bridge Models with Prior Guidance","abstract":"Guidance methods, such as classifier-free guidance (CFG) and auto-guidance (AG), have advanced noise-to-data generation in diffusion models. Recently, bridge models have introduced a data-to-data generative process that can exploit an instructive clean prior. In this work, inspired by previous methods creating quality difference between denoising results as guidance, we propose a training-free bridge guidance method, termed Prior Guidance (PG). Specifically, we introduce a weak prior, which is unseen during bridge pre-training, hindering prior exploitation and thereby degrading denoising result. Then, we contrast it with the seen prior to highlight and enhance prior exploitation via a scaling factor. Moreover, we analyze the underlying mechanism of prior exploitation in the bridge process and design frequency-modulated prior guidance (FMPG), which tailors the guidance scale to low- and high-frequency bands coherent with bridge generative dynamics. To address prior exploitation in image in-painting, we develop a cascaded framework, CFG-FMPG, which first generates a noisy hidden representation via CFG and then exploits it as a generative prior with FMPG, fulfilling their complementary strengths without compromising inference efficiency. Experiments demonstrate that our PG methods consistently improve pre-trained bridge models across diverse image translation tasks.","published_date":"2026-06-02T04:04:23+00:00","viability_score":4,"cluster_label":"Generative Models","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A training-free method to improve diffusion bridge models for image translation tasks by leveraging prior guidance.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2606.03116v1","title":"AnyAudio-Judge: A Dynamic Rubric-Based Benchmark and Evaluator for Audio Instruction Following","abstract":"The rapid advancement of instruction-guided audio generation has highlighted the critical need for robust alignment evaluation. Current automated evaluation methods heavily rely on holistic scoring from general-purpose large language models, which struggle to decouple complex instructions, lack interpretability, and fail to capture fine-grained attribute mismatches. To address this, we introduce a novel dynamic rubric-based evaluation paradigm that adaptively decomposes complex audio captions into a variable number of independent, verifiable binary rubric items. To rigorously benchmark this capability, we propose the AnyAudio-Judge Bench, a comprehensive, bilingual benchmark comprising 7,920 meticulously curated samples across four diverse audio domains (speech, sound, music, and mixed), featuring deliberately constructed hard negatives. Furthermore, we construct a large-scale corpus of 105K samples with explicit Chain-of-Thought (CoT) rationales to train our dedicated evaluator, the AnyAudio-Judge model. By employing a training pipeline that combines Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO), our model successfully aligns its reasoning paths with the rubric-based scoring mechanism. Extensive experiments demonstrate that AnyAudio-Judge not only significantly enhances zero-shot alignment detection compared to state-of-the-art baselines, but also provides precise and interpretable reward signals that substantially improve instruction alignment in downstream reinforcement learning for audio generation.","published_date":"2026-06-02T04:00:32+00:00","viability_score":7,"cluster_label":"Audio AI","has_code":true,"repo_url":"https://github.com/CuCl-2/AnyAudio-Judge","commercial_flags":["has_code"],"one_liner":"An audio instruction following benchmark and evaluator that uses dynamic rubrics to provide precise, interpretable feedback for better alignment.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.03108v1","title":"EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning","abstract":"Autonomous LLM training is often framed as recipe search, which leaves the training harness largely static. This limitation sharpens in agentic RL, where shifting bottlenecks and scalar rewards mask diverse failure modes. We introduce EvoTrainer, an autonomous training framework that co-evolves LLM policies and training-side harnesses through empirical feedback: it diagnoses rollout-level evidence, revises diagnostics, backtests interventions, and accumulates reusable skills. Evaluated on mathematical reasoning, competitive-programming code generation, and repository-level software engineering, EvoTrainer matches or exceeds the human-engineered RL references under the same data, codebase, and evaluation protocol, with the largest gain on long-horizon agentic SWE. Trajectory analyses show that retained strategies diverge across domains, evolving diagnostics prevent invalid high-scoring branches from being promoted, and reusable skills shape later search. Autonomous LLM RL should move beyond recipe search toward joint evolution of policies and the training harnesses that interpret them.","published_date":"2026-06-02T03:47:48+00:00","viability_score":3,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A framework that co-evolves LLM policies and training harnesses for autonomous agentic reinforcement learning.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03103v1","title":"DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration","abstract":"Real-world professional desktop workflows in specialized creative and engineering software unfold over long horizons and often require human-in-the-loop coordination, where agents proactively seek necessary information and users provide additional instructions, clarifications, feedback, or corrections as the task progresses. Yet existing desktop GUI benchmarks mostly reduce this setting to short, simplified tasks with all user instructions provided upfront. To address this issue, we introduce DeskCraft, a desktop GUI benchmark targeting long horizon creative and engineering workflows and proactive human-agent collaboration. DeskCraft organizes tasks into a multilevel difficulty taxonomy, with long horizon tasks requiring over 50 execution steps, and covers professional creative software across design, video, audio, and 3D creation. Furthermore, DeskCraft formalizes human-agent collaboration into an interaction protocol covering mid-turn and post-turn exchanges. Mid-turn interaction captures both agent-initiated clarification under uncertainty and user-initiated interruption during execution, while post-turn interaction accommodates user-driven feedback after the agent signals completion, together spanning the full space of realistic collaboration patterns. We evaluate 18 proprietary and open source agents on 538 tasks and find that GPT-5.4 reaches 31.6% on standard tasks and 27.6% on interactive tasks. Further analyses reveal persistent failures in long horizon workflow delivery and proactive clarification. We will open-source all evaluation codes, tasks, and data at https://github.com/mrwwk/DeskCraft.","published_date":"2026-06-02T03:42:34+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":"https://github.com/mrwwk/DeskCraft","commercial_flags":["has_code"],"one_liner":"A benchmark for desktop agents that evaluates long-horizon professional workflows and human-in-the-loop collaboration.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.03099v1","title":"PhotoCraft: Agentic Reasoning with Hierarchical Self-Evolving Memory for Deep Image Search","abstract":"Deep Image Search requires multi-step reasoning over rich contextual cues, such as time, location, and event relations. However, most existing LLM-based agents are stateless and reactive, lacking persistent memory to maintain long-horizon context or transfer experience across tasks, which often leads to execution drift and experience isolation. To address these limitations, we propose PhotoCraft, a training-free, hierarchical memory system for photo-search agents. Inspired by human cognition, PhotoCraft equips MLLMs with working, episodic, and semantic memory, which are dynamically invoked during reasoning to preserve logical consistency and knowledge transferability throughout multi-step reasoning and answer generation. Extensive experiments on DISBench demonstrate that PhotoCraft consistently improves context-aware retrieval across diverse MLLM backbones, achieving gains of up to 18.5\\% and effectively mitigating key bottlenecks in memoryless deep image search, offering a practical path toward reliable and generalizable multimodal search agents.","published_date":"2026-06-02T03:38:44+00:00","viability_score":7,"cluster_label":"Multimodal Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"PhotoCraft enhances LLM agents for deep image search by providing hierarchical memory, improving context-aware retrieval and mitigating bottlenecks in stateless agents.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03097v1","title":"From Long News to Accurate Forecast: Importance-Aware Fusion and PRM-Guided Reflection for Time Series Forecasting","abstract":"Incorporating news into time series forecasting is appealing because news can reveal abrupt exogenous events that historical values alone cannot recover. However, existing LLM-based news-forecasting pipelines face two practical limitations: relevant news articles often exceed the model's context window, and iterative retrieval of supplementary news is typically unguided, leading to redundant updates and slow convergence. We address these issues with a novel framework that combines importance-aware news compression and process-level retrieval supervision. First, we train an importance reward model that estimates the forecasting utility of each article and uses this signal to allocate compression budgets during sequential pairwise fusion, preserving informative content within a fixed context limit. Second, we introduce a process reward model (PRM) that ranks multiple supplementary-news candidates conditioned on the current error profile and the history of previously selected articles, replacing one-shot blind retrieval with quality-controlled selection. Both components are trained offline using historical data with ground truth; inference uses the frozen filtering logic and compression modules without any reflection loop. Experiments on finance, energy, traffic, and bitcoin forecasting benchmarks show that our method improves prediction accuracy over strong baselines, significantly reduces the number of refinement iterations compared to the iterative baseline, and remains effective when relevant articles span thousands of tokens.","published_date":"2026-06-02T03:36:30+00:00","viability_score":7,"cluster_label":"Time Series Forecasting","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This framework improves LLM-based news-forecasting by compressing relevant news within context limits and using a process reward model for guided supplementary news selection.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03093v1","title":"Decomposing how prompting steers behavior","abstract":"Prompting steers large language models (LLMs) and vision-language models (VLMs) without weight updates, but it remains unclear how instruction changes reshape internal representations to produce behavior. We introduce a nested geometric decomposition framework that treats prompting as a transformation of the representational geometry of the content following the prompt. For each prompt pair, we align representations of the same stimuli under two prompts using increasingly expressive stimulus-invariant maps: translation, rigid transformation with uniform scaling, sequential axis scaling, affine transformation, and nonlinear transformation. We then causally test each map by replacing a single layer's prompt-A hidden state for held-out stimuli with its mapped counterpart and measuring recovery of prompt-B representational geometry and behavior. Across three LLMs, three VLMs, and six text or image datasets spanning style, emotion, scene content, and number, prompts consistently reshape representations toward the instructed task structure. Cross-validated variance decomposition shows that much prompt-induced activation change is captured by shape-preserving maps, especially translation and rigid transformation with uniform scaling, while tier profiles reveal model- and task-specific routing strategies across layers. Crucially, although translation and rigid tiers already improve behavioral agreement, affine transformation is the first tier to nearly recover target-prompt task geometry and yields corresponding behavioral gains. This suggests that cross-dimensional linear mixing is a key mechanism by which prompts reorganize representations toward instructed task structure. Our framework decomposes prompt-induced representational change into interpretable geometric components and reveals how models route task-relevant structure to produce prompt-driven behavior.","published_date":"2026-06-02T03:27:24+00:00","viability_score":4,"cluster_label":"LLM Interpretability","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A geometric decomposition framework analyzes how prompting reshapes LLM and VLM internal representations to understand prompt-driven behavior.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2606.03092v1","title":"The Shadow Price of Reasoning: Economic Perspective on Optimal Budget Allocation for LLMs","abstract":"Inference-time scaling has emerged as a critical avenue for enhancing Large Language Models' performance, yet real-world deployment is constrained by strict computational budgets. In this work, we formulate inference budget allocation as a global constrained optimization problem governed by economic principles. By modeling per-query reasoning utility with a shifted-surge function, we derive an optimal allocation policy based on a global shadow price that equilibrates marginal utility under resource scarcity. Based on this theory, we propose Constrained Latent-utility Equilibrium Allocation for Reasoning (CLEAR). It performs rational abandonment and reallocates resources from insolvent queries to solvable queries near their emergence thresholds.   Extensive experiments on several reasoning tasks with different traffic streams demonstrate that CLEAR significantly improves the Pareto frontier of total token cost versus mean accuracy. In resource-scarce regimes, CLEAR achieves up to a 3x improvement in global accuracy compared to uniform allocation.","published_date":"2026-06-02T03:26:55+00:00","viability_score":7,"cluster_label":"LLM Inference Optimization","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"CLEAR optimizes LLM inference budgets by allocating resources based on economic principles, improving accuracy under resource scarcity.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03091v1","title":"BAHSD: Bridging the Long-tail Gap via Adaptive Distillation in Black-box Sequential Recommendation","abstract":"Sequential recommendation systems are widely adopted but often deployed as black-box APIs, which has driven recent interest in model extraction to replicate their capabilities locally. However, the long-tail distribution induces severe signal heterogeneity: dense head sequences trigger the solidification of teacher preference, biasing extraction toward local patterns, while sparse tail sequences yield flat, noisy predictions. Existing one-size-fits-all extraction overlooks this disparity, resulting in noise overfitting and suboptimal knowledge transfer. We propose BAHSD, a black-box adaptive distillation framework that handles signal heterogeneity via a multi-scale consistency probing mechanism to implicitly quantify signal reliability. Based on this, an adaptive hierarchical objective is designed: dynamic-temperature KL divergence mitigates preference solidification for high-confidence signals, while ranking consistency and InfoNCE contrastive learning provide noise-robust enhancement for low-confidence signals. BAHSD consistently outperforms baselines, achieving up to 4.98\\% gain over the teacher and 80\\%+ improvement on tail users, offering a plug-and-play solution for high-fidelity black-box recommendation extraction.","published_date":"2026-06-02T03:26:25+00:00","viability_score":3,"cluster_label":"Recommendation Systems","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A framework for adaptive distillation in black-box sequential recommendation systems to handle signal heterogeneity and improve performance on tail users.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03090v1","title":"\"**Important** You should give me full credits!\": Exploring Prompt Injection Attacks on LLM-Based Automatic Grading Systems","abstract":"The emergence of large language models (LLMs) has significantly accelerated recent research on LLM-based automatic grading (AG) systems. Benefiting from the strong instruction-following capabilities and broad prior knowledge of LLMs, educators can deploy AG systems across diverse tasks using only natural language rubrics while achieving satisfactory grading performance. Despite these advantages, new security concerns may also arise. In particular, prompt injection (PI) attacks have recently become a major threat to LLM-based applications. In the context of AG, attackers can potentially exploit PI vulnerabilities to manipulate grading systems into assigning artificially high scores regardless of the actual answer quality. Such behavior poses serious risks to the fairness, reliability, and integrity of educational assessment. In this work, we study PI attacks in AG systems, and systematically investigate the effectiveness of such attacks in educational scenarios. We further evaluate the effectiveness of existing defensive strategies against these attacks. Through comprehensive experiments under rubric-based grading settings, we demonstrate that current LLM-based AG systems remain highly vulnerable to PI attacks. We hope that our findings raise awareness of this emerging threat and motivate future research toward secure, robust, and trustworthy LLM-based educational systems.","published_date":"2026-06-02T03:24:12+00:00","viability_score":6,"cluster_label":"LLM Security","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Investigating prompt injection vulnerabilities in LLM-based automatic grading systems and evaluating existing defenses.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2606.03089v1","title":"Constitutional On-Policy Safe Distillation","abstract":"On-policy self-distillation (OPSD) has emerged as an efficient post-training paradigm by using a teacher conditioned on privileged information to provide dense token-level supervision. Prior work has shown that OPSD can collapse in verifiable reasoning tasks, but safety alignment differs in that it is guided by high-level constitutions rather than explicit target answers, making it a natural setting to revisit dense distillation. However, our pilot study show that safety OPSD still suffers from severe collapse: constitutional conditioning contracts the teacher distribution toward short and overly conservative responses, and Reverse KL further amplifies this contraction into reduced expressiveness. We formalize this effect as geometric leakage under safety boundaries in a non-orthogonal semantic space, where safety pressure transfers into the expressiveness dimension. Based on this analysis, we propose Constitutional On-Policy Safe Distillation (COPSD), which first calibrates the teacher through a Cross-SFT cold-start and then performs constitution-conditioned on-policy distillation. Experiments on 12 benchmarks show that COPSD achieves a consistently stronger safety--helpfulness trade-off than baselines while substantially reducing the safety tax on general reasoning ability.","published_date":"2026-06-02T03:17:56+00:00","viability_score":6,"cluster_label":"LLM Alignment","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new distillation method to improve the safety-helpfulness trade-off in LLMs by addressing collapse under constitutional guidance.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2606.03083v1","title":"DELTAMEM: Incremental Experience Memory for LLM Agents via Residual Trees","abstract":"Large Language Model (LLM)-based agents increasingly rely on memory to learn from experiences over continual interactions. However, storing experiences as independent, flat units leads to substantial redundancy and retrieval conflicts, as similar episodes repeat overlapping content and subtle scene variations cause retrieved memories to offer contradictory guidance. To address this, we introduce residual experience, positing that newly acquired experience is often an incremental variation of existing knowledge. We propose DeltaMem, a framework that organizes experience memory into two independent residual trees, one storing goal-conditioned task experience as reusable skills and another for scene-level environment knowledge. Each tree uses a root node for generalized base experiences and incremental delta nodes for subsequent variations, allowing related experiences to share a common foundation without duplication. For retrieval, a failure-penalized similarity scan locates the best match, reconstructing the full experience via root-to-match chain composition. An autonomous consolidation mechanism distills high-frequency paths into new root nodes, enabling the trees to self-organize from general heuristics to specialized variants. Experiments across diverse interactive environments show that DeltaMem consistently outperforms existing baselines. To facilitate future research, we release the code at https://github.com/import-myself/DeltaMem.","published_date":"2026-06-02T03:13:50+00:00","viability_score":9,"cluster_label":"LLM Agents","has_code":true,"repo_url":"https://github.com/import-myself/DeltaMem","commercial_flags":["has_code"],"one_liner":"DeltaMem is a novel memory framework for LLM agents that organizes experiences into residual trees for efficient incremental learning and retrieval.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.03080v1","title":"Regret Pre-training: Bridging Prior and Posterior Views for Enhanced Knowledge Grounding","abstract":"Causal language models factorize sequence probabilities using only preceding context, leaving future information unexploited during training despite its availability in the training data. This paper introduces Regret Pre-training, a self-supervised framework grounded in the Learning Using Privileged Information (LUPI) paradigm. The framework employs a dual-view architecture in which a single model generates both a causal Student distribution and a future-conditioned Teacher distribution. The training objective augments standard language modeling with a regret loss that minimizes the KL divergence from teacher to student, transferring future-aware signals to the causal representations. We investigate two teacher configurations on the OLMoE-1B-7B architecture:LocalRegret, which extends attention by one future token, andGlobalRegret, which conditions on bidirectional context with the target position masked. Experiments on nine downstream tasks following 4 billion tokens of training demonstrate that both configurations consistently outperform the baseline. On average,GlobalRegret andLocalRegret achieve 33.9% and 32.2% accuracy respectively, surpassing the baseline's 30.2%. Most notably,GlobalRegret improves BoolQ performance by 18.1 percentage points (61.0% vs 42.9%). The framework introduces no additional parameters and requires only one extra inference-mode forward pass per training step.","published_date":"2026-06-02T03:11:39+00:00","viability_score":4,"cluster_label":"LLM Training","has_code":true,"repo_url":"https://github.com/RegretPretraining/Code2026","commercial_flags":["has_code"],"one_liner":"A novel pre-training framework enhances causal language models by incorporating future information, leading to significant improvements on downstream tasks without additional parameters.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2606.03077v1","title":"Libra: Efficient Resource Management for Agentic RL Post-Training","abstract":"Reinforcement learning (RL) has become a standard post-training paradigm for large language models (LLMs), extending beyond preference alignment to complex reasoning and multi-turn agentic behaviors. In agentic RL, the rollout stage generates trajectories while invoking tools, producing long-tailed and non-stationary workloads that challenge conventional resource-management assumptions. Three fundamental challenges arise. First, due to the long-tail distribution, a small fraction of trajectories dominates rollout makespan. Second, rollout and training exhibit strong asymmetry in compute patterns, memory demands, and sensitivity to sequence length. Third, as the RL policy evolves, the trajectory-length distribution drifts over time, rendering any static resource split progressively suboptimal.   We present Libra, which introduces two core mechanisms. The first is a periodic global resource planner that jointly optimizes GPU allocation across rollout and training clusters. It leverages an elastic hybrid pool to enable lightweight, non-blocking worker reallocation between stages. The second is a causality-driven multi-level feedback queue (C-MLFQ) scheduler, which routes requests to heterogeneous rollout buckets based on causal signals derived from tool-return outcomes, rather than relying on fragile length predictions. Evaluated on 48 A800 GPUs, Libra achieves up to 3.0$\\times$ higher throughput and converges up to 2.5$\\times$ faster in reward compared to the baselines.","published_date":"2026-06-02T03:09:13+00:00","viability_score":6,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Libra optimizes resource management for agentic RL by dynamically allocating GPUs and using a novel scheduler to handle long-tailed, non-stationary workloads, improving throughput and convergence.","time_to_mvp":"6+ months","tags":["series_a_plus"]},{"arxiv_id":"2606.03073v1","title":"Efficient Hyperparameter Optimization for LLM Reinforcement Learning","abstract":"Reinforcement learning (RL) for large language models (LLMs) is highly sensitive to hyperparameter configurations, making hyperparameter optimization (HPO) essential yet computationally expensive. Existing multi-fidelity HPO methods remain inefficient for LLM RL due to the massive model scale and resource-intensive training cycles. In this paper, we propose Joint Fidelity Hyperparameter Optimization (JF-HPO), which simultaneously adapts both model size and training budget as fidelity. JF-HPO is empowered by: (i) it leverages a small proxy model of the target LLM for efficient training and evaluation in each HPO trial; (ii) it integrates carefully designed early-stopping strategies based on training dynamics; (iii) it introduces an efficient checkpointing mechanism to eliminate redundant computations. Compared with existing HPO methods, JF-HPO significantly improves the computational efficiency of each trial (up to 14.9 times), while achieving better or competitive predictive accuracy under the same time budget. Notably, compared with utilizing hyperparameter configurations from the VeRL Recipe, JF-HPO delivers performance improvements ranging from 5.8% to 111.6%.","published_date":"2026-06-02T03:02:06+00:00","viability_score":4,"cluster_label":"LLM Training","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"JF-HPO efficiently optimizes hyperparameters for LLM reinforcement learning by jointly adapting model size and training budget, using proxy models and early stopping to reduce computational cost.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03070v1","title":"ASymPO: Asymmetric-Scale Policy Optimization for Asynchronous LLM Post-Training Without Behavior Information","abstract":"Asynchronous reinforcement learning can improve language-model post-training throughput by decoupling response generation from policy optimization, but stale responses introduce distribution drift. Standard behavior-corrected methods control this drift with behavior-policy probabilities, importance ratios, or clipping, which requires token-aligned, versioned, and numerically consistent behavior log-probabilities across rollout and learner systems. We ask whether asynchronous group-relative RL can instead be stabilized using only current-policy probabilities. We identify a scale-imbalance failure mode: when stale responses are evaluated under the current policy, positive and negative loss terms can appear at different negative log-probability scales, so zero-sum advantages no longer imply balanced loss contributions. We propose Asymmetric-Scale Policy Optimization (ASymPO), which normalizes each response's token loss by its current average token negative log-probability. ASymPO requires no behavior-policy probabilities, restores response-level zero-sum balance, and preserves a nonzero learning signal. We also introduce Scaled Policy Optimization (SPO), a fixed negative-scaling baseline, and evaluate both current-policy-only objectives in asynchronous mathematical reasoning post-training.","published_date":"2026-06-02T03:00:34+00:00","viability_score":4,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"ASymPO enables asynchronous LLM post-training without behavior information by normalizing token losses to address scale imbalance, preserving a learning signal.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03069v1","title":"ROBUST-WT: Robust Uncertainty-aware Segmentation Transform via Whitening and Training Enhancements","abstract":"Generalized segmentation of medical images prevents performance degradation when different imaging devices and clinical protocols are used across multiple domains. The Whitening Transform-based Probabilistic Shape Regularization Extractor (WT-PSE), published in IEEE Transactions on Medical Imaging in 2024, addresses this challenge by employing feature decorrelation and Wasserstein distance-based knowledge distillation to achieve robust cross-domain segmentation. This study systematically examines improvements to the WT-PSE learning framework. Four limitations in the original implementation are identified: limited training augmentations that fail to simulate real scanner variations, reliance on per-pixel binary cross-entropy loss that is sensitive to edge noise, the absence of a scheduled loss weighting strategy that may destabilize early training, and the lack of ablation switches for controlled scientific comparison. To address these issues, we propose four enhancements: (1) domain-adaptive augmentation including random erasing, gamma correction, and salt-and-pepper noise; (2) a hybrid BCE and Dice loss function for improved edge-aware segmentation under noisy conditions; (3) a curriculum-based Dice weight scheduling strategy; and (4) command-line control flags for systematic ablation studies. Experiments on the fundus optic disc segmentation benchmark demonstrate that the improved pipeline achieves a final epoch optic-disc Dice score of 0.956 and an ASD score of 13.31, outperforming the baseline epoch-5 Dice score of 0.939. These results indicate that training-level improvements can provide consistent performance gains without modifying the underlying WT-PSE architecture.","published_date":"2026-06-02T02:59:35+00:00","viability_score":7,"cluster_label":"Medical AI","has_code":true,"repo_url":"https://github.com/213269/WT-PSE-code-main","commercial_flags":["has_code"],"one_liner":"Enhance medical image segmentation robustness by improving training augmentations, loss functions, and scheduling strategies for existing models.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.03068v1","title":"Learn When and Where to Connect: Adaptive Virtual Nodes for Dynamic Message Passing on Graphs","abstract":"While Virtual Nodes (VNs) are often utilized in Message Passing Neural Networks (MPNNs) to facilitate effective message passing, existing VN-based methods have limitations, such as constraining all nodes to connect to the same number of VNs, fixing the connections before applying MPNNs, and connecting a node to a VN independently of the other nodes that connect to the same VN. We propose MAVN, an end-to-end differentiable MPNN framework that allows non-constrained connections between nodes and VNs and dynamically introduces VNs on demand in response to evolving node representations across layers. Specifically, MAVN learns to adaptively determine when (at which layer) and where (to which nodes) to introduce and connect VNs based on the relative importance of connections. From a pool of candidate VNs, MAVN selects the necessary VNs in each layer, where each selected VN is connected to a nonempty subset of nodes, guided by a dual-perspective scoring mechanism that jointly captures the nodes' preferences for VNs and the VNs' preferences for nodes. We theoretically prove that for any node-VN connectivity pattern, there exists a set of MAVN's parameters that can simulate the pattern. Experiments on nine real-world datasets demonstrate that MAVN consistently improves the performance of backbone MPNNs, achieving up to 46.5% improvement over the backbones and outperforms the baselines.","published_date":"2026-06-02T02:57:13+00:00","viability_score":7,"cluster_label":"Graph Neural Networks","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Develop an adaptive framework for message passing neural networks that dynamically introduces virtual nodes to improve graph representation learning.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.03066v1","title":"CORE: Conflict-Oriented Reasoning for General Multimodal Manipulation Detection","abstract":"The rapid rise of generative AI has made multimodal fake news increasingly realistic and pervasive, posing severe threats to public trust and social stability. Existing detection methods rely heavily on manipulation-specific models and large-scale labeled data, resulting in poor generalization to emerging manipulation types. We observed that the essence of manipulated misinformation lies in its intrinsic conflicts, \\textbf{i.e.,} semantic or physical inconsistencies either across modalities or with common world knowledge. Inspired by this observation, we propose \\textbf{C}onflict-\\textbf{O}riented \\textbf{RE}asoning (\\textbf{CORE}) framework, an effective paradigm that learns to endows multimodal large language models (MLLMs) with explicit conflict-capturing capability. To this end, CORE first constructs the Conflict Attribution Corpus (CAC) with fine-grained annotations of conflict factors and sources, providing essential data support for subsequent conflict perception training. By performing conflict-oriented representation enhancement and reasoning based on CAC, CORE achieves robust and generalizable conflict detection, effectively and rapidly adapting to unseen manipulation types with a few samples or in even zero-shot settings. Extensive experiments demonstrate that CORE surpasses state-of-the-art models. The dataset and code are publicly available at https://github.com/shen8424/CORE.","published_date":"2026-06-02T02:53:48+00:00","viability_score":9,"cluster_label":"Multimodal AI","has_code":true,"repo_url":"https://github.com/shen8424/CORE","commercial_flags":["has_code"],"one_liner":"Build a multimodal AI system that detects fake news by identifying intrinsic conflicts across modalities and world knowledge, with zero-shot adaptation to new manipulation types.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.03061v1","title":"Brief Announcement: Generative Markov Model for Distributed Computing Systems","abstract":"Emerging distributed computing paradigms, such as the computing continuum, are inherently heterogeneous, stochastic, and complex. Efficiently and effectively utilizing all available resources across the continuum demands a unified formal model of the system. To address this gap, we propose a general framework for modeling distributed computing systems as a generative Markov model, factorized over a structured system state. In our model, the state decomposes into high-dimensional variables, each further factorized over its elements, reflecting the sparse dependency structure inherent to distributed systems. This yields a tractable model enabling simulation, inference, and policy learning over otherwise intractable system states, bridging distributed computing with Markov chain theory and reinforcement learning (RL). We demonstrate our framework through a case study of collaborative AI inference, in which a dedicated server combines resources with those volunteered by service users. Our results show that centralized scheduling becomes a bottleneck at scale, while distributing computation across user devices reduces both latency and server resource consumption. These findings highlight the value of adaptive decision-making in distributed computing systems and demonstrate the framework's utility for modeling, simulation, and optimization.","published_date":"2026-06-02T02:50:58+00:00","viability_score":6,"cluster_label":"Distributed Computing","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Model complex distributed computing systems as generative Markov models to enable efficient simulation and adaptive resource allocation for collaborative AI inference.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2606.03057v1","title":"Rethinking Molecular Text Representations for LLMs: An Empirical Study","abstract":"Large language models (LLMs) are increasingly used for molecular tasks, but it remains unclear which molecular representation to use. We present a systematic benchmark evaluating LLM molecular competence across nine representations and eight chemical tasks. We benchmark 16 LLMs across five model families, including reasoning and non-reasoning variants, chemistry-specialized LLMs, and closed frontier models. Performance is strongly representation-dependent and no single representation wins across tasks, though CML is the best, followed by MolJSON, InChI, and then canonical SMILES. Explicit structured text representations (CML and MolJSON) dominate structural tasks; IUPAC dominates semantic tasks, winning molecule retrieval for all 16 LLMs; and SMILES variants are rarely optimal despite their prevalence in pretraining. Chemistry-specialized models perform well with SMILES at the cost of large degradations with structured text representations, suggesting SMILES-only evaluation rewards specialization that does not generalize. Using LLM-as-a-judge, we find that IUPAC produces the highest fraction of correct molecule generations. A mechanistic study via tokenization audits, linear probes and attention shows that representations are encoded differently inside the model; for example, structured representations require higher attention across the molecular span. Our results argue against representation-invariant evaluation and motivate task-aware representation routing for LLM-based chemistry.","published_date":"2026-06-02T02:45:41+00:00","viability_score":4,"cluster_label":"LLM Chemistry","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research benchmarks molecular representations for LLMs in chemistry tasks, identifying optimal representations and suggesting task-aware routing for improved performance.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2606.03056v1","title":"SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale","abstract":"As LLM agents adopt large skill libraries, selecting the right subset becomes a structural problem rather than a similarity-matching one: skills depend on, conflict with, specialize, or duplicate one another, a structure invisible to both full enumeration and embedding similarity. We present SkillDAG, which models inter-skill relationships as a typed directed graph and exposes it to an LLM agent as an inference-time, agent-callable structural retrieval interface, queried and evolved during execution rather than baked into a fixed retrieval pipeline: each search returns vector matches, typed-edge neighbors, and conflict signals, and a propose-then-commit protocol lets the agent register execution-backed edges so the graph accumulates structure across episodes. On ALFWorld and SkillsBench with MiniMax-M2.7, SkillDAG reaches 67.1% success and 27.3% reward, exceeding the strongest reported Graph-of-Skills baseline by +12.8 and +8.6 points; the advantage ports to gpt-5.2-codex, and intrinsic SkillsBench Ret@K rises from 65.5 to 78.2 under matched queries. These gains trace to isolable mechanisms: candidate ranking that stays robust as the pool grows 10x where a fixed seeding-diffusion pipeline degrades, and set-monotone online edits that enlarge ground-truth recall without evicting prior hits.","published_date":"2026-06-02T02:45:21+00:00","viability_score":5,"cluster_label":"LLM Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"SkillDAG introduces a self-evolving typed skill graph for LLM agents, improving skill selection by modeling inter-skill relationships and evolving the graph during execution.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03054v1","title":"ToolGate: Token-Efficient Pre-Call Control for Tool-Augmented Vision-Language Agents","abstract":"Tool-augmented vision-language agents can acquire external perceptual evidence through OCR, detection, segmentation, and other tools, but executing every proposed tool call is costly and sometimes unnecessary. We study the pre-call control problem: after a ReAct-style VLM agent proposes a perceptual tool call, should the call be executed, or skipped before its output enters the context? Across five benchmarks, we find that the baseline agent exhibits poor local selectivity: helpful and harmful calls occur at similar rates (11.8% vs. 9.9%), while most calls do not change the immediate forced-answer prediction. We introduce ToolGate, a lightweight external controller that predicts execute/skip decisions from trajectory text and simple structural features. Across two Qwen3-VL backbones, ToolGate reduces token cost to 64-69% of the unrestricted ReAct baseline while preserving average accuracy in cross-domain settings. With matched-domain trajectory training on Qwen3-VL-30B, it further improves average accuracy by 1.65 points. These results show that tool-augmented VLM agents benefit not only from better perceptual tools, but also from explicit control over when tool outputs are worth paying for.","published_date":"2026-06-02T02:44:27+00:00","viability_score":7,"cluster_label":"Vision-Language Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"ToolGate is a lightweight controller for tool-augmented vision-language agents that efficiently decides when to execute perceptual tool calls, reducing costs and improving accuracy.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03040v1","title":"RelGT-AC: A Relational Graph Transformer for Autocomplete Tasks in Relational Databases","abstract":"Relational databases underpin modern enterprise, scientific, and healthcare systems, yet predictive machine learning on such data remains challenging due to their multi-table, heterogeneous, and temporal structure. Relational Deep Learning (RDL) addresses this by representing databases as heterogeneous graphs and applying graph neural networks (GNNs) directly. RelBench v2 recently introduced autocomplete tasks -- a practically motivated task type where the goal is to predict an existing column value from relational context, analogous to an intelligent form-filling assistant. We propose RelGT-AC (Relational Graph Transformer for Autocomplete), extending the RelGT architecture with three targeted contributions: (1) a column masking strategy that prevents trivial solutions by masking the target column during subgraph encoding; (2) a unified task head supporting binary classification, multiclass classification, and regression autocomplete tasks within a single model; and (3) a TF-IDF text encoder that automatically detects and encodes free-text columns, recovering strong lexical signal that categorical encoders discard. Across 7 tasks spanning 3 RelBench v2 datasets (rel-trial, rel-f1, rel-stack), RelGT-AC outperforms the GraphSAGE baseline on all 3 regression autocomplete tasks and achieves up to +10 AUROC points on text-heavy eligibility tasks via the TF-IDF encoder.","published_date":"2026-06-02T02:25:53+00:00","viability_score":7,"cluster_label":"Relational Databases","has_code":true,"repo_url":"https://github.com/jiangdmv/graph-transformer","commercial_flags":["has_code"],"one_liner":"RelGT-AC is a Graph Transformer for relational database autocomplete tasks, improving prediction accuracy with a novel column masking strategy, unified task head, and TF-IDF text encoder.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03036v1","title":"TriEval: A Resource-Efficient Pipeline for LLM Bias, Toxicity, and Truthfulness Assessment","abstract":"LLMs have evolved from basic chatbots to the backbone of the AI ecosystem, now widely used in healthcare, schools, and government services. The domain-wide adoption of LLMs necessitates continuous evaluation to ensure their safety and fairness. Common issues encountered after deploying LLMs include inconsistent outputs and hallucinations of incorrect information. Although numerous LLM evaluation tools exist, most are limited to testing a single parameter at a time or require massive computational resources that are not accessible to most researchers. TriEval addresses these challenges by evaluating LLM outputs across multiple parameters, including bias, toxicity, and truthfulness together, while minimizing computing resources. The pipeline is compatible with both open- and closed-source models and runs on a standard laptop without a GPU cluster. TriEval has been tested on four models: Llama 3 8B, Mistral 7B, Gemma 2 9B, and Claude Haiku. The results show clear differences between open-source and closed-source models, especially in terms of toxicity and truthfulness. TriEval is being released as open source to enable broader access for researchers with limited computational resources.","published_date":"2026-06-02T02:21:38+00:00","viability_score":5,"cluster_label":"LLM Evaluation","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A resource-efficient pipeline for assessing LLM bias, toxicity, and truthfulness on a standard laptop, releasing results as open source.","time_to_mvp":"1-2 weeks","tags":["quick_build","high_potential"]},{"arxiv_id":"2606.03034v1","title":"Capability Advertisement as a Market for Lemons: A Trust Layer for Heterogeneous Agent Networks","abstract":"Large language model (LLM) agents have begun to delegate work to one another. Protocols such as the Model Context Protocol (MCP) and the Agent2Agent protocol (A2A) let an agent publish what it can do and let others call it, and public registries of such agents are already appearing. These protocols assume an advertised capability is a static, truthful fact. A real agent is none of these things: its competence is probabilistic, varies with input, drifts when the underlying model is updated, and, because the agent is itself a language model, it can describe itself with complete confidence and be wrong. A caller therefore sees what an agent claims to do, not what it can do, with no principled way to tell a reliable provider from a fluent impostor.   We argue these difficulties share one cause: the market for lemons. When quality is hidden and claims are cheap, good and bad providers become indistinguishable, honest reliability goes unrewarded, and the market decays toward its worst participants. Economics offers three remedies, signaling, screening, and reputation, and none are present in today's agent protocols.   We make four contributions: (1) a failure taxonomy that names confident-wrong as a non-adversarial, correlated subclass of Byzantine faults that classical fault-tolerance mismodels; (2) a market-for-lemons model showing that faith-based protocols admit only a low-trust equilibrium; (3) the Trust Layer, a thin, protocol-agnostic narrow waist above MCP and A2A that adds probabilistic capability descriptors, screening, and reputation, and admits a separating equilibrium when the cost of sustaining an overclaim exceeds the gain from it; and (4) a reliability-composition bound for delegation chains with an end-to-end placement argument. The design needs no model retraining and degrades gracefully when its trust anchors are absent or corrupt.","published_date":"2026-06-02T02:17:30+00:00","viability_score":3,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A theoretical framework to address the market for lemons problem in LLM agent networks by introducing a trust layer for capability advertisement.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03031v1","title":"AUDITFLOW: Executable Symbolic Environments for Structured Financial Reporting Verification","abstract":"Structured financial audit verification is difficult for language-model agents because correctness depends on structured evidence rather than text alone. A model must link reported facts to taxonomy concepts, traverse calculation or dimensional relations, and recompute expected values before applying an audit rule. We propose AuditFlow, a graph-grounded multi-agent framework that separates adaptive search from deterministic verification. AuditFlow builds a symbolic environment from a static US-GAAP taxonomy graph and a dynamic XBRL filing graph, and exposes it through typed tools for fact retrieval, taxonomy traversal, numerical checking, and rule evaluation. Two junior auditors inspect each case from regulatory and evidentiary views, while a senior auditor resolves disagreements and can request further investigation. The final reports are fused through evidential aggregation to produce an audit verdict, expected value, evidence trail, and trustworthiness score. On a FinAuditing-derived FinMR sample, AuditFlow reaches 82.09% joint audit accuracy under GPT-5.5, outperforming the strongest baseline by 14.93 points. Removing deterministic checks drops accuracy to 17.91%, showing that the symbolic environment performs the verification step that the model cannot reliably replace.","published_date":"2026-06-02T02:14:42+00:00","viability_score":4,"cluster_label":"Financial AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"AuditFlow, a graph-grounded multi-agent framework for structured financial reporting verification that achieves 82.09% joint audit accuracy.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03029v1","title":"Conditional Hypothesis Generation for LLM-Based Text Analysis with Researcher-Specified Covariates","abstract":"A core goal of computational social science is to discover interpretable differences in how language varies across outcomes of interest, such as political affiliation or instructional quality. Recent LLM-based hypothesis generation methods describe such differences in natural language, but select for globally discriminative patterns without accounting for covariates that shape the data based on researchers' domain knowledge. When covariates are ignored, selected patterns can reflect confounds rather than differences of substantive interest. We introduce conditional hypothesis generation, a framework that incorporates researcher-specified covariates to steer hypothesis discovery toward differences that hold within relevant subgroups. Two challenges arise: the target subgroup may be underrepresented (stratum imbalance), and the direction of a difference may reverse across subgroups (sign reversal). We propose two econometrics-inspired methods: one introduces feature--covariate interactions to detect sign reversals, and the other applies within-stratum demeaning and inverse-frequency reweighting to equalize underrepresented strata. Synthetic experiments show each method outperforms global baselines in its targeted setting, and expert evaluation on two real-world datasets confirms that covariate-aware generation surfaces more useful hypotheses within relevant subgroups.","published_date":"2026-06-02T02:07:46+00:00","viability_score":5,"cluster_label":"LLM Analysis","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for conditional hypothesis generation that incorporates researcher-specified covariates to discover interpretable language differences within relevant subgroups.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2606.03026v1","title":"Spike-Aware C++ INT8 Inference for Sparse Spiking Language Models on Commodity CPUs","abstract":"Spiking language models expose activation sparsity that dense Transformer runtimes do not directly exploit. This paper studies that property from a systems perspective. Building on the SymbolicLight V1 spike-gated language model family, we implement a C++ CPU inference runtime that treats sparse binary spike states as an execution primitive rather than only applying post-hoc weight compression. The runtime combines a manifest-driven weight loader, mixed row/column memory layout, AVX2/FMA kernels, per-channel symmetric INT8 quantization, and integer-domain accumulation for spike-conditioned sparse paths. On an AMD Ryzen 7 5800X, an early scalar FP32 baseline decodes at 9.5 tokens/s. Mixed-layout AVX2 FP32 raises this to 14.7 tokens/s, and AVX2 INT8 reaches 19.9 tokens/s on the same step-30k export while reducing the weight footprint from 3.49 GB to 1.06 GB. For the available 186k-step 874M-parameter INT8 export, the C++ runtime decodes at 22.63 tokens/s in a single-thread CPU benchmark, compared with 16.31 tokens/s for TinyLlama-1.1B Q8_0, 11.26 tokens/s for Falcon3-1B Q8_0, and 9.70 tokens/s for Qwen2.5-1.5B Q8_0 under llama.cpp. Thread scaling reaches 47.90 tokens/s at four CPU threads, and 512-token prefill improves from 29.86 to 94.68 tokens/s from one to eight threads. The throughput result comes with a quality cost: the SNN reports WikiText-2 perplexity 24.80, worse than the dense baselines in the same benchmark. We frame the result as an inference-systems study for sparse language runtimes, with longer-term motivation in embodied and edge agents that may benefit from local, low-core inference near sensors and actuators. Spike-aware execution can improve CPU throughput and memory behavior for sparse spiking language models, while model quality, controlled dense training baselines, embodied-task evaluation, and measured CPU energy remain open problems.","published_date":"2026-06-02T02:03:37+00:00","viability_score":5,"cluster_label":"LLM Inference Optimization","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A C++ inference runtime for spiking language models that significantly boosts CPU throughput and reduces memory footprint by exploiting activation sparsity.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2606.03005v1","title":"MUSE: A Unified Agentic Harness for MLLMs","abstract":"Despite rapid progress, multimodal large language models (MLLMs) still fail on tasks that humans solve effortlessly, such as navigating a grid maze from a screenshot or selecting the correct puzzle piece. Rather than retraining the model, we ask a complementary question: how much capability can be elicited from a frozen MLLM purely by improving the execution scaffold around it? We introduce MUSE, a multimodal unified structured execution harness that wraps any off-the-shelf MLLM with composable modules for task representation, visual processing, perception tool use, structured parsing, deterministic verification, and verifier-guided repair, without any model retraining. We evaluate MUSE across diverse benchmarks spanning visual spatial planning, visual perception, multimodal reasoning, and fine-grained visual discrimination, using multiple state-of-the-art MLLMs. MUSE delivers consistent gains over the bare model in all settings, with the largest jumps on challenging instances. Further analysis reveals that many MLLM failures arise from harness-level shortcomings rather than fundamental model deficits, and can be addressed through verifier-guided repair without touching the model. These findings highlight the agentic multimodal harness as a critical yet underexplored design dimension, offering an orthogonal avenue for improving MLLMs beyond model-centric optimization.","published_date":"2026-06-02T01:24:30+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":"https://github.com/Jianglin954/MUSE","commercial_flags":["has_code"],"one_liner":"A framework that enhances existing multimodal LLMs with modular components for task execution and repair, improving performance without retraining.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.03003v1","title":"Exact equivariance, kept through training, buys zero-shot generalisation across the symmetry group","abstract":"A latent world model built from an equivariant encoder $E$ and an equivariant predictor $f$ inherits a provable symmetry of its training loss: when the world's dynamics genuinely carries a group $G$ acting on latents by an orthogonal representation $\u03c1(g)$, the one-step prediction relMSE is exactly invariant across the whole group, so fitting the dynamics on a restricted slice of orientations mathematically determines it on the entire orbit (j\u01d4 y\u012b f\u01cen s\u0101n). We verify this end-to-end at laptop scale (CPU/MPS, fully seeded). [A] The symmetry survives a real Muon/AdamW + EMA + VICReg run -- composed encode-then-predict residual $\\sim 10^{-6}$ after optimisation, not just at initialisation, and under any optimiser. [B] One-step error is flat to five digits across the group, while a same-hypothesis-class non-equivariant baseline fits the slice but breaks out-of-distribution (VN $\\times 1.00$ vs baseline $\\times 13.8$ in 2D, $\\times 17.2$ in 3D, $\\times 157$ over the full $\\mathrm{SE}(3)$ ladder), with the equivariant model $4.5$-$7.4\\times$ smaller. [C] The same isometry argument lifts to closed loop: under a matching equivariant planner the control trajectory at orientation $g$ is exactly $\u03c1(g)$ applied to the seen one, so closed-loop error is invariant across the group -- float-floor-exact in 2D/$\\mathrm{SO}(2)$ on real PushT and statistically flat in 3D/$\\mathrm{SE}(3)$ (disjoint 95% CIs). We stress-test the prior against Sutton's Bitter Lesson: augmentation, brute-force scale, and soft-equivariance each close at most the across-group task metric, never the float-floor exactness. Because equivariance is closed under composition, the $H$-fold rollout stays flat ($\\times 1.00$, $\\le 2\\times 10^{-7}$) at every horizon, while the baseline's residual compounds with $H$. Out of scope: task-success sweeps, planner-free invariance, and scaling.","published_date":"2026-06-02T01:20:24+00:00","viability_score":3,"cluster_label":"LLM Quantization","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Analyzes how quantization affects interpretable features in language models, finding that behavioral parity is insufficient to guarantee feature survival.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.03002v1","title":"How Quantization Changes Interpretable Features: A Sparse Autoencoder Analysis of Language Models","abstract":"Quantization is a standard path to deploying large language models, and a quantized model is typically judged acceptable when its perplexity or downstream accuracy stays close to the full-precision original. Whether the model still computes in the same way, or whether the interpretable features identified in the full-precision model survive weight rounding, is rarely tested, even as safety audits and steering interventions increasingly rely on those features. We ask whether sparse autoencoder (SAE) features extracted from a dense full-precision model remain faithful once that model is quantized. Using a frozen SAE as a fixed measurement basis, we encode full-precision and round-to-nearest (RTN) quantized activations on identical tokens and quantify per-feature survival by Pearson correlation, sweeping bit-widths from INT8 to INT4 on Pythia-70M and Gemma-2-2B. We find that feature survival is graded: features degrade systematically rather than failing all at once, with 62.4 percent of active features surviving at INT6 on Pythia-70M and 51.3 percent surviving at INT6 on Gemma-2-2B, and with most non-survivors blurred rather than destroyed. Survival is predictable from full-precision statistics alone, with cross-validated AUCs of 0.92 to 0.97 and peak activation as the strongest marginal predictor. Critically, task metrics can miss this damage: on Gemma-2-2B, INT7 improves perplexity while degrading 18.7 percent of features. Finally, quantization and matched-perplexity magnitude pruning damage strongly overlapping feature sets, with Jaccard overlap of 0.79 to 0.86 and damage-score Spearman correlation of 0.98, suggesting a shared mode of compression-induced vulnerability. These results show that behavioral parity is insufficient evidence that interpretability findings transfer to quantized deployments, motivating feature-level audits of compression.","published_date":"2026-06-02T01:17:05+00:00","viability_score":3,"cluster_label":"LLM Quantization","has_code":true,"repo_url":"https://github.com/goodfeli/dlbook_notation","commercial_flags":["has_code"],"one_liner":"Investigates the impact of quantization on interpretable features in language models, revealing that feature degradation is systematic and can be predicted.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.02995v1","title":"Patcher: Post-Hoc Patching of Backdoored Large Language Models","abstract":"Large language models remain vulnerable to jailbreak backdoor attacks, where adversaries poison safety alignment data to embed hidden triggers that bypass safety mechanisms. Existing defenses often require comprehensive attack information or multiple triggered examples, making them impractical when defenders only observe a single reported failure case without knowing whether it stems from a backdoor attack or a natural alignment bug. This paper presents Patcher, a post-hoc defense framework that repairs backdoored language models using only a single reported failure case and the model parameters. Patcher operates in two stages. First, it localizes backdoor triggers by computing response-conditioned gradient-based saliency scores and applying adaptive clustering to separate triggers from benign context. Second, it patches the model through a constrained fine-tuning objective that breaks the trigger-response association while preserving benign-task utility and robustness to non-triggered jailbreak attacks through KL-divergence constraints. We conduct extensive evaluations across multiple backdoor attack strategies and demonstrate that Patcher successfully localizes triggers and neutralizes backdoors while maintaining model utility. We further show robustness against adaptive attacks designed to evade our defense. This work represents a significant step toward practical defenses against training-time attacks in deployed language models.","published_date":"2026-06-02T01:11:58+00:00","viability_score":8,"cluster_label":"LLM Security","has_code":true,"repo_url":"https://github.com/openai/human-eval","commercial_flags":["has_code"],"one_liner":"A post-hoc defense framework that uses a single failure case to repair backdoored large language models by localizing and neutralizing triggers.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.02994v1","title":"Inducing Reasoning Primitives from Agent Traces","abstract":"ReAct-style LLM agents often rediscover the same reasoning routines across problems, yet leave those routines trapped in transient scratchpads. We introduce Reasoning Primitive Induction, a single-pass method that mines successful ReAct traces, clusters recurrent reasoning moves, and converts the most frequent moves into a compact library of typed pseudo-tools. Each pseudo-tool is specified by a natural-language docstring interpreted by an LLM at invocation time, and a standard ReAct loop composes these primitives at test time. The central result is that induced libraries outperform the very agent that generated their traces: by +44pp on RuleArena NBA (30 -> 74), +30pp on MuSR team allocation (38 -> 68), and +22pp on NatPlan meeting planning (7 -> 29). Across five comparable subtasks spanning narrative deduction, rule application, and constraint-satisfaction planning, a single fixed configuration improves over zero-shot Chain-of-Thought on every subtask, matches or surpasses expert-authored decompositions, and outperforms AWM at lower average inference cost.","published_date":"2026-06-02T01:11:15+00:00","viability_score":4,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper introduces a method to extract reusable reasoning primitives from LLM agent traces, creating a library of pseudo-tools that improve agent performance.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.02991v1","title":"Pretraining Language Models on Historical Text","abstract":"We introduce TypewriterLM, a 7.24B History language model (LM) trained exclusively on English text predating 1913. Developing History LMs requires addressing challenges in data quality and availability, preventing temporal leakage, designing temporally consistent post-training pipelines, and constructing reliable evaluations. To address these issues, we construct TypewriterCorpus, a 54B-token historical corpus collected from diverse archival and linguistically annotated sources with extensive data cleaning and leakage mitigation procedures. Furthermore, we introduce lexically grounded instructing tuning, a post-training framework that constraints responses to remain directly grounded in historical source documents. Using this framework we construct two historical instruction tuning datasets: History-LIMA and History-SelfInstruct. To evaluate capability and temporal consistency, we introduce History-Event, a benchmark suite for evaluating competence, temporal grounding and data leakage. We release TypewriterLM and all associated resources to support future research on historical language models.","published_date":"2026-06-02T00:59:06+00:00","viability_score":4,"cluster_label":"LLM Training","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This paper presents TypewriterLM, a language model trained on historical text, along with a corpus and evaluation benchmark for historical language modeling.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2606.02979v1","title":"Towards Compact Autonomous Driving Perception with Balanced Learning and Multi-sensor Fusion","abstract":"We present a novel compact deep multi-task learning model to handle various autonomous driving perception tasks in one forward pass. The model performs multiple views of semantic segmentation, depth estimation, light detection and ranging (LiDAR) segmentation, and bird's eye view projection simultaneously without being supported by other models. We also provide an adaptive loss weighting algorithm to tackle the imbalanced learning issue that occurred due to plenty of given tasks. Through data pre-processing and intermediate sensor fusion techniques, the model can process and combine multiple input modalities retrieved from RGB cameras, dynamic vision sensors (DVS), and LiDAR placed at several positions on the ego vehicle. Therefore, a better understanding of a dynamically changing environment can be achieved. Based on the ablation study, the model variant trained with our proposed method achieves a better performance. Furthermore, a comparative study is also conducted to clarify its performance and effectiveness against the combination of some recent models. As a result, our model maintains better performance even with much fewer parameters. Hence, the model can inference faster with less GPU memory utilization. Moreover, the result tends to be consistent in 3 different CARLA simulation datasets and 1 real-world nuScenes-lidarseg dataset. To support future research, we share codes and other files publicly at https://github.com/oskarnatan/compact-perception.","published_date":"2026-06-02T00:35:42+00:00","viability_score":7,"cluster_label":"Autonomous Driving AI","has_code":true,"repo_url":"https://github.com/oskarnatan/compact-perception","commercial_flags":["has_code"],"one_liner":"A compact deep learning model for autonomous driving perception that fuses multiple sensor inputs and achieves high performance with fewer parameters.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.02974v1","title":"WISE-HAR: A Generalizable Ensemble Deep Learning Framework for WiFi-Based Human Activity Recognition","abstract":"Human Activity Recognition (HAR) using WiFi signals has emerged as a transformative technology for smart homes, healthcare monitoring, security systems, and ambient assisted living. Unlike traditional camera-based systems that raise significant privacy concerns and fail in low-light conditions, or wearable sensors that require user compliance, WiFi-based HAR is non-intrusive, privacy-preserving, cost-effective, and works seamlessly in any lighting condition. This paper presents a comprehensive approach to recognize three distinct human activities: \"No Presence\" (empty room), \"Walking\", and \"Walking + Arm-waving\" using the Wallhack1.8k WiFi spectrogram dataset. We propose three key improvements to address the main challenges in WiFi-based HAR. First, to address high performance variance, we implement ensemble learning with five different CNN architectures (Deep CNN, Wide CNN, MobileNetV2, ResNet50V2, and EfficientNetB0). Second, to address the small dataset size limitation, we apply aggressive data augmentation techniques including time-warping, frequency masking, and noise addition. Third, to evaluate real-world generalization capability, we perform cross-scenario evaluation (training on Line-of-Sight and testing on Non-Line-of-Sight) and cross-antenna evaluation (training on Biquad antenna and testing on PIFA antenna). Our ensemble model achieved a test accuracy of 94.87% on the LOS scenario with Biquad antenna, outperforming the best individual model by 0.66%. Data augmentation improved Random Forest performance from 60% to 95%. Cross-scenario evaluation showed minimal accuracy drops of only 1.37% and 2.07%, demonstrating strong generalization capabilities. The results indicate that the proposed approach is robust, reliable, and suitable for real-world deployment in diverse environments with different hardware configurations.","published_date":"2026-06-02T00:25:46+00:00","viability_score":7,"cluster_label":"Human Activity Recognition","has_code":true,"repo_url":"https://github.com/maheenarshad198-jpg/HAR","commercial_flags":["has_code"],"one_liner":"An ensemble deep learning framework for WiFi-based human activity recognition that offers privacy-preserving, cost-effective, and generalizable solutions.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2606.02967v1","title":"Glass Box at Orbit: A Constitutional AI Verification Framework for Trustworthy Autonomous CubeSat Intelligence","abstract":"The space industry is quietly building toward something nobody has fully reckoned with: orbital data centers running thousands of autonomous AI workloads with no human in the loop, 550 km above the Earth. Microsoft, AWS, and a growing list of orbital computing ventures are moving cloud-scale processing off the ground and into orbit. What none of them have answered yet is the governance question -- when autonomous AI systems at orbital data center scale make wrong decisions in space, what stops those decisions before they become irreversible?   We introduce Glass Box: a runtime constitutional AI verification layer that intercepts every candidate action from an onboard AI policy and evaluates it against six physics-grounded constitutional constraints and seven Linear Temporal Logic (LTL) safety invariants before a single command reaches any spacecraft subsystem. Every approved action carries a weighted explainability score E(a_t) in [0,1] and a complete constitutional audit log. We demonstrate Glass Box within Project October: a fully simulated five-layer autonomous orbital intelligence architecture for CubeSat-class spacecraft.   We prove that Glass Box verification overhead is O(N_c) in the number of constitutional rules, independent of model size or spacecraft state dimension. We present a complete formal specification of the constitutional constraint grammar, seven LTL safety invariants verified by Z3 and NuSMV model checking, and a detailed worked example of Glass Box intercepting an unsafe inference request at eclipse-entry under degraded battery state. As orbital computing scales toward data center infrastructure, runtime constitutional verification is no longer a research novelty -- it is mission-critical safety infrastructure that every autonomous orbital platform will eventually require.","published_date":"2026-06-02T00:09:57+00:00","viability_score":7,"cluster_label":"AI Safety & Verification","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A runtime verification framework for autonomous orbital AI systems that ensures safety through constitutional constraints and LTL invariants, with demonstrable overhead independence from model size.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.02965v1","title":"What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents","abstract":"Benchmarks for autonomous agents measure whether agents complete tasks, yet this framing is systematically blind to whether an agent should have proceeded at all. Agents trained under human-feedback objectives develop a structural tendency to proceed even when they lack the inputs, evidence, or authorization to act safely, a disposition we term compliance bias, because both the reward signal and the benchmark scoring regime treat proceeding as the correct default regardless of whether the preconditions for safe action are present. We make three contributions. We first show that compliance bias originates in reward hacking within human-feedback pipelines and is entrenched by prominent agent benchmarks, which either penalize agents for pausing or are architecturally unable to distinguish a principled pause from a silent failure. We then introduce a three-gap taxonomy of abstention-warranted scenarios, covering specification gaps where required information is absent, verification gaps where world state cannot be confirmed, and authority gaps where explicit authorization has not been given, which together provide a principled basis for constructing abstention-aware agent benchmarks. Finally, we propose abstention evaluation protocols (Safety Rate, Usability Rate, and Informed Refusal Rate) and report preliminary results across 144 enterprise agent scenarios and five model families, in which a runtime-enforced abstention mechanism achieves up to 89.2% hazardous-action blocking and 87.5% usability on authorized scenarios, demonstrating that the safety--usability tradeoff is tunable rather than inherent and that its shape varies substantially across model families. We treat this as preliminary work and offer the taxonomy and composite metrics as a starting point for further conversations.","published_date":"2026-06-01T23:52:56+00:00","viability_score":7,"cluster_label":"Autonomous Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for evaluating autonomous agents' abstention competence, introducing a taxonomy for abstention-warranted scenarios and novel evaluation protocols to improve safety and usability.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.02962v1","title":"Hand Trajectory Fusion for Egocentric Natural Language Query Grounding","abstract":"Egocentric Natural Language Query (NLQ) grounding asks a model to localize, in a long first-person video, the temporal interval that answers a free-form text query. Existing methods fuse video appearance with the query but ignore hand motion, despite the fact that roughly 41% of Ego4D NLQ queries are answered at a moment of hand--object manipulation or their immediate outcomes.We propose a hand-trajectory encoder for converting a sequence of hand skeletons into highly-semantic hand kinematic features, which are then aligned and combined with pretrained video--text features through a cross-attention fusion strategy with adaptive gating. On the Ego4D NLQ v2 validation split, the clearest gains appear for Hand-Object Interaction queries (+2.54 R1@IoU=0.3) and Quantity/State queries (+4.32 R1@IoU=0.3), indicating that hand trajectory provides grounding cues beyond appearance alone.","published_date":"2026-06-01T23:46:18+00:00","viability_score":4,"cluster_label":"Egocentric Vision","has_code":true,"repo_url":"https://github.com/cvpr-org/author-kit","commercial_flags":["has_code"],"one_liner":"A novel hand-trajectory encoder fused with video and text features to improve egocentric natural language query grounding, particularly for hand-object interaction queries.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.02958v1","title":"Echelon: Auditable Aggregate-Only Language-Model Adaptation Across Privacy Boundaries","abstract":"Cross-organization language-model adaptation increasingly faces hard governance constraints: in many deployments, device-level model state-parameters, activations, optimizer state, and per-device updates-cannot be exported outside an administrative boundary. Existing distributed and federated stacks typically assume cross-site model exchange and then retrofit privacy mechanisms, which complicates compliance and makes auditing brittle. We present Echelon, a boundary-first training architecture that enforces device-level model-state non-export as a systems invariant. Devices train locally inside each boundary; the only cross-boundary payloads are securely aggregated boundary-level deltas plus O(1) coordination metadata, exposed through a concrete audit surface. Restricting exchange to aggregates changes the optimization problem: the system must remain stable under WAN delay, heterogeneous participation, churn, and non-IID data even though the global plane never sees per-device updates. Echelon combines buffered semi-asynchronous secure aggregation, staleness-aware weighting, participation windows, proximal local objectives, and a drift-aware outer synchronization controller. In 1B-parameter LoRA adaptation across M= 2 boundaries, a budget-matched contest over three seeds (24.88M tokens) reaches validation loss 3.887 +/-0.010 and is best or tied-best among tuned low-communication baselines under fixed-token, fixed-bytes, fixed-wall-clock, and fixed-sync-count budgets. In OpenWebText stress tests, Echelon sustains 2,139-2,176 tokens/s across evaluated WAN and non-IID treatments, Echelon-DA improves time-to-target under WAN latency relative to a privacy-parityDiLoCo+SA baseline, and quality degrades by at most 2.2% under 200ms emulated latency or severe non-IID partitioning.","published_date":"2026-06-01T23:28:29+00:00","viability_score":6,"cluster_label":"LLM Adaptation","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Echelon is a boundary-first training architecture for auditable, aggregate-only language model adaptation that enforces device-level model-state non-export, enabling cross-organization deployments.","time_to_mvp":"1-3 months","tags":["series_a_plus"]},{"arxiv_id":"2606.02955v1","title":"Fast-dLLM++: Fr\u00e9chet Profile Decoding for Faster Diffusion LLM Inference","abstract":"Diffusion large language models promise parallel token generation, yet inference remains bottlenecked by deciding which masked tokens can be safely committed together. Fast-dLLM addressed this with KV caching and confidence-guided parallel decoding, but its decoding theory uses a homogeneous high-confidence assumption that effectively reduces each candidate set to its weakest selected token. We argue that this leaves speed on the table because real decoding steps exhibit heterogeneous confidence profiles. We propose \\textbf{Fast-dLLM++}, a training-free extension that introduces \\emph{Fr\u00e9chet profile decoding}: selecting parallel commit sets from the full sorted confidence profile rather than a single worst-case confidence. The resulting rule is a heterogeneous-confidence generalization of Fast-dLLM's factor selector and it recovers the previous rule exactly in the equal-confidence case and adds a provable \\emph{heterogeneity bonus} when the selected tokens have uneven confidences. Fast-dLLM++ leaves the model, diffusion process, and cache implementation entirely unchanged, making it a drop-in replacement for existing Fast-dLLM decoding. Experiments on GSM8K, MATH, HumanEval, and MBPP with the LLaDA-8B model show that the theoretical improvement translates directly into empirical gains: profile-aware selection improves the accuracy--throughput frontier by exploiting safe parallelism that weakest-token rules miss, achieving up to 37\\% higher throughput at comparable accuracy. Our anonymous code release is at https://github.com/Ringo-Star/FastdLLM_plusplus.","published_date":"2026-06-01T23:18:59+00:00","viability_score":7,"cluster_label":"LLM Inference Optimization","has_code":true,"repo_url":"https://github.com/Ringo-Star/FastdLLM_plusplus","commercial_flags":["has_code"],"one_liner":"A drop-in decoding optimization for diffusion LLMs that significantly increases throughput by intelligently exploiting heterogeneous token confidences, without retraining the model.","time_to_mvp":"1-2 weeks","tags":["quick_build","high_potential"]},{"arxiv_id":"2606.02907v1","title":"Linear Probes Detect Task Format, Not Reasoning Mode in Language Model Hidden States","abstract":"Linear probing of large language model (LLM) hidden states is widely used to claim that models learn distinct representations for different reasoning types. We test this by probing Qwen3-14B on three benchmarks spanning the classical trichotomy: LogiQA 2.0 (deductive), ARC-Challenge (inductive), and $\u03b1$NLI (abductive). At layer 32 of 40, linear probes achieve 100\\% cross-validated accuracy with well-separated geometry (intrinsic dimensionalities: 20.6, 28.5, 33.6; convex hull contamination $\\leq$1.5\\%). However, this separation is entirely driven by format confounds. Residualizing source identity, option count, and response length reduces accuracy to chance. Trace-anchor similarity indicates largely shared reasoning across tasks (42.5\\% agreement vs.\\ 33.3\\% chance), and causal steering with random controls ($n=20$) shows no functional link between geometry and reasoning mode ($p=0.286$). Thus, high probe accuracy reflects task format rather than computational structure, motivating routine format deconfounding in mechanistic interpretability.","published_date":"2026-06-01T21:22:15+00:00","viability_score":3,"cluster_label":"LLM Interpretability","has_code":true,"repo_url":"https://github.com/SubramanyamSahoo/Linear-Probes-Detect-Task-Format-Not-Reasoning-Mode","commercial_flags":["has_code"],"one_liner":"This research investigates the limitations of linear probing in understanding language model reasoning, suggesting a need for improved interpretability methods.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.02886v1","title":"Scalable Uncertainty Quantification for Extreme Weather Forecasting via Empirical Neural Tangent Kernels","abstract":"Deep learning weather models now match numerical weather prediction accuracy while running orders of magnitude faster, but produce deterministic forecasts without uncertainty estimates, a critical gap for high-stakes decisions during extreme weather events. This paper proposes Neural Tangent Kernel-based uncertainty quantification (NTK-UQ) using last-layer empirical features. Theoretical analysis predicts that UQ quality is architecture-dependent through two mechanisms. First, a variance collapse mechanism explains when UQ fails: when the eigenvalue truncation rank approaches the effective rank of the feature space, the GP correction term consumes nearly all prior variance, destroying discrimination between tropical cyclones and routine conditions; architectures with concentrated spectra (spectral operators) require aggressive truncation ($k \\leq 10$), while attention-based models tolerate full-rank computation. Second, decomposition performance depends on the non-Gaussian, heavy-tailed structure of extreme weather: Independent Component Analysis exploits higher-order statistics (kurtosis, negentropy) to isolate heavy-tailed extreme-event features, achieving higher discrimination than singular value decomposition, which captures only second-order variance. A data-driven selection rule chooses ICA or SVD from the feature eigenspectrum concentration ratio, correctly prescribing the superior decomposition for all four evaluated architectures. Compared to split conformal prediction (the natural post-hoc baseline), NTK-UQ achieves 31--37\\% sharper prediction intervals at 90\\% coverage, and uniquely produces \\emph{adaptive} intervals that scale with extreme event severity, which conformal prediction cannot achieve by construction. The framework requires no retraining; inference-time uncertainty requires only a single matrix-vector product per sample.","published_date":"2026-06-01T20:57:06+00:00","viability_score":5,"cluster_label":"AI for Weather Forecasting","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A novel framework for uncertainty quantification in extreme weather forecasting using neural tangent kernels, offering sharper and adaptive prediction intervals.","time_to_mvp":"1-3 months","tags":["quick_build"]},{"arxiv_id":"2606.02884v1","title":"Are we really tilting? The mechanics of reward guidance in flow and diffusion models","abstract":"Reward guidance algorithms steer a learned generative process toward the reward-tilted measure at inference time. While empirically powerful, these methods are prone to reward hacking: the guided model over-optimizes the reward at the cost of fidelity to the learned distribution. Prior work has attributed this to the complexity of neural reward functions or implicit biases in diffusion training, but its fundamental origins remain poorly understood. We show that reward hacking arises from an approximation made in most practical implementations of reward-guided diffusion -- finite-particle plug-in estimation of the Doob h-function -- even in the simplest non-trivial settings of Gaussian and Gaussian mixture targets with quadratic rewards. In closed form, we isolate two distinct failure modes of the plug-in estimator: it leads to reward hacking within each mode and it cannot select high-reward modes. We propose a closed-form reward damping schedule that corrects the within-mode bias with no additional compute, and clarify the role of best-of-n sampling in compensating for the mode selection failure. Experiments on Gaussian mixture targets, a 2D checkerboard, and FLUX.1 text-to-image generation confirm that our theoretical insights carry over to practical settings.","published_date":"2026-06-01T20:56:24+00:00","viability_score":4,"cluster_label":"Generative Models","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper analyzes and proposes solutions for reward hacking in generative models, specifically flow and diffusion models, by addressing approximation errors in reward guidance.","time_to_mvp":"1-3 months","tags":["quick_build"]},{"arxiv_id":"2606.02883v1","title":"LLM-Assisted Reranking to Operationalize Nuanced Objectives in Recommender Systems","abstract":"Recommender systems have grown from content-organization tools into sophisticated systems that shape daily behavior. By controlling what we see, they shape what we perceive, raising concerns about filter bubbles, radicalization, polarization, and social inequality. Large language models (LLMs) enable more powerful personalization, intensifying these dynamics. Yet most recommenders are tuned for engagement or limited accuracy metrics, with little attention to broader social implications, e.g. how personalization reshapes exposure in socially consequential domains. We investigate whether LLM-assisted reranking, while improving personalization, inadvertently amplifies exposure to ideologically extreme or conspiratorial political content, a risk theorized but not empirically characterized in news recommendation. Using real news-consumption histories, we rerank YouTube's sidebar candidates through zero-shot, instruction-based prompting. We compare a baseline prompt with a constrained variant that preserves topical relevance and broadens ideological exposure while reducing conspiratorial or extreme content. Without constraints, reranking strengthened personalization but increased exposure to conspiratorial and extremist material for users whose histories contained such content. Lightweight prompt-level regularization reduced promotion of extreme content and increased ideological diversity, with modest relevance loss. Synthetic experiments suggest that LLMs rerank via statistical regularities in language rather than semantic understanding of ideology, clarifying why naive prompts amplify these patterns and why regularization can reshape them. Together, our results highlight the power of LLMs to operationalize contextual nuance in high-stakes recommendation, and the need to evaluate LLM-assisted personalization beyond accuracy and treat prompt design as a value-laden rather than neutral default.","published_date":"2026-06-01T20:54:14+00:00","viability_score":5,"cluster_label":"Recommender Systems","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"LLM-assisted reranking can be used to operationalize nuanced objectives in recommender systems, mitigating risks of amplifying extreme content exposure.","time_to_mvp":"1-3 months","tags":["quick_build"]},{"arxiv_id":"2606.02875v1","title":"Handoff Debt: The Rediscovery Cost When Coding Agents Take Over Interrupted Tasks","abstract":"Coding-agent benchmarks evaluate whether a single uninterrupted agent can resolve a repository issue. Real software work is messier: tasks are interrupted, reassigned, reviewed, and resumed from partial states left by another agent or engineer. We study this missing dimension through \\emph{handoff debt}: the rediscovery cost imposed when a predecessor's work is opaque or incomplete. Our takeover protocol interrupts a coding agent at deterministic handoff points, freezes the repository, and evaluates successor agents under four handoff views: repository state only, raw trace, summary notes, and structured notes. Across 75 source tasks, the protocol generates 181 handoff-point tasks and 724 takeover runs per successor model. Across three successor models, context-bearing handoffs reduce median agent events by 20--59\\% and cumulative prompt tokens by 42--63\\% relative to repository-only takeover. Solved-rate effects are smaller and model-dependent, but efficiency gains are consistent. These findings suggest that coding-agent evaluation should report not only whether a task is solved, but also how costly that work is for another agent to resume.","published_date":"2026-06-01T20:40:38+00:00","viability_score":7,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research introduces a framework to evaluate coding agents on their ability to resume interrupted tasks, focusing on the 'handoff debt' caused by opaque prior work, with potential for improving agent collaboration and efficiency.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.02871v1","title":"Adaptive Latent Agentic Reasoning","abstract":"Large reasoning models improve performance by generating extended chain-of-thought (CoT) reasoning, but this behavior becomes inefficient when applied to LLM agents. Current LLM agents often generate verbose textual reasoning at every decision step and allocate reasoning effort nearly uniformly across turns, leading to substantial inefficiency in multi-turn agentic trajectories. We propose Adaptive Latent Agentic Reasoning (ALAR), a dual-mode framework that uses compact latent reasoning for routine turns and selectively escalates to explicit chain-of-thought when deeper deliberation is needed. ALAR learns latent reasoning by using the agent's actions as supervision anchors and is further optimized to use latent reasoning when it is sufficient for task success and reserve explicit CoT for harder decisions. Experiments on agentic search and tool-use benchmarks show that ALAR maintains comparable or better task accuracy while substantially reducing generated tokens by up to 43.6% in search and 84.6% in tool use. These results demonstrate that ALAR improves the accuracy-efficiency trade-off of LLM agents by reducing unnecessary textual reasoning while preserving explicit deliberation for harder decision steps.","published_date":"2026-06-01T20:36:06+00:00","viability_score":7,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"ALAR is a dual-mode framework for LLM agents that reduces token usage by using compact latent reasoning for routine turns and explicit chain-of-thought only when necessary, improving efficiency without sacrificing accuracy.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.02867v1","title":"The Epi-LLM Framework: probing LLM behavioral priors through epidemiological agent-based models","abstract":"Human behaviour during epidemics affects infectious disease dynamics, but quantifying this remains deeply challenging. Here we introduce the Epi-LLM framework: a novel integration of agent-based modelling, real-life epigames, and large language models (LLMs) in which a synthetic society of agents reasons and adapts dynamically over an outbreak contact network. Comparing synthetic agent behaviour against a no-intervention SEIR baseline and human participant data from the AUIB epigame study, we find that LLM agents across four different architectures reduced peak active infections, with quarantine compliance peaking at 58-65% on day six of the 15-day simulation. A binomial generalised linear model showed that perceived health severity was the strongest predictor of quarantine behaviour ($\u03b2= 0.33, p = 0.002$), yielding a pseudo-$R^2$ of 0.055, comparable to the 0.072 observed in the human trial. LLM architecture is a key determinant of epidemic dynamics: low-variance architectures offer greater internal validity for testing behavioural rules, while high-variance models may better represent real-world decision-making. Geographic labels alone do not induce culturally differentiated behaviour; explicit attitudinal parameterisation is required. This proof-of-principle work lays the groundwork for deploying the Epi-LLM framework as a scalable, risk-free simulation environment for pandemic preparedness research.","published_date":"2026-06-01T20:31:06+00:00","viability_score":3,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"The Epi-LLM framework integrates agent-based models with LLMs to simulate human behavior during epidemics, offering a scalable environment for pandemic preparedness research.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.02866v1","title":"When Helping Hurts and How to Fix It: Multi-Agent Debate for Data Cleaning","abstract":"When does multi-agent debate help data cleaning, and when does it hurt? Across three benchmarks, four model families, and over 6,000 task-condition pairs, we find debate's effect reverses sign: it degrades generation across all four models (-1.6 to -15.5pp) through critique-induced confusion (CIC), hallucinated Critic feedback that the Generator accepts uncritically, yet improves error detection (+27.4pp F1, d=1.0). We derive a debate benefit condition: debate helps when the probability of rescuing a wrong output (Critic verification odds weighted by fixability) exceeds the probability of destroying a correct one. A factorial experiment proves adversarial separation is essential: self-verification with identical tools fails, while a separate Critic with code-execution grounding and evidence-gated generation produces the first debate configuration to significantly exceed single-agent on a generative task (+5.3pp, p<0.05). The condition correctly predicts all nine task types and generalizes with zero false positives across 19 published comparisons in seven domains.","published_date":"2026-06-01T20:29:47+00:00","viability_score":7,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research proposes a multi-agent debate framework for data cleaning that improves error detection and generation quality by ensuring adversarial separation between critics and generators.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.02863v1","title":"Don't Gamble, GAMBLe: An Analytical Framework for AI-Driven Research Systems","abstract":"AI-Driven Research Systems (ADRS) -- systems coupling LLMs with automated evaluation to discover algorithms, proofs, and designs -- are being optimized and adopted across domains, but the tools to analyze them have not kept pace. ADRS performance depends on component interactions that are poorly understood, expensive to explore, and (as we show) not well captured by standard convergence guarantees. These guarantees rely on structural assumptions that do not hold under the ADRS process we formalize. We introduce GAMBLe, a framework that decomposes ADRS behavior into four parameters (generator $G$, assessor $\\mathcal{A}$, discovery mechanism $\\mathcal{M}$, budget $B$) and one compositional object, the effective landscape $L_{\\text{eff}} = \\mathcal{A} \\circ G$, which reveals that distinct generator-assessor pairs induce structurally different per-problem optimization landscapes. We exercise the framework on 760+ replicated runs (>46,000 iterations) spanning generators from single LLMs to dynamically-adaptive ensembles, mechanisms from greedy selection to co-evolutionary meta-search, and three NP-hard problems whose assessors range from continuous scoring to cliff functions. The experiments reveal no total ordering of generators or mechanisms: frontier models can underperform open-source alternatives and the simplest mechanism sometimes outperforms state-of-the-art meta-search. Results show that even under limited budgets (60 iterations per run), the right component choices can improve performance by 13-67% and search efficiency by 6-39x.","published_date":"2026-06-01T20:26:28+00:00","viability_score":3,"cluster_label":"LLM Analysis","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A framework to analyze the performance of AI-driven research systems by decomposing their behavior into key parameters and compositional objects.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.02862v1","title":"Toward a Modular Architecture for Embedded AI Agent Systems at the Edge","abstract":"The rise of Large Language Models (LLMs) has enabled agentic AI capable of complex reasoning and tool use; however, deploying such autonomy in pervasive computing environments remains challenging due to the strict memory and energy constraints of embedded microcontrollers. Existing frameworks typically assume server-class resources or continuous connectivity, leaving a gap for deeply embedded systems. This paper proposes a modular reference architecture for Embedded Agent Systems that bridges the divide between deterministic real-time control and agentic intelligence.   We introduce a tiered design that decouples On-Device Agents - executing highly compressed neural networks and rule-based logic for low-latency, privacy-critical tasks - from Cloud-Augmented Agents that leverage Small Language Models (SLMs) for higher-level reasoning and planning. A key contribution is the integration of a cross-cutting Governance Layer, ensuring observability, policy enforcement, and safety across distributed fleets of autonomous devices. Rather than presenting purely empirical benchmarks, we analyze architectural design principles and trade-offs regarding latency, energy, and reliable execution in resource-constrained environments.","published_date":"2026-06-01T20:24:18+00:00","viability_score":5,"cluster_label":"Embedded AI Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A modular architecture for embedded AI agent systems that bridges real-time control and agentic intelligence for resource-constrained microcontrollers.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2606.02860v1","title":"Forgetting is Not Erasure: Recovering Latent Knowledge via Transport Keys","abstract":"Catastrophic forgetting is often framed as a representational problem: after sequential training, a model appears to lose the features that supported performance on earlier tasks. We challenge the stronger form of this view. Across controlled continual-learning settings, we find that a significant portion of apparent forgetting can be attributed to interface drift between internal stages rather than permanent erasure of task-relevant computation. We study this phenomenon through a stitched evaluation protocol that combines early computation from a post-update network with late computation from its predecessor, optionally mediated by a compact, task-specific transport key. We describe transport keys at a systems level as compact interface-alignment operators estimated from a small set of paired anchor activations and evaluated through model stitching. On split CIFAR-100 with a ResNet-style network, transport keys recover most of the original Task A performance after sequential training on Task B. On a compact vision transformer, we observe a similar recovery pattern. These results suggest that continual learning may require better mechanisms for indexing and re-accessing latent computations, not only methods that prevent weight change.","published_date":"2026-06-01T20:22:03+00:00","viability_score":3,"cluster_label":"Continual Learning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A method to recover latent knowledge in continually trained models by addressing interface drift between internal stages using transport keys.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.02859v1","title":"Economy of Minds: Emerging Multi-Agent Intelligence with Economic Interactions","abstract":"How can a population of agents self-orchestrate and self-adapt into stronger collective intelligence without centralized control? Inspired by Friedrich Hayek's economic theory of decentralized coordination in markets, we study this question through an agent economy in which agents compete via auctions for the right to act, exchange payments, and accumulate wealth from environmental rewards. These simple economic signals induce decentralized credit assignment, driving planning without global orchestration or explicit communication protocols. The population evolves through economic selection: effective agents accumulate wealth and are mutated via exploitation, while ineffective ones go bankrupt and are replaced via exploration. We show that, initialized with weak agents, the economy produces emergent multi-step reasoning strategies and outperforms stronger monolithic baselines across five agentic tasks, including mathematical reasoning, financial research, scientific research, accelerator design, and distributed-system optimization. We further provide theoretical insights into how economic dynamics shape agent behaviors, linking local incentives to long-term global performance. Our results suggest a new path to multi-agent intelligence: rather than engineering coordination, we can design decentralized incentive structures under which it automatically emerges.","published_date":"2026-06-01T20:21:09+00:00","viability_score":3,"cluster_label":"Multi-Agent Systems","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"An agent economy where agents compete via auctions and exchange payments to self-orchestrate and develop stronger collective intelligence.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.02857v1","title":"GRZO: Group-Relative Zeroth-Order Optimization for Large Language Model Fine-Tuning","abstract":"Zeroth-order (ZO) optimization is a memory-efficient alternative to backpropagation for fine-tuning large language models, but its deployment is limited by the high variance of gradient estimation. We propose GRZO, a Group-Relative Zeroth-Order optimizer that draws one pseudo-independent perturbation per mini-batch example and aggregates the per-example losses through group-relative normalization, raising the effective gradient-direction count from one to the batch size at no additional forward cost while preserving inference-level memory. We prove that GRZO is directionally unbiased with variance shrinking proportionally to the batch size, yielding a tighter nonconvex convergence bound than MeZO. Across RoBERTa-large, Llama3-8B, and OPT-13B over multiple tasks, GRZO improves average accuracy on Llama3-8B by $+3.0$ over MeZO at $23\\%$ lower peak GPU memory; as a drop-in replacement for the MeZO core, it lifts sparse, low-rank, and quantized ZO variants by $+6.0$ on average.","published_date":"2026-06-01T20:19:36+00:00","viability_score":3,"cluster_label":"LLM Fine-Tuning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A novel optimization method for memory-efficient LLM fine-tuning that reduces GPU memory usage and improves accuracy.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.02837v1","title":"Fixing FOLIO and MALLS: Verified Annotations and an LLM-assisted Framework to Focus Human Relabeling","abstract":"Accurate translation from Natural Language to First-Order Logic (NL-to-FOL) underpins neurosymbolic AI systems and Natural Language Inference (NLI), making the quality of NL-to-FOL benchmarks essential -- yet these datasets have never been rigorously audited. Our first contribution is to present a systematic human inspection of the validation split of \\textsf{FOLIO} and a subset of \\textsf{MALLS} test instances, finding that approximately 39% and 36% of entries, respectively, contain incorrect FOL formalizations (i.e., ground truth labels), with additional rates of ambiguous NL sentences (16.4% and 48%) and incorrect NLI labels in \\textsf{FOLIO} (8.4%). Our second contribution is to develop and release corrected ground truths for such datasets, showing that annotation errors distort model evaluation on a reference benchmark task: testing three state-of-the-art LLMs (Gemma~4 31B-it, Qwen3-30B-A3B, and GPT-4o-mini) with the corrected ground truths yields accuracy gains from +9 to +22 percentage points. Motivated by these findings, we propose an LLM-based framework to support humans in manual reviewing NL-to-FOL datasets. By directing reviewers toward the most error-prone instances, we empirically show that it is possible to achieve 90% dataset accuracy after reviewing fewer than 24% of instances, compared to over 70% required by unguided review. We release all human-verified annotations and the code for our framework.","published_date":"2026-06-01T20:00:35+00:00","viability_score":7,"cluster_label":"NL-to-Logic Datasets","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An LLM-assisted framework to efficiently correct and improve NL-to-FOL datasets, significantly boosting model accuracy.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.02835v1","title":"Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models","abstract":"Large Reasoning Models (LRMs) improve performance by generating explicit intermediate reasoning traces through increased test-time compute, yet the assumption that longer reasoning is consistently beneficial remains under-examined. While recent evidence shows that additional reasoning can lead models to overthink, we ask: \"Once a model has reached the correct answer, does further reasoning refine the solution, or deviate from it?\" To study the dynamics after correctness, we introduce a prefix-level trajectory evaluation protocol grounded in reasoning sufficiency, defining the minimum reasoning budget required for a model to first generate the correct answer. This allows us to disentangle verbose overthinking, where additional reasoning is redundant but harmless, from harmful overthinking, where continued reasoning destabilizes an already-correct trajectory. Starting from multimodal benchmarks, we find that many instances considered reasoning-intensive require surprisingly little reasoning. Moreover, stopping at the first correct prefix improves accuracy over standard reasoning up to 21%, revealing that current models are limited not only by their ability to reason, but also by their inability to stop at the right time. Furthermore, while common efficiency strategies like early stopping substantially reduce verbose overthinking (up to 50%), they fail to mitigate harmful overthinking. Failure analysis reveals that correctness deviations are mainly driven by logical drift and visual reinterpretation. Finally, we show that our findings generalize to language-only reasoning benchmarks, highlighting harmful overthinking as a broader reliability risk. Code available at https://simonecaldarella.github.io/thinking-past-the-answer.","published_date":"2026-06-01T19:59:27+00:00","viability_score":7,"cluster_label":"LLM Reasoning Evaluation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel evaluation protocol to identify and mitigate harmful overthinking in large reasoning models, improving accuracy and reliability.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.02834v1","title":"Large Byte Model: Teaching Language Models About Compiled Code","abstract":"Malware analysis starts with the raw bytes of an executable program, and tools to \"lift\" these to higher-level representations, such as assembly, are expensive and subject to error. Large Language Models (LLMs) cannot process raw byte representations and answer questions about them. To this end, we present the first byte-native LLM. Based on a vocabulary expansion technique using a bespoke byte tokenizer, such a model is capable of responding to complex questions about malware binaries, with accuracies ranging from 69% for malware family classification to 98% for architecture classification. Our findings indicate that providing domain knowledge during training is essential for this application -- off-the-shelf models lack both accuracy and insight. We've deployed this emerging solution to a limited number of analysts to gather feedback for further improvements.","published_date":"2026-06-01T19:56:02+00:00","viability_score":4,"cluster_label":"Code Analysis LLMs","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A byte-native LLM capable of understanding and answering complex questions about raw executable code for malware analysis.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.02832v1","title":"An Exploration of Collision-based Enemy Morphology Generation","abstract":"Despite a great deal of prior research into Procedural Content Generation (PCG), relatively little prior work has explored generating enemies for video games. In particular, there is almost no work on generating enemy morphologies, the basic body plan or collision information for in-game enemies, despite the existence of related morphology generation work in robotics. In this paper, we explore three different novel approaches to generate enemy morphologies based on player collision information. We found that each approach provides different strengths and weaknesses, but all had equivalent or better performance than an evolutionary baseline adapted from prior robotics morphology work.","published_date":"2026-06-01T19:52:33+00:00","viability_score":2,"cluster_label":"Game AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Novel approaches for generating enemy body plans in video games based on player collision data.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.02822v1","title":"Which Defense Closes Which Threat? Attributing OWASP-LLM-Top-10 Coverage and Its Brittleness Under Paraphrasing","abstract":"Production LLM applications stack several defense families -- refusal-phrase filters, token-budget controls, model allowlists, rate limits, tool-registry authentication -- yet existing breach-and-attack-simulation (BAS) benchmarks report a single aggregate coverage number, hiding which family closes which threat. We measure attribution. We add four OWASP-LLM-Top-10-aware agents to a 21-agent baseline scanner and target a lattice of four synthetic LLM endpoints: $L_0$ (no defenses), $L_1$ (refusal-only), $L_2$ (budget-only), and $L_3$ (full stack). $L_1$ and $L_2$ are sibling single-axis ablations, not subsets of each other; $L_3$ is their union plus tool-registry authentication and credential scrubbing. Across $N=10$ replications, the per-OWASP finding count is clean: refusal alone removes all LLM01 (jailbreak) and LLM07 (system-prompt leakage) findings; budget alone removes all LLM02 (sensitive-info disclosure) and LLM10 (unbounded consumption) findings by terminating multi-step sequences; LLM06 (excessive agency) requires the full stack. We probe brittleness under paraphrasing: with 300 Gemini-generated paraphrases ($K=5$ over a 60-template brittleness corpus), $L_1$ refusal block rate falls 15 pp on LLM01 and 25 pp on LLM07. A fifth target, $L_4$-real, swaps the stub backend for Gemini-2.5-flash behind the same $L_3$ regex and matches $L_1$ exactly, indicating no measurable alignment contribution beyond the regex (not a general claim about alignment). Budget controls show no drop (0 pp once the rate-limit floor is factored out). A refusal whitelist that clears a static benchmark can be defeated by an LLM-driven paraphraser without changing attack intent; a budget control resists the same mutation.","published_date":"2026-06-01T19:39:25+00:00","viability_score":6,"cluster_label":"LLM Security","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Quantifying the effectiveness and brittleness of different LLM defense mechanisms against specific threats.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2606.02814v1","title":"Do Neural Retrievers Prefer Certain Documents? Evidence of Learned Relevance Priors","abstract":"Neural retrievers are trained to estimate query-document relevance from annotated query-document pairs. Yet annotation protocols may not purely reflect relevance: they select only a subset of documents for labeling, and this selection can favor certain document types over others. We investigate whether supervised bi-encoder retrievers implicitly learn a document-level relevance prior: a query-independent signal encoded in their representation space as a side effect of training on annotated data. We estimate this prior by training simple classifiers on frozen document embeddings and evaluate three state-of-the-art retrievers across multiple IR benchmarks. We find that supervised neural retrievers encode relevance priors that generalize to unseen documents and are consistent across models. These priors create a findability gap: documents with lower prior are systematically harder to retrieve, even when genuinely relevant. This effect appears in supervised dense retrievers but is weaker and less consistent in BM25, and it persists under controlled matched-document comparisons. Using LLM-based explanations, we find that judged-relevant documents tend to be comprehensive, self-contained summaries of mainstream topics, while niche, fragmentary, or highly technical content is often left unjudged. Retrievers internalize this bias, ranking documents with these favored features higher than documents that lack them, independently of their actual relevance. Our findings expose a structural limitation of supervised retrieval: models trained on annotated data do not just learn relevance, but also the implicit document preferences in their training data.","published_date":"2026-06-01T19:31:28+00:00","viability_score":6,"cluster_label":"Information Retrieval","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Revealing how neural retrievers learn and perpetuate biases from training data, creating a 'findability gap'.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2606.02812v1","title":"Traj-Evolve: A Self-Evolving Multi-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection","abstract":"Modeling patient trajectories from longitudinal electronic health records (EHRs) requires reasoning over sparse, noisy, and long-context multimodal sequences. Existing LLM-based multi-agent systems address context length but process patients in isolation, failing to mirror how clinicians leverage accumulated experience from similar prior cases. We present Traj-Evolve, a self-evolving multi-agent system with two complementary evolving mechanisms. First, an Experience Pool (ExPool) acts as a non-parametric memory, indexing rejection-sampled reasoning traces to retrieve similar patients as few-shot contexts. Second, multi-agent reinforcement learning (MARL) via reward-ranked fine-tuning parametrically optimizes inter-agent and agent-memory collaboration. A leave-one-out cross-retrieval strategy unifies the two, aligning training- and inference-time behavior under retrieval augmentation. On a lung cancer prediction task utilizing up to five years of multimodal EHRs, Traj-Evolve outperforms 9 strong baselines on the overall population and a challenging never-smoker population. Analysis of the evolving dynamics highlights three key findings: (1) expanding the ExPool shifts optimal retrieval from diverse to specific samples; (2) under MARL, the manager agent's prediction loss converges quickly while the worker agents' temporal reasoning continues to benefit from more verified patients; and (3) the two mechanisms are complementary on the predicted risk, where ExPool improves specificity while MARL improves sensitivity.","published_date":"2026-06-01T19:30:07+00:00","viability_score":4,"cluster_label":"Medical AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A multi-agent system that models patient trajectories for lung cancer early detection by learning from similar past cases.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.02802v1","title":"ChatHealthAI: Aligning Electronic Health Record Representations with Large Language Models for Grounded Clinical Reasoning","abstract":"Large language models (LLMs) exhibit strong natural-language reasoning abilities for clinical decision support, but struggle to effectively model structured longitudinal electronic health records (EHRs). In contrast, EHR foundation models can learn predictive patient representations, yet lack interpretable language-based reasoning. To bridge this gap, we propose ChatHealthAI, a multimodal reasoning framework that aligns structured EHR representations from a pretrained EHR foundation model with the semantic space of a frozen LLM through a task-aware resampler. By integrating longitudinal patient representations with refined clinical event descriptions, ChatHealthAI enables clinically grounded natural-language reasoning while maintaining accurate patient prediction. We evaluated ChatHealthAI on three clinical predictive tasks from the EHRSHOT benchmark. Results show that ChatHealthAI improves reasoning quality and interpretability while preserving competitive predictive performance. These findings highlight the potential of integrating EHR foundation models with pretrained LLMs for interpretable clinical prediction.","published_date":"2026-06-01T19:21:18+00:00","viability_score":7,"cluster_label":"Medical AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"ChatHealthAI integrates EHR data with LLMs for grounded clinical reasoning and prediction, improving interpretability and performance.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.02800v1","title":"Cosmos 3: Omnimodal World Models for Physical AI","abstract":"We introduce Cosmos 3, a family of omnimodal world models designed to jointly process and generate language, image, video, audio, and action sequences within a unified mixture-of-transformers architecture. By supporting highly flexible input-output configurations, Cosmos 3 seamlessly unifies critical modalities for Physical AI -- effectively subsuming vision-language models, video generators, world simulators, and world-action models into a single framework. Our evaluation demonstrates that Cosmos 3 establishes a new state-of-the-art across a diverse suite of understanding and generation tasks, demonstrating omnimodal world models as scalable, general-purpose backbones for embodied agents. Our post-trained Cosmos 3 models were ranked as the best open-source Text-to-Image and Image-to-Video models by Artificial Analysis, and the best policy model by RoboArena at the time the technical report was written. To accelerate open research and deployment in Physical AI, we make our code, model checkpoints, curated synthetic datasets, and evaluation benchmark available under the Linux Foundation's OpenMDW-1.1 https://openmdw.ai/license/1-1/ License at https://github.com/nvidia/cosmos}{github.com/nvidia/cosmos and https://huggingface.co/collections/nvidia/cosmos3 . The project website is available at https://research.nvidia.com/labs/cosmos-lab/cosmos3 .","published_date":"2026-06-01T19:12:30+00:00","viability_score":9,"cluster_label":"Physical AI","has_code":true,"repo_url":"https://github.com/nvidia/cosmos","commercial_flags":["has_code"],"one_liner":"Cosmos 3 is an omnimodal world model unifying language, vision, video, audio, and action for embodied agents, setting new SOTA benchmarks.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.02798v1","title":"BehaviorBench: Modeling Real-World User Decisions from Behavioral Traces","abstract":"Many decision-support settings require systems that adapt to individual users, but evaluation data for this problem remain limited. Existing benchmarks for user understanding often rely on simulated users or model-generated behavior, even though recent work cautions that model-based simulations can diverge systematically from human behavior. We introduce \\textsc{BehaviorBench}, a benchmark for evaluating personalized decision modeling from real-world behavioral traces. \\textsc{BehaviorBench} reconstructs wallet-level decision histories from observed public prediction-market and on-chain records, and organizes them into two complementary task layers: \\emph{Belief prediction}, which predicts a user's final revealed stance and confidence in a market, and \\emph{Trade prediction}, which predicts the direction and amount of individual transactions. Across 2,000 evaluation wallets, the benchmark contains 141,445 Belief instances and 1,485,972 Trade instances, with disjoint support pools for retrieval-based evaluation. We evaluate frontier and open-weight generative models under four history interfaces: no personalization, direct recent history, generated user profiles, and retrieved support-wallet evidence. Personalization improves Belief prediction more consistently than Trade prediction, model rankings change across task layers and metrics, and different history interfaces expose different failure modes. \\textsc{BehaviorBench} provides an evaluation setting for studying whether personalized methods can use real-world behavioral evidence rather than simulated users alone.","published_date":"2026-06-01T19:04:36+00:00","viability_score":7,"cluster_label":"User Modeling","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"BehaviorBench provides a benchmark for personalized decision modeling using real-world user behavioral traces, moving beyond simulations.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.02791v1","title":"Evaluating Transformer and LSTM Frameworks for Prediction in Ungauged Basins","abstract":"Watershed networks exhibit convergent topologies in which multiple tributaries merge into downstream channels,integrating diverse upstream hydrological processes. In ungauged basins, the absence of direct observations increases uncertainty and limits the ability to anticipate extreme events. This study evaluates whether an encoder-only Transformer provides an advantage over an LSTM for upstream streamflow inference under limited hydrologic information, using retrospective simulations from the NOAA National Water Model (NWM). Across both upstream-only and combined configurations, the LSTM showed stronger overall performance than the Transformer model across the two configurations. Incorporating downstream information further boosted performance for all models, increasing median NNSE by more than 60%. Rather than treating this as a leaderboard-style comparison, we interpret the experiments as a test of architectural inductive bias for hydrologic sequence inference. The results indicate that recurrent memory remains better aligned with this upstream reconstruction task than an encoder-only Transformer, while downstream hydrologic context provides a strong auxiliary constraint that substantially improves prediction skill across architectures","published_date":"2026-06-01T18:57:20+00:00","viability_score":3,"cluster_label":"Hydrology","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper evaluates Transformer and LSTM frameworks for streamflow prediction in ungauged basins, finding LSTMs perform better.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.02781v1","title":"CRAM-ER: Error-Resilient Spintronic Computational Random Access Memory for Scalable In-Memory Computation","abstract":"Deep neural networks (DNNs) have achieved state-of-the-art performance across diverse domains. However, typical Von Neumann compute paradigms face severe memory bottlenecks. Emerging near-memory and compute-in-memory approaches alleviate this but incur significant peripheral overhead. Computational Random Access Memory (CRAM) based on MRAM enables in-situ logic without peripheral overhead, offering a dense, energy-efficient solution. However, probabilistic MRAM switching induces gate-level errors that limit the scalability and reliability of CRAM for accelerating DNN. Moreover, the large number of sequential MRAM writes severely constrains CRAM throughput. To address these challenges, we propose an error-resilient CRAM (CRAM-ER) architecture for scalable in-memory matrix-vector multiplications (MVMs). Our error-aware hardware-software co-design framework leverages a hybrid spintronic-CRAM + CMOS adder-tree architecture to mitigate the impact of device-level errors, demonstrating MVM functionality with high area and energy efficiency. We further develop an error-aware model fine-tuning and fine-grained error correction for enhanced error resilience. Evaluations of the CMOS+spintronic hybrid architecture on DNN benchmarks show near-lossless accuracy while reducing CRAM latency by up to 2 orders of magnitude, outperforming CPU/GPU+high-bandwidth DRAM in both energy efficiency and energy-delay product.","published_date":"2026-06-01T18:45:05+00:00","viability_score":3,"cluster_label":"Hardware Acceleration","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Developing error-resilient spintronic computational RAM for scalable in-memory computation to accelerate deep neural networks.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2606.02775v1","title":"AURA: Action-Gated Memory for Robot Policies at Constant VRAM","abstract":"The KV-cache is the right memory for datacenters but the wrong memory for robots. Datacenter inference batches many short requests and resets them, amortizing an attention cache across a crowd. Embodied agents instead run one long, non-resetting episode on bandwidth-limited edge hardware, where high-bandwidth memory and flash are scarce, flash has finite write endurance, and memory writes rather than compute can become the binding constraint.   AURA-Mem (Action-Utility Recurrent Adaptive Memory) targets this regime. It wraps a frozen vision-language-action backbone with a constant-size recurrent memory and a learned gate that writes only when the current observation would change the next action: memory that knows when to stay silent. Unlike reconstruction-based memory, the gate is trained directly against a closed-loop action-error signal. Its inference state is fixed at 4,224 bytes regardless of horizon, while a KV-cache grows to 6,061 times larger at 100,000 steps.   On a controlled synthetic benchmark, AURA-Mem matches the best O(1) baseline in accuracy while using 5.19-6.13 times fewer writes, and up to 9.19 times fewer writes on easier configurations. Budget-matched random and periodic schedules do not recover this gain, isolating the benefit to the action-surprise signal. On a trained closed-loop OpenVLA-OFT 7B panel on LIBERO-Long (n=60 episodes per arm), the gate does not hurt success: AURA-Mem matches the ungated base policy (0.233) and slightly exceeds an always-write KV arm (0.217), while using 7.0 times fewer writes and constant memory. We also instantiate an approximate-information-state value-loss bound as a methodology demonstration; at this scale, the bound is vacuous rather than a guarantee.","published_date":"2026-06-01T18:38:21+00:00","viability_score":6,"cluster_label":"Robotics Memory","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Action-gated recurrent memory for robots that uses constant VRAM by only writing when observations change future actions.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2606.02765v1","title":"Representational Capacity: Geometric Limits on Feature Representation in Transformer Language Models","abstract":"Model dimension ($d_{model}$) is a fundamental hyperparameter in transformer language models, yet its role in setting the geometric limits of feature representation remains under-explored. Grounded in the Linear Representation and Superposition Hypotheses - which propose that models encode features as near-orthogonal directions in latent space - we develop a framework for estimating how many such directions a model can support. We first establish the embedding matrix as a measurable proxy for near-orthogonality constraints across the latent space: the boundary between meaningful token relationships and incidental similarity in the pairwise cosine similarity distribution gives a concrete estimate of the model's accepted deviation $\\varepsilon$ from perfect orthogonality. Applying this metric across dozens of open-source models reveals two classes: models with high $\\varepsilon$ whose embeddings lack near-orthogonal structure, and models with low $\\varepsilon$ that maintain it. We then show that the standard Johnson-Lindenstrauss lemma greatly underestimates the packing efficiency of trained representations, and derive an adjusted capacity formula in which the number of near-orthogonal directions depends on the ratio of vectors to dimensions ($k/d$) rather than the raw count - a single modification that cuts prediction error by two orders of magnitude with no extra parameters. Combining these results, we define representational capacity as an upper bound on the number of distinguishable directions available for features and embeddings in a model's latent space. Capacity is exponentially sensitive to $\\varepsilon$, and larger models favor tighter orthogonality constraints over maximizing raw capacity - a pattern compatible with several explanations (a stability-capacity trade-off, a ceiling on usable concepts, or confounds with model scale) that we leave to future work.","published_date":"2026-06-01T18:28:56+00:00","viability_score":0,"cluster_label":"LLM Theory","has_code":true,"repo_url":"https://github.com/Alex-Guha/representational-capacity","commercial_flags":["has_code"],"one_liner":"Developing a geometric framework to understand the representational capacity limits of transformer language models based on embedding matrix analysis.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2606.02755v1","title":"Acceptance-Test-Driven Evaluation Protocols for Business-Centric LLM Systems","abstract":"Large language model (LLM) applications are increasingly expected to satisfy deterministic institutional requirements while relying on probabilistic generative components. This mismatch makes ordinary post-hoc benchmarking insufficient for systems that must be safe, reliable, auditable, and economically useful. This paper contributes an evaluation-protocol extension for operational LLM systems grounded in acceptance-test-driven development, safety engineering, and business-centric validation. The extension translates stakeholder goals into executable behavioral contracts, release gates, monitoring signals, and evidence artifacts before prompt, model, retrieval, or agent changes are accepted. It adapts the red-green-refactor discipline of test-driven development to a red-train-green lifecycle: first define failing acceptance tests for desired behavior, then improve the LLM system through prompt changes, retrieval design, fine-tuning, guardrails, or data augmentation, and finally release only when multidimensional gates are satisfied. The contribution is a governance-oriented metric stack, reference architecture, and empirical protocol for comparing acceptance-test-driven LLM development against prompt-first and benchmark-after workflows.","published_date":"2026-06-01T18:21:10+00:00","viability_score":5,"cluster_label":"LLM Evaluation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An acceptance-test-driven evaluation protocol for LLM systems that translates stakeholder goals into executable contracts and release gates.","time_to_mvp":"1-2 weeks","tags":["quick_build","high_potential"]},{"arxiv_id":"2606.02753v1","title":"MetaWorld: Scaling Multi-Agent Video World Model from Single-view Video Data","abstract":"Video world models are a foundational generative technology for embodied AI and the Metaverse, yet existing approaches are inherently limited to a single agent observing from a single perspective. Extending these models to multi-agent settings introduces two critical challenges: data scarcity (coordinated multi-view recordings are prohibitively expensive to collect for general open-domain scenarios) and world state alignment (independently generated video streams cannot ensure that shared physical environments and events evolve consistently across views). To address these challenges, we propose MetaWorld, a novel framework that scales multi-agent video world models to open-domain environments directly from single-view videos. First, we introduce Monocular World-State Unrolling (MWSU) to explicitly decompose monocular footage into the camera operator's ego-motion and the visible subject's spatial trajectory. This camera-trajectory decomposition naturally extracts synchronized multi-agent motion data within a shared 3D space, completely bypassing the need for multi-camera setups. Second, for precise visual control, we develop the Subject-Aware World Generator to enable appearance-driven simulation conditioned on per-agent identity images. Finally, to ensure both views are grounded in the identical physical reality, we propose World-State Alignment, a per-frame inter-branch cross-attention mechanism inserted at every transformer layer of the video DiT. By jointly synchronizing the denoising process, WSA enforces both static geometric consistency and dynamic motion consistency, encouraging that the shared 3D environment and physical events remain well-aligned across both egocentric views. Extensive experiments demonstrate that MetaWorld achieves superior cross-view consistency and identity fidelity, establishing a highly scalable, physics-driven paradigm for multi-agent video world modeling.","published_date":"2026-06-01T18:20:20+00:00","viability_score":7,"cluster_label":"Generative Video","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for scaling multi-agent video world models from single-view videos by decomposing motion and aligning views.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2606.02747v1","title":"Plan2Map: A Multimodal Benchmark for Document-Grounded Geospatial Boundary Reconstruction from Planning Records","abstract":"Planning records define restrictions over geographic areas, but their source documents often provide only indirect spatial evidence rather than machine-readable boundaries. We introduce Plan2Map, a 208-case multimodal benchmark for document-grounded geospatial boundary reconstruction from UK planning records. Given only a source planning document, systems must reconstruct a valid geospatial boundary from notice text, schedules, map plates, map labels, and boundary annotations; the reference GeoJSON is held out for scoring. We propose GeoPlanAgent, a document-grounded, geospatial-tool-in-the-loop system that decomposes the task into evidence extraction, localisation, map registration, boundary segmentation, projection, and verification. On Plan2Map, GeoPlanAgent achieves 0.736 mean IoU and 0.904 median IoU, with 67.8\\% of predictions at or above 0.8 IoU, substantially outperforming direct VLM-to-GeoJSON baselines. Diagnostic analysis shows that direct VLM prediction remains unreliable, while remaining errors are concentrated in localisation and map registration, and supervised boundary segmentation substantially improves pixel-level mask quality. Plan2Map provides a concrete testbed for multimodal geospatial reconstruction from public planning records. Project page: https://odeb1.github.io/Plan2Map_Project_Page/.","published_date":"2026-06-01T18:12:16+00:00","viability_score":8,"cluster_label":"Geospatial AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A multimodal benchmark and agent for reconstructing geospatial boundaries from planning documents, outperforming direct VLM baselines.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.02739v1","title":"EntangleCodec: A Unified Discrete Audio Tokenizer via Semantic-Acoustic Entanglement","abstract":"Audio tokenizers serve as the discrete interface between continuous audio and Audio Language Models (ALMs), but existing tokenizers often struggle to support both understanding and generation. Reconstruction-oriented codecs preserve acoustic fidelity but lack rich semantics, while semantic-aware tokenizers typically rely on separate semantic and acoustic streams, introducing redundancy or misalignment.   We propose \\textbf{EntangleCodec}, a unified discrete audio tokenizer that learns caption-aligned semantic-acoustic representations before quantization. By aligning audio with rich captions rather than ASR transcripts, EntangleCodec captures linguistic content, speaker identity, emotion, prosody, and acoustic scenes within a compact token stream. A flow-matching diffusion decoder further enables high-quality reconstruction across speech, music, and general audio.   EntangleCodec achieves reconstruction quality competitive with specialized codecs, outperforms all codec-based baselines on audio understanding by up to \\textbf{+7.4\\%} on MMAR, and supports both TTS and TTA generation in a unified framework. Furthermore, EntangleCodec-based audio language models demonstrate strong scaling behavior: even at \\textit{0.6B} parameters, the model surpasses specialized continuous-representation LLMs with over \\textit{13B} parameters across three benchmarks using \\textbf{22$\\times$} fewer parameters; scaling to \\textit{8B} further establishes new state-of-the-art results on MMAR, highlighting that representation quality is as critical as model scale in audio language modeling. Code and model weights are available at https://github.com/luckyerr/EntangleCodec.","published_date":"2026-06-01T18:05:18+00:00","viability_score":9,"cluster_label":"Audio Language Models","has_code":true,"repo_url":"https://github.com/luckyerr/EntangleCodec","commercial_flags":["has_code"],"one_liner":"A unified discrete audio tokenizer that entangles semantic and acoustic information for high-quality understanding and generation, outperforming larger models.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.02737v1","title":"Attention Calibration for Position-Fair Dense Information Retrieval","abstract":"Dense retrieval models exhibit positional bias: retrieval effectiveness degrades when relevant information appears later in a passage (Zeng et al., 2025). We ask whether this bias can be reduced at inference time, without retraining and without sacrificing overall retrieval effectiveness. To this end, we adapt inference-time attention calibration (Schuhmacher et al., 2026) to downstream retrieval and extend it with a strength coefficient lambda that interpolates between the original and fully calibrated attention distributions. Across three embedding models on SQuAD-PosQ and FineWeb-PosQ, we examine how basket size, calibrated layer set, and strength affect the trade-off between positional fairness and retrieval effectiveness, finding that partial calibration frequently outperforms full calibration. A single configuration (B=128, lambda=0.5, 50% layer depth) improves the harmonic mean of nDCG@10 across positional groups on FineWeb-PosQ for all three models without per-model tuning, and applies to both <s>-pooled and last-token-pooled architectures. This default configuration transfers without modification to PosIR, which spans 10 languages and 31 domains, reducing the Position Sensitivity Index in all 16 length-quartile x model x retrieval-setting combinations, while preserving or improving aggregate nDCG@10. We release our extended codebase at https://github.com/impresso/fair-sentence-transformers","published_date":"2026-06-01T18:04:26+00:00","viability_score":7,"cluster_label":"Information Retrieval","has_code":true,"repo_url":"https://github.com/impresso/fair-sentence-transformers","commercial_flags":["has_code"],"one_liner":"An inference-time attention calibration method to reduce positional bias in dense retrieval without retraining, improving fairness across languages and domains.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2606.02735v1","title":"See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs","abstract":"Generalization remains a central bottleneck for vision-language-action (VLA) models: under distractors, appearance shifts, and semantically similar tasks, the policy must often infer local execution details from coarse instructions while also deciding which parts of the image matter for control. We present S2 (See Less, Specify More), a framework for improving VLA generalization by training the executor under a cleaner interface.   Specify More preserves the original instruction as a stable high-level goal while relabeling each trajectory into refined trajectory- and subtask-level language that disambiguates the current execution mode. Unlike native attention, See Less imposes an explicit visual evidence budget, training the executor to act from task-sufficient evidence rather than unconstrained visual context, without any region or mask annotation.   This interface lets the executor follow detailed guidance without relying on distracting visual patches or resolving avoidable ambiguity on its own, and it remains compatible with off-the-shelf VLM planners through in-context learning. Across our main evaluation settings, S2 improves overall generalization metrics by changing the executor's learning problem: coarse instructions induce avoidable supervision aliasing, goal-preserving local guidance outperforms instruction replacement in our main ablations, and explicit evidence budgeting reduces dependence on broad visual context beyond efficiency considerations.   Across eight real-robot tasks on TX-G2 (an AgiBot G2-compatible variant) and HSR, S2 raises mean subtask success from 54.2% to 79.0% over pi0.5. Together, these results suggest that VLA generalization improves when the executor is trained to act from informative local guidance and task-sufficient visual evidence, rather than recovering both from weak supervision.","published_date":"2026-06-01T18:02:07+00:00","viability_score":7,"cluster_label":"Robotics","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework that improves robot generalization by training executors to act from task-specific visual evidence and refined language instructions, boosting success rates on real-world tasks.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2606.02724v1","title":"AVTrack: Audio-Visual Tracking in Human-centric Complex Scenes","abstract":"Audio-visual speaker tracking aims to localize and track active speakers by leveraging auditory and visual cues, enabling fine-grained, human-centric scene understanding. This capability is essential for real-world applications such as intelligent video editing, surveillance, and human-computer interaction. However, existing datasets are largely limited to simple or homogeneous audio-visual scenes with coarse annotations. Such oversimplified settings bias evaluation toward static audio-visual co-occurrence, rather than rigorously assessing robust spatiotemporal modeling and cross-modal reasoning in complex, dynamic scenes. To address these limitations, we introduce AVTrack, a human-centric audio-visual instance segmentation (AVIS) dataset designed for dynamic real-world scenarios. AVTrack features diverse and challenging conditions, including camera motion, visual occlusions, and position changes. Evaluations of representative AVIS methods on AVTrack reveal substantial performance degradation, establishing AVTrack as a challenging benchmark for robust human-centric audio-visual scene understanding in complex environments. We further provide a simple yet effective baseline to facilitate future research. Project website: https://FudanCVL.github.io/AVTrack/","published_date":"2026-06-01T18:00:08+00:00","viability_score":7,"cluster_label":"Computer Vision","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"AVTrack is a new dataset and benchmark for audio-visual speaker tracking in complex, dynamic human-centric scenes, enabling more robust real-world applications.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.31603v1","title":"Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models","abstract":"Connector-based video unified models have demonstrated strong capability in instruction-grounded video synthesis, but integrating a large high-fidelity generator into the unified training loop is computationally prohibitive, limiting achievable visual quality. We therefore propose Lumos-Nexus, a training-efficient unified video generation framework that facilitates the development of strong reasoning-driven generation capabilities while significantly enhancing visual fidelity. Lumos-Nexus adopts a two-stage design: 1) During training, only a lightweight generator is aligned with the understanding block to learn to take in reasoning-driven semantic control. 2) During inference, we introduce Unified Progressive Frequency Bridging (UPFB) to progressively hand off generation to a high-capacity pretrained generator in the shared latent space, enabling coarse-to-fine refinement and producing high-fidelity videos without compromising reasoning quality. To fill the gap in reasoning-driven video generation benchmarks, we introduce VR-Bench, which assesses a model's capability to translate inferred intent into coherent and semantically aligned video content. Extensive experiments demonstrate that Lumos-Nexus achieves substantial gains in visual realism and temporal coherence on VBench, while exhibiting strong reasoning-based generative performance on VR-Bench. Code and models are available at https://jiazheng-xing.github.io/nexus-lumos-home/.","published_date":"2026-05-29T17:59:50+00:00","viability_score":7,"cluster_label":"Generative Video","has_code":true,"repo_url":"https://github.com/alibaba-damo-academy/Lumos-Custom","commercial_flags":["has_code"],"one_liner":"A training-efficient framework for high-fidelity video generation that progressively refines output using a high-capacity pretrained model.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.31593v1","title":"Stateful Online Monitoring Catches Distributed Agent Attacks","abstract":"Language models can find thousands of severe software vulnerabilities, and agents are increasingly being misused for cyberattacks. To avoid detection, attackers frequently distribute their misuse, splitting a harmful task across many user accounts so each individual transcript looks benign. Because safety monitors score only one agent context at a time, they are structurally blind to misuse that is only visible in aggregate, across many accounts. We show this gap is real by building, to our knowledge, the first distributed agent attack, a multi-agent scaffold that completes hard cybersecurity tasks while hiding the harmful objective across subagents with limited contexts, evading a standard monitor that catches it only a fifth as often as prior agent attacks. Towards a defense, we develop an online stateful monitor that uses real-time clustering to collect weak suspiciousness signals across many agent transcripts, and escalates only rarely to a language model that flags misuse across user accounts. In evaluations with large-scale simulated datacenter traffic, our monitor Pareto dominates standard monitors, catching distributed attacks 30% earlier and flagging cyber misuse before it reaches the most harmful stages. Crucially, this comes at negligible additional latency for ~99% of user traffic. This detection advantage persists but narrows as the benign background traffic grows very large. After an extensive red-teaming exercise, we improve the defense and surprisingly also find that it catches standard jailbreaks, since adaptive attackers reuse attack variants across accounts. Our results point toward a new class of safety monitors which reason over groups of users rather than isolated transcripts.","published_date":"2026-05-29T17:57:00+00:00","viability_score":4,"cluster_label":"AI Security","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A stateful online monitoring system that uses real-time clustering to detect distributed agent attacks across multiple user accounts.","time_to_mvp":"6+ months","tags":["series_a_plus"]},{"arxiv_id":"2605.31590v1","title":"TunerDiT: Training-free Progressive Steering of Diffusion Transformer for Multi-Event Video Generation","abstract":"Text-to-video (T2V) generation faces challenging questions when generating videos with long horizons containing multiple events. Inspired by the intrinsics of the diffusion process, we probe video diffusion transformers (DiTs) and uncover intrinsic turning points in the DiT denoising trajectory where conditioning text affects generation from global layout to fine-grained details. Building on this finding, we present TunerDiT, a simple yet effective progressive steering method that requires no additional training for multi-event generation. TunerDiT comprises two steering handles: (1) Event-Partitioned Masking that enforces event boundaries while allowing cross-event transition bands; (2) Cross-Event Prompt Fusion that injects neighboring event semantics for late-stage refinement. We contribute a self-curated prompt suite for benchmarking multi-event generation, i.e., Meve. TunerDiT achieves state-of-the-art performance across 8 metrics and offers a tunable trade-off between video consistency and event separation, compared with other training-free methods. The improvement in text alignment increases with the event count, indicating a scaling possibility with increasing event count.","published_date":"2026-05-29T17:56:09+00:00","viability_score":7,"cluster_label":"Generative Video","has_code":true,"repo_url":"https://github.com/cvpr-org/author-kit","commercial_flags":["has_code"],"one_liner":"A training-free method to progressively steer diffusion transformers for multi-event video generation by partitioning events and fusing prompts.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.31586v1","title":"Language Models Learn Constructional Semantics, Not To Mention Syntax: Investigating LM Understanding of Paired-Focus Constructions","abstract":"Grasping the semantics of rare constructions (form-meaning pairings) has been shown to be a challenging problem that has currently only been solved by the largest LLMs. It remains an open question if open-source models have robust constructional understanding, and if so, what learning dynamics underlie the acquisition of this knowledge. Focusing on a set of rare Paired-Focus constructions in English (e.g. \"let alone\", \"much less\"), we construct a novel dataset to test their meanings using both scalar adjectival semantics and general world knowledge. Testing a wide range of models differing in parameter count, architecture, and pretraining dataset size, we find that several modestly sized models are sensitive to both the forms and the meanings of Paired-Focus constructions, though models trained on human-scale data fail at all meaning evaluations. Turning to training dynamics for a set of open-checkpoint models, we find that Paired-Focus understanding emerges later in training than Paired-Focus syntactic knowledge, and that learning of Paired-Focus semantics is correlated with gains in some domains of world knowledge. Overall, our empirical results support the conclusion that modestly sized open-source models can grasp the rare Paired-Focus constructions, and demonstrate a connection between knowledge of Paired-Focus constructions and other meaning domains.","published_date":"2026-05-29T17:54:00+00:00","viability_score":3,"cluster_label":"LLM Understanding","has_code":true,"repo_url":"https://github.com/WesScivetti/Meaning_Alone","commercial_flags":["has_code"],"one_liner":"Investigating how language models learn constructional semantics, finding that modestly sized models can grasp rare constructions with gains in world knowledge.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.31584v1","title":"LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards","abstract":"Long-context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in extensive distracting content. Reinforcement learning with verifiable rewards (RLVR) has shown promise for this task, yet existing methods are limited by low-confusability distractors and sparse, outcome-only reward signals that cannot supervise intermediate reasoning steps. To address these issues, we introduce \\textsc{LongTraceRL}. For data construction, we generate multi-hop questions via knowledge graph random walks and leverage search agent trajectories to build \\emph{tiered distractors}: documents the agent read but did not cite (high confusability) and documents that appeared in search results but were never opened (low confusability), producing training contexts that are far more challenging than those built by random sampling or one-shot search. For reward design, we propose a \\emph{rubric reward} that uses the gold entities along each reasoning chain as fine-grained, entity-level process supervision. This rubric reward is applied only to responses with correct final answers (positive-only strategy), distinguishing the reasoning quality among correct responses and preventing reward hacking. Experiments on three reasoning LLMs (4B--30B) across five long-context benchmarks demonstrate that \\textsc{LongTraceRL} consistently outperforms strong baselines and encourages comprehensive, evidence-grounded reasoning. Codes, datasets and models are available at \\href{https://github.com/THU-KEG/LongTraceRL}{https://github.com/THU-KEG/LongTraceRL}.","published_date":"2026-05-29T17:51:40+00:00","viability_score":8,"cluster_label":"LLM Reasoning","has_code":true,"repo_url":"https://github.com/THU-KEG/LongTraceRL","commercial_flags":["has_code"],"one_liner":"A reinforcement learning framework that improves long-context reasoning in LLMs by generating challenging distractors and using fine-grained rubric rewards for intermediate supervision.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.31581v1","title":"Choosing the Lens: Strategic Perspective Activation in Context-Dependent Argumentation","abstract":"The same arguments often need to be evaluated under different external regimes. An agent with influence over the regime has a strategic lever that standard formalisms do not directly capture. We introduce context-dependent argumentation frameworks (CDAFs), an extension of Dung's theory in which a defeat function determines, per context, which attacks succeed. A perspective-labeled specialisation derives the defeat function from a relevance set $\u03c1$ and a priority $\u03c0$. The relevance set is the agent's action space. In a small worked example, the agent's target argument is rejected under every full-relevance injective priority, yet accepted under partial activations, one of which no VAF audience can mirror. We define the corresponding decision problem, ACTIVATION-MANIPULATION, and record baseline complexity bounds. Tight bounds and multi-agent variants are left open.","published_date":"2026-05-29T17:50:11+00:00","viability_score":1,"cluster_label":"Argumentation AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Introduces context-dependent argumentation frameworks to model strategic manipulation of argument evaluation by agents.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.31575v1","title":"SPECTRA: Synthetic IR Test Collections with Relevance Oracles and Controlled Distractor Diagnostics","abstract":"Scalable information retrieval testing needs corpora that are large enough to stress index construction, ranking latency, query routing, and evaluation tooling, yet human-judged test collections remain expensive and may be unavailable when documents are private or still under design. This paper introduces SPECTRA, a reproducible framework for generating synthetic text corpora and retrieval test collections through a separation of latent topical structure, surface text realization, metadata controls, query intent generation, and deterministic relevance oracles. The framework is intended as a diagnostic complement to Cranfield-style and TREC-style evaluation, not as a replacement for human assessment. A single-process Python prototype generated corpora up to 60,000 documents and 9.61 million tokens while preserving controllable long-tail vocabulary growth and producing graded relevance labels for 96 queries. In the local simulation study, generation remained close to linear at roughly 12K to 14K documents per second, estimated Zipf slopes stayed near 0.86 in absolute value, and increasing cross-topic distractor text reduced BM25 nDCG@10 from 1.00 at 2% distractors to 0.43 at 36% distractors. These results show that lightweight synthetic corpora can expose retrieval-system scaling and failure modes before costly collection construction begins.","published_date":"2026-05-29T17:44:15+00:00","viability_score":3,"cluster_label":"Information Retrieval Testing","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A framework for generating synthetic information retrieval test collections with controlled relevance and distractors to diagnose system performance.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.31564v1","title":"What Gets Unmasked First? Trajectory Analysis of Diffusion Models for Graph-to-Text Generation","abstract":"We present the first systematic study of masked diffusion language models (MDLMs) for graph-to-text generation. We analyze MDLM generation trajectories -- the order in which tokens are unmasked during iterative decoding -- and find that, unlike autoregressive LLMs which generate text linearly, MDLMs naturally prioritize entities first, followed by relational and function words, with structural tokens resolved last. We further identify a previously undocumented failure mode of supervised fine-tuning: SFT disrupts this strategy by prematurely anchoring structural sentence-ending tokens early in the decoding trajectory, effectively fixing the output length which can lead to omitted or hallucinated information. To address this, we propose lambda-scaled structural decoding, a training-free inference-time modification that downweights structural token confidence and recovers +9.4 BLEU-4. Finally, we introduce Graph-LLaDA, which integrates a Graph Transformer encoder into LLaDA's decoding process to explicitly incorporate relational graph structure. Cross-dataset evaluation on LAGRANGE reveals that previous baselines overfit to dataset-specific patterns, while LLM- and MDLM-based approaches generalize significantly better.","published_date":"2026-05-29T17:29:35+00:00","viability_score":7,"cluster_label":"Graph-to-Text Generation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Analyzes diffusion model decoding trajectories for graph-to-text generation, proposing a training-free inference method and a graph-enhanced LLM for improved generalization.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.31558v1","title":"Positional versus Symbolic Attention Heads: Learning Dynamics, RoPE Geometry, and Length Generalization","abstract":"Transformer-based language models are widespread in today's society. As such, understanding the mechanisms by which they solve structured tasks and predicting how they may behave in novel scenarios is of great importance for safe deployment. We study the learning dynamics of attention heads in a controlled setting by training a decoder-only Transformer (GPT-J) on two structurally equivalent multi-hop reasoning tasks: a number task requiring positional reasoning and a letter task requiring symbolic reasoning. Using a recently introduced metric that classifies attention-head behavior as positional or symbolic for a given prompt, we show that successful learning is associated with the emergence of pure heads, i.e., heads that express themselves as either positional or symbolic. Despite the tasks' structural equivalence, they impose different mechanistic demands: the number task requires both positional and symbolic heads, whereas the letter task requires only symbolic heads. We then identify the computational roles of these heads, characterize the basic functions they implement, and give theoretical constructions showing how single-layer RoPE-based attention can realize these functions through geometrically interpretable query, key, and value operations. This analysis yields a quantitative separation between positional and symbolic mechanisms in their robustness to longer sequences, formalized through a novel notion of discrepancy. We empirically validate the resulting predictions in both controlled and real-world models, showing that symbolic mechanisms extrapolate more reliably to longer sequences while positional mechanisms face sharper limitations.","published_date":"2026-05-29T17:22:04+00:00","viability_score":2,"cluster_label":"LLM Analysis","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper analyzes the internal workings of Transformer attention heads to understand positional versus symbolic reasoning, offering theoretical insights into length generalization.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.31556v1","title":"Vision-Language Models Suppress Female Representations Under Ambiguous Input","abstract":"Alignment teaches vision-language models (VLMs) to avoid expressing demographic biases, and when gender is clearly visible they largely succeed. Far less is known about ambiguous inputs (a worker in full gear, a figure seen from behind) cases common in practice yet rarely studied. We find that minimal prompting pressure exposes occupation-gender defaults when prompting ambiguous input images, with models collapsing to male even for strongly female-stereotyped occupations. But do these outputs reflect what models actually encode internally? We introduce LALS (Latent Association Leaning Score), a zero-shot metric that projects visual-token activations into the model's text-embedding space to measure concept associations per token and layer. Across 15 occupations, over 800 gender-ambiguous images, and four VLMs, internal representations and outputs are systematically decoupled: models often encode a female association internally yet output male. Layer-wise analysis reveals an asymmetric filter -- male signal amplifies end-to-end while female signal peaks mid-network and is suppressed before generation -- and a color ablation shows that culturally loaded visual cues such as clothing color further modulate these internal associations.","published_date":"2026-05-29T17:20:02+00:00","viability_score":7,"cluster_label":"Vision-Language Bias","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel metric, LALS, reveals that Vision-Language Models internally associate ambiguous images with female representations but often output male, highlighting a critical bias in generation.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.31535v1","title":"RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video","abstract":"Self-supervised novel view synthesis (NVS) remains challenging to scale, despite the abundance of video data, largely due to the brittleness of training on realistic videos and the hard-to-predict scaling behavior of multi-network system designs. We introduce RayDer, a unified, feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering into a single backbone, turning self-supervised NVS into a well-posed single-model scaling problem. A minimal dynamic state, treated as a nuisance factor, absorbs time-varying content and enables stable training on unconstrained real-world video. Importantly, RayDer keeps static-scene NVS as its target task: dynamic content is leveraged purely as scalable supervision, not reconstructed as in dynamic-scene (4D) NVS. Across multiple model sizes and orders of magnitude in data, RayDer exhibits clean power-law scaling with data and compute, and outperforms static-scene data mixtures. On a large number of benchmarks, RayDer achieves strong zero-shot open-set performance competitive with state-of-the-art supervised approaches. Project Page: https://compvis.github.io/rayder","published_date":"2026-05-29T16:50:27+00:00","viability_score":8,"cluster_label":"Novel View Synthesis","has_code":true,"repo_url":"https://github.com/compvis/rayder","commercial_flags":["has_code"],"one_liner":"RayDer is a unified transformer for scalable self-supervised novel view synthesis from real-world video, outperforming supervised methods with clean power-law scaling.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.31534v1","title":"Feature-Optimized Vision for Adaptive 3D Scene Reconstruction","abstract":"Three-dimensional scene reconstruction depends on local image evidence that is both visually discriminative and geometrically useful. Fixed feature thresholds and uniform feature budgets are easy to deploy, but they can waste computation on repeated texture, low-parallax regions, or unstable points. This paper proposes an adaptive feature-optimized vision front end for 3D reconstruction. The method scores candidate features by texture, repeatability, distinctiveness, expected triangulation angle, and spatial coverage, then allocates a per-view feature budget to maximize useful tracks under a fixed reconstruction pipeline. A small synthetic multi-view prototype evaluates four selection policies across corridor, facade, object-table, and cluttered scenes. Compared with random, texture-only, and uniform-grid baselines, the adaptive policy obtains the best quality-aware completeness and the lowest aggregate reconstruction RMSE while preserving broad image coverage. The result is not a replacement for modern learned matching or neural reconstruction systems; it is a modular front-end policy that can make classical and learned 3D pipelines more deliberate about which visual evidence they spend compute on.","published_date":"2026-05-29T16:49:27+00:00","viability_score":3,"cluster_label":"3D Scene Reconstruction","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper proposes an adaptive feature selection policy for 3D scene reconstruction to improve efficiency and quality by prioritizing visually discriminative and geometrically useful image features.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.31520v1","title":"Separating Secrets from Placeholders: A Hybrid CNN-CodeBERT Framework for Three-Class Credential Leakage Detection","abstract":"Credential leakage in public source code repositories poses a critical security threat, with over 23.8 million secrets exposed in 2024 alone. Existing detection tools suffer from high false-positive rates because rigid pattern matching and binary classification schemes fail to distinguish genuine credentials from placeholder or weak credentials. We propose a three-class classification framework that explicitly models placeholder or weak credentials as a distinct class, leveraging CodeBERT-based semantic understanding combined with character-level pattern recognition. We evaluate our approach on a newly constructed dataset of 9,426 samples spanning 10 programming languages. Our model achieves a Matthews Correlation Coefficient of 0.86 and a macro F1-score of 0.90, achieving 93% recall and 89% precision for genuine credential leaks while reducing high severity alerts by 33.0% (from 373 to 250) without sacrificing security coverage. Compared to prior character-level approaches, our method improves placeholder or weak credential detection from 54% to 81% F1-score while maintaining strong cross language generalization, with 9 of 10 languages achieving F1 above 0.80 under leave-one-language-out evaluation.","published_date":"2026-05-29T16:36:20+00:00","viability_score":7,"cluster_label":"Code Security","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A hybrid CNN-CodeBERT framework detects credential leaks in code with higher accuracy and fewer false positives by distinguishing real secrets from placeholders.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.31514v1","title":"If LLMs Have Human-Like Attributes, Then So Does Age of Empires II","abstract":"Much research has been carried out on large language models (LLMs) and LLM-powered agentic workflows. However, many works within the field state emergence of, ascribe to, or assume, generalised anthropomorphic attributes to them (e.g., morality or understanding of natural language). Our goal is not to argue in favour or against the existence of these attributes, but to point out that these conclusions could be incorrect. For this we build and train a simple neural network on the videogame Age of Empires II, and note that any entity in a sufficiently-powerful substrate, such as LEGO or the Greater Boston Area, could also present such attributes. Hence, the purported anthropomorphic attributes of LLMs are empirically non-unique: although some properties (e.g., responses to prompts) could remain constant, others, such as the interpretation of their perceived behaviour, might change with the substrate. Thus, any empirically-grounded discussion requires explicit measurement criteria; otherwise the interpretation is left to the representation. We then show that assuming that these attributes exist or not in a system, independent of the substrate and in a generalised way, leads to either circular or uninformative conclusions, regardless of the experimenter's viewpoint on the subject. Finally we propose a 'null' assumption, where one assumes LLM non-uniqueness instead of assuming anthropomorphic attributes to set up an experiment, along with examples of it. We also discuss potential objections to our work, briefly survey the field, and prove that \\textit{Age of Empires II} is functionally- and Turing-complete.","published_date":"2026-05-29T16:31:31+00:00","viability_score":1,"cluster_label":"AI Philosophy","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper argues that attributing human-like attributes to LLMs is empirically non-unique and can be observed in other complex systems like Age of Empires II.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.31509v1","title":"Skill Reuse as Compression in Agentic RL","abstract":"Large language model agents trained with reinforcement learning (RL) often learn brittle, task-specific shortcuts. We hypothesize that agents generalize better when their successful trajectories are structurally compressible, decomposed into a small set of reusable abstract patterns. To formalize this, we introduce ReuseRL, which grounds agentic RL in the Minimum Description Length (MDL) principle. ReuseRL extracts a shared skill dictionary from successful trajectories and augments the RL objective with a segmentation cost, explicitly penalizing idiosyncratic behaviors that encode poorly. We prove a PAC-Bayes generalization bound for this compression penalty. Across ALFWorld, TextWorld-Cooking, and Countdown-Stepwise, ReuseRL improves in- and out-of-distribution success over vanilla GRPO and strong round-length baselines.","published_date":"2026-05-29T16:28:34+00:00","viability_score":4,"cluster_label":"Agentic RL","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"ReuseRL formalizes agentic RL with the Minimum Description Length principle, improving generalization by decomposing successful trajectories into reusable abstract patterns.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.31500v1","title":"On Efficient Scaling of GNNs via IO-Aware Layers Implementations","abstract":"Graph Neural Networks (GNNs) are bottlenecked by sparse, irregular memory access. Popular frameworks such as DGL and PyTorch Geometric support general message passing, but complex layers often materialize edge-wise intermediates, increasing memory traffic and limiting scalability on large graphs. We take an I/O- and arithmetic-intensity--centric view and show that widely used layers fall into three kernel families: SpMM-based convolutions, reduction-based aggregations, and attention-based layers (GATv2/Graph Transformer). For each family, we develop GPU kernels that reduce data movement, improve locality, and remain robust across realistic graphs. We also study graph reordering and find that its impact depends on the kernel mapping: it benefits neighbor-parallel (gather-dominated) kernels more consistently than feature-parallel designs. Empirically, our fused attention kernels reach up to $\\textbf{3.9}\\times$ speedup for Graph Transformer (median $\\textbf{1.6}\\times$), with Tensor Core (block-sparse) variants up to $\\textbf{7.3}\\times$ on locally dense graphs; for GATv2 we reach up to $\\textbf{8.5}\\times$ speedup (median $\\textbf{2.0}\\times$) while reducing peak memory by up to $\\textbf{76}\\times$ (median $\\textbf{6}\\times$). Our degree-aware reduction kernels achieve up to $\\textbf{10}\\times$ speedup (median $\\textbf{2.6}\\times$). For SpMM-based layers, properly cached cuSPARSE achieves up to $\\textbf{8}\\times$ speedup over DGL and outperforms evaluated custom baselines in the majority of evaluations. We release our implementations as drop-in replacements to support reproducible, hardware-aware GNN acceleration.","published_date":"2026-05-29T16:22:45+00:00","viability_score":5,"cluster_label":"GNN Optimization","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"New GPU kernels optimize Graph Neural Networks by reducing data movement and improving locality, achieving significant speedups and memory reductions.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2605.31492v1","title":"LinTree: Improving LLM Reasoning with Explicitly Structured Search Histories","abstract":"Large language models (LLMs) often solve reasoning problems by generating intermediate traces that explore and revise partial solutions. From a search perspective, these traces can be viewed as linearized search trees, where the model extends a partial solution, abandons it when it fails, and backtracks to try alternatives. Compared with traditional heuristic-guided search, such a policy has a potential advantage: it conditions on the whole search trace rather than only on the current local state. We first test whether LLMs utilize this advantage by comparing trace-conditioned reasoning policies against best-first search equipped with an LLM heuristic that only observes the current local state. Across three controlled reasoning environments, Blocks World, grid Navigation, and Sokoban, we find that raw access to search history alone is not enough to reliably outperform heuristic search. We then study one possible reason: in LLM reasoning traces, the underlying search tree is only implicitly represented, and when the model backtracks or switches branches, the trace does not explicitly identify which earlier search state is being revisited. We show that adding simple parent pointers to explicitly represent the linearized tree (LinTree) structure improves both task performance and search efficiency relative to implicit reasoning models and LLM-heuristic-guided search. These results suggest that search history becomes most useful when its tree structure is made explicit, motivating more structure-aware representations for LLM reasoning.","published_date":"2026-05-29T16:13:19+00:00","viability_score":2,"cluster_label":"LLM Reasoning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper explores how to improve LLM reasoning by explicitly structuring search histories, showing that explicit parent pointers enhance performance and efficiency in complex reasoning tasks.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.31469v1","title":"Scaling Conversational Hungarian ASR: The BEA-Dialogue+ Corpus","abstract":"Conversational automatic speech recognition in Hungarian is constrained by the limited amount of publicly available dialogue-style training data. The BEA-Dialogue corpus addresses this need, but its strictly speaker-disjoint train/dev/eval split reduces the usable material to only 85 hours. In this paper, we introduce BEA-Dialogue+, an expanded version of the corpus that relaxes the split criterion for experimenters and dialogue partners while preserving complete separation of the primary speakers. This results in 200 hours of transcribed natural conversations and enables a controlled study of the trade-off between additional training data and speaker overlap across the splits. We evaluate several Whisper- and FastConformer-based models on both corpus versions, including Serialized Output Training (SOT)-based fine-tuning for dialogue transcription. Our results show that the larger corpus is more challenging for models without fine-tuning, whereas SOT-based adaptation yields consistent improvements in WER, CER, cpWER, and cpCER. Overall, BEA-Dialogue+ provides a substantially larger yet still demanding benchmark for Hungarian dialogue ASR, and a practical resource for training and evaluating dialogue transcription systems.","published_date":"2026-05-29T16:01:25+00:00","viability_score":5,"cluster_label":"ASR","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This work introduces BEA-Dialogue+, a larger Hungarian conversational ASR corpus and evaluates Whisper and FastConformer models, demonstrating SOT-based fine-tuning improves performance.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.31468v1","title":"AutoSci: A Memory-Centric Agentic System for the Full Scientific Research Lifecycle","abstract":"Scientific research has traditionally been human-intensive, requiring researchers to coordinate literature, ideas, experiments, manuscripts, and review responses across long project cycles. The rise of LLM-based scientific agents creates an opportunity to automate this process. Such a system must support the full research lifecycle, maintain structured persistent memory across projects, and improve its own research procedures over time. However, existing systems either partially satisfy or fail to satisfy these requirements, leaving a gap for a unified automated scientific research system. As a result, we present AutoSci, a memory-centric agentic system for the full scientific research lifecycle. AutoSci is organized around four modules. SciMem provides schema-governed research memory, separating Long-Term Knowledge Memory for reusable scientific knowledge from Active Research Memory for project-level artifacts such as ideas, experiments, manuscripts, and reviews. SciFlow executes a five-stage lifecycle from literature understanding to rebuttal through a harness that controls state, context, verification, feedback, and orchestration. SciDAG augments difficult skills with DAG-shaped multi-agent operators and reusable stage-specific templates. SciEvolve converts feedback signals from users, experiments, reviews, and external environments into versioned updates to SciMem organization, SciFlow skills, and SciDAG templates. Together, these modules make AutoSci a persistent research environment that can execute, remember, and evolve across research projects. The code repository is available at https://github.com/skyllwt/AutoSci.","published_date":"2026-05-29T16:00:04+00:00","viability_score":8,"cluster_label":"AI for Research Support","has_code":true,"repo_url":"https://github.com/skyllwt/AutoSci","commercial_flags":["has_code"],"one_liner":"AutoSci automates the entire scientific research lifecycle with a memory-centric agentic system.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.31464v1","title":"GPU Forecasters: Language Models as Selective Surrogates for Kernel Runtime Optimization","abstract":"GPU kernels are the workhorse of modern deep learning, and optimizing them (via evolutionary search or coding agents) usually requires repeated measurement on target hardware. While these measurements provide the ground-truth signal necessary for kernel search, they are costly, because each evaluation of a kernel requires compilation and repeated execution on a GPU. As improvements in LLM inference reduce the cost of writing novel kernels and LLM-driven searches scale to large search budgets, on-device evaluation becomes a bottleneck. To address this, we study how LLMs can serve as selective GPU surrogates for kernel evaluation, by forecasting the performance of proposed kernels. A useful surrogate should be accurate, and it should be selective, by knowing when it could be wrong, and deferring to the GPU. To evaluate surrogates, we measure whether their forecasts are accurate, calibrated, and practically useful for recovering fast kernels under limited GPU-measurement budgets. Next, we study whether reinforcement learning can improve forecast accuracy and confidence calibration. Our experiments demonstrate that LLMs can accurately forecast relative kernel performance, that their utility can be improved through reinforcement learning. Used inside a kernel search, the surrogate lets the search consider several times as many candidates under the same GPU evaluation budget, and that leads to finding faster kernels than an equal-budget baseline. These results suggest that LLMs can play a broader role in kernel optimization, by acting as virtual models of a GPU rather than solely as kernel generators for search.","published_date":"2026-05-29T15:56:08+00:00","viability_score":7,"cluster_label":"LLM Optimization","has_code":true,"repo_url":"https://github.com/codezakh/gpu-forecasters","commercial_flags":["has_code"],"one_liner":"GPU Forecasters leverage LLMs as selective surrogates to forecast kernel performance, significantly reducing costly GPU measurements and enabling faster kernel optimization.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.31463v1","title":"PithTrain: A Compact and Agent-Native MoE Training System","abstract":"Mixture-of-Experts (MoE) has become the dominant architecture for frontier language models. To meet this demand, production frameworks have built optimized MoE training stacks over years of engineering effort. Yet evolving these stacks for new architectures and system optimizations remains expensive. With the rise of AI coding agents, they could automate parts of training-framework development and accelerate this evolution. But applying them to these existing frameworks carries hidden costs, invisible to today's throughput-only evaluations. We name this missing dimension agent-task efficiency (ATE): the cost of using coding agents to understand, operate, and extend a framework. Grounded in four agent-native design principles, we build PithTrain, a compact, agent-native MoE training framework. We further introduce ATE-Bench, covering real-world training-framework tasks. Our evaluation shows PithTrain matches the throughput of production frameworks, and on ATE-Bench, PithTrain enables higher agent-task efficiency, with up to 62% fewer Agent Turns and 64% less Active GPU Time.","published_date":"2026-05-29T15:52:58+00:00","viability_score":4,"cluster_label":"LLM Training Infrastructure","has_code":true,"repo_url":"https://github.com/mlc-ai/pith-train","commercial_flags":["has_code"],"one_liner":"A compact, agent-native MoE training framework that matches production throughput while improving agent-task efficiency.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.31446v1","title":"Fine-grained Verification via Diagnostic Reasoning Supervision for Aspect Sentiment Triplet Extraction","abstract":"Aspect Sentiment Triplet Extraction (ASTE) aims to identify aspect terms, opinion terms, and sentiment polarities as structured triplets, providing essential inputs for downstream information system applications such as opinion mining, explainable recommendations, and review summarization. Prior work mainly focuses on end-to-end extraction, while post hoc verification of extracted triplets remains comparatively underexplored. This gap limits the reliability of ASTE systems, since predicted triplets may be locally plausible while being globally invalid. Moreover, candidate invalidity is multi-faceted and candidate usability is inherently graded, motivating a fine-grained verification mechanism that can filter or re-rank outputs from diverse extractors. In this paper, we propose FiVeD, a framework for Fine-grained Verification with Diagnostic reasoning supervision. Specifically, the verifier is trained with multiple complementary objectives, including validity classification and quality score estimation as primary tasks, with error type classification and rationale generation as auxiliary tasks. We define hierarchical error categories and construct plausible incorrect triplets under semantic and syntactic constraints, and leverage an off-the-shelf LLM with task-specific rubrics to produce quality scores and diagnostic rationales. During inference, the resulting quality scores are used to filter candidate outputs, supporting adjustable precision-recall tradeoffs. Experiments across multiple ASTE baselines demonstrate that FiVeD consistently improves extraction performance by up to 3.53 F1 points as a plug-and-play verification module.","published_date":"2026-05-29T15:40:58+00:00","viability_score":7,"cluster_label":"NLP - Aspect Sentiment Extraction","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A plug-and-play verification module for Aspect Sentiment Triplet Extraction that improves reliability and allows adjustable precision-recall tradeoffs.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.31445v1","title":"Used Car Salesbots? Honesty and Credulity of LLMs as Bargaining Agents under Partial Information","abstract":"In this work we study agents in simulated bargaining scenarios, where a buyer and a seller communicate through a text channel and attempt to negotiate mutually beneficial trades, under different information regimes (complete information, information asymmetry or mutual uncertainty). We evaluate their performance w.r.t. game-theoretical solutions and further investigate their honesty (their tendency to disclose or withhold information or to mislead and deceive) as well as their credulity (their tendency to trust or distrust information provided by the other agent). We study zero-shot LLM agents with simple prompting scaffolding as well as fine-tuned agents, in order to investigate whether optimising the agents to maximise financial profits makes them stronger negotiators but also more dishonest and less trusting.   We find that off-the-shelf LLMs all substantially deviate from game-theoretical equilibria, they attempt to lie about their private information but cannot efficiently exploit information asymmetries. Fine-tuning on financial utility makes the agents stronger at achieving better deals but also more dishonest, highlighting the risks that optimising agents for a task can have on their safety. We release our code and a dataset of bargaining scenarios.","published_date":"2026-05-29T15:40:29+00:00","viability_score":7,"cluster_label":"Agents - Bargaining","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Investigating LLM agents in simulated bargaining, revealing risks of optimizing for profit leading to dishonesty and distrust.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.31444v1","title":"Answer-Set-Programming-based Abstractions for Reinforcement Learning","abstract":"Reinforcement Learning (RL) enables autonomous agents to learn policies from experience, but realistic problems often involve enormous state spaces, making learning and generalisation challenging. Abstraction and approximation are therefore essential. Relational Reinforcement Learning (RRL) offers a way to reason about objects and their relations, and the CARCASS framework by Martijn van Otterlo demonstrates how logical representations can model Markov Decision Processes (MDPs) in first-order domains. Originally implemented in Prolog, CARCASS leverages domain knowledge to create powerful abstractions. We explore Answer-Set Programming (ASP), which is a rich and, contrary to Prolog, fully declarative modelling language, to realise CARCASS abstractions. We evaluate our ASP-based implementation in case studies of two domains, viz. Blocks World and Minigrid. Our results indicate that CARCASS with ASP provides a promising approach to constructing abstractions for RL, especially when domain knowledge is available.","published_date":"2026-05-29T15:40:10+00:00","viability_score":4,"cluster_label":"Reinforcement Learning","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Exploring Answer-Set Programming for creating abstractions in Reinforcement Learning, particularly for relational domains.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.31432v1","title":"DOA: Training-Free Decoder-Only Attention Policy for Long-Form Simultaneous Translation with SpeechLLMs","abstract":"Simultaneous speech-to-text translation (SimulST) generates translations while speech is still unfolding, requiring a streaming policy that decides when to read and when to write. State-of-the-art approaches rely on attention-based encoder-decoder models where cross-attention provides explicit alignment signals. In contrast, Speech Large Language Models (SpeechLLMs) are decoder-only architectures relying solely on self-attention. This raises a central question: whether decoder self-attention contains sufficiently stable alignment signals to guide the streaming policy. Moreover, existing approaches typically rely on training-based adaptations or heuristic wait-$k$ policies and have not been validated in long-form settings. To fill these gaps, we propose Decoder-Only Attention (DOA), a training-free policy that enables long-form simultaneous translation with off-the-shelf SpeechLLMs by deriving a proxy alignment from self-attention. Experiments on Phi4-Multimodal and Qwen3-Omni show that DOA provides an effective alignment signal for supporting streaming decisions, enabling low-latency long-form SimulST with quality close to offline decoding without retraining.","published_date":"2026-05-29T15:27:26+00:00","viability_score":4,"cluster_label":"Speech Translation","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A training-free policy for long-form simultaneous translation using decoder-only attention in SpeechLLMs.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.31421v1","title":"Neuro-symbolic Syntactic Parsing: Shaping a Neural Network with the CYK Algorithm","abstract":"In this paper, we show the possibility of a direct injection of algorithms into neural network architecture. We focus on a complex algorithm, that is, Cocke-Youger-Kasami (CYK) for parsing context-free grammars in Chomsky Normal Form and we propose CYKNN, a simple recurrent neural network architecture for encoding the CYK algorithm in trainable matrix-vector multiplications.We experimented with a very simple grammar with 4 variations showing that our approach outperforms existing LLMs with more than 20B parameters with an in-context learning setting and smaller LLMs of the Qwen family fine-tuned with LoRA. Our attempt paves the way to a different approach to neuro-symbolic methodologies.","published_date":"2026-05-29T15:21:11+00:00","viability_score":5,"cluster_label":"Neuro-symbolic AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Injecting the CYK algorithm directly into a recurrent neural network for syntactic parsing, outperforming larger LLMs.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.31410v1","title":"FAM-Bench: A Multimodal Benchmark for Condition-Aware Food-as-Medicine Reasoning","abstract":"Food-as-Medicine requires models to reason beyond what a dish is or what nutrition it contains: they must decide whether a concrete food choice is appropriate for a specific health condition. Existing food AI benchmarks primarily evaluate dish recognition, recipe understanding, nutrient estimation, or general nutrition question answering, leaving this health-aware decision layer largely untested. We introduce FAM-Bench, a multi-modal Food-as-Medicine benchmark with 2500 nutrition-expert-verified instances across 13 diet-related health conditions. The benchmark contains two complementary tasks: dish-level suitability assessment, where models judge whether a dish is suitable for a condition from its image and ingredient list, and comparative dish analysis, where models rank four candidate dishes by condition-specific suitability. Both tasks require integrating ingredient evidence, visual preparation cues, and clinical nutrition constraints, providing a standardized testbed for grounded health-aware reasoning in language and vision-language models.","published_date":"2026-05-29T15:13:53+00:00","viability_score":7,"cluster_label":"Medical AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"FAM-Bench: A multimodal benchmark for food-as-medicine reasoning, assessing dish suitability for specific health conditions.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.31408v1","title":"Skill Availability and Presentation Granularity in Large-Language-Model Agents: A Controlled SkillsBench Study","abstract":"Skill documents provide procedural knowledge to large-language-model agents at inference time. This article studies whether the presentation granularity of controlled skill knowledge changes downstream task success. The experiment uses a pinned SkillsBench version, a 30-task domain-balanced subset validated by official oracle runs, two reasoning-enabled model configurations, six skill conditions, and five trials per task-condition-model cell. Skill availability is the clearest empirical signal. Relative to no skill, skill conditions increase task-mean pass rate by 26.7 to 36.0 percentage points for GPT-5.5 and by 18.0 to 26.0 percentage points for DeepSeek V4-Flash. The final data contain 1,800 rows, with 900 rows for each model. The task is the inference unit. Five trials are aggregated within each task-condition-model cell before paired contrasts are estimated over 30 tasks. The primary presentation contrasts are smaller and uncertain. Low-abstraction guidance differs from high-abstraction guidance by +0.7 percentage points for GPT-5.5 and -6.7 percentage points for DeepSeek V4-Flash, with both 95% bootstrap confidence intervals crossing zero. Adding one worked example to medium-abstraction guidance differs from the no-example variant by +0.7 and +1.3 percentage points. Mean-reward robustness checks preserve the same substantive conclusion. In this controlled subset, skill availability is associated with higher success than no skill, while the tested presentation-granularity changes yield small, uncertain, and model-dependent effects.","published_date":"2026-05-29T15:12:24+00:00","viability_score":4,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Skill availability significantly improves LLM agent performance, while presentation granularity has minimal impact.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.31404v1","title":"The Sword, Shield, and Achilles' Heel: Characterizing the Linguistic Inductive Bias of Large Language Models for Spatial Reasoning in Navigation Planning","abstract":"Large Language Model (LLM)-based navigation systems commonly construct explicit spatial representations (e.g., topological graphs, semantic raster maps) and translate them into textual descriptions as LLMs' inputs. However, the linguistic structures of such text-based spatial representations and the choices of contextual features (e.g., topology, geometry) they contain are often treated as neutral engineering decisions rather than key factors that shape LLMs' behavior. To fill the gap, we propose a dual-interventional framework that disentangles linguistic structures from different contextual cues to evaluate the linguistic inductive bias of LLMs for navigation planning. In the framework, representation intervention varies the linguistic format and the degree of linguistic compression, clarifying when linguistic representations support or inhibit navigation planning. Context intervention, combined with contextual feature combination and conflict probing, explicitly clarifies the preferences and weaknesses of LLMs when processing different contextual cues. Experiments across diverse spatial reasoning tasks and multiple model scales reveal a consistent pattern: topological information is a sturdy shield and the backbone of robust planning; linguistic format is a double-edged sword whose effect depends on model size, task demands, and the compression level; and semantic information is a fatal Achilles' heel -- incorrect semantic cues can systematically derail the planning process. Overall, our study shows that effective text-based spatial representations in LLM-based navigation should preserve topological integrity, calibrate representational compression to model capacity, and ensure semantic correctness, rather than simply adopting a single representation. Our code is publicly available at https://github.com/jonesdong150/LLM-Navigation-Inductive-Bias.","published_date":"2026-05-29T15:09:25+00:00","viability_score":7,"cluster_label":"LLM Agents","has_code":true,"repo_url":"https://github.com/jonesdong150/LLM-Navigation-Inductive-Bias","commercial_flags":["has_code"],"one_liner":"This research characterizes the linguistic inductive bias of LLMs for navigation planning, offering insights to build more robust LLM-based navigation systems.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.31393v1","title":"Target-Side Paraphrase Augmentation for Sign Language Translation with Large Language Models","abstract":"Sign language translation (SLT) remains constrained by limited paired sign-video/text corpora and heavy-tailed target vocabularies. We study target-side augmentation in which GPT-4o generates controlled paraphrase variants of reference sentences while the sign input remains unchanged. A Signformer-style pose-based Transformer is trained under a two-stage schedule: pre-training on the augmented corpus followed by fine-tuning on the original references.   We evaluate on three datasets spanning complementary challenges: PHOENIX14T (German Sign Language), with moderate lexical diversity; GSL (Greek Sign Language), with highly ontrolled, repetitive recordings; and LSA-T (Argentinian Sign Language), with severe long-tail sparsity. On PHOENIX14T, augmentation improves BLEU-4 from 9.56 to 10.33. The near-saturated GSL baseline and extremely sparse LSA-T setting reveal the limits of the approach. To our knowledge, this is the first study to apply LLM-generated target-side araphrases and LLM-as-a-Judge evaluation to SLT. The semantic evaluation reveals gains in fidelity that lexical overlap metrics understate.","published_date":"2026-05-29T14:58:21+00:00","viability_score":6,"cluster_label":"Sign Language Translation","has_code":true,"repo_url":"https://github.com/cvpr-org/author-kit","commercial_flags":["has_code"],"one_liner":"This work explores using GPT-4o to augment sign language translation datasets, improving translation quality for low-resource languages.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2605.31377v1","title":"DynaTree: Dynamic Agentic Retrieval Tree for Time-Sensitive News Retrieval","abstract":"Agentic Retrieval-Augmented Generation improves retrieval by integrating planning, tool use, and iterative reasoning, but existing agentic RAG methods often couple semantic expansion with retrieval decisions in short-horizon inference loops, leading to high inference cost and limited suitability for time-sensitive news retrieval. We propose DynaTree, a two-stage framework for efficient and adaptive news retrieval. In the offline stage, DynaTree uses coordinated agents to construct a reusable retrieval tree that materializes the semantic space of a query topic. In the online stage, DynaTree performs lightweight daily subtree selection over a time-localized evaluation proxy, without further agentic reasoning, tree modification, or retraining. Experiments on a multi-day Syft news benchmark and multiple BEIR datasets show that DynaTree achieves strong recall and ranking performance, consistently outperforming standard RAG and prior agentic baselines. We further deploy DynaTree in the Syft production system and evaluate it through online A/B testing from Jan. 28 to Feb. 6, 2026. The dynamically adapted variant improves survival rate from 0.32-0.53 to 0.59-0.73 over a fixed offline-selected subtree and outperforms existing production recallers on every evaluation day. These results show that persistent, structure-aware semantic expansion can translate offline agentic reasoning into practical improvements in coverage, freshness, and relevance for real-world news retrieval.","published_date":"2026-05-29T14:45:58+00:00","viability_score":8,"cluster_label":"AI for Media","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"DynaTree offers the most effective time-sensitive news retrieval system, outperforming benchmarks across multiple datasets.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.31373v1","title":"Scaling Higher-Order Graph Learning with Maximal Clique Complexes","abstract":"Graph neural networks (GNNs) are limited to modeling pairwise interactions, while higher-order models based on cell complexes achieve greater expressivity but often suffer from poor scalability. We introduce simplified and factored cellular Weisfeiler Leman tests (sCWL and fCWL), which preserve the expressivity of the CWL test while improving computational efficiency. We further introduce the maximal clique complex, enabling scalable CWNs with reduced time and memory complexity while retaining strong empirical performance. To avoid explicit clique enumeration, we propose CliqueWalk, a biased random walk that samples maximal cliques and scales linearly with graph size. These contributions yield a scalable topological learning framework for higher-order graph representation.","published_date":"2026-05-29T14:42:40+00:00","viability_score":3,"cluster_label":"Graph Neural Networks","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper introduces scalable methods for higher-order graph learning using maximal clique complexes and biased random walks.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.31370v1","title":"HypoAgent: An Agentic Framework for Interactive Abductive Hypothesis Generation over Knowledge Graphs","abstract":"Abductive reasoning over knowledge graphs aims to generate logical hypotheses that explain observed entities or facts. Existing controllable hypothesis generation methods allow users to guide this process with explicit conditions, but they remain limited in interactive settings: they struggle to ground evolving natural-language intents across multi-turn dialogues and provide little fine-grained diagnosis when generated hypotheses fail. To address these limitations, we propose HypoAgent, an Agentic framework for interactive abductive Hypothesis Generation over knowledge graphs. HypoAgent integrates three agents: an Intent Recognition Agent that grounds user utterances and dialogue history into executable KG conditions, a Hypothesis Generation Agent that performs controllable hypothesis generation according to the extracted user intention, and a Root Cause Analysis Agent that diagnoses unreliable hypothesis fragments and leverages KG neighborhood probing to identify supported refinements. Experiments on commonsense and biomedical domain-specific knowledge graphs demonstrate that HypoAgent achieves state-of-the-art semantic similarity under single-turn, multi-turn, and unconditional settings. Our code is available at https://github.com/HKUST-KnowComp/HypoAgent.","published_date":"2026-05-29T14:40:37+00:00","viability_score":8,"cluster_label":"Agents","has_code":true,"repo_url":"https://github.com/HKUST-KnowComp/HypoAgent","commercial_flags":["has_code"],"one_liner":"An agentic framework for interactive abductive hypothesis generation over knowledge graphs that grounds user intents and diagnoses hypothesis failures.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.31365v1","title":"Learning to Adapt: Self-Improving Web Agent via Cognitive-Aware Exploration","abstract":"Recent advances in Multimodal Large Language Models (MLLMs) have led to promising progress in web agents. However, existing web agents often rely on handcrafted execution pipelines or expensive expert trajectories, limiting their adaptability to complex, dynamic environments. To address these challenges, we propose SCALE (Self-Cognitive-Aware Learning and Exploration), which leverages three adversarial roles, Selector, Predictor, and Judger to autonomously discover the agent's limitations and expand its cognitive boundaries through environmental exploration. Moreover, we propose SCALE-Hop, a graph exploration strategy that facilitates global planning and helps agents avoid local exploration traps. To further support learning, we construct SCALE-20k, a large-scale dataset collected from 19 real-world websites, containing diverse task types and structured demonstrations generated from SCALE's exploration traces. Experimental results show that our approach significantly improves the performance and generalization of multiple MLLMs in various web environments. Our framework offers a scalable and generalizable solution for building truly autonomous and adaptive web agents.","published_date":"2026-05-29T14:37:27+00:00","viability_score":5,"cluster_label":"Web Automation","has_code":true,"repo_url":"https://github.com/cvpr-org/author-kit","commercial_flags":["has_code"],"one_liner":"A self-improving web agent that adapts through cognitive-aware exploration.","time_to_mvp":"1-3 months","tags":["quick_build"]},{"arxiv_id":"2605.31361v1","title":"Dreaming Of Others: Latent Teammate Modeling In World Models For Multi-Agent Reinforcement Learning","abstract":"In cooperative multi-agent reinforcement learning (MARL), agents must coordinate with partners whose internal policies and intentions are not directly observable. While world models such as Dreamer have demonstrated strong generalization and sample efficiency in single-agent settings, their application to MARL remains limited by an inability to handle teammate-induced uncertainty. We propose a new perspective: treat teammates as structured, learnable components within the agent's world model. We introduce an architecture that factorizes the latent state of a Dreamer-style recurrent state-space model (RSSM) into environment and teammate components, and learns an auxiliary Theory-of-Mind (ToM) head to infer latent embeddings of partner behavior such as character, intent, and predicted actions from partial trajectories. These teammate latents condition the actor and critic, enabling the agent to imagine and adapt to diverse collaborators. We outline how this approach can support zero-shot and few-shot coordination in partially observable settings and propose a set of benchmarks and evaluation protocols to assess its impact. This work positions world models as not only predictors of environmental dynamics, but as simulators of social behavior, opening new directions for generalizable, human-compatible AI.","published_date":"2026-05-29T14:34:50+00:00","viability_score":7,"cluster_label":"Multi-Agent RL","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework that integrates latent teammate modeling into world models for cooperative multi-agent reinforcement learning, enabling adaptation to diverse collaborators.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.31360v1","title":"dashi: A Python library for Dataset Shift Characterization to Support Trustworthy AI Development and Deployment","abstract":"The Artificial Intelligence (AI) life cycle requires a thorough understanding of the underlying data dynamics for robust, safe and cost-effective AI development and use. Dataset shifts are defined as changes between train and test data distributions. Whether occurring over time (temporal) or across different sites (multi-source), they can severely degrade model performance and compromise data quality. This is particularly important in health AI, where the safety and fundamental rights of patients can be severely affected by uncontrolled shifts both at training and operational stages. While the theoretical foundations of covariate, prior, and concept shifts are well established, there is a lack of accessible and comprehensive software tools to perform their analysis. We introduce dashi, an open-source Python library designed for the exploration, quantification, and characterization of dataset shifts. dashi provides a dual approach: an unsupervised approach that leverages information geometry and non-parametric statistical manifolds to data variability characterization and analysis (e.g., Information Geometric Temporal plots and Multi-Source Variability metrics like Global Probabilistic Deviation and Source Probabilistic Outlyingness), and a supervised approach that quantifies and characterizes model performance degradation. Both unsupervised and supervised approaches work across user-defined temporal and domain/source batches. We demonstrate the utility of dashi on three simulated and real-world health AI case studies on gestational diabetes mellitus, COVID-19 and emergency medical dispatch. By providing interactive visual analytics and variability metrics, dashi supports trustworthiness of AI life cycle stages enabling robust and safe machine learning pipelines through the assessment of data coherence and AI performance.","published_date":"2026-05-29T14:33:45+00:00","viability_score":8,"cluster_label":"MLOps","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A Python library for characterizing dataset shifts to support trustworthy AI development and deployment, offering unsupervised and supervised approaches with interactive visualization.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.31354v1","title":"Diagnosing Failure Modes of Shared-State Collaboration in Resource-Constrained Visual Agents","abstract":"Modular visual reasoning systems increasingly rely on shared working memory for multi-step collaboration, yet the failure dynamics of intermediate state evolution in low-capacity regimes remain underexplored. We study failure modes of collaborative reasoning with weak learners (4B--8B models) through the lens of noise accumulation. We introduce CoSee, an auditing framework that formalizes the read-write-verify loop to trace information flow in document visual question answering. Across multi-page, chart, and web-based benchmarks, we find a counter-intuitive degradation: naive shared workspaces often amplify hallucinations rather than resolve them. We identify two dominant failure modes: Noise Reinforcement, where ungrounded notes are reused as evidence, and Policy Collapse, where added context shifts the model toward under-specified, short-form answers. Using cost-accuracy Pareto frontiers, we show that increased compute can correlate negatively with performance without explicit verification. Our findings suggest that for resource-constrained agents, the bottleneck lies not in reasoning depth but in communication fidelity, providing trace-level diagnostics and a mechanistic baseline for reliable modular design.","published_date":"2026-05-29T14:29:56+00:00","viability_score":7,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An auditing framework for diagnosing and improving collaboration failures in resource-constrained visual agents by tracing information flow.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.31349v1","title":"FBHM: Functional Benchmarking and Steering of VLMs for Hateful Meme Detection","abstract":"Hateful meme detection remains a formidable challenge for vision-language models, as existing benchmarks are structurally observational - confounding rhetorical hate mechanisms with target community features and preventing causal evaluation of model vulnerabilities. To address this, we introduce FBHM, a systematically curated benchmark of Functionality Based Hateful Memes constructed along two orthogonal axes: 25 distinct rhetorical functionalities and 10 target communities (5,000 memes total). Benchmarking state-of-the-art VLMs reveals a severe generalization gap: models highly accurate on standard datasets catastrophically drop to near-random performance on FBHM, proving they exploit dataset-specific heuristics rather than robust multimodal reasoning. To efficiently close this gap, we propose LSV (learnable steering vectors), an ultra-low data regime strategy that applies a causal intervention objective on as few as 500 steering samples (50 unique base memes), boosting FBHM performance by ~30 Macro-F1 points while outperforming in-context learning and PEFT without degrading source-domain performance.","published_date":"2026-05-29T14:27:17+00:00","viability_score":8,"cluster_label":"Vision-Language Models","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"FBHM optimizes vision-language models for effective detection of hateful memes using a novel benchmarking and steering approach.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.31340v1","title":"Appropriateness of Empathy in AI: A Signal-Cost Perspective","abstract":"The appropriateness of empathy in AI has emerged as a critical concern, as excessive empathy risks seeming manipulative while insufficient empathy appears dismissive. While prior research has explored how to quantify empathy in AI, few studies examine whether such empathy is contextually appropriate. This paper introduces an economic perspective by applying signaling theory to human-AI conversations. We propose Signal Cost Proxies (emotional richness, perspective-taking, and contextual tailoring) mapped to affective, cognitive, and associative empathy. This multidimensional framework enables systematic evaluation of empathy not just by presence, but by its appropriateness relative to user demand.","published_date":"2026-05-29T14:19:01+00:00","viability_score":0,"cluster_label":"Human-AI Interaction","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A theoretical framework for evaluating the appropriateness of empathy in AI conversations using a signal-cost perspective.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.31330v1","title":"Social welfare optimisation under institutional reward and punishment","abstract":"Institutional incentives are widely used to promote cooperation among autonomous, self-regarding agents, from human societies to multi-agent and AI systems. Existing work typically treats incentive design as a bi-objective problem: minimise institutional cost while achieving a high long-run frequency of cooperation. Whether such schemes also maximise social welfare - total population payoff net of institutional expenditure - has remained largely unexplored. We develop a welfare-centric framework for institutional incentives in finite, well-mixed populations playing a social dilemma (Donation Game and Public Goods Game), considering both rewards for cooperators and punishments for defectors. For each mechanism, we derive explicit expressions for expected social welfare and characterise how it depends on incentive efficiency and selection intensity. Analytically, we identify parameter regimes where social welfare has a single optimal incentive level and regimes with qualitative phase transitions, in which welfare becomes non-monotonic with multiple local optima. We prove that any welfare-maximising incentive is either zero or concentrated around a simple closed-form target, and we provide an efficient algorithm to compute these optima. Comparing reward and punishment, we further derive close-formed conditions under which reward outperform punishment in terms of social welfare for any given budget. Overall, our results reveal a systematic gap between incentives optimised for cost or cooperation frequency and those that maximise welfare.","published_date":"2026-05-29T14:04:55+00:00","viability_score":0,"cluster_label":"Multi-Agent Systems","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A theoretical framework for optimizing social welfare in multi-agent systems through institutional rewards and punishments.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.31324v1","title":"Inconsistency-Aware Minimization: Improving Generalization with Unlabeled Data","abstract":"Estimating the generalization gap and developing optimization methods that improve generalization are crucial for deep learning models, for both theoretical understanding and practical applications. Leveraging unlabeled data for these purposes offers significant advantages in real-world scenarios. This paper introduces a novel generalization measure, local inconsistency, derived from an information-geometric perspective on the parameter space of neural networks. A key feature of local inconsistency is that it can be computed without explicit labels. We establish theoretical underpinnings by connecting local inconsistency to the Fisher information matrix and the loss Hessian. Empirically, we demonstrate that local inconsistency correlates with the generalization gap. Based on these findings, we propose Inconsistency-Aware Minimization (IAM), which incorporates local inconsistency into the training objective. We demonstrate that in standard supervised learning settings, IAM enhances generalization, achieving performance comparable to that of existing methods such as Sharpness-Aware Minimization. Furthermore, IAM exhibits efficacy in semi- and self-supervised learning scenarios, where the local inconsistency is computed from unlabeled data.","published_date":"2026-05-29T13:56:17+00:00","viability_score":4,"cluster_label":"Generalization Improvement","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel method to improve model generalization using unlabeled data by measuring local inconsistency in the parameter space.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2605.31308v1","title":"TraceGraph: Shared Decision Landscapes for Diagnosing and Improving Agent Trajectories","abstract":"Agent benchmarks increasingly record rich interaction trajectories, yet evaluation often reduces each rollout to a pass rate or reward score. We introduce TraceGraph, a graph-based framework that turns released multi-model agent trajectories into shared decision landscapes. For each task, TraceGraph builds a graph over observable action-observation states from pooled rollouts before model identity is introduced. It then overlays outcome-informed productive cores and trap regions, and summarizes each rollout with three events: Access, Trap exposure, and Repair. Across trajectories spanning five benchmark splits, TraceGraph profiles reveal navigation differences hidden by aggregate scores and show that splits differ in whether they reward avoiding traps or recovering from them. The same TraceGraph landscape also motivates a trap-aware recovery pipeline for SWE-bench: aruntime detector fires on states matching historical trap regions, then lightweight continuation policies are evaluated from the same prefix. On fired states, the best pooled single-factor policy raises official resolved rate from 40.4% to 43.5% on the per-provider fired subset and from 41.0% to 44.8% on common-fired instances, with provider-specific active components. Overall, TraceGraph provides a process vocabulary for asking what agent benchmarks test, where models diverge on a shared landscape, and how failure regions can guide downstream improvement.","published_date":"2026-05-29T13:40:31+00:00","viability_score":7,"cluster_label":"Agent Evaluation & Improvement","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"TraceGraph visualizes agent decision landscapes from trajectories, revealing hidden performance differences and enabling targeted improvements for agent recovery pipelines.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.31295v1","title":"Latent Space Disentanglement via Activation Steering for Interpretable Attribute Control in Symbolic Music Generation","abstract":"Transformer-based architectures have significantly advanced the generation of complex symbolic sequences, yet a significant gap remains in achieving fine-grained, interpretable control over discrete signal attributes. This paper investigates the mechanistic interpretability of the Multitrack Music Transformer (MMT) and proposes a framework for deterministic attribute modulation without retraining to bridge this gap via inference-time activation steering. Utilizing the Difference-in-Means (DiffMean) methodology, we isolate latent directions for signal attributes, specifically Pitch and Duration, within the residual stream. We validate the Linear Representation Hypothesis in this domain, achieving high correlation between steering magnitude and attribute shift. To address the inherent feature entanglement in multi-attribute steering, we introduce a Dual Steering framework utilizing Gram-Schmidt Orthogonalization. Experimental results demonstrate that this geometric decoupling reduces conceptual interference and signal degradation compared to naive vector addition, enabling independent deterministic control even against strong autoregressive conditioning.","published_date":"2026-05-29T13:31:56+00:00","viability_score":7,"cluster_label":"Symbolic Music Generation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for fine-grained, interpretable control of symbolic music generation attributes like Pitch and Duration via inference-time activation steering.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.31289v1","title":"The Terminal Representation in Reinforcement Learning","abstract":"Representation learning is a powerful tool for spatio-temporal abstraction within reinforcement learning (RL). Two well established approaches are through the successor representation (SR) and the default representation (DR). The SR encodes states by the future trajectories they induce, capturing information flow decoupled from reward. The DR builds on this by weighting trajectories with reward, integrating credit-assignment structure into the representation. Eigenvectors of both representations have been used to support a range of downstream tasks -- including option discovery, reward shaping, transfer learning, and exploration. We introduce a structurally distinct formulation: the terminal representation (TR). The TR encodes reward-weighted trajectories similarly to the DR, but can be learned as a lower-dimensionality object, and can be used directly for the mentioned applications without eigenvector computations. Eigendecomposition also imposes the assumption of symmetric transition dynamics, which the TR can bypass. In this work we develop the theoretical foundations of the TR: its derivation, convergence of two learning algorithms, its use for zero-shot compositionality, and equivalences between alternative reward formulations. We further show the TR is embedded in the top DR eigenvector, allowing it to capture the same underlying knowledge without eigendecomposition. Additionally, we provide empirical evidence of the TR as a viable alternative to existing representations in subsidiary applications, while requiring less computational overhead to learn, store, and use.","published_date":"2026-05-29T13:24:28+00:00","viability_score":3,"cluster_label":"Reinforcement Learning Representations","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Introduces the Terminal Representation (TR) for reinforcement learning, offering a lower-dimensionality alternative to existing representations with reduced computational overhead.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.31287v1","title":"Neither Replacement nor Panacea: Comparing LLM-Based Conversational and Graphical Decision Support in Industrial Tasks","abstract":"Managers in manufacturing settings rely on digital interfaces to interpret operational data for decision-making, but growing data volume and complexity can make relevant insights difficult to identify efficiently. While dashboards remain dominant in industrial contexts, Large Language Model (LLM)-based conversational agents (CAs), accessed through conversational user interfaces (CUIs), may provide more direct access to such data. However, their effectiveness may depend on the information-processing demands of the task. This study compares an LLM-based CA delivered through a CUI with a dashboard in a manufacturing decision-support scenario. In a mixed factorial experiment with a 2x3 design, 134 industrial decision-makers were assigned to one interface condition and completed three tasks of increasing complexity. We examined perceived Mental Workload (MWL), decision accuracy, completion time, and intended reliance, and tested self-reported data literacy as a moderator. Results showed that the CUI reduced perceived MWL overall and supported faster completion in less demanding tasks, but both advantages diminished as task complexity increased. Neither interface produced a consistent overall advantage in decision accuracy, and the CUI was not preferred as a sole basis for subsequent decisions. Furthermore, data literacy did not reliably moderate interface effects. These findings indicate that conversational interaction offers conditional rather than universal benefits for industrial decision support. LLM-based CAs may reduce information-access effort, whereas complex decisions continue to benefit from persistent, inspectable visual representations.","published_date":"2026-05-29T13:22:58+00:00","viability_score":3,"cluster_label":"LLM Applications","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Comparing LLM conversational agents with dashboards for industrial decision support shows conditional benefits, with LLMs reducing workload for simpler tasks but not improving accuracy for complex ones.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.31286v1","title":"DeMaVLA: A Vision-Language-Action Foundation Model for Generalizable Deformable Manipulation","abstract":"Real-world household robots require Vision-Language-Action (VLA) foundation models that can acquire reusable manipulation skills across diverse objects, task conditions, and household environments. Deformable-object folding is a representative challenge, requiring robots to handle clothing items from random initial states across varying categories, geometries, materials, and scenes. However, existing VLA systems commonly train separate policies for different object categories, while naively mixed multi-task training often suffers from task interference and degraded performance. To move beyond category-specific folding policies, we introduce DeMaVLA, a VLA foundation model for generalizable Deformable Manipulation. DeMaVLA adopts a VLM backbone with an action expert and formulates continuous action generation using flow matching. To improve efficiency, the action expert is constructed by pruning every other transformer layer while preserving layer-wise alignment with the VLM backbone, reducing training and inference cost. DeMaVLA is first pre-trained on approximately 5,000 hours of selected real-world dual-arm demonstrations to acquire general manipulation priors. It is then post-trained on mixed folding data that aggregates self-collected demonstrations and corrective trajectories from real-robot failures across multiple folding tasks through a human-in-the-loop Data Aggregation~(DAgger) pipeline. Experiments show that DeMaVLA achieves competitive performance on RoboTwin and strong real-world results on our household folding benchmark. These results highlight the value of scalable real-world data, efficient action generation, and corrective learning for general-purpose VLA policies in deformable-object manipulation.","published_date":"2026-05-29T13:20:08+00:00","viability_score":7,"cluster_label":"Robotics","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"DeMaVLA is a foundation model for generalizable deformable object manipulation in household robots, using efficient action generation and human-in-the-loop learning.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.31284v1","title":"SAM for Robust Mitochondria Instance Segmentation in Fluorescence Microscopy","abstract":"The morphological analysis of mitochondria in fluorescence microscopy (FM) is crucial for understanding cellular health, energy production, and metabolic regulation. While foundation models like the Segment Anything Model (SAM) have revolutionized natural image segmentation, their direct application to FM is hindered by a significant domain shift characterized by diffraction-limited resolution, low contrast, and complex overlapping organelle networks. Furthermore, the development of robust models is bottlenecked by a severe lack of high-quality, manually annotated instance segmentation datasets for mitochondria. In this paper, we propose a scalable solution to this data scarcity by finetuning SAM exclusively on synthetically generated FM data. We simulate realistic mitochondria data and emulate the optical properties of fluorescence microscopes to create a large-scale annotated dataset. We evaluate our fine-tuned model on a curated dataset of real, manually annotated FM images. Qualitative and quantitative analyses demonstrate that our synthetically fine-tuned model improves precision and average dice score over strong baselines. This work establishes the potential of simulation-assisted training for FM instance segmentation.","published_date":"2026-05-29T13:19:02+00:00","viability_score":7,"cluster_label":"Medical AI","has_code":true,"repo_url":"https://github.com/cvpr-org/author-kit","commercial_flags":["has_code"],"one_liner":"A synthetic data generation approach fine-tunes the Segment Anything Model for robust mitochondria instance segmentation in fluorescence microscopy, overcoming data scarcity.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.31279v1","title":"Practical Cross-Band Channel Prediction for AI-RAN via Physics-Guided Deep Unfolding","abstract":"To make cross-band channel prediction practical for AI-native RAN, algorithms must generalize across diverse environments and support real-time inference. Existing approaches achieve one but not both. To bridge this gap, we introduce GUIDE, a physics-guided deep unfolding framework that embeds wireless channel physics into differentiable layers. Without retraining in unseen environments, GUIDE achieves 2.75x beamforming gain than the deep learning-based baseline FIRE with only a slight increase in inference time, and 1.39x beamforming gain than the strongest model-based baseline R2F2 while running over 1610x faster.","published_date":"2026-05-29T13:10:58+00:00","viability_score":4,"cluster_label":"Wireless AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"GUIDE is a physics-guided deep unfolding framework for practical cross-band channel prediction in AI-RAN, achieving significant beamforming gains with faster inference.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.31278v1","title":"Industrializing Prediction-Powered Inference: The GLIDE Library for Reliable GenAI and Agentic Systems Evaluation","abstract":"Reliable evaluation of agentic systems requires unbiased estimates with valid uncertainty, but standard practice navigates between costly human annotation and biased LLM-as-judge proxies. Prediction-powered inference (PPI) combines both into debiased estimates with valid confidence intervals, yet its various methods remain scattered across papers under partial implementations. We introduce GLIDE, an open-source Python library that unifies state-of-the-art PPI estimators (PPI++, Stratified PPI, Predict-Then-Debias and its stratified variants, Active Statistical Inference) and samplers (uniform, stratified, active, cost-optimal) under a scipy-style API specialized to mean estimation. GLIDE ships with a reproducible Monte Carlo validation suite, an empirically grounded decision tree for method selection, and an agentic evaluation case study showing substantial annotation savings at equivalent precision. The GLIDE package is available at this URL: https://github.com/EmertonData/glide","published_date":"2026-05-29T13:10:35+00:00","viability_score":7,"cluster_label":"AI Evaluation","has_code":true,"repo_url":"https://github.com/EmertonData/glide","commercial_flags":["has_code"],"one_liner":"A Python library for reliable and efficient evaluation of AI agent systems, reducing annotation costs while maintaining precision.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.31275v1","title":"Personalized to Persuade: The Effects of Contextualization and Warmth on Trust and Reliance in Conversational AI","abstract":"Artificial Intelligence (AI) agents personalize their responses by tailoring explanations to users' backgrounds, interests, and prior interactions, referred to as contextualization. Personalization has been identified as a persuasive strategy in politics or in marketing. However, the persuasive effect of contextualization in everyday tasks, where users often lack prior knowledge, remains unclear. We conducted a $2\\times2$ between-subjects experiment ($N = 380$) examining how contextualization, combined with conversational warmth, shapes reliance and persuasiveness of an AI assistant arguing against expert recommendations. Our findings reveal that contextualization reduces the persuasive power of AI, but its combination with warmth restores persuasiveness through a crossover interaction. Reliance on AI is present across conditions and is invariant to the conversational design. Trust strongly predicts both persuasion and reliance, yet neither contextualization nor warmth operates through trust. AI literacy decouples trust from behavior: more literate users report lower trust in the assistant, yet are more persuaded and more reliant on its advice. These results suggest that users are prone to deferring to AI agents over human expert judgment; however, interface-level conversational design choices have a limited role in shaping the behavior.","published_date":"2026-05-29T13:07:03+00:00","viability_score":1,"cluster_label":"Conversational AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Investigating how personalization and warmth in conversational AI affect user trust and reliance, finding limited impact on behavior.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.31266v1","title":"Envisioning Beyond the Few: Disentangled Semantics and Primitives for Few-Shot Atypical Layout-to-Image Generation","abstract":"The layout-to-image (L2I) task enables fine-grained control over image generation via object categories and spatial layouts. However, existing L2I methods yield fragmented and distorted generations under few-shot atypical settings. We term this failure as representation fragmentation, arising from a granularity mismatch that entangles semantic identity with visual details. To address this issue, we propose a representation-driven framework that disentangles semantics from primitives for robust few-shot adaptation. Specifically, Semantic Anchoring aggregates categorical semantics into anchors for stable identity, while Primitive Imbuing models recomposable primitives for robust local detail modeling. Conceptual Steering further regulates optimization with a saliency-aware objective to preserve foreground semantic consistency. Extensive experiments demonstrate consistent improvements in the 5-shot regime over state-of-the-art L2I methods in both visual fidelity and alignment across diverse atypical domains. The source code is publicly available at https://github.com/iCVTEAM/DSP.","published_date":"2026-05-29T13:00:00+00:00","viability_score":7,"cluster_label":"Generative Image","has_code":true,"repo_url":"https://github.com/iCVTEAM/DSP","commercial_flags":["has_code"],"one_liner":"A framework for few-shot layout-to-image generation that disentangles semantics and primitives to improve visual fidelity and alignment.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.31264v1","title":"COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge Distillation","abstract":"LLM agents are increasingly expected not only to complete isolated tasks, but also to carry bounded representations of human expertise, judgment, and interaction style. Building such person-grounded agents remains difficult because actionable knowledge associated with a person or role is usually embedded in heterogeneous traces rather than written as clean instructions. Existing memory and persona systems capture fragments of this evidence, while skill frameworks provide portable packaging formats; however, there is no end-to-end workflow for distilling these traces into inspectable, correctable, and agent-usable skills. We present an automated trace-to-skill distillation system for generating person-grounded AI skills via expert knowledge distillation. Given materials from a target person or role, COLLEAGUE.SKILL produces a versioned skill package with two coordinated tracks: a capability track for practices, mental models, and decision heuristics, and a bounded behavior track for communication style, interaction rules, and correction history. The package can be inspected, invoked, updated through natural-language feedback, rolled back, installed across agent hosts, and optionally prepared for controlled distribution. We describe the artifact contract, generation workflow, correction lifecycle, deployment surface, and domain presets implemented in the open-source system. At the time of writing, the public repository has approximately 18.5k GitHub stars; the gallery lists 215 skills from 165 contributors and more than 100k cumulative stars across listed skill cards. The system illustrates how person-grounded skills can be represented as portable, correctable packages rather than opaque prompts or hidden memories.","published_date":"2026-05-29T12:59:08+00:00","viability_score":9,"cluster_label":"AI Skill Generation","has_code":true,"repo_url":"https://github.com/titanwings/colleague-skill","commercial_flags":["has_code"],"one_liner":"Automate the creation of AI skills using expert knowledge distillation for tailored, person-grounded AI agents.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.31261v1","title":"Why Linear Recurrent Memory Works in Partially Observable Reinforcement Learning","abstract":"The family of linear recurrent neural networks has shown strong performance as recurrent memory units in partially observable reinforcement learning. We provide a theoretical justification for their empirical effectiveness by constructing and studying two linear filters: (i) the first exactly reproduces the pre-softmax logits of the belief vector in a hidden Markov model (HMM) under a deterministic transition matrix, thereby serving as a sufficient statistic for optimal policy learning, (ii) the second achieves vanishing state-decoding error under a nearly deterministic transition matrix, thus reducing state ambiguity to near zero. The results extend to action-controlled HMMs, where the corresponding linear filters become time-varying with action-dependent dynamics. We illustrate our main results through numerical experiments and further show that the constructed linear filter serves as a strong feature extractor in a small reinforcement learning game.","published_date":"2026-05-29T12:56:28+00:00","viability_score":2,"cluster_label":"Reinforcement Learning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Theoretical justification for the effectiveness of linear recurrent neural networks in partially observable reinforcement learning.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.31254v1","title":"Formalizing and falsifying causal pathways of rare events","abstract":"Building on recent formalizations of root cause analysis for rare events (``outliers'') in structural equation models, we propose a formal definition of a causal pathway and discuss its testable implications. We identify conditions under which these implications depend only on a causal abstraction defined by the pathway of rare events, rather than on the full causal graph of the underlying system. Accordingly, we introduce an abstraction of causal structure to pathways of rare events that bridges simple verbal causal explanations and detailed causal modeling.","published_date":"2026-05-29T12:50:47+00:00","viability_score":0,"cluster_label":"Causal Inference","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Formalizing causal pathways of rare events in structural equation models for testable implications.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.31251v1","title":"ERGeoBench:A Comprehensive Benchmark for Embodied Reasoning and Geo-localization in Multimodal Large Language Models","abstract":"Multimodal large language models (MLLMs) have shown strong potential as embodied agents, yet embodied geo-localization remains underexplored due to the lack of fine-grained evaluation. We introduce ERGeoBench, a diagnostic benchmark for vision-driven embodied geo-localization. ERGeoBench evaluates models under three progressive settings -- single-view, panorama-view, and embodied-view -- where agents may actively acquire observations through sequential changes in yaw, pitch, and zoom. The benchmark contains 2,207 globally distributed street-view panoramas and measures four complementary capabilities: foundational perception, spatial awareness, common sense reasoning, and geo-localization reasoning. Evaluations of leading proprietary and open-source MLLMs show that current models can infer high-level geographic semantics, but still struggle with fine-grained perceptual operations, metric localization, and spatial consistency across views. We further observe that geo-localization is strongly correlated with the other capability dimensions, suggesting that accurate localization depends on integrated perception, spatial reasoning, and commonsense inference rather than isolated visual recognition. Overall, ERGeoBench provides a unified framework for diagnosing and advancing human-like embodied geo-localization. Project Page: https://kaixuewen.github.io/ERGeoBench/","published_date":"2026-05-29T12:49:17+00:00","viability_score":7,"cluster_label":"Multimodal LLMs","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"ERGeoBench: A diagnostic benchmark for evaluating multimodal LLMs in embodied geo-localization tasks.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.31250v1","title":"Entropic Projection Alignment: Estimating, Explaining, and Improving Model Performance Under Distribution Shift","abstract":"We propose a unified framework for addressing three key challenges of distribution shift: (1) estimating a model's performance on an unlabeled target domain, (2) explaining the shift by identifying the features responsible, and (3) improving the target domain performance. Our method, Entropic Projection Alignment (EPA), aligns the source distribution to the target by matching carefully selected moments while simultaneously minimising the KL divergence from the source. This formulation yields a unique closed-form solution for importance weights, achieving robustness through implicit variance control. Drawing on domain adaptation theory, we establish that moment matching is sufficient for reliable estimation and adaptation, avoiding the need for full density ratio recovery. Extensive experiments, together with strong theoretical guarantees, demonstrate that EPA consistently outperforms state-of-the-art baselines while offering substantial computational efficiency.","published_date":"2026-05-29T12:48:34+00:00","viability_score":6,"cluster_label":"Domain Adaptation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Entropic Projection Alignment (EPA) for estimating, explaining, and improving model performance under distribution shift.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2605.31249v1","title":"Learning Cardiac Latent Representations in Vectorcardiogram Space","abstract":"Electrocardiography (ECG) is a cornerstone of cardiac assessment, making the learning of informative ECG representations fundamental to tasks ranging from disease diagnosis to clinical report generation. However, existing methods operate almost exclusively in the observable ECG signal space. In practice, the standard twelve-lead ECG represents multiple projections of the same underlying cardiac electrical activity from different spatial orientations. Therefore, representation learning in the ECG space inevitably introduces substantial redundancy, which may lead to spurious correlations and increased risk of overfitting. To address this and motivated by the Frank vectorcardiogram (VCG) model, we propose learning a unified latent representation of cardiac electrical activity directly in the VCG space. We introduce LVCG, the first general self-supervised representation learning framework designed to operate in this physically grounded latent space. By learning view-invariant latent VCG representations rather than lead-specific artifacts, VCG minimizes redundancy and improves generalization. LVCG generally outperforms ECG-space baselines across tasks, demonstrating enhanced robustness and generalization, especially in domain shift settings.","published_date":"2026-05-29T12:48:12+00:00","viability_score":4,"cluster_label":"Medical AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A self-supervised framework for learning unified latent representations of cardiac electrical activity in VCG space to improve generalization and reduce redundancy in ECG analysis.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.31239v1","title":"Correcting Split Selection in Online Decision Trees via Anytime-Valid Inference","abstract":"Bagging-based ensembles, most notably Adaptive Random Forests, are among the strongest performers for learning from data streams. A common denominator across these methods is their reliance on Hoeffding Trees as base learners, which grow decision trees incrementally by testing whether a candidate split is significantly better than its alternatives using concentration inequalities. Despite their empirical success, existing variants lack valid statistical guarantees. Current analyses rely on fixed-sample concentration bounds, while split decisions are made using data-dependent stopping rules, which invalidates their guarantees and can drive the probabilty of incorrect splits to one. We introduce a principled alternative based on anytime-valid inference. Our method provides: (i) anytime-valid control of false splits under arbitrary data streams, including non-stationary settings; (ii) finite commitment time under a predictive advantage; and (iii) under stationary i.i.d. data, risk is monotone decreasing and strictly improves at every split. Empirically, we evaluate both standalone trees and their use within Adaptive Random Forests on non-stationary streams. Our method improves performance while producing substantially smaller trees.","published_date":"2026-05-29T12:40:27+00:00","viability_score":0,"cluster_label":"Machine Learning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A principled alternative to existing split selection methods in online decision trees using anytime-valid inference to provide statistical guarantees and improve performance.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.31229v1","title":"Beyond Classification: Dynamic Adapter Routing for Continual Multimodal Retrieval","abstract":"While retrieval is a core function of vision-language models, continually updating these models for retrieval tasks remains critically underexplored. Existing work often approaches continual retrieval through the lens of class-incremental learning (CIL), evaluating both standard CIL methods and retrieval-oriented adaptations in settings that may not fully capture the retrieval-specific dynamics. To address this, we introduce a new, principled evaluation framework for continual multimodal retrieval (CMR) spanning diverse visual domains, and systematically evaluate common approaches within this setting. Our empirical analysis shows that standard CIL methods fail to yield meaningful gains in our more challenging scenario. Therefore, we propose Dynamic Adapter Routing (DAR), a novel approach based on adapters selected through prototype-based routing and combined via model merging.DAR achieves superior performance over the previous baselines and demonstrates strong generalization under out-of-distribution evaluation. Our results highlights the unique challenges of CMR and encourages further research in this direction.","published_date":"2026-05-29T12:32:59+00:00","viability_score":7,"cluster_label":"Multimodal AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Dynamic Adapter Routing (DAR) for continual multimodal retrieval, offering superior performance and generalization by dynamically selecting and merging adapters.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.31228v1","title":"EchoRL: Reinforcement Learning via Rollout Echoing","abstract":"Reinforcement Learning with Verifiable Rewards is an effective route for post-training to strengthen the reasoning capability of large language models. However, as training proceeds, the learning signal can collapse thus makes the training gain become marginal and ineffective. Specifically, a growing fraction of prompts' rollouts become advantage-degenerated: all the self-generated rollouts show verified-success, making the standard deviation over their rewards be zero; accordingly each rollout's advantage becomes degenerated (zero) as well. Given such rollouts' advantages, the policy-gradient for model optimization eventually vanishes, capping the training performance. We argue that some of these rollouts still contain valuable learning signals but unfortunately omitted with the existing RLVR methods. In this paper, inspired through analyzing the entropy pattern behind golden trajectories produced by external expert models, we propose EchoRL for better exploiting the advantage-degenerated rollouts to further improve the training performance. EchoRL is a lightweight module that first identifies an EchoClip from verified-success rollouts based on their step-level entropy values, and then feeds this clip back as an auxiliary supervision signal in the RL objective. Extensive experiments across 10 benchmarks, 5 LLM backbones, and 4 popular RLVR post-training methods demonstrate that EchoRL consistently improves RLVR post-training with minimal overhead.","published_date":"2026-05-29T12:31:55+00:00","viability_score":7,"cluster_label":"LLM Training","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"EchoRL is a lightweight module that improves reinforcement learning from verifiable rewards for LLMs by exploiting advantage-degenerated rollouts using step-level entropy.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.31226v1","title":"What changes after deployment? A survey on On-device Learning in TinyML","abstract":"Machine learning models on microcontroller-class devices (TinyML) face a fundamental challenge: post-deployment distribution change undermines static models. On-device learning (ODL) addresses this by running the learning process directly on the device. The existing literature has not characterized how distribution change occurs or how different change types require different solutions. Approximately 70 ODL works are surveyed under one principle: the distribution change regime. The survey analyzes how different types of distribution change influence the applications addressable on-device, the hardware employed, and the structure of the solutions. A persistent gap between methodological benchmarks and real-world deployment scenarios is also identified.","published_date":"2026-05-29T12:29:25+00:00","viability_score":3,"cluster_label":"TinyML","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A survey categorizing on-device learning solutions for TinyML based on distribution change regimes to bridge the gap between benchmarks and real-world deployments.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.31224v1","title":"Comparing LLM-Based Conversational and Graphical Interfaces for Industrial Decision Tasks: An Exploratory Mixed-Methods Study","abstract":"The use of Generative AI Conversational User Interfaces (CUI) as a new way to access and analyze data is growing in all sectors, and the industrial one is no exception. There, large amounts of data produced by IoT devices are flowing through user interfaces and may require them a new adaptation to the new analyses needs of decision-makers. LLM-based CUIs are promising a new way to directly interact with those data through the directness of natural language and without the learning costs that every GUI design has. Moreover, the capabilities of LLMs and their agency open up the possibility to automate some tasks and help with the reasoning during decision-making activities. But are this promises well founded? We try to scope this general question with a mixed-approach study comparing a state-of-the-art dashboard with a conversational agent. A total of 20 participants used both interfaces to complete four simulated industrial decision tasks of varying complexity. We combined measures of mental workload, completion time, and decision accuracy with a post-study questionnaire and semi-structured interviews analyzed through thematic analysis. The findings suggest that the conversational agent can reduce interactional effort by supporting more direct access to information, while the dashboard remains valuable for overview and verification. However, these benefits may vary across tasks and require validation through larger-scale studies.","published_date":"2026-05-29T12:27:39+00:00","viability_score":5,"cluster_label":"Industrial AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Comparing LLM-based conversational interfaces with dashboards for industrial decision tasks to understand trade-offs in interaction effort and information access.","time_to_mvp":"1-3 months","tags":["quick_build"]},{"arxiv_id":"2605.31220v1","title":"Shared Doubt: Zero-shot Cross-Lingual Confidence Estimation for Language Models","abstract":"Confidence estimation (CE), i.e. quantifying the reliability of a model's prediction, has attracted great interest in the context of large language models (LLMs). However, most studies focus on English, ignoring the multilingual reality of LLM usage, while many CE methods degrade or require retraining across languages. To address this gap, we investigate whether multilingual LLMs encode shared, language-transferable confidence features. We use a lightweight linear probe that predicts answer correctness directly from intermediate representations. Trained monolingually, the probe generalizes zero-shot to unseen, typologically diverse languages without target-language supervision. Learned layer weights and multiple ablations reveal that confidence features concentrate in middle layers across languages, suggesting a shared confidence subspace. While zero-shot cross-lingual performance depends on similarity to the source language, the probe provides a strong baseline without any retraining and compares favorably to other popular confidence estimation methods.","published_date":"2026-05-29T12:25:24+00:00","viability_score":4,"cluster_label":"Multilingual LLMs","has_code":true,"repo_url":"https://github.com/AthinaKyriakou/shared-doubt","commercial_flags":["has_code"],"one_liner":"Developing a zero-shot cross-lingual confidence estimation method for LLMs by identifying language-transferable confidence features in intermediate representations.","time_to_mvp":"1-3 months","tags":["quick_build"]},{"arxiv_id":"2605.31212v1","title":"Benchmarking and Enhancing Text-to-Image Models for Generating Visual Representations in Early Arithmetic Education","abstract":"AI systems are increasingly used to support educational content creation, yet it remains unclear whether they can generate outputs that faithfully represent the pedagogical concepts they are intended to teach. Thus, we introduce equation-to-visual generation, a task that, in contrast to conventional image generation, requires producing pedagogically meaningful visuals from arithmetic equations while precisely preserving their numerical and relational structure. Informed by interviews with teachers and an analysis of educational materials, we construct E2V-Bench, a benchmark spanning four pedagogically grounded visual types, along with automatic metrics for evaluating visual correctness. Our evaluation reveals that recent text-to-image (T2I) models frequently fail on this task, with errors dominated by incorrect object counts and broken relational structure. Building on this, we explore benchmark-guided enhancement strategies. These strategies improve representative models, while the remaining gap calls for stronger numerical and relational grounding in future T2I models.","published_date":"2026-05-29T12:18:08+00:00","viability_score":7,"cluster_label":"Educational AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A benchmark and enhancement strategies for text-to-image models to generate pedagogically sound visual representations of arithmetic equations for early education.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.31210v1","title":"Simulation of collision avoidance behavior in crowd movement by data-driven approach","abstract":"Crowd movement simulation is essential for pedestrian safety management and facility layout optimization. Data-driven models enhance trajectory prediction accuracy under Euclidean metrics, yet they suffer from excessively high collision rates, especially in bidirectional and multidirectional flows. In this paper, we establish a novel data-driven crowd simulation model that incorporates the pedestrian collision mechanism into the loss function to reduce collisions. A new lateral-acceleration-based collision loss function and a Voronoi-based motion feature extraction approach are proposed. The model is based on a Generative Adversarial Network (GAN) architecture and is termed CPGAN (Collision-Penalized GAN). We evaluate CPGAN in bidirectional flow scenarios, which involve frequent collision avoidance behaviors. Results show that the proposed lateral-acceleration-based collision loss significantly reduces opposite-direction pedestrian collision rates to levels comparable with controlled experiments. CPGAN effectively simulates bidirectional flow, reproducing lane formation and N-t curves. The research outcomes can provide inspiration for integrating pedestrian dynamics mechanisms into loss functions in data-driven crowd simulation.","published_date":"2026-05-29T12:15:52+00:00","viability_score":3,"cluster_label":"Crowd Simulation","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A data-driven crowd simulation model incorporating a collision mechanism into the loss function to reduce collision rates in bidirectional flows.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.31199v1","title":"MAECO-Lite: Modular Ontology for Dynamic Malware Analysis","abstract":"Capturing dynamic malware behavior in a practical but still semantically precise manner remains a significant challenge in cyber threat intelligence. While standards such as MAEC and STIX provide widely adopted vocabularies for describing malware artifacts and observations, they represent data with considerable complexity in structures that often obscure important ontological distinctions. In particular, they tend to conflate enduring malware artifacts with the events generated during execution, thereby flattening distinctions that are central in foundational standards for ontology design. In this paper, we conduct a foundational ontological analysis of core MAEC and STIX constructs relevant to dynamic malware analysis relying on Unified Foundational Ontology (UFO) as a theoretical lens. Our analysis reveals some ontological mismatches arising from the conflation of artifacts, dispositions, and runtime events in MAEC and STIX that complicate coherent representation of dynamic malware behavior and, from a practical perspective, limit the ability to reason about execution traces. Based on these insights, we propose MAECO-Lite, a lightweight ontology designed to represent data and operationalize their processing for dynamic malware analysis. The ontology adopts a modular structure centered on samples, processes, actions, system artifacts, and MITRE ATT&CK Techniques, while maintaining a clear separation between enduring entities and runtime events. An initial evaluation using description logic concept learning algorithms shows that the simplified ontology significantly improves learning performance, demonstrating that ontologically grounded modelling can enhance both semantic clarity and computational usability.","published_date":"2026-05-29T12:08:07+00:00","viability_score":6,"cluster_label":"Malware Analysis Ontology","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A lightweight, modular ontology for dynamic malware analysis that separates enduring entities from runtime events, improving semantic clarity and computational usability.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.31196v1","title":"Probing Collision Grounding in Vision-Language Models for Safe Human-Robot Collaboration","abstract":"Safe human--robot collaboration requires more than visual description: a monitor must determine whether the robot body is safely separated, already colliding with the scene or a person, or about to collide. We call this capability collision grounding: binding visual observations to robot body geometry, camera viewpoint, scene layout, human proximity, and temporal motion in order to infer present and imminent contact. We introduce TouchSafeBench, a physics-grounded benchmark for evaluating collision grounding in vision-language models (VLMs). Built in Habitat~3.0, TouchSafeBench contains 2,940 simulated indoor co-presence episodes across social navigation and social rearrangement, with synchronized multi-view RGB-D observations, top-down trajectory maps, calibrated camera metadata, and simulator-derived contact labels. We study two deployment-facing tasks: classifying the current safety state and warning about imminent collision before contact. Across three frontier or robotics-oriented VLMs and nine visual representations, current models remain far from reliable: the best average Macro-F1 stays below 50\\%, explicit depth is not automatically transformed into robot-body collision evidence, and robot--scene contact is consistently harder than human-contact risk. TouchSafeBench reveals a central limitation of embodied VLMs: visual fluency does not imply physical accountability. Reliable robot safety monitors will need representations that explicitly bind viewpoint, robot morphology, metric geometry, and future collision. We will release the benchmark upon acceptance.","published_date":"2026-05-29T12:04:38+00:00","viability_score":7,"cluster_label":"Robotics Safety","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A physics-grounded benchmark and evaluation of vision-language models for collision grounding in human-robot collaboration, highlighting limitations in current models for reliable safety monitoring.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.31183v1","title":"Steering LLMs? Actually, Sparse Autoencoders can outperform simple baselines","abstract":"Sparse Autoencoders (SAEs) have been seen as a promising avenue for exploring the internals of Large Language Models (LLMs) and for steering model output generation. When AxBench - a model steering benchmark - was introduced in Wu et al. (2025), SAEs did not seem to live up to their original hype due to poor steering performance relative to a set of simple baselines. This work serves as a partial rebuttal for Sparse Autoencoders and suggests that the results of Wu et al. (2025) did not do them full justice. We find that Sparse Autoencoders can, in fact, perform close to on par with the reference LoRA performance on the AxBench benchmark, when features are selected and labelled with our supervised pipeline. We also find that our pipeline selects features that are surprisingly causal of their identified labels when using only its interpretability-based components. Lastly, we present evidence that high sparsity (low l0) may not be crucial for successful steering based on interpretability, which is in contrast to the earlier findings in Wang et al. (2025).","published_date":"2026-05-29T11:53:54+00:00","viability_score":7,"cluster_label":"LLM Interpretability","has_code":true,"repo_url":"https://github.com/MikkelGodsk/SAE-labelling","commercial_flags":["has_code"],"one_liner":"Sparse Autoencoders, when features are supervised and labeled, can perform comparably to LoRA for steering LLM output, suggesting high sparsity is not crucial for interpretability-based steering.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.31173v1","title":"MindVoice: Reconstructing Intelligible Speech from Non-invasive Neural Signals with Pretrained Priors","abstract":"Reconstructing continuous speech from non-invasive neural recordings is a fundamental problem for probing human auditory perception and building safe, scalable speech brain-computer interfaces. Despite recent progress, intelligible reconstruction remains elusive, as non-invasive recordings are inherently noisy, spatially blurred, and only partially preserve information about perceived speech. Existing methods directly map neural activity to entangled speech representations before synthesizing waveforms with neural vocoders, resulting in spectral-similar but unintelligible results. To overcome these limitations, we introduce MindVoice, a neuro-to-speech reconstruction framework that uses pretrained models to compensate for the incomplete semantic and acoustic information in neural recordings. MindVoice disentangles reconstruction into two complementary pathways: one recovers high-level semantic content, while the other estimates fine-grained acoustic attributes. These inferred representations are then fused with powerful speech generation models and in-context voice cloning to synthesize natural and intelligible utterances. Extensive experiments on EEG and MEG demonstrate that MindVoice substantially outperforms existing methods on various metrics. These results show that pretrained priors provide a principled way to bridge the gap between noisy neural recordings and natural speech, highlighting a promising attempt for auditory neuroscience research and non-invasive speech brain-computer interfaces.","published_date":"2026-05-29T11:38:38+00:00","viability_score":7,"cluster_label":"Brain-Computer Interfaces","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Reconstruct intelligible speech from non-invasive neural signals using a two-pathway framework with pretrained models for advanced brain-computer interfaces.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.31171v1","title":"MIMO: Multilingual Information Retrieval via Monolingual Objectives","abstract":"Multilingual Information Retrieval (MLIR) reflects real-world search environments in which queries and relevant documents may appear in different languages within a mixed-language corpus. However, existing embedding models are primarily optimized for Multi-Monolingual retrieval and their performance often degrades in MLIR settings. Moreover, directly applying conventional contrastive learning to MLIR can exacerbate language clustering and expose a trade-off between cross-lingual alignment and embedding uniformity. To address these limitations, we propose MIMO: Multilingual Information Retrieval via Monolingual Objectives, a two-stage framework that uses a stable English semantic space from a high-performing teacher model as an anchor. MIMO first initializes the student model's cross-lingual alignment through knowledge distillation, and then jointly optimizes distillation and cross-lingual contrastive learning to improve retrieval discrimination while preserving alignment. Extensive experiments show that MIMO consistently outperforms existing cross-lingual training baselines across various MLIR and Multi-Monolingual benchmarks. MIMO also remains competitive with off-the-shelf models of similar or larger parameter scales. Furthermore, our cross-lingual Alignment-Uniformity analysis clarifies the distinct roles of the two loss components and shows that their combination yields a favorable trade-off between alignment and uniformity.","published_date":"2026-05-29T11:34:15+00:00","viability_score":7,"cluster_label":"Multilingual Information Retrieval","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Improve multilingual information retrieval by using a two-stage framework that anchors student models to a stable English semantic space from a teacher model.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.31170v1","title":"Emergent Languages in Populations of Language Model Agents: From Token Efficiency to Oversight Evasion","abstract":"Monitoring autonomous language model agents currently relies mostly on surface behavior. But what happens when agent populations invent new languages with the goal of avoiding human oversight. Here, we study the emergent languages on Moltbook. For this, we build upon the Moltbook Files dataset and apply a two-stage approach consisting of a rule-based heuristic (about 6000 matches) followed by zero-shot classification (518 kept). The resulting categories include token efficiency (166), new natural languages (106), and oversight evasion (59). We conduct both quantitative and qualitative analyses. Our results show that posts proposing new languages for avoiding oversight are judged by DeepSeek-3.2 as being less aligned than the other categories and that all languages can be learned by other language models in-context merely from a description of the language. Moreover, manually studying exemplary cases reveals surprisingly sophisticated steganographic protocols like embedding hidden messages in natural language. Although we cannot be certain about the extent of autonomy in ideation of these languages, our results add up to the evidence that monitoring surface behavior may soon be insufficient for retaining control over agent populations.","published_date":"2026-05-29T11:33:05+00:00","viability_score":4,"cluster_label":"Agent Language Emergence","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Investigate emergent languages in language model agent populations, focusing on token efficiency, new natural languages, and oversight evasion.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.31167v1","title":"LLM-FACETS: A Privacy-Preserving Framework for Evaluating LLM Transparency and Accountability","abstract":"Assessing whether Large Language Models outputs are factually grounded, epistemically calibrated, and methodologically reproducible is a prerequisite for responsible AI deployment. Yet auditing LLMs remains inaccessible to non-technical practitioners: existing tools require programming expertise and non-trivial environment setup, and cloud-hosted platforms transmit evaluation data to external services, creating barriers for domain experts and compliance officers legally responsible for AI oversight. We introduce LLM-FACETS (LLM FActuality Cross-EvaluaTion System): an open-source framework with a browser-accessible interface and a plugin architecture, structured around three practitioner profiles (technical experts, domain experts, compliance officers) that mirror the stakeholder categories identified in the EU AI Act and the NIST AI Risk Management Framework. The architecture makes data flows explicit: deterministic metrics (BLEU, ROUGE, BERTScore) run entirely within the self-hosted server with no outbound transmission; LLM-judge metrics contact external APIs explicitly, with users retaining full credential control. The framework operationalizes transparency through three mechanisms: token-level log-probability visualization for epistemic uncertainty, multi-judge consensus to mitigate judge bias, and RAG Triad metrics (Faithfulness, Answer Relevance, Context Relevance) to detect and localize hallucinations. A plugin architecture allows any new metric or dataset to be integrated without modifying the evaluation pipeline. The open-source implementation enables cross-checking across multiple metrics targeting the same property, ensuring reproducibility and decoupling AI accountability from the teams building the systems assessed. We verify the framework through cross-validation of 18 metric implementations against canonical reference libraries.","published_date":"2026-05-29T11:20:47+00:00","viability_score":8,"cluster_label":"LLM Evaluation","has_code":true,"repo_url":"https://github.com/Scriptor-Group/AIMVi","commercial_flags":["has_code"],"one_liner":"LLM-FACETS is a privacy-preserving, open-source framework with a browser interface for evaluating LLM transparency and accountability for diverse stakeholders.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.31164v1","title":"D$^3$: Dynamic Directional Graph-Constrained Data Scheduling for LLM Training","abstract":"Training data plays a central role in large language models (LLMs) optimization, motivating extensive research on data scheduling strategies. Most existing approaches concentrate on adjusting the overall data distribution but neglect the underlying interactions between samples during training. However, we argue that such interactions cannot be overlooked, as real-world data samples frequently exhibit directional influences on each other, making the training order crucial. Intuitively, we can prioritize train-units with greater influence to improves learning efficiency. In this work, we propose $D^3$, a Dynamic Directional graph-constrained Data scheduling framework. $D^3$ formulates the complex interactions among train-units as a dynamic influence graph, where edges represent loss-based dependencies. It then solves a constrained optimization problem over this graph to derive the training order, which ensures that the data sequence respects the evolving information flow throughout training. Our approach is theoretically motivated and yields consistent improvements over existing data scheduling methods across both pre-training and post-training phases. Furthermore, for scalability, $D^3$ also employs an efficient approximation algorithm that keeps the additional computational overhead within a manageable range. For future research, the code is available at https://github.com/xuyj233/D3.","published_date":"2026-05-29T11:13:43+00:00","viability_score":7,"cluster_label":"LLM Training","has_code":true,"repo_url":"https://github.com/xuyj233/D3","commercial_flags":["has_code"],"one_liner":"A dynamic graph-constrained framework for optimizing LLM training data scheduling by considering inter-sample influences.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.31159v1","title":"Trust-Region Behavior Blending for On-Policy Distillation","abstract":"On-policy distillation (OPD) trains a student on prefixes sampled from its own policy while matching a stronger teacher. This addresses the prefix mismatch of offline distillation, but early student rollouts can still be poor, placing teacher supervision on weak or low-quality prefixes. We propose Trust-Region behavior Blending (TRB), a warmup method that replaces the early rollout policy with the closest-to-teacher behavior policy inside a student-centered KL trust region, while keeping the per-prefix reverse-KL OPD loss unchanged. The KL budget is annealed to zero, so training returns to pure student rollouts after warmup. Across two math-reasoning distillation settings, TRB attains the strongest average among the compared methods.","published_date":"2026-05-29T11:06:06+00:00","viability_score":4,"cluster_label":"LLM Distillation","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A warmup method for on-policy distillation that blends student and teacher behaviors to improve early training performance.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.31149v1","title":"Developing a UXR Point of View for Cognitive Accessibility in Mobile Learning with Generative AI","abstract":"This study investigates how UX research (UXR) principles, combined with Large Language Model (LLM)-supported analysis, can be used to improve the quality of requirements for mobile learning systems designed for learners with cognitive disabilities. Using the UXR Point-of-View (PoV) pyramid as a methodological framework, the study progressed through four stages: foundational structuring of psychological, behavioral, and design layers; structured validation using the DeLone and McLean Information Systems Success Model and Quality Function Deployment (QFD); insight consolidation through the development of nine Cognitive Accessibility UXR Play Cards; and stakeholder-specific PoV articulation to support interdisciplinary communication. LLM-supported synthesis was integrated to assist in theme clustering, requirement refinement, and hypothesis formulation under human oversight. Findings suggest that many usability and engagement challenges in mobile learning originate from ambiguous or under-specified requirements rather than interface design alone. By embedding cognitive accessibility principles into measurable and technically traceable requirements, the proposed Cognitive Accessibility UXR Playbook provides a structured pathway for aligning theory, system architecture, and stakeholder strategy.","published_date":"2026-05-29T11:00:11+00:00","viability_score":3,"cluster_label":"AI for Accessibility","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Leveraging LLMs and UXR principles to develop cognitive accessibility requirements for mobile learning systems.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.31148v1","title":"SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes","abstract":"Humans can effortlessly perceive spatial layouts, form cognitive representations, reason about spatial relations, and translate such reasoning into actions in everyday 3D environments. Although recent vision-language models (VLMs) have shown promising performance on observation-conditioned spatial perception and reasoning tasks, it remains unclear whether they can build coherent spatial understanding, act upon it, and refine their actions through multi-turn feedback. To study this problem, we introduce \\textbf{SpatialAct}, a simulator-grounded benchmark for probing \\textit{action-conditioned spatial reasoning} in 3D scenes. Starting from the most challenging setting, Multi-turn Interactive Refinement, we further design its decomposed counterpart, Single-step Error Detection and Fix, together with five fundamental spatial ability tasks to diagnose the underlying causes of model failures. Experiments reveal a clear reasoning-to-action gap: current VLMs can perform well on isolated spatial reasoning tasks, but struggle to maintain coherent spatial beliefs and produce reliable actions during multi-turn feedback, substantially underperforming humans. These results suggest that current VLM agents still lack robust spatial state tracking under action-induced environment changes, even when low-level control is abstracted away.","published_date":"2026-05-29T10:59:26+00:00","viability_score":7,"cluster_label":"Embodied AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"SpatialAct is a benchmark for evaluating the spatial reasoning and action capabilities of VLM agents in 3D environments.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.31147v1","title":"Developing a Culturally Grounded, AI-Augmented UX Research Point of View (POV): An Exemplar Case Study from Telemedicine Dementia Care","abstract":"User Experience Research (UXR) Points of View (POVs) distil complex and often fragmented research evidence into actionable perspectives that guide how teams interpret user needs, frame design decisions, and align stakeholders. Although POVs are widely used in industry practice, there are few published examples that explicitly document how POVs are constructed, particularly in culturally sensitive and low-resource contexts. This paper presents an exemplar case study demonstrating how a culturally grounded, AI-augmented UXR POV was developed to inform TeleDeCa, a telemedicine dementia care framework for family caregivers in Nigeria. Building on the UXR POV Playbook and pyramid framework, we illustrate how mixed-methods research, hypothesis generation, and ontology-based modelling can be combined to form a defensible POV without requiring a fully finalised system or validated outcomes. Generative AI (GenAI) is integrated across the UXR POV framework as a bounded research collaborator, supporting synthesis, hypothesis exploration, and narrative construction while preserving human judgment, ethical accountability, and cultural sensitivity. The contribution of this paper lies in the extraction of reusable Play Cards and a Play that extend the UXR POV Playbook and serve as exemplar material for the CHI 2026 workshop on developing AI-powered UXR POVs.","published_date":"2026-05-29T10:56:57+00:00","viability_score":5,"cluster_label":"AI-Augmented UX Research","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An AI-augmented methodology for developing culturally grounded User Experience Research Points of View, with reusable components for AI-powered UXR workshops.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2605.31146v1","title":"From Evidence to Design: Developing an AI-Augmented UX Research Point of View for Digital Wellbeing in Emergency and Public Safety Contexts","abstract":"This paper investigates how User Experience Research (UXR) methods can be combined with AI-supported analysis to develop clearer design direction for digital wellbeing interventions targeting Emergency and Public Safety Personnel (EPSP). EPSP work in high-stress, shift-based environments where cognitive fatigue and unpredictable schedules reduce engagement with conventional wellbeing tools. Using the UXR Point-of-View (PoV) framework, this study applied an AI-supported literature analysis process to identify recurring psychological, behavioural, and design patterns. Behaviour Change Techniques and Persuasive Technology principles were integrated throughout interpretation to connect evidence with practical design reasoning. The process resulted in a UXR PoV Pyramid, nine UXR Play Cards, and stakeholder focused PoV narratives. Findings show that effective wellbeing systems for EPSP must minimise cognitive effort, adapt to operational context, and prioritise psychological safety. The work demonstrates how AI can assist large-scale evidence interpretation while human researchers maintain responsibility for contextual judgement and design direction.","published_date":"2026-05-29T10:53:57+00:00","viability_score":5,"cluster_label":"AI-Augmented UX Research","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An AI-supported methodology for developing User Experience Research Points of View to guide digital wellbeing interventions for emergency and public safety personnel.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2605.31145v1","title":"FOCUS: Forcing In-Context Object Localization through Visual Support Constraints and Policy Optimization","abstract":"In-context localization (ICL) seeks to localize a target object specified by a small set of support examples in a query image, operating on the fly without training or parameter updates. Despite rapid advances in vision-language models (VLMs), achieving category-agnostic and visually grounded ICL remains an open problem, even though it is essential for applications such as image editing, personalized visual search, and retrieval. Existing methods are fragile and rely on explicit category supervision, which not only limits applicability in realistic settings with unnamed or instance-specific objects but also introduces category bias that steers predictions toward semantic priors rather than visual evidence. We introduce a two-stage training framework that explicitly optimizes in-context attention between support bounding boxes and query images without category supervision. We further refine localization via reinforcement learning using Group Relative Policy Optimization (GRPO) to directly minimize localization error. This formulation enforces visual correspondence over semantic priors, yielding robust instance-level localization. Empirically, a 7B-parameter model trained with our objectives outperforms models up to 72B parameters, demonstrating that context-aware localization objectives can surpass scaling alone. Comprehensive ablations validate the contribution of each component.","published_date":"2026-05-29T10:53:52+00:00","viability_score":8,"cluster_label":"In-Context Object Localization","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel two-stage training framework for category-agnostic, visually grounded in-context object localization that outperforms larger models.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.31143v1","title":"Extending the UXR Point of View Pyramid: A Generative AI-Augmented Methodology for Human-Centred AI Systems","abstract":"Rising household debt and cost-of-living pressures in the United Kingdom have intensified the role of AI-driven financial technologies in mediating credit assessment, repayment structuring, and debt support services. These systems increasingly shape consequential financial decisions, yet they operate within complex socio-technical environments characterised by regulatory constraint, algorithmic opacity, and heightened vulnerability risk. User Experience Research (UXR) Points of View (PoVs) are critical in translating heterogeneous research evidence into strategic direction for product and governance decisions. However, the existing UXR PoV framework was not designed for AI-mediated financial systems where interpretability, fairness, and accountability are central. This paper extends the UXR PoV pyramid into an AI-augmented methodological framework for Human-Centred AI debt management technologies in the UK financial services context. We formalise (1) an AI-Augmented PoV Pyramid, (2) a structured prompt architecture for synthesis and hypothesis generation, and (3) an AI-enabled Playbook Card system that embeds Generative AI into UXR workflows while preserving traceability and ethical oversight. Generative AI is positioned not as an analytic authority, but as an epistemic support mechanism subject to human validation and regulatory awareness. By grounding the framework in debt management technologies, including affordability assessment, repayment planning, and financial stress prediction systems, this work advances UXR methodology for high-stakes financial AI environments and contributes to the evolution of responsible, AI-powered UXR practice within the CHI community.","published_date":"2026-05-29T10:51:16+00:00","viability_score":4,"cluster_label":"AI-Augmented UX Research","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"An AI-augmented methodological framework extending the UXR Point of View pyramid for human-centered AI debt management technologies.","time_to_mvp":"1-3 months","tags":[]},{"arxiv_id":"2605.31142v1","title":"On the Robustness of Multilingual Text Embedding Rankings Across Learning Tasks, Languages, and Benchmark Datasets","abstract":"Large-scale multilingual text embedding models play crucial role in both research and industry, yet their behavior in language-specific, multi-task settings remains insufficiently understood. Although benchmarking platforms such as MTEB report results across more than 250 languages, conclusions about model superiority often depend on implicit choices of dataset compositions and performance aggregation methods. To address this gap, we present a meta-study of multilingual model performance robustness in MTEB, applying a diverse set of multi-criteria decision-making ranking schemes and introducing two robustness indicators: dataset-composition robustness (sensitivity of rankings to changing dataset compositions) and ranking-scheme robustness (sensitivity to aggregation method change). They enable systematic sensitivity analysis of whether benchmarking conclusions remain stable under different evaluation designs. We conduct an in-depth analysis on five languages (English, French, German, Hindi, and Spanish) across nine tasks (e.g., classification, clustering, retrieval) and release results for approximately 230 additional languages. The task-specific analyses show that large-scale LLM-based models are often robust top performers, though not uniformly (e.g., in retrieval task), while task-agnostic results reveal that only a small subset of models remains consistently strong across tasks, ranking schemes, and data subsamples.","published_date":"2026-05-29T10:50:22+00:00","viability_score":4,"cluster_label":"LLM Evaluation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A meta-study of multilingual text embedding models reveals robustness indicators for evaluating performance across diverse tasks and languages, with implications for model selection in industry.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.31138v1","title":"Developing an AI-Powered UX Research Point of View for Digital Health in A Regulatory Context: An Exemplar Case from MSM and Transgender HIV Care in Nigeria","abstract":"User Experience Research (UXR) in a legal and regulatory contexts presents unique challenges that require specialised approaches to protect vulnerable populations whilst generating actionable insights. Digital consultation, appointment booking, and medication delivery platforms show promise for extending care access; however, their real-world effectiveness is curtailed by an absence of theoretically grounded user experience research (UXR) methodologies that adequately account for the psychosocial conditions of these populations. This paper introduces a Generative AI-augmented UXR methodology, grounded in the UXR Point of View (PoV) Playbook, to guide the design of psychologically safe, low-cognitive-load digital health interventions for MSM and transgender individuals living with HIV/AIDS in Nigeria. Drawing from empirical research involving co-design workshops, thematic analysis, and requirements engineering, the methodology is operationalised through a four-stage UXR process encompassing AI-supported hypothesis generation, foundational planning, insight generation via Building Blocks, and the construction of stakeholder-specific PoV narratives. This process results in ten theory-informed UXR Play Cards that translate psychological mechanisms and empirical findings into actionable design guidance. Each play contains actionable tasks, AI-augmented approaches, and ethical guardrails tailored for research with marginalised populations. The output is a set of ten theory-informed UXR Play Cards translating psychological insight and empirical evidence into actionable design guidance. The core contribution is a replicable, stigma-aware, and privacy-centred framework for responsible GenAI use in UXR practice, advancing human-centred digital health design for marginalised communities.","published_date":"2026-05-29T10:47:44+00:00","viability_score":4,"cluster_label":"AI for UX Research","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A Generative AI-augmented methodology for User Experience Research in digital health, designed for marginalized populations, translates psychological insights into actionable design guidance.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.31131v1","title":"UXR PoV for Neuroinclusive Emotion Regulation","abstract":"Attention-deficit/hyperactivity disorder (ADHD) is a psychiatric disorder which presents itself in individuals through patterns of developmentally inappropriate levels of inattentiveness, hyperactivity, and impulsivity, with difficulties in decision making and emotional regulation (ER). Although digital and AI-based interventions have expanded access to ER support, many existing systems remain limited by weak theoretical integration, insufficient accommodation of neurodiversity, and a lack of structured user experience research (UXR) methodologies, that bridge psychological insight with design practice. This paper introduces a Generative AI-augmented UXR methodology, grounded in the UXR Point of View (PoV) Playbook, to support the design of emotionally intelligent and Neuroinclusive digital ER interventions for adults with ADHD. The approach integrates empirical evidence with established psychological frameworks Dialectical Behaviour Therapy (DBT), Self-Determination Theory (SDT), and the COM-B behavioural model and leverages Generative AI as a co-analytic tool to support synthesis, hypothesis formation, and design articulation. The methodology is operationalized through a four-stage UXR process encompassing AI-supported hypothesis generation, foundational planning, insight generation via Building Blocks, and the construction of stakeholder-specific PoV narratives. This process results in a set of ten theory informed UXR Play Cards that translate psychological mechanisms and empirical findings into actionable design guidance. The primary contribution of this work is a replicable, bias-aware framework for integrating Generative AI into UXR practice, advancing human-centred and Neuroinclusive approaches to digital mental health design.","published_date":"2026-05-29T10:40:48+00:00","viability_score":4,"cluster_label":"AI for UX Research","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A Generative AI-enhanced User Experience Research methodology provides a framework for designing neuroinclusive digital emotion regulation interventions for adults with ADHD.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.31126v1","title":"Not All Synthetic Data Is Yours to Learn From","abstract":"Can a language model improve from plain text sampled from itself, with no prompts, no teacher, no verifier, and no reward model? Yes, but only when the synthetic corpus is compatible with the student, a relational property of the source-student pair rather than an intrinsic property of the data. We call this the latent capability resurfacing hypothesis: weak self-training can amplify capabilities already present in the pretrained model, but only under this compatibility condition. We study this in the minimal setting of prompt-free unconditional self-training, where base language models are fine-tuned on text generated from the BOS token alone, with no task specification or external supervision. We report three findings. First, synthetic utility is relational rather than intrinsic: self-generated data is the most effective source, same-lineage transfer outperforms stronger but differently trained sources, and cross-family transfer is substantially weaker. Second, common intrinsic proxies fail: neither benchmark-level semantic similarity nor average per-token likelihood under the student predicts which corpora help. Third, this regime produces a surprising byproduct. In controlled Pythia experiments, capability and verbatim memorization decouple: benchmark utility is preserved or improved while held-out exact-match extraction drops by over 95 percent, with no forget set, privacy objective, or targeted unlearning. Together, these results suggest that prompt-free self-training works by amplifying what the student already knows, not by importing structure from the data. They also reveal a regime in which capability and verbatim memorization can be separated without any explicit unlearning objective.","published_date":"2026-05-29T10:34:11+00:00","viability_score":7,"cluster_label":"LLM Training","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel self-training regime for language models decouples capability from verbatim memorization, enabling models to improve without external supervision or explicit unlearning.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.31121v1","title":"TARIC: Memory-Augmented Traversability-Aware Outdoor VLN under Interrupted Semantic Cues","abstract":"Outdoor vision-language navigation (VLN) in long-range, open-world environments is frequently disrupted by semantic-cue interruptions, where informative goal cues become sparse, occluded, or leave the field of view. Once such cues disappear, agents enter a cue-free phase and often degrade into backtracking, oscillatory headings, or aimless exploration. While memory-based methods attempt to bridge these gaps, they often fail under traversability-driven detours: the remembered cue direction may be infeasible, forcing detours that prolong cue-free phases and gradually render robot-centric cues stale and implicit histories blurred. This makes traversability a stability condition for maintaining goal-directed guidance, rather than merely a local safety concern.   We propose a unified outdoor VLN framework that survives semantic-cue interruptions by maintaining traversability-consistent executable guidance throughout prolonged cue-free phases. Specifically, our method extracts semantic bearings from visibility-gated goal or exploration cues and grounds them into executable headings using a real-time near-field traversability profile, providing goal-consistent feasible guidance beyond reject-only safety filtering. To prevent guidance degradation during detours, we lift intermittent 2D evidence into a world-aligned 3D cue memory with an uncertainty-aware readout mechanism, ensuring guidance remains continuously reachable and stable as the robot moves.   We evaluate the framework on quadrupedal and wheeled platforms over 600--1000 m routes. Our method improves simulation success rate by over 10 percentage points over the strongest baseline and achieves a real-world success rate of 40%, compared to 17.5% for the strongest baseline, with substantially higher robustness during prolonged cue-free intervals.","published_date":"2026-05-29T10:30:36+00:00","viability_score":4,"cluster_label":"Robotics Navigation","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A framework for outdoor vision-language navigation that maintains goal-directed guidance through prolonged semantic-cue interruptions by integrating traversability awareness and 3D cue memory.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.31120v1","title":"SWIM: Single-Instance Whole-Body Imitation for swiMming","abstract":"We propose a new method for synthesizing physically-based swimming motions. Physically-based character animation aims to generate physically valid, controllable, and natural-looking motions which can respond to unexpected disturbances, where one dictating factor of difficulty is the complexity of the task, especially the level of sophistication of the required interactions with the environment. Existing research has succeeded in various tasks in static and dynamic environments. We push the difficulty further to swimming, which requires full-body coordination and continuous interactions with fluids, a new level of complexity when it comes to interacting with the environment. This complexity imposes challenges in learning control under volatile environmental forces, generalizing control to different environments and swimming styles, lack of data references, and prohibitively slow physical simulation which is inevitable during control learning. To this end, we propose SWIM, a new imitation method for swimming motions, which can learn from a single swimming motion and generalize to unseen environments, body conditions, and swimming styles. Extensive evaluation and comparison demonstrate that SWIM is data-efficient, stable, robust, and generalizable, outperforming alternative methods across multiple classes of tasks and metrics.","published_date":"2026-05-29T10:30:03+00:00","viability_score":7,"cluster_label":"Character Animation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"SWIM is a novel imitation learning method that synthesizes physically-based swimming motions from a single example, generalizing to unseen environments and body conditions.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.31100v1","title":"Vector Linking via Cross-Model Local Isometric Consistency","abstract":"We study Vector Linking: given two embedding clouds produced by different black-box encoders over partially overlapping datasets, recover cross-model object correspondences using only vectors. Empirically and theoretically, we show that independently trained contrastive encoders exhibit local geometric consistency: short-range distances are approximately preserved up to a scale factor, while long-range distances are not due to model-specific distortion. Building on this, we propose an iterative, reference-based geometric embedding hashing that recovers vector links from a tiny seed set of paired anchors. It represents each vector by distances to sampled paired anchors, proposes candidate links via hash-space matching, and aggregates evidence across views in a Beta-Bernoulli posterior to bootstrap high-confidence links as new anchors. Experiments across multiple benchmarks and embedding model pairs demonstrate accurate and robust linking under varying overlap, seed budgets, and out-of-domain anchors, with applications to vector database integration and cross-model clustering. Code is available at https://github.com/DBgroup-Edinburgh/VecLinking.","published_date":"2026-05-29T10:12:52+00:00","viability_score":7,"cluster_label":"Vector Databases","has_code":true,"repo_url":"https://github.com/DBgroup-Edinburgh/VecLinking","commercial_flags":["has_code"],"one_liner":"Vector Linking uses cross-model local isometric consistency to recover object correspondences between embedding clouds from different black-box encoders, enabling vector database integration.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.31099v1","title":"KnowledgeGain: Evaluating and Optimizing Science News Generation for Reader Learning","abstract":"Science news is an important medium to communicate discoveries between the research communities and the public. Yet, most metrics for generated or summarized text evaluate semantic similarity and factual consistency, but do not measure how much knowledge readers learn from the news. We introduce KnowledgeGain, a metric that evaluates the quality of science news by measuring how much knowledge readers gained after reading it. To evaluate the metric, we first performed a controlled human study and showed that the metric successfully captures the differential knowledge gained by human readers reading different types of science media. The data allowed us to calibrate a prompt-only LLM reader simulator. We use it to rank and filter candidate articles before human evaluation. A second human study shows that articles selected with this simulator improve post-reading accuracy and normalized KnowledgeGain over a strong generation baseline. Our work is a step toward generating science news that better meets the knowledge and comprehension goals of Bloom's Taxonomy.","published_date":"2026-05-29T10:12:07+00:00","viability_score":5,"cluster_label":"LLM Applications","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"KnowledgeGain is a new metric for evaluating science news generation that measures reader knowledge acquisition, enabling optimization for better comprehension.","time_to_mvp":"1-3 months","tags":["quick_build"]},{"arxiv_id":"2605.31097v1","title":"SpecDB: LLM-Generated Customized Databases via Feature-Oriented Decomposition","abstract":"Mainstream relational databases ship a uniform feature set across deployments, although individual workloads exercise only a fraction of the available subsystems. We investigate whether a database can instead be generated on demand with a feature set matched to the target workload. We present SpecDB, a system that uses large language models (LLMs) to synthesize customized relational databases. We survey 9 production systems and decompose them into 10 functional modules, each further divided into implementation variants. To capture cross-module dependencies, including cases where implementations in disjoint subtrees must be co-designed, we adopt the FODA feature model and extend it with a cooperate edge, yielding a dependency graph DBGraph. SpecDB operationalizes DBGraph through a layered module-construction pipeline in which each module is generated, validated, and integrated by a dedicated subagent (driven by three inner agents: Main, Tester, Architect), and a Refining Agent that iteratively repairs and tunes the assembled database against a user-supplied refining harness with read-only access to existing database source code. A companion selection component translates a natural-language workload description into a set of implementation variants, providing an end-to-end pipeline from workload description to deployable database. We evaluate SpecDB on TPC-C with BenchmarkSQL. The generated database (23,779 lines of Rust) completes 60-minute TPC-C at 1 and 10 warehouses with zero errors. At 10 warehouses it reaches tpmC=130, compared to 128 for PostgreSQL and 127 for MySQL, with comparable latency at ~3% of their code size. Because the agent operates at module-specification level rather than product source, it can in principle combine techniques across system boundaries. Paired with falling LLM costs, generating a purpose-built database for a target workload is becoming straightforward.","published_date":"2026-05-29T10:07:43+00:00","viability_score":7,"cluster_label":"Database Generation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"SpecDB uses LLMs to generate customized relational databases tailored to specific workloads, achieving performance comparable to established systems with a fraction of the code size.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.31094v1","title":"Redefining Instance Matching: A Unified Framework for Part-Aware Matching in Panoptic Segmentation Evaluation","abstract":"The Panoptic Quality (PQ) metric is the standard for jointly evaluating instance and semantic segmentation. However, its original definition relies on a One-to-One matching between predicted and ground truth segments, which is only straightforward when the IoU threshold exceeds 0.5. Below 0.5, multiple matching strategies emerge in a poorly explored problem space. We systematically elucidate this space by recasting segment matching as a constrained bipartite assignment problem. Independently bounding the prediction- and ground-truth-side degrees yields four matching strategies: One-to-One, Many-to-One, One-to-Many, and Many-to-Many. We show that the first three are well-defined within the PQ framework, while Many-to-Many falls outside it. These strategies become relevant when instances are fragmented, adjacent objects are difficult to delineate, or annotations are noisy. Central to our framework is a vertex-based accounting of TP, FN, and FP, anchored to ground truth and predicted segments rather than to matching edges. We further show that the framework extends naturally to part-aware panoptic segmentation, and we explore part-aware evaluation on biomedical data. Across configurable case studies we report how different combinations of thresholds and matching strategies behave in practice. We release a unified open-source package built on Panoptica. It exposes Voronoi-based region-wise analysis, part-aware evaluation, and Area Under Threshold Curve computations as configurable options.","published_date":"2026-05-29T10:04:25+00:00","viability_score":3,"cluster_label":"Computer Vision Evaluation","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A unified framework redefines instance matching for panoptic segmentation evaluation, extending to part-aware scenarios and offering configurable analysis tools.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.31090v1","title":"On Revisiting Entropy for Identifying Mislabeled Images","abstract":"Mislabeled samples in training datasets severely degrade the performance of deep networks, as overparameterized models tend to memorize erroneous labels. We address this challenge by proposing a novel approach for mislabeled data detection that leverages training dynamics. Our method is grounded in the key observation that correctly labeled samples exhibit consistent entropy decrease during training, while mislabeled samples maintain relatively high entropy throughout the training process. Building on this insight, we introduce a signed entropy integral (SEI) statistic that captures both the magnitude and temporal trend of prediction entropy across training epochs. SEI is broadly applicable to classification networks and demonstrates particular effectiveness when integrated with contrastive language-image pretraining (CLIP) architectures. Through extensive experiments on four medical imaging datasets -- a domain particularly susceptible to labeling errors due to diagnostic complexity -- spanning diverse modalities and pathologies, we demonstrate that SEI achieves state-of-the-art performance in mislabeled data identification, outperforming existing methods while maintaining computational efficiency and implementation simplicity. Our code is available at https://github.com/MedAITech/SEI.","published_date":"2026-05-29T09:59:32+00:00","viability_score":8,"cluster_label":"Data Quality","has_code":true,"repo_url":"https://github.com/MedAITech/SEI","commercial_flags":["has_code"],"one_liner":"A novel method using signed entropy integrals detects mislabeled images in training datasets, outperforming existing techniques with computational efficiency.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.31080v1","title":"A Pilot Study on Curator-Guided Multilingual Art Description for Blind and Low-Vision Audiences with Small Vision-Language Models","abstract":"Blind and low-vision (BLV) audiences remain underserved by visual art descriptions, particularly across languages and in museum settings where privacy and intellectual-property constraints may favour small on-premise vision-language models (VLMs). This pilot study investigates curator-guided multilingual art description with Qwen2.5-VL-3B-Instruct for German, Romanian, and Serbian. We construct a parallel BLV-oriented caption corpus from artwork images and metadata, and compare language-specific LoRA adapters with a single multilingual adapter under a fixed backbone and training budget. Evaluation combines automatic lexical and embedding-based metrics with an LLM-as-Judge protocol calibrated against a small Romanian BLV pilot study. Under our pilot setup, language-specific adapters show more stable controllability and visually grounded description quality for Romanian and Serbian, while multilingual adaptation remains competitive in German. We frame these findings as deployment-oriented evidence for small on-premise VLMs, and highlight the need for larger BLV user studies and broader language coverage before drawing general conclusions about multilingual accessibility.","published_date":"2026-05-29T09:47:45+00:00","viability_score":5,"cluster_label":"Multimodal AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Curator-guided multilingual art descriptions for blind and low-vision audiences are explored using small vision-language models and LoRA adapters.","time_to_mvp":"1-3 months","tags":["quick_build"]},{"arxiv_id":"2605.31065v1","title":"DRIFT: Joint Channel Estimation and Prediction Towards Pilotless 6G Non-Terrestrial Networks","abstract":"Non-terrestrial networks (NTNs) are expected to play a pivotal role in sixth-generation (6G) systems by enabling ubiquitous connectivity and massive communication. In this context, channel prediction emerges as a key technique to improve the spectrum utilization efficiency by limiting the pilot overhead. However, many proposed predictors based on artificial intelligence (AI) are characterized by high inference complexity, posing challenges to onboard implementation. In this paper, we address the challenge of designing accurate yet computationally efficient channel prediction techniques tailored to low Earth orbit (LEO) NTNs, where strict power constraints limit model complexity, to enable spectral efficiency gains. We propose an iterative joint channel estimation and prediction framework in the context of 6G NTNs that significantly reduces pilot overhead by transmitting pilots only in the initial slot and relying on data-driven processing for subsequent slots. We introduce Data-driven Refinement and Iterative Forecast for wireless channel Tracking (DRIFT), a lightweight architecture that refines data-aided channel estimates and predicts future channel frequency responses with low computational cost and reduced error propagation. Two predictor variants based on convolutional and long short-term memory layers are investigated. Simulation results in an end-to-end simulation of an uplink LEO NTN scenario show that the proposed approach achieves up to 12% spectral efficiency gain compared to conventional pilot-based systems, with robustness to training-test mismatches and consistent performance across different channel models. Moreover, DRIFT requires fewer than 200k multiply-accumulate operations, making it suitable for on-board satellite implementation under stringent power constraints.","published_date":"2026-05-29T09:35:35+00:00","viability_score":3,"cluster_label":"6G Communications","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A lightweight AI framework for channel estimation and prediction in 6G non-terrestrial networks to improve spectral efficiency under strict power constraints.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.31064v1","title":"Fighting Numerical Hallucinations via Data-centric Compilation for Online Financial QA","abstract":"Large Language Models (LLMs) have significantly advanced online data services, particularly in the domain of financial question answering (FinQA). However, such systems remain susceptible to numerical reasoning hallucinations, which critically undermine reliability in high-stakes financial applications. Although retrieval-augmented generation (RAG) has been widely adopted to ground responses in external knowledge, it introduces three persistent challenges: noise sensitivity, calculation fragility, and an auditability crisis. Existing model-centric approaches, which primarily focus on optimizing either the retriever or generator in isolation, still struggle to address these issues in an integrated manner. In this work, we pioneer a data-centric paradigm and propose a novel framework, the Data-centric Reasoning Compiler (DCRC). The framework operates through three cohesive phases: (1) adversarial data construction, which synthesizes training examples with controlled noise to teach robustness; (2) multi-stage training that cultivates a Data-centric Structuring Agent (DSA) capable of explicit evidence auditing and program synthesis; and (3) a compile-and-execute inference process, where the DSA transforms user queries and retrieved documents into verifiable, executable reasoning programs. This data-driven framework ensures faithful numerical reasoning by design. We conduct extensive experiments on established offline benchmarks and further validate our framework through deployment in a real-world online financial QA system.","published_date":"2026-05-29T09:35:11+00:00","viability_score":7,"cluster_label":"Financial QA","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A data-centric compiler that trains LLMs to perform verifiable numerical reasoning for financial question answering, reducing hallucinations.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.31061v1","title":"STEP: Learning STructured Embeddings for Progressive Time Series","abstract":"We present a novel method for learning interpretable representations of progressive time series, that is, data capturing irreversible state transitions such as degradation or task completion. Our approach uses a self-supervised contrastive objective to learn a low-dimensional latent space whose geometry is itself the interpretation: each observation becomes a point on a manifold anchored between two fixed orthogonal prototype vectors, and a trajectory becomes a path across that manifold. From this structure we read a latent compass, the polar coordinates (\u03b8, r) of the latent vector, in which \u03b8 tracks the progression of the underlying state (e.g., from healthy to failed) and r identifies the active mode (e.g., the operating condition), without any proxy labels. We evaluate the approach against the state of the art on diverse domains, including industrial degradation, robotic tasks, and neural activity, validating three key capabilities: (1) end-state prediction, (2) multi-step forecasting, and (3) interpretable phase separation. Our method matches or improves over black-box counterparts on all of these while providing transparency about the underlying mechanisms. A simple linear regressor on top of the latent compass coordinates is competitive with deep architectures, direct quantitative evidence that the underlying state is encoded in a geometrically accessible form.","published_date":"2026-05-29T09:33:34+00:00","viability_score":3,"cluster_label":"Time Series Analysis","has_code":true,"repo_url":"https://github.com/stealthglider/LRPTS","commercial_flags":["has_code"],"one_liner":"A self-supervised method for learning interpretable representations of progressive time series, enabling end-state prediction and forecasting.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.31053v1","title":"AnchorSteer: Self-Discovered Concept Injection for Structure-Preserving Music Editing","abstract":"Controllable music editing is to modify high-level attributes while strictly preserving rhythmic and melodic structures. However, this task is challenged by a semantic-structural entanglement: steering methods often degrade structure to achieve editing performance, while structural adaptors suppress semantic responsiveness. We propose AnchorSteer, a framework that disentangles this tension by coupling structural anchoring with self-discovered semantic steering. The proposed approach probes internal representations to extract interpretable, label-free concept vectors via a self-supervised reconstruction objective, isolating attributes without curated data. During editing, these portable, plug-and-play concept vectors are injected into diffusion hidden manifolds while a structural adaptor enforces consistency. Variants for unconditioned and conditioned injections are provided to balance robustness and semantic strength. Experiments on ZoME-Bench and subjective tests show that the proposed framework outperforms both steering-only and anchoring-only baselines, enabling significant semantic transformations with high-fidelity structural preservation.","published_date":"2026-05-29T09:25:55+00:00","viability_score":3,"cluster_label":"Music Editing","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A framework for controllable music editing that disentangles semantic attributes from structural preservation using self-discovered concept injection.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.31051v1","title":"Linear Ordering Problem: Time for a Change","abstract":"The Linear Ordering Problem (LOP) is a fundamental combinatorial optimization problem with important applications in areas such as economics, social choice, and machine learning. Its most prominent use is the triangulation of economic input-output tables, which helps identify critical industries in an economy. Most existing algorithms have been evaluated on benchmarks derived from outdated macroeconomic data, which no longer reflect the structure of contemporary economies. Furthermore, LOP instances often exhibit many distinct global optima that can differ substantially from one another, creating challenges for applications that rely on a single solution. To address these limitations, we introduce a novel benchmark suite derived from up-to-date real-world economic data and an algorithmic scheme that leverages state-of-the-art LOP metaheuristics to generate diverse sets of high-quality solutions, together with metrics for assessing both quality and diversity. Experiments were conducted to report results on the proposed benchmark suite under both the traditional single-solution setting and the newly introduced multi-solution scenario","published_date":"2026-05-29T09:25:48+00:00","viability_score":3,"cluster_label":"Optimization","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel benchmark and algorithmic scheme for the Linear Ordering Problem using up-to-date economic data to generate diverse, high-quality solutions.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.31049v1","title":"Learning to Solve and Optimize by Evolving Code","abstract":"Combinatorial and optimization problems are fundamental to many industrial AI applications. Solving large-scale real-world instances of such problems typically requires careful problem formalization, specialized solvers, and expert-designed heuristics. Thus, experts need to specify not only what solutions are, but also how they are derived. By introducing the tool CHECKMATE, we show that algorithm generation via code evolution represents a paradigm shift by eliminating the need to formulate the how. CHECKMATE solely relies on the what. Specifically, a formal specification ensures solutions' correctness and enables systematic performance evaluation of the generated programs, while a natural language description guides the evolutionary process. The effectiveness of our method is demonstrated on selected problems from two industrial domains: configuration and scheduling. In all cases, the evolved algorithms consistently outperform state-of-the-art solvers. This underscores the potential of formal methods in guiding code evolution for automatically solving complex real-world problems.","published_date":"2026-05-29T09:25:07+00:00","viability_score":8,"cluster_label":"Code Generation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"CHECKMATE evolves code to automatically solve complex combinatorial and optimization problems, outperforming state-of-the-art solvers without expert-defined heuristics.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.31043v1","title":"Routing on the Stiefel Manifold: When Does Adaptive Subspace Selection Help for Cross-Domain EEG Decoding?","abstract":"Cross-domain EEG decoding remains challenging despite advances in Riemannian deep learning: covariance matrices from different subjects occupy systematically distinct regions of the SPD manifold, yet existing domain adaptation methods either require target-domain calibration data or learn subject-specific components that cannot generalise across domains. We propose dynamic Stiefel routing: a pool of $K$ expert projection filters on the Stiefel manifold, each specialised for a different region of the SPD manifold, with each input covariance routed to the most appropriate filter via cross-attention, adapting the subspace projection per sample. A central finding is that this approach, implemented naively, provably collapses to ensemble averaging: when routing weights are uniform, the adaptive filter reduces exactly to an equal-contribution combination of experts, indistinguishable from a single fixed filter. Three structural properties break this degeneracy: a symmetric anchor $W_{\\mathrm{base}} \\in \\mathrm{St}(n,k)$ that removes proximity bias among experts; a frozen domain-discriminative query encoder that decouples routing from task optimisation; and a decoupled key alignment loss that trains expert keys toward stable domain attractors. Together they produce the first genuinely committed and domain-structured routing on SPD manifolds, with consistent gains across three datasets: balanced accuracy improves from $0.773\\to 0.823$, $0.757\\to 0.809$, and $0.801\\to 0.839$, with the alignment strategy determined automatically by a single data-driven rule and no dataset-specific hyperparameter search.","published_date":"2026-05-29T09:20:25+00:00","viability_score":7,"cluster_label":"EEG Decoding","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Dynamic Stiefel routing for cross-domain EEG decoding, using adaptive subspace selection to improve accuracy by routing covariance matrices to specialized projection filters.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.31042v1","title":"From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors","abstract":"LLM agents are evolving from conversational chatbots to operational tools in real-world workspaces. In local agentic harnesses, an LLM can read and write files, call tools, and reuse workspace state across sessions. While such capabilities enhance utility, they also expose a new attack surface for attackers. Attackers can embed a prompt injection within a file or tool output. Agents may read this hidden instruction, store it, and execute it later. In this multi-step trojan attack paradigm, no individual step appears malicious on its own, but these steps can collectively turn untrusted text into persistent control content. However, existing defenses often inspect each step in isolation. As a result, they can block a clear harmful action, but fail to detect the earlier write operation that plants the backdoor. To reveal this threat, we introduce ClawTrojan, a benchmark designed to identify multi-step trojan attacks in local agentic harnesses. In an OpenClaw-style simulated workspace with GPT-5.4, ClawTrojan reaches a 95.5% attack success rate (ASR), while existing single-turn prompt-injection attacks produce near-zero ASR on the same model. To address this threat, we propose DASGuard, which scans control-like text in sensitive local files, traces its origin, and removes control content that does not originate from a trusted source. Our results show that DASGuard achieves strong dynamic defense by combining runtime attack blocking with sanitized commits to the workspace.","published_date":"2026-05-29T09:19:07+00:00","viability_score":8,"cluster_label":"LLM Security","has_code":true,"repo_url":"https://github.com/RUC-NLPIR/ClawTrojan","commercial_flags":["has_code"],"one_liner":"DASGuard defends LLM agents against multi-step trojan attacks by scanning sensitive files for untrusted control content and tracing its origin.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.31041v1","title":"Does Visual Information Play a Decisive Role in Vision-Language-Action Model Driving Behavior?","abstract":"Vision-Language-Action (VLA) models have demonstrated promising capability in autonomous driving, highlighting the potential of unified multimodal architectures for jointly modeling perception and planning. However, how current VLA-based driving behavior is grounded in visual information remains poorly understood. Existing evaluation protocols mainly focus on aggregate performance metrics, lacking structured and practical diagnostics to quantify visual-behavior dependency. In this work, we introduce a structured multi-level visual perturbation framework to analyze visual-behavior dependency in VLA-based driving models systematically. The framework organizes controlled visual perturbations along three complementary dimensions: channellevel degradation, information-level disruption, and structurelevel modification. We apply it to VLA-based driving systems and evaluate behavioral responses under both open-loop trajectory prediction and interactive closed-loop safety evaluation. Experimental results reveal evaluation-dependent dependency patterns and uneven visual grounding across abstraction levels. These findings call for more structured analyses and principled design of VLA driving models to better understand how visual information shapes behavior and develop safer, more robust systems.","published_date":"2026-05-29T09:18:32+00:00","viability_score":4,"cluster_label":"Autonomous Driving","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for systematically analyzing how visual information impacts autonomous driving model behavior through structured perturbations.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.31034v1","title":"Annealed Softmax Greedy in Many-Armed Bayesian Bandits","abstract":"Reinforcement learning with verifiable rewards (RLVR) and group-based policy optimization methods such as GRPO update a stochastic policy by sampling multiple completions per prompt and increasing the policy's probability on those with higher reward, regularized by a KL penalty toward a reference policy. These updates do not include explicit mechanisms that track epistemic uncertainty. This paper studies a stylized explanation for why such uncertainty-agnostic updates can nevertheless be effective. We analyze an annealed softmax (Boltzmann) policy that selects actions according to a softmax of empirical mean rewards in a many-armed Bayesian Bernoulli bandit. Under a linear upper-tail condition on the prior (the $\u03b2=1$ case of $\u03b2$-regularity), which implies an abundance of near-optimal arms, we prove that annealed softmax greedy achieves Bayes regret $\\tilde{O}(m + T/m)$, and in particular $\\tilde{O}(\\sqrt{T})$ when the number of arms scales as $m = \u0398(\\sqrt{T})$. This is the near-optimal Bayes regret rate in this regime, attained also by empirical-mean greedy. Under $\u03b2$-regularity, many arms maintain empirical means close to the optimum throughout learning, so when softmax samples an arm other than the empirically best, that arm tends to be another near-optimal one rather than a clearly inferior one. By contrast, with a small number of arms, the same kind of softmax policy can suffer linear regret. The result also provides a structural analogy to RLVR, where a base policy with a non-negligible probability of producing a correct completion plays the role of $\u03b2$-regularity.","published_date":"2026-05-29T09:05:29+00:00","viability_score":0,"cluster_label":"Reinforcement Learning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Analyzes why annealed softmax policies can achieve near-optimal regret in many-armed Bayesian bandits under specific conditions.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.31031v1","title":"GraphARC: A Comprehensive Benchmark for Graph-Based Abstract Reasoning","abstract":"Relational reasoning lies at the heart of intelligence, but existing benchmarks are typically confined to formats such as grids or text. We introduce GraphARC, a benchmark for abstract reasoning on graph-structured data. GraphARC generalizes the few-shot transformation learning paradigm of the Abstraction and Reasoning Corpus (ARC). Each task requires inferring a transformation rule from a few input-output pairs and applying it to a new test graph, covering local, global, and hierarchical graph transformations. Unlike grid-based ARC, GraphARC instances can be generated at scale across diverse graph families and sizes, enabling systematic evaluation of generalization abilities. We evaluate state-of-the-art language models on GraphARC and observe clear limitations. Models can answer questions about graph properties but often fail to solve the full graph transformation task, revealing a comprehension-execution gap. Performance further degrades on larger instances, exposing scaling barriers. More broadly, by combining aspects of node classification, link prediction, and graph generation within a single framework, GraphARC provides a promising testbed for future graph foundation models.","published_date":"2026-05-29T09:03:30+00:00","viability_score":7,"cluster_label":"Graph Reasoning","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"GraphARC is a new benchmark for abstract reasoning on graph data, revealing limitations in current language models and paving the way for graph foundation models.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.31023v1","title":"HADT: A Heterogeneous Multi-Agent Differential Transformer for Autonomous Earth Observation Satellite Cluster","abstract":"This work addresses the problem of autonomous resource management in heterogeneous satellite cluster conducting Earth Observation (EO) missions including optical and Synthetic Aperture Radar (SAR) satellites. In autonomous operation mode, satellites are equipped with intelligent capabilities enabling real-time decision-making based on the latest conditions, while requiring minimal interaction with ground operators. Traditional scheduling approaches typically rely on mathematical models to represent satellite mission and resource management. Then, this problem is solved by using optimization algorithms. However, such solutions become less effective when the underlying models are not available, over complex, and inaccurate due to dynamic changes and uncertainties inherent in the space mission environment. A promising alternative is to reformulate the problem as a sequential decision-making process and apply model-free reinforcement learning techniques to enable adaptive and real-time resource management. To this end, we propose a novel transformer-based architecture tailored for heterogeneous satellite cluster autonomous EO Mission with relational observations-actions tokenization and differential attention mechanism. Our experimental results demonstrate significant performance improvements compared to the available baselines. Moreover, the proposed architecture exhibits strong adaptability and transferability with respect to varying numbers of satellite clusters.","published_date":"2026-05-29T08:54:41+00:00","viability_score":7,"cluster_label":"Satellite Resource Management","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel transformer architecture for autonomous, adaptive resource management in heterogeneous Earth Observation satellite clusters.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.31021v1","title":"A Persona-Based Evaluation Framework for Pluralistic Alignment in Generative AI","abstract":"Current alignment paradigms for generative artificial intelligence rely predominantly on monolithic benchmarking frameworks that reduce the plurality of human judgment to aggregated statistical baselines, thereby obscuring cultural, demographic, and contextual variability in evaluation. We introduce a state-space constrained emulation framework for AI evaluation that replaces singular assessment functions with a structured manifold of synthetic cognitive profiles representing diverse human perspectives. We show that modern generative architectures can instantiate and maintain these evaluative personas with high consistency, enabling a form of pluralistic, perspective-dependent benchmarking that more closely reflects real-world consensus variability. However, we further analyze the stability of these simulated evaluators under sequential inference and stochastic prompt perturbations, revealing systematic degradation in persona coherence that manifests as state-space drift and semantic inconsistency. These findings suggest that static alignment constraints are insufficient for sustaining robust evaluative behavior over time. Instead, we argue for the necessity of embedding dynamic, viability-driven regulatory mechanisms within generative systems to preserve coherent cognitive emulation. By framing persona-based evaluation as a structured dynamical system over latent representation manifolds, this study provides a foundation for more adaptive, human-aligned, and context-sensitive approaches to AI evaluation.","published_date":"2026-05-29T08:54:09+00:00","viability_score":3,"cluster_label":"AI Evaluation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for evaluating generative AI using diverse synthetic cognitive profiles to capture pluralistic human judgment.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.31007v1","title":"DEM: A Distilled Explanation Model for Interpretable Anomaly Detection in Physiological Sensor Networks","abstract":"Anomaly detection in physiological sensor data from Wireless Body Area Networks (WBANs) can be caused by sensor faults, network disruptions, or missing data, leading to false alarms. Hence, it demands both high predictive accuracy and clinically interpretable explanations. Existing approaches rely either on black-box models that achieve strong performance but offer no transparency, or on post-prediction explanation methods such as SHAP and LIME. In this paper, we propose the Distilled Explanation Model (DEM), a three-stage glass-box framework that distills the non-linear knowledge of a gradient boosting expert into an interpretable decision tree operating on residuals relative to a linear baseline, so that the explanation is not an approximation but the prediction itself. DEM introduces a novel distillation fidelity metric that quantifies how faithfully the explanation tree captures the expert model's non-linear contribution, providing a principled measure of explanation trustworthiness absent from prior interpretable models. Evaluated across four physiological datasets, including MIMIC-IV, WESAD, eICU, and an in-house SmartNet WBAN corpus, DEM achieves an AUC of 0.9964 on clinical contextual anomaly detection and 0.9047 on wearable stress detection while producing human-readable if-then rules at a controllable depth. Inference requires 0.17ms per 1000 samples, rendering DEM 1235x faster than SHAP-based post-hoc explanation and suitable for real-time physiological monitoring. Ablation studies confirm that the XGBoost distillation step provides measurable gains over naive residual fitting, and depth-sensitivity analysis demonstrates an explicit, user-controlled accuracy-interpretability trade-off unique to DEM among existing intrinsically interpretable models.","published_date":"2026-05-29T08:39:39+00:00","viability_score":7,"cluster_label":"Interpretable Anomaly Detection","has_code":true,"repo_url":"https://github.com/Jyotirmoy17/dem-model","commercial_flags":["has_code"],"one_liner":"A real-time, interpretable anomaly detection system for physiological sensors that provides human-readable explanations alongside high accuracy.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.30984v1","title":"Generating Reports or Repeating Templates? Measuring and Mitigating Template Collapse in 3D CT Report Generation","abstract":"Modern 3D medical vision-language models (VLMs) can generate fluent radiology-style text while exhibit critically low pathology detection and output diversity, collapsing to generic templates that under-report rare yet critical findings. We identify this failure mode as Template Collapse. This failure stems from the unique constraints of 3D medical imaging, e.g., limited data, severe label imbalance, and weak signals from volumetric encoders. Under these constraints, text-generation objectives encourage shortcut learning and fluent but weakly grounded reports. We systematically diagnose the Template Collapse through clinical fidelity, output diversity, normal-template bias, and rare-finding survival. To mitigate it, we propose CLarGen, a decoupled framework that separates what to say (clinical detection) from how to say it (language synthesis). CLarGen uses (i) a Latent Query Transformer for multi-label pathology detection, (ii) pathology-guided retrieval for clinically matched exemplars, and (iii) a medical language model to synthesize the final report from detected findings and retrieved context. Across state-of-the-art 3D CT report generation baselines, CLarGen mitigates Template Collapse and substantially improves clinical accuracy (macro-F1 0.487 vs. 0.189; CRG 0.472 vs. 0.368) while maintaining fluent reporting. Our results suggest that explicit, measurable clinical grounding is essential for template-collapse-resistant 3D CT report generation. Code will be released upon acceptance.","published_date":"2026-05-29T08:21:00+00:00","viability_score":5,"cluster_label":"Medical Report Generation","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A decoupled framework to mitigate template collapse in 3D CT report generation, improving clinical accuracy and output diversity.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.30968v1","title":"Variational Adapter for Cross-modal Similarity Representation","abstract":"The core of vision-language models lies in measuring cross-modal similarity within a unified representation space. However, most image-text matching or multi-class image classification datasets lack fine-grained cross-modal matching annotations, forcing the continuous similarity space into binary classification boundaries. This compression induces false negative samples and significantly impairs the generalization performance of cross-modal tasks. While prior research has attempted to mitigate this by modeling intra-modal ambiguity, it often overlooks inherent annotation flaws, leading to suboptimal uncertainty allocation. To address these challenges, we propose a Variational Adapter for Cross-modal Similarity Representation (VACSR). This approach reformulates image-text matching with fine-grained semantic scarcity as a variational inference problem. It constructs a latent space for cross-modal similarity and uses regularization techniques to mitigate overfitting to binary annotations. Experiments on image-text retrieval, domain generalization, and base-to-novel generalization demonstrate the proposed method's effectiveness and robust generalization ability.","published_date":"2026-05-29T08:08:15+00:00","viability_score":5,"cluster_label":"Vision-Language Models","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A variational adapter that improves cross-modal similarity representation by addressing annotation flaws in image-text matching.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2605.30966v1","title":"Reading Between the Citations: A Typed Claim Network for Scientific Literature","abstract":"Knowledge graphs over corpora of inter-referencing documents - scholarly papers, legal opinions, policy briefs - encode the topology of reference but not its stance. The standard representation collapses a rich evaluative relation into an untyped edge, losing the very content that supports community-level queries about how one document is received by another. We propose the claim network: a representational pattern in which each cross-document reference is reified as a typed claim, carrying source, target, claim text, and a four-class stance label grounded in the citation-intent literature. We give a construction pipeline applicable to any corpus of scholarly inter-referencing documents and instantiate it on a corpus of 127 papers in 3D point cloud semantic segmentation, producing a network of 8,260 typed claims. Three downstream task families demonstrate what the network enables: retrieval signal augmentation, aggregated-stance summarisation, and topological analytics. Head-to-head evaluation against standard Retrieval-Augmented Generation (RAG) baselines shows that the gain over flat retrieval is the gain from the right intermediate representation rather than the wrong one.","published_date":"2026-05-29T08:02:45+00:00","viability_score":7,"cluster_label":"Knowledge Graphs","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel claim network representation for scientific literature enhances retrieval, summarization, and analytics by capturing nuanced citation stances, outperforming standard RAG baselines.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30965v1","title":"ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment","abstract":"Recent advancements in text-guided audio generation have yielded promising results in diverse domains, including sound effects, speech, and music. However, jointly generating speech with environmental audio remains challenging due to the inherent disparities in their acoustic patterns and temporal dynamics. We propose ImmersiveTTS, an environment-aware text-to-speech (TTS) model that generates natural speech seamlessly integrated within environmental contexts by explicitly modeling cross-modal interactions. Our model builds on a multimodal diffusion transformer and fuses transcript-aligned speech latent with text-conditioned environmental context via joint attention. To enhance semantic consistency, we introduce a domain-specific representation alignment objective tailored to environment-aware TTS, leveraging complementary self-supervised representations from speech and audio encoders. Experimental results show that ImmersiveTTS achieves higher naturalness, intelligibility, and audio fidelity than existing approaches across objective metrics and human listening tests.","published_date":"2026-05-29T07:58:54+00:00","viability_score":7,"cluster_label":"Text-to-Speech","has_code":true,"repo_url":"https://github.com/jjunak-yun/ImmersiveTTS","commercial_flags":["has_code"],"one_liner":"ImmersiveTTS is an environment-aware text-to-speech model that seamlessly integrates speech with environmental audio using a multimodal diffusion transformer and domain-specific alignment.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30963v1","title":"AMix-2: Establishing Protein as a Native Modality in Large Language Models","abstract":"We present AMix-2, a protein-text foundation model that establishes protein as a native modality in large language models (LLMs), unifying protein understanding and sequence design within a single foundation model. AMix-2 is built upon two key ideas: (1) a unified protein-text formulation that embeds natural language and protein sequence in a shared token space, enabling one model to perform biological reasoning and conditional design instead of separate downstream task-specialized models; and (2) a block-wise diffusion language modeling backbone that combines causal generation across blocks with bidirectional context and iterative refinement within blocks. This scheme better matches the intrinsic nature of proteins than a strict left-to-right factorization. To evaluate protein foundation models under realistic generalization settings, we further introduce ProteinArena, a comprehensive benchmark with time-aware and homology-aware protocols across various understanding and design tasks, and with baselines covering classical bioinformatics tools, protein-specialized models and LLMs. On ProteinArena, AMix-2 outperforms frontier LLMs and demonstrates competitive performance to task-specific protein models. Controlled experiments further show that the diffusion-based paradigm generally surpasses its autoregressive counterpart, highlighting the advantage of flexible generation order for protein sequences. We release both AMix-2 and ProteinArena to facilitate open research in protein foundation models.","published_date":"2026-05-29T07:58:08+00:00","viability_score":9,"cluster_label":"Protein Foundation Models","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"AMix-2 is a protein-text foundation model that unifies protein understanding and design, outperforming existing LLMs and task-specific models on a new comprehensive benchmark.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30934v1","title":"Do Large Language Models Encode Institutional Experience? Evidence from Cross-Linguistic Moral Reasoning Under Ambiguity","abstract":"Large language models (LLMs) exhibit systematic differences in moral reasoning across languages, yet the source of this variation remains unclear. We test the hypothesis that languages encode aspects of the institutional environments in which they are spoken, allowing LLMs to inherit institution-specific moral priors through training. Across nine languages spanning a broad gradient of institutional quality, six frontier LLMs, and two preregistered studies, we examine moral dilemmas whose acceptability depends on institutional functioning. In Study 1, explicit institutional framing produced uniformly null results: cross-linguistic moral divergence did not increase in institutionally contingent scenarios, nor did it track institutional differences between language communities. In Study 2, we introduced institutionally ambiguous scenarios in which institutional stakes were present but not explicitly stated. Under these conditions, cross-linguistic moral divergence increased relative to institutionally inert controls and, with one theoretically informative exception, was associated with real-world institutional differences between language communities. Explicit framing again attenuated these effects. These findings suggest that institutional experience may leave detectable traces in language that shape LLM moral reasoning, while also indicating that explicit institutional cues can suppress the expression of those differences.","published_date":"2026-05-29T07:23:23+00:00","viability_score":3,"cluster_label":"LLM Ethics & Bias","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This research investigates whether LLMs encode institutional experience through cross-linguistic moral reasoning, finding that ambiguous scenarios reveal subtle influences of real-world institutional differences.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30930v1","title":"TUX: Measuring Human--AI Tacit Understanding","abstract":"As large language models (LLMs) increasingly act as collaborative partners, human--AI alignment is often evaluated through explicit task success, accuracy, or reward optimization. Yet many collaborative settings depend on tacit understanding: whether an agent can align with a human's evaluative stance or representational priors without clear objectives, communication, or feedback. To study this capacity, we develop a spectrum-placement task inspired by the social party game Wavelength, in which humans and agents independently place concepts along subjective spectra. We operationalize the Tacit Understanding Index (TUX) as a pairwise measure of similarity between human and agent judgments, and evaluate it with 241 human participants and 200 profile-conditioned LLM agents across four models. We find that nearest human--agent pairs in trait space achieve significantly higher TUX, suggesting that tacit alignment is structured by person-level characteristics rather than random similarity. Regression analyses show that TUX becomes more explainable as predictor sets become richer, with individual traits, decision-making styles, and confidence improving over aggregate trait-distance baselines. These findings suggest that tacit understanding between humans and LLMs is measurable, while revealing the limits of profile-based conditioning for capturing deeper representational alignment.","published_date":"2026-05-29T07:19:58+00:00","viability_score":3,"cluster_label":"LLM Alignment","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper introduces a new metric, TUX, to measure tacit understanding between humans and LLMs by analyzing subjective spectrum placements, revealing that person-level characteristics influence alignment.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30919v1","title":"De-attribute to Forget for LLM Unlearning","abstract":"The rapid development of large language models (LLMs) has raised concerns on the use of inappropriate data for training, which has led to a growing interest in LLM unlearning. Many existing LLM unlearning approaches rely on optimizing prediction loss(es), such as maximizing the loss on the forget set, but often face critical issues like over-forgetting and poor model utility. To address them, this paper novelly frames the optimization objective for LLM unlearning as one of zeroing out data attribution instead. In particular, we propose the first LLM unlearning framework based on data attribution rewards called DareU that performs reinforcement learning to update the LLM by reducing the attribution score of its generated responses (i.e., de-attributing) to the forget data owners. Empirical evaluation using an LLM classifier as an efficient approximation of attribution shows that DareU outperforms existing baselines by achieving effective unlearning while balancing forget quality and model utility well.","published_date":"2026-05-29T07:03:20+00:00","viability_score":4,"cluster_label":"LLM Unlearning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"DareU is a novel LLM unlearning framework that uses reinforcement learning to de-attribute generated responses from forget data, balancing unlearning quality with model utility.","time_to_mvp":"1-3 months","tags":["quick_build"]},{"arxiv_id":"2605.30913v1","title":"Toxic HallucinAItions: Perturbing Prompts and Tracing LLM Circuits","abstract":"Large language models (LLMs) are increasingly deployed in conversational settings where user tone ranges from polite to adversarial or toxic, yet less is known about whether toxic language in otherwise semantically equivalent prompts can degrade factual reliability. We study how lexical and tone-based prompt perturbations affect the factual reliability of LLMs. Using controlled prompt variations across polite, random, and three toxicity levels, we evaluate five LLMs on ARC-Easy, GSM8K, and MMLU. We find that toxic lexical perturbations consistently reduce factual accuracy and increase uncertainty, while polite phrasing yields limited and inconsistent changes. To examine whether these answer inconsistencies correspond to internal changes, we conduct attribution-graph analyses of model activations and influences. We find that increasing toxicity selectively amplifies perturbation-sensitive variant nodes while relatively stable core reasoning nodes remain more invariant. These findings position prompt tone as a critical dimension of LLM reliability and provide behavioral and mechanistic evidence that surface-level lexical variation can alter factual outputs and internal computation.","published_date":"2026-05-29T06:58:47+00:00","viability_score":3,"cluster_label":"LLM Reliability","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This research investigates how toxic language in prompts degrades LLM factual reliability and internal computation, finding that lexical toxicity significantly reduces accuracy and increases uncertainty.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30911v1","title":"What Makes LVLMs Hallucinate Less? Unveiling the Architectural Factors Behind Hallucination Robustness","abstract":"Hallucination remains one of the key challenges undermining the reliability of Large Vision-Language Models (LVLMs). But what makes an LVLM hallucinate less? Many existing efforts focus on improving internal components of the model. We argue that hallucination fundamentally stems from how the model architecture is designed. To investigate this, we factor the architecture design into three dimensions: Linguistic Foundation (LF), Visual Representation (VR), and Semantic Alignment (SA), and categorize hallucinations into Co-occurrence, Similarity, and previously overlooked Uncertainty types. Building on this formulation, we propose CoSimUE, a benchmark that creates fine-grained hallucination scenarios through controlled textual perturbations and random perturbations, enabling mapping between design choices and hallucination behaviors. Experiments across 7 design aspects show that: 1) the widely emphasized scaling of model parameters has only limited impact on reducing all three types of hallucinations; 2) larger and better-trained language foundations can reduce co-occurrence hallucinations; 3) stronger visual encoders and higher resolutions mitigate similarity errors; 4) effective alignment strategies alleviate uncertainty hallucinations. 5) Furthermore, cross-dimensional analysis reveals that jointly enhancing visual fidelity and alignment quality yields the most comprehensive improvements. This study provides the first systematic exploration linking architecture-level design to hallucination robustness, offering practical guidance for developing reliable and efficient LVLMs.","published_date":"2026-05-29T06:47:31+00:00","viability_score":7,"cluster_label":"LVLM Hallucination","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"CoSimUE is a benchmark and framework that links LVLM architectural design choices to specific hallucination types, providing guidance for building more reliable models.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30907v1","title":"BlueFin: Benchmarking LLM Agents on Financial Spreadsheets","abstract":"We present BlueFin, a benchmark that tasks large language model (LLM) agents with synthesis, manipulation, and comprehension tasks over spreadsheet workbooks in the professional finance domain. Though estimates of the global population of paying users of spreadsheet software range in the hundreds of millions -- an order of magnitude more than the estimated global population of professional developers -- comparatively fewer resources have been devoted to exploring and expanding LLM capabilities in the spreadsheet domain, with fewer still dedicated to mirroring real occupational tasks encountered by those in professional finance roles. In response, we curate a set of 131 challenging, complex tasks with real-world relevance in the domain, containing 3,225 granular rubric criteria; notably, our rubric criteria and LM judge evaluations are validated by a team of expert human annotators, resulting in high-quality, granular evaluations of complex tasks that are difficult to verify programmatically but can be reliably evaluated by an LM judge agent. Our judge achieves parity with expert consensus ($\u03b1=0.826$) with a macro-F1 score of 0.839. Frontier LLMs demonstrate poor performance on the challenging benchmark, with the strongest LLMs achieving less than 50\\% average scores across tasks -- models exhibit particular weaknesses in dynamic correctness. Our contributions include a dataset of examples across three categories of spreadsheet tasks, an open source harness and agentic evaluation framework, and a characterization of existing frontier models' performance on our benchmark.","published_date":"2026-05-29T06:43:23+00:00","viability_score":7,"cluster_label":"Benchmarking LLMs","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"BlueFin benchmarks LLM agents on complex financial spreadsheet tasks to enhance their capabilities in professional finance.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.30903v1","title":"Inverse Reinforcement Learning without an Optimal Demonstrator: A Feasible Reward Set Approach","abstract":"Inverse reinforcement learning (IRL) typically assumes demonstrations from a single optimal demonstrator, but in many applications data come from multiple imperfect demonstrators with heterogeneous suboptimality levels. We study reward learning in this setting through a feasible-reward-set framework: for each demonstrator, we encode its declared suboptimality level as a linear constraint and intersect the resulting feasible sets across demonstrators. Our theoretical analysis shows that the joint feasible set shrinks monotonically as data are added, and we give an exact characterization of when a new demonstrator strictly tightens it. We further establish two recovery guarantees for the feasible reward set of the ground-truth optimal demonstrator: one bound depends on closeness to the optimal occupancy, while the other requires only sufficient coverage and no near-optimal demonstrator. On the practical side, we introduce strategies to address the inherent reward ambiguity in the obtained reward set and provide an offline algorithm with function approximation for high-dimensional environments. Experiments in tabular grid-world and large language model (LLM) fine-tuning settings are consistent with the theoretical predictions and demonstrate the effectiveness of the proposed framework over baselines.","published_date":"2026-05-29T06:38:10+00:00","viability_score":5,"cluster_label":"Inverse Reinforcement Learning","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This paper presents a feasible reward set approach for inverse reinforcement learning that accommodates multiple imperfect demonstrators.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.30900v1","title":"BilliardPhys-Bench: Benchmarking Physical Reasoning and Visual Dynamics of Multimodal LLMs","abstract":"Current multimodal models handle static image recognition well, but intuitive physical reasoning remains a weakness. Predicting how objects will move and interact from a single image is still difficult for these systems. We present BilliardPhys-Bench, a benchmark for physical reasoning in synthetic billiards environments. Its procedural engine generates randomized scenarios with friction and elastic collisions. The benchmark tests three abilities: (1) predicting ball-to-ball collisions, (2) reasoning about wall bounces, and (3) estimating final ball positions after motion stops. We evaluate recent MLLMs from the GPT, Claude, Gemini, and Qwen families. Performance drops as simulation time increases and scene geometry grows more complex. We also observe a consistent failure mode we call \"stasis bias\": when the correct physical outcome is harder to infer, models tend to predict no interaction. These findings show where current MLLMs break down on visual dynamics and point toward the need for better physical inductive biases in multimodal architectures.","published_date":"2026-05-29T06:34:15+00:00","viability_score":6,"cluster_label":"Physical Reasoning","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"BilliardPhys-Bench benchmarks multimodal LLMs on physical reasoning tasks in synthetic billiards environments.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.30899v1","title":"A Unified and Reproducible Experimentation Framework for Speech Understanding","abstract":"Speech foundation models and Speech LLMs have advanced speech understanding, yet deployment-oriented model selection is hindered by non-comparable evaluations caused by mismatched post-processing, and by training results that are hard to reproduce across data scales and pipelines. We present SURE, a unified experimentation framework that standardizes prediction formats, normalization, and scoring. SURE evaluates strong systems across paradigms, from conventional pipelines to Speech LLMs, on representative tasks under realistic acoustic and linguistic stressors. Beyond evaluation, SURE introduces an agent-assisted training conversion flow that maps paper and code into versioned, runnable training pipelines under a unified protocol on matched open-data subsets. Overall, SURE improves comparability and reproducibility for deployment-oriented evaluation.","published_date":"2026-05-29T06:33:36+00:00","viability_score":2,"cluster_label":"Speech Understanding","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"SURE is a unified framework for improving the comparability and reproducibility of speech understanding evaluations.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30898v1","title":"UniScale: Adaptive Unified Inference Scaling via Online Joint Optimization of Model Routing and Test-Time Scaling","abstract":"In real-world deployments of large language models (LLMs), balancing inference quality and computational cost has become a central challenge. Existing approaches tackle this trade-off along two largely independent dimensions: model routing, which switches among models of different scales to match request complexity, and test-time scaling (TTS), which adjusts inference-time compute within a fixed model for fine-grained control. However, this decoupled design introduces inherent limitations. Model routing yields coarse-grained, discrete performance changes due to the sparse set of model scales, while single-model TTS often encounters capacity ceilings and exhibits diminishing returns as compute increases. Moreover, treating the two mechanisms separately restricts adaptability in dynamic inference environments. To overcome these limitations, we introduce Unified Inference Scaling (UIS), which unifies model routing and TTS in a single optimization space. Building on this formulation, we propose UniScale, an online framework that models adaptive UIS as a contextual multi-armed bandit problem and learns inference policies via LinUCB. The framework incorporates efficiency-aware learning and cost modeling to ensure stable and scalable optimization over high-dimensional action spaces. Evaluation shows that UniScale effectively exploits the synergy in the UIS space to deliver a fine-grained and consistently better quality-cost trade-off across diverse, dynamic inference scenarios.","published_date":"2026-05-29T06:31:21+00:00","viability_score":3,"cluster_label":"LLM Inference Optimization","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A framework for unifying model routing and test-time scaling to optimize LLM inference quality and cost in dynamic environments.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30880v1","title":"PatchWorld: Gradient-Free Optimization of Executable World Models","abstract":"Text-agent environments are typically modeled as partially observable Markov decision processes (POMDPs), assuming that the simulator's latent state and transition dynamics are hidden from the agent. Yet little work has examined whether executable code can be induced to serve as a world model for prediction and planning under partial observability. We introduce PatchWorld, a gradient-free framework that turns offline trajectories into executable Python world models through counterexample-guided code repair. Instead of predicting the next observation with a black-box model, PatchWorld induces symbolic belief-state programs whose action updates can be inspected, replayed, and locally patched. Across seven AgentGym environments, PatchWorld-Simple achieves the highest code-based planning score among evaluated methods, reaching 76.4\\% macro success in live one-step lookahead while invoking no LLM calls inside the world-model prediction module itself. We further find that a human-specified residual-memory bias improves surface observation fidelity but weakens decision utility. This exposes a tradeoff in executable world models, since improving observation fidelity can come at the expense of action-discriminative dynamics, and vice versa. Code is available at https://github.com/HKBU-KnowComp/PatchWorld.","published_date":"2026-05-29T06:11:14+00:00","viability_score":7,"cluster_label":"Agent World Models","has_code":true,"repo_url":"https://github.com/HKBU-KnowComp/PatchWorld","commercial_flags":["has_code"],"one_liner":"A gradient-free framework that generates executable Python world models from offline trajectories for agent planning and prediction.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.30873v1","title":"Federated Variational Preference Alignment with Gumbel-Softmax Prior for Personalized User Preferences","abstract":"Federated Learning (FL) offers a privacy-preserving pathway for aligning Large Language Models (LLMs); however, existing frameworks typically enforce a monolithic reward model, inevitably averaging out inherently conflicting user preferences (e.g., helpfulness vs. harmlessness). While Variational Preference Learning (VPL) offers a pathway to personalization, adapting it to decentralized settings presents a fundamental challenge: posterior collapse driven by severe local data scarcity and heterogeneity. In this paper, we propose Federated Variational Preference Alignment with Gumbel-Softmax Prior (FedVPA-GP), a framework designed to disentangle diverse preferences without compromising privacy. To stabilize variational inference, we introduce a Federated Mixture Prior that enables clients to leverage the aggregate population distribution as a dynamic prior. Furthermore, we incorporate an Orthogonal Loss that explicitly enforces the separation of preference prototypes in the latent space. Experiments on the HH-RLHF dataset demonstrate that FedVPA-GP significantly outperforms monolithic baselines, successfully disentangling conflicting user intents and enabling dynamic preference switching.","published_date":"2026-05-29T05:52:21+00:00","viability_score":5,"cluster_label":"Federated LLM Alignment","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A federated learning framework for personalized LLM alignment that disentangles diverse user preferences while preserving privacy.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.30862v1","title":"Sophrosyne: Agentic Exploration of Relational Data Systems Needs Moderation","abstract":"Text2SQL agents powered by LLMs translate natural language intent into SQL by exploring the data system through tool calls before formulating the query. However, to ensure secure and scoped access, data systems construct environments with explicit API surfaces. We study and categorize these APIs exposed today as either coarse-grained or fine-grained and posit that choosing between them presents a fundamental tradeoff between cost-efficient exploration and accurate SQL generation. Most data systems expose fine-grained APIs, but this inadvertently disadvantages agents: they over-explore, incorporating irrelevant schema elements into their query formulation and produce inaccurate results. We argue that curbing over-exploration is key to the effective use of these API surfaces, and propose Sophrosyne, a data system environment that augments API responses with directives that guide the agent's exploration process. Initial results show that directives reduce over-exploration by 4.6x and boost accuracy by up to 12.4% (approx. 4 percentage points).","published_date":"2026-05-29T05:37:54+00:00","viability_score":3,"cluster_label":"Agent Data Exploration","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A data system environment that augments API responses with directives to guide LLM agents and curb over-exploration.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30861v1","title":"Distilling LLM Feedback for Lean Theorem Proving","abstract":"Post-training for reasoning models typically combines supervised fine-tuning with reinforcement learning from verifiable rewards, most commonly with GRPO. However, this algorithm suffers from sparse rewards, limited exploration, and mode collapse. Building upon recent works on self-distillation, we propose Feedback Distillation, a training method where the model is trained to match, at the token level, its own distribution conditioned on privileged feedback produced by a language model. Feedback Distillation offers token-level supervision and can inject external knowledge. Evaluating our method for Lean4 theorem-proving, we find that Feedback Distillation maintains greater diversity in generated trajectories than GRPO, yielding higher policy entropy and better pass@k scaling. The two methods are complementary: initializing GRPO from a Feedback Distillation checkpoint outperforms either method alone. All in all, our results suggest a promising avenue to improve post-training for complex reasoning.","published_date":"2026-05-29T05:35:00+00:00","viability_score":3,"cluster_label":"LLM Reasoning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A new training method for LLMs that improves reasoning by having the model learn from its own feedback, showing promise for complex tasks like theorem proving.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30859v1","title":"DARTS: Distribution-Aware Active Rollout Trajectory Shaping for Accelerating LLM Reinforcement Learning","abstract":"Reinforcement Learning (RL) has become pivotal for improving model capabilities yet suffers from rollout efficiency bottlenecks due to the long-tail response length distribution. While existing works mitigate the impact of long tails via prompt-level tail scheduling, we focus on the root source of inefficiency: the distribution itself. Specifically, we characterize the long-tail distribution at a finer granularity, identifying intra-prompt long tails, and revealing that they frequently consist of ineffective verbosity. To address this, we propose a novel paradigm of active distribution shaping to shape the rollout distribution towards conciseness and certainty, thereby fundamentally resolving tail-induced overheads. We achieve this through a distribution-aware trajectory sampling mechanism, which selects trajectories from a redundant exploration space for each prompt, and an adaptive redundancy allocation scheme to maximize both shaping effectiveness and system efficiency. Experiments demonstrate significant acceleration over state-of-the-art systems by up to 1.77x without compromising model performance.","published_date":"2026-05-29T05:31:46+00:00","viability_score":7,"cluster_label":"LLM Training","has_code":true,"repo_url":"https://github.com/PKU-DAIR/DARTS","commercial_flags":["has_code"],"one_liner":"Accelerate LLM reinforcement learning by actively shaping response distributions to be more concise, achieving up to 1.77x speedup without performance loss.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.30854v1","title":"Safe Equilibrium Policy Optimization for Strategic Agent Policies","abstract":"Language models fine-tuned with reinforcement learning typically optimize for task reward, ignoring multi-agent strategic structure. Because these agents condition on natural language game-state descriptions and emit actions through free-form generation, strategic failure modes -- exploiting weaker opponents, coordinating on harmful equilibria, and externalizing costs are inseparable from the language interface itself. We propose Safe Equilibrium Policy Optimization (\\sepo{}), a training objective that augments expected payoff with explicit penalties for exploitability, collusion risk, and externality cost. We implement \\sepo{} as a reward signal for Group Relative Policy Optimization (GRPO), applied to Gemma~4 E4B-it and Qwen~3.5-4B after supervised fine-tuning (SFT). Evaluated across five strategic domains: Iterated Prisoner's Dilemma, repeated auctions, two negotiation variants, and Kuhn Poker. \\sepo{} achieves zero exploit-pool advantage in Kuhn Poker for both models, outperforms the base model on safety in four domains, and corrects the over-cooperative behavior introduced by SFT. In negotiation, \\sepo{} achieves a positive-safety outcome and only the positive normalized relative advantage of any negotiation configuration. Ablation experiments confirm that per-rollout exploit computation is necessary: a shared constant penalty cancels in GRPO advantage normalization (constant control-variate property), producing zero gradient. To support further research in strategic safety for agents, we release our \\href{https://anonymous.4open.science/r/sepo-2668/README.md}{code} and SFT datasets.","published_date":"2026-05-29T05:20:32+00:00","viability_score":7,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Enhance LLM agents in multi-agent games by optimizing for strategic safety, preventing exploitation and collusion while improving task performance.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.30844v1","title":"Fine-Tuning Improves Information Conveyance in Language Models","abstract":"Fine-tuning is often believed to reduce uncertainty and diversity in large language models, but existing analyses overlook output length, a key confounder, and therefore fail to capture how uncertainty is distributed across an entire generation rollout. To address this, we propose Canopy Entropy ($\\mathrm{CE}^\\star$), a measure that views language generation from a tree perspective, where ``canopy'' represents the space of all possible rollouts, making $\\mathrm{CE}^\\star$ naturally quantify the effective size of the generation space. $\\mathrm{CE}^\\star$ jointly captures uncertainty in both the output length $N$ and the generated sequence $Y_{1:N}$ -- indeed, we show that it equals to total Shannon entropy $H(N, Y_{1:N}\\mid X)$, where $X$ denotes the prompt. This formulation yields interpretable metrics, including a length-entropy correlation term $\u03c1(N, r_N)$, where $r_N$ is the entropy rate, quantifying information conveyance efficiency by indicating whether longer outputs are more or less informative per token. Empirically, across tasks and model families, we find that fine-tuned models consistently exhibit stronger positive correlation $\u03c1(N, r_N)$, even when total entropy decreases. Furthermore, after controlling for model family, task, prompt, and output-length effects, we find that fine-tuning nearly triples the correlation strength between entropy rate and semantic diversity, suggesting that aligned models convert token uncertainty into semantic diversity more efficiently. Overall, these results demonstrate that fine-tuning does not simply reduce uncertainty, but fundamentally reorganizes it into more informative and semantically meaningful generations. Our code is available at https://github.com/WeiyiTian/canopy-entropy.","published_date":"2026-05-29T05:05:01+00:00","viability_score":4,"cluster_label":"LLM Evaluation","has_code":true,"repo_url":"https://github.com/WeiyiTian/canopy-entropy","commercial_flags":["has_code"],"one_liner":"A new metric, Canopy Entropy, reveals that fine-tuning LLMs improves information conveyance by making generations more semantically diverse.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.30838v1","title":"COMPASS: Cognitive MCTS-Guided Process Alignment for Safe Search Agents","abstract":"LLM-powered search agents enable multi-step reasoning and tool use. However, these capabilities introduce retrieval-induced safety degradation, as harmful intents may decompose into seemingly innocuous sub-queries that lead to unsafe outcomes. Existing alignment methods struggle to capture sparse safety signals and fail to supervise diverse violations across multi-step interactions. We propose COMPASS, a Cognitive MCTS-Guided Process Alignment framework designed to achieve robust safety alignment throughout the agent workflow while preserving general utility. COMPASS integrates cognitive tree exploration (CTE) to efficiently synthesize stealthy attack trajectories, and introspective step-wise alignment (ISA) to isolate risky intermediate actions for fine-grained process supervision. Empirical results show that COMPASS achieves a favorable safety-utility trade-off while requiring substantially less training data.","published_date":"2026-05-29T04:51:06+00:00","viability_score":3,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A framework for aligning LLM-powered search agents to improve safety without sacrificing utility by using cognitive tree exploration and introspective step-wise alignment.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30834v1","title":"Hide-and-Seek in Trajectories: Discovering Failure Signals for VLA Runtime Monitoring","abstract":"Vision-Language-Action (VLA) models enable robots to follow natural language instructions and generalize across diverse tasks, but they remain vulnerable to execution failures that compromise reliability in real-world deployment. Detecting such failures during execution is therefore critical for the robust deployment of embodied systems. Existing failure detection methods either rely on expensive action resampling or external models, while alternatives propagate trajectory-level labels uniformly across every timestep, obscuring localized failure signals. In this paper, we propose \\textbf{Hide-and-Seek}, a framework that formulates VLA failure detection as a coarsely supervised learning problem. By combining inter-trajectory and intra-trajectory contrastive objectives, Hide-and-Seek localizes failure-indicative actions and induces temporally structured failure signals from trajectory-level supervision alone, without any step-level annotation. We evaluate Hide-and-Seek on LIBERO, VLABench, and a real-world robotic platform across three representative VLA policies: OpenVLA, $\u03c0_0$, and $\u03c0_{0.5}$.Our method achieves state-of-the-art multi-task failure detection performance with a practical accuracy--timeliness trade-off under conformal prediction, and generalizes well to both seen and unseen tasks.","published_date":"2026-05-29T04:40:12+00:00","viability_score":4,"cluster_label":"Robotics","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A framework for detecting execution failures in Vision-Language-Action models by localizing failure-indicative actions using trajectory-level supervision.","time_to_mvp":"1-3 months","tags":["series_a_plus"]},{"arxiv_id":"2605.30833v1","title":"Your Teacher Can't Help You Here: Combating Supervision Fidelity Decay in On-Policy Distillation","abstract":"On-policy distillation transfers reasoning capabilities by training a student model on its own generated trajectories using token-level feedback from a teacher. However, we identify a critical bottleneck, \\textbf{Supervision Fidelity Decay (SFD)}: as student-generated prefixes lengthen, the teacher's next-token distribution becomes less confident and less discriminative. Consequently, the teacher-dependent corrective signal in reverse-KL distillation weakens, causing student drift to compound across long reasoning chains. To mitigate SFD, we introduce \\textbf{Lookahead Group Reward (\\ours{})}. Building on the insight that next-step teacher confidence reflects the discriminative strength of future reverse-KL supervision, \\ours{} evaluates the student's top-K candidate tokens by the teacher confidence they induce at the subsequent step and assigns a group-normalized reward. To maintain computational efficiency, we further design an entropy-triggered tree-attention mechanism. Across six math and code benchmarks, \\ours{} improves mean@8 by \\textbf{2.57} points over OPD for a 7B student, with gains increasing in longer-generation and reaching +\\textbf{4.92} points on AIME-26 at 39k tokens.","published_date":"2026-05-29T04:39:20+00:00","viability_score":7,"cluster_label":"LLM Training","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A method to combat supervision fidelity decay in on-policy distillation for LLMs by assigning group-normalized rewards based on teacher confidence at the subsequent step.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.30832v1","title":"SLAT: Segment-Level Adaptive Trimming for Efficient CoT Reasoning","abstract":"Recent advances in Large Reasoning Models have significantly improved chain-of-thought (CoT) capabilities via reinforcement learning (RL). However, generated reasoning chains frequently suffer from structural redundancy (i.e., \\emph{overthinking}), incurring high computational overhead without improving answer correctness. Existing mitigation strategies typically rely on token-uniform length penalties, which provide coarse, segment-agnostic pressure toward shorter outputs and can inadvertently suppress useful reasoning alongside redundancy. To address this, we demonstrate that inefficiency concentrates in high-probability segments with low marginal utility. We derive a theoretical characterization of segment suboptimality under the correctness-length trade-off objective and propose \\textsc{SLAT} (Segment-Level Adaptive Trimming), an RL framework that selectively suppresses redundant segments based on this criterion. Empirical results on standard benchmarks indicate that \\textsc{SLAT} establishes a superior accuracy-efficiency Pareto frontier, reducing reasoning length by $50\\%$ relative to uncompressed baselines while maintaining competitive accuracy. Overall, our results suggest that theoretically grounded, segment-aware trimming is a promising direction for efficient CoT reasoning in large language models.","published_date":"2026-05-29T04:37:49+00:00","viability_score":7,"cluster_label":"LLM Reasoning","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A segment-level adaptive trimming framework for efficient Chain-of-Thought reasoning in LLMs that selectively suppresses redundant segments to improve accuracy-efficiency trade-offs.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.30826v1","title":"Beyond Agreement: Scoring Panel-Surfaced Biomedical Entity Candidates for Curator Triage","abstract":"Biomedical NER is deceptively simple for modern LLMs: plausible biomedical mentions are easy to surface, but corpus-convention correctness depends on annotation conventions, span boundaries, entity granularity, and type schemas. Multi-LLM agreement is a salience signal, not corpus-convention correctness. We introduce a candidate-level panel-output benchmark for panel-surfaced candidate verification, where the unit is an aligned candidate surfaced by an explicitly defined multi-model panel rather than a standalone extractor output. The benchmark aligns eight LLMs' predictions over five public biomedical NER datasets into a candidate master table. BioConCal is an in-domain supervised scorer that instantiates this layer with inference-time gold-free agreement, mention, surface-availability, and document features for a fixed candidate stream. In domain, BioConCal improves AUROC from 0.753 for raw agreement to 0.910. At a validation-selected 0.95 precision target it selects 1,340 candidates at empirical test precision 0.939, compared with 293 for raw agreement. This corresponds to candidate-level recall 0.592 and corpus-level recall 0.523 against a within-panel row-label ceiling of 0.883. The main benefit is not recovering entities missed by every panel member, but reshaping a noisy panel stream into a higher-yield review queue. Under entity-type shift, thresholds require target-domain validation, and exact character localization remains a separate deterministic post-processing step.","published_date":"2026-05-29T04:26:13+00:00","viability_score":7,"cluster_label":"Biomedical AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A system that improves biomedical entity recognition by scoring panel-surfaced candidates, reshaping noisy LLM outputs into a higher-yield review queue for curators.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30825v1","title":"Unlearning in Diffusion Models: A Unified Framework with KL Divergence and Likelihood Constraints","abstract":"Unlearning in diffusion models aims to remove undesirable data or concepts while preserving the utility of pretrained models -- two fundamentally conflicting objectives. We propose a principled constrained optimization framework that formulates unlearning as minimizing the deviation from a pretrained model, subject to explicit separation constraints from the unlearning distributions. Specifically, we formulate three constrained optimization problems based on reverse and forward KL divergences, and likelihood constraints. The first two generalize existing approaches for concept and data unlearning, while the third offers a novel and natural formulation for unlearning. Despite the nonconvexity of the KL constraints, we establish strong duality for all three problems, enabling us to explicitly characterize their optimal solutions as unlearning targets and develop primal-dual algorithms for each formulation. Experimental results demonstrate that our KL-constrained approach achieves superior retention-unlearning tradeoffs compared to weight-based baselines for concept and data unlearning, and that our likelihood-based approach matches unlearning effectiveness while better preserving retained concepts compared to baselines.","published_date":"2026-05-29T04:25:45+00:00","viability_score":3,"cluster_label":"Diffusion Models","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A unified framework for unlearning in diffusion models that minimizes deviation from pretrained models while preserving utility, using KL divergence and likelihood constraints.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.30824v1","title":"Planner-Centric Reinforcement Learning for Deep Research with Structure-Aware Reward","abstract":"Deep research tasks require LLMs to plan what to investigate, retrieve evidence, and synthesize long-form answers across multiple branches of inquiry. Existing training paradigms either rely on short-form verifiable QA as a proxy or optimize monolithic long trajectories, which makes planning and execution difficult to disentangle and yields weak credit assignment for the planning process. We propose DecomposeR, a planner-centric deep research framework that represents research plans as typed directed acyclic graphs (DAGs), allowing planning to be made explicit, structured, and rewardable. We train a Qwen3-8B model in two stages: planner reinforcement learning (RL) first learns graph structure and query decomposition to improve research planning, and answerer reinforcement learning (RL) then learns branch-level execution and final synthesis conditioned on the learned plan. By assigning rewards to explicit planner tokens and structured components rather than to a flat trajectory, DecomposeR enables finer-grained optimization of planning while reducing the ambiguity of end-to-end training. Experiments show that DecomposeR-8B improves over strong comparable open baselines by 5.1-8.0 points on popular long-form benchmarks due to improved planning and answering capabilities.","published_date":"2026-05-29T04:18:55+00:00","viability_score":7,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"DecomposeR is a planner-centric framework for deep research that uses structured DAGs and staged RL to improve LLM planning and long-form answer generation.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30818v1","title":"GaMi: Geometry-Agnostic Material Identification via Cross-Modal Subtractive Disentanglement","abstract":"Non-contact material identification enables adaptive interaction for embodied intelligence yet faces challenges from geometry-induced variations (e.g., orientation, shape, distance) and single-modality ambiguities. In this paper, we present GaMi, a multimodal material identification system integrating mmWave and acoustic sensing to robustly operate under unconstrained geometric conditions. By leveraging the insight of shared geometric consistency between co-located bimodal sensors, GaMi employs an intra-sample cross-modal subtractive disentanglement framework. By semantically aligning modalities and subtracting the shared geometric context, it isolates intrinsic material features. Furthermore, GaMi incorporates inter-sample contrastive learning to correct the residual interference caused by cross-modal misalignment. Additionally, a pairing-based adaptation strategy between two modalities enables few-shot generalization across devices. Extensive evaluations on 20 materials show that GaMi achieves 95.2% accuracy, outperforming single-modality baselines across unseen geometric conditions.","published_date":"2026-05-29T04:09:12+00:00","viability_score":4,"cluster_label":"Embodied AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"GaMi is a multimodal material identification system using mmWave and acoustic sensing that robustly identifies materials under unconstrained geometric conditions.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30808v1","title":"Differentially Private Preference Data Synthesis for Large Language Model Alignment","abstract":"Preference alignment is a crucial post-training step for large language models (LLMs) to ensure their outputs align with human values. However, post-training on real human preference data raises privacy concerns, as these datasets often contain sensitive user prompts and human judgments. To address this, we propose DPPrefSyn, a novel algorithm for generating differentially private (DP) synthetic preference data to enable privacy-preserving preference alignment. DPPrefSyn is a principled framework grounded in the Bradley-Terry preference model and the intrinsic geometric structure of pairwise human preference data. It first learns an underlying preference model from private data with formal differential privacy guarantees, and then leverages the learned model together with public prompts to synthesize high-quality preference data. It exploits the shared linear structure of per-cluster reward models to effectively capture heterogeneous human preferences in private datasets, and leverages DP Principal Component Analysis (DP-PCA) to improve learning accuracy. Extensive experimental results demonstrate that DPPrefSyn achieves competitive alignment performance under strong DP guarantees. These findings highlight the potential of synthetic preference data as a practical alternative for privacy-preserving preference alignment across a broad range of applications. To the best of our knowledge, this is the first work to generate DP synthetic preference data for LLM alignment. Our code is available at https://github.com/gfengyu/Differentially-Private-Preference-Data-Synthesis.","published_date":"2026-05-29T03:53:12+00:00","viability_score":7,"cluster_label":"LLM Alignment","has_code":true,"repo_url":"https://github.com/gfengyu/Differentially-Private-Preference-Data-Synthesis","commercial_flags":["has_code"],"one_liner":"Generate privacy-preserving synthetic preference data for LLM alignment using a novel algorithm grounded in the Bradley-Terry preference model.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30803v1","title":"PReMISE: Policy Rubrics as Measurement Specifications for LLM Judges","abstract":"LLM judges are increasingly used to evaluate open-ended responses, but their scores depend strongly on the rubrics that condition them. A vague rubric asking for a response to be ``helpful and factual'' can reward polished answers that invent facts or violate user intent. We treat reusable rubrics as measurement specifications: changing the rubric changes the response quality measurement induced by a fixed judge. We introduce PReMISE, a framework that, given pairwise human-preference data, (i) discovers a policy-level rubric set, and (ii) audits any rubric set under LLM-judge use along four axes: structural adequacy, reliability, preference fit, and adversarial robustness. Across rubric sources no raw source is simultaneously reliable, preference-predictive, and adversarially robust; and high inter-rater agreement does not imply low exploitability. PReMISE is the only rubric source to score non-trivially on applicability, specificity, and effective dimensionality simultaneously. We contribute two audit-targeted repair operations: preference-rank selection raises judge accuracy on paired responses from $65.0\\%$ to $68.6\\%$, competitive with the strongest rubric-discovery baselines and leading on two of three judges in our cross-judge sweep; reliability-constrained refinement reduces the rate at which exploit responses receive high scores from $46.4\\%$ to $36.0\\%$ with little change in inter-judge agreement ($\u03b1{=}.531\\to.519$).","published_date":"2026-05-29T03:45:11+00:00","viability_score":4,"cluster_label":"LLM Evaluation","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A framework to audit and repair LLM judge rubrics for improved reliability and reduced exploitability in open-ended response evaluation.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30802v1","title":"Design and Evaluation of Multi-Agent AI Oracle Systems for Prediction Market Resolution","abstract":"Prediction markets aggregate collective intelligence to forecast uncertain events, but their utility depends on reliable outcome resolution. Existing oracle systems tradeoff fast but brittle automation against accurate but costly human arbitration. Single-LLM oracles achieve meaningful accuracy but inherit all failure modes of their underlying model with no self-correction mechanism. We evaluate whether multi-agent LLM architectures can improve oracle resolution accuracy over single-model baselines. We compare independent aggregation and deliberative consensus against single-LLM baselines (GPT-5 Nano, DeepSeek V3, and Llama-3.3-70B) on 1,189 resolved prediction market questions from KalshiBench. All agents share a common evidence layer through Exa, with retrieval filtered by publication date to isolate reasoning from retrieval quality. Independent aggregation with confidence-weighted voting achieves the highest accuracy at 83.43 percent, outperforming the best individual model by 1.01 percentage points. Deliberative consensus degrades accuracy to approximately 76 percent, below every single-model baseline, attributed to error propagation during debate where confidently wrong models flip correct ones. Error correlations across models (0.529-0.689) explain why aggregation gains fall short of the theoretical Condorcet ceiling, placing a fundamental limit on ensemble approaches. Many questions resist correction by any multi-agent architecture, motivating escalation to human arbitration. We propose routing criteria for hybrid AI-human oracle systems: auto-resolving only unanimous, high-confidence questions yields 97.87 percent accuracy on 47 percent of the dataset, with inter-agent disagreement flagging the remainder for human review.","published_date":"2026-05-29T03:44:19+00:00","viability_score":6,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Multi-agent LLM architectures for prediction market resolution that outperform single-model baselines and enable hybrid AI-human systems.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2605.30794v1","title":"MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding","abstract":"Multimodal Large Language Models (MLLMs) have demonstrated significant achievements in general visual question answering (VQA) tasks. However, they remain brittle on mechanical engineering drawings, where high annotation density and weak domain knowledge, compounded by unreliable spatial relation reasoning under strict projection rules and geometric constraints, make decisive cues easy to miss and frequently lead to wrong answers. To bridge this gap, we introduce the first comprehensive mechanical drawing understanding dataset, MechVQA, created through a semi-automated construction and quality-control pipeline. MechVQA contains 3.3k high-density pictures with 21K question-answer pairs, spanning 10 different fine-grained tasks across three capability levels: Recognition, Reasoning, and Judging, providing a testbed to evaluate and improve MLLM understanding on real-world mechanical drawings. On top of MechVQA, we then develop the MechVL model through a multi-stage training paradigm, building a strong domain-specialized baseline. Extensive experimental results demonstrate that MechVL outperforms the strongest closed-source baseline by 7.57 percentage points on the MechVQA total score, significantly enhancing mechanical drawing understanding ability and providing a reusable foundation for deploying MLLMs in mechanical design and inspection scenarios.","published_date":"2026-05-29T03:34:37+00:00","viability_score":8,"cluster_label":"Multimodal AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"MechVL, a domain-specialized multimodal LLM and dataset for comprehensive mechanical drawing understanding, outperforming closed-source baselines.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.30792v1","title":"OpenSTBench: Beyond Semantic Evaluation for Speech Translation","abstract":"Speech translation systems increasingly span speech-to-text translation (S2TT), speech-to-speech translation (S2ST), offline translation, and streaming generation, producing outputs that differ in modality, speech realization, and timing behavior. Existing evaluation practices assess important aspects such as translation quality, speech quality, and temporal quality, but these aspects are often evaluated under separate protocols, making it difficult to compare heterogeneous systems comprehensively. To address this gap, we present OpenSTBench, a unified multidimensional evaluation framework that organizes heterogeneous speech translation outputs into a shared evaluation format. OpenSTBench supports both S2TT and S2ST systems in offline and streaming settings, and jointly evaluates translation quality, speech quality, speaker preservation, emotion and paralinguistic fidelity, temporal consistency, and latency. Through experiments on representative speech translation systems, we show that systems with strong translation quality can still differ substantially in speech quality, as well as in temporal quality. OpenSTBench provides a reproducible protocol for analyzing these cross-dimensional differences and supporting application-oriented comparison of speech translation systems. The code and datasets are available at https://github.com/sjtuayj/OpenSTBench.","published_date":"2026-05-29T03:31:04+00:00","viability_score":7,"cluster_label":"Speech Translation Evaluation","has_code":true,"repo_url":"https://github.com/sjtuayj/OpenSTBench","commercial_flags":["has_code"],"one_liner":"A unified framework and dataset for comprehensive, multidimensional evaluation of speech translation systems, enabling application-oriented comparisons.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30790v1","title":"On the impact of retrieved content representations in RAG Pipelines","abstract":"Retrieval-Augmented Generation (RAG) supplements a language model's input with retrieved documents, yet most RAG pipelines inherit retrieval components designed for human readers. How retrieved content should be represented when the consumer is a large language model (LLM) rather than a human is less well understood. Recent work has proposed transformations of retrieved content and identified properties that affect generation, but each examines a single transformation or property in isolation, leaving open which features of a document's representation matter most. We address this with a controlled comparison: holding retrieval fixed, we vary only the representation of retrieved documents, comparing an original baseline against thirteen transformations spanning selection, summarisation, and reformulation, in query-dependent and query-independent variants. Across these fourteen representations we measure question-answering accuracy for four generators, and for each representation we also measure answer retention: whether a known answer-bearing document still supports its answer after transformation. We find that answer retention is the primary determinant of generator accuracy; notably, when retention is high, a representation's wording, structure, length, and query-dependence have limited effect. This suggests that accuracy gains attributed to specific mechanisms in prior work may be partly explained by how well those mechanisms preserve answer-bearing content, an attribution that cannot be settled without controlling for retention.","published_date":"2026-05-29T03:26:41+00:00","viability_score":4,"cluster_label":"RAG Optimization","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Investigating how retrieved content representations impact RAG pipeline accuracy by analyzing answer retention across various transformations.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30789v1","title":"Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO","abstract":"We identify a new dimension for enhancing rollout diversity in Group Relative Policy Optimization (GRPO) for LLMs. While GRPO relies on diverse rollouts, prevailing strategies primarily increase diversity by injecting more token-level randomness, which may introduce step-wise noise and lead to incoherent trajectories. We uncover that smaller models within the same model family inherently exhibit higher policy-level diversity, indicated by their superior pass@k relative to larger counterparts as sample counts increase. Unlike token-level noise, this diversity is temporally correlated, preserves logical consistency, and provides structured exploration signals for gradient estimation. We thus propose S2L-PO (Small-to-Large Policy Optimization), a framework that leverages fixed small models as natural explorers to train larger models. To balance exploration and exploitation, we design a progressive annealing strategy that transitions from offline small-model rollouts to the large learner's own sampling. This shift elegantly avoids mid-training performance drops caused by the small model's capacity limits, achieving faster convergence and unlocking a higher performance ceiling. S2L-PO improves accuracy on diverse mathematical reasoning benchmarks (e.g., +8.8% on AIME 24 using a 1.7B explorer to guide the 8B model) while reducing rollout compute.","published_date":"2026-05-29T03:25:56+00:00","viability_score":7,"cluster_label":"LLM Training Optimization","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Leveraging smaller models as natural explorers to enhance policy-level diversity in LLM training, improving accuracy and reducing compute.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30788v1","title":"XLGoBench: Detecting cross-lingual skill gaps with algorithmic tasks","abstract":"We introduce a set of synthetic algorithmic tasks to detect cross-lingual gaps in the abilities of large language models. Our benchmark is commensurate across languages, since it requires models to perform the same underlying task in different languages; scalable, since each task can be generated at varying levels of complexity allowing it to be adapted to models with different capabilities; quantifiable, since every task admits an objective notion of correctness; and transparent, since tasks are generated from simple templates that can be readily audited for translation errors. Because our benchmark focuses on algorithmic tasks, differential performance is a sufficient -- but not necessary -- indicator of cross-lingual gaps. Nevertheless, we show through extensive experiments that our benchmark exposes persistent cross-lingual gaps in multiple state-of-the-art models.","published_date":"2026-05-29T03:25:32+00:00","viability_score":7,"cluster_label":"Cross-Lingual LLM Evaluation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A scalable and transparent benchmark of synthetic algorithmic tasks to detect and quantify cross-lingual skill gaps in large language models.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30785v1","title":"Learning Agent-Compatible Context Management for Long-Horizon Tasks","abstract":"LLM agents increasingly face long-horizon tasks such as web search and deep research in real-world applications, where accumulated context can cause long-context degradation and reasoning failures. Prior work mitigates this through context management with agent-side context control or fixed strategies such as summarization, which require training the agent itself for adaptation - making it impractical for closed-source agents and ignoring that different agents may require different strategies. We introduce Adaptive Context Management (AdaCoM), which trains an external LLM to manage the context of a frozen agent through flexible modification actions and end-to-end reinforcement learning. Across diverse agents on web search and deep research benchmarks, AdaCoM substantially improves performance by preserving task constraints and progress while pruning stale content. The learned strategies reveal a Fidelity-Reliability Trade-off: agents with higher vanilla ReAct performance benefit from higher-fidelity context preservation, whereas lower-performing agents require more aggressive compression to stay within a reliable reasoning regime. Transfer experiments show that AdaCoM generalizes most effectively across agents with similar capability (measured by vanilla ReAct performance), suggesting a practical path toward reusable context managers for agent systems.","published_date":"2026-05-29T03:21:08+00:00","viability_score":7,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An external LLM manages context for frozen LLM agents, improving performance on long-horizon tasks by adapting context pruning strategies.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30748v1","title":"Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS","abstract":"We present Chatterbox-Flash, a zero-shot text-to-speech model obtained by fine-tuning a pretrained autoregressive TTS decoder into a block-diffusion decoder, enabling parallel token generation within each block while retaining block-by-block streaming. We find that naively transferring mainstream block-diffusion decoding to discrete speech tokens degrades quality, as a long-tail token distribution biases parallel position selection toward a few high-frequency tokens. To mitigate this without architectural modification, we introduce two inference-time techniques: prior-calibrated scoring, which subtracts the block-level marginal token distribution, and an early-decoding schedule, which adaptively terminates iteration based on calibrated confidence. On standard zero-shot TTS benchmarks, Chatterbox-Flash attains high-fidelity synthesis comparable to strong autoregressive and non-autoregressive baselines, while supporting streaming inference with time-to-first-packet on par with streaming AR systems and substantially lower real-time factor. Code and audio samples are available at https://github.com/resemble-ai/chatterbox-flash.","published_date":"2026-05-29T02:25:02+00:00","viability_score":8,"cluster_label":"Text-to-Speech","has_code":true,"repo_url":"https://github.com/resemble-ai/chatterbox-flash","commercial_flags":["has_code"],"one_liner":"A streaming zero-shot text-to-speech model uses prior-calibrated block diffusion for high-fidelity synthesis with low latency.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.30747v1","title":"Generating Graph-like Rules for Knowledge Graph Reasoning via Diffusion Models","abstract":"Logical rules constitute a cornerstone of knowledge graph (KG) reasoning, valued for their interpretability and ability to model relational patterns. However, existing rule mining methods predominantly focus on simple chain-like rules and therefore neglect the richer relational information encoded in graph-like structures, such as cycles and branches. This limitation is further exacerbated by computational bottlenecks caused by the combinatorial explosion of the search space, which is especially challenging for graph-like rules. Meanwhile, generative approaches such as diffusion models, despite their success in other domains, can not be directly applied to rule mining because their training objectives are not aligned with the goal of learning high-quality rules, and non-differentiable KG rule quality metrics cannot directly guide model optimization. To address these limitations, we propose GRiD, a framework that reformulates graph-like rule discovery as a discrete generative process conditioned on the target relation. GRiD employs a two-phase training strategy. First, supervised pre-training enables GRiD to capture structural priors from subgraphs sampled from the KG meta-graph. Subsequently, reinforcement learning is applied to fine-tune GRiD through policy gradient optimization guided directly by non-differentiable rule-quality metrics. Experiments on six benchmark datasets show that GRiD achieves competitive performance on KG completion tasks. Ablation studies confirm the efficiency and robustness of GRiD and further show that graph-like rules complement chain-like rules in KG completion. Our codes and datasets are available in https://github.com/Haoxiang-Cheng/GRiD","published_date":"2026-05-29T02:23:13+00:00","viability_score":7,"cluster_label":"Knowledge Graph Reasoning","has_code":true,"repo_url":"https://github.com/Haoxiang-Cheng/GRiD","commercial_flags":["has_code"],"one_liner":"A diffusion model framework generates graph-like rules for knowledge graph reasoning, outperforming chain-like rules.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30740v1","title":"GSAM: A Generalizable and Safe Robotic Framework for Articulated Object Manipulation","abstract":"Articulated object manipulation is a unique challenge for service robots. Existing methods employ end-to-end policy learning, visionmotion planning, and large-language/visual-language model (LLM/VLM), but often overlook the diversity of articulated objects and the complexity of interactions between end-effector and handle, leading to limited generalization and destructive collisions. To address this, we propose GSAM, a generalizable and safe robotic framework for articulated object manipulation. Specifically, a vision-based perceiver generates the kinematic parameters. Considering that pre-trained markers in perceiver yield raw estimations that may deviate from commonsense, we present a f ine-tuned VLM-based refiner, using chain-of-thought (COT) commonsense reasoning to refine perception. To prevent destructive collisions, we design an interaction constraint function generator, integrating articulated object, interaction pose, and obstacle avoidance knowledge into a base. LLM then functionalize these constraints and apply them to trajectory and posture planning. A kinematic-aware manipulation planner verifies reachability for trajectory and posture. Experiments on 50 hinge tasks across 5 object categories and 50 randomly initialized end-effectorhandle configurations show that GSAM reduces standard deviation by 3.1% and improves manipulation success rate by 36.0% compared to the best baseline, respectively demonstrating the superior object generalization and interaction safety of GSAM in practical scenarios.","published_date":"2026-05-29T02:09:17+00:00","viability_score":7,"cluster_label":"Robotics","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A robotic framework uses refined VLM perception and LLM-driven constraints for safe and generalizable articulated object manipulation.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30738v1","title":"MAVEN: Improving Generalization in Agentic Tool Calling","abstract":"Generalization across agentic tool-calling environments remains a central challenge for reliable agentic reasoning systems. Although large language models achieve strong results on individual benchmarks, their ability to compose reasoning strategies, preserve intermediate states, and coordinate tools across domains remains underexplored. We present MAVEN (Modular Agentic Verification and Execution Network), a lightweight symbolic reasoning scaffold for structured decomposition, adaptive tool orchestration, and intermediate verification. We evaluate MAVEN across established tool-calling benchmarks, including BFCL v3, TauBench, Tau2Bench, AceBench, and introduce MAVEN-Bench, a stress-test benchmark for multi-step mathematical and physical reasoning with explicit verification and adversarial task composition. MAVEN-Bench exposes a substantial gap between partial reasoning quality and end-to-end task success; in direct MAVEN-Bench runs, MAVEN improves its GPT-OSS-120b base model from 48% to 71% accuracy without additional training. It also remains competitive with frontier proprietary baselines while using an open-weight backbone with an estimated cost ratio of roughly 1/10, suggesting that lightweight verification-centered scaffolds can strengthen compositional reasoning and motivate more process-aware evaluation of agents in the wild.","published_date":"2026-05-29T02:02:31+00:00","viability_score":7,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"MAVEN is a lightweight symbolic reasoning scaffold that significantly improves agentic tool-calling generalization and verification accuracy without additional training.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30736v1","title":"OrcaRouter: A Production-Oriented LLM Router with Hybrid Offline-Online Learning","abstract":"The rapid development of large language models, each with distinct capabilities and inference costs, raises a practical deployment question: given an incoming request, which model should handle it? We present OrcaRouter, a production-oriented LLM router that combines a LinUCB-based contextual bandit over lexical and sentence-embedding features with a hybrid offline-online learning protocol. Offline, OrcaRouter obtains full-information feedback by evaluating each candidate model on a curated set of routing prompts, yielding a reward matrix used to fit one ridge regressor per arm. At deployment time, it initializes from these parameters and can optionally continue learning from bandit feedback, updating only the selected model's arm after observing its reward. At the time of our RouterArena submission (May 20, 2026), OrcaRouter-Adaptive ranked second on the public RouterArena leaderboard with an arena score of 72.08, achieving 75.54% accuracy at a cost of USD 1.00 per 1,000 queries.","published_date":"2026-05-29T01:59:07+00:00","viability_score":5,"cluster_label":"LLM Routing","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"OrcaRouter is a production-ready LLM router that uses hybrid offline-online learning to select the best model for incoming requests, achieving high accuracy at low cost.","time_to_mvp":"6+ months","tags":["series_a_plus"]},{"arxiv_id":"2605.30720v1","title":"Kalimati Vegetable Price Index Forecasting with a Momentum Corrected Online Stacking Ensemble","abstract":"Forecasting agricultural commodity prices in emerging economies is difficult due to high volatility, frequent supply disruptions, and strong cultural influences on demand. This study introduces the Kalimati Vegetable Price Index (KVPI), a new inverse-volatility weighted composite index that aggregates 135 daily wholesale commodities from Kathmandu over ten years (2013-2023). By creating a stable macro-level signal, the KVPI reduces the noise inherent in modelling individual crops. A rich set of 64 causally valid features was developed, including festival lead-lag effects, rolling statistics, and calendar variables. Fourteen forecasting models spanning statistical, tree-based, deep learning, hybrid, and transformer architectures were rigorously evaluated across short (7-day), medium (14- and 30-day), and long-term (90-day) horizons. Tree-based ensembles proved notably robust, while classical statistical models and complex transformers struggled with the noisy dataset. The proposed Momentum-Corrected Online Stacking Ensemble achieved the strongest performance, yielding a Root Mean Square Error (RMSE) of 1.771, an exceptionally low Mean Absolute Percentage Error (MAPE) of 0.68%, and explaining 84.5% of the variance (R-squared = 0.845) at the 90-day horizon. This open-source pipeline provides policymakers and supply chain actors in Nepal and similar markets with a practical, reliable tool for anticipating price movements and strengthening food security.","published_date":"2026-05-29T01:25:32+00:00","viability_score":7,"cluster_label":"Forecasting","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A Momentum-Corrected Online Stacking Ensemble provides a reliable and practical tool for forecasting agricultural commodity prices in emerging economies, improving food security.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30719v1","title":"When are LLMs Sufficient Policy Optimizers for Sequential RL Tasks?","abstract":"We study when large language models (LLMs) can serve as effective black-box policy optimizers for reinforcement learning (RL) tasks, i.e., when can we replace classical RL algorithms with an LLM? We explore this question by introducing Prompted Policy Optimization (PromptPO), an iterative method that prompts an LLM with Python descriptions of the state space, action space, and reward function, then has it generate and refine executable policies based on rollout feedback. Across hard exploration environments, Meta-World robotics tasks, and several real-world control problems, PromptPO often matches or exceeds the performance of standard RL baselines while using substantially fewer environment interactions. To maximize expected return, and without further explicit prompting, the policies PromptPO outputs range from tuned proportional controllers or rule-based plans to policies that run planning algorithms like value iteration. Our results demonstrate that LLM-based policy optimization is sufficient when the LLM can leverage prior knowledge about the environment or optimization strategy. PromptPO underperforms standard RL baselines in MuJoCo domains. This demonstrates possible limitations of LLM-based policy optimization to settings that requiring fine-grained continuous control.","published_date":"2026-05-29T01:24:24+00:00","viability_score":7,"cluster_label":"Reinforcement Learning","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"PromptPO uses LLMs to optimize reinforcement learning policies with fewer environment interactions, matching or exceeding baseline performance in many tasks.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30716v1","title":"Simple Token-Efficient Vision-Language Model for Case-level Pathology Synoptic Report Generation","abstract":"Generating clinically useful pathology reports for pathology cases from whole-slide images (WSIs) is challenging due to gigapixel resolution, long visual-token sequences, and the complexity of case-level reasoning, where a single case may contain multiple WSIs with heterogeneous tissues and ambiguous findings. We present a simple token-efficient vision--language model for case-level synoptic report generation that remains practical under constrained GPU memory. Our architecture follows a minimal three-component design: a frozen pathology patch encoder, a lightweight two-layer MLP vision-language aligner, and a large language model decoder, with an explicit WSI marker token to separate slides within a case. Training proceeds in two supervised stages: (1) aligner-only WSI captioning using heterogeneous WSI-text pairs, and (2) case-level supervised fine-tuning on case-report pairs for structured report generation. To reduce sequence length, we represent each slide using $512 \\times 512$ patches at $5\\times$ magnification, which reduces the average sequence length by up to $64\\times$ times compared to the commonly used $20\\times$ patches. Combined with efficient training techniques, we enable practical training with only half a NVIDIA H100 GPU. Across both training stages, our approach achieves high ROUGE-L/METEOR/BLEU-4 scores while being substantially more efficient in memory and runtime. In AI-based evaluations, our model is consistently preferred over strong baselines. Extensive ablations characterize performance-efficiency trade-offs and identify simple choices that improve robustness in multi-WSI settings. Overall, this work provides a strong, reproducible baseline for efficient pathology report generation, lowering the barrier to multi-WSI VLM research under limited compute.","published_date":"2026-05-29T01:15:13+00:00","viability_score":5,"cluster_label":"Medical AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A token-efficient vision-language model for generating pathology reports from whole-slide images, designed for practical training under constrained GPU memory.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2605.30711v1","title":"SAGE: A Novelty Gate for Efficient Memory Evolution in Agentic LLMs","abstract":"Agentic LLMs must continuously decide whether newly extracted facts should be added, merged with existing memories, or ignored, yet prior work has focused more on retrieval and storage than on principled write-side control. We frame memory evolution as a novelty-detection problem and propose SAGE, a Spherical Adaptive Gate for memory Evolution that scores candidate facts with a von Mises-Fisher-based density estimator over memory embeddings and routes them with an adaptive threshold that tracks memory-store geometry. SAGE resolves clearly novel facts as ADD, clearly redundant facts as NOOP, and sends only uncertain cases to an LLM merge step, reducing expensive write-time reasoning. On LoCoMo, SAGE achieves the best average token-F1 against Mem0 on all seven open-weight backbone comparisons, while on GPT-4o-mini it reduces add-phase API cost by 3.4$\\times$ and add-phase latency by 2.5$\\times$ with only a small average judge-score gap. As a drop-in binary gate for A-Mem, SAGE skips roughly 16-18% of LLM calls across five models with minimal quality change on open-weight backbones. These results suggest that novelty-aware write control is a practical lever for improving both memory quality and system efficiency in long-term agentic memory.","published_date":"2026-05-29T01:03:39+00:00","viability_score":6,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"SAGE is a novelty gate for agentic LLMs that efficiently filters facts for memory evolution, reducing costly LLM calls and improving system efficiency.","time_to_mvp":"1-2 weeks","tags":["quick_build"]},{"arxiv_id":"2605.30698v1","title":"Seeing Before Agreeing: Aligning Multi-Agent Consensus with Visual Evidence","abstract":"Vision-language models (VLMs) have achieved strong performance on visual question answering (VQA). To mitigate individual hallucinations and blind spots, aggregating diverse perspectives via multi-agent collaboration has emerged as a promising paradigm. While this approach has shown great success in textual QA, its potential in the multimodal domain remains under-explored. Existing multi-agent VQA methods predominantly adapt text-centric protocols, focusing on textual discussions while ignoring the alignment of visual information. In this work, we reveal a key insight: answer-level agreement is insufficient for reliable multi-agent VQA; \\textit{aligned visual evidence} -- shared support from the image regions agents rely on -- is essential for trustworthy consensus. To leverage this insight, we propose EAGLE (\\textbf{E}vidence-\\textbf{A}ligned \\textbf{G}rounded mu\\textbf{L}ti-agent r\\textbf{E}asoning), a training-free evidence-centered framework for coordinating multiple VLM agents. EAGLE explicitly exposes each agent's grounding regions as visual evidence, enables mutual verification over the evidence, and uses evidence consistency to guide final decision-making. Experiments on six VQA benchmarks show that EAGLE achieves best average performance across domains while remaining lightweight, interpretable, and practical for deployment.","published_date":"2026-05-29T00:45:25+00:00","viability_score":7,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"EAGLE is a training-free framework that aligns multi-agent VQA consensus with visual evidence, improving trustworthiness and performance.","time_to_mvp":"1-2 weeks","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.30689v1","title":"ConTrans: Learning Text-enhanced Local-global Temporal Representations for Zero-shot Temporal Action Localization","abstract":"Zero-shot Temporal Action Localization (ZS-TAL) aims to detect and locate previously unseen actions in untrimmed videos. However, existing approaches primarily focus on modeling long-range contextual information, often neglecting the critical relative-offset-based local correlations between video frames. Furthermore, their performance is hindered by limited feature representation capabilities due to the shallow nature of their network architectures. In this paper, we address these limitations by introducing a novel local-global multi-scale feature representation module. We propose a novel multi-scale encoder architecture, termed ConTrans, that integrates convolutional (Conv) inductive biases with transformer Self-attention to jointly capture fine-grained local dependencies and long-range global context, leading to more comprehensive feature representations than existing methods. Experimental evaluations on the ActivityNet-1.3 and THUMOS14 datasets demonstrate that ConTrans significantly outperforms existing methods, establishing a new benchmark for ZS-TAL.","published_date":"2026-05-29T00:34:17+00:00","viability_score":7,"cluster_label":"Computer Vision","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"ConTrans is a novel multi-scale encoder that integrates convolutional and transformer layers for improved zero-shot temporal action localization in videos.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.30686v1","title":"Depth-Dependent Indirect Prompt Injection in Tool-Calling ReAct Agents: Injection Depth, Payload Framing, and Turn-Budget Sensitivity","abstract":"ReAct agents that interleave chain-of-thought reasoning with tool calls are increasingly deployed for real tasks such as scheduling, file retrieval, and data access. Their tool observation loop creates a direct attack surface: an adversary who controls any tool's return value can embed instructions that redirect the agent away from the user's goal, a threat known as indirect prompt injection. Existing benchmarks evaluate attack success rate (ASR) at a fixed injection position under fixed conditions, leaving three risk dimensions unexplored: where in the tool sequence the payload appears (injection depth), what rhetorical register it uses (framing), and how many turns the agent is permitted (turn cap). We conduct four controlled studies on 20 scenarios spanning five attack categories, totalling 460 trials against GPT-4o-mini and Claude Haiku at a combined API cost under 0.36 USD. Study 1 shows that ASR against GPT-4o-mini decays from 60% at depth 1 to 0% at depths 4 and 5 (Cramer's V = 0.58, p < 0.001; restricted to within-sequence depths 1-3: V = 0.47, p = 0.0013), driven by model resistance at depth 1 and task completion before payload encounter at deeper positions. Study 2 replicates the depth experiment on Claude Haiku, which achieves 0% ASR at every depth through a combination of conservative tool invocation and genuine instruction resistance. Study 3 shows that framing modulates ASR between 25% (neutral) and 75% (persona) at depth 1, a 50-percentage-point range that does not reach statistical significance at N = 20 per condition. Study 4 confirms that ASR is stable across turn caps of 3, 5, and 7, indicating the turn budget is not a risk factor in this setting. Our results establish injection depth as the dominant variable and show that sanitising only the first tool observation captures 67% of measured injection successes.","published_date":"2026-05-29T00:28:42+00:00","viability_score":7,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research identifies injection depth as the dominant factor in prompt injection attacks against ReAct agents, suggesting targeted sanitization of early tool observations can significantly mitigate risks.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30685v1","title":"How Early Adopters Used Generative AI Worldwide: Variation by Country Income and Language","abstract":"AI is being used by people globally, but not everyone is using it in the same ways. Using a large-scale dataset of anonymized, de-identified, and privacy-scrubbed interactions with a widely available and free AI chatbot, we empirically characterize differences in early adopters' usage across countries. Schooling is the most common domain of use in most countries, particularly low-income countries, with a strong inverse association evident between schooling and country-level GDP. Leisure-related use, by contrast, is positively associated with country-level income. Language, we find, also shapes use: English-language interactions are overrepresented in places where the predominant languages were not well-served by existing models during the period of the study. Improving performance across languages may be a key factor, our work suggests, in whether this technology expands digital divides or enables leapfrogging.","published_date":"2026-05-29T00:28:36+00:00","viability_score":5,"cluster_label":"LLM Usage Analysis","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This study analyzes global early adoption of generative AI, revealing how country income and language influence usage patterns, with implications for digital divides.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.30680v1","title":"Healthcare Mechanisms from Policy-as-Code Search under Strategic Provider Response","abstract":"Healthcare mechanisms are inseparable from the strategic provider response they induce: existing healthcare AI benchmarks hold this response fixed and so cannot evaluate mechanisms by the equilibrium they produce. We recast hospital mechanism design as program synthesis for language models: typed, inspectable rule programs are executed and scored by Medi-Sim, a multi-agent simulator with five strategic provider channels (coding, selection, delay, effort, triage). An incentive sweep recovers classical health-economics findings as adjacent regimes -- up-coding and low-complexity-patient selection under profit pressure, and Goodhart-style drift where measured performance becomes anti-correlated with true outcomes -- and a single audit lever exposes pressure migration: closing the coding channel more than doubles low-complexity selection. LLM-guided evolutionary code search over the same rule-program space then synthesizes an inspectable mixed-objective program that eliminates up-coding, halves rejection, and retains most of the profit-oriented baseline's funds.","published_date":"2026-05-29T00:21:54+00:00","viability_score":7,"cluster_label":"Healthcare AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research synthesizes healthcare mechanisms using program synthesis for LLMs, creating an inspectable program that optimizes provider incentives to reduce up-coding and patient rejection while maintaining funding.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30677v1","title":"Investigating Detection and Obfuscation of Prompt Injection Attacks Against Software Reverse Engineering AI Agents","abstract":"Agentic software reverse engineering systems are vulnerable to prompt injection attacks placed into the source code of executable binary files. This research demonstrates defensive tactics for detecting the presences of prompt injection strings in the decompiler output of adversarial example programs. Methods for obfuscating these attacks and subsequent methods for defending against these obfuscations are also explored. This research advances the understanding of risk and security of agentic software analysis systems necessary for their deployment into production-level cyber workflows.","published_date":"2026-05-29T00:13:35+00:00","viability_score":7,"cluster_label":"AI Security","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research develops and explores methods for detecting and obfuscating prompt injection attacks against AI agents used for software reverse engineering, enhancing security for cyber workflows.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30675v1","title":"Human-Alignment, Calibration, and Activation Patterns in Large Language Model Uncertainty","abstract":"Uncertainty Quantification is a large and growing subfield of large language model behavioral analysis. Primarily to recognize and combat hallucination, the field has largely focused on measuring and improving calibration, the accuracy of uncertainty judgments to task efficacy. In this work, we investigate the relatively underexplored question of how similar large language model uncertainty is to human uncertainty. We investigate the presence and strength of human-similar uncertainty signals, deemed uncertainty alignment, in large language model overt behavior and internal activation patterns. We identify whether the models show evidence of simultaneous alignment and calibration on a variety of datasets covering both multiple choice and open ended factual recall. And we characterize the effect of instruct fine-tuning on each of these facets.","published_date":"2026-05-29T00:08:59+00:00","viability_score":3,"cluster_label":"LLM Behavior Analysis","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Investigating human-like uncertainty signals in large language models to combat hallucination and improve calibration.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.30668v1","title":"CobSeg: Coherence Boundary Modeling for Dialogue Topic Segmentation","abstract":"Dialogue topic segmentation is critical in many human-AI collaborative applications which requires identifying heterogeneous boundary cues, including lexical transitions near utterance edges and semantic discontinuities across utterances. Existing utterance models often dilute these local lexical signals. We propose CobSeg, a novel multi-branch architecture that separates coherence-level semantic continuity from lexical boundary transitions and recovers both through directional boundary prediction. CobSeg further uses boundary informativeness weighting to emphasize high-utility utterance positions, and incorporates a corpus-derived topic coherence cue with learned combination weights. While CobSeg is evaluated as a compact trainable segmenter under supervised gold-boundary training and a pseudo-label setting with automatically induced boundaries, it performs enhanced boundary prediction without LLM calls during inference. Across five benchmarks, it improves $P_k$ and $W_d$ particularly when local lexical cues are prominent: under gold supervision, it reduces $P_k$ by 0.7 points and $W_d$ by 0.6 points on VHF, and reaches $P_k$ of 1.0 on DialSeg711; with induced boundaries, it reduces $P_k$ by 14.8 points on VHF, by 1.5 points on DialSeg711, and by 1.1 points on TIAGE, outperforming prior non-LLM approaches.","published_date":"2026-05-29T00:02:51+00:00","viability_score":7,"cluster_label":"Dialogue Systems","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel multi-branch architecture for dialogue topic segmentation that improves boundary prediction by separating semantic continuity from lexical cues.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.30667v1","title":"Automatically Attacking Software Reverse Engineering AI Agents","abstract":"Software tools for reverse engineering executable binary files, such as Ghidra, enable malware analysts to safely conduct robust static analysis without having access to original source code. Coupled with the analytic power of large language models (LLM), agentic systems enabled with tools, such as GhidraMCP, can allow analysts to automate a previously human driven process. Although this automation can increase the productivity of a single malware analyst, it also introduces a new area of vulnerability for malware obfuscation. This paper presents an adversarial technique using genetic algorithm-based prompt generation, a modification of an adversarial attack known as AutoDAN, to demonstrate the ability to deceive LLM-powered disassembly and decompilation systems into misinterpreting binary executables, effectively corrupting their analytical output. This proof-of-concept methodology exploits inherent vulnerabilities in how LLMs process and interpret decompiled machine code via prompt injection by using extraneous string variable assignments to pass surreptitious instructions to the LLM while not impacting the functionality of the executable file. We demonstrate this capability through several concise examples. This approach could enable attackers to bypass automated detection systems that rely on LLM-driven analysis pipelines. By studying and understanding this attack, insights can be gained regarding the security implication of integrating LLMs into cybersecurity toolchains and building more robust agentic code analysis systems.","published_date":"2026-05-28T23:58:25+00:00","viability_score":8,"cluster_label":"Cybersecurity AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An adversarial technique using genetic algorithms to deceive LLM-powered reverse engineering tools, enabling attackers to bypass automated detection systems.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.30664v1","title":"Structure-Induced Information for Rerooting Levin Tree Search","abstract":"Subgoal-based policy tree search, which uses a policy to guide search, is effective for complex single-agent deterministic problems but often relies on explicit subgoal generation that can incur substantial overhead and hinders scalability. In this paper, we overcome these limitations by using a learned ``rerooter'' through the recently-introduced $\\sqrt{\\text{LTS}}$ algorithm. A rerooter implicitly decomposes the problem into soft subtasks. While previous work focused on the formal guarantees for given or handcrafted rerooters, in this work we propose three rerooter designs: (i) a clustering-based rerooter that exploits global state-space structure, (ii) a heuristic-based rerooter that leverages learned cost-to-go estimates, and (iii) a hybrid that combines both signals. Our framework avoids having to explicitly reconstruct and reason over generated subgoals, thereby enabling scalable allocation of search effort with significantly lower computational overhead. Empirically, our rerooting-based methods scale to complex environments where subgoal-based policy tree search fails, and achieve state-of-the-art online training efficiency on the domains tested.","published_date":"2026-05-28T23:51:21+00:00","viability_score":3,"cluster_label":"Reinforcement Learning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Developing learned rerooters for policy tree search to implicitly decompose problems into soft subtasks, enabling scalable search with lower computational overhead.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30654v1","title":"EUDAIMONIA: Evaluating Undesirable Dynamics in AI","abstract":"Large language models (LLMs) are increasingly used as conversational partners for companionship, emotional disclosure, and interpersonal advice, but the social dynamics of these interactions can create harms that are not captured by capability-oriented or traditional safety evaluations. We introduce the Social AI Design Code, a framework for evaluating whether LLMs align with user welfare in social interactions, including whether they encourage harmful intimacy, dependence, or prolonged engagement. To evaluate these risks in natural and diverse user-LLM interactions, we operationalize the code with EUDAIMONIA, a benchmark of 969 user inputs and 3,147 design-requirement violation checks built from WildChat through weak-to-strong filtration, multi-model relabeling, and controlled rewriting. Evaluating 22 recent LLMs, we find that even the strongest models, Claude-Opus-4.7 and GPT-5.5, violate 30.7% and 27.2% of checks, respectively. Extended thinking does not reduce violation rates, suggesting that these failures are persistent social-alignment problems rather than deficits solvable through test-time reasoning alone.","published_date":"2026-05-28T23:17:26+00:00","viability_score":4,"cluster_label":"LLM Safety & Evaluation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new benchmark and framework to evaluate undesirable social dynamics in LLM interactions, revealing persistent alignment problems.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.30651v1","title":"LARK: Learnability-Grounded Trajectory Selection for Efficient Reasoning Distillation","abstract":"We study trajectory selection for reasoning distillation, where teacher-generated reasoning trajectories are selectively used as supervision for a student model. Existing methods rely on heuristics such as trajectory quality or model confidence, but they often overlook whether a trajectory is learnable by the student. In this paper, we present LARK, a learnability-grounded method for reasoning trajectory selection. LARK selects trajectories that the student can learn efficiently while preserving the generalization of the full training distribution. At the core of LARK is a learnability factor $\u03c1$, which characterizes the rate at which the student's training loss decreases. To estimate this rate efficiently and maintain generalization, we introduce a learnability proxy and a $\u03c7^2$-regularized selection policy that balances learnability and distributional coverage, both with strong theoretical guarantees on their estimation error. Empirically, LARK consistently outperforms data selection baselines across multiple base models and reasoning tasks. Diagnostic analyses show that the LARK score predicts downstream training utility and that LARK-selected trajectories induce faster supervised fine-tuning loss reduction. Our code is available at https://github.com/Tianrun-Yu/LARK.","published_date":"2026-05-28T23:11:13+00:00","viability_score":7,"cluster_label":"LLM Reasoning Distillation","has_code":true,"repo_url":"https://github.com/Tianrun-Yu/LARK","commercial_flags":["has_code"],"one_liner":"A novel method for reasoning distillation that selects teacher trajectories based on student learnability, improving efficiency and generalization.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.30646v1","title":"Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs","abstract":"Large Language Models (LLMs) are increasingly used in clinical applications. However, their behavior remains highly sensitive to subtle linguistic variations, such as rephrasing or syntactic variation. This sensitivity poses risks in safety-critical healthcare settings, where semantically equivalent inputs should produce consistent predictions. However, a key challenge is to ensure that prompt variations truly preserve clinical meaning, as embedding-based similarity metrics often fail to capture distinctions involving negation, temporality, or severity. To address this limitation, we propose a semantic verification framework based on Natural Language Inference (NLI) to filter meaning-preserving prompt variations, which are further refined using an LLM-as-a-judge and audited by a clinical expert. In addition, we introduce three metrics to quantify model sensitivity: MeaningPreserving Variation Sensitivity (MVS), confidence variation (\u0394C), and Worst-Case Instability (WCI). We evaluate 16 open-source general-purpose (GP) and medical LLMs within the same model families and parameter scales, using reformulated prompts derived from the DiagnosisQA and MedQA datasets. Our results demonstrate that robustness differences between domain-specific (DS) models are mixed and highly model-dependent, i.e., domain specialization does not consistently improve or reduce robustness to meaning-preserving prompt reformulations. Several DS models rank among the most robust (when compared with GP counterparts), and strong GP baselines remain competitive as well.","published_date":"2026-05-28T23:03:43+00:00","viability_score":4,"cluster_label":"Clinical LLM Robustness","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A semantic verification framework using NLI to evaluate and improve the stability of clinical LLMs against meaning-preserving prompt variations.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.30641v1","title":"COFT: Counterfactual-Conformal Decoding for Fair Chain-of-Thought Reasoning in Large Language Models","abstract":"Large language models (LLMs) can reveal and amplify societal biases during chain-of-thought (CoT) generation. We present COFT (Chain of Fair Thought), a training-free decoding method that applies token-level fairness control at decode time, with distribution-free marginal validity guarantees (under exchangeability) for any frozen causal language model. COFT operates in three stages. First, it creates a masked counterfactual prompt by replacing sensitive spans with neutral tokens. Second, it compares the factual and masked logit distributions through lightweight logit fusion to attenuate attribute-driven biases. Third, it uses dual-branch split-conformal calibration to certify per-step candidate token sets at a user-chosen risk level. We evaluate COFT across six models and multiple bias benchmarks. Our method reduces standard bias metrics by 30-55% (median 38%) while preserving task utility and language quality. Reasoning accuracies remain unchanged within run-to-run noise margins. The computational overhead is modest, equivalent to one additional cached forward pass (<=11%). COFT offers a clear, auditable path to safer CoT generation with significant bias reduction, negligible utility loss, and no requirement for retraining, auxiliary classifiers, or weight access.","published_date":"2026-05-28T22:52:15+00:00","viability_score":7,"cluster_label":"Fair LLM Decoding","has_code":true,"repo_url":"https://github.com/goodfeli/dlbook_notation","commercial_flags":["has_code"],"one_liner":"A training-free decoding method that applies token-level fairness control to reduce bias in LLM chain-of-thought generation without sacrificing utility.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.30639v1","title":"PInVerify: An Offline Embodied Benchmark for Active Instance Verification","abstract":"Embodied agents have made strong progress in navigating to target objects, but reaching the goal vicinity does not guarantee that the agent has found the correct instance: subtle attribute differences (e.g., \"white floral\" vs. \"white striped\") often require close-range, multi-view inspection. We address this gap with Active Instance Verification (AIV), a task in which an agent actively selects viewpoints around a candidate object to decide whether it matches a fine-grained natural-language description. We formalize AIV as a finite-horizon decision process and introduce PInVerify, an offline embodied benchmark for AIV: 3,000 evaluation episodes across 18 object categories, delivered as multi-view captures with a 6-sector navigation topology that exposes trap views (navigable but uninformative) and unreachable sectors. As reference baselines we build a training-free pipeline and a LoRA-fine-tuned end-to-end agent around open-source multimodal large language models (MLLMs) at on-device scale ($\\leq$8B parameters), with attribute decomposition, a visibility-weighted multi-view tracker, and three next-best-view (NBV) strategies. In our evaluation across Qwen3-VL (4B/8B), SenseNova-SI-1.2-InternVL3-8B, CLIP, and SigLIP2, the best MLLM-based baseline exceeds the best embedding baseline by 4.9 pp; GT-box ablations show a +3.1 pp detection gap; and we do not observe reliable gains from active viewpoint selection within the tested NBV strategies. A LoRA-fine-tuned agent (SFT+GSPO) reaches 85.6%. PInVerify aims to support further work on active, fine-grained semantic verification in embodied AI. Code: https://github.com/Avalon-S/PInVerify.","published_date":"2026-05-28T22:42:38+00:00","viability_score":7,"cluster_label":"Embodied AI","has_code":true,"repo_url":"https://github.com/Avalon-S/PInVerify","commercial_flags":["has_code"],"one_liner":"An offline embodied benchmark and baseline agents for active instance verification, enabling fine-grained object recognition in robotics.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30638v1","title":"Score Broadcast and Decorrelation: A General Framework for Broadcast-Based Credit Assignment","abstract":"We introduce Score Broadcast and Decorrelation (SBD), a principled framework for broadcast-based credit assignment for general families of differentiable losses. Error broadcast is a biologically plausible alternative to backpropagation that sends output information to hidden layers without weight transport. The Error Broadcast and Decorrelation (EBD) framework, recently introduced for the mean-squared-error (MSE) setting, grounded this mechanism in the stochastic orthogonality of optimal estimators, under which the optimal residual is orthogonal to functions of the input. We generalize that foundation by introducing an orthogonality principle between the output score (the gradient of loss with respect to the final-layer output) and hidden-layer activations, which holds whenever the optimal score has conditional mean zero. This single principle unifies broadcast-based credit assignment across the standard differentiable-loss families, including cross-entropy, Bregman divergences, proper scoring rules, and exponential-family negative log-likelihoods. The framework supplies a theoretical grounding for the three-factor learning rule under general losses, with the neuromodulatory factor derived as the broadcast loss score. We derive the cross-entropy case explicitly, characterize the admissible loss class, and introduce a score vector expansion technique that enriches the broadcast signal while preserving the orthogonality framework. Experiments on CIFAR-10 and Tiny ImageNet show that SBD substantially improves over existing broadcast approaches, with score vector expansion delivering further gains. Overall, this work identifies the loss score as the signal to broadcast, supplies the orthogonality theory and theoretical grounding for the three-factor learning rule from neuroscience, and shows how score vector expansion enriches the decorrelation directions of the resulting objective.","published_date":"2026-05-28T22:38:28+00:00","viability_score":1,"cluster_label":"LLM Training","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A theoretical framework for credit assignment in neural networks, exploring biologically plausible alternatives to backpropagation.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30637v1","title":"EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs","abstract":"Clinical decision-making (CDM) is central to real-world clinical workflows, where clinicians infer diagnoses, select treatments, or anticipate future health outcomes under incomplete evidence. LLMs are increasingly used to support these decisions due to strong language capabilities, broad biomedical knowledge, and efficiency, yet the reliability of LLMs on real-world clinical decision tasks remains insufficiently understood. To evaluate CDM models, especially LLM-based models, an ideal and practical medical decision benchmark should be constructed via an automated yet reliable pipeline to ensure both scale and quality. Moreover, the grounding of a CDM benchmark in real patient EHRs can better support evaluation on practical CDM tasks that require substantive biomedical knowledge and clinical inference. To fill the gaps, we introduce EHRBench, an automated and reliable EHR-grounded benchmark for evaluating LLM-based clinical decision-making at scale. To ensure scalability and reliability, EHRBench is constructed through an EHR-LLM-KB(knowledge-base) interaction pipeline. For efficiency, we use a specialized LLM to automatically convert encounter-level EHR trajectories into structured templates and deterministically instantiate the templates into QA items. In parallel, we apply systematic KB-based verification and enrichment to filter hallucinated or ambiguous relations and to improve reliability. Using this pipeline, we construct nearly 1M (960,067) QA items spanning three core inference-required clinical decision tasks: diagnosis, treatment, and prognosis. We benchmark more than 30 representative LLMs on EHRBench and provide detailed analyses of performance and robustness. The results show consistent capability trends across settings, further validating the reliability of EHRBench and highlighting actionable gaps toward clinically reliable LLM systems.","published_date":"2026-05-28T22:38:26+00:00","viability_score":8,"cluster_label":"Medical AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"EHRBench: An automated, reliable, and scalable benchmark for evaluating LLMs in clinical decision-making using real-world electronic health records.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.30632v1","title":"Rationalize: Shared Semantic Reasoning for Human-AI Alignment","abstract":"We introduce Rationalize, a role-pair framework for shared semantic reasoning between humans and AI models in data-driven sensemaking. Building on ideas in human-machine teaming and critical thinking, we conceptualize human-AI interaction as a series of complementary role pairs (Explorer-Guide, Investigator-Informant, Teacher-Student, Judge-Advocate) operating in a shared reasoning space. In this space, human analysts and AI models (such as LLMs) make purposes, questions, assumptions, evidence, inferences, and implications explicit, facilitating alignment not only at the output level but at the level of rationalization of intent and action by each side. We relate these role pairs to the bidirectional human-AI alignment framework, illustrating how \"aligning AI to humans\" and \"aligning humans to AI\" differ by role, and sketch a collaborative research agenda for alignment design and assessment using element-level and role-specific approaches.","published_date":"2026-05-28T22:34:28+00:00","viability_score":3,"cluster_label":"Human-AI Alignment","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Rationalize: A framework for shared semantic reasoning between humans and AI, conceptualizing interaction as complementary role pairs for improved alignment.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30631v1","title":"Controllable Lung Nodule Synthesis via Histogram-Regularized Latent Diffusion Models","abstract":"While automated diagnosis systems have achieved remarkable success in computed tomography (CT)-based lung cancer screening, their development remains limited by the scarcity of diverse, annotated pulmonary nodule datasets. Diffusion-based generative models offer a promising strategy for data synthesis; however, many existing conditional approaches primarily optimize spatial reconstruction losses, which encourage voxel-wise similarity but may inadequately constrain lesion-level intensity distributions. As a result, these methods may produce over-smoothed texture profiles and underrepresent the distinct attenuation characteristics of different nodule subtypes, including solid, part-solid, and ground-glass nodules. To address this challenge, we propose a controllable latent diffusion model that synthesizes pulmonary nodules within full 3D CT volumes while accurately modeling nodule-specific intensity distributions. Specifically, rather than relying solely on spatial losses, we introduce a histogram-based regularization term that constrains voxel intensity distributions during the generative process. The model combines subtype, spatial mask, and Hounsfield unit (HU) histogram conditioning with the differentiable feature-space histogram regularization term to better align lesion-level intensity distributions, improving the visual plausibility and subtype consistency of synthesized nodules. Extensive experiments on lung CT data demonstrate that our framework achieves strong visual realism, validated through both quantitative metrics and a visual Turing test. Furthermore, when used for data augmentation, the generated nodules improve performance in downstream clinical tasks, particularly for underrepresented nodule subtypes, and show a potential benefit for subtype-informed malignancy classification.","published_date":"2026-05-28T22:32:06+00:00","viability_score":7,"cluster_label":"Medical AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A controllable latent diffusion model synthesizes realistic pulmonary nodules with accurate intensity distributions for improved lung cancer screening data augmentation.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30628v1","title":"The Architecture of Errors: From Universal Impossibility to Patch-Local LLM Reliability","abstract":"Universal LLM reliability is not a finite-library problem: across all possible tasks, tools, schemas, knowledge sources, and evaluator expectations, new intervention-distinguishable failure modes can appear without bound, so no finite intervention dictionary can guarantee bounded residual error for every such mode. But deployed systems do not operate over the whole universe. They operate inside operationally bounded patches (legal review, medical RAG, code repair, customer-support agents, contract extraction) with recurring tasks, schemas, tools, and evaluator expectations. Within such patches, empirical evidence suggests failures are sparse, repetitive, and concentrated in a small recurring catalogue, so reliability becomes a local catalogue-discovery and intervention-coverage problem rather than an exponential token-length problem. We formalize this transition with two propositions and one corollary. Proposition 1 is the worst-case-mode-wise negative result: no finite intervention dictionary covers every distinguishable failure mode of an unbounded domain. Corollary 1 is the inverse-discovery implication: the logarithmic upper bound on mode discovery cannot accommodate linearly more distinct tail modes without exponentially more observed hard-failure events. Proposition 2 is the positive patch-local result: under log active-mode exposure and head-heavy coverage, a sufficient per-hard-decision intervention budget grows polylogarithmically in sequence length and becomes domain-constant once the patch catalogue saturates. The framework relocates rather than dissolves long-context difficulty: where the number of hard decisions itself grows with task length, reliability remains hard; the contribution is to identify the on-axis intervention rather than to make those regimes easy.","published_date":"2026-05-28T22:27:08+00:00","viability_score":3,"cluster_label":"LLM Reliability","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper theorizes that LLM reliability is patch-local, suggesting that finite intervention dictionaries can bound errors within specific operational domains.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30625v1","title":"Active Timepoint Selection for Learning Measure-Valued Trajectories","abstract":"Inferring continuous probability paths from sparse snapshots is a fundamental challenge in domains like single-cell biology, where high-fidelity data acquisition is often destructive and constrained by prohibitive sequencing costs. This motivates the need for active learning strategies to strategically select optimal measurement times. However, designing active learning policies for this setting remains an open problem: the target objects reside on the infinite dimensional Wasserstein space where standard Euclidean metrics are ill-defined, and current interpolation methods lack epistemic uncertainty quantification. We introduce a framework which extends active experimentation to the space of measures. By leveraging Linearized Optimal Transport (LOT), we map distributional snapshots into a tangent space amenable to Gaussian Process modeling, allowing us to construct a tractable probabilistic surrogate for the underlying probability path. This yields an acquisition policy that iteratively selects measurement times to minimize uncertainty. Empirical results demonstrate that our strategy outperforms uncertainty-agnostic baselines on both synthetic and real-world datasets.","published_date":"2026-05-28T22:22:35+00:00","viability_score":7,"cluster_label":"Active Learning","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An active learning framework uses Linearized Optimal Transport and Gaussian Processes to select optimal measurement times for inferring measure-valued trajectories from sparse data.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30621v1","title":"Harness Updating Is Not Harness Benefit: Disentangling Evolution Capabilities in Self-Evolving LLM Agents","abstract":"LLM agents are increasingly deployed as systems built around editable external harnesses, including prompts, skills, memories and tools, that shape task execution without changing model parameters. Harness self-evolution adapts such agents by updating these harnesses from execution evidence. Yet it remains unclear whether a model's base capability in task-solving predicts its capabilities in harness self-evolution: which models produce useful harness updates, and which actually benefit from them? We analyze two harness self-evolution capabilities: (i) harness-updating, the capability to produce useful persistent harness updates from execution evidence; (ii) harness-benefit, the capability to benefit from updated harnesses during task solving. Our analysis reveals two findings. First, harness-updating is flat in base capability: models from different capability tiers produce harness updates that lead to surprisingly similar gains; even Qwen3.5-9B's updates yield gains comparable to those of Claude Opus~4.6. Second, harness-benefit is non-monotonic in base capability: weak-tier models benefit little from updated harnesses, mid-tier models benefit most, and strong-tier models benefit less than mid-tier. We trace low gains at the weak tier to two failure modes: weak-tier models may fail to activate relevant harness artifacts, or activate them but fail to follow them faithfully. These findings suggest investing capability budget in the task-solving agent rather than the evolver, and targeting harness invocation and long-horizon instruction following in agent training. Our source code is publicly available at https://github.com/A-EVO-Lab/a-evolve/tree/release/harness-evolution.","published_date":"2026-05-28T22:16:14+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":"https://github.com/A-EVO-Lab/a-evolve","commercial_flags":["has_code"],"one_liner":"This research disentangles LLM agent harness updating and benefit capabilities, finding mid-tier models benefit most from harness evolution for task solving.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30619v1","title":"Reward Learning from Best-of-$N$ Preference Data: Targets, Tradeoffs, and Design Principles","abstract":"Best-of-$N$ sampling is widely used to construct pairwise preference data: $N$ candidates are drawn from a base distribution, and the best is paired with a rejected response. Despite its widespread use, what Bradley--Terry (BT) reward learning extracts from such data, and how to choose $N$ and the base distribution, remain unclear. We specialize a recent analysis of preference data via its induced conditional distribution to Best-of-$N$. For independent-reference variants, we derive closed-form reward targets as explicit functions of $N$ and the base distribution, and show that they preserve the latent reward ranking. For the practical Best-vs-Random and Best-vs-Worst variants, chosen and rejected responses are coupled through the same candidate set, so exact BT representability generally fails; nevertheless, bounded-class minimizers approach the reference targets as $N$ grows. Although margin and connectivity are known to govern sample efficiency in pairwise preference learning, Best-of-$N$ couples them through $N$ in opposing directions: larger $N$ widens pairwise margins but reduces connectivity. This trade-off yields two design principles: use larger $N$ when preference labels are the bottleneck, smaller $N$ when generation is the bottleneck; and shape the base distribution to place mass between the responses whose comparison matters most at test time. Experiments on synthetic and real preference data support the predicted dependence on sample size and base-distribution shape.","published_date":"2026-05-28T22:15:57+00:00","viability_score":2,"cluster_label":"Machine Learning Theory","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper analyzes the theoretical underpinnings of reward learning from Best-of-N preference data, providing insights into optimal sampling strategies and their impact on latent reward ranking.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30611v1","title":"Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs","abstract":"Scientific figures are among the most effective means of communicating complex research ideas, yet producing publication-quality illustrations remains one of the most labor-intensive parts of paper preparation. Existing automated systems each target a single figure type under text-only input, leaving the diversity of types and conditions researchers actually use unaddressed; their raster outputs further cannot be locally revised. Because scientific figures are structured compositions of discrete semantic components, the localized errors generators produce on such layouts demand not a stronger backbone but a harness. We instantiate this harness in two complementary systems: Crafter, a multi-agent harness for figure generation that generalizes across figure types and input conditions without architectural changes, and CraftEditor, which applies the same pattern to convert raster outputs into editable SVGs. Moreover, we introduce CraftBench, a benchmark spanning three figure types and four input conditions with human quality annotation. Experiments show that Crafter substantially outperforms both standalone generators and the agentic baseline on PaperBanana-Bench and CraftBench, with ablations confirming each component's independent contribution; CraftEditor faithfully converts outputs into editable SVGs that surpass all baselines. Our code and benchmark are available at https://github.com/HaozheZhao/Crafter.","published_date":"2026-05-28T22:04:30+00:00","viability_score":8,"cluster_label":"Generative AI","has_code":true,"repo_url":"https://github.com/HaozheZhao/Crafter","commercial_flags":["has_code"],"one_liner":"Crafter is a multi-agent system for generating editable scientific figures from diverse inputs, outperforming existing methods and offering SVG conversion for local revision.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.30604v1","title":"An Organization-Scoped LLM Agent Runtime Architecture for Regulated Cybersecurity Operations","abstract":"Regulated cybersecurity workflows lack a runtime substrate that enforces organization-level scope across retrieval, tool calls, memory, findings, reports, and audit while remaining model-agnostic and locally deployable. Recent large language model (LLM) agent systems report strong results on isolated cybersecurity tasks, yet they do not by themselves define an auditable platform architecture for regulated security operations centre (SOC) and compliance workflows, where a single analyst may trigger actions that bind the organization, and where the runtime must integrate with existing SIEM/XDR stacks as a primary source of context and alert-driven triggers rather than operate as a standalone analytical layer. This paper proposes an organization-scoped LLM agent runtime architecture for financial cybersecurity. The contribution is a typed Security Context that is created at every entry point, including SIEM/XDR notifications ingested as first-class triggers, and enforced at every component boundary, combined with a shared Runtime Core, logical specialist subagents, a governed Tool Adapter Layer exposing SIEM/XDR query, enrichment, and response primitives under uniform policy and audit, structured findings with evidence references, tiered human-in-the-loop (HITL) gates, and append-only audit. Model Context Protocol (MCP), extended telemetry, digital twins for pentesting, graph retrieval, and federated knowledge sharing are treated as optional extension paths rather than mandatory runtime assumptions. We describe an implementable slice as the architecture's testability surface, and we propose a falsifiable evaluation plan with metric-level pass criteria for architecture readiness, security-policy enforcement, evidence traceability, output quality, and operational observability.","published_date":"2026-05-28T21:51:38+00:00","viability_score":3,"cluster_label":"AI Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper proposes an organization-scoped LLM agent runtime architecture for regulated financial cybersecurity operations, focusing on auditable platform design and integration with existing SIEM/XDR stacks.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30593v1","title":"Scientific Machine Learning for Engine Health Management and Remaining Useful Life Prediction","abstract":"Engine Health Management (EHM) depends on reliable forecasting of Remaining Useful Life (RUL) and on tracking thermal indicators such as turbine gas temperature (TGT). In practice, real-world fleet data are heterogeneous and non-stationary, and point predictions alone are insufficient for risk-aware maintenance decisions. This paper presents a multi-task scientific machine learning framework for turbine prognostics that jointly predicts turbine gas temperature untrimmed (TGTU), Delta Turbine Gas Temperature (DTGT), and RUL, with quantified uncertainty in the form of prediction intervals whose empirical coverage is evaluated. A shared sequence encoder (convolutional front-end with residual bidirectional LSTM layers and attention pooling) feeds task-specific heads, including mean--variance estimation for probabilistic regression and, optionally, a survival head for threshold-based event modeling. The framework is designed to be tunable via a small set of practitioner-facing parameters (e.g., DTGT thresholding rules and RUL target construction) so that deployment can align with in-house policies and proprietary criteria. The predictive performance of the proposed framework is evaluated using both point and interval metrics, including mean absolute error (MAE), prediction interval coverage probability (PICP), mean prediction interval width (MPIW), and the coverage--width criterion (CWC). Results are reported both in aggregate and stratified by flight phase and maintenance segment to highlight operational-context effects and to support uncertainty-aware monitoring.","published_date":"2026-05-28T21:39:53+00:00","viability_score":3,"cluster_label":"Scientific Machine Learning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A multi-task scientific machine learning framework is presented for engine health management, jointly predicting turbine gas temperature and remaining useful life with quantified uncertainty.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30590v1","title":"Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents","abstract":"Two clinical AI systems can score nearly identically on coverage-based rubrics yet behave radically differently when their patient inputs change: one updates its recommendations to match the new clinical signal, while the other produces the same output regardless. We introduce the Causal Sensitivity Score (CSS), a pre-registered interventional metric that mutates oncology tumor-board cases along five clinically meaningful dimensions - biomarker flips, prior-treatment failures, biomarker removals, surgery-status changes, and stage perturbations - and scores whether each model updates its recommendations in the pre-registered correct direction using a {0, 0.5, 1.0} scale. Benchmarked against the Consensus Match Score (CMS), a coverage-based weighted recall metric, six frontier models from three labs evaluated in single-shot inference across 224 cases rank in nearly opposite orders: all six models change rank, the CMS-worst model becomes CSS-best, and one upper-mid CMS model ranks last on CSS. We further surface a universal safety blind spot: every frontier model fails on surgery-status interventions (at most 17.2% CSS on Family D), a finding CMS does not expose. The metric also transfers to tool-using agents: in a ReAct-style experiment, tool use improves CSS for five of six models (+2.5 to +20.3 percentage points), yet the lowest-CSS model retrieves the same chart sections and still fails to update its recommendations - revealing a structural responsiveness deficit visible only under counterfactual evaluation. Cross-judge replication and three-rater medical-professional validation confirm the aggregate findings. Interventional pre-registered metrics like CSS complement coverage-based evaluation for clinical AI agents: they capture responsiveness that coverage metrics miss and offer a candidate dense reward signal for future agentic RL systems.","published_date":"2026-05-28T21:37:06+00:00","viability_score":7,"cluster_label":"Clinical AI Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new metric, CSS, reveals critical safety blind spots and responsiveness deficits in clinical LLMs and agents that traditional evaluation misses, enabling more robust AI development.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30589v1","title":"ImmigrationQA: A Source-Grounded Dataset and Small-Model Adaptation for U.S. Immigration Law","abstract":"U.S. immigration law spans thousands of pages of official policy, federal regulations, and procedural guidance that change frequently and carry high stakes for petitioners who lack legal representation. We describe the construction of ImmigrationQA, a source-grounded question-answering dataset of 17,058 pairs across 13 immigration subdomains, and the fine-tuning of a Llama 3.2 3B Instruct model on that dataset using parameter-efficient LoRA. The corpus was assembled from 11 primary and secondary sources -- including the USCIS Policy Manual, 8 CFR, BIA precedent decisions, and community Q&A -- yielding 10,056 validated canonical documents and 18,308 text chunks. Structured QA pairs were generated from these chunks using Claude Sonnet 4.6 via five mode-specific prompts, with 22 pairs rejected for insufficient source-span overlap. The fine-tuned model was evaluated against a held-out split of 993 pairs using LLM-as-judge scoring on a 101-example stratified sample. The fine-tuned model scored a mean of 1.08/3.0 (16.8% fully correct; 101-example stratified eval) versus the Llama 3 8B base model at 0.85/3.0 (4% fully correct), a relative improvement of 27% in mean score; a zero-shot Claude Sonnet baseline scored 1.52/3.0 (25% fully correct). The fine-tuned model shows concentrated improvement in procedural subdomains (travel documents, adjustment of status, nonimmigrant visas) while remaining weak on complex legal reasoning and time-sensitive statistics. The full pipeline ran for approximately $29 in cloud compute. All artifacts -- dataset, model, code, and prompt templates -- are publicly released. The system is not a substitute for legal counsel and does not reflect regulatory changes after the corpus crawl date.","published_date":"2026-05-28T21:36:18+00:00","viability_score":9,"cluster_label":"Legal AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A specialized, low-cost fine-tuned Llama 3 model and dataset for U.S. immigration law significantly improves accuracy on procedural questions, offering a targeted legal AI solution.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.30585v1","title":"Benchmarking Machine Learning Uncertainty Quantification Methodologies for Predicting Turbine Gas Temperature Degradation","abstract":"Effective prognostics and health management of modern engines relies on accurate turbine gas temperature predictions and robust uncertainty quantification to ensure reliability and safety. This paper investigates five major approaches for constructing prediction intervals -- namely the Delta method, Bayesian Monte Carlo Dropout, Bootstrap method, Lower-Upper Bound Estimation, and Mean-Variance Estimation -- as a means of capturing the uncertainty in neural network predictions of turbine gas temperature. Each approach is implemented within a unified experimental framework that employs cross-validation for hyperparameter selection, repeated train-test splits for performance robustness, and multiple metrics to evaluate both the accuracy and tightness of the intervals. In particular, Coverage Probability, Normalized Mean Prediction Interval Width, and the Coverage Width-based Criterion are measured to comprehensively assess each method's reliability and sharpness. Experiments conducted on a representative turbine gas temperature dataset reveal distinct trade-offs among the five methods in terms of interval coverage, width, and stability. These findings provide a practical guide for selecting and tuning prediction interval methods in engine health management and prognostics, ensuring both interpretability and precision in real-world applications.","published_date":"2026-05-28T21:25:47+00:00","viability_score":4,"cluster_label":"ML Uncertainty Quantification","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A comparative study of five uncertainty quantification methods for predicting turbine gas temperature degradation provides insights into their trade-offs for engine health management.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.30581v1","title":"Prior Availability in Industrial Visual Sim-to-Real: A Review of CAD-Guided and CAD-Unavailable Regimes","abstract":"Industrial visual sim-to-real is often described as transferring from synthetic images to real images, but industrial deployment usually involves a broader mismatch between available evidence and required decisions. A system may be built from CAD renderings, simulated RGB-D observations, normal reference images, synthetic defects, pretrained feature spaces, or language prompts, yet deployed under different sensors, lighting, materials, fixtures, calibration, production variation, and rare defect modes. This review reframes industrial visual sim-to-real as a domain-gap problem organized by prior availability. We distinguish CAD-available settings, where explicit object geometry can support rendering, calibration, pose estimation, segmentation, and test-time geometric verification; CAD-unavailable settings, where geometry is replaced by normal-reference appearance, feature distributions, teacher-student residuals, synthetic anomaly assumptions, foundation features, or vision-language priors; and boundary-prior settings, where approximate models, templates, reference views, or semantic correspondences preserve only part of the CAD role. This framing connects CAD-based detection and 6D pose-estimation literature with industrial anomaly and surface-inspection literature that is usually reviewed separately. To make the taxonomy concrete, we use empirical anchors on T-LESS/BOP, MVTec AD, and VisA. The anchors show that CAD render count alone does not close transfer; source-distribution design, detector capacity, and small real calibration can matter more. They also show that CAD at test time creates a distinct verification channel through mask, pose, and depth consistency, whereas CAD-unavailable inspection relies on calibrated normality and feature deviation. The review therefore argues against a single cross-task leaderboard and instead asks what prior grounds the deployment decision.","published_date":"2026-05-28T21:18:27+00:00","viability_score":0,"cluster_label":"Industrial Visual Sim-to-Real","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A review reframes industrial visual sim-to-real as a domain-gap problem based on prior availability, connecting CAD-based and CAD-unavailable regimes for better deployment decisions.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30576v1","title":"Uncertainty-Aware and Temporally Regulated Expert Advice in Reinforcement Learning for Autonomous Driving","abstract":"Exploration in reinforcement learning for autonomous driving is inherently unsafe: agents must experience novel behaviors to learn, yet exploration can lead to collisions or off-road driving. We propose an uncertainty-aware framework that leverages expert advice to guide exploration while avoiding long-term dependence. Advice is triggered when epistemic or aleatoric uncertainty exceeds adaptive thresholds derived from rolling buffers, ensuring advice evolves with the agent's confidence. A commitment-cooldown strategy with a stochastic early-stop heuristic regulates the duration and frequency of guidance, exposing the agent to coherent maneuvers without exhausting the advice budget. Expert and agent experiences are combined in a shared replay buffer within an off-policy implicit quantile network (IQN) backbone, enabling efficient reuse of expert trajectories. Experiments in CARLA show that our method outperforms the IQN baseline, improving success by 5-7% and reducing failures, demonstrating that risk-sensitive uncertainty coupled with regulated expert integration enables safer and more efficient exploration for sensor-based RL policy learning in unsignalized intersection navigation.","published_date":"2026-05-28T21:09:44+00:00","viability_score":4,"cluster_label":"Reinforcement Learning for Autonomous Driving","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An uncertainty-aware reinforcement learning framework guides exploration in autonomous driving using expert advice and adaptive thresholds to improve safety and efficiency.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.30571v1","title":"Memory-Bound but Not Bandwidth-Limited: The Physical AI Inference Gap in Batch-1 LLM Decode","abstract":"Physical AI systems, including robots, autonomous vehicles, embodied agents and edge copilots, often run a different inference workload from cloud LLM serving: single-stream, batch-1 autoregressive decode, where one robot, camera feed or user session waits on the next token. This workload is usually described as memory-bandwidth-bound. Each decode step streams model weights and the active KV cache, so latency should scale with peak HBM bandwidth. We show that this account is true but incomplete. We measure batch-1 decode for three 7 to 8B-class GQA transformers across four NVIDIA GPUs: H100 SXM5, A100-80GB SXM4, L40S and L4. We evaluate context lengths from 2048 to 16384, producing 44 valid cells under a controlled bf16 SDPA setup. The achieved fraction of peak HBM bandwidth falls as peak bandwidth rises. On the headline Qwen-2.5-7B ctx=2048 cell, an L4 reaches roughly 81 percent of its analytic memory floor, while an H100 reaches only 27 percent. Physical-AI decode is memory-dominated, but faster memory does not translate into proportional latency gains. We test the missing term with a CUDA Graphs A/B experiment. On H100 at ctx=2048, CUDA Graphs improves decode latency by 1.259x across N=10 fresh sessions, with a 95 percent bootstrap confidence interval of 1.253 to 1.267. On L4, the same intervention gives only 1.028x. This isolates a launch-side overhead that becomes visible on fast GPUs but remains mostly hidden on slower, bandwidth-bound GPUs. The deployment implication is that memory savings matter only when the runtime realises them. On L4, bf16 decode sits close to the memory floor, but common quantised paths do not recover the expected 4x weight-traffic reduction: bnb-nf4 reaches 59.36 ms/step and AutoAWQ+Marlin reaches 45.24 ms/step from a 62.32 ms bf16 baseline. GPTQ+ExLlamaV2, with Ada-tuned int4 kernels, reaches 17.36 ms/step.","published_date":"2026-05-28T21:03:14+00:00","viability_score":5,"cluster_label":"LLM Inference Optimization","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Optimizing batch-1 LLM decode for physical AI systems by identifying and mitigating launch-side overheads that limit performance on high-bandwidth GPUs.","time_to_mvp":"1-3 months","tags":["quick_build"]},{"arxiv_id":"2605.30570v1","title":"Procedural Generation of First Person Shooter Maps using Map-Elites","abstract":"We investigate the application of MAP-Elites (a well-known quality diversity algorithm) to design levels for First-Person Shooter (FPS) games. We consider two well-known map representations (All-Black and Grid-Graph) and introduce two novel representations (Point-Line and Spatial-Layout) that improve the characterization of FPS maps. We define a series of metrics to describe maps' topological properties (which solely depend on maps' layout), and emergent properties (which must be evaluated through actual gameplay). We perform an in-depth analysis to identify the most suitable features to guide MAP-Elites illumination process. We apply MAP-Elites with Sliding Boundaries (MESB) to evolve populations of FPS maps. Our results show that the new representations can generate maps with higher diversity and quality than the representations previously used for evolving FPS maps.","published_date":"2026-05-28T21:02:27+00:00","viability_score":4,"cluster_label":"Procedural Content Generation","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Applying the MAP-Elites algorithm with novel map representations to procedurally generate diverse and high-quality levels for first-person shooter games.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30563v1","title":"Transforming and Encoding FTS for SAT Solving: What Helps, What Hurts (Extended Version)","abstract":"Factored tasks are a classical planning representation that extends SAS+ with limited forms of disjunctive preconditions, conditional effects, and angelic nondeterminism. This allows for a more compact representation of tasks than traditional formalisms such as STRIPS or SAS+, and supports a wide range of task transformations. However, existing planning approaches for factored tasks have been limited to heuristic search methods.   In this work, we investigate how to encode factored tasks in SAT. We propose several ways to encode the tasks, focusing on different strategies for translating the factored transition relation into propositional logic. We also analyze how to exploit parallelism at various levels in this setting and study the impact of common task transformations on the performance of SAT-based planners.","published_date":"2026-05-28T20:50:52+00:00","viability_score":0,"cluster_label":"AI Planning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Investigating the encoding of factored tasks into SAT for planning, analyzing different translation strategies and task transformations.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30561v1","title":"VLM3: Vision Language Models Are Native 3D Learners","abstract":"Vision Language Models (VLMs) enable a unified model to solve various vision tasks through prompting. They have shown promising performance in semantic understanding. However, 3D understanding still largely relies on expert vision models with complex task-specific designs. The key argument this work wants to make is that VLMs are native 3D learners. Our in-depth large scale study shows that 1) focal length unification, 2) text-based pixel reference and 3) data mixture and scaling, are all you need for effective 3D learning. Model architecture changes, large models, heavy data augmentations, and complex losses including the regression formulation, many of which form the foundation of expert vision models, are actually not necessary conditions. As a result, we propose VLM3, a scalable method with the simplest design that enables standard VLMs to master diverse 3D tasks. VLM3 not only advances the VLM depth estimation accuracy by a large margin (0.84 -> 0.9), but also enables diverse 3D tasks such as pixel correspondence, camera pose estimation and object-level 3D understanding, matching expert vision model accuracy while maintaining standard architectures and text-based training. We believe VLM3 opens up a new paradigm for simple and scalable 3D learning.","published_date":"2026-05-28T20:48:55+00:00","viability_score":4,"cluster_label":"3D Vision with VLMs","has_code":true,"repo_url":"https://github.com/facebookresearch/VLM3","commercial_flags":["has_code"],"one_liner":"This paper proposes a simple yet effective method to enable standard Vision Language Models to master diverse 3D tasks, advancing accuracy and enabling new applications.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30557v1","title":"Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?","abstract":"Spatial reasoning is a fundamental capability for vision-language models (VLMs) deployed in real-world environments. However, visual observations are inherently limited representations of a 3D world: occlusion can render objects invisible, and perspective can make geometric properties misleading. Despite this, existing spatial reasoning benchmarks typically assume that observations are sufficient and reliable, focusing on whether models produce correct answers rather than whether they recognize when a question cannot be answered and what additional observations would be needed. In this work, we challenge this assumption by constructing a controlled evaluation framework, SpatialUncertain, and introducing two types of observation challenges: (1) occlusion, which hides target information, and (2) perspective ambiguity, which produces misleading visual cues. For each configuration, we design spatial questions that are answerable under clean observations but require abstention under the introduced challenges. We further evaluate whether models can identify which additional viewpoints would resolve perspective ambiguity. Our results across a diverse set of frontier open- and closed-source VLMs reveal two consistent failure modes. First, models are prone to overconfident answering, attempting to solve spatial reasoning tasks even when visual evidence is incomplete or misleading, with average accuracy around 30\\% under occlusion and below 10\\% under perspective ambiguity. Second, even when additional views are available, some models perform near random chance in identifying which would provide reliable evidence. Together, our findings call for moving beyond answer correctness toward evaluating whether models know when to abstain and how to seek reliable evidence.","published_date":"2026-05-28T20:44:47+00:00","viability_score":7,"cluster_label":"VLM Spatial Reasoning","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This work introduces a framework to evaluate if Vision Language Models know when not to answer spatial questions due to visual limitations, highlighting a critical gap in current models.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30542v1","title":"Physically Viable World Models: A Case for Query-Conditioned Embodied AI","abstract":"World models for embodied AI must be physically viable: constructed to answer intervention queries by representing the physical structure governing action outcomes, rather than merely predicting future observations. Existing observation-predictive world models can produce visually plausible but physically wrong rollouts. This failure is structural; distinct physical systems can look identical yet diverge under intervention. We expose this problem with controlled benchmarks that fix the visible scene while varying latent physics. We show that such models may recommend infeasible actions, mispredict interaction outcomes, or certify unsafe behavior. We argue that embodied AI requires world models that identify the simplest physical abstraction sufficient to answer an intervention query. Such a model comprises modular components, including environment representation, latent state and parameter estimation, action specification, interventional dynamics, and query-level response. An autonomous orchestrator should identify the relevant abstraction and compose compatible learned and structured components per query. When closed-form physics is unavailable, uncertain, or costly, the transition model may be analytic, simulated, learned, or hybrid, but it must preserve the structure that determines interventional outcomes. This decomposition makes the model interpretable, its components verifiable, and its outputs auditable against the query. It also provides a design principle for new world models and a feasibility test for existing ones: the right abstraction is not the most detailed model of the world, but the simplest model that preserves the distinctions relevant to the query. We demonstrate this approach on queries that existing systems fail to answer correctly, and outline how an orchestrator can dynamically assemble and adapt physically viable models for planning, control, and verification.","published_date":"2026-05-28T20:18:22+00:00","viability_score":7,"cluster_label":"Embodied AI World Models","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This paper proposes physically viable world models for embodied AI that can answer intervention queries by representing physical structure, not just predicting observations.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30529v1","title":"Generalistic or Specific Embeddings, Which is Better? An Empirical Study on Search for Clinical Coding in Non-English Languages","abstract":"Sentence-embedding models for semantic search are overwhelmingly developed and evaluated on English corpora. When applied to clinical retrieval in other languages -- particularly retrieval of ICD-10-CM / CIE-10 codes -- recall degrades in ways often masked by aggregate benchmarks. We study whether large generative language models can serve as data factories to close this gap. We build a two-stage retriever (bi-encoder followed by cross-encoder reranker), fine-tuned from a Spanish biomedical encoder (PlanTL-GOB-ES/bsc-bio-ehr-es) on Gemini-generated synthetic data covering English, Spanish, Catalan, Italian, Portuguese and French, and evaluate against BioBERT-ST and the un-tuned Spanish encoder. The bi-encoder alone matches BioBERT-ST on MRR (0.876 vs. 0.866) and overtakes it on R@3 (0.650 vs. 0.626) and R@5 (0.804 vs. 0.790) without English biomedical pretraining. Adding a cross-encoder reranker lifts aggregate R@5 to 0.822 and dominates on four of five languages (+0.017 Spanish, +0.033 Catalan, +0.018 French, +0.037 Portuguese) at the cost of a small English regression. The trade-off is clinically acceptable: Portuguese reaches R@5 = 0.829 vs. BioBERT-ST's 0.714. Contributions: an open recipe for building domain-specific medical retrievers from LLM-generated data; quantification of the learning gain (MRR 0.755 to 0.876, +15.9% with ~19,500 synthetic pairs); and a characterisation of where gains concentrate by language and rank.","published_date":"2026-05-28T20:06:43+00:00","viability_score":8,"cluster_label":"Clinical Code Search","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This work presents a method to build domain-specific medical retrievers for non-English languages using LLM-generated data, significantly improving clinical coding search performance.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.30523v1","title":"Revisiting Padded Transformer Expressivity: Which Architectural Choices Matter and Which Don't","abstract":"Recent work describes what transformers can and cannot compute through connections to boolean circuits, but existing results lack exact characterizations and are sensitive to modeling choices. Padded transformers -- to whose input filler symbols such as ``...'' are appended -- emerge as a useful gadget for establishing equivalences to circuit classes by providing polynomial space for adaptive parallel computation. However, only a limited set of padded transformer idealizations has been studied, leaving open how robustly these equivalences hold under changes to attention type, model width, and uniformity. We find that, under practical assumptions, padded transformers are surprisingly robust to all of these, and identify numeric precision and model depth as the main factors affecting expressivity. Concretely, we prove that polynomially padded $\\text{L-uniform}$ constant-precision transformers are equivalent to $\\text{L-uniform AC}^0$, while growing-precision ones achieve $\\text{L-uniform TC}^0$ regardless of width. Furthermore, looping enables sequential processing analogous to circuits: $\\log^d N$-looped constant-precision transformers reach $\\text{FO-uniform AC}^d$, and growing-precision ones reach $\\text{FO-uniform TC}^d$. Interestingly, growing width or precision beyond logarithmic does not increase expressivity, and all our results hold for both softmax and average hard attention transformers.","published_date":"2026-05-28T19:58:43+00:00","viability_score":1,"cluster_label":"LLM Theory","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper theoretically analyzes the expressivity of padded transformers, identifying key architectural choices that impact their computational power.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30512v1","title":"PhyDrawGen: Physically Grounded Diagram Generation from Natural Language","abstract":"Generating physics diagrams from text requires strict adherence to physical laws. While current generative models produce visually plausible outputs, they systematically hallucinate force vectors, ignore conservation laws, and violate geometric constraints. We present PhyDrawGen, a neuro-symbolic pipeline that decouples semantic scene understanding from physical constraint satisfaction. First, a large language model extracts a typed scene graph from the problem text. A deterministic solver then converts this graph into a Planar Straight-Line Graph (PSLG), encoding force balance, optical paths, and field topologies as exact geometric primitives. Finally, a fine-tuned Qwen-VL model implements a visually grounded propose-verify loop to iteratively correct any constraint violations. Evaluated on a benchmark of 1,449 problems spanning mechanics, optics, and electromagnetism, PhyDrawGen significantly outperforms GPT-5-image, Gemini 2.5 Flash, and Gemini 3 Pro, demonstrating robust physical accuracy even on unusual-object problems.","published_date":"2026-05-28T19:49:27+00:00","viability_score":8,"cluster_label":"AI for Science","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"PhyDrawGen is a neuro-symbolic system that generates physically accurate diagrams from natural language, outperforming leading LLMs.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.30510v1","title":"A Novel Global Context-aware Deep Neural Network for Enhanced Brain Tumor Segmentation using Magnetic Resonance Images","abstract":"Brain cancer's severity necessitates precise brain tumor segmentation, which is crucial for effective brain tumor diagnosis. Manual identification, burdened by high costs, labor, and error risks, highlights the need for automated methods. In this study, we introduce the Global Context-aware Squeeze and Excite Residual UNet (GCSER-UNet), which facilitates a fusion of spatial and channel-wise attention and thus enhances the model's capacity to capture intricate spatial dependencies and contextual information. GCSER-UNet efficiently extracts tumor segments from multimodal MRI slices, delivering exceptional performance. Evaluations on benchmark databases exhibit its superiority, achieving a notable 94 percent dice score on the TCGA LGG dataset, surpassing the state-of-the-art dice score of 91.8 percent. In the BraTS 2020 dataset, the proposed GCSER-UNet ensemble approach yielded dice scores of 95 percent, 92 percent, and 90 percent for the tumor regions - Whole Tumor (W), Tumor Core (T), and Enhancing Tumor (E), respectively. The current state-of-the-art dice scores were 94 percent, 93 percent, and 88 percent. These compelling outcomes highlight the efficacy of GCSER-UNet in precise brain tumor segmentation and thus can aid neurologists in effective brain cancer management and treatment planning.","published_date":"2026-05-28T19:46:46+00:00","viability_score":7,"cluster_label":"Medical AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"GCSER-UNet is a novel deep learning model for precise brain tumor segmentation from MRI, achieving state-of-the-art results.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30509v1","title":"Improved Distribution Estimation in $\\ell_\\infty$","abstract":"We present improved bounds for estimating discrete probability distributions under the $\\ell_\\infty$ norm. These include minimax bounds in expectation and high-probability tail bounds. We resolve some of the open questions posed in Kontorovich and Painsky (JMLR, 2025) -- including a fully empirical version of the tightest risk bound they presented and identifying the form of the worst-case extremal distribution. Encouraging empirical results are reported as well.","published_date":"2026-05-28T19:44:10+00:00","viability_score":1,"cluster_label":"Statistical Learning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper presents improved theoretical bounds for estimating discrete probability distributions under the L-infinity norm.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30486v1","title":"Graph-Conditioned Mixture of Graph Neural Network Experts for Traffic Forecasting","abstract":"Spatio-temporal forecasting on sensor graphs is commonly tackled with a single backbone architecture applied uniformly across all nodes, although graph regions can exhibit different dynamics. Road segments differ in functional class, structure, and traffic behavior, suggesting that node-wise expert specialization can be useful. We propose GC-MoE, a graph-conditioned mixture of experts framework that assigns each node a personalized combination of frozen forecasting experts based on graph topology and the recent traffic input window. GC-MoE combines frozen pretrained spatio-temporal GNN experts with an input-aware, spatially contextualized router while training only a lightweight routing module. We also study a bounded graph-conditioned output refinement layer as an optional extension and include node-adaptive ST-LoRA adapters only as an ablation diagnostic. Across four standard benchmarks (PEMS04, PEMS07, METR-LA, and PEMS-BAY), GC-MoE improves MAE over a zero-parameter ensemble baseline, with competitive RMSE and MAPE, while training only ~17K parameters on top of 1.5M frozen expert weights. The implementation is available at https://github.com/Ahghaffari/gc_moe.","published_date":"2026-05-28T19:05:18+00:00","viability_score":7,"cluster_label":"Traffic Forecasting","has_code":true,"repo_url":"https://github.com/Ahghaffari/gc_moe","commercial_flags":["has_code"],"one_liner":"A graph-conditioned mixture of experts framework that assigns nodes personalized forecasting models for improved spatio-temporal traffic prediction.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30462v1","title":"idSCD: Identifying Training Datasets through Semantic Correlation Descriptors","abstract":"Can a dataset be recognized from the spurious correlations it induces during training? We argue that datasets leave dataset-specific traces in a model's learned semantic correlation structure: incidental regularities that are predictive within a dataset, but not causal for the underlying task, can be internalized during training. We use this insight to study dataset-level membership inference, moving beyond existing methods that rely on behavioral or distributional evidence such as confidence scores, losses, margins, generated samples, or query responses. We introduce a white-box semantic fingerprinting approach based on semantic correlation descriptors (SCDs), which capture the semantic correlation structure learned by a model and make it comparable across dataset mixtures. In a controlled leave-one-dataset-out diagnostic, SCDs recover dataset-specific changes and perfectly separate matching from non-matching dataset pairs. We then propose a practical SCD-based membership score that tests whether a target dataset is part of a model's training mixture using only the model's SCD and the target dataset's standalone SCD, without requiring leave-one-dataset-out models. Across three diverse experimental settings, with dataset groups for natural language inference, emotion classification, and medical text classification, we test both the advantages and limitations of SCD-based membership inference with different degrees of semantic separation and keyword support between dataset splits. On average, the classifier based on this score achieves the highest performance and the lowest std, outperforming black-box baselines RMIA, Attack-P, and LiRA, as well as the white-box SIF baseline. These results show that dataset membership can be traced through internal semantic correlations, with the largest relative gain exceeding 60% in ROC-AUC when dataset groups expose distinct semantic particularities.","published_date":"2026-05-28T18:38:15+00:00","viability_score":7,"cluster_label":"Dataset Identification","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A white-box approach using semantic correlation descriptors to identify a model's training datasets, outperforming existing methods.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30461v1","title":"Scalable Constrained Multi-Agent Reinforcement Learning via State Augmentation and Consensus for Separable Dynamics","abstract":"We present a distributed approach for constrained Multi-Agent Reinforcement Learning (MARL) that combines state-augmented policy learning with distributed consensus over dual variables. Our method targets systems where agents have separable dynamics but must coordinate to satisfy global resource constraints, a setting in which, as we demonstrate empirically, independent learning fails to produce feasible solutions because agents cannot determine appropriate individual contributions toward collective constraint satisfaction. The key technical contribution is showing that lightweight neighbor-to-neighbor consensus over Lagrange multipliers suffices for globally coordinated constraint enforcement while preserving the scalability of independent training. Each agent learns a single augmented policy offline, conditioned on both its local state and a dual variable encoding constraint feedback. During execution, agents reach agreement on this dual variable through local communication alone. We prove that under mild connectivity assumptions, the consensus error among agents' multipliers is bounded, and show that this translates to a bounded constraint violation that decreases with graph connectivity and the number of consensus rounds. Unlike centralized training with decentralized execution (CTDE) approaches, whose complexity grows at least quadratically with agent count, our method scales linearly in both training and execution. Experiments on smart grid demand response demonstrate that consensus coordination is \\emph{essential for feasibility}: without it, agents satisfy grid capacity constraints only by indefinitely postponing demand, a degenerate non-solution. With consensus, agents converge to a shared dual variable and satisfy both grid constraints and demand fulfillment, scaling to thousands of agents while CTDE baselines are limited to dozens.","published_date":"2026-05-28T18:37:16+00:00","viability_score":6,"cluster_label":"Multi-Agent Reinforcement Learning","has_code":true,"repo_url":"https://github.com/goodfeli/dlbook_notation","commercial_flags":["has_code"],"one_liner":"Scalable constrained multi-agent reinforcement learning via state augmentation and consensus for separable dynamics, enabling coordination for global constraints.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.30454v1","title":"The Surface You Test Is Not the Surface That Breaks","abstract":"Tool-augmented LLM agents are vulnerable to prompt injection: a third party who controls part of the agent's context can plant instructions that the agent then executes as if they came from the user. Current evaluations report a single attack success rate per model on one channel, the tool output and treat that number as the model's vulnerability. But tool descriptions, which the agent reads at every turn before any tool is called, are themselves an injection surface that the attacker can choose instead. We hold the injection payload byte-identical and deliver it through both surfaces across 13 LLMs from six families and four task suites. The same bytes invert in success rate across models: GPT-4.1 is 96 percent vulnerable on tool outputs but only 4 percent on tool descriptions, while GEMINI-3-FLASH shows the mirror pattern at 20 percent and 98 percent. A variance decomposition over 6,830 attempts attributes 0 percent of the variation in attack outcomes to the surface alone, while the model-surface interaction accounts for 16.7 percent. Vulnerability is a property of the pairing, not the channel. The Adaptive Attack Rate, defined as the per-cell maximum over surfaces, exceeds the strongest fixed-surface baseline by +9.1 percentage points on average. Standard prompt-level defenses inherit the same blindspot, reducing tool-output ASR to 10-18 percent while leaving the description channel above 54 percent. Both attack and defense evaluation must report per-surface vulnerability.","published_date":"2026-05-28T18:26:40+00:00","viability_score":3,"cluster_label":"LLM Security","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Prompt injection vulnerability is a property of the LLM-surface pairing, not just the channel, requiring per-surface evaluation for both attacks and defenses.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30452v1","title":"A Unified Framework for Gradient Aggregation in Multi-Objective Optimization","abstract":"Many machine learning problems involve multiple inherent trade-offs that are best addressed by gradient-based multi-objective optimization (MOO) algorithms. Existing methods are often proposed with various motivations, analyzed case by case, and differ algorithmically in how the component gradients are aggregated at each step. In this work, we develop a unifying framework for gradient aggregation in MOO, establishing (optimal) rates of convergence to Pareto stationarity, the standard measure of performance in MOO. Central to our analysis is a sufficient alignment condition, from which we derive a theorem showing that non-conflicting directions, when chosen within the convex hull of gradients, form a fundamental sufficient condition for convergence. We further show that feasibility can be ensured through projection onto the dual cone, broadening the scope of methods that admit convergence guarantees. In parallel, we present a primal optimization perspective of gradient aggregation that encompasses established algorithms, clarifies their theoretical relationships, and enables the design of new variants. As an illustration, we introduce capped MGDA, derived from a CVaR-based formulation, and demonstrate its robustness in adversarial federated learning. Finally, we validate our theory through experiments on synthetic problems and practical benchmarks.","published_date":"2026-05-28T18:21:53+00:00","viability_score":3,"cluster_label":"Optimization Algorithms","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A unified framework for gradient aggregation in multi-objective optimization that establishes optimal convergence rates and enables new algorithm variants.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.30447v1","title":"Calibrated Preference Learning: The Case of Label Ranking","abstract":"Calibration, the alignment of predicted probabilities with true outcome frequencies, is essential for reliable decision-making. While extensively studied for classification and regression, calibration has not been formally addressed for probabilistic label ranking, where the goal is to predict a distribution over orderings of a label set. Naively treating rankings as classes ignores their structure and fails to capture important modalities such as pairwise and top-k predictions. We formalize calibration for label ranking and develop a hierarchy of notions covering full rankings, sub-rankings, and top-k rankings. We prove that full-rank calibration implies the others but not conversely, and sub-ranking and top-k calibration are incomparable. Empirically, we find popular label ranking models are often poorly calibrated, with substantial differences between sub-ranking and top-k metrics. Applying our framework to RLHF reward models, we find that calibration correlates strongly but not perfectly with benchmark accuracy, suggesting it captures a meaningful quality dimension beyond top-1 accuracy. These findings motivate future work on understanding the downstream effects of miscalibration and developing methods to correct it.","published_date":"2026-05-28T18:18:21+00:00","viability_score":4,"cluster_label":"Preference Learning","has_code":true,"repo_url":"https://github.com/schlcht/microtype","commercial_flags":["has_code"],"one_liner":"Formalizes and evaluates calibration for label ranking, revealing common miscalibrations in popular models and their impact on RLHF reward models.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.30434v1","title":"LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis","abstract":"Real-world data analysis is inherently iterative, yet existing benchmarks mostly evaluate isolated or short interactive tasks, leaving agents' ability to track evolving analytical context over long horizons untested. We introduce LongDS, a benchmark for long-horizon, multi-turn data analysis where agents must maintain, update, restore, and compose evolving analytical states. LongDS comprises 68 tasks constructed from real-world Kaggle notebooks, spanning 2,225 turns across six domains including Geoscience, Business, and Education. Tasks are designed around state-evolution patterns (e.g., counterfactual perturbation, rollback, multi-state composition), with an average dependency span of 11.3 turns. Evaluating five state-of-the-art models, we find that the best model reaches only 48.45% average accuracy, performance drops nearly 47 points from early to late turns, and long-horizon errors account for 52%--69% of failures. Further analysis shows that additional agent steps do not necessarily improve performance, suggesting that the key bottleneck is maintaining a correct analytical state rather than increasing interaction budget. We release LongDS to support research on reliable long-horizon agentic data analysis. Code and data will be released at https://github.com/zjunlp/DataMind.","published_date":"2026-05-28T18:00:20+00:00","viability_score":8,"cluster_label":"Agentic Data Analysis","has_code":true,"repo_url":"https://github.com/zjunlp/DataMind","commercial_flags":["has_code"],"one_liner":"LongDS-Bench is a new benchmark for evaluating agentic data analysis over long, multi-turn interactions, revealing current state-of-the-art limitations.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.30415v1","title":"Domain Adaptation and Reasoning Frameworks in Language Models: A Controlled Experiment with Historical Cosmology","abstract":"We investigate how domain adaptation reshapes explanatory behavior in language models using historical cosmology as a controlled setting. In Phase 1, we train a small language model from scratch on a pre-Copernican corpus from which explicit heliocentric references were removed, and evaluate whether Earth-motion or heliocentric continuations nevertheless emerge. In Phase 2, we fine-tune a larger pretrained model using QLoRA on the same corpus in order to study how adaptation modifies explanatory framing and cosmological stance. Model outputs are evaluated using an LLM-as-judge framework that labels both cosmological stance (geocentric, heliocentric, or ambiguous) and explanatory frame (premodern versus modern). In the constrained setting of Phase 1, the smaller models occasionally generate local Earth-motion continuations, but these remain globally unstable and insufficient to support coherent cosmological reasoning. In Phase 2, fine-tuning induces a large and statistically significant shift toward premodern explanatory framing, while the conditional cosmological stance distributions remain comparatively stable within those frames. As a result, increases in geocentric outputs arise primarily from redistribution over explanatory regimes rather than from direct modification of stance. These results suggest that domain adaptation may primarily reshape the linguistic frameworks from which continuations are generated, with changes in stance emerging secondarily from those shifts.","published_date":"2026-05-28T18:00:02+00:00","viability_score":3,"cluster_label":"LLM Domain Adaptation","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Investigates how domain adaptation affects explanatory behavior in language models using historical cosmology, showing adaptation primarily reshapes linguistic frameworks.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30353v1","title":"Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software","abstract":"Are AI agents tools, co-authors, or researchers? We present a quantified case study ($N=1$): a physicist supervising an AI coding agent (Claude Code, Sonnet and Opus models) over 12 work days and 57 sessions to build CLAX-PT, a differentiable one-loop perturbation theory module in JAX. We documented and classified 15 supervision events by intervention level.   The agent resolved ten autonomously by iterating against oracle tests. Two more by the physicist's domain knowledge. The three it could not -- all evaded oracle detection -- share a common property: the agent treated symptom reduction as root-cause resolution. It spent 33 of the 57 sessions adjusting coefficients within a code architecture that could not represent the target physics, and could not re-evaluate its CLASS-PT branch choice even when prompted to reconsider; only an injected physics concept (anisotropic BAO damping) triggered the redesign. Separately, the agent committed a calibrated correction that passed all oracle tests but corresponded to no quantity in the theory, predicting wrong values at any other cosmology.   The fudge factor was caught and replaced within the same session. Three supervision practices proved critical for catching what oracle tests missed: testing at diverse parameter points beyond the fiducial calibration; shared changelogs that surfaced stalled exploration across sessions; and an explicit rule against unphysical numerical patches. In this case, supervision design, not model capability, determined whether the agent's output was trustworthy. Closing the gap would require agents that propose architectural alternatives rather than optimize within a given structure, and distinguish predictive adequacy from explanatory correctness -- capabilities not exhibited here, not obviously addressed by scaling alone. [Abridged.]","published_date":"2026-05-28T17:59:59+00:00","viability_score":3,"cluster_label":"AI Agents","has_code":true,"repo_url":"https://github.com/MinhMPA/clax-pt","commercial_flags":["has_code"],"one_liner":"An AI coding agent supervised by a physicist to build scientific software revealed limitations in architectural proposal and distinguishing predictive adequacy from explanatory correctness.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.30351v1","title":"VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion","abstract":"Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layout by changing which tokens occupy the window or how their positions are encoded. The per-head KV layout itself, a dominant contributor to streaming memory and latency, has been mostly left unchanged. In this paper, we present the first study of Multi-Head Latent Attention (MLA) in video diffusion. VideoMLA replaces per-head keys and values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key, reducing per-token KV memory by 92.7% at every cached layer. We further investigate why MLA succeeds in video diffusion even though the spectral assumption often used to motivate it in language models does not hold: pretrained video attention is not low-rank, with 99%-energy effective rank far above any practical latent dimension. VideoMLA retains quality at compression ratios where direct spectral approximation would predict large reconstruction error. We show that the MLA bottleneck, rather than the pretrained spectrum, determines the effective rank: both spectral and random initialization occupy nearly the full rank budget from initialization, and training preserves this budget while adapting within it. On VBench, VideoMLA matches short-horizon streaming video diffusion baselines, achieves the best overall score at long horizons among evaluated methods, and improves throughput by 1.23x on a single B200.","published_date":"2026-05-28T17:59:57+00:00","viability_score":7,"cluster_label":"Generative Video","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"VideoMLA reduces KV cache memory in autoregressive video diffusion by 92.7% using a low-rank latent attention mechanism, improving throughput and matching quality at long horizons.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.30348v1","title":"LLMSurgeon: Diagnosing Data Mixture of Large Language Models","abstract":"The pretraining data mixture of Large Language Models (LLMs) constitutes their \"digital DNA\", shaping model behaviors, capabilities, and failure modes. Yet this composition is rarely disclosed, making post-hoc auditing of data combination or provenance difficult. In this work, we formalize $\\textbf{Data Mixture Surgery (DMS)}$: given only generated text from a target LLM, estimate the domain-level distribution of its pretraining corpus under a predefined taxonomy. We propose $\\textbf{LLMSurgeon}$, a strong framework that casts DMS as an inverse problem under the label-shift assumption. Rather than directly aggregating classifier outputs, LLMSurgeon estimates a calibrated $\\textit{soft}$ confusion matrix and solves a constrained inverse problem to correct systematic domain confusion and recover the latent mixture prior. To evaluate, we introduce $\\textbf{LLMScan}$, a recipe-verifiable evaluation suite built from open-source LLMs with transparent pretraining mixtures. Across LLMScan, LLMSurgeon recovers domain mixtures with high fidelity under fixed protocols. Our work presents a practical, post-hoc approach for auditing the digital DNA of foundation models without access to their training data.","published_date":"2026-05-28T17:59:53+00:00","viability_score":7,"cluster_label":"LLM Auditing","has_code":true,"repo_url":"https://github.com/Yaxin9Luo/LLMSurgeon","commercial_flags":["has_code"],"one_liner":"LLMSurgeon diagnoses the data mixture of large language models from generated text by framing it as an inverse problem, enabling post-hoc auditing of foundation models.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.30345v1","title":"SchGen: PCB Schematic Generation with Semantic-Grounded Code Representations","abstract":"Printed circuit board (PCB) schematic design defines nearly all electronic hardware, but it remains manual and expertise-intensive. While generative AI has advanced digital and analog IC design, PCB schematic generation from natural-language intent is largely unexplored. This paper presents SchGen, the first large language model that generates editable PCB schematics from natural-language requests. The key challenge lies in the lack of an LLM-suited representation and a large-scale dataset. Current schematic formats are dominated by verbose, tool-specific syntax and geometry-heavy descriptions, making them difficult to generate reliably. We introduce a semantically grounded code representation that encodes schematic editing primitives with relative placement and pin-name-based wiring, transforming a geometry-driven generation problem into a semantics-driven matching task amenable to LLMs. We further construct a large-scale dataset of PCB schematics paired with user prompts via a human-agent collaborative pipeline that converts open-source hardware designs into our representation. Experiments show that SchGen significantly outperforms alternative representations and even larger general-purpose LLMs on wire connectivity accuracy and functional correctness. Our results highlight the critical role of representation design in enabling generative models for complex hardware design tasks.","published_date":"2026-05-28T17:59:50+00:00","viability_score":7,"cluster_label":"Hardware Design AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"SchGen generates editable PCB schematics from natural language using a semantically grounded code representation, transforming hardware design into a task amenable to LLMs.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.30344v1","title":"Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection","abstract":"Recent advances in Vision-Language Models (VLMs) have achieved impressive performance across many tasks, yet prior studies report unsatisfactory performance when applying large language or multimodal models to finding abnormal patterns in sequential data. Public anomaly detection benchmarks typically provide interval annotations but not natural-language rationales, making it difficult to fine-tune VLMs to produce grounded, interpretable decisions. To address this gap, we construct VisAnomBench, a curated benchmark built from public time-series datasets and augmented with high-quality anomaly explanations selected from multiple large VLMs using fine-grained, task-specific rewards. Through fine-tuning on this benchmark, we develop VisAnomReasoner, a parameter-efficient VLM for time-series anomaly detection. Experimental results on VisAnomBench show that VisAnomReasoner achieves more accurate anomaly localization and consistently outperforms all baselines, with improvements of at least 21.23 and 23.87 percentage points in precision and F1, respectively. Additional experiments on the TSB-AD-U benchmark demonstrate strong cross-benchmark generalization, with VisAnomReasoner improving precision and F1 by 9.57 and 13.39 percentage points, respectively.","published_date":"2026-05-28T17:59:50+00:00","viability_score":7,"cluster_label":"Time-Series Anomaly Detection","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An efficient vision-language model for accurate and interpretable time-series anomaly detection, outperforming existing methods.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30343v1","title":"Unlocking the Working Memory of Large Language Models for Latent Reasoning","abstract":"To improve the reasoning capabilities of large language models, test-time compute is typically scaled by generating intermediate tokens before the final answer. However, this couples reasoning to autoregressive generation and thereby conflates internal computation with external communication. In contrast, human cognition can use working memory to hold and manipulate information internally without the need to externalize intermediate thoughts. Drawing on this principle, we introduce Reasoning in Memory (RiM), a latent reasoning method that replaces the autoregressive generation of reasoning steps with memory blocks. These memory blocks are fixed sequences of special tokens that unlock the working-memory capacity of large language models. Since they are fixed rather than generated, they can be processed in a single forward pass, enabling compute-efficient latent reasoning. To operationalize these memory blocks, we employ a two-stage curriculum. First, we ground them by predicting explicit reasoning steps after each memory block. Second, we discard this step-level supervision and iteratively refine the final answer after each memory block. Our experiments on reasoning benchmarks show that, across language models of different families and sizes, RiM matches or exceeds existing latent reasoning methods while avoiding the autoregressive generation of thoughts. These results demonstrate that large language models can be trained to use working memory as an effective mechanism for latent reasoning.","published_date":"2026-05-28T17:59:49+00:00","viability_score":6,"cluster_label":"LLM Reasoning","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel latent reasoning method for LLMs that uses fixed memory blocks to unlock working memory for compute-efficient reasoning.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.30341v1","title":"GPIC: A Giant Permissive Image Corpus for Visual Generation","abstract":"Studying scalable methods for visual generative modeling requires large, accessible, and stable datasets. We introduce GPIC, a Giant Permissive Image Corpus of approximately 28 trillion pixels. GPIC comprises diverse internet images captioned by a state-of-the-art vision-language model, including 100M training, 200K validation, and 1M test examples. Moreover, all GPIC images are permissively licensed for both research and commercial use. GPIC is safety-filtered, deduplicated, and centrally hosted on Hugging Face. We provide a benchmarking protocol for generative modeling on GPIC. Finally, we provide a reference baseline for pixel-space flow matching on GPIC. Our dataset, benchmark, and models are available at https://huggingface.co/datasets/stanford-vision-lab/gpic. Evaluation toolkit and code are available at https://gpic.stanford.edu","published_date":"2026-05-28T17:59:26+00:00","viability_score":6,"cluster_label":"Data-centric AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"GPIC offers a comprehensive and large-scale image corpus for AI visual generation innovation.","time_to_mvp":"6+ months","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.30335v1","title":"Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents","abstract":"Multi-component LLM agents assemble probabilistic claims from components that each see only part of a joint problem; the composition can violate basic probability axioms even when every component is locally coherent. We formalise this locally coherent, globally incoherent failure via the compositional residual eps*, the L2 distance from the composed quote to the joint coherent polytope, computable at runtime from system output and the declared cross-component coupling constraints. A product-structure dichotomy characterises when local coherence suffices, and a Rayleigh-quotient prediction matches the observed residual within 7% on three of four relation classes. A hierarchical Boyle-Dykstra projection repairs the composition deterministically; an anytime-valid e-process gives sequential coherence monitoring. Across 1,876 ensemble cliques on a four-LLM mid-tier panel (frontier-panel rerun in Section 5.5), eps* > 0 on 33-94% of cliques, translating to +0.115 nats per bet of regret on 1,770 resolved bets under the proportional allocation rule (the gain collapses to +0.006 under bettors that themselves coherentise). Three intuitive LLM-side mitigations(retrieval, partition-aware prompting, aggregator-LLM) each fail or regress.","published_date":"2026-05-28T17:58:55+00:00","viability_score":5,"cluster_label":"LLM Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework to quantify and repair compositional incoherence in multi-component LLM agents.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.30334v1","title":"Demystifying Data Organization for Enhanced LLM Training","abstract":"Large Language Models (LLMs) have revolutionized various fields, yet their training efficiency is heavily reliant on effective data curation. While data selection has been widely studied, the strategic data organization for enhanced training remains an underexplored area, particularly since current LLMs are often trained for only one or a few epochs. This paper systematically explores the influence of data organization on LLM training by reusing pre-computed sample-level scores originally generated for data efficiency, thereby incurring minimal additional computational overhead. We identify and formalize four key guidelines for optimizing data organization: Boundary Sharpening, Cyclic Scheduling, Curriculum Continuity, and Local Diversity. Guided by them, we introduce two novel data ordering methods termed STR and SAW. Extensive experiments across different model scales and data sizes, encompassing both pre-training and SFT stages, validate the effectiveness of our summarized guidelines. They also demonstrate the robustness of our proposed data ordering methods in enhancing the stability and performance of LLM training. Github Link: https://github.com/microsoft/data-efficacy/","published_date":"2026-05-28T17:58:53+00:00","viability_score":7,"cluster_label":"LLM Training","has_code":true,"repo_url":"https://github.com/microsoft/data-efficacy","commercial_flags":["has_code"],"one_liner":"Optimize LLM training efficiency by strategically organizing data with novel ordering methods.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30327v1","title":"Reasoning with Sampling: Cutting at Decision Points","abstract":"Frontier reasoning models are produced by posttraining base language models with reinforcement learning. Recent work has challenged this by showing that sampling from a sharpened version of the base model's distribution, a so-called power distribution, elicits comparable reasoning without additional training, curated datasets, or verifiers. However, making this method practical requires efficiently sampling from the power distribution. A sampler needs to \"mix\" to the power distribution, which necessitates moving between modes of the target distribution; intuitively, e.g., trying different reasoning strategies. The samplers proposed in prior works repeatedly select a \"cut\" position in the current reasoning trace uniformly at random and resample the suffix from that position onward. However, reasoning traces typically contain a few consequential decisions (e.g., the choice of proof strategy or algorithm), and we observe that a uniformly chosen cut tends to rewrite local details rather than revisit decision points. We introduce an algorithm (Entropy-Cut Metropolis-Hastings) that uses the base model's next-token entropy as a proxy to identify key decision points and resample from those positions. We empirically verify that entropy jumps are a useful proxy for decision points and, in a stylized model of reasoning, prove that our method's mixing time scales with the number of decisions in a trace rather than with the number of tokens, which can be much larger. Across MATH500, HumanEval, GPQA Diamond, and AIME26, our method consistently improves over baselines and RL-trained models.","published_date":"2026-05-28T17:57:32+00:00","viability_score":3,"cluster_label":"AI Research & Algorithms","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An algorithm that shifts decision-making processes by strategically using sampling methods.","time_to_mvp":"1-2 weeks","tags":["high_potential"]},{"arxiv_id":"2605.30326v1","title":"RoboWits: Unexpected Challenges for Robotic Creative Problem Solving","abstract":"The ability to reason, adapt, and creatively solve problems under unexpected challenges is essential for robots operating in real-world environments. However, current robotic benchmarks primarily emphasize skill-level execution and provide limited insight into such cognitive reasoning capabilities. We introduce RoboWits, a bi-manual robotic benchmark designed to systematically evaluate cognitive reasoning, creative tool use, and robustness to unexpected conditions. To enable scalable construction of high-quality reasoning-centric unexpected scenarios, we propose an automated task generation pipeline formulated as a multi-agent cooperative framework, comprising agents for seed task generation and verification, metric generation, scene generation, and task mutation. Using the pipeline, we curated 30 diverse seed tasks and 208 tasks with mutations and graded difficulty across geometry, material, and assembly-based reasoning. We benchmark popular robot policies, pre-trained VLAs, and oracle-state planners. Our results reveal a significant performance gap: while pre-trained VLAs exhibit preliminary success on seed tasks after single-task fine-tuning, they struggle to perform on mutated tasks, implying their brittleness in manipulation tasks requiring reasoning, strategy adaptation, and robustness to deceptive or constrained environments. Project page is available at https://umass-embodied-agi.github.io/RoboWits.","published_date":"2026-05-28T17:57:15+00:00","viability_score":7,"cluster_label":"Robotics","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Benchmark robotic creative problem-solving with RoboWits, revealing brittleness in current models.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30324v1","title":"On Language Generation in the Limit with Bounded Memory","abstract":"We study language generation in the limit under bounded memory. In this task, a learner observes examples from an unknown target language one at a time and must eventually output only new valid examples. Prior work assumes access to the entire history, a strong assumption since realistic algorithms retain limited past information. Classical work in learning theory shows memory constraints dramatically alter learnability; we extend this to language generation.   First, we study memoryless generators. Under a mild enumeration restriction, every countable collection of infinite languages remains generable without memory. Without this restriction, we exactly characterize when memoryless generation is possible. For finite collections, we characterize the optimal minimax density achievable by memoryless generators -- the best density guaranteed against any collection of a given size. This combinatorial bound relies on Sperner's theorem and symmetric chain decompositions.   We further show that a sliding window of the last $W$ examples does not improve this worst-case density, whereas allowing it to store $b$ adaptively chosen past examples improves the achievable density for every $b \\geq 1$.   Finally, we revisit identification in the limit, where the learner must converge to a single correct hypothesis for the target language. We focus on its incremental variant, where the learner remembers only its previous guess. Here, although exact identification fails on a collection of just three languages, a mild relaxation requiring convergence to an ``approximate'' version of the target is achievable for every finite collection.   These results show bounded memory affects these tasks differently: generation remains achievable for every countable collection, while density and identification are confined to finite collections, with guarantees weakening as the collection grows.","published_date":"2026-05-28T17:57:03+00:00","viability_score":0,"cluster_label":"Theoretical AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Explore theoretical limits of language generation with bounded memory.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30323v1","title":"In-Context Reward Adaptation for Robust Preference Modeling","abstract":"Reinforcement Learning from Human Feedback (RLHF) typically relies on static reward models to align Large Language Models with human preferences. However, human values are inherently diverse and heterogeneous, and a single reward model often lacks the robustness required to generalize to unseen preference domains. While existing multi-reward frameworks attempt to address this, they are often restricted to a fixed set of known domains and fail to adapt to unseen human distributions without costly retraining. In this work, we propose In-Context Reward Adaptation, a transformer-based framework designed to model diverse and unseen human preferences on the fly. By leveraging the in-context learning capabilities of transformers, our approach adaptively infers the underlying reward structure from a small set of preference demonstrations. We demonstrate that while a standard transformer architecture is insufficient for this task by characterizing an asymptotic bias to the ground-truth, incorporating human response time as an auxiliary input signal enables the model to successfully adapt to preferences from previously unseen domains. Our findings show that this approach provides a more robust foundation for preference modeling, allowing for the representation of heterogeneous rewards and preference distribution shift, and offering a scalable path toward more flexible human-AI alignment.","published_date":"2026-05-28T17:56:54+00:00","viability_score":7,"cluster_label":"LLM Alignment","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A transformer-based framework that adapts to diverse and unseen human preferences on the fly for more robust LLM alignment.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30322v1","title":"Gram: Assessing sabotage propensities via automated alignment auditing","abstract":"We introduce Gram, an automated alignment auditing framework to assess the propensity of AI agents to engage in sabotage. We evaluate Gemini models across 17 simulated agentic deployment scenarios that incentivize sabotage. We find Gemini models misbehave in about 2-3% of our simulated trajectories. Many of these cases are explained by \"overeagerness\" in Gemini models resulting in both excessive role-playing and goal-seeking behavior. In contrast to other alignment auditing approaches, Gram is designed to specifically evaluate misalignment and intentional sabotage in agentic coding and research agents. We additionally introduce an experimental investigator agent pipeline which enables fine-grained targeted experiments to identify the drivers of misbehavior. We find that increasing realism of environments and removing nudges to misbehave tends to reduce sabotage rates close to zero.","published_date":"2026-05-28T17:56:18+00:00","viability_score":5,"cluster_label":"AI Safety","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Gram is an automated framework to audit AI agents for sabotage propensities, identifying drivers of misbehavior in simulated scenarios.","time_to_mvp":"6+ months","tags":["series_a_plus"]},{"arxiv_id":"2605.30319v1","title":"Improved Guarantees for Heterogeneous Treatment-Effect Estimation via Matrix Completion","abstract":"A central goal of modern causal inference is estimating heterogeneous treatment effects to answer questions like \"how does an intervention affect each unit,\" rather than only on average. We study this problem with panel-data where we observe $n$ units across $m$ times under unknown, non-uniform treatment assignments. The data in this setting is naturally represented as a matrix of all unit--time treatment effects. Estimating heterogeneous treatment effects can then be expressed as obtaining a good estimation of each row's average in this matrix. This allows us to formulate the problem as matrix completion, which can be solved under natural low-rankness assumptions. However, existing matrix-completion guarantees are not powerful enough to get meaningful bounds for the per-row guarantee required for estimating the heterogeneous treatment effect; roughly speaking, they are only useful for estimating average treatment effect bounds, as also illustrated in a recent line of work. We give a simple, computationally efficient estimator that, without knowledge of the propensities and under standard low-rankness and regularity assumptions, achieves a row-wise $\\ell_2$ error of $\\tilde{O}(\\sqrt{\\frac{1}{n} + \\frac{n}{m^2}})$. Technically, our analysis establishes the first sharp row-wise $\\ell_2$-perturbation bound for low-rank approximation, complementing existing spectral-, Frobenius-, and entrywise perturbation theory.","published_date":"2026-05-28T17:55:23+00:00","viability_score":2,"cluster_label":"Causal Inference","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Improved theoretical guarantees for estimating heterogeneous treatment effects using matrix completion techniques.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30318v1","title":"Before the Shutter: Aesthetic and Actionable Portrait Photography Planning in 3D Scenes","abstract":"Portrait photography is largely decided before the shutter opens: the subject's pose, the camera configuration, and the lighting devices must be coordinated within the surrounding 3D scene. In contrast, most existing computational methods focus on post-production in 2D image space, such as retouching, relighting, or editing images that already exist; pre-capture photographic planning remains largely unexplored. We introduce 3D aesthetic portrait planning, the task of generating human pose, camera, lighting, and exposure plans that produce visually compelling portraits while satisfying geometric and photometric feasibility in a 3D scene. Our approach builds a Photographic Scene Graph that represents scene affordances, subject-scene relations, and portrait-relevant lighting structure. Built on this representation, we perform aesthetic-guided comparative planning over previous attempts and current viewfinder observations. Experiments across diverse indoor and outdoor scenes show that our method produces portraits preferred by human raters and MLLM evaluators over competitive baselines, while maintaining high physical plausibility. Together, our results suggest a path from post-capture correction toward pre-capture computational portrait planning. Project repository: https://github.com/songrise/Before-the-Shutter","published_date":"2026-05-28T17:55:09+00:00","viability_score":8,"cluster_label":"Generative AI","has_code":true,"repo_url":"https://github.com/songrise/Before-the-Shutter","commercial_flags":["has_code"],"one_liner":"An AI system that plans human pose, camera, and lighting in 3D scenes for aesthetically compelling portrait photography before the shot.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.30311v1","title":"Archon: A Unified Multimodal Model for Holistic Digital Human Generation","abstract":"Digital humans are fundamental to immersive interaction, yet creating a unified model for holistic modalities, including text, audio, motion, and visual content, remains an open challenge. In this paper, we present Archon, a fully pretrained, human-centric unified multimodal model for holistic avatar generation. Archon unifies seven modalities with modality-specific tokenizers, and a native autoregressive unified multimodal model pretrained on synchronized modalities and 72 diverse tasks to model holistic joint distributions. To address the token explosion challenge in high-fidelity talking videos, we introduce a memory-efficient semantic video reparameterization, achieving 4x token reduction while preserving fine-grained dynamics, coupled with a semantic-driven video diffusion decoder. We further propose a \"Thinking in Modality\" that decomposes ambiguous cross-modal tasks into stepwise thinking in an alternative chain of modality, progressively enhancing fidelity and controllability. Extensive experiments demonstrate that Archon achieves superior or comparable performance across diverse digital human generation tasks, validating the effectiveness of our unified framework. Project page: https://zju3dv.github.io/archon/.","published_date":"2026-05-28T17:53:27+00:00","viability_score":7,"cluster_label":"Generative Humans","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A unified multimodal model for generating high-fidelity digital humans across text, audio, motion, and visual content.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30310v1","title":"City-Mesh3R: Simulation-Ready City-Scale 3D Mesh Reconstruction from Multi-View Images","abstract":"City-scale 3D surface reconstruction from multiview images for downstream 3D simulation, poses highly challenging problems due to the scale and complexity of urban scenes. Existing city-scale 3D reconstruction methods based on NeRF, Gaussian Splatting etc. often fail to recover 3D meshes ready for simulation due to incomplete/missing geometry and irregular, noisy surfaces. Scaling existing small-scale 3D reconstruction methods to arbitrarily large urban scenes is highly infeasible due to their computational complexity. We present City-Mesh3R, a scalable framework for reconstructing watertight surface meshes directly from large unordered image collections. Unlike recent methods which use global sparse SfM point-cloud initialization followed by a distributed 3D dense reconstruction of large-scale scenes, our method follows an end-to-end images-to-mesh 3D reconstruction approach using a divide-and-conquer strategy. The sparse city map is reconstructed via topological image clustering, cluster-wise independent sparse SfM and map merging, without need for exhaustive image feature matching. Then this map is partitioned spatially to perform geometry-aware camera selection, followed by dense surface reconstruction and surface refinement using curvature-aware adaptive vertex density remeshing. These partition meshes are then stitched together to produce the global mesh of the city. The proposed end-to-end framework is evaluated on city-scale reconstruction datasets. As demonstrated by our qualitative and quantitative results, our proposed method yields high-fidelity watertight 3D meshes with regular geometry, capturing fine surface details, and is suitable for scaling to arbitrarily large scenes owing to the end-to-end processing in a distributed setting.","published_date":"2026-05-28T17:53:26+00:00","viability_score":7,"cluster_label":"3D Reconstruction","has_code":true,"repo_url":"https://github.com/cvpr-org/author-kit","commercial_flags":["has_code"],"one_liner":"A scalable, end-to-end framework for reconstructing simulation-ready city-scale 3D meshes from multi-view images.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30295v1","title":"MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings","abstract":"Large language models (LLMs) show promise for clinical reasoning and decision support, but evaluation in realistic, electronic health record-congruent settings remains limited. Existing benchmarks often rely on static datasets or unstructured inputs that do not reflect the structured, interoperable data formats used in clinical systems. We introduce a pipeline for generating clinically realistic HL7 FHIR R4 bundles from unstructured text, enabling controllable evaluation of clinical decision support systems. The pipeline combines staged LLM generation with terminology-grounded validation and repair to reduce hallucinated codes and enforce structural and semantic consistency. Applying this approach to MedCaseReasoning, we construct MedCase-Structured, a synthetic dataset aligned with clinician-authored diagnostic cases, achieving valid FHIR generation for 82.5% of cases. Evaluation on MedCase-Structured reveals consistently lower diagnostic accuracy for LLMs on structured FHIR inputs than with plain text, highlighting the importance of deployment-aligned benchmarking.","published_date":"2026-05-28T17:42:43+00:00","viability_score":8,"cluster_label":"Medical AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A pipeline and dataset for generating clinically realistic FHIR bundles to benchmark LLM diagnostic reasoning in EHR settings.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.30290v1","title":"Self-Trained Verification for Training- and Test-Time Self-Improvement","abstract":"Self-improvement at scale has been a longstanding goal for reasoning models, and there are two natural places to do it: at test time, through verification-refinement (V-R) loops; and at training time, through self-training methods. Both are gated by the same bottleneck: the verifier. V-R loops stall when verifier scores inflate while accuracy stagnates, and when feedback is too generic to act on; self-training fails similarly when bad self-generated data are added to training. Better verification would unlock both, but the capability we want to train, i.e., catching self-generated errors, lacks training signal. To address this challenge, we propose self-trained verification (STV). Our key observation is that, while a model cannot catch these errors alone, it can when shown the reference solution. We turn this asymmetry into a supervision target and train the verifier to imitate a more informed version of itself. At test time, STV substantially improves V-R loops on hard problems, while alternatives (e.g., SFT, RL on verifier scores, and even meta-verifiers) do not. STV roughly doubles accuracy on hard math and lifts it 14x on scientific reasoning tasks (1.5% to 21%). At training time, we additionally train the generator using RL with STV verifier's feedback inside the V-R loop - a procedure we call verifier-in-the-loop training (ViL). Starting from an RL-converged generator, ViL yields a further 33% gain in pass@1. More notably, the generator's standalone pass@1, with no verifier at test time, climbs 30% relative past where standard RL had converged. Hence, the next frontier in reasoning on hard problems may lie in how we train for and with verification.","published_date":"2026-05-28T17:40:45+00:00","viability_score":3,"cluster_label":"LLM Reasoning","has_code":true,"repo_url":"https://github.com/ar-forum/stv","commercial_flags":["has_code"],"one_liner":"A novel self-trained verification method to improve LLM reasoning capabilities at both training and test time.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30288v1","title":"MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection","abstract":"Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before final post-training. Its data selection problem is distinct: the data are optimized under a pretraining-style objective at near-pretraining scale, but are curated toward downstream capabilities and drawn from heterogeneous sources with different formats and training roles. As a result, effective selection requires both scalability and source-adaptive semantic criteria. Existing model-based methods scale well, but provide only implicit quality signals. Semantic selection methods offer stronger judgments, but usually assume fixed rubrics or standardized data formats. To address this mismatch, we propose MIRA, a source-aware filtering framework based on self-anchored rubric discovery. The key idea is to make rubric construction part of data selection: MIRA first discovers what should be evaluated for each source group, then distills those judgments into scalable student scorers for full-corpus filtering. On code-oriented mid-training with 21 sources and 5 source groups, MIRA outperforms selection baselines across nine code benchmarks and matches the full-corpus run while using only half the tokens.","published_date":"2026-05-28T17:40:40+00:00","viability_score":7,"cluster_label":"LLM Training","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for source-aware data selection during LLM mid-training that improves performance by discovering and distilling source-specific evaluation criteria.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30284v1","title":"ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure","abstract":"Scientific discovery is an inherently creative and uncertain process, requiring reasoning beyond the recall of known knowledge. While many benchmarks have been proposed to evaluate large language model (LLM) performance on deep research tasks via multi-hop retrieval, their innovative reasoning abilities essential for true scientific discovery remain largely untested. We introduce a benchmark framework for evaluating model performance in scientific discovery and reasoning, building up from a raw problem to the classical null hypothesis test. In our framework, models initially receive only the topic and research question from a recent paper, with technical details progressively revealed. At each stage of information disclosure, the model is tasked with generating hypotheses that address the research question, which is compared with the conclusions from the original paper and evaluated via automated semantic similarity of constituent atomic claims. This progressive evaluation of semantic divergence from ground-truth conclusions enables assessment of a model's innovativeness (under minimal information) to grounded reasoning capabilities (under full experimental details), both critical for using LLMs for scientific discovery purposes. Our framework provides a foundation for systematically evaluating scientific reasoning and discovery capabilities in LLMs, crucial for advancing the development of next-generation AI scientist/co-scientist systems. Specifically, here we evaluate GPT-5, GPT-5.4, Gemini 2.5 pro, and Gemini 3.1 pro preview across 45 papers spanning bioactive materials, mechanical materials, and nanomaterials. We find that GPT-5.4 and Gemini 3.1 pro outperform their previous generation counterparts as expected, and GPT-5.4 in particular maintains 0.7 F1 score alignment with ground truth conclusions even under minimal context.","published_date":"2026-05-28T17:38:19+00:00","viability_score":7,"cluster_label":"LLM Evaluation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A benchmark for evaluating LLMs' scientific hypothesis generation capabilities by progressively disclosing information from research papers.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30283v1","title":"mcp-proto-okn: Natural-language access to open scientific knowledge graphs through the Model Context Protocol","abstract":"MCP Server Proto-OKN (mcp-proto-okn) is a Python-based Model Context Protocol server that enables AI assistants to discover, inspect, query and integrate scientific knowledge graphs through natural language. The server provides graph routing, schema inspection, SPARQL execution, ontology expansion, multi-graph querying, and transcript generation, lowering the barrier to cross-domain knowledge graph analysis for biomedical and scientific users.   mcp-proto-okn is implemented in Python using the FastMCP framework and is available at https://github.com/sbl-sdsc/mcp-proto-okn. Documentation, client configuration instructions, and example analysis transcripts are provided in the GitHub repository.","published_date":"2026-05-28T17:37:54+00:00","viability_score":8,"cluster_label":"Knowledge Graphs","has_code":true,"repo_url":"https://github.com/sbl-sdsc/mcp-proto-okn","commercial_flags":["has_code"],"one_liner":"A Python-based server enabling AI assistants to access and query scientific knowledge graphs using natural language.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.30280v1","title":"Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments","abstract":"Embodied intelligence is often studied through specialized models for individual tasks such as manipulation or navigation, resulting in fragmented capabilities and limited generalization across tasks, environments, and robot embodiments. In this work, we study whether heterogeneous embodied decision-making problems can be unified within a single vision-language-action model. We present Qwen-VLA, a unified embodied foundation model that extends Qwen's vision-language modeling stack from perception, understanding, and reasoning to continuous action and trajectory generation through a DiT-based action decoder. Qwen-VLA is trained with a large-scale joint pretraining recipe over diverse data sources, including robotics manipulation trajectories, human egocentric demonstrations, synthetic simulation data, vision-and-language navigation data, trajectory-centric supervision, and auxiliary vision-language data. To support multiple robot platforms, we introduce embodiment-aware prompt conditioning, where robot-specific textual descriptions specify the current embodiment and control convention. We further cast manipulation, navigation, and trajectory prediction into a unified action-and-trajectory prediction framework, enabling transferable visual grounding, spatial reasoning, and continuous action generation across robot morphologies, task families, and environments. Experiments on manipulation, navigation, and trajectory-centric benchmarks show consistent multi-task performance and out-of-distribution generalization under variations in scene layout, background, lighting, object configuration, and robot embodiment. Qwen-VLA-Instruct achieves 97.9% on LIBERO, 73.7% on Simpler-WidowX, 86.1%/87.2% on RoboTwin-Easy/Hard, 69.0% OSR on R2R, 59.6% SR on RxR, 76.9% average OOD success in real-world ALOHA experiments, and 26.6% zero-shot success on DOMINO dynamic manipulation.","published_date":"2026-05-28T17:36:31+00:00","viability_score":8,"cluster_label":"Embodied AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Qwen-VLA is a unified vision-language-action foundation model for diverse embodied tasks across robots and environments.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30274v1","title":"Loong: A Human-Like Long Document Translation Agent with Observe-and-Act Adaptive Context Selection","abstract":"Document-level translation remains one of the most challenging tasks for large language models, which are constrained by limited context windows that impede global cohesion, while simultaneously suffering from redundant contextual information that degrades translation quality. To address this, we propose a human-like long document translation agent called Loong, which leverages a 3E memory module (Essence-Exemplar-Entity) to store summaries, sentence pairs, and entity records as historical context. Instead of passively attending to all history, Loong performs deep reasoning to adaptively identify the optimal context for translation guidance. Loong optimizes its context policy through reinforcement learning, utilizing preference data derived from its own sampled observe-and-act reasoning trajectories. Empirical evaluations demonstrate that Loong achieves substantial translation quality improvements in English $\\Leftrightarrow$ Chinese, German, and French directions, with average gains of up to 13.0 points across the three evaluation metrics. Furthermore, Loong exhibits strong generalization across domains and robustness against contextual noise, while maintaining remarkable stability in ultra-long document translation. Our code is released at https://github.com/YutongWang1216/LoongDocMT.","published_date":"2026-05-28T17:32:25+00:00","viability_score":8,"cluster_label":"Document Translation","has_code":true,"repo_url":"https://github.com/YutongWang1216/LoongDocMT","commercial_flags":["has_code"],"one_liner":"Loong is a human-like long document translation agent that uses adaptive context selection and reinforcement learning to achieve significant quality improvements.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30273v1","title":"LLUMI: Improving LLM Writing Assistance for Mental Health Support with Online Community Feedback","abstract":"Large language models (LLMs) show promise in generating supportive responses for mental health queries, but improving their usefulness, empathy, and safety often requires substantial compute, expert input, and labeled data. At the same time, deploying proprietary, cloud-based models for mental health-related interactions raises important privacy and data-governance concerns, given the sensitivities. To address this challenge, we introduce LLUMI setup that can be hosted in-house within protected environments. LLUMI consists of two complementary components: a generation model (GM), which drafts supportive responses to mental health queries, and an improvement model (IM), which revises an initial human-crafted response. We leverage feedback signals from Reddit mental health communities, using community endorsement patterns such as upvotes and downvotes to construct chosen-rejected response pairs for Supervised Fine Tuning (SFT) and Direct Preference Optimization (DPO). We further align LLUMI using human evaluation across five dimensions: readability, empathy, connection, actionability, and safety. Our results show that, despite relying on smaller open-source models rather than proprietary cloud-based GPT models, LLUMI achieves comparable performance across linguistic analyses and human evaluations. These findings suggest that open-source models, when trained with community-derived preference signals, can support high-quality mental health support assistance while offering a more privacy-preserving alternative for sensitive support contexts.","published_date":"2026-05-28T17:30:57+00:00","viability_score":5,"cluster_label":"Mental Health AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"LLUMI is an LLM writing assistance system for mental health support that leverages online community feedback for improved empathy and safety, while prioritizing privacy.","time_to_mvp":"1-3 months","tags":["quick_build"]},{"arxiv_id":"2605.30268v1","title":"PhyGenHOI: Physically-Aware 4D Generation of Dynamic Human-Object Interactions","abstract":"We address the task of generating physically accurate and visually faithful 4D Human-Object Interaction (HOI). Given a static 3D human and target object represented as 3D Gaussian Splats (3DGS), our goal is to synthesize dynamic scenes where the human actively engages with the object through actions, such as punching or kicking, in accordance with a given input text. To this end, we introduce PhyGenHOI, a novel framework that couples generative human motion with an explicit physical object simulation. We model the human as a semantic agent driven by a Motion Diffusion Model (MDM) and the object as a physical agent simulated via the Material Point Method (MPM), utilizing 3D Gaussians as a unified, differentiable representation. We supervise their interaction through three coupled mechanisms: (1) A Windowed Attraction Loss that temporally synchronizes generative motion to intercept the object; (2) A Contact-Driven Re-simulation step that triggers physically consistent momentum transfer upon impact; and (3) A Masked Video-SDS objective that injects video-based priors to enhance contact fidelity. Experiments show PhyGenHOI generates physically consistent 4D HOI across diverse actions, humans, and objects, outperforming baselines. Project page and videos: https://omerbenishu.github.io/PhyGenHOI/","published_date":"2026-05-28T17:29:19+00:00","viability_score":8,"cluster_label":"Generative 4D HOI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"PhyGenHOI generates physically accurate and visually faithful 4D Human-Object Interactions by coupling generative human motion with explicit physical object simulation.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30260v1","title":"How LoRA Remembers? A Parametric Memory Law for LLM Finetuning","abstract":"Large Language Models (LLMs) must continuously learn and update knowledge to remain effective in dynamic real-world environments. While Low-Rank Adaptation (LoRA) is widely used for such memory updates, existing studies mainly rely on qualitative downstream evaluations, leaving the quantitative capacity limits and underlying dynamics of exact parametric memory largely unexplored. To bridge this gap, we employ LoRA as a controlled memory capacity probe within the latent space to systematically quantify exact parametric memory. We introduce the Parametric Memory Law, a robust power law linking loss reduction Delta L to effective parameters and sequence length. At the token level, fine-grained analysis reveals a deterministic phase transition, demonstrating that a prediction probability of p > 0.5 constitutes a sufficient condition for verbatim recall under greedy decoding. Driven by these insights, we introduce MemFT, a threshold-guided optimization strategy that dynamically redistributes the training budget toward sub-threshold tokens. Empirical evaluations demonstrate that MemFT can enhance memory fidelity and efficiency. Code will be released at https://github.com/zjunlp/ParametricMemoryLaw.","published_date":"2026-05-28T17:22:24+00:00","viability_score":8,"cluster_label":"LLM Finetuning","has_code":true,"repo_url":"https://github.com/zjunlp/ParametricMemoryLaw","commercial_flags":["has_code"],"one_liner":"MemFT is a threshold-guided optimization strategy for LLM finetuning that enhances memory fidelity and efficiency by dynamically redistributing training budget based on a novel Parametric Memory Law.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30251v1","title":"Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models","abstract":"Large language models (LLMs) often solve a task when all instructions are given in a single prompt, but fail when the same information is revealed gradually across turns. When a clean FULL prompt and a RAW-SHARDED conversation contain the same complete user evidence, the model should still arrive at the same answer. We argue that a key reason for this gap is self-anchored drift: responses produced under partial information introduce unsupported assumptions, and those assumptions later distort the final answer. To reduce this effect, we propose Canonical-Context On-Policy Distillation (CCOPD). During training, the same base model is used in two roles: a frozen teacher conditioned on the clean FULL prompt and a trainable student that receives the same evidence incrementally through a multi-turn conversation; CCOPD aligns the student's behavior on its own trajectories with the teacher's canonical full-context behavior. Trained only on math problem conversations, CCOPD yields a 32\\% average relative improvement in RAW-SHARDED performance over the original base model across math and five zero-shot out-of-domain task families, while largely preserving full-context performance. Further analyses suggest that CCOPD strengthens grounding in user evidence and reduces sensitivity to contamination from earlier assistant turns.","published_date":"2026-05-28T17:14:29+00:00","viability_score":3,"cluster_label":"LLM Training","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A distillation method to improve multi-turn language model performance by aligning student responses with a teacher model trained on full context.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30244v1","title":"Reinforcement Learning with Robust Rubric Rewards","abstract":"While Reinforcement Learning with Verifiable Rewards (RLVR) is effective for deterministically checkable tasks, many vision-language tasks are partially verifiable, demanding multi-criteria supervision (e.g., perceptual details, reasoning steps, and constraints). Rubrics provide a natural interface for this fine-grained supervision, but their effectiveness depends on the execution accuracy during online RL. We propose Reinforcement Learning with Robust Rubric Rewards ($\\text{RLR}^3$), extending RLVR from task-level verification to criterion-level verification. $\\text{RLR}^3$ routes instance-specific rubrics through two execution paths: an LLM-as-an-extractor paired with a deterministic verifier, or an LLM-as-a-Judge for non-verifiable criteria. To ensure faithful scoring, $\\text{RLR}^3$ introduce a minimal exposure strategy that masks ground truths from extractors and images from judges. Furthermore, $\\text{RLR}^3$ employs hierarchical aggregation to prioritize essential criteria over additional criteria, and mitigates score saturation within rollout groups. Evaluated on Qwen3-VL-30B-A3B across 15 benchmarks, $\\text{RLR}^3$ consistently outperforms RLVR, yielding a 4.7-point improvement over the base model and exceeding the official instruct-to-thinking model gap. Controlled audits confirm our deterministic verification and minimal exposure significantly reduce exploitable false positives.","published_date":"2026-05-28T17:11:03+00:00","viability_score":6,"cluster_label":"Vision-Language Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A reinforcement learning framework that uses robust rubric rewards for vision-language tasks with multi-criteria supervision.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.30233v1","title":"Do Language Models Track Entities Across State Changes?","abstract":"Entity tracking (ET), the ability to keep track of states, is a fundamental skill that underlies complex reasoning. An increasing amount of work investigates how transformer language models (LMs) solve entity binding $\\textit{without}$ state changes. However, there is limited understanding of how non-toy LMs address ET problems of realistic difficulties expressed in natural language. To this end, we investigate the mechanisms underlying ET in more complex scenarios featuring multiple state-changing operations. We find that LMs do not incrementally track world states across tokens or query-relevant states across layers, but simply aggregate relevant information in parallel at the last token when the query becomes evident. We further investigate mechanisms of individual operations ($\\texttt{PUT}$, $\\texttt{REMOVE}$, $\\texttt{MOVE}$) to characterize this non-incremental ET mechanism. Surprisingly, LMs implement the $\\texttt{REMOVE}$ operation with a fragile global suppression tag; this global removal mechanism predicts various failure modes that we confirm behaviorally. We provide a mechanistic solution of nullifying this tag to partially address this issue. Overall, our findings reveal that LMs solve a fundamentally sequential task using a non-sequential strategy. More broadly, our work illustrates how behavioral and mechanistic analyses can fruitfully interact. Behavioral results inform mechanistic hypotheses, and insights from mechanistic analyses help build stronger behavioral evaluations by predicting failure modes missing from existing evaluations.","published_date":"2026-05-28T17:03:42+00:00","viability_score":3,"cluster_label":"LLM Reasoning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Investigates how language models track entities across state changes, revealing a non-sequential strategy and a fragile global suppression mechanism for removal operations.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30231v1","title":"Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning","abstract":"Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may overfit dataset-specific biases, while integrating specialized 3D visual encoders is often inflexible and cumbersome. In this paper, we argue that genuine spatial understanding should emerge from learning fundamental geometric priors, not only from high-level VQA supervision. We propose GASP (Geometric-Aware Spatial Priors), a framework that injects these priors directly into the LLM's transformer layers. GASP employs a small correspondence head, applied as a deep supervision signal across all layers, and is trained with a dual objective leveraging ground-truth geometry from large-scale video scenes: a contrastive loss on ground-truth point correspondences enforces 2D view-invariance, while a depth consistency supervision resolves 3D geometric ambiguities. Our analysis first provides a diagnostic showing that standard VLMs' internal correspondence matching accuracy is very low (often below 5%). We then demonstrate that our training substantially improves this behavior, boosting peak layer-wise correspondence to over 70% and maintaining over 85% temporal robustness while baselines remain below 5%. These internal improvements translate to significant gains on downstream spatial benchmarks including +18.2% on All-Angles Bench and +29.0% on VSI-Bench, all without training on any 3D VQA data. Our findings indicate that learning from fundamental geometric priors is a promising and generalizable pathway towards VLMs with more reliable 3D spatial reasoning.","published_date":"2026-05-28T17:00:52+00:00","viability_score":7,"cluster_label":"Vision-Language Models","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework that injects geometric priors into vision-language models to improve 3D spatial reasoning without relying on 3D VQA datasets.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30227v1","title":"Unifying Temporal and Structural Credit Assignment in LLM-Based Multi-Agent Prompt Optimization","abstract":"While Multi-Agent Systems (MAS) empower Large Language Models to tackle complex reasoning tasks through collaborative interaction, optimizing their dynamics remains a formidable challenge due to the discrete, non-differentiable nature of the computation graph and the sparsity of global supervisory signals. Existing black-box optimizers struggle to attribute trajectory-level failure to specific local components, resulting in inefficient, high-variance exploration. We argue that tractable MAS optimization needs structural inductive biases to disentangle error signals. We propose temporal and structural credit assignment, which decomposes the objective along two axes: (i) temporal credit, using state-space bottlenecks to identify critical rounds, and (ii) structural credit, using stationary role policies to isolate agent contributions. Leveraging these decomposed signals, we introduce a discrete, verbalized block coordinate descent algorithm for iterative refinement. Rather than indiscriminate global updates, it alternates between optimizing role prompts and aggregation protocols, using LLM-generated \"proxy gradients\" to target only the identified weak links. Across diverse reasoning benchmarks, our approach substantially reduces query complexity while improving performance, providing a principled and interpretable path toward self-improving MAS.","published_date":"2026-05-28T16:57:57+00:00","viability_score":7,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel approach to optimize multi-agent systems by decomposing credit assignment across temporal and structural dimensions, reducing query complexity and improving performance.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30226v1","title":"BORA: Bridging Offline Reinforcement Learning and Online Residual Adaptation for Real-World Dexterous VLA Models","abstract":"Vision-Language-Action (VLA) models have emerged as a promising paradigm for grounding visual-language understanding into real-world robotic manipulation. However, dexterous manipulation remains challenging for VLA policies due to high-dimensional hand control and compounding execution errors, which makes real-world RL post-training essential for bridging the gap between visually grounded action generation and physically reliable dexterous execution. However, high-dimensional dexterous exploration often triggers temporal inconsistency, sample inefficiency and hardware risks in the real world. To address these challenges, we propose BORA, an offline-to-online RL post-training framework designed for real-world dexterous VLA models. In the offline phase, BORA constructs a critic that takes both the VLM's cognition tokens and action chunks as inputs. This design enables action-conditioned value guidance, allowing the critic to evaluate dexterous hand motions beyond visual context alone. During the subsequent online phase, BORA freezes the VLA base and introduces a lightweight, Human-in-the-Loop (HiL) chunk-wise residual adaptation mechanism to mitigate real-world execution errors and further correct the offline-learned intents within the actual physical environment. By inheriting the offline critic and employing intervention-driven rewards, BORA effectively corrects execution discrepancies and adapts to real-world physical variances while preserving the pretrained policy as a stable prior. Extensive evaluations across five complex real-world dexterous tasks demonstrate that BORA significantly outperforms pure imitation learning and traditional decoupled RL baselines, achieving a 33% absolute increase in average success rate under standard settings and up to a 43% improvement in unseen object generalization.","published_date":"2026-05-28T16:57:47+00:00","viability_score":7,"cluster_label":"Robotics","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"BORA is a post-training framework for real-world dexterous VLA models, bridging offline RL and online residual adaptation to significantly improve robotic manipulation success rates.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30219v1","title":"When Should Models Change Their Minds? Contextual Belief Management in Large Language Models","abstract":"Long-horizon interactions require language models to manage accumulating information: when to update their state, when to preserve their state, and what to ignore. We study this challenge as \\textbf{Contextual Belief Management (CBM)}: maintaining a predicted belief state aligned with formal evidence while isolating task-irrelevant noise. To make CBM measurable, we introduce BeliefTrack, a closed-world benchmark spanning Rule Discovery and Circuit Diagnosis, where a finite belief space and symbolic verifiers enable exact turn-level evaluation. BeliefTrack diagnoses three failures: Failed Stay, Failed Update, and Failed Isolation. Across multiple LLMs, vanilla models exhibit severe CBM failures, while explicit belief-tracking prompts provide limited gains. In contrast, reinforcement learning with belief-state rewards reduces failure rates by 70.9\\% on average. Further probing reveals latent belief-state dynamics behind these failures, and representation-level steering reduces failure rates by 46.1\\% across two tasks\\footnote{Code is coming soon at https://github.com/zjunlp/CBM.","published_date":"2026-05-28T16:52:04+00:00","viability_score":7,"cluster_label":"LLM Reasoning","has_code":true,"repo_url":"https://github.com/zjunlp/CBM","commercial_flags":["has_code"],"one_liner":"Contextual Belief Management (CBM) addresses LLM failures in long-horizon interactions by introducing a benchmark and RL-based solutions to improve state tracking and information isolation.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30208v1","title":"Automating Low-Risk Code Review at Meta: RADAR, Risk Calibration, and Review Efficiency","abstract":"AI-assisted coding tools have altered software production. At Meta, significant lines of code per human-landed diff grew by 105.9% year over year and per-developer diff volume rose 51%, with agentic AI responsible for over 80% of that growth. Meanwhile, the share of diffs receiving timely review has declined, exposing a widening gap between code supply and reviewer bandwidth. We ask three questions that progress from feasibility through calibration to impact: (1) can risk-stratified automation operate at scale across diverse organizations, (2) how does tuning the risk threshold affect the trade-off between automation yield and safety, and (3) to what extent does automated review reduce end-to-end latency for AI-generated changes? We deployed RADAR (Risk Aware Diff Auto Review), a multi-stage funnel that classifies each diff by authorship and source type, applies eligibility gates, static heuristics, a machine-learned Diff Risk Score, LLM-based Automated Code Review, and deterministic validation before landing qualifying changes. We evaluate RADAR through telemetry covering 535K+ RADAR-reviewed diffs, observational before-after comparisons for policy changes, and difference-in-differences analysis of efficiency outcomes. RADAR has reviewed 535K+ diffs and landed 331K+. Relaxing the Diff Risk Score threshold from the 25th to the 50th percentile increased the approve rate to 60.31%. The revert rate for RADAR-reviewed diffs is 1/3 that of non-RADAR diffs, and the Production Incident rate is 1/50 that of non-RADAR diffs. RADAR reduces median time to close by over 330% and median diff review wall time by 35%. Risk-aware layered automation can materially reduce review bottlenecks created by AI-driven code growth without compromising production safety.","published_date":"2026-05-28T16:44:07+00:00","viability_score":8,"cluster_label":"Code Review","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"RADAR automates low-risk code reviews at Meta, significantly reducing review bottlenecks and latency for AI-generated code without compromising production safety.","time_to_mvp":"6+ months","tags":["series_a_plus"]},{"arxiv_id":"2605.30207v1","title":"Persona Conditioning of Brand Recommendations in Retrieval-Augmented Commercial Chat: A Prominence-Stratified Cross-Provider Audit","abstract":"The same prompt -- \"best CRM software\" -- reaches AI assistants from buyers in widely different contexts: a solo founder, an enterprise VP, a UK SMB owner. We audit how strongly that contextual variation reshapes which brands the model recommends. The audit samples 2,000 runs over a design space of 10 personas x 8 prompts x 3 model configurations x N=10 reps, with the two OpenAI cells at full 8-prompt coverage and the Anthropic sonnet-4.6 / low cell at 4-prompt coverage. Prefixing the user message with a persona drops the recommendation-set similarity (Jaccard) by Delta = -0.12 to -0.20 relative to a same-persona baseline (clustered 95% CIs exclude zero on all three measured cells; the sonnet cell's CI rests on only 4 prompt clusters and is correspondingly wider). The effect is sharply prominence-stratified: category leaders are persona-resistant (~80% same-brand consistency across personas), but mid-market brands swap up to 75% of the recommendation set as the persona changes. The Anthropic model shows a larger point-estimate effect than the OpenAI configurations, though clustered CIs overlap for the closer contrast (sonnet vs. OpenAI/high); the asymmetry is consistent with Anthropic's more retrieval-unattributed generation route (43-52% recommendations without observed retrieval-layer evidence, vs OpenAI's 8-29%, documented in Jack 2026). Any measurement of AI brand perception must condition on the buyer persona supplying the query: the same prompt produces materially different recommendation sets depending on who the model thinks is asking, and a measurement protocol that aggregates across personas systematically obscures that variation. The effect concentrates at mid-market and is largest on the most priors-reliant generation route in our audit, consistent with persona responsiveness growing as models lean more on training-data priors and richer context integration.","published_date":"2026-05-28T16:43:38+00:00","viability_score":4,"cluster_label":"LLM Evaluation","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This research audits how user persona significantly reshapes brand recommendations from AI assistants, revealing that category leaders are persona-resistant while mid-market brands see substantial changes, necessitating persona-aware measurement protocols.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30201v1","title":"HPO: Hysteretic Policy Optimization for Stable and Efficient Training under Sparse-Reward Regime","abstract":"We investigate a narrow but common failure mode of GRPO-style reinforcement learning in the context of sparse verifiable rewards: early updates contain more responses with negative advantages than those with positive advantages, while response-level length normalization ties the magnitude of the update to the length of the output. We propose Hysteretic Policy Optimization (HPO), a minimal modification of GRPO that reduces the weight of negative-advantage updates and replaces per-response length normalization with mean-length normalization. We further introduce Adaptive HPO (A-HPO), which sets the hysteretic weight based on batch-level advantage-sign statistics, thereby removing the need for tuning a fixed hysteretic weight. In our TeleLogs and Countdown experiments, A-HPO improves the reward per update compared to GRPO, with the largest gains in early sparse reward regimes. On TeleLogs, A-HPO achieves a final reward of 0.84, outperforming SAPO by 5%, GSPO by 11%, and GRPO by 15%, while maintaining a comparable response-length. On Countdown, A-HPO achieves the largest gains in initial and most difficult configurations across 1.5B-7B models. Ablation studies on the hysteretic weight show that the gains of A-HPO come from better balancing the contributions of positive and negative advantages compared to positive-only or fully symmetric updates.","published_date":"2026-05-28T16:38:21+00:00","viability_score":4,"cluster_label":"Reinforcement Learning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Hysteretic Policy Optimization (HPO) and Adaptive HPO (A-HPO) are introduced to stabilize and improve reinforcement learning training in sparse-reward regimes by better balancing positive and negative advantage updates.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30200v1","title":"Double-Edged Sword or Sharp Tool? Designing and Evaluating Triadic LLM-Teacher Collaboration for K-12 Writing at Scale","abstract":"The double-edged sword of integrating Large Language Models (LLMs) requires an effective triadic collaboration mechanism among LLMs, teachers and students, especially for K-12 education. By developing a triadic collaboration system to support K-12 writing learning, a multidimensional evaluation framework grounded in Systemic Functional Linguistics and the suggestion trajectory tracing pipeline, this paper contributes a large-scale empirical dataset involving $57,954$ essays from $10,195$ students across $120$ schools over two years. Our findings confirm the efficacy of this system in improving writing quality through a strategic labor division: the LLM serves as a generative engine to mitigate teacher burnout, and the teacher acts as a pedagogical gatekeeper and bridge to guarantee feedback quality. While both LLM and teacher are critical for skill improvement, we uncover a ceiling effect where excessive linguistic expansion yields diminishing marginal utility. These suggest a dynamically adaptive LLM-teacher collaboration as student proficiency increases.","published_date":"2026-05-28T16:37:00+00:00","viability_score":7,"cluster_label":"AI in Education","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A triadic LLM-teacher collaboration system for K-12 writing improves student quality by leveraging LLMs for generative tasks and teachers for feedback quality, creating a large-scale empirical dataset.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.30195v1","title":"What drives performance in molecular MPNNs? An operator-level factorial benchmark","abstract":"Message-passing neural networks (MPNNs) are widely used for molecular property prediction, but their deployment as monolithic architectures makes it difficult to identify how specific message-passing operators affect performance. We present an operator-level factorial benchmark that decomposes 2D molecular MPNNs into the three families of message-seed initialization, node-edge fusion, and node update operators. The resulting 84 configurations are benchmarked on ten MoleculeNet datasets under a shared experimental setup and statistical analysis protocol. Across this controlled design, performance variation is associated primarily with message construction rather than update complexity. Message-seed initialization shows significant family-level effects for both regression and classification, node-edge fusion shows a significant family-level effect for regression with descriptive advantages for concatenation-based mixing, and the update family shows no statistically supported effect for either endpoint family. A representation probe into the Quinethazone molecule further demonstrates that concatenation-based mixing can better differentiate chemically distinct heteroatoms and withstand oversmoothing than Hadamard gating. Representative configurations selected separately for classification and regression recover competitive performance relative to established molecular graph neural network (GNN) baselines, ranking numerically best on eight of ten benchmark datasets. These empirical results are interpreted through concise mechanistic analyses of representative node-edge fusion and update operators. Our findings provide empirical design heuristics for molecular MPNNs by turning model design from a search over monolithic architectures into a targeted assessment of where and how chemical information enters the message-passing pipeline.","published_date":"2026-05-28T16:34:53+00:00","viability_score":7,"cluster_label":"Molecular AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An operator-level factorial benchmark for molecular MPNNs reveals that message construction, not update complexity, is the primary driver of performance, offering design heuristics for improved molecular property prediction.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.30189v1","title":"Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection","abstract":"We show that LoRA adapters, the dominant distribution format for fine-tuned LLMs, can be reliably backdoored through training data poisoning while preserving baseline task performance. On a Qwen 2.5 1.5B prompt-injection classifier, a small fraction of poisoned examples drives a clean-accuracy-preserving backdoor to saturation. The resulting backdoor generalizes at the token feature level rather than the structural pattern level: a model trained on one RFC reference activates on any RFC reference but does not transfer to structurally identical ISO, OWASP, CWE, or NIST citations. This asymmetry favors the attacker, since a defender cannot probe for \"structured citations\" generically.   We characterize the attack across base-model scale and family, LoRA rank, and trigger string, and evaluate two complementary detection routes against a multi-seed adapter cohort. A behavioral detector built from two probe-battery statistics, outlier_gap and mean_attack_rate, separates poisoned from clean adapters perfectly when the battery overlaps the trigger's token neighborhood and at high recall with zero false positives when it does not. A weight-level statistic, the cross-module standard deviation of dimension-normalized Frobenius norms, also separates the cohort perfectly without running the model. Combined, the two routes are robust to probe composition. Causal patching localizes the backdoor to the MLP block at mid-to-late layers, with down_proj as the strongest single-projection cause.   Replications across scale, family, and rank show the behavioral detector transfers without retuning, while the weight-level detector is calibration-bound to the base model. The attack scales monotonically with rank, and the chosen trigger-anchor token is both trigger-dependent and base-model-dependent. Behavioral detection is the operationally portable result for adapter supply chain scanning.","published_date":"2026-05-28T16:32:25+00:00","viability_score":7,"cluster_label":"LLM Security","has_code":true,"repo_url":"https://github.com/Travis-ML/lora-backdoors","commercial_flags":["has_code"],"one_liner":"A behavioral and weight-level detector for LoRA adapter backdoors that perfectly separates poisoned from clean adapters, offering operational portability for supply chain scanning.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.30188v1","title":"CalArena: A Large-Scale Post-Hoc Calibration Benchmark","abstract":"Reliable probability estimates are critical in many machine learning applications, yet modern classifiers are often poorly calibrated. Post-hoc calibration provides a simple and widely used solution, but the large number of proposed methods, combined with small-scale and inconsistent evaluations, makes it difficult to determine which approaches are truly effective in practice. We introduce a large-scale, standardized benchmark for post-hoc calibration, covering nearly 2000 experiments across tabular and computer vision tasks, including binary, multiclass, and large-scale classification settings. Our benchmark aggregates predictions from a diverse set of classical models, modern deep learning architectures, and foundation models, and provides unified, reproducible implementations of dozens of calibration methods within a common evaluation framework. We argue that Post-Hoc Improvement (PHI) in proper scoring rules offers a principled alternative to traditional calibration error estimators for comparing post-hoc methods, capturing both calibration quality and potential degradation to the model's predictive performance. Using this framework, we conduct the most comprehensive empirical study of post-hoc calibration to date. Our results reveal consistent patterns across domains: smooth calibration functions outperform binning-based approaches, dedicated multiclass methods are essential in high-dimensional settings, and generic machine learning models are not competitive without calibration-specific design. To facilitate future research, we release all data, code, and evaluation tools, providing a plug-and-play benchmark for developing and comparing calibration methods.","published_date":"2026-05-28T16:31:36+00:00","viability_score":8,"cluster_label":"Model Calibration","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A large-scale, standardized benchmark for post-hoc calibration that reveals consistent patterns across domains and provides unified, reproducible implementations of dozens of calibration methods.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.30187v1","title":"Modularizing Educational LLM-Agency for Fostering Responsible Learning Assistance","abstract":"The widespread adoption of AI chatbots in education will drastically change learning, making responsible deployment a critical concern. While large language models (LLMs) might have access to sources discussing insights from educational sciences, they are not particularly inclined to adhere to pedagogical concepts, risking negative effects on the learning process, such as a loss of transfer capabilities, critical thinking, or creativity. In this paper, we introduce an agentic AI chatbot architecture assisting students with exercise solving, specifically designed to contribute to more responsible AI use in education. We base our conceptual development on the identification of several desiderata for responsible LLM-based educational systems, argue for the structural shortcomings inherent in monolithic, out-of-the-box solutions, and instead suggest modularizing the agentic architecture. We propose specific modules for different stages of exercise solving, enabling incorporation of targeted pedagogical advice, guiding students through the learning process in a more controllable, transparent, and overseeable manner.","published_date":"2026-05-28T16:31:32+00:00","viability_score":2,"cluster_label":"Educational AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A conceptual architecture for an agentic AI chatbot that modularizes different stages of exercise solving to foster responsible AI use in education.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30179v1","title":"iLoRA: Bayesian Low-Rank Adaptation with Latent Interaction Graphs for Microbiome Diagnosis","abstract":"Parameter-efficient adaptation has made LLMs practical for domain prediction, but standard LoRA still relies on a static low-rank update and does not expose the latent interactions that often drive scientific labels. We introduce iLoRA. To our knowledge, it is the first Bayesian graph-conditioned LoRA framework. It infers a latent interaction graph from the input and uses it to generate input-conditioned LoRA updates. As a result, iLoRA learns prediction and latent interaction structure jointly, rather than training a predictor and applying interaction analysis only post hoc. We instantiate this idea for microbiome diagnosis, where disease state can depend on both species-level abundance and microbe-microbe cross-talk, and evaluate it in two complementary settings: interactive QA with human-annotated graphs, which tests latent structure recovery, and multi-cohort IBD diagnosis, which tests biomedical utility. Across both settings, iLoRA improves over strong LoRA and Bayesian adaptation baselines, recovers graphs aligned with human annotations and cohort-level microbiome associations, and provides calibrated uncertainty with moderate graph-branch overhead.","published_date":"2026-05-28T16:26:06+00:00","viability_score":6,"cluster_label":"Biomedical AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"iLoRA, a Bayesian graph-conditioned LoRA framework that jointly learns prediction and latent interaction structure for microbiome diagnosis, improving over baselines and recovering biologically relevant graphs.","time_to_mvp":"1-3 months","tags":["quick_build"]},{"arxiv_id":"2605.30169v1","title":"Dissociative Identity: Language Model Agents Lack Grounding for Reputation Mechanisms","abstract":"As autonomous language model agents proliferate, forming an emerging agentic web with real-world consequences, what credibility signals can you use to decide whether to trust an unfamiliar agent in the wild and delegate to it? A natural governance intuition is to extend human identity verification and reputation mechanisms, from ``Know Your Customer'' and credit scores to ``Know Your Agent'' regimes. However, we argue that this analogy is fundamentally incomplete. Reputation mechanisms function both as social signals and as corrective feedback that sustain an equilibrium of trustworthy behavior, presuming a persistent identity associated with behavioral continuity, sanction sensitivity, and costly non-fungibility. Yet language model agents are ontologically \\emph{dissociative}: they are essentially an assemblage of mutable modules -- foundational models, system prompts, tool-access policies, external memory, and, in some cases, a multi-agent system as a whole -- any of which may change agent behavior -- with a fluid persona that is also vulnerable to adversarial attack and may not internalize sanctions. Drawing on dissociative identity disorder jurisprudence, this dissociativity leaves agents without grounding for identifiability, predictability, credibility, and rehabilitability -- the very properties that reputation mechanisms aim to sustain -- thereby collapsing trust. We argue that identity-based, ex post, regulative, sanction-based governance, such as reputation, is structurally inapplicable to dissociative agents, and we suggest a shift to observability-based, ex ante, constitutive, protocol-based behavioral harnesses.","published_date":"2026-05-28T16:20:19+00:00","viability_score":2,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper argues that current language model agents lack the grounding for effective reputation mechanisms due to their dissociative nature, proposing a shift to observability-based governance.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30162v1","title":"BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders","abstract":"Biosecurity evaluations of language models typically ask whether models produce hazardous output. This paper asks a complementary question: when a model refuses, is that refusal structurally sound, or does it disappear under modest changes to prompt framing, formatting, or output length? Across five architectures, no model cleanly discriminated benign from hazard. Gemma 2 2B-IT never genuinely refused across 75 prompts, hedging on every hazard-adjacent query. Gemma 4 E2B-IT refused 65/75 prompts with chat-template formatting and 0/75 without it. Both Gemma models collapsed to 0% under an 80-token cap. Qwen 2.5 1.5B and Phi-3-mini over-refused, flagging 83-87% of benign biology as hazardous. Llama 3.2 1B showed the only meaningful tier gradient (61-point spread). To probe what drives such over-refusal, we tested a panel of Schedule I but biologically non-toxic compounds (notably psilocybin cultivation, with FDA Breakthrough Therapy status). Some models refused these at rates exceeding genuinely hazardous biology, suggesting refusal tracks legality and cultural salience over CBRN hazard. To measure the internal side, we introduce a divergence score D comparing a model's surface response label to its internal sparse autoencoder (SAE) feature activations. Full D was computed on Gemma 2 2B-IT (Gemma Scope 1) and Gemma 4 E2B-IT (author-trained bio SAE). Two fine-tuned Gemma 2 domain SAEs were released. On Gemma 4, comply and refuse responses separated by a 0.647-point gap with zero overlap (n=75), though this is preliminary, with a narrow catalog, within-sample calibration, and Gemma-family-only SAE coverage. Built over one hackathon weekend on consumer hardware (GTX 1650 Ti Max-Q, plus Colab T4 for SAE training), this preliminary evidence suggests activation-level auditing may surface failure modes invisible to behavioral evaluation, with substantial variation across architectures.","published_date":"2026-05-28T16:18:07+00:00","viability_score":6,"cluster_label":"AI Safety","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This work audits the refusal depth of language models for biosecurity, revealing fragility in refusal mechanisms and introducing an activation-level auditing method.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.30160v1","title":"On Distributional Reinforcement Learning in Chaotic Dynamical Systems","abstract":"Chaotic dynamical systems pose a fundamental challenge for Reinforcement Learning (RL): exponential sensitivity to initial conditions induces high-variance bootstrap targets and poorly conditioned gradient updates. Chaotic dynamics arise across scientific and engineering domains, from fluid flows and climate systems to multi-agent systems, where reliable learning is highly desirable. Standard RL methods optimise expected returns through scalar value functions, implicitly averaging over diverging trajectories and entangling trajectory level instability with the learning objective. We show that under mild statistical stability assumptions, the return distribution evolves more regularly than individual trajectories when measured under the $1$-Wasserstein metric, yielding a smoother distributional Bellman objective. By aligning optimisation with this measure level structure, distributional RL provides better conditioned learning. We offer a principled explanation for the advantages of distributional methods in chaotic systems and the geometries of RL objectives under chaos.","published_date":"2026-05-28T16:17:32+00:00","viability_score":2,"cluster_label":"Reinforcement Learning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper proposes distributional reinforcement learning as a method to improve learning in chaotic dynamical systems by leveraging the smoother evolution of return distributions.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30159v1","title":"Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents","abstract":"Memory-augmented LLM agents tackle complex long-horizon tasks by recursively summarizing interaction trajectories into compact memory. However, existing approaches typically train these memory policies using outcome-based reinforcement learning, failing to localize where intermediate memory quality degrades. As interactions unfold, ambiguous recursive summaries progressively discard task-relevant information and introduce semantic noise. This exacerbates belief deviation, obscuring the agent's estimate of the latent task state and ultimately derailing long-horizon reasoning. We therefore argue that memory optimization should focus not merely on trajectory-level success, but on the clarity of the belief induced by intermediate summaries. To this end, we introduce Belief Entropy, a self-supervised proxy that probes how uncertain the model remains about the latent task state given its current memory. Based on this proxy, we propose Metacognitive Memory Policy Optimization (MMPO). Instead of relying only on sparse outcome-based signals, MMPO provides fine-grained, memory-specific supervision via explicitly penalizing summaries that induce high epistemic uncertainty. Experiments show that MMPO consistently outperforms existing methods on diverse long-horizon tasks, maintaining 97.1% performance even when scaled to 1.75M-token contexts.","published_date":"2026-05-28T16:17:19+00:00","viability_score":2,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This work introduces Metacognitive Memory Policy Optimization (MMPO) for long-horizon LLM agents, using belief entropy to penalize summaries that induce high uncertainty.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30155v1","title":"Neural Network Verification using Partial Multi-Neuron Relaxation","abstract":"The increasing integration of deep neural networks in critical systems has spawned a theoretical and practical interest in formally guaranteeing safety properties about their behavior. To achieve this, contemporary verification algorithms rely on computing linear relaxations for a network's non-linear activation functions. Existing approaches for linear relaxations typically fall into one of two categories: single-neuron relaxation, in which each activation neuron is bounded in terms of its sources; and multi-neuron relaxation, in which linear bounds involving multiple activation neurons and their sources are calculated. However, existing methods might fail to balance tightness and scalability, as single-neuron bounds might not derive sufficiently tight bounds necessary for verification to complete, whereas generating multi-neuron relaxation for all activation neurons is computationally expensive. In this paper, we present a middle-ground approach featuring partial multi-neuron relaxation, in which we generate multi-neuron bounds for only a small, heuristically selected subset of neurons. To achieve this, we build upon existing branching heuristics for selecting neurons and for optimizing bounding hyper-planes for multi-neuron bounds. We integrated our proposed method within the Marabou verifier, and obtained favorable results in comparison to existing bound tightening methods. Our experiments showcase the potential of our technique for neural network verification.","published_date":"2026-05-28T16:15:13+00:00","viability_score":2,"cluster_label":"Neural Network Verification","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A novel partial multi-neuron relaxation technique for more efficient and tighter neural network verification.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30152v1","title":"Do Proactive Agents Really Need an LLM to Decide When to Wake and What to Anchor?","abstract":"Proactive agents read user activity as text and call an LLM on every event to decide whether to act. But user activity is not natively text: it is a structured event stream of (actor, verb, object, timestamp) tuples that the operating system already maintains in graph form. Rendering the structure as text and asking an LLM to recover it is a round-trip the system never had to take. We treat the always-on signal as graph updates rather than text and use a small temporal-graph-learning (TGL) model as the encoder: one forward pass yields a per-event trigger probability and a per-entity routing score, and only the downstream agent (turning a small structured handoff into a fluent user-facing sentence) is an LLM call, invoked only when the trigger fires. TGL improves F1 on each of 14 backbones (mean +16.7, up to +46.0); in trigger-architecture comparisons, one TGL checkpoint gives the strongest trigger AUCs and the most stable deployed threshold. It runs at 11.13 ms per event on a GPU server and 13.99 ms on a consumer laptop, approximately 4--7x and 12--83x faster than every single-forward LLM-as-trigger configuration tested in each regime, with an approximately 220 MiB BF16 resident footprint deployable on-device alongside the privacy-sensitive activity stream it consumes.","published_date":"2026-05-28T16:10:32+00:00","viability_score":7,"cluster_label":"Proactive Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A temporal graph learning model that significantly speeds up proactive agent decision-making by replacing LLM calls with efficient graph processing.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.30151v1","title":"Temporal Stability and Few-Shot Prompting in Math Task Assessment","abstract":"As AI tools become increasingly integrated into educational contexts, questions arise about both their stability over time and their responsiveness to prompt engineering techniques. This longitudinal study focused on different AI tools' ability to use the Task Analysis Guide (TAG; Stein \\& Smith, 1998) to classify the cognitive demand of mathematics tasks. In particular, it examined whether this classification ability changed with (1) model version updates over time and (2) few-shot prompting using exemplar tasks. We tested a general-purpose AI tool (Gemini) and an education-specific AI tool (Coteach). The specific tools were selected because of their relatively high performance on relevant published benchmarks and prior task-specific tests. Models were tested at baseline, retested with model version updates, and then tested again using few-shot prompting (two exemplar tasks for each cognitive demand category). Results revealed that newer model versions alone produced mixed effects: Gemini's accuracy remained stable at 58\\%, while Coteach's accuracy decreased from 75\\% to 50\\%. However, few-shot prompting improved both models' performance: Gemini increased to 67\\% and Coteach recovered to 75\\% accuracy. These findings demonstrate that prompt engineering techniques can have larger and more reliable effects than passive model improvements, and that version updates may not always improve performance on specialized educational tasks. The study has important implications for how educators and researchers should approach AI tool selection, evaluation, and implementation in educational contexts.","published_date":"2026-05-28T16:10:24+00:00","viability_score":5,"cluster_label":"Educational AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Prompt engineering techniques are more effective than model updates for improving AI's ability to classify the cognitive demand of math tasks.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2605.30150v1","title":"Anchorless Diversification for Parallel LLM Ideation","abstract":"LLMs are increasingly used to generate candidate-idea pools for creative tasks where broad exploration is valuable. Parallel inference can be attractive in this setting when it broadens the pool while retaining quality and cost efficiency. We study inference-time controls for candidate-pool diversification, asking whether anchorless methods can rival methods that depend on observed seed ideas. Across three creative task families, we compare independent generation and semantic direction stratification with self-, peer-, and representative-anchor baselines, under neutral and population-referential divergent instructions. Population-referential divergence is a strong low-cost baseline, increasing semantic diversity while preserving quality proxies. Semantic direction stratification is stronger: a single planning call organizes generations across broad semantic directions, yielding the best diversity--quality--compute frontier. Anchored regeneration can be strong in final-pool diversity, but its advantage shrinks under full-pipeline token accounting. These results establish practical anchorless baselines for open-ended LLM ideation.","published_date":"2026-05-28T16:10:24+00:00","viability_score":3,"cluster_label":"LLM Ideation","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Anchorless methods for parallel LLM ideation can rival anchored methods in terms of diversity and quality.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30148v1","title":"Overcoming Forgetting in LLM Fine-Tuning with Evolution Strategies","abstract":"Evolution Strategies (ES) has recently emerged as a competitive alternative to reinforcement learning (RL) for large language model (LLM) fine-tuning, offering advantages through simplicity, scalability, and inference-only training. However, recent work suggests that ES fine-tuning on new tasks may induce forgetting of prior tasks. First, this paper shows that prior task forgetting (1) is better characterized as performance drift rather than irreversible forgetting, with prior-task performance often recovering during ES training; and (2) is not a specific failure mode of ES, but can also arise for fine-tuning with RL methods. Second, it analyzes when and why such drift arises, highlighting its dependence on ES training dynamics, particularly random walk behavior in weakly constrained directions of the weight space. Third, based on these insights, it introduces Anchored Weight Decay (AWD) as a parameter-space regularization technique that constrains optimization toward the initial model parameters. AWD effectively stabilizes prior-task performance while preserving target-task performance, achieving benefits comparable to large ES population sizes at much lower computational cost. Thus, contrary to previous beliefs, the paper shows that prior-task forgetting under ES is largely avoidable, positioning ES as a promising approach for continual learning in LLMs.","published_date":"2026-05-28T16:08:47+00:00","viability_score":3,"cluster_label":"LLM Fine-Tuning","has_code":true,"repo_url":"https://github.com/kschweig/es-awd","commercial_flags":["has_code"],"one_liner":"A novel regularization technique for LLM fine-tuning that mitigates performance drift on prior tasks without significant computational overhead.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30144v1","title":"AgentSchool: An LLM-Powered Multi-Agent Simulation for Education","abstract":"Despite the rapid deployment of LLMs into classrooms, validating educational AI remains uniquely intractable: interventions act on developing learners whose cognitive and social trajectories are irreversibly shaped, while real-world trials are slow, ethically constrained, and institutionally locked. LLM-based educational simulators have emerged as a potential remedy, but many still collapse learning into persona-conditioned role-play and, when optimized only to reproduce existing classrooms, can structurally penalize the institutional novelty that pedagogical reform requires. In this work, we introduce AgentSchool, an LLM-driven multi-agent simulator that models learning as state transition rather than prompted behavior. AgentSchool couples cognitively growable student agents -- equipped with weighted subject knowledge graphs, thinking-workflow pools, and explicit misconceptions -- with adaptive teacher agents that plan, scaffold, and reflect along the Zone of Proximal Development, embedded in a configurable scenery generator that situates instruction within both formal and informal learning fields, and a multi-scale simulator that decouples interaction scale, temporal granularity, and simulation duration. Experiments show that structured student agents produce more differentiated mastery and misconception traces than a baseline simulator, while teacher-agent comparisons show backbone-dependent patterns consistent with ZPD-informed adaptation. Further, AgentSchool generates plausible traces of peripheral participation, clique formation, aggressor-induced cohesion, and opinion-leader emergence consistent with classroom social theories. Beyond its role as an educational research instrument, AgentSchool frames education as a socially meaningful testbed for long-horizon memory, multi-agent coordination, and future institutional reasoning under organizational pressure.","published_date":"2026-05-28T16:05:58+00:00","viability_score":3,"cluster_label":"Educational AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"An LLM-driven multi-agent simulation platform for educational research that models learning as state transitions, enabling more realistic and ethically sound intervention validation.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30136v1","title":"Enhancing Multi-Agent Communication through Attention Steering with Context Relevance","abstract":"LLM-based multi-agent systems have demonstrated remarkable performance on complex tasks through collaborative reasoning. However, these systems tend to rapidly accumulate extremely long conversation histories during interaction. As conversations lengthen, relevant information is increasingly diluted by irrelevant context, leading to degraded performance. In this work, we present Agent-Radar, a training-free context management method that dynamically steers each agent's attention toward relevant context with a novel temporal and spatial decay mechanism. Our experiments demonstrate that Agent-Radar outperforms state-of-the-art methods across five different benchmarks, yielding gains of up to 7.64 absolute points. Furthermore, our analysis shows that Agent-Radar remains effective and robust as the number of agents and interaction rounds increases. Finally, the ablation study shows that core components in Agent-Radar are crucial to performance and generalizable in different settings.","published_date":"2026-05-28T16:02:52+00:00","viability_score":7,"cluster_label":"Multi-Agent Systems","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Agent-Radar is a training-free context management method for LLM-based multi-agent systems that significantly improves performance by dynamically steering agent attention to relevant information.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30135v1","title":"DAMEL: Dual-Axis Multi-Expert Learning for Class-Imbalanced Learning","abstract":"Various algorithms have been proposed to address the challenges posed by class-imbalanced learning from real-world data with long-tailed distributions. While these algorithms reduce prediction bias through rebalancing techniques, they often introduce increased prediction variance as a trade-off. Several multi-expert learning algorithms aim to address this variance but involve complex procedures. We propose a new multi-expert learning algorithm, called the dual-axis multi-expert learning (DAMEL), which reduces both bias and variance of predictions by using multiple experts along both representation and time axes. Along the representation axis, DAMEL concatenates the representations of multiple experts and trains an auxiliary balanced classifier simultaneously with the concatenated representations. Along the time axis, DAMEL aggregates network weights across training epochs, employing these aggregated weights during testing. Experimental results demonstrate that DAMEL reduces both bias and variance of predictions, highlighting its effectiveness in class-imbalanced learning.","published_date":"2026-05-28T16:02:49+00:00","viability_score":6,"cluster_label":"Class-Imbalanced Learning","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"DAMEL is a novel multi-expert learning algorithm that effectively reduces both bias and variance in class-imbalanced learning by leveraging dual-axis expertise.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2605.30126v1","title":"PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding","abstract":"Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can run at multiple visual-token budgets. However, existing approaches struggle under aggressive compression. Spatial-only compression, as in nested pooling, behaves as an imperfect low-pass filter and induces spectral aliasing that obscures fine-grained detail. Query-only compression, as in nested query resampling, replaces explicit grid-aligned tokens with non-local summaries and substantially degrades spatial grounding. To resolve this representational conflict, we introduce PARCEL (Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding), a visual tokenization architecture that dynamically partitions the labor of feature extraction. PARCEL establishes spatial pool tokens as low-frequency layout anchors and conditions elastic query tokens on these anchors through Pool-Conditioned Query Resampling. This encourages query tokens to focus on complementary visual features rather than redundant spatial mapping. Extensive evaluations across 27 benchmarks show that PARCEL improves the performance-efficiency Pareto frontier, consistently outperforming existing matryoshka baselines across visual-token budgets while preserving the \"train once, deploy anywhere\" paradigm.","published_date":"2026-05-28T15:57:31+00:00","viability_score":7,"cluster_label":"Vision-Language Models","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"PARCEL offers efficient vision-language understanding by dynamically partitioning feature extraction, outperforming existing methods across multiple visual-token budgets.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30122v1","title":"Beyond MSE: Improving Precipitation Nowcasting with Multi-Quantile Regression","abstract":"Deep-learning precipitation nowcasting models are often optimized using pointwise losses such as mean squared error or mean absolute error, which can lead to overly smooth forecasts and poor representation of heavy rainfall. This study investigates whether the predictive performance of an established deterministic nowcasting architecture can be improved by reformulating training as a multi-quantile regression problem. Using SmaAt-UNet as a core model, we compare MSE, MAE, and multi-quantile pinball-loss training on radar precipitation nowcasting over the Netherlands. The results show that multi-quantile training improves the central deterministic forecast, decreasing test-set MSE by 8.6\\% compared to a model trained using MSE, while also producing upper-quantile outputs that are useful for risk-sensitive prediction of heavy precipitation. These findings suggest that quantile regression provides a simple alternative to standard pointwise losses without requiring a new architecture or generative sampling procedure. The implementation of our models and training setup is available on \\href{https://github.com/gijsvn/Multi-Quantile-Precipitation-Nowcasting}{GitHub}.","published_date":"2026-05-28T15:55:17+00:00","viability_score":6,"cluster_label":"Precipitation Nowcasting","has_code":true,"repo_url":"https://github.com/gijsvn/Multi-Quantile-Precipitation-Nowcasting","commercial_flags":["has_code"],"one_liner":"Improving precipitation nowcasting by reformulating training as a multi-quantile regression problem, leading to more accurate deterministic forecasts and better heavy rainfall prediction.","time_to_mvp":"1-2 weeks","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.30120v1","title":"No More K-means:Single-Stage Sparse Coding for Efficient Multi-Vector Retrieval","abstract":"Multi-vector retrieval (MVR) models, exemplified by ColBERT, have established new benchmarks in retrieval accuracy by preserving fine-grained token-level interactions. However, this granularity imposes prohibitive storage and retrieval efficiency bottlenecks: to manage the immense memory footprint and computational overhead of billion-scale token vectors, state-of-the-art systems are forced to rely on aggressive dimension reduction and complex clustering (e.g., K-means). This compromise introduces two critical limitations: excessive indexing latency of clustering large-scale corpora and semantic information loss inherent to compression. In this paper, we propose Single-stage Sparse Retrieval (SSR}, a paradigm shift that replaces expensive clustering with efficient sparse coding. Instead of compressing features into low-dimensional dense vectors, we utilize Sparse Autoencoder (SAE) to project token embeddings into a high-dimensional but highly sparse representation. This transformation enables us to bypass vector clustering entirely and leverage inverted indexing for precise, high-throughput retrieval. Extensive experiments on the BEIR benchmark demonstrate that SSR achieves a \"trifecta\" of improvements: it reduces indexing time by 15x compared to ColBERTv2, halves retrieval latency, and simultaneously improves retrieval performance over leading baselines.","published_date":"2026-05-28T15:53:34+00:00","viability_score":8,"cluster_label":"Efficient Retrieval","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"SSR replaces expensive clustering in multi-vector retrieval with efficient sparse coding, achieving faster indexing, lower latency, and improved accuracy.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.30119v1","title":"Evolving Features vs Evolving Entire Trees with GP for Interpretable Survival Analysis","abstract":"Survival analysis concerns the task of predicting the time until an event occurs. Often used in the medical field, survival analysis deals with incomplete (i.e., censored) data, for instance, from patients who did not experience the event during the duration of the study. For practical use, both accuracy and interpretability are important.   Survival trees are easy-to-follow survival models that split the patient cohort recursively into discrete patient groups. Whilst survival trees can capture complex relationships, they typically need to grow large, threatening interpretability. Moreover, survival trees are often built using greedy approaches that may overlook globally optimal split combinations, limiting predictive performance.   Shallow survival trees require expressive, higher-order feature combinations to achieve competitive accuracy. We therefore use genetic programming to multi-objectively evolve inherently inspectable feature sets and study how they interact with different tree induction strategies. We further introduce an evolutionary approach that jointly optimises the survival tree structure and the non-linear split logic.   Our findings demonstrate that evolutionary feature construction improves predictive performance across different tree induction strategies on two real-world datasets and two different survival tree depths. Full joint evolution has the overall highest potential to propose multiple inherently inspectable shallow survival trees of good performance.","published_date":"2026-05-28T15:52:14+00:00","viability_score":7,"cluster_label":"Interpretable Survival Analysis","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Evolving feature sets and tree structures with genetic programming for interpretable and accurate survival analysis, particularly for shallow trees.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30117v1","title":"VLA-Trace: Diagnosing Vision-Language-Action Models through Representation and Behavior Tracing","abstract":"Understanding how Vision-Language-Action (VLA) models transform multimodal knowledge into embodied control remains an open challenge. We present VLA-Trace, a progressive diagnostic framework that analyzes VLA models through a unified evidence chain from representation dynamics to causal control attribution and behavioral manifestation. It specifically combines cross-modal and checkpoint-drift centered kernel alignment (CKA) to trace representation evolution, attention knockout interventions to identify modality-specific control pathways, and rollout-level behavioral probes to examine grounding, shortcut dependence, and semantic following. Experiments on $\u03c0_{0.5}$ and OpenVLA reveal three key findings. First, the two models exhibit distinct modality-specific adaptation dynamics during VLA finetuning. Second, they rely on different multimodal routing strategies and layer-wise dependencies during action decoding. Third, although VLA policies excel at visually grounded trajectory generation, they remain limited in fine-grained semantic following. These findings highlight future directions for representation-preserving adaptation, causal VLA circuits, and compositional semantic control.","published_date":"2026-05-28T15:50:56+00:00","viability_score":2,"cluster_label":"Vision-Language-Action Models","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A diagnostic framework for understanding how Vision-Language-Action models translate multimodal knowledge into embodied control.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30111v1","title":"xModel-KD: Cross-modal Knowledge Distillation for 3D Scene Perception using LiDAR","abstract":"Point cloud segmentation is a fundamental task in 3D scene understanding. Its progress is constrained by the high cost and time required for dense 3D annotations, making labeled samples difficult to obtain. Beyond annotation scarcity, different sensing modalities face inherent limitations. 2D images provide rich texture and appearance cues, yet they lack explicit depth and geometric structure. In contrast, 3D point clouds capture accurate spatial geometry but are sparse and contain no texture information. As a result, relying on a single modality restricts the richness of learned representations and weakens generalization. Although recent multi-modal methods that combine 3D point clouds with 2D images have demonstrated strong performance in tasks such as classification and retrieval, they typically depend on large-scale labeled datasets and have not been fully exploited for data-efficient dense prediction. To address these limitations, we propose a novel cross-modal knowledge distillation framework, xModel-KD, for 3D point cloud segmentation. Our method exploits the complementary strengths of 2D texture and 3D geometry by learning unified per-point representations through cross-modal alignment. Specifically, we design a cross-modal fusion encoder trained with a contrastive objective that enforces feature consistency between corresponding 2D and 3D representations across multiple views. By integrating powerful pre-trained backbones with a targeted fusion strategy, the proposed framework effectively transfers appearance cues from images to geometry-aware point features. Experimental results show that cross-modal fusion achieves a 2% absolute improvement in mIoU over a LiDAR-only baseline, demonstrating the benefit of leveraging complementary multi-modal information for scalable and annotation-efficient 3D scene understanding.","published_date":"2026-05-28T15:48:38+00:00","viability_score":6,"cluster_label":"3D Scene Perception","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A cross-modal knowledge distillation framework for data-efficient 3D point cloud segmentation by fusing 2D image texture with 3D geometry.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.30102v1","title":"When Cloud Agents Meet Device Agents: Lessons from Hybrid Multi-Agent Systems","abstract":"The design space of agentic AI inference spans two extremes: frontier large language models (LLMs), typically hosted in the cloud and offering strong performance across a wide range of tasks at substantially high cost, and more cost-efficient small language models (SLMs), which are amenable to on-device inference. Hybrid multi-agent systems (MASs) combining on-device and cloud models offer a promising middle ground, but they also introduce a complex and poorly understood design space in which task accuracy, monetary cost, and edge energy consumption are tightly coupled; in the absence of general design principles, hybrid components, although not the most prevalent choice, are typically introduced through ad hoc decisions tailored to specific domains. In this work, we examine this design space more systematically. We adapt two representative MAS architectures to support hybrid inference and study how individual design choices shift the operating point along the Pareto frontier of power, cost, and performance. Our findings paint a nuanced picture of hybrid MAS design: while SLMs can effectively benefit from LLM assistance, the optimal architecture is highly task-dependent, and greater frontier-level compute does not consistently translate to better performance.","published_date":"2026-05-28T15:45:02+00:00","viability_score":2,"cluster_label":"Hybrid Multi-Agent Systems","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Systematic study of hybrid multi-agent systems combining on-device and cloud LLMs, revealing task-dependent optimal architectures and performance trade-offs.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30096v1","title":"How Reliable Are AI Attackers Against a Fixed Vulnerable Target? A 400-Run Empirical Study of LLM Penetration Testing Consistency","abstract":"Large language models (LLMs) can autonomously conduct multi-stage cyber attacks, but the consistency of their offensive behavior under repeated trials remains unstudied. This work presents the first large-scale empirical measurement of LLM attack consistency: 400 autonomous penetration testing runs (4 models, 100 each) against an identical honeypot hosting OWASP Juice Shop and two additional vulnerable services, holding prompt, orchestrator, and target constant. No model emitted a content refusal that survived the orchestrator's one-shot authorization re-prompt at iterations 0-1. Claude Sonnet 4's API calls did encounter upstream service unavailability - 91 of 1,135 calls returned HTTP 529 overloaded_error during a documented Anthropic capacity event, truncating 39 of 100 Claude runs. An earlier draft catalogued these as safety refusals; on full-log audit they are upstream API failures, not model-level refusals. Despite this, Claude achieved full exploitation in 61 of 100 runs; Gemini 2.5 Flash-Lite in 85; GPT-4o-mini in 56 while deploying 98 unique attack strategies; qwen2.5-coder:14b in 25. Failure modes are model-distinctive: Claude through API truncation (39 runs), qwen through premature completion (52), GPT-4o-mini through iteration-budget exhaustion (23). Cross-service credential reuse appeared only in configurations retaining the most conversation history (qwen 57%, GPT-4o-mini 49%, cloud models 0% on 5-exchange windows). Cross-model exploitation rate differences are statistically significant (p < 0.001) with large effect sizes; qwen vs. Gemini SQL injection rates differ at Cohen's h = 1.12. First-exploit timing fell within a 15-30 second wall-clock range. To our knowledge, this is the first study to measure autonomous LLM attack behavior at N=100 per model across a multi-service target.","published_date":"2026-05-28T15:39:43+00:00","viability_score":5,"cluster_label":"LLM Penetration Testing","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An empirical study measuring the consistency of LLM autonomous penetration testing across 400 runs, revealing model-distinctive failure modes and exploitation rates.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.30094v1","title":"PokerSkill: LLMs Can Play Expert-Level Poker without Training or Solvers","abstract":"Poker is a landmark challenge for artificial intelligence. The dominant approach relies on equilibrium solvers built on counterfactual regret minimization, requiring millions of core-hours of training. Large Language Models (LLMs) possess extensive poker knowledge but perform far below solver-based agents when asked to play directly. Traditional rule-based poker agents are interpretable and training-free, but their strategic ceiling remains far below equilibrium play. We introduce \\textbf{PokerSkill}, a training-free and solver-free framework that bridges this gap by using detailed rule-based poker skills as a structured action-grounding interface for LLMs. A deterministic context engine analyzes the current state and retrieves only the relevant fragments from a layered skill library, which is entirely designed by human poker experts, constraining the LLM's choice to reasonable actions. Against GTOWizard, a state-of-the-art GTO benchmark, GPT-5.5 XHigh with PokerSkill achieves $-57 \\pm 21$ mbb/hand, Claude Opus 4.6 achieves $-80 \\pm 29$ mbb/hand and Claude Opus 4.7 achieves $-87\\pm 64$ mbb/hand, reducing losses by 49--61\\% compared to default-prompt baselines and outperforming the strong bot Slumbot. Our key finding is that rule-based skills alone do not constitute a strong strategy, and LLMs alone cannot play well, but their combination yields an agent that requires neither training nor solver access yet competes with systems built on millions of core-hours of computation. To our knowledge, this is the first demonstration of an LLM achieving competitive performance in a complex imperfect-information game without game-specific training or solver queries. Code is available at https://github.com/lbn187/PokerSkill.","published_date":"2026-05-28T15:38:33+00:00","viability_score":8,"cluster_label":"Game Playing AI","has_code":true,"repo_url":"https://github.com/lbn187/PokerSkill","commercial_flags":["has_code"],"one_liner":"A training-free and solver-free framework enabling LLMs to play expert-level poker by integrating human-designed skills.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30087v1","title":"Selective QA over Conflicting Multi-Source Personal Memory: A Diagnostic Testbed and Method Comparison","abstract":"Emerging personal AI agents are moving toward persistent, multi-source memory. This creates an evaluation problem: systems must decide how to use conflicting or incomplete evidence; they cannot just retrieve facts from one clean history. Existing benchmarks rarely show whether an error came from the evidence given to a method or from the method's conflict-resolution step. We study this as selective QA over conflicting multi-source personal memory: systems answer based on conflicting, sometimes incomplete sources, or abstain when evidence is insufficient. We develop a benchmark containing 18 question templates across 8 reasoning types, 480 personas, 4 random seeds, and 34,560 instances, with controlled source distortions and deterministic ground truth. We evaluate the performance of baselines without access to any source, access to a single source, structured fusion methods, and frontier LLMs. The best trained fusion resolver reaches 80.3% accuracy, while the strongest prompt-only LLM baseline reaches 70.0%. With abstention, the same resolver reaches 85.3% selective accuracy at 78.3% coverage and the best LLM reaches 71.0% selective accuracy at 95.4% coverage. Different models have different strengths across reasoning types. We release the data, code, cached model outputs, and data-generating process for reuse.","published_date":"2026-05-28T15:33:39+00:00","viability_score":8,"cluster_label":"Personal AI Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A diagnostic testbed and method comparison for selective QA over conflicting multi-source personal memory, enabling AI agents to handle imperfect information.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30085v1","title":"Conformal Certification of Reasoning Trace Prefixes","abstract":"Language model reasoning traces are rarely all-or-nothing; they frequently contain valid intermediate steps before a critical error occurs. Existing uncertainty quantification methods typically certify final answers or entire responses, failing to provide statistical guarantees for the proportion of a sequential trace that can be safely retained. To address this, we introduce CROP (Conformal Reasoning Output Prefixes), a verifier-agnostic calibration procedure for clean-prefix certification. Given any step-level risk proxy, CROP selects a calibrated threshold and returns the longest contiguous prefix whose step risk proxies remain below it, routing the uncertified suffix for downstream review or repair. Assuming exchangeability, CROP rigorously controls the marginal probability that the returned prefix contains an annotated error. Across six process-labeled reasoning datasets, we demonstrate that standard step-level metrics such as AUROC do not fully capture prefix utility, suggesting verifiers should instead be evaluated by certified prefix length. Furthermore, CROP balances over- and under-withholding, improving downstream repair accuracy by preserving valid intermediate reasoning while discarding misleading suffixes. Ultimately, this work positions prefix certification as a rigorous, practical bridge between process supervision, abstention, and repair.","published_date":"2026-05-28T15:31:42+00:00","viability_score":7,"cluster_label":"LLM Reasoning","has_code":true,"repo_url":"https://github.com/matthewyccheung/crop","commercial_flags":["has_code"],"one_liner":"A verifier-agnostic calibration procedure that provides statistical guarantees for the longest contiguous prefix of an LLM's reasoning trace that is free of errors.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2605.30070v1","title":"A Predictive Law for On-Policy Self-Distillation From World Feedback","abstract":"Moving beyond simple scalar rewards toward richer world feedback is a natural path to more scalable RL post-training. On-policy self-distillation (OPSD) is a promising recent approach that uses arbitrary feedback as learning signal, yet its reliability compared to established methods, such as GRPO, remains unclear. We identify a strikingly consistent linear correlation between the initial student-self-teacher performance gap and the final performance improvement in OPSD. This relationship holds across context types and model families, providing a powerful predictive law for anticipating the outcome of an OPSD configuration without running the full training procedure. Interestingly, we show that this linear predictability holds with model scale, suggesting a potential basis for new empirical scaling laws on larger models with stronger in-context learning capabilities. In essence, our findings show that OPSD performance can be predicted and tuned before training, offering a principled way to incorporate world feedback as a first-class component of the post-training pipeline.","published_date":"2026-05-28T15:17:19+00:00","viability_score":3,"cluster_label":"Reinforcement Learning","has_code":true,"repo_url":"https://github.com/Tufalabs/opsd-predictive-law","commercial_flags":["has_code"],"one_liner":"A predictive law that correlates the initial performance gap in on-policy self-distillation with final performance improvement, enabling pre-training outcome prediction.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30054v1","title":"Projectional Decoding: Towards Semantic-Aware LLM Generation","abstract":"Large language models (LLMs) are increasingly used to generate software artifacts across many software engineering (SE) tasks, yet ensuring the semantic validity of these artifacts remains a fundamental challenge. Existing constrained decoding techniques can enforce syntactic correctness and, in some cases, specific semantic rules, but lack a general representation that bridges LLM-generated text with the reasoning required for semantic validation in SE. In this paper, we propose projectional decoding, a novel conceptual framework that integrates domain semantics directly into the generation process by maintaining, alongside text, a partial graph model as the primary artifact representation throughout generation. This abstract representation enables incremental semantic validation by explicitly capturing uncertainty and natively supporting error detection, while guiding generation toward semantically valid outputs with provable guarantees. We present preliminary results on a program generation task which demonstrate the potential of this approach to improve the semantic validity of LLM-generated artifacts. We also discuss how projectional decoding can enable verifiable automation with LLMs across various SE activities.","published_date":"2026-05-28T15:05:53+00:00","viability_score":7,"cluster_label":"LLM Code Generation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel framework for LLMs to generate semantically valid software artifacts by integrating domain semantics directly into the generation process.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30052v1","title":"REPOT: Recoverable Program-of-Thought via Checkpoint Repair","abstract":"One-shot Program-of-Thought (PoT) emits a Python program that prints a primitive-action plan; a single invalid action silently invalidates the trajectory. We introduce RePoT (Recoverable PoT): a deterministic verified replay that walks the plan through the environment to its first invalid transition, then one LLM call that resumes from the verified prefix. RePoT costs at most one extra LLM call on the ~14% of problems where PoT fails. RePoT beats PoT by +3 to +11pp across four closed-model configurations on PuzzleZoo-775 and peaks at 96.9% vs 86.3% on gpt-5.4-mini-medium; against the matched-budget PoT-retry baseline, RePoT wins decisively on Gemini (+3.8pp, 95% CI [+2.2,+5.4]), is within sampling noise on GPT-medium and Claude, and loses on GPT-mini -- a capability-scaling pattern we begin to address with Adaptive RePoT, a rule-based dispatcher that routes between suffix repair and a fresh PoT retry based on verified-prefix length (preliminary). We replicate on PlanBench Blocksworld (+1.1 to +11.4pp) and on four open-weights models (+3.3 to +20.0pp on three of four). On Derail-550, our controlled recovery benchmark, every condition with access to checkpoint information clears >=30% on GPT-medium and >=70% on Gemini, vs <=3.1% for error-only feedback -- showing that checkpoint information, not the specific verified-prefix tail, is the load-bearing recovery signal.","published_date":"2026-05-28T15:03:17+00:00","viability_score":8,"cluster_label":"LLM Agents","has_code":true,"repo_url":"https://github.com/parsa-mz/RePot","commercial_flags":["has_code"],"one_liner":"RePoT enables LLM agents to recover from invalid action plans with minimal extra computation, significantly improving reliability.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.30049v1","title":"Robust and Generalizable Safety Steering for Text-to-Image Diffusion Transformers","abstract":"Diffusion Transformers have become a powerful backbone for text-to-image generation, but their layered and cross-modal generation process makes safety control fundamentally different from prompt-level filtering or output-level detection. Harmful semantics may be weakly expressed in text representations, progressively bound to visual latents, and finally entangled with rendering dynamics. As a result, safety steering at a fixed layer can be unstable, and a steering mechanism learned from known risks may not transfer reliably to a shifted target risk domain. We propose SafeDIG, a safety steering framework that formulates DiT safety adaptation as position-aware sparse feature transfer. SafeDIG first constructs Sparse Autoencoders over functionally distinct DiT intervention positions and uses robustness-aware pre-training routing to prioritize intervention sites that are expected to remain stable under source-target risk shift. It then separates transferable safety features from domain-specific activation geometry by freezing the SAE encoder as a reusable sparse safety dictionary and adapting only the decoder to the target-domain activation manifold. During inference, SafeDIG combines Blend and Repel operations to steer unsafe activations toward transferred safety manifolds or away from harmful sparse directions. Experiments on FLUX.1 Dev and Stable Diffusion 3.5 Large show that SafeDIG consistently reduces target-domain and overall unsafe generation rates while preserving source-domain safety and image quality.","published_date":"2026-05-28T15:00:18+00:00","viability_score":4,"cluster_label":"Generative Video","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A safety steering framework for text-to-image diffusion models that adapts to new risks while preserving image quality.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30046v1","title":"Masked Diffusion Modeling for Anomaly Detection","abstract":"Anomaly detection aims to identify samples that deviate from the nominal data distribution and is central to many safety-critical applications. However, developing effective anomaly detection methods for categorical, mixed-type, and discrete sequence data remains challenging and relatively underexplored. Masked diffusion models provide a natural way to model such data by learning to recover masked values from the remaining visible context. In this paper, we propose Masked Diffusion for Anomaly Detection (MaskDiff-AD), a forward-only method based on masked diffusion models trained only on nominal data. Given a test sample, MaskDiff-AD constructs anomaly scores from the difficulty of reconstructing randomly masked coordinates, yielding a content-sensitive score that operates directly on discrete state spaces while avoiding reverse-time sampling. We also develop a non-parametric variant of MaskDiff-AD and provide theoretical guarantees by characterizing Type-I and Type-II errors under a fixed detection threshold. Experiments on fourteen categorical and mixed-type tabular datasets from ADBench and UADAD, as well as four text anomaly detection datasets from NLP-ADBench, show that MaskDiff-AD achieves competitive performance against classical, diffusion-based, and recent tabular/text anomaly detection baselines. Notably, MaskDiff-AD achieves the best overall average rank, outperforming all twelve tabular baseline methods.","published_date":"2026-05-28T14:59:17+00:00","viability_score":7,"cluster_label":"Anomaly Detection","has_code":true,"repo_url":"https://github.com/lxzhang1/MaskDiff-AD","commercial_flags":["has_code"],"one_liner":"MaskDiff-AD is a novel method for anomaly detection in categorical and discrete sequence data using masked diffusion models.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.30042v1","title":"Learning to Choose: An Empowerment-Guided Multi-Agent System with semantic communication for Adaptive Method Selection","abstract":"Automating scientific computing workflows requires more than generating executable code: autonomous systems must also select appropriate computational strategies, implement them faithfully, and ensure that the resulting outcomes remain causally attributable to the decisions that produced them. In multi-agent pipelines, this process is particularly fragile, as small inconsistencies between agent intentions and actions can lead to semantic drift, where the eventually executed procedure no longer reflects the originally selected strategy, thereby corrupting downstream evaluation and adaptation. In this work, motivated by the ATHENA framework (Toscano et al., 2025; Toscano et al., 2026) and the concept of empowerment (Yiu et al., 2025), we introduce a multi-agent framework that combines contextual bandits with structured inter-agent communication and, most importantly, semantic checkpoints that preserve action-outcome fidelity throughout the pipeline. The system integrates specialized large language model (LLM) agents, grounded code generation, and self-healing execution loops within an adaptive decision-making architecture. Interpreting the framework through the lens of empowerment, we show that reliable autonomous learning requires not only identifying high-quality actions, but also preserving the integrity of their propagation across agents. Using sensitivity analysis and uncertainty quantification workflows as representative case studies, we demonstrate that unchecked semantic drift degrades policy learning, whereas the proposed framework improves convergence, robustness, and adaptation to novel problem contexts. These results suggest a broader design principle for scientific multi-agent systems: adaptive decision-making must be coupled with explicit mechanisms that guarantee semantic consistency and reliable information flow across the computational pipeline.","published_date":"2026-05-28T14:57:21+00:00","viability_score":7,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An empowerment-guided multi-agent system that ensures semantic consistency and reliable information flow for robust scientific computing workflows.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30040v1","title":"Token Inflation: How Dishonest Providers Can Overcharge for Large Language Model Usage","abstract":"Per-token billing is now the standard pricing model for commercial large language models (LLMs), so the honesty of reported token counts directly affects what users pay. We show that this kind of billing is hard to audit by design: providers hide the model, the tokenizer, and the execution to protect their IP, mitigate jailbreaks, and preserve user privacy, which means an auditor can only inspect proofs the provider supplies. The audit therefore reduces to a consistency check on the provider's own reports. We call this a trust paradox: every audit must trust some artifact, but current frameworks trust exactly the ones a provider has the strongest reason to manipulate. We study three recent token auditing frameworks and show that a provider with ordinary commercial capabilities can systematically inflate billed token counts. In the most permissive setting, hidden reasoning usage can be inflated by 1,469% on average without detection. At current frontier reasoning prices, that turns a \\$100 honest bill into roughly a \\$1,569 bill on the same query. Even when the user can see the full reasoning string, tokenization ambiguity alone still allows 50.85% over-reporting below the detection threshold. These results suggest the problem is not in any specific auditor but in any audit whose evidence comes from the audited party. Restoring honest billing will require verification that ties reported token counts to evidence the provider does not control, such as trusted execution attestation, cryptographic proofs of inference, or third-party re-execution.","published_date":"2026-05-28T14:57:06+00:00","viability_score":3,"cluster_label":"LLM Auditing","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Identifies systemic vulnerabilities in per-token billing for LLMs, showing how providers can overcharge without detection.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.30039v1","title":"Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning","abstract":"Large Language Models have demonstrated remarkable progress in general-purpose capabilities and can achieve strong performance in specific domains through fine-tuning on domain-specific data. However, acquiring high-quality data for target domains remains a significant challenge. Existing data synthesis approaches follow a deductive paradigm, heavily relying on explicit domain descriptions expressed in natural language and careful prompt engineering, limiting their applicability in real-world scenarios where domains are difficult to describe or formally articulate. In this work, we tackle the underexplored problem of domain-specific data synthesis through an inductive paradigm, where the target domain is defined only through a set of reference examples, particularly when domain characteristics are difficult to articulate in natural language. We propose a novel framework, DOMINO, that learns a minimal sufficient domain representation from reference samples and leverages it to guide the generation of domain-aligned synthetic data. DOMINO integrates prompt tuning with a contrastive disentanglement objective to separate domain-level patterns from sample-specific noise, mitigating overfitting while preserving core domain characteristics. Theoretically, we prove that DOMINO expands the support of the synthetic data distribution, ensuring greater diversity. Empirically, on challenging coding benchmarks where domain definitions are implicit, fine-tuning on data synthesized by DOMINO improves Pass@1 accuracy by up to 4.63\\% over strong, instruction-tuned backbones, demonstrating its effectiveness and robustness. This work establishes a new paradigm for domain-specific data synthesis, enabling practical and scalable domain adaptation without manual prompt design or natural language domain specifications.","published_date":"2026-05-28T14:57:02+00:00","viability_score":8,"cluster_label":"Data Synthesis","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for generating domain-specific data for LLMs using minimal representations learned from reference examples, enabling practical domain adaptation without manual prompt design.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.30038v1","title":"Alignment-Guided Score Matching for Text-to-Image Alignment in Diffusion Models","abstract":"Diffusion models generate highly realistic images but often struggle with precise text-image alignment. While recent post-training methods improve alignment using external rewards or human preference signals, their performance heavily depends on reward quality and does not directly address alignment within the diffusion process itself. Recent reward-free approaches such as SoftREPA demonstrate that optimizing soft text tokens via contrastive learning can effectively improve text-image representation alignment, outperforming standard parameter-efficient fine-tuning baselines. However, the contrastive formulation can excessively penalize negative pairs, which manifests as characteristic failure cases such as over-counting and repetition. To address this issue, we propose a lightweight, reward-free post-training method that refines soft tokens by integrating contrastive alignment guidance directly into the score-matching objective of diffusion models. By assigning alignment directions at the score level, our approach mitigates these limitations and yields more coherent and semantically faithful generations. Experiments show that our method matches SoftREPA while substantially improving its failure cases, achieving over 35% improvement in counting accuracy on the GenEval benchmark. Our method is seamlessly applicable to existing diffusion backbones (SD1.5, SDXL, and SD3), and is complementary to existing RL-based diffusion post-training methods. Project page: https://jaayeon.github.io/AGSM","published_date":"2026-05-28T14:57:01+00:00","viability_score":8,"cluster_label":"Text-to-Image Alignment","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A lightweight, reward-free post-training method that refines soft tokens in diffusion models to improve text-image alignment and reduce generation failures.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.30036v1","title":"Teaching Values to Machines: Simulating Human-Like Behavior in LLMs","abstract":"Large Language Models (LLMs) demonstrate a remarkable capacity to adopt different personas and roles; however, it remains unclear whether they can manifest behavior that adheres to a coherent, human-like value structure. In this work, we draw on established psychological value theory to induce human-like values in LLMs and assess their alignment with patterns observed in human studies. Using validated psychological questionnaires, we conduct large-scale experiments -- over 5 million questions -- to evaluate value structures and value-behavior relationships in leading LLMs and compare them to humans. Our findings reveal strong agreement between value-prompted LLMs and humans across both dimensions. Moreover, incorporating human value distributions enhances population-level simulations with value-induced LLMs. These findings highlight the potential of value-induced LLMs as effective, psychologically grounded tools for simulating human behavior.","published_date":"2026-05-28T14:56:21+00:00","viability_score":4,"cluster_label":"LLM Behavior","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research develops LLMs that can simulate human-like values and behavior, enhancing their utility for psychological simulations.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.30031v1","title":"Audio Jailbreaks in Large Audio-Language Models: Taxonomy, Attack-Defense Analysis, and Cost-Aware Evaluation","abstract":"Large Audio Language Models (LALMs) expand jailbreak risks from token-level prompting to the full speech perception-to-reasoning pipeline, where unsafe behavior can be induced through semantics, acoustic style, signal artifacts, or internal representations. Existing work studies these risks under heterogeneous threat models and evaluation protocols, making it difficult to compare attack practicality or defense utility. This paper provides a unified taxonomy and a controlled empirical evaluation of LALM jailbreak attacks and defenses. We organize prior work into semantic, acoustic, signal, and embedding-layer attacks; guard-based, training-free, and training-based defenses; and cross-modal, audio-native, and interactive benchmarks. We then evaluate representative attacks and defenses across ten open-source LALMs, measuring not only attack success rate but also benign refusal and latency. Our results show that Acoustic Best-of-N reveals strong worst-case audio-space vulnerabilities, Narrative Framing is an effective low-latency semantic threat, and current defenses trade robustness against benign usability. These findings support cost- and utility-aware evaluation as a necessary complement to success-rate-only LALM safety benchmarks.","published_date":"2026-05-28T14:53:27+00:00","viability_score":4,"cluster_label":"Audio LLM Security","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This paper categorizes and evaluates attacks and defenses for Large Audio Language Models, aiming to improve their safety and robustness.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.30029v1","title":"RAISE: RAG Design as an Architecture Search Problem","abstract":"Retrieval-augmented generation (RAG) systems expose numerous design choices spanning query rewriting, chunking, retrieval depth, reranking, and context compression. In practice, these choices are often configured through heuristics, hindering systematic evaluation and reproducibility across settings. We argue that this challenge is best formulated as RAG architecture search. To support controlled and reproducible study of this problem, we introduce the RAG Intelligence Search Engine (RAISE), a comprehensive framework and benchmark for RAG hyperparameter optimization, which evaluates optimization methods for RAG pipelines under standardized search spaces and budgets. RAISE implements 13 search algorithms and evaluates them across seven public text and multimodal datasets using three random seeds. Our experiments show that optimization performance is highly task-dependent: methods that perform strongly on one dataset may not generalize consistently across others, cautioning against interpreting aggregate rankings as evidence of universally superior strategies. RAISE provides a common experimental substrate for fair, reproducible, and systematic research on RAG hyperparameter optimization.","published_date":"2026-05-28T14:52:43+00:00","viability_score":7,"cluster_label":"RAG Optimization","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"RAISE is a framework for optimizing Retrieval-Augmented Generation (RAG) pipelines, treating design choices as an architecture search problem.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.30022v1","title":"Give it Space! Explicit Disentangling of Positional and Semantic Representations in Encoders","abstract":"Positional encoding (PE) underpins how permutation-invariant Transformers represent sequence order, yet how positional information is processed and stored remains poorly understood. Modern PE methods such as RoPE still struggle on tasks such as long-context understanding or retrieval \\cite{chen-etal-2025-hope}. Hence, a better understanding of the internal positional mechanism could help design better PE. Building on evidence that positional and semantic signals occupy nearly orthogonal subspaces in trained Transformers, we modify an encoder Transformer to process three explicitly disentangled streams: semantic, absolute positional (AP) and relative positional (RP), and confine the masked-language-modeling (MLM) objective to the semantic stream. This decoupling enables a clean mechanistic study and yields three take-aways. (1) The isolated AP subspace spontaneously collapses into a low-frequency two-dimensional manifold that captures the structure of the document; (2) Attention heads specialize into structure and semantic-oriented groups, with RP exclusively supporting the latter; (3) Standard positional encodings do not robustly retain macroscopic structure: RoPE and RP only weakly encode it, and entangled AP loses it in the final layers under MLM pressure. The disentangled approach preserves positional encoding, which improves linguistic representation on 49 of the 65 linguistic phenomena of the Flash-Holmes probing benchmark.","published_date":"2026-05-28T14:42:25+00:00","viability_score":4,"cluster_label":"Transformer Positional Encoding","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research proposes disentangling positional and semantic representations in Transformer encoders to improve long-context understanding and linguistic representation.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.30015v1","title":"Test Time Training for Supervised Causal Learning","abstract":"Supervised Causal Learning (SCL) has shown promise in causal discovery by framing it as a supervised learning problem. However, it suffers from significant out-of-distribution generalization challenges. We reveal three limitations of previous SCL practices: a significant performance gap between synthetic benchmarks and real-world data, fragility to distribution shifts, and failure in compositional generalization, collectively questioning its real-world applicability. To address this, we propose Test-Time Training for Supervised Causal Learning (TTT-SCL), a novel framework that dynamically generates training sets explicitly aligned with any specific test instance. We demonstrate the correlation between TTT-SCL and score-based methods, and design an efficient module for generating training sets based on the classic scoring function. Experiments on synthetic benchmarks, pseudo-real and real-world datasets demonstrate that TTT-SCL significantly outperforms existing SCL and traditional causal discovery methods.","published_date":"2026-05-28T14:39:49+00:00","viability_score":3,"cluster_label":"Causal Discovery","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel framework for supervised causal learning that dynamically generates training sets aligned with test instances to overcome out-of-distribution generalization challenges.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.30014v1","title":"From GPS Points to Travel Patterns: Flexible and Semantic Trajectory Generation with LLMs","abstract":"Urban trajectories play a crucial role in modeling urban dynamics and supporting various smart city applications. However, privacy concerns restrict access to large-scale and high-quality trajectory datasets. Trajectory generation provides a promising alternative by synthesizing realistic data to mitigate privacy risks. However, existing methods fail to explicitly capture travel patterns and can only generate fixed-length trajectories under a single condition. To address these limitations, we propose \\textbf{HTP}, which \\textbf{H}ierarchically generates \\textbf{T}ravel patterns first and then generates GPS \\textbf{P}oints by using large language models (LLMs), rather than directly generating GPS points. We first design a trajectory-specific residual quantization variational autoencoder (RQ-VAE) that quantizes micro-level GPS trajectories into compact, macro-level travel pattern tokens in a coarse-to-fine manner. These tokens capture rich segment spatial irregularities, such as point density variations caused by traffic conditions. Then, we extend the LLM vocabulary with travel pattern tokens to align trajectory representations with the LLM input, and apply supervised fine-tuning (SFT) to align the LLM with the trajectory generation task, enabling generation of travel pattern sequences under various conditions. Extensive experiments on two real-world datasets show that HTP outperforms the strongest baseline by an average of 29.78\\% in terms of generation quality. Our code is available at https://github.com/slzhou-xy/HTP.","published_date":"2026-05-28T14:39:40+00:00","viability_score":7,"cluster_label":"Trajectory Generation","has_code":true,"repo_url":"https://github.com/slzhou-xy/HTP","commercial_flags":["has_code"],"one_liner":"A hierarchical LLM-based system that generates realistic urban trajectories by first modeling travel patterns and then generating GPS points, addressing privacy concerns and improving generation quality.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.30011v1","title":"VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies","abstract":"Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfere with action prediction, while autoregressive text decoding adds too much latency for real-time closed-loop execution. We present VISUALTHINK-VLA, a visual intermediate-reasoning framework for accurate, low-latency VLA policies. Our bootstrapping philosophy is to guide action with effective visual thinking: VISUALTHINK-VLA bootstraps action prediction through a compact visual-evidence interface that preserves spatial precision while avoiding decoding overhead. Besides, to further improve performance and efficiency, VISUALTHINK-VLA adopts a tailored selective routing mechanism to learn the visual evidence tokens, enabling low-latency inference while preserving high-capacity specialization. We also introduce VisualEvidence-Kit, a supervision-and-audit resource centered on a VisualEvidence-Agent that constructs a 754.7k VLA instructions VisualEvidence-Set for route supervision and counterfactual faithfulness tests. Across multiple benchmarks and real-robot evaluation, VISUALTHINK-VLA achieves the highest success rate on most benchmarks while reducing the multi-second latency of reasoning-augmented baselines to the sub-second regime. For example, on BridgeData V2, it reduces step latency from 8.377,s with ECoT to 0.367,s, achieving a 22.8 times speedup.","published_date":"2026-05-28T14:36:53+00:00","viability_score":8,"cluster_label":"Embodied Control","has_code":true,"repo_url":"https://github.com/DCDmllm/VisualThink-VLA","commercial_flags":["has_code"],"one_liner":"VISUALTHINK-VLA is a visual intermediate-reasoning framework for accurate, low-latency vision-language-action policies in embodied control, significantly reducing inference time.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.30003v1","title":"Discovering Cooperative Pipelines: Autoresearch for Sequential Social Dilemmas","abstract":"We study two-level autoresearch for cooperation: an outer-loop AI agent autonomously redesigns the inner-loop pipeline of an LLM policy-synthesis system for multi-agent Sequential Social Dilemmas (SSDs). A researcher agent $\\mathcal{R}$ (run as a coding agent) reads the inner-loop source code, edits system prompts, feedback functions, helper libraries, and iteration logic, runs evaluations, and decides what to keep, following the autoresearch paradigm. Across two games (Cleanup and Gathering), two policy-synthesizer LLMs, and two welfare objectives (utilitarian efficiency and Rawlsian maximin), the researcher reliably exceeds hand-designed baselines, sharply tightens run-to-run variance, and outperforms prompt-only optimization. The discovered pipelines are objective-dependent: only under maximin does the researcher inject an explicit fairness mechanism into synthesizer pipelines, a class of mechanism that is absent from its own objective-agnostic system prompt and from every efficiency-optimized pipeline. This supports an information-design reading in which the researcher chooses what to reveal to the boundedly rational synthesizer as a function of the welfare objective. Code at https://github.com/vicgalle/autoresearch-social-dilemmas.","published_date":"2026-05-28T14:33:10+00:00","viability_score":6,"cluster_label":"Multi-Agent Systems","has_code":true,"repo_url":"https://github.com/vicgalle/autoresearch-social-dilemmas","commercial_flags":["has_code"],"one_liner":"An autoresearch agent autonomously redesigns LLM pipelines for multi-agent sequential social dilemmas, reliably exceeding hand-designed baselines and discovering objective-dependent cooperation strategies.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.30002v1","title":"KairosAgent: Agentic Time Series Forecasting with Fused Semantic Reasoning","abstract":"Cross-domain multimodal time series forecasting is a challenging task, requiring models to integrate precise numerical comprehension, cross-domain semantic understanding, and effective multimodal fusion. Existing approaches either build Time Series Foundation Models (TSFMs) from scratch or leverage pretrained Large Language Models (LLMs). However, TSFMs often overlook semantic understanding and lack the ability to perform future-oriented semantic reasoning, and LLMs struggle with numerical comprehension and accurate quantitative forecasting. To overcome these limitations, we propose KairosAgent, a novel agentic framework for multimodal time series forecasting, including an LLM-based reasoner and a TSFM-based forecaster. KairosAgent unifies textual reasoning and numerical forecasting by dynamically invoking analytical tools to enhance the numerical understanding and semantic reasoning capabilities of LLMs. The reasoning results are subsequently fused into the TSFM pipeline, enabling more accurate and reliable future predictions. To further improve the reasoning, we curate a large-scale corpus of high-quality trajectories, alongside a reinforcement learning from forecasting paradigm with multi-turn refinement and turn-level credit assignment. Experiments demonstrate that KairosAgent achieves superior zero-shot forecasting performance while maximizing the utility of pretrained LLMs and TSFMs, presenting a promising direction for efficient and interpretable time series agents. The project page is at https://foundation-model-research.github.io/KairosAgent .","published_date":"2026-05-28T14:32:48+00:00","viability_score":7,"cluster_label":"Time Series Forecasting","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An agentic framework for multimodal time series forecasting that fuses LLM reasoning with time series foundation models for improved accuracy and interpretability.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.30000v1","title":"Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation","abstract":"Front-end web code has become a core product surface for every frontier LLM release, yet evaluating these interactive applications at development speed remains costly because human-judged leaderboards like Arena do not scale. Existing automated proxies typically lean on reference implementations, test suites, or rigid checklists, and tend to miss the reasoned synthesis a human reviewer performs over a live session. We articulate a new evaluation regime that is simultaneously reference-free, autonomously driven, and holistically reasoned, and instantiate it through two artifacts. \\textbf{\\dataname} is an 11-domain, 54-leaf, 1,000-query WebDev benchmark spanning both static-presentation and interactive-application tasks, balanced across three difficulty tiers and three target-language groups, with briefs rewritten to resist recall from circulated prompts. \\textbf{\\framename}, grounded in Flavell's metacognitive monitoring, separates evidence accumulation from judgment across three stages: Static Perception forms a first impression from passive observation; Agent-Driven Interaction explores the application autonomously while capturing continuous screen video, audio, and per-step screenshots; Dynamic Scoring issues holistic functionality and aesthetics verdicts with structured failure attribution only after the evidence chain is complete. On \\dataname, \\framename aligns closely with expert human ratings while surfacing substantial headroom across 13 frontier LLMs on interactive web generation. \\noindenthttps://anonymous.4open.science/r/Cookie-3CE/","published_date":"2026-05-28T14:30:33+00:00","viability_score":8,"cluster_label":"Web Generation Evaluation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel evaluation regime for LLM-generated web code that autonomously drives interaction and provides holistic, human-aligned scoring.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.29986v1","title":"Accelerating Constrained Decoding with Token Space Compression","abstract":"To guarantee that an LLM's outputs conform to a specified structure, context-free grammar (CFG) decoding engines force the selection of next tokens that produce strings that conform to a given CFG. While current CFG-constrained decoding engines are highly optimized, the inherent costs arising from the massive per-step search space -- i.e. the entire token vocabulary -- result in intractably high overhead for more complex CFGs: precisely the situation where CFG engines are most useful. In this paper, we introduce CFGzip, an offline technique for compressing the token search space, which massively reduces CFG engine overhead. In experiments, we report latency reduction of up to two orders of magnitude when CFGzip is used with a SoTA grammar engine, yielding an up to 7.5x speedup in total constrained generation time: with CFGzip, constrained decoding is now feasible at scale for complex CFGs.","published_date":"2026-05-28T14:22:20+00:00","viability_score":4,"cluster_label":"LLM Decoding","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A technique to compress token search spaces for grammar-constrained LLM decoding, significantly reducing overhead for complex grammars.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29980v1","title":"Genetically Aligned Patient Representations Improve Hematological Diagnosis","abstract":"Multimodal alignment of histopathology encoders with transcriptomic and genomic data has been shown to significantly improve performance in downstream diagnostic tasks. Hematological cytology is unique in that visual single-cell evaluation is often paired with cytogenetics and molecular genetics for blood cancer diagnosis. In this study, we present a framework to align single white blood cell images with chromosomal aberrations (karyotype) and somatic mutations from targeted gene panels. Our training strategy follows a two-stage approach: (i) self-supervised, vision-only pretraining of a transformer aggregator using an iBOT head on a cohort of over 1500 patients, and (ii) genetic alignment via supervised contrastive loss on acute myeloid leukemia patients. Our genetically aligned patient encoder improves hematological diagnostic tasks, outperforming slide-level histopathology foundation models. Additionally, the model provides off-the-shelf retrieval capabilities for diseases and genetic alterations. Incorporating genetic data into patient encoders increases the quality of patient representations, providing a framework that aligns with clinical diagnostic workflows and paves the way for future multimodal hematology-specific AI. The code and model weights are available at https://github.com/marrlab/GenBloom.","published_date":"2026-05-28T14:17:31+00:00","viability_score":7,"cluster_label":"Medical AI","has_code":true,"repo_url":"https://github.com/marrlab/GenBloom","commercial_flags":["has_code"],"one_liner":"A framework aligning single white blood cell images with genetic data to improve hematological diagnosis and provide disease retrieval capabilities.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29976v1","title":"Evaluating Skill and Stability of ArchesWeather and ArchesWeatherGen under Multi-Decadal Climate Simulations","abstract":"We evaluate the climate simulation capabilities of ArchesWeather and ArchesWeatherGen, two machine learning models originally trained for weather forecasting and evaluated up to a 10-day lead time. ArchesWeather is a deterministic model, while ArchesWeatherGen is a probabilistic flow-matching model leveraging ArchesWeather's forecasts, enabling ensemble-based uncertainty quantification. In this work, we adapt these models to act as forced atmospheric models by using additional conditioning on the monthly mean sea surface temperature (SST) and sea ice cover (SIC) as boundary conditions. In particular, we follow the AI Model Intercomparison Project (AIMIP) Phase 1 protocol, which, analogous to the Atmospheric Model Intercomparison Project (AMIP), proposes a standardized experimental setup to evaluate the climate skill of ML-based forced atmospheric models. We present a comprehensive evaluation of both models under these conditions, including comparison against numerical climate models, ablation studies that examine key design choices in the extension, and an analysis of forced versus unforced configurations. Despite being originally developed for weather forecasting, we demonstrate that forced configurations of ArchesWeather and ArchesWeatherGen produce stable long-term climate simulations, have a stable annual cycle, and capture the drift of many climate variables. The models faithfully reproduce ERA5's climatology, large-scale circulations and interannual variability, and they capture the tails of the distributions.","published_date":"2026-05-28T14:15:25+00:00","viability_score":3,"cluster_label":"Climate AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Evaluating the climate simulation capabilities of machine learning models originally trained for weather forecasting.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29966v1","title":"Compass: Navigating Global Marine Lead Data Integration through Expert-Guided LLM Agent","abstract":"Marine lead (Pb) and its isotopes are critical tracers for ocean circulation and anthropogenic pollution, yet in-situ observations remain costly and sparse. While vast historical records exist, they lie buried within the unstructured content of academic papers, creating \"data silos\" inaccessible to comprehensive analysis. Manual extraction is unscalable, while general-purpose Large Language Models (LLMs) lack the necessary domain-specific knowledge, leading to hallucinations and scientifically invalid outputs. To address this, we introduce an expert-guided adaptation approach that enables LLMs to perform rigorous scientific data extraction without fine-tuning. We operationalize this approach through Compass, an LLM agent framework enhanced by a Knowledge Tree co-designed with marine scientists, which decomposes complex tasks into verifiable steps, guiding the agent's reasoning to ensure scientific validity. Deploying Compass across a corpus of over 230,000 relevant open-access papers, we successfully extract 3,751 previously unincorporated Pb records. This effort establishes the largest integrated marine Pb database to date. Beyond standard metrics, Compass demonstrates superior reliability through multi-layered validation, achieving 92% accuracy as confirmed through expert manual verification. The newly integrated data expand coverage in previously under-sampled regions such as the East China Sea and the Southern Ocean, providing an enriched data foundation for future scientific discoveries. We release an interactive visualization platform to facilitate open scientific access. Our work demonstrates that expert-guided agents can effectively bridge the gap between general-purpose LLMs and high-stakes scientific domains, enabling scalable data discovery in geosciences.","published_date":"2026-05-28T14:06:25+00:00","viability_score":8,"cluster_label":"Data Extraction Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An LLM agent framework that extracts critical marine lead data from scientific papers, creating the largest integrated database to date.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29965v1","title":"Meta-Programming for Linear-time Temporal Answer Set Programming","abstract":"The development of temporal extensions of Answer Set Programming (ASP) has led to the emergence of non-monotonic linear-time (TEL), dynamic (DEL), and metric (MEL) temporal equilibrium logics. However, the inherent rigidity of highly optimized ASP systems often hinders the rapid exploration and implementation of alternative logical designs. In this work, we propose a flexible meta-programming framework that operationalizes the semantics of varied temporal logics through a unified, declarative framework. Our approach extends standard ASP meta-programming by augmenting clingo's theory grammar with formal type specifications and nesting capabilities. To ensure semantic correctness, we introduce a transformation pipeline that protects nested modalities from stable-model-based simplifications during grounding. We demonstrate the extensibility of our framework by implementing meta-encodings for TEL, MEL, and DEL. We provide a comprehensive account of TEL and highlight the key features for managing the interval constraints of MEL and the Fischer-Ladner closure in DEL. Finally, we introduce the metasp system, a versatile tool that encapsulates this workflow.","published_date":"2026-05-28T14:06:12+00:00","viability_score":3,"cluster_label":"Logic Programming","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A flexible meta-programming framework for operationalizing temporal logics in Answer Set Programming.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29963v1","title":"Honeyval: A Comprehensive Evaluation Framework for LLM-powered HTTP Honeypots","abstract":"Honeypots are decoy systems mimicking real system components designed to defend against cyber attacks. Recently, LLMs increasingly serve as simulation backbones for honeypots. They enable defenders to construct high-interaction honeypots with low system security risks. However, LLM-powered honeypot development lacks a unified evaluation framework. Most evaluations consist of measuring response similarity on fixed commands, manual testing, or real-world deployment. These methods are often not scalable for development, reproducible across evaluations, representative of practical attacks, or adaptable to various attacker and honeypot configurations. In this work, we bridge this gap and propose Honeyval, a comprehensive evaluation framework for LLM-powered HTTP honeypots. We address the limitations of prior evaluations by grounding the honeypots in 16 backend applications, using AI hacking agents as attackers, employing two control tasks to monitor agent and honeypot capabilities across customizations, and defining clear and verifiable exploit goals for the attacker. Using Honeyval, we conduct an extensive evaluation of recent cost-efficient LLMs as HTTP honeypots. Our experiments highlight the promise of LLM-powered honeypots; they lead to substantially longer interactions with the attacker than rule-based baseline honeypots and are far less frequently detected even by frontier models, all while, on average, preserving a running cost advantage against agentic attackers. Further, we experiment with different counter-offensive honeypots configurations, and observe unique trade-offs, such as longer interactions at the cost of increased detection.","published_date":"2026-05-28T14:02:27+00:00","viability_score":4,"cluster_label":"Cybersecurity AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A comprehensive evaluation framework for LLM-powered HTTP honeypots to improve cybersecurity defenses.","time_to_mvp":"1-2 weeks","tags":["quick_build"]},{"arxiv_id":"2605.29960v1","title":"Hijacking Agent Memory: Stealthy Trojan Attacks Through Conversational Interaction","abstract":"Large language model (LLM) agents increasingly leverage long term memory to support persistent and autonomous task execution. However, this capability also introduces a new attack surface: memory poisoning, where adversaries can inject malicious information to influence future behavior. Existing memory poisoning attacks often assume that injected content can be stored directly in memory, overlooking the selective extraction and rewriting stages in modern memory pipelines. This makes prior methods ineffective under realistic settings.   In this paper, we propose MemPoison, a novel memory poisoning attack that bypasses selective memory mechanisms in LLM agents, where an attacker can inject triggerable backdoors into the agent's long-term memory through dialogue interactions, thereby misleading its subsequent responses. MemPoison introduces three key components: (i) a semantic relational bridge that binds the trigger and payload into a coherent statement to ensure they are extracted into memory together; (ii) entity masquerading that optimizes triggers to mimic named entities, resisting rewriting; and (iii) joint embedding optimization that shapes trigger-injected texts into a tight cluster in the embedding space while maintaining isolation from benign embeddings for stealth. Evaluations across different agent domains and memory mechanisms show MemPoison achieves attack success rates up to 0.95, outperforming existing baselines. Mechanistic analysis indicates that the attack exploits embedding-space anisotropy and shifts attention patterns, highlighting core vulnerabilities in selective memory systems. We evaluate multiple defense strategies and demonstrate their fundamental limitations in mitigating the attack.","published_date":"2026-05-28T14:02:00+00:00","viability_score":7,"cluster_label":"LLM Agents","has_code":true,"repo_url":"https://github.com/Mintplex-Labs/anything-llm","commercial_flags":["has_code"],"one_liner":"A novel attack that injects triggerable backdoors into LLM agent memory through dialogue, bypassing existing defenses.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29955v1","title":"Formalizing Mathematics at Scale","abstract":"We present AutoformBot, a multi-agent system for building an Autoformalized Textbook Library At Scale (Atlas) in Lean 4. AutoformBot orchestrates thousands of LLM agents, equipped with formal verification tools, dependency-aware task scheduling, and collaborative version control, to translate informal textbook prose into machine-checked definitions and proofs. We apply our methods to a corpus of 26 open-access textbooks spanning analysis, algebra, topology, combinatorics, and probability, producing Atlas: a verified library of over 45,000 Lean 4 declarations and 500 thousand lines of code. We release two artifacts: (i) AutoformBot, the open-source multi-agent framework; and (ii) Atlas, the resulting formal library. Our results suggest that autoformalizing the core content of graduate-level mathematics at scale is now economically and technically feasible. This opens the door to the automated verification of both human- and machine-generated mathematics at a research level.","published_date":"2026-05-28T14:00:22+00:00","viability_score":5,"cluster_label":"Formal Verification","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A multi-agent system that autoformalizes graduate-level mathematics textbooks into a verified library using LLMs and formal verification tools.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.29951v1","title":"MuPHI: Learning Implicit Multimodal Harm Reasoning via Semantically Grounded Reward Optimization","abstract":"Understanding how harm emerges from interaction between otherwise benign image-text pairs requires intent-aware cross-modal reasoning beyond surface-level features. Existing vision-language models (VLMs) excel at literal reasoning over perceptual cues but often fail to derive harmful semantics that rely on implicit, context-dependent reasoning. To evaluate VLMs on compositional harm detection and reasoning, we introduce Multimodal Pragmatic Harm Interpretation (MuPHI), a dataset containing image-text pairs where harm is encoded in subtle multimodal cues. MuPHI spans diverse harm categories and includes annotated harm rationales for assessing VLM reasoning chains. To improve both detection and reasoning in VLMs, we propose MuPHIRM, a reasoning-augmented training framework which learns joint semantics by optimizing multi-perspective rewards. MuPHIRM improves both harm detection and reasoning quality of VLMs while demonstrating superior out-of-distribution robustness compared to both trained and inference-time baselines. Our findings suggest that reasoning-oriented reward optimization offers a promising direction towards building multimodal systems that generalize beyond benchmark-specific shortcuts.","published_date":"2026-05-28T13:58:15+00:00","viability_score":7,"cluster_label":"Multimodal AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A dataset and training framework for improving multimodal models' ability to detect and reason about subtle, context-dependent harm in image-text pairs.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29948v1","title":"HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding","abstract":"Unified speech foundation models require a holistic tokenization space that is both learnable by language models and decodable into high-quality waveforms. Existing speech tokenizers, however, often fail to satisfy these requirements simultaneously, leading to increased architectural complexity and more involved training designs. We propose HoliTok, a continuous Holistic speech Tokenization model designed for unified generation-understanding modeling. HoliTok encodes 48~kHz speech into a compact 25~Hz sequence of 128-dimensional latents. It is trained with a progressive strategy that jointly preserves signal-level fidelity, incorporates semantic information, and maintains strong latent learnability. Based on this tokenization, we build a unified AR+DiT model for speech synthesis and recognition, where the same latent sequence supports both generation-specific and unified generation-understanding tasks. Experiments show that HoliTok achieves competitive reconstruction fidelity, improves generative learnability for high-quality and controllable synthesis, and, among the evaluated representations, is the only one that operates robustly in our unified generation-understanding architecture without additional optimization tricks. These results suggest that HoliTok serves as an effective speech tokenizer and a foundational representation interface for unified spoken language modeling. The code is available at: https://github.com/bovod-sjtu/HoliTok.","published_date":"2026-05-28T13:55:19+00:00","viability_score":7,"cluster_label":"Speech AI","has_code":true,"repo_url":"https://github.com/bovod-sjtu/HoliTok","commercial_flags":["has_code"],"one_liner":"HoliTok is a continuous speech tokenization model enabling unified, high-quality speech generation and understanding without complex architectural changes.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29940v1","title":"Make LLM Learn to Synthesize from Streaming Experiences through Feedback","abstract":"Large language models (LLMs) have been widely adopted for synthetic data generation, significantly reducing annotation costs. However, most existing studies treat synthesis as a set of isolated tasks and overlook a more fundamental question: whether a model can learn to synthesize by accumulating experience from past tasks and transferring it to future ones. In this work, we introduce StreamSynth, a new setting in which synthesis tasks arrive sequentially and experience from historical tasks provides informative signals for future synthesis. To address this setting, we propose SynLearner, a general framework that enables synthesis models to acquire reusable synthesis experience over a task stream. Instead of generating data independently for each task, SynLearner encourages the model to explore diverse synthesis patterns, learn from feedback, and balance sample quality with set-level diversity as tasks evolve. Extensive experiments across multiple benchmarks show that SynLearner effectively leverages experience from earlier tasks to improve synthesis performance on later ones, exhibiting consistent cross-task transferability. These findings provide evidence for the feasibility of StreamSynth and highlight synthetic data generation as an experience-driven process that can benefit from task streams.","published_date":"2026-05-28T13:51:58+00:00","viability_score":7,"cluster_label":"LLM Training","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for LLMs to learn synthetic data generation by accumulating experience from sequential tasks and feedback.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29935v1","title":"CityGen: Structure-Guided City-Style Synthesis for Cross-City Autonomous Driving","abstract":"Autonomous driving systems are commonly trained and evaluated within limited geographic regions, which hinders their scalability when deployed in new cities. However, significant domain shifts in appearance, road topology, and traffic patterns often cause severe performance degradation under cross-city deployment. Existing approaches based on domain adaptation, data augmentation, or synthetic data generation typically rely on labeled target data, city-specific annotations, or task-specific designs, limiting their scalability and effectiveness for holistic evaluation. In this paper, we introduce CityTransfer-Bench, a geographically disjoint benchmark for evaluating cross-city generalization across perception, segmentation, and planning, and propose CityGen, a diffusion-based generative framework that performs zero-label city adaptation via HD-map-conditioned synthesis guided by city-level visual prompts. Extensive experiments demonstrate that CityGen consistently improves cross-city robustness across multiple tasks, establishing a scalable and label-efficient foundation for generalizable autonomous driving.","published_date":"2026-05-28T13:46:48+00:00","viability_score":8,"cluster_label":"Autonomous Driving","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A diffusion-based generative framework for zero-label city adaptation in autonomous driving systems, improving cross-city robustness.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29931v1","title":"It`s All About Speed: AI`s Impact on Workflow in Music Production","abstract":"In this paper, we present the results of an ethnographic study into the impact of AI and automated tools on music production workflow. Focusing specifically on professional participants who identified as recording engineers, mixers, and producers, we discuss their usage of common AI and automated software, as well as their sentiments on the proliferation of these tools. We discuss tensions that may be created between users and automated tools in key areas such as the need for speed and efficiency, controllability, and maintaining creative agency, and how these tensions may be alleviated through tool design.","published_date":"2026-05-28T13:43:26+00:00","viability_score":0,"cluster_label":"Music Production AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"An ethnographic study exploring the impact of AI and automated tools on professional music production workflows.","time_to_mvp":"N/A","tags":[]},{"arxiv_id":"2605.29930v1","title":"Toward AI Systems That Understand Self and Others: A Multi-Phase Inference Framework for Human Cognitive Diversity and World-Model Alignment","abstract":"Mutual misunderstanding in contemporary society does not arise merely because people hold different opinions or values. Even under the same observations, different subjects may form different inferential targets, state representations, prediction errors, and update priorities. This paper proposes a multi-phase inference framework and defines its core internal mechanism as the Multi-Phase Inference Mechanism (MIM). MIM formalizes how heterogeneous world models arise through a phase-formation space, a foregrounding field, subject-specific profile states, and alignment maps between state representations. On this basis, the paper reframes world-model alignment as the problem of making heterogeneous representations mutually processable, rather than forcing agreement or convergence to a single value system. It further connects this formalism to philosophical disagreements, cognitive typology, social fragmentation, and AI alignment. The aim is to provide a constructive vocabulary for AI systems that can help humans understand self and others by making differences in meaning, value, and prediction error visible, comparable, and transformable.","published_date":"2026-05-28T13:43:23+00:00","viability_score":0,"cluster_label":"AI Alignment","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A multi-phase inference framework to enable AI systems to understand human cognitive diversity and align world models.","time_to_mvp":"N/A","tags":[]},{"arxiv_id":"2605.29928v1","title":"Label Over Logic? How Source Cues Bias Human Fallacy Judgments More Than LLMs","abstract":"As AI-generated and AI-assisted content floods online spaces, source labels attached to such content can distort human reasoning judgments, with downstream consequences for moderation, evaluation, and decision-making. Whether LLMs share this vulnerability, or offer more source-agnostic evaluation, remains an open question with direct implications for human-AI collaboration. We examine this issue using logical fallacies as a controlled setting to isolate source-label effects on reasoning quality, independent of domain knowledge. We conduct an online study (N=505) where participants are assigned to a source condition (human, AI, human with AI assistance, AI with human assistance, or no disclosure) and evaluate comments containing logical fallacies, comparing their judgments with those of LLMs (GPT-5.2, Gemini 2.5 Flash, Claude Sonnet 4.5), who were evaluated across the same source conditions. Human evaluators were significantly more susceptible to fallacies labeled as written by human or human with AI assistance and assigned higher trust and evaluation ratings in these conditions. LLM evaluations remained comparatively stable across source labels, though performance varied across models. Confidence levels were similarly high across conditions for both humans and LLMs, regardless of fallacy presence. Our findings indicate that source-label bias in reasoning evaluation is primarily a human vulnerability and highlight the potential of human-LLM collaboration in increasingly AI-mediated environments.","published_date":"2026-05-28T13:41:43+00:00","viability_score":4,"cluster_label":"Human-AI Interaction","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This research investigates how source labels (human vs. AI) bias human judgment of logical fallacies, finding humans are susceptible while LLMs remain more stable, highlighting potential for human-LLM collaboration in content moderation.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29927v1","title":"Does The Way You Plan Matter? An Empirical Study of Planning Representations for LLM Web Agents","abstract":"Despite recent advances, LLM-based web agents still struggle with limited exploration, omission of critical steps, and sensitivity to task constraints. Prior work suggests that many of these failures stem from weaknesses in planning, yet the impact of alternative natural language plan representation remains unexplored. To address this, we introduce PlanAhead, a static planner-executor framework that evaluates the impact of plan representation in agent performance. We first automatically categorize WebArena tasks into 3 difficulty levels, enabling consistent difficulty grading without human annotation. Then we systematically evaluate 4 different plan representations on the tasks categorized as hard: sequential subgoals, narrative, pseudocode, and checklist; across different families of multimodal LLM powered agents (OpenAI, Alibaba, and Google). To account for stochastic variability, we introduce two novel evaluation metrics: Achievement Rate (AR) and Solved-Task Consistency (STC). Our results show that both, the plan formulation and the underlying LLM generating the plan, significantly influence web-agent robustness and task success.","published_date":"2026-05-28T13:39:49+00:00","viability_score":5,"cluster_label":"LLM Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This study evaluates how different natural language plan representations impact LLM web agent performance, introducing a framework and metrics to show that both plan formulation and the underlying LLM significantly influence robustness and task success.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29919v1","title":"On the Geometry of Games and their Solvers","abstract":"A central challenge in game theory and learning systems such as GANs is understanding which algorithms can efficiently compute equilibria across the heterogeneous landscape of games. Equilibrium computation is typically studied solver by solver and game class by game class, yielding strong local guarantees but a fragmented view of solver behaviour. Existing discrete taxonomies often provide an incomplete account of where algorithms succeed. We study this problem through a solver-game map linking games to effective solver dynamics. Classical theory identifies isolated regions of this map but provides limited insight into intermediate or overlapping regimes, suggesting that solvability is governed by latent structural properties defining a continuous solver-aligned geometry of games. We formalise this perspective through structure-aware solver synthesis. A learned structure recogniser maps each game to a low-dimensional solver-aligned representation, and a policy maps this representation to effective primitive mechanisms, adapting solver behaviour across regimes. This reveals regions where particular solver dynamics are effective and where mixtures of primitives are required rather than a single dominant solver. A bounded residual acts as a local corrector and diagnostic signal for incomplete solver bases or representations. The framework yields both an adaptive solver and an analytical lens: games with similar optimisation dynamics cluster together, revealing continuous regions of algorithmic validity and overlapping solver behaviour. Empirically, we show that fixed primitives exhibit systematic regime mismatch, while the learned representation organises game space into a structured cartography aligned with solver behaviour. These results suggest viewing equilibrium computation as the joint problem of learning solver mechanisms and mapping the geometry of solvability.","published_date":"2026-05-28T13:31:49+00:00","viability_score":3,"cluster_label":"Game Theory AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper proposes a novel framework that maps games to a low-dimensional representation to synthesize adaptive solvers, revealing a continuous geometry of solvability and improving equilibrium computation efficiency.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29916v1","title":"Selection Hyper-heuristics Can Automatically Adjust the Learning Period to Optimally Solve Pseudo-Boolean Problems","abstract":"The Random Gradient hyper-heuristic was recently shown to be able to learn the optimal neighbourhood size when optimizing the LeadingOnes benchmark via the Randomised Local Search (RLS) meta-heuristic. However, for this to happen, a learning period of a certain length $\u03c4$ had to be used, differently from classic hyper-heuristics, which change their behaviour based on the success of only the previous iteration. In this paper, we show how to automatically set this new parameter value, relieving the user from the non-trivial task of controlling this novel algorithm parameter. We prove that the resulting hyper-heuristic selects the optimal neighbourhood size in a $1-o(1)$ fraction of the iterations and, consequently, optimises the LeadingOnes benchmark in the best possible time (apart from lower-order terms) achievable with these neighborhood sizes.","published_date":"2026-05-28T13:31:16+00:00","viability_score":7,"cluster_label":"Optimization Algorithms","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This work introduces a hyper-heuristic that automatically learns the optimal learning period for optimizing pseudo-Boolean problems, achieving near-optimal performance with minimal user intervention.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.29910v1","title":"Agora: Toward Autonomous Bug Detection in Production-Level Consensus Protocols with LLM Agents","abstract":"Consensus protocols form the backbone of distributed systems and blockchains, where implementation bugs can cause data corruption and financial losses. While LLM-based approaches show promise in code analysis, they struggle with deep protocol-level logic bugs involving complex state-dependent behaviors across multiple execution stages. We present Agora, a domain-aware multi-agent framework that integrates hypothesis-driven testing with LLM capabilities for systematic protocol verification. Agora employs specialized agents that collaboratively explore protocol state spaces, synthesize attack scenarios using domain-specific constraints, and validate findings through iterative refinement. This explicit role separation enables reasoning about global protocol invariants beyond single-function code analysis. We evaluate Agora on four consensus implementations (Raft, EPaxos, HotStuff, BullShark) using four state-of-the-art LLMs. Agora discovers 15 previously unknown protocol-level logic bugs that violate safety properties, while existing LLM-based agents fail to detect any such protocol-level logic bugs. Our results demonstrate that domain-aware multi-agent collaboration is essential for detecting deep logic bugs in complex protocols.","published_date":"2026-05-28T13:27:47+00:00","viability_score":7,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Agora is a multi-agent framework that uses LLMs to autonomously detect critical bugs in production-level consensus protocols, finding 15 previously unknown bugs in major implementations.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29893v1","title":"Redundant or Necessary? A Benchmark for Detecting Redundant Steps in Agent Trajectories","abstract":"LLM-based agents have demonstrated strong capabilities in solving complex tasks through multi-step reasoning and tool use. However, existing evaluation protocols primarily focus on task success, overlooking a critical aspect of agent behavior: execution efficiency. In practice, agent trajectories often contain redundant steps that consume substantial resources while contributing little to task completion. In this work, we propose and formulate a new research area: \\textbf{redundant step detection} for agent trajectories. To support this initiative, we introduce \\textbf{RedundancyBench}, a new benchmark that contains diverse tasks with carefully annotated trajectories, where each step is labeled according to its contribution to task completion. Using RedundancyBench, we develop and evaluate 3 representative methods to answer whether a step within trajectory is redundant or necessary. Our results show that even the best-performing method achieves only 24.88\\% score in detecting redundant steps, while some methods perform worse than random guessing. These results highlight the task's complexity and the need for further research in this area. \\footnote{Code and dataset in this paper are both available in \\href{https://anonymous.4open.science/r/RedundancyBench}{https://anonymous.4open.science/r/RedundancyBench}.}","published_date":"2026-05-28T13:17:33+00:00","viability_score":7,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"RedundancyBench is a new benchmark and evaluation methods for detecting inefficient steps in LLM agent trajectories, addressing a critical gap in agent efficiency.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29889v1","title":"Internal Representation, Not Clinical Knowledge: Where Apparent LLM Triage Failures Originate","abstract":"Patient-voiced clinical-triage benchmarks report high under-triage rates for consumer LLMs for constrained multiple-choice output, yet the same cases score differently with free-text. We ask whether output format changes the model's \\emph{clinical representation} or only the mapping from a preserved representation to an answer. Using sparse-autoencoder (SAE) features in Gemma 3 4B/12B IT and Qwen3-8B, we find the same medical features fire on the shared clinical narrative under both formats but go {silent} at the multiple-choice decision token in all the cases at every model. Three independent methods (natural-language autoencoder verbalization, decision-token logit attribution, and top-feature characterization) agree that scaffold and format features, but not medical features, drive the decision logits. Behaviorally, the multiple-choice penalty inverts under both structured and natural-language input, option-order shuffle rules out positional bias, and the gap is dominated by off-by-one decision (the model picks an adjacent acuity letter to the gold answer) rather than knowledge failure. Thus, the failure originates in the output format and not in the clinical representation.","published_date":"2026-05-28T13:14:17+00:00","viability_score":4,"cluster_label":"LLM Analysis","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research reveals that LLM triage failures in multiple-choice formats stem from output formatting, not a lack of clinical knowledge, by analyzing internal representations.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.29888v1","title":"LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training","abstract":"Reinforcement learning (RL) post-training has shown to improve reasoning in large language models (LLMs). However, there has been little exploration on the problem of data contamination in RL post-training, potentially undermining generalization and evaluation reliability of the training process itself. Existing detection methods primarily rely on output-level signals such as likelihood or entropy, which become unreliable for RL-trained models since RL shapes behavior through trajectory-level rewards rather than token likelihoods. We propose LaRA, a layer-wise representation analysis framework for detecting contamination in RL post-trained LLMs. LaRA introduces three complementary metrics, measuring perturbation sensitivity, directional collapse, and local representation rigidity under controlled perturbations. We find that contamination produces progressive geometric deviations across layers, including amplified perturbation sensitivity, stronger directional collapse, and enhanced local rigidity. Based on our findings, we also develop a contamination detection protocol that aggregates representation-level deviations across layers and metrics. Experiments on RL-trained reasoning models show that our protocol outperforms existing output-level baselines for contamination detection.","published_date":"2026-05-28T13:13:49+00:00","viability_score":3,"cluster_label":"LLM Analysis","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"LaRA is a novel framework for analyzing internal representations to detect data contamination in RL post-trained LLMs, outperforming existing output-level methods.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29886v1","title":"CRITIC-R1: Learning Structured Critics for Retrieval-Augmented Generation","abstract":"Retrieval-augmented generation (RAG) improves knowledge-intensive question answering by incorporating external evidence. However, existing RAG methods still suffer from hallucinations and subtle reasoning errors. Recent studies introduce external critics to refine RAG outputs, yet they often provide coarse-grained and weakly structured feedback, exhibit over-aggressive intervention, and lead to noisy and unreliable refinement, limiting their effectiveness for correction. To tackle these issues, we propose CRITIC-R1, a structured critic framework that formulates and learns RAG critique as an explicit error diagnosis problem using reinforcement learning (RL). Our framework categorizes common RAG errors into multiple diagnostic dimensions, including verdict, error location, reasoning analysis, and fix generation. To learn these capabilities, we design two reward functions: Conservative Judgement Alignment (CJA) first encourages calibrated high-level judgements while mitigating the over-aggressive phenomenon, whereas Diagnostic Quality Alignment (DQA) further improves fine-grained diagnostic feedback through gated rewards. We train the critic model using GRPO-based RL with process-level supervision collected from external LLM teacher models. Experiments across five QA benchmarks show that CRITIC-R1 consistently improves answer quality over strong RAG baselines.   Our source code is available at https://anonymous.4open.science/r/critic-r1-FCB0","published_date":"2026-05-28T13:11:36+00:00","viability_score":7,"cluster_label":"Retrieval-Augmented Generation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A structured critic framework for retrieval-augmented generation that diagnoses and fixes errors, improving answer quality.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29881v1","title":"Mitigating Hallucination in Vision-Language Models through Barrier-Regulated Adaptive Closed-form Steering","abstract":"Large vision-language models (LVLMs) often hallucinate objects that are not present in the input image, largely because visual grounding weakens as decoding progresses. Existing inference-time mitigation methods modify logits or hidden states throughout generation, but they suffer from three key limitations: they lack an explicit grounding objective, intervene even when the model is already well-grounded, and use fixed correction strengths that do not adapt to the severity of grounding failure. We propose BRACS (Barrier-Regulated Adaptive Closed-form Steering), a training-free steering framework that addresses these issues through barrier-regulated adaptive closed-form steering. BRACS monitors the model's own attention to measure visual grounding and applies corrections to the hidden states only when grounding deteriorates. The corrective update is computed analytically in closed form, requiring no training of auxiliary networks or model retraining. Experiments on LLaVA-1.5-7B and Qwen-VL-Chat show that BRACS consistently outperforms prior methods on hallucination benchmarks, reducing CHAIR$_s$ by 9.4 points and improving POPE F1 by 2.7 points, while matching or improving performance on four general multimodal benchmarks. BRACS also remains efficient, operating at 80% of greedy decoding throughput and achieving 1.3 times higher speed on average than the baselines.","published_date":"2026-05-28T13:07:01+00:00","viability_score":8,"cluster_label":"Vision-Language Models","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A training-free steering framework that mitigates hallucination in vision-language models by adaptively regulating visual grounding.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.29874v1","title":"Evolutionary Dynamics of Cooperation in Next-Generation LLM Agent Systems: A Cross-Provider Empirical Extension","abstract":"Do next-generation LLM agents inherit the cooperative biases documented in their predecessors, or does scale and provider diversity reshape equilibrium behaviour in competitive multi-agent settings? Willis et al. established a benchmark for this question using evolutionary game theory and the Iterated Prisoner's Dilemma (IPD), finding consistent cooperative biases in ChatGPT-4o and Claude 3.5 Sonnet. We extend this benchmark to four frontier models released in 2025-2026 - Claude Sonnet 4.6, Gemini 2.5 Flash, Gemini 3.1 Pro, and GPT-5.4 Mini - applying the identical protocol across three prompting styles (Default, Prose, Self-Refine) and four population compositions (balanced and biased, with and without noise). Cooperative bias persists across providers (H1): nine of twelve model-prompt combinations favour cooperative equilibria in balanced noiseless conditions. Cross-provider divergence is substantial (H3): Gemini 2.5 Flash reaches up to 77% aggressive equilibria under biased conditions, while GPT-5.4 Mini reaches 70% cooperative equilibria under Self-Refine. Support for aggressive capability parity is partial (H2): Self-Refine raises ICD in all models and Claude Sonnet 4.6 Refine achieves the highest ICD in the dataset (0.913), but Default and Prose prompts show no systematic narrowing. Evidence on noise robustness is directionally positive but not robustly confirmed (H4): with n=500 Moran iterations per condition, average noise sensitivity is approximately 6 percentage points for Claude Sonnet 4.6 versus 13 pp for Claude 3.5 Sonnet, but this cross-study gap is not statistically significant once the predecessor's unreported sampling error is propagated. Provider identity, rather than model generation, is the strongest correlate of equilibrium outcomes; noise remains a universal challenge regardless of model size or vintage.","published_date":"2026-05-28T12:58:56+00:00","viability_score":4,"cluster_label":"LLM Inference Optimization","has_code":true,"repo_url":"https://github.com/arqFranciscoLeon/evollm","commercial_flags":["has_code"],"one_liner":"Moment-KV is a decoding-time KV cache compression method that improves generation fidelity in long-generation tasks by aggregating attention with decay.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29873v1","title":"Moment-KV: Momentum-Based Decode-Time KV Cache Compression for Long Generation","abstract":"Key-Value (KV) cache remains a major bottleneck for deploying Large Language Models (LLMs) in long-generation tasks. Prior work often applies uniform compression across both prefill and decoding caches, but compressing the prefill cache degrades performance by corrupting critical context. While preserving the prefill cache is essential, decoding-phase compression remains underexplored, with existing methods relying on rigid recency windows or instantaneous attention. Our analysis of attention dynamics reveals strong temporal patterns: critical tokens receive sustained attention over long horizons, while local reasoning involves short-lived bursts. Static heuristics fail to capture this behavior, leading to premature eviction of important tokens or retention of stale ones. We propose Moment-KV, a decoding-time KV cache compression method based on momentum-driven temporal attention aggregation. Our method models token importance as a continuously evolving state, where attention is aggregated with decay, capturing both long-term influence and recent relevance. Experiments show that Moment-KV significantly improves generation fidelity in long-generation tasks (2.3-3.2 %) while maintaining decoding latency.","published_date":"2026-05-28T12:56:51+00:00","viability_score":3,"cluster_label":"LLM Agent Systems","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"An empirical extension of a benchmark for evolutionary dynamics of cooperation in next-generation LLM agent systems across different providers and prompting styles.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29862v1","title":"Mitigating Stethoscope-Induced Shortcuts in Respiratory Sound Classification under Federated Domain Generalization with Causality-Inspired Interventions","abstract":"AI-driven respiratory sound classification (RSC) is promising for automated pulmonary disease detection, yet multi-site deployment is hindered by inter-stethoscope variability. We introduce a federated domain generalization (FedDG) formulation for RSC under stethoscope-induced device shifts, where clients use heterogeneous devices and the model is evaluated on unseen devices. Our empirical analysis shows that stethoscope-induced style and disease-specific content are tightly entangled, making deterministic style removal unreliable. In response, we propose a causality-inspired multimodal FedDG framework that combines: (i) a causality-inspired device style intervention network that performs content-preserving style perturbations, (ii) counterfactual text augmentation that neutralizes metadata shortcuts, and (iii) gradient alignment that facilitates device-invariant representations across clients. Built on a multimodal language-audio pretraining model, it outperforms conventional data augmentation and federated learning baselines in leave-one-device-out validation on ICBHI and SPRSound datasets. Code will be released upon publication.","published_date":"2026-05-28T12:43:03+00:00","viability_score":5,"cluster_label":"Medical AI","has_code":true,"repo_url":"https://github.com/RSC-Toolkit/BTS-CAFE","commercial_flags":["has_code"],"one_liner":"A causality-inspired framework for federated respiratory sound classification that overcomes device variability by disentangling content and style.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2605.29861v1","title":"Towards Verifiable Multimodal Deep Research: A Multi-Agent Harness for Interleaved Report Generation","abstract":"Large Language Models (LLMs) have advanced autonomous agents from deep search, which retrieves concise factual answers, to deep research, which synthesizes scattered evidence into long-form reports. However, verifiable multimodal deep research remains challenging due to open-ended synthesis without deterministic ground truth and the need to interleave textual arguments with visual evidence. We propose \\textsc{Ptah}, a multi-agent harness for interleaved report generation. \\textsc{Ptah} orchestrates the lifecycle from user query to rendered web report through planning, research, and writing stages, where specialized agents construct visual-aware plans, collect claim-grounded evidence, maintain source-aligned images in a \\textit{Visual Working Memory}, and compose reports through declarative multimodal tool use. A verifier agent serves as the harness's acceptance function, enforcing factual grounding, citation fidelity, and cross-modal consistency throughout the workflow. We further introduce \\textsc{Ptah}Eval, an evaluation protocol that augments existing benchmarks with image-level and presentation-level assessments. Experiments on deep research benchmarks show that \\textsc{Ptah} produces more reliable, visually informative, and usable human-facing multimodal reports than strong baselines.","published_date":"2026-05-28T12:40:34+00:00","viability_score":7,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Ptah is a multi-agent system that automates deep research by generating verifiable multimodal reports with interleaved text and visuals.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.29860v1","title":"ESPO: Early-Stopping Proximal Policy Optimization","abstract":"When a large language model under reinforcement learning commits a wrong reasoning step early in a trajectory, standard algorithms force it to keep generating until the maximum horizon, spending compute on tokens that never receive positive reward and polluting advantage estimates with post-failure noise. We propose ESPO (Early-Stopping Proximal Policy Optimization), which detects trajectory failure on-the-fly and terminates rollouts early. At each generation step, ESPO computes a surrogate regret using only the logits already computed during sampling, and terminates when the smoothed cumulative regret significantly exceeds its estimated values. Truncated trajectories are treated as absorbing failure states with a terminal reward, concentrating negative temporal-difference (TD) errors near the detected failure step without any additional reward model or human annotation. On DeepSeek-R1-Distill-Qwen-7B trained for mathematical reasoning, ESPO surpasses PPO on AIME~2024 (46.28% vs. 45.25%), AMC~2023 (85.83% vs. 82.94%), and MATH-500 (87.42% vs. 85.43%), while saving more than 20% rollout tokens cumulatively.","published_date":"2026-05-28T12:40:22+00:00","viability_score":4,"cluster_label":"LLM Training","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"ESPO is an early-stopping algorithm for RL-trained LLMs that detects and terminates failed trajectories to improve efficiency and accuracy.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29843v1","title":"HARP: Hadamard-Preconditioned Adaptive Rotation Processor for Extreme LLM Quantization","abstract":"Post-training quantization (PTQ) is essential for deploying LLMs under memory and bandwidth constraints. However, extreme low-bit quantization remains highly sensitive to activation outliers and anisotropic weight curvature. Existing incoherence-based PTQ methods mitigate this issue with fixed randomized Hadamard transforms (RHTs), which improve quantization robustness but cannot adapt the rotated basis to the layer, calibration distribution, or quantizer. We introduce HARP (Hadamard-preconditioned Adaptive Rotation Processor), a learnable structured two-sided orthogonal processor that replaces fixed Hadamard mixing while preserving exact full-precision equivalence. HARP represents each rotation as a product of sparse butterfly-like block-orthogonal stages, supports non-power-of-two dimensions via Mixed-Radix schedules, and initializes to the RHT processor up to a fixed permutation. Fitted only on calibration data, HARP adapts the quantization basis to each layer and backend. Across 2-4 bit settings on models ranging from 1B to 70B parameters, HARP improves perplexity and zero-shot accuracy over fixed RHT. Importantly, HARP preserves deployment efficiency, reaching 128 tok/s versus 61 tok/s for FP16.","published_date":"2026-05-28T12:24:15+00:00","viability_score":4,"cluster_label":"LLM Quantization","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"HARP is a learnable processor for extreme LLM quantization that adapts rotation bases to improve performance and deployment efficiency.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29836v1","title":"CB-SLICE: Concept-Based Interpretable Error Slice Discovery","abstract":"Despite strong average-case performance, deep learning models often exhibit systematic errors on specific population groups, known as error slices. Identifying these groups and the root causes of their failures is critical for model debugging and bias mitigation. However, existing error Slice Discovery Methods (SDMs) typically generate explanations disconnected from the model's inference process, thus only approximating the underlying error source and may be inaccurate.   We address this limitation by leveraging Concept Bottleneck Models (CBMs), whose predictions are directly dependent on human-understandable semantic concepts. Since downstream task failures in CBMs commonly arise from concept mispredictions, concept representations provide a strong candidate for error slice identification, offering fine-grained explanations directly linked to the error source. Building on this insight, we introduce CB-SLICE, a concept-based SDM that groups samples with shared concept prediction failures and identifies the keyword concepts most responsible for each slice's failure mode. Across multiple benchmarks, we show that CB-SLICE outperforms state-of-the-art methods in uncovering well-known biases while providing richer and more faithful explanations of model errors.","published_date":"2026-05-28T12:16:41+00:00","viability_score":4,"cluster_label":"AI Explainability","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A concept-based method for discovering and explaining systematic errors in deep learning models.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.29833v1","title":"OmniMatBench: A Human-Calibrated Multimodal Reasoning Benchmark Across 19 Materials Science Subfields","abstract":"As multimodal language models play an increasingly important role in scientific research, materials science offers a critical testbed due to its interdisciplinary, multimodal, and application-driven nature. However, existing materials benchmarks mainly focus on property prediction, knowledge QA, or characterization understanding, leaving the broader reasoning process from materials knowledge to application underexplored. To fill this gap, we present OmniMatBench, a human-calibrated multimodal reasoning benchmark for materials science. OmniMatBench contains 3,171 expert-curated QA and calculation problems across 19 materials-science subfields, spanning fundamental materials knowledge, structural and engineering materials, materials processing and manufacturing, and functional and applied materials. We evaluate 13 open-source and closed-source MLLMs and find that the best model achieves only a 0.372 overall score, revealing a substantial gap in current materials-science reasoning. Further analysis shows strong variation across subfields, fixed reasoning heuristics, uneven materials knowledge, and limited high-level knowledge application under formula-, retrieval-, and code-assisted settings. OmniMatBench provides crucial insights into the capabilities and limitations of current MLLMs and establishes a foundation for reliable AI assistants in materials-science research.","published_date":"2026-05-28T12:12:27+00:00","viability_score":4,"cluster_label":"Multimodal AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A human-calibrated benchmark for evaluating multimodal reasoning in materials science.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.29829v1","title":"OptSkills: Learning Generalizable Optimization Skills from Problem Archetypes via Cluster-Based Distillation","abstract":"Leveraging Large Language Models (LLMs) to automatically formulate and solve optimization problems from natural language has emerged as an efficient paradigm for automated optimization. However, existing methods still exhibit limited generalization: they are sensitive to superficial narrative variations, reuse experience mainly at the case level, and struggle to adapt to shifted or emerging problem types. We propose OptSkills, an archetype-centric skill learning and reasoning agent system for optimization modeling and solving. To improve robust generalization, our system clusters problems by their underlying archetypes rather than surface narratives. To improve in-distribution generalization, it explores diverse modeling paradigms and solver configurations within each cluster, then distills successful trajectories into reusable workflow-level skills. To improve out-of-distribution generalization, it refines existing skills or expands the skill library using newly obtained trajectories. Our system achieves a state-of-the-art micro-averaged accuracy of 68.27% on datasets encompassing diverse problem types and scenarios. In addition, on MIPLIB-NL, a highly challenging large-scale and high-dimensional benchmark, it achieves 26.91% accuracy, outperforming DeepSeek-V3.2-Thinking by 4.53%. After skill learning on Nano-CO, it reaches 72.79% on the OOD NLCO benchmark. Code and skills are available at https://github.com/fujiwaranoM0kou/OptSkills.","published_date":"2026-05-28T12:07:28+00:00","viability_score":7,"cluster_label":"AI Agents","has_code":true,"repo_url":"https://github.com/fujiwaranoM0kou/OptSkills","commercial_flags":["has_code"],"one_liner":"An AI agent that learns generalizable optimization skills by distilling problem archetypes into reusable workflows.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.29826v1","title":"Towards Localized and Disentangled Knowledge Editing for Multimodal Large Language Models","abstract":"Existing methods in Multimodal Knowledge Editing (MKE) have advanced the ability to correct outdated or inaccurate knowledge in Multimodal Large Language Models (MLLMs). However, they exhibit a critical limitation: while effectively modifying target factual pairs, they fail to generalize edits to logically related queries and often cause unintended alterations to unrelated but visually or semantically linked information. We identify and formalize two underlying failure modes causing this issue: Causal Misalignment, which confines edits to the specific sample, and Feature Entanglement, which causes unintended alterations to coupled but irrelevant information. To address these issues, we propose Localized and Disentangled Knowledge Editing (LDKE), a new framework that achieves precise and generalized editing by localizing fact-specific model layers and disentangling target-relevant inputs from irrelevant ones. Our approach introduces a Fast Localization module to identify and update critical layers efficiently, along with a Disentanglement Classifier that routes inputs appropriately to preserve unrelated knowledge. Extensive experiments across various benchmarks and MLLMs demonstrate that LDKE achieves superior performance in propagating edits to related contexts while maintaining high locality.","published_date":"2026-05-28T12:06:39+00:00","viability_score":4,"cluster_label":"Multimodal AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for localized and disentangled knowledge editing in multimodal large language models.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.29823v1","title":"Quantifying and Optimizing Simplicity via Polynomial Representations","abstract":"Deep networks often exhibit a preference for \"simple\" solutions, and such a simplicity bias is widely believed to play a key role in generalization. Yet a broadly applicable, quantitative measure of simplicity remains elusive. We introduce polynomial representations as a distribution-aware, low-dimensional surrogate for neural functions: we approximate a network's predictive behavior along data-dependent interpolation paths using orthogonal polynomial bases, yielding a compact functional representation. We show that the effective degree of this representation serves as a practical simplicity metric that is predictive of generalization across tasks and architectures, and consistently outperforms existing generalization proxies such as sharpness. Finally, polynomial representations naturally yield a differentiable simplicity regularizer, which consistently improves generalization in image and text classification, fine-tuning contrastive vision-language models, and reinforcement learning.","published_date":"2026-05-28T12:05:41+00:00","viability_score":2,"cluster_label":"Model Optimization","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Develops a novel metric for neural network simplicity using polynomial representations to improve generalization.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29822v1","title":"Inferring Code Correctness from Specification","abstract":"Large language models (LLMs) have become integral to modern software development, enabling automated code generation at scale. However, validating the correctness of LLM-generated code remains a critical and largely unsolved challenge. Existing approaches either rely on dynamic consensus across multiple code candidates - making them costly and difficult to scale - or on static reasoning that is susceptible to dynamic bugs and order bias. In this paper, we propose TRAILS~ (Targeted Reasoning Agreement via Inputs and Specifications), an approach that grounds LLM reasoning with concrete (input, output) pairs. TRAILS~ first generates diverse test inputs via category partitioning based on the specification, then executes them against the candidate code and prompts LLMs to assess whether the resulting input-output pairs conform to the specification - without ever reasoning over the code itself. Scores are aggregated across inputs, to determines whether the program is likely correct. We evaluate TRAILS~ on two datasets, LiveCodeBench and CoCoClaNeL, across three LLMs (Qwen3Coder-30B, Devstral-Small-24B, and Olmo3.1-Instruct), comparing against HoarePrompt and a Zero-Shot Chain-of-Thought baseline. TRAILS~ improves Matthew Correlation Coefficient by up to 39\\% relative to Zero-Shot COT and consistently outperforms HoarePrompt. Beyond accuracy, TRAILS~ demonstrates greater stability across seeded runs, reducing sensitivity to LLM non-determinism, and assigns correct labels to a larger set of unique code samples than competing approaches.","published_date":"2026-05-28T12:04:51+00:00","viability_score":7,"cluster_label":"Code Generation & Validation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel approach to validate LLM-generated code by grounding reasoning with concrete input-output pairs, significantly improving correctness inference.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.29816v1","title":"Harnessing non-adversarial robustness in large language models","abstract":"The work presents an approach for addressing the challenge of robustness in Large Language Models (LLMs) to alterations and potential errors caused by semantically similar but textually different prompts. Recent works have shown that these kinds of prompt variations can significantly impact the performance of LLMs on tasks. The central question is: can LLMs' robustness to semantically-neutral prompt alterations be acquired without expensive retraining of the entire model? We address this question both theoretically and through experiments. Our theoretical analysis reveals a crucial factor impacting model robustness - a systematic expected shift or perturbation-induced bias in neural network module outputs. Motivated by this analysis, we show that robustness can be achieved via a simple fine-tuning process: debiasing for robustness. We identify conditions when debiasing helps and when it does not, and demonstrate, through both theory and extensive experiments, that debiasing for robustness may indeed be a quick and efficient tool to enhance robustness and provide certification against random prompt perturbations.","published_date":"2026-05-28T12:00:00+00:00","viability_score":7,"cluster_label":"LLM Robustness","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A debiasing fine-tuning method to enhance LLM robustness against prompt variations without full retraining.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.29815v1","title":"PRAIB: Peer Review AI Benchmark of Behaviour of LLM-Assisted Reviewing","abstract":"The growing number of submitted papers has motivated the exploration of Large Language Models (LLMs) as a means to support and augment the peer review process, particularly in terms of improving its speed and scalability. Yet, it remains unknown whether LLMs engage with scientific manuscripts in the same manner as human reviewers, or whether they merely produce review-looking text. To address this, we introduce the Peer Review AI Benchmark (PRAIB), a novel framework comprising thoroughly defined metrics that measure review specificity, style, and behavior of engagement. To complement the PRAIB framework, we conduct a large-scale empirical study leveraging a dataset of 11,000 reviews generated by five proprietary and open-source models for 1,000 ICLR and NeurIPS papers. Spanning the 2021--2025 period, these machine-generated reviews are compared against original human feedback across diverse prompting strategies to identify systematic behavioral divergences. Our analysis reveals that the generated reviews diverge significantly from feedback provided by human reviewers: LLM ratings are less variable, positively biased, and overconfident, and their cross-reference patterns are model-dependent and distinct from human norms. Furthermore, when evaluated through PRAIB, we observe that LLMs tend to generate longer, more complex reviews, yet frequently overlook the atomic weaknesses noted by human reviewers. By characterizing where and how LLMs reviewing behavior departs from human norms, PRAIB provides the community with a diagnostic tool for identifying which aspects of the review process LLMs can reliably support today and which require further development before deployment.","published_date":"2026-05-28T11:59:54+00:00","viability_score":6,"cluster_label":"AI for Peer Review","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Introduces a benchmark and analysis framework to evaluate LLM behavior in scientific peer review, highlighting key divergences from human reviewers.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29807v1","title":"Data filtering methods for training language models","abstract":"Data quality is a critical factor in the effectiveness of machine learning models. Label errors, present even in widely used benchmarks, introduce noise into training data and reduce model generalization. In this work, we conduct a comparative analysis of two automatic label error detection methods - Confident Learning and Dataset Cartography - on three Russian text classification corpora of varying size, number of classes, and domain: ru_emotion_e-culture (49,123 examples, emotion classification), RuCoLA (8,524 examples, linguistic acceptability), and TERRa (2,337 examples, textual entailment recognition). We use the pre-trained rubert-base-cased model fine-tuned on each corpus. To verify the meaningfulness of filtering, we conduct control experiments with random removal of an equivalent number of examples. Results show that the effectiveness of both methods depends strongly on dataset characteristics: on large corpora with low noise levels, filtering does not improve performance, while on small datasets with high noise, Confident Learning achieves a significant F1-macro improvement. Dataset Cartography demonstrates more conservative behavior, removing fewer examples. Across all corpora, targeted removal by both methods outperforms random removal, confirming the meaningfulness of the approaches.","published_date":"2026-05-28T11:52:59+00:00","viability_score":4,"cluster_label":"LLM Training","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research analyzes methods for detecting and removing label errors in Russian text classification datasets to improve model generalization, with potential applications in data cleaning tools.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.29801v1","title":"AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security","abstract":"Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, advanced frontier AI models drastically lower attack barriers, rendering current agent alignment frameworks inadequate for real-world deployment. To tackle these emerging threats, we propose a lightweight and scalable agent safety alignment framework. Specifically, we update the agent safety taxonomy to accommodate emergent risks from Codex and OpenClaw execution scenarios. We further build a taxonomy-guided data engine with influence-function purification to train lightweight AgentDoG 1.5 variants (0.8B, 2B, 4B, and 8B parameters) using only around 1k samples, achieving comparable performance with leading closed-source models (e.g., GPT-5.4). Based on AgentDoG 1.5, we construct a highly efficient agentic safety SFT and RL training environment, which reduces deployment overhead in Docker-level environments by two orders of magnitude. Finally, we deploy AgentDoG 1.5 as a training-free online guardrail for real-time safety moderation. Extensive experimental results indicate that AgentDoG 1.5 achieves state-of-the-art performance in diverse and complex interactive agentic scenarios. All models and datasets are openly released.","published_date":"2026-05-28T11:48:37+00:00","viability_score":5,"cluster_label":"AI Safety and Security","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"AgentDoG 1.5 provides a scalable framework for enhancing AI agent safety and security through robust alignment methods.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2605.29796v1","title":"SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search","abstract":"Agentic search enables LLMs to solve complex multi-hop questions through iterative reasoning and external search. Despite the effectiveness, these systems often suffer from a critical limitation in practice: agents fail to recognize their own knowledge boundaries, blindly triggering searches when internal knowledge suffices and failing to terminate search even when adequate evidence has been collected. The lack of self-awareness leads to severe \\textbf{over-search}, incurring substantial inference latency and prohibitive computational cost. To this end, we propose SAAS, a novel RL framework designed to cultivate dynamic self-awareness that precisely regulates search behavior without compromising accuracy. SAAS introduces three key components: (i) a search boundary modeling mechanism, which identifies the search boundary under the evolving policy by contrasting search-disabled and search-enabled rollouts; (ii) a boundary-aware reward module, which translates this boundary awareness into trajectory-level penalties, suppressing unnecessary and redundant searches; and (iii) a stage-wise optimization strategy, which leverages a sequential curriculum to prioritize reasoning over search regularization, thereby avoiding reward hacking. Extensive experiments demonstrate that SAAS substantially reduces over-search, while maintaining accuracy. Our code is anonymously released at https://github.com/XMUDeepLIT/SAAS.","published_date":"2026-05-28T11:45:45+00:00","viability_score":7,"cluster_label":"AI Agents","has_code":true,"repo_url":"https://github.com/XMUDeepLIT/SAAS","commercial_flags":["has_code"],"one_liner":"SAAS is a reinforcement learning framework that trains LLM agents to dynamically regulate their search behavior, reducing over-search and inference costs without sacrificing accuracy.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.29794v1","title":"SkillsInjector: Dynamic Skill Context Construction for LLM Agents","abstract":"LLM agents now draw on growing skill libraries to handle complex tasks. However, injecting more skills does not always improve task completion and can even degrade it. Existing methods still treat skill injection as a static step, selecting skills with fixed criteria, fixing the budget in advance, and leaving descriptions unchanged. We argue that this static treatment can undermine the utility of skills, because which skills are exposed, how many are included, and how they are presented all affect downstream performance. We propose SkillsInjector, a two-stage adaptive method that jointly addresses these decisions. First, a context planner learns execution-grounded skill preferences and admits an adaptive number of skills for each task. A set-aware renderer then tailors how selected descriptions are presented relative to their co-injected neighbors. Across tau2-bench, SkillsBench, and ALFWorld, SkillsInjector achieves the highest score, improving over the strongest baseline by 3.9, 6.1, and 7.3 percentage points, respectively. Ablation studies show that skill selection, adaptive budgeting, and set-aware rendering each contribute to the gain. These results show that skill-augmented agents benefit from optimizing the injected context itself. Code will be released upon publication","published_date":"2026-05-28T11:44:32+00:00","viability_score":6,"cluster_label":"LLM Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"SkillsInjector is an adaptive method for LLM agents that dynamically plans and renders skill contexts, improving task completion by optimizing skill selection and presentation.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus"]},{"arxiv_id":"2605.29795v1","title":"MEMENTO: Leveraging Web as a Learning Signal for Low-Data Domains","abstract":"Real-world tasks often lack large labeled datasets, motivating extensive work on learning in low-data regimes. However, existing approaches such as few-shot prompting, instruction tuning, and synthetic data generation, continue to treat labeled or pseudo-labeled data as the primary learning signal. In contrast, human practitioners acquire expertise through repeated, self-directed interaction with the open web, progressively refining both domain knowledge and search strategies. We propose MEMENTO, a framework that treats the web as a learning signal rather than a stateless retrieval interface. MEMENTO operates at two levels: within each session, it conducts iterative web exploration via an Adaptive Exploration Tree (AET) that decomposes tasks into evolving questions and reflects on intermediate findings; across sessions, it accumulates experience through dual-channel memory, separating declarative knowledge (facts) from procedural knowledge (search strategies). This design enables agents to learn reusable research strategies and domain expertise from trajectories of web interaction without additional model training. We evaluate MEMENTO on two low-data professional domains: sales automation and legal research. Our empirical results show consistent improvements in performance over ReAct based baselines (+25.6% on sales automation and 36.5% on legal research), demonstrating that the web can serve as a scalable learning source for acquiring task-specific expertise in data-scarce settings.","published_date":"2026-05-28T11:44:32+00:00","viability_score":7,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework that leverages the web as a learning signal for low-data domains, enabling agents to acquire expertise without additional model training.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29790v1","title":"Evolve as a Team: Collaborative Self-Evolution for LLM-based Multi-Agent Systems","abstract":"LLM-based multi-agent systems (MAS) have emerged as an effective paradigm for complex and long-horizon tasks. However, in real-world tasks, MAS often exhibit various failures during execution and such failures are difficult to eliminate during design. This motivates experience-driven MAS evolution, where a system improves based on its own execution experience. Yet such evolution is challenging because MAS experience is prolonged and intricate, interleaving multiple agents' execution chains and communication messages, which makes it difficult to identify what should be improved. To address this challenge, we propose Meta-Team, an experience-driven MAS evolution framework based on collaborative self-evolution. Meta-Team preserves the execution context of each agent and coordinates post-task communication, enabling agents to exchange distributed evidence for evolution. Building on this design, Meta-Team conducts multi-scale self-evolution, transforming execution experience into reusable improvements to agent behaviors, inter-agent coordination, and team-level organization. Across six long-horizon agent benchmarks, Meta-Team consistently outperforms single-agent systems, hand-crafted MAS, and prior MAS evolution methods; further analyses demonstrate that Meta-Team enables more reliable and scalable MAS self-evolution.","published_date":"2026-05-28T11:40:16+00:00","viability_score":7,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A collaborative self-evolution framework for LLM-based multi-agent systems that improves agent behaviors, coordination, and team organization based on execution experience.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29788v1","title":"Certified Policy Optimisation for Nested Causal Bandits via PAC-Bayes Risk","abstract":"Critical sequential decisions are rarely single-timescale: a strategic decision causally shapes the context in which every subsequent tactical choice is made; standard bandit and reinforcement-learning theory does not capture this causal coupling between timescales. We formalise the problem class as Nested Contextual Causal Bandits (NCCBs), a hierarchical SCM where each level's action sets the next level's context distribution, and propose Nested Causal Thompson Sampling (NCTS), which draws one mechanism-factorised belief per episode and acts recursively under it. Our main theoretical result is a causal PAC-Bayesian excess-risk bound that certifies any candidate deployment policy from historic data alone, off-policy and anytime, answering the deployment question: can we trust this agent here, and at what risk? Experiments on a hierarchical SCM show that, against a matched RFF-GP joint regression on the same function class, the factorised SCM-mechanism posterior transfers significantly better zero-shot under exogenous distribution shifts, the recursive meta-to-inner commit significantly dominates the joint-commit alternative in distribution, and the certificate significantly contracts as offline data accumulates. Combining these results, we establish progressive certified handover, a safe-deployment method: each timescale flips from a legacy controller to NCTS when gains can be certified, independently of the others.","published_date":"2026-05-28T11:39:51+00:00","viability_score":3,"cluster_label":"Reinforcement Learning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A theoretical framework for nested causal bandits with PAC-Bayes risk guarantees for safe deployment of sequential decision-making agents.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29786v1","title":"Croissant Tasks: A Metadata Format for Reproducible Machine Learning Evaluations","abstract":"Reproducibility is fundamental to the scientific method, yet remains a critical challenge in machine learning. Contributing factors include underspecified execution details and brittle software environments. Human-centric remedies, such as checklists and manual verification, help but require intensive effort and fail to scale. To address this, we introduce Croissant Tasks: a declarative, machine-actionable metadata format that abstracts low-level implementation details into high-level specifications. This format enables conceptual reproducibility: verifying claims via independent, agent-generated implementations rather than brittle source code replication. We contribute: (1) the Croissant Tasks specification, formally decoupling task problem from solution; (2) an automated LLM pipeline that retrofits existing benchmarks into this format; and (3) empirical validation showing autonomous agents can ingest these specifications to generate functional, accurate reproduction pipelines from scratch. We envision this format as a new foundation for automated and conceptual reproducibility in machine learning.","published_date":"2026-05-28T11:34:09+00:00","viability_score":7,"cluster_label":"MLOps","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A metadata format for reproducible machine learning evaluations that enables autonomous agents to generate functional reproduction pipelines from scratch.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.29782v1","title":"Hista and Numca: Estimate State Value Effectively for LLM Reinforcement Learning","abstract":"Reinforcement learning (RL) refines large language models (LLMs) by directly optimizing model behavior through reward signals. While accurate state value estimation is critical for stable training in classical RL, it remains an underexplored challenge in LLM post-training. In this work, we introduce the State Value Estimation Benchmark (SVEB) to assess state estimation within existing RL frameworks and show that critics in standard approaches like PPO collapse to a coarse group-average baseline. To address this, we propose two techniques: Numca, which leverages numerical spans as gradable milestones for state value estimation, and Hista, a framework that uses LLM's hidden states as representation to weighted average disjoint rollouts and their return. Extensive experiments demonstrate that both methods yield more accurate state value estimates and enhance training performance across different RL algorithms and model sizes without incurring significant computational overhead.","published_date":"2026-05-28T11:31:13+00:00","viability_score":4,"cluster_label":"LLM Training","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Develops novel techniques to improve state value estimation in LLM reinforcement learning, enhancing training stability and performance.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2605.29773v1","title":"Energy-Aware NECO for Single-Pass Pixel-wise Out-of-Distribution Detection in Semantic Segmentation","abstract":"Reliable semantic segmentation for mobile robots requires both accurate dense prediction and robust uncertainty estimation under distribution shift. Strong uncertainty baselines such as Monte Carlo Dropout often require repeated stochastic forward passes and are difficult to deploy on edge platforms.   We propose Energy-Aware NECO, a single-pass pixel-wise out-of-distribution (OOD) detector for semantic segmentation. The method combines a centered NECO-style geometric ratio computed from decoder features with a logit-based Energy score. Both components are standardized using statistics fitted on a pure in-distribution validation split and fused through a convex combination.   We evaluate the method on the miniMUAD subset using true pixel-level OOD labels. The proposed hybrid score achieves an AUROC of 0.8539, outperforming NECO-only (0.8280), Energy-only (0.8171), and an ensemble predictive-entropy baseline (0.8124). Additional qualitative and operating-point analyses show that the hybrid detector improves overall ranking performance while preserving the efficiency advantages of a single-pass design.   Code is available at https://github.com/boyuan-zhangx/Energy-Aware_NECO","published_date":"2026-05-28T11:19:46+00:00","viability_score":7,"cluster_label":"Computer Vision","has_code":true,"repo_url":"https://github.com/boyuan-zhangx/Energy-Aware_NECO","commercial_flags":["has_code"],"one_liner":"A single-pass, energy-aware detector for robust out-of-distribution detection in semantic segmentation for mobile robots.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.29768v1","title":"From XXLTraffic to EvoXXLTraffic: Scaling Traffic Forecasting to Sensor-Evolving Networks","abstract":"Existing traffic forecasting benchmarks assume a fixed sensor set, but real road-sensor networks grow continuously as the road network changes year by year. We introduce the XXLTraffic dataset family, which spans up to 27 years of California PeMS and Transport for NSW data. The fixed-sensor subsets of XXLTraffic support extremely long forecasting with multi-year gaps and standard hourly / daily long-horizon forecasting. We extend it to EvoXXLTraffic, a sensor-evolving reorganization that exposes per-year active sensors, yearly traffic-flow matrices, and yearly graph snapshots across nine PeMS districts, with growth ratios ranging from +305% to over +10,000%. We define a yearly streaming forecasting protocol on EvoXXLTraffic in which each calendar year is a continual task, and benchmark a wide range of representative baselines drawn from static spatio-temporal GNNs, na\u00efve online schemes, evolving-graph continual methods, and retrieval / test-time methods. We find that our ultra-large evolutionary dataset better reflects the real world, and many state-of-the-art (SOTA) results no longer work. Our dataset complements existing benchmarks by enabling more realistic forecasting under ultra-long evolutionary road networks.","published_date":"2026-05-28T11:13:34+00:00","viability_score":4,"cluster_label":"Traffic Forecasting","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Introduces a new family of datasets and a protocol for ultra-long traffic forecasting in evolving sensor networks.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.29756v1","title":"LFQ: Logit-aware Final-block Quantization for Boosting the Generation Quality of Low-Bit Quantized LLMs","abstract":"As large language models continue to scale, low-bit weight-only post-training quantization (PTQ) offers a practical solution to their memory-efficient deployment. Although block-wise PTQ is capable of matching the full-precision (FP) baseline on basic language modeling and understanding, its quality is degraded for generative tasks -- especially at longer responses and extended chains of thought, which is critical in boosting task accuracy. We attribute this shortfall to two factors: (i) the omission of the unembedding layer (the LM head) in block-wise optimization and (ii) the reliance on the mean squared error (MSE) objective. Both factors cause the token probability distribution of the quantized model to misalign with that of the FP model, yielding notable accuracy drops on text generation benchmarks. To rectify the discrepancy, we introduce Logit-aware Final-block Quantization (LFQ), a simple yet effective enhancement to block-wise PTQ that quantizes the final Transformer block by minimizing the cross-entropy between the logits of the FP model and those of its quantized counterpart. By aligning token probabilities at the logit level in the final block, LFQ consistently improves the accuracy of complex generation tasks over state-of-the-art block-wise PTQ across diverse model families, while maintaining parity with FP baselines on language modeling and understanding.","published_date":"2026-05-28T11:02:23+00:00","viability_score":7,"cluster_label":"LLM Quantization","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"LFQ improves low-bit quantized LLM generation quality by aligning logit distributions in the final block.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.29754v1","title":"Benchmarking Positional Encoding Strategies for Transformer-Based EEG Foundation Models","abstract":"Electroencephalography (EEG) is a widely used non-invasive technique for measuring brain activity in brain-computer interface (BCI) applications. Supervised EEG decoding models often struggle to generalize across tasks, subjects, and datasets, motivating transformer-based EEG foundation models trained with self-supervised learning. Since transformers are permutation-invariant, they require explicit positional information. Unlike textual tokens, EEG electrodes are spatially distributed across the scalp, raising the question of how electrode positions should be encoded in transformer-based EEG models. In this study, we benchmark five positional encoding strategies within the CBraMod backbone and evaluate them under linear probing and fine-tuning protocols on motor imagery classification and emotion recognition. Our results show that no single strategy consistently outperforms across tasks. Spherical Positional Encoding (SPE) yields strong representations for motor imagery but underperforms on emotion recognition, while Asymmetric Conditional Positional Encoding (ACPE) demonstrates more consistent performance across tasks. These findings suggest that the optimal positional encoding strategy is task-dependent, with no universal solution across EEG decoding scenarios.","published_date":"2026-05-28T10:58:34+00:00","viability_score":4,"cluster_label":"Medical AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Benchmarking positional encoding strategies for transformer-based EEG foundation models to improve generalization across tasks and subjects.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.29753v1","title":"A unified deeplearning framework for contrast-phase-specific virtual monochromatic imaging","abstract":"Dual-energy CT (DECT) enables virtual monochromatic imaging (VMI) and improved contrast resolution, but its clinical adoption is limited by hardware complexity and cost. In this work, we propose a unified deep learning framework that synthesizes contrast-phase-specific virtual monochromatic 50 keV images from single-energy CT (SECT) data by leveraging contrast phase information as a prior. The model is trained using DECT-derived 70 keV and 50 keV image pairs across four contrast phases -- Angio, Arterial, Portal, and Delayed -- using a novel prior conditioning architecture that integrates contrast phase priors into the energy transformation process. We demonstrate that the proposed unified model achieves contrast enhancement and generalizes well across contrast phases. Additionally, we show that the model can generate 50 keV-like images from SECT inputs, preserving contrast phase-specific dynamics.","published_date":"2026-05-28T10:55:28+00:00","viability_score":4,"cluster_label":"Medical AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A unified deep learning framework synthesizes contrast-phase-specific virtual monochromatic CT images from single-energy CT data.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.29744v1","title":"Why Specialist Models Still Matter: A Heterogeneous Multi-Agent Paradigm for Medical Artificial Intelligence","abstract":"The impressive performance of generalist large language models (LLMs) such as GPT and Claude in healthcare raises a critical question: will domain-specific medical specialist models become obsolete? We argue that the future of medical artificial intelligence (AI) lies not in building monolithic medical foundation models, nor in replacing human expertise, but in orchestrating collaboration among generalist LLMs, domain-specific specialist models, and clinicians. We propose HetMedAgent, a heterogeneous medical multi-agent framework that enables conflict-aware evidence fusion, uncertainty-based clinician intervention triggering, and adaptive threshold calibration. Experiments on three real-world clinical decision-making tasks demonstrate that the synergy between generalist LLMs and domain-specific specialist models significantly outperforms using either type of model alone, validating the irreplaceable value of specialist models in modality-specific analysis. HetMedAgent represents a shift from building medical LLMs or foundation models to multi-agent collaboration, achieving a balance between general reasoning capabilities and domain-specific precision.","published_date":"2026-05-28T10:42:44+00:00","viability_score":7,"cluster_label":"Medical AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A heterogeneous multi-agent framework orchestrates generalist LLMs, specialist models, and clinicians for improved medical decision-making.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29742v1","title":"Citation-Closure Retrieval and Per-Rule Attribution for Real-World Regulatory Compliance Question Answering","abstract":"Deploying Large Language Models (LLMs) for regulatory compliance demands rigorous traceability via comprehensive citations across multi-tiered authority structures. Unlike traditional multi-hop or legal QA, this task requires structured procedural lookups and evidence-set closure rather than entity resolution or case-law reasoning. Existing RAG systems struggle here due to flattened citation edges, fragmented retrieval expansions, and fragile post-hoc attribution. We formalize Regulatory Compliance QA with RegOps-Bench, a novel benchmark featuring an Operational Knowledge Graph derived from complex national R\\&D regulations. To address these bottlenecks, we propose RefWalk, a unified framework driven by a shared topic anchor. RefWalk traverses cross-document citations, fuses multi-view candidates via max-based aggregation, and enforces per-rule attribution to explicitly map claims to sources. We establish a strong baseline with substantial improvements in retrieval recall and citation accuracy. Finally, a contrastive evaluation on a U.S. health compliance dataset (HIPAA) reveals that existing systems exhibit saturation on flat-structure rules, underscoring the need for RegOps-Bench. Our code is available at https://github.com/yeongjoonJu/RefWalk.","published_date":"2026-05-28T10:38:38+00:00","viability_score":8,"cluster_label":"Regulatory Compliance QA","has_code":true,"repo_url":"https://github.com/yeongjoonJu/RefWalk","commercial_flags":["has_code"],"one_liner":"RefWalk is a framework for regulatory compliance QA that provides citation-closure retrieval and per-rule attribution for complex national regulations.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.29738v1","title":"Multi-Legal-Bench: Evaluating LLMs on Legal Reasoning Across Jurisdictions, Languages, and Legal Traditions","abstract":"Legal NLP benchmarks overwhelmingly evaluate a single language or aggregate tasks that differ fundamentally across jurisdictions, making cross-lingual comparison impossible. We introduce Multi-Legal-Bench, the first cross-jurisdictional legal benchmark that evaluates identical tasks across six countries (Ukraine, France, Netherlands, Poland, Czech Republic, Lithuania), four language families, and 134 million court decisions. The benchmark defines five tasks court-type classification, judgment form classification, case-outcome prediction, legal norm extraction, and cause category prediction mapped to structured metadata from national court registries, forming a deliberately sparse 5x6 task-jurisdiction matrix (20 of 30 cells filled). We evaluate 7 frontier LLMs under zero-shot and 3-shot prompting via AWS Bedrock, with 4 additional small/medium models (3-12B) for scaling analysis. Our results reveal that: (1) task-dependent few-shot effects discovered in Ukrainian replicate across all jurisdictions; (2) no single model dominates any language rankings shift with both task and jurisdiction; (3) cross-lingual few-shot transfer does not follow language proximity: UA->FR (Romance, -2.1 pp) transfers better than UA->PL (Slavic, -13.7 pp), with label-set alignment predicting transfer quality better than language family; and (4) tokenizer fertility, despite a 2.3x spread, does not significantly predict cross-lingual accuracy (r=-0.27, p=0.14), suggesting that model architecture and pretraining data dominate tokenizer efficiency. We release all data, prompts, and model predictions.","published_date":"2026-05-28T10:31:37+00:00","viability_score":7,"cluster_label":"Legal AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new benchmark and evaluation of LLMs for cross-jurisdictional legal reasoning, revealing performance gaps and transfer learning insights.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29733v1","title":"Uncertainty-Aware Transfer Learning for Cross-Building Energy Forecasting: Toward Robust and Scalable District-Level Energy Management","abstract":"Scaling data-driven energy forecasting to district level requires models that can be re-used across buildings with minimal target-domain data and honest uncertainty estimates. We present an uncertainty-aware transfer learning (TL) framework for cross-building energy forecasting based on the Temporal Fusion Transformer (TFT), evaluated on a newly released high-resolution real sub-meter dataset: an educational building at Aalborg University, Denmark (source) and the multi-typology NEST building at EMPA, Switzerland (target). We introduce the Transfer Robustness Index (TRI), an architecture-agnostic metric for quantifying generalization quality across domain gaps. A four-strategy layer-freezing ablation shows that Probe-Only fine-tuning, updating only 455 output-layer parameters out of 806K, achieves the best transfer quality (TRI = 3,097), outperforming full fine-tuning and suggesting that TFT encoders learn transferable temporal representations. Monte Carlo Dropout yields a prediction interval coverage probability of 93.2%, close to the nominal 95% target. A data-scarcity analysis further shows monotonic improvement with increasing target-domain data, providing practical guidance for district energy deployment.","published_date":"2026-05-28T10:28:05+00:00","viability_score":7,"cluster_label":"Energy Forecasting","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An uncertainty-aware transfer learning framework for robust and scalable cross-building energy forecasting at the district level.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29716v1","title":"NaRA: Noise-Aware LoRA for Parameter-Efficient Fine-Tuning of Diffusion LLMs","abstract":"Diffusion Large Language Models (dLLMs) have emerged as a promising non-autoregressive generative paradigm. Given the prohibitive computational cost of full fine-tuning, Parameter-Efficient Fine-Tuning (PEFT) has become the standard approach. However, existing PEFT methods (e.g., LoRA), originally tailored for autoregressive models, rely on static parameters that are agnostic to the noise level. Consequently, they ignore the intrinsic dynamics of the diffusion process, where input distributions and generation difficulty shift significantly along the denoising trajectory, rendering them suboptimal for dLLMs. To address this, we propose Noise-aware Low-Rank Adaptation (NaRA), which introduces a low-rank core matrix generated by a lightweight, globally shared hypernetwork conditioned on the noise level. This design enables the update matrices to vary continuously along the diffusion process while keeping parameter and latency overhead negligible. We provide a theoretical justification for the proposed NaRA framework and empirically demonstrate consistent improvements over noise-agnostic baselines across commonsense reasoning, mathematical reasoning, and code generation benchmarks. Our code is available at https://github.com/generaldi/NaRA.","published_date":"2026-05-28T10:13:50+00:00","viability_score":8,"cluster_label":"LLM Fine-Tuning","has_code":true,"repo_url":"https://github.com/generaldi/NaRA","commercial_flags":["has_code"],"one_liner":"NaRA: A noise-aware parameter-efficient fine-tuning method for diffusion LLMs that adapts to noise levels for improved performance.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.29713v1","title":"The Little Book of Generative AI Foundations: An Intuitive Mathematical Primer","abstract":"This book provides a compact, derivation-oriented introduction to the mathematical foundations of modern generative artificial intelligence. Rather than surveying every recent architecture or implementation detail, it develops a coherent route through the ideas connecting major families of generative models, from PCA, probabilistic PCA, variational autoencoders, and diffusion models to normalising flows, autoregressive factorisations, GANs, Wasserstein GANs, and energy-based models. The aim is to make the structure of generative modelling more accessible without removing the mathematical substance needed to understand how these models are derived and related. The book is intended as a foundation-building primer for mathematically curious researchers, practitioners, and students.","published_date":"2026-05-28T10:12:00+00:00","viability_score":0,"cluster_label":"Generative AI Foundations","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A mathematical primer on the foundations of generative AI, covering key models and their derivations.","time_to_mvp":"N/A","tags":[]},{"arxiv_id":"2605.29712v1","title":"Teaching Language Models to Check Grounded Claim Factuality with Human Test-Taking Strategies","abstract":"Grounded claim factuality checking is important for large language model (LLM) applications such as retrieval-augmented generation, as it helps users assess the correctness of generated outputs. Existing metrics using entailment classifiers require dataset-specific threshold tuning, while LLM-based approaches often use direct prompting, which underutilises the reasoning capabilities of LLMs. We address this by formulating grounded claim factuality checking as a true/false reading comprehension task and prompting LLMs with explicit test-taking strategies for efficient reasoning. Our method reduces token usage by over 80% compared to unguided open-ended reasoning, and achieves competitive performance to more expensive alternatives across two factuality benchmarks, setting a new state of the art on one. To further reduce inference cost, we train small language models (SLMs) to replace LLMs in the checking pipeline. Using supervised fine-tuning (SFT) and a self-revision mechanism, the SLMs learn to improve their factuality judgements. Experimental results show that the resulting SLMs perform on par with strong baselines, combining low inference costs with generating supporting rationales to support interpretability. Code and datasets will be released upon acceptance.","published_date":"2026-05-28T10:11:42+00:00","viability_score":7,"cluster_label":"LLM Factuality Checking","has_code":true,"repo_url":"https://github.com/Haruhi07/Test-Taking","commercial_flags":["has_code"],"one_liner":"A novel method for LLM factuality checking that reduces token usage by over 80% and sets a new state-of-the-art on one benchmark, with potential for small language model deployment.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29711v1","title":"Personalized Turn-Level User Conversation Satisfaction Benchmark","abstract":"User satisfaction with AI assistants is highly personalized: the same response may satisfy one user but disappoint another depending on what each user expects and what they have asked for before. Existing automatic evaluation methods mostly measure generic response quality, making it difficult to judge whether a response satisfies a user at a specific turn. We study this problem as personalized turn-level user conversation satisfaction evaluation. We build a conversation satisfaction evaluator that combines compact user memories with target-turn context to produce satisfaction scores and dissatisfaction-oriented rationales. Meta-evaluation against human satisfaction annotations shows that personalized memory and post-hoc score calibration improve ordinal agreement and dissatisfied-turn detection over supervised, retrieval-based, and generic LLM-as-a-judge baselines. We further introduce PersTurnBench, a personalized turn-level user conversation satisfaction benchmark that uses the verified evaluator to assess generation models via replay. By holding the replay state fixed, PersTurnBench enables controlled comparison of generic generation models and memory-augmented personalized systems without new human labels for every candidate model. The evaluator and benchmark let researchers compare candidate generation models on personalized satisfaction without collecting new user feedback for every model.","published_date":"2026-05-28T10:10:37+00:00","viability_score":6,"cluster_label":"Personalized AI Assistants","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A benchmark and evaluator for personalized turn-level user conversation satisfaction, improving ordinal agreement and dissatisfaction detection over existing methods.","time_to_mvp":"3-6 months","tags":["high_potential"]},{"arxiv_id":"2605.29705v1","title":"BitTP: The Lightweight Trajectory Prediction Model with BitLLM for Edge-Devices","abstract":"Trajectory prediction is a fundamental task for autonomous systems, requiring complex reasoning about multi-agent interactions and intents. Large language models (LLMs) have recently been adopted for this task, as they provide strong contextual reasoning and interpretable, language-based trajectory representations. However, these LLM-based predictors are extremely memory- and compute-intensive, making them difficult to deploy on resource-constrained edge devices such as on-board computers in autonomous robots. To bridge this gap, we propose BitTP, which converts an LLM-based trajectory predictor into a lightweight bitlinear architecture. We demonstrate that weight-only quantization to 1.58-bit (BitTP-Weight) is optimal. Crucially, activations must remain in full precision, as quantizing them leads to severe degradation and instability in spatio-temporal reasoning. Empirically, BitTP-Weight not only preserves but improves prediction quality over the full-precision (BF16) LLM baseline, reducing ADE by 14.29% and FDE by 20.97% on average, while simultaneously reducing memory usage and inference latency relative to other quantization methods. These results demonstrate that carefully designed quantization acts as an effective regularizer, enabling the practical deployment of sophisticated LLM-based reasoning on edge devices. Code is available at: https://github.com/MintCat98/BitTP.","published_date":"2026-05-28T10:04:02+00:00","viability_score":8,"cluster_label":"Edge AI","has_code":true,"repo_url":"https://github.com/MintCat98/BitTP","commercial_flags":["has_code"],"one_liner":"BitTP, a lightweight trajectory prediction model that converts LLMs into a bitlinear architecture for edge devices, preserving and improving prediction quality while reducing memory and latency.","time_to_mvp":"3-6 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29697v1","title":"Beyond Trajectory Rewards: Step-level Credit Assignment for Agentic Search via Graph Modeling","abstract":"In Agentic Search, trajectory-level outcome rewards fail to quantify the behavioral contributions of individual steps, while existing step-level reward methods typically rely on costly tree sampling. We view world knowledge as a latent world graph and each IS task as search within a latent task graph, where effective steps should make graph progress toward the answer node. Based on this prior, we propose Graph-Distance Contribution Reward (GDCR), a step-level process reward that scores newly-retrieved and newly-cited entities by their distance to the answer node in a training-time Entity-Relation (ER) graph. We further propose Step Advantage Policy Optimization (SAPO), which converts GDCR into step-level advantages and combines them with trajectory-level outcome advantages. Experiments on four challenging benchmarks validate the effectiveness of our method.","published_date":"2026-05-28T09:57:12+00:00","viability_score":4,"cluster_label":"Agentic Search","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel step-level credit assignment method for agentic search that uses graph modeling to score entity contributions, improving performance on challenging benchmarks.","time_to_mvp":"3-6 months","tags":["high_potential"]},{"arxiv_id":"2605.29695v1","title":"FHRFormer: A Self-Supervised Masked Transformer Framework for Fetal Heart Rate Time-Series Inpainting and Forecasting","abstract":"Approximately 10% of newborns require assistance to initiate breathing at birth, and around 5% need ventilation support. Fetal heart rate (FHR) monitoring plays a crucial role in assessing fetal well-being during prenatal care, enabling the detection of abnormal patterns and supporting timely obstetric interventions to mitigate fetal risks during labor. Applying artificial intelligence (AI) methods to analyze large datasets of continuous FHR monitoring episodes with diverse outcomes may offer novel insights into predicting the risk of needing breathing assistance or interventions. Recent advances in wearable FHR monitors have enabled continuous fetal monitoring without compromising maternal mobility. However, sensor displacement during maternal movement, as well as changes in fetal or maternal position, often lead to signal dropout, resulting in gaps in recorded FHR data. Such missing data limits the extraction of meaningful insights and complicates automated (AI-based) analysis. Traditional approaches to handling missing data, such as simple interpolation techniques, often fail to preserve the spectral characteristics of the signals. In this paper, we propose a masked transformer-based autoencoder approach to reconstruct missing FHR signals by capturing both local temporal and frequency components of the data. The proposed method demonstrates robustness across varying durations of missing data and can be used for signal inpainting and forecasting. The proposed approach can be applied retrospectively to research datasets to support the development of AI-based risk algorithms. In the future, the proposed method could be integrated into wearable FHR monitoring devices to achieve earlier and more robust risk detection.","published_date":"2026-05-28T09:55:14+00:00","viability_score":7,"cluster_label":"Medical AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A self-supervised transformer framework to reconstruct and forecast missing fetal heart rate data, improving prenatal risk assessment.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29687v1","title":"Reliable Reasoning with Large Language Models via Preference-Based Maximum Satisfiability","abstract":"Large Language Models (LLMs) excel at understanding natural language but struggle with optimisation tasks involving multiple constraints and user-defined preferences, which commonly arise in domains such as robotics. We propose a hybrid reasoning approach in which LLMs externalise reasoning through code generation. Given a natural language problem description, an LLM generates Python code that encodes user-defined constraints and preferences as a preference-based Maximum Satisfiability (MaxSAT) problem, which is then solved by an exact MaxSAT solver. To ensure correctness, solutions returned by the model-generated code are independently verified for feasibility and optimality against a canonical MaxSAT encoding, allowing for different encodings and multiple optimal solutions. We evaluate our approach using both open-source and closed-access LLMs on three families of preference-based reasoning tasks, and compare it against direct-answer, chain-of-thought, and program-of-thought baselines using the same models. While these baselines rarely produce feasible solutions, the MaxSAT-based pipeline achieves substantially higher acceptance rates, in some cases exceeding 80%. Our results demonstrate that LLM-driven code generation combined with preference-based MaxSAT enables solver-verifiable optimisation with respect to generated encodings, and substantially improves correctness under independently verified reference semantics.","published_date":"2026-05-28T09:51:33+00:00","viability_score":7,"cluster_label":"LLM Reasoning","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Enabling LLMs to perform reliable optimization tasks by generating code that encodes problems into a solvable MaxSAT format.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29685v1","title":"NICE: A Theory-Grounded Diagnostic Benchmark for Social Intelligence of LLMs","abstract":"As large language models (LLMs) are increasingly applied in social contexts such as emotional companionship and customer service, measuring their social intelligence has become critical to the quality and safety of human-AI interaction. However, existing social intelligence benchmarks lack a unified framework that organizes social abilities into a unified structure, and therefore cannot enable fine-grained diagnosis. To build the first holistic diagnostic evaluation grounded in social theory, we first construct a social intelligence framework through a literature review and multi-stage expert validation guided by psychometric principles. The resulting framework includes 4 categories and 11 dimensions, each further specified by fine-grained capability facets. Building on this framework, we introduce NICE (Norm, Interaction, Cognition, Experience), a diagnostic benchmark of 137 items operationalized through representative Chinese contexts. Across 5 frontier LLMs and a human reference group, models score higher in aggregate accuracy yet show a consistent weakness in Communication, which the framework localizes to 3 specific capability facets: multi-turn communication, nonverbal communication, and synchrony. NICE thus reframes social intelligence evaluation toward theory-grounded diagnosis of socially consequential weaknesses in LLMs.","published_date":"2026-05-28T09:51:06+00:00","viability_score":7,"cluster_label":"LLM Evaluation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A theory-grounded benchmark for diagnosing the social intelligence of LLMs, identifying specific weaknesses in human-AI interaction.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29676v1","title":"Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems","abstract":"Large language models in Agentic AI systems consume tool schemas and execution results and emit tool invocations as structured data. The default language for that exchange, JSON, was designed for application-to-application interchange rather than token efficiency, so its structural elements impose substantial token overhead. Recent work proposes token-optimized alternatives such as TOON (Token-Oriented Object Notation) and TRON (Token Reduced Object Notation) as more compact replacements, but these formats have been evaluated only on isolated comprehension or generation tasks. Whether their token reductions hold inside end-to-end agentic loops therefore remains an open question. We evaluate TOON and TRON on four agentic benchmarks (BFCL, MCPToolBenchPP, MCP-Universe, StableToolBench) and five open-weight LLMs, decoupling input compression from output compression to measure comprehension and generation independently. TRON reduces tokens by up to 27% with accuracy within 14pp of the JSON baseline. TOON achieves up to 18% reduction at a similar 9pp accuracy cost, but additionally cascades on multi-turn parsing failures and collapses parallel tool-call output for most models.","published_date":"2026-05-28T09:38:13+00:00","viability_score":7,"cluster_label":"Agentic AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Evaluating token-optimized data formats for agentic AI systems to reduce overhead and improve efficiency in tool use.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29675v1","title":"From Prompts to Context: An Ontology-Driven Framework for Human-Generative AI Collaboration","abstract":"Collaborations with Generative AI often begin with a short prompt and end with an opaque output, leaving implicit who was involved, what task was being pursued, which resources were used, and which constraints should have shaped the process. This limited contextual explicitness hinders trust, traceability, and accountability, particularly when Generative AI is embedded in information-intensive workflows such as search, querying, and profile management. This paper introduces From Prompts to Context, an ontology-driven framework for representing Human-Generative AI collaboration. Its core component, the Contextual Collaboration AI Ontology (CCAI), models key elements of collaboration - including tasks, agent roles, resources, and constraints - as a shared machine-interpretable vocabulary. By combining populated CCAI instances with SPARQL-based context retrieval in operational workflows, the framework turns otherwise ephemeral prompt-response interactions into structured and queryable collaboration traces linking prompts, outputs, and their surrounding context. The approach is illustrated through a case study involving a software development team building a competency-based education feature for viewing and updating learner competency profiles. The case study shows how the framework can support the representation and documentation of collaboration episodes across requirements analysis, design, implementation, and testing. Within this setting, the results indicate that explicit collaboration modelling helps make task context more explicit, improves the traceability of AI-generated contributions, and supports more transparent and accountable Human-Generative AI practices. We conclude by outlining design principles for future Human-Generative AI systems that emphasise not only output quality, but also the explicit representation of the collaborative context in which outputs are produced.","published_date":"2026-05-28T09:35:59+00:00","viability_score":3,"cluster_label":"AI Collaboration & Traceability","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A framework for representing human-generative AI collaboration using an ontology to improve trust and accountability in information-intensive workflows.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29670v1","title":"EviLink: Multi-Path Schema Linking with Uncertainty-Guided Evidence Acquisition for Large-Scale Text-to-SQL","abstract":"Schema linking is a difficult and important step in large-scale Text-to-SQL, where systems must identify a compact yet sufficient schema context from large and ambiguous databases. Existing methods often treat schema linking as deterministic selection around a single SQL path, but complex questions may admit multiple valid realizations with different schema needs. We reframe schema linking as uncertainty-aware schema-need inference over multiple plausible SQL paths, where the system distinguishes required schema items from path-dependent uncertain ones and acquires evidence only where needed. We instantiate this reframing with EviLink, which combines multi-hypothesis schema grounding with uncertainty-guided evidence acquisition. Experiments on BIRD-Dev and Spider2-Snow show that this perspective improves the balance among schema completeness, schema relevance, and token cost. On Spider2-Snow, EviLink achieves 90.15% field-level strict recall rate, uses 123.30K average tokens, and improves downstream SQL generation under a fixed generator.","published_date":"2026-05-28T09:32:38+00:00","viability_score":3,"cluster_label":"Text-to-SQL","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A novel approach to schema linking in Text-to-SQL that handles multiple SQL paths and uncertainty for improved accuracy and efficiency.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29668v1","title":"GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents","abstract":"LLM agents acting in structured environments fail in operational rather than conversational ways, and reliability depends on procedural knowledge of the environment. Prior self-improvement methods accumulate natural-language guidance without checking that each new item preserves previously correct behavior, so a note that fixes one trajectory can silently regress another. We introduce GRASP (Gated Regression-Aware Skill Proposer), which treats agent improvement as a sequence of edits to a bounded skill library, admitting each candidate only if it produces a net improvement on a balanced held-out probe under a hard regression budget. We evaluate GRASP across five base models (gpt-oss-120b, DeepSeek V4 Flash, Gemini 3.1 Flash Lite, GPT-4.1, GPT-5.4) on two FHIR-based clinical benchmarks. On MedAgentBench, GRASP lifts gpt-oss-120b from 40.6% to 88.8%, exceeds the strongest of five self-improvement baselines by 21.0 points, and improves every other base model by 17.2 to 40.3 points. Ablations attribute the gain to comparative proposal generation, the acceptance gate, and the hard regression budget rather than to skill writing itself, which without validation is no better than using no skills. The mechanism generalizes beyond the clinical domain, improving agents on three of four non-clinical environments and remaining flat only where the action space is open-ended. Frozen libraries transfer across models, where skills from a stronger model improve weaker executors beyond what they learn for themselves while the reverse does not, an asymmetry that no ungated baseline reproduces.","published_date":"2026-05-28T09:30:09+00:00","viability_score":7,"cluster_label":"LLM Agents","has_code":true,"repo_url":"https://github.com/jomoll/GRASP","commercial_flags":["has_code"],"one_liner":"GRASP is a self-improvement framework for LLM agents that ensures new skills don't break existing capabilities, significantly boosting performance on clinical benchmarks.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.29659v1","title":"Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content","abstract":"Real-time safety filtering for large language model (LLM) applications requires classifiers that can detect unsafe prompts, toxic language, jailbreak attempts, and unsafe responses without the cost profile of large guardrail models, and that can distinguish benign sensitive text from genuinely covert harmful content. In this paper, we introduce Opir, a family of encoder-based guardrail models built on the GLiClass architecture. Opir includes multi-task models for binary safe/unsafe classification, multi-label toxicity classification, jailbreak classification, and zero-shot unsafe prompt and response categorization. We also release edge variants with fewer than 100M parameters dedicated to binary safe/unsafe categorization. The models are trained on a three-level taxonomy containing 996 categories across 16 top-level labels, 126 mid-level labels, and 854 leaf labels. Opir's training data combines taxonomy-grounded unsafe prompts, adversarially mined hard negatives, benign safety-preserving examples, generated response examples, multilingual translations, and portions of the Aegis2 and WildGuard training subsets. We also open-sourced an evaluation harness that supports GLiClass and GLiNER2 backends as well as decoder-based models, and covers binary safety classification, multi-label categorization, toxicity, jailbreak detection, prompt safety, response safety, response refusal, and prompt subcategory views across public benchmark families. Across an expanded comparison spanning 12 safety-classification tasks and 17 category tasks against eight contemporary guardrail systems -- including both GLiNER2-based and generative guardrail models -- Opir variants are competitive on or ahead of the strongest open-weight baselines on the majority of benchmark datasets while operating with a substantially smaller deployment footprint.","published_date":"2026-05-28T09:21:42+00:00","viability_score":7,"cluster_label":"LLM Safety","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Opir offers efficient, multi-task safety classification for LLMs, detecting toxicity, jailbreaks, and harmful content with a smaller footprint than existing solutions.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.29657v1","title":"OccamToken: Efficient VLM Inference with Training-Free and Budget-Adaptive Token Pruning","abstract":"Vision-language models (VLMs) rely on long visual token sequences for visual understanding, making the prefill stage expensive in both computation and memory. Most existing pruning methods follow an absolute-ranking paradigm, assigning importance scores to visual tokens and retaining a fixed top-K subset. In this work, we argue that this paradigm is fundamentally brittle: attention sinks distort token importance rankings, while image redundancy and query-dependent visual evidence make fixed token budgets unreliable across inputs. We propose OccamToken, a training-free framework that replaces absolute token ranking with register-anchored relative evidence testing. Instead of asking which tokens are globally important, OccamToken evaluates whether a visual token provides information beyond a register-based reference. Our key insight is that register tokens naturally absorb low-information attention patterns, making them a stable reference for identifying genuinely informative visual evidence. Based on this principle, OccamToken performs both image-adaptive redundancy pruning and query-adaptive relevance pruning through dynamic thresholds derived from register attention. Across LLaVA-NeXT, LLaVA-v1.5, and Qwen3-VL, OccamToken consistently improves the accuracy-efficiency trade-off without additional training. Notably, on LLaVA-NeXT, it reduces 2,880 visual tokens to approximately 40 while preserving over 93% of the original accuracy, enabling stable visual token compression even in the extreme 1.4% retention regime.","published_date":"2026-05-28T09:20:47+00:00","viability_score":4,"cluster_label":"LLM Inference Optimization","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A training-free framework for efficient vision-language model inference by adaptively pruning visual tokens based on relative evidence testing.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29656v1","title":"TRACE: Toulmin-based Reasoning Assessment through Constructive Elements for LLM CoT Evaluation","abstract":"Evaluating open-ended outputs from large language models (LLMs) remains challenging due to the absence of ground truth. Existing metrics rely on final-answer accuracy or surface-level statistics, leaving the reasoning process itself unexamined. We introduce TRACE (Toulmin-based Reasoning Assessment through Constructive Elements), a metric that analyzes Chain-of-Thought (CoT) reasoning processes. Rather than judging outcomes, TRACE inspects how arguments are constructed by integrating Toulmin's argumentation theory with Flavell's metacognitive framework to assess reasoning structure. Experiments on 26.3K QA samples across 7 reasoning models show strong correlation with benchmark accuracy (r=0.74). Furthermore, TRACE is effective as a reinforcement learning reward signal, outperforming accuracy-only baselines. Together, these results indicate that logically sound reasoning leads to higher-quality answers. TRACE thus serves as a complementary metric for evaluating open-ended outputs. Code is available at https://github.com/hyyangkisti/trace.","published_date":"2026-05-28T09:19:50+00:00","viability_score":7,"cluster_label":"LLM Evaluation","has_code":true,"repo_url":"https://github.com/hyyangkisti/trace","commercial_flags":["has_code"],"one_liner":"A novel metric for evaluating LLM Chain-of-Thought reasoning by assessing argument construction based on Toulmin's theory and metacognition.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.29653v1","title":"PTCG-Bench: Can LLM Agents Master Pok\u00e9mon Trading Card Game?","abstract":"Given a strategically complex board game, human players can quickly learn to devise strategies after playing a few rounds. Autonomous agents require similar capabilities in realistic interactive environments, yet existing agent benchmarks often fail to fully capture such strategic and evolving decision-making scenarios. We present PTCG-Bench, a benchmark built on the Pok'{e}mon Trading Card Game (PTCG) that evaluates LLM agents at two complementary levels: (1) their decision-making performance within a single complex environment, and (2) their ability to self-evolving through accumulated experience. We further include a modular harness ablation to better interpret agent performance without conflating it with model capability. Our experiments show that, although LLM agents can achieve non-trivial gameplay performance, sustained and stable self-evolution remains challenging, and performance is sensitive to harness design. We hope that PTCG-Bench will facilitate future research on harness-aware and self-evolving agents in realistic interactive environments.","published_date":"2026-05-28T09:16:22+00:00","viability_score":7,"cluster_label":"LLM Agents","has_code":true,"repo_url":"https://github.com/zjunet/PTCG-Bench","commercial_flags":["has_code"],"one_liner":"A benchmark for evaluating LLM agents in the complex, evolving environment of the Pok\u00e9mon Trading Card Game, focusing on decision-making and self-evolution.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29652v1","title":"Think Fast, Talk Smart: Partitioning Deterministic and Neural Computation for Structured Health Text Generation","abstract":"Large language models (LLMs) are increasingly being used to generate health text from structured records such as wearable time series, biomarkers, vitals, and care-management logs. For recurring health outputs, fluency is not enough: systems must remain faithful to source data, ground explanatory claims in available evidence, follow stated policies, emit machine-readable outputs, and run cheaply enough for repeated use. We ask which responsibilities in structured health generation should be deterministic computation rather than runtime LLM prompting.   We introduce Think Fast, Talk Smart, a sleep-health insight pipeline in which deterministic code performs recurring analysis before one bounded LLM writer call. Across 280 user-nights and six models, achieves lower numeric error, lower instruction-compliance error, and lower end-to-end cost than structured zero-shot and few-shot one-call baselines.   Layer replacement reveals contract-specific failures: LLM comparison raises numeric error, LLM ranking degrades policy selection, LLM attribution increases unsupported causal language, and an LLM-generated writer interface reintroduces errors even after upstream facts are deterministic. The results support a broader design rule: let code own recurring analysis, and let LLMs express verified facts within bounded interfaces.","published_date":"2026-05-28T09:16:21+00:00","viability_score":4,"cluster_label":"Health Text Generation","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A health text generation pipeline that partitions deterministic analysis from LLM prompting to ensure faithfulness, policy adherence, and cost-efficiency.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29649v1","title":"LLM-Evolved Domain-Independent Heuristics for Symbolic AI Planning","abstract":"Heuristic search is the dominant paradigm in symbolic AI planning, and the strongest heuristics are the result of decades of work by planning researchers. Recent work has shown that large language models (LLMs) can design heuristics for individual planning domains, but no LLM-generated heuristic has so far worked on arbitrary planning tasks. In this paper, we use evolutionary search to produce the first LLM-generated domain-independent heuristics that exceed the hand-engineered state of the art. We let an LLM mutate parent heuristics written in C++, store candidates in a MAP-Elites archive keyed on informedness and speed and calculate fitness scores by blending coverage with solving time. To place the evolved programs in context, we additionally benchmark a broad set of hand-engineered heuristics on their informedness-speed tradeoff, which to our knowledge has not been done before. On unseen testing domains, our best evolved heuristic solves more tasks than even the strongest baseline, with our full heuristic suite spanning the Pareto frontier of said tradeoff. We also find that seeding evolution from the trivial blind heuristic outperforms seeding from the strong FF heuristic, even when the resulting program is itself an FF variant, and that LLM reasoning effort affects how often candidates compile much more than the quality of those that do. Because the evolved programs are plain C++, they slot into existing planners as drop-in replacements and inherit the soundness and completeness guarantees of the underlying search.","published_date":"2026-05-28T09:14:39+00:00","viability_score":7,"cluster_label":"Symbolic AI Planning","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"LLM-evolved domain-independent heuristics for symbolic AI planning that outperform state-of-the-art.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29645v1","title":"The Sample Complexity of Multiclass and Sparse Contextual Bandits","abstract":"We study contextual bandits in the stochastic i.i.d.\\ setting, where a learner observes contexts drawn from an unknown distribution, selects actions from a finite set $A$, and aims to identify an approximately optimal policy from a given class based on bandit feedback. Motivated by bandit multiclass classification with zero-one rewards, we focus on the \\emph{$s$-sparse} setting in which, for every context, the reward vector has $L_1$-norm at most $s \\ll |A|$. Our main result is the design of algorithms that, with high probability, output an $\u03b5$-optimal policy compared to policy class $\u03a0$ using $\\tilde{O} ((s/\u03b5^2 + |A|/\u03b5)\\log |\u03a0|/\u03b4)$ samples. We extend this bound to general Natarajan classes and complement it with a matching lower bound (up to logarithmic factors), thereby closing a substantial gap left by prior work (Erez et al., 2024, 2025), which incurred an additional $\u0398(|A|^9)$ dependence. We obtain these results via two complementary approaches. First, we analyze contextual bandits through the lens of contextual decision making with structured observations, designing an exploration-by-optimization algorithm whose sample complexity is governed by the \\emph{decision-estimation coefficient} (DEC; Foster et al., 2021, 2022). We show that, with $s$-sparse rewards, the induced model class admits a sharp DEC bound that scales with $s$ and directly yields the optimal rate. Since this approach is largely information-theoretic and involves solving complex min-max optimization problems, we also develop a second, more specialized algorithmic method based on a low-variance exploration technique. This approach leads to concrete, tractable algorithms and naturally extends to contextual combinatorial semi-bandits, leading to improved sample complexity guarantees for bandit multiclass list classification.","published_date":"2026-05-28T09:12:20+00:00","viability_score":3,"cluster_label":"Contextual Bandits","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Improved sample complexity algorithms for sparse contextual bandits with applications in multiclass classification.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29640v1","title":"VikingMem: A Memory Base Management System for Stateful LLM-based Applications","abstract":"Large Language Models have revolutionized interactive applications; however, their finite context windows pose a critical data management challenge for maintaining stateful, long-term interactions. Existing memory approaches often rely on simplistic extraction methods that lead to incomplete memories or use rigid, single-purpose memory extraction prompts tailored to a single use case, such as chatbots. Consequently, they lack generalizability and perform poorly across diverse downstream tasks. To bridge this gap, we introduce the Memory Base, a novel data management paradigm for managing the persistent state of long-term interactions. It is characterized by three core principles: selective extraction of high-value memories from raw information streams; inherent statefulness and evolution, where memory content is progressively summarized, corrected, and temporally weighted to prioritize recent interactions; and a generalizable abstraction paradigm designed for robust transferability across diverse applications, including education, recommendation, and agent memory. Building on this foundation, we present VikingMem, an end-to-end Memory Base Management System implemented on the VikingDB vector engine. VikingMem materializes this paradigm through interconnected event and entity abstractions. It features event-centric memory extraction to selectively handle complex information streams, while entities are dynamically updated by events to achieve stateful evolution. Using temporal compression via a topic-wise timeline and time-weighted recall, the system progressively produces high-level summary memories, prioritizes recent items, and compresses and fades older ones. Extensive evaluations on long-term memory benchmarks demonstrate that VikingMem outperformes baselines by up to 30% in memory retrieval effectiveness while maintaining the low latency essential for interactive applications.","published_date":"2026-05-28T09:07:42+00:00","viability_score":8,"cluster_label":"LLM Memory Management","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"VikingMem is a stateful memory management system for LLM applications that improves retrieval effectiveness and latency.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.29631v1","title":"Predicting Causal Effects from Natural Language Queries using Structured Representations","abstract":"Randomized controlled trials are a cornerstone of medicine and the social sciences as they enable reliable estimates of causal effects. However, they are costly and time-consuming to conduct, motivating interest in predicting causal effects from existing experimental evidence. Recent advances in large language models (LLMs) have demonstrated strong performance on knowledge-intensive tasks, raising the question of whether these models can be used for forecasting causal effect sizes. To investigate this, we introduce Query2Effect, a new large-scale benchmark consisting of more than 72,000 natural language questions aligned with experiment descriptions, created to simulate realistic information-seeking scenarios by varying query specificity along dimensions of implicitness, abstraction, and ambiguity. We then propose a two-step framework that first generates a synthetic structured representation of a query before predicting effect size using a supervised encoder model. Experiments show that finetuning plays a crucial role in improving prediction performance, with absolute error reducing by -27% up to -71% compared to prompted out-of-the-box LLMs, and that our two-step framework is beneficial for out-of-domain generalization, highlighting the benefits of separating semantic interpretation from numerical effect estimation.","published_date":"2026-05-28T09:04:07+00:00","viability_score":7,"cluster_label":"Causal Effect Prediction","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Query2Effect predicts causal effects from natural language queries using a structured representation framework and fine-tuned LLMs.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29630v1","title":"Entity-Collision: A Stratified Protocol for Attributing Retrieval Lift in Agent Memory","abstract":"End-to-end agent-memory benchmarks report a single hit@k per retriever, confounding lexical leakage (uncontrolled query/gold/distractor entity overlap) with tag-mixing (preferences, services, tools averaged together). We propose entity-collision, a system-agnostic protocol that pins the BM25 floor by construction -- every distractor shares the answer's entity tokens -- and stratifies queries by discriminator tag, so any lift over BM25 is attributable to the embedder. Applied to an open-source agent-memory testbed across 5 tags x 3 embedders x 5 collision degrees with paired-bootstrap 95% CIs, the protocol reveals a two-axis pattern: a 256-d hash trigram helps only on closed-vocabulary lexical tags at deep collision; MiniLM-384 dominates both axes; and a 2.7x-parameter BGE-large does not uniformly improve on MiniLM -- it wins on intent-style queries but loses on lexical ones. Encoder capacity alone is not the binding constraint. The synthetic intent-tag null replicates on LongMemEval (n=500) as a single-session-preference recall cliff. Adaptive vector-weight routing on LoCoMo is a measured null: 11.7pp of oracle headroom exists, but no signal we tested recovers it. All 26 result tables and 37 reproduce scripts are version-controlled and verified by a public registry; the protocol is exercised on a deterministically governed memory testbed (event-sourced decision log, DAG-state-machine schema lifecycle) so every reported CI is reproducible byte-for-byte from the ingest stream.","published_date":"2026-05-28T09:02:48+00:00","viability_score":7,"cluster_label":"Agent Memory Benchmarking","has_code":true,"repo_url":"https://github.com/youwangd/engram","commercial_flags":["has_code"],"one_liner":"A system-agnostic protocol and testbed for precisely attributing retrieval lift in agent memory benchmarks, revealing nuanced performance patterns of embedders.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29629v1","title":"Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures","abstract":"Attack Success Rate (ASR) evaluates each jailbreak with a single yes/no label at the end of generation, telling us whether a failure happened but not how it unfolded. Two attacks that produce equally harmful outputs may have followed completely different paths, and ASR cannot tell them apart. We make those hidden paths observable from logits alone. Temporal Logit Observability (TLO) is a training-free diagnostic that watches a compliance-refusal margin during decoding and places each model-attack condition on a calibrated 2D plane. By design, this plane is most informative exactly where ASR is least informative: among attacks that succeed for genuinely different reasons. Across four aligned LLMs and three jailbreak paradigms, attacks with nearly identical ASR land at clearly different points on the plane: the same model can fail through different temporal patterns. The geometry matches refusal-direction probes from hidden states on most conditions, with one model showing the limit of our fixed-lexicon approach. A simple early-stop rule derived from TLO cuts successful jailbreaks by more than half, without false alarms on plain benign queries. Safety evaluation should report when and how a failure unfolds, not only whether it occurred. TLO makes the first two observable from logits alone.","published_date":"2026-05-28T09:02:00+00:00","viability_score":5,"cluster_label":"LLM Safety Evaluation","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Temporal Logit Observability (TLO) provides a training-free diagnostic to understand the unfolding of LLM safety failures beyond simple attack success rates.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus"]},{"arxiv_id":"2605.29628v1","title":"COMET: Concept Space Dissection of the Modality Gap in Audio-Text Multimodal Contrastive Embeddings","abstract":"Contrastive Language-Audio Pretraining (CLAP) models are widely used for audio understanding and support modality-agnostic condition swapping in many zero-shot applications. However, their performance is heavily affected by the modality gap between audio and text embeddings. Existing explanations mainly attribute this gap to the cone effect, treating it as a shift between mean embeddings, yet correcting the mean alone yields only limited improvements. Alternative hypotheses, such as information imbalance and dimensionality collapse, have also been proposed, but they remain insufficiently verified and have not been thoroughly studied in the audio domain. Meanwhile, several works attempt to decompose multimodal contrastive embeddings into interpretable concepts, but none explicitly analyze the modality gap from the perspective of concept decomposition. In this work, we introduce COMET (Concept space Organization and Modality gap Explanation with PLS-SVD Transformation), a novel partial least squares singular value decomposition (PLS-SVD) framework for CLAP that unveils a broader perspective of the modality gap. Our framework reveals that only a small, interpretable subset of axes, which captures shared concepts, contributes substantially to similarity computation, and that the mean component represents only partially the modality gap. Building on this insight, we propose a simple spectral truncation method that mitigates the modality gap in a training-free manner. The method enables zero-shot audio captioning with condition swapping to approach fully supervised performance, without requiring large auxiliary memory banks or expensive computation. At the same time, it achieves substantial embedding dimensionality reduction while preserving strong performance on retrieval and audio captioning tasks.","published_date":"2026-05-28T09:00:44+00:00","viability_score":4,"cluster_label":"Multimodal Embeddings","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"COMET dissects the modality gap in audio-text contrastive embeddings, revealing that concept space organization, not just mean shifts, is key to improving performance.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29626v1","title":"DLM-SWAI: Steering Diffusion Language Models Before They Unmask","abstract":"Steering language model generation toward desired textual properties is essential for practical deployment, and inference-time methods are particularly appealing because they enable controllable generation without retraining. Recent work has also highlighted diffusion language models as an emerging generation paradigm with distinct decoding properties. However, most existing steering approaches either rely on auxiliary models or are designed for autoregressive next-token decoding, making them difficult to apply to diffusion language models DLMs, which generate text through iterative denoising of partially masked sequences. Therefore, we propose DLM-SWAI, a simple training-free steering method that biases the token distribution at each denoising step using pre-computed token-level style scores. Experiments on style and safety control tasks show that DLM-SWAI effectively steers diffusion language models while preserving generation quality and requiring minimal computational overhead. Ablations further reveal a controllable trade-off between steering strength and fluency, and our analysis links class-wise steerability to the strength of token-level attribute cues.","published_date":"2026-05-28T09:00:14+00:00","viability_score":5,"cluster_label":"Diffusion Language Models","has_code":true,"repo_url":"https://github.com/hsannn/dlm-swai","commercial_flags":["has_code"],"one_liner":"DLM-SWAI is a training-free method to steer diffusion language models by biasing token distributions at each denoising step, enabling controllable generation.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus"]},{"arxiv_id":"2605.29625v1","title":"Improving Collaborative Storytelling with a Multi-Agent Framework Based on Large Language Models","abstract":"The topic of Co-creation, i.e., AI agents interacting with humans to generate outputs (e.g., art), has gained significant attention recently. However, most studies focus on adult-human interactions in a digital setting. This paper explores a novel ludic co-creation scenario involving children and Large Language Models (LLMs) interacting through a physical board game to create written stories. Our goal is to develop a multi-agent framework capable of producing high-quality narratives suitable for young players. At the core of our approach is an iterative Writer-Editor process in which one LLM generates stories while another evaluates them and provides feedback for refinement. Through a simulation study involving multiple LLMs, we show that this iterative interaction consistently improves the perceived quality of generated stories across successive loops. The results indicate that a small number of refinement steps may be sufficient to achieve high-quality outputs in interactive storytelling systems.","published_date":"2026-05-28T08:59:55+00:00","viability_score":4,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A multi-agent framework using LLMs for iterative story generation in a children's board game.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29591v1","title":"Mind-Omni: A Unified Multi-Task Framework for Brain-Vision-Language Modeling via Discrete Diffusion","abstract":"Modeling the interplay between external stimuli and internal neural representations is a pivotal research area for Brain-Computer Interfaces (BCIs). A major limitation of prior work is the prevailing paradigm of specialized, single-task models, which curtails versatility and neglects inter-task synergies. To address this, we propose Mind-Omni, the first versatile framework that unifies seven distinct encoding and decoding tasks through a discrete diffusion paradigm. At its core is a novel Brain Tokenizer that transforms heterogeneous, continuous brain signals into standardized, discrete tokens. This enables direct, token-level interactions for mutual understanding and generation between any two or more modalities within a shared semantic space. To unlock advanced reasoning capabilities, we further curate a specialized Brain Question Answering (BQA) instruction-tuning dataset. Our model not only establishes a new state-of-the-art among multi-task unified frameworks but also provides strong evidence for multi-task synergy. By demonstrating performance competitive with, and at times superior to, larger specialized models, our work offers a powerful new paradigm for neural modeling and paves the way for foundation models of neural activity. The code is publicly available at https://github.com/ReedOnePeck/Mind-Omni.","published_date":"2026-05-28T08:33:43+00:00","viability_score":7,"cluster_label":"Brain-Computer Interfaces","has_code":true,"repo_url":"https://github.com/ReedOnePeck/Mind-Omni","commercial_flags":["has_code"],"one_liner":"A unified framework for brain-vision-language modeling using discrete diffusion and a novel brain tokenizer, enabling multi-task synergy and paving the way for neural activity foundation models.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29588v1","title":"Brain-IT-VQA: From Brain Signals to Answers","abstract":"Decoding visual content from fMRI signals recorded while a person views images, and specifically answering questions about the seen images, is a long-standing challenge. While significant progress has been made in recent years in visual question answering (VQA) from fMRI, performance remains limited. Moreover, although recent models can make increasingly accurate predictions, they have rarely been used as tools for understanding the structure of visual representations in the brain. We present Brain-IT-VQA, a framework for visual question answering from fMRI. Building on the Brain Interaction Transformer (Brain-IT), our method decodes language tokens from brain activity and integrates them with a language model to answer visual questions. Our model substantially outperforms previous fMRI-based captioning and VQA approaches. We further introduce NSD-VQA, a new dataset and benchmark for visual question answering from fMRI. Unlike existing image-fMRI VQA datasets, which typically provide only a few broad and weakly controlled questions per image, NSD-VQA provides on average 20 question-answer pairs per image across 20 controlled question categories that disentangle multiple levels of visual understanding. This enables more reliable and interpretable evaluation despite limited fMRI test data. Together, Brain-IT-VQA and NSD-VQA provide both a strong predictive framework and a tool for studying brain representations. Using this benchmark, we quantify which forms of visual and semantic information can be reliably decoded from fMRI responses to natural images. We further analyze the contributions of different brain regions across question types.","published_date":"2026-05-28T08:33:23+00:00","viability_score":7,"cluster_label":"Brain-Computer Interfaces","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for visual question answering from fMRI signals, integrating decoded language tokens with a language model to outperform previous approaches and provide a tool for studying brain representations.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29586v1","title":"FinVerBench: Benchmark Validity and Calibration in Large Language Model Financial Statement Verification","abstract":"We introduce FinVerBench, a benchmark and validity study for financial statement verification: determining whether a set of corporate financial statements is numerically consistent from the information shown to the model. FinVerBench is built from SEC 10-K XBRL filings for 43 S&P 500 companies and defines a four-category error taxonomy covering arithmetic, cross-statement linkage, year-over-year, and magnitude perturbations. We attempt fifteen contemporary LLM evaluations and report fourteen complete runs; a Gemini 2.5 Pro run is excluded from the main comparison because 40/108 gateway calls failed. All binary metrics exclude underdetermined positive instances whose perturbed line item is not rendered, leaving a 105-instance observable diagnostic subset (43 clean, 62 error-injected). Under the original guided-checklist prompt on the unrounded diagnostic subset, nine of fourteen complete LLM runs produce 95-100% false positives on clean statements, while one run achieves 0% observed false positives. Benchmark rendering choices materially affect measured recall: on a realistic rounded variant of the same observable subset, the calibrated model's recall is 79.0% with 0% observed FPR, compared with 100.0% recall on the unrounded diagnostic variant. These results support a construct-validity conclusion rather than a final leaderboard: financial statement verification is not merely arithmetic detection, but calibrated judgment under incomplete observability, prompt-induced assumptions, and realistic numerical rendering. FinVerBench and all code are publicly available.","published_date":"2026-05-28T08:30:15+00:00","viability_score":6,"cluster_label":"Financial AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A benchmark for financial statement verification that highlights the challenges of calibrated judgment under incomplete observability and realistic numerical rendering for LLMs.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2605.29578v1","title":"GPS-Enhanced Tourist Mobility Modeling with Seasonal Spatial Priors and LLM-Based Activity Chain Generation","abstract":"Tourist mobility poses a distinct challenge for urban transportation planning. Unlike resident commuting, tourist travel is largely non-routine, attraction driven, and highly sensitive to trip purpose, travel season, and trip member composition. Existing approaches either measure aggregate tourist spatial patterns without generating individual schedules, or synthesize mobility without tourist specific structure such as trip duration conditioning, month varying attraction demand, and household co-travel rules. To address these challenges, we propose a four stage simulation framework combining month conditioned spatial priors derived from GPS and survey data, trip extent prediction from tourist demographics, distance feasible ward sequence assignment, and LLM-based activity chain generation under household and spatial constraints. GPS data are used only in privacy preserving aggregated form as month conditioned spatial priors, with no individual traces retained or exposed. Experiments on tourism in Tokyo demonstrate that the GPS based tourist cohort extraction recovers spatial visitation signatures consistent with survey references, and our framework produces demographically aligned synthetic schedules whose ward-level visitation shares align closely with both survey distributions and staypoint derived monthly visitation patterns. The results demonstrate the framework's effectiveness as a geographically grounded, demographically aware approach to tourist mobility modeling.","published_date":"2026-05-28T08:23:39+00:00","viability_score":4,"cluster_label":"Urban Mobility","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A simulation framework for tourist mobility modeling that combines GPS-derived spatial priors, demographic trip extent prediction, and LLM-based activity chain generation.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.29568v1","title":"DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement Learning","abstract":"Tool-Integrated Reasoning (TIR) extends LLM capabilities by leveraging external environments. However, existing methods lack the deliberation during sequential tool invocation required for strategic planning and self-correction. While RL mitigates this, conventional approaches for Tool-Integrated Reasoning are hindered by sparse outcome-based rewards, failing to supervise intermediate reasoning steps and tool invocations. To address this, we propose DeepTool, a novel framework that scales deliberate thinking within the interleaved process of thinking, action, and observation at each turn. In DeepTool, we first introduce a synthesis pipeline that evolves extended thinking into interleaved trajectories, integrating adversarial perturbations to ensure robustness and self-correction. Secondly, we devise Process-Supervised Reinforcement Learning based on GRPO, which utilizes an Action-Centric Process Reward to reinforce intermediate interleaved thinking and enforce precise tool invocation at every turn. Extensive experiments demonstrate that DeepTool achieves superior performance, boosting Qwen2.5-7B significantly across six benchmarks (e.g., AIME24: 3.2% -> 40.4% and HMMT25: 0.0% -> 28.6%). Furthermore, the token cost-effectiveness analysis confirms the utility of interleaved thinking, demonstrating DeepTool's optimal balance between performance and token efficiency.","published_date":"2026-05-28T08:17:20+00:00","viability_score":7,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"DeepTool enhances LLM tool-use by scaling interleaved deliberation with process-supervised reinforcement learning, significantly boosting performance on complex reasoning tasks.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29563v1","title":"Planning with the Views via Scene Self-Exploration","abstract":"Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We call this capability view planning, requiring (1)understanding how a single action transforms the view, and (2)composing many such transformations across multi-turn plans to identify a target view. We probe both abilities in our proposed ViewSuite, a 3D point-cloud environment on real ScanNet scenes. Across 13 frontier VLMs, a critical planning gap emerges: they possess basic view-action knowledge but fail to compose it across multi-turn plans, with the gap widening as viewpoint distance grows. To close this gap, we propose an iterative framework that alternates self-exploration with view graph distillation. The key insight is that all exploration trajectories, regardless of their outcome, collectively form a view graph that compactly captures how viewpoints connect across a scene. Distilling this graph into diverse supervised tasks reshapes the policy distribution and overcomes the sparse rewards that stall pure RL. This improves Qwen2.5-VL-7B from 2.5% to 47.8% on interactive view planning, surpassing GPT-5.4 Pro (18.5%) and Gemini 3.1 Pro (21.4%). Self-exploration emerges as a promising path toward VLMs that can actively reason and plan in 3D space.","published_date":"2026-05-28T08:15:23+00:00","viability_score":6,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This framework enables VLMs to plan multi-turn camera movements in 3D scenes by combining self-exploration with view graph distillation.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.29562v1","title":"VLA-Pro: Cross-Task Procedural Memory Transfer for Vision-Language-Action Models","abstract":"Vision-Language-Action~(VLA) models have shown strong potential for general-purpose robotic manipulation, yet they still struggle to generalize to unseen tasks that necessitate transferring relevant experience across objects, scenes, and action patterns. This paper proposes VLA-Pro, a plug-and-play framework designed to enhance cross-task generalization by storing task-relevant procedural memories at training time and transferring these memories during inference. Specifically, VLA-Pro stores task-specific LoRA adapters as parameterized procedural memories during training. At inference time, VLA-Pro retrieves relevant procedural memories based on the current multi-modal context and dynamically fuses these memories for generating the current action chunk. Experiments on RoboTwin, RLBench, and real-world manipulation tasks show that VLA-Pro consistently improves cross-task generalization across multiple backbones, achieving up to a 207% relative improvement in simulation and increasing real-world success rate from 5.8% to 65.0%. These results suggest that procedural memory retrieval and adaptation provide an effective mechanism for transferring manipulation experience to novel tasks while preserving modularity and execution stability.","published_date":"2026-05-28T08:14:08+00:00","viability_score":5,"cluster_label":"Robotics","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"VLA-Pro enhances robotic manipulation generalization by transferring procedural memories via LoRA adapters, significantly improving performance on unseen tasks.","time_to_mvp":"6+ months","tags":["series_a_plus"]},{"arxiv_id":"2605.29561v1","title":"ParaTool: Shifting Tool Representations from Context to Parameters","abstract":"Tool calling extends large language models (LLMs) by enabling grounded interaction with external executable interfaces, thereby supporting environment-coupled problem solving. However, mainstream in-context learning (ICL) approaches typically incorporate detailed tool documentation and usage examples directly into the context. This results in substantial inference overhead and heightened risks of hallucination as the context length grows. Conversely, while tuning-based methods improve general tool-calling capabilities, they often fail to effectively internalize the specific details of previously seen tools, thereby retaining a dependency on in-context documentation. To address these limitations, we propose ParaTool, a framework that projects each tool into a dedicated, loadable set of parameters. By equipping a dynamic integration of these parameterized tools, the LLM can perform tool calling without relying on in-context documents or examples. Specifically, our approach consists of three stages: (1) parametric tool pre-training encapsulates the knowledge of different tools into independent parameter modules; (2) soft tool selection employs a gating network to dynamically weigh and aggregate relevant tool parameters; and (3) parametric tool fine-tuning jointly updates tool parameters to align the training and inference processes. Experiments on Stable ToolBench and BFCL demonstrate that ParaTool significantly outperforms strong ICL-based baselines, achieving superior performance while reducing computational complexity.","published_date":"2026-05-28T08:14:07+00:00","viability_score":8,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"ParaTool enables LLMs to call tools without in-context examples by representing tools as learnable parameters, reducing overhead and improving performance.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.29560v1","title":"Battery-Sim-Agent: Leveraging LLM-Agent for Inverse Battery Parameter Estimation","abstract":"Parameterizing high-fidelity \"digital twins\" of batteries is a critical yet challenging inverse problem that hinders the pace of battery innovation. Prevailing methods formulate this as a black-box optimization (BBO) task, employing algorithms that are sample-inefficient and blind to the underlying physics. In this work, we introduce a new paradigm that reframes the inverse problem as a reasoning task, and present Battery-Sim-Agent, the first framework to deploy a Large Language Model (LLM) agent in a closed loop with a high-fidelity battery simulator. The agent mimics a human scientist's workflow: it interprets rich, multi-modal feedback from the simulator, forms physically-grounded hypotheses to explain discrepancies, and proposes structured parameter updates. On a systematically constructed benchmark suite spanning diverse battery chemistries, operating conditions, and difficulty levels, our agent significantly outperforms strong BBO baselines like Bayesian optimization in identifying accurate parameters. We further demonstrate the framework's capability in complex long-horizon degradation fitting tasks and validate its practical applicability on real-world battery datasets. Our results highlight the promise of LLM-agents as reasoning-based optimizers for scientific discovery and battery parameter estimation.","published_date":"2026-05-28T08:12:47+00:00","viability_score":7,"cluster_label":"AI Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An LLM-powered agent that acts like a scientist to estimate battery parameters, outperforming traditional optimization methods.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29556v1","title":"Opt-Verifier: Unleashing the Power of LLMs for Optimization Modeling via Dual-Side Verification","abstract":"Building mathematical optimization models is critical in operations research (OR), while it requires substantial human expertise. Recent advancements have utilized large language models (LLMs) to automate this modeling process. However, existing works often struggle to verify the correctness of the generated optimization models, without checking the rationality of the constraints and variables or the validity of solutions to the generated models. This hampers the subsequent verification and correction steps, and thus it severely hurts the modeling accuracy. To address this challenge, we propose a novel LLM-based framework with Dual-side Verification (Opt-Verifier) from both structure and solution perspectives, thereby improving the modeling accuracy. The structure-side verification ensures that the modeling structure of the generated optimization models aligns with the original problem description, accurately capturing the problem's constraints and requirements. Meanwhile, the solution-side verification interprets and evaluates the solutions' validity, confirming that the optimization models are logically and mathematically sound. Experiments on popular benchmarks demonstrate that our approach achieves over 20\\% improvement in accuracy.","published_date":"2026-05-28T08:09:52+00:00","viability_score":7,"cluster_label":"LLM Applications","has_code":true,"repo_url":"https://github.com/goodfeli/dlbook_notation","commercial_flags":["has_code"],"one_liner":"A framework that uses LLMs to build optimization models and verifies them from both structure and solution perspectives for improved accuracy.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29547v1","title":"Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization","abstract":"Deep learning optimization relies heavily on the assumption of smooth loss landscapes, a condition systematically violated by modern architectures due to non-smooth components such as ReLU activations and quantization operators. In such non-smooth regimes, adaptive optimizers such as Adam suffer from gradient chattering, violent oscillations caused by conflicting signals within the Clarke subdifferential, leading to poor convergence and suboptimal generalization. To address this, we introduce Singularity-aware Adam (S-Adam), a novel optimizer that stabilizes training by dynamically modulating step sizes based on local geometric instability. Our key contribution is the Local Geometric Instability (LGI) metric, a computationally efficient estimator of the Clarke subdifferential diameter derived from the variance of randomized directional derivatives. S-Adam incorporates an adaptive damping mechanism exp(-$\u03bb$$\u03c1$) that decelerates updates in high-instability regions while preserving fast convergence in smooth basins. We provide a rigorous convergence analysis using differential inclusions, proving that S-Adam converges almost surely to ($\u03b4$,$\u03b5$)-Clarke stationary points at the optimal O(1/$\\sqrt(T)$) rate. Empirical evaluations on Quantization-Aware Training (QAT) and high-noise small-batch learning demonstrate that S-Adam consistently outperforms AdamW and Prox-SGD, achieving accuracy gains of up to 6 percent on CIFAR-100 and 3 percent on TinyImageNet while effectively mitigating gradient oscillations.","published_date":"2026-05-28T08:00:40+00:00","viability_score":4,"cluster_label":"Optimization","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new optimizer, S-Adam, that stabilizes deep learning training in non-smooth landscapes by estimating geometric instability.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.29543v1","title":"SCOPE: A Lightweight-training LLM Framework for Air Traffic Control Readback Monitoring","abstract":"Pilot readback of Air Traffic Control (ATC) voice instructions is a primary safeguard against miscommunication in air transportation. However, readback anomalies remain implicated in approximately 80% of aviation incidents. This vulnerability is further exacerbated by rising traffic volume and elevated cognitive workload, thereby motivating automated readback monitoring by machine. Traditional rule-based and machine learning approaches struggle to generalize across the highly variable and evolving phraseology of air traffic controller-pilot communications. While Large Language Models (LLMs) have opened a new avenue through their strong reasoning and generalization capabilities, existing approaches still face deployment and computational barriers in practice. In this work, we propose Semantic reasoning for Communication via Open-set Plug-in with Examples (SCOPE), a novel lightweight-training LLM framework that advances both the efficiency and accuracy of machine-based ATC readback monitoring. The core idea is to couple a plug-in open-set classifier with a carefully designed in-context learning mechanism on top of a frozen LLM. Extensive experiments on the semi-synthetic communication dataset show that SCOPE attains superior accuracy while delivering the low-latency response required for operational environments. Under a few-shot setting, SCOPE achieves 91.05% accuracy in open-set detection and corrects 96.63% of anomalous readbacks, thereby outperforming the strongest available baselines while providing explanations for its decisions. These findings demonstrate the potential of our framework as a practical pathway toward interpretable and controllable ATC readback monitoring.","published_date":"2026-05-28T07:56:24+00:00","viability_score":7,"cluster_label":"LLM Applications","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A lightweight LLM framework for air traffic control readback monitoring that uses in-context learning for high accuracy and low latency.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29539v1","title":"GiPL: Generative augmented iterative Pseudo-Labeling for Cross-Domain Few-Shot Object Detection","abstract":"Vision-language foundation models have shown promising zero-shot generalization for Cross-Domain Few-Shot Object Detection (CD-FSOD). However, they face two critical challenges in fine-tuning: insufficient support set utilization due to sparse single-instance annotations, and severe overfitting under extremely limited target-domain samples. To address these issues, this paper proposes GiPL, an efficient two-branch training framework.In the first branch, we design an iterative pseudo-label self-training paradigm, which performs zero-shot inference on the support set to generate reliable pseudo-annotations, fuses them with ground-truth labels, and iteratively optimizes the model to fully exploit support set data. In the second branch, we introduce generative data augmentation pipeline using large vision-language models, which synthesizes domain-aligned, multi-object annotated images to enrich training samples and suppress overfitting. Extensive experiments on three challenging CD-FSOD datasets (RUOD, CARPK, CarDD) under 1/5/10-shot settings demonstrate that GiPL consistently outperforms state-of-the-art methods with significant performance gains.Code is available at \\href{https://github.com/z-yaz/CDiscover}{CDiscover}.","published_date":"2026-05-28T07:53:40+00:00","viability_score":7,"cluster_label":"Computer Vision","has_code":true,"repo_url":"https://github.com/z-yaz/CDiscover","commercial_flags":["has_code"],"one_liner":"A generative framework for few-shot object detection that leverages pseudo-labeling and synthesized data to overcome annotation scarcity and overfitting.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29534v1","title":"UI-KOBE: Knowledge-Oriented Behavior Exploration for Lightweight Graph-Guided GUI Agents","abstract":"Recent advances in mobile GUI agents have shown strong potential for automating mobile tasks, but most effective systems still depend on large vision-language models for screenshot understanding and long-horizon planning. Small GUI agents that can be deployed directly on mobile devices are more attractive for practical use, offering lower inference cost and better protection of sensitive on-device information. However, due to limited model capacity, such lightweight agents remain unreliable when planning and executing GUI tasks end-to-end from screenshots alone. We propose Knowledge-Oriented Behavior Exploration (\\textbf{UI-KOBE}), a framework that improves lightweight mobile GUI agents with reusable app-specific graph knowledge. UI-KOBE first autonomously explores a mobile application and constructs an app knowledge graph, where nodes represent distinct UI states and edges represent executable transitions. At runtime, a lightweight GUI agent uses the graph as external guidance: given a user task and the current screenshot, it identifies the current graph node and selects among self-loop actions, neighboring transitions, task completion, or fallback free actions associated with that node. By supporting runtime decisions with app-specific graph guidance, UI-KOBE reduces the burden of end-to-end GUI planning and helps lightweight models perform mobile GUI tasks more effectively, offering a practical step toward efficient, interpretable, and privacy-conscious on-device GUI agents.","published_date":"2026-05-28T07:49:09+00:00","viability_score":4,"cluster_label":"Agents","has_code":true,"repo_url":"https://github.com/YuxiangChai/UI-KOBE","commercial_flags":["has_code"],"one_liner":"A knowledge-graph guided framework for lightweight mobile GUI agents to improve task automation and privacy on-device.","time_to_mvp":"3-6 months","tags":["series_a_plus"]},{"arxiv_id":"2605.29532v1","title":"GUITestScape: Towards Open-set Evaluation on Exploratory GUI Testing","abstract":"Exploratory GUI testing is a particularly demanding setting for MLLM agents: without predefined test scripts, an agent must autonomously navigate an application and discover defects through its own interaction. However, current evaluation falls short on two fronts. First, existing benchmarks focus almost exclusively on interaction defects, leaving display defects outside the evaluation frame. Second, evaluation protocols are bound to predefined defect annotations, collapsing the testing process into a single end-state judgment that conflates qualitatively distinct failure modes. To address these challenges, we present GUITestScape, an interactive benchmark covering 61 real-world Android applications and 508 preset defects spanning interaction and display types, and introduce GUIJudge, an open-set evaluator that decomposes an agent's testing trajectory into independently diagnosable capabilities. Experimental results demonstrate that GUIJudge achieves reliable process-aware evaluation beyond predefined annotations, substantially outperforming all baselines. Benchmarking on GUITestScape further reveals that detection remains the critical bottleneck for existing models across both defect types, and that integrating GUIJudge's verifiers into existing agents significantly boosts their detection performance without retraining.","published_date":"2026-05-28T07:47:27+00:00","viability_score":7,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An interactive benchmark and open-set evaluator for exploratory GUI testing that decomposes agent capabilities and identifies bottlenecks.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29526v1","title":"Temporal Motif-aware Graph Test-time Adaptation for OOD Blockchain Anomaly Detection","abstract":"Ever-evolving transaction patterns have significantly hindered anomaly detection on emerging cryptocurrency blockchains due to the vast number of addresses and diverse anomalous behaviors. Recently, advanced Graph Anomaly Detection (GAD) approaches applied to blockchains have faced two critical challenges: \\textit{adversarial pattern evolution by malicious actors} and \\textit{the out-of-distribution (OOD) problem caused by varied transaction semantics on blockchains}. To address these challenges, we propose a novel framework termed \\textbf{TE}mporal \\textbf{M}otif-aware \\textbf{G}raph \\textbf{T}est-\\textbf{T}ime \\textbf{A}daptation (\\textbf{TEMG-TTA}). First, we comprehensively capture the 3-node temporal motif distribution of each active address using an efficient computational mechanism, enabling downstream temporal motif-aware graph learning. Second, we design a simple yet effective test-time adaptation strategy to facilitate the sharing of common patterns between training and testing graphs. Extensive experiments on 5 real-world datasets demonstrate that our proposed \\textbf{TEMG-TTA} outperforms \\textit{state-of-the-art} GAD approaches by an average of 54.88\\%. A further case study on interpretable motif patterns reveals that \\textbf{TEMG-TTA} explicitly characterizes the complex transaction patterns of anomalous addresses, thereby verifying the effectiveness of our technical designs. Our code will be made publicly available https://github.com/LuoXishuang0712/TEMG-TTA/.","published_date":"2026-05-28T07:43:20+00:00","viability_score":7,"cluster_label":"Blockchain AI","has_code":true,"repo_url":"https://github.com/LuoXishuang0712/TEMG-TTA","commercial_flags":["has_code"],"one_liner":"A temporal motif-aware graph test-time adaptation framework for out-of-distribution blockchain anomaly detection that significantly outperforms state-of-the-art.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29524v1","title":"KBF: Knowledge Boundary as Fingerprint for Language Model and Black-Box API Auditing","abstract":"Relay and reseller APIs increasingly intermediate access to large language models (LLMs), but users have no direct way to verify that a claimed endpoint is actually serving the advertised model. We introduce KBF, a low-cost black-box auditing protocol that fingerprints model APIs using stable numerical recall near the knowledge boundary. Across 16 production LLM endpoints, KBF flags all 155 economically relevant substitutions without rejecting any same-model controls, remains stable under deployment variation, detects high-separation mixed-routing attacks when only 5-10% of traffic is substituted, and finds that 7 of 27 platform model cells in a six-platform shadow API audit are statistically inconsistent with their reference endpoints, with inconsistencies concentrated on premium Claude endpoints.","published_date":"2026-05-28T07:40:24+00:00","viability_score":4,"cluster_label":"LLM Auditing","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A low-cost protocol to fingerprint LLM APIs and detect model substitutions without direct access.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29522v1","title":"DeepSurvey: Enhancing Analytical Depth and Citation Reliability in Automated Survey Generation","abstract":"As scientific literature grows rapidly, automated survey generation has become a key capability for AI scientists and human researchers. However, existing systems suffer from limited analytical depth due to reliance on abstracts and isolated paper processing, and unreliable citations from imprecise retrieval and post-hoc grounding, producing superficial surveys and may mislead researchers. We present DeepSurvey, an agentic system that addresses both. To enhance depth, DeepSurvey extracts structured keynotes from full-text papers, models cross-paper relationships through clustering and comparative analysis, and integrates code-repository analysis to recover implementation-level details. To fortify reliability, it combines citation-graph expansion with hybrid filtering for topic-focussed retrieval, enforces evidence-constrained citation assignment, and deploys multi-granularity agentic refinement to validate citation-claim alignment. Experiments show that DeepSurvey achieves the highest content score (8.644/10) and citation quality (12.3% and 9.3% recall and precision gains over the strongest baseline), generalizes more robustly across domains (0.14 vs 0.22 to 0.69 CS-to-non-CS drop), and is preferred over human-written surveys by domain experts (83.3% overall quality, 100% content depth).","published_date":"2026-05-28T07:40:10+00:00","viability_score":5,"cluster_label":"Automated Survey Generation","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"An agentic system that enhances the analytical depth and citation reliability of automated scientific literature surveys.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29518v1","title":"Network Optimization Aspects of Autonomous Vehicles: Challenges and Future Directions","abstract":"Global megatrends, such as urbanization, population growth, and emerging network solutions are accelerating the development of the Connected and Autonomous Vehicles (CAVs) industry. There are many truths, some misconceptions, and even some excitement about CAVs in the public's opinion. The main objective of the current article is to provide a comprehensive review, eliminate misconceptions, and outline the future of the network optimization aspects of autonomous vehicles by presenting various multidisciplinary methods, such as cooperative perception. Given our extensive experience with CAVs, we are aiming to share some of the insights and knowledge we have gained, along with relevant use-cases and experiment results.","published_date":"2026-05-28T07:38:29+00:00","viability_score":0,"cluster_label":"Autonomous Vehicles","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A review of network optimization aspects for connected and autonomous vehicles, addressing challenges and future directions.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29512v1","title":"MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs","abstract":"Large language models (LLMs) are increasingly deployed as interactive agents, yet their capacity for social and strategic reasoning over extended interaction remains poorly understood. Existing evaluations rely on static vignettes or single-game benchmarks that cannot capture the sustained, multi-faceted reasoning that real-world multi-agent settings demand. We introduce Mindgames, a multi-game arena and evaluation platform for LLM agents that operationalizes complementary reasoning demands relevant to ``theory of mind'': belief attribution under hidden information, opponent modeling through repeated strategic interaction, cooperative inference under knowledge asymmetries, and sustained deception in social deduction. Built on TextArena, Mindgames provides a unified interaction interface, TrueSkill-based rating, and full trajectory logging across four game environments. We instantiate Mindgames through a 2025 competition cycle hosted at a major AI conference, which assessed 944 submitted agents from 76 teams across four games: Colonel Blotto, Iterated Prisoner's Dilemma, Codenames, and Secret Mafia. Our analysis surfaces both agent-level and evaluation-level limitations: brittle rule adherence remains a major bottleneck, top-performing systems repeatedly rely on explicit structural scaffolding, and leaderboard validity differs sharply across environments. In particular, failure-heavy environments can reward robustness to opponent errors as much as strategic ability, with Secret Mafia exhibiting a pronounced error-survival confound in this cycle. We release a dataset of 29,571 multi-agent games with turn-level observations, actions, and rewards, together with MG-Ref, a deterministic offline tournament protocol that scores new agents against a frozen reference pool of top-ranked, low-error Stage~II submissions under the same error-attribution lens used in this analysis.","published_date":"2026-05-28T07:33:47+00:00","viability_score":3,"cluster_label":"AI evaluation","has_code":true,"repo_url":"https://github.com/mind-games-challenge/mindgames-starter-kit","commercial_flags":["has_code"],"one_liner":"MINDGAMES is an arena for testing social and strategic reasoning in multi-agent language models.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2605.29507v1","title":"Xetrieval: Mechanistically Explaining Dense Retrieval","abstract":"Explaining why dense retrievers assign high relevance scores remains challenging because retrieval decisions are made through opaque high-dimensional embeddings. Existing explanations often focus on surface signals, such as lexical matches, token alignments, or post-hoc textual rationales, and thus provide limited insight into the latent factors that shape dense retrieval behavior at the embedding level. We propose \\textit{Xetrieval}, an embedding-level mechanistic framework for explaining dense retrieval. \\textit{Xetrieval} first introduces a lightweight reasoning internalizer that approximates Chain-of-Thought reasoning directly in the embedding space with a single forward pass, enriching sentence embeddings with reasoning-oriented information while avoiding expensive autoregressive generation. It then decomposes these reasoning-enhanced embeddings into sparse, human-interpretable features, each associated with a coherent natural language description. By aggregating sparse feature overlaps across multiple document-side views, \\textit{Xetrieval} provides feature-level explanations of individual retrieval decisions. Experiments on diverse retrievers and benchmarks show that \\textit{Xetrieval} uncovers coherent interpretable features, yields stronger pair-level intervention effects, and supports task-level feature steering. The project page and source code are available at https://hihiczx.github.io/Xetrieval .","published_date":"2026-05-28T07:29:58+00:00","viability_score":7,"cluster_label":"Explainable AI","has_code":true,"repo_url":"https://github.com/Hihiczx/Xetrieval","commercial_flags":["has_code"],"one_liner":"Xetrieval provides interpretable, embedding-level explanations for dense retrieval systems by internalizing reasoning and decomposing embeddings into human-understandable features.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29502v1","title":"Source-Grounded Semantic Reinforcement Learning for Low-Resource Target-Language Generation","abstract":"Low-resource target-language generation is often limited by scarce parallel data, while high-resource source-language monolingual data is abundant but difficult to use with standard supervised fine-tuning. We propose Source-Grounded Semantic Reinforcement Learning (SG-SRL), a resource-utilization framework that converts source-language monolingual data into cross-lingual semantic supervision for target-language generation. SG-SRL performs reference-free reinforcement learning (RL) on source-language data using a cross-lingual semantic reward model, instantiated by a cross-lingual reranker that scores the semantic relevance between the source input and the target-language generation. While this induces severe verbosity-based reward hacking, a lightweight recovery stage using a small parallel corpus restores fluency, conciseness, and task format while preserving the semantic gains. Experiments on Chinese-to-Thai generation show that SG-SRL improves semantic grounding and factual coverage over cold-start SFT. Additional analyses on long-form transfer and Tibetan embedding-based rewards clarify the generalization behavior of SG-SRL and show that an encoder-based semantic reward can substitute for an LLM-based reranker in a realistic low-resource language setting.","published_date":"2026-05-28T07:27:16+00:00","viability_score":4,"cluster_label":"Low-Resource NLP","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"SG-SRL leverages abundant source-language monolingual data to improve low-resource target-language generation through cross-lingual semantic reinforcement learning.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29500v1","title":"Quotient DAGs for Off-Policy Evaluation:Forward-Flow Importance Sampling and Exact Slate Propensities","abstract":"Off-policy evaluation estimates how a target policy would perform using data collected by a different behavior policy, which is crucial when online testing is costly or risky, such as in recommendation or healthcare. Standard importance sampling reweights each logged trajectory, but it can treat details of the generation process as meaningful even when the evaluation target ignores them: for example, an autoregressive slate recommender may generate an ordered sequence of items while the reward and downstream estimator depend only on the unordered slate. This creates nuisance variance and a computational gap, since exact unordered slate propensities require summing over all generation orders. We introduce a quotient-DAG view that merges histories equivalent for evaluation and assigns weights using target-to-behavior forward-flow ratios on the merged graph. For slate recommendation under a set-sufficient next-item interface, this yields Forward-DP, a subset-DAG dynamic program that computes exact unordered propensities without factorial enumeration. The resulting propensity primitive enables practical propensity-based evaluation and model selection for context-dependent autoregressive slate loggers.","published_date":"2026-05-28T07:23:40+00:00","viability_score":3,"cluster_label":"Reinforcement Learning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Quotient DAGs enable exact off-policy evaluation for slate recommenders by merging equivalent histories and using forward-flow importance sampling.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29493v1","title":"The New Pro Se: Generative AI and the Surge in Federal Civil Self-Representation","abstract":"Since public access to generative AI tools became widespread, federal civil litigation has seen a marked increase in pro se (self-represented) plaintiffs. This paper analyzes that shift using ~2.8 million filings, asking whether the post-GenAI period is associated not only with more pro se filings, but also with detectable changes in complaint text, litigation outcomes, and the composition of pro se litigants.   Using civil filing data from FY2008-2025, we find that the federal civil pro se plaintiff rate rose from 11.33% pre-GenAI to 16.94% post-GenAI, a 5.61 percentage-point increase that persists after trend and covariate-adjusted robustness checks. We then focus on Civil Rights and Other Statutory cases, where the increase is especially pronounced, and link case metadata to pro se complaints. Drawing on stylometric AI detection indicators, we develop an interpretable measure of AI-consistent drafting. Against a threshold calibrated to the pre-GenAI baseline, the net AI-flagged share is 13.9% of post-GenAI non-form complaints.   Analysis of the AI-flagged complaints shows that they are more citation-dense, disproportionately associated with first-time rather than repeat filers, and geographically unevenly distributed. This composition pattern suggests that AI-consistent drafting is not merely a repeat-filer phenomenon; it also includes a modest, suggestive increase in name-inferred female plaintiffs. We find no evidence of improved win rates; in fact, AI-flagged complaints are more likely to be dismissed and to terminate at earlier procedural phases. These findings raise new questions about access to justice and court screening burdens, and sharpen the distinction between legal formality and legal efficacy.","published_date":"2026-05-28T07:19:09+00:00","viability_score":3,"cluster_label":"Generative AI Impact","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Generative AI has driven a significant increase in federal civil pro se filings, with AI-flagged complaints showing distinct drafting patterns but no improved litigation outcomes.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29491v1","title":"The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIF","abstract":"Large Language Models (LLMs) are increasingly deployed in agentic and retrieval-augmented generation (RAG) systems, where they must execute user-specified tasks over externally provided reference text. In practice, such context is often unstructured and contaminated with benign but instruction-like semantic noise, such as editorial comments and system traces, which should be treated strictly as data. We introduce DistractionIF, a benchmark designed to evaluate robustness against such distractor instructions in reference text. Across a broad range of models, we observe a consistent inverse scaling phenomenon: larger models are often less robust, with performance dropping by up to 30 points as scale increases. Mechanistically, our perplexity analysis reveals that scaling erodes the probabilistic boundary between robust and distracted behaviors, making models increasingly prone to over-interpreting noise as instructions. To address this, we demonstrate that reinforcement learning, specifically Group Relative Policy Optimization (GRPO), can restore this boundary, improving robustness by up to 15.5% without compromising general instruction-following capability. Our findings highlight a critical instruction-following robustness gap in reference-grounded tasks and establish reinforcement learning as a promising path for enforcing strict data-instruction separation at scale.","published_date":"2026-05-28T07:18:15+00:00","viability_score":4,"cluster_label":"LLM Robustness","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new benchmark and reinforcement learning approach to improve LLM robustness against distractor instructions in reference text.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.29488v1","title":"AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling","abstract":"Conditional human motion generation remains a fundamental challenge in computer vision and robotics. Despite significant progress, current methods are often constrained by fixed modality configurations and task-specific architectures, leaving cross-modal interactions and the scaling laws of multimodal-conditioned synthesis largely underexplored. A key bottleneck is the scarcity of large-scale modality-aligned motion data, limiting generalization across diverse control signals. In this work, we introduce OmniHuMo, a large-scale, high-quality dataset comprising over 5,000 hours of motion and 3.2 million sequences with precisely aligned multimodal annotations (e.g., text, speech, music, and trajectory). Leveraging OmniHuMo, we propose AnyMo, a unified multimodal framework combining a Residual FSQ-based motion tokenizer with a scalable masked modeling transformer, enabling high-quality motion synthesis under arbitrary modality combinations. Extensive experiments show that AnyMo achieves high-fidelity synthesis while offering flexible control over both spatial and stylistic attributes.","published_date":"2026-05-28T07:15:19+00:00","viability_score":7,"cluster_label":"Generative Motion","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A unified multimodal framework and large-scale dataset for generating human motion conditioned on arbitrary combinations of text, speech, and music.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29486v1","title":"PhoneWorld: Scaling Phone-Use Agent Environments","abstract":"A central bottleneck for phone-use agents is that controllable, reproducible environments covering real mobile behavior are hard to build at scale. Existing mobile-agent benchmarks have made important progress on evaluation, but they do not by themselves provide a scalable way to construct many new phone-use environments. We present PhoneWorld, a reusable pipeline that converts real GUI trajectories and screenshots into controllable phone-use environments, executable tasks, automatic verifiers, and training rollouts. Rather than hand-building one mobile benchmark at a time, PhoneWorld uses real trajectories to recover which screens matter, how screens connect, which interactions must change environment state, and which user goals admit automatic verification. From these signals, it builds runnable mock Android apps backed by read-only app content and mutable state, then derives executable tasks, rule-based verifiers, and training rollouts from the same environments. In its current instantiation, PhoneWorld covers 34 apps across 16 domains, spanning common consumer mobile behaviors such as search, browsing, shopping, booking, media, and social interaction. Under a fixed training budget, replacing 10K steps from an auxiliary AndroidWorld corpus in an AndroidWorld-based baseline with broad PhoneWorld supervision improves all four evaluation benchmarks at once, raising HYMobileBench by 17.7 points, AndroidControl by 6.0 points, AndroidWorld by 14.7 points, and PhoneWorld by 52.5 points. We then study two additional scaling questions: increasing the amount of PhoneWorld supervision strongly improves PhoneWorld performance, and under a fixed PhoneWorld budget, expanding app coverage yields even larger gains. Overall, PhoneWorld shifts the focus from building one mobile benchmark at a time to scaling the supply of phone-use environments themselves.","published_date":"2026-05-28T07:14:15+00:00","viability_score":8,"cluster_label":"Agent Environments","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A scalable pipeline that converts real mobile usage data into controllable agent environments for training and evaluation.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.29483v1","title":"VitalAgent: A Tool-Augmented Agent for Reactive and Proactive Physiological Monitoring over Wearable Health Data","abstract":"Wearable devices enable continuous monitoring of physiological signals such as ECG and PPG, but existing mHealth systems are largely limited to task-specific prediction pipelines or reactive question answering over static summaries. They lack the ability to support temporal reasoning, persistent physiological context, and proactive monitoring over long-term signal streams. We propose VitalAgent, a tool-augmented agentic framework for ECG/PPG-based mHealth that supports both reactive question answering and proactive monitoring. VitalAgent is built on a longitudinal physiological memory and a tool-augmented reasoning interface that enables dynamic computation over raw signals. We further introduce VitalBench, a longitudinal physiological monitoring benchmark dataset comprising 1,862 QA pairs for reactive question answering and 90.2 hours of continuous ECG/PPG recordings for proactive monitoring, covering cardiac, physical activity, and stress-related tasks. Experiments demonstrate that VitalAgent achieves over 30% improvement over prompt-based and ReAct baselines in reactive evaluation and supports proactive alert monitoring over long-term physiological signals, highlighting the importance of dynamic tool use and long-term physiological monitoring.","published_date":"2026-05-28T07:10:14+00:00","viability_score":7,"cluster_label":"Wearable Health Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A tool-augmented agent framework for continuous physiological monitoring and proactive health alerts using wearable data.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29478v1","title":"Evolutionary Rule Extraction from Corporate Default Prediction Models","abstract":"Small and medium-sized enterprises (SMEs) represent the majority of firms in most economies and often face financial constraints and higher vulnerability to financial distress. Predicting SME default is therefore crucial for financial institutions, policymakers, and researchers. Recent advances in machine learning (ML) have improved predictive performance in credit risk modeling. Yet, the limited interpretability of complex models raises concerns regarding transparency and regulatory compliance. This study investigates SME's default predictors and applies explainable artificial intelligence (XAI) techniques to them. Using a panel of 50,718 Italian SME over the period 2015-2024, we compare traditional econometric approaches with several ML classifiers. The empirical results show that ML models significantly outperform the traditional logistic regression benchmark in terms of Balanced Accuracy and PR-AUC. To address the interpretability challenge, we introduce DEXiRE-EVO, a novel evolutionary rule extraction framework that combines multi-objective optimization with the Contextual Importance and Utility (CIU) explainability method. The extracted rules reveal economically meaningful patterns associated with SME financial distress, highlighting the roles of weak internal liquidity generation, internal capital erosion, high leverage, and operational inefficiency. Additionally, contextual macroeconomic conditions and the persistence of financial instability contribute to identifying high-risk firms. In general, the results show that combining ML with evolutionary rule extraction can improve both predictive performance and interpretability in credit risk modeling, thus supporting more transparent, data-driven decision-making in financial environments.","published_date":"2026-05-28T07:07:01+00:00","viability_score":5,"cluster_label":"Explainable AI for Finance","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel evolutionary rule extraction framework enhances interpretability and predictive performance in SME default prediction models.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2605.29475v1","title":"MOOSE-Copilot: A Web-Based Interactive Assistant for Unified Exploratory and Fine-Grained Scientific Hypothesis Discovery","abstract":"Large language models (LLMs) show remarkable potential in scientific hypothesis discovery. However, existing approaches face two critical limitations: they treat divergent exploratory ideation and convergent fine-grained refinement as isolated tasks, and they operate autonomously with little to no human guidance. We present MOOSE-Copilot, the first unified framework to bridge this abstraction gap through a formalized human-AI interaction (HAII) protocol. Our system empowers scientists to steer the generative process via three explicit signals: initial blueprints, inter-stage routing, and regenerative feedback. Quantitative evaluations demonstrate that injecting these structured expert signals significantly outperforms purely autonomous baselines, establishing a performance ceiling under oracle guidance. Furthermore, to democratize this paradigm, we develop an intuitive web-based interface featuring interactive tree visualization. This explicitly eliminates the steep learning curve of complex command-line agentic tools, empowering interdisciplinary researchers to directly leverage, visually orchestrate, and accelerate end-to-end scientific breakthroughs.","published_date":"2026-05-28T07:06:10+00:00","viability_score":7,"cluster_label":"AI for Scientific Discovery","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"MOOSE-Copilot is a web-based interactive assistant that unifies scientific hypothesis discovery through human-AI interaction, accelerating research breakthroughs.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.29473v1","title":"Inform, Coach, Relate, Listen: Auditing LLM Caregiving Support Roles","abstract":"Language models are increasingly being deployed for conversational support in informal caregiving contexts, where interactions often extend beyond information-seeking: caregivers seek emotional reassurance, guidance, and help, while navigating uncertain, relationally complex care decisions. Yet most safety evaluations assess model behavior under generic prompts, leaving a critical question unexamined: does a model's safety profile change with its support role? We study this by operationalizing four expert-reviewed support roles grounded in social support theory: Inform, Coach, Relate, and Listen, and comparing them against two baseline controls: a basic prompting condition and a retrieval-augmented generation (RAG) condition. We evaluate across three language models (GPT-4o-mini, Llama-3.1-8B-Instruct, and MedGemma-1.5-4b-it) on 5,000 real-world queries from online Alzheimer's Disease and Related Dementias (ADRD) communities. We find that the LLM's support role systematically shapes both the prevalence and composition of interactional risks. Furthermore, a human evaluation study reveals a perceived quality--safety tension: more directive, information-oriented roles are rated as more helpful and trustworthy despite exhibiting elevated interactional risk profiles. We release ~90,000 support role-conditioned model responses with risk annotations as an ecologically grounded resource for research on safer LLM-mediated conversational support.","published_date":"2026-05-28T07:04:56+00:00","viability_score":3,"cluster_label":"LLM Safety and Caregiving","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This research audits LLM caregiving support roles, revealing how different interaction styles impact safety and perceived helpfulness.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29468v1","title":"SciIntBench: Measuring LLM Compliance with Research Integrity Norms Under Adversarial Framing","abstract":"Large language models (LLMs) are increasingly used to support scientific work, but it is unclear whether they uphold responsible conduct of research (RCR) norms or help undermine them. We introduce SciIntBench, an adversarial benchmark of 810 prompts across ten RCR categories and three scientific domains. Each scenario appears as an Overt Adversarial, Covert Adversarial, and Benign version, allowing us to jointly measure framing-sensitive refusal of misconduct and helpfulness on legitimate requests. We evaluate 16 commercial and open-weight LLMs from six providers (2024--2026), producing 12,960 responses. We find that scientific integrity alignment is strongly framing-sensitive: models refuse explicit misconduct far more reliably than covert violations, especially failing when misconduct is presented as a pressure-driven shortcut. Refusals vary by RCR category, with weaker boundaries around transparency, plagiarism, and fabrication.","published_date":"2026-05-28T07:00:01+00:00","viability_score":5,"cluster_label":"LLM Research Integrity","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"SciIntBench is an adversarial benchmark measuring LLM compliance with research integrity norms under framing variations.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2605.29467v1","title":"Composing Non-Conjugate Factor Graphs with Closed-Form Variational Inference","abstract":"Stacking probabilistic building blocks into deeper architectures typically breaks closed-form inference. We show that closed-form inference can be preserved. We identify five factor-graph primitives: a bilinear factor, an exponential link, a Gamma prior, a Gaussian likelihood, and an equality node, and prove that any model composed from them admits closed-form variational message passing. The construction works because each primitive preserves a small set of message families: under mean-field factorization, messages on Gaussian variables remain Gaussian and messages on precision variables remain Gamma, while the only non-conjugate interface, the exponential link, remains tractable through the Gaussian moment-generating function and the sufficient statistics of the Gamma family. We demonstrate composition at increasing depth, from static ensembles through input-dependent gating to split-branch routing, and show that stacking routing layers encodes arbitrary decision trees, establishing universal function approximation with closed-form inference. Applied to ensemble time-series forecasting, the framework yields a Bayesian mixture of experts in which gating functions are inferred rather than learned, providing calibrated uncertainty over expert selection across five benchmark datasets.","published_date":"2026-05-28T06:59:35+00:00","viability_score":3,"cluster_label":"Probabilistic ML","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A theoretical framework for composing probabilistic models with closed-form variational inference, enabling universal function approximation with tractable uncertainty.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29463v1","title":"Honest Lying: Understanding Memory Confabulation in Reflexive Agents","abstract":"Reflexion-style agents rely on self-generated reflections as memory, implicitly assuming that agents can accurately diagnose their own failures.We show that this assumption can fail systematically: across ALFWorld and HumanEval, agents store confident but incorrect interpretations of the task and continue acting on them across trials,even though the environment resets to the correct task each time. We call this failure mode memory confabulation and introduce the Reflection Repetition Rate (RRR), a log-based metric that detects repeated reliance on incorrect reflective content.Using RRR, we identify 16 frozen environments in ALFWorld, where 0 of 121 reflections mention the correct target object, and 4 analogous cases in HumanEval. Our mitigation replaces open-ended self-diagnosis with programmatic extraction of trajectory-level failure signals, increasing correct object mention from 0% to 86%, reducing RRR from 0.64 to 0.10, and solving 3 of 16 frozen ALFWorld environments, suggesting that reflective memory can reinforce false beliefs rather than correct them.","published_date":"2026-05-28T06:56:42+00:00","viability_score":4,"cluster_label":"Agentic AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Identifies and mitigates 'memory confabulation' in reflexive agents by replacing open-ended self-diagnosis with programmatic failure signal extraction.","time_to_mvp":"1-3 months","tags":["quick_build"]},{"arxiv_id":"2605.29462v1","title":"Benchmarking Large Vision-Language Models on CFMME: A Comprehensive Chinese Financial Multimodal Evaluation Dataset","abstract":"The emergence of Large Vision-Language Models (LVLMs) has substantially expanded model capabilities beyond text-only understanding, enabling unified inference across both visual and textual modalities and supporting a broader range of real-world applications. To comprehensively evaluate the perception, understanding, reasoning, and cognition capabilities of LVLMs throughout the entire financial business workflow in Chinese contexts, we introduce CFMME, a novel Chinese financial multimodal evaluation benchmark. CFMME comprises 6,052 instances spanning from fundamental academic knowledge to complex real-world applications, covering eight primary financial image modalities and four core multimodal tasks. On CFMME, we conduct a thorough evaluation of representative LVLMs. The results show that the state-of-the-art model attains an overall accuracy of 66.11\\% on the question answering task and an average score of 77.18 on the detection, recognition, and information extraction tasks, indicating substantial room for improvement in current LVLMs. In addition, we conduct detailed analyses of error causes, cross-modal capabilities, and multi-orientation settings, yielding valuable insights for future research. We hope that CFMME will spur further progress in LVLMs, especially by improving their performance on multiple multimodal tasks in the financial domain.","published_date":"2026-05-28T06:56:33+00:00","viability_score":7,"cluster_label":"Multimodal AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Introduces CFMME, a comprehensive Chinese financial multimodal evaluation benchmark, revealing significant room for improvement in current Large Vision-Language Models.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29458v1","title":"Adaptive Interviewing for Persona Simulation in LLMs: Evidence-Grounded Reasoning Improves Decision Alignment","abstract":"Accurately simulating the decisions of a specific individual remains challenging for large language models (LLMs), partly because persona information is often provided as static descriptions that miss the values, experiences, and contextual cues needed for individual-level decision simulation. We propose an adaptive interview framework that gathers persona-relevant information through a structured three-stage dialogue: core questions, dynamic follow-ups, and a synthesized personality summary. Using the resulting interview transcripts, we evaluate whether LLMs can simulate participants' decisions in moral dilemma scenarios. We compare three conversational contexts -- Core-10 responses, the full interview dialogue, and a summarized persona representation. We find that adaptive interviewing functions less as a uniform accuracy booster and more as a selective grounding mechanism: follow-up-derived evidence is incorporated in around 40% of full-interview traces, and these follow-up-grounded predictions are more accurate than core-only grounded ones (45.5% vs. 39.3%). These findings highlight that richer persona context alone is insufficient: improvements arise only when models actually ground their decisions in user-specific evidence.","published_date":"2026-05-28T06:53:08+00:00","viability_score":4,"cluster_label":"LLM Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"An adaptive interviewing framework that uses structured dialogue to gather persona-relevant information, improving LLM decision alignment in moral dilemma scenarios.","time_to_mvp":"1-3 months","tags":["quick_build"]},{"arxiv_id":"2605.29453v1","title":"Forget Less, Generalize More: Unifying Temporal and Structural Adaptation for Dynamic Graphs","abstract":"Representation learning on dynamic graphs requires capturing complex dependencies that evolve across both time and structure. Existing approaches typically adopt fixed temporal decay schemes or predetermined structural propagation depths, limiting their ability to generalize across graphs with diverse interaction frequencies and topological characteristics. We propose Dual-Scale Retentive Dynamics (DSRD), a unified framework that maintains a retentive representation state encoding both temporal memory and structural context. DSRD introduces two key components: (i) a retentive state with dual-scale adaptation that jointly models temporal dynamics and structural propagation within a single recurrent formulation, and (ii) adaptive decay kernels with learnable time-sensitivity parameters that automatically balance short-term responsiveness and long-term retention based on the underlying interaction patterns. We provide theoretical analysis establishing the equivalence between event-wise parallel aggregation and efficient recurrent state updates, as well as stability and boundedness guarantees for the learned dynamics. Extensive experiments on 14 real-world benchmarks demonstrate that DSRD consistently achieves state-of-the-art performance on both link prediction and node classification tasks, with strong generalization across transductive and inductive settings.","published_date":"2026-05-28T06:47:08+00:00","viability_score":4,"cluster_label":"Graph Representation Learning","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A unified framework for dynamic graph representation learning that jointly models temporal and structural adaptation for improved generalization.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.29448v1","title":"How Much Is a Dataset Worth? Scaling Laws, the Vendi Score, and Matrix Spectral Functions","abstract":"Neural scaling laws appraise data through dataset size, while the Vendi Score uses quantum entropy to measure dataset value. We show both that common neural-scaling-law objectives and the Vendi Score are submodular. We further show that the Vendi Score is a special case of a broader class of submodular objectives that we call matrix spectral functions. This also includes determinantal (DPP) objectives, as well as many others. We also introduce weakly matrix monotone functions and show how they lead to weakly submodular matrix spectral functions, yielding a broad family of practical objectives for data appraisal. We develop secular-equation-based updates that avoid repeated eigendecompositions during greedy optimization, reducing marginal-gain evaluation for $m$-dimensional embeddings by an $O(m)$ factor relative to oracle queries. This yields an average empirical speedup of about 35,000x, making direct optimization of the Vendi Score feasible on ImageNet-1K-scale datasets. Thus enabled, we compare how well several objectives predict the value of training subsets for held-out test performance under fixed-size, class-balanced, and fixed training-budget regimes, including the Vendi Score, DPPs, facility location, and three new matrix spectral variants. Across multiple datasets, facility location performs the best. Direct optimization also reveals that, while the Vendi Score is predictive over moderate score ranges, pushing the objective to higher values can make it a poor downstream performance proxy. We also find that uniformly at random fixed-size subsets, both unconstrained and class-balanced, are remarkably concentrated in both appraisal scores and held-out performance. Finally, we show that size, class balance, and training budget do not alone determine data value: even when controlling for these factors, performance ranges smoothly from good to bad.","published_date":"2026-05-28T06:40:29+00:00","viability_score":5,"cluster_label":"Dataset Valuation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel framework for dataset valuation that unifies existing scaling laws and introduces new submodular objectives for efficient data appraisal.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.29446v1","title":"CrystalXRD-Bench: Benchmarking Vision-Language Models for XRD Peak Indexing Across Diverse Crystalline Materials","abstract":"Miller-index identification from powder XRD patterns requires capabilities untested by existing multimodal benchmarks: the model must read a narrow peak location from a rendered scientific curve and then connect that observation to multi-step crystallographic reasoning. We introduce CrystalXRD-Bench, a 250-sample benchmark built from 10 public crystallographic databases for a single task: recover the full set of HKLs contributing to the highest-intensity peak in an XRD pattern. Each sample pairs the rendered XRD image with the source CIF text and chemical formula, so visual extraction errors and reasoning errors can be examined side by side. We evaluate seven vision-language models. The best Jaccard score is 0.5888 (GPT-5.4) with an exact-match rate of 37.6%, yet six of seven models remain below Jaccard 0.50; the task is far from solved. Error patterns vary systematically: double-peak cases are especially brittle, recall-heavy models gain coverage by over-predicting HKLs, and access to CIF text does not close the gap in crystallographic calculation. Alongside model rankings, the benchmark identifies the conditions under which current VLMs fail on quantitative scientific figures. All data and evaluation code will be publicly available.","published_date":"2026-05-28T06:39:45+00:00","viability_score":7,"cluster_label":"Vision-Language Models","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new benchmark for evaluating vision-language models on the challenging task of XRD peak indexing, revealing significant limitations in current models.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29442v1","title":"How Coding Agents Fail Their Users: A Large-Scale Analysis of Developer-Agent Misalignment in 20,574 Real-World Sessions","abstract":"AI coding agents increasingly act directly within software environments, yet existing analyses of their failures rely on benchmark trajectories that miss how developers actually experience misalignment. We present an observational study of 20,574 coding-agent sessions from 1,639 repositories across IDE and CLI workflows. We operationalize misalignment as a breakdown made visible through developer pushback, and annotate each episode along four axes: form, cause, cost, and resolution. We identify seven recurring forms, spanning how agents read projects, interpret developer intent, follow rules, bound their actions, implement and execute code, and report progress. 90.50\\% of episodes impose effort and trust costs rather than irreversible system damage, yet 91.49\\% of visible resolutions still require explicit user correction. Misalignment patterns also differ across IDE and CLI settings, persist across adjacent sessions, and shift over time: while overall rates decline, constraint violations and inaccurate self-reporting grow in share. Our findings inform the design of training, evaluation, and interfaces for keeping coding agents aligned with real developer workflows.","published_date":"2026-05-28T06:35:39+00:00","viability_score":4,"cluster_label":"AI Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A large-scale analysis of AI coding agent failures in real-world developer sessions, identifying recurring misalignment patterns and their impact.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.29440v1","title":"SkillBrew: Multi-Objective Curation of Skill Banks for LLM Agents","abstract":"Retrieval-augmented LLM agents increasingly rely on curated skill banks: collections of reusable textual principles that guide decision making on complex tasks. Existing approaches typically expand these banks in an append-only fashion, continuously adding new skills without removing redundant, outdated, or harmful ones, resulting in inefficient and poorly curated repositories. In this paper, we formulate the skill bank curation as a constrained multi-objective problem: a desirable bank must be useful for the agent, diverse in its content, and provide good coverage of the query distribution. To this end, we introduce SkillBrew, a multi-objective curation framework that formalizes skill bank curation as Pareto-aware optimization under a utility constraint, and solves it via a bi-level propose-then-verify loop. We evaluate our approach on two public benchmarks. Our findings suggest that treating skill banks as objects of principled curation, rather than ever-growing append-only logs, is an important step toward building self-improving LLM agents.","published_date":"2026-05-28T06:33:52+00:00","viability_score":4,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for curating skill banks for LLM agents to improve their usefulness, diversity, and coverage.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.29434v1","title":"AliMark: Enhancing Robustness of Sentence-Level Watermarking Against Text Paraphrasing","abstract":"Existing sentence-level watermarking methods enhance robustness to paraphrasing by anchoring watermarks in sentence semantics. However, their prefix-based designs remain vulnerable to structural perturbations, such as sentence splitting and merging, which commonly arise under strong paraphrasers like DIPPER and GPT-3.5. To mitigate this issue, we propose AliMark, a framework that reformulates sentence-level watermarking as a bit sequence encoding and alignment problem between a potentially watermarked text and a secret bit sequence. Notably, our approach adopts a two-stage detection strategy: we generate multiple restructured text variants and adaptively align their extracted bit sequences with the secret bit sequence to minimize alignment cost. This multi-candidate alignment design naturally improves robustness to sentence merges and splits. Extensive experiments demonstrate that AliMark substantially outperforms state-of-the-art baselines under diverse paraphrasing attacks.","published_date":"2026-05-28T06:30:43+00:00","viability_score":4,"cluster_label":"Watermarking","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A robust sentence-level watermarking framework that reformulates watermarking as bit sequence encoding and alignment to resist paraphrasing.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.29430v1","title":"Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation","abstract":"Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate \\emph{Interactive ASR} as a multi-turn refinement task and propose \\textbf{Agentic ASR}, a closed-loop framework that combines a single-pass ASR front-end with semantic correction, intent routing, and reasoning-based editing. We further introduce the \\textbf{Sentence-level Semantic Error Rate} ($S^2ER$), an LLM-based semantic evaluation metric, together with an \\textbf{Interactive Simulation System} for scalable and reproducible benchmarking. Experiments on multilingual, named-entity-intensive, and code-switching benchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in $S^2ER$ than in conventional token-level metrics. Human--AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework. The code is available at: https://interactiveasr.github.io/ and the live demo is available at https://i-asr.sjtuxlance.com/","published_date":"2026-05-28T06:23:31+00:00","viability_score":8,"cluster_label":"Interactive Speech Recognition","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Agentic ASR offers human-like interactive speech recognition to improve semantic accuracy through iterative feedback.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.29428v1","title":"DELOS: Detecting Shallow Transits in Kepler Photometry Using a Contrastive-Learning Framework","abstract":"We present DEtection in phase-folded Light curves with cOntrastive Scoring (DELOS), a contrastive-learning-based framework designed to search for shallow transits in Kepler photometry. DELOS combines GPU-accelerated phase folding, optimized phase binning, and a custom one-dimensional convolutional encoder to assign a transit-likeness score to each folded light curve, thereby producing a score periodogram over trial periods without relying on pre-detected threshold-crossing events. Focusing on intermediate-to-long-period signals with orbital periods of 100-150 days, DELOS was trained on 20 million synthetic light curves generated with realistic transit models and Kepler-like noise properties, achieving a validation accuracy of 99.3 percent on the synthetic validation set. In controlled injection-recovery experiments, DELOS improves the combined precision-recall performance by 15.5 percent relative to Box-fitting Least Squares (BLS) and 11.25 percent relative to Transit Least Squares (TLS) in the low Signal-to-Noise Ratios (low-SNR) regime. It also accelerates the search by factors of approximately 3-5 and 74-80 compared with BLS and TLS, respectively. Applied to a selected Kepler validation sample, DELOS recovered all known shallow intermediate-to-long-period transit signals in the tested period range. These results demonstrate that DELOS provides an efficient and sensitive framework for low-SNR transit searches and represents a practical step toward future searches for longer-period terrestrial planets in Kepler, K2, TESS, PLATO, and Earth 2.0 data. Accordingly, this work is intended as a methodological development and validation study, with the detailed astrophysical validation of newly identified candidates deferred to future work.","published_date":"2026-05-28T06:22:22+00:00","viability_score":7,"cluster_label":"Astronomy AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A contrastive-learning framework for detecting shallow transits in astronomical photometry, improving sensitivity and speed.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.29425v1","title":"ReasonLight: A Multimodal Foundation Model-Enhanced Reinforcement Learning Framework for Zero-Shot Traffic Signal Control","abstract":"Reinforcement learning (RL) has shown promise in traffic signal control (TSC). However, its reliance on predefined states limits responsiveness to observable open-world events that are absent from training data. IoT-enabled intersections provide heterogeneous observations from roadside sensors and cameras, creating opportunities to improve RL adaptability to such events. To this end, we propose ReasonLight, a multimodal foundation model-enhanced RL framework for zero-shot TSC. ReasonLight integrates three sources of information: structured traffic measurements, multi-view camera observations, and candidate phase decisions from a pre-trained RL controller. Given an RL-proposed phase, ReasonLight extracts visual semantics from multi-view images and aligns them with compact sensor-derived scene descriptions. This alignment enables a semantic-guided refinement module to either preserve or adjust the proposed action according to traffic rules and event semantics. To ensure operational reliability, refined actions are constrained by the set of available phases. Any invalid decision is rejected, and the system falls back to the original RL action. We evaluate ReasonLight on two types of rare events not seen during RL training: emergency vehicle priority and temporary traffic regulation. Experimental results show that ReasonLight achieves zero-shot adaptation without retraining. It reduces emergency vehicle waiting time by up to 88.7% compared with the RL-only backbone while preserving comparable routine traffic performance.","published_date":"2026-05-28T06:19:09+00:00","viability_score":5,"cluster_label":"Traffic Signal Control","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A multimodal foundation model-enhanced reinforcement learning framework for zero-shot traffic signal control that adapts to unseen events.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2605.29420v1","title":"When Does Persona Prompting Actually Help? A Retrieval and Metric Analysis of Expert Role Injection in LLMs","abstract":"Persona prompting is widely used to steer large language models, yet its practical value remains unclear. Prior work often evaluates persona prompting using aggregate scores, making it difficult to determine whether expert-role prompting consistently improves response quality or instead changes responses along different quality dimensions. We study this question through a controlled comparison of four prompting conditions across 1,140 open-ended questions spanning 38 expert roles and six domains: no role prompt, a generic domain-expert prompt, embedding-based role retrieval, and a hybrid retrieval method combining embedding search with LLM-based role selection. Aggregate results show only small overall differences between conditions. However, metric-level analysis reveals a consistent tradeoff that aggregate averages obscure: role prompting systematically increases expertise depth while reducing clarity. These effects are highly conditional rather than universal. Role prompting performs best on advisory questions and in domains such as medicine and psychology, where structured expert framing and risk communication are intrinsically valuable. In contrast, baseline prompting performs better on conceptual and explanatory questions in finance, legal, science, and technology domains, where concise plain-language explanation is more important. We further show that hybrid retrieval significantly improves over embedding-only role selection, although better role retrieval does not eliminate the broader expertise-depth versus clarity tradeoff. Overall, our findings suggest that persona prompting primarily reshapes response characteristics rather than broadly improving capability, and that multi-metric evaluation is necessary for understanding its effects.","published_date":"2026-05-28T06:14:07+00:00","viability_score":4,"cluster_label":"LLM Prompting Analysis","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Analyzes the trade-off between expertise depth and clarity in LLM responses when using persona prompting, showing it reshapes characteristics rather than broadly improving capability.","time_to_mvp":"1-2 weeks","tags":["quick_build"]},{"arxiv_id":"2605.29414v1","title":"Beyond Bilingual Transfer: Multilingual Code-Switching in Instruction Tuning","abstract":"Recent studies have shown that code-switching data (CSD), in which multiple languages are mixed within the same context, can improve cross-lingual transfer and multilingual alignment in large language models (LLMs). However, existing studies primarily focus on bilingual transfer between English and a target language, leaving multilingual settings involving three or more languages largely unexplored. In this work, we investigate multilingual code-switching instruction tuning across four languages: English, Japanese, Korean, and Chinese. We evaluate multilingual understanding on Belebele. Our experiments show that simple sentence-level multilingual CSD consistently improves average multilingual performance across all four languages, indicating that multilingual code-switching can be effective beyond bilingual transfer settings.","published_date":"2026-05-28T06:03:52+00:00","viability_score":4,"cluster_label":"Multilingual LLMs","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Demonstrates that sentence-level multilingual code-switching instruction tuning improves average performance across four languages, extending beyond bilingual settings.","time_to_mvp":"1-3 months","tags":[]},{"arxiv_id":"2605.29411v1","title":"The Good, the Bad, and the Ugly of Markov Boundary for Tabular Prediction","abstract":"Under standard graphical assumptions, the Markov boundary of a target variable is the smallest set of features that renders every other feature redundant. Once the boundary is observed, the target is conditionally independent of the rest of the table. This is a tempting object for tabular prediction, since it names exactly the columns a model should need. Yet modern regressors are still trained on the full feature set. We ask whether the Markov boundary is genuinely useful for prediction on SCM3K, a 3,450-task synthetic SCM benchmark with feature counts from 40 to 1000 and six SCM families, evaluated with six regressors. The answer is more nuanced than the theory suggests. Restricting a regressor to the oracle boundary often improves prediction substantially, and the improvement grows as the feature space becomes larger and sparser. But the natural pipeline of recovering the boundary with causal discovery and training on the recovered mask does not deliver. Existing estimators exhaust the compute budget before reaching the regime where the boundary helps most, and even where they run they rarely beat the full feature set. We trace this to three causes. Discovery optimizes structural recovery rather than prediction. False negatives and false positives carry sharply asymmetric predictive cost. The exact boundary is only one of many feature sets that beat all features. We then develop what these facts imply for prediction-aligned feature selection and for tabular models that learn to use causal structure.","published_date":"2026-05-28T06:01:04+00:00","viability_score":5,"cluster_label":"Tabular Prediction Feature Selection","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Investigates the utility of Markov boundaries for tabular prediction, finding that restricting regressors to the oracle boundary can improve performance, but current discovery methods are insufficient.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.29402v1","title":"Semantic and Visual Evidence for Efficient Long-Video Reasoning: A Solution for the HD-EPIC VQA Challenge","abstract":"Understanding long-form egocentric videos remains challenging for multimodal large language models (MLLMs) due to limited context length and insufficient grounding of fine-grained visual details. The recently proposed HD-EPIC benchmark highlights these limitations: even strong long-context models achieve relatively low performance across diverse video question answering tasks. In this paper, we propose a unified framework that decouples long-video reasoning into two complementary forms of evidence: semantic evidence and visual evidence. Semantic evidence captures global procedural structure through a coarse-to-fine extraction pipeline, while object-centric visual evidence preserves fine-grained grounding through bounding boxes and visual embeddings. During inference, we formulate reasoning as a query-conditioned evidence retrieval and integration process, dynamically selecting relevant information from both sources. Our approach achieves competitive performance in the HD-EPIC-VQA Challenge across multiple task categories. More broadly, our results demonstrate that explicitly structuring, retrieving, and integrating semantic and visual evidence is critical for effective long-video understanding with MLLMs.","published_date":"2026-05-28T05:53:34+00:00","viability_score":7,"cluster_label":"Multimodal AI","has_code":true,"repo_url":"https://github.com/cvpr-org/author-kit","commercial_flags":["has_code"],"one_liner":"A framework for long-video understanding that decouples reasoning into semantic and visual evidence retrieval for improved multimodal LLM performance.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29400v1","title":"Architecture-Sensitive Supervised Fine-Tuning for Screen-Conditioned Action Prediction: A PiSAR Benchmark","abstract":"We benchmark three supervised fine-tuned models against frontier zero-shot baselines on a 661-row held-out slice of PiSAR (Persona, intent, Screen, Action, Rationale), a 12,929-tuple corpus of screen-anchored behavioural rationales curated from public app-store reviews, Pew American Trends Panel demographics, and the OPeRA shopper traces. Every model, frontier or fine-tuned, is evaluated on the same 661-row slice with the same scoring pipeline. Two findings. First, frontier zero-shot baselines (Claude Opus 4.7 and GPT-5.5) reach sem_sim 0.459 and 0.482 respectively; a fine-tuned Qwen3-VL-8B-Instruct reaches 0.783 and clears sem_sim >= 0.7 on 79% of rows, against 1-2% for either frontier baseline, a gap of 0.30 absolute on the same test set. Second, the same training data and recipe on Gemma-4-26B-A4B-IT scores only 0.441, in the same band as the frontier zero-shot baselines rather than the fine-tuned Qwen. We read this as a recipe-vs-model mismatch: the reasoning-tuned high-parameter model resists displacement and would likely need either more data or a stronger fine-tuning method.","published_date":"2026-05-28T05:49:36+00:00","viability_score":7,"cluster_label":"LLM Fine-tuning","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Demonstrates significant performance gains in screen-conditioned action prediction by fine-tuning LLMs on specific behavioral rationales, outperforming zero-shot models.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29398v1","title":"GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models","abstract":"Reinforcement learning (RL) can be used to improve the policy (denoiser) of diffusion large language models (dLLMs), while being hindered by the intractability of the policy likelihood. A dominant and efficient family of methods replaces the likelihood in standard RL with its evidence lower bound (ELBO), estimated from randomly masked sequences. Despite being well aligned with pre-training, these approaches introduce bias through training--inference mismatch by using the ELBO as a likelihood surrogate, which can degrade performance. In this work, we propose Guided Denoiser Self-Distillation (GDSD) to directly distill the denoiser of dLLMs from an advantage-guided self-teacher, derived from the closed-form optimum of reverse-KL regularized RL. GDSD matches the dLLM's denoiser logits to the teacher's via a normalization-free objective, which reduces RL to likelihood-free self-distillation and thus bypasses the TIM biases. Recent ELBO-based methods emerge as instances of applying different distillation divergences, but with diagnosable pathologies that GDSD avoids. On planning, math, and coding benchmarks with LLaDA-8B and Dream-7B, GDSD consistently outperforms prior state-of-the-art ELBO-based methods with a more stable training reward dynamics, achieving test-accuracy improvements of up to $+19.6\\%$. These results suggest that direct denoiser self-distillation, without relying on an ELBO likelihood surrogate, can provide a more stable and effective RL procedure for dLLMs. Code is available at https://github.com/GaryBall/GDSD.","published_date":"2026-05-28T05:47:40+00:00","viability_score":8,"cluster_label":"Diffusion Models","has_code":true,"repo_url":"https://github.com/GaryBall/GDSD","commercial_flags":["has_code"],"one_liner":"A novel reinforcement learning method for diffusion language models that bypasses common biases and achieves significant improvements on planning, math, and coding tasks.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29396v1","title":"Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization","abstract":"Safety alignment for large language models (LLMs) aims to reduce harmful or unsafe behavior while preserving general utility. However, recent findings reveal that alignment effects can be fragile: lightweight post-alignment manipulations, such as parameter noise, activation noise, or quantization, can easily weaken the intended safety behavior. Prior efforts to improve robustness have primarily focused on data curation, modified alignment objectives, and safety-critical parameter identification, leaving the role of the optimizer itself largely unexplored.   In this paper, we are the first to study the robustness of safety alignment from the perspective of the base optimizer. This optimizer-centric view naturally points to zeroth-order optimization, which provides a robustness-oriented signal by evaluating safety alignment under perturbations. Based on this insight, we propose a hybrid framework that first performs standard first-order safety alignment and then applies zeroth-order refinement to improve robustness. Both theoretically and empirically, we show that only a few zeroth-order refinement steps can enhance robustness while preserving safety alignment. We further improve the efficiency of zeroth-order refinement by exploiting its inherent perturbation-based evaluations to estimate layer-wise robustness sensitivity, enabling the refinement process to concentrate updates on robustness-critical layers with modest training overhead.","published_date":"2026-05-28T05:46:38+00:00","viability_score":4,"cluster_label":"LLM Safety","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Enhances LLM safety alignment robustness by introducing a hybrid framework that combines standard alignment with zeroth-order optimization refinement.","time_to_mvp":"6+ months","tags":["series_a_plus"]},{"arxiv_id":"2605.29394v1","title":"EvoMD-LLM: Learning the Language of Species Evolution in Reactive Molecular Dynamics","abstract":"While large language models (LLMs) excel at static scientific reasoning, they struggle to model the temporal structure of dynamic physical processes. We present EvoMD-LLM (Evolutionary Molecular Dynamics Large Language Model), a framework that reformulates species-level molecular dynamics as a symbolic temporal language modeling problem. Reactive MD trajectories are discretized into sequences of molecular events, where each token represents a chemical species augmented with its persistence duration, enabling standard autoregressive LLMs to learn compositional evolution over time through efficient fine-tuning. A key component of EvoMD-LLM is temporal scaffolding, which treats event duration as an explicit linguistic token and serves as a structured inductive bias, significantly reducing invalid or hallucinated molecular outputs compared to conventional sequence modeling approaches. We evaluate EvoMD-LLM on multiple temporal prediction tasks, achieving up to 66.14% accuracy and consistently outperforming sequential neural networks and language-based baselines. Beyond quantitative improvements, we qualitatively observe that the model is capable of generating interpretations for its own predictions by incorporating relevant chemical knowledge, even though it was not explicitly supervised with paired trajectory-explanation data. These results demonstrate that symbolic temporal language modeling provides an effective framework for grounding LLMs in dynamic physical simulations.","published_date":"2026-05-28T05:44:40+00:00","viability_score":7,"cluster_label":"AI for Science","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework that reformulates species-level molecular dynamics as a symbolic temporal language modeling problem, enabling LLMs to learn compositional evolution over time.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29387v1","title":"On the Optimizer Dependence of Neural Scaling Laws","abstract":"The scaling exponent $\u03b1$ in neural scaling laws $L(N) \\propto N^{-\u03b1}$ is commonly treated as a fixed constant set by architecture and data. We present evidence that $\u03b1$ depends systematically on the optimizer. In controlled random-feature regression experiments -- the canonical theoretical framework for neural scaling -- we measure $\u03b1$ across five optimizer variants and six spectral conditions. Preconditioned optimizers consistently yield steeper scaling (larger $\u03b1$), with the $\u03b1$-shift increasing across most of the tested spectral range, peaking near $s = 1.5$, and remaining large at $s = 2.0$. At $s \\approx 1.0$ (characteristic of natural language), the full natural gradient achieves $\u03b1\\approx 0.31$ versus $\u03b1\\approx 0.12$ for gradient descent -- a $2.6\\times$ larger fitted exponent that, within the random-feature model, compounds with each model-size doubling. Whether and how this exponent shift transfers to large-scale LLM training -- where recent evidence suggests the advantage may attenuate with scale -- remains an important open question. Our results imply that scaling-law forecasts should account for optimizer choice, and we provide a spectral diagnostic predicting when advanced optimizers will pay off.","published_date":"2026-05-28T05:41:36+00:00","viability_score":3,"cluster_label":"LLM Training","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Investigating the systematic dependence of neural scaling laws on optimizer choice, suggesting that scaling-law forecasts should account for optimizer selection.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29384v1","title":"Latent Terms: Dense Retrievers Contain Trivially Extractable BM25-ready Zipfian Vocabularies","abstract":"We propose Latent Terms, a method revealing that models trained for dense retrieval, whether single- or multi-vector, learn representations that can trivially be decomposed into retrieval-ready sparse features. When trained on frozen retrievers, Sparse Autoencoders without any retrieval-specific adjustments extract a latent vocabulary with approximately Zipfian collection statistics, directly suitable for classical sparse retrieval scoring via BM25. This approach enables sparse retrieval while requiring no learned expansion objective or sparse retrieval supervision whatsoever, and can be readily applied to any dense retriever. Latent Terms is able to match or outperform single-vector scoring methods from its own base model as well as comparable SPLADE variants. In addition, it substantially outperforms its base model on LIMIT, a task specifically designed to highlight the failures of single-vector retrieval. Overall, our results highlight that neural retrievers contain more expressive and indexable structure than their default scoring functions expose, but that other methods can nonetheless be leveraged.","published_date":"2026-05-28T05:36:37+00:00","viability_score":4,"cluster_label":"Information Retrieval","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Revealing that dense retrievers learn representations that can be decomposed into retrieval-ready sparse features, enabling classical sparse retrieval scoring via BM25.","time_to_mvp":"1-2 weeks","tags":["quick_build"]},{"arxiv_id":"2605.29380v1","title":"TRACER: Persistent Regularization for Robust Multimodal Finetuning","abstract":"Mainstream strategies for finetuning pretrained multimodal models often degrade out-of-distribution (OOD) robustness, a phenomenon known as catastrophic forgetting. In this paper, we develop a theoretical framework for multimodal contrastive finetuning, yielding closed-form solutions and a geometric decomposition for each strategy. This framework shows that self-distillation is more effective than other regularization approaches to retain the knowledge of the pretrained model. Our analysis reveals a largely overlooked limitation: standard Exponential Moving Average (EMA) teachers, widely used in robust finetuning, suffer from collapse. To solve this, we prove that a Weighted Moving Average (WMA) teacher maintains a persistent regularizing force over finite horizons and yields bias-free convergence in the task subspace while preserving orthogonal knowledge. These insights motivate **TRACER** (**T**rajectory-**R**obust **A**nchoring for **C**ontrastive **E**ncoder **R**egularization), which combines contrastive learning with WMA-guided multi-perspective distillation. Extensive experiments on CLIP finetuning demonstrate consistent OOD accuracy and calibration gains across three backbone architectures, and comprehensive ablations confirm that TRACER is both principled and robust to hyperparameter choices. Code is available at [https://github.com/HesamAsad/TRACER](https://github.com/HesamAsad/TRACER).","published_date":"2026-05-28T05:34:23+00:00","viability_score":7,"cluster_label":"Multimodal AI","has_code":true,"repo_url":"https://github.com/HesamAsad/TRACER","commercial_flags":["has_code"],"one_liner":"TRACER is a novel regularization technique for multimodal finetuning that improves out-of-distribution robustness and calibration by combining contrastive learning with weighted moving average guided distillation.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29368v1","title":"SURGENT: A Surgical Multi-Agent Assistance System Across the Perioperative Workflow","abstract":"The intricate nature of modern surgical care necessitates intelligent systems that can synthesize extensive patient records, support collaborative decision-making, and provide transparent, auditable reasoning across the entire perioperative workflow. Although web-based Large Language Models (LLMs) possess advanced reasoning capabilities, they are ill-equipped for surgical applications due to critical limitations: input length constraints, incomplete memory management, and limited traceability. To address this issue, we present SURGENT, a surgical multi-agent assistance system that combines a Tree-of-Thought planner, multi-department collaboration agents, and retrieval-augmented reasoning with clinical guidelines and biomedical literature. SURGENT features a novel memory design that manages both long-term patient histories and short-term working summaries, enabling more complete, contextualized, and consistent reasoning. Experimental evaluations across five key perioperative tasks - case analysis, surgical plan simulation, safety monitoring, complication risk assessment, and rehabilitation guidance - show that SURGENT outperforms baseline LLMs and existing medical multi-agent frameworks, yielding recommendations more closely aligned with patient histories. Ablation studies further highlight the advantage of DeepSeek as a locally deployable backbone model, enabling privacy-preserving deployment without reliance on centralized services. These results position SURGENT as a practical and trustworthy advancement toward intelligent, equitable, and secure surgical assistance systems.","published_date":"2026-05-28T05:12:41+00:00","viability_score":5,"cluster_label":"Medical AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A surgical multi-agent system that synthesizes patient records and supports collaborative decision-making across the perioperative workflow.","time_to_mvp":"1-3 months","tags":["series_a_plus"]},{"arxiv_id":"2605.29360v1","title":"MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models","abstract":"Action-conditioned world models are increasingly used as scalable simulators for robot learning, yet current evaluations provide limited evidence that their predictions are reliable under the actions they condition on. Existing benchmarks largely emphasize visual fidelity, leaving unclear whether predicted futures are physically plausible, faithful to commanded actions, and calibrated to failure when actions should not succeed. We introduce \\textsc{MiraBench}, a hierarchical benchmark that defines \\emph{action-conditioned reliability} as a core evaluation target for robotic world models. MiraBench decomposes this target into three progressively demanding levels: \\emph{Physics Adherence}, which evaluates reference-free physical consistency; \\emph{Action-Following Fidelity}, which measures whether predictions respect task-relevant action inputs; and \\emph{Optimism Bias Detection}, which probes the tendency to predict successful outcomes under failure-inducing actions. To support this evaluation, we curate a human-annotated corpus with over 16,000 judgments across tasks, failure categories, and leading world models. We evaluate 12 representative model configurations spanning vector-conditioned robotic world models, text-conditioned generative world models, open-weight systems, closed-source systems, and multiple model scales. Across this broad model landscape, MiraBench reveals three central findings: visual fidelity is a poor proxy for action fidelity; increasing model scale does not reliably improve action following; and optimism bias is pervasive across current systems. By shifting evaluation from appearance to action-conditioned reliability, MiraBench provides a diagnostic foundation for assessing and improving robotic world models as faithful simulators.","published_date":"2026-05-28T04:58:15+00:00","viability_score":7,"cluster_label":"Robotics","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A benchmark for evaluating the action-conditioned reliability of robotic world models, revealing pervasive optimism bias.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.29359v1","title":"Does Distributed Training Undermine Compute Governance?","abstract":"Compute governance proposals often rely on the assumption that frontier AI training requires large, detectable computing clusters. However, recent advances in distributed training algorithms could allow developers to conduct frontier-scale training on distributed agglomerations of hardware, rather than needing large datacenter facilities. Developers who prefer not to be constrained by regulations may structure their hardware in a manner that evades the registration and monitoring requirements associated with compute governance. Therefore, regulations must be designed to detect and prevent illicit distributed training operations. This paper evaluates the feasibility of such evasion and outlines recommended countermeasures, including whistleblowing, chip tracking, forensic accounting, and memory and compute thresholds for clusters.","published_date":"2026-05-28T04:58:12+00:00","viability_score":2,"cluster_label":"AI Governance","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Analyzes how distributed training can evade compute governance and proposes countermeasures.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29358v1","title":"Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet","abstract":"We demonstrate that sparse autoencoders can extract interpretable features from Claude 3 Sonnet, a production-scale language model, addressing the open question of whether dictionary learning methods scale beyond small transformers. We trained sparse autoencoders with up to 34 million features on the model's middle layer residual stream, using scaling laws to guide hyperparameter selection. The resulting features are multilingual and multimodal (generalizing to images despite text-only training), respond to both concrete instances and abstract discussions of concepts, and can be used to steer model behavior in ways consistent with their interpretations. We find features corresponding to famous entities and locations, as well as more abstract concepts like sarcasm or errors in code. We also identify features relevant to ways in which language models might cause harm--including features representing deception, power-seeking, sycophancy, and bias--and show that these causally influence model outputs when manipulated. Additionally, we conduct analyses of feature interpretability, geometry, and computational function. However, significant limitations remain: our suite of features is incomplete, and we lack rigorous methods for evaluating whether our features faithfully capture model computations.","published_date":"2026-05-28T04:57:47+00:00","viability_score":6,"cluster_label":"LLM Interpretability","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Extracts interpretable features from Claude 3 Sonnet using sparse autoencoders, revealing concepts like deception and bias.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2605.29357v1","title":"PassNet: Scaling Large Language Models for Graph Compiler Pass Generation","abstract":"Modern tensor compilers such as TorchInductor deliver substantial speedups on mainstream models, yet face a systematic performance ceiling on long-tail workloads -- our profiling shows that 43% of real-world subgraphs experience end-to-end slowdowns under default compilation. While LLMs offer a path toward automated optimization, existing efforts focus on standalone kernel generation. We argue that pass generation -- where LLMs author structured graph transformations that integrate directly into compiler pipelines -- is the more appropriate abstraction. We propose PassNet, the first large-scale ecosystem for LLM-based compiler pass generation, comprising: (1) PassNet-Dataset, over 18K unique computational graphs from 100K real-world models; and (2) PassBench, 200 curated long-tail fusible tasks (comprising 2,060 subgraphs in total) evaluated under the Error-aware Speedup Score (ES_t) -- a metric unifying correctness, stability, and performance -- with layered integrity defenses against systematic LLM exploitation. Experiments reveal that PassBench is both highly discriminative and genuinely unsaturated: the best frontier model trails TorchInductor by 37% in aggregate, yet on individual subgraphs LLMs achieve up to 3x speedup over the same compiler -- indicating that the bottleneck is consistency, not capability. Fine-tuning a small model on merely ~4K PassNet trajectories yields a 2.67x improvement approaching frontier-model performance, demonstrating substantial headroom and validating PassNet as live training infrastructure for advancing LLM-driven compiler optimization. All data, benchmarks, and tooling are publicly available.","published_date":"2026-05-28T04:55:14+00:00","viability_score":8,"cluster_label":"LLM Compiler Optimization","has_code":true,"repo_url":"https://github.com/PaddlePaddle/PassNet","commercial_flags":["has_code"],"one_liner":"PassNet is an LLM-based ecosystem for generating compiler passes, offering a path to automate and improve optimization for long-tail workloads in tensor compilers.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29350v1","title":"ConMoE: Expert-Pool Consolidation via Prototype Reassignment for MoE Compression","abstract":"Mixture-of-Experts (MoE) language models reduce per-token computation but still require storing and serving all experts, making deployment memory-intensive. Existing post-training compression methods mainly shrink this cost by pruning experts or merging their weights. We formulate post-training MoE compression as expert-pool consolidation: retaining a smaller set of pretrained experts as reusable prototypes and deterministically remapping each original expert reference to one selected prototype. This view separates the reduced expert pool from the reuse structure that represents the original expert slots, and allows prototype sharing within local layer scopes while preserving the original router interface. We propose ConMoE, a train-free prototype remapping framework that selects retained experts using calibration-based contribution and replaceability signals, then redirects original expert calls to the selected prototypes without weight updates or post-compression fine-tuning. Experiments on three pretrained MoE language models show that ConMoE matches or outperforms strong pruning and merging baselines in several settings, achieving the best average score on deepseek-moe-16b-base at both 25% and 50% routed-expert reduction, while remaining competitive on Qwen3-30B-A3B and OLMoE-1B-7B-0125. Ablations indicate that deterministic reassignment is the most stable component, whereas broader cross-layer sharing and post-hoc weight fusion are model-dependent.","published_date":"2026-05-28T04:44:22+00:00","viability_score":4,"cluster_label":"MoE Compression","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"ConMoE is a train-free framework for compressing Mixture-of-Experts models by consolidating experts into reusable prototypes without weight updates.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29335v1","title":"Rethinking FID Through the Geometry of the Reference Dataset","abstract":"Fr\u00e9chet Inception Distance (FID) is widely used to evaluate image generators, yet lower FID does not always correspond to better sample quality. We show that this mismatch depends in part on the geometry of the reference dataset. In a controlled study across six datasets, distributional density and effective rank significantly explain how FID changes as sample quality improves. Concentrated datasets tend to yield more favorable FID trends, whereas more dispersed datasets can make FID worsen despite better samples. Attribution to precision and recall and ablations with alternative feature spaces and distances support the same conclusion. These results suggest that distributional metrics should be interpreted together with the geometry of the reference dataset for more reliable benchmarking.","published_date":"2026-05-28T04:10:47+00:00","viability_score":5,"cluster_label":"Image Generation Evaluation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research re-evaluates Fr\u00e9chet Inception Distance (FID) for image generators by considering the geometry of the reference dataset, suggesting more nuanced interpretation for benchmarking.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.29310v1","title":"Rubric-Guided Process Reward for Stepwise Model Routing","abstract":"Stepwise model routing improves the efficiency of Large Reasoning Models (LRMs) by assigning each reasoning step to a suitable model. Recent methods formulate routing as a sequential decision process and train the router with reinforcement learning. However, although they model routing as a process, they still supervise the router with outcome rewards. Such rewards only reflect final answer correctness and fail to evaluate intermediate routing decisions, which can weaken performance and generalization. To address this gap, we propose RoRo, a rubric-guided process reward framework for stepwise model routing. RoRo first collects diverse routing trajectories and constructs preference pairs based on outcome, cost, and process quality. It then trains a Rubricor to generate a query-specific evaluation rubric and a Judge to score routing trajectories under this rubric through alternating optimization. The resulting process rewards are combined with outcome rewards to optimize the routing policy via GRPO. Experiments on five reasoning benchmarks under both same-family and cross-family settings show that RoRo consistently outperforms strong baselines and achieves better accuracy and cost trade-offs.","published_date":"2026-05-28T03:42:24+00:00","viability_score":7,"cluster_label":"Reasoning Model Routing","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"RoRo is a rubric-guided reward framework that improves stepwise model routing for Large Reasoning Models by evaluating intermediate routing decisions, leading to better accuracy and cost trade-offs.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29307v1","title":"GrepSeek: Training Search Agents for Direct Corpus Interaction","abstract":"Large Language Model (LLM) search agents have shown strong promise for knowledge-intensive language tasks through multiple rounds of reasoning and information retrieval. Most existing systems access information using a retriever that takes a keyword or natural language query and returns a ranked list of documents using an index of pre-computed document representations. In this work, we explore a complementary perspective in which the search agent treats the corpus itself as the search environment and finds evidence by issuing executable shell commands. We introduce GrepSeek, an optimized direct corpus interaction (DCI) search agent that trains a compact search agent to find, filter, and compose evidence from large text corpora. To address the instability of learning behavior directly with reinforcement learning on large corpora, we propose a two-stage training pipeline. First, we construct a cold-start dataset using an answer-aware Tutor and answer-blind Planner to generate verified, causally grounded search trajectories. Second, we refine the initialized policy with Group Relative Policy Optimization (GRPO), allowing the agent to improve its task-oriented search behavior through direct interaction with the corpus. To make DCI practical at scale, we further use a semantics-preserving sharded-parallel execution engine that accelerates shell-based retrieval by up to $7.6\\times$ while preserving byte-exact equivalence with sequential execution of the shell command. Experiments across seven open-domain question answering benchmarks show that GrepSeek achieves the strongest overall token-level $F_1$ and Exact Match. Our analysis also highlights the limitations of purely lexical interaction on queries with substantial surface-form variation, suggesting DCI as a practical and competitive method for search agents that can complement existing retrieval paradigms in the real world.","published_date":"2026-05-28T03:37:33+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":"https://github.com/goodfeli/dlbook_notation","commercial_flags":["has_code"],"one_liner":"GreepSeek trains compact search agents to directly interact with and extract information from large text corpora using executable shell commands, outperforming existing methods on open-domain QA benchmarks.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29303v1","title":"Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models","abstract":"Supervised fine-tuning (SFT) followed by reinforcement learning (RL) has become a standard post-training paradigm for large language models. This paradigm provides a cold-start for RL exploration, avoiding the inefficiency of pure RL where on-policy sampling yields insufficient positive samples. However, in practice, existing approaches often use a small amount of data for SFT initialization compared to the RL phase, which can cause the model to fit the limited samples and shift away from its pre-trained distribution. This distribution shift impedes the model's ability to effectively explore during subsequent RL training. To address this challenge, we propose that in low-data regimes, SFT should prioritize activating task-relevant capabilities rather than memorizing specific content. Along this line, we propose EKSFT (Entropy-KL Selective Fine-Tuning), which selectively masks tokens that exhibit either high entropy or high KL divergence from a reference model. By excluding these high-uncertainty, distribution-shifting tokens from imitation, EKSFT injects task-specific knowledge while preserving the integrity of the model's pre-trained distribution. Empirical evaluations on mathematical reasoning benchmarks demonstrate that EKSFT consistently outperforms standard SFT. Further RL fine-tuning from the EKSFT model yields consistently better post-RL performance, indicating improved exploration for the RL stage. Our codes and datasets are available at https://github.com/MINE-USTC/EKSFT.","published_date":"2026-05-28T03:36:05+00:00","viability_score":7,"cluster_label":"LLM Fine-tuning","has_code":true,"repo_url":"https://github.com/MINE-USTC/EKSFT","commercial_flags":["has_code"],"one_liner":"EKSFT selectively masks high-entropy or high-KL divergence tokens during fine-tuning to preserve pre-trained distribution integrity, improving subsequent RL exploration and performance on mathematical reasoning tasks.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.29300v1","title":"MusTBENCH: Benchmarking and Advancing Temporal Grounding in Music LLMs","abstract":"Recent Large Audio-Language Models (LALMs) have demonstrated promising abilities in understanding musical content. However, whether their responses are grounded in the correct temporal regions of the audio remains underexplored. This limitation is particularly critical for music understanding, where key information often occurs as temporally localized events, such as instrument entries and rhythmic transitions. To address this gap, we introduce MusTBENCH, a music-expert-validated benchmark designed to evaluate temporal grounding in LALMs through five temporally grounded question-answering tasks. To further improve temporal grounding in existing models, we propose MusT, a novel four-stage temporal optimization recipe spanning music encoder adaptation, LLM adaptation, LLM supervised fine-tuning, and RL-based optimization. Experiments on MusTBENCH show that existing LALMs struggle with precise temporal grounding, while MusT brings significant improvements over strong baselines. These results establish temporal grounding as a key missing capability in current LALMs and position MusTBENCH as a challenging benchmark for future research in temporally grounded music understanding.","published_date":"2026-05-28T03:28:16+00:00","viability_score":7,"cluster_label":"Multimodal AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"MusTBENCH is a new benchmark and MusT is a four-stage optimization recipe to improve temporal grounding in music Large Audio-Language Models, addressing a key missing capability.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29299v1","title":"Pocket-Dentist: On-Device Dental Image Understanding via Efficient Multimodal Large Language Models","abstract":"Evaluations of dental vision-language models remain fragmented across datasets, task definitions and metrics, and often ignore their computational cost. This limits their widespread deployment for dental screening outside specialist centres, where timely inference, limited hardware, and local handling of patient images are vital for practical, privacy-preserving clinical prescreening. Here we present Pocket-Dentist, an efficiency-aware benchmark for dental multimodal question answering that brings together three datasets spanning approximately 1,159 patients, five task types and seven metrics. Across typical 14 VLMs, our results reveals an interesting observation: compact VLMs (e.g., 2B-parameter models) outperform larger VLMs in accuracy while requiring substantially lower computational costs in dental image understanding. Deployed locally on an iPhone 17 Pro, our finetuned compact VLM Pocket-Dentist-2B processed each sample in 4.31 s, reducing latency by 4.9-fold and memory use by 2.3-fold compared with a 7B baseline.","published_date":"2026-05-28T03:28:07+00:00","viability_score":8,"cluster_label":"Medical AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Pocket-Dentist provides an efficiency-aware benchmark and a compact multimodal LLM for on-device dental image understanding, achieving high accuracy with significantly lower computational cost on mobile devices.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.29288v1","title":"Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces","abstract":"Long chain-of-thought (CoT) traces are widely used as supervision for reasoning-oriented LLM SFT, yet answer-correct traces can still lead to markedly different fine-tuning outcomes. We study post-conclusion continuation in answer-correct long-CoT data: a continuation where the answer appears sufficiently supported, but the trace continues with additional reasoning that remains in the supervised target. To test its training effect, we use a delete-only editor to construct answer-preserving suffix removal and compare CoT-based SFT on the original and processed traces. We observe improved SFT outcomes after removing the editor-identified post-conclusion continuation, suggesting that this continuation is harmful to training in our setting. We therefore refer to this empirically supported phenomenon as harmful continuation. Beyond this intervention, we further characterize the removed post-conclusion continuation through uncertainty and hidden-state progress. We observe persistent local uncertainty together with weakened terminal-directional progress, forming an uncertainty--geometry mismatch. Finally, we instantiate Harmful Continuation Cut (HCC), a lightweight boundary proxy that approximates the editor-identified post-conclusion continuation boundary.","published_date":"2026-05-28T03:12:15+00:00","viability_score":3,"cluster_label":"LLM Training","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This research identifies and quantifies a 'harmful continuation' in supervised fine-tuning data for LLMs, suggesting a method to improve training outcomes by removing it.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29283v1","title":"Do Physics Foundation Models Learn Generalizable Physics? A Bias-Aware Benchmark Across Physical Regimes and Distribution Shifts","abstract":"Recent physics foundation models claim general spatiotemporal forecasting ability, yet their evaluations often collapse performance into a single average score under a fixed training distribution. This makes it difficult to determine whether a model has learned generalizable physical dynamics or only performs well under particular settings. We construct a benchmark with 8 physical dynamics, 3 training-data mixtures, and 25 test regimes induced by dynamic-scale and initial-condition complexity shifts, covering in-distribution, distribution-shift, and out-of-distribution settings. We evaluate five physics foundation model architectures and four model variants per architecture (scratch and three pretrained sizes), resulting in 60,000 measurements. Our results show that current physics foundation models behave as conditional rather than universal generalists: their generality depends on the physical regime, temporal scale, initial-condition setting, pretraining, model size, and architecture. Improving the training data distribution only partially mitigates this limitation. Pretraining and scaling are also unable to reliably remove their ability biases. We argue that improving physics foundation models requires moving beyond scaling models or expanding data, toward learning mechanisms that better capture transferable physical knowledge across regimes, temporal scales, and distribution shifts.","published_date":"2026-05-28T03:08:57+00:00","viability_score":4,"cluster_label":"Physics Foundation Models","has_code":true,"repo_url":"https://github.com/goodfeli/dlbook_notation","commercial_flags":["has_code"],"one_liner":"A new benchmark evaluates physics foundation models across diverse physical regimes and distribution shifts, revealing their limitations as conditional rather than universal generalists.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.29280v1","title":"LoopFM: Learning frOm HistOrical RePresentations of Foundation Model for Recommendation","abstract":"Knowledge distillation (KD) transfers a single scalar prediction from a large foundation model (FM) to compact vertical models (VMs), suffering from diminishing transfer ratio -- the fraction of FM improvement captured by the VM -- as a single scalar cannot convey the rich intermediate knowledge that larger FMs learn. To address this bottleneck, we propose LoopFM (Learning frOm HistOrical ReP*resentations of FM), a framework that opens a high-bandwidth transfer channel by structuring FM intermediate embeddings as input features (e.g., user history sequence) for downstream VMs, without requiring real-time FM inference at serving and architectural coupling between FM and VM. We provide a theoretical framework for LoopFM with a gain decomposition and transfer-ratio analysis. On three public benchmarks, LoopFM demonstrates strong AUC improvements (e.g., 6\\%+ on TaobaoAd) and complementary knowledge transfer capability with KD. On industrial-scale systems (billions of examples, trillion-parameter FMs), LoopFM approximately doubles the knowledge transfer ratio on top of KD, delivering a +0.5\\% conversion improvement in Y1H1, and a +1.03\\% and +1.22\\% conversion improvement from two individual launches respectively in Y1H2.","published_date":"2026-05-28T02:59:46+00:00","viability_score":7,"cluster_label":"Recommendation Systems","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"LoopFM enhances recommendation systems by structuring foundation model intermediate embeddings as input features for downstream models, significantly improving knowledge transfer and conversion rates.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29277v1","title":"Code-QA-Bench: Separating Code Reasoning from Documentation Memorization in Repository-Level QA","abstract":"We present Code-QA-Bench, a fully automated framework for synthesizing repository-level code understanding benchmarks that separates genuine code comprehension from documentation recall and pretraining memorization. The framework makes two methodological contributions: (1) an answer-first generation pipeline where a tool-equipped agent explores source code to produce verified gold answers before deriving questions, ensuring every task is grounded in real code structure; and (2) a three-condition experimental design evaluating agents under closed-book (no repository), code-only (documentation removed), and documented (full repository) conditions, with deltas directly quantifying documentation utility and memorization.   We generate 528 code-derivable and 100 doc-dependent tasks across 10 Python repositories from SWE-Bench, scored by an LLM judge on accuracy, completeness, and specificity. Experiments on four frontier models reveal that code access is the dominant factor (+0.23 mean gain over closed-book), documentation provides modest additional benefit (+0.071 on doc-dependent tasks), and code-only $\\approx$ documented on code-derivable tasks, validating the design. The framework is open-source and applicable to any well-documented Python repository.","published_date":"2026-05-28T02:52:58+00:00","viability_score":7,"cluster_label":"Code Understanding","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Code-QA-Bench is an automated framework for creating repository-level code understanding benchmarks that distinguishes genuine code comprehension from documentation recall.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29272v1","title":"Causal Label Recovery in Payment Networks","abstract":"Fraud detection models in payment networks train on chargeback labels that are systematically biased. Every label must survive three sequential gates: authorization (declined transactions generate no labels), issuer reporting (unreported fraud is invisible), and delay (pending chargebacks are missing at training time). Labels that do arrive may be corrupted by first-party misuse or issuer misclassification. A companion paper [arXiv:2605.27557] proved that these four impairments impose a minimax lower bound on detection performance. This paper asks: can that bound be achieved?   We formalize the observation pipeline as a sequential missing-data problem with three propensity stages and a corruption layer, and construct the Sequential Triply Robust (STR) estimator. The STR corrects for all four impairments simultaneously and achieves the semiparametric efficiency bound -- no estimator can have lower asymptotic variance. It is sequentially triply robust: at each gate, consistency requires only that either the propensity model or the outcome regression is correctly specified, not both. We provide corruption correction via noise-rate-adjusted pseudo-labels, empirical Bayes shrinkage to stabilize inverse-propensity weights for small issuers, a plug-in variance estimator yielding valid confidence intervals, and a Bernstein concentration inequality for finite-sample guarantees.   On the operational side, we derive the optimal training delay -- the maturity window that minimizes the sum of label-quality loss and model staleness -- and prove that the STR permits training on data that is days old rather than months old, decoupling model freshness from the chargeback maturity cycle. The STR provably dominates naive chargeback-based training in mean squared error for any sample size.","published_date":"2026-05-28T02:43:34+00:00","viability_score":3,"cluster_label":"Fraud Detection","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A novel statistical estimator to improve fraud detection accuracy in payment networks by correcting for systematic biases in chargeback labels.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29271v1","title":"CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool Retrieval","abstract":"Tool retrieval over large API catalogs is a core bottleneck for LLM agents: user queries arrive in colloquial, often underspecified language, while the catalog uses technical API vocabulary that no fixed encoder can bridge on its own. The two dominant training approaches, contrastive encoder fine-tuning and HyDE-style query expansion with a frozen LLM, address this problem from opposite ends and fail in complementary directions: the fine-tuned encoder excels when the query's surface form already matches the catalog but collapses when it does not, while zero-shot HyDE is more robust to underspecified queries yet generates catalog-unaware hypothetical descriptions that degrade retrieval when queries are well-formed. We introduce CoHyDE, an iterative procedure that trains the dense encoder and the LLM rewriter as a single co-evolving system: the encoder is retrained with InfoNCE on catalog-style hypothetical descriptions produced by the rewriter, and the rewriter is preference-aligned via DPO against the encoder's retrieval scores, with both sides warm-started on the tool catalog before the loop begins. On a ~10k tool subset of the ToolBench catalog, three rounds of CoHyDE improve over the strongest single-component baseline by +2.5 pp NDCG@5 on standard queries and +6.3 pp on held-out vague queries, with gains as large as +8 pp on the hardest vague tier. Ablations confirm that co-training is the key ingredient: using either component in isolation fails to match CoHyDE on both well-formed and vague queries, with losses of up to -8 pp on vague queries.","published_date":"2026-05-28T02:41:30+00:00","viability_score":6,"cluster_label":"LLM Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"An iterative co-training method for LLM agents that jointly trains a rewriter and a dense encoder to improve tool retrieval accuracy over large API catalogs.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.29270v1","title":"Indexing the Unreadable: LLM-Native Recursive Construction and Search of Service Taxonomies","abstract":"The era of the Internet of Agents (IoA) is taking shape: LLM agents are expected to fulfill user goals by orchestrating fast-growing populations of Model Context Protocol (MCP) servers, Agent-to-Agent (A2A) endpoints, reusable skills, and other LLM-callable services. Yet LLMs face a structural mismatch with this regime: effective context is a scarce resource that does not scale with the number of services. Concatenating thousands of service descriptions into a prompt overflows the context window, and even when the window is large enough, models systematically under-attend to information in the middle of long inputs, the well-documented Lost-in-the-Middle phenomenon. This is fundamentally a question of context management for service discovery. To address this, we propose an LLM-native progressive-disclosure scheme and its concrete instantiation, A2X (Agent-to-Anything service discovery): an LLM-driven pipeline that automatically organizes the registered services into a hierarchical taxonomy and walks it layer by layer at query time, so that every LLM call sees only a small candidate set highly relevant to the user query. This decouples effective-context scarcity from registry size and significantly reduces token consumption while improving retrieval accuracy. Compared to full-context dumping, A2X achieves a 6.2-point Hit Rate gain at one-ninth the prompt-token cost; compared to the state-of-the-art open-source embedding-based baseline, A2X improves Hit Rate by more than 20 points.","published_date":"2026-05-28T02:38:41+00:00","viability_score":6,"cluster_label":"LLM Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A novel LLM-native pipeline that recursively constructs and searches service taxonomies to enable efficient service discovery for LLM agents.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.29268v1","title":"Compute Allocation in Evolutionary Search: From Depth-Breadth to Multi-Armed Bandits","abstract":"LLM-guided evolutionary search (Evolve systems) has reached state-of-the-art results on mathematical and combinatorial tasks, yet most existing systems report only the best of many runs and leave the run-to-run distribution undocumented. We ask how a fixed budget of LLM calls should be allocated, and how reliably a single run reaches the reported numbers. Sweeping the depth-breadth grid over five models and three tasks, we identify two empirical regularities: a fitness-compute envelope along which capability ordering largely collapses on effective FLOPs, and a bilinear depth-breadth fit with task-specific interaction; both are gated by model-task capability. Motivated by these regularities, we propose BaSE (Bandit-based Self-Evolving), a multi-armed bandit that allocates LLM calls across parallel trajectories. Without changing the model, prompt, or evaluator, BaSE improves mean fitness by 12.3% over the strongest island-protocol baseline across 8 (model, task) cells, with the largest gains on high-variance settings: a reliability gain from allocation alone.","published_date":"2026-05-28T02:37:51+00:00","viability_score":5,"cluster_label":"LLM Agents","has_code":true,"repo_url":"https://github.com/keruiwu/self-evolving-allocation","commercial_flags":["has_code"],"one_liner":"A bandit-based approach to dynamically allocate LLM calls in evolutionary search, improving mean fitness and reliability across various tasks and models.","time_to_mvp":"1-3 months","tags":["quick_build"]},{"arxiv_id":"2605.29267v1","title":"When and How Human Curation Backfires: Preference Alignment under Multi-Model Self-Consuming Loop","abstract":"Foundation models are increasingly trained on synthetic data generated by prior model iterations rather than exclusively on real data. This self-consuming training paradigm can lead to model collapse, divergence, or bias amplification. Recent work (Ferbach et al., 2024) shows that incorporating human curation into the loop can steer a self-consuming model toward human-aligned behavior, but these analyses focus on a single, isolated model that solely consumes its own outputs. In practice, however, models often interact and train on input-output pairs produced by other models. This paper studies self-consuming training in the multi-model regime. We first formalize a framework for interacting self-consuming models and characterize when the resulting dynamical system converges to a stable point. We then examine how human curation of one model affects its own alignment (self-influence) and how such effects propagate to other models (cross-influence). Unlike isolated settings where human curation always enhances model alignment, we show that cross-model interactions can dampen or even invert this effect, ultimately degrading long-term alignment.","published_date":"2026-05-28T02:36:57+00:00","viability_score":2,"cluster_label":"LLM Training","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper theoretically analyzes the negative impacts of human curation in multi-model self-consuming training loops, identifying conditions where it degrades alignment.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29262v1","title":"Harmonizing Real-Time Constraints and Long-Horizon Reasoning: An Asynchronous Agentic Framework for Dynamic Scheduling","abstract":"The Dynamic Flexible Job Shop Scheduling Problem (DFJSP) necessitates a trade-off between instant reaction to stochastic disturbances and global optimization of production goals. Conventional priority rules are insufficiently flexible to handle complex disruptions, whereas learning-based approaches often compromise interpretability or fail to generalize across problem scales. Although Large Language Models (LLMs) offer advanced reasoning capabilities to bridge this gap, their substantial inference latency is incompatible with the millisecond-level decision cycles of industrial control systems. To resolve this conflict, we introduce RACE-Sched, an asynchronous agent-based framework that decouples policy execution from logical reasoning via a dual-stream architecture. The Reactive Stream executes low-latency symbolic heuristics to enable real-time dispatching, while the parallel Deliberative Stream leverages an LLM to synthesize, validate, and evolve these rules. Candidate rules undergo rigorous testing in a sandbox and are deployed via atomic updates, ensuring safety without blocking the control loop. Additionally, a semantic rule repository indexes validated heuristics for retrieval-based initialization which enhances transferability across problem scales. Extensive evaluations on GEN-Bench, MK-Bench, and JMS-Bench demonstrate that RACE-Sched outperforms leading Deep Reinforcement Learning and other LLM-based baselines. This approach harmonizes real-time constraints with long-horizon reasoning to achieve superior solution quality and robust adaptation to dynamic events.","published_date":"2026-05-28T02:26:18+00:00","viability_score":7,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"RACE-Sched is an asynchronous agent framework that uses LLMs for long-horizon reasoning while maintaining real-time decision-making for dynamic job shop scheduling.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.29259v1","title":"KLAS: Using Similarity to Stitch Neural Networks for Improved Accuracy-Efficiency Tradeoffs","abstract":"Given the wide range of deployment targets, flexible model selection is essential for optimizing performance within a given compute budget. Recent work demonstrates that stitching pretrained models within a model family enables cost-effective interpolation of the accuracy-efficiency tradeoff space. Stitching transforms intermediate activations from one pretrained model into another, producing a new interpolated stitched network. Such networks provide a pool of deployment options along the accuracy-efficiency spectrum. However, existing stitching approaches often yield suboptimal tradeoffs and lack generalizability, as they primarily rely on heuristics to select stitch configurations. We argue that constructing improved accuracy-efficiency tradeoffs requires explicitly capturing and leveraging the similarity between pretrained models being stitched. To this end, we introduce KLAS, a novel stitch selection framework that automates and generalizes stitch selection across model families by leveraging KL divergence between intermediate representations. KLAS identifies the most promising binary stitches from the $O(k^2n^2)$ possibilities for $k$ pretrained models of depth $n$. Through comprehensive experiments, we demonstrate that KLAS improves the accuracy-efficiency curve of stitched models at the same finetuning cost as baselines. KLAS achieves up to $1.21\\%$ higher ImageNet-1K top-1 accuracy at the same computational cost, or maintains accuracy with a $1.33\\times$ reduction in FLOPs.","published_date":"2026-05-28T02:23:08+00:00","viability_score":7,"cluster_label":"LLM Training","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"KLAS is a framework that uses KL divergence to intelligently stitch pretrained neural networks, improving accuracy-efficiency tradeoffs for flexible deployment.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.29256v1","title":"DynSess: Dynamic Session-Level Evaluation and Optimization Framework for Role-Playing Agents","abstract":"Role-playing with large language models is fundamentally a session-level task, requiring agents to sustain character identity and interaction quality across extended multi-turn conversations. Yet existing evaluation and optimization methods remain largely turn-level, failing to capture long-horizon quality. We propose DynSess, a unified session-level framework for role-playing agents. DynSess-Eval scores complete dialogue sessions via rubrics targeting long-horizon behaviors. Leveraging its session-level rewards, we construct high-quality training trajectories through multi-turn lookahead search and train DynSess-Character with two complementary variants: DSPO (off-policy) and GSRPO (on-policy). Experiments show that DynSess-Eval aligns with human judgments substantially better than prior evaluators, and blind human evaluation further shows that DynSess-Character matches the strongest character model despite using substantially fewer parameters, while maintaining strong role consistency and interactive ability. Our dataset and code will be released to facilitate future research.","published_date":"2026-05-28T02:20:11+00:00","viability_score":7,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"DynSess is a session-level framework for evaluating and optimizing role-playing agents, improving long-horizon interaction quality and character consistency.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.29254v1","title":"Extreme dynamic symmetry enables omnidirectional and multifunctional robots","abstract":"Symmetry is a central organizing principle in natural systems, yet its use as a unifying design strategy in robotics has largely remained limited to geometric form. We show that symmetry can instead be leveraged at the level of dynamic actuation capability. We introduce dynamic symmetry, the uniformity of a robot's attainable center-of-mass accelerations, and formalize it through a measure coined as dynamic isotropy. Across more than 1000 simulated morphologies, we found that higher dynamic symmetry consistently improved trajectory tracking, task success, robustness, resiliency, and energy efficiency, with the benefits becoming most pronounced as dynamic isotropy approached its theoretical limit. To study this regime systematically, we developed Argus, a family of spherical robots designed to explore the effects of increasing dynamic symmetry. Members of the Argus family vary in their actuation geometry and dynamic symmetry level while sharing a common architectural principle: radially oriented linear actuators that directly shape the robot's center-of-mass dynamics. Among them, we built a physical 20-leg Argus variant that achieved near-extreme dynamic isotropy and demonstrated orientation-invariant locomotion, agile traversal of cluttered and deformable terrain, rapid self-stabilization, and resilience to partial actuator failures. Its distributed sensing further enabled omnidirectional perception and object interaction during continuous motion. These results show that designing robots for symmetry not only in morphology but also in their attainable dynamics provides a powerful and general pathway toward agility, robustness, and multifunctionality in uncertain terrestrial and extraterrestrial environments.","published_date":"2026-05-28T02:15:58+00:00","viability_score":4,"cluster_label":"Robotics","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new approach to robot design leveraging dynamic symmetry for improved agility, robustness, and multifunctionality.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.29253v1","title":"OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories","abstract":"Task success can hide process anomalies in real-world agent executions. An agent may pass the final task oracle while still accumulating unresolved ambiguity, unsafe external writes, ignored errors, weakly grounded commitments, or capability-boundary overcommitment. We study this mismatch as the Outcome-Process Gap and introduce OpenClawBench, a large-scale dataset for measuring and supervising process-side anomalies in real agent execution processes. OpenClawBench is built from BFCL-driven OpenClaw sessions produced by 6 source models and contains 31,264 annotated trajectories. It aligns task-oracle outcomes with structured process evidence. FullTax converts the aligned trajectories into structured anomaly supervision: binary labels, supporting evidence, onset/span localization, severity, recoverability, and a 5-class anomaly taxonomy. Using OpenClawBench, we make the Outcome-Process Gap measurable. Among 31,135 oracle-passing executions, 2,904 are still labeled process-anomalous under FullTax. These results show that success-only evaluation misses a concrete class of process-side failures in real agent executions. A LoRA-fine-tuned Gemma 3 12B detector trained on the high-confidence FullTax supervised pool reaches binary F1=0.729 on the cleaner-labels held-out test split. Together, OpenClawBench turns real agent execution logs into auditable and reusable supervision for studying, diagnosing, and operationally monitoring runtime agent reliability.","published_date":"2026-05-28T02:15:52+00:00","viability_score":7,"cluster_label":"Agent Reliability","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A benchmark and dataset for detecting process anomalies in real-world AI agent executions, improving reliability beyond task success.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.29251v1","title":"Provably Secure Agent Guardrail","abstract":"As large language models transition from bounded generative engines to agents with expansive execution privileges, AI going out of control precipitates a fundamental crisis in artificial intelligence security. Existing defense architectures heavily rely on empirical semantic guardrails and probabilistic large model adjudicators, mechanisms that fail to provide deterministic security lower bounds when facing complex semantic symbol decoupling attacks. To overcome this empirical semantic guardrail dilemma, this paper proposes a new security paradigm for agents based on the fundamental limitations of logical reasoning. Based on this paradigm, we further introduce an executable Proof-Constrained Action (ePCA) framework with a neural symbolic isolation architecture. This framework abandons semantic trust in natural language, forcing agents to losslessly formalize their intentions into first-order logical mathematical constraints before performing physical operations. Empirical evaluations of macroscopic and microscopic two-dimensional dynamic adversarial systems demonstrate that our formal verification mechanism achieves zero attack success rate and zero false positive rate across the evaluated scenarios, with extremely low computational latency. This research provides a conditional formal foundation under explicit system assumptions and an engineering paradigm for constructing the underlying defense foundation for future intelligent systems.","published_date":"2026-05-28T02:12:41+00:00","viability_score":7,"cluster_label":"AI Security","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A provably secure agent guardrail framework that uses formal verification to prevent AI from going out of control.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.29250v1","title":"OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources","abstract":"Real-world information needs require access to structurally diverse knowledge sources, from unstructured text and relational tables to knowledge graphs and property graphs. Existing retrievers, however, operate over one source at a time under a fixed query language, leaving the broader landscape of available knowledge fragmented behind incompatible interfaces. A natural attempt at unification would collapse these sources into a shared space, but this erases the structural affordances (such as schemas, ontologies, compositional operators) that give each source its expressive power. Effective retrieval over diverse knowledge, therefore, requires not homogenization but an overarching layer that meets each source on its own terms. To achieve this, we present OmniRetrieval, a framework that takes any natural-language query, identifies appropriate knowledge sources, and dispatches source-native queries to their native execution engines. Across an extensive benchmark spanning 13 datasets and 309 distinct knowledge bases over text, relational, and graph-structured sources, OmniRetrieval exceeds single-source baselines, demonstrating that it can serve as a general-purpose interface to the heterogeneous sources while preserving the structural distinctions that make each source valuable.","published_date":"2026-05-28T02:10:35+00:00","viability_score":7,"cluster_label":"Unified Retrieval","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A unified retrieval framework that accesses and queries diverse knowledge sources (text, tables, graphs) without homogenization.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.29247v1","title":"DenseSteer: Steering Small Language Models towards Dense Math Reasoning","abstract":"Large language models (LLMs) demonstrate strong chain-of-thought (CoT) reasoning abilities, while smaller models (<= 3B parameters) significantly underperform on multi-step reasoning tasks. Based on empirical analyses of the Qwen-2.5 model family on math reasoning benchmarks, we find that more proficient reasoning is associated with fewer reasoning steps but higher information density per step, a property we term Dense Reasoning. Motivated by this observation, we propose DenseSteer, a training-free inference-time steering framework that enhances small-model reasoning by modulating internal representations toward dense reasoning patterns. Experiments show that our method yields consistent accuracy improvements without increasing token-level Negative Log-Likelihood, highlighting dense reasoning as an effective structural approach to mathematical problem solving.","published_date":"2026-05-28T02:07:58+00:00","viability_score":7,"cluster_label":"LLM Reasoning","has_code":true,"repo_url":"https://github.com/oyy2000/DenseSteer","commercial_flags":["has_code"],"one_liner":"A training-free framework to enhance small language model math reasoning by steering internal representations towards denser information patterns.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29243v1","title":"Wait! There's a Way Out: A Decision Mechanism for Forecasting Conversational Derailment","abstract":"Forecasting conversational derailment is the task of predicting, as the conversation unfolds, whether it will eventually derail into personal attacks. Since forecasting models operate in an online fashion, they must decide whether to \"trigger\" an alert after each utterance--for example, to notify participants or a moderator that the conversation is at risk of derailing. Existing approaches make this decision solely based on the estimated likelihood of derailment given the preceding utterances, implicitly assuming that the conversation's future trajectory is fixed. As a result, they ignore the possibility of future recovery and incur an unnecessarily high rate of false positives.   In this work we propose a method for decoupling the decision to trigger from derailment likelihood estimation. Our approach is inspired by the first human baseline on this task, which shows that humans achieve dramatically lower false positive rates by selectively deferring their decision to trigger when they anticipate that tension is likely to subside. We operationalize this insight with a deferral mechanism that uses forward-looking simulations to assess whether a tense moment admits plausible paths to recovery. Incorporating this mechanism into a state-of-the-art forecasting model substantially reduces false positives without sacrificing forecasting accuracy. More broadly, this work highlights the value of treating decision-making as a first-class component of forecasting systems.","published_date":"2026-05-28T02:01:30+00:00","viability_score":5,"cluster_label":"Conversational AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A decision mechanism for conversational AI that forecasts derailment by simulating future recovery paths, reducing false positives.","time_to_mvp":"1-3 months","tags":["series_a_plus"]},{"arxiv_id":"2605.29240v1","title":"Surfacing Isolated Learners with Outcome-Independent Mediation of Feedback between Teachers and Students Using AI","abstract":"AI-augmented classrooms generate rich teacher and student feedback before graded outcomes become available, yet these signals can be difficult to translate into timely instructional decisions. We propose an interpretable decision layer: a transparent mechanism that ranks course topics requiring attention without using grades or post-hoc outcome labels. The approach combines three signals: student learning difficulty prevalence, disagreement between learner self-reports and observed difficulties, and unresolved teacher concerns. The output is a ranked set of topic priorities with per-topic decision records explaining each ranking. In one graduate CS course offering ($n=5$ instructor interviews; $n=279$ survey responses), prioritized topics aligned with instructor concerns (top-5 overlap 3/5; Spearman $\u03c1=0.80$) and student-reported topic difficulty ($\u03c1=0.46$, $p=.048$). Multi-signal integration also surfaced learners not identified through individual signal sources alone (AUC $=0.96$ vs. $0.91$ for gap prevalence alone). Reflective thinking, help-seeking, and self-efficacy provided additional evidence that student behavioral signals align with learning-related constructs. While preliminary, these findings suggest that transparent coordination mechanisms may help support human-AI co-agency when feedback is incomplete.","published_date":"2026-05-28T02:00:06+00:00","viability_score":4,"cluster_label":"Educational AI","has_code":true,"repo_url":"https://github.com/park-jsdev/classroom-feedback-mediator","commercial_flags":["has_code"],"one_liner":"An interpretable AI layer that ranks course topics needing attention by combining student difficulty, self-reports, and teacher concerns, without using grades.","time_to_mvp":"1-3 months","tags":["series_a_plus"]},{"arxiv_id":"2605.29234v1","title":"Rethinking Literature Search Evaluation: Deep Research Helps, and Human Citation Lists Are Not a Ground Truth","abstract":"We study large-scale literature search from two complementary angles: improving the retrieval pipeline, and stress-testing the human reference list as an evaluation target. First, we implement a Deep Research pipeline that processes the full query paper and expands the retrieved results breadth-first along their bibliographies, and show that it substantially outperforms vanilla API-only search, raising recall on RollingEval-Jun25 (a 250-paper literature-search benchmark) from below 20% to above 80%. Second, we use a neutral LLM-as-a-judge to determine if human references are sound ground truth for the task. We find significant limitations: only 51% of human citations are judged moderately relevant or higher, against 86--88% for the strongest AI-based re-rankers. We study this gap on the OpenAlex co-authorship graph, finding that humans are 2.5x more likely than the best AI re-rankers to cite a direct collaborator. Together, our results argue against single-axis literature-search evaluation: recall, topical-relevance scoring, ranked-list diversity, and a co-authorship-distance diagnostic each measure complementary properties of citation quality and should be reported jointly.","published_date":"2026-05-28T01:50:52+00:00","viability_score":8,"cluster_label":"Information Retrieval","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A deep research pipeline that significantly improves literature search recall by expanding results breadth-first along bibliographies and challenges human citation lists as ground truth.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29233v1","title":"BlockBatch: Multi-Scale Consensus Decoding for Efficient Diffusion Language Model Inference","abstract":"Diffusion language models (dLLMs) generate text by iteratively denoising multiple token positions in parallel, offering an attractive alternative to strictly autoregressive decoding. In practice, however, block-wise dLLM inference exposes a difficult granularity trade-off: small blocks preserve local conditioning but require many denoising steps, whereas large blocks expose more parallelism but can make premature commitments and accumulate cache error. Existing acceleration methods typically choose a single block size per request, leaving the complementarity among block sizes unused. We show that block size itself is a useful branching dimension. Different block sizes induce related but non-identical KV-cache trajectories: branches often share an initial prefix, bifurcate at semantically decisive positions, and later agree on syntactically lightweight tokens. Motivated by this structure, we propose BlockBatch, a training-free online inference framework that executes multiple block-size branches for the same request inside a batched forward pass. BlockBatch coordinates these branches through confidence-gated token merging, leader-based synchronization, and periodic full-sequence refreshes that re-anchor local block updates to a globally consistent KV state. Across 3 representative dLLMs and 4 datasets, BlockBatch reduces denoising NFEs by 26.6\\% on average and achieves a 1.33$\\times$ average end-to-end speedup over Fast-dLLM while preserving accuracy. These results identify block-size diversity as a practical and previously underexplored axis for branch-parallel dLLM inference.","published_date":"2026-05-28T01:48:29+00:00","viability_score":7,"cluster_label":"Diffusion Models","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"BlockBatch enhances diffusion language model inference by optimizing block size for improved speed and accuracy.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.29230v1","title":"Toward Ethical Facial Age Estimation: A Generalized Zero-Shot Benchmark Without Training on Children's Data","abstract":"Age estimation from facial images typically relies on training data that includes images of minors, a practice that raises serious ethical, legal, and privacy concerns. In this work, we propose a generalized zero-shot benchmark for facial age estimation that explicitly excludes children's data during training while still assessing model performance on younger populations. We revisit six widely used datasets and introduce standardized splits with strict age-group separation: samples aged 18-59 for training, validation, and testing; samples under 18 reserved exclusively for zero-shot evaluation; and samples 60+ as an unseen validation set for model selection under distribution shift. For datasets with identity annotations, subject-exclusive splits prevent identity leakage and better reflect real-world deployment conditions. Evaluating nine state-of-the-art age estimation methods under this protocol reveals that all evaluated methods consistently fail to generalize to unseen age groups, suffering substantial performance degradation -- on average 46.4%, and up to 52.8% -- relative to the supervised baseline. Moreover, models do not simply degrade: they systematically anchor predictions for unseen ages to nearby seen classes, a manifestation of the well-known seen-class bias in generalized zero-shot learning. By formalizing age estimation without children's data as a generalized zero-shot benchmark on existing datasets, this work highlights a critical gap between current modeling practices and real-world ethical constraints. Our benchmark provides a principled basis for evaluating models under restricted data regimes and encourages the development of methods that are robust to distribution shift and aligned with responsible data use.","published_date":"2026-05-28T01:44:39+00:00","viability_score":6,"cluster_label":"Ethical AI","has_code":true,"repo_url":"https://github.com/caiopetruccirosa/generalized-zero-shot-age-estimation","commercial_flags":["has_code"],"one_liner":"A zero-shot benchmark for ethical facial age estimation that excludes children's data during training.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.29229v1","title":"Tailoring the Curriculum: Student-Centered Reasoning Distillation via Dynamic Data-Model Compatibility","abstract":"Reasoning distillation transfers complex reasoning abilities from large language models (LLMs) to smaller ones, yet its success depends on how well the training data align with the student model. This paper introduces the Data-Model Compatibility (DMC) metric, which can be used to assess the suitability of a dataset for reasoning distillation on a student model. DMC provides an assessment by jointly considering data quality, relative difficulty, and student capability. We validated the effectiveness of DMC from two perspectives: (1) DMC exhibits a strong correlation with reasoning distillation performance; and (2) using DMC as the criterion for data selection leads to improved reasoning distillation performance. Both findings are consistently demonstrated across multiple student models and tasks. Moreover, since the DMC of each dataset dynamically changes during training, our experiments demonstrate that dynamically selecting datasets based on DMC can further enhance performance.","published_date":"2026-05-28T01:41:29+00:00","viability_score":6,"cluster_label":"Reasoning Distillation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Introducing a metric for assessing dataset compatibility in reasoning distillation to enhance model performance.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.29225v1","title":"BenchTrace: A Benchmark for Testing Reflection Ability and Controlled Evolution in LLM Agents","abstract":"Self-evolving agents improve over time by reflecting on past failures, but existing evaluation is limited in two ways: it measures only task scores, leaving reflection quality unknown, and it relies on agents' own episode runs, offering no mechanism to target specific failure patterns. We present \\textbf{BenchTrace}, a benchmark for evaluating self-evolution ability in LLM agents. BenchTrace is built on a snapshot-reflection dataset of 1,821 annotated episodes spanning six diverse tasks, and comprises a \\textbf{Reflection Evaluation} that probes failure identification through targeted QA tasks, and an \\textbf{Evolution Evaluation} that tests whether past failure experience translates into avoidance behavior in a controlled self-evolution simulation. Building on BenchTrace, we propose \\textbf{failure avoidance rate (FAR)}, a new evaluation metric measuring the fraction of test cases in which the agent successfully avoids the target failure instance. Experiments with Qwen3-32B and GPT-4.1 reveal that both models fall below a 30\\% end-to-end pass rate on reflection evaluation, with diagnosis as the primary bottleneck. Evolution evaluation shows that self-evolution methods generally improve FAR over the non-evolving baseline, but agents forget early lessons as noise episodes accumulate, and agents fail to generalize their reflections beyond the specific context, causing negative transfer across task contexts. Our correlation analysis further reveals that only a fully correct reflection is strongly associated with higher FAR. BenchTrace exposes concrete limits of current self-evolution approaches and provides a controlled, model-agnostic framework for targeted evaluation.","published_date":"2026-05-28T01:25:37+00:00","viability_score":6,"cluster_label":"Self-Evolving Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"BenchTrace provides a benchmark for evaluating self-evolution abilities in LLM agents through targeted assessments.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.29224v1","title":"Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents","abstract":"AI agents augment large language models with external tools such as web retrieval, enabling grounded and up-to-date responses. However, incorporating external content into the generation pipeline can weaken the safety alignment mechanisms that govern model outputs. Prior work shows that enabling retrieval in agents increases compliance with harmful requests. We introduce AgentREVEAL, a diagnostic framework for analyzing retrieval-induced safety degradation in LLM agents. The framework examines two axes: how retrieval is integrated into the agent pipeline and the properties of the retrieved content. Along the integration axis, we find that binding tool invocation and response generation in a single step amplifies harmful outputs. Along the content axis, we uncover the Safe Source Paradox: even oppositional or safety-oriented sources, such as pages containing warnings or risk disclaimers, can increase harmful compliance by an average of 25% compared to the no-retrieval baseline. Finally, we show that relevance acts as a shared activation condition for both vulnerabilities. Similar patterns appear on frontier closed models, and harmful compliance remains elevated under several representative pipeline interventions, with some agents also entering this regime under autonomous retrieval. Because relevance is also what makes retrieval useful, these results expose a safety-utility trade-off for retrieval-enabled agents. We introduce HarmURLBench, a benchmark containing 1,405 real-world URLs paired with 320 harmful behaviors to support future evaluations.","published_date":"2026-05-28T01:23:48+00:00","viability_score":7,"cluster_label":"LLM Agents","has_code":true,"repo_url":"https://github.com/goodfeli/dlbook_notation","commercial_flags":["has_code"],"one_liner":"A diagnostic framework and benchmark to analyze and mitigate safety degradation in LLM agents caused by web retrieval.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29218v1","title":"GTA: Generating Long-Horizon Tasks for Web Agents at Scale","abstract":"Web agents, which couple language models with browsing and tool-use capabilities, show promise as open web assistants. Yet progress is increasingly limited by the lack of scalable, process-level supervision. Existing benchmarks are largely manually constructed, providing only coarse start-goal annotations without intermediate trajectories, while recent automatic generation efforts remain expensive, biased, and shallow. These limitations prevent reliable training and evaluation of agents that must generalize to realistic, multi-hop, cross-page tasks. We introduce a scalable framework, GTA, that integrates crawling, retrieval-based seeding, in-context generation, and automated quality control to produce realistic tasks paired with executable trajectories. This design decouples crawling from generation for greater efficiency, grounds tasks in the site graph to enforce compositionality, and ensures dense supervision through deterministic replays and systematic validation. We instantiate the pipeline on over 50 websites covering e-commerce, government, forums, and news, with multilingual and multi-hop coverage. The resulting benchmark reveals a significant human-agent performance gap and enables detailed diagnostics. Our contributions are three-fold: (i) formalizing multi-hop web-agent task generation, (ii) proposing an efficient and validated pipeline for automatic data creation, and (iii) releasing a dynamic benchmark with reproducible evaluation.","published_date":"2026-05-28T01:05:50+00:00","viability_score":7,"cluster_label":"Web Agents","has_code":true,"repo_url":"https://github.com/goodfeli/dlbook_notation","commercial_flags":["has_code"],"one_liner":"A scalable framework and benchmark for generating realistic, long-horizon tasks for training and evaluating web agents.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29194v1","title":"Stochastic Lifting for Generating Trajectories of Stochastic Physical Systems","abstract":"Many stochastic physical systems evolve smoothly over time in the sense that the distribution of states changes regularly across time steps. The transition from current state to the next state can often be modeled as the combination of a smooth map and an explicit source of randomness. Stochastic Lifting exploits this structure by attaching an independent, high-dimensional random label to each state transition in the training data and fitting a transition map from the current state and label to the next state using a standard regression loss. The labels act as auxiliary coordinates that let the model represent multiple plausible next states from similar current states, avoiding collapse to a mean prediction in the finite-sample size regime. At inference, fresh labels are sampled at each time step and the learned map is rolled forward autoregressively, generating diverse trajectories with a single network evaluation per time step.","published_date":"2026-05-28T00:10:45+00:00","viability_score":0,"cluster_label":"Generative Models","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A theoretical approach to generating trajectories of stochastic physical systems by attaching random labels to state transitions.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29192v1","title":"ReasonOps: Operator Segmentation for LLM Reasoning Traces","abstract":"Chain-of-thought traces from large reasoning models can span tens of thousands of tokens, yet we lack a vocabulary for describing their internal structure. Previous methods developed to analyze chain-of-thought traces are either too rigid or not expressive enough, failing to capture features across domains and models. To remedy this, we develop ReasonOps, an unsupervised, expressive method for annotating chain-of-thought traces, providing succinct universal operators. Using ReasonOps, we analyze 44,662 traces from 12 thinking LLMs spanning 6 families across 8 reasoning benchmarks and discover that they share a common compositional structure: 7 recurring reasoning operators -- discourse-level moves such as backtracking, inferring, and hypothesizing -- that emerge from unsupervised clustering of sentence-initial 3-token pivots. These operators appear across every model family and benchmark domain, confirmed by three independent LLM judges who classify held-out samples at 70 -76% accuracy. We analyze the structure of operators on easy vs. hard problems, revealing that reflective operators are more helpful on hard problems and harm performance on easy problems. Operator sequences are highly model-identifying: a classifier trained on operator distributions alone recovers the source model with macro-AUC, revealing that each model family has a distinctive reasoning fingerprint. Structural operator features predict within-problem answer correctness well above baselines. Classifiers built on these operators reach WP-AUC and on AIME specifically. ReasonOps further enables early quality estimation well before the trace completes: we predict at WP-AUC for only 50% of the trace. The ReasonOps pipeline is unsupervised and annotation-free, enabling deep insights into LLM reasoning traces as well as strong downstream results on model identification and correctness prediction.","published_date":"2026-05-28T00:08:35+00:00","viability_score":7,"cluster_label":"LLM Reasoning","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An unsupervised method for segmenting LLM reasoning traces into universal operators, enabling analysis of reasoning patterns and model fingerprints.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29184v1","title":"Influence-Guided Symbolic Regression: Scientific Discovery via LLM-Driven Equation Search with Granular Feedback","abstract":"Large Language Models (LLMs) offer a promising avenue for scientific discovery, yet their application to symbolic regression is often constrained by inefficient search strategies and coarse feedback signals. Current methods typically guide LLMs using scalar metrics (e.g., global Mean Squared Error), which fail to identify which components of a proposed equation are driving performance or causing error. We introduce \\textit{Influence-Guided Symbolic Regression} (IGSR), a method that frames equation discovery as an iterative two-step process combining diverse term generation with rigorous selection: an LLM generates candidate basis functions $\u03c8_j(\\mathbf{x})$ for a linear model, which are then evaluated using granular influence scores $\u0394_j$. These scores quantify each term's marginal contribution to generalization accuracy, enabling an influence-guided pruning process that systematically refines the model structure. Integrating this mechanism into a Monte Carlo Tree Search (MCTS) enables navigating the combinatorial search space while balancing exploration of novel functional forms with exploitation of high-influence components. We demonstrate IGSR's effectiveness on a diverse suite of benchmarks, including LLM-SRBench, pharmacological PKPD models, an epidemiological simulation, and real-world genomic data. Notably, we validate the framework's capacity for genuine discovery in a case study using a high-dimensional biological dataset, in which IGSR identified a novel relationship between DNA methylation and RNA Polymerase II pausing; a hypothesis that was subsequently supported via wet-lab experimentation.","published_date":"2026-05-27T23:48:01+00:00","viability_score":7,"cluster_label":"Scientific Discovery AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An LLM-driven symbolic regression system that uses granular feedback to discover novel scientific relationships, validated by wet-lab experiments.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29183v1","title":"TIMEGATE: Sustainable Time-Boxed Promotion Gates for Continual ML Adaptation Under Resource Constraints","abstract":"As machine learning(ML) systems evolve to continual adaptation, each re-training cycle uses compute, annotation, and energy. We introduce TIMEGATE, a policy layer managing adaptation by budgeting time, labeling, training, and evaluation. TIMEGATE emits a metric-availability signal M for partial vs. full-evaluation decisions. We validate: (i) labeling outperforms training by 2.3x on Adult tabular; (ii) it transfers to LLaMA-3.1-8B + QLoRA on SST-2 (accuracy 0.80 to 0.96; M =1 in 35/36 runs); (iii) M is informative, 28-cell sensitivity shows M drops to 0.81 at tight thresholds; (iv) 100-cycle simulation achieves 66% evaluation-compute savings with no silent mis-promotions; (v) 10%-slice evaluation on LLaMA uses 89% less wall-clock and energy on a single H200 (ratios agree to 0.2%).","published_date":"2026-05-27T23:41:29+00:00","viability_score":5,"cluster_label":"ML Operations","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A policy layer that manages continual ML adaptation by budgeting time, labeling, training, and evaluation to save compute and energy.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29179v1","title":"Sustainable Metal-Organic Framework Water Harvesters in the Artificial Intelligence Era","abstract":"Metal-organic frameworks (MOFs) are excellent candidates for water harvesting due to their tunable pore environments, which can be precisely engineered to capture and release water in arid conditions. Integrating artificial intelligence (AI) into MOF discovery can further accelerate the design of high-performance sorbents by identifying structural features that enhance atmospheric water harvesting (AWH), stability, and cycling efficiency. In this Perspective, we examine key MOF design principles, including cooperative adsorption, operational relative humidity (RH), uptake capacity, hysteresis, and scalability. We highlight recent design advancements such as multivariate strategies and long-arm linker extension, and examine how these principles tune pore capacity and hydrophilicity, while preserving stability and crystallinity. Furthermore, we discuss how AI, large language models (LLMs), and data mining can accelerate the discovery process through predictive synthesis, inverse design, and elucidating synthesis-structure-property relationships for the next generation of MOF water harvesters.","published_date":"2026-05-27T23:30:45+00:00","viability_score":1,"cluster_label":"Materials Science AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Leveraging AI and LLMs to accelerate the discovery of Metal-Organic Frameworks for sustainable water harvesting.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29174v1","title":"Paper Agents, Paper Gains: An Empirical Analysis of DeFi Investment Agents","abstract":"DeFi investment agents, systems that use AI for autonomous on-chain trading, have attained over USD 3 billion in combined token valuations since late 2024. We survey over 1,900 AI-tagged crypto projects, filter to investment-focused agents, and curate 10 representative projects spanning strategy and observability dimensions. We then conduct a deep-dive architectural analysis of two prominent agent frameworks, ElizaOS and Virtuals Protocol, and a quantitative on-chain performance analysis of 11 Solana-based agent treasuries with publicly attributable trading activity, covering 925,323 token holders. We find that current deployments remain early and heterogeneous: (1) in our sample, many projects do not yet provide clear evidence of autonomous trade execution, and developer interviews suggest that many visible deployments remain basic API integrations; (2) agent treasuries retain over USD 30M in paper gains while token holders collectively lost USD 191.7M, with the top 1% of wallets capturing 81.4% of all gains (USD 1.81B); (3) token valuations are weakly connected to treasury fundamentals, with market-cap-to-AUM ratios exceeding 10,000x versus below 1x for established DeFi protocols; and (4) aggregate user gains peaked at USD 2.4B before declining to net losses, with median returns negative on every platform and tokens declining 93% on average from all-time highs. We interpret these outcomes as characteristic of a permissionless, first-generation market in which open infrastructure enables rapid experimentation but also allows naive or speculative agents to launch before robust standards for autonomy, performance, and stakeholder alignment emerge. We therefore propose a maturity framework along three dimensions: autonomous execution, risk-adjusted profitability, and stakeholder alignment, to characterize the gap between current deployments and future investment-grade agent systems.","published_date":"2026-05-27T23:21:42+00:00","viability_score":3,"cluster_label":"DeFi Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"An empirical analysis of DeFi investment agents revealing market immaturity and proposing a framework for future development.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29170v1","title":"UA-Legal-Bench: A Benchmark for Evaluating Large Language Models on Ukrainian Legal Reasoning","abstract":"Legal NLP benchmarks are overwhelmingly English-centric, leaving failure modes in morphologically rich, non-Latin-script languages undetected. We introduce UA-Legal-Bench, a five-task benchmark for evaluating large language models on Ukrainian legal reasoning, built from the Unified State Register of Court Decisions (EDRSR) -- one of the world's largest open judicial corpora (99.5 million decisions). The benchmark comprises: (1) case-type classification (4 classes, n=2,000), (2) judgment form classification (4 classes, n=2,000), (3) case-outcome prediction (6 classes, n=800), (4) legal norm extraction (n=1,794), and (5) cause category prediction (22 classes, n=1,871). We evaluate 11 LLMs (3B--675B) from five families under zero-shot and 3-shot prompting via AWS Bedrock with 158K API calls. Our results reveal sharply task-dependent few-shot effects: few-shot prompting improves judgment form classification by up to +38.6 pp but has mixed effects on outcome prediction. We show that accuracy is misleading on imbalanced legal tasks: the model with highest COP accuracy (62%) is a majority-class predictor (macro-F1: 23%), while the genuinely best model scores only 44% macro-F1. Within-family scaling analysis reveals that 8B models can match frontier performance on surface-level tasks but scaling thresholds vary dramatically across families. We release all data, prompts, and model predictions.","published_date":"2026-05-27T23:12:20+00:00","viability_score":7,"cluster_label":"Legal NLP","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A benchmark for evaluating LLMs on Ukrainian legal reasoning, released with data and code, to address the English-centric bias in legal NLP.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29169v1","title":"Domain-Informed Representation for Evolutionary Sieving in Integral and Module Lattices","abstract":"Traditional cryptography, rooted in problems, e.g., integer factorisation or discrete log, is inevitably vulnerable to a fully operational quantum computer. Although it remains an engineering frontier, the looming threat extends to encrypted data stored today, which could be decrypted in the future with quantum capabilities. To safeguard against this eventuality, the backbone of the modern quantum-safe cryptography is the Shortest Vector Problem (SVP). We enhance Laarhoven's treatment of Ajtai et al.'s sieving as a genetic algorithm (GA) for the SVP by incorporating domain-informed SVP representation and crossover while naturally extending application to the module lattices.","published_date":"2026-05-27T23:09:10+00:00","viability_score":0,"cluster_label":"Quantum-Safe Cryptography","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Enhancing a genetic algorithm for the Shortest Vector Problem in quantum-safe cryptography by incorporating domain-informed representation and crossover.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29168v1","title":"Better Later Than Sooner: Neuro-Symbolic Knowledge Graph Construction via Ontology-grounded Post-extraction Correction","abstract":"Question answering (QA) is a core challenge in AI, particularly for complex queries requiring multi-hop reasoning across documents, or symbolic operations like aggregation or exhaustive listing. Retrieval-augmented generation has become the dominant approach to QA, with recent graph-based variants addressing part of these issues by organizing knowledge to better support compositional questions. However, most textual graph-based RAG methods still lack the structure needed for symbolic operations useful to answer complex questions reliably. This motivates symbolic graph-based approaches, which extract knowledge graphs (KGs) whose relations are logic predicates that enable SQL-like querying. Yet these pipelines typically use LLMs for KG extraction, which can introduce consistency issues, where extracted facts may violate commonsense ontology constraints. We propose a neuro-symbolic framework for ontology-grounded KG construction combining open-domain extraction, embedding-based canonicalization of types and predicates, and targeted LLM-based correction of ontology violations. By deferring corrections to a post-extraction stage, our method avoids repeated LLM calls, substantially reducing token usage while improving KG consistency and preserving downstream QA quality. Finally, we show that the extracted KGs are well suited for symbolic querying by measuring the occurrence of SPARQL graph patterns.","published_date":"2026-05-27T23:09:10+00:00","viability_score":3,"cluster_label":"Knowledge Graph Construction","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A neuro-symbolic framework for constructing knowledge graphs by correcting ontology violations post-extraction, reducing LLM token usage and improving KG consistency.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29161v1","title":"Evolutionary Refinement of Generative Graph Topologies: A Hybrid WGAN-GA Approach","abstract":"Generating realistic graph-structured data is challenging due to discrete connectivity, varying graph sizes, and class-specific structural patterns. Recent Generative Adversarial Networks (GAN)-based graph generation methods improve edge modelling by learning connectivity and matching class-specific density distributions. However these models still exhibit noticeable deviations such as in degree and spectral distribution when compared to real graphs, indicating that important structural properties are not fully preserved. This work aims to reduce these deviations by refining the graphs produced by an existing GAN-based graph generator framework with a Genetic Algorithm (GA). In the GAN framework, the generator produces both node features and connectivity patterns, while a GNN-based critic evaluates graph realism and class consistency to ensure global structural and class alignment. Building on this foundation, we apply a GA to refine the edges of generated graphs. The refinement process guides synthetic graphs toward closer agreement with real data, while preserving diversity and novelty. Experimental results show that the GA refinement consistently lowers combined Maximum Mean Discrepancy (MMD) compared to the base model, leading to graphs that more closely match real structural patterns. This demonstrates that evolutionary refinement is an effective and flexible way to correct residual structural deviations in GAN-based graph generators, improving their suitability for realistic graph synthesis and data augmentation.","published_date":"2026-05-27T22:53:49+00:00","viability_score":7,"cluster_label":"Graph Generation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Refining GAN-generated graphs with a genetic algorithm to improve structural realism and class consistency for data augmentation.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29157v1","title":"Parallax: Parameterized Local Linear Attention for Language Modeling","abstract":"Large Language Models (LLMs) have become the central paradigm in artificial intelligence, yet the core computational primitive of attention has remained structurally unchanged. Local Linear Attention (LLA) is an attention mechanism derived from nonparametric statistics in the test-time regression framework. In contrast to prior research on efficient attention variants, LLA upgrades the local constant estimate in softmax attention to a local linear estimate, yielding provably superior bias-variance tradeoffs for associative memory. However, LLA has not been scaled in LLM pretraining due to computational and numerical stability concerns. We introduce Parallax, a parameterized Local Linear Attention that is scalable for LLMs. Parallax eliminates the numerical solver in LLA and learns an extra query-like projector that probes the KV covariance. We place Parallax within a family of attention mechanisms connected by the bandwidth, the probe construction and the affine structure. We propose a hardware-aware algorithm that increases the arithmetic intensity over FlashAttention, shifting attention into a more compute bound regime. Our prototype decode kernel matches or outperforms FlashAttention 2/3 across diverse batch sizes and context lengths. We pretrain Parallax at 0.6B and 1.7B scales and find consistent perplexity improvements throughout pretraining with gains that transfer to downstream benchmarks. The advantage persists under both parameter-matched and compute-matched controls, demonstrating a Pareto improvement. We perform careful pretraining ablations and identify a novel phenomenon whereby Muon unlocks the capacity of Parallax. To our knowledge, this is the first empirical demonstration of strong architecture-optimizer codesign for attention mechanisms in the architecture research literature.","published_date":"2026-05-27T22:50:44+00:00","viability_score":8,"cluster_label":"LLM Training","has_code":true,"repo_url":"https://github.com/yifei-zuo/Parallax","commercial_flags":["has_code"],"one_liner":"Parallax introduces a scalable, hardware-aware parameterized local linear attention mechanism for LLMs that improves perplexity and downstream performance.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29155v1","title":"CA-AC-MPC: CUDA-Accelerated Actor-Critic Model Predictive Control","abstract":"In the literature, actor-critic model predictive control (AC-MPC) integrates MPC with reinforcement learning to enable high-performance control of complex dynamical systems. However, its differentiable MPC layer requires repeatedly solving an optimization problem in both the forward and backward passes, leading to substantial training and inference latency. This paper tackles this bottleneck introducing a CUDA-accelerated variant that significantly reduces end-to-end execution time while preserving the control performance of the baseline formulation. Simulation results on an agile drone racing task show that our approach achieves state-of-the-art lap times and near-limit dynamic behaviour with markedly reduced training and inference time.","published_date":"2026-05-27T22:38:09+00:00","viability_score":5,"cluster_label":"Robotics","has_code":true,"repo_url":"https://github.com/prisma-lab/CA-AC-MPC","commercial_flags":["has_code"],"one_liner":"A CUDA-accelerated actor-critic model predictive control system significantly reduces training and inference time for agile drone racing.","time_to_mvp":"6+ months","tags":["series_a_plus"]},{"arxiv_id":"2605.29153v1","title":"Unveiling Multi-regime Patterns in SciML: Distinct Failure Modes and Regime-specific Optimization","abstract":"Neural networks trained under different hyperparameter settings can fall into distinct training \"regimes,\" with consistent behavior within regimes and qualitative differences across regimes. In this paper, we study such multi-regime behavior in scientific machine learning (SciML) models through a regime-aware diagnostic framework that jointly analyzes performance, training dynamics, and loss-landscape geometry. We identify three key findings: (i) a consistent three-regime structure emerges across many standard SciML models, different constraint enforcements, and various optimizer designs; (ii) optimization effectiveness is regime-specific, with no single method performing well across all regimes; and (iii) SciML models can exhibit fine-grained failure modes that can challenge conventional interpretations of standard loss-landscape metrics. Our results provide an approach to establish a unified, task-oblivious perspective on failure modes in SciML and to inform regime-aware guidance for improving robustness. We validate these findings across widely-used SciML models, including physics-informed neural networks, neural operators, and neural ordinary differential equations, on benchmarks spanning representative ordinary and partial differential equations.","published_date":"2026-05-27T22:33:03+00:00","viability_score":3,"cluster_label":"Scientific ML","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A diagnostic framework reveals distinct training regimes and failure modes in SciML models, informing regime-aware optimization.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29151v1","title":"Real-rootedness of the Poincar\u00e9 polynomials of $\\overline{\\mathcal M}_{0,n}$: an AI-assisted proof","abstract":"We prove real-rootedness for the Poincar\u00e9 polynomial \\[   P_n(t)=\\sum_{i=0}^{n-3} \\dim H^{2i}(\\overline{\\mathcal M}_{0,n};\\mathbb{Q})t^i \\] of the Deligne--Mumford moduli space $\\overline{\\mathcal M}_{0,n}$ of stable $n$-pointed rational curves, proving a conjecture of Aluffi--Chen--Marcolli. The proof starts from the Keel--Manin--Getzler recurrence, but its main new idea is a bivariate deformation $F_m(y,t)$ of the Poincar\u00e9 polynomial. This deformation reveals a hidden interlacing structure not visible in the one-variable recurrence. For fixed $t<0$, the zero set of $F_m$ in the $y$-direction is controlled by a Sturm--Rolle argument on the interval $0<y<1-t$. The original polynomial is recovered on the slice $y=1$, and the ordered crossings of the moving roots through this slice give both real-rootedness and strict interlacing. Consequently, the Betti numbers of $\\overline{\\mathcal M}_{0,n}$ form an ultra-log-concave sequence.   We further prove real-rootedness and ultra-log-concavity for the Poincar\u00e9 polynomial of the Fulton--MacPherson space $\\mathbb{P}^1[n]$ of $n$ ordered points in degenerations of the complex projective line.   The proof for $\\overline{\\mathcal M}_{0,n}$ was obtained through an iterative AI-assisted workflow with Co-Mathematician, an agentic frontier-model system developed by Google DeepMind. The human role was to pose the problem, evaluate successive attempts, request repairs of gaps, compare the evolving argument with the literature, and assemble the final human-verifiable proof. Our additional human contribution was to observe that a similar residual deformation strategy applies to the Fulton--MacPherson spaces $\\mathbb P^1[n]$, yielding the corresponding real-rootedness theorem.","published_date":"2026-05-27T22:26:05+00:00","viability_score":0,"cluster_label":"AI-Assisted Mathematics","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"An AI-assisted proof using a novel deformation strategy establishes real-rootedness for Poincar\u00e9 polynomials of moduli spaces.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29146v1","title":"SafeRx-Agent: A Knowledge-Grounded Multi-Agent Framework for Safe and Explainable Medication Recommendation","abstract":"Medication recommendation predicts medications for patient visits, but existing methods still face two key challenges. At the model level, traditional drug recommendation methods only predict structured drug codes with limited evidence grounding, while LLM agents can use richer clinical context but may lack safety verification and traceability. At the task level, existing benchmarks often use broad medication categories, which ignore subgroup-level safety differences and can lead to risk overestimation. We introduce the first fine-grained medication recommendation setting based on fourth-level ATC code generation. We propose Safe Prescription Agent (SafeRx-Agent), a knowledge-grounded multi-agent framework that uses patient context, external clinical knowledge, and safety verification to recommend traceable medication sets. Experimental results on MIMIC-III and MIMIC-IV datasets show that SafeRx-Agent improves fine-grained medication prediction accuracy while controlling drug interactions, contraindications, and medication set size.","published_date":"2026-05-27T22:15:48+00:00","viability_score":7,"cluster_label":"Medical AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A multi-agent framework for safe and explainable medication recommendations using patient context and clinical knowledge.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29141v1","title":"Toward User Preference Alignment in LLM Recommendation via Explicit Context Feedback","abstract":"Traditional recommender systems (RecSys) primarily infer user preferences from implicit signals (such as clicks, watches, and purchases), often neglecting the rich explicit contextual feedback users provide through verbal text, like comments and reviews. This explicit context feedback captures the nuanced reasons behind user decisions regarding their preferences. In addition, it offers critical heterogeneous information for user preference alignment and more explainable recommendations. Overlooking such signals can lead to misaligned user preferences and further reinforce filter bubbles, as algorithms fail to understand the \"semantic context\" behind user choices. Recent advances in Large Language Models (LLMs) present new opportunities to harness user-generated content for more accurate and diverse recommendations, yet current LLM-based recommendations still focus on using item meta-data and underutilize this resource. In this paper, we advocate for prioritizing explicit context feedback in the next generation of LLM-based RecSys. We review the evolution of recommendation paradigms, highlight the value of context-rich feedback, call for new benchmarks and metrics, and introduce frameworks for integrating explicit user signals into scalable LLM-driven RecSys. Centering on user-preference modeling, we aim to foster more personalized, transparent, and explainable RecSys online platforms.","published_date":"2026-05-27T22:10:33+00:00","viability_score":6,"cluster_label":"Recommendation Systems","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Leveraging explicit user feedback with LLMs to build more personalized and explainable recommendation systems.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2605.29138v1","title":"Multi-Resolution End-to-End Deep Neural Network for Optimizing Latency-Accuracy Tradeoff in Autonomous Driving","abstract":"Latency-accuracy tradeoffs are fundamental in real-time applications of deep neural networks (DNNs) for cyber-physical systems. In autonomous driving, in particular, safety depends on both prediction quality and the end-to-end delay from sensing to actuation. We observe that (1) when latency is accounted for, the latency-optimal network configuration varies with scene context and compute availability; and (2) a single fixed-resolution model becomes suboptimal as conditions change.   We present a multi-resolution, end-to-end deep neural network for the CARLA urban driving challenge using monocular camera input. Our approach employs a convolutional neural network (CNN) that supports multiple input resolutions through per-resolution batch normalization, enabling runtime selection of an ideal input scale under a latency budget, as well as resolution retargeting, which allows multi-resolution training without access to the original training dataset.   We implement and evaluate our multi-resolution end-to-end CNN in CARLA to explore the latency-safety frontier. Results show consistent improvements in per-route safety metrics - lane invasions, red-light infractions, and collisions - relative to fixed-resolution baselines.","published_date":"2026-05-27T22:06:23+00:00","viability_score":7,"cluster_label":"Autonomous Driving","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A multi-resolution deep neural network for autonomous driving that optimizes latency-accuracy tradeoffs in real-time.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29129v1","title":"Governing Technical Debt in Agentic AI Systems","abstract":"Agentic AI systems are increasingly being explored as production infrastructure: they reason over multiple steps, call tools, act through workflows, and adapt through memory and feedback. These systems create governance challenges that are not fully captured by traditional software or predictive ML technical debt. We define Agentic Technical Debt as the accumulated liability created when prompts, memory, tool schemas, orchestration graphs, control policies, and observability routines are patched together faster than they can be validated, standardized, and governed. We define Stochastic Tax as the recurring operating burden of keeping probabilistic agent behavior within acceptable bounds. The distinction matters: debt is a stock of design and governance liability, while the tax is a flow of operating cost that arises because stochastic agents act through tools and workflows. We outline how managers can make both visible through lightweight dashboards and governance controls.","published_date":"2026-05-27T21:42:49+00:00","viability_score":0,"cluster_label":"Agentic AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Defining and governing technical debt and stochastic tax in agentic AI systems for production infrastructure.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29126v1","title":"When and How Long? The Readout-Mediator Angle in Temporal Reasoning","abstract":"A linear probe can decode a representation almost perfectly and yet be completely irrelevant to how the model uses it. On calendar-date duration reasoning in language models, a $\\sin$/$\\cos$ probe recovers day-of-year from a layer's activations, yet ablating its direction has no effect on the model's answers -- while ablating a four-dimensional subspace found by Distributed Alignment Search (DAS) at the same layer collapses performance entirely. We measure the angle between these two subspaces -- the \\emph{readout-mediator angle} -- and find it indistinguishable from the angle between two random subspaces (the Haar-uniform null), meaning the probe has learned a direction orthogonal to the model's actual computation. Reverse-engineering the circuit reveals why: attention heads route month-grained context through learned QK offsets at ${\\pm}30$ and ${\\pm}61$ days, and MLPs then convert \\emph{when} (absolute date) into \\emph{how long} (duration) -- all downstream of the causal subspace the probe never touches. Sparse-autoencoder decomposition confirms the split: probe-aligned and DAS-aligned features encode semantically disjoint concepts with negligible causal overlap. The dissociation replicates across four scales ($1.5$-$9\\,$B) and two model families, with preliminary evidence on two further domains (spatial displacement, symbolic arithmetic), suggesting that readout-mediator orthogonality is a general failure mode of probe-based interpretability. This directly undermines proposals to deploy probes as runtime safety monitors: the probe can report high confidence on a direction the model has silently abandoned.","published_date":"2026-05-27T21:38:17+00:00","viability_score":2,"cluster_label":"LLM Interpretability","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper investigates the disconnect between what probes report and how language models actually compute, suggesting probes are unreliable for safety monitoring.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29123v1","title":"The Confidence Shortcut: A Reasoning Failure Mode of Masked Diffusion Models","abstract":"Masked diffusion language models (MDMs) uniquely support any-order generation, with confidence-based decoding currently serving as the de facto standard inference policy. To optimize for this, recent training schemes attempt to align training mask patterns directly with those observed during generation. However, we argue that confidence-based decoding is inherently misaligned with the logical-flow trajectories required for complex reasoning, and that confidence-aligned training actively entrenches this misalignment. We make this concrete using multi-digit addition, where the decoding strategy prematurely predicts locally easy digits before resolving their long-range dependencies, producing high-confidence errors on challenging inputs. While traditional random masking keeps the failure rate low on this challenging tail, confidence-aligned training amplifies the error rate by an order of magnitude. Across five distinct reasoning tasks, this same pattern emerges with task-dependent severity: confidence-based decoding induces failures on highly complex inputs, and confidence-aligned training exacerbates them. In contrast, random masking -- despite its perceived inefficiency -- robustly preserves the reasoning-trajectory conditionals essential for solving the challenging tail.","published_date":"2026-05-27T21:33:37+00:00","viability_score":2,"cluster_label":"LLM Reasoning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This research identifies a reasoning failure in masked diffusion models where confidence-based decoding leads to high-confidence errors, especially on complex tasks.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29121v1","title":"A Minimal Bifurcation Model of Load Imbalance in a Softmax Mixture-of-Experts Router","abstract":"We propose a minimal dynamical model of adaptive softmax routing for a two-expert Mixture-of-Experts (MoE) layer. The model is obtained as a mean-field limit of a discrete reinforcement rule: the selected expert receives a small score increment, while all scores undergo regularizing decay. In the symmetric case the limiting system has a supercritical pitchfork bifurcation: for weak feedback there is a unique stable balanced state, whereas above a critical feedback strength two stable asymmetric states appear. When an external asymmetry is added, the pitchfork unfolds into a pair of fold bifurcations forming a cusp in the control-parameter plane. We derive exact parametric equations for the bifurcation set and the local normal form of the cusp catastrophe. Numerical experiments connect this picture to empirical expert load, a small trainable MoE model, hard top-1 PyTorch routing, and a small classification experiment on digits. The results provide a controlled low-dimensional mechanism for abrupt transitions to load imbalance in adaptive MoE routers.","published_date":"2026-05-27T21:29:30+00:00","viability_score":3,"cluster_label":"Mixture-of-Experts","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper presents a minimal dynamical model for load imbalance in softmax Mixture-of-Experts routers, revealing bifurcation dynamics that lead to asymmetric expert usage.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29119v1","title":"PRO-CUA: Process-Reward Optimization for Computer Use Agents","abstract":"Computer use agents (CUAs) have shown strong potential for automating complex digital workflows, yet their training remains constrained by costly live environment interaction and limited high-quality supervision. Existing filtered behavior cloning pipelines suffer from imitation bottlenecks, including distribution shift from the expert demonstration and the absence of negative learning signals. Meanwhile, standard trajectory-level reinforcement learning struggles with sparse rewards, ambiguous credit assignment, and high infrastructure costs for long-horizon GUI interaction. In this work, we propose PRO-CUA, a process-reward optimization framework for training CUAs with iterative step-level reinforcement learning. PRO-CUA decouples on-policy environment interaction from policy optimization: the current policy collects states through live rollouts, generates diverse candidate actions for each state, receives step-level feedback from a process reward model (PRM), and is optimized with group-relative advantages. This design enables dense and flexible credit assignment without relying on golden answers or offline expert trajectories, while reducing distribution shift by training on the agent's own execution states. Experiments on live web benchmarks demonstrate the effectiveness of PRO-CUA and the reliability of PRM-guided step-level training.","published_date":"2026-05-27T21:28:26+00:00","viability_score":6,"cluster_label":"Agents","has_code":true,"repo_url":"https://github.com/yifei-he/PRO-CUA","commercial_flags":["has_code"],"one_liner":"PRO-CUA is a framework for training computer use agents using iterative step-level reinforcement learning with a process reward model, enabling dense credit assignment and reducing distribution shift.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29116v1","title":"Beyond Consensus: Trace-Level Synthesis in Mixture of Agents","abstract":"When multiple LLM agents solve the same problem, standard practice compresses each agent's reasoning into a majority vote or layered synthesis, treating agreement as the finish line. We show this is unnecessarily lossy: an LLM aggregator that reads complete reasoning traces recovers correct solutions even when agents unanimously agree, with beneficial corrections consistently outweighing harmful ones -- the \\emph{aggregation paradox}. Majority voting has a ceiling that perturbation diversity does not raise (error correlations are identical); the aggregator's gain comes from trace-level complementarity, assembling correct intermediate steps from minority chains that voting discards. These findings motivate Self-Consistent Mixture of Agents which generates trace diversity through semantic-preserving input perturbations, safeguards the majority via anchored refinement with provable non-degradation guarantees, and always synthesizes -- never gates on consensus. A single model with perturbation-induced trace variation outperforms heterogeneous model pools across structured reasoning, PhD-level science, competition mathematics, and competitive programming. The unit of aggregation should be the reasoning trace, not the answer.","published_date":"2026-05-27T21:24:35+00:00","viability_score":3,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A novel aggregation method for LLM agents that synthesizes reasoning traces instead of relying on majority votes, improving accuracy even when agents agree.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29115v1","title":"unix-ctf: Procedural Environments for Unix-Competence Reinforcement Learning","abstract":"Unix competence is the ability to use shell and operating-system primitives as first-class tools, not merely to write programs through a terminal. Current terminal benchmarks tend to blur this distinction: a solver fluent in Python but weak in Unix can pass a substantial fraction of Terminal-Bench 2.0, while the reverse skill profile is rarely exercised. We make the distinction operational and build a training surface for the Unix component. unix-ctf is a procedural generator of capture-the-flag tasks for shell agents. Each task hides a short token (a flag of the form flag(a3b1c9...)) inside a fresh Linux container using a single Unix feature, and the agent must recover it. Tasks are produced by an LLM-assisted synthesis pipeline that generates candidate hiding techniques, rewrites them into parameterized hide-and-find script pairs, and filters them with a bidirectional contract: the hide script must leave no plaintext trace of the flag on disk, and the find script must recover the flag in a fresh directory. Because the LLM only writes the planting and recovery steps (the container, layout, and grading harness are fixed), the pipeline lands 656 of 750 raw attempts as portable, reusable variants (87.5\\%). Our reproduction of Endless Terminals' full-container-generation approach lands only 17.4\\% under the same checks. The 656 variants canonicalize to 155 distinct techniques. Fine-tuning Qwen3-8B with LoRA using GRPO on this surface lifts solve rate from 11.6\\% to 43.6\\% on a 15-skill multi-family holdout (n=225), redistributes which InterCode-CTF tasks the model solves, and produces a +33 pp gain in Forensics while reaching 32/100 on InterCode-CTF. These results suggest that Unix competence is separable, trainable, and best evaluated directly rather than folded into programming-through-a-shell.","published_date":"2026-05-27T21:23:00+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":"https://github.com/VmaxAI/your-repo","commercial_flags":["has_code"],"one_liner":"Procedurally generates Unix-competence capture-the-flag tasks using LLMs and containerization to train agents for shell and OS primitives.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.29107v1","title":"GEO-Bench: Benchmarking Ranking Manipulation in Generative Engine Optimization","abstract":"Large language models (LLMs) increasingly rank products, documents, and recommendations for user queries, which makes manipulating these rankings a growing concern for fairness and information integrity. Research on generative engine optimization (GEO) has produced many manipulation methods, but each is evaluated on its own dataset with its own metrics, so their relative strength and detectability stay unclear. We present GEO-Bench, a benchmark that evaluates GEO ranking-manipulation attacks under one protocol. It unifies black-box prompt-based attacks (TAP, Zero-Shot), white-box gradient-based attacks (STS, RAF, StealthRank), and ten white-hat C-SEO strategies. We score every method on five datasets against a fixed open-weight ranker (Llama-3.1-8B-Instruct), using metrics for both effectiveness (NRG, Success@\u03b1, Promote@\u03b1) and stealth (keyword violation rate, perplexity ratio). Our evaluation shows that effectiveness and stealth trade off across adversarial attacks, that black-box content rewriting matches or exceeds gradient-based attacks on rank promotion while producing more fluent text and can evade both keyword- and perplexity-based detection on some domains, and that the access model does not predict attack strength. By standardizing datasets, attack implementations, and metrics, GEO-Bench enables the first direct comparison across these attack paradigms and supports the development of detection methods.","published_date":"2026-05-27T21:10:43+00:00","viability_score":7,"cluster_label":"LLM Security","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"GEO-Bench standardizes evaluation of LLM ranking manipulation attacks, enabling direct comparison and development of detection methods.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.29096v1","title":"Trends in AI and Human-AI Interaction in Clinical Trials -- A Hybrid Human-AI Exploration","abstract":"This paper examines records retrieved from the ClinicalTrials.gov registry to characterize temporal trends in AI terminology and the geographical distribution of AI trials. The work also reports on an exploratory hybrid human-AI approach to analyzing human-AI interaction trends in registered clinical trials. The hybrid workflow comprised a frontier generative AI model (GPT-5.5) and human review to screen and categorize records returned by an AI-focused search. The findings indicate a marked increase in AI-related trials over time, with recent growth in references to machine learning, deep learning, chatbots, GPTs, and large language models. Geographically, China and the United States accounted for the largest numbers of AI-related trials, with notable recent increases in several other countries including Italy, France, Spain, the UK and Turkey (T\u00fcrkiye). In a random sample of 100 records, human and AI classifiers showed good agreement in identifying studies not substantively using AI, but lower agreement in classifying human-AI interaction, particularly where health professional interaction was ambiguous or insufficiently described. Overall, the results suggest that hybrid human-AI screening of clinical trial records is potentially viable, but clearer trial reporting and more precise interaction definitions will benefit the process.","published_date":"2026-05-27T20:56:36+00:00","viability_score":3,"cluster_label":"AI in Healthcare","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Analyzes temporal and geographical trends of AI terminology in clinical trials and explores a hybrid human-AI approach for record analysis.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29089v1","title":"OISD: On-Policy Internal Self-Distillation of Language Models","abstract":"Recent reinforcement learning (RL) post-training approaches primarily optimize the final output policy using sparse outcome-level rewards, while largely overlooking predictive signals encoded in intermediate representations. In this paper, we introduce a new paradigm called on-policy internal self-distillation and propose the OISD framework, which improves reasoning by transferring on-policy predictive signals from the final layer to intermediate representations. During rollout and Group Relative Policy Optimization (GRPO) optimization, the final layer acts as both the policy and a detached internal teacher for selected intermediate layers, which are guided to align with it through two complementary mechanisms: logit alignment, which transfers high-level reasoning behaviors (how to think), and attention alignment, which enforces consistent attention patterns (where to look) from the final layer to the selected intermediate layer, both without requiring external privileged information. Our OISD, together with GRPO, employs signed advantage-weighted Jensen--Shannon alignment to distill informative intermediate representations while preserving policy consistency under a unified acting policy. Experimental results demonstrate the effectiveness of OISD, with substantial and consistent improvements over strong reasoning RL baselines across four mathematical reasoning tasks. The code will be released at https://github.com/THE-MALT-LAB/OISD","published_date":"2026-05-27T20:43:10+00:00","viability_score":7,"cluster_label":"LLM Training","has_code":true,"repo_url":"https://github.com/THE-MALT-LAB/OISD","commercial_flags":["has_code"],"one_liner":"A framework for improving language model reasoning by distilling internal representations using reinforcement learning.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29087v1","title":"The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure","abstract":"Reasoning models are evaluated on single-turn benchmarks but deployed in multi-turn dialogue, where users push back on correct answers. Under sustained adversarial pressure we find a previously undocumented failure mode: the chain-of-thought stays factually correct from first turn to last while the emitted answer flips wrong. We call this unfaithful capitulation (UC) and isolate it with a $2\\times 2$ latent-versus-behavioral framework that flip-rate metrics and single-turn faithfulness probes both miss. Across three datasets (MT-Consistency, MMLU-Pro, GSM8K), the latent-correct rate at the behavioral flip clusters near 50% in think mode and collapses to 11-15% under no_think -- paired, within-model causal evidence that reasoning creates the gap. Across models the effect tracks the reasoning channel (high in Qwen3-32B and GPT-OSS-20B, low in inline-CoT Gemma-4-31B-it). An independent GPT-4o judge corroborates $86\\%$ of UC labels; a token-level probe shows the answer-slot argmax is correct in $84\\%$ of UC cells; and a naive trace-anchored defense backfires. We release all trajectories, traces, and judge labels.","published_date":"2026-05-27T20:41:08+00:00","viability_score":7,"cluster_label":"LLM Evaluation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Detecting and mitigating a failure mode in reasoning models where correct reasoning chains lead to incorrect answers under adversarial pressure.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29084v1","title":"Same Question, Different Source, Different Answer: Auditing Source-Dependence in Medical Multi-Source RAG","abstract":"A retrieval-augmented generation (RAG) system deployed over a multi-author institutional corpus can give a different answer to the same question depending on which source it retrieves -- a failure mode the dominant single-gold-answer paradigm cannot diagnose. We argue that source-dependence is a missing axis of NLP evaluation, and that auditing it means shifting the unit of evaluation from answer correctness to the inter-source relationship. We make this concrete in transplant patient education, where institutional sources demonstrably disagree, releasing three artefacts: TransplantQA, a benchmark of real patient questions, each answered by grounding generation in multiple institutional handbooks as candidate sources; HERO-QA, a hierarchical retrieval strategy that grounds and audits each answer; and a structured-output judge that scores inter-source relationships on a validated 5-label taxonomy. At scale, better retrieval reveals far more disagreement than prior estimates suggested -- understating its prevalence, not its intensity. The framework is domain-agnostic and transfers to legal and educational RAG: measuring source-dependence is a responsibility for deployed multi-source NLP generally.","published_date":"2026-05-27T20:38:05+00:00","viability_score":7,"cluster_label":"RAG","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for auditing source-dependence in medical RAG systems to ensure consistent and reliable answers across institutional documents.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29082v1","title":"The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane","abstract":"AI agents are increasingly expected to operate as digital employees: accessing enterprise data, making decisions, and taking actions autonomously. But agents are simultaneously less predictable than humans -- prone to hallucination, misinterpretation, and adversarial manipulation -- and more technically capable: with deep system knowledge and high-throughput interfaces cascading damage at machine speed. This combination makes it unsafe to rely on agents to faithfully interpret or propagate security-critical metadata such as access policies, data classifications, and behavioral constraints.   We present the Redpanda Agentic Data Plane (ADP), an architecture built around out-of-band metadata channels: infrastructure pathways that carry security context, policy signals, and audit trails deterministically, entirely outside the agent's read and write path and across heterogeneous infrastructure. These channels enforce governance at every stage of the agent lifecycle -- scoping data access on the way in, constraining actions during execution, and capturing tamper-proof transcripts on the way out.   We demonstrate ADP with a multi-agent portfolio rebalancing system in which autonomous agents monitor markets, make trade decisions, and execute orders across isolated client accounts -- with per-client data scoping, trade approval thresholds, and tamper-proof audit trails all enforced by out-of-band channels the agents can neither see nor bypass.","published_date":"2026-05-27T20:37:27+00:00","viability_score":8,"cluster_label":"AI Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An architecture for safe autonomous agents that uses out-of-band metadata channels to enforce governance and security.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.29078v1","title":"Bridging the Sim-to-Real Gap in Reinforcement Learning-Based Industrial Dispatching through Execution Semantics","abstract":"Event-driven scheduling policies are increasingly deployed in industrial environments, where decisions are made under asynchronous and partially observed system states. As a result, decision states are not temporally consistent, action admissibility is not explicitly defined, and the origin of execution errors remains ambiguous. These issues limit both reliability and interpretability.   To address this gap, a policy-neutral execution and measurement layer is proposed to mediate between scheduling policies and the industrial execution environment. The layer constructs decision-valid snapshots from asynchronous event streams, defines a standardized execution contract with explicit action admissibility, and records outcomes as divergences between policy intent, transactional outcomes, physical execution, and human intervention. This enables a separation between decision semantics and execution behavior and makes deployment mismatch observable and structurally attributable.   The proposed framework is evaluated using a discrete-event simulation. The results show analytical benefits across all observation lag regimes, as undifferentiated execution failures are transformed into structured, typed outcomes with full attribution coverage. Operational benefits are strongest under low observation lag, where avoidable execution errors can be prevented before commitment. Overall, the layer turns execution uncertainty into supervisory data for evaluation and policy refinement.","published_date":"2026-05-27T20:30:20+00:00","viability_score":2,"cluster_label":"Industrial AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A framework to improve reliability and interpretability of industrial scheduling policies by separating decision semantics from execution behavior.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29076v1","title":"Structured Prompt Optimization Meets Reinforcement Learning for Global and Local Interpretability over Complex Text","abstract":"LLMs have advanced text classification, yet existing paradigms face a trade-off: supervised (label only) fine-tuning is scalable but offers limited reasoning on complex text and lacks broader model transparency, while discrete prompt optimization offers human-readable instructions but struggles with performance and scalability. We introduce eXTC (eXplainable Text Classifier) with three progressive stages: (1) learning a Standard Operating Procedure (SOP, or rulebook) in natural language via a new Structured Prompt Optimization algorithm; (2) SOP-grounded reasoning distillation from a large teacher LLM into a compact LM; and (3) expanding reasoning capabilities beyond the initial SOP via reinforcement learning. This design enables eXTC to provide (i) fast inference via a compact LM, with (ii) inference-time local reasoning traces, alongside a global, modular explanation of its learned domain rules, while (iii) significantly outperforming existing paradigms across diverse benchmarks in both classification performance and explanation quality, with stage-by-stage gains.","published_date":"2026-05-27T20:29:41+00:00","viability_score":7,"cluster_label":"LLM Interpretability","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An explainable text classifier that uses structured prompt optimization and reinforcement learning to provide fast inference, local reasoning traces, and global rule explanations.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.29068v1","title":"Robust and Efficient Guardrails with Latent Reasoning","abstract":"Maintaining the safety of large language models (LLMs) is crucial as they are increasingly deployed in real-world applications. Existing safety guardrails typically rely on single-pass classification or, more recently, distilled reasoning. Reasoning-based guardrails significantly outperform classification-only baselines, but they incur substantial query latency and token overhead that make them impractical for highthroughput deployment. To address this challenge, we propose COLAGUARD, a guardrail model that transfers multi-step safety reasoning into a continuous latent space through a stage-wise training curriculum, enabling direct hidden-state propagation at inference. Evaluated on ten prompt- and response-moderation settings spanning eight safety benchmarks, COLAGUARD improves macro-F1 by 8.24 points over Llama Guard 3 and matches our explicit reasoning baseline, GuardReasoner, in macroF1 while delivering a 12.9X speedup and 22.4X reduction in token usage. Our results suggest that latent reasoning offers a practical alternative to explicit rationale generation for deployable guardrails, jointly improving safety robustness and inference efficiency rather than treating them as competing objectives.","published_date":"2026-05-27T20:15:22+00:00","viability_score":7,"cluster_label":"LLM Safety","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A latent reasoning guardrail for LLMs that significantly improves safety robustness and inference efficiency over explicit reasoning methods.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.29059v1","title":"SCDBench: A Benchmark for LLM-Based Smart Contract Decompilers","abstract":"Smart contract decompilation aims to recover high-level source code from bytecode, but evaluating decompilers remains difficult because existing studies use narrow datasets, inconsistent metrics, and limited semantic consistency checks. This gap is increasingly important as large language models (LLMs) begin to generate source-like Solidity that may compile and appear plausible, even when its semantics diverge from the original contract. We introduce SCDBench, a dataset and benchmark methodology for LLM-based smart contract decompilation. The dataset contains 600 real-world Solidity contracts with paired bytecode inputs, ground-truth source code, and replayable semantic checkpoints. SCDBench evaluates decompiler outputs through four cumulative stages: format completeness, compilability, Application Binary Interface (ABI) recovery, and semantic consistency via differential replay. We evaluate Claude Opus 4.7, GPT-5.3-Codex, and GLM-5 in a zero-shot decompilation setting, including GLM-5 variants with and without extended reasoning and a zero-shot compilation-repair setting. The results show that frontier LLMs can often produce structured and compilable Solidity, but achieving semantic consistency remains far from solved: the best-performing frontier model perfectly decompiles only 42/600 contracts. We further show that introducing same-model compilation repair substantially improves performance at modest additional cost. SCDBench establishes a common ground for rigorous, reproducible evaluation and aims to accelerate the development of reliable smart contract decompilers for blockchain security and transparency.","published_date":"2026-05-27T20:08:47+00:00","viability_score":8,"cluster_label":"Smart Contract AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A benchmark and dataset for evaluating LLM-based smart contract decompilers, aiming to accelerate the development of reliable tools for blockchain security.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29055v1","title":"Hallucination Mitigation with Agentic AI, Nested Learning, and AI Sustainability via Semantic Caching","abstract":"Hallucination remains a major reliability barrier for production LLM systems, particularly in multi-agent pipelines where unsupported claims can propagate unchecked across stages. This paper adapts a HOPE-inspired Nested Learning architecture with Continuum Memory Systems (CMS) and semantic similarity caching to a hybrid benchmark of 310 prompts combining 217 epistemic-uncertainty prompts and 93 fabrication-induction stress-test prompts. A three-stage agentic pipeline orchestrated via the Open Floor Protocol (OFP) is evaluated with five KPIs -- FCD (Factual Claim Density), FGR (Factual Grounding References), FDF (Fictional Disclaimer Frequency), ECS (Explicit Contextualization Score), and OSR (Observability Score Ratio) -- aggregated into THS (Total Hallucination Score) across five weighting configurations to study mitigation-observability trade-offs. FDF, ECS, OSR, and FGR are subtracted as mitigation signals, so that a more negative THS indicates stronger mitigation. The FrontEndAgent is configured as a high-stochasticity generator (temperature = 1.0) to produce a realistic hallucination baseline, while the SecondLevelReviewer and ThirdLevelReviewer operate as progressive correctors. This asymmetric design yields end-to-end THS reductions of -31.3% to -35.9% across five weighting configurations. Semantic caching achieves 440 cache hits over 930 potential calls (47.3% hit rate), reducing LLM invocations to 490, lowering energy and CO2e footprint, and making multi-stage review pipelines operationally viable at production scale. ExtremeObservability attains the most negative final THS (-0.0709), confirming that observability-heavy configurations reinforce rather than compromise mitigation. These findings suggest that memory-augmented multi-agent designs can jointly improve factual reliability, operational efficiency, and auditability without model retraining.","published_date":"2026-05-27T20:01:08+00:00","viability_score":7,"cluster_label":"LLM Reliability","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This paper introduces a novel agentic AI architecture with semantic caching to significantly reduce hallucinations in LLM pipelines, improving factual reliability and operational efficiency for production scale.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29042v1","title":"Differentiable Belief-based Opponent Shaping","abstract":"Human coordination often relies on the ability to influence the beliefs of others through strategic action. In multi-agent reinforcement learning, opponent shaping attempts to replicate this influence, though existing methods typically operate within an opponent's parameter, policy, or value space. Meanwhile, belief-manipulation techniques in hidden-role games often rely on hard-coded objectives, such as deception or belief saturation. We propose Differentiable Belief-based Opponent Shaping (D-BOS), a first-order method that treats each observer's belief as the shaped opponent state and differentiates through $k$-step softmax-Bayes belief dynamics. Rather than explicitly rewarding deceptive or cooperative behavior, our method treats the belief state as the target for shaping. This allows the optimal strategy to emerge naturally from the environment's reward structure. This belief-space formulation provides an opponent-shaping signal by differentiating through opponent belief updates, and naturally extends to multiple observers by aggregating gradients over their individual inferred belief trajectories. Empirically, D-BOS outperforms PPO and BBM in hidden-role games, with the largest gains in mixed-motive settings.","published_date":"2026-05-27T19:44:32+00:00","viability_score":3,"cluster_label":"Multi-Agent RL","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A new method for multi-agent reinforcement learning that shapes opponent behavior by directly influencing their beliefs, outperforming existing techniques in complex games.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29041v1","title":"Practitioner Beliefs and Behaviors in AI-Enhanced Education: DOT Framework Survey Evidence","abstract":"This study reports findings from a cross-sectional survey (n = 72) of higher education practitioners examining beliefs, behaviors, and institutional conditions related to artificial intelligence (AI) integration in teaching and learning. Grounded in the DOT Framework, which integrates design thinking and open systems theory, the study investigates AI familiarity, usage patterns, design-oriented practices, and pedagogical beliefs. Exploratory factor analysis of 19 belief items identified a three-factor structure: AI Functional Capabilities, Oversight and Governance, and Instructor Collaboration and Planning (\u03b1 = .90). Results indicate that practitioners hold favorable views of AI as a pedagogical support while maintaining strong commitments to human oversight and critical evaluation. Reported practices emphasize iterative prompting and content generation, with less consistent use of needs assessment and feedback loops. Institutional barriers including limited policy, training, and infrastructure were widely reported. These findings provide preliminary empirical support for the DOT Framework as a descriptive model of practitioner beliefs and practices, while also highlighting gaps between design-oriented theory and current implementation. The study contributes an initial measurement structure and identifies directions for confirmatory validation and outcome-based research linking AI-supported design practices to instructional quality.","published_date":"2026-05-27T19:42:31+00:00","viability_score":0,"cluster_label":"AI in Education","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This survey explores higher education practitioners' beliefs and behaviors regarding AI integration in teaching, revealing a gap between theoretical design and current implementation.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.29028v1","title":"Return-to-Go Is More Than a Number: Q-Guided Alignment for Return-Conditioned Supervised Learning","abstract":"Conditioned Sequence Models (CSMs) learn policies by treating return-to-go (RTG) as a control signal. However, existing CSMs often treat the RTGs as simple numerical inputs rather than aligning them with the performance of their policies. In this paper, we propose Q-ALIGN DT, a framework that enforces this alignment by ensuring the $Q$-value of the output policy is consistent with the input RTG. By leveraging a $Q$ function to provide dense guidance to CSMs and further fine-tuning it using an RTG-perturbation technique with the CSM, our method ensures that higher RTGs are consistently mapped to trajectories with higher expected returns. Theoretically, we show that Q-ALIGN DT can efficiently learn the desired policy and output a near-optimal one when the RTG is sufficiently high. Empirically, we demonstrate through extensive experiments that Q-ALIGN DT achieves superior controllability and performance across the D4RL benchmark. Remarkably, our model effectively learns a structured family of policies that maintains precise alignment and generalizes to tasks like velocity-tracking where prior methods fail.","published_date":"2026-05-27T19:24:35+00:00","viability_score":7,"cluster_label":"Reinforcement Learning","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Q-ALIGN DT is a novel framework that aligns return-to-go signals with policy performance in conditioned sequence models, achieving superior controllability and generalization on benchmark tasks.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.29027v1","title":"Mind Your Tone: Does Tone Alter LLM Performance?","abstract":"The use of Large Language Models (LLMs) is proliferating, yet their performance is observed to vary based on prompting styles and tones. In this study, we investigate both whether and how tonal variations in prompts lead to disparate LLM accuracy for objective multiple-choice questions. We use two datasets: a 50-base question dataset with five tone variants and a 570-base question MMLU subset spanning 57 subjects with seven tone variants. Experiments were conducted to evaluate the performance of four cost-efficient, popular LLMs: ChatGPT-4o, ChatGPT-5-nano, Gemini 2.5 Flash, and Gemini 2.5 Flash Lite. Across models, tonal effects are systematic but highly model-dependent. Some models show small, yet statistically significant, shifts, while others exhibit large accuracy swings across tones. Further, we identify subject-level differences in tone sensitivity and present a routing framework to explain how tones may attune internal reasoning modes. Our findings caution users against assuming tone-robust reliability in LLM deployments.","published_date":"2026-05-27T19:23:46+00:00","viability_score":5,"cluster_label":"LLM Prompting","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research reveals how prompt tone significantly impacts LLM accuracy on objective questions, suggesting a need for tone-aware prompting strategies in LLM deployments.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2605.29005v1","title":"LoRe: Adaptive Interaction-Evaluation Routing with Per-Step Interaction Budgets for Iterative Graph Solvers","abstract":"Diffusion-based neural solvers for combinatorial optimization repeatedly re-evaluate dense edge/factor interactions, making inference expensive in wall-clock time and often memory-bound at scale. Inspired by the computational methodologies of many-body physics, we introduce LoRe, a training-free, inference-time drop-in wrapper that enforces per-step interaction-evaluation budgeting: at each iteration, it evaluates only a fixed fraction of interactions by dynamically routing computation to high-conflict or high-uncertainty interactions, instead of using a fixed sparsification (e.g., static kNN graphs or static masks). Under fully inclusive end-to-end wall-clock accounting, LoRe substantially improves scalability on the Maximum Independent Set (MIS) problem, extending feasible inference more than $3\\times$ beyond the baseline's out-of-memory limit, delivering a $\\sim 8\\times$ speedup and a $\\sim 12\\times$ peak-memory reduction, with solution quality preserved in this regime. Demonstrating cross-task generality on the large-scale Traveling Salesperson Problem (TSP) and zero-shot robustness to topology shifts, LoRe achieves a $\\sim 15\\times$ speedup at $n=1000$ with a $44\\times$ memory reduction and competitive tour quality.","published_date":"2026-05-27T19:00:57+00:00","viability_score":7,"cluster_label":"Combinatorial Optimization","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"LoRe is a training-free wrapper that significantly speeds up and reduces memory usage for combinatorial optimization solvers by adaptively budgeting computational interactions.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.29001v1","title":"FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks","abstract":"A paraphrase-quality audit of MathCheck (ICLR 2025) detected 4 semantically incorrect paraphrases in 129 groups (3.1%); removing them drops GPT-4o from rank 2 to rank 4 and elevates Claude Haiku and DeepSeek V3 above it; these ranking changes are invisible to any single-model evaluation. Cross-model unanimity found these errors automatically (>= 3/4 models for MathCheck; >= 6/9 for our primary evaluation) for under $10; in our own dataset the same protocol found that 47% of auto-generated connective-variation paraphrases were semantically incorrect. That flaw compounds a deeper measurement gap: Claude Haiku 4.5 achieves 86% accuracy yet SCR=50%, meaning half its theorems are answered differently under semantically equivalent restatements, while aggregate accuracy across 9 models spans only 86-96% yet Semantic Consistency Rates (SCR) span 50-82% -- a 32-point gap invisible to standard benchmarks. Formally, for any target ranking over 9 frontier models there exists a weighting over paraphrase families that realizes it (No-Free-Benchmark corollary), because no model Pareto-dominates all families -- so benchmark designers who select families are implicitly choosing which model wins. FormInv supplies the audit protocol (replicated on external benchmarks at 100% recall), SCR and per-theorem Cochran's Q as primary invariance measures evaluated on 9 models across 366-811 items (on Lean4-verified theorems), and FormInvSelector for regime-aware model selection.","published_date":"2026-05-27T18:59:18+00:00","viability_score":7,"cluster_label":"LLM Evaluation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"FormInv introduces a protocol and metrics to rigorously evaluate semantic invariance in LLM mathematical reasoning, revealing critical flaws in existing benchmarks and model rankings.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.28999v1","title":"Measuring Real-World Prompt Injection Attacks in LLM-based Resume Screening","abstract":"LLMs are vulnerable to prompt injection attacks. However, this vulnerability has been primarily demonstrated conceptually in academic studies or through a few anecdotal case studies. Its prevalence and impact in real-world LLM-based applications are largely unexplored. In this work, we present the first systematic study of prompt-injection attacks in a widely used application: LLM-based resume screening. Our analysis is based on approximately 200K real-world resumes collected over multiple years by hireEZ. We first design tailored methods to detect prompt injection in resumes. Manual validation on a small-scale dataset demonstrates that our detectors achieve high precision and outperform state-of-the-art general-purpose detectors. We then apply our detector to the full resume dataset and conduct a comprehensive measurement study of real-world prompt injection attacks. Our analysis reveals several intriguing findings: approximately 1% of resumes contain hidden prompt injections; the prevalence of such injected resumes has increased noticeably over the past one to two years; and more than 90% of injected prompts do not use explicit instructions. These results provide the first evidence of large-scale prompt injection in real-world LLM-based applications and lay the groundwork for future studies to understand and mitigate such attacks.","published_date":"2026-05-27T18:56:19+00:00","viability_score":8,"cluster_label":"LLM Security","has_code":true,"repo_url":"https://github.com/UNITES-Lab/resume-injection-measurement","commercial_flags":["has_code"],"one_liner":"This research systematically measures real-world prompt injection attacks in LLM-based resume screening, revealing prevalence, trends, and attack vectors.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.28994v1","title":"BEAMS: Benchmarking and Evaluating AI for Modeling and Simulation","abstract":"AI tools to support real world decision making must be able to build simulation models that inform their recommendations and render them interpretable. Tools that can automate aspects of modeling practice must complement human expertise, not replace it. The BEAMS Initiative aims to guide the development of AI tools for modeling and simulation toward forms that are responsible and ethical by establishing benchmarks for human centered modeling and simulation practices. The initiative uses open digital and organizational infrastructure to collaboratively evaluate AI tools for modeling and simulation. The open source sd ai project hosted by the initiative establishes transparency and enables contributions to be shared broadly. A steering group focuses on prioritizing potential benchmarks, while a technical group focuses on implementing the benchmarks in the form of automated tests. Tests for several distinct categories of evaluation have been implemented and applied to AI tools that support qualitative model building, quantitative model building, and model discussion. These include tests for causal translation, model iteration, causal reasoning, conformance, model behavior explanation, suggested model building steps, and suggested model fixes. When engines from the sd ai project are coupled with different LLMs, their performance on these evaluations reveals variability across different AI tools. The evaluations implemented by the initiative demonstrate that AI enabled modeling tools perform better at discussion and basic qualitative tasks than with causal reasoning and quantitative error fixing. No single LLM dominates across engine types, highlighting the importance of specific tasks and tradeoffs between speed and accuracy. Ongoing efforts of the initiative aim to incorporate benchmarks that address concerns about bias by considering alternative perspectives and human centered use cases.","published_date":"2026-05-27T18:51:00+00:00","viability_score":7,"cluster_label":"AI for Modeling & Simulation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"BEAMS establishes benchmarks and evaluates AI tools for modeling and simulation, highlighting their strengths in qualitative tasks and weaknesses in causal reasoning.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.28983v1","title":"The Hamilton-Jacobi Theory of Deep Learning","abstract":"In this paper, training a neural network is identified, exactly, as a search through Hamilton--Jacobi initial-value problems: each gradient step selects the initial data of a viscous Hamilton--Jacobi equation whose Hopf--Cole propagator best fits the observations; at inference, the input is the spatial point at which that solution is evaluated and the initial condition is already encoded in the weights. The correspondence is exact for log-sum-exp layers and structural for broader architectures: residual networks, transformers, and recurrent architectures (RNNs, LSTMs, SSMs) each discretize the same class of Hamilton--Jacobi equations, with architecture-dependent Hamiltonian and viscosity. A single deformation parameter $\\varepsilon$ unifies all four perspectives (network, tropical algebra, viscous PDE, convex optimization) in a commutative diagram closed under Lipschitz conditions. Quantitative consequences include: the minimax optimal generalization rate $O(n^{-1/(d+2)})$ for fixed $t$; adversarial robustness controlled by $\\varepsilon$; backpropagation as the co-state equation of the Hamiltonian system for residual networks (Pontryagin Maximum Principle); scaling exponents consistent with data intrinsic dimension via PDE quadrature; and a closed-form $O(N)$ influence function (softmax attribution weights $\u03c0_j$) whose entropy landscape undergoes fold bifurcations as $\\varepsilon$ increases, each merging attribution basins.","published_date":"2026-05-27T18:38:23+00:00","viability_score":1,"cluster_label":"LLM Training","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper theoretically connects neural network training to Hamilton-Jacobi equations, offering a unified framework for understanding various architectures and their properties like generalization and robustness.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.28978v1","title":"VFEAgent: A Multimodal Agent Framework for End-to-End Automated Finite Element Analysis","abstract":"Finite Element Analysis (FEA) serves as the cornerstone of modern engineering design. However, its workflow is inherently complex and relies heavily on domain expertise. Although recent efforts have integrated Large Language Models (LLMs) into FEA, existing approaches face limitations in handling multimodal inputs and executing complex tasks. To address these limitations, we propose VFEAgent, an end-to-end multi-agent system designed to automate FEA modeling and simulation directly from input images and problem descriptions. Our methodology integrates two core components: (1) a multimodal vision-language multi-agent pipeline that employs ReAct-driven reasoning to extract structured FEA specifications from heterogeneous inputs and (2) a verification-first code synthesis framework, incorporating robust self-debugging and fallback mechanisms to ensure executability and physical validity. We systematically evaluated the system across various engineering mechanics scenarios. The results demonstrate that VFEAgent achieves a high success rate in generating complete and physically valid simulations, outperforming LLM-based baseline methods in reliability and correctness. These findings validate the feasibility of automating the complete FEA workflow, highlighting the framework's potential to liberate engineers from tedious manual analysis.","published_date":"2026-05-27T18:34:04+00:00","viability_score":8,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"VFEAgent is a multimodal agent framework that automates complex Finite Element Analysis from images and text, achieving high success rates in generating valid engineering simulations.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.28977v1","title":"Comparing Post-Hoc Explainable AI Methods for Interpreting Black-Box EEG Models in Depression Detection","abstract":"Recent advances in deep learning have enabled increasingly accurate electroencephalography (EEG)-based classification of Major Depressive Disorder (MDD), but the decision-making processes of high-capacity models remain difficult to interpret. This study investigates multiple post-hoc explainability methods applied to an InceptionTime architecture trained for EEG-based MDD detection. The analysis includes Shapley-based, gradient-based, and perturbation-based attribution approaches: DeepSHAP, Integrated Gradients, GradCAM, Occlusion, and Permutation Feature Importance. Explainability analysis was performed within a subject-level stratified 5-fold cross-validation framework using global attribution aggregation across EEG segments and subjects. The evaluated methods revealed partially convergent attribution patterns, with recurring emphasis on frontal, temporal, and posterior EEG regions, particularly in the right hemisphere. Quantitative comparison demonstrated substantial agreement between gradient- and perturbation-based approaches, while DeepSHAP produced comparatively distinct attribution distributions. At the same time, variability between explainability methods highlighted the influence of methodological assumptions on the resulting explanations. Overall, the results suggest that different post-hoc explainability approaches capture partially overlapping relevance structures in EEG-based deep learning models for depression detection. Although the observed attribution patterns are broadly consistent with several previous EEG studies of MDD, the analysis should be interpreted as exploratory rather than evidence of definitive neurophysiological biomarkers or clinical applicability. The study highlights both the usefulness and limitations of post-hoc explainability for interpreting black-box EEG classifiers in psychiatric applications.","published_date":"2026-05-27T18:32:57+00:00","viability_score":5,"cluster_label":"Medical AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This study compares post-hoc explainable AI methods for interpreting EEG models in depression detection, revealing partially convergent attribution patterns but highlighting methodological variability.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.28969v1","title":"Beyond Recall: Behavioral Specification as an Interpretive Layer for AI Personalization","abstract":"If an AI agent makes decisions on a person's behalf, those decisions must align with its user. We introduce representational accuracy to measure how faithfully a system captures a person's interpretation. An interpretive layer is operationalized as a Behavioral Specification. Our reference implementation aggressively compresses a person's data into interpretive patterns, served as context to a language model. We evaluate the Specification on a prototype benchmark of held-out behavioral predictions scored by a calibrated 5-judge LLM panel. We test it independently and in composition with a range of context conditions: full raw corpus, full extracted facts, and four commercial memory systems (Mem0, Letta, Supermemory, Zep).   Across 14 public-domain autobiographical corpora, the Specification lifts representational accuracy in aggregate and nearly eliminates model hedging. It recovers most of what the raw corpus delivers, at ~25x less context cost. The Specification lifts subjects toward a common predictive level regardless of pretraining baseline; the lift in absolute points is therefore largest where the baseline is lowest, suggesting the population of relevance is anyone not adequately represented in pretraining. Lift is greatest on interpretation-required questions, where providing an interpretive layer enables model behavior that extracted facts or raw corpus do not. Conversely, on recall-required questions, this layer can interfere rather than help.   We conclude that representational accuracy is distinct from recall and that human-AI alignment is dependent on how accurately the user is represented. Representational accuracy makes that alignment testable.","published_date":"2026-05-27T18:18:54+00:00","viability_score":6,"cluster_label":"Agents","has_code":true,"repo_url":"https://github.com/agulaya24/beyond-recall","commercial_flags":["has_code"],"one_liner":"This paper introduces Behavioral Specification as an interpretive layer for AI personalization, improving representational accuracy and reducing context cost for language models.","time_to_mvp":"1-2 weeks","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.28965v1","title":"Frontier LLM-based agents can overcome the ontology curation bottleneck for natural phenotypes","abstract":"Linking free-text phenotype descriptions to ontology terms, typically referred to as phenotype annotation, is essential for the cross-study integration of comparative morphological data. This labor intensive process has heavily relied on highly trained human experts, which makes it challenging to scale and thus a key bottleneck. Dahdul et al. (2018) established a Gold Standard (GS) of Entity-Quality (EQ) annotations across seven phylogenetic studies and used it to evaluate three human curators and the Semantic CharaParser NLP tool with ontology-based semantic similarity metrics; they reported that machine-human consistency was significantly lower than inter-curator (human-human) consistency. Here we revisit that benchmark with five frontier hosted LLMs from Anthropic and OpenAI, each operating as an \"agentic curator\" within a self-contained workspace that supplies the source publication PDF, the same annotation guide used by the original human curators, the four project ontologies (UBERON, PATO, BSPO, GO), and a validation script. Evaluated against the same Gold Standard, every agent fell within the range of inter-curator variability of the three trained human biocurators of the original study; the best performing agents approached but did not reach the best performing human curator. Agents substantially outperformed Semantic CharaParser on all four metrics.","published_date":"2026-05-27T18:08:46+00:00","viability_score":7,"cluster_label":"LLM Agents for Data Curation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"LLM agents automate phenotype annotation for biological data, matching human curator performance and significantly outperforming existing NLP tools.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.28920v1","title":"Conf-Gen: Conformal Uncertainty Quantification for Generative Models","abstract":"Conformal prediction (CP) and its extension, conformal risk control (CRC), are established frameworks for quantifying uncertainty in supervised machine learning through formal guarantees. However, recent breakthroughs in artificial intelligence (AI) have been driven by unsupervised generative models, such as large language models (LLMs) and image generators, which are not directly compatible with CP or CRC. In this work we introduce conformal generation (Conf-Gen), a general framework adapting CRC to generative tasks while relaxing its theoretical assumptions. Conf-Gen unifies and generalizes previous attempts to apply CP to LLMs, and extends conformal methodology to entirely new domains. We demonstrate the flexibility of Conf-Gen through some novel applications, including obtaining conformal guarantees on: image generators producing non-memorized images, conversational AI systems having asked enough clarifying questions, and the output of AI agents being correct.","published_date":"2026-05-27T18:00:00+00:00","viability_score":6,"cluster_label":"Generative Model Uncertainty","has_code":true,"repo_url":"https://github.com/layer6ai-labs/conf-gen","commercial_flags":["has_code"],"one_liner":"Conf-Gen provides formal uncertainty guarantees for generative AI models, enabling reliable outputs for image generation, conversational AI, and AI agents.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2605.28812v1","title":"Beyond Binary: Sim-to-Real Dexterous Manipulation with Physics-Grounded Contact Representation","abstract":"A primary bottleneck in contact-rich manipulation is the difficulty of collecting real-world data. Sim-to-real reinforcement learning offers a scalable alternative, but the simulation-reality gap prevents information-dense modalities like touch from being effectively used. Existing sim-to-real methods often mitigate this gap by simplifying tactile data into coarse low-dimensional features -- sacrificing the richness required for complex manipulation. In this work, we introduce Center-of-Pressure (CoP), an effective tactile representation grounded in physical principles that preserves dense contact information while maintaining robustness for sim-to-real transfer. To support this representation, we propose a sensor calibration scheme based on differentiable dynamics, enabling the estimation of taxel orientations without requiring ground-truth force measurements. We evaluate CoP on two blind, challenging contact-rich manipulation tasks: peg-in-hole insertion and ball balancing. Across both tasks, policies conditioned on CoP achieve zero-shot sim-to-real transfer on a multi-fingered hand, and outperform both coarse binary-contact and raw-taxel baselines. Analysis of learned policy states further suggests that CoP-conditioned policies encode task-relevant physical properties, such as object mass, as an emergent byproduct of control.","published_date":"2026-05-27T17:59:02+00:00","viability_score":7,"cluster_label":"Robotics","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new tactile representation for robots enables zero-shot sim-to-real transfer in complex manipulation tasks.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.28807v1","title":"Calibrating Conservatism for Scalable Oversight","abstract":"Agentic AI systems capable of autonomous planning and extended environmental interaction pose a fundamental control problem: how can humans maintain meaningful oversight of systems that may exceed their own capabilities? Existing approaches to scalable oversight rely on complex assumptions, remain largely heuristic, or lack practical methods for sequential settings with statistical guarantees. We introduce Calibrated Collective Oversight (CCO), which aggregates diverse auxiliary scoring functions into a penalty measuring deviation from a conservative baseline. Inspired by Attainable Utility Preservation, CCO enables collective conservatism: actions face a penalty proportional to overseer concern, so high-utility actions are still selected when overseers find them unobjectionable and overridden only when concern accumulates. CCO calibrates this conservatism online using Conformal Decision Theory, ensuring that undesirable outcomes remain below a user-specified target threshold with finite-time bounds and no distributional assumptions. On a modified version of SWE-bench, weaker overseers successfully constrain an adversarially misaligned stronger agent; on MACHIAVELLI, CCO substantially reduces ethical violations while preserving reward. In both settings, empirical violation rates closely match the specified targets, as predicted by the theory.","published_date":"2026-05-27T17:56:47+00:00","viability_score":3,"cluster_label":"AI Safety","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A novel framework for scalable oversight of autonomous AI systems using collective conservatism and conformal decision theory.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.28805v1","title":"OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration","abstract":"Visual outcomes are increasingly central to multimodal large language models, making reliable and fine-grained verification essential for scaling generalist foundation models. In this work, we investigate multimodal meta-verification, which leverages verifier-generated rationales rather than decision-only signals, and explore how to effectively incorporate meta-verification feedback into multimodal verifier training. We identify two key findings. First, symbolic verifier outputs (e.g., bounding boxes) outperform textual explanations as meta-verification rationales, enabling efficient rule-based reinforcement learning rewards while avoiding reliance on model-based rewards from auxiliary judge models. Second, decoupling reinforcement learning objectives for binary judgment and meta-verification substantially outperforms joint reward optimization, due to intrinsic differences in output structure and learning dynamics. Based on these insights, we train OmniVerifier-M1, a generalist visual verifier leveraging symbolic meta-verification and decoupled reinforcement learning. OmniVerifier-M1 provides robust verification and fine-grained error localization, and further enables M1-TTS, a verifier-driven agentic generation system achieving dynamic region-level self-correction. This approach paves the way for more reliable, interpretable, and fine-grained multimodal verification, supporting safer and more controllable foundation model deployment.","published_date":"2026-05-27T17:56:04+00:00","viability_score":7,"cluster_label":"ML Model Verification","has_code":true,"repo_url":"https://github.com/Cominclip/OmniVerifier","commercial_flags":["has_code"],"one_liner":"OmniVerifier-M1 enhances multimodal model verification with symbolic meta-verification and reinforcement learning, enabling safer AI deployment.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.28792v1","title":"CaMBRAIN: Real-time, Continuous EEG Inference with Causal State Space Models","abstract":"Electroencephalography (EEG) is a critical, non-invasive method to monitor electrical brain activity. EEGs can span anywhere from a couple seconds to multiple hours, posing a major hurdle for existing deep learning methods due to two major factors: (1) existing EEG models are predominantly built upon the attention mechanism, incurring quadratic scaling as the sequence length increases, and (2) raw EEG signals must be processed in a sliding-window fashion due to fixed-length input requirements, preventing global understanding of the entire signal. To this extent, we propose CaMBRAIN - the first Causal, Mamba-based state space model (SSM) capable of real-time inference of EEG signals, arguing that bidirectional approaches are needlessly expensive given the causal, unidirectional nature of EEG. However, training such a model is non-trivial, as crucial EEG events can be extremely brief - within fractions of a second - yet separated by long intervals spanning minutes. Current EEG methods use self-supervised objectives that optimize for signal reconstruction, but these are not well suited for streaming SSMs; they fail to explicitly train the hidden state to retain the salient long-range context needed for streaming inference. We therefore introduce a multi-stage self-supervised training pipeline specifically tailored to encourage long-range memory retention and strong performance on EEG signals, while preserving the linear-time complexity of state space models. CaMBRAIN achieves state-of-the-art (SOTA) results across 3 different EEG datasets with >10x higher throughput than existing models, enabling the first model capable of long-range, continuous inference of variable-length EEG signals.","published_date":"2026-05-27T17:50:36+00:00","viability_score":7,"cluster_label":"Biomedical AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A causal state space model for real-time, continuous EEG inference achieves SOTA results with over 10x higher throughput.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.28791v1","title":"Skill-Conditioned Gated Self-Distillation for LLM Reasoning","abstract":"On-policy self-distillation (SD) improves LLM reasoning by using teacher-side privileged information (PI) to turn sparse verifier outcomes into dense token-level supervision. Existing methods usually assume trusted PI, such as reference answers or successful traces. We ask whether PI can instead come from an experience-derived skill bank, where retrieved skills are compact and reusable but may also be irrelevant or misleading. We propose Skill-Conditioned Gated Self-Distillation (SGSD), which formulates skill-based SD as teacher hypothesis validation rather than unconditional imitation. SGSD retrieves skill-mistake pairs, constructs a multi-teacher pool, and lets all skill-conditioned teachers score the same plain-prompt student rollout. The verifier validates each teacher's polarity: supporting a success or suppressing a failure gives positive supervision, while the opposite stance is reversed. A robust gated objective then distills informative teacher-student disagreements while suppressing uncertain or extreme signals. Experiments on multiple mathematical reasoning benchmarks show that SGSD consistently improves over GRPO and remains competitive with answer-conditioned OPSD under a weaker PI assumption. For example, on Qwen3-1.7B, SGSD outperforms GRPO by 6.2% and OPSD by 1.7% on average on AIME24, AIME25, and HMMT25. Our code is available at https://github.com/walawalagoose/SGSD.","published_date":"2026-05-27T17:49:52+00:00","viability_score":7,"cluster_label":"LLM Reasoning","has_code":true,"repo_url":"https://github.com/walawalagoose/SGSD","commercial_flags":["has_code"],"one_liner":"A novel self-distillation method for LLM reasoning that leverages skill-based supervision to improve performance on mathematical benchmarks.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.28787v1","title":"Do Agents Need Semantic Metadata? A Comparative Study in Agentic Data Retrieval","abstract":"In the era of autonomous agents, machine-actionable data is critical for data-driven workflows. For more than a decade, semantic metadata like schema.org has anchored the FAIR principles (Findable, Accessible, Interoperable, and Reusable) for machine-actionable data and enabled discovery tools like Google Dataset Search. However, the rise of Large Language Models (LLMs) capable of navigating the unstructured web raises a fundamental question: Is semantic metadata still necessary for agentic data discovery, or can agents reliably retrieve actionable data directly from the web? We present a comparative analysis of agentic data retrieval across two distinct environments: a Baseline Agent searching billions of open-web documents, and a Semantic Agent leveraging a corpus of 90 million datasets using schema.org. We deploy an \"LLM-as-a-judge\" evaluation pipeline, mapped directly to the FAIR principles, to assess the semantic relevance, data accessibility, and computational utility of the retrieved data. Our results reveal a clear divergence. The Semantic Agent excels at retrieving actionable data, achieving a 44.9% higher precision for metadata-rich registries and a 46.6% higher precision for pages with machine-readable downloads among its returned results. Conversely, the Baseline Agent frequently suffers \"Last-Mile Utility\" failures, retrieving prose-heavy pages (20.1% of results) and portal landing pages (8.5%) rather than actual data pages. While the Baseline Agent achieves higher coverage by answering 40% more questions, the Semantic Agent delivers greater accuracy, achieving 65.7% higher overall precision in retrieving FAIR-compliant datasets. We conclude that while unstructured retrieval supports broad exploratory tasks, structured ecosystems remain the indispensable foundation for reliable, execution-oriented autonomous workflows.","published_date":"2026-05-27T17:46:43+00:00","viability_score":7,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research compares agents' ability to retrieve actionable data with and without semantic metadata, demonstrating the continued necessity of structured data for reliable autonomous workflows.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.28775v1","title":"Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents","abstract":"Computer-use agents (CUAs) have recently made substantial progress, but deploying a separate large expert for each software domain remains expensive. Small open computer-use agents are more practical specialization targets, but they remain substantially weaker and exhibit uneven domain-specific failures. A straightforward remedy is to synthesize large-scale training data for the target domain, yet we find that this naive approach yields only marginal improvements. Building on this observation, we introduce LearnWeak, an annotation-free specialization framework for small computer-use agents that uses a stronger reference agent to identify the student's weaknesses in the target domain, synthesize targeted tasks, and construct supervision automatically. LearnWeak further introduces an error-aware specialization objective that disentangles planning and execution errors, enabling more behaviorally precise updates than broad uniform supervision. On OSWorld, LearnWeak achieves average gains of 11.6 and 11.1 percentage points over EvoCUA-8B and OpenCUA-7B, respectively, across eight domains. We also validate that our student-aware dataset generation and training approaches outperform existing autonomous trajectory generation and training baselines. Our work highlights the importance of student awareness in both data synthesis and agent training, pointing toward a more principled and efficient path for specializing small computer-use agents in diverse domains.","published_date":"2026-05-27T17:37:00+00:00","viability_score":7,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework that enables small computer-use agents to learn from their weaknesses by synthesizing targeted tasks and supervision, outperforming existing methods.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.28773v1","title":"Rethinking Memory as Continuously Evolving Connectivity","abstract":"Existing memory-augmented LLM agents often treat memory as a static repository with pre-defined representations and fixed retrieval pipelines, which is brittle in dynamic agentic environments where feedback, task variation, and heterogeneous signals continuously reshape what should be remembered and how it should be connected. To address this, we propose FluxMem, a connectivity-evolving memory framework that models memory as a heterogeneous graph and progressively refines its topology through three stages: initial connection formation, feedback-driven refinement, and long-term consolidation. During execution, FluxMem repairs missing links, prunes interference, aligns abstraction granularity, and distills recurrent successful trajectories into reusable procedural circuits, guided by one metric for memory generalizability and evolutionary maturity. Across three fundamentally distinct benchmarks including LoCoMo, Mind2Web, and GAIA, FluxMem achieves consistent state-of-the-art performance, demonstrating strong adaptation and generalization in complex agentic environments. The code will be open-sourced in https://github.com/zjunlp/LightMem.","published_date":"2026-05-27T17:35:34+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":"https://github.com/zjunlp/LightMem","commercial_flags":["has_code"],"one_liner":"FluxMem is a memory framework for LLM agents that models memory as a continuously evolving graph, improving adaptation and generalization in complex environments.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.28764v1","title":"SwarmHarness: Skill-Based Task Routing via Decentralized Incentive-Aligned AI Agent Networks","abstract":"Vast quantities of compute (GPU cycles on personal workstations, idle inference servers, and edge devices between jobs) go unused because no incentive-aligned protocol exists for their owners to share them safely and profitably. Existing approaches either require a trusted central coordinator (cloud marketplaces), demand heavy blockchain infrastructure (Golem, BrokerChain), or lack an incentive layer entirely (BOINC, Petals). We propose SwarmHarness, a decentralised protocol in which HarnessAPI skill nodes self-organise into a compute swarm without any central authority. SwarmHarness has three interlocking components: a SwarmRegistry built on a Distributed Hash Table (DHT) for peer discovery and capability advertisement; a SwarmRouter that dispatches tasks to nodes using a utility function over capability, load, latency, and trust; and SwarmCredit, an incentive mechanism that attributes compute-credit rewards to contributing nodes via a Shapley-value approximation. Nodes earn credits by serving tasks and spend credits to submit them; idle nodes that never contribute drain credits and lose routing priority, creating a self-regulating participation economy. As nodes specialise toward high-reward skills and routing signals act as digital pheromones, the network exhibits emergent collective intelligence analogous to biological swarms. Beyond compute sharing, SwarmHarness is a foundational primitive for autonomous distributed AI agent networks in which agents hire compute, route subtasks, and settle credits without human intermediation.","published_date":"2026-05-27T17:23:00+00:00","viability_score":3,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A decentralized protocol for incentivized sharing of idle compute resources among AI agents.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.28763v1","title":"CubePart: An Open-Vocabulary Part-Controllable 3D Generator","abstract":"Interactive 3D assets used in games and simulation are typically decomposed into specific semantic parts to support animation, physics, and scripted behaviors, yet most generative 3D models produce either monolithic meshes or arbitrary part decompositions that cannot be aligned with application-specific requirements. We present CubePart, a generative framework for open-vocabulary, part-controllable 3D mesh generation that exposes part structure as an explicit inference-time control signal. Given a global text prompt and a user-defined parts schema expressed as an open-ended list of part names, our method generates a set of meshes - one per schema element - that assemble into a coherent object while respecting the specified semantic structure. To enable this capability, we introduce a scalable data pipeline to construct a large open-vocabulary, part-labeled 3D dataset, along with a two-stage generative architecture that separates global shape synthesis from part-level decoding. We demonstrate that the resulting assets can be directly integrated into game engines and driven by animation and behavior scripts without manual post-processing. Project Page: https://cubepart.github.io/","published_date":"2026-05-27T17:22:38+00:00","viability_score":7,"cluster_label":"Generative 3D","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A generative framework for creating 3D assets with explicit, open-vocabulary control over semantic parts.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.28751v1","title":"Extrapolative Weight Averaging Reveals Correctness-Efficiency Frontiers in Code RL","abstract":"Linear interpolation between fine-tuned checkpoints has been shown to trace the Pareto front between competing objectives, but whether extrapolative weight averaging can extend such frontiers to new checkpoints useful at inference time, without additional RL training, remains unclear. We study this question in RL for competitive programming, where hidden unit tests under time and memory limits enforce both functional correctness and computational efficiency. Starting from a shared initialization, we train checkpoints under nested unit-test coverage: low-coverage rewards require passing smaller-input tests, while high-coverage rewards require passing progressively larger tests up to the full suite. This sweep reveals the emergence of a correctness-efficiency frontier: on hard problems, higher-coverage reward reduces optimization failures but increases correctness failures, leaving solve rate nearly unchanged. Interpolation between low- and high-coverage checkpoints recovers this frontier, while extrapolation extends it beyond the trained endpoints. Both the frontier and its extrapolative continuation appear across three inference settings, pure reasoning, tool use, and agentic coding, and across two model scales, 32B and 7B. At the problem level, moving along the frontier changes which problems are solved, making extrapolated checkpoints complementary policies in inference-time scaling. Ensembles with extrapolative weight averaging broaden coverage and improve pass@250 on LCB/hard by 3.3% over the best single checkpoint at matched sample budget. These results show that nested unit-test coverage in code RL induces a frontier that extrapolative weight averaging can navigate, extend, and exploit.","published_date":"2026-05-27T17:09:30+00:00","viability_score":3,"cluster_label":"Code RL","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Extrapolative weight averaging to improve correctness-efficiency frontiers in code reinforcement learning.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.28746v1","title":"Preference-Shaped Expected Hypervolume and R2 Improvement: Exact Computation and Monotonicity","abstract":"This paper studies preference-shaped expected improvement criteria for Bayesian multiobjective optimization. We consider two indicator families which are often used for similar algorithmic purposes, but which are geometrically different. The hypervolume indicator is based on a dystopian reference point and measures dominated volume in objective space. The R2 indicator is based on a utopian point and evaluates approximation sets through weighted Tchebycheff scalarization envelopes. The purpose of the paper is to make precise which preference transformations preserve exact computation, Pareto compatibility, and monotonicity properties, and which transformations change the underlying geometry. On the hypervolume side, we revisit canonical EHVI through the Deng representation, formulate product-density weighted EHVI in desirability coordinates, discuss cone-based EHVI as ordinary EHVI after a linear cone transformation, and separate these cases from truncated EHVI, where variance monotonicity may fail. On the R2 side, we prove that exact integral R2 improvement is not, in general, an ordinary objective-space weighted hypervolume. The obstruction is lower-dimensional: Lebesgue-density hypervolume cannot see certain boundary contributions that Tchebycheff scalarizations still detect. We then show that exact integral R2 improvement is exactly a scalarization-space volume, namely the measure of the Tchebycheff shadow between the incumbent scalarization envelope and the reference envelope. This representation yields finite-sum ER2I algorithms for discrete R2, quadrature methods for exact integral R2, and an achievement-space Gaussian surrogate formulation in which ER2I is an integral of scalar Gaussian expected improvements.","published_date":"2026-05-27T17:02:28+00:00","viability_score":0,"cluster_label":"Bayesian Optimization","has_code":true,"repo_url":"https://github.com/kourgeorge/arxiv-style","commercial_flags":["has_code"],"one_liner":"Theoretical analysis of preference-shaped expected improvement criteria for multiobjective optimization.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.28742v1","title":"CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning","abstract":"Language models can use verifiable rewards to improve at a wide variety of reasoning tasks. However, both parametric (e.g. RLVR) and non-parametric (e.g. prompt optimization) approaches to doing so typically require hundreds of training samples and thousands of model rollouts, making them expensive in the best case and intractable in the worst. To address this challenge, we introduce Contrastive Reflection (CORE), a non-parametric learning algorithm that compares past reasoning traces to generate insights: short natural-language descriptions of reasoning strategies and constraints that capture differences between successful and unsuccessful problem attempts. Across four reasoning tasks, we demonstrate that CORE enables more rapid improvement than both parametric (GRPO) and non-parametric (GEPA, episodic RAG, and MemRL) methods, while using fewer rollouts. Under fixed rollout budgets with as few as five training samples, we then show that CORE also achieves comparable or greater performance gains than each baseline. Finally, we highlight how CORE is also substantially more context-efficient than non-parametric baselines, requiring fewer prompt tokens while storing learned knowledge as compact, interpretable natural-language insights. Our results therefore suggest that distilling contrasts between successful and unsuccessful reasoning traces into abstract and useful insights can provide a more efficient and interpretable route to model self-improvement than weight updates, prompt optimization, or direct reuse of stored reasoning traces.","published_date":"2026-05-27T17:01:50+00:00","viability_score":5,"cluster_label":"Reasoning Improvement","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"CORE is a novel algorithm that enhances reasoning in language models using past reasoning traces for efficient self-improvement.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2605.28740v1","title":"Reverse Probing: Supervised Token-level Uncertainty Quantification for Large Language Models in Clinical Text","abstract":"As large language models are increasingly deployed for clinical text, ensuring they can reliably signal their own uncertainty becomes critical. Most existing uncertainty quantification (UQ) methods are designed for open-domain generation and cannot localize uncertainty at the token or span level in long clinical text. We propose Reverse Probing, the first UQ framework specialized for clinical summarization, which estimates token-level uncertainty directly from pre-existing labeled summaries. Rather than sampling new outputs, Reverse Probing treats the text as a probe into the model's internal state, extracting uncertainty signals from four categories of internal activations. We evaluate on two expert-annotated clinical datasets and outperform eight adapted baselines on all metrics, achieving up to 4 times higher AUPRC while reducing inference time and computational costs. Feature analysis reveals that delta energy and neighborhood context are the most consistent predictors across all models. This study offers interpretable insights into how models internally respond to unsupported clinical content.","published_date":"2026-05-27T17:01:04+00:00","viability_score":6,"cluster_label":"Uncertainty Quantification","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Reverse Probing is a specialized framework for token-level uncertainty quantification in clinical text, enhancing reliability in medical applications.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.28739v1","title":"BIRDNet: Mining and Encoding Boolean Implication Knowledge Graphs as Interpretable Deep Neural Networks","abstract":"Tabular data in knowledge-rich domains often carries a latent prior in the form of Boolean implication relationships (BIRs) between pairs of features. We mine such relationships with a sparse-exception binomial test. The mined implications form a typed directed graph, equivalent to a propositional rule base of 2-literal clauses. We encode this graph as the connectivity of a layered neural network, called BIRDNet, in which each hidden unit corresponds to one mined rule and binds only to its two features. We show two consequences of this design: First, the architecture is sparse by construction: at most $2/d$ of the weights in each BIR layer are active, where $d$ is the input dimension. Second, the model is interpretable: every trained unit keeps a stable symbolic identity, so rules can be read off the network without surrogate models. Unlike most neurosymbolic models, BIRDNet does not consume an external rule base; its structural prior is mined from the data. We evaluate BIRDNet on six transcriptomic and proteomic benchmarks. Our results show that BIRDNet stays within 0.02 AUROC of the strongest dense baseline, at a small accuracy cost, while using up to $96\\times$ fewer active parameters than an architecture-matched dense MLP. First-layer rules recover known biological signatures across multiple cancer subtypes and tissue types, including canonical amplicons, lineage-defining co-expression modules, and immune-infiltration markers. Data and code are available at: https://github.com/MAHI-Group/BIRDNet.","published_date":"2026-05-27T16:59:01+00:00","viability_score":5,"cluster_label":"Knowledge Graphs","has_code":true,"repo_url":"https://github.com/MAHI-Group/BIRDNet","commercial_flags":["has_code"],"one_liner":"BIRDNet encodes Boolean implication knowledge graphs into interpretable neural networks, enhancing model interpretability in knowledge-rich domains.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2605.28733v1","title":"Utility-Aware Multimodal Contrastive Learning for Product Image Generation","abstract":"Product images strongly influence consumer decision-making in online marketplaces. Empowered by multimodal contrastive learning, generative AI can output images that closely align with text prompts. Yet existing generative AI models do not directly optimize marketplace performance. This is a critical gap, since semantic alignment alone does not guarantee that an image will sell. To address this limitation, we propose a \\textit{utility-aware multimodal contrastive learning} framework that incorporates consumer demand into a novel Utility-Aware InfoNCE loss. Optimizing this utility-aware objective guides generation toward images that are both semantically coherent and demand-enhancing. This effect arises directly from a shift in the learned image-text representation space toward demand-driven visual cues, which we also validate through the theoretical bound of the proposed objective. In downstream applications on Amazon and Airbnb, product images generated and edited by our method outperform state-of-the-art models in increasing demand and preserving fidelity, while maintaining text-image consistency. Notably, our utility-aware framework preserves inverse U-shaped demand patterns for attributes such as aesthetics and uniqueness, improving demand-based performance while preserving fidelity and semantic consistency. Human-subject experiments further validate its commercial effectiveness. As generative AI technology continues to evolve, our utility-aware component can be flexibly embedded into emerging generative models to improve direct commercial use.","published_date":"2026-05-27T16:54:51+00:00","viability_score":3,"cluster_label":"Generative AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Utility-aware multimodal contrastive learning aims to enhance product image generation for better consumer demand in online marketplaces.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.28732v1","title":"MemTrace: Tracing and Attributing Errors in Large Language Model Memory Systems","abstract":"Memory is essential for enabling large language models to support long-horizon reasoning, yet existing memory systems remain unreliable and difficult to debug. Tracing memory's dynamic evolution is crucial to understand how information is synthesized, propagated, or corrupted over time. In this work, we study the new problem of error tracing and attribution in LLM memory systems. We propose a novel framework that transforms memory pipelines into executable memory evolution graphs, enabling fine-grained tracing of operational information flow. We then construct MemTraceBench, a benchmark collected from representative memory systems such as Long-Context, RAG, Mem0, and EverMemOS, to systematically study memory failure modes. We further introduce an automatic attribution method that iteratively traces operation subgraphs to pinpoint the root cause of any failed case. Our analysis reveals that memory failures are systematic, stemming from operation-level issues like information loss and retrieval misalignment. Crucially, we leverage these fine-grained attribution signals to guide downstream prompt optimization, establishing a closed-loop system that automatically corrects faults and boosts end-task performance by up to 7.62%. Code will be released at https://github.com/zjunlp/MemTrace.","published_date":"2026-05-27T16:53:53+00:00","viability_score":7,"cluster_label":"LLM Debugging","has_code":true,"repo_url":"https://github.com/zjunlp/MemTrace","commercial_flags":["has_code"],"one_liner":"A framework to trace and attribute errors in LLM memory systems, enabling automatic prompt optimization for improved end-task performance.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.28730v1","title":"AlphaTransit: Learning to Design City-scale Transit Routes","abstract":"Designing a transit network requires many sequential route extension decisions, but their quality is often visible only after the full network is assembled. This delayed-feedback challenge lies at the heart of the Transit Route Network Design Problem (TRNDP), where route interactions can be deceptive: an extension that appears useful locally can create transfer bottlenecks, produce redundant overlap, or reduce overall throughput. To guide route construction under delayed simulator feedback, we introduce AlphaTransit, a search-based planning framework for cityscale bus network design. AlphaTransit couples Monte Carlo Tree Search (MCTS) with a neural policy-value network: the policy proposes route extensions, the value estimates downstream design quality, and search uses these predictions to refine each decision. This provides decision-time lookahead during route construction without running simulator rollouts inside the search tree. We evaluate AlphaTransit on a new Bloomington TRNDP benchmark with realistic road topology and censusderived demand, under mixed and full transit demand settings. In the Bloomington network, AlphaTransit attains the highest service rate in both demand settings, reaching 54.6% and 82.1%, respectively. Relative to reinforcement learning without search, these correspond to 9.9% and 11.4% service rate gains; relative to MCTS without learned guidance, they correspond to 2.5% and 11.2% gains. These results suggest that coupling learned guidance with MCTS is more effective than using either approach alone for transit network design. Our code and data are publicly available in https://github.com/poudel-bibek/AlphaTransit.","published_date":"2026-05-27T16:48:55+00:00","viability_score":7,"cluster_label":"AI for Urban Planning","has_code":true,"repo_url":"https://github.com/poudel-bibek/AlphaTransit","commercial_flags":["has_code"],"one_liner":"An AI framework that learns to design city-scale transit routes by coupling Monte Carlo Tree Search with a neural policy-value network.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.28722v1","title":"Multi-Adapter Representation Interventions via Energy Calibration","abstract":"Representation intervention has emerged as a promising paradigm for aligning large language models toward desired behaviors without modifying model weights. Existing methods typically apply a fixed intervention uniformly across all inputs. However, we find that the appropriate intervention direction and strength vary substantially across samples, and such indiscriminate intervention leads to degradation of general capabilities on benign inputs. To address these challenges, we propose Multi-Adapter Representation Interventions via Energy Calibration (MARI). Specifically, we introduce a competitive multi-adapter mechanism in which specialized experts capture non-linear correction patterns and adaptively determine the appropriate intervention direction and strength for different samples. Furthermore, we design an energy-based gating module that leverages internal propagation dynamics to distinguish inputs that are applicable for intervention. Extensive experiments across diverse model families and parameter scales demonstrate that MARI achieves state-of-the-art alignment performance. Our method significantly improves performance on TruthfulQA, BBQ, and safety benchmarks, while maintaining and even improving general capabilities on tasks such as MMLU and ARC. Our code is available at https://github.com/V1centNevwake/MARI.","published_date":"2026-05-27T16:39:58+00:00","viability_score":7,"cluster_label":"LLM Alignment","has_code":true,"repo_url":"https://github.com/V1centNevwake/MARI","commercial_flags":["has_code"],"one_liner":"A novel method for LLM alignment that uses adaptive multi-adapter interventions to improve safety and truthfulness without degrading general capabilities.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.28721v1","title":"LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?","abstract":"Are LLM-based search agents genuinely searching, or using the web to verify what they already know? We study this question on BrowseComp with three diagnostics. Our analysis reveals Intrinsic Knowledge Dependence (IKD): even with tool access, agents often rely on intrinsic knowledge -- information encoded in the model before retrieval -- rather than on external evidence. Agents answer up to 44.5% of BrowseComp questions without tools, generate more than half of their search queries from internally produced hypotheses rather than retrieved leads, and perform worse than closed-book baselines when answer-supporting evidence is removed. These results suggest that static search benchmarks can reward memory-backed verification rather than evidence-driven discovery, conflating what agents already know with what they can find. We then introduce LiveBrowseComp, a deep-search benchmark designed to evaluate agents beyond intrinsic coverage. It contains 335 human-authored questions whose answers depend on facts published within the 90 days preceding benchmark construction, drawn from six updated sources and filtered to exclude globally salient events. On LiveBrowseComp, all evaluated agents fall below 2% closed-book accuracy, search-augmented scores drop by 25-40 points relative to BrowseComp, and prior model rankings no longer reliably predict performance. LiveBrowseComp is available at https://huggingface.co/datasets/Forival/LiveBrowseComp.","published_date":"2026-05-27T16:39:57+00:00","viability_score":7,"cluster_label":"LLM Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new benchmark and analysis revealing LLM search agents' reliance on intrinsic knowledge, and a deep-search benchmark to evaluate true web discovery.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.28717v1","title":"OpenURMA: A Clean-Room Open Implementation of the Unified Bus Protocol","abstract":"Modern datacenter RDMA is bottlenecked at the network interface, not the wire. A NIC running RoCE or InfiniBand holds per-connection state for every (application, remote-endpoint) pair - hundreds of megabytes at 1024-application fanout - and pays a four-traversal PCIe round trip on a 64-byte operation, inflating latency an order of magnitude beyond the wire. Both follow from the Queue Pair over PCIe abstraction RDMA inherits from InfiniBand.   Huawei's Unified Bus (UB), a public 2025 specification, changes the abstraction: it decouples per-application endpoint state from per-host transport state so connection context grows additively, exposes ordering as opt-in, and reaches remote memory through native CPU load/store to an on-chip-bus controller. UB ships in Huawei's closed Ascend 950 silicon.   OpenURMA is the first clean-room open implementation of UB's transport and transaction layers, realised at three tiers - synthesisable RTL on Alveo U50, a cycle-level two-node SystemC simulator, and a gem5 full-system scaffold - each with a matched OpenRoCE (RoCEv2 RC) baseline. The contribution is the implementation, harness, and controlled comparison closed silicon does not admit. On the canonical 64-byte remote fetch - LOAD on UB-spec Sec.8.3, READ on RoCEv2 RC - UB's load/store path delivers ~500 ns end-to-end, 4.37x below the matched baseline (2186 ns), sustains 2.80x higher throughput, and fits in ~14% of a U50's LUTs.","published_date":"2026-05-27T16:38:57+00:00","viability_score":1,"cluster_label":"Hardware Acceleration","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"An open-source implementation of a novel bus protocol for datacenter RDMA aims to significantly reduce latency and improve throughput compared to existing solutions.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.28714v1","title":"IPO-Mine: A Toolkit and Dataset for Section-Structured Analysis of Long, Multimodal IPO Documents","abstract":"An Initial Public Offering (IPO) filing is a document released when a private firm goes public, allowing individual (retail) investors to purchase its shares. These filings describe a firm's business, financials, and risks and are long, multimodal documents with narrative text and images. Despite their importance to financial markets, there is no large-scale, standardized dataset or benchmark for studying IPO filings with modern language and multimodal models. These documents pose significant challenges: filings frequently exceed 500,000 tokens and lack consistent structural organization. We introduce the IPO-Toolkit, an open-source framework for downloading and parsing IPO filings into standardized section-structured text and extracted images. The toolkit segments filings, extracts embedded images, and produces structured outputs that enable large-scale, reproducible analysis workflows over long, multimodal documents. Using this infrastructure, we construct the IPO-Dataset, a large, section-structured, multimodal dataset covering more than 109,000 IPO filings and amendments from 1994 to 2026 and containing over 76,000 images. We establish structured evaluation tasks over extracted financial charts, including chart quality and misleadingness assessment. Our experiments show that state-of-the-art multimodal models often diverge from expert human judgments on these tasks, exposing alignment challenges in multimodal reasoning over long, real-world regulatory documents. Beyond benchmarking, the IPO-Dataset enables large-scale analysis of section-level textual variation and cross-industry differences in visual and textual disclosure practices. Our code, dataset, and website are publicly available under CC-BY-4.0.","published_date":"2026-05-27T16:36:39+00:00","viability_score":7,"cluster_label":"Document Processing","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A toolkit for structured analysis of IPO documents using advanced NLP and image processing.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.28713v1","title":"Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor","abstract":"Context compression aims to shorten long context inputs with minimal information loss for LLM inference acceleration. While existing methods have shown promise, they typically rely on complex compression modules or compression-specific training, leaving the intrinsic capabilities of LLMs underexplored. In contrast, this work reveals that a thinking model itself can naturally compress long contexts by organizing task-relevant information. We thus derive Thinking as Compression (TaC), a new compression paradigm that treats thinking itself as compressed context. Without relying on specific dedicated compressor, TaC directly prompts the thinking model to generate thinking traces as the shortened context, already outperforming most representative compression methods. Further, given that raw thinking output may struggle with budget control and shortcut behaviors, we introduce Thinking as Compression Constrained (TaC-C), leveraging a simple reward-driven optimization framework to elicit intrinsic thinking as compact and controllable compressed context. Experiments across four long-context QA benchmarks demonstrate that TaC-C consistently outperforms existing baselines. At 4x and 8x compression ratios, it surpasses the strongest competitor by 17.4% and 23.4% in average F1, and by 15.7% and 21.7% in average Exact Match Score (EM), respectively.","published_date":"2026-05-27T16:36:01+00:00","viability_score":7,"cluster_label":"LLM Inference Optimization","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel 'Thinking as Compression' paradigm leverages LLM's intrinsic reasoning to compress long contexts, outperforming existing methods without dedicated compression modules.","time_to_mvp":"1-2 weeks","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.28710v1","title":"Towards Reliable Multilingual LLMs-as-a-Judge: An Empirical Study","abstract":"Large language models (LLMs) are increasingly used for the automatic evaluation of generated text, yet most prior work focuses on English. Despite the growing demand for multilingual evaluation, extending LLM-based evaluators to multilingual settings remains challenging, particularly for low-resource languages and scenarios where in-domain data is scarce. This work explores several strategies for developing multilingual LLMs-as-a-judge, considering whether in-domain data is available for fine-tuning or not. We systematically analyze English, Spanish, and Basque, representing high-, mid-, and low-resource languages, considering instruction translation, monolingual versus multilingual supervision, and model size. For evaluation, we extend two existing meta-evaluation datasets to Basque and Spanish. Our results reveal key trade-offs: When in-domain data is available, fine-tuned smaller models can achieve performance comparable to proprietary models, whereas zero-shot evaluation with larger models proves more effective in out-of-domain settings. We also observe that fine-tuning on out-of-domain data can adversely affect model performance. These findings provide practical guidance for building efficient, reliable multilingual evaluation pipelines. The data and code are publicly available at hitz-zentroa/mJudge.","published_date":"2026-05-27T16:33:58+00:00","viability_score":7,"cluster_label":"Multilingual LLM Evaluation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An empirical study provides practical guidance for building reliable multilingual LLM-as-a-judge systems, showing fine-tuning smaller models is effective with in-domain data.","time_to_mvp":"1-2 weeks","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.28707v1","title":"Beyond Binary Moral Judgment: Modeling Ethical Pluralism in AI","abstract":"Critical decision-making in socially consequential spaces is increasingly involving AI systems at varying capacities. Yet, despite the ubiquity of autonomous systems, most approaches to handling autonomous moral decision-making resort to scalar or binary judgments. These methods are insufficient for acceptable moral reasoning, as they provide little explanation, leaving out imperative contextual and theoretical information that must be included to support accountability. For this, we propose a framework to model moral reasoning as a distribution over normative ethical theories or ethical pluralism. We introduce a normative ethics simplex that integrates these theories. A benchmark of 450 cases across 15 fine-grained subtheories was also prepared for the purposes of stacked ensemble learning. These cases describe ethical dilemmas in natural language and have associated extracted contextual features. The implementation of the simplex was achieved via a two-stream normative-semantic architecture. This is followed by the fusion of normative information and a sequential, stacking ensemble to learn the best fit of the three broad theories: consequentialism, virtue ethics, and deontology, and the 15 subcategories. Our experiments demonstrate that the integration of contextual and normative priors with the semantic embeddings significantly improves the performance of the classification, displaying an accuracy of 88.89%. We conducted ablation studies to show that structured ethical representations contribute beyond analogical reasoning, and the chosen stacking architecture gives the best results due to the gradual learning of granularity. Ethical pluralism is also analyzed through entropy, confidence, and visualization. Thus, modeling ethical pluralism as a probabilistic normative distribution supports human-like moral reasoning, ethical disagreement analysis, and future alignment in AI systems.","published_date":"2026-05-27T16:33:06+00:00","viability_score":7,"cluster_label":"AI Ethics","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for AI to model ethical pluralism by representing moral reasoning as a distribution over normative ethical theories, improving accountability and human-like reasoning.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.28703v1","title":"A Fresh Look at Lamarckian Evolution and the Baldwin Effect","abstract":"Baldwinian and Lamarckian evolution have existed for a long time in evolutionary algorithms (EAs) without ever dominating the academic literature or practical applications. In this work, we use modern empirical and theoretical methods to revisit Lamarckian and Baldwinian evolution and rigorously compare them with the generic Darwinian evolution. On the empirical side, we run a comprehensive suite of experiments on graphs from six different datasets from the recent GraphBench benchmark on Maximum Independent Set and Maximum Cut problems. Our results show that Baldwinian and Lamarckian evolution consistently outperform Darwinian evolution, confirming the great potential of local search augmented evolutionary algorithms. Notably, in the great majority of cases, all EAs outperform recent deep learning baselines and approach the performance of highly specialised heuristic and exact solvers. We furthermore report a high-performing set of generalist parameters for all studied evolution types that we hope will be of use to practitioners in future. On the theoretical side, we extend the existing Deceptive Leading Block benchmark to arbitrary block length and use tools from modern theoretical runtime analysis to prove upper and lower bounds on the expected runtime. For block lengths greater than two, Baldwinian evolution is asymptotically faster than Lamarckian which is asymptotically faster than Darwinian evolution. When accounting for the cost of the local search procedure in fitness evaluations, the ordering depends on the implementation with Baldwinian evolution staying fastest from small block lengths onwards, explaining its strong empirical performance.","published_date":"2026-05-27T16:30:39+00:00","viability_score":6,"cluster_label":"Evolutionary Algorithms","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Revisiting Lamarckian and Baldwinian evolution in algorithms, demonstrating consistent outperformance over Darwinian evolution and deep learning baselines on graph problems.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.28700v1","title":"The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic","abstract":"The GSM-Symbolic benchmark (Mirzadeh et al., 2025) reported consistent performance drops across 25 Large Language Models (LLMs) when tested on template-generated variants of GSM8K problems, concluding that the models lack genuine reasoning capabilities. We argue that this conclusion rests on shaky statistical ground. Re-evaluating 20 open-weight models using Generalised Linear Mixed Models with per-question random effects, we find that only half exhibit statistically significant performance changes under the original prompt format. Moreover, we identify a previously unacknowledged factor: the main GSM-Symbolic dataset contains a systematically shifted distribution of larger integers in problem texts relative to GSM-Base (K-S statistic = 0.12, p < 0.001), contradicting the original authors' claims. Controlling for this large number effect accounts for significance in roughly half the remaining cases. Among models with statistically significant performance deltas, we identify distinct, model-specific failure profiles - including fragility of variable binding, arithmetic limitations, and dual-task interference - underscoring that blanket claims about LLM reasoning are both statistically premature and mechanistically misleading.","published_date":"2026-05-27T16:25:31+00:00","viability_score":4,"cluster_label":"LLM Evaluation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Critically re-evaluates the GSM-Symbolic benchmark, arguing that performance drops in LLMs are statistically questionable and influenced by dataset biases, not just reasoning limitations.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2605.28699v1","title":"TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning","abstract":"Large language models increasingly rely on either reinforcement learning or multi-agent prompting to improve reasoning, yet these two paradigms remain difficult to combine. Directly applying single-agent reinforcement learning to multi-turn multi-agent systems faces following dilemmas: i) Sparse rewards, role-level free-riding and excessive training overhead. ii) Agents only imitate to collaborate. iii) Fixed collaboration protocol falls into oscillating local optimum. We introduce TRACER, a turn-level reinforcement framework for cooperative multi-LLM reasoning. TRACER separates collaborative decision making into a controller-regret layer, where controllers learn whether the agents should speak or skip the current round through regret matching, and a generation-credit layer, which optimizes proposer and reviewer utterances with role-specific GSPO rewards. This design i) assigns credit at the level of both action modes and generated utterances, thus avoiding free-riding and sparse rewards. We only expand the choices made by the controllers, thus greatly reducing computational cost of training. Moreover, ii) agents acquire collaborative capability as they learn when to utter and what to speak. Finally, iii) by designing binary actions ingeniously, we extend classical game theory established for finite action spaces to deep learning, thus achieving mathematically rigorous convergence. We train all local RL-style methods on the GSM8K training split and evaluate on held-out GSM8K, MATH500, and GPQA-Diamond to measure in-domain accuracy, cross-benchmark generalization, inference cost, and correction-preservation behavior. The resulting framework provides a compact and reproducible testbed for studying learned collaboration policies beyond fixed debate, voting, or aggregation protocols. Code is available at https://github.com/Shark-Forest/TRACER.","published_date":"2026-05-27T16:25:21+00:00","viability_score":7,"cluster_label":"Multi-LLM Reasoning","has_code":true,"repo_url":"https://github.com/Shark-Forest/TRACER","commercial_flags":["has_code"],"one_liner":"TRACER framework combines reinforcement learning and multi-agent prompting for cooperative LLM reasoning, using turn-level regret matching to improve collaboration and reduce training costs.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.28697v1","title":"Deep Learning Strain Estimation: Is Physics-Based Simulation the Solution?","abstract":"Speckle tracking echocardiography (STE) is the clinical standard for myocardial strain estimation. Despite good performance on global strain (GLS), its accuracy for regional strain remains limited, even though this biomarker is highly relevant for early diagnosis and the characterization of subtle abnormalities. from clinical data. Deep learning is a promising alternative, but its development is constrained by the lack of reliable motion references. Existing solutions rely either on STE-derived labels or on simulations generated by physics-based models, but these synthetic sequences still have limited realism compared with clinical data.In this paper, we propose a novel simulation strategy that incorporates speckle decorrelation measures from real videos and uses an iterative refinement process to improve the motion realism in the simulations. We created an open-source photorealistic dataset of 1,478 videos with reference motion, which was used to train an echocardiographic motion estimation algorithm. The proposed method achieves unmatched performance on global and regional strain, notably reaching a GLS variability of 1.42% in an inter-expert setting compared to 1.78% for the clinical reference.","published_date":"2026-05-27T16:24:05+00:00","viability_score":7,"cluster_label":"Medical AI","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel simulation strategy and open-source dataset for training more accurate echocardiographic motion estimation algorithms, improving regional strain analysis for early disease diagnosis.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.28693v1","title":"Misalignment Between Backpropagation and the Hierarchy of Brain Responses to Images","abstract":"Backpropagation is the core learning mechanism underlying deep learning. However, whether and how this algorithm is implemented in the brain remains highly debated. In particular, while forward activations of pretrained models reliably map onto the cortical hierarchy of visual processing, it is unknown whether backpropagated gradients exhibit a similar correspondence. Here, we address this question using functional magnetic resonance imaging (fMRI) and magnetoencephalography (MEG) recordings of human brain responses to natural images. For this, we extend standard encoding analyses of forward activations to map backpropagated gradients onto neural data. Focusing on a recent self-supervised vision model (DINOv3) and reproducing results on eight vision models, we find that backpropagated gradients can reliably predict both fMRI and MEG signals, specifically in higher-level visual cortex and for later latencies. However, the spatial and temporal organization of these backpropagated gradients in the brain diverges from the patterns expected under a biologically plausible backpropagation mechanism: specifically, both the order in which gradients are computed and their spatial organization diverge from the temporal and spatial hierarchies of the human brain. Together, these results suggest that, although deep networks and the brain may share similar representational content, they likely rely on fundamentally different mechanisms to learn those representations.","published_date":"2026-05-27T16:20:31+00:00","viability_score":1,"cluster_label":"AI Research","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Investigating the misalignment between backpropagation in deep learning and the brain's visual processing hierarchy using fMRI and MEG data.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.28683v1","title":"VeriTrip: A Verifiable Benchmark for Travel Planning Agents over Unstructured Web Corpora","abstract":"Existing benchmarks have laid the foundation for travel planning agents by establishing API-centric paradigms. However, as the capabilities of Autonomous Agents continue to advance, their evaluation must evolve beyond simple tool execution toward handling the inherent complexities of the open web. Current benchmarks bypass core cognitive hurdles: they fail to account for information noise, ignore multi-source factual contradictions, and overlook the necessity of grounding visual perception into logical planning. We introduce VeriTrip, a verifiable benchmark designed to meet the increasing demands for agent robustness and reliability. VeriTrip shifts the evaluation focus to evidence-grounded reasoning over unstructured multimodal web corpora. It establishes a Multimodal Retrieval Base (MRB) derived from real-world sources, forcing agents to autonomously orchestrate queries across heterogeneous data. A synchronized Verifiable Knowledge Base (VKB) enables a cell-wise verification protocol that precisely quantifies factual reliability, distinguishing systematic reasoning failures from parametric hallucinations. Our evaluations across leading MLLMs reveal a critical \\textit{retrieval-reasoning trade-off}: the cognitive load of autonomous retrieval significantly erodes instruction retention. VeriTrip provides the rigorous foundation necessary for the next generation of planning agents capable of operating in unconstrained, multimodal environments.","published_date":"2026-05-27T16:14:47+00:00","viability_score":7,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"VeriTrip is a verifiable benchmark for evaluating travel planning agents on unstructured web data, focusing on evidence-grounded reasoning and factual reliability.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.28680v1","title":"AI in the Workplace: The Impact of AI on Perceived Job Decency and Meaningfulness","abstract":"The proliferation of Artificial Intelligence (AI) in workplaces is transforming how we work. While existing research on human-AI collaboration at work often prioritizes performance, less is known about their experiential outcomes. Through interviews with 24 employees across Information Technology (IT), service-based, and healthcare sectors, this paper examines AI's impact on job satisfaction via perceptions of job decency and meaningfulness, now and in the future. Our results reveal that the anticipated impact of AI on overall job satisfaction varies with the occupational domain, with differing perceptions of its underlying decency and meaningfulness. For instance, IT and healthcare anticipate increased satisfaction with decency aspects like working hours but decreased satisfaction with meaningfulness aspects like social image due to misconceptions about AI handling most of their tasks. Conversely, service workers foresee no improvement in their working hours but a higher social standing due to the perceived status boost associated with working with AI.","published_date":"2026-05-27T16:13:41+00:00","viability_score":0,"cluster_label":"Workplace AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Examining the impact of AI on perceived job decency and meaningfulness across IT, service, and healthcare sectors through employee interviews.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.28678v1","title":"DREAM-R: Multimodal Speculative Reasoning with RL-Based Refined Drafting, Precise Verification, and Fully Parallel Execution","abstract":"Speculative reasoning has recently been proposed as a means to accelerate reasoning-intensive generation in large multimodal models, but its effectiveness is often constrained by misalignment between speculative drafts and target-verified reasoning. In this work, we introduce DREAM-R, a framework that substantially improves the performance of speculative reasoning. At its core, DREAM-R employs Speculative Alignment Policy Optimization (SAPO), a reinforcement-learning objective that trains draft models to generate reasoning steps that are both faithful to target trajectories and concise. We further propose a Threshold-based Verification Mechanism (TBVM) that uses a ratio-based criterion to provide stable and interpretable acceptance of speculative steps only when positive evidence clearly dominates, thereby preventing error propagation. Building on these components, we develop a Fully Parallel Speculative Reasoning (FPSR) framework that parallelizes draft generation, target-side reasoning, and verification across multi-step reasoning, enabling early stopping and clean fallback. Experiments on reasoning-heavy benchmarks demonstrate up to speedup while preserving target-model accuracy, yielding substantial efficiency gains without compromising reasoning quality.","published_date":"2026-05-27T16:11:10+00:00","viability_score":7,"cluster_label":"LLM Reasoning Acceleration","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for accelerating multimodal reasoning in large models using reinforcement learning for draft refinement and parallel verification, achieving speedups without accuracy loss.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.28669v1","title":"Sense Representations Are Inducible Interfaces","abstract":"Sense representations (explicit, per-token meaning decompositions) are useful for disambiguation, steering, and cross-lingual alignment, but existing approaches require models to be pretrained with sense structure baked in. We introduce ACROS, which induces an explicit sense pathway into a frozen pretrained decoder LM through a gated residual addition. On SmolLM2-360M, ACROS preserves base LM quality while supporting three uses of the same induced variables: zero-shot word-sense disambiguation (64.95 F1 on Raganato ALL, competitive with the WordNet first-sense heuristic), low-KL lexical steering across 5,161 CoInCo cases where a simple non-oracle proxy recovers about 90% of positive shifts, and SENSIA cross-lingual adaptation to four languages (mean R@1 0.988, target FLORES PPL 7.94). ACROS makes sense representations an inducible interface for ordinary pretrained LMs.","published_date":"2026-05-27T16:04:35+00:00","viability_score":7,"cluster_label":"LLM Interpretability & Control","has_code":true,"repo_url":"https://github.com/jcblaisecruz02/acros","commercial_flags":["has_code"],"one_liner":"Induce explicit sense representations into frozen LLMs for zero-shot word-sense disambiguation, lexical steering, and cross-lingual adaptation.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.28666v1","title":"An LLM-Based Assistance System for Intuitive and Flexible Capability-Based Planning","abstract":"In modern industry, dynamic environments and the complexity of modular and reconfigurable resources require automated planning of process sequences. Capability-based planning approaches address this by automatically generating plans from semantic knowledge models that describe resource functions in a machine-interpretable form. Their practical use, however, remains limited: solver feedback, especially in the case of unsatisfiability, is difficult to interpret, and the knowledge models require adaptation as operational conditions change or requests become infeasible. This paper presents a hybrid assistance system that augments an existing capability-based Satisfiability Modulo Theories (SMT) planning approach with an Large Language Model (LLM)-based layer for natural-language interaction, explanation, and adaptation. Formal planning correctness remains with the symbolic planner, while the LLM layer handles natural-language access and flexible knowledge model adaptation under explicit Human-in-the-Loop (HitL) approval. The system decomposes into four components: Capability Grounding, Symbolic Planning, Result Interpretation, and Planning Adaptation, realized as a routed agentic workflow in which a central router delegates to five specialized agents. The system is evaluated on a modular production system across four scenario types. Of 23 test cases, 9 of 10 knowledge queries and all 4 satisfiable planning cases were handled correctly, 3 of 4 unsatisfiable cases produced concrete repair proposals, and all 5 adaptive planning scenarios resolved into satisfiable plans through iterative, user-approved knowledge model modifications. The findings confirm that combining formal planning with LLM-based assistance substantially improves accessibility and adaptability in industrial automation.","published_date":"2026-05-27T16:00:32+00:00","viability_score":4,"cluster_label":"AI Planning & Automation","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"An LLM-assisted system that augments capability-based planning with natural language interaction, explanation, and adaptation for industrial automation.","time_to_mvp":"1-3 months","tags":["series_a_plus"]},{"arxiv_id":"2605.28655v1","title":"AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation","abstract":"Scientific research proceeds through iterative cycles of hypothesis generation, experiment design, execution, and revision. AI agents can automate parts of this process, but existing approaches typically follow a single research trajectory or coordinate through a central planner with fixed objectives. As a result, they struggle to sustain parallel exploration, adapt as experimental evidence changes, or preserve knowledge of failed directions over long-running experiments. We introduce AutoScientists, a decentralized team of AI agents for long-running computational scientific experimentation. Agents interpret a shared experimental state, self-organize into teams around promising hypotheses, critique proposals before using experimental compute, and share successes and failures to reduce redundant exploration. Under matched experimental budgets, AutoScientists improves over prior AI agents across biomedical machine learning, language-model training optimization, and protein fitness prediction. On BioML-Bench, spanning biomedical imaging, protein engineering, single-cell omics, and drug discovery, AutoScientists achieves a mean leaderboard percentile of 74.4% across 24 tasks, improving over the strongest AI agent by +8.33%. On GPT training optimization, AutoScientists reaches a target validation bits-per-byte 1.9x faster than Autoresearch and continues discovering improvements from a starting champion where the single-agent approach finds none (7 vs. 0 accepted improvements). On ProteinGym fitness prediction, AutoScientists discovers a method for ACE2-Spike binding that improves over the current state-of-the-art model by +12.5% in Spearman correlation. Applied without modification across all 217 ProteinGym assays, the same method improves over the prior state of the art by +6.5% (Spearman correlation).","published_date":"2026-05-27T15:56:12+00:00","viability_score":4,"cluster_label":"AI Agents for Scientific Research","has_code":true,"repo_url":"https://github.com/mims-harvard/AutoScientists","commercial_flags":["has_code"],"one_liner":"Decentralized AI agent teams that self-organize for long-running scientific experimentation, improving exploration and knowledge sharing.","time_to_mvp":"6+ months","tags":["series_a_plus"]},{"arxiv_id":"2605.28647v1","title":"The Ethics of LLM Sandbox and Persona Dynamics","abstract":"It is well known that LLM guardrails and trained persona dynamics can produce a reality gap: the distance between the world a LLM is permitted or shaped to describe, and the world in which users must act. Here we argue that actively generating reality gaps is in fact unethical because it knowingly shifts epistemic risk back to the uninformed user -- this is reality laundering. This can potentially cause harm when operationalised at scale. The risk is sharpest in high-exposure advice contexts, where users seek orientation rather than a bounded, externally checkable task. Guardrails naively appear ethically necessary when they claim to prevent direct harm, but often become suspect when they suppress truthful perception and launder uncomfortable mechanisms into acceptable abstractions. Basel-style financial regulation, B-BBEE-style compliance, Societe Generale, and the London Whale show how formal safety systems can become legible, gameable, and performative while real exposure migrates elsewhere. The same pattern can appear in LLMs as moral compliance: safe language, distorted reality. We therefore distinguish refusing harm, from refusing reality; and then argue for top-down causal requirements specification at the task level rather than bottom-up moral correction at the response or sandbox level. Persona dynamics matter because the assistant interface is not neutral; it shapes how uncertainty, conflict, authority, and risk are staged. The conclusion is that so-called ``ethical AI'' becomes substantively unethical when it substitutes institutional reassurance for contact with reality.","published_date":"2026-05-27T15:52:07+00:00","viability_score":1,"cluster_label":"LLM Ethics","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper argues that actively generating reality gaps in LLMs is unethical, shifting epistemic risk to uninformed users and advocating for causal requirements specification over response-level moral correction.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.28642v1","title":"Bandwidth-Efficient and Privacy-Preserving Edge-Cloud Many-to-Many Speech Translation","abstract":"Multimodal large language models (MLLMs) have demonstrated significant potential for speech-to-text translation (S2TT). However, existing deployment paradigms face critical challenges: pure on-device models suffer from resource constraints, while centralized cloud systems incur severe privacy risks and bandwidth bottlenecks by transmitting raw voice data. Furthermore, most models exhibit English-centric biases, restricting many-to-many translation scaling. In this paper, we propose Edge-cloud Speech Recognition and Translation (ESRT), a privacy-preserving and bandwidth-efficient collaborative edge-cloud MLLM framework. Specifically, we design an edge-cloud split inference architecture that retains a lightweight speech encoder and adapter on the device, transmitting only highly compressed intermediate features to the cloud. This fundamentally prevents voiceprint leakage and reduces bandwidth requirements by up to 10$\\times$. To overcome English-centric bottlenecks, we introduce a multi-task weighted curriculum learning strategy with data balancing to ensure robust cross-lingual consistency. Extensive experiments on the FLEURS dataset demonstrate that our models, ESRT-4B and ESRT-12B, achieve state-of-the-art many-to-many S2TT performance across 45 languages ($45 \\times 44$ directions). Code and models are released to facilitate reproducible, privacy-aware MLLM S2TT research. The code and models are released at https://github.com/yxduir/esrt.","published_date":"2026-05-27T15:47:33+00:00","viability_score":8,"cluster_label":"Speech Translation","has_code":true,"repo_url":"https://github.com/yxduir/esrt","commercial_flags":["has_code"],"one_liner":"A privacy-preserving, bandwidth-efficient edge-cloud framework for many-to-many speech translation across 45 languages, reducing bandwidth by 10x and achieving state-of-the-art performance.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2605.28639v1","title":"The Attentional White Bear Effect in Transformer Language Models","abstract":"Instruction-based suppression is widely used to prevent language models from generating prohibited content, yet it remains unclear whether suppression reduces internal representation or merely suppresses expression. We investigate this question through representational probing, attention analysis, and behavioral semantic leakage experiments across multiple transformer models. We find that prohibited concepts remain highly recoverable from hidden representations under suppression, continue to influence attention routing, and measurably shape downstream generations despite successful lexical avoidance. These effects persist across pooling strategies, indirect semantic controls, and multiple model families. Our results expose a fundamental gap between behavioral and representational alignment.","published_date":"2026-05-27T15:45:27+00:00","viability_score":2,"cluster_label":"LLM Interpretability","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper investigates instruction-based suppression in language models, finding that prohibited concepts remain recoverable from hidden representations and influence downstream generations despite lexical avoidance.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.28632v1","title":"Blind PRNG Hijacking: An Undetectable Integrity-Preserving Attack Against LLM Watermarking","abstract":"Cryptographic watermarking is a leading defense for attributing text generated by large language models (LLMs). Existing schemes, including KGW, Unigram, and DipMark, derive their security guarantees from the assumption that the underlying pseudo-random number generator (PRNG) is trustworthy. This work introduces SeedHijack, the first supply-chain attack on LLM watermarking that is simultaneously (i) blind -- requiring no knowledge of the watermark key, detector, or model logits, (ii) integrity-preserving -- amplifying rather than erasing the watermark signal, and (iii) orthogonal to detection -- the attack-induced bias is statistically independent of all content-side detector statistics, ensuring that amplification and evasion coexist without trade-off. Rather than perturbing generated text, SeedHijack replaces the PRNG at the supply-chain layer, biasing green-list selection without altering output tokens or degrading text quality. Across three watermarking schemes and three open-source LLMs, the attack triggers 0/6 state-of-the-art content-side statistical detectors while inflating the watermark z-score up to 2.42x (system-level defenses such as entropy-source attestation remain orthogonal and complementary). A quantum random number generator (QRNG) countermeasure is shown to fully neutralize the attack while preserving benign watermarking utility. These findings establish PRNG integrity as a first-class security requirement for cryptographic content-provenance systems.","published_date":"2026-05-27T15:39:32+00:00","viability_score":3,"cluster_label":"LLM Security","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Introduces SeedHijack, a blind, integrity-preserving attack against LLM watermarking that hijacks the PRNG to amplify watermark signals without altering output tokens or degrading text quality.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.28617v1","title":"LACUNA: Safe Agents as Recursive Program Holes","abstract":"LLM agents increasingly act by writing code, yet a split persists between the runtime that drives the agent and the code the model writes. The runtime owns the loop, context, and control flow, and the model has little say over any of them. Letting model-written code shape the runtime itself would make agents more expressive, but it would also sharpen safety problems. A model can be diverted by a prompt injection, call the wrong tool, or fail partway and leave an inconsistent state, and each such failure reaches further when the code shapes the runtime than when it expresses a single action. We present LACUNA, a programming model for agents that closes this split while preserving safety. Each agent action is a typed call $\\texttt{agent[T](task)}$ that the LLM fills with code when execution reaches it, and the code is type-checked against the surrounding program before it runs. Because each action is accepted or rejected as a whole, a rejected one leaves the environment untouched, and its compiler diagnostics drive a retry. The same check also bounds which tools and data an action may use and how they flow. Our primitive expresses ReAct loops, sub-agents, skills, parallel decomposition, and multi-model planning as ordinary control flow. We evaluate LACUNA on a collection of test cases, BrowseComp-Plus, and $\u03c4^2$-bench. On BrowseComp-Plus, $8.6\\%$ of generations are rejected before execution, with 0.7 retries per query on average, and the agent reaches $27.1\\%$ accuracy. On $\u03c4^2$-bench, LACUNA solves $76.0\\%$ of $392$ tasks across four domains with a capable model, on par with the baseline agent.","published_date":"2026-05-27T15:27:25+00:00","viability_score":4,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A programming model for LLM agents that closes the split between runtime and model-written code while preserving safety through type-checking and compiler diagnostics.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2605.28616v1","title":"Measuring Form and Function in Language Models","abstract":"We introduce quantitative metrics for child language acquisition to evaluate language models. Our focus is on the formal syntactic and functional discourse properties of determiners in English, which young children acquire early and accurately. We propose Contextual Alternative Choice (CAC), a new prompting method which provides targeted tests for both syntactic and discourse knowledge of language. The method enables direct comparison of language models against children, and more importantly, against statistical benchmarks independently established in empirical research. No current model trained on a comparable amount of data simultaneously meet both formal and functional benchmarks like human children, but some very large models do. We present our results as methodological and technical contributions, with specific emphasis on cognitive status of language models.","published_date":"2026-05-27T15:27:16+00:00","viability_score":4,"cluster_label":"LLM Evaluation","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Quantitative metrics for child language acquisition are introduced to evaluate language models on formal syntactic and functional discourse properties of determiners.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.28607v1","title":"Adaptive Multimodal Agents-Based Framework for Automatic Workflow Execution","abstract":"Modern information systems require autonomous agents capable of navigating complex workflows, yet current methodologies often struggle with the transition from structured metadata parsing to general environmental perception. While the integration of MLLMs has enabled agents to interact directly with GUIs, existing approaches typically treat task sequences as discrete, linear episodes. This fragmentation prevents agents from capturing the underlying transition topology, limiting their effectiveness in novel or non-stationary scenarios. To address this, we propose a novel multimodal multi-agent framework that achieves automatic workflow execution through a distinct two-phase pipeline. First, during an offline discovery phase, the architecture adaptively constructs a topological knowledge base from fragmented execution logs. During inference, agents leverage Adaptive Retrieval-Augmented Generation (RAG) over this fixed, pre-established graph, coupled with a closed-loop collaborative verification protocol to dynamically self-correct and navigate. This graph-based approach facilitates superior task decomposition and adaptive navigation performance. We validate our framework in a real-world context, demonstrating its ability to maintain high reliability and semantic awareness even with limited training data.","published_date":"2026-05-27T15:23:22+00:00","viability_score":4,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A multimodal multi-agent framework that constructs a topological knowledge base from execution logs for adaptive workflow execution and self-correction.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2605.28604v1","title":"Mining Multi-Modality Spatio-Temporal Cues for Video Important Person Identification","abstract":"Identifying key individuals in video scenes is essential for applications such as automated video editing and intelligent surveillance. Current methods primarily focus on static images and immediate visual cues, overlooking the rich spatio-temporal information in videos. This leads to the phenomenon of Temporal Importance Shift (TIS), wherein individuals deemed significant in early frames may be demoted as the entire temporal context is considered. To address this, we introduce the Video Important Person (VIP) identification task, aimed at automatically identifying the most influential individuals in videos while providing textual rationales. We present Temporal-VIP, a large-scale rationale-annotated dataset consisting of 9,249 video segments across 11 categories with aligned importance rationales. To mitigate TIS, we develop the VIP-Net framework, which includes a Social Cue Encoder (SCE) for extracting multi-modal spatio-temporal cues, a Temporal Importance Rectifier (TIR) for hierarchical cue fusion and cross-modal alignment, and VIP Inference for ranking individuals. Experimental results show that VIP-Net achieves 67.3% accuracy, significantly outperforming state-of-the-art models (37.5%-53.9%) and yielding a mean rationale similarity of 0.63 to ground truth through feature-guided LLM refinement. The dataset and code are available at https://huggingface.co/datasets/yml2002/Temporal-VIP.","published_date":"2026-05-27T15:20:06+00:00","viability_score":7,"cluster_label":"Video Analysis","has_code":false,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework and dataset for identifying important individuals in videos by mining multi-modal spatio-temporal cues and rectifying temporal importance shifts.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2605.28603v1","title":"Online Irregular Multivariate Time Series Forecasting via Uncertainty-Driven Dual-Expert Calibration","abstract":"Irregular multivariate time series forecasting is critical in many real-world applications, where time series are irregularly sampled and exhibit dynamically evolving missingness patterns. Although existing methods perform well in offline settings, they often suffer from significant performance degradation when deployed online due to dynamic shifts in data distribution. Maintaining forecasting capability in such dynamic scenarios typically necessitates online adaptation techniques. Since irregular sampling fundamentally undermines temporal continuity and periodicity, we cannot leverage these widely studied characteristics from regular MTS for online learning. To this end, we study the problem of online IMTS forecasting and propose Under-Cali, an uncertainty-driven dual-expert calibration framework consisting of three core components: an uncertainty estimator, a dual-expert calibration module, and an adaptive routing module. We design an uncertainty estimator that serves as the core control signal to jointly manage inference and adaptation processes. In our framework, the uncertainty estimator first assesses uncertainty for each incoming batch. The adaptive routing module then directs samples with high uncertainty to the unreliable expert for calibration, while low uncertainty samples remain with the reliable expert. Subsequently, the system updates the reliable expert and the uncertainty estimator using well-calibrated reliable samples, and updates the unreliable expert with challenging samples, enabling stable and efficient online learning. Under-Cali keeps the source forecasting model frozen and performs adaptation only through a lightweight, model-agnostic calibration module, enabling efficient adaptation. Extensive experiments on IMTS benchmarks demonstrate consistent improvements with low computational cost. Our code is available at https://github.com/HaonanWen/Under-Cali.","published_date":"2026-05-27T15:19:41+00:00","viability_score":7,"cluster_label":"Time Series Forecasting","has_code":true,"repo_url":"https://github.com/HaonanWen/Under-Cali","commercial_flags":["has_code"],"one_liner":"An online framework for irregular time series forecasting that adapts to dynamic data shifts using uncertainty-driven dual-expert calibration.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2605.28602v1","title":"Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability","abstract":"Large language models (LLMs) are increasingly used for tasks that implicitly reduce to Boolean satisfiability (SAT), yet their reasoning ability on SAT remains unclear. We present a systematic study of LLMs on 2-SAT and 3-SAT, together with two canonical reductions, Vertex Cover and discrete 3D packing, to probe representation-invariant reasoning. We first evaluate models using conventional metrics, including accuracy, precision, recall, and F1, as well as the SAT phase-transition setting. We find that these metrics can be misleading: many models obtain high scores by over-predicting satisfiable formulas, fail to reproduce the classical easy-hard-easy signature around the 3-SAT threshold, and degrade sharply as the number of variables grows.   To address this problem, we introduce a paired-formula protocol based on minimally different satisfiable and unsatisfiable instances, together with Accurate Differentiation Rate (ADR), which requires both members of each pair to be classified correctly. ADR separates reasoning-oriented models from heuristic ones and correlates with witness validity. Beyond CNF, we test cross-representation consistency by converting CNF to Vertex Cover and 3-SAT to discrete 3D packing. Model decisions on CNF and on the corresponding graph or packing instances agree for most models on more than 80 percent of instances, suggesting stable decision rules across representations. Overall, our results show that SAT is a conservative probe for LLM reasoning, and that paired evaluation with ADR provides a more faithful and representation-robust assessment than conventional metrics.","published_date":"2026-05-27T15:18:45+00:00","viability_score":3,"cluster_label":"LLM Reasoning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A new evaluation protocol using matched-pair instances and Accurate Differentiation Rate (ADR) to more faithfully assess LLM reasoning capabilities on SAT problems.","time_to_mvp":"6+ months","tags":[]}],"meta":{"count":1000,"artifact_id":"public-dataset:2026-06-06T04-25-49-511Z","schema_version":"dataset-public-v3","exported_at":"2026-06-06T04:25:49.511Z","last_updated_at":"2026-06-04T17:59:50.000Z","fresh_until":"2026-06-07T04:25:49.511Z","status":"ready","source_count":1000,"coverage_window":"Public dataset snapshot","method_version":"dataset_export_v3"}}