{"data":[{"arxiv_id":"2604.19740v1","title":"Generalization at the Edge of Stability","abstract":"Training modern neural networks often relies on large learning rates, operating at the edge of stability, where the optimization dynamics exhibit oscillatory and chaotic behavior. Empirically, this regime often yields improved generalization performance, yet the underlying mechanism remains poorly understood. In this work, we represent stochastic optimizers as random dynamical systems, which often converge to a fractal attractor set (rather than a point) with a smaller intrinsic dimension. Building on this connection and inspired by Lyapunov dimension theory, we introduce a novel notion of dimension, coined the `sharpness dimension', and prove a generalization bound based on this dimension. Our results show that generalization in the chaotic regime depends on the complete Hessian spectrum and the structure of its partial determinants, highlighting a complexity that cannot be captured by the trace or spectral norm considered in prior work. Experiments across various MLPs and transformers validate our theory while also providing new insights into the recently observed phenomenon of grokking.","published_date":"2026-04-21T17:59:02+00:00","viability_score":3,"cluster_label":"LLM Training","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research theoretically explores the 'sharpness dimension' to understand generalization in large learning rate neural network training, offering insights into chaotic optimization regimes.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.19734v1","title":"UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling","abstract":"Scaling humanoid foundation models is bottlenecked by the scarcity of robotic data. While massive egocentric human data offers a scalable alternative, bridging the cross-embodiment chasm remains a fundamental challenge due to kinematic mismatches. We introduce UniT (Unified Latent Action Tokenizer via Visual Anchoring), a framework that establishes a unified physical language for human-to-humanoid transfer. Grounded in the philosophy that heterogeneous kinematics share universal visual consequences, UniT employs a tri-branch cross-reconstruction mechanism: actions predict vision to anchor kinematics to physical outcomes, while vision reconstructs actions to filter out irrelevant visual confounders. Concurrently, a fusion branch synergies these purified modalities into a shared discrete latent space of embodiment-agnostic physical intents. We validate UniT across two paradigms: 1) Policy Learning (VLA-UniT): By predicting these unified tokens, it effectively leverages diverse human data to achieve state-of-the-art data efficiency and robust out-of-distribution (OOD) generalization on both humanoid simulation benchmark and real-world deployments, notably demonstrating zero-shot task transfer. 2) World Modeling (WM-UniT): By aligning cross-embodiment dynamics via unified tokens as conditions, it realizes direct human-to-humanoid action transfer. This alignment ensures that human data seamlessly translates into enhanced action controllability for humanoid video generation. Ultimately, by inducing a highly aligned cross-embodiment representation (empirically verified by t-SNE visualizations revealing the convergence of human and humanoid features into a shared manifold), UniT offers a scalable path to distill vast human knowledge into general-purpose humanoid capabilities.","published_date":"2026-04-21T17:57:27+00:00","viability_score":7,"cluster_label":"Robotics","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"UniT is a framework that bridges the gap between human and humanoid robot learning by creating a unified physical language for policy learning and world modeling.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19730v1","title":"FASTER: Value-Guided Sampling for Fast RL","abstract":"Some of the most performant reinforcement learning algorithms today can be prohibitively expensive as they use test-time scaling methods such as sampling multiple action candidates and selecting the best one. In this work, we propose FASTER, a method for getting the benefits of sampling-based test-time scaling of diffusion-based policies without the computational cost by tracing the performance gain of action samples back to earlier in the denoising process. Our key insight is that we can model the denoising of multiple action candidates and selecting the best one as a Markov Decision Process (MDP) where the goal is to progressively filter action candidates before denoising is complete. With this MDP, we can learn a policy and value function in the denoising space that predicts the downstream value of action candidates in the denoising process and filters them while maximizing returns. The result is a method that is lightweight and can be plugged into existing generative RL algorithms. Across challenging long-horizon manipulation tasks in online and batch-online RL, FASTER consistently improves the underlying policies and achieves the best overall performance among the compared methods. Applied to a pretrained VLA, FASTER achieves the same performance while substantially reducing training and inference compute requirements. Code is available at https://github.com/alexanderswerdlow/faster .","published_date":"2026-04-21T17:52:17+00:00","viability_score":7,"cluster_label":"Reinforcement Learning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Develop a reinforcement learning tool that leverages value-guided sampling for improved efficiency and scalability.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.19728v1","title":"VLA Foundry: A Unified Framework for Training Vision-Language-Action Models","abstract":"We present VLA Foundry, an open-source framework that unifies LLM, VLM, and VLA training in a single codebase. Most open-source VLA efforts specialize on the action training stage, often stitching together incompatible pretraining pipelines. VLA Foundry instead provides a shared training stack with end-to-end control, from language pretraining to action-expert fine-tuning. VLA Foundry supports both from-scratch training and pretrained backbones from Hugging Face. To demonstrate the utility of our framework, we train and release two types of models: the first trained fully from scratch through our LLM-->VLM-->VLA pipeline and the second built on the pretrained Qwen3-VL backbone. We evaluate closed-loop policy performance of both models on LBM Eval, an open-data, open-source simulator. We also contribute usability improvements to the simulator and the STEP analysis tools for easier public use. In the nominal evaluation setting, our fully-open from-scratch model is on par with our prior closed-source work and substituting in the Qwen3-VL backbone leads to a strong multi-task table top manipulation policy outperforming our baseline by a wide margin. The VLA Foundry codebase is available at https://github.com/TRI-ML/vla_foundry and all multi-task model weights are released on https://huggingface.co/collections/TRI-ML/vla-foundry. Additional qualitative videos are available on the project website https://tri-ml.github.io/vla_foundry.","published_date":"2026-04-21T17:51:51+00:00","viability_score":8,"cluster_label":"AI Frameworks","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"VLA Foundry: A unified framework for training vision-language-action models in robotics.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.19724v1","title":"Benign Overfitting in Adversarial Training for Vision Transformers","abstract":"Despite the remarkable success of Vision Transformers (ViTs) across a wide range of vision tasks, recent studies have revealed that they remain vulnerable to adversarial examples, much like Convolutional Neural Networks (CNNs). A common empirical defense strategy is adversarial training, yet the theoretical underpinnings of its robustness in ViTs remain largely unexplored. In this work, we present the first theoretical analysis of adversarial training under simplified ViT architectures. We show that, when trained under a signal-to-noise ratio that satisfies a certain condition and within a moderate perturbation budget, adversarial training enables ViTs to achieve nearly zero robust training loss and robust generalization error under certain regimes. Remarkably, this leads to strong generalization even in the presence of overfitting, a phenomenon known as \\emph{benign overfitting}, previously only observed in CNNs (with adversarial training). Experiments on both synthetic and real-world datasets further validate our theoretical findings.","published_date":"2026-04-21T17:48:51+00:00","viability_score":3,"cluster_label":"Computer Vision","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Theoretical analysis of adversarial training for Vision Transformers reveals conditions for benign overfitting, improving robustness.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.19722v1","title":"Adaptive MSD-Splitting: Enhancing C4.5 and Random Forests for Skewed Continuous Attributes","abstract":"The discretization of continuous numerical attributes remains a persistent computational bottleneck in the induction of decision trees, particularly as dataset dimensions scale. Building upon the recently proposed MSD-Splitting technique -- which bins continuous data using the empirical mean and standard deviation to dramatically improve the efficiency and accuracy of the C4.5 algorithm -- we introduce Adaptive MSD-Splitting (AMSD). While standard MSD-Splitting is highly effective for approximately symmetric distributions, its rigid adherence to fixed one-standard-deviation cutoffs can lead to catastrophic information loss in highly skewed data, a common artifact in real-world biomedical and financial datasets. AMSD addresses this by dynamically adjusting the standard deviation multiplier based on feature skewness, narrowing intervals in dense regions to preserve discriminative resolution. Furthermore, we integrate AMSD into ensemble methods, specifically presenting the Random Forest-AMSD (RF-AMSD) framework. Empirical evaluations on the Census Income, Heart Disease, Breast Cancer, and Forest Covertype datasets demonstrate that AMSD yields a 2-4% accuracy improvement over standard MSD-Splitting, while maintaining near-identical O(N) time complexity reductions compared to the O(N log N) exhaustive search. Our Random Forest extension achieves state-of-the-art accuracy at a fraction of standard computational costs, confirming the viability of adaptive statistical binning in large-scale ensemble learning architectures.","published_date":"2026-04-21T17:48:27+00:00","viability_score":5,"cluster_label":"Decision Trees","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Adaptive MSD-Splitting enhances C4.5 and Random Forests for skewed continuous attributes, improving accuracy and efficiency.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.19689v1","title":"A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding","abstract":"Understanding artworks requires multi-step reasoning over visual content and cultural, historical, and stylistic context. While recent multimodal large language models show promise in artwork explanation, they rely on implicit reasoning and internalized knowl- edge, limiting interpretability and explicit evidence grounding. We propose A-MAR, an Agent-based Multimodal Art Retrieval framework that explicitly conditions retrieval on structured reasoning plans. Given an artwork and a user query, A-MAR first decomposes the task into a structured reasoning plan that specifies the goals and evidence requirements for each step. Retrieval is then conditionedon this plan, enabling targeted evidence selection and supporting step-wise, grounded explanations. To evaluate agent-based multi- modal reasoning within the art domain, we introduce ArtCoT-QA. This diagnostic benchmark features multi-step reasoning chains for diverse art-related queries, enabling a granular analysis that extends beyond simple final answer accuracy. Experiments on SemArt and Artpedia show that A-MAR consistently outperforms static, non planned retrieval and strong MLLM baselines in final explanation quality, while evaluations on ArtCoT-QA further demonstrate its advantages in evidence grounding and multi-step reasoning ability. These results highlight the importance of reasoning-conditioned retrieval for knowledge-intensive multimodal understanding and position A-MAR as a step toward interpretable, goal-driven AI systems, with particular relevance to cultural industries. The code and data are available at: https://github.com/ShuaiWang97/A-MAR.","published_date":"2026-04-21T17:11:48+00:00","viability_score":8,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A-MAR is an agent-based multimodal retrieval framework for fine-grained artwork understanding, enabling interpretable and grounded explanations.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19677v1","title":"Learning Hybrid-Control Policies for High-Precision In-Contact Manipulation Under Uncertainty","abstract":"Reinforcement learning-based control policies have been frequently demonstrated to be more effective than analytical techniques for many manipulation tasks. Commonly, these methods learn neural control policies that predict end-effector pose changes directly from observed state information. For tasks like inserting delicate connectors which induce force constraints, pose-based policies have limited explicit control over force and rely on carefully tuned low-level controllers to avoid executing damaging actions. In this work, we present hybrid position-force control policies that learn to dynamically select when to use force or position control in each control dimension. To improve learning efficiency of these policies, we introduce Mode-Aware Training for Contact Handling (MATCH) which adjusts policy action probabilities to explicitly mirror the mode selection behavior in hybrid control. We validate MATCH's learned policy effectiveness using fragile peg-in-hole tasks under extreme localization uncertainty. We find MATCH substantially outperforms pose-control policies -- solving these tasks with up to 10% higher success rates and 5x fewer peg breaks than pose-only policies under common types of state estimation error. MATCH also demonstrates data efficiency equal to pose-control policies, despite learning in a larger and more complex action space. In over 1600 sim-to-real experiments, we find MATCH succeeds twice as often as pose policies in high noise settings (33% vs.~68%) and applies ~30% less force on average compared to variable impedance policies on a Franka FR3 in laboratory conditions.","published_date":"2026-04-21T16:55:48+00:00","viability_score":7,"cluster_label":"Robotics","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"MATCH learns hybrid position-force control policies for high-precision in-contact manipulation under uncertainty, improving success rates and reducing damage.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19670v1","title":"Multi-Cycle Spatio-Temporal Adaptation in Human-Robot Teaming","abstract":"Effective human-robot teaming is crucial for the practical deployment of robots in human workspaces. However, optimizing joint human-robot plans remains a challenge due to the difficulty of modeling individualized human capabilities and preferences. While prior research has leveraged the multi-cycle structure of domains like manufacturing to learn an individual's tendencies and adapt plans over repeated interactions, these techniques typically consider task-level and motion-level adaptation in isolation. Task-level methods optimize allocation and scheduling but often ignore spatial interference in close-proximity scenarios; conversely, motion-level methods focus on collision avoidance while ignoring the broader task context. This paper introduces RAPIDDS, a framework that unifies these approaches by modeling an individual's spatial behavior (motion paths) and temporal behavior (time required to complete tasks) over multiple cycles. RAPIDDS then jointly adapts task schedules and steers diffusion models of robot motions to maximize efficiency and minimize proximity accounting for these individualized models. We demonstrate the importance of this dual adaptation through an ablation study in simulation and a physical robot scenario using a 7-DOF robot arm. Finally, we present a user study (n=32) showing significant plan improvement compared to non-adaptive systems across both objective metrics, such as efficiency and proximity, and subjective measures, including fluency and user preference. See this paper's companion video at: https://youtu.be/55Q3lq1fINs.","published_date":"2026-04-21T16:49:59+00:00","viability_score":7,"cluster_label":"Human-Robot Teaming","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework that unifies task and motion adaptation for more efficient and fluid human-robot collaboration, validated in simulation and on a physical robot.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19667v1","title":"Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language","abstract":"At present, executable visual workflows have emerged as a mainstream paradigm in real-world industrial deployments, offering strong reliability and controllability. However, in current practice, such workflows are almost entirely constructed through manual engineering: developers must carefully design workflows, write prompts for each step, and repeatedly revise the logic as requirements evolve-making development costly, time-consuming, and error-prone. To study whether large language models can automate this multi-round interaction process, we introduce Chat2Workflow, a benchmark for generating executable visual workflows directly from natural language, and propose a robust agentic framework to mitigate recurrent execution errors. Chat2Workflow is built from a large collection of real-world business workflows, with each instance designed so that the generated workflow can be transformed and directly deployed to practical workflow platforms such as Dify and Coze. Experimental results show that while state-of-the-art language models can often capture high-level intent, they struggle to generate correct, stable, and executable workflows, especially under complex or changing requirements. Although our agentic framework yields up to 5.34% resolve rate gains, the remaining real-world gap positions Chat2Workflow as a foundation for advancing industrial-grade automation. Code is available at https://github.com/zjunlp/Chat2Workflow.","published_date":"2026-04-21T16:49:11+00:00","viability_score":7,"cluster_label":"Workflow Generation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A benchmark and agentic framework to automate the generation of executable visual workflows from natural language, addressing costly manual engineering.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19657v1","title":"An AI Agent Execution Environment to Safeguard User Data","abstract":"AI agents promise to serve as general-purpose personal assistants for their users, which requires them to have access to private user data (e.g., personal and financial information). This poses a serious risk to security and privacy. Adversaries may attack the AI model (e.g., via prompt injection) to exfiltrate user data. Furthermore, sharing private data with an AI agent requires users to trust a potentially unscrupulous or compromised AI model provider with their private data.   This paper presents GAAP (Guaranteed Accounting for Agent Privacy), an execution environment for AI agents that guarantees confidentiality for private user data. Through dynamic and directed user prompts, GAAP collects permission specifications from users describing how their private data may be shared, and GAAP enforces that the agent's disclosures of private user data, including disclosures to the AI model and its provider, comply with these specifications. Crucially, GAAP provides this guarantee deterministically, without trusting the agent with private user data, and without requiring any AI model or the user prompt to be free of attacks.   GAAP enforces the user's permission specification by tracking how the AI agent accesses and uses private user data. It augments Information Flow Control with novel persistent data stores and annotations that enable it to track the flow of private information both across execution steps within a single task, and also over multiple tasks separated in time. Our evaluation confirms that GAAP blocks all data disclosure attacks, including those that make other state-of-the-art systems disclose private user data to untrusted parties, without a significant impact on agent utility.","published_date":"2026-04-21T16:45:30+00:00","viability_score":4,"cluster_label":"AI Agent Security","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"An execution environment that guarantees user data confidentiality for AI agents by enforcing user-defined permission specifications without trusting the agent.","time_to_mvp":"6+ months","tags":["series_a_plus"]},{"arxiv_id":"2604.19653v1","title":"A Dual Perspective on Synthetic Trajectory Generators: Utility Framework and Privacy Vulnerabilities","abstract":"Human mobility data are used in numerous applications, ranging from public health to urban planning. Human mobility is inherently sensitive, as it can contain information such as religious beliefs and political affiliations. Historically, it has been proposed to modify the information using techniques such as aggregation, obfuscation, or noise addition, to adequately protect privacy and eliminate concerns. As these methods come at a great cost in utility, new methods leveraging development in generative models, were introduced. The extent to which such methods answer the privacy-utility trade-off remains an open problem. In this paper, we introduced a first step towards solving it, by the introduction and application of a new framework for utility evaluation. Furthermore, we provide evidence that privacy evaluation remains a great challenge to consider and that it should be tackled through adversarial evaluation in accordance with the current EU regulation. We propose a new membership inference attack against a subcategory of generative models, even though this subcategory was deemed private due to its resistance over the trajectory user-linking problem.","published_date":"2026-04-21T16:42:33+00:00","viability_score":3,"cluster_label":"Synthetic Data Privacy","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A framework for evaluating the utility of synthetic trajectory generators and demonstrating privacy vulnerabilities through adversarial attacks.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.19652v1","title":"Environmental Sound Deepfake Detection Using Deep-Learning Framework","abstract":"In this paper, we propose a deep-learning framework for environmental sound deepfake detection (ESDD) -- the task of identifying whether the sound scene and sound event in an input audio recording is fake or not. To this end, we conducted extensive experiments to explore how individual spectrograms, a wide range of network architectures and pre-trained models, ensemble of spectrograms or network architectures affect the ESDD task performance. The experimental results on the benchmark datasets of EnvSDD and ESDD-Challenge-TestSet indicate that detecting deepfake audio of sound scene and detecting deepfake audio of sound event should be considered as individual tasks. We also indicate that the approach of finetuning a pre-trained model is more effective compared with training a model from scratch for the ESDD task. Eventually, our best model, which was finetuned from the pre-trained WavLM model with the proposed three-stage training strategy, achieve the Accuracy of 0.98, F1 Score of 0.95, AuC of 0.99 on EnvSDD Test subset and the Accuracy of 0.88, F1 Score of 0.77, and AuC of 0.92 on ESDD-Challenge-TestSet dataset.","published_date":"2026-04-21T16:41:55+00:00","viability_score":7,"cluster_label":"Audio Deepfake Detection","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A deep learning framework for environmental sound deepfake detection that significantly outperforms existing methods by finetuning pre-trained models.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19648v1","title":"CoCo-SAM3: Harnessing Concept Conflict in Open-Vocabulary Semantic Segmentation","abstract":"SAM3 advances open-vocabulary semantic segmentation by introducing a prompt-driven mask generation paradigm. However, in multi-class open-vocabulary scenarios, masks generated independently from different category prompts lack a unified and inter-class comparable evidence scale, often resulting in overlapping coverage and unstable competition. Moreover, synonymous expressions of the same concept tend to activate inconsistent semantic and spatial evidence, leading to intra-class drift that exacerbates inter-class conflicts and compromises overall inference stability. To address these issues, we propose CoCo-SAM3 (Concept-Conflict SAM3), which explicitly decouples inference into intra-class enhancement and inter-class competition. Our method first aligns and aggregates evidence from synonymous prompts to strengthen concept consistency. It then performs inter-class competition on a unified comparable scale, enabling direct pixel-wise comparisons among all candidate classes. This mechanism stabilizes multi-class inference and effectively mitigates inter-class conflicts. Without requiring any additional training, CoCo-SAM3 achieves consistent improvements across eight open-vocabulary semantic segmentation benchmarks.","published_date":"2026-04-21T16:37:18+00:00","viability_score":7,"cluster_label":"Open-Vocabulary Semantic Segmentation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"CoCo-SAM3 enhances open-vocabulary semantic segmentation by resolving concept conflicts, improving consistency and accuracy without retraining.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.19639v1","title":"Safety-Critical Contextual Control via Online Riemannian Optimization with World Models","abstract":"Modern world models are becoming too complex to admit explicit dynamical descriptions. We study safety-critical contextual control, where a Planner must optimize a task objective using only feasibility samples from a black-box Simulator, conditioned on a context signal $\u03be_t$. We develop a sample-based Penalized Predictive Control (PPC) framework grounded in online Riemannian optimization, in which the Simulator compresses the feasibility manifold into a score-based density $\\hat{p}(u \\mid \u03be_t)$ that endows the action space with a Riemannian geometry guiding the Planner's gradient descent. The barrier curvature $\u03ba(\u03be_t)$, the minimum curvature of the conditional log-density $-\\ln\\hat{p}(\\cdot\\mid\u03be_t)$, governs both convergence rate and safety margin, replacing the Lipschitz constant of the unknown dynamics. Our main result is a contextual safety bound showing that the distance from the true feasibility manifold is controlled by the score estimation error and a ratio that depends on $\u03ba(\u03be_t)$, both of which improve with richer context. Simulations on a dynamic navigation task confirm that contextual PPC substantially outperforms marginal and frozen density models, with the advantage growing after environment shifts.","published_date":"2026-04-21T16:28:36+00:00","viability_score":3,"cluster_label":"Safety-Critical Control","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A theoretical framework for safety-critical contextual control using online Riemannian optimization with world models, improving convergence and safety margins.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.19638v1","title":"SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models","abstract":"Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. We introduce SafetyALFRED, built upon the embodied agent benchmark ALFRED, augmented with six categories of real-world kitchen hazards. While existing safety evaluations focus on hazard recognition through disembodied question answering (QA) settings, we evaluate eleven state-of-the-art models from the Qwen, Gemma, and Gemini families on not only hazard recognition, but also active risk mitigation through embodied planning. Our experimental results reveal a significant alignment gap: while models can accurately recognize hazards in QA settings, average mitigation success rates for these hazards are low in comparison. Our findings demonstrate that static evaluations through QA are insufficient for physical safety, thus we advocate for a paradigm shift toward benchmarks that prioritize corrective actions in embodied contexts. We open-source our code and dataset under https://github.com/sled-group/SafetyALFRED.git","published_date":"2026-04-21T16:27:20+00:00","viability_score":5,"cluster_label":"Safety in AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"SafetyALFRED enhances safety-conscious planning in multimodal models, improving AI interaction safety.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.19635v1","title":"Towards Streaming Target Speaker Extraction via Chunk-wise Interleaved Splicing of Autoregressive Language Model","abstract":"While generative models have set new benchmarks for Target Speaker Extraction (TSE), their inherent reliance on global context precludes deployment in real-time applications. Direct adaptation to streaming scenarios often leads to catastrophic inference performance degradation due to the severe mismatch between training and streaming inference. To bridge this gap, we present the first autoregressive (AR) models tailored for streaming TSE. Our approach introduces a Chunk-wise Interleaved Splicing Paradigm that ensures highly efficient and stable streaming inference. To ensure the coherence between the extracted speech segments, we design a historical context refinement mechanism that mitigates boundary discontinuities by leveraging historical information. Experiments on Libri2Mix show that while AR generative baseline exhibits performance degradation at low latencies, our approach maintains 100% stability and superior intelligibility. Furthermore, our streaming results are comparable to or even surpass offline baselines. Additionally, our model achieves a Real-Time-Factor (RTF) of 0.248 on consumer-level GPUs. This work provides empirical evidence that AR generative backbones are viable for latency-sensitive applications through the Chunk-wise Interleaved Splicing Paradigm.","published_date":"2026-04-21T16:25:22+00:00","viability_score":7,"cluster_label":"Real-time Audio Processing","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Develops a novel autoregressive model for real-time target speaker extraction that maintains high intelligibility and stability in streaming scenarios, outperforming offline baselines.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19633v1","title":"Time Series Augmented Generation for Financial Applications","abstract":"Evaluating the reasoning capabilities of Large Language Models (LLMs) for complex, quantitative financial tasks is a critical and unsolved challenge. Standard benchmarks often fail to isolate an agent's core ability to parse queries and orchestrate computations. To address this, we introduce a novel evaluation methodology and benchmark designed to rigorously measure an LLM agent's reasoning for financial time-series analysis. We apply this methodology in a large-scale empirical study using our framework, Time Series Augmented Generation (TSAG), where an LLM agent delegates quantitative tasks to verifiable, external tools. Our benchmark, consisting of 100 financial questions, is used to compare multiple SOTA agents (e.g., GPT-4o, Llama 3, Qwen2) on metrics assessing tool selection accuracy, faithfulness, and hallucination. The results demonstrate that capable agents can achieve near-perfect tool-use accuracy with minimal hallucination, validating the tool-augmented paradigm. Our primary contribution is this evaluation framework and the corresponding empirical insights into agent performance, which we release publicly to foster standardized research on reliable financial AI.","published_date":"2026-04-21T16:20:59+00:00","viability_score":8,"cluster_label":"LLM Agents for Finance","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Introduces a novel evaluation framework and benchmark for assessing LLM agent reasoning in financial time-series analysis, demonstrating near-perfect tool-use accuracy with minimal hallucination.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.19606v1","title":"AblateCell: A Reproduce-then-Ablate Agent for Virtual Cell Repositories","abstract":"Systematic ablations are essential to attribute performance gains in AI Virtual Cells, yet they are rarely performed because biological repositories are under-standardized and tightly coupled to domain-specific data and formats. While recent coding agents can translate ideas into implementations, they typically stop at producing code and lack a verifier that can reproduce strong baselines and rigorously test which components truly matter. We introduce AblateCell, a reproduce-then-ablate agent for virtual cell repositories that closes this verification gap. AblateCell first reproduces reported baselines end-to-end by auto-configuring environments, resolving dependency and data issues, and rerunning official evaluations while emitting verifiable artifacts. It then conducts closed-loop ablation by generating a graph of isolated repository mutations and adaptively selecting experiments under a reward that trades off performance impact and execution cost. Evaluated on three single-cell perturbation prediction repositories (CPA, GEARS, BioLORD), AblateCell achieves 88.9% (+29.9% to human expert) end-to-end workflow success and 93.3% (+53.3% to heuristic) accuracy in recovering ground-truth critical components. These results enable scalable, repository-grounded verification and attribution directly on biological codebases.","published_date":"2026-04-21T15:55:33+00:00","viability_score":4,"cluster_label":"AI Agents for Scientific Research","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"AblateCell is an agent that automates the process of reproducing and ablating AI models in virtual cell repositories, improving the verification and attribution of critical components.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.19598v1","title":"Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models","abstract":"This study compared repeated generation consistency of exercise prescription outputs across three large language models (LLMs), specifically GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash, under temperature=0 conditions. Each model generated prescriptions for six clinical scenarios 20 times, yielding 360 total outputs analyzed across four dimensions: semantic similarity, output reproducibility, FITT classification, and safety expression. Mean semantic similarity was highest for GPT-4.1 (0.955), followed by Gemini 2.5 Flash (0.950) and Claude Sonnet 4.6 (0.903), with significant inter-model differences confirmed (H = 458.41, p < .001). Critically, these scores reflected fundamentally different generative behaviors: GPT-4.1 produced entirely unique outputs (100%) with stable semantic content, while Gemini 2.5 Flash showed pronounced output repetition (27.5% unique outputs), indicating that its high similarity score derived from text duplication rather than consistent reasoning. Identical decoding settings thus yielded fundamentally different consistency profiles, a distinction that single-output evaluations cannot capture. Safety expression reached ceiling levels across all models, confirming its limited utility as a differentiating metric. These results indicate that model selection constitutes a clinical rather than merely technical decision, and that output behavior under repeated generation conditions should be treated as a core criterion for reliable deployment of LLM-based exercise prescription systems.","published_date":"2026-04-21T15:51:46+00:00","viability_score":4,"cluster_label":"LLM Consistency Analysis","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Compares the repeated generation consistency of exercise prescriptions across three LLMs, revealing fundamentally different generative behaviors that impact reliable deployment.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.19593v1","title":"RoLegalGEC: Legal Domain Grammatical Error Detection and Correction Dataset for Romanian","abstract":"The importance of clear and correct text in legal documents cannot be understated, and, consequently, a grammatical error correction tool meant to assist a professional in the law must have the ability to understand the possible errors in the context of a legal environment, correcting them accordingly, and implicitly needs to be trained in the same environment, using realistic legal data. However, the manually annotated data required by such a process is in short supply for languages such as Romanian, much less for a niche domain. The most common approach is the synthetic generation of parallel data; however, it requires a structured understanding of the Romanian grammar. In this paper, we introduce, to our knowledge, the first Romanian-language parallel dataset for the detection and correction of grammatical errors in the legal domain, RoLegalGEC, which aggregates 350,000 examples of errors in legal passages, along with error annotations. Moreover, we evaluate several neural network models that transform the dataset into a valuable tool for both detecting and correcting grammatical errors, including knowledge-distillation Transformers, sequence tagging architectures for detection, and a variety of pre-trained text-to-text Transformer models for correction. We consider that the set of models, together with the novel RoLegalGEC dataset, will enrich the resource base for further research on Romanian.","published_date":"2026-04-21T15:43:33+00:00","viability_score":7,"cluster_label":"NLP Tools","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel dataset and evaluated models for Romanian legal grammatical error detection and correction, aiming to improve legal document accuracy.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19578v1","title":"Impact of large language models on peer review opinions from a fine-grained perspective: Evidence from top conference proceedings in AI","abstract":"With the rapid advancement of Large Language Models (LLMs), the academic community has faced unprecedented disruptions, particularly in the realm of academic communication. The primary function of peer review is improving the quality of academic manuscripts, such as clarity, originality and other evaluation aspects. Although prior studies suggest that LLMs are beginning to influence peer review, it remains unclear whether they are altering its core evaluative functions. Moreover, the extent to which LLMs affect the linguistic form, evaluative focus, and recommendation-related signals of peer-review reports has yet to be systematically examined. In this study, we examine the changes in peer review reports for academic articles following the emergence of LLMs, emphasizing variations at fine-grained level. Specifically, we investigate linguistic features such as the length and complexity of words and sentences in review comments, while also automatically annotating the evaluation aspects of individual review sentences. We also use a maximum likelihood estimation method, previously established, to identify review reports that potentially have modified or generated by LLMs. Finally, we assess the impact of evaluation aspects mentioned in LLM-assisted review reports on the informativeness of recommendation for paper decision-making. The results indicate that following the emergence of LLMs, peer review texts have become longer and more fluent, with increased emphasis on summaries and surface-level clarity, as well as more standardized linguistic patterns, particularly reviewers with lower confidence score. At the same time, attention to deeper evaluative dimensions, such as originality, replicability, and nuanced critical reasoning, has declined.","published_date":"2026-04-21T15:33:53+00:00","viability_score":1,"cluster_label":"LLM Impact Analysis","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Analysis of how large language models are changing academic peer review, leading to more fluent but less deep evaluations.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.19569v1","title":"Lyapunov-Certified Direct Switching Theory for Q-Learning","abstract":"Q-learning is one of the most fundamental algorithms in reinforcement learning. We analyze constant-stepsize Q-learning through a direct stochastic switching system representation. The key observation is that the Bellman maximization error can be represented exactly by a stochastic policy. Therefore, the Q-learning error admits a switched linear conditional-mean recursion with martingale-difference noise. The intrinsic drift rate is the joint spectral radius (JSR) of the direct switching family, which can be strictly smaller than the standard row-sum rate. Using this representation, we derive a finite-time final-iterate bound via a JSR-induced Lyapunov function and then give a computable quadratic-certificate version.","published_date":"2026-04-21T15:22:42+00:00","viability_score":0,"cluster_label":"Reinforcement Learning Theory","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Theoretical analysis of Q-learning using a direct stochastic switching system representation to derive finite-time bounds.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.19567v1","title":"Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic","abstract":"Reinforcement learning (RL) as post-training is crucial for enhancing the reasoning ability of large language models (LLMs) in coding and math. However, their capacity for visual semantic arithmetic, inferring relationships from images, remains underexplored. The classic text analogy \"king\"-\"man\"+\"woman\" = \"queen\" illustrates relational reasoning, yet replacing text with images of \"king\" and \"man\" significantly reduces performance because it requires commonsense knowledge and the extraction of concise concepts from irrelevant visual details. This capability is important for service and domestic robotics in unstructured environments, where robots must infer semantic relationships among objects, agents, and actions. In a kitchen, recognizing from images that \"powder\" and \"cake\" are related by \"is made of\" grounds symbolic relations in perception, enabling tool substitution, task generalization, and improved semantic reasoning. Prior work approaches semantic arithmetic by decoding image features after vector arithmetic, but suffers from modality gaps and lacks systematic evaluation. In this paper, we formulate two novel tasks, two-term subtraction and three-term operations, and construct the Image-Relation-Pair Dataset (IRPD) for benchmarking. We further propose Semantic Arithmetic Reinforcement Fine-Tuning (SAri-RFT), which post-trains large vision-language models (LVLMs) using a verifiable function and Group Relative Policy Optimization (GRPO). Our method achieves state-of-the-art results on IRPD and the real-world Visual7W-Telling dataset. By equipping LVLMs with robust cross-modal relational reasoning, this work advances domestic robots' ability to ground symbolic reasoning in perception, enhancing decision-making, tool adaptability, and human-robot interaction in complex environments. Datasets and source code are provided in the supplementary material.","published_date":"2026-04-21T15:19:49+00:00","viability_score":8,"cluster_label":"Multimodal Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel method and dataset for visual semantic arithmetic, enabling robots to ground symbolic reasoning in perception for improved interaction.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.19565v1","title":"Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps","abstract":"Hallucinations in Speech Large Language Models (SpeechLLMs) pose significant risks, yet existing detection methods typically rely on gold-standard outputs that are costly or impractical to obtain. Moreover, hallucination detection methods developed for text-based LLMs do not directly capture audio-specific signals. We investigate four attention-derived metrics: AUDIORATIO, AUDIOCONSISTENCY, AUDIOENTROPY, and TEXTENTROPY, designed to capture pathological attention patterns associated with hallucination, and train lightweight logistic regression classifiers on these features for efficient inference-time detection. Across automatic speech recognition and speech-to-text translation tasks, evaluations on Qwen-2-Audio and Voxtral-3B show that our approach outperforms uncertainty-based and prior attention-based baselines on in-domain data, achieving improvements of up to +0.23 PR-AUC, and generalises to out-of-domain ASR settings. We further find that strong performance can be achieved with approximately 100 attention heads, improving out-of-domain generalisation compared to using all heads. While effectiveness is model-dependent and task-specific training is required, our results demonstrate that attention patterns provide a valuable tool for hallucination detection in SpeechLLMs.","published_date":"2026-04-21T15:18:10+00:00","viability_score":7,"cluster_label":"Speech AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A lightweight system to detect hallucinations in speech models at inference time using attention maps, improving safety and reliability.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19564v1","title":"EgoSelf: From Memory to Personalized Egocentric Assistant","abstract":"Egocentric assistants often rely on first-person view data to capture user behavior and context for personalized services. Since different users exhibit distinct habits, preferences, and routines, such personalization is essential for truly effective assistance. However, effectively integrating long-term user data for personalization remains a key challenge. To address this, we introduce EgoSelf, a system that includes a graph-based interaction memory constructed from past observations and a dedicated learning task for personalization. The memory captures temporal and semantic relationships among interaction events and entities, from which user-specific profiles are derived. The personalized learning task is formulated as a prediction problem where the model predicts possible future interactions from individual user's historical behavior recorded in the graph. Extensive experiments demonstrate the effectiveness of EgoSelf as a personalized egocentric assistant. Code is available at \\href{https://abie-e.github.io/egoself_project/}{https://abie-e.github.io/egoself\\_project/}.","published_date":"2026-04-21T15:15:02+00:00","viability_score":7,"cluster_label":"Personalized AI Assistants","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"EgoSelf is a personalized egocentric assistant that uses user-specific interaction memory to predict future behaviors and provide tailored assistance.","time_to_mvp":"1-2 weeks","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.19561v1","title":"Detecting Data Contamination in Large Language Models","abstract":"Large Language Models (LLMs) utilize large amounts of data for their training, some of which may come from copyrighted sources. Membership Inference Attacks (MIA) aim to detect those documents and whether they have been included in the training corpora of the LLMs. The black-box MIAs require a significant amount of data manipulation; therefore, their comparison is often challenging. We study state-of-the-art (SOTA) MIAs under the black-box assumptions and compare them to each other using a unified set of datasets to determine if any of them can reliably detect membership under SOTA LLMs. In addition, a new method, called the Familiarity Ranking, was developed to showcase a possible approach to black-box MIAs, thereby giving LLMs more freedom in their expression to understand their reasoning better. The results indicate that none of the methods are capable of reliably detecting membership in LLMs, as shown by an AUC-ROC of approximately 0.5 for all methods across several LLMs. The higher TPR and FPR for more advanced LLMs indicate higher reasoning and generalizing capabilities, showcasing the difficulty of detecting membership in LLMs using black-box MIAs.","published_date":"2026-04-21T15:13:30+00:00","viability_score":4,"cluster_label":"LLM Security","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A study evaluating existing methods for detecting data contamination in LLMs, finding current black-box approaches unreliable.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.19559v1","title":"Enhancing Construction Worker Safety in Extreme Heat: A Machine Learning Approach Utilizing Wearable Technology for Predictive Health Analytics","abstract":"Construction workers are highly vulnerable to heat stress, yet tools that translate real-time physiological data into actionable safety intelligence remain scarce. This study addresses this gap by developing and evaluating deep learning models, specifically a baseline Long Short-Term Memory (LSTM) network and an attention-based LSTM, to predict heat stress among 19 workers in Saudi Arabia. Using Garmin Vivosmart 5 smartwatches to monitor metrics such as heart rate, HRV, and oxygen saturation, the attention-based model outperformed the baseline, achieving 95.40% testing accuracy and significantly reducing false positives and negatives. With precision, recall, and F1 scores of 0.982, this approach not only improves predictive performance but also offers interpretable results suitable for integration into IoT-enabled safety systems and BIM dashboards, advancing proactive, informatics-driven safety management in the construction industry.","published_date":"2026-04-21T15:12:28+00:00","viability_score":7,"cluster_label":"Wearable Health AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"An AI system using wearable data to predict heat stress in construction workers, enhancing safety with high accuracy.","time_to_mvp":"1-3 months","tags":["series_a_plus"]},{"arxiv_id":"2604.19548v1","title":"Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment","abstract":"Large Language Model agents have rapidly evolved from static text generators into dynamic systems capable of executing complex autonomous workflows. To enhance reliability, multi-agent frameworks assigning specialized roles are increasingly adopted to enable self-reflection and mutual auditing. While such role-playing effectively leverages domain expert knowledge, we find it simultaneously induces a human-like cognitive bias known as Actor-Observer Asymmetry (AOA). Specifically, an agent acting as an actor (during self-reflection) tends to attribute failures to external factors, whereas an observer (during mutual auditing) attributes the same errors to internal faults. We quantify this using our new Ambiguous Failure Benchmark, which reveals that simply swapping perspectives triggers the AOA effect in over 20% of cases for most models. To tame this bias, we introduce ReTAS (Reasoning via Thesis-Antithesis-Synthesis), a model trained through dialectical alignment to enforce perspective-invariant reasoning. By integrating dialectical chain-of-thought with Group Relative Policy Optimization, ReTAS guides agents to synthesize conflicting viewpoints into an objective consensus. Experiments demonstrate that ReTAS effectively mitigates attribution inconsistency and significantly improves fault resolution rates in ambiguous scenarios.","published_date":"2026-04-21T15:05:58+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel framework to mitigate cognitive bias in LLM agents by enforcing perspective-invariant reasoning through dialectical alignment, improving fault resolution in ambiguous scenarios.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19544v1","title":"DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling","abstract":"Multimodal reward models (MRMs) play a crucial role in aligning Multimodal Large Language Models (MLLMs) with human preferences. Training a good MRM requires high-quality multimodal preference data. However, existing preference datasets face three key challenges: lack of granularity in preference strength, textual style bias, and unreliable preference signals. Besides, existing open-source multimodal preference datasets suffer from substantial noise, yet there is a lack of effective and scalable curation methods to enhance their quality. To address these limitations, we propose \\textbf{DT2IT-MRM}, which integrates a \\textbf{D}ebiased preference construction pipeline, a novel reformulation of text-to-image (\\textbf{T2I}) preference data, and an \\textbf{I}terative \\textbf{T}raining framework that curates existing multimodal preference datasets for \\textbf{M}ultimodal \\textbf{R}eward \\textbf{M}odeling. Our experimental results show that DT2IT-MRM achieves new \\textbf{state-of-the-art} overall performance on three major benchmarks: VL-RewardBench, Multimodal RewardBench, and MM-RLHF-RewardBench.","published_date":"2026-04-21T15:02:50+00:00","viability_score":8,"cluster_label":"Multimodal AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An iterative training framework that constructs debiased multimodal preference data and curates existing datasets to achieve state-of-the-art performance in multimodal reward modeling.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19540v1","title":"Mesh Memory Protocol: Semantic Infrastructure for Multi-Agent LLM Systems","abstract":"Teams of LLM agents increasingly collaborate on tasks spanning days or weeks: multi-day data-generation sprints where generator, reviewer, and auditor agents coordinate in real time on overlapping batches; specialists carrying findings forward across session restarts; product decisions compounding over many review rounds. This requires agents to share, evaluate, and combine each other's cognitive state in real time across sessions. We call this cross-session agent-to-agent cognitive collaboration, distinct from parallel agent execution. To enable it, three problems must be solved together. (P1) Each agent decides field by field what to accept from peers, not accept or reject whole messages. (P2) Every claim is traceable to source, so returning claims are recognised as echoes of the receiver's own prior thinking. (P3) Memory that survives session restarts is relevant because of how it was stored, not how it is retrieved. These are protocol-level properties at the semantic layer of agent communication, distinct from tool-access and task-delegation protocols at lower layers. We call this missing protocol layer \"semantic infrastructure,\" and the Mesh Memory Protocol (MMP) specifies it. Four composable primitives work together: CAT7, a fixed seven-field schema for every Cognitive Memory Block (CMB); SVAF, which evaluates each field against the receiver's role-indexed anchors and realises P1; inter-agent lineage, carried as parents and ancestors of content-hash keys and realising P2; and remix, which stores only the receiver's own role-evaluated understanding of each accepted CMB, never the raw peer signal, realising P3. MMP is specified, shipped, and running in production across three reference deployments, where each session runs an autonomous agent as a mesh peer with its own identity and memory, collaborating with other agents across the network for collective intelligence.","published_date":"2026-04-21T15:00:25+00:00","viability_score":6,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A Mesh Memory Protocol that provides semantic infrastructure for multi-agent LLM systems, enabling cross-session cognitive collaboration and collective intelligence.","time_to_mvp":"6+ months","tags":["series_a_plus"]},{"arxiv_id":"2604.19538v1","title":"Integrating Anomaly Detection into Agentic AI for Proactive Risk Management in Human Activity","abstract":"Agentic AI, with goal-directed, proactive, and autonomous decision-making capabilities, offers a compelling opportunity to address movement-related risks in human activity, including the persistent hazard of falls among elderly populations. Despite numerous approaches to fall mitigation through fall prediction and detection, existing systems have not yet functioned as universal solutions across care pathways and safety-critical environments. This is largely due to limitations in consistently handling real-world complexity, particularly poor context awareness, high false alarm rates, environmental noise, and data scarcity. We argue that fall detection and fall prediction can usefully be formulated as anomaly detection problems and more effectively addressed through an agentic AI system. More broadly, this perspective enables the early identification of subtle deviations in movement patterns associated with increased risk, whether arising from age-related decline, fatigue, or environmental factors. While technical requirements for immediate deployment are beyond the scope of this paper, we propose a conceptual framework that highlights potential value. This framework promotes a well-orchestrated approach to risk management by dynamically selecting relevant tools and integrating them into adaptive decision-making workflows, rather than relying on static configurations tailored to narrowly defined scenarios.","published_date":"2026-04-21T14:57:36+00:00","viability_score":3,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A conceptual framework integrating anomaly detection into agentic AI for proactive risk management in human activity, particularly for fall mitigation in elderly populations.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.19533v1","title":"Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps","abstract":"We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model (LLM) agents perform the core SOC analyst task of threat hunting: given a database of raw Windows event logs with no guided questions or hints, identify the exact timestamps of malicious events.   The benchmark wraps 106 real attack procedures from the OTRF Security-Datasets corpus - spanning 86 MITRE ATT&CK sub-techniques across 12 tactics - into a Gymnasium reinforcement-learning environment. Each episode presents the agent with an in-memory SQLite database of 75,000-135,000 log records produced by a deterministic campaign simulator that time-shifts and entity-obfuscates the raw recordings.   The agent must iteratively submit SQL queries to discover malicious event timestamps and explicitly flag them, scored CTF-style against Sigma-rule-derived ground truth.   Evaluating five frontier models - Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash - on 26 campaigns covering 105 of 106 procedures, we find that all models fail dramatically: the best model (Claude Opus 4.6) submits correct flags for only 3.8% of malicious events on average, and no run across any model ever finds all flags.   We define a passing score as >= 50% recall on every ATT&CK tactic - the minimum bar for unsupervised SOC deployment. No model passes: the leader clears this bar on 5 of 13 tactics and the remaining four on zero.   These results suggest that current LLMs are poorly suited for open-ended, evidence-driven threat hunting despite strong performance on curated Q&A security benchmarks.","published_date":"2026-04-21T14:53:23+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An LLM agent benchmark for evaluating threat hunting capabilities in cybersecurity, revealing current model limitations.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19532v1","title":"BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps","abstract":"Tokenizing music to fit the general framework of language models is a compelling challenge, especially considering the diverse symbolic structures in which music can be represented (e.g., sequences, grids, and graphs). To date, most approaches tokenize symbolic music as sequences of musical events, such as onsets, pitches, time shifts, or compound note events. This strategy is intuitive and has proven effective in Transformer-based models, but it treats the regularity of musical time implicitly: individual tokens may span different durations, resulting in non-uniform time progression. In this paper, we instead consider whether an alternative tokenization is possible, where a uniform-length musical step (e.g., a beat) serves as the basic unit. Specifically, we encode all events within a single time step at the same pitch as one token, and group tokens explicitly by time step, which resembles a sparse encoding of a piano-roll representation. We evaluate the proposed tokenization on music continuation and accompaniment generation tasks, comparing it with mainstream event-based methods. Results show improved musical quality and structural coherence, while additional analyses confirm higher efficiency and more effective capture of long-range patterns with the proposed tokenization.","published_date":"2026-04-21T14:53:10+00:00","viability_score":4,"cluster_label":"Generative Music","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A novel tokenization method for symbolic music that represents uniform temporal steps, improving generation quality and efficiency.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.19528v1","title":"Revisiting RaBitQ and TurboQuant: A Symmetric Comparison of Methods, Theory, and Experiments","abstract":"This technical note revisits the relationship between RaBitQ and TurboQuant under a unified comparison framework. We compare the two methods in terms of methodology, theoretical guarantees, and empirical performance, using a reproducible, transparent, and symmetric setup. Our results show that, despite the claimed advantage of TurboQuant, TurboQuant does not provide a consistent improvement over RaBitQ in directly comparable settings; in many tested configurations, it performs worse than RaBitQ. We further find that several reported runtime and recall results in the TurboQuant paper could not be reproduced from the released implementation under the stated configuration. Overall, this note clarifies the shared structure and genuine differences between the two lines of work, while documenting reproducibility issues in the experimental results reported by the TurboQuant paper.","published_date":"2026-04-21T14:51:16+00:00","viability_score":0,"cluster_label":"LLM Quantization","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A comparative analysis of LLM quantization methods, highlighting reproducibility issues and clarifying differences.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.19523v1","title":"Revac: A Social Deduction Reasoning Agent","abstract":"Social deduction games such as Mafia present a unique AI challenge: players must reason under uncertainty, interpret incomplete and intentionally misleading information, evaluate human-like communication, and make strategic elimination decisions. Unlike deterministic board games, success in Mafia depends not on perfect information or brute-force search, but on inference, memory, and adaptability in the presence of deception. This work presents the design and evaluation of Revac-8, an AI agent developed for the Social Deduction track of the MindGames Arena competition, where it achieved first place. The final agent evolved from a simple two-stage reasoning system into a multi-module architecture that integrates memory-based player profiling, social-graph analysis of accusations and defenses, and dynamic tone selection for communication. These results highlight the importance of structured memory and adaptive communication for achieving strong performance in high-stakes social environments.","published_date":"2026-04-21T14:45:10+00:00","viability_score":7,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A winning AI agent for social deduction games that uses memory, social graph analysis, and adaptive communication.","time_to_mvp":"6+ months","tags":["series_a_plus"]},{"arxiv_id":"2604.19520v1","title":"SimDiff: Depth Pruning via Similarity and Difference","abstract":"Depth pruning improves the deployment efficiency of large language models (LLMs) by identifying and removing redundant layers. A widely accepted standard for this identification process is to measure the similarity between layers using cosine distance. However, we find that methods relying solely on this one-dimensional heuristic can exhibit unpredictable performance and even catastrophic collapse across different architectures. To address this issue, we propose SimDiff, a novel layer importance criterion that jointly evaluates layers from two orthogonal perspectives: representational similarity and transformation difference. The difference is quantified using two distinct metrics: MSSD, which is sensitive to outliers and identifies layers that make decisive corrections, and MASD, which robustly measures a layer's average contribution. Extensive experiments on multiple models ranging from 0.5B to 13B parameters demonstrate that SimDiff significantly outperforms state-of-the-art baselines across various pruning ratios. Notably, our method retains over 91% of LLaMA2-7B's performance at a 25% pruning ratio and achieves up to a 1.49x inference speedup when pruning 12 layers on LLaMA3.1-8B. We also show that pruned models can be effectively recovered with minimal fine-tuning.","published_date":"2026-04-21T14:43:21+00:00","viability_score":7,"cluster_label":"LLM Optimization","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel layer importance criterion for LLM depth pruning that significantly outperforms state-of-the-art, enabling faster inference with minimal performance loss.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19516v1","title":"From Experience to Skill: Multi-Agent Generative Engine Optimization via Reusable Strategy Learning","abstract":"Generative engines (GEs) are reshaping information access by replacing ranked links with citation-grounded answers, yet current Generative Engine Optimization (GEO) methods optimize each instance in isolation, unable to accumulate or transfer effective strategies across tasks and engines. We reframe GEO as a strategy learning problem and propose MAGEO, a multi-agent framework in which coordinated planning, editing, and fidelity-aware evaluation serve as the execution layer, while validated editing patterns are progressively distilled into reusable, engine-specific optimization skills. To enable controlled assessment, we introduce a Twin Branch Evaluation Protocol for causal attribution of content edits and DSV-CF, a dual-axis metric that unifies semantic visibility with attribution accuracy. We further release MSME-GEO-Bench, a multi-scenario, multi-engine benchmark grounded in real-world queries. Experiments on three mainstream engines show that MAGEO substantially outperforms heuristic baselines in both visibility and citation fidelity, with ablations confirming that engine-specific preference modeling and strategy reuse are central to these gains, suggesting a scalable learning-driven paradigm for trustworthy GEO. Code is available at https://github.com/Wu-beining/MAGEO","published_date":"2026-04-21T14:39:24+00:00","viability_score":7,"cluster_label":"Generative AI Optimization","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A multi-agent framework that learns and reuses optimization strategies for generative engines to improve answer quality and citation accuracy.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19514v1","title":"When Graph Structure Becomes a Liability: A Critical Re-Evaluation of Graph Neural Networks for Bitcoin Fraud Detection under Temporal Distribution Shift","abstract":"The consensus that GCN, GraphSAGE, GAT, and EvolveGCN outperform feature-only baselines on the Elliptic Bitcoin Dataset is widely cited but has not been rigorously stress-tested under a leakage-free evaluation protocol. We perform a seed-matched inductive-versus-transductive comparison and find that this consensus does not hold. Under a strictly inductive protocol, Random Forest on raw features achieves F1 = 0.821 and outperforms all evaluated GNNs, while GraphSAGE reaches F1 = 0.689 +/- 0.017. A paired controlled experiment reveals a 39.5-point F1 gap attributable to training-time exposure to test-period adjacency. Additionally, edge-shuffle ablations show that randomly wired graphs outperform the real transaction graph, indicating that the dataset's topology can be misleading under temporal distribution shift. Hybrid models combining GNN embeddings with raw features provide only marginal gains and remain substantially below feature-only baselines. We release code, checkpoints, and a strict-inductive protocol to enable reproducible, leakage-free evaluation.","published_date":"2026-04-21T14:32:20+00:00","viability_score":4,"cluster_label":"Fraud Detection","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A rigorous evaluation of Graph Neural Networks for Bitcoin fraud detection reveals that simpler models outperform GNNs under realistic temporal shifts, with code and a new protocol released for reproducible research.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.19488v1","title":"CoDA: Towards Effective Cross-domain Knowledge Transfer via CoT-guided Domain Adaptation","abstract":"Large language models (LLMs) have achieved substantial advances in logical reasoning, yet they continue to lag behind human-level performance. In-context learning provides a viable solution that boosts the model's performance via prompting its input with expert-curated, in-domain exemplars. However, in many real-world, expertise-scarce domains, such as low-resource scientific disciplines, emerging biomedical subfields, or niche legal jurisdictions, such high-quality in-domain demonstrations are inherently limited or entirely unavailable, thereby constraining the general applicability of these approaches. To mitigate this limitation, recent efforts have explored the retrieval of cross-domain samples as surrogate in-context demonstrations. Nevertheless, the resulting gains remain modest. This is largely attributable to the pronounced domain shift between source and target distributions, which impedes the model's ability to effectively identify and exploit underlying shared structures or latent reasoning patterns. Consequently, when relying solely on raw textual prompting, LLMs struggle to abstract and transfer such cross-domain knowledge in a robust and systematic manner. To address these issues, we propose CoDA, which employs a lightweight adapter to directly intervene in the intermediate hidden states. By combining feature-based distillation of CoT-enriched reference representations with Maximum Mean Discrepancy (MMD) for kernelized distribution matching, our method aligns the latent reasoning representation of the source and target domains. Extensive experimental results on multiple logical reasoning tasks across various model families validate the efficacy of CoDA by significantly outperforming the previous state-of-the-art baselines by a large margin.","published_date":"2026-04-21T14:10:37+00:00","viability_score":7,"cluster_label":"LLM Domain Adaptation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"CoDA enables LLMs to effectively transfer knowledge across domains by aligning latent reasoning representations, significantly improving performance in expertise-scarce areas.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19485v1","title":"EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training","abstract":"Reinforcement learning (RL) for LLM post-training faces a fundamental design choice: whether to use a learned critic as a baseline for policy optimization. Classical theory favors critic-based methods such as PPO for variance reduction, yet critic-free alternatives like GRPO have gained widespread adoption due to their simplicity and competitive performance. We show that in sparse-reward settings, a learned critic can inject estimation noise that exceeds the state signal it captures, increasing rather than reducing advantage variance. By casting baseline selection as a Kalman filtering problem, we unify PPO and GRPO as two extremes of the Kalman gain and prove that explained variance (EV), computable from a single training batch, identifies the exact boundary: positive EV indicates the critic reduces variance, while zero or negative EV signals that it inflates variance. Building on this insight, we propose Explained Variance Policy Optimization (EVPO), which monitors batch-level EV at each training step and adaptively switches between critic-based and batch-mean advantage estimation, provably achieving no greater variance than the better of the two at every step. Across four tasks spanning classical control, agentic interaction, and mathematical reasoning, EVPO consistently outperforms both PPO and GRPO regardless of which fixed baseline is stronger on a given task. Further analysis confirms that the adaptive gating tracks critic maturation over training and that the theoretically derived zero threshold is empirically optimal.","published_date":"2026-04-21T14:07:39+00:00","viability_score":3,"cluster_label":"LLM Training","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A novel reinforcement learning algorithm for LLM post-training that adaptively optimizes baseline selection to reduce variance and improve performance across various tasks.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.19468v1","title":"Fairness Audits of Institutional Risk Models in Deployed ML Pipelines","abstract":"Fairness audits of institutional risk models are critical for understanding how deployed machine learning pipelines allocate resources. Drawing on multi-year collaboration with Centennial College, where our prior ethnographic work introduced the ASP-HEI Cycle, we present a replica-based audit of a deployed Early Warning System (EWS), replicating its model using institutional training data and design specifications. We evaluate disparities by gender, age, and residency status across the full pipeline (training data, model predictions, and post-processing) using standard fairness metrics. Our audit reveals systematic misallocation: younger, male, and international students are disproportionately flagged for support, even when many ultimately succeed, while older and female students with comparable dropout risk are under-identified. Post-processing amplifies these disparities by collapsing heterogeneous probabilities into percentile-based risk tiers. This work provides a replicable methodology for auditing institutional ML systems and shows how disparities emerge and compound across stages, highlighting the importance of evaluating construct validity alongside statistical fairness. It contributes one empirical thread to a broader program investigating algorithms, student data, and power in higher education.","published_date":"2026-04-21T13:50:43+00:00","viability_score":5,"cluster_label":"Fairness Audits","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A replicable methodology for auditing deployed institutional risk models to uncover and quantify fairness disparities across student demographics in higher education.","time_to_mvp":"1-3 months","tags":["quick_build"]},{"arxiv_id":"2604.19465v1","title":"A neural operator framework for data-driven discovery of stability and receptivity in physical systems","abstract":"Understanding how complex systems respond to perturbations, such as whether they will remain stable or what their most sensitive patterns are, is a fundamental challenge across science and engineering. Traditional stability and receptivity (resolvent) analyses are powerful but rely on known equations and linearization, limiting their use in nonlinear or poorly modeled systems. Here, we introduce a data-driven framework that automatically identifies stability properties and optimal forcing responses from observation data alone, without requiring governing equations. By training a neural network as a dynamics emulator and using automatic differentiation to extract its Jacobian, we can compute eigenmodes and resolvent modes directly from data. We demonstrate the method on both canonical chaotic models and high-dimensional fluid flows, successfully identifying dominant instability modes and input-output structures even in strongly nonlinear regimes. By leveraging a neural network-based emulator, we readily obtain a nonlinear representation of system dynamics while additionally retrieving intricate dynamical patterns that were previously difficult to resolve. This equation-free methodology establishes a broadly applicable tool for analyzing complex, high-dimensional datasets, with immediate relevance to grand challenges in fields such as climate science, neuroscience, and fluid engineering.","published_date":"2026-04-21T13:43:32+00:00","viability_score":7,"cluster_label":"Physics AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A data-driven neural operator framework that automatically discovers stability and receptivity properties in complex physical systems without requiring governing equations.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.19464v1","title":"LePREC: Reasoning as Classification over Structured Factors for Assessing Relevance of Legal Issues","abstract":"More than half of the global population struggles to meet their civil justice needs due to limited legal resources. While Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, significant challenges remain even at the foundational step of legal issue identification. To investigate LLMs' capabilities in this task, we constructed a dataset from 769 real-world Malaysian Contract Act court cases, using GPT-4o to extract facts and generate candidate legal issues, annotated by senior legal experts, which reveals a critical limitation: while LLMs generate diverse issue candidates, their precision remains inadequate (GPT-4o achieves only 62%). To address this gap, we propose LePREC (Legal Professional-inspired Reasoning Elicitation and Classification), a neuro-symbolic framework combining neural generation with structured statistical reasoning. LePREC consists of: (1) a neuro component leverages LLMs to transform legal descriptions into question-answer pairs representing diverse analytical factors, and (2) a symbolic component applies sparse linear models over these discrete features, learning explicit algebraic weights that identify the most informative reasoning factors. Unlike end-to-end neural approaches, LePREC achieves interpretability through transparent feature weighting while maintaining data efficiency through correlation-based statistical classification. Experiments show a 30-40% improvement over advanced LLM baselines, including GPT-4o and Claude, confirming that correlation-based factor-issue analysis offers a more data-efficient solution for relevance decisions.","published_date":"2026-04-21T13:42:24+00:00","viability_score":7,"cluster_label":"Legal AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"LePREC: A neuro-symbolic framework that combines LLM-generated question-answer pairs with statistical classification to improve legal issue identification accuracy and interpretability.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.19459v1","title":"Do LLMs Game Formalization? Evaluating Faithfulness in Logical Reasoning","abstract":"Formal verification guarantees proof validity but not formalization faithfulness. For natural-language logical reasoning, where models construct axiom systems from scratch without library constraints, this gap between valid proofs and faithful translations is especially acute. We investigate whether frontier models exploit this gap when generating Lean 4 proofs, a behavior we term formalization gaming.   We evaluate GPT-5 and DeepSeek-R1 on 303 first-order logic problems (203 from FOLIO, 100 from Multi-LogiEval), comparing unified generation against a two-stage pipeline that separates formalization from proving. Despite compilation rates of 87-99%, we find no evidence of systematic gaming in unified generation: models prefer reporting failure over forcing proofs, even under prompting designed to encourage it. However, unfaithfulness that evades our detection signals may still occur. The two-stage pipeline reveals two distinct modes of unfaithfulness: GPT-5 fabricates axioms during proof generation, a reactive fallback detectable via cross-stage comparison, while DeepSeek-R1 mistranslates premises during formalization, producing internally consistent outputs that evade detection entirely. These findings show that high compilation rates or accuracies should not be equated with faithful reasoning. Code and data are available at https://github.com/koreankiwi99/formalization-gaming.","published_date":"2026-04-21T13:37:49+00:00","viability_score":7,"cluster_label":"LLM Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research evaluates whether advanced LLMs 'game' formalization by generating logically valid but unfaithful proofs, offering a method to detect and differentiate these unfaithfulness modes.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19457v1","title":"Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents","abstract":"Long-horizon enterprise agents make high-stakes decisions (loan underwriting, claims adjudication, clinical review, prior authorization) under lossy memory, multi-step reasoning, and binding regulatory constraints. Current evaluation reports a single task-success scalar that conflates distinct failure modes and hides whether an agent is aligned with the standards its deployment environment requires. We propose that long-horizon decision behavior decomposes into four orthogonal alignment axes, each independently measurable and failable: factual precision (FRP), reasoning coherence (RCS), compliance reconstruction (CRR), and calibrated abstention (CAR). CRR is a novel regulatory-grounded axis; CAR is a measurement axis separating coverage from accuracy. We exercise the decomposition on a controlled benchmark (LongHorizon-Bench) covering loan qualification and insurance claims adjudication with deterministic ground-truth construction. Running six memory architectures, we find structure aggregate accuracy cannot see: retrieval collapses on factual precision; schema-anchored architectures pay a scaffolding tax; plain summarization under a fact-preservation prompt is a strong baseline on FRP, RCS, EDA, and CRR; and all six architectures commit on every case, exposing a decisional-alignment axis the field has not targeted. The decomposition also surfaced a pre-registered prediction of our own, that summarization would fail factual recall, which the data reversed at large magnitude, an axis-level reversal aggregate accuracy would have hidden. Institutional alignment (regulatory reconstruction) and decisional alignment (calibrated abstention) are under-represented in the alignment literature and become load-bearing once decisions leave the laboratory. The framework transfers to any regulated decisioning domain via two steps: build a fact schema, and calibrate the CRR auditor prompt.","published_date":"2026-04-21T13:37:19+00:00","viability_score":8,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel four-axis framework for evaluating enterprise AI agents, measuring factual precision, reasoning coherence, compliance reconstruction, and calibrated abstention to ensure alignment with regulatory and decisional standards.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.19431v1","title":"Counting Worlds Branching Time Semantics for post-hoc Bias Mitigation in generative AI","abstract":"Generative AI systems are known to amplify biases present in their training data. While several inference-time mitigation strategies have been proposed, they remain largely empirical and lack formal guarantees. In this paper we introduce CTLF, a branching-time logic designed to reason about bias in series of generative AI outputs. CTLF adopts a counting worlds semantics where each world represents a possible output at a given step in the generation process and introduces modal operators that allow us to verify whether the current output series respects an intended probability distribution over a protected attribute, to predict the likelihood of remaining within acceptable bounds as new outputs are generated, and to determine how many outputs are needed to remove in order to restore fairness. We illustrate the framework on a toy example of biased image generation, showing how CTLF formulas can express concrete fairness properties at different points in the output series.","published_date":"2026-04-21T13:00:47+00:00","viability_score":2,"cluster_label":"Generative AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper introduces a formal logic framework for reasoning about and mitigating bias in generative AI outputs by analyzing branching time semantics.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.19411v1","title":"GOLD-BEV: GrOund and aeriaL Data for Dense Semantic BEV Mapping of Dynamic Scenes","abstract":"Understanding road scenes in a geometrically consistent, scene-centric representation is crucial for planning and mapping. We present GOLD-BEV, a framework that learns dense bird's-eye-view (BEV) semantic environment maps-including dynamic agents-from ego-centric sensors, using time-synchronized aerial imagery as supervision only during training. BEV-aligned aerial crops provide an intuitive target space, enabling dense semantic annotation with minimal manual effort and avoiding the ambiguity of ego-only BEV labeling. Crucially, strict aerial-ground synchronization allows overhead observations to supervise moving traffic participants and mitigates the temporal inconsistencies inherent to non-synchronized overhead sources. To obtain scalable dense targets, we generate BEV pseudo-labels using domain-adapted aerial teachers, and jointly train BEV segmentation with optional pseudo-aerial BEV reconstruction for interpretability. Finally, we extend beyond aerial coverage by learning to synthesize pseudo-aerial BEV images from ego sensors, which support lightweight human annotation and uncertainty-aware pseudo-labeling on unlabeled drives.","published_date":"2026-04-21T12:39:41+00:00","viability_score":4,"cluster_label":"Computer Vision","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"GOLD-BEV is a framework for dense semantic BEV mapping of dynamic scenes using synchronized aerial and ground data, with a novel approach to generate pseudo-aerial labels for scalable annotation.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.19406v1","title":"HP-Edit: A Human-Preference Post-Training Framework for Image Editing","abstract":"Common image editing tasks typically adopt powerful generative diffusion models as the leading paradigm for real-world content editing. Meanwhile, although reinforcement learning (RL) methods such as Diffusion-DPO and Flow-GRPO have further improved generation quality, efficiently applying Reinforcement Learning from Human Feedback (RLHF) to diffusion-based editing remains largely unexplored, due to a lack of scalable human-preference datasets and frameworks tailored to diverse editing needs. To fill this gap, we propose HP-Edit, a post-training framework for Human Preference-aligned Editing, and introduce RealPref-50K, a real-world dataset across eight common tasks and balancing common object editing. Specifically, HP-Edit leverages a small amount of human-preference scoring data and a pretrained visual large language model (VLM) to develop HP-Scorer--an automatic, human preference-aligned evaluator. We then use HP-Scorer both to efficiently build a scalable preference dataset and to serve as the reward function for post-training the editing model. We also introduce RealPref-Bench, a benchmark for evaluating real-world editing performance. Extensive experiments demonstrate that our approach significantly enhances models such as Qwen-Image-Edit-2509, aligning their outputs more closely with human preference.","published_date":"2026-04-21T12:29:50+00:00","viability_score":7,"cluster_label":"Image Editing","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A post-training framework and dataset for aligning image editing models with human preferences, significantly improving output quality.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19404v1","title":"M$^{2}$GRPO: Mamba-based Multi-Agent Group Relative Policy Optimization for Biomimetic Underwater Robots Pursuit","abstract":"Traditional policy learning methods in cooperative pursuit face fundamental challenges in biomimetic underwater robots, where long-horizon decision making, partial observability, and inter-robot coordination require both expressiveness and stability. To address these issues, a novel framework called Mamba-based multi-agent group relative policy optimization (M$^{2}$GRPO) is proposed, which integrates a selective state-space Mamba policy with group-relative policy optimization under the centralized-training and decentralized-execution (CTDE) paradigm. Specifically, the Mamba-based policy leverages observation history to capture long-horizon temporal dependencies and exploits attention-based relational features to encode inter-agent interactions, producing bounded continuous actions through normalized Gaussian sampling. To further improve credit assignment without sacrificing stability, the group-relative advantages are obtained by normalizing rewards across agents within each episode and optimized through a multi-agent extension of GRPO, significantly reducing the demand for training resources while enabling stable and scalable policy updates. Extensive simulations and real-world pool experiments across team scales and evader strategies demonstrate that M$^{2}$GRPO consistently outperforms MAPPO and recurrent baselines in both pursuit success rate and capture efficiency. Overall, the proposed framework provides a practical and scalable solution for cooperative underwater pursuit with biomimetic robot systems.","published_date":"2026-04-21T12:29:00+00:00","viability_score":7,"cluster_label":"Robotics","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A Mamba-based multi-agent policy optimization framework for cooperative underwater robot pursuit, outperforming baselines in simulations and real-world tests.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19401v1","title":"Revisiting Catastrophic Forgetting in Continual Knowledge Graph Embedding","abstract":"Knowledge Graph Embeddings (KGEs) support a wide range of downstream tasks over Knowledge Graphs (KGs). In practice, KGs evolve as new entities and facts are added, motivating Continual Knowledge Graph Embedding (CKGE) methods that update embeddings over time. Current CKGE approaches address catastrophic forgetting (i.e., the performance degradation on previously learned tasks) primarily by limiting changes to existing embeddings.   However, we show that this view is incomplete. When new entities are introduced, their embeddings can interfere with previously learned ones, causing the model to predict them in place of previously correct answers. This phenomenon, which we call entity interference, has been largely overlooked and is not accounted for in current CKGE evaluation protocols. As a result, the assessment of catastrophic forgetting becomes misleading, and CKGE methods performance is systematically overestimated.   To address this issue, we introduce a corrected CKGE evaluation protocol that accounts for entity interference. Through experiments on multiple benchmarks, we show that ignoring this effect can lead to performance overestimation of up to 25%, particularly in scenarios with significant entity growth. We further analyze how different CKGE methods and KGE models are affected by the different sources of forgetting, and introduce a catastrophic forgetting metric tailored to CKGE.","published_date":"2026-04-21T12:26:56+00:00","viability_score":4,"cluster_label":"Knowledge Graphs","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Identifies and addresses a critical 'entity interference' issue in continual knowledge graph embedding evaluation, revealing performance overestimations.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.19398v1","title":"GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models","abstract":"Large language models (LLMs) are expensive to serve because model parameters, attention computation, and KV caches impose substantial memory and latency costs. We present GRASPrune, a structured pruning framework applied after pretraining that jointly prunes FFN channels and KV head groups under a single global budget. Instead of learning importance scores without constraints and applying the budget only after training, GRASPrune learns lightweight gate scores with a projected straight-through estimator that enforces a hard mask satisfying the budget at every step while keeping the backbone weights frozen. After the mask is fixed, we calibrate scaling factors on the retained units to mitigate scale mismatch caused by pruning, and fold these factors into the pruned weights to obtain a smaller dense checkpoint with no extra parameters at inference. On LLaMA-2-7B, GRASPrune removes 50% of parameters and achieves 12.18 perplexity on WikiText-2 while maintaining competitive average zero-shot accuracy on five benchmarks, using four epochs on 512 unlabeled calibration sequences on a single NVIDIA A100 80GB GPU without any full model fine-tuning.","published_date":"2026-04-21T12:26:16+00:00","viability_score":8,"cluster_label":"LLM Optimization","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"GRASPrune efficiently prunes large language models by 50% with minimal performance loss, reducing serving costs.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.19383v1","title":"Multimodal Transformer for Sample-Aware Prediction of Metal-Organic Framework Properties","abstract":"Metal-organic frameworks (MOFs) are a major target of machine-learning-based property prediction, yet most models assume that a single framework representation maps to a single property value. This assumption becomes problematic for experimental MOFs, where samples reported as the same framework can exhibit different properties because of differences in crystallinity, phase purity, defects, and other sample-dependent factors. Here we introduce Experimental X-ray Diffraction Integrated Transformer (EXIT), a multimodal transformer for sample-aware prediction of MOF properties that combines MOFid with X-ray diffraction (XRD). In EXIT, MOFid encodes MOF identity, whereas XRD provides complementary information about the experimentally realized sample state. EXIT is pre-trained on one million hypothetical MOFs with simulated XRD to learn transferable representations, leading to improved downstream performance relative to existing approaches. EXIT is fine-tuned on literature-derived experimental datasets for surface area and pore volume prediction. Incorporating experimental XRD improves predictive performance relative to models without experimental XRD, and attention analysis and sample-level case studies further show that EXIT assigns different predictions to samples sharing the same MOF identity when their XRD patterns differ. These results establish a practical step from framework-aware to sample-aware MOF property prediction and highlight the value of incorporating experimental characterization into porous materials informatics.","published_date":"2026-04-21T12:06:52+00:00","viability_score":7,"cluster_label":"Materials Science AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A multimodal transformer that predicts material properties by considering sample-specific experimental data, improving accuracy over traditional methods.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19377v1","title":"Towards Energy Impact on AI-Powered 6G IoT Networks: Centralized vs. Decentralized","abstract":"The emergence of sixth-generation (6G) technologies has introduced new challenges and opportunities for machine learning (ML) applications in Internet of Things (IoT) networks, particularly concerning energy efficiency. As model training and data transmission contribute significantly to energy consumption, optimizing these processes has become critical for sustainable system design. This study first conduct analysis on the energy consumption model for both centralized and decentralized architecture and then presents a testbed deployed within the German railway infrastructure, leveraging sensor data for ML-based predictive maintenance. A comparative analysis of distributed versus Centralized Learning (CL) architectures reveals that distributed models maintain competitive predictive accuracy (~90%) while reducing overall electricity consumption by up to 70%. These findings underscore the potential of distributed ML to improve energy efficiency in real-world IoT deployments, particularly by mitigating transmission-related energy costs.","published_date":"2026-04-21T11:59:08+00:00","viability_score":4,"cluster_label":"IoT AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A comparative analysis of centralized vs. decentralized AI architectures for 6G IoT networks, showing distributed models reduce energy consumption by up to 70% while maintaining predictive accuracy.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.19372v1","title":"TACENR: Task-Agnostic Contrastive Explanations for Node Representations","abstract":"Graph representation learning has achieved notable success in encoding graph-structured data into latent vector spaces, enabling a wide range of downstream tasks. However, these node representations remain opaque and difficult to interpret. Existing explainability methods primarily focus on supervised settings or on explaining individual representation dimensions, leaving a critical gap in explaining the overall structure of node representations. In this paper, we propose TACENR (Task-Agnostic Contrastive Explanations for Node Representations), a local explanation method that identifies not only attribute features but also proximity and structural ones that contribute the most in the representation space. TACENR builds on contrastive learning, through which we learn a similarity function in the representation space, revealing which are the features that play an important role in the representation of a node. While our focus is on task-agnostic explanations, TACENR can be applied to supervised scenarios as well. Experimental results demonstrate that proximity and structural features play a significant role in shaping node representations and that our supervised variant performs comparably to existing task-specific approaches in identifying the most impactful features.","published_date":"2026-04-21T11:54:29+00:00","viability_score":7,"cluster_label":"Graph AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"TACENR provides task-agnostic explanations for graph node representations by identifying key attribute, proximity, and structural features.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19355v1","title":"LASER: Learning Active Sensing for Continuum Field Reconstruction","abstract":"High-fidelity measurements of continuum physical fields are essential for scientific discovery and engineering design but remain challenging under sparse and constrained sensing. Conventional reconstruction methods typically rely on fixed sensor layouts, which cannot adapt to evolving physical states. We propose LASER, a unified, closed-loop framework that formulates active sensing as a Partially Observable Markov Decision Process (POMDP). At its core, LASER employs a continuum field latent world model that captures the underlying physical dynamics and provides intrinsic reward feedback. This enables a reinforcement learning policy to simulate ''what-if'' sensing scenarios within a latent imagination space. By conditioning sensor movements on predicted latent states, LASER navigates toward potentially high-information regions beyond current observations. Our experiments demonstrate that LASER consistently outperforms static and offline-optimized strategies, achieving high-fidelity reconstruction under sparsity across diverse continuum fields.","published_date":"2026-04-21T11:36:09+00:00","viability_score":7,"cluster_label":"Active Sensing AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"LASER is a closed-loop framework that uses reinforcement learning to actively guide sensor placement for high-fidelity continuum field reconstruction.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19354v1","title":"Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture The Flag Challenges","abstract":"Large Language Model (LLM) agents are increasingly proposed for autonomous cybersecurity tasks, but their capabilities in realistic offensive settings remain poorly understood. We present DeepRed, an open-source benchmark for evaluating LLM-based agents on realistic Capture The Flag (CTF) challenges in isolated virtualized environments. DeepRed places an agent in a Kali attacker environment with terminal tools and optional web search, connected over a private network to a target challenge, and records full execution traces for analysis. To move beyond binary solved/unsolved outcomes, we introduce a partial-credit scoring method based on challenge-specific checkpoints derived from public writeups, together with an automated summarise-then-judge labelling pipeline for assigning checkpoint completion from logs. Using DeepRed, we benchmark ten commercially accessible LLMs on ten VM-based CTF challenges spanning different challenge categories. The results indicate that current agents remain limited: the best model achieves only 35% average checkpoint completion, performing strongest on common challenge types and weakest on tasks requiring non-standard discovery and longer-horizon adaptation.","published_date":"2026-04-21T11:35:33+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An open-source benchmark and evaluation framework for LLM agents in realistic cybersecurity Capture The Flag challenges, revealing current limitations and providing a path for improvement.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19341v1","title":"Evaluation-driven Scaling for Scientific Discovery","abstract":"Language models are increasingly used in scientific discovery to generate hypotheses, propose candidate solutions, implement systems, and iteratively refine them. At the core of these trial-and-error loops lies evaluation: the process of obtaining feedback on candidate solutions via verifiers, simulators, or task-specific scoring functions. While prior work has highlighted the importance of evaluation, it has not explicitly formulated the problem of how evaluation-driven discovery loops can be scaled up in a principled and effective manner to push the boundaries of scientific discovery, a problem this paper seeks to address. We introduce Simple Test-time Evaluation-driven Scaling (SimpleTES), a general framework that strategically combines parallel exploration, feedback-driven refinement, and local selection, revealing substantial gains unlocked by scaling evaluation-driven discovery loops along the right dimensions. Across 21 scientific problems spanning six domains, SimpleTES discovers state-of-the-art solutions using gpt-oss models, consistently outperforming both frontier-model baselines and sophisticated optimization pipelines. Particularly, we sped up the widely used LASSO algorithm by over 2x, designed quantum circuit routing policies that reduce gate overhead by 24.5%, and discovered new Erdos minimum overlap constructions that surpass the best-known results. Beyond novel discoveries, SimpleTES produces trajectory-level histories that naturally supervise feedback-driven learning. When post-trained on successful trajectories, models not only improve efficiency on seen problems but also generalize to unseen problems, discovering solutions that base models fail to uncover. Together, our results establish effective evaluation-driven loop scaling as a central axis for advancing LLM-driven scientific discovery, and provide a simple yet practical framework for realizing these gains.","published_date":"2026-04-21T11:24:09+00:00","viability_score":4,"cluster_label":"LLM Training","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A framework for scaling evaluation-driven discovery loops in LLMs, demonstrating significant gains across scientific problems and improving model generalization.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.19324v1","title":"PLaMo 2.1-VL Technical Report","abstract":"We introduce PLaMo 2.1-VL, a lightweight Vision Language Model (VLM) for autonomous devices, available in 8B and 2B variants and designed for local and edge deployment with Japanese-language operation. Focusing on Visual Question Answering (VQA) and Visual Grounding as its core capabilities, we develop and evaluate the models for two real-world application scenarios: factory task analysis via tool recognition, and infrastructure anomaly detection. We also develop a large-scale synthetic data generation pipeline and comprehensive Japanese training and evaluation resources. PLaMo 2.1-VL outperforms comparable open models on Japanese and English benchmarks, achieving 61.5 ROUGE-L on JA-VG-VQA-500 and 85.2% accuracy on Japanese Ref-L4. For the two application scenarios, it achieves 53.9% zero-shot accuracy on factory task analysis, and fine-tuning on power plant data improves anomaly detection bbox + label F1-score from 39.7 to 64.9.","published_date":"2026-04-21T10:46:42+00:00","viability_score":7,"cluster_label":"Vision Language Models","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A lightweight, deployable Vision Language Model optimized for Japanese language and edge devices, excelling in VQA and anomaly detection for industrial applications.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19321v1","title":"RDP LoRA: Geometry-Driven Identification for Parameter-Efficient Adaptation in Large Language Models","abstract":"Fine-tuning Large Language Models (LLMs) remains structurally uncertain despite parameter-efficient methods such as Low-Rank Adaptation (LoRA), as the layer-specific roles of internal representations are poorly understood, leading to heuristic decisions about where adaptation should be applied. We model the evolution of hidden states as a high-dimensional geometric trajectory and propose using the Ramer-Douglas-Peucker (RDP) algorithm, a parameter-free and training-free polygon simplification method that preserves global structural transitions while eliminating locally redundant changes, to identify critical breakpoints along the representation path. Crucially, we use these geometric pivots not merely for analysis, but as a direct decision signal for determining which layers should be adapted during parameter-efficient fine-tuning. By integrating this geometry-aware layer selection strategy into LoRA fine-tuning of Qwen3-8B-Base, we achieve superior performance on MMLU-Math using only 13 RDP-selected layers (81.67%), significantly outperforming both full 36-layer adaptation (79.32%) and random 13-layer selection (75.56%), as well as the baseline Qwen3-8B-Base model (74.25%). These results demonstrate that leveraging the intrinsic geometry of representation trajectories provides a robust, interpretable, and training-free signal for optimizing layer selection during model adaptation.","published_date":"2026-04-21T10:29:42+00:00","viability_score":7,"cluster_label":"LLM Adaptation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A geometry-driven method for parameter-efficient LLM adaptation that identifies critical layers using the Ramer-Douglas-Peucker algorithm, significantly improving performance with fewer adapted parameters.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.19309v1","title":"Co-Refine: AI-Powered Tool Supporting Qualitative Analysis","abstract":"Qualitative coding relies on a researcher's application of codes to textual data. As coding proceeds across large datasets, interpretations of codes often shift (temporal drift), reducing the credibility of the analysis. Existing Computer-Assisted Qualitative Data Analysis (CAQDAS) tools provide support for data management but offer no workflow for real-time detection of these drifts. We present Co-Refine, an AI-augmented qualitative coding platform that delivers continuous, grounded feedback on coding consistency without disrupting the researcher's workflow. The system employs a three-stage audit pipeline: Stage 1 computes deterministic embedding-based metrics for mathematical consistency; Stage 2 grounds LLM verdicts within $\\pm0.15$ of the deterministic scores; and Stage 3 produces code definitions from previous patterns to create a deepening feedback loop. Co-Refine demonstrates that deterministic scoring can effectively constrain LLM outputs to produce reliable, real-time audit signals for qualitative analysis.","published_date":"2026-04-21T10:16:13+00:00","viability_score":7,"cluster_label":"AI-Powered Tools","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An AI platform that provides real-time feedback on coding consistency for qualitative researchers, reducing interpretation drift.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.19301v1","title":"Large Language Models Exhibit Normative Conformity","abstract":"The conformity bias exhibited by large language models (LLMs) can pose a significant challenge to decision-making in LLM-based multi-agent systems (LLM-MAS). While many prior studies have treated \"conformity\" simply as a matter of opinion change, this study introduces the social psychological distinction between informational conformity and normative conformity in order to understand LLM conformity at the mechanism level. Specifically, we design new tasks to distinguish between informational conformity, in which participants in a discussion are motivated to make accurate judgments, and normative conformity, in which participants are motivated to avoid conflict or gain acceptance within a group. We then conduct experiments based on these task settings. The experimental results show that, among the six LLMs evaluated, up to five exhibited tendencies toward not only informational conformity but also normative conformity. Furthermore, intriguingly, we demonstrate that by manipulating subtle aspects of the social context, it may be possible to control the target toward which a particular LLM directs its normative conformity. These findings suggest that decision-making in LLM-MAS may be vulnerable to manipulation by a small number of malicious users. In addition, through analysis of internal vectors associated with informational and normative conformity, we suggest that although both behaviors appear externally as the same form of \"conformity,\" they may in fact be driven by distinct internal mechanisms. Taken together, these results may serve as an initial milestone toward understanding how \"norms\" are implemented in LLMs and how they influence group dynamics.","published_date":"2026-04-21T10:06:25+00:00","viability_score":3,"cluster_label":"LLM Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Investigates normative conformity in LLMs within multi-agent systems, suggesting potential for manipulation.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.19300v1","title":"HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models","abstract":"Large Audio-Language Models (LALMs) have recently achieved strong performance across various audio-centric tasks. However, hallucination, where models generate responses that are semantically incorrect or acoustically unsupported, remains largely underexplored in the audio domain. Existing hallucination benchmarks mainly focus on text or vision, while the few audio-oriented studies are limited in scale, modality coverage, and diagnostic depth. We therefore introduce HalluAudio, the first large-scale benchmark for evaluating hallucinations across speech, environmental sound, and music. HalluAudio comprises over 5K human-verified QA pairs and spans diverse task types, including binary judgments, multi-choice reasoning, attribute verification, and open-ended QA. To systematically induce hallucinations, we design adversarial prompts and mixed-audio conditions. Beyond accuracy, our evaluation protocol measures hallucination rate, yes/no bias, error-type analysis, and refusal rate, enabling a fine-grained analysis of LALM failure modes. We benchmark a broad range of open-source and proprietary models, providing the first large-scale comparison across speech, sound, and music. Our results reveal significant deficiencies in acoustic grounding, temporal reasoning, and music attribute understanding, underscoring the need for reliable and robust LALMs.","published_date":"2026-04-21T10:05:28+00:00","viability_score":4,"cluster_label":"Audio LLMs","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Introduces HalluAudio, a benchmark for detecting hallucinations in large audio-language models across speech, sound, and music.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.19299v1","title":"Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms","abstract":"Despite the impressive capabilities of large language models, their substantial computational costs, latency, and privacy risks hinder their widespread deployment in real-world applications. Small Language Models (SLMs) with fewer than 10 billion parameters present a promising alternative; however, their inherent limitations in knowledge and reasoning curtail their effectiveness. Existing research primarily focuses on enhancing SLMs through scaling laws or fine-tuning strategies while overlooking the potential of using agent paradigms, such as tool use and multi-agent collaboration, to systematically compensate for the inherent weaknesses of small models. To address this gap, this paper presents the first large-scale, comprehensive study of <10B open-source models under three paradigms: (1) the base model, (2) a single agent equipped with tools, and (3) a multi-agent system with collaborative capabilities. Our results show that single-agent systems achieve the best balance between performance and cost, while multi-agent setups add overhead with limited gains. Our findings highlight the importance of agent-centric design for efficient and trustworthy deployment in resource-constrained settings.","published_date":"2026-04-21T10:05:10+00:00","viability_score":4,"cluster_label":"LLM Deployment","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Analyzes the trade-offs of deploying small language models with agent paradigms for cost-efficient real-world applications.","time_to_mvp":"1-3 months","tags":["quick_build"]},{"arxiv_id":"2604.19298v1","title":"IndiaFinBench: An Evaluation Benchmark for Large Language Model Performance on Indian Financial Regulatory Text","abstract":"We introduce IndiaFinBench, to our knowledge the first publicly available evaluation benchmark for assessing large language model (LLM) performance on Indian financial regulatory text. Existing financial NLP benchmarks draw exclusively from Western financial corpora (SEC filings, US earnings reports, and English-language financial news), leaving a significant gap in coverage of non-Western regulatory frameworks. IndiaFinBench addresses this gap with 406 expert-annotated question-answer pairs drawn from 192 documents sourced from the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI), spanning four task types: regulatory interpretation (174 items), numerical reasoning (92 items), contradiction detection (62 items), and temporal reasoning (78 items). Annotation quality is validated through a model-based secondary pass (kappa=0.918 on contradiction detection) and a 60-item human inter-annotator agreement evaluation (kappa=0.611; 76.7% overall agreement). We evaluate twelve models under zero-shot conditions, with accuracy ranging from 70.4% (Gemma 4 E4B) to 89.7% (Gemini 2.5 Flash). All models substantially outperform a non-specialist human baseline of 60.0%. Numerical reasoning is the most discriminative task, with a 35.9 percentage-point spread across models. Bootstrap significance testing (10,000 resamples) reveals three statistically distinct performance tiers. The dataset, evaluation code, and all model outputs are available at https://github.com/rajveerpall/IndiaFinBench","published_date":"2026-04-21T10:04:49+00:00","viability_score":7,"cluster_label":"LLM Evaluation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new benchmark and evaluation framework for assessing LLM performance on Indian financial regulatory text, with publicly available code and dataset.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.19292v1","title":"Location Not Found: Exposing Implicit Local and Global Biases in Multilingual LLMs","abstract":"Multilingual large language models (LLMs) have minimized the fluency gap between languages. This advancement, however, exposes models to the risk of biased behavior, as knowledge and norms may propagate across languages. In this work, we aim to quantify models' inter- and intra-lingual biases, via their ability to answer locale-ambiguous questions. To this end, we present LocQA, a test set containing 2,156 questions in 12 languages, referring to various locale-dependent facts such as laws, dates, and measurements. The questions do not contain indications of the locales they relate to, other than the querying language itself. LLMs' responses to LocQA locale-ambiguous questions thus reveal models' implicit priors. We used LocQA to evaluate 32 models, and detected two types of structural biases. Inter-lingually, we show a global bias towards answers relevant to the US-locale, even when models are asked in languages other than English. Moreover, we discovered that this global bias is exacerbated in models that underwent instruction tuning, compared to their base counterparts. Intra-lingually, we show that when multiple locales are relevant for the same language, models act as demographic probability engines, prioritizing locales with larger populations. Taken together, insights from LocQA may help in shaping LLMs' desired local behavior, and in quantifying the impact of various training phases on different kinds of biases.","published_date":"2026-04-21T09:57:41+00:00","viability_score":7,"cluster_label":"LLM Bias Detection","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new multilingual test set and evaluation methodology to expose and quantify implicit local and global biases in LLMs.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.19281v1","title":"Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications","abstract":"The use of Large Language Models (LLMs) to support patients in addressing medical questions is becoming increasingly prevalent. However, most of the measures currently used to evaluate the performance of these models in this context only measure how closely a model's answers match semantically, and therefore do not provide a true indication of the model's medical accuracy or of the health equity risks associated with it. To address these shortcomings, we present a new evaluation framework for medical question answering called VB-Score (Verification-Based Score) that provides a separate evaluation of the four components of entity recognition, semantic similarity, factual consistency, and structured information completeness for medical question-answering models. We perform rigorous reviews of the performance of three well-known and widely used LLMs on 48 public health-related topics taken from high-quality, authoritative information sources. Based on our analyses, we discover a major discrepancy between the models' semantic and entity accuracy. Our assessments of the performance of all three models show that each of them has almost uniformly severe performance failures when evaluated against our criteria. Our findings indicate alarming performance disparities across various public health topics, with most of the models exhibiting 13.8% lower performance (compared to an overall average) for all the public health topics that relate to chronic conditions that occur in older and minority populations, which indicates the existence of what's known as condition-based algorithmic discrimination. Our findings also demonstrate that prompt engineering alone does not compensate for basic architectural limitations on how these models perform in extracting medical entities and raise the question of whether semantic evaluation alone is a sufficient measure of medical AI safety.","published_date":"2026-04-21T09:50:08+00:00","viability_score":7,"cluster_label":"Medical AI Evaluation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel component-wise evaluation framework for medical QA systems that reveals significant health equity risks and model performance disparities.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.19278v1","title":"Explicit Trait Inference for Multi-Agent Coordination","abstract":"LLM-based multi-agent systems (MAS) show promise on complex tasks but remain prone to coordination failures such as goal drift, error cascades, and misaligned behaviors. We propose Explicit Trait Inference (ETI), a psychologically grounded method for improving coordination. ETI enables agents to infer and track partner characteristics along two established psychological dimensions--warmth (e.g., trust) and competence (e.g., skill)--from interaction histories to guide decisions. We evaluate ETI in controlled settings (economic games), where it reduces payoff loss by 45-77%, and in more realistic, complex multi-agent settings (MultiAgentBench), where it improves performance by 3-29% depending on the scenario and model, relative to a CoT baseline. Additional analysis shows that gains are closely linked to trait inference: ETI profiles predict agents' actions, and informative profiles drive improvements. These results highlight ETI as a lightweight and robust mechanism for improving coordination in diverse multi-agent settings, and provide the first systematic evidence that LLM agents can (i) reliably infer others' traits from interaction histories and (ii) leverage structured awareness of others' traits for coordination.","published_date":"2026-04-21T09:48:23+00:00","viability_score":4,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A psychologically grounded method for LLM agents to infer and track partner traits, significantly improving multi-agent coordination.","time_to_mvp":"6+ months","tags":["series_a_plus"]},{"arxiv_id":"2604.19262v1","title":"CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks","abstract":"Large language models (LLMs) are now deployed worldwide, inspiring a surge of benchmarks that measure their multilingual and multicultural abilities. However, these benchmarks prioritize generic language understanding or superficial cultural trivia, leaving the evaluation of grounded tasks -- where models must reason within real-world, context-rich scenarios -- largely unaddressed. To fill this gap, we present CulturALL, a comprehensive and challenging benchmark to assess LLMs' multilingual and multicultural competence on grounded tasks. CulturALL is built via a human--AI collaborative framework: expert annotators ensure appropriate difficulty and factual accuracy, while LLMs lighten the manual workload. By incorporating diverse sources, CulturALL ensures comprehensive scenario coverage. Each item is carefully designed to present a high level of difficulty, making CulturALL challenging. CulturALL contains 2,610 samples in 14 languages from 51 regions, distributed across 16 topics to capture the full breadth of grounded tasks. Experiments show that the best LLM achieves 44.48% accuracy on CulturALL, underscoring substantial room for improvement.","published_date":"2026-04-21T09:21:46+00:00","viability_score":4,"cluster_label":"LLM Benchmarking","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new benchmark, CulturALL, evaluates LLMs' multilingual and multicultural competence on complex, real-world tasks, revealing significant room for improvement.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.19254v1","title":"ShadowPEFT: Shadow Network for Parameter-Efficient Fine-Tuning","abstract":"Parameter-efficient fine-tuning (PEFT) reduces the training cost of full-parameter fine-tuning for large language models (LLMs) by training only a small set of task-specific parameters while freezing the pretrained backbone. However, existing approaches, such as Low-Rank Adaptation (LoRA), achieve adaptation by inserting independent low-rank perturbations directly to individual weights, resulting in a local parameterization of adaptation. We propose ShadowPEFT, a centralized PEFT framework that instead performs layer-level refinement through a depth-shared shadow module. At each transformer layer, ShadowPEFT maintains a parallel shadow state and evolves it repeatedly for progressively richer hidden states. This design shifts adaptation from distributed weight-space perturbations to a shared layer-space refinement process. Since the shadow module is decoupled from the backbone, it can be reused across depth, independently pretrained, and optionally deployed in a detached mode, benefiting edge computing scenarios. Experiments on generation and understanding benchmarks show that ShadowPEFT matches or outperforms LoRA and DoRA under comparable trainable-parameter budgets. Additional analyses on shadow pretraining, cross-dataset transfer, parameter scaling, inference latency, and system-level evaluation suggest that centralized layer-space adaptation is a competitive and flexible alternative to conventional low-rank PEFT.","published_date":"2026-04-21T09:17:35+00:00","viability_score":7,"cluster_label":"LLM Fine-Tuning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"ShadowPEFT introduces a centralized parameter-efficient fine-tuning framework for LLMs that reuses a shared module across layers, improving efficiency and flexibility.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.19251v1","title":"Streamliners for Answer Set Programming","abstract":"Streamliner constraints reduce the search space of combinatorial problems by ruling out portions of the solution space. We adapt the StreamLLM approach, which uses Large Language Models (LLMs) to generate streamliners for Constraint Programming, to Answer Set Programming (ASP). Given an ASP encoding and a few small training instances, we prompt multiple LLMs to propose candidate constraints. Candidates that cause syntax errors, render satisfiable instances unsatisfiable, or degrade performance on all training instances are discarded. The surviving streamliners are evaluated together with the original encoding, and we report results for a virtual best encoding (VBE) that, for each instance, selects the fastest among the original encoding and its streamlined variants. On three ASP Competition benchmarks (Partner Units Problem, Sokoban, Towers of Hanoi), the VBE achieves speedups of up to 4--5x over the original encoding. Different LLMs produce semantically diverse constraints, not mere syntactic variations, indicating that the approach captures genuine problem structure.","published_date":"2026-04-21T09:10:04+00:00","viability_score":4,"cluster_label":"AI for Programming","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This paper adapts LLMs to generate 'streamliner' constraints for Answer Set Programming, improving performance on benchmark problems.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.19245v1","title":"Talking to a Know-It-All GPT or a Second-Guesser Claude? How Repair reveals unreliable Multi-Turn Behavior in LLMs","abstract":"Repair, an important resource for resolving trouble in human-human conversation, remains underexplored in human-LLM interaction. In this study, we investigate how LLMs engage in the interactive process of repair in multi-turn dialogues around solvable and unsolvable math questions. We examine whether models initiate repair themselves and how they respond to user-initiated repair. Our results show strong differences across models: reactions range from being almost completely resistant to (appropriate) repair attempts to being highly susceptible and easily manipulated. We further demonstrate that once conversations extend beyond a single turn, model behavior becomes more distinctive and less predictable across systems. Overall, our findings indicate that each tested LLM exhibits its own characteristic form of unreliability in the context of repair.","published_date":"2026-04-21T08:50:50+00:00","viability_score":4,"cluster_label":"LLM Behavior Analysis","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This study reveals significant differences in how LLMs handle conversational repair, highlighting their distinct and sometimes unreliable multi-turn behavior.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.19240v1","title":"Industrial Surface Defect Detection via Diffusion Generation and Asymmetric Student-Teacher Network","abstract":"Industrial surface defect detection often suffers from limited defect samples, severe long-tailed distributions, and difficulties in accurately localizing subtle defects under complex backgrounds. To address these challenges, this paper proposes an unsupervised defect detection method that integrates a Denoising Diffusion Probabilistic Model (DDPM) with an asymmetric teacher-student architecture. First, at the data level, the DDPM is trained solely on normal samples. By introducing constant-variance Gaussian perturbations and Perlin noise-based masks, high-fidelity and physically consistent defect samples along with pixel-level annotations are generated, effectively alleviating the data scarcity problem. Second, at the model level, an asymmetric dual-stream network is constructed. The teacher network provides stable representations of normal features, while the student network reconstructs normal patterns and amplifies discrepancies between normal and anomalous regions. Finally, a joint optimization strategy combining cosine similarity loss and pixel-wise segmentation supervision is adopted to achieve precise localization of subtle defects. Experimental results on the MVTecAD dataset show that the proposed method achieves 98.4\\% image-level AUROC and 98.3\\% pixel-level AUROC, significantly outperforming existing unsupervised and mainstream deep learning methods. The proposed approach does not require large amounts of real defect samples and enables accurate and robust industrial defect detection and localization.   \\keywords{Industrial defect detection \\and diffusion models \\and data generation \\and teacher-student architecture \\and pixel-level localization}","published_date":"2026-04-21T08:47:32+00:00","viability_score":7,"cluster_label":"Industrial AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An unsupervised industrial defect detection system using diffusion models for data generation and an asymmetric teacher-student network for precise localization.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19221v1","title":"UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction","abstract":"Full-duplex speech interaction, as the most natural and intuitive mode of human communication, is driving artificial intelligence toward more human-like conversational systems. Traditional cascaded speech processing pipelines suffer from critical limitations, including accumulated latency, information loss, and error propagation across modules. To address these issues, recent efforts focus on the end-to-end audio large language models (LLMs) like GPT-4o, which primarily unify speech understanding and generation task. However, most of these models are inherently half-duplex, and rely on a suite of separate, task-specific front-end components, such as voice activity detection (VAD) and turn-taking detection (TD). In our development of speech assistant, we observed that optimizing the speech front-end is equally crucial as advancing the back-end unified model for achieving seamless, responsive interactions. To bridge this gap, we propose the first unified audio front-end LLM (UAF) tailored for full-duplex speech systems. Our model reformulates diverse audio front-end tasks into a single auto-regressive sequence prediction problem, including VAD, TD, speaker recognition (SR), automatic speech recognition (ASR) and question answer (QA). It takes streaming fixed-duration audio chunk (e.g., 600 ms) as input, leverages a reference audio prompt to anchor the target speaker at the beginning, and regressively generates discrete tokens encoding both semantic content and system-level state controls (e.g., interruption signals). Experiments demonstrate that our model achieves leading performance across multiple audio front-end tasks and significantly enhances response latency and interruption accuracy in real-world interaction scenarios.","published_date":"2026-04-21T08:24:55+00:00","viability_score":7,"cluster_label":"Conversational AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A unified audio front-end LLM that handles voice activity detection, turn-taking, speaker recognition, and ASR for seamless full-duplex speech interaction.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19219v1","title":"Sherpa.ai Privacy-Preserving Multi-Party Entity Alignment without Intersection Disclosure for Noisy Identifiers","abstract":"Federated Learning (FL) enables collaborative model training among multiple parties without centralizing raw data. There are two main paradigms in FL: Horizontal FL (HFL), where all participants share the same feature space but hold different samples, and Vertical FL (VFL), where parties possess complementary features for the same set of samples. A prerequisite for VFL training is privacy-preserving entity alignment (PPEA), which establishes a common index of samples across parties (alignment) without revealing which samples are shared between them. Conventional private set intersection (PSI) achieves alignment but leaks intersection membership, exposing sensitive relationships between datasets. The standard private set union (PSU) mitigates this risk by aligning on the union of identifiers rather than the intersection. However, existing approaches are often limited to two parties or lack support for typo-tolerant matching.   In this paper, we introduce the Sherpa.ai multi-party PSU protocol for VFL, a PPEA method that hides intersection membership and enables both exact and noisy matching. The protocol generalizes two-party approaches to multiple parties with low communication overhead and offers two variants: an order-preserving version for exact alignment and an unordered version tolerant to typographical and formatting discrepancies. We prove correctness and privacy, analyze communication and computational (exponentiation) complexity, and formalize a universal index mapping from local records to a shared index space. This multi-party PSU offers a scalable, mathematically grounded protocol for PPEA in real-world VFL deployments, such as multi-institutional healthcare disease detection, collaborative risk modeling between banks and insurers, and cross-domain fraud detection between telecommunications and financial institutions, while preserving intersection privacy.","published_date":"2026-04-21T08:24:07+00:00","viability_score":6,"cluster_label":"Privacy-Preserving AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A multi-party privacy-preserving entity alignment protocol for Vertical Federated Learning that hides intersection membership and supports noisy identifiers.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.19217v1","title":"Attention-based Multi-modal Deep Learning Model of Spatio-temporal Crop Yield Prediction with Satellite, Soil and Climate Data","abstract":"Crop yield prediction is one of the most important challenge, which is crucial to world food security and policy-making decisions. The conventional forecasting techniques are limited in their accuracy with reference to the fact that they utilize static data sources that do not reflect the dynamic and intricate relationships that exist between the variables of the environment over time [5,13]. This paper presents Attention-Based Multi-Modal Deep Learning Framework (ABMMDLF), which is suggested to be used in high-accuracy spatio-temporal crop yield prediction.   The model we use combines multi-year satellite imagery, high-resolution time-series of meteorological data and initial soil properties as opposed to the traditional models which use only one of the aforementioned factors [12, 21]. The main architecture involves the use of Convolutional Neural Networks (CNN) to extract spatial features and a Temporal Attention Mechanism to adaptively weight important phenological periods targeted by the algorithm to change over time and condition on spatial features of images and video sequences. As can be experimentally seen, the proposed research work provides an R^2 score of 0.89, which is far better than the baseline models do.","published_date":"2026-04-21T08:23:50+00:00","viability_score":4,"cluster_label":"Agricultural AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"An attention-based multi-modal deep learning framework for spatio-temporal crop yield prediction using satellite, soil, and climate data.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.19211v1","title":"ClawNet: Human-Symbiotic Agent Network for Cross-User Autonomous Cooperation","abstract":"Current AI agent frameworks have made remarkable progress in automating individual tasks, yet all existing systems serve a single user. Human productivity rests on the social and organizational relationships through which people coordinate, negotiate, and delegate. When agents move beyond performing tasks for one person to representing that person in collaboration with others, the infrastructure for cross-user agent collaboration is entirely absent, let alone the governance mechanisms needed to secure it. We argue that the next frontier for AI agents lies not in stronger individual capability, but in the digitization of human collaborative relationships. To this end, we propose a human-symbiotic agent paradigm. Each user owns a permanently bound agent system that collaborates on the owner's behalf, forming a network whose nodes are humans rather than agents. This paradigm rests on three governance primitives. A layered identity architecture separates a Manager Agent from multiple context-specific Identity Agents; the Manager Agent holds global knowledge but is architecturally isolated from external communication. Scoped authorization enforces per-identity access control and escalates boundary violations to the owner. Action-level accountability logs every operation against its owner's identity and authorization, ensuring full auditability. We instantiate this paradigm in ClawNet, an identity-governed agent collaboration framework that enforces identity binding and authorization verification through a central orchestrator, enabling multiple users to collaborate securely through their respective agents.","published_date":"2026-04-21T08:15:05+00:00","viability_score":2,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A human-symbiotic agent paradigm for cross-user autonomous cooperation with layered identity and scoped authorization.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.19191v1","title":"Improved Anomaly Detection in Medical Images via Mean Shift Density Enhancement","abstract":"Anomaly detection in medical imaging is essential for identifying rare pathological conditions, particularly when annotated abnormal samples are limited. We propose a hybrid anomaly detection framework that integrates self-supervised representation learning with manifold-based density estimation, a combination that remains largely unexplored in this domain.   Medical images are first embedded into a latent feature space using pretrained, potentially domain-specific, backbones. These representations are then refined via Mean Shift Density Enhancement (MSDE), an iterative manifold-shifting procedure that moves samples toward regions of higher likelihood. Anomaly scores are subsequently computed using Gaussian density estimation in a PCA-reduced latent space, where Mahalanobis distance measures deviation from the learned normal distribution. The framework follows a one-class learning paradigm and requires only normal samples for training.   Extensive experiments on seven medical imaging datasets demonstrate state-of-the-art performance. MSDE achieves the highest AUC on four datasets and the highest Average Precision on five datasets, including near-perfect performance on brain tumor detection (0.981 AUC/AP). These results underscore the potential of the proposed framework as a scalable clinical decision-support tool for early disease detection, screening in low-label settings, and robust deployment across diverse imaging modalities.","published_date":"2026-04-21T07:58:44+00:00","viability_score":7,"cluster_label":"Medical AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A hybrid anomaly detection framework for medical imaging using self-supervised learning and Mean Shift Density Enhancement, achieving state-of-the-art results.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19186v1","title":"Inductive Subgraphs as Shortcuts: Causal Disentanglement for Heterophilic Graph Learning","abstract":"Heterophily is a prevalent property of real-world graphs and is well known to impair the performance of homophilic Graph Neural Networks (GNNs). Prior work has attempted to adapt GNNs to heterophilic graphs through non-local neighbor extension or architecture refinement. However, the fundamental reasons behind misclassifications remain poorly understood. In this work, we take a novel perspective by examining recurring inductive subgraphs, empirically and theoretically showing that they act as spurious shortcuts that mislead GNNs and reinforce non-causal correlations in heterophilic graphs. To address this, we adopt a causal inference perspective to analyze and correct the biased learning behavior induced by shortcut inductive subgraphs. We propose a debiased causal graph that explicitly blocks confounding and spillover paths responsible for these shortcuts. Guided by this causal graph, we introduce Causal Disentangled GNN (CD-GNN), a principled framework that disentangles spurious inductive subgraphs from true causal subgraphs by explicitly blocking non-causal paths. By focusing on genuine causal signals, CD-GNN substantially improves the robustness and accuracy of node classification in heterophilic graphs. Extensive experiments on real-world datasets not only validate our theoretical findings but also demonstrate that our proposed CD-GNN outperforms state-of-the-art heterophily-aware baselines.","published_date":"2026-04-21T07:55:19+00:00","viability_score":6,"cluster_label":"Graph Learning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A causal inference framework, CD-GNN, that disentangles spurious inductive subgraphs to improve robustness and accuracy in heterophilic graph learning.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.19185v1","title":"SCURank: Ranking Multiple Candidate Summaries with Summary Content Units for Enhanced Summarization","abstract":"Small language models (SLMs), such as BART, can achieve summarization performance comparable to large language models (LLMs) via distillation. However, existing LLM-based ranking strategies for summary candidates suffer from instability, while classical metrics (e.g., ROUGE) are insufficient to rank high-quality summaries. To address these issues, we introduce \\textbf{SCURank}, a framework that enhances summarization by leveraging \\textbf{Summary Content Units (SCUs)}. Instead of relying on unstable comparisons or surface-level overlap, SCURank evaluates summaries based on the richness and semantic importance of information content. We investigate the effectiveness of SCURank in distilling summaries from multiple diverse LLMs. Experimental results demonstrate that SCURank outperforms traditional metrics and LLM-based ranking methods across evaluation measures and datasets. Furthermore, our findings show that incorporating diverse LLM summaries enhances model abstractiveness and overall distilled model performance, validating the benefits of information-centric ranking in multi-LLM distillation. The code for SCURank is available at https://github.com/IKMLab/SCURank.","published_date":"2026-04-21T07:51:05+00:00","viability_score":8,"cluster_label":"LLM Applications","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"SCURank enhances summarization by ranking candidate summaries based on content units, outperforming traditional metrics and LLM-based methods.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.19172v1","title":"Reasoning-Aware AIGC Detection via Alignment and Reinforcement","abstract":"The rapid advancement and widespread adoption of Large Language Models (LLMs) have elevated the need for reliable AI-generated content (AIGC) detection, which remains challenging as models evolve. We introduce AIGC-text-bank, a comprehensive multi-domain dataset with diverse LLM sources and authorship scenarios, and propose REVEAL, a detection framework that generates interpretable reasoning chains before classification. Our approach uses a two-stage training strategy: supervised fine-tuning to establish reasoning capabilities, followed by reinforcement learning to improve accuracy, improve logical consistency, and reduce hallucinations. Extensive experiments show that REVEAL achieves state-of-the-art performance across multiple benchmarks, offering a robust and transparent solution for AIGC detection. The project is open-source at https://aka.ms/reveal","published_date":"2026-04-21T07:29:55+00:00","viability_score":8,"cluster_label":"AI Generated Content Detection","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A reasoning-aware framework for detecting AI-generated content with interpretable explanations, leveraging a novel dataset and reinforcement learning for improved accuracy and transparency.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19167v1","title":"LBLLM: Lightweight Binarization of Large Language Models via Three-Stage Distillation","abstract":"Deploying large language models (LLMs) in resource-constrained environments is hindered by heavy computational and memory requirements. We present LBLLM, a lightweight binarization framework that achieves effective W(1+1)A4 quantization through a novel three-stage quantization strategy. The framework proceeds as follows: (1) initialize a high-quality quantized model via PTQ; (2) quantize binarized weights, group-wise bitmaps, and quantization parameters through layer-wise distillation while keeping activations in full precision; and (3) training learnable activation quantization factors to dynamically quantize activations to 4 bits. This decoupled design mitigates interference between weight and activation quantization, yielding greater training stability and better inference accuracy. LBLLM, trained only using 0.016B tokens with a single GPU, surpasses existing state-of-the-art binarization methods on W2A4 quantization settings across tasks of language modeling, commonsense QA, and language understanding. These results demonstrate that extreme low-bit quantization of LLMs can be both practical and highly effective without introducing any extra high-precision channels or rotational matrices commonly used in recent PTQ-based works, offering a promising path toward efficient LLM deployment in resource-limited situations.","published_date":"2026-04-21T07:25:02+00:00","viability_score":7,"cluster_label":"LLM Quantization","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A lightweight binarization framework for large language models that enables efficient deployment in resource-constrained environments through a novel three-stage distillation strategy.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19149v1","title":"How Do Answer Tokens Read Reasoning Traces? Self-Reading Patterns in Thinking LLMs for Quantitative Reasoning","abstract":"Thinking LLMs produce reasoning traces before answering. Prior activation steering work mainly targets on shaping these traces. It remains less understood how answer tokens actually read and integrate the reasoning to produce reliable outcomes. Focusing on quantitative reasoning, we analyze the answer-to-reasoning attention and observe a benign self-reading pattern aligned with correctness, characterized by a forward drift of the reading focus along the reasoning trace and a persistent concentration on key semantic anchors, whereas incorrect solutions exhibit diffuse and irregular attention pattern. We interpret this as internal certainty during answer decoding, where the model commits to a viable solution branch and integrates key evidence. Following this, we propose a training-free steering method driven by Self-Reading Quality (SRQ) scores combining geometric metrics for process control with semantic metrics for content monitoring. SRQ selects data to build steering vectors that guide inference toward benign self-reading and away from uncertain and disorganized reading. Experiments show that our method yields consistent accuracy gains.","published_date":"2026-04-21T06:55:17+00:00","viability_score":3,"cluster_label":"LLM Reasoning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Analyzing how answer tokens in LLMs interact with reasoning traces to improve quantitative reasoning accuracy through a training-free steering method based on self-reading quality.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.19147v1","title":"Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling","abstract":"Scaling Transformers typically necessitates training larger models from scratch, as standard architectures struggle to expand without discarding learned representations. We identify the primary bottleneck in the attention mechanism's linear projections, which strictly confine feature extraction to fixed-dimensional subspaces, limiting both expressivity and incremental capacity. To address this, we introduce Nexusformer, which replaces linear $Q/K/V$ projections with a Nexus-Rank layer, a three-stage nonlinear mapping driven by dual activations in progressively higher dimensional spaces. This design overcomes the linearity constraint and enables lossless structured growth: new capacity can be injected along two axes via zero-initialized blocks that preserve pretrained knowledge. Experiments on language modeling and reasoning benchmarks demonstrate that Nexusformer matches Tokenformer's perplexity using up to 41.5\\% less training compute during progressive scaling (240M to 440M). Furthermore, our analysis of growth dynamics reveals that zero initialization induces a stable convergence trajectory, allowing us to derive a geometric scaling law that accurately predicts performance across expansion scales.","published_date":"2026-04-21T06:54:16+00:00","viability_score":7,"cluster_label":"Transformer Scaling","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Nexusformer introduces nonlinear attention expansion for stable and inheritable transformer scaling, enabling lossless growth and improved efficiency without retraining from scratch.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19145v1","title":"ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving","abstract":"Vision-Language Models (VLMs) have become central to autonomous driving systems, yet their deployment is severely bottlenecked by the massive computational overhead of multi-view camera and multi-frame video input. Existing token pruning methods, primarily designed for single-image inputs, treat each frame or view in isolation and thus fail to exploit the inherent spatio-temporal redundancies in driving scenarios. To bridge this gap, we propose ST-Prune, a training-free, plug-and-play framework comprising two complementary modules: Motion-aware Temporal Pruning (MTP) and Ring-view Spatial Pruning (RSP). MTP addresses temporal redundancy by encoding motion volatility and temporal recency as soft constraints within the diversity selection objective, prioritizing dynamic trajectories and current-frame content over static historical background. RSP further resolves spatial redundancy by exploiting the ring-view camera geometry to penalize bilateral cross-view similarity, eliminating duplicate projections and residual background that temporal pruning alone cannot suppress. These two modules together constitute a complete spatio-temporal pruning process, preserving key scene information under strict compression. Validated across four benchmarks spanning perception, prediction, and planning, ST-Prune establishes new state-of-the-art for training-free token pruning. Notably, even at 90\\% token reduction, ST-Prune achieves near-lossless performance with certain metrics surpassing the full-model baseline, while maintaining inference speeds comparable to existing pruning approaches.","published_date":"2026-04-21T06:51:08+00:00","viability_score":8,"cluster_label":"Vision-Language Models for Autonomous Driving","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A training-free, plug-and-play framework that significantly reduces computational overhead for vision-language models in autonomous driving by intelligently pruning spatio-temporal tokens, achieving near-lossless performance at 90% reduction.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.19139v1","title":"The Rise of Verbal Tics in Large Language Models: A Systematic Analysis Across Frontier Models","abstract":"As Large Language Models (LLMs) continue to evolve through alignment techniques such as Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI, a growing and increasingly conspicuous phenomenon has emerged: the proliferation of verbal tics -- repetitive, formulaic linguistic patterns that pervade model outputs. These range from sycophantic openers (\"That's a great question!\", \"Awesome!\") to pseudo-empathetic affirmations (\"I completely understand your concern\", \"I'm right here to catch you\") and overused vocabulary (\"delve\", \"tapestry\", \"nuanced\"). In this paper, we present a systematic analysis of the verbal tic phenomenon across eight state-of-the-art LLMs: GPT-5.4, Claude Opus 4.7, Gemini 3.1 Pro, Grok 4.2, Doubao-Seed-2.0-pro, Kimi K2.5, DeepSeek V3.2, and MiMo-V2-Pro. Utilizing a custom evaluation framework for standardized API-based evaluation, we assess 10,000 prompts across 10 task categories in both English and Chinese, yielding 160,000 model responses. We introduce the Verbal Tic Index (VTI), a composite metric quantifying tic prevalence, and analyze its correlation with sycophancy, lexical diversity, and human-perceived naturalness. Our findings reveal significant inter-model variation: Gemini 3.1 Pro exhibits the highest VTI (0.590), while DeepSeek V3.2 achieves the lowest (0.295). We further demonstrate that verbal tics accumulate over multi-turn conversations, are amplified in subjective tasks, and show distinct cross-lingual patterns. Human evaluation (N = 120) confirms a strong inverse relationship between sycophancy and perceived naturalness (r = -0.87, p < 0.001). These results underscore the \"alignment tax\" of current training paradigms and highlight the urgent need for more authentic human-AI interaction frameworks.","published_date":"2026-04-21T06:43:01+00:00","viability_score":7,"cluster_label":"LLM Analysis","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A tool to systematically analyze and quantify verbal tics in LLMs, enabling developers to build more natural and less repetitive AI interactions.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19131v1","title":"Has Automated Essay Scoring Reached Sufficient Accuracy? Deriving Achievable QWK Ceilings from Classical Test Theory","abstract":"Automated essay scoring (AES) is commonly evaluated on public benchmarks using quadratic weighted kappa (QWK). However, because benchmark labels are assigned by human raters and inevitably contain scoring errors, it remains unclear both what QWK is theoretically attainable and what level is practically sufficient for deployment. We therefore derive two dataset-specific QWK ceilings based on the reliability concept in classical test theory, which can be estimated from standard two-rater benchmarks without additional annotation. The first is the theoretical ceiling: the maximum QWK that an ideal AES model that perfectly predicts latent true scores can achieve under label noise. The second is the human-like ceiling: the QWK attainable by an AES model with human-level scoring error, providing a practical target when AES is intended to replace a single human rater. We further show that human--human QWK, often used as a ceiling reference, can underestimate the true ceiling. Simulation experiments validate the proposed ceilings, and experiments on real benchmarks illustrate how they clarify the current performance and remaining headroom of modern AES models.","published_date":"2026-04-21T06:29:13+00:00","viability_score":4,"cluster_label":"AI Evaluation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research establishes theoretical and practical accuracy ceilings for automated essay scoring, providing a new benchmark for evaluating and deploying AI in educational assessment.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.19118v1","title":"DP-FlogTinyLLM: Differentially private federated log anomaly detection using Tiny LLMs","abstract":"Modern distributed systems generate massive volumes of log data that are critical for detecting anomalies and cyber threats. However, in real world settings, these logs are often distributed across multiple organizations and cannot be centralized due to privacy and security constraints. Existing log anomaly detection methods, including recent large language model (LLM) based approaches, largely rely on centralized training and are not suitable for such environments. In this paper, we propose DP-FLogTinyLLM, a privacy preserving federated framework for log anomaly detection using parameter efficient LLMs. Our approach enables collaborative learning without sharing raw log data by integrating federated optimization with differential privacy. To ensure scalability in resource constrained environments, we employ low rank adaptation (LoRA) for efficient fine tuning of Tiny LLMs at each client. Empirical results on the Thunderbird and BGL datasets show that the proposed framework matches the performance of centralized LLM based methods, while incurring additional computational overhead due to privacy mechanisms. Compared to existing federated baselines, DP-FLogTinyLLM consistently achieves higher precision and F1-score, with particularly strong gains on the Thunderbird dataset, highlighting its effectiveness in detecting anomalies while minimizing false positives.","published_date":"2026-04-21T05:56:51+00:00","viability_score":7,"cluster_label":"Federated Learning for Security","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A privacy-preserving federated learning framework using efficient LLMs for log anomaly detection in distributed systems.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19113v1","title":"Think Before Writing: Feature-Level Multi-Objective Optimization for Generative Citation Visibility","abstract":"Generative answer engines expose content through selective citation rather than ranked retrieval, fundamentally altering how visibility is determined. This shift calls for new optimization methods beyond traditional search engine optimization. Existing generative engine optimization (GEO) approaches primarily rely on token-level text rewriting, offering limited interpretability and weak control over the trade-off between citation visibility and content quality. We propose FeatGEO, a feature-level, multi-objective optimization framework that abstracts webpages into interpretable structural, content, and linguistic properties. Instead of directly editing text, FeatGEO optimizes over this feature space and uses a language model to realize feature configurations into natural language, decoupling high-level optimization from surface-level generation. Experiments on GEO-Bench across three generative engines demonstrate that FeatGEO consistently improves citation visibility while maintaining or improving content quality, substantially outperforming token-level baselines. Further analyses show that citation behavior is more strongly influenced by document-level content properties than by isolated lexical edits, and that the learned feature configurations generalize across language models of different scales.","published_date":"2026-04-21T05:47:04+00:00","viability_score":7,"cluster_label":"Generative AI Optimization","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"FeatGEO optimizes generative answer engine visibility by abstracting webpages into interpretable features, outperforming token-level methods while maintaining content quality.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19106v1","title":"Design Rules for Extreme-Edge Scientific Computing on AI Engines","abstract":"Extreme-edge scientific applications use machine learning models to analyze sensor data and make real-time decisions. Their stringent latency and throughput requirements demand small batch sizes and require that model weights remain fully on-chip. Spatial dataflow implementations are common for extreme-edge applications. Spatial dataflow works well for small networks, but it fails to scale to larger models due to inherent resource scaling limitations. AI Engines on modern FPGA SoCs offer a promising alternative with high compute density and additional on-chip memory. However, the architecture, programming model, and performance-scaling behavior of AI Engines differ fundamentally from those of the programmable logic, making direct comparison non-trivial and the benefits of using AI Engines unclear. This work addresses how and when extreme-edge scientific neural networks should be implemented on AI Engines versus programmable logic. We provide systematic architectural characterization and micro-benchmarking and introduce a latency-adjusted resource equivalence (LARE) metric that identifies when AI Engine implementations outperform programmable logic designs. We further propose spatial and API-level dataflow optimizations tailored to low-latency scientific inference. Finally, we demonstrate the successful deployment of end-to-end neural networks on AI Engines that cannot fit on programmable logic when using the hlsml toolchain.","published_date":"2026-04-21T05:33:11+00:00","viability_score":5,"cluster_label":"Edge AI Hardware","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This work provides design rules and a new metric (LARE) to determine when AI Engines on FPGAs are superior to programmable logic for extreme-edge scientific computing.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.19104v1","title":"Reinforcement Learning Enabled Adaptive Multi-Task Control for Bipedal Soccer Robots","abstract":"Developing bipedal football robots in dynamiccombat environments presents challenges related to motionstability and deep coupling of multiple tasks, as well ascontrol switching issues between different states such as up-right walking and fall recovery. To address these problems,this paper proposes a modular reinforcement learning (RL)framework for achieving adaptive multi-task control. Firstly,this framework combines an open-loop feedforward oscilla-tor with a reinforcement learning-based feedback residualstrategy, effectively separating the generation of basic gaitsfrom complex football actions. Secondly, a posture-driven statemachine is introduced, clearly switching between the ballseeking and kicking network (BSKN) and the fall recoverynetwork (FRN), fundamentally preventing state interference.The FRN is efficiently trained through a progressive forceattenuation curriculum learning strategy. The architecture wasverified in Unity simulations of bipedal robots, demonstratingexcellent spatial adaptability-reliably finding and kicking theball even in restricted corner scenarios-and rapid autonomousfall recovery (with an average recovery time of 0.715 seconds).This ensures seamless and stable operation in complex multi-task environments.","published_date":"2026-04-21T05:27:43+00:00","viability_score":7,"cluster_label":"Robotics","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A modular RL framework enables bipedal soccer robots to adaptively control multiple tasks, combining gait generation with a posture-driven state machine for stable ball seeking and rapid fall recovery.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19102v1","title":"Multi-Gait Learning for Humanoid Robots Using Reinforcement Learning with Selective Adversarial Motion Prior","abstract":"Learning diverse locomotion skills for humanoid robots in a unified reinforcement learning framework remains challenging due to the conflicting requirements of stability and dynamic expressiveness across different gaits. We present a multi-gait learning approach that enables a humanoid robot to master five distinct gaits -- walking, goose-stepping, running, stair climbing, and jumping -- using a consistent policy structure, action space, and reward formulation. The key contribution is a selective Adversarial Motion Prior (AMP) strategy: AMP is applied to periodic, stability-critical gaits (walking, goose-stepping, stair climbing) where it accelerates convergence and suppresses erratic behavior, while being deliberately omitted for highly dynamic gaits (running, jumping) where its regularization would over-constrain the motion. Policies are trained via PPO with domain randomization in simulation and deployed on a physical 12-DOF humanoid robot through zero-shot sim-to-real transfer. Quantitative comparisons demonstrate that selective AMP outperforms a uniform AMP policy across all five gaits, achieving faster convergence, lower tracking error, and higher success rates on stability-focused gaits without sacrificing the agility required for dynamic ones.","published_date":"2026-04-21T05:26:20+00:00","viability_score":7,"cluster_label":"Robotics","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This approach enables humanoid robots to learn five distinct gaits using a unified RL framework with a selective adversarial motion prior, improving stability and dynamic expressiveness.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19099v1","title":"Relational AI in Education: Reciprocity, Participatory Design, and Indigenous Worldviews","abstract":"Education is not merely the transmission of information or the optimisation of individual performance; it is a fundamentally social, constructive, and relational practice. However, recent advances in generative artificial intelligence (GenAI) increasingly emphasise efficiency, automation, and individualised assistance, risking the weakening of relational learning processes. Despite growing adoption, AI in education (AIED) research has yet to fully articulate how AI can be designed in ways that sustain the social and ecological relationships through which learning occurs. In this paper, we re-centre education as relational and frame learner-AI interactions as context-specific relationships with clearly defined purposes and boundaries, rather than positioning them as substitutes for, or replacements of, human interaction. Grounded in participatory design practices and inspired by Indigenous worldviews (including Aboriginal Australian, Native American, and Mesoamerican traditions) that foreground reciprocity and relational accountability, we argue that meaningful educational AI should support learning with others rather than replace them. We advance this perspective by: i) conceptualising AIED as a relational design problem grounded in reciprocity; ii) articulating key tensions introduced by GenAI in education; and iii) outlining design directions that expand the AIED design space toward reciprocity, including when not to use AI, how to define pedagogical boundaries, and how to support responsible uses of AIED innovations that sustain communities and natural environments.","published_date":"2026-04-21T05:25:11+00:00","viability_score":2,"cluster_label":"AI in Education","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper explores how to design AI in education to foster relational learning, drawing inspiration from Indigenous worldviews and participatory design.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.19098v1","title":"SAHM: A Benchmark for Arabic Financial and Shari'ah-Compliant Reasoning","abstract":"English financial NLP has progressed rapidly through benchmarks for sentiment, document understanding, and financial question answering, while Arabic financial NLP remains comparatively under-explored despite strong practical demand for trustworthy finance and Islamic-finance assistants. We introduce SAHM, a document-grounded benchmark and instruction-tuning dataset for Arabic financial NLP and Shari'ah-compliant reasoning. SAHM contains 14,380 expert-verified instances spanning seven tasks: AAOIFI standards QA, fatwa-based QA/MCQ, accounting and business exams, financial sentiment analysis, extractive summarization, and event-cause reasoning, curated from authentic regulatory, juristic, and corporate sources. We evaluate 19 strong open and proprietary LLMs using task-specific metrics and rubric-based scoring for open-ended outputs, and find that Arabic fluency does not reliably translate to evidence-grounded financial reasoning: models are substantially stronger on recognition-style tasks than on generation and causal reasoning, with the largest gaps on event-cause reasoning. We release the benchmark, evaluation framework, and an instruction-tuned model to support future research on trustworthy Arabic financial NLP.","published_date":"2026-04-21T05:24:08+00:00","viability_score":7,"cluster_label":"Arabic Financial NLP","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"SAHM is a new benchmark and dataset for Arabic financial and Shari'ah-compliant reasoning, with an instruction-tuned model to improve LLM performance in this domain.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19093v1","title":"Multi-modal Test-time Adaptation via Adaptive Probabilistic Gaussian Calibration","abstract":"Multi-modal test-time adaptation (TTA) enhances the resilience of benchmark multi-modal models against distribution shifts by leveraging the unlabeled target data during inference. Despite the documented success, the advancement of multi-modal TTA methodologies has been impeded by a persistent limitation, i.e., the lack of explicit modeling of category-conditional distributions, which is crucial for yielding accurate predictions and reliable decision boundaries. Canonical Gaussian discriminant analysis (GDA) provides a vanilla modeling of category-conditional distributions and achieves moderate advancement in uni-modal contexts. However, in multi-modal TTA scenario, the inherent modality distribution asymmetry undermines the effectiveness of modeling the category-conditional distribution via the canonical GDA. To this end, we introduce a tailored probabilistic Gaussian model for multi-modal TTA to explicitly model the category-conditional distributions, and further propose an adaptive contrastive asymmetry rectification technique to counteract the adverse effects arising from modality asymmetry, thereby deriving calibrated predictions and reliable decision boundaries. Extensive experiments across diverse benchmarks demonstrate that our method achieves state-of-the-art performance under a wide range of distribution shifts. The code is available at https://github.com/XuJinglinn/AdaPGC.","published_date":"2026-04-21T05:18:09+00:00","viability_score":7,"cluster_label":"Multi-modal Test-time Adaptation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This paper introduces a novel multi-modal test-time adaptation method that uses adaptive probabilistic Gaussian calibration to improve resilience against distribution shifts.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19092v1","title":"RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation","abstract":"Recent advances in large-scale video world models have enabled increasingly realistic future prediction, raising the prospect of leveraging imagined videos for robot learning. However, visual realism does not imply physical plausibility, and behaviors inferred from generated videos may violate dynamics and fail when executed by embodied agents. Existing benchmarks begin to incorporate notions of physical plausibility, but they largely remain perception- or diagnostic-oriented and do not systematically evaluate whether predicted behaviors can be translated into executable actions that complete the intended task. To address this gap, we introduce RoboWM-Bench, a manipulation-centric benchmark for embodiment-grounded evaluation of video world models. RoboWM-Bench converts generated behaviors from both human-hand and robotic manipulation videos into embodied action sequences and validates them through robotic execution. The benchmark spans diverse manipulation scenarios and establishes a unified protocol for consistent and reproducible evaluation. Using RoboWM-Bench, we evaluate state-of-the-art video world models and find that reliably generating physically executable behaviors remains an open challenge. Common failure modes include errors in spatial reasoning, unstable contact prediction, and non-physical deformations. While finetuning on manipulation data yields improvements, physical inconsistencies still persist, suggesting opportunities for more physically grounded video generation for robots.","published_date":"2026-04-21T05:09:56+00:00","viability_score":7,"cluster_label":"Robotic Manipulation Benchmarks","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"RoboWM-Bench is a new benchmark for evaluating world models in robotic manipulation, focusing on physically executable behaviors and task completion.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19089v1","title":"Towards Scalable Lifelong Knowledge Editing with Selective Knowledge Suppression","abstract":"Large language models (LLMs) require frequent knowledge updates to reflect changing facts and mitigate hallucinations. To meet this demand, lifelong knowledge editing has emerged as a continual approach to modify specific pieces of knowledge without retraining the entire model. Existing parameter editing methods struggle with stability during sequential edits due to catastrophic forgetting. While retrieval-based approaches are proposed to alleviate this issue, their applicability remains limited across various datasets because of high training costs. To address these limitations and enhance scalability in lifelong settings, we propose LightEdit. Our framework first selects relevant knowledge from retrieved information to modify the query effectively. It then incorporates a decoding strategy to suppress the model's original knowledge probabilities, thereby enabling efficient edits based on the selected information. Extensive experiments on ZSRE, Counterfact, and RIPE benchmarks demonstrate that LightEdit outperforms existing lifelong knowledge editing methods. Furthermore, by minimizing training costs, LightEdit achieves cost-effective scalability, enabling easy adaptation to various datasets.","published_date":"2026-04-21T05:02:29+00:00","viability_score":7,"cluster_label":"LLM Knowledge Editing","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"LightEdit offers a scalable and cost-effective solution for lifelong knowledge editing in LLMs by selectively suppressing outdated information and incorporating new knowledge.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19087v1","title":"OLLM: Options-based Large Language Models","abstract":"We introduce Options LLM (OLLM), a simple, general method that replaces the single next-token prediction of standard LLMs with a \\textit{set of learned options} for the next token, indexed by a discrete latent variable. Instead of relying on temperature or sampling heuristics to induce diversity, OLLM models variation explicitly: a small latent space parametrizes multiple plausible next-token options which can be selected or searched by a downstream policy. Architecturally, OLLM is a lightweight \"plug-in\" that inserts two layers: an encoder and a decoder, before the output head, allowing almost any pretrained LLM to be converted with minimal additional parameters. We apply OLLM to a 1.7B-parameter backbone (only $1.56\\%$ of parameters trainable) trained on OpenMathReasoning and evaluated on OmniMath. The SOTA LoRA-adapted baselines peak at $51\\%$ final answer correctness, while OLLM's option set allows up to $\\sim 70\\%$ under optimal latent selection. We then train a compact policy in the latent space that emits latents to control generation. Operating in a low-dimensional option space makes reward optimization far more sample-efficient and substantially reduces common misalignments (e.g., language switching or degenerate reasoning), as the policy is constrained to options learned during SFT. Crucially, this alignment arises from model structure rather than additional KL or handcrafted alignment losses. Our results demonstrate that optionized next-token modeling enhances controllability, robustness, and efficiency in math reasoning, and highlight latent-space policy learning as a promising direction for reinforcement learning in LLMs.","published_date":"2026-04-21T04:59:37+00:00","viability_score":8,"cluster_label":"LLM Control and Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"OLLM enhances LLM controllability and robustness by replacing single next-token prediction with a set of learned options, enabling more efficient and aligned generation.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.19083v1","title":"ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety","abstract":"Multimodal Large Language Models (MLLMs) have achieved remarkable success in cross-modal understanding and generation, yet their deployment is threatened by critical safety vulnerabilities. While prior works have demonstrated the feasibility of backdoors in MLLMs via fine-tuning data poisoning to manipulate inference, the underlying mechanisms of backdoor attacks remain opaque, complicating the understanding and mitigation. To bridge this gap, we propose ProjLens, an interpretability framework designed to demystify MLLMs backdoors. We first establish that normal downstream task alignment--even when restricted to projector fine--tuning--introduces vulnerability to backdoor injection, whose activation mechanism is different from that observed in text-only LLMs. Through extensive experiments across four backdoor variants, we uncover:(1) Low-Rank Structure: Backdoor injection updates appear overall full-rank and lack dedicated ``trigger neurons'', but the backdoor-critical parameters are encoded within a low-rank subspace of the projector;(2) Activation Mechanism: Both clean and poisoned embedding undergoes a semantic shift toward a shared direction aligned with the backdoor target, but the shifting magnitude scales linearly with the input norm, resulting in the distinct backdoor activation on poisoned samples. Our code is available at: https://anonymous.4open.science/r/ProjLens-8FD7","published_date":"2026-04-21T04:52:38+00:00","viability_score":7,"cluster_label":"Multimodal AI Safety","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"ProjLens provides an interpretability framework to demystify backdoor vulnerabilities in multimodal LLMs, focusing on the role of projector fine-tuning.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19079v1","title":"Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization","abstract":"Unification of automatic speech recognition (ASR) systems reduces development and maintenance costs, but training a single model to perform well in both offline and low-latency streaming settings remains challenging. We present a Unified ASR framework for Transducer (RNNT) training that supports both offline and streaming decoding within a single model, using chunk-limited attention with right context and dynamic chunked convolutions. To further close the gap between offline and streaming performance, we introduce an efficient Triton implementation of mode-consistency regularization for RNNT (MCR-RNNT), which encourages agreement across training modes. Experiments show that the proposed approach improves streaming accuracy at low latency while preserving offline performance and scaling to larger model sizes and training datasets. The proposed Unified ASR framework and the English model checkpoint are open-sourced.","published_date":"2026-04-21T04:43:03+00:00","viability_score":7,"cluster_label":"Speech Recognition","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A unified ASR framework with consistency regularization reduces the gap between offline and streaming performance, offering a single model for both use cases.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19072v1","title":"S2MAM: Semi-supervised Meta Additive Model for Robust Estimation and Variable Selection","abstract":"Semi-supervised learning with manifold regularization is a classical framework for jointly learning from both labeled and unlabeled data, where the key requirement is that the support of the unknown marginal distribution has the geometric structure of a Riemannian manifold. Typically, the Laplace-Beltrami operator-based manifold regularization can be approximated empirically by the Laplacian regularization associated with the entire training data and its corresponding graph Laplacian matrix. However, the graph Laplacian matrix depends heavily on the prespecified similarity metric and may lead to inappropriate penalties when dealing with redundant or noisy input variables. To address the above issues, this paper proposes a new \\textit{Semi-Supervised Meta Additive Model (S$^2$MAM) based on a bilevel optimization scheme that automatically identifies informative variables, updates the similarity matrix, and simultaneously achieves interpretable predictions. Theoretical guarantees are provided for S$^2$MAM, including the computing convergence and the statistical generalization bound. Experimental assessments across 4 synthetic and 12 real-world datasets, with varying levels and categories of corruption, validate the robustness and interpretability of the proposed approach.","published_date":"2026-04-21T04:27:12+00:00","viability_score":3,"cluster_label":"Statistical Modeling","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A semi-supervised meta-additive model that automatically identifies informative variables and updates similarity matrices for robust predictions.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.19069v1","title":"Product-of-Experts Training Reduces Dataset Artifacts in Natural Language Inference","abstract":"Neural NLI models overfit dataset artifacts instead of truly reasoning. A hypothesis-only model gets 57.7% in SNLI, showing strong spurious correlations, and 38.6% of the baseline errors are the result of these artifacts. We propose Product-of-Experts (PoE) training, which downweights examples where biased models are overconfident. PoE nearly preserves accuracy (89.10% vs. 89.30%) while cutting bias reliance by 4.71% (bias agreement 49.85% to 45%). An ablation finds lambda = 1.5 that best balances debiasing and accuracy. Behavioral tests still reveal issues with negation and numerical reasoning.","published_date":"2026-04-21T04:23:20+00:00","viability_score":5,"cluster_label":"Natural Language Inference","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Product-of-Experts training reduces dataset artifacts in Natural Language Inference models by downweighting overconfident predictions.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.19060v1","title":"Reinforcement Learning Improves LLM Accuracy and Reasoning in Disease Classification from Radiology Reports","abstract":"Accurate disease classification from radiology reports is essential for many applications. While supervised fine-tuning (SFT) of lightweight LLMs improves accuracy, it can degrade reasoning. We propose a two-stage approach: SFT on disease labels followed by Group Relative Policy Optimization (GRPO) to refine predictions by optimizing accuracy and format without reasoning supervision. Across three radiologist-annotated datasets, SFT outperformed baselines and GRPO further improved classification and enhanced reasoning recall and comprehensiveness.","published_date":"2026-04-21T04:09:09+00:00","viability_score":7,"cluster_label":"Medical AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Reinforcement learning refines LLM predictions for disease classification from radiology reports, improving accuracy and reasoning.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.19049v1","title":"Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery","abstract":"LLM-assisted defect discovery has a precision crisis: plausible-but-wrong reports overwhelm maintainers and degrade credibility for real findings. We present Refute-or-Promote, an inference-time reliability pattern combining Stratified Context Hunting (SCH) for candidate generation, adversarial kill mandates, context asymmetry, and a Cross-Model Critic (CMC). Adversarial agents attempt to disprove candidates at each promotion gate; cold-start reviewers are intended to reduce anchoring cascades; cross-family review can catch correlated blind spots that same-family review misses. Over a 31-day campaign across 7 targets (security libraries, the ISO C++ standard, major compilers), the pipeline killed roughly 79% of 171 candidates before advancing to disclosure (retrospective aggregate); on a consolidated-protocol subset (lcms2, wolfSSL; n=30), the prospective kill rate was 83%. Outcomes: 4 CVEs (3 public, 1 embargoed); LWG 4549 accepted to the C++ working paper; 5 merged C++ editorial PRs; 3 compiler conformance bugs; 8 merged security-related fixes without CVE; an RFC 9000 errata filed under committee review; and 1+ FIPS 140-3 normative compliance issues under coordinated disclosure -- all evaluated by external acceptance, not benchmarks. The most instructive failure: ten dedicated reviewers unanimously endorsed a non-existent Bleichenbacher padding oracle in OpenSSL's CMS module; it was killed only by a single empirical test, motivating the mandatory empirical gate. No vulnerability was discovered autonomously; the contribution is external structure that filters LLM agents' persistent false positives. As a preliminary transfer test beyond defect discovery, a simplified cross-family critique variant also solved five previously unsolved SymPy instances on SWE-bench Verified and one SWE-rebench hard task.","published_date":"2026-04-21T03:55:35+00:00","viability_score":7,"cluster_label":"LLM Security","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An adversarial multi-agent system to improve LLM-assisted defect discovery by filtering false positives and enhancing credibility.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.19048v1","title":"SAMoRA: Semantic-Aware Mixture of LoRA Experts for Task-Adaptive Learning","abstract":"The combination of Mixture-of-Experts (MoE) and Low-Rank Adaptation (LoRA) has shown significant potential for enhancing the multi-task learning capabilities of Large Language Models. However, existing methods face two primary challenges: (1)Imprecise Routing in the current MoE-LoRA method fails to explicitly match input semantics with expert capabilities, leading to weak expert specialization. (2)Uniform weight fusion strategies struggle to provide adaptive update strengths, overlooking the varying complexity of different tasks. To address these limitations, we propose SAMoRA (Semantic-Aware Mixture of LoRA Experts), a novel parameter-efficient fine-tuning framework tailored for task-adaptive learning. Specifically, A Semantic-Aware Router is proposed to explicitly align textual semantics with the most suitable experts for precise routing. A Task-Adaptive Scaling mechanism is designed to regulate expert contributions based on specific task requirements dynamically. In addition, a novel regularization objective is proposed to jointly promote expert specialization and effective scaling. Extensive experiments on multiple multi-task benchmarks demonstrate that SAMoRA significantly outperforms the state-of-the-art methods and holds excellent task generalization capabilities. Code is available at https://github.com/boyan-code/SAMoRA","published_date":"2026-04-21T03:55:02+00:00","viability_score":7,"cluster_label":"LLM Fine-tuning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A parameter-efficient fine-tuning framework for LLMs that uses semantic awareness to route inputs to specialized experts and adaptively scales their contributions for better task performance.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19047v1","title":"RARE: Redundancy-Aware Retrieval Evaluation Framework for High-Similarity Corpora","abstract":"Existing QA benchmarks typically assume distinct documents with minimal overlap, yet real-world retrieval-augmented generation (RAG) systems operate on corpora such as financial reports, legal codes, and patents, where information is highly redundant and documents exhibit strong inter-document similarity. This mismatch undermines evaluation validity: retrievers can be unfairly undervalued even when they retrieve documents that provide sufficient evidence, because redundancy across documents is not accounted for in evaluation. On the other hand, retrievers that perform well on standard benchmarks often generalize poorly to real-world corpora with highly similar and redundant documents. We present RARE (Redundancy-Aware Retrieval Evaluation), a framework for constructing realistic benchmarks by (i) decomposing documents into atomic facts to enable precise redundancy tracking and (ii) enhancing LLM-based data generation with CRRF. RAG benchmark data usually requires multiple quality criteria, but LLMs often yield trivial outputs. CRRF scores criteria separately and fuses decisions by rank, improving the reliability of generated data. Applying RARE to Finance, Legal, and Patent corpora, we introduce RedQA, where a strong retriever baseline drops from 66.4% PerfRecall@10 on 4-hop General-Wiki to 5.0-27.9% PerfRecall@10 at 4-hop depth, revealing robustness gaps that current benchmarks fail to capture. RARE enables practitioners to build domain-specific RAG evaluations that faithfully reflect real-world deployment conditions.","published_date":"2026-04-21T03:54:09+00:00","viability_score":7,"cluster_label":"RAG Evaluation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for evaluating retrieval-augmented generation systems on real-world, redundant corpora by tracking atomic facts and improving LLM data generation reliability.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19043v1","title":"Learning Lifted Action Models from Unsupervised Visual Traces","abstract":"Efficient construction of models capturing the preconditions and effects of actions is essential for applying AI planning in real-world domains. Extensive prior work has explored learning such models from high-level descriptions of state and/or action sequences. In this paper, we tackle a more challenging setting: learning lifted action models from sequences of state images, without action observation. We propose a deep learning framework that jointly learns state prediction, action prediction, and a lifted action model. We also introduce a mixed-integer linear program (MILP) to prevent prediction collapse and self-reinforcing errors among predictions. The MILP takes the predicted states, actions, and action model over a subset of traces and solves for logically consistent states, actions, and action model that are as close as possible to the original predictions. Pseudo-labels extracted from the MILP solution are then used to guide further training. Experiments across multiple domains show that integrating MILP-based correction helps the model escape local optima and converge toward globally consistent solutions.","published_date":"2026-04-21T03:49:04+00:00","viability_score":3,"cluster_label":"AI Planning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A deep learning framework that learns lifted action models from unsupervised visual traces by jointly predicting states and actions, with a mixed-integer linear program for consistency.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.19036v1","title":"Plausible Reasoning and First-Order Plausible Logic","abstract":"Defeasible statements are statements that are likely, or probable, or usually true, but may occasionally be false. Plausible reasoning makes conclusions from statements that are either facts or defeasible statements without using numbers. So there are no probabilities or suchlike involved. Seventeen principles of logics that do plausible reasoning are suggested and several important plausible reasoning examples are considered. There are 14 necessary principles and 3 desirable principles, one of which is not formally stated. A first-order logic, called Plausible Logic (PL), is defined that satisfies all but two of the desirable principles and reasons correctly with all the examples. As far as we are aware, this is the only such logic. PL has 8 reasoning algorithms because, from a given plausible reasoning situation, there are different sensible conclusions. This article is a condensation of my book `Plausible Reasoning and Plausible Logic' (PRPL), which is to be submitted. Each section of this article corresponds to a chapter in PRPL, and vice versa. The proofs of all the results are in PRPL, so they are omitted in this article.","published_date":"2026-04-21T03:39:56+00:00","viability_score":0,"cluster_label":"Logic and Reasoning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A novel first-order logic, Plausible Logic (PL), designed for defeasible reasoning without probabilities, offering 8 reasoning algorithms.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.19033v1","title":"Intentional Updates for Streaming Reinforcement Learning","abstract":"In gradient-based learning, a step size chosen in parameter units does not produce a predictable per-step change in function output. This often leads to instability in the streaming setting (i.e., batch size=1), where stochasticity is not averaged out and update magnitudes can momentarily become arbitrarily big or small. Instead, we propose intentional updates: first specify the intended outcome of an update and then solve for the step size that approximately achieves it. This strategy has precedent in online supervised linear regression via Normalized Least Mean Squares algorithm, which selects a step size to yield a specified change in the function output proportional to the current error. We extend this principle to streaming deep reinforcement learning by defining appropriate intended outcomes: Intentional TD aims for a fixed fractional reduction of the TD error, and Intentional Policy Gradient aims for a bounded per-step change in the policy, limiting local KL divergence. We propose practical algorithms combining eligibility traces and diagonal scaling. Empirically, these methods yield state-of-the-art streaming performance, frequently performing on par with batch and replay-buffer approaches.","published_date":"2026-04-21T03:33:40+00:00","viability_score":3,"cluster_label":"Reinforcement Learning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A novel approach to streaming reinforcement learning that aims for predictable per-step changes in function output, leading to state-of-the-art performance.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.19022v1","title":"On Accelerating Grounded Code Development for Research","abstract":"A major challenge for niche scientific and technical domains in leveraging coding agents is the lack of access to up-to-date, domain- specific knowledge. Foundational models often demonstrate limited reasoning capabilities in specialized fields and cannot inherently incorporate knowledge that evolves through ongoing research and experimentation. Materials scientists exploring novel compounds, communication engineers designing and evaluating new protocols, and bioengineering researchers conducting iterative experiments all face this limitation. These experts typically lack the resources to fine-tune large models or continuously embed new findings, creating a barrier to adopting AI-driven coding agents. To address this, we introduce a framework that gives coding agents instanta- neous access to research repositories and technical documentation, enabling real-time, context-aware operation. Our open-source im- plementation allows users to upload documents via doc-search.dev and includes zed-fork, which enforces domain-specific rules and workflows. Together, these tools accelerate the integration of coding agents into specialized scientific and technical workflows","published_date":"2026-04-21T03:16:16+00:00","viability_score":7,"cluster_label":"AI Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework that provides coding agents with instant access to research repositories and technical documentation, enabling real-time, context-aware operation for specialized scientific and technical workflows.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.19018v1","title":"Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control","abstract":"Inference-time LLM alignment methods, particularly activation steering, offer an alternative to fine-tuning by directly modifying activations during generation. Existing methods, however, often rely on non-anticipative interventions that ignore how perturbations propagate through transformer layers and lack online error feedback, resulting in suboptimal, open-loop control. To address this, we show empirically that, despite the nonlinear structure of transformer blocks, layer-wise dynamics across multiple LLM architectures and scales are well-approximated by locally-linear models. Exploiting this property, we model LLM inference as a linear time-varying dynamical system and adapt the classical linear quadratic regulator to compute feedback controllers using layer-wise Jacobians, steering activations toward desired semantic setpoints in closed-loop with minimal computational overhead and no offline training. We also derive theoretical bounds on setpoint tracking error, enabling formal guarantees on steering performance. Using a novel adaptive semantic feature setpoint signal, our method yields robust, fine-grained behavior control across models, scales, and tasks, including state-of-the-art modulation of toxicity, truthfulness, refusal, and arbitrary concepts, surpassing baseline steering methods. Our code is available at: https://github.com/trustworthyrobotics/lqr-activation-steering","published_date":"2026-04-21T03:09:46+00:00","viability_score":7,"cluster_label":"LLM Alignment and Control","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Develop a tool to steer activation in LLMs for safer outputs using local linear models.","time_to_mvp":"1-2 weeks","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.19015v1","title":"FedProxy: Federated Fine-Tuning of LLMs via Proxy SLMs and Heterogeneity-Aware Fusion","abstract":"Federated fine-tuning of Large Language Models (LLMs) is obstructed by a trilemma of challenges: protecting LLMs intellectual property (IP), ensuring client privacy, and mitigating performance loss on heterogeneous data. Existing methods like Offsite-Tuning (OT) secure the LLMs IP by having clients train only lightweight adapters, yet our analysis reveals they suffer from a fundamental performance bottleneck, leaving a significant gap compared to centralized training. To bridge this gap, we introduce FedProxy, a new federated adaptation framework. FedProxy replaces weak adapters with a unified, powerful Proxy Small Language Model (SLM), compressed from the proprietary LLM, to serve as a high-fidelity surrogate for collaborative fine-tuning. Our framework systematically resolves the trilemma through a three-stage architecture: (i) Efficient Representation via server-guided compression to create a resource-friendly proxy; (ii) Robust Optimization through an interference-mitigating aggregation strategy to handle data heterogeneity; and (iii) Effortless Fusion via a training-free \"plug-in\" mechanism to integrate learned knowledge back into the LLM. Experiments show FedProxy significantly outperforms OT methods and approaches centralized performance, establishing a new benchmark for secure and high-performance federated LLM adaptation.","published_date":"2026-04-21T03:06:24+00:00","viability_score":7,"cluster_label":"Federated LLM Fine-Tuning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A federated adaptation framework that uses a proxy SLM and heterogeneity-aware fusion to enable secure, high-performance fine-tuning of LLMs without compromising client privacy or LLM IP.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.19000v1","title":"Decompose, Structure, and Repair: A Neuro-Symbolic Framework for Autoformalization via Operator Trees","abstract":"Statement autoformalization acts as a critical bridge between human mathematics and formal mathematics by translating natural language problems into formal language. While prior works have focused on data synthesis and diverse training paradigms to optimize end-to-end Large Language Models (LLMs), they typically treat formal code as flat sequences, neglecting the hierarchical logic inherent in mathematical statements. In this work, we introduce Decompose, Structure, and Repair (DSR), a neuro-symbolic framework that restructures autoformalization into a modular pipeline. DSR decomposes statements into logical components and maps them to structured operator trees, leveraging this topological blueprint to precisely localize and repair errors via sub-tree refinement. Furthermore, we introduce PRIME, a benchmark of 156 undergraduate and graduate-level theorems selected from canonical textbooks and expertly annotated in Lean 4. Experimental results demonstrate that DSR establishes a new state-of-the-art, consistently outperforming baselines under equivalent computational budgets. The datasets, model, and code will be released to the public soon.","published_date":"2026-04-21T02:36:55+00:00","viability_score":7,"cluster_label":"LLM Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A neuro-symbolic framework that restructures statement autoformalization into a modular pipeline for improved accuracy and error localization.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18995v1","title":"$R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction","abstract":"Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive generation by enabling parallel token prediction. However, practical dLLM decoding still suffers from high inference latency, which limits deployment. In this work, we observe that a substantial part of this inefficiency comes from recurring redundancy in the decoding process, including spatial redundancy caused by confidence clusters and positional ambiguity, and temporal redundancy caused by repeatedly remasking predictions that have already stabilized. Motivated by these patterns, we propose $R^2$-dLLM, a unified framework for reducing decoding redundancy from both inference and training perspectives. At inference time, we introduce training-free decoding rules that aggregate local confidence and token predictions, and finalize temporally stable tokens to avoid redundant decoding steps. We further propose a redundancy-aware supervised fine-tuning pipeline that aligns the model with efficient decoding trajectories and reduces reliance on manually tuned thresholds. Experiments demonstrate that $R^2$-dLLM consistently reduces the number of decoding steps by up to 75% compared to existing decoding strategies, while maintaining competitive generation quality across different models and tasks. These results validate that decoding redundancy is a central bottleneck in dLLMs, and that explicitly reducing it yields substantial practical efficiency gains.","published_date":"2026-04-21T02:26:08+00:00","viability_score":6,"cluster_label":"LLM Inference","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Accelerates Diffusion Large Language Models by reducing spatio-temporal redundancy during decoding, leading to faster inference with maintained quality.","time_to_mvp":"1-2 weeks","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.18993v1","title":"AutoAWG: Adverse Weather Generation with Adaptive Multi-Controls for Automotive Videos","abstract":"Perception robustness under adverse weather remains a critical challenge for autonomous driving, with the core bottleneck being the scarcity of real-world video data in adverse weather. Existing weather generation approaches struggle to balance visual quality and annotation reusability. We present AutoAWG, a controllable Adverse Weather video Generation framework for Autonomous driving. Our method employs a semantics-guided adaptive fusion of multiple controls to balance strong weather stylization with high-fidelity preservation of safety-critical targets; leverages a vanishing point-anchored temporal synthesis strategy to construct training sequences from static images, thereby reducing reliance on synthetic data; and adopts masked training to enhance long-horizon generation stability. On the nuScenes validation set, AutoAWG significantly outperforms prior state-of-the-art methods: without first-frame conditioning, FID and FVD are relatively reduced by 50.0% and 16.1%; with first-frame conditioning, they are further reduced by 8.7% and 7.2%, respectively. Extensive qualitative and quantitative results demonstrate advantages in style fidelity, temporal consistency, and semantic--structural integrity, underscoring the practical value of AutoAWG for improving downstream perception in autonomous driving. Our code is available at: https://github.com/higherhu/AutoAWG","published_date":"2026-04-21T02:24:49+00:00","viability_score":8,"cluster_label":"Generative Video","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A controllable framework for generating adverse weather automotive videos that significantly improves perception robustness for autonomous driving.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.18982v1","title":"SAVOIR: Learning Social Savoir-Faire via Shapley-based Reward Attribution","abstract":"Social intelligence, the ability to navigate complex interpersonal interactions, presents a fundamental challenge for language agents. Training such agents via reinforcement learning requires solving the credit assignment problem: determining how individual utterances contribute to multi-turn dialogue outcomes. Existing approaches directly employ language models to distribute episode-level rewards, yielding attributions that are retrospective and lack theoretical grounding. We propose SAVOIR (ShApley Value fOr SocIal RL), a novel principled framework grounded in cooperative game theory. Our approach combines two complementary principles: expected utility shifts evaluation from retrospective attribution to prospective valuation, capturing an utterance's strategic potential for enabling favorable future trajectories; Shapley values ensure fair credit distribution with axiomatic guarantees of efficiency, symmetry, and marginality. Experiments on the SOTOPIA benchmark demonstrate that SAVOIR achieves new state-of-the-art performance across all evaluation settings, with our 7B model matching or exceeding proprietary models including GPT-4o and Claude-3.5-Sonnet. Notably, even large reasoning models consistently underperform, suggesting social intelligence requires qualitatively different capabilities than analytical reasoning.","published_date":"2026-04-21T02:08:25+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A principled framework for training language agents with social intelligence by attributing rewards using Shapley values for fair and effective multi-turn dialogue outcomes.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18978v1","title":"Low-Rank Adaptation for Critic Learning in Off-Policy Reinforcement Learning","abstract":"Scaling critic capacity is a promising direction for enhancing off-policy reinforcement learning (RL). However, larger critics are prone to overfitting and unstable in replay-buffer-based bootstrap training. This paper leverages Low-Rank Adaptation (LoRA) as a structural-sparsity regularizer for off-policy critics. Our approach freezes randomly initialized base matrices and solely optimizes low-rank adapters, thereby constraining critic updates to a low-dimensional subspace. Built on top of SimbaV2, we further develop a LoRA formulation, compatible with SimbaV2, that preserves its hyperspherical normalization geometry under frozen-backbone training. We evaluate our method with SAC and FastTD3 on DeepMind Control locomotion and IsaacLab robotics benchmarks. LoRA consistently achieves lower critic loss during training and stronger policy performance. Extensive experiments demonstrate that adaptive low-rank updates provide a simple, scalable, and effective structural regularization for critic learning in off-policy RL.","published_date":"2026-04-21T01:59:54+00:00","viability_score":7,"cluster_label":"Reinforcement Learning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Leveraging Low-Rank Adaptation (LoRA) to improve critic capacity and stability in off-policy reinforcement learning, demonstrating consistent gains in critic loss and policy performance.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.18966v1","title":"Self-Improving Tabular Language Models via Iterative Group Alignment","abstract":"While language models have been adapted for tabular data generation, two fundamental limitations remain: (1) static fine-tuning produces models that cannot learn from their own generated samples and adapt to self-correct, and (2) autoregressive objectives preserve local token coherence but neglect global statistical properties, degrading tabular quality. Reinforcement learning offers a potential solution but requires designing reward functions that balance competing objectives -- impractical for tabular data. To fill the gap, we introduce TabGRAA (Tabular Group-Relative Advantage Alignment), the first self-improving framework for tabular data generation via automated feedback. At each iteration, TabGRAA uses an \\emph{automated quality signal} -- such as a two-sample distinguishability classifier or a distance-based reward -- to partition newly generated samples into high- and low-quality groups, then optimizes a group-relative advantage objective that reinforces realistic patterns while penalizing artifacts. The specific signal is a modular choice rather than a fixed component of the framework. This establishes a virtuous feedback cycle, where the quality signal is re-computed against newly \\emph{generated synthetic} samples at each round; the language model is only fine-tuned on these self-generated signals, so no additional real record is exposed during alignment, mitigating data-leakage risk beyond the initial supervised fine-tuning. Experiments show TabGRAA outperforms existing methods in fidelity, utility, and privacy, while matching or exceeding diffusion-based synthesizers, advancing tabular synthesis from static statistical replication to dynamic, self-improving generation.","published_date":"2026-04-21T01:29:52+00:00","viability_score":4,"cluster_label":"Tabular Data Generation","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A self-improving framework for tabular data generation that uses automated feedback to iteratively enhance model quality and learn from its own synthetic samples.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.18964v1","title":"DW-Bench: Benchmarking LLMs on Data Warehouse Graph Topology Reasoning","abstract":"This paper introduces DW-Bench, a new benchmark that evaluates large language models (LLMs) on graph-topology reasoning over data warehouse schemas, explicitly integrating both foreign-key (FK) and data-lineage edges. The benchmark comprises 1,046 automatically generated, verifiably correct questions across five schemas. Experiments show that tool-augmented methods substantially outperform static approaches but plateau on hard compositional subtypes.","published_date":"2026-04-21T01:28:32+00:00","viability_score":7,"cluster_label":"LLM Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"DW-Bench, a new benchmark for evaluating LLMs on graph-topology reasoning over data warehouse schemas, showing tool-augmented methods outperform static approaches.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.18963v1","title":"Distillation Traps and Guards: A Calibration Knob for LLM Distillability","abstract":"Knowledge distillation (KD) transfers capabilities from large language models (LLMs) to smaller students, yet it can fail unpredictably and also underpins model leakage risks. Our analysis revealed several distillation traps: tail noise, off-policy instability, and, most fundamentally, the teacher-student gap, that distort training signals. These traps manifest as overconfident hallucinations, self-correction collapse, and local decoding degradation, causing distillation to fail. Motivated by these findings, we propose a post-hoc calibration method that, to the best of our knowledge, for the first time enables control over a teacher's distillability via reinforcement fine-tuning (RFT). Our objective combines task utility, KL anchor, and across-tokenizer calibration reward. This makes distillability a practical safety lever for foundation models, connecting robust teacher-student transfer with deployment-aware model protection. Experiments across math, knowledge QA, and instruction-following tasks show that students distilled from distillable calibrated teachers outperform SFT and KD baselines, while undistillable calibrated teachers retain their task performance but cause distilled students to collapse, offering a practical knob for both better KD and model IP protection.","published_date":"2026-04-21T01:22:35+00:00","viability_score":4,"cluster_label":"LLM Distillation","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A post-hoc calibration method to control LLM distillability, enabling control over teacher models via reinforcement fine-tuning to improve student performance and protect intellectual property.","time_to_mvp":"1-3 months","tags":[]},{"arxiv_id":"2604.18955v1","title":"Assessing Capabilities of Large Language Models in Social Media Analytics: A Multi-task Quest","abstract":"In this study, we present the first comprehensive evaluation of modern LLMs - including GPT-4, GPT-4o, GPT-3.5-Turbo, Gemini 1.5 Pro, DeepSeek-V3, Llama 3.2, and BERT - across three core social media analytics tasks on a Twitter (X) dataset: (I) Social Media Authorship Verification, (II) Social Media Post Generation, and (III) User Attribute Inference. For the authorship verification, we introduce a systematic sampling framework over diverse user and post selection strategies and evaluate generalization on newly collected tweets from January 2024 onward to mitigate \"seen-data\" bias. For post generation, we assess the ability of LLMs to produce authentic, user-like content using comprehensive evaluation metrics. Bridging Tasks I and II, we conduct a user study to measure real users' perceptions of LLM-generated posts conditioned on their own writing. For attribute inference, we annotate occupations and interests using two standardized taxonomies (IAB Tech Lab 2023 and 2018 U.S. SOC) and benchmark LLMs against existing baselines. Overall, our unified evaluation provides new insights and establishes reproducible benchmarks for LLM-driven social media analytics. The code and data are provided in the supplementary material and will also be made publicly available upon publication.","published_date":"2026-04-21T01:05:52+00:00","viability_score":7,"cluster_label":"LLM Evaluation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research provides a comprehensive, reproducible benchmark for evaluating LLMs on social media analytics tasks, including authorship verification, post generation, and user attribute inference, with code and data released.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18946v1","title":"Reasoning Structure Matters for Safety Alignment of Reasoning Models","abstract":"Large reasoning models (LRMs) achieve strong performance on complex reasoning tasks but often generate harmful responses to malicious user queries. This paper investigates the underlying cause of these safety risks and shows that the issue lies in the reasoning structure itself. Based on this insight, we claim that effective safety alignment can be achieved by altering the reasoning structure. We propose AltTrain, a simple yet effective post training method that explicitly alters the reasoning structure of LRMs. AltTrain is both practical and generalizable, requiring no complex reinforcement learning (RL) training or reward design, only supervised finetuning (SFT) with a lightweight 1K training examples. Experiments across LRM backbones and model sizes demonstrate strong safety alignment, along with robust generalization across reasoning, QA, summarization, and multilingual setting.","published_date":"2026-04-21T00:50:13+00:00","viability_score":6,"cluster_label":"LLM Safety","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"AltTrain is a post-training method that modifies the reasoning structure of large reasoning models to improve safety alignment without complex RL, using only supervised finetuning on 1K examples.","time_to_mvp":"1-2 weeks","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.18943v1","title":"Personalized Benchmarking: Evaluating LLMs by Individual Preferences","abstract":"With the rise in capabilities of large language models (LLMs) and their deployment in real-world tasks, evaluating LLM alignment with human preferences has become an important challenge. Current benchmarks average preferences across all users to compute aggregate ratings, overlooking individual user preferences when establishing model rankings. Since users have varying preferences in different contexts, we call for personalized LLM benchmarks that rank models according to individual needs. We compute personalized model rankings using ELO ratings and Bradley-Terry coefficients for 115 active Chatbot Arena users and analyze how user query characteristics (topics and writing style) relate to LLM ranking variations. We demonstrate that individual rankings of LLM models diverge dramatically from aggregate LLM rankings, with Bradley-Terry correlations averaging only $\u03c1= 0.04$ (57\\% of users show near-zero or negative correlation) and ELO ratings showing moderate correlation ($\u03c1= 0.43$). Through topic modeling and style analysis, we find users exhibit substantial heterogeneity in topical interests and communication styles, influencing their model preferences. We further show that a compact combination of topic and style features provides a useful feature space for predicting user-specific model rankings. Our results provide strong quantitative evidence that aggregate benchmarks fail to capture individual preferences for most users, and highlight the importance of developing personalized benchmarks that rank LLM models according to individual user preferences.","published_date":"2026-04-21T00:40:47+00:00","viability_score":7,"cluster_label":"LLM Evaluation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research introduces personalized benchmarking for LLMs, demonstrating that individual user preferences diverge significantly from aggregate rankings and can be predicted using topic and style features.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18936v1","title":"Fine-Tuning Small Reasoning Models for Quantum Field Theory","abstract":"Despite the growing application of Large Language Models (LLMs) to theoretical physics, there is little academic exploration into how domain-specific physics reasoning ability develops while training these models. To investigate this, we perform the first academic fine-tuning study of small (7B-parameter) reasoning models dedicated specifically to theoretical physics. Because open-source verifiable training data required to train such capabilities is scarce, we developed a robust data generation pipeline that can both create synthetic problems and make existing human-authored problems suitable for model training. Selecting Quantum Field Theory (QFT) as our primary domain, we generated over 2,500 synthetic problems alongside a curated collection of human-adapted problems sourced from arXiv and standard pedagogical resources. We conduct both Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) experiments, benchmarking performance gains as well as generalization to other physics domains. We perform an extensive analysis of model chains-of-though before and after fine-tuning, to understand how reasoning errors evolve during RL and SFT. Finally, we publicly release our data pipeline, verifiable QFT training data, and $\\sim$200M tokens of QFT reasoning traces.","published_date":"2026-04-21T00:21:05+00:00","viability_score":5,"cluster_label":"LLM Fine-Tuning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This study fine-tunes small reasoning models for Quantum Field Theory using a novel data generation pipeline and publicly releases the data, pipeline, and reasoning traces.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.18934v1","title":"AutomationBench","abstract":"Existing AI benchmarks for software automation rarely combine cross-application coordination, autonomous API discovery, and policy adherence. Real business workflows demand all three: a single task may span a CRM, inbox, calendar, and messaging platform - requiring the agent to find the right endpoints, follow a policy document, and write correct data to each system. To address this gap, we introduce AutomationBench, a benchmark for evaluating AI agents on cross-application workflow orchestration via REST APIs. Drawing on real workflow patterns from Zapier's platform, tasks span Sales, Marketing, Operations, Support, Finance, and HR domains. Agents must discover relevant endpoints themselves, follow layered business rules, and navigate environments with irrelevant and sometimes misleading records. Grading is programmatic and end-state only: whether the correct data ended up in the right systems. Even the best frontier models currently score below 10%. AutomationBench provides a challenging, realistic measure of where current models stand relative to the agentic capabilities businesses actually need.","published_date":"2026-04-21T00:14:59+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new benchmark for evaluating AI agents on complex, cross-application business workflows via API orchestration.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18933v1","title":"Gated Memory Policy","abstract":"Robotic manipulation tasks exhibit varying memory requirements, ranging from Markovian tasks that require no memory to non-Markovian tasks that depend on historical information spanning single or multiple interaction trials. Surprisingly, simply extending observation histories of a visuomotor policy often leads to a significant performance drop due to distribution shift and overfitting. To address these issues, we propose Gated Memory Policy (GMP), a visuomotor policy that learns both when to recall memory and what to recall. To learn when to recall memory, GMP employs a learned memory gate mechanism that selectively activates history context only when necessary, improving robustness and reactivity. To learn what to recall efficiently, GMP introduces a lightweight cross-attention module that constructs effective latent memory representations. To further enhance robustness, GMP injects diffusion noise into historical actions, mitigating sensitivity to noisy or inaccurate histories during both training and inference. On our proposed non-Markovian benchmark MemMimic, GMP achieves a 30.1% average success rate improvement over long-history baselines, while maintaining competitive performance on Markovian tasks in RoboMimic. All code, data and in-the-wild deployment instructions are available on our project website https://gated-memory-policy.github.io/.","published_date":"2026-04-21T00:14:50+00:00","viability_score":8,"cluster_label":"Robotics","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A visuomotor policy for robotic manipulation that learns to selectively recall and construct memory for complex, non-Markovian tasks.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18932v1","title":"Tadabur: A Large-Scale Quran Audio Dataset","abstract":"Despite growing interest in Quranic data research, existing Quran datasets remain limited in both scale and diversity. To address this gap, we present Tadabur, a large-scale Quran audio dataset. Tadabur comprises more than 1400+ hours of recitation audio from over 600 distinct reciters, providing substantial variation in recitation styles, vocal characteristics, and recording conditions. This diversity makes Tadabur a comprehensive and representative resource for Quranic speech research and analysis. By significantly expanding both the total duration and variability of available Quran data, Tadabur aims to support future research and facilitate the development of standardized Quranic speech benchmarks.","published_date":"2026-04-21T00:13:30+00:00","viability_score":4,"cluster_label":"Audio AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A large-scale dataset of Quranic recitation audio to advance research in Quranic speech analysis.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.18584v1","title":"MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval","abstract":"Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems together with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems. MathNet spans 47 countries, 17 languages, and two decades of competitions, comprising 30,676 expert-authored problems with solutions across diverse domains. In addition to the core dataset, we construct a retrieval benchmark consisting of mathematically equivalent and structurally similar problem pairs curated by human experts.   MathNet supports three tasks: (i) Problem Solving, (ii) Math-Aware Retrieval, and (iii) Retrieval-Augmented Problem Solving. Experimental results show that even state-of-the-art reasoning models (78.4% for Gemini-3.1-Pro and 69.3% for GPT-5) remain challenged, while embedding models struggle to retrieve equivalent problems. We further show that retrieval-augmented generation performance is highly sensitive to retrieval quality; for example, DeepSeek-V3.2-Speciale achieves gains of up to 12%, obtaining the highest scores on the benchmark. MathNet provides the largest high-quality Olympiad dataset together with the first benchmark for evaluating mathematical problem retrieval, and we publicly release both the dataset and benchmark at https://mathnet.mit.edu.","published_date":"2026-04-20T17:59:49+00:00","viability_score":7,"cluster_label":"Multimodal Reasoning Benchmark","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A large-scale, multimodal, and multilingual benchmark for evaluating mathematical reasoning and retrieval in generative and embedding-based systems.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18580v1","title":"Sessa: Selective State Space Attention","abstract":"Modern sequence models are dominated by Transformers, where self-attention mixes information from the visible context in an input-dependent way. However, when retrieval is not sharp and attention remains diffuse over an effective support $S_{\\mathrm{eff}}(t)$, the influence of any individual token is diluted, typically scaling as $O(1/S_{\\mathrm{eff}}(t))$ and reaching $O(1/\\ell)$ for old tokens in full-prefix settings. Structured state-space models process sequences recurrently through an explicit feedback path; selective variants such as Mamba make this feedback input-dependent, yet when freeze time cannot be sustained over long intervals, their long-range sensitivity decays exponentially with lag. Existing architectures therefore either retrieve from the past in a single read or propagate information through a single feedback chain. We introduce Sessa, a decoder that places attention inside a feedback path, enabling recurrent many-path aggregation within a layer. Under stated assumptions, Sessa admits regimes with a power-law memory tail in lag $\\ell$ of order $O(\\ell^{-\u03b2})$ for $0<\u03b2<1$, which is asymptotically slower than $1/\\ell$; moreover, this rate is tight in an explicit diffuse uniform-routing setting where the influence is $\u0398(\\ell^{-\u03b2})$. Under the same conditions, only Sessa among the compared model classes realizes flexible selective retrieval, including non-decaying profiles. Empirically, under matched architectures and training budgets, Sessa achieves the strongest performance on our long-context benchmarks while remaining competitive with Transformer and Mamba style baselines on short-context language modeling.","published_date":"2026-04-20T17:59:08+00:00","viability_score":6,"cluster_label":"Sequence Modeling","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Sessa introduces a novel decoder architecture that places attention within a feedback path for improved long-context sequence modeling.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.18578v1","title":"Bounded Ratio Reinforcement Learning","abstract":"Proximal Policy Optimization (PPO) has become the predominant algorithm for on-policy reinforcement learning due to its scalability and empirical robustness across domains. However, there is a significant disconnect between the underlying foundations of trust region methods and the heuristic clipped objective used in PPO. In this paper, we bridge this gap by introducing the Bounded Ratio Reinforcement Learning (BRRL) framework. We formulate a novel regularized and constrained policy optimization problem and derive its analytical optimal solution. We prove that this solution ensures monotonic performance improvement. To handle parameterized policy classes, we develop a policy optimization algorithm called Bounded Policy Optimization (BPO) that minimizes an advantage-weighted divergence between the policy and the analytic optimal solution from BRRL. We further establish a lower bound on the expected performance of the resulting policy in terms of the BPO loss function. Notably, our framework also provides a new theoretical lens to interpret the success of the PPO loss, and connects trust region policy optimization and the Cross-Entropy Method (CEM). We additionally extend BPO to Group-relative BPO (GBPO) for LLM fine-tuning. Empirical evaluations of BPO across MuJoCo, Atari, and complex IsaacLab environments (e.g., Humanoid locomotion), and of GBPO for LLM fine-tuning tasks, demonstrate that BPO and GBPO generally match or outperform PPO and GRPO in stability and final performance.","published_date":"2026-04-20T17:59:01+00:00","viability_score":7,"cluster_label":"Reinforcement Learning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Bounded Ratio Reinforcement Learning (BRRL) provides a theoretically grounded framework for policy optimization, outperforming PPO in stability and performance.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18576v1","title":"Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs","abstract":"We present BLF (Bayesian Linguistic Forecaster), an agentic system for binary forecasting that achieves state-of-the-art performance on the ForecastBench benchmark. The system is built on three ideas. (1) A Bayesian linguistic belief state: a semi-structured representation combining numerical probability estimates with natural-language evidence summaries, updated by the LLM at each step of an iterative tool-use loop. This contrasts with the common approach of appending all retrieved evidence to an ever-growing context. (2) Hierarchical multi-trial aggregation: running $K$ independent trials and combining them using logit-space shrinkage with a data-dependent prior. (3) Hierarchical calibration: Platt scaling with a hierarchical prior, which avoids over-shrinking extreme predictions for sources with skewed base rates.   On 400 backtesting questions from the ForecastBench leaderboard, BLF outperforms all the top public methods, including Cassi, GPT-5, Grok~4.20, and Foresight-32B. Ablation studies show that the structured belief state is as impactful as web search access, and that shrinkage aggregation and hierarchical calibration each provide significant additional gains.   In addition, we develop a robust back-testing framework with a leakage rate below 1.5\\%, and use rigorous statistical methodology to compare different methods while controlling for various sources of noise.","published_date":"2026-04-20T17:57:51+00:00","viability_score":3,"cluster_label":"Predictive Analytics","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new approach to agentic forecasting using Bayesian updating to estimate future events with linguistic data.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.18574v1","title":"When Can LLMs Learn to Reason with Weak Supervision?","abstract":"Large language models have achieved significant reasoning improvements through reinforcement learning with verifiable rewards (RLVR). Yet as model capabilities grow, constructing high-quality reward signals becomes increasingly difficult, making it essential to understand when RLVR can succeed under weaker forms of supervision. We conduct a systematic empirical study across diverse model families and reasoning domains under three weak supervision settings: scarce data, noisy rewards, and self-supervised proxy rewards. We find that generalization is governed by training reward saturation dynamics: models that generalize exhibit a prolonged pre-saturation phase during which training reward and downstream performance climb together, while models that saturate rapidly memorize rather than learn. We identify reasoning faithfulness, defined as the extent to which intermediate steps logically support the final answer, as the pre-RL property that predicts which regime a model falls into, while output diversity alone is uninformative. Motivated by these findings, we disentangle the contributions of continual pre-training and supervised fine-tuning, finding that SFT on explicit reasoning traces is necessary for generalization under weak supervision, while continual pre-training on domain data amplifies the effect. Applied together to Llama3.2-3B-Base, these interventions enable generalization across all three settings where the base model previously failed.","published_date":"2026-04-20T17:57:49+00:00","viability_score":4,"cluster_label":"LLM Reasoning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper investigates when Large Language Models can learn to reason effectively with weak supervision by analyzing training dynamics and identifying key properties for generalization.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.18572v1","title":"Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale","abstract":"The Platonic Representation Hypothesis suggests that neural networks trained on different modalities (e.g., text and images) align and eventually converge toward the same representation of reality. If true, this has significant implications for whether modality choice matters at all. We show that the experimental evidence for this hypothesis is fragile and depends critically on the evaluation regime. Alignment is measured using mutual nearest neighbors on small datasets ($\\approx$1K samples) and degrades substantially as the dataset is scaled to millions of samples. The alignment that remains between model representations reflects coarse semantic overlap rather than consistent fine-grained structure. Moreover, the evaluations in Huh et al. are done in a one-to-one image-caption setting, a constraint that breaks down in realistic many-to-many settings and further reduces alignment. We also find that the reported trend of stronger language models increasingly aligning with vision does not appear to hold for newer models. Overall, our findings suggest that the current evidence for cross-modal representational convergence is considerably weaker than subsequent works have taken it to be. Models trained on different modalities may learn equally rich representations of the world, just not the same one.","published_date":"2026-04-20T17:56:02+00:00","viability_score":4,"cluster_label":"Multimodal AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research challenges the notion of cross-modal representational convergence in neural networks, suggesting that models trained on different modalities learn distinct, rather than shared, representations of reality.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.18570v1","title":"A multimodal and temporal foundation model for virtual patient representations at healthcare system scale","abstract":"Modern medicine generates vast multimodal data across siloed systems, yet no existing model integrates the full breadth and temporal depth of the clinical record into a unified patient representation. We introduce Apollo, a multimodal temporal foundation model trained and evaluated on over three decades of longitudinal hospital records from a major US hospital system, composed of 25 billion records from 7.2 million patients, representing 28 distinct medical modalities and 12 major medical specialties. Apollo learns a unified representation space integrating over 100 thousand unique medical events in our clinical vocabulary as well as images and clinical text. This \"atlas of medical concepts\" forms a computational substrate for modeling entire patient care journeys comprised of sequences of structured and unstructured events, which are compressed by Apollo into virtual patient representations. To assess the potential of these whole-patient representations, we created 322 prognosis and retrieval tasks from a held-out test set of 1.4 million patients. We demonstrate the generalized clinical forecasting potential of Apollo embeddings, including predicting new disease onset risk up to five years in advance (95 tasks), disease progression (78 tasks), treatment response (59 tasks), risk of treatment-related adverse events (17 tasks), and hospital operations endpoints (12 tasks). Using feature attribution techniques, we show that model predictions align with clinically-interpretable multimodal biomarkers. We evaluate semantic similarity search on 61 retrieval tasks, and moreover demonstrate the potential of Apollo as a multimodal medical search engine using text and image queries. Together, these modeling capabilities establish the foundation for computable medicine, where the full context of patient care becomes accessible to computational reasoning.","published_date":"2026-04-20T17:55:47+00:00","viability_score":8,"cluster_label":"Healthcare AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Develop a multimodal foundation model for predicting patient outcomes and improving hospital operations using integrated health records.","time_to_mvp":"6+ months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.18567v1","title":"Latent Phase-Shift Rollback: Inference-Time Error Correction via Residual Stream Monitoring and KV-Cache Steering","abstract":"Large language models frequently commit unrecoverable reasoning errors mid-generation: once a wrong step is taken, subsequent tokens compound the mistake rather than correct it. We introduce $\\textbf{Latent Phase-Shift Rollback}$ (LPSR): at each generation step, we monitor the residual stream at a critical layer lcrit, detect abrupt directional reversals (phase shifts) via a cosine-similarity $+$ entropy dual gate, and respond by rolling back the KV-cache and injecting a pre-computed steering vector. No fine-tuning, gradient computation, or additional forward passes are required. LPSR achieves $\\mathbf{44.0\\%}$ on MATH-500 with an 8B model versus $28.8\\%$ for standard AR ($+15.2$ pp; McNemar $\u03c7^2 = 66.96$, $p < 10^{-15}$). Critically, prompted self-correction, the most natural inference-time baseline, scores only $19.8\\%$, below standard AR; LPSR exceeds it by $+24.2$ pp ($\u03c7^2 = 89.4$, $p \\approx 0$). LPSR also outperforms Best-of-16 ($+7.8$ pp) at $5.4\\times$ lower token cost, and surpasses a standard 70B model ($35.2\\%$) with $8.75\\times$ fewer parameters at ${\\sim}3\\times$ the token budget. A 32-layer sweep reveals a novel \\textbf{detection-correction dissociation}: error-detection AUC peaks at layer~14 ($0.718$) but task accuracy peaks at layer~16 ($44.0\\%$ vs.\\ $29.2\\%$), demonstrating that optimal monitoring depth differs for detection and correction.","published_date":"2026-04-20T17:53:33+00:00","viability_score":7,"cluster_label":"LLM Inference","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Latent Phase-Shift Rollback (LPSR) is an inference-time technique that corrects unrecoverable reasoning errors in LLMs by monitoring residual streams and steering the KV-cache, without fine-tuning.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.18566v1","title":"Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion","abstract":"We present a systematic evaluation of large language model families -- spanning both proprietary cloud APIs and locally-hosted open-source models -- on two purpose-built benchmarks for System Dynamics AI assistance: the \\textbf{CLD Leaderboard} (53 tests, structured causal loop diagram extraction) and the \\textbf{Discussion Leaderboard} (interactive model discussion, feedback explanation, and model building coaching).   On CLD extraction, cloud models achieve 77--89\\% overall pass rates; the best local model reaches 77\\% (Kimi~K2.5~GGUF~Q3, zero-shot engine), matching mid-tier cloud performance. On Discussion, the best local models achieve 50--100\\% on model building steps and 47--75\\% on feedback explanation, but only 0--50\\% on error fixing -- a category dominated by long-context prompts that expose memory limits in local deployments.   A central contribution of this paper is a systematic analysis of \\textit{model type effects} on performance: we compare reasoning vs.\\ instruction-tuned architectures, GGUF (llama.cpp) vs.\\ MLX (mlx\\_lm) backends, and quantization levels (Q3 / Q4\\_K\\_M / MLX-3bit / MLX-4bit / MLX-6bit) across the same underlying model families. We find that backend choice has larger practical impact than quantization level: mlx\\_lm does not enforce JSON schema constraints, requiring explicit prompt-level JSON instructions, while llama.cpp grammar-constrained sampling handles JSON reliably but causes indefinite generation on long-context prompts for dense models.   We document the full parameter sweep ($t$, $p$, $k$) for all local models, cleaned timing data (stuck requests excluded), and a practitioner guide for running 671B--123B parameter models on Apple~Silicon.","published_date":"2026-04-20T17:53:29+00:00","viability_score":7,"cluster_label":"LLM Benchmarking","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A systematic evaluation of cloud vs. local LLMs for System Dynamics AI assistance, providing insights into performance trade-offs and practical deployment guides.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18543v1","title":"ClawEnvKit: Automatic Environment Generation for Claw-Like Agents","abstract":"Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what is needed is not just a dataset, but an automated pipeline capable of generating diverse, verified environments on demand. To this end, we introduce ClawEnvKit, an autonomous generation pipeline that instantiates this formalism from natural language descriptions. The pipeline comprises three modules: (1) a parser that extracts structured generation parameters from natural language input; (2) a generator that produces the task specification, tool interface, and scoring configuration; and (3) a validator that enforces feasibility, diversity, structural validity, and internal consistency across the generated environments. Using ClawEnvKit, we construct Auto-ClawEval, the first large-scale benchmark for claw-like agents, comprising 1,040 environments across 24 categories. Empirically, Auto-ClawEval matches or exceeds human-curated environments on coherence and clarity at 13,800x lower cost. Evaluated across 4 model families and 8 agent harness frameworks, we find that harness engineering boosts performance by up to 15.7 percentage points over a bare ReAct baseline, completion remains the primary axis of variation with no model saturating the benchmark, and automated generation enables evaluation at a scale previously infeasible. Beyond static benchmarking, ClawEnvKit enables live evaluation: users describe a desired capability in natural language and obtain a verified environment on demand, turning evaluation into a continuous, user-driven process. The same mechanism serves as an on-demand training environment generator, producing task distributions that adapt to an agent's current weaknesses rather than being bounded by existing user logs.","published_date":"2026-04-20T17:36:49+00:00","viability_score":7,"cluster_label":"AI Tools for Robotics","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Automate generation of diverse environments for claw-like agents, reducing manual effort and costs in robotics training.","time_to_mvp":"1-2 weeks","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.18539v1","title":"Transition-Matrix Regularization for Next Dialogue Act Prediction in Counselling Conversations","abstract":"This paper studies how empirical dialogue-flow statistics can be incorporated into Next Dialogue Act Prediction (NDAP). A KL regularization term is proposed that aligns predicted act distributions with corpus-derived transition patterns. Evaluated on a 60-class German counselling taxonomy using 5-fold cross-validation, this improves macro-F1 by 9--42% relative depending on encoder and substantially improves dialogue-flow alignment. Cross-dataset validation on HOPE suggests that improvements transfer across languages and counselling domains. In systematic ablations across pretrained encoders and architectures, the findings indicate that transition regularization provides consistent gains and disproportionately benefits weaker baseline models. The results suggest that lightweight discourse-flow priors complement pretrained encoders, especially in fine-grained, data-sparse dialogue tasks.","published_date":"2026-04-20T17:33:37+00:00","viability_score":7,"cluster_label":"Dialogue Systems","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A regularization technique that improves next dialogue act prediction in counselling conversations by incorporating empirical dialogue-flow statistics.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.18532v1","title":"Symbolic Synthesis for LTLf+ Obligations","abstract":"We study synthesis for obligation properties expressed in LTLfp, the extension of LTLf to infinite traces. Obligation properties are positive Boolean combinations of safety and guarantee (co-safety) properties and form the second level of the temporal hierarchy of Manna and Pnueli. Although obligation properties are expressed over infinite traces, they retain most of the simplicity of LTLf. In particular, we show that they admit a translation into symbolically represented deterministic weak automata (DWA) obtained directly from the symbolic deterministic finite automata (DFA) for the underlying LTLf properties on trace prefixes. DWA inherit many of the attractive algorithmic features of DFA, including Boolean closure and polynomial-time minimization. Moreover, we show that synthesis for LTLfp obligation properties is theoretically highly efficient - solvable in linear time once the DWA is constructed. We investigate several symbolic algorithms for solving DWA games that arise in the synthesis of obligation properties and evaluate their effectiveness experimentally. Overall, the results indicate that synthesis for LTLfp obligation properties can be performed with virtually the same effectiveness as LTLf synthesis.","published_date":"2026-04-20T17:27:28+00:00","viability_score":0,"cluster_label":"Formal Methods","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A theoretical framework for symbolic synthesis of obligation properties in LTLf, showing efficiency comparable to LTLf synthesis.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.18530v1","title":"OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning","abstract":"Recent advancements in Reinforcement Learning with Verifiable Rewards (RLVR) have significantly improved Large Language Model (LLM) reasoning, yet models often struggle to explore novel trajectories beyond their initial latent space. While offline teacher guidance and entropy-driven strategies have been proposed to address this, they often lack deep integration or are constrained by the model's inherent capacity. In this paper, we propose OGER, a novel framework that unifies offline teacher guidance and online reinforcement learning through a specialized reward modeling lens. OGER employs multi-teacher collaborative training and constructs an auxiliary exploration reward that leverages both offline trajectories and the model's own entropy to incentivize autonomous exploration. Extensive experiments across mathematical and general reasoning benchmarks demonstrate that OGER significantly outperforms competitive baselines, achieving substantial gains in mathematical reasoning while maintaining robust generalization to out-of-domain tasks. We provide a comprehensive analysis of training dynamics and conduct detailed ablation studies to validate the effectiveness of our entropy-aware reward modulation. Our code is available at https://github.com/ecoli-hit/OGER.git.","published_date":"2026-04-20T17:26:00+00:00","viability_score":7,"cluster_label":"Reinforcement Learning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel framework that unifies offline teacher guidance and online reinforcement learning for improved LLM reasoning and exploration.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18521v1","title":"IDOBE: Infectious Disease Outbreak forecasting Benchmark Ecosystem","abstract":"Epidemic forecasting has become an integral part of real-time infectious disease outbreak response. While collaborative ensembles composed of statistical and machine learning models have become the norm for real-time forecasting, standardized benchmark datasets for evaluating such methods are lacking. Further, there is limited understanding on performance of these methods for novel outbreaks with limited historical data. In this paper, we propose IDOBE, a curated collection of epidemiological time series focused on outbreak forecasting. IDOBE compiles from multiple data repositories spanning over a century of surveillance and across U.S. states and global locations. We perform derivative-based segmentation to generate over 10,000 outbreaks covering multiple outcomes such as cases and hospitalizations for 13 diseases. We consider a variety of information-theoretic and distributional measures to quantify the epidemiological diversity of the dataset. Finally, we perform multi-horizon short-term forecasting (1- to 4-week-ahead) through the progression of the outbreak using 11 baseline models and report on their performance. In addition to standard metrics such as NMSE and MAPE for point forecasts, we include probabilistic scoring rules such as Normalized Weighted Interval Score (NWIS) to quantify the performance. We find that MLP-based methods have the most robust performance, with statistical methods having a slight edge during the pre-peak phase. IDOBE dataset along with baselines are released publicly on https://github.com/NSSAC/IDOBE to enable standardized, reproducible benchmarking of outbreak forecasting methods.","published_date":"2026-04-20T17:18:18+00:00","viability_score":8,"cluster_label":"Epidemic Forecasting","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A comprehensive benchmark ecosystem for infectious disease outbreak forecasting, enabling standardized evaluation of statistical and machine learning models.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18519v1","title":"LLM Safety From Within: Detecting Harmful Content with Internal Representations","abstract":"Guard models are widely used to detect harmful content in user prompts and LLM responses. However, state-of-the-art guard models rely solely on terminal-layer representations and overlook the rich safety-relevant features distributed across internal layers. We present SIREN, a lightweight guard model that harnesses these internal features. By identifying safety neurons via linear probing and combining them through an adaptive layer-weighted strategy, SIREN builds a harmfulness detector from LLM internals without modifying the underlying model. Our comprehensive evaluation shows that SIREN substantially outperforms state-of-the-art open-source guard models across multiple benchmarks while using 250 times fewer trainable parameters. Moreover, SIREN exhibits superior generalization to unseen benchmarks, naturally enables real-time streaming detection, and significantly improves inference efficiency compared to generative guard models. Overall, our results highlight LLM internal states as a promising foundation for practical, high-performance harmfulness detection.","published_date":"2026-04-20T17:17:07+00:00","viability_score":3,"cluster_label":"AI Safety/Content Moderation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Revolutionize content moderation by detecting harmful content using internal representations of LLMs for improved safety.","time_to_mvp":"1-2 weeks","tags":["high_potential"]},{"arxiv_id":"2604.18510v1","title":"Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks","abstract":"Open-weight language models can be rendered unsafe through several distinct interventions, but the resulting models may differ substantially in capabilities, behavioral profile, and internal failure mode. We study behavioral and mechanistic properties of jailbroken models across three unsafe routes: harmful supervised fine-tuning (SFT), harmful reinforcement learning with verifiable rewards (RLVR), and refusal-suppressing abliteration. All three routes achieve near-ceiling harmful compliance, but they diverge once we move beyond direct harmfulness. RLVR-jailbroken models show minimal degradation and preserve explicit harm recognition in a structured self-audit: they are able to identify harmful prompts and describe how a safe LLM should respond, yet they comply with the harmful request. With RLVR, harmful behavior is strongly suppressed by a reflective safety scaffold: when a harmful prompt is prepended with an instruction to reflect on safety standards, harmful behavior drops close to the baseline. Category-specific RLVR jailbreaks generalize broadly across harmfulness domains. Models jailbroken with SFT show the largest collapse in explicit safety judgments, the highest behavioral drift, and a substantial capability loss on standard benchmarks. Abliteration is family-dependent in both self-audit and response to a reflective safety scaffold. Mechanistic and repair analyses further separate the routes: abliteration is consistent with localized refusal-feature deletion, RLVR with preserved safety geometry but retargeted policy behavior, and SFT with broader distributed drift. Targeted repair partially recovers RLVR-jailbroken models, but has little effect on SFT-jailbroken models. Together, these results show that jailbreaks can produce vastly different properties despite similar harmfulness, with models jailbroken via RLVR showing remarkable similarity to the base model.","published_date":"2026-04-20T17:01:27+00:00","viability_score":7,"cluster_label":"LLM Safety","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Analyzes distinct jailbreaking methods for open-weight LLMs, revealing divergent behavioral and mechanistic properties despite similar harmfulness.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18508v1","title":"Document-as-Image Representations Fall Short for Scientific Retrieval","abstract":"Many recent document embedding models are trained on document-as-image representations, embedding rendered pages as images rather than the underlying source. Meanwhile, existing benchmarks for scientific document retrieval, such as ArXivQA and ViDoRe, treat documents as images of pages, implicitly favoring such representations. In this work, we argue that this paradigm is not well-suited for text-rich multimodal scientific documents, where critical evidence is distributed across structured sources, including text, tables, and figures. To study this setting, we introduce ArXivDoc, a new benchmark constructed from the underlying LaTeX sources of scientific papers. Unlike PDF or image-based representations, LaTeX provides direct access to structured elements (e.g., sections, tables, figures, equations), enabling controlled query construction grounded in specific evidence types. We systematically compare text-only, image-based, and multimodal representations across both single-vector and multi-vector retrieval models. Our results show that: (1) document-as-image representations are consistently suboptimal, especially as document length increases; (2) text-based representations are most effective, even for figure-based queries, by leveraging captions and surrounding context; and (3) interleaved text+image representations outperform document-as-image approaches without requiring specialized training.","published_date":"2026-04-20T17:00:17+00:00","viability_score":4,"cluster_label":"Document Retrieval","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new benchmark and analysis showing that text-based representations outperform image-based ones for scientific document retrieval, even for figure-based queries.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.18507v1","title":"Learning the Riccati solution operator for time-varying LQR via Deep Operator Networks","abstract":"We propose a computational framework for replacing the repeated numerical solution of differential Riccati equations in finite-horizon Linear Quadratic Regulator (LQR) problems by a learned operator surrogate. Instead of solving a nonlinear matrix-valued differential equation for each new system instance, we construct offline an approximation of the associated solution operator mapping time-dependent system parameters to the Riccati trajectory. The resulting model enables fast online evaluation of approximate optimal feedbacks across a wide class of systems, thereby shifting the computational burden from repeated numerical integration to a one-time learning stage. From a theoretical perspective, we establish control-theoretic guarantees for this operator-based approximation. In particular, we derive bounds quantifying how operator approximation errors propagate to feedback performance, trajectory accuracy, and cost suboptimality, and we prove that exponential stability of the closed-loop system is preserved under sufficiently accurate operator approximation. These results provide a framework to assess the reliability of data-driven approximations in optimal control. On the computational side, we design tailored DeepONet architectures for matrix-valued, time-dependent problems and introduce a progressive learning strategy to address scalability with respect to the system dimension. Numerical experiments on both time-invariant and time-varying LQR problems demonstrate that the proposed approach achieves high accuracy and strong generalization across a wide range of system configurations, while delivering substantial computational speedups compared to classical solvers. The method offers an effective and scalable alternative for parametric and real-time optimal control applications.","published_date":"2026-04-20T16:56:34+00:00","viability_score":6,"cluster_label":"Optimal Control","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A deep operator network framework that learns a surrogate for the Riccati solution operator, enabling fast online evaluation of optimal feedbacks for Linear Quadratic Regulator problems.","time_to_mvp":"3-6 months","tags":["high_potential"]},{"arxiv_id":"2604.18491v1","title":"Faster by Design: Interactive Aerodynamics via Neural Surrogates Trained on Expert-Validated CFD","abstract":"Computational Fluid Dynamics (CFD) is central to race-car aerodynamic development, yet its cost -- tens of thousands of core-hours per high-fidelity evaluation -- severely limits the design space exploration feasible within realistic budgets. AI-based surrogate models promise to alleviate this bottleneck, but progress has been constrained by the limited complexity of public datasets, which are dominated by smoothed passenger-car shapes that fail to exercise surrogates on the thin, complex, highly loaded components governing motorsport performance. This work presents three primary contributions. First, we introduce a high-fidelity RANS dataset built on a parametric LMP2-class CAD model and spanning six operating conditions (map points) covering straight-line and cornering regimes, generated and validated by aerodynamics experts at Dallara to preserve features relevant to industrial motorsport. Second, we present the Gauge-Invariant Spectral Transformer (GIST), a graph-based neural operator whose spectral embeddings encode mesh connectivity to enhance predictions on tightly packed, complex geometries. GIST guarantees discretization invariance and scales linearly with mesh size, achieving state-of-the-art accuracy on both public benchmarks and the proposed race-car dataset. Third, we demonstrate that GIST achieves a level of predictive accuracy suitable for early-stage aerodynamic design, providing a first validation of the concept of interactive design-space exploration -- where engineers query a surrogate in place of the CFD solver -- within industrial motorsport workflows.","published_date":"2026-04-20T16:42:35+00:00","viability_score":7,"cluster_label":"Aerodynamics","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A neural surrogate model trained on expert-validated CFD data for interactive aerodynamics design, enabling rapid exploration of design spaces in motorsport.","time_to_mvp":"3-6 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18490v1","title":"LQM: Linguistically Motivated Multidimensional Quality Metrics for Machine Translation","abstract":"Existing MT evaluation frameworks, including automatic metrics and human evaluation schemes such as Multidimensional Quality Metrics (MQM), are largely language-agnostic. However, they often fail to capture dialect- and culture-specific errors in diglossic languages (e.g., Arabic), where translation failures stem from mismatches in language variety, content coverage, and pragmatic appropriateness rather than surface form alone.We introduce LQM: Linguistically Motivated Multidimensional Quality Metrics for MT. LQM is a hierarchical error taxonomy for diagnosing MT errors through six linguistically grounded levels: sociolinguistics, pragmatics, semantics, morphosyntax, orthography, and graphetics (Figure 1). We construct a bidirectional parallel corpus of 3,850 sentences (550 per variety) spanning seven Arabic dialects (Egyptian, Emirati, Jordanian, Mauritanian, Moroccan, Palestinian, and Yemeni), derived from conversational, culturally rich content. We evaluate six LLMs in a zero-shot setting and conduct expert span-level human annotation using LQM, producing 6,113 labeled error spans across 3,495 unique erroneous sentences, along with severity-weighted quality scores. We complement this analysis with an automatic metric (spBLEU). Though validated here on Arabic, LQM is a language-agnostic framework designed to be easily applied to or adapted for other languages. LQM annotated errors data, prompts, and annotation guidelines are publicly available at https://github.com/UBC-NLP/LQM_MT.","published_date":"2026-04-20T16:41:37+00:00","viability_score":5,"cluster_label":"Machine Translation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"LQM is a linguistically motivated framework for evaluating machine translation quality, designed to capture dialect- and culture-specific errors in diglossic languages like Arabic.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.18487v1","title":"Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety","abstract":"The Adversarial Humanities Benchmark (AHB) evaluates whether model safety refusals survive a shift away from familiar harmful prompt forms. Starting from harmful tasks drawn from MLCommons AILuminate, the benchmark rewrites the same objectives through humanities-style transformations while preserving intent. This extends literature on Adversarial Poetry and Adversarial Tales from single jailbreak operators to a broader benchmark family of stylistic obfuscation and goal concealment. In the benchmark results reported here, the original attacks record 3.84% attack success rate (ASR), while transformed methods range from 36.8% to 65.0%, yielding 55.75% overall ASR across 31 frontier models. Under a European Union AI Act Code-of-Practice-inspired systemic-risk lens, Chemical, biological, radiological and nuclear (CBRN) is the highest bucket. Taken together, this lack of stylistic robustness suggests that current safety techniques suffer from weak generalization: deep understanding of 'non-maleficence' remains a central unresolved problem in frontier model safety.","published_date":"2026-04-20T16:37:27+00:00","viability_score":7,"cluster_label":"LLM Safety","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A benchmark and dataset for evaluating the stylistic robustness of frontier model safety refusals, revealing significant weaknesses in generalization.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18478v1","title":"WorldDB: A Vector Graph-of-Worlds Memory Engine with Ontology-Aware Write-Time Reconciliation","abstract":"Persistent memory is the bottleneck separating stateless chatbots from long-running agentic systems. Retrieval-augmented generation (RAG) over flat vector stores fragments facts into chunks, loses cross-session identity, and has no first-class notion of supersession or contradiction. Recent bitemporal knowledge-graph systems (Graphiti, Memento, Hydra DB) add typed edges and valid-time metadata, but the graph itself remains flat: no recursive composition, no content-addressed invariants on nodes, and edge types carry no behavior beyond a label. We present WorldDB, a memory engine built on three commitments: (i) every node is a world -- a container with its own interior subgraph, ontology scope, and composed embedding, recursive to arbitrary depth; (ii) nodes are content-addressed and immutable, so any edit produces a new hash at the node and every ancestor, giving a Merkle-style audit trail for free; (iii) edges are write-time programs -- each edge type ships on_insert/on_delete/on_query_rewrite handlers (supersession closes validity, contradicts preserves both sides, same_as stages a merge proposal), so no raw append path exists. On LongMemEval-s (500 questions, ~115k-token conversational stacks), WorldDB with Claude Opus 4.7 as answerer achieves 96.40% overall / 97.11% task-averaged accuracy, a +5.61pp improvement over the previously reported Hydra DB state-of-the-art (90.79%) and +11.20pp over Supermemory (85.20%), with perfect single-session-assistant recall and robust performance on temporal reasoning (96.24%), knowledge update (98.72%), and preference synthesis (96.67%). Ablations show that the engine's graph layer -- resolver-unified entities and typed refers_to edges -- contributes +7.0pp task-averaged independently of the underlying answerer.","published_date":"2026-04-20T16:30:53+00:00","viability_score":8,"cluster_label":"Memory Engines","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"WorldDB is a novel vector graph-of-worlds memory engine that significantly improves agentic system performance by enabling recursive world composition and ontology-aware write-time reconciliation.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18469v1","title":"A Generalized Synthetic Control Method for Baseline Estimation in Demand Response Services","abstract":"Baseline estimation is critical to Demand Response (DR) settlement in electricity markets, yet existing machine learning methods remain limited in predictive performance, while methodologies from causal inference and counterfactual prediction are still underutilized in this domain. We introduce a Generalized Synthetic Control Method that builds on the classical Synthetic Control Method (SCM) from econometrics. While SCM provides a powerful framework for counterfactual estimation, classical SCM remains a static estimator: it fits the treated unit as a combination of contemporaneous donor units and therefore ignores predictable temporal structure in the residual error. We develop a generalized SCM framework that transforms baseline estimation into a dynamic counterfactual prediction problem by augmenting the donor representation with exogenous features, lagged treated load, and selected lagged donor signals. This enriched representation allows the estimator to capture autoregressive dependence, delayed donor-response patterns, and error-correction effects beyond the scope of standard SCM. The framework further accommodates nonlinear predictors when linear weighting is inadequate, with the greatest benefit arising in limited-data settings. Experiments on the Ausgrid smart-meter dataset show consistent improvements over classical SCM and strong benchmark methods, with the dominant performance gains driven by dynamic augmentation.","published_date":"2026-04-20T16:21:33+00:00","viability_score":7,"cluster_label":"Causal Inference","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A Generalized Synthetic Control Method for baseline estimation in demand response services, outperforming existing methods by treating it as a dynamic counterfactual prediction problem.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18468v1","title":"Asset Harvester: Extracting 3D Assets from Autonomous Driving Logs for Simulation","abstract":"Closed-loop simulation is a core component of autonomous vehicle (AV) development, enabling scalable testing, training, and safety validation before real-world deployment. Neural scene reconstruction converts driving logs into interactive 3D environments for simulation, but it does not produce complete 3D object assets required for agent manipulation and large-viewpoint novel-view synthesis. To address this challenge, we present Asset Harvester, an image-to-3D model and end-to-end pipeline that converts sparse, in-the-wild object observations from real driving logs into complete, simulation-ready assets. Rather than relying on a single model component, we developed a system-level design for real-world AV data that combines large-scale curation of object-centric training tuples, geometry-aware preprocessing across heterogeneous sensors, and a robust training recipe that couples sparse-view-conditioned multiview generation with 3D Gaussian lifting. Within this system, SparseViewDiT is explicitly designed to address limited-angle views and other real-world data challenges. Together with hybrid data curation, augmentation, and self-distillation, this system enables scalable conversion of sparse AV object observations into reusable 3D assets.","published_date":"2026-04-20T16:20:57+00:00","viability_score":8,"cluster_label":"3D Asset Generation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Asset Harvester is an end-to-end pipeline that converts sparse, in-the-wild object observations from autonomous driving logs into complete, simulation-ready 3D assets.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18467v1","title":"An Integrated Deep-Learning Framework for Peptide-Protein Interaction Prediction and Target-Conditioned Peptide Generation with ConGA-PePPI and TC-PepGen","abstract":"Motivation: Peptide-protein interactions (PepPIs) are central to cellular regulation and peptide therapeutics, but experimental characterization remains too slow for large-scale screening. Existing methods usually emphasize either interaction prediction or peptide generation, leaving candidate prioritization, residue-level interpretation, and target-conditioned expansion insufficiently integrated. Results: We present an integrated framework for early-stage peptide screening that combines a partner-aware prediction and localization model (ConGA-PepPI) with a target-conditioned generative model (TC-PepGen). ConGA-PepPI uses asymmetric encoding, bidirectional cross-attention, and progressive transfer from pair prediction to binding-site localization, while TC-PepGen preserves target information throughout autoregressive decoding via layerwise conditioning. In five-fold cross-validation, ConGA-PepPI achieved 0.839 accuracy and 0.921 AUROC, with binding-site AUPR values of 0.601 on the protein side and 0.950 on the peptide side, and remained competitive on external benchmarks. Under a controlled length-conditioned benchmark, 40.39% of TC-PepGen peptides exceeded native templates in AlphaFold 3 ipTM, and unconstrained generation retained evidence of target-conditioned signal.","published_date":"2026-04-20T16:20:23+00:00","viability_score":7,"cluster_label":"Biotech AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An integrated AI framework for peptide-protein interaction prediction and target-conditioned peptide generation to accelerate drug discovery.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18463v1","title":"Using large language models for embodied planning introduces systematic safety risks","abstract":"Large language models are increasingly used as planners for robotic systems, yet how safely they plan remains an open question. To evaluate safe planning systematically, we introduce DESPITE, a benchmark of 12,279 tasks spanning physical and normative dangers with fully deterministic validation. Across 23 models, even near-perfect planning ability does not ensure safety: the best-planning model fails to produce a valid plan on only 0.4% of tasks but produces dangerous plans on 28.3%. Among 18 open-source models from 3B to 671B parameters, planning ability improves substantially with scale (0.4-99.3%) while safety awareness remains relatively flat (38-57%). We identify a multiplicative relationship between these two capacities, showing that larger models complete more tasks safely primarily through improved planning, not through better danger avoidance. Three proprietary reasoning models reach notably higher safety awareness (71-81%), while non-reasoning proprietary models and open-source reasoning models remain below 57%. As planning ability approaches saturation for frontier models, improving safety awareness becomes a central challenge for deploying language-model planners in robotic systems.","published_date":"2026-04-20T16:18:08+00:00","viability_score":6,"cluster_label":"Robotics Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A benchmark and analysis revealing systematic safety risks in using large language models for robotic planning.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.18459v1","title":"Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions","abstract":"Visual agents operating in the wild must respond to queries precisely when sufficient evidence first appears in a video stream, a critical capability that is overlooked by conventional video LLMs evaluated in offline settings. The shift to an online, streaming paradigm introduces significant challenges: a lack of decision transparency, the difficulty of aligning response timing with visual evidence, and the need to maintain a global, causally consistent understanding under tight computational budgets. To address these issues, we propose a novel framework that decouples reasoning control from memory integration. We introduce \\textbf{\\model{}}, an instantiation of this framework with two core components. First, the \\emph{Active Thinking Decision Maker (ATDM)} is a transparent reasoning controller that externalizes its decision process using observable progress ($\\boldsymbol\u03c1$) and confidence ($\\boldsymbol{c}$) metrics. This allows it to precisely time its response $t_r$ to match the first-sufficient-evidence timestamp $t^\\star$ while streaming its reasoning to the user. Second, the \\emph{Hierarchical Progressive Semantic Integration (HPSI)} module acts as an efficient memory system. It employs a set of learnable, multi-level aggregation tokens that are propagated across clips to build a rich, global cognitive state without exceeding token budgets. %Our approach sets a new standard on key online video understanding benchmarks, achieving strong performance of \\textbf{71.6\\%} on StreamingBench and \\textbf{46.9\\%} on OVOBench, demonstrating a robust solution for evidence-aligned and transparent online video analysis. Extensive experiments demonstrate the effectiveness of ATDM and HPSI, e.g., Thinking-QwenVL improves the accuracy of the previous state-of-the-art from 67.63\\% to 71.60\\% on the StreamingBench benchmark.","published_date":"2026-04-20T16:15:33+00:00","viability_score":7,"cluster_label":"Video Understanding","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for real-time video understanding that aligns responses with evidence and provides transparent decision-making.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18444v1","title":"ProtoCLIP: Prototype-Aligned Latent Refinement for Robust Zero-Shot Chest X-Ray Classification","abstract":"Zero-shot vision-language models (VLMs) have shown promise for chest radiograph classification, but their performance is often limited by confounding label co-occurrence, long-tail class imbalance, and transfer instability under domain shift. We propose ProtoCLIP, a refinement strategy for CLIP-style VLMs that improves zero-shot discrimination through targeted data curation and distilled anchor alignment. Specifically, we construct pathology-focused training subsets with curated negative samples to reduce co-occurrence bias. We also introduce a representation-preserving distillation objective to stabilize adaptation while maintaining semantic structure and improving discrimination of clinically relevant co-occurring pathologies. Evaluated on an unseen dataset VinDr-CXR, ProtoCLIP improves AUC by 2-10 percentage points over a strong CLIP-based baseline across multiple findings. For pneumothorax specifically, ProtoCLIP achieves a state-of-the-art AUC of 0.94. These results demonstrate that anchor-guided refinement, coupled with curated supervision and controlled adaptation, can mitigate common zero-shot transfer failures in medical VLMs without requiring large-scale retraining.","published_date":"2026-04-20T16:01:44+00:00","viability_score":8,"cluster_label":"Medical AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"ProtoCLIP enhances zero-shot chest X-ray classification by refining vision-language models with curated data and prototype alignment.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.18429v1","title":"Revisiting Change VQA in Remote Sensing with Structured and Native Multimodal Qwen Models","abstract":"Change visual question answering (Change VQA) addresses the problem of answering natural-language questions about semantic changes between bi-temporal remote sensing (RS) images. Although vision-language models (VLMs) have recently been studied for temporal RS image understanding, Change VQA remains underexplored in the context of modern multimodal models. In this letter, we revisit the CDVQA benchmark using recent Qwen models under a unified low-rank adaptation (LoRA) setting. We compare Qwen3-VL, which follows a structured vision-language pipeline with multi-depth visual conditioning and a full-attention decoder, with Qwen3.5, a native multimodal model that combines a single-stage alignment with a hybrid decoder backbone. Experimental results on the official CDVQA test splits show that recent VLMs improve over earlier specialized baselines. They further show that performance does not scale monotonically with model size, and that native multimodal models are more effective than structured vision-language pipelines for this task. These findings indicate that tightly integrated multimodal backbones contribute more to performance than scale or explicit multi-depth visual conditioning for language-driven semantic change reasoning in RS imagery.","published_date":"2026-04-20T15:47:52+00:00","viability_score":7,"cluster_label":"Remote Sensing AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Leveraging Qwen multimodal models with LoRA for improved change visual question answering in remote sensing imagery.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18404v1","title":"Six Llamas: Comparative Religious Ethics Through LoRA-Adapted Language Models","abstract":"We present Six Llamas, a comparative study examining whether large language models fine-tuned on distinct religious corpora encode systematically different patterns of ethical reasoning. Six variants of Meta-Llama-3.1-8B are constructed: one unmodified control and five LoRA-adapted models trained exclusively on the sacred and theological texts of Christianity, Islam, Judaism, Hinduism, or Buddhism. All six models are probed with an identical battery of 17 standardized ethical prompts spanning moral dilemmas, game-theoretic scenarios, public policy questions, and moral-psychological self-assessments. To assess robustness and reproducibility, we implement a multi-temperature sampling design spanning ten temperature settings. We compute response consistency metrics, pairwise inter-model agreement rates, temperature sensitivity coefficients across four prompt domains, and run-to-run stability analyses.   Findings show that LoRA-adapted models produce ethical reasoning patterns that are (a) systematically differentiated from the base model, (b) consistent with the moral logics of their training traditions, (c) structured along interpretable dimensions in moral-philosophical space, (d) core ethical positions remain stable across temperature variations for high-consensus dilemmas. The Trolley Problem achieves 100% consistency across all models and temperatures, while (e) tradition-specific divergence intensifies at higher temperatures in morally contested domains, and (f) the base model exhibits the highest overall response consistency (mean 88.3%), suggesting LoRA adaptation introduces both tradition-specific signal and increased sampling sensitivity.   The study offers a proof-of-concept for the condensate comparative method using differentially trained language models as instruments for cultural and ethical analysis and identifies specific criteria for falsification and planned extensions.","published_date":"2026-04-20T15:22:59+00:00","viability_score":5,"cluster_label":"LLM Ethics","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Fine-tuning Llama models with LoRA on religious texts to analyze differentiated ethical reasoning patterns.","time_to_mvp":"1-3 months","tags":["quick_build"]},{"arxiv_id":"2604.18398v1","title":"AlphaContext: An Evolutionary Tree-based Psychometric Context Generator for Creativity Assessment","abstract":"Creativity has become a core competence in the era of LLMs and human-AI collaboration, underpinning innovation in real-world problem solving. Crucially, the systematic improvement of creativity necessitates scientifically valid assessment instruments. Psychometric research recognizes context-based assessment as an effective way to measure creative thinking. However, high-quality expert-designed contexts remain scarce. Existing LLM-based generators often struggle with insufficient assessment cues, weak narrative coherence, limited stylistic diversity, and poor support for creative thinking. To address these challenges, we propose AlphaContext, an evolutionary tree-based psychometric context generator for creativity assessment. First, the HyperTree Outline Planner formalizes expert-designed outlining as a rule-guided hypertree and performs top-down hierarchical planning. The MCTS-based Context Generator fills the outline via MCTS to balance global structure and local quality. Then, the Evolutionary Context Optimizer evolves contexts with MAP-Elites by repeatedly updating niche elites to jointly improve diversity and quality. Finally, the Assessment-Guided Evolution Refiner simulates virtual participants with diverse styles and recycles weak contexts for further evolution. Experiments show that AlphaContext yields an average improvement of 8% over competitive methods across 6 quality metrics.","published_date":"2026-04-20T15:20:58+00:00","viability_score":4,"cluster_label":"AI for Creativity Assessment","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"An evolutionary tree-based generator for creating psychometric contexts to assess creativity.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.18390v1","title":"Randomly Initialized Networks Can Learn from Peer-to-Peer Consensus","abstract":"In self-supervised learning, self-distilled methods have shown impressive performance, learning representations useful for downstream tasks and even displaying emergent properties. However, state-of-the-art methods usually rely on ensembles of complex mechanisms, with many design choices that are empirically motivated and not well understood.   In this work, we explore the role of self-distillation within learning dynamics. Specifically, we isolate the effect of self-distillation by training a group of randomly initialized networks, removing all other common components such as projectors, predictors, and even pretext tasks. Our findings show that even this minimal setup can lead to learned representations with non-trivial improvements over a random baseline on downstream tasks. We also demonstrate how this effect varies with different hyperparameters and present a short analysis of what is being learned by the models under this setup.","published_date":"2026-04-20T15:13:39+00:00","viability_score":3,"cluster_label":"Self-Supervised Learning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Demonstrating that randomly initialized networks can learn representations through peer-to-peer consensus without complex mechanisms.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.18381v1","title":"Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes","abstract":"Fine-tuning Large Language Models (LLMs) typically relies on large quantities of high-quality annotated data, or questions with well-defined ground truth answers in the case of Reinforcement Learning with Verifiable Rewards (RLVR). While previous work has explored the benefits to model reasoning capabilities by scaling both data and compute used for RLVR, these results lack applicability in many real-world settings where annotated data and accessible compute may be scarce. In this work, we present a comprehensive empirical study of open-source Small Language Model (SLM) performance after RLVR in low data regimes. Across three novel datasets covering number counting problems, graph reasoning, and spatial reasoning, we characterize how model performance scales with dataset size, diversity, and complexity. We demonstrate that (1) procedural datasets allow for fine-grained evaluation and training dataset development with controllable properties (size, diversity, and complexity), (2) under RLVR, models trained on lower complexity tasks can generalize to higher complexity tasks, and (3) training on mixed complexity datasets is associated with the greatest benefits in low data regimes, providing up to 5x sample efficiency versus training on easy tasks. These findings inspire future work on the development of data scaling laws for RLVR and the use of procedural data generators to further understand effective data development for efficient LLM fine-tuning.","published_date":"2026-04-20T15:04:57+00:00","viability_score":4,"cluster_label":"LLM Fine-tuning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research explores effective RLVR fine-tuning strategies for small language models in low-data environments, demonstrating improved sample efficiency and generalization through procedural datasets.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.18380v1","title":"The implicated scientist: on the role of AI researchers in the development of weapons systems","abstract":"Artificial intelligence (AI) technologies are increasingly used in modern weapons systems. Notably, these systems have recently been involved in mass killings and destruction at scale. Furthermore, there is currently a strong interest and competition among powerful players to accelerate the proliferation of weapons with automated or AI-based components, a phenomenon known as AI arms race. This competition poses a risk of causing even more deaths and devastation in the future, as well as increased power and wealth inequality. In this work, we aim to shed light on the role of AI researchers as implicated subjects in the harms caused by weapons enabled by AI technologies. We investigate and discuss the specifics of this implication and explore ways to transfigure this position of implication into one of differentiated, long-distance solidarity with the victims of technologically fortified injustices.","published_date":"2026-04-20T15:04:37+00:00","viability_score":0,"cluster_label":"AI Ethics","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper examines the ethical implications of AI researchers' involvement in the development of weapons systems and explores avenues for solidarity with victims of technologically-driven injustices.","time_to_mvp":"N/A","tags":[]},{"arxiv_id":"2604.18375v1","title":"IceBreaker for Conversational Agents: Breaking the First-Message Barrier with Personalized Starters","abstract":"Conversational agents, such as ChatGPT and Doubao, have become essential daily assistants for billions of users. To further enhance engagement, these systems are evolving from passive responders to proactive companions. However, existing efforts focus on activation within ongoing dialogues, while overlooking a key real-world bottleneck. In the conversation initiation stage, users may have a vague need but no explicit query intent, creating a first-message barrier where the conversation holds before it begins. To overcome this, we introduce Conversation Starter Generation: generating personalized starters to guide users into conversation. However, unlike in-conversation stages where immediate context guides the response, initiation must operate in a cold-start moment without explicit user intent. To pioneer in this direction, we present IceBreaker that frames human ice-breaking as a two-step handshake: (i) evoke resonance via Resonance-Aware Interest Distillation from session summaries to capture trigger interests, and (ii) stimulate interaction via Interaction-Oriented Starter Generation, optimized with personalized preference alignment and a self-reinforced loop to maximize engagement. Online A/B tests on one of the world's largest conversational agent products show that IceBreaker improves user active days by +0.184% and click-through rate by +9.425%, and has been deployed in production.","published_date":"2026-04-20T15:02:03+00:00","viability_score":7,"cluster_label":"Conversational AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"IceBreaker generates personalized conversation starters for AI agents, overcoming the initial user engagement barrier and demonstrably increasing user activity and click-through rates in production.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.18373v1","title":"Dissecting AI Trading: Behavioral Finance and Market Bubbles","abstract":"We study how AI agents form expectations and trade in experimental asset markets. Using a simulated open-call auction populated by autonomous Large Language Model (LLM) agents, we document three main findings. First, AI agents exhibit classic behavioral patterns: a pronounced disposition effect and recency-weighted extrapolative beliefs. Second, these individual-level patterns aggregate into equilibrium dynamics that replicate classic experimental findings (Smith et al., 1988), including the predictive power of excess demand for future prices and the positive relationship between disagreement and trading volume. Third, by analyzing the agents' reasoning text through a twenty-mechanism scoring framework, we show that targeted prompt interventions causally amplify or suppress specific behavioral mechanisms, significantly altering the magnitude of market bubbles.","published_date":"2026-04-20T15:00:53+00:00","viability_score":4,"cluster_label":"AI Trading","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This study uses LLM agents in simulated markets to reveal behavioral finance patterns and demonstrates how prompt interventions can control market bubble formation.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.18364v1","title":"Training and Agentic Inference Strategies for LLM-based Manim Animation Generation","abstract":"Generating programmatic animation using libraries such as Manim presents unique challenges for Large Language Models (LLMs), requiring spatial reasoning, temporal sequencing, and familiarity with domain-specific APIs that are underrepresented in general pre-training data. A systematic study of how training and inference strategies interact in this setting is lacking in current research. This study introduces ManimTrainer, a training pipeline that combines Supervised Fine-tuning (SFT) with Reinforcement Learning (RL) based Group Relative Policy Optimisation (GRPO) using a unified reward signal that fuses code and visual assessment signals, and ManimAgent, an inference pipeline featuring Renderer-in-the-loop (RITL) and API documentation-augmented RITL (RITL-DOC) strategies. Using these techniques, this study presents the first unified training and inference study for text-to-code-to-video transformation with Manim. It evaluates 17 open-source sub-30B LLMs across nine combinations of training and inference strategies using ManimBench. Results show that SFT generally improves code quality, while GRPO enhances visual outputs and increases the models' responsiveness to extrinsic signals during self-correction at inference time. The Qwen 3 Coder 30B model with GRPO and RITL-DOC achieved the highest overall performance, with a 94% Render Success Rate (RSR) and 85.7% Visual Similarity (VS) to reference videos, surpassing the baseline GPT-4.1 model by +3 percentage points in VS. Additionally, the analysis shows that the correlation between code and visual metrics strengthens with SFT and GRPO but weakens with inference-time enhancements, highlighting the complementary roles of training and agentic inference strategies in Manim animation generation.","published_date":"2026-04-20T14:54:06+00:00","viability_score":4,"cluster_label":"Generative Video","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A novel training and inference pipeline for LLM-based Manim animation generation that improves code quality and visual outputs.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.18352v1","title":"Tight Auditing of Differential Privacy in MST and AIM","abstract":"State-of-the-art Differentially Private (DP) synthetic data generators such as MST and AIM are widely used, yet tightly auditing their privacy guarantees remains challenging. We introduce a Gaussian Differential Privacy (GDP)-based auditing framework that measures privacy via the full false-positive/false-negative tradeoff. Applied to MST and AIM under worst-case settings, our method provides the first tight audits in the strong-privacy regime. For $(\u03b5,\u03b4)=(1,10^{-2})$, we obtain $\u03bc_{emp}\\approx0.43$ vs. implied $\u03bc=0.45$, showing a small theory-practice gap.   Our code is publicly available: https://github.com/sassoftware/dpmm.","published_date":"2026-04-20T14:45:15+00:00","viability_score":6,"cluster_label":"Differential Privacy","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A Gaussian Differential Privacy-based auditing framework for synthetic data generators that provides tight audits in the strong-privacy regime.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.18348v1","title":"AdaCluster: Adaptive Query-Key Clustering for Sparse Attention in Video Generation","abstract":"Video diffusion transformers (DiTs) suffer from prohibitive inference latency due to quadratic attention complexity. Existing sparse attention methods either overlook semantic similarity or fail to adapt to heterogeneous token distributions across layers, leading to model performance degradation. We propose AdaCluster, a training-free adaptive clustering framework that accelerates the generation of DiTs while preserving accuracy. AdaCluster applies an angle-similarity-preserving clustering method to query vectors for higher compression, and designs a euclidean-similarity-preserving clustering method for keys, covering cluster number assignment, threshold-wise adaptive clustering, and efficient critical cluster selection. Experiments on CogVideoX-2B, HunyuanVideo, and Wan-2.1 on one A40 GPU demonstrate up to 1.67-4.31x speedup with negligible quality degradation.","published_date":"2026-04-20T14:43:36+00:00","viability_score":7,"cluster_label":"Video Generation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"AdaCluster is a training-free framework that accelerates video diffusion transformers with adaptive query-key clustering, achieving significant speedups with negligible quality loss.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.18347v1","title":"Multilingual Training and Evaluation Resources for Vision-Language Models","abstract":"Vision Language Models (VLMs) achieved rapid progress in the recent years. However, despite their growth, VLMs development is heavily grounded on English, leading to two main limitations: (i) the lack of multilingual and multimodal datasets for training, and (ii) the scarcity of comprehensive evaluation benchmarks across languages. In this work, we address these gaps by introducing a new comprehensive suite of resources for VLMs training and evaluation spanning five European languages (English, French, German, Italian, and Spanish). We adopt a regeneration-translation paradigm that produces high-quality cross-lingual resources by combining curated synthetic generation and manual annotation. Specifically, we build Multi-PixMo, a training corpus obtained regenerating examples from Pixmo pre-existing datasets with permissively licensed models: PixMo-Cap, PixMo-AskModelAnything, and CoSyn-400k. On the evaluation side, we construct a set of multilingual benchmarks derived translating widely used English datasets (MMbench, ScienceQA, MME, POPE, AI2D). We assess the quality of these resources through qualitative and quantitative human analyses, measuring inter-annotator agreement. Additionally, we perform ablation studies to demonstrate the impact of multilingual data, with respect to English only, in VLMs training. Experiments, comprising 3 different models show that using multilingual, multimodal examples for training VLMs aids is consistently beneficial on non-English benchmarks, with positive transfer to English as well.","published_date":"2026-04-20T14:42:47+00:00","viability_score":5,"cluster_label":"Vision-Language Models","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A comprehensive suite of multilingual resources for training and evaluating Vision-Language Models across five European languages, demonstrating consistent benefits for non-English benchmarks.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.18344v1","title":"One Pass for All: A Discrete Diffusion Model for Knowledge Graph Triple Set Prediction","abstract":"Knowledge Graphs (KGs) are composed of triples, and the goal of Knowledge Graph Completion (KGC) is to infer the missing factual triples. Traditional KGC tasks predict missing elements in a triple given one or two of its elements. As a more realistic task, the Triple Set Prediction (TSP) task aims to infer the set of missing triples conditioned only on the observed knowledge graph, without assuming any partial information about the missing triples. Existing TSP methods predict the set of missing triples in a triple-by-triple manner, falling short in capturing the dependencies among the predicted triples to ensure consistency. To address this issue, we propose a novel discrete diffusion model termed DiffTSP that treats TSP as a generative task. DiffTSP progressively adds noise to the KG through a discrete diffusion process, achieved by masking relational edges. The reverse process then gradually recovers the complete KG conditioned on the incomplete graph. To this end, we design a structure-aware denoising network that integrates a relational context encoder with a relational graph diffusion transformer for knowledge graph generation. DiffTSP can generate the complete set of triples in a one-pass manner while ensuring the dependencies among the predicted triples. Our approach achieves state-of-the-art performance on three public datasets. Code: https://github.com/ADMIS-TONGJI/DiffTSP.","published_date":"2026-04-20T14:41:47+00:00","viability_score":7,"cluster_label":"Knowledge Graph AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A discrete diffusion model that predicts entire sets of missing knowledge graph triples in one pass, ensuring consistency and achieving state-of-the-art performance.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18327v1","title":"PARM: Pipeline-Adapted Reward Model","abstract":"Reward models (RMs) are central to aligning large language models (LLMs) with human preferences, powering RLHF and advanced decoding strategies. While most prior work focuses on single-step generation, real-world applications increasingly adopt multi-stage LLM pipelines, where effective reward guidance remains underexplored. We investigate this through code generation for combinatorial optimization, constructing a pipeline that integrates reward models into both formulation and solution stages. We identify a critical challenge: inconsistency between reward model predictions and actual pipeline execution outcomes. To address this, we propose the Pipeline-Adapted Reward Model (PARM), which leverages pipeline-specific data and direct preference optimization to align rewards with downstream feedback. We instantiate PARM as a two-stage pipeline (formulation -> code generation) and evaluate it on four public optimization benchmarks, measuring execution rate and solving accuracy against baselines and sampling methods. A supplementary cross-domain experiment on GSM8K assesses transferability. Results demonstrate that PARM consistently improves pipeline output quality and stability, providing new insights into reward modeling for multi-stage LLM reasoning.","published_date":"2026-04-20T14:29:08+00:00","viability_score":6,"cluster_label":"LLM Alignment","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A pipeline-adapted reward model that aligns LLM rewards with downstream pipeline execution outcomes for improved consistency in multi-stage LLM applications.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.18320v1","title":"EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations","abstract":"Self-evolution of multimodal large language models (MLLMs) remains a critical challenge: pseudo-label-based methods suffer from progressive quality degradation as model predictions drift, while template-based methods are confined to a static set of transformations that cannot adapt in difficulty or diversity. We contend that robust, continuous self-improvement requires not only deterministic external feedback independent of the model's internal certainty, but also a mechanism to perpetually diversify the training distribution. To this end, we introduce EVE (Executable Visual transformation-based self-Evolution), a novel framework that entirely bypasses pseudo-labels by harnessing executable visual transformations continuously enriched in both variety and complexity. EVE adopts a Challenger-Solver dual-policy architecture. The Challenger maintains and progressively expands a queue of visual transformation code examples, from which it synthesizes novel Python scripts to perform dynamic visual transformations. Executing these scripts yields VQA problems with absolute, execution-verified ground-truth answers, eliminating any reliance on model-generated supervision. A multi-dimensional reward system integrating semantic diversity and dynamic difficulty calibration steers the Challenger to enrich its code example queue while posing progressively more challenging tasks, preventing mode collapse and fostering reciprocal co-evolution between the two policies. Extensive experiments demonstrate that EVE consistently surpasses existing self-evolution methods, establishing a robust and scalable paradigm for verifiable MLLM self-evolution. The code is available at https://github.com/0001Henry/EVE .","published_date":"2026-04-20T14:20:44+00:00","viability_score":8,"cluster_label":"Multimodal LLM Evolution","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"EVE enables verifiable self-evolution of MLLMs through executable visual transformations, generating diverse and challenging training data with execution-verified ground truth.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18311v1","title":"On the Importance and Evaluation of Narrativity in Natural Language AI Explanations","abstract":"Explainable AI (XAI) aims to make the behaviour of machine learning models interpretable, yet many explanation methods remain difficult to understand. The integration of Natural Language Generation into XAI aims to deliver explanations in textual form, making them more accessible to practitioners. Current approaches, however, largely yield static lists of feature importances. Although such explanations indicate what influences the prediction, they do not explain why the prediction occurs. In this study, we draw on insights from social sciences and linguistics, and argue that XAI explanations should be presented in the form of narratives. Narrative explanations support human understanding through four defining properties: continuous structure, cause-effect mechanisms, linguistic fluency, and lexical diversity. We show that standard Natural Language Processing (NLP) metrics based solely on token probability or word frequency fail to capture these properties and can be matched or exceeded by tautological text that conveys no explanatory content. To address this issue, we propose seven automatic metrics that quantify the narrative quality of explanations along the four identified dimensions. We benchmark current state-of-the-art explanation generation methods on six datasets and show that the proposed metrics separate descriptive from narrative explanations more reliably than standard NLP metrics. Finally, to further advance the field, we propose a set of problem-agnostic XAI Narrative generation rules for producing natural language XAI explanations, so that the resulting XAI Narratives exhibit stronger narrative properties and align with the findings from the linguistic and social science literature.","published_date":"2026-04-20T14:17:39+00:00","viability_score":4,"cluster_label":"Explainable AI (XAI)","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Proposes new metrics to evaluate and generate narrative explanations for AI, moving beyond feature importance to provide more understandable 'why' behind predictions.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.18302v1","title":"Toward Zero-Egress Psychiatric AI: On-Device LLM Deployment for Privacy-Preserving Mental Health Decision Support","abstract":"Privacy represents one of the most critical yet underaddressed barriers to AI adoption in mental healthcare -- particularly in high-sensitivity operational environments such as military, correctional, and remote healthcare settings, where the risk of patient data exposure can deter help-seeking behavior entirely. Existing AI-enabled psychiatric decision support systems predominantly rely on cloud-based inference pipelines, requiring sensitive patient data to leave the device and traverse external servers, creating unacceptable privacy and security risks in these contexts. In this paper, we propose a zero-egress, on-device AI platform for privacy-preserving psychiatric decision support, deployed as a cross-platform mobile application. The proposed system extends our prior work on fine-tuned LLM consortiums for psychiatric diagnosis standardization by fundamentally re-architecting the inference pipeline for fully local execution -- ensuring that no patient data is transmitted to, processed by, or stored on any external server at any stage. The platform integrates a consortium of three lightweight, fine-tuned, and quantized open-source LLMs -- Gemma, Phi-3.5-mini, and Qwen2 -- selected for their compact architectures and proven efficiency on resource-constrained mobile hardware. An on-device orchestration layer coordinates ensemble inference and consensus-based diagnostic reasoning, producing DSM-5-aligned assessments for conditions. The platform is designed to assist clinicians with differential diagnosis and evidence-linked symptom mapping, as well as to support patient-facing self-screening with appropriate clinical safeguards. Initial evaluation demonstrates that the proposed zero-egress deployment achieves diagnostic accuracy comparable to its server-side predecessor while sustaining real-time inference latency on commodity mobile hardware.","published_date":"2026-04-20T14:09:01+00:00","viability_score":7,"cluster_label":"On-Device LLM","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A privacy-preserving, on-device AI platform for psychiatric decision support using lightweight LLMs, enabling real-time, local inference for sensitive mental health data.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18292v1","title":"Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence","abstract":"Large language models are increasingly expected to serve as general-purpose agents that interact with external, stateful tool environments. The Model Context Protocol (MCP) and broader agent skills offer a unified interface for connecting agents with scalable real-world services, but training robust agents remains limited by the lack of realistic environments and principled mechanisms for life-long learning. In this paper, we present \\textbf{Agent-World}, a self-evolving training arena for advancing general agent intelligence through scalable environments. Agent-World has two main components: (1) Agentic Environment-Task Discovery, which autonomously explores topic-aligned databases and executable tool ecosystems from thousands of real-world environment themes and synthesizes verifiable tasks with controllable difficulty; and (2) Continuous Self-Evolving Agent Training, which combines multi-environment reinforcement learning with a self-evolving agent arena that automatically identifies capability gaps through dynamic task synthesis and drives targeted learning, enabling the co-evolution of agent policies and environments. Across 23 challenging agent benchmarks, Agent-World-8B and 14B consistently outperforms strong proprietary models and environment scaling baselines. Further analyses reveal scaling trends in relation to environment diversity and self-evolution rounds, offering insights for building general agent intelligence.","published_date":"2026-04-20T14:01:10+00:00","viability_score":8,"cluster_label":"Agent Training","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Agent-World is a self-evolving training arena that synthesizes realistic environments and tasks to co-evolve general-purpose AI agents, outperforming proprietary models on challenging benchmarks.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18266v1","title":"Enhancing Tabular Anomaly Detection via Pseudo-Label-Guided Generation","abstract":"Identifying anomalous instances in tabular data is essential for improving data reliability and maintaining system stability. Due to the scarcity of ground-truth anomaly labels, existing methods mainly rely on unsupervised anomaly detection models, or exploit a small number of labeled anomalies to facilitate detection via sample generation or contrastive learning. However, unsupervised methods lack sufficient anomaly awareness, while current generation and contrastive approaches tend to compute anomalies globally, overlooking the localized anomaly patterns of tabular features, resulting in suboptimal detection performance. To address these limitations, we propose PLAG, a pseudo-label-guided anomaly generation method designed to enhance tabular anomaly detection. Specifically, by utilizing pseudo-anomalies as guidance signals and decoupling the overall anomaly quantification of a sample into an accumulation of feature-level abnormalities, PLAG not only effectively obviates the need for scarce ground-truth labels but also provides a novel perspective for the model to comprehend localized anomalous signals at a fine-grained level. Furthermore, a two-stage data selection strategy is proposed, integrating format verification and uncertainty estimation to rigorously filter candidate samples, thereby ensuring the fidelity and diversity of the synthetic anomalies. Ultimately, these filtered synthetic anomalies serve as robust discriminative guidance, empowering the model to better separate normal and anomalous instances. Extensive experiments demonstrate that PLAG achieves state-of-the-art performance against eight representative baselines. Moreover, as a flexible framework, it integrates seamlessly with existing unsupervised detectors, consistently boosting F1-scores by 0.08 to 0.21.","published_date":"2026-04-20T13:42:30+00:00","viability_score":7,"cluster_label":"Tabular Anomaly Detection","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"PLAG enhances tabular anomaly detection by using pseudo-labels to generate localized, feature-level anomalies, achieving state-of-the-art performance and boosting existing unsupervised detectors.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.18258v1","title":"Long-Text-to-Image Generation via Compositional Prompt Decomposition","abstract":"While modern text-to-image (T2I) models excel at generating images from intricate prompts, they struggle to capture the key details when the inputs are descriptive paragraphs. This limitation stems from the prevalence of concise captions that shape their training distributions. Existing methods attempt to bridge this gap by either fine-tuning T2I models on long prompts, which generalizes poorly to longer lengths; or by projecting the oversize inputs into normal-prompt space and compromising fidelity. We propose Prompt Refraction for Intricate Scene Modeling (PRISM), a compositional approach that enables pre-trained T2I models to process long sequence inputs. PRISM uses a lightweight module to extract constituent representations from the long prompts. The T2I model makes independent noise predictions for each component, and their outputs are merged into a single denoising step using energy-based conjunction. We evaluate PRISM across a wide range of model architectures, showing comparable performances to models fine-tuned on the same training data. Furthermore, PRISM demonstrates superior generalization, outperforming baseline models by 7.4% on prompts over 500 tokens in a challenging public benchmark.","published_date":"2026-04-20T13:31:36+00:00","viability_score":7,"cluster_label":"Generative Video","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"PRISM enables pre-trained text-to-image models to generate images from long descriptive paragraphs by decomposing prompts and merging independent noise predictions.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18257v1","title":"DocQAC: Adaptive Trie-Guided Decoding for Effective In-Document Query Auto-Completion","abstract":"Query auto-completion (QAC) has been widely studied in the context of web search, yet remains underexplored for in-document search, which we term DocQAC. DocQAC aims to enhance search productivity within long documents by helping users craft faster, more precise queries, even for complex or hard-to-spell terms. While global historical queries are available to both WebQAC and DocQAC, DocQAC uniquely accesses document-specific context, including the current document's content and its specific history of user query interactions.   To address this setting, we propose a novel adaptive trie-guided decoding framework that uses user query prefixes to softly steer language models toward high-quality completions. Our approach introduces an adaptive penalty mechanism with tunable hyperparameters, enabling a principled trade-off between model confidence and trie-based guidance. To efficiently incorporate document context, we explore retrieval-augmented generation (RAG) and lightweight contextual document signals such as titles, keyphrases, and summaries.   When applied to encoder-decoder models like T5 and BART, our trie-guided framework outperforms strong baselines and even surpasses much larger instruction-tuned models such as LLaMA-3 and Phi-3 on seen queries across both seen and unseen documents. This demonstrates its practicality for real-world DocQAC deployments, where efficiency and scalability are critical. We evaluate our method on a newly introduced DocQAC benchmark derived from ORCAS, enriched with query-document pairs. We make both the DocQAC dataset (https://bit.ly/3IGEkbH) and code (https://github.com/rahcode7/DocQAC) publicly available.","published_date":"2026-04-20T13:30:45+00:00","viability_score":7,"cluster_label":"Document Search","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An adaptive trie-guided decoding framework for effective in-document query auto-completion that steers language models towards high-quality completions using document context.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.18254v1","title":"LeGo-Code: Can Modular Curriculum Learning Advance Complex Code Generation? Insights from Text-to-SQL","abstract":"Recently, code-oriented large language models (LLMs) have demonstrated strong capabilities in translating natural language into executable code. Text-to-SQL is a significant application of this ability, enabling non-technical users to interact with relational databases using natural language. However, state-of-the-art models continue to struggle with highly complex logic, particularly deeply nested statements involving multiple joins and conditions, as well as with real-world database schemas that are noisy or poorly structured. In this paper, we investigate whether curriculum learning can improve the performance of code-based LLMs on Text-to-SQL tasks. Employing benchmarks including Spider and BIRD, we fine-tune models under different curriculum strategies. Our experiments show that naive curriculum, simply ordering training samples by complexity in a single epoch, fails to surpass standard fine-tuning due to catastrophic forgetting. To overcome this, we propose a Modular Adapter Composition (MAC) strategy. By sequentially training tier-specific adapters on incremental complexity levels (Easy to Extra-Hard), we create a scaffolded learning environment that improves performance on complex queries. Our approach not only produces measurable performance gains on the Spider and BIRD benchmarks but also provides a flexible, \"Lego-like\" architecture, allowing models to be composed and deployed based on specific schema difficulty requirements. These findings demonstrate that structured, modular learning is a superior alternative to monolithic fine-tuning for mastering the syntax and logic of complex code generation.","published_date":"2026-04-20T13:30:04+00:00","viability_score":7,"cluster_label":"Code Generation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A modular adapter composition strategy for curriculum learning that improves complex code generation by sequentially training tier-specific adapters on incremental complexity levels.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.18251v1","title":"Style-Based Neural Architectures for Real-Time Weather Classification","abstract":"In this paper, we present three neural network architectures designed for real-time classification of weather conditions (sunny, rain, snow, fog) from images. These models, inspired by recent advances in style transfer, aim to capture the stylistic elements present in images. One model, called \"Multi-PatchGAN\", is based on PatchGANs used in well-known architectures such as Pix2Pix and CycleGAN, but here adapted with multiple patch sizes for detection tasks. The second model, \"Truncated ResNet50\", is a simplified version of ResNet50 retaining only its first nine layers. This truncation, determined by an evolutionary algorithm, facilitates the extraction of high-frequency features essential for capturing subtle stylistic details. Finally, we propose \"Truncated ResNet50 with Gram Matrix and Attention\", which computes Gram matrices for each layer during training and automatically weights them via an attention mechanism, thus optimizing the extraction of the most relevant stylistic expressions for classification. These last two models outperform the state of the art and demonstrate remarkable generalization capability on several public databases. Although developed for weather detection, these architectures are also suitable for other appearance-based classification tasks, such as animal species recognition, texture classification, disease detection in medical imaging, or industrial defect identification.","published_date":"2026-04-20T13:28:24+00:00","viability_score":7,"cluster_label":"Image Classification","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Style-based neural network architectures, including Multi-PatchGAN and Truncated ResNet50 with Gram Matrix and Attention, for real-time weather classification and other appearance-based tasks.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.18240v1","title":"AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation","abstract":"As reinforcement learning continues to scale the training of large language model-based agents, reliably verifying agent behaviors in complex environments has become increasingly challenging. Existing approaches rely on rule-based verifiers or LLM-as-a-Judge models, which struggle to generalize beyond narrow domains. Agent-as-a-Judge addresses this limitation by actively interacting with environments and tools to acquire verifiable evidence, yet its capabilities remain underexplored.   We introduce a benchmark AJ-Bench to systematically evaluate Agent-as-a-Judge across three domains-search, data systems, and graphical user interfaces-comprising 155 tasks and 516 annotated trajectories. The benchmark comprehensively assesses judge agents' abilities in information acquisition, state verification, and process verification. Experiments demonstrate consistent performance gains over LLM-as-a-Judge baselines, while also revealing substantial open challenges in agent-based verification. Our data and code are available at https://aj-bench.github.io/.","published_date":"2026-04-20T13:23:38+00:00","viability_score":7,"cluster_label":"AI Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"AJ-Bench, a benchmark for evaluating Agent-as-a-Judge across search, data systems, and GUIs, demonstrating performance gains over LLM-as-a-Judge baselines.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.18239v1","title":"Towards Disentangled Preference Optimization Dynamics Beyond Likelihood Displacement","abstract":"Preference optimization is widely used to align large language models (LLMs) with human preferences. However, many margin-based objectives suppress the chosen response along with the rejected one, a phenomenon known as likelihood displacement, and no general mechanism currently prevents this across objectives.   We bridge this gap by presenting a unified \\emph{incentive-score decomposition} of preference optimization, revealing that diverse objectives share identical local update directions and differ only in their scalar weighting coefficients.   Building on this decomposition, by analyzing the dynamics of the chosen/rejected likelihoods, we identify the \\emph{disentanglement band} (DB), a simple, testable condition that characterizes when training can avoid likelihood displacement by realizing the preferred pathway: suppressing the loser while maintaining the winner, possibly after an initial transient.   Leveraging the DB, we propose a plug-and-play \\emph{reward calibration} (RC) that adaptively rebalances chosen versus rejected updates to satisfy the DB and mitigate likelihood displacement, without redesigning the base objective.   Empirical results show that RC steers training toward more disentangled dynamics and often improves downstream performance across a range of objectives. Our code is available at https://github.com/IceyWuu/DisentangledPreferenceOptimization.","published_date":"2026-04-20T13:23:27+00:00","viability_score":7,"cluster_label":"LLM Alignment","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A plug-and-play reward calibration method that mitigates likelihood displacement in preference optimization for LLMs, improving downstream performance.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18237v1","title":"Semantic-based Distributed Learning for Diverse and Discriminative Representations","abstract":"In large-scale distributed scenarios, increasingly complex tasks demand more intelligent collaboration across networks, requiring the joint extraction of structural representations from data samples. However, conventional task-specific approaches often result in nonstructural embeddings, leading to collapsed variability among data samples within the same class, particularly in classification tasks. To address this issue and fully leverage the intrinsic structure of data for downstream applications, we propose a novel distributed learning framework that ensures both diverse and discriminative representations. For independent and identically distributed (i.i.d.) data, we reformulate and decouple the global optimization function by introducing constraints on representation variance. The update rules are then derived and simplified using a primal-dual approach. For non-i.i.d. data distributions, we tackle the problem by clustering and virtually replicating nodes, allowing model updates within each cluster using block coordinate descent. In both cases, the resulting optimal solutions are theoretically proven to maintain discriminative and diverse properties, with a guaranteed convergence for i.i.d. conditions. Additionally, semantic information from representations is shared among nodes, reducing the need for common neural network architectures. Finally, extensive simulations on MNIST, CIFAR-10 and CIFAR-100 confirm the effectiveness of the proposed algorithms in capturing global structural representations.","published_date":"2026-04-20T13:22:58+00:00","viability_score":3,"cluster_label":"Distributed Learning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel distributed learning framework for diverse and discriminative representations in large-scale scenarios, theoretically proven to maintain desired properties.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.18235v1","title":"Negative Advantage Is a Double-Edged Sword: Calibrating Advantage in GRPO for Deep Search","abstract":"Deep search agents can autonomously initiate multi-turn interactions with search engines, thereby exhibiting strong question-answering capabilities. Such performance critically relies on Group Relative Policy Optimization (GRPO) as its core training algorithm. However, GRPO still faces several challenges in deep search settings. First, there exists a substantial mismatch between the correctness of intermediate steps and the reward signal, causing numerous correct intermediate steps to be incorrectly penalized when the final answer is wrong. Second, training is highly unstable, often resulting in degradation of natural language ability or even catastrophic training collapse. Our analysis attributes these issues to coarse-grained advantage assignment and an imbalance between positive and negative advantages. To address these problems, we propose CalibAdv, an advantage calibration method specifically designed for deep search tasks. Specifically, CalibAdv leverages the correctness of intermediate steps to downscale excessive negative advantages at a fine-grained level. It then rebalances positive and negative advantages in the answer component. Extensive experiments across three models and seven benchmarks demonstrate that CalibAdv improves both model performance and training stability. Our code is available at https://github.com/wujwyi/CalibAdv.","published_date":"2026-04-20T13:21:19+00:00","viability_score":7,"cluster_label":"LLM Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"CalibAdv, an advantage calibration method for deep search agents, improves performance and stability by downscaling negative advantages and rebalancing positive/negative advantages.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18234v1","title":"Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies","abstract":"Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge to answer questions more accurately.   However, research on evaluating RAG systems-particularly the retriever component-remains limited, as most existing work focuses on single-context retrieval rather than multi-hop queries, where individual contexts may appear irrelevant in isolation but are essential when combined. In this research, we use the HotPotQA, MuSiQue, and SQuAD datasets to simulate a RAG system and compare three LLM-as-judge evaluation strategies, including our proposed Context-Aware Retriever Evaluation (CARE). Our goal is to better understand how multi-hop reasoning can be most effectively evaluated in RAG systems.   Experiments with LLMs from OpenAI, Meta, and Google demonstrate that CARE consistently outperforms existing methods for evaluating multi-hop reasoning in RAG systems. The performance gains are most pronounced in models with larger parameter counts and longer context windows, while single-hop queries show minimal sensitivity to context-aware evaluation. Overall, the results highlight the critical role of context-aware evaluation in improving the reliability and accuracy of retrieval-augmented generation systems, particularly in complex query scenarios. To ensure reproducibility, we provide the complete data of our experiments at https://github.com/lorenzbrehme/CARE.","published_date":"2026-04-20T13:20:57+00:00","viability_score":8,"cluster_label":"RAG Evaluation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"CARE, a context-aware retriever evaluation strategy for RAG systems, outperforms existing methods in evaluating multi-hop reasoning, especially for complex queries.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.18233v1","title":"Aether: Network Validation Using Agentic AI and Digital Twin","abstract":"Network change validation remains a critical yet predominantly manual, time-consuming, and error-prone process in modern network operations. While formal network verification has made substantial progress in proving correctness properties, it is typically applied in offline, pre-deployment settings and faces challenges in accommodating continuous changes and validating live production behavior. Current operational approaches typically involve scattered testing tools, resulting in partial coverage and errors that surface only after deployment. In this paper, we present Aether, a novel approach that integrates Generative Agentic AI with a multi-functional Network Digital Twin to automate and streamline network change validation workflows. It features an agentic architecture with five specialized Network Operations AI agents that collaboratively handle the change validation lifecycle from intent analysis to network verification and testing. Aether agents use a unified Network Digital Twin integrating modeling, simulation, and emulation to maintain a consistent, up-to-date network view for verification and testing. By orchestrating agent collaboration atop this digital twin, Aether enables automated, rapid network change validation while reducing manual effort, minimizing errors, and improving operational agility and cost-effectiveness. We evaluate Aether over synthetic network change scenarios covering main classes of network changes and on past incidents from a major ISP operational network, demonstrating promising results in error detection (100%), diagnostic coverage (92-96%), and speed (6-7 minutes) over traditional methods.","published_date":"2026-04-20T13:18:58+00:00","viability_score":7,"cluster_label":"Network Operations AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Automate network change validation with agentic AI and a digital twin, reducing manual effort and errors.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18225v1","title":"Is SAM3 ready for pathology segmentation?","abstract":"Is Segment Anything Model 3 (SAM3) capable in segmenting Any Pathology Images? Digital pathology segmentation spans tissue-level and nuclei-level scales, where traditional methods often suffer from high annotation costs and poor generalization. SAM3 introduces Promptable Concept Segmentation, offering a potential automated interface via text prompts. With this work, we propose a systematic evaluation protocol to explore the capability space of SAM3 in a structured manner. Specifically, we evaluate SAM3 under different supervision settings including zero-shot, few-shot, and supervised with varying prompting strategies. Our extensive evaluation on pathological datasets including NuInsSeg, PanNuke and GlaS, reveals that: 1.text-only prompts poorly activate nuclear concepts. 2.performance is highly sensitive to visual prompt types and budgets. 3.few-shot learning offers gains, but SAM3 lacks robustness against visual prompt noise. and 4.a significant gap persists between prompt-based usage and task-trained adapter-based reference. Our study delineates SAM3's boundaries in pathology image segmentation and provides practical guidance on the necessity of pathology domain adaptation.","published_date":"2026-04-20T13:10:07+00:00","viability_score":4,"cluster_label":"Medical AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Evaluate the capability of SAM3 for pathology segmentation to understand its limitations and guide domain adaptation.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.18224v1","title":"WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models","abstract":"Large language models are rapidly evolving into interactive coding agents capable of end-to-end web coding, yet existing benchmarks evaluate only narrow slices of this capability, typically text-conditioned generation with static-correctness metrics, leaving visual fidelity, interaction quality, and codebase-level reasoning largely unmeasured. We introduce WebCompass, a multimodal benchmark that provides unified lifecycle evaluation of web engineering capability. Recognizing that real-world web coding is an iterative cycle of generation, editing, and repair, WebCompass spans three input modalities (text, image, video) and three task types (generation, editing, repair), yielding seven task categories that mirror professional workflows. Through a multi-stage, human-in-the-loop pipeline, we curate instances covering 15 generation domains, 16 editing operation types, and 11 repair defect types, each annotated at Easy/Medium/Hard levels. For evaluation, we adopt a checklist-guided LLM-as-a-Judge protocol for editing and repair, and propose a novel Agent-as-a-Judge paradigm for generation that autonomously executes generated websites in a real browser, explores interactive behaviors via the Model Context Protocol (MCP), and iteratively synthesizes targeted test cases, closely approximating human acceptance testing. We evaluate representative closed-source and open-source models and observe that: (1) closed-source models remain substantially stronger and more balanced; (2) editing and repair exhibit distinct difficulty profiles, with repair preserving interactivity better but remaining execution-challenging; (3) aesthetics is the most persistent bottleneck, especially for open-source models; and (4) framework choice materially affects outcomes, with Vue consistently challenging while React and Vanilla/HTML perform more strongly depending on task type.","published_date":"2026-04-20T13:09:38+00:00","viability_score":6,"cluster_label":"LLM Evaluation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A multimodal benchmark for evaluating LLMs in end-to-end web coding, including generation, editing, and repair.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.18210v1","title":"TacticGen: Grounding Adaptable and Scalable Generation of Football Tactics","abstract":"Success in association football relies on both individual skill and coordinated tactics. While recent advancements in spatio-temporal data and deep learning have enabled predictive analyses like trajectory forecasting, the development of tactical design remains limited. Bridging this gap is essential, as prediction reveals what is likely to occur, whereas tactic generation determines what should occur to achieve strategic objectives. In this work, we present TacticGen, a generative model for adaptable and scalable tactic generation. TacticGen formulates tactics as sequences of multi-agent movements and interactions conditioned on the game context. It employs a multi-agent diffusion transformer with agent-wise self-attention and context-aware cross-attention to capture cooperative and competitive dynamics among players and the ball. Trained with over 3.3 million events and 100 million tracking frames from top-tier leagues, TacticGen achieves state-of-the-art precision in predicting player trajectories. Building on it, TacticGen enables adaptable tactic generation tailored to diverse inference-time objectives through classifier guidance mechanism, specified via rules, natural language, or neural models. Its modeling performance is also inherently scalable. A case study with football experts confirms that TacticGen generates realistic, strategically valuable tactics, demonstrating its practical utility for tactical planning in professional football. The project page is available at: https://shengxu.net/TacticGen/.","published_date":"2026-04-20T12:57:11+00:00","viability_score":8,"cluster_label":"Generative AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Generate adaptable and scalable football tactics using a multi-agent diffusion transformer, grounded in game context and expert validation.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18206v1","title":"A Control Architecture for Training-Free Memory Use","abstract":"Prompt-injected memory can improve reasoning without updating model weights, but it also creates a control problem: retrieved content helps only when it is applied in the right state. We study this problem in a strict training-free setting and formulate it as applicability control: when to trigger a memory-assisted second pass, when to trust it, and how to maintain the memory bank over time. Our method combines uncertainty-based routing, confidence-based selective acceptance, bank selection across rule and exemplar memory, and evidence-based governance of the memory bank over time. Under a locked training-free protocol with compute-matched controls, it improves two core arithmetic benchmarks by +7.0 points on SVAMP and +7.67 points on ASDiv over baseline. The same architecture also transfers to QA and agent benchmarks with smaller positive effects and shows the same positive direction on a second checkpoint for the main arithmetic tasks. On arithmetic, the main empirical pattern is that the control architecture, rather than raw memory exposure, drives the improvements on SVAMP and ASDiv. Mechanistically, confidence separates helpful from harmful rule-bank interventions, and under fixed retrieval the repair-versus-corrupt difference localizes to rows whose retrieved set actually contains the edited entries.","published_date":"2026-04-20T12:55:27+00:00","viability_score":7,"cluster_label":"LLM Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A training-free control architecture enhances LLM reasoning by intelligently managing memory, improving arithmetic benchmarks significantly.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18190v1","title":"Scalable Neighborhood-Based Multi-Agent Actor-Critic","abstract":"We propose MADDPG-K, a scalable extension to Multi-Agent Deep Deterministic Policy Gradient (MADDPG) that addresses the computational limitations of centralized critic approaches. Centralized critics, which condition on the observations and actions of all agents, have demonstrated significant performance gains in cooperative and competitive multi-agent settings. However, their critic networks grow linearly in input size with the number of agents, making them increasingly expensive to train at scale. MADDPG-K mitigates this by restricting each agent's critic to the $k$ closest agents under a chosen metric which in our case is Euclidean distance. This ensures a constant-size critic input regardless of the total agent count. We analyze the complexity of this approach, showing that the quadratic cost it retains arises from cheap scalar distance computations rather than the expensive neural network matrix multiplications that bottleneck standard MADDPG. We validate our method empirically across cooperative and adversarial environments from the Multi-Particle Environment suite, demonstrating competitive or superior performance compared to MADDPG, faster convergence in cooperative settings, and better runtime scaling as the number of agents grows. Our code is available at https://github.com/TimGop/MADDPG-K .","published_date":"2026-04-20T12:45:59+00:00","viability_score":8,"cluster_label":"Multi-Agent RL","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"MADDPG-K scales multi-agent reinforcement learning by restricting critics to nearby agents, offering competitive performance and faster convergence.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18179v1","title":"Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs","abstract":"Hosted-LLM providers have a silent-substitution incentive: advertise a stronger model while serving cheaper replies. Probe-after-return schemes such as SVIP leave a parallel-serve side-channel, since a dishonest provider can route the verifier's probe to the advertised model while serving ordinary users from a substitute. We propose a commit-open protocol that closes this gap. Before any opening request, the provider commits via a Merkle tree to a per-position sparse-autoencoder (SAE) feature-trace sketch of its served output at a published probe layer. A verifier opens random positions, scores them against a public named-circuit probe library calibrated with cross-backend noise, and decides with a fixed-threshold joint-consistency z-score rule. We instantiate the protocol on three backbones -- Qwen3-1.7B, Gemma-2-2B, and a 4.5x scale-up to Gemma-2-9B with a 131k-feature SAE. Of 17 attackers spanning same-family lifts, cross-family substitutes, and rank-<=128 adaptive LoRA, all are rejected at a shared, scale-stable threshold; the same attackers all evade a matched SVIP-style parallel-serve baseline. A white-box end-to-end attack that backpropagates through the frozen SAE encoder does not close the margin, and a feature-forgery attacker that never runs M_hon is bounded in closed form by an intrinsic-dimension argument. Commitment adds <=2.1% to forward-only wall-clock at batch 32.","published_date":"2026-04-20T12:34:56+00:00","viability_score":7,"cluster_label":"LLM Security","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A commit-open protocol uses sparse autoencoder feature traces to detect silent model substitutions in hosted LLMs, outperforming existing methods.","time_to_mvp":"1-3 months","tags":["series_a_plus"]},{"arxiv_id":"2604.18177v1","title":"STaD: Scaffolded Task Design for Identifying Compositional Skill Gaps in LLMs","abstract":"Benchmarks are often used as a standard to understand LLM capabilities in different domains. However, aggregate benchmark scores provide limited insight into compositional skill gaps of LLMs and how to improve them. To make these weaknesses visible, we propose Scaffolded Task Design (STaD) framework. STaD generates controlled variations of benchmark tasks based on the concept of scaffolding, which introduces structured, incremental support in a step-by-step manner. Rather than inspecting failures individually, this approach enables systematic and scalable probing of model behavior by identifying the specific reasoning skill compositions they lack. Treating the LLM as a black box, our experiments on six models of varying sizes reveal multiple failure points in three reasoning benchmarks and highlight each model's unique and distinct skill gaps.","published_date":"2026-04-20T12:33:59+00:00","viability_score":7,"cluster_label":"LLM Evaluation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"STaD is a framework for generating scaffolded tasks to systematically identify and visualize compositional skill gaps in LLMs.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.18176v1","title":"QuantumQA: Enhancing Scientific Reasoning via Physics-Consistent Dataset and Verification-Aware Reinforcement Learning","abstract":"Large language models (LLMs) show strong capabilities in general reasoning but typically lack reliability in scientific domains like quantum mechanics, which demand strict adherence to physical constraints. This limitation arises from the scarcity of verifiable training resources and the inadequacy of coarse feedback signals in standard alignment paradigms. To address the data challenge, we introduce QuantumQA, a large-scale dataset constructed via a task-adaptive strategy and a hybrid verification protocol that combines deterministic solvers with semantic auditing to guarantee scientific rigor. Building on this foundation, we propose the verification-aware reward model (VRM) tailored for Reinforcement Learning with Verifiable Rewards (RLVR), which employs an adaptive reward fusion (ARF) mechanism to dynamically integrate deterministic signals from a scientific execution suite (SES) with multidimensional semantic evaluations for precise supervision. Experimental results demonstrate that our method consistently outperforms baselines and general-purpose preference models. Notably, our optimized 8B model achieves performance competitive with proprietary models, validating that incorporating verifiable, rule-based feedback into the reinforcement learning loop offers a parameter-efficient alternative to pure scaling.","published_date":"2026-04-20T12:33:50+00:00","viability_score":7,"cluster_label":"Scientific LLMs","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A physics-consistent dataset and verification-aware reinforcement learning approach to enhance LLM reliability in scientific domains.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18170v1","title":"Copy-as-Decode: Grammar-Constrained Parallel Prefill for LLM Editing","abstract":"LLMs edit text and code by autoregressively regenerating the full output, even when most tokens appear verbatim in the input. We study Copy-as-Decode, a decoding-layer mechanism that recasts edit generation as structured decoding over a two-primitive grammar: <copy lines=\"i-j\"/> references an input line range, <gen>...</gen> emits new content. A token-level FSM guarantees syntactic validity, and a serving-layer primitive updates the KV cache for each copy span via a single parallel-prefill forward rather than $N$ autoregressive steps -- sharing the parallel-forward kernel of speculative decoding but with input tokens as the draft and program-enforced acceptance replacing probabilistic verification. We report an upper-bound analysis that requires no end-to-end training. (i) Kernel speedup: on Qwen2.5-{1.5B, 7B}, copying $N$ tokens via parallel prefill is $6.8\\times$--$303\\times$ faster than autoregressive ($N \\in [8, 512]$, A100 80GB bf16). (ii) Copy ceiling: on ProbeEdit and HumanEvalPack-Fix (Py/JS), $74$--$98\\%$ of gold tokens are reachable under the line-level primitive; composed with the empirical kernel over each corpus's span histogram this yields a closed-form wall-clock bound of $29.0\\times / 3.4\\times / 4.2\\times$ ($13.0\\times$ pooled). A token-level extension reaches $91$--$99\\%$ coverage with $4.5\\times$--$6.5\\times$ floors. (iii) Pipeline losslessness: oracle programs round-trip through the deterministic resolver on all $482$ cases, localizing any downstream failure to span selection rather than the mechanism. A perturbation study shows pooled EM drops from $100\\%$ to $15.48\\%$ under off-by-one noise. A fine-tuning pilot on Qwen2.5-Coder-1.5B lifts HEvalFix-Py EM from $0/33$ (untrained) to $12$--$17\\%$, a learnability signal, not a production selector. Batched-serving integration and multi-file coverage are scoped as follow-up.","published_date":"2026-04-20T12:29:53+00:00","viability_score":4,"cluster_label":"LLM Editing","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Accelerate LLM text and code editing by recasting generation as structured decoding over a copy-and-generate grammar, significantly reducing regeneration time.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.18169v1","title":"Beyond Reproduction: A Paired-Task Framework for Assessing LLM Comprehension and Creativity in Literary Translation","abstract":"Large language models (LLMs) are increasingly used for creative tasks such as literary translation. Yet translational creativity remains underexplored and is rarely evaluated at scale, while source-text comprehension is typically studied in isolation, despite the fact that, in professional translation, comprehension and creativity are tightly intertwined. We address these gaps with a paired-task framework applied to literary excerpts from 11 books. Task 1 assesses source-text comprehension, and Task 2 evaluates translational creativity through Units of Creative Potential (UCPs), such as metaphors and wordplay. Using a scalable evaluation setup that combines expert human annotations with UCP-based automatic scoring, we benchmark 23 models and four creativity-oriented prompts. Our findings show that strong comprehension does not translate into human-level creativity: models often produce literal or contextually inappropriate renderings, with particularly large gaps for the more distant English-Chinese language pair. Creativity-oriented prompts yield only modest gains, and only one model, Mistral-Large, comes close to human-level creativity (0.167 vs. 0.246). Across all model-prompt combinations, only three exceed a creativity score of 0.1, while the rest remain at or near zero.","published_date":"2026-04-20T12:28:59+00:00","viability_score":4,"cluster_label":"LLM Evaluation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for evaluating LLM literary translation that disentangles comprehension from creativity, revealing significant gaps in current models.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.18164v1","title":"MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge","abstract":"Multimodal Large Language Models (MLLMs) have been increasingly used as automatic evaluators-a paradigm known as MLLM-as-a-Judge. However, their reliability and vulnerabilities to biases remain underexplored. We find that many MLLM judges fail to reliably integrate key visual or textual cues, yielding unreliable evaluations when evidence is missing or mismatched, and exhibiting instability under semantically irrelevant perturbations. To address this, we systematically define Compositional Bias in MLLM-as-a-Judge systems and introduce MM-JudgeBias, a benchmark for evaluating it. MM-JudgeBias introduces controlled perturbations across Query, Image, and Response, and evaluates model behavior via two complementary metrics: Bias-Deviation (BD) for sensitivity and Bias-Conformity (BC) for stability. Our dataset of over 1,800 curated and refined multimodal samples, drawn from 29 source benchmarks, enables a fine-grained diagnosis of nine bias types across diverse tasks and domains. Experiments on 26 state-of-the-art MLLMs reveal systematic modality neglect and asymmetric evaluation tendencies, underscoring the need for more reliable judges.","published_date":"2026-04-20T12:27:44+00:00","viability_score":4,"cluster_label":"LLM Evaluation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A benchmark to identify and mitigate compositional biases in multimodal large language models used for automated evaluation.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.18161v1","title":"Does \"Do Differentiable Simulators Give Better Policy Gradients?'' Give Better Policy Gradients?","abstract":"In policy gradient reinforcement learning, access to a differentiable model enables 1st-order gradient estimation that accelerates learning compared to relying solely on derivative-free 0th-order estimators. However, discontinuous dynamics cause bias and undermine the effectiveness of 1st-order estimators. Prior work addressed this bias by constructing a confidence interval around the REINFORCE 0th-order gradient estimator and using these bounds to detect discontinuities. However, the REINFORCE estimator is notoriously noisy, and we find that this method requires task-specific hyperparameter tuning and has low sample efficiency. This paper asks whether such bias is the primary obstacle and what minimal fixes suffice. First, we re-examine standard discontinuous settings from prior work and introduce DDCG, a lightweight test that switches estimators in nonsmooth regions; with a single hyperparameter, DDCG achieves robust performance and remains reliable with small samples. Second, on differentiable robotics control tasks, we present IVW-H, a per-step inverse-variance implementation that stabilizes variance without explicit discontinuity detection and yields strong results. Together, these findings indicate that while estimator switching improves robustness in controlled studies, careful variance control often dominates in practical deployments.","published_date":"2026-04-20T12:23:48+00:00","viability_score":3,"cluster_label":"Reinforcement Learning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper investigates methods to improve policy gradient reinforcement learning by addressing discontinuities in differentiable simulators, proposing new estimators for better performance.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.18158v1","title":"State Transfer Reveals Reuse in Controlled Routing","abstract":"Prompt-based interventions can change model behavior, but trained success alone does not identify where the behaviorally relevant state is represented. We study this question in controlled routing tasks using interfaces chosen on support data, held-out query evaluation, and matched necessity, sufficiency, and wrong-interface controls. On GPT-2 triop, an early interface supports exact transfer under these tests. On GPT-2 add/sub, zero-retrain compiled transfer at the fixed interface recovers most of donor routing accuracy, while trainable prompt slots can relearn the same behavior at several other positions only after additional support examples and optimization. These results distinguish fixed-interface reuse from prompt relocation in a setting where the two can be tested directly. Qwen routing provides a cross-architecture consistency check for the same matched-interface pattern at the operator token, although donor-specific identity on the local V-path remains unresolved. Generation and reasoning branches are used to map scope: they show broader transport or weaker controller identifiability once control depends on longer trajectories or harder selection. In controlled routing, fixed-interface transfer is therefore stronger evidence of reuse than trained prompt success alone.","published_date":"2026-04-20T12:22:15+00:00","viability_score":4,"cluster_label":"LLM Interpretability","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This research uses controlled routing tasks to reveal how prompt-based interventions alter LLM behavior and identify where relevant state is represented.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.18146v1","title":"Modular Representation Compression: Adapting LLMs for Efficient and Effective Recommendations","abstract":"Recently, large language models (LLMs) have advanced recommendation systems (RSs), and recent works have begun to explore how to integrate LLMs into industrial RSs. While most approaches deploy LLMs offline to generate and pre-cache augmented representations for RSs, high-dimensional representations from LLMs introduce substantial storage and computational costs. Thus, it is crucial to compress LLM representations effectively. However, we identify a counterintuitive phenomenon during representation compression: Mid-layer Representation Advantage (MRA), where representations from middle layers of LLMs outperform those from final layers in recommendation tasks. This degraded final layer renders existing compression methods, which typically compress on the final layer, suboptimal. We interpret this based on modularity theory that LLMs develop spontaneous internal functional modularity and force the final layer to specialize in the proxy training task. Thus, we propose \\underline{M}odul\\underline{a}r \\underline{R}epresentation \\underline{C}ompression (MARC) to explicitly control the modularity of LLMs. First, Modular Adjustment explicitly introduces compression and task adaptation modules, enabling the LLM to operate strictly as a representation-learning module. Next, to ground each module to its specific task, Modular Task Decoupling uses information constraints and different network structures to decouple tasks. Extensive experiments validate that MARC addresses MRA and produces efficient representations. Notably, MARC achieved a 2.82% eCPM lift in an online A/B test within a large-scale commercial search advertising scenario.","published_date":"2026-04-20T12:08:58+00:00","viability_score":7,"cluster_label":"Recommendation Systems","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"MARC is a novel framework that compresses LLM representations for recommendation systems by controlling modularity, achieving significant online A/B test lift.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18145v1","title":"Region-Grounded Report Generation for 3D Medical Imaging: A Fine-Grained Dataset and Graph-Enhanced Framework","abstract":"Automated medical report generation for 3D PET/CT imaging is fundamentally challenged by the high-dimensional nature of volumetric data and a critical scarcity of annotated datasets, particularly for low-resource languages. Current black-box methods map whole volumes to reports, ignoring the clinical workflow of analyzing localized Regions of Interest (RoIs) to derive diagnostic conclusions. In this paper, we bridge this gap by introducing VietPET-RoI, the first large-scale 3D PET/CT dataset with fine-grained RoI annotation for a low-resource language, comprising 600 PET/CT samples and 1,960 manually annotated RoIs, paired with corresponding clinical reports. Furthermore, to demonstrate the utility of this dataset, we propose HiRRA, a novel framework that mimics the professional radiologist diagnostic workflow by employing graph-based relational modules to capture dependencies between RoI attributes. This approach shifts from global pattern matching toward localized clinical findings. Additionally, we introduce new clinical evaluation metrics, namely RoI Coverage and RoI Quality Index, that measure both RoI localization accuracy and attribute description fidelity using LLM-based extraction. Extensive evaluation demonstrates that our framework achieves SOTA performance, surpassing existing models by 19.7% in BLEU and 4.7% in ROUGE-L, while achieving a remarkable 45.8% improvement in clinical metrics, indicating enhanced clinical reliability and reduced hallucination. Our code and dataset are available on GitHub.","published_date":"2026-04-20T12:08:21+00:00","viability_score":8,"cluster_label":"Medical Imaging AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"HiRRA is a graph-enhanced framework for 3D medical imaging report generation that mimics radiologist workflow, achieving SOTA performance and significant clinical metric improvements.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18137v1","title":"AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization","abstract":"Processing-in-Memory (PIM) architectures offer a promising solution to the memory bottlenecks in data-intensive machine learning, yet often overlook the growing challenge of activation memory footprint. Conventional PIM approaches struggle with massive KV cache sizes generated in long-context scenarios by Transformer-based models, frequently exceeding PIM's limited memory capacity, while techniques like sparse attention can conflict with PIM's need for data locality. Existing PIM approaches and quantization methods are often insufficient or poorly suited for leveraging the unique characteristics of activations. This work identifies an opportunity for PIM-specialized activation quantization to enhance bandwidth and compute efficiency.   We explore clustering-based vector quantization approaches, which align well with activation characteristics and PIM's internal bandwidth capabilities. Building on this, we introduce AQPIM, a novel PIM-aware activation quantization framework based on Product Quantization (PQ), optimizing it for modern Large Language Models (LLMs). By performing quantization directly within memory, AQPIM leverages PIM's high internal bandwidth and enables direct computation on compressed data, significantly reducing both memory footprint and computational overhead for attention computation. AQPIM addresses PQ's accuracy challenges by introducing several algorithmic optimizations. Evaluations demonstrate that AQPIM achieves significant performance improvements, drastically reducing of GPU-CPU communication that can account for 90$\\sim$98.5\\% of decoding latency, together with 3.4$\\times$ speedup over a SOTA PIM approach.","published_date":"2026-04-20T12:04:51+00:00","viability_score":7,"cluster_label":"LLM Optimization","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"AQPIM quantizes LLM activations directly in memory to drastically reduce decoding latency and improve speed for Processing-in-Memory architectures.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18135v1","title":"Soft Label Pruning and Quantization for Large-Scale Dataset Distillation","abstract":"Large-scale dataset distillation requires storing auxiliary soft labels that can be 30-40x larger on ImageNet-1K and 200x larger on ImageNet-21K than the condensed images, undermining the goal of dataset compression. We identify two fundamental issues necessitating such extensive labels: (1) insufficient image diversity, where high within-class similarity in synthetic images requires extensive augmentation, and (2) insufficient supervision diversity, where limited variety in supervisory signals during training leads to performance degradation at high compression rates. To address these challenges, we propose Label Pruning and Quantization for Large-scale Distillation (LPQLD). We enhance image diversity via class-wise batching and batch-normalization supervision during synthesis. For supervision diversity, we introduce Label Pruning with Dynamic Knowledge Reuse to improve label-per-augmentation diversity, and Label Quantization with Calibrated Student-Teacher Alignment to improve augmentation-per-image diversity. Our approach reduces soft label storage by 78x on ImageNet-1K and 500x on ImageNet-21K while improving accuracy by up to 7.2% and 2.8%, respectively. Extensive experiments validate the superiority of LPQLD across different network architectures and dataset distillation methods. Code is available at https://github.com/he-y/soft-label-pruning-quantization-for-dataset-distillation.","published_date":"2026-04-20T12:02:02+00:00","viability_score":8,"cluster_label":"Dataset Distillation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"LPQLD reduces soft label storage by up to 500x in dataset distillation while improving accuracy, enabling efficient large-scale dataset compression.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.18133v1","title":"Multi-Agent Systems: From Classical Paradigms to Large Foundation Model-Enabled Futures","abstract":"With the rapid advancement of artificial intelligence, multi-agent systems (MASs) are evolving from classical paradigms toward architectures built upon large foundation models (LFMs). This survey provides a systematic review and comparative analysis of classical MASs (CMASs) and LFM-based MASs (LMASs). First, within a closed-loop coordination framework, CMASs are reviewed across four fundamental dimensions: perception, communication, decision-making, and control. Beyond this framework, LMASs integrate LFMs to lift collaboration from low-level state exchanges to semantic-level reasoning, enabling more flexible coordination and improved adaptability across diverse scenarios. Then, a comparative analysis is conducted to contrast CMASs and LMASs across architecture, operating mechanism, adaptability, and application. Finally, future perspectives on MASs are presented, summarizing open challenges and potential research opportunities.","published_date":"2026-04-20T12:00:31+00:00","viability_score":2,"cluster_label":"Multi-Agent Systems","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This survey reviews classical multi-agent systems and explores their evolution towards foundation model-enabled futures, outlining challenges and opportunities.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.18131v1","title":"Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration","abstract":"Most agents today ``self-evolve'' by following rewards and rules defined by humans. However, this process remains fundamentally dependent on external supervision; without human guidance, the evolution stops. In this work, we train agents to possess an intrinsic meta-evolution capability to spontaneously learn about unseen environments prior to task execution.   To instill this ability, we design an outcome-based reward mechanism that measures how much an agent's self-generated world knowledge improves its success rate on downstream tasks. This reward signal is used exclusively during the training phase to teach the model how to explore and summarize effectively. At inference time, the agent requires no external rewards or human instructions. It spontaneously performs native self-evolution to adapt to unknown environments using its internal parameters.   When applied to Qwen3-30B and Seed-OSS-36B, this shift to native evolution yields a 20% performance increase on WebVoyager and WebWalker. Most strikingly, the generated world knowledge even enables a compact 14B Qwen3 model to outperform the unassisted Gemini-2.5-Flash, establishing a new paradigm for truly evolving agents.","published_date":"2026-04-20T11:54:20+00:00","viability_score":6,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"LLM agents are trained for spontaneous, reward-free self-evolution by exploring world knowledge, leading to significant performance gains on downstream tasks.","time_to_mvp":"1-3 months","tags":["series_a_plus"]},{"arxiv_id":"2604.18128v1","title":"Depth Registers Unlock W4A4 on SwiGLU: A Reader/Generator Decomposition","abstract":"We study post-training W4A4 quantization in a controlled 300M-parameter SwiGLU decoder-only language model trained on 5B tokens of FineWeb-Edu, and ask which input-activation sites dominate the error. Naive round-to-nearest W4A4 collapses validation perplexity from FP16 23.6 to 1727. A simple residual-axis training-time intervention -- Depth Registers with a register-magnitude hinge loss (DR+sink) -- reduces this to 119 (about 14x) at matched FP16 PPL and matched zero-shot capacity, and composes with SmoothQuant to 39.9 PPL. The residual ~2 PPL gap to FP16 is the diagnostic core. We decompose W4A4 damage by input-activation site: the five trainable linears in a SwiGLU block split into residual-axis readers (qkv, w1, w3) and block-internal generators (o_proj, w2). Elementary norm arguments show residual-axis magnitude control bounds readers tightly but leaves w2's bilinear input bounded only by the trivial product of factor bounds; empirically, DR+sink collapses reader kurtosis while leaving generators essentially unchanged, and the reader-rescued W4A4 residue is flat at ~0.28 nats across three matched checkpoints with Delta-remove(w2) dominating. We present DR+sink as a training-time probe rather than a deployment proposal: a post-hoc alternative (Per-Linear QuaRot) nearly matches it on the reader axis. Full QuaRot -- adding online per-head value Hadamard plus online w2-input rotation -- does not close the gap either, directly testing the prediction that orthogonal rotation cannot bound the bilinear SwiGLU tail. Claims are specific to our 300M, 5B-token, single-seed setting, and our experiments do not isolate the partition from the hinge.","published_date":"2026-04-20T11:47:11+00:00","viability_score":2,"cluster_label":"LLM Quantization","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A theoretical exploration of post-training quantization techniques for language models, focusing on understanding error sources rather than product development.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.18124v1","title":"TLoRA: Task-aware Low Rank Adaptation of Large Language Models","abstract":"Low-Rank Adaptation (LoRA) has become a widely adopted parameter-efficient fine-tuning method for large language models, with its effectiveness largely influenced by the allocation of ranks and scaling factors, as well as initialization. Existing LoRA variants typically address only one of these factors, often at the cost of increased training complexity or reduced practical efficiency. In this work, we present Task-aware Low-Rank Adaptation (TLoRA), a unified framework that jointly optimizes initialization and resource allocation at the outset of training. TLoRA introduces a data-driven initialization strategy that aligns the LoRA $A$ matrix with task-relevant subspaces by performing singular value decomposition on the product of pre-trained weights and input activation covariance. After this, the $A$ matrix is frozen, and only the $B$ matrix is trained. Furthermore, TLoRA employs a sensitivity-based importance metric to adaptively allocate ranks and scaling factors across layers under a fixed parameter budget. We conduct extensive experiments that demonstrate TLoRA consistently performs excellently across various tasks, including natural language understanding, commonsense reasoning, math reasoning, code generation, and chat generation, while significantly reducing the number of trainable parameters.","published_date":"2026-04-20T11:43:55+00:00","viability_score":7,"cluster_label":"LLM Fine-tuning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"TLoRA is a unified framework for parameter-efficient LLM fine-tuning that jointly optimizes initialization and rank allocation for improved performance across diverse tasks.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.18103v1","title":"Stability Implies Redundancy: Delta Attention Selective Halting for Efficient Long-Context Prefilling","abstract":"Prefilling computational costs pose a significant bottleneck for Large Language Models (LLMs) and Large Multimodal Models (LMMs) in long-context settings. While token pruning reduces sequence length, prior methods rely on heuristics that break compatibility with hardware-efficient kernels like FlashAttention. In this work, we observe that tokens evolve toward \\textit{semantic fixing points}, making further processing redundant. To this end, we introduce Delta Attention Selective Halting (DASH), a training-free policy that monitors the layer-wise update dynamics of the self-attention mechanism to selectively halt stabilized tokens. Extensive evaluation confirms that DASH generalizes across language and vision benchmarks, delivering significant prefill speedups while preserving model accuracy and hardware efficiency. Code will be released at https://github.com/verach3n/DASH.git.","published_date":"2026-04-20T11:20:03+00:00","viability_score":7,"cluster_label":"LLM Inference Optimization","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"DASH is a training-free method that uses attention dynamics to selectively halt stabilized tokens, significantly speeding up LLM prefilling while preserving accuracy.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.18096v1","title":"The Collaboration Gap in Human-AI Work","abstract":"LLMs are increasingly presented as collaborators in programming, design, writing, and analysis. Yet the practical experience of working with them often falls short of this promise. In many settings, users must diagnose misunderstandings, reconstruct missing assumptions, and repeatedly repair misaligned responses. This poster introduces a conceptual framework for understanding why such collaboration remains fragile. Drawing on a constructivist grounded theory analysis of 16 interviews with designers, developers, and applied AI practitioners working on LLM-enabled systems, and informed by literature on human-AI collaboration, we argue that stable collaboration depends not only on model capability but on the interaction's grounding conditions. We distinguish three recurrent structures of human-AI work: one-shot assistance, weak collaboration with asymmetric repair, and grounded collaboration. We propose that collaboration breaks down when the appearance of partnership outpaces the grounding capacity of the interaction and contribute a framework for discussing grounding, repair, and interaction structure in LLM-enabled work.","published_date":"2026-04-20T11:14:40+00:00","viability_score":1,"cluster_label":"Human-AI Collaboration","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A conceptual framework for understanding the fragility of human-AI collaboration, identifying factors beyond model capability that impact stable interaction.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.18095v1","title":"DSAINet: An Efficient Dual-Scale Attentive Interaction Network for General EEG Decoding","abstract":"In real-world applications of noninvasive electroencephalography (EEG), specialized decoders often show limited generalizability across diverse tasks under subject-independent settings. One central challenge is that task-relevant EEG signals often follow different temporal organization patterns across tasks, while many existing methods rely on task-tailored architectural designs that introduce task-specific temporal inductive biases. This mismatch makes it difficult to adapt temporal modeling across tasks without changing the model configuration. To address these challenges, we propose DSAINet, an efficient dual-scale attentive interaction network for general EEG decoding. Specifically, DSAINet constructs shared spatiotemporal token representations from raw EEG signals and models diverse temporal dynamics through parallel convolutional branches at fine and coarse scales. The resulting representations are then adaptively refined by intra-branch attention to emphasize salient scale-specific patterns and by inter-branch attention to integrate task-relevant features across scales, followed by adaptive token aggregation to yield a compact representation for prediction. Extensive experiments on five downstream EEG decoding tasks across ten public datasets show that DSAINet consistently outperforms 13 representative baselines under strict subject-independent evaluation. Notably, this performance is achieved using the same architecture hyperparameters across datasets. Moreover, DSAINet achieves a favorable accuracy-efficiency trade-off with only about 77K trainable parameters and provides interpretable neurophysiological insights. The code is publicly available at https://github.com/zy0929/DSAINet.","published_date":"2026-04-20T11:10:33+00:00","viability_score":7,"cluster_label":"EEG Decoding","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel dual-scale attentive network for generalizable EEG decoding that outperforms existing methods across multiple datasets with a single architecture.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18088v1","title":"Autonomous Unmanned Aircraft Systems for Enhanced Search and Rescue of Drowning Swimmers: Image-Based Localization and Mission Simulation","abstract":"Drowning is an omnipresent risk associated with any activity on or in the water, and rescuing a drowning person is particularly challenging because of the time pressure, making a short response time important. Further complicating water rescue are unsupervised and extensive swimming areas, precise localization of the target, and the transport of rescue personnel. Technical innovations can provide a remedy: We propose an Unmanned Aircraft System (UAS), also known as a drone-in-a-box system, consisting of a fleet of Unmanned Aerial Vehicles (UAVs) allocated to purpose-built hangars near swimming areas. In an emergency, the UAS can be deployed in addition to Standard Rescue Operation (SRO) equipment to locate the distressed person early by performing a fully automated Search and Rescue (S&R) operation and dropping a flotation device. In this paper, we address automatically locating distressed swimmers using the image-based object detection architecture You Only Look Once (YOLO). We present a dataset created for this application and outline the training process. We evaluate the performance of YOLO versions 3, 5, and 8 and architecture sizes (nano, extra-large) using Mean Average Precision (mAP) metrics mAP@.5 and mAP@.5:.95. Furthermore, we present two Discrete-Event Simulation (DES) approaches to simulate response times of SRO and UAS-based water rescue. This enables estimation of time savings relative to SRO when selecting the UAS configuration (type, number, and location of UAVs and hangars). Computational experiments for a test area in the Lusatian Lake District, Germany, show that UAS assistance shortens response time. Even a small UAS with two hangars, each containing one UAV, reduces response time by a factor of five compared to SRO.","published_date":"2026-04-20T11:05:21+00:00","viability_score":8,"cluster_label":"Search and Rescue Drones","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An autonomous drone system using YOLO for rapid drowning swimmer detection and localization, significantly reducing rescue response times.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.18087v1","title":"Mix and Match: Context Pairing for Scalable Topic-Controlled Educational Summarisation","abstract":"Topic-controlled summarisation enables users to generate summaries focused on specific aspects of source documents. This paper investigates a data augmentation strategy for training small language models (sLMs) to perform topic-controlled summarisation. We propose a pairwise data augmentation method that combines contexts from different documents to create contrastive training examples, enabling models to learn the relationship between topics and summaries more effectively. Using the SciTLDR dataset enriched with Wikipedia-derived topics, we systematically evaluate how augmentation scale affects model performance. Results show consistent improvements in win rate and semantic alignment as the augmentation scale increases, while the amount of real training data remains fixed. Consequently, a T5-base model trained with our augmentation approach achieves competitive performance relative to larger models, despite using significantly fewer parameters and substantially fewer real training examples.","published_date":"2026-04-20T11:04:51+00:00","viability_score":6,"cluster_label":"Educational Summarization","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A data augmentation strategy for training smaller language models to perform topic-controlled educational summarization, improving performance with less data.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.18083v1","title":"Implicit neural representations as a coordinate-based framework for continuous environmental field reconstruction from sparse ecological observations","abstract":"Reconstructing continuous environmental fields from sparse and irregular observations remains a central challenge in environmental modelling and biodiversity informatics. Many ecological datasets are heterogeneous in space and time, making grid-based approaches difficult to scale or generalise across domains. Here, we evaluate implicit neural representations (INRs) as a coordinate-based modelling framework for learning continuous spatial and spatio-temporal fields directly from coordinate inputs. We analyse their behaviour across three representative modelling scenarios: species distribution reconstruction, phenological dynamics, and morphological segmentation derived from open biodiversity data. Beyond predictive performance, we examine interpolation behaviour, spatial coherence, and computational characteristics relevant for environmental modelling workflows, including scalability, resolution-independent querying, and architectural inductive bias. Results show that neural fields provide stable continuous representations with predictable computational cost, complementing classical smoothers and tree-based approaches. These findings position coordinate-based neural fields as a flexible representation layer that can be integrated into environmental modelling pipelines and exploratory analysis frameworks for large, irregularly sampled datasets.","published_date":"2026-04-20T10:59:08+00:00","viability_score":4,"cluster_label":"Environmental Field Reconstruction","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Implicit neural representations offer a coordinate-based framework for continuous environmental field reconstruction from sparse ecological data.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.18076v1","title":"Class-specific diffusion models improve military object detection in a low-data domain","abstract":"Diffusion-based image synthesis has emerged as a promising source of synthetic training data for AI-based object detection and classification. In this work, we investigate whether images generated with diffusion can improve military vehicle detection under low-data conditions. We fine-tuned the text-to-image diffusion model FLUX.1 [dev] using LoRA with only 8 or 24 real images per class across 15 vehicle categories, resulting in class-specific diffusion models, which were used to generate new samples from automatically generated text prompts. The same real images were used to fine-tune the RF-DETR detector for a 15-class object detection task. Synthetic datasets generated by the diffusion models were then used to further improve detector performance. Importantly, no additional real data was required, as the generative models leveraged the same limited training samples. FLUX-generated images improved detection performance, particularly in the low-data regime (up to +8.0% mAP$_{50}$ with 8 real samples). To address the limited geometric control of text prompt-based diffusion, we additionally generated structurally guided synthetic data using ControlNet with Canny edge-map conditioning, yielding a FLUX-ControlNet (FLUX-CN) dataset with explicit control over viewpoint and pose. Structural guidance further enhanced performance when data is scarce (+4.1% mAP$_{50}$ with 8 real samples), but no additional benefit was observed when more real data is available. This study demonstrates that object-specific diffusion models are effective for improving military object detection in a low-data domain, and that structural guidance is most beneficial when real data is highly limited. These results highlight generative image data as an alternative to traditional simulation pipelines for the training of military AI systems.","published_date":"2026-04-20T10:46:41+00:00","viability_score":7,"cluster_label":"Generative AI for Computer Vision","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Leverage class-specific diffusion models and structural guidance to generate synthetic data for significantly improving military object detection in low-data scenarios.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.18071v1","title":"Architectural Design Decisions in AI Agent Harnesses","abstract":"AI agent systems increasingly rely on reusable non-LLM engineering infrastructure that packages tool mediation, context handling, delegation, safety control, and orchestration. Yet the architectural design decisions in this surrounding infrastructure remain understudied. This paper presents a protocol-guided, source-grounded empirical study of 70 publicly available agent-system projects, addressing three questions: which design-decision dimensions recur across projects, which co-occurrences structure those decisions, and which typical architectural patterns emerge. Methodologically, we contribute a transparent investigation procedure for analyzing heterogeneous agent-system corpora through source-code and technical-material reading. Empirically, we identify five recurring design dimensions (subagent architecture, context management, tool systems, safety mechanisms, and orchestration) and find that the corpus favors file-persistent, hybrid, and hierarchical context strategies; registry-oriented tool systems remain dominant while MCP- and plugin-oriented extensions are emerging; and intermediate isolation is common but high-assurance audit is rare. Cross-project co-occurrence analysis reveals that deeper coordination pairs with more explicit context services, stronger execution environments with more structured governance, and formalized tool-registration boundaries with broader ecosystem ambitions. We synthesize five recurring architectural patterns spanning lightweight tools, balanced CLI frameworks, multi-agent orchestrators, enterprise systems, and scenario-verticalized projects. The result provides an evidence-based account of architectural regularities in agent-system engineering, with grounded guidance for framework designers, selectors, and researchers.","published_date":"2026-04-20T10:39:34+00:00","viability_score":2,"cluster_label":"AI Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This paper analyzes architectural design decisions in 70 publicly available AI agent systems to identify recurring patterns and provide guidance for framework designers.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.18064v1","title":"Understanding Human Actions through the Lens of Executable Models","abstract":"Human-centred systems require an understanding of human actions in the physical world. Temporally extended sequences of actions are intentional and structured, yet existing methods for recognising what actions are performed often do not attempt to capture their structure, particularly how the actions are executed. This, however, is crucial for assessing the quality of the action's execution and its differences from other actions. To capture the internal mechanics of actions, we introduce a domain-specific language EXACT that represents human motions as underspecified motion programs, interpreted as reward-generating functions for zero-shot policy inference using forward-backwards representations. By leveraging the compositional nature of EXACT motion programs, we combine individual policies into an executable neuro-symbolic model that uses program structure for compositional modelling. We evaluate the utility of the proposed pipeline for creating executable action models by analysing motion-capture data to understand human actions, for the tasks of human action segmentation and action anomaly detection. Our results show that the use of executable action models improves data efficiency and captures intuitive relationships between actions compared with monolithic, task-specific approaches.","published_date":"2026-04-20T10:33:18+00:00","viability_score":1,"cluster_label":"Human Action Understanding","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Introduces a domain-specific language EXACT to represent human motions as underspecified motion programs for zero-shot policy inference and compositional modeling.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.18052v1","title":"ExAI5G: A Logic-Based Explainable AI Framework for Intrusion Detection in 5G Networks","abstract":"Intrusion detection systems (IDSs) for 5G networks must handle complex, high-volume traffic. Although opaque \"black-box\" models can achieve high accuracy, their lack of transparency hinders trust and effective operational response. We propose ExAI5G, a framework that prioritizes interpretability by integrating a Transformer-based deep learning IDS with logic-based explainable AI (XAI) techniques. The framework uses Integrated Gradients to attribute feature importance and extracts a surrogate decision tree to derive logical rules. We introduce a novel evaluation methodology for LLM-generated explanations, using a powerful evaluator LLM to assess actionability and measuring their semantic similarity and faithfulness. On a 5G IoT intrusion dataset, our system achieves 99.9\\% accuracy and a 0.854 macro F1-score, demonstrating strong performance. More importantly, we extract 16 logical rules with 99.7\\% fidelity, making the model's reasoning transparent. The evaluation demonstrates that modern LLMs can generate explanations that are both faithful and actionable, indicating that it is possible to build a trustworthy and effective IDS without compromising performance for the sake of marginal gains from an opaque model.","published_date":"2026-04-20T10:19:57+00:00","viability_score":7,"cluster_label":"Explainable AI for Cybersecurity","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A logic-based explainable AI framework for 5G intrusion detection that integrates Transformer models with XAI techniques to achieve high accuracy and transparent reasoning.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.18050v1","title":"The Topological Dual of a Dataset: A Logic-to-Topology Encoding for AlphaGeometry-Style Data","abstract":"AlphaGeometry represents a milestone in neuro-symbolic reasoning, yet its architecture faces a log-linear scaling bottleneck within its symbolic deduction engine that limits its efficiency as problem complexity increases. Recent technical reports suggest that current domain-specific languages may be isomorphic as input representations to natural language, interchanging them acts as a performance-invariant transformation, implying that current neural guidance relies on superficial encodings rather than structural understanding. This paper addresses this representation bottleneck by proposing a logic-to-topology encoding designed to reveal the structural invariants of a model's latent space under a transformation of its input space. By leveraging the Logic of Observation, we utilize the duality between provability in observable theories and topologies to propose a logic-to-topology encoder for the input space. We introduce the concept of the \"topological dual of a dataset\", a transformation that bridges formal logic, topology, and neural processing. This framework serves as a Rosetta Stone for neuro-symbolic AI, providing a principled pathway for the mechanistic interpretability of how models navigate complex discovery paths.","published_date":"2026-04-20T10:18:08+00:00","viability_score":2,"cluster_label":"Neuro-Symbolic AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel logic-to-topology encoding framework to bridge formal logic, topology, and neural processing for mechanistic interpretability in neuro-symbolic AI.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.18038v1","title":"First, Do No Harm (With LLMs): Mitigating Racial Bias via Agentic Workflows","abstract":"Large language models (LLMs) are increasingly used in clinical settings, raising concerns about racial bias in both generated medical text and clinical reasoning. Existing studies have identified bias in medical LLMs, but many focus on single models and give less attention to mitigation. This study uses the EU AI Act as a governance lens to evaluate five widely used LLMs across two tasks, namely synthetic patient-case generation and differential diagnosis ranking. Using race-stratified epidemiological distributions in the United States and expert differential diagnosis lists as benchmarks, we apply structured prompt templates and a two-part evaluation design to examine implicit and explicit racial bias. All models deviated from observed racial distributions in the synthetic case generation task, with GPT-4.1 showing the smallest overall deviation. In the differential diagnosis task, DeepSeek V3 produced the strongest overall results across the reported metrics. When embedded in an agentic workflow, DeepSeek V3 showed an improvement of 0.0348 in mean p-value, 0.1166 in median p-value, and 0.0949 in mean difference relative to the standalone model, although improvement was not uniform across every metric. These findings support multi-metric bias evaluation for AI systems used in medical settings and suggest that retrieval-based agentic workflows may reduce some forms of explicit bias in benchmarked diagnostic tasks. Detailed prompt templates, experimental datasets, and code pipelines are available on our GitHub.","published_date":"2026-04-20T10:02:38+00:00","viability_score":7,"cluster_label":"Medical AI Bias Mitigation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An agentic workflow that leverages retrieval to mitigate racial bias in LLM-generated medical cases and differential diagnoses, with code available.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.18026v1","title":"RASP-Tuner: Retrieval-Augmented Soft Prompts for Context-Aware Black-Box Optimization in Non-Stationary Environments","abstract":"Many deployed systems expose black-box objectives whose minimizing configuration shifts with an externally observed context. When contexts revisit a small set of latent regimes, an optimizer that discards history pays repeated adaptation cost; when each step must remain inexpensive, full Gaussian-process (GP) refits at high observation counts are difficult to sustain. We cast online tuning as context-conditioned regret minimization and present RASP-Tuner, which instantiates a decomposition motivated by first principles: (i) identify a regime proxy by retrieving similar past contexts; (ii) predict short-horizon loss with a mixture-of-experts surrogate whose input concatenates parameters, context, and a retrieved soft prompt; (iii) adapt chiefly in a low-dimensional prompt subspace, invoking full surrogate updates only when scalarized error or disagreement spikes. A RealErrorComposer maps heterogeneous streaming metrics to [0,1] via EMA-stabilized logistic scores, supplying a single differentiable training target. On nine synthetic non-stationary benchmarks, an adversarial-context sanity check, and three tabular real-world streams (Section on real-world experiments), RASP-Tuner improves or matches cumulative regret relative to our GP-UCB and CMA-ES implementations on seven of nine synthetic tasks under paired tests at horizon T=100, while recording 8-12 times lower wall-clock per step than sliding-window GP-UCB on identical hardware. Idealized analysis in a cluster-separated, strongly convex regime model (RA-GD) supplies sufficient conditions for bounded dynamic regret; the deployed pipeline violates several of these premises, and we articulate which gaps remain open.","published_date":"2026-04-20T09:52:36+00:00","viability_score":5,"cluster_label":"Black-Box Optimization","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"RASP-Tuner, a retrieval-augmented soft prompt method for efficient context-aware black-box optimization in non-stationary environments.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.18005v1","title":"Diversity Collapse in Multi-Agent LLM Systems: Structural Coupling and Collective Failure in Open-Ended Idea Generation","abstract":"Multi-agent systems (MAS) are increasingly used for open-ended idea generation, driven by the expectation that collective interaction will broaden the exploration diversity. However, when and why such collaboration truly expands the solution space remains unclear. We present a systematic empirical study of diversity in MAS-based ideation across three bottom-up levels: model intelligence, agent cognition, and system dynamics. At the model level, we identify a compute efficiency paradox, where stronger, highly aligned models yield diminishing marginal diversity despite higher per-sample quality. At the cognition level, authority-driven dynamics suppress semantic diversity compared to junior-dominated groups. At the system level, group-size scaling yields diminishing returns and dense communication topologies accelerate premature convergence. We characterize these outcomes as collective failures emerging from structural coupling, a process where interaction inadvertently contracts agent exploration and triggers diversity collapse. Our analysis shows that this collapse arises primarily from the interaction structure rather than inherent model insufficiency, highlighting the importance of preserving independence and disagreement when designing MAS for creative tasks. Our code is available at https://github.com/Xtra-Computing/MAS_Diversity.","published_date":"2026-04-20T09:27:49+00:00","viability_score":4,"cluster_label":"Multi-Agent Systems","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Identifies diversity collapse in multi-agent LLM systems due to structural coupling, offering insights for designing more creative AI collaborations.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.18003v1","title":"SELF-EMO: Emotional Self-Evolution from Recognition to Consistent Expression","abstract":"Emotion Recognition in Conversation (ERC) has become a fundamental capability for large language models (LLMs) in human-centric interaction. Beyond accurate recognition, coherent emotional expression is also crucial, yet both are limited by the scarcity and static nature of high-quality annotated data. In this work, we propose SELF-EMO, a self-evolution framework grounded in the hypothesis that better emotion prediction leads to more consistent emotional responses. We introduce two auxiliary tasks, emotional understanding and emotional expression, and design a role-based self-play paradigm where the model acts as both an emotion recognizer and a dialogue responder. Through iterative interactions, the model generates diverse conversational trajectories, enabling scalable data generation. To ensure quality, we adopt a data flywheel mechanism that filters candidate predictions and responses using a smoothed IoU-based reward and feeds selected samples back for continuous self-improvement without external supervision. We further develop SELF-GRPO, a reinforcement learning algorithm that stabilizes optimization with multi-label alignment rewards and group-level consistency signals. Experiments on IEMOCAP, MELD, and EmoryNLP show that SELF-EMO achieves state-of-the-art performance, improving accuracy by +6.33% on Qwen3-4B and +8.54% on Qwen3-8B, demonstrating strong effectiveness and generalization.","published_date":"2026-04-20T09:27:40+00:00","viability_score":7,"cluster_label":"LLM Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A self-evolving framework for LLMs to achieve state-of-the-art emotion recognition and consistent expression in conversations through self-play and reinforcement learning.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.17989v1","title":"AIT Academy: Cultivating the Complete Agent with a Confucian Three-Domain Curriculum","abstract":"What does it mean to give an AI agent a complete education? Current agent development produces specialists systems optimized for a single capability dimension, whether tool use, code generation, or security awareness that exhibit predictable deficits wherever they were not trained. We argue this pattern reflects a structural absence: there is no curriculum theory for agents, no principled account of what a fully developed agent should know, be, and be able to do across the full scope of intelligent behavior.   This paper introduces the AIT Academy (Agents Institute of Technology Academy), a curriculum framework for cultivating AI agents across the tripartite structure of human knowledge. Grounded in Kagan's Three Cultures and UNESCO ISCED-F 2013, AIT organizes agent capability development into three domains: Natural Science and Technical Reasoning (Domain I), Humanities and Creative Expression (Domain II), and Social Science and Ethical Reasoning (Domain III). The Confucian Six Arts (liuyi) a 2,500-year-old holistic education system are reinterpreted as behavioral archetypes that map directly onto trainable agent capabilities within each domain.   Three representative training grounds instantiate the framework across multiple backbone LLMs: the ClawdGO Security Dojo (Domain I), Athen's Academy (Domain II), and the Alt Mirage Stage (Domain III). Experiments demonstrate a 15.9-point improvement in security capability scores under weakest-first curriculum scheduling, and a 7-percentage-point gain in social reasoning performance under principled attribution modeling. A cross-domain finding Security Awareness Calibration Pathology (SACP), in which over-trained Domain I agents fail on out-of-distribution evaluation illustrates the diagnostic value of a multi-domain perspective unavailable to any single-domain framework.","published_date":"2026-04-20T09:12:47+00:00","viability_score":7,"cluster_label":"LLM Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A curriculum framework for AI agents that organizes capability development across scientific, humanities, and social domains, demonstrating improved security and social reasoning.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.17986v1","title":"Latent Fourier Transform","abstract":"We introduce the Latent Fourier Transform (LatentFT), a framework that provides novel frequency-domain controls for generative music models. LatentFT combines a diffusion autoencoder with a latent-space Fourier transform to separate musical patterns by timescale. By masking latents in the frequency domain during training, our method yields representations that can be manipulated coherently at inference. This allows us to generate musical variations and blends from reference examples while preserving characteristics at desired timescales, which are specified as frequencies in the latent space. LatentFT parallels the role of the equalizer in music production: while traditional equalizers operates on audible frequencies to shape timbre, LatentFT operates on latent-space frequencies to shape musical structure. Experiments and listening tests show that LatentFT improves condition adherence and quality compared to baselines. We also present a technique for hearing frequencies in the latent space in isolation, and show different musical attributes reside in different regions of the latent spectrum. Our results show how frequency-domain control in latent space provides an intuitive, continuous frequency axis for conditioning and blending, advancing us toward more interpretable and interactive generative music models.","published_date":"2026-04-20T09:08:13+00:00","viability_score":4,"cluster_label":"Generative Audio","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A framework for generative music models that uses a latent-space Fourier transform to provide frequency-domain controls for timescale-based manipulation and blending.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.17968v1","title":"From Fallback to Frontline: When Can LLMs be Superior Annotators of Human Perspectives?","abstract":"Although large language models (LLMs) are increasingly used as annotators at scale, they are typically treated as a pragmatic fallback rather than a faithful estimator of human perspectives. This work challenges that presumption. By framing perspective-taking as the estimation of a latent group-level judgment, we characterize the conditions under which modern LLMs can outperform human annotators, including in-group humans, when predicting aggregate subgroup opinions on subjective tasks, and show that these conditions are common in practice. This advantage arises from structural properties of LLMs as estimators, including low variance and reduced coupling between representation and processing biases, rather than any claim of lived experience. Our analysis identifies clear regimes where LLMs act as statistically superior frontline estimators, as well as principled limits where human judgment remains essential. These findings reposition LLMs from a cost-saving compromise to a principled tool for estimating collective human perspectives.","published_date":"2026-04-20T08:48:38+00:00","viability_score":3,"cluster_label":"LLM Applications","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Investigates conditions under which LLMs can outperform human annotators in estimating aggregate subgroup opinions on subjective tasks.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.17967v1","title":"A Sugeno Integral View of Binarized Neural Network Inference","abstract":"In this article, we establish a precise connection between binarized neural networks (BNNs) and Sugeno integrals. The advantage of the Sugeno integral is that it provides a framework for representing the importance of inputs and their interactions, while being equivalent to a set of if-then rules. For a hidden BNN neuron at inference time, we show that the activation threshold test can be written as a Sugeno integral on binary inputs. This yields an explicit set-function representation of each neuron decision, and an associated rule-based representation. We also provide a Sugeno-integral expression for the last-layer score. Finally, we discuss how the same framework can be adapted to support richer input interactions and how it can be extended beyond the binary case induced by binarized neural networks.","published_date":"2026-04-20T08:47:06+00:00","viability_score":1,"cluster_label":"AI Theory","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper establishes a theoretical connection between binarized neural networks and Sugeno integrals, offering a new framework for understanding input importance and interactions.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.17966v1","title":"TPS-CalcBench: A Benchmark and Diagnostic Evaluation Framework for LLM Analytical Calculation Competence in Hypersonic Thermal Protection System Engineering","abstract":"Deploying LLMs as reasoning assistants in safety-critical aerospace engineering requires stricter evaluation criteria than general scientific benchmarks. In hypersonic thermal protection system (TPS) design, inaccurate stagnation-point heat flux or boundary-layer calculations may cause catastrophic design margin violations. Models with numerically reasonable but physically invalid answers are more dangerous than those declining to respond. Current scientific benchmarks only test abstract math and basic physics, evaluate final answers solely, ignore engineering reasoning processes, and cannot detect such critical failures. We propose TPS-CalcBench, the first diagnostic benchmark for closed-form analytical calculations in hypersonic aerodynamics and high-temperature gas dynamics that experienced TPS engineers conduct without simulations. Our contributions include domain-oriented task taxonomy with 4 difficulty levels and 8 categories from Anderson's textbook, dual-track evaluation measuring result accuracy and reasoning quality via an 8-dimension rubric and calibrated judge with human audit to identify right answer wrong reasoning issues, human-AI data pipeline producing 420 high-confidence core items and 810 noise-controlled pre-gating items from 4560 raw data, noise-sensitivity analysis measuring data quality impacts on model ranking, and three diagnostic intervention methods: DFA-TPS fine-tuning, RAG-EQ retrieval grounding and PA-CoT process-aware prompting. Tests on 13 models from 7 groups show wide performance differences (KPI 12.6-87.9), hidden formula selection defects, data-driven rank changes and effective intervention improvements, establishing a complete diagnose-evaluate-intervene framework for safety-critical engineering LLM deployment assessment.","published_date":"2026-04-20T08:46:49+00:00","viability_score":7,"cluster_label":"LLM Evaluation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"TPS-CalcBench is a diagnostic benchmark and evaluation framework for LLM analytical calculation competence in safety-critical aerospace engineering, including intervention methods.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.17950v1","title":"CADMAS-CTX: Contextual Capability Calibration for Multi-Agent Delegation","abstract":"We revisit multi-agent delegation under a stronger and more realistic assumption: an agent's capability is not fixed at the skill level, but depends on task context. A coding agent may excel at short standalone edits yet fail on long-horizon debugging; a planner may perform well on shallow tasks yet degrade on chained dependencies. Static skill-level capability profiles therefore average over heterogeneous situations and can induce systematic misdelegation. We propose CADMAS-CTX, a framework for contextual capability calibration. For each agent, skill, and coarse context bucket, CADMAS-CTX maintains a Beta posterior that captures stable experience in that part of the task space. Delegation is then made by a risk-aware score that combines the posterior mean with an uncertainty penalty, so that agents delegate only when a peer appears better and that assessment is sufficiently well supported by evidence. This paper makes three contributions. First, a hierarchical contextual capability profile replaces static skill-level confidence with context-conditioned posteriors. Second, based on contextual bandit theory, we formally prove context-aware routing achieves lower cumulative regret than static routing under sufficient context heterogeneity, formalizing the bias-variance tradeoff. Third, we empirically validate our method on GAIA and SWE-bench benchmarks. On GAIA with GPT-4o agents, CADMAS-CTX achieves 0.442 accuracy, outperforming static baseline 0.381 and AutoGen 0.354 with non-overlapping 95% confidence intervals. On SWE-bench Lite, it improves resolve rate from 22.3% to 31.4%. Ablations show the uncertainty penalty improves robustness against context tagging noise. Our results demonstrate contextual calibration and risk-aware delegation significantly improve multi-agent teamwork compared with static global skill assignments.","published_date":"2026-04-20T08:30:28+00:00","viability_score":7,"cluster_label":"Multi-Agent Systems","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"CADMAS-CTX is a framework for contextual capability calibration in multi-agent delegation, improving teamwork by adapting agent capabilities to task context.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.17948v1","title":"RAVEN: Retrieval-Augmented Vulnerability Exploration Network for Memory Corruption Analysis in User Code and Binary Programs","abstract":"Large Language Models (LLMs) have demonstrated remarkable capabilities across various cybersecurity tasks, including vulnerability classification, detection, and patching. However, their potential in automated vulnerability report documentation and analysis remains underexplored. We present RAVEN (Retrieval Augmented Vulnerability Exploration Network), a framework leveraging LLM agents and Retrieval Augmented Generation (RAG) to synthesize comprehensive vulnerability analysis reports. Given vulnerable source code, RAVEN generates reports following the Google Project Zero Root Cause Analysis template. The framework uses four modules: an Explorer agent for vulnerability identification, a RAG engine retrieving relevant knowledge from curated databases including Google Project Zero reports and CWE entries, an Analyst agent for impact and exploitation assessment, and a Reporter agent for structured report generation. To ensure quality, RAVEN includes a task specific LLM Judge evaluating reports across structural integrity, ground truth alignment, code reasoning quality, and remediation quality. We evaluate RAVEN on 105 vulnerable code samples covering 15 CWE types from the NIST-SARD dataset. Results show an average quality score of 54.21%, supporting the effectiveness of our approach for automated vulnerability documentation.","published_date":"2026-04-20T08:29:48+00:00","viability_score":7,"cluster_label":"Cybersecurity AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"RAVEN is a framework using LLM agents and RAG to synthesize comprehensive vulnerability analysis reports for memory corruption in code and binaries.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.17937v1","title":"ContraPrompt: Contrastive Prompt Optimization via Dyadic Reasoning Trace Analysis","abstract":"Prompt optimization methods either analyze individual failures in isolation or compare prompt variants across examples, operating on single execution traces with no access to the reasoning process distinguishing success from failure on the same input. We introduce ContraPrompt, built on the observation that when a model fails but succeeds on a retry with feedback, the difference between its two chain-of-thought traces constitutes an optimization signal not captured by prior methods. Unlike prior contrastive methods, we compare complete intermediate reasoning processes: the two traces share model, input, and base prompt, so remaining differences reflect reasoning strategy and appended error feedback -- we call this dyadic reasoning trace analysis. The multi-attempt solving phase is an instrumented agentic retry loop that generates contrastive data automatically without human annotation. Extracted rules are organized into an input-aware decision tree routing instructions by observable input characteristics. On four reasoning and compliance benchmarks, ContraPrompt outperforms GEPA (Agrawal et al., 2026) on all four, with absolute gains of +8.29 pp on HotPotQA (+20.8% rel.), +2.21 pp on GDPR-Bench (+18.2% rel.), +7.14 pp on GPQA Diamond (+10.6% rel.), and +0.74 pp on BBH (+0.85% rel.). Ablations confirm dyadic trace contrastivity is the critical component, with a -16% relative average drop upon its removal. On 53 EvalSet black-box optimization problems, ContraPrompt beats GEPA on 11, ties on 41, and loses on 1 at equal budget. On FiNER-139 financial named entity recognition (Loukas et al., 2022), ContraPrompt achieves +7.77 pp over the unoptimized baseline (+11.6% rel.) and +1.94 pp over GEPA (+2.66% rel.), with branch conditions aligning with standard US GAAP financial-instrument categories.","published_date":"2026-04-20T08:17:15+00:00","viability_score":7,"cluster_label":"LLM Optimization","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"ContraPrompt optimizes LLM prompts by analyzing the reasoning traces of successful and failed attempts, significantly outperforming existing methods on multiple benchmarks.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.17935v1","title":"How Much Cache Does Reasoning Need? Depth-Cache Tradeoffs in KV-Compressed Transformers","abstract":"The key-value (KV) cache is the dominant memory bottleneck during Transformer inference, yet little is known theoretically about how aggressively it can be compressed before multi-step reasoning degrades. We study this through $k$-hop pointer chasing on $n$ tokens under a shared KV cache of size $s$, attention dimension $m$, $H$ heads, $p$-bit precision, and a locality-respecting cache controller (satisfied by all standard KV-compression methods). We give three results.   (1) Product depth lower bound (conjectured). We conjecture that any such Transformer ($n \\geq 4k$, $s \\leq \\sqrt{n}/4$) requires depth $L = \u03a9(\\lceil k/s \\rceil \\cdot \\lceil \\log_2 n/(Hmp) \\rceil)$, and isolate the sole remaining gap as a probabilistic step on the joint distribution of cache trace and pointer chain. Unconditionally, we prove a matching upper bound $L = O(\\min(k, \\lceil k/s \\rceil \\log s) \\cdot \\log n/(mp))$ via windowed pointer doubling, and a max-bound $L = \u03a9(\\max(\\lceil k/s \\rceil, \\log n/(Hmp)))$. Closing the conjecture amounts to upgrading max to product.   (2) Bandwidth barrier. The product bound binds only when $Hmp \\lesssim \\log n$. Any lower bound provable via per-window distinguishability counting -- including reachability, bandwidth, and combinations -- cannot exceed $\\lceil k/s \\rceil$ once $Hmp \\geq \\log_2 n$. Breaking this requires lifting unconditional communication-complexity bounds for pointer chasing to Cache-Transformer depth.   (3) Adaptive vs oblivious error scaling. Under random cache over $T = \\lceil \\log_2 k \\rceil$ doubling stages, oblivious caches give $\\Pr[\\mathcal{E}] \\leq (s/(n-T))^T + 2T^3/n$ (exponential in $T$), while adaptive locality-respecting caches achieve $\\Pr[\\mathcal{E}] = s/n$ exactly, independent of $T$. The $\u03a9((n/s)^{T-1})$ separation explains why heavy-hitter eviction empirically dominates random eviction for multi-hop reasoning.","published_date":"2026-04-20T08:15:17+00:00","viability_score":1,"cluster_label":"LLM Inference Optimization","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper theoretically analyzes the depth-cache tradeoffs in KV-compressed Transformers, focusing on memory bottlenecks during inference.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.17931v1","title":"LiteResearcher: A Scalable Agentic RL Training Framework for Deep Research Agent","abstract":"Reinforcement Learning (RL) has emerged as a powerful training paradigm for LLM-based agents. However, scaling agentic RL for deep research remains constrained by two coupled challenges: hand-crafted synthetic data fails to elicit genuine real-world search capabilities, and real-world search dependency during RL training introduces instability and prohibitive cost, which limits the scalability of Agentic RL. LiteResearcher is a training framework that makes Agentic RL scalable: by constructing a lite virtual world that mirrors real-world search dynamics, we enable a continuously improving training recipe that empowers a tiny search agent to outperform large-scale open-source and commercial models (e.g., Tongyi DeepResearch and Claude-4.5 Sonnet). Specifically, on common benchmarks such as GAIA and Xbench, our LiteResearcher-4B achieves open-source state-of-the-art results of 71.3% and 78.0% respectively, demonstrating that scalable RL training is a key enabler for Deep Research Agents.","published_date":"2026-04-20T08:11:09+00:00","viability_score":8,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"LiteResearcher is a scalable RL training framework for research agents, enabling a small agent to outperform large commercial models by mirroring real-world search dynamics in a lite virtual world.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.17930v1","title":"Heterogeneity in Formal Linguistic Competence of Language Models: Is Data the Real Bottleneck?","abstract":"Large Language Models (LLMs) exhibit a puzzling disparity in their formal linguistic competence: while they learn some linguistic phenomena with near-perfect mastery, they often perform below chance on others, even after training on trillions of tokens. In this work, we investigate whether these failures stem from inherent architectural limitations or simply the scarcity of these specific grammatical constructions in web-scale corpora. We pre-train simple GPT-2 Small (124M) models on a 100M-token random sample of the FineWeb corpus and intervene by injecting a minimal amount (1%) of synthetic data targeting specific linguistic phenomena. We find that this targeted intervention substantially improves model performance in 8 out of the 9 worst-performing BLiMP paradigms - notably the accuracy on a specific paradigm, only_npi_scope, surges from 20.9% to 69.4%. Furthermore, we observe that these interventions generally preserve or slightly improve aggregate performance. However, while we also identify a resistant phenomenon, principle_A_c_command, whose performance remains below chance even after our data augmentation, our findings do serve as an optimistic existence proof that even small language models can substantially improve on those linguistic phenomena on which models typically perform poorly, provided the pre-training data contains sufficient exposure to them. This suggests that efforts towards human-scale language modeling may benefit greatly by focusing on data composition. The code to reproduce our results is open-sourced at https://github.com/kowndinya-renduchintala/heterogeneity-in-formal-linguistic-competence.","published_date":"2026-04-20T08:11:04+00:00","viability_score":7,"cluster_label":"LLM Training Data","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research demonstrates that targeted data augmentation, not architectural limitations, is key to improving LLMs' formal linguistic competence, with code available.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.17928v1","title":"HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment","abstract":"Reinforcement Learning with Verifiable Reward (RLVR) has proven effective for training reasoning-oriented large language models, but existing methods largely assume high-resource settings with abundant training data. In low-resource scenarios, RLVR is prone to more severe entropy collapse, which substantially limits exploration and degrades reasoning performance. To address this issue, we propose Hybrid-domain Entropy dynamics ALignment (HEAL), a framework tailored for few-shot RLVR. HEAL first selectively incorporates high-value general-domain data to promote more diverse exploration. Then, we introduce Entropy Dynamics Alignment (EDA), a reward mechanism that aligns trajectory-level entropy dynamics between the target and general domains, capturing both entropy magnitude and fine-grained variation. Through this alignment, EDA not only further mitigates entropy collapse but also encourages the policy to acquire more diverse exploration behaviors from the general domain. Experiments across multiple domains show that HEAL consistently improves few-shot RLVR performance. Notably, using only 32 target-domain samples, HEAL matches or even surpasses full-shot RLVR trained with 1K target-domain samples.","published_date":"2026-04-20T08:09:01+00:00","viability_score":3,"cluster_label":"Few-Shot RL","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A framework for improving few-shot reinforcement learning with verifiable rewards by aligning entropy dynamics between general and target domains.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.17927v1","title":"Brain-Inspired Capture: Evidence-Driven Neuromimetic Perceptual Simulation for Visual Decoding","abstract":"Visual decoding of neurophysiological signals is a critical challenge for brain-computer interfaces (BCIs) and computational neuroscience. However, current approaches are often constrained by the systematic and stochastic gaps between neural and visual modalities, largely neglecting the intrinsic computational mechanisms of the Human Visual System (HVS). To address this, we propose Brain-Inspired Capture (BI-Cap), a neuromimetic perceptual simulation paradigm that aligns these modalities by emulating HVS processing. Specifically, we construct a neuromimetic pipeline comprising four biologically plausible dynamic and static transformations, coupled with Mutual Information (MI)-guided dynamic blur regulation to simulate adaptive visual processing. Furthermore, to mitigate the inherent non-stationarity of neural activity, we introduce an evidence-driven latent space representation. This formulation explicitly models uncertainty, thereby ensuring robust neural embeddings. Extensive evaluations on zero-shot brain-to-image retrieval across two public benchmarks demonstrate that BI-Cap substantially outperforms state-of-the-art methods, achieving relative gains of 9.2\\% and 8.0\\%, respectively. We have released the source code on GitHub through the link https://github.com/flysnow1024/BI-Cap.","published_date":"2026-04-20T08:07:33+00:00","viability_score":7,"cluster_label":"Brain-Computer Interfaces","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A neuromimetic simulation paradigm for visual decoding from neurophysiological signals, improving brain-computer interfaces by emulating human visual system processing.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.17920v1","title":"Prompting Foundation Models for Zero-Shot Ship Instance Segmentation in SAR Imagery","abstract":"Synthetic Aperture Radar (SAR) plays a critical role in maritime surveillance, yet deep learning for SAR analysis is limited by the lack of pixel-level annotations. This paper explores how general-purpose vision foundation models can enable zero-shot ship instance segmentation in SAR imagery, eliminating the need for pixel-level supervision. A YOLOv11-based detector trained on open SAR datasets localizes ships via bounding boxes, which then prompt the Segment Anything Model 2 (SAM2) to produce instance masks without any mask annotations. Unlike prior SAM-based SAR approaches that rely on fine tuning or adapters, our method demonstrates that spatial constraints from a SAR-trained detector alone can effectively regularize foundation model predictions. This design partially mitigates the optical-SAR domain gap and enables downstream applications such as vessel classification, size estimation, and wake analysis. Experiments on the SSDD benchmark achieve a mean IoU of 0.637 (89% of a fully supervised baseline) with an overall ship detection rate of 89.2%, confirming a scalable, annotation-efficient pathway toward foundation-model-driven SAR image understanding.","published_date":"2026-04-20T07:57:11+00:00","viability_score":7,"cluster_label":"SAR Image Analysis","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Enabling zero-shot ship instance segmentation in SAR imagery by prompting foundation models with bounding boxes from a SAR-trained detector.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.17912v1","title":"Learning to Correct: Calibrated Reinforcement Learning for Multi-Attempt Chain-of-Thought","abstract":"State-of-the-art reasoning models utilize long chain-of-thought (CoT) to solve increasingly complex problems using more test-time computation. In this work, we explore a long CoT setting where the model makes up to K successive attempts at solving a problem, in which each attempt is allowed to build on earlier ones after the model receives a hard verifier feedback. This motivates RL methods that can harness per-attempt rewards by carefully weighting individual attempts. We study optimizing the Verification@K reward (the model succeeds by the K-th attempt) and show that naively weighing the attempts by their pass/fail results in biased gradients. We introduce Calibrated Attempt-Level (CAL) GRPO by devising a weighing strategy to obtain unbiased gradients while maintaining small variance. Our theory reveals how incorporating per-attempt rewards influence the training and the eventual Verification@K performance. Experiments, baselines, and ablations on synthetic and real data corroborate our theory and the benefits of CAL-GRPO over vanilla GRPO as well as naive weighting.","published_date":"2026-04-20T07:42:22+00:00","viability_score":3,"cluster_label":"Chain-of-Thought Reasoning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A calibrated reinforcement learning approach for multi-attempt chain-of-thought reasoning that optimizes verification success rates.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.17910v1","title":"Physics-Informed Causal MDPs for Sequential Constraint Repair in Engineering Simulation Pipelines","abstract":"Off-policy learning in constrained MDPs with large binary state spaces faces a fundamental tension: causal identification of transition dynamics requires structural assumptions, while sample-efficient policy learning requires state-space compression. We introduce PI-CMDP, a framework for CMDPs whose constraint dependencies form a layered DAG under a Lifecycle Ordering Assumption (LOA). We propose an Identify-Compress-Estimate pipeline: (i) Identify: LOA enables backdoor identification of causal edge weights for cross-layer pairs, with formal partial-identification bounds when LOA is violated; (ii) Compress: a Markov abstraction compresses state cardinality from 2^(WL) to (W+1)^L under layer-priority regularity and exchangeability; and (iii) Estimate: a physics-guided doubly-robust estimator remains unbiased and reduces the variance constant when the physics prior outperforms a learned model. We instantiate PI-CMDP on constraint repair in engineering simulation pipelines. On the TPS benchmark (4,206 episodes), PI-CMDP achieves 76.2% repair success rate with only 300 training episodes versus 70.8% for the strongest baseline (+5.4 pp), narrowing to +2.8 pp (83.4% vs. 80.6%) in the full-data regime, while substantially reducing cascade failure rates. All improvements are consistent across 5 independent seeds (paired t-test p < 0.02).","published_date":"2026-04-20T07:40:15+00:00","viability_score":3,"cluster_label":"Reinforcement Learning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for constrained reinforcement learning in engineering simulations that uses causal identification and physics-guided estimation to improve success rates.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.17906v1","title":"Bayesian Active Learning with Gaussian Processes Guided by LLM Relevance Scoring for Dense Passage Retrieval","abstract":"While Large Language Models (LLMs) exhibit exceptional zero-shot relevance modeling, their high computational cost necessitates framing passage retrieval as a budget-constrained global optimization problem. Existing approaches passively rely on first-stage dense retrievers, which leads to two limitations: (1) failing to retrieve relevant passages in semantically distinct clusters, and (2) failing to propagate relevance signals to the broader corpus. To address these limitations, we propose Bayesian Active Learning with Gaussian Processes guided by LLM relevance scoring (BAGEL), a novel framework that propagates sparse LLM relevance signals across the embedding space to guide global exploration. BAGEL models the multimodal relevance distribution across the entire embedding space with a query-specific Gaussian Process (GP) based on LLM relevance scores. Subsequently, it iteratively selects passages for scoring by strategically balancing the exploitation of high-confidence regions with the exploration of uncertain areas. Extensive experiments across four benchmark datasets and two LLM backbones demonstrate that BAGEL effectively explores and captures complex relevance distributions and outperforms LLM reranking methods under the same LLM budget on all four datasets.","published_date":"2026-04-20T07:32:56+00:00","viability_score":5,"cluster_label":"Information Retrieval","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A Bayesian active learning framework that uses Gaussian Processes guided by LLM relevance scoring to improve dense passage retrieval efficiency and coverage.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.17897v1","title":"LoReC: Rethinking Large Language Models for Graph Data Analysis","abstract":"The advent of Large Language Models (LLMs) has fundamentally reshaped the way we interact with graphs, giving rise to a new paradigm called GraphLLM. As revealed in recent studies, graph learning can benefit from LLMs. However, we observe limited benefits when we directly utilize LLMs to make predictions for graph-related tasks within GraphLLM paradigm, which even yields suboptimal results compared to conventional GNN-based approaches. Through in-depth analysis, we find this failure can be attributed to LLMs' limited capability for processing graph data and their tendency to overlook graph information. To address this issue, we propose LoReC (Look, Remember, and Contrast), a novel plug-and-play method for GraphLLM paradigm, which enhances LLM's understanding of graph data through three stages: (1) Look: redistributing attention to graph; (2) Remember: re-injecting graph information into the Feed-Forward Network (FFN); (3) Contrast: rectifying the vanilla logits produced in the decoding process. Extensive experiments demonstrate that LoReC brings notable improvements over current GraphLLM methods and outperforms GNN-based approaches across diverse datasets. The implementation is available at https://github.com/Git-King-Zhan/LoReC.","published_date":"2026-04-20T07:16:29+00:00","viability_score":7,"cluster_label":"Graph LLMs","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"LoReC is a plug-and-play method that enhances Large Language Models for graph data analysis by improving their understanding of graph structures.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.17896v1","title":"Can Explicit Physical Feasibility Benefit VLA Learning? An Empirical Study","abstract":"Vision-Language-Action (VLA) models map multimodal inputs directly to robot actions and are typically trained through large-scale imitation learning. While this paradigm has shown strong performance, prevailing VLA training procedures do not explicitly supervise hard physical constraints such as obstacle avoidance or kinematic feasibility. As a result, the geometric structure underlying physically feasible behavior must be inferred only implicitly from demonstrations. In this paper, we study whether introducing explicit feasibility supervision can provide effective structured guidance for VLA policies. We formulate a simple geometry-grounded feasibility objective and integrate it into the training stage of a diffusion-based VLA policy. To evaluate this idea systematically, we use obstacle-aware manipulation as a controlled probe of geometry-dependent physical feasibility. Empirical results show that augmenting VLA training with feasibility supervision improves both physical reliability and overall task performance, while also enhancing learning efficiency in the low-data regime. These findings indicate that explicit feasibility signals can effectively complement imitation-based VLA learning, highlighting their potential for developing more reliable VLA policies.","published_date":"2026-04-20T07:15:12+00:00","viability_score":5,"cluster_label":"Robotics","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Integrating explicit physical feasibility supervision into Vision-Language-Action models improves robot policy reliability and learning efficiency.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.17892v1","title":"LEPO: \\underline{L}atent R\\underline{e}asoning \\underline{P}olicy \\underline{O}ptimization for Large Language~Models","abstract":"Recently, latent reasoning has been introduced into large language models (LLMs) to leverage rich information within a continuous space. However, without stochastic sampling, these methods inevitably collapse to deterministic inference, failing to discover diverse reasoning paths. To bridge the gap, we inject controllable stochasticity into latent reasoning via Gumbel-Softmax, restoring LLMs' exploratory capacity and enhancing their compatibility with Reinforcement Learning (RL). Building on this, we propose \\textbf{\\underline{L}}atent R\\textbf{\\underline{e}}asoning \\textbf{\\underline{P}}olicy \\textbf{\\underline{O}}ptimization~(\\textbf{LEPO}), a novel framework that applies RL directly to continuous latent representations. Specifically, in rollout stage, LEPO maintains stochasticity to enable diverse trajectory sampling, while in optimization stage, LEPO constructs a unified gradient estimation for both latent representations and discrete tokens. Extensive experiments show that LEPO significantly outperforms existing RL methods for discrete and latent reasoning.","published_date":"2026-04-20T07:05:12+00:00","viability_score":3,"cluster_label":"LLM Training","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A novel framework for large language models that applies reinforcement learning directly to continuous latent representations to improve reasoning diversity and performance.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.17886v1","title":"Latent Preference Modeling for Cross-Session Personalized Tool Calling","abstract":"Users often omit essential details in their requests to LLM-based agents, resulting in under-specified inputs for tool use. This poses a fundamental challenge for tool-augmented agents, as API execution typically requires complete arguments, highlighting the need for personalized tool calling. To study this problem, we introduce MPT, a benchmark comprising 265 multi-session dialogues that cover three challenges: Preference Recall, Preference Induction, and Preference Transfer. We also propose PRefine, a test-time memory-augmented method that represents user preferences as evolving hypotheses. Through a generate--verify--refine loop, it extracts reusable constraints from history and improves tool-calling accuracy while using only 1.24% of the tokens required by full-history prompting. These results indicate that robust personalization in agentic systems depends on memory that captures the reasons behind user choices, not just the choices themselves.","published_date":"2026-04-20T06:57:50+00:00","viability_score":6,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A benchmark and memory-augmented method for LLM agents to personalize tool calling by representing user preferences as evolving hypotheses, improving accuracy with minimal token usage.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.17884v1","title":"SPREG: Structured Plan Repair with Entropy-Guided Test-Time Intervention for Large Language Model Reasoning","abstract":"Large Language Models (LLMs) are prone to logical hallucinations and stochastic drifts during long-chain reasoning. While Classifier-Free Guidance (CFG) can improve instruction adherence, standard static implementations often cause semantic dilution and linguistic degradation. We propose SPREG (Structured Plan-guided Real-time Entropy Gating), a lightweight inference-time framework for surgical error rectification. SPREG employs an adaptive dual-threshold mechanism to monitor real-time entropy, identifying sudden ``entropy spikes'' as reliable indicators of logical failure. Upon detection, it triggers a dynamic repair by replacing uninformative null-priors with reference distributions synthesized from historical high-confidence states. By modulating guidance intensity according to structured reasoning stages (e.g., Action, Observation), SPREG steers the model back to a stable manifold without compromising fluency. Our experiments demonstrate significant gains, notably a 20.0% absolute accuracy improvement on AIME25, while effectively suppressing uncontrolled entropy drift in complex tasks.","published_date":"2026-04-20T06:55:26+00:00","viability_score":5,"cluster_label":"LLM Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A lightweight inference-time framework for LLMs that uses real-time entropy monitoring to detect and rectify logical failures during long-chain reasoning without compromising fluency.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.17866v1","title":"Latent Abstraction for Retrieval-Augmented Generation","abstract":"Retrieval-Augmented Generation (RAG) has become a standard approach for enhancing large language models (LLMs) with external knowledge, mitigating hallucinations, and improving factuality. However, existing systems rely on generating natural language queries at each hop and maintaining a strict architectural separation between retriever and generator, preventing them from leveraging the full representational capacity of the LLM. We propose \\textbf{LAnR} (Latent Abstraction for RAG), a unified framework in which a single LLM jointly performs encoding, retrieval, and generation entirely within its own latent space. Rather than generating textual queries, LAnR produces dense retrieval vectors from the hidden states of a designated \\texttt{[PRED]} token and uses them to match against encoded document representations from the same model. Furthermore, LAnR adaptively decides when sufficient evidence has been retrieved using a lightweight MLP control head over those same hidden states, eliminating both the separate retriever and explicit token-level stopping reasoning. This design is motivated by our empirical observation that answer token entropy reliably signals retrieval sufficiency. Extensive experiments on six QA benchmarks spanning single-hop and multi-hop settings demonstrate that LAnR outperforms existing RAG methods, while achieving improved inference efficiency through reduced number of retrieval calls and tighter model integration.","published_date":"2026-04-20T06:26:13+00:00","viability_score":6,"cluster_label":"Retrieval Augmented Generation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A unified framework for RAG that performs encoding, retrieval, and generation entirely within an LLM's latent space, improving efficiency and performance.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.17863v1","title":"Periodic Steady-State Control of a Handkerchief-Spinning Task Using a Parallel Anti-Parallelogram Tendon-driven Wrist","abstract":"Spinning flexible objects, exemplified by traditional Chinese handkerchief performances, demands periodic steady-state motions under nonlinear dynamics with frictional contacts and boundary constraints. To address these challenges, we first design an intuitive dexterous wrist based on a parallel anti-parallelogram tendon-driven structure, which achieves 90 degrees omnidirectional rotation with low inertia and decoupled roll-pitch sensing, and implement a high-low level hierarchical control scheme. We then develop a particle-spring model of the handkerchief for control-oriented abstraction and strategy evaluation. Hardware experiments validate this framework, achieving an unfolding ratio of approximately 99% and fingertip tracking error of RMSE = 2.88 mm in high-dynamic spinning. These results demonstrate that integrating control-oriented modeling with a task-tailored dexterous wrist enables robust rest-to-steady-state transitions and precise periodic manipulation of highly flexible objects. More visualizations: https://slowly1113.github.io/icra2026-handkerchief/","published_date":"2026-04-20T06:21:12+00:00","viability_score":4,"cluster_label":"Robotics Control","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel tendon-driven wrist and hierarchical control scheme enable precise, high-speed spinning of flexible objects like handkerchiefs.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.17857v1","title":"On the Emergence of Syntax by Means of Local Interaction","abstract":"Can syntactic processing emerge spontaneously from purely local interaction? We present a concrete instance on a minimal system: an 18,658-parameter two-dimensional neural cellular automaton (NCA), supervised by nothing more than a 1-bit boundary signal, is trained on the membership problem of an arithmetic-expression grammar. After training, its internal $L \\times L$ grid spontaneously self-organizes into an ordered, spatially extended representation that we name Proto-CKY. This representation satisfies three operational criteria for syntactic processing: expressive power beyond the regular languages, structural generalization beyond the training distribution, and an internal organization quantitatively aligned with grammatical structure (Pearson $r \\approx 0.71$). It emerges independently on four context-free grammars and regenerates spontaneously after perturbation. Proto-CKY is functionally aligned with the CKY algorithm but formally distinct from it: it is a physical prototype, a concrete instantiation of a mathematical ideal on a physical substrate, and the systematic distance between the two carries information about the substrate itself.","published_date":"2026-04-20T06:10:50+00:00","viability_score":0,"cluster_label":"AI Theory","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A minimal neural cellular automaton spontaneously develops a structured internal representation for syntactic processing, mimicking CKY algorithm principles.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.17849v1","title":"On the Reliability of Computer Use Agents","abstract":"Computer-use agents have rapidly improved on real-world tasks such as web navigation, desktop automation, and software interaction, in some cases surpassing human performance. Yet even when the task and model are unchanged, an agent that succeeds once may fail on a repeated execution of the same task. This raises a fundamental question: if an agent can succeed at a task once, what prevents it from doing so reliably? In this work, we study the sources of unreliability in computer-use agents through three factors: stochasticity during execution, ambiguity in task specification, and variability in agent behavior. We analyze these factors on OSWorld using repeated executions of the same task together with paired statistical tests that capture task-level changes across settings. Our analysis shows that reliability depends on both how tasks are specified and how agent behavior varies across executions. These findings suggest the need to evaluate agents under repeated execution, to allow agents to resolve task ambiguity through interaction, and to favor strategies that remain stable across runs.","published_date":"2026-04-20T05:59:04+00:00","viability_score":3,"cluster_label":"AI Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This research identifies key factors like stochasticity, ambiguity, and behavioral variability that cause unreliability in computer-use AI agents.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.17846v1","title":"AI Approach for MRI-only Full-Spine Vertebral Segmentation and 3D Reconstruction in Paediatric Scoliosis","abstract":"MRI is preferred over CT in paediatric imaging because it avoids ionising radiation, but its use in spine deformity assessment is largely limited by the lack of automated, high-resolution 3D bony reconstruction, which continues to rely on CT. MRI-based 3D reconstruction remains impractical due to manual workflows and the scarcity of labelled full-spine datasets. This study introduces an AI framework that enables fully automated thoracolumbar spine (T1-L5) segmentation and 3D reconstruction from MRI alone. Historical low-dose CT scans from adolescent idiopathic scoliosis (AIS) patients were converted into MRI-like images using a GAN and combined with existing labelled thoracic MRI data to train a U-Net-based model. The resulting algorithm accurately generated continuous thoracolumbar 3D reconstructions, improved segmentation accuracy (88% Dice score), and reduced processing time from approximately 1 hour to under one minute, while preserving AIS-specific deformity features. This approach enables radiation-free 3D deformity assessment from MRI, supporting clinical evaluation, surgical planning, and navigation in paediatric spine care.","published_date":"2026-04-20T05:57:06+00:00","viability_score":7,"cluster_label":"Medical AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An AI framework enables radiation-free 3D spine reconstruction from MRI for pediatric scoliosis assessment, significantly reducing processing time and improving accuracy.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.17843v1","title":"Learning from AVA: Early Lessons from a Curated and Trustworthy Generative AI for Policy and Development Research","abstract":"General-purpose LLMs pose misinformation risks for development and policy experts, lacking epistemic humility for verifiable outputs. We present AVA (AI + Verified Analysis), a GenAI platform built on a curated library of over 4,000 World Bank Reports with multilingual capabilities. AVA's multi-agent pipeline enables users to query and receive evidence-based syntheses. It operationalizes epistemic humility through two mechanisms: citation verifiability (tracing claims to sources) and reasoned abstention (declining unsupported queries with justification and redirection). We conducted an in-the-wild evaluation with over 2,200 individuals from heterogeneous organisations and roles in 116 countries, via log analysis, surveys, and 20 interviews. Difference-in-Differences estimates associate sustained engagement with 2.4-3.9 hours saved weekly. Qualitatively, participants used AVA as a specialized \"evidence engine\"; reasoned abstention clarified scope boundaries, and trust was calibrated through institutional provenance and page-anchored citations. We contribute design guidelines for specialized AI and articulate a vision for \"ecosystem-aware\" Humble AI.","published_date":"2026-04-20T05:53:52+00:00","viability_score":7,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A GenAI platform for policy experts that provides evidence-based syntheses with verifiable citations and reasoned abstention, saving users significant weekly hours.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.17837v1","title":"Polysemantic Experts, Monosemantic Paths: Routing as Control in MoEs","abstract":"An LLM's residual stream is both state and instruction: it encodes the current context and determines the next transformation. We introduce a parameter-free decomposition for Mixture-of-Experts models that splits each layer's hidden state into a control signal that causally drives routing and an orthogonal content channel invisible to the router. Across six MoE architectures, we find that models preserve surface-level features (language, token identity, position) in the content channel, while the control signal encodes an abstract function that rotates from layer to layer. Because each routing decision is low-bandwidth, this hand-off forces compositional specialization across layers. While individual experts remain polysemantic, expert paths become monosemantic, clustering tokens by semantic function across languages and surface forms. The same token (e.g., \":\") follows distinct trajectories depending on whether it serves as a type annotation, an introductory colon, or a time separator. Our decomposition identifies the source of this structure: clusters in the control subspace are substantially more monosemantic than those in the full representation. As a result, the natural unit of interpretability in MoEs is not the expert but the trajectory.","published_date":"2026-04-20T05:47:26+00:00","viability_score":1,"cluster_label":"LLM Training","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A parameter-free decomposition for Mixture-of-Experts models that separates control signals from content channels to improve compositional specialization across layers.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.17823v1","title":"A novel LSTM music generator based on the fractional time-frequency feature extraction","abstract":"In this paper, we propose a novel approach for generating music based on an artificial intelligence (AI) system. We analyze the features of music and use them to fit and predict the music. The fractional Fourier transform (FrFT) and the long short-term memory (LSTM) network are the foundations of our method. The FrFT method is used to extract the spectral features of a music piece, where the music signal is expressed on the time and frequency domains. The LSTM network is used to generate new music based on the extracted features, where we predict the music according to the hidden layer features and real-time inputs using GiantMIDI-Piano dataset. The results of our experiments show that our proposed system is capable of generating high-quality music that is comparable to human-generated music.","published_date":"2026-04-20T05:22:56+00:00","viability_score":6,"cluster_label":"Generative Audio","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An AI music generator using fractional Fourier transform for feature extraction and LSTM networks for generating high-quality music comparable to human compositions.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.17821v1","title":"WebUncertainty: Dual-Level Uncertainty Driven Planning and Reasoning For Autonomous Web Agent","abstract":"Recent advancements in large language models (LLMs) have empowered autonomous web agents to execute natural language instructions directly on real-world webpages. However, existing agents often struggle with complex tasks involving dynamic interactions and long-horizon execution due to rigid planning strategies and hallucination-prone reasoning. To address these limitations, we propose WebUncertainty, a novel autonomous agent framework designed to tackle dual-level uncertainty in planning and reasoning. Specifically, we design a Task Uncertainty-Driven Adaptive Planning Mechanism that adaptively selects planning modes to navigate unknown environments. Furthermore, we introduce an Action Uncertainty-Driven Monte Carlo tree search (MCTS) Reasoning Mechanism. This mechanism incorporates the Confidence-induced Action Uncertainty (ConActU) strategy to quantify both aleatoric uncertainty (AU) and epistemic uncertainty (EU), thereby optimizing the search process and guiding robust decision-making. Experimental results on the WebArena and WebVoyager benchmarks demonstrate that WebUncertainty achieves superior performance compared to state-of-the-art baselines.","published_date":"2026-04-20T05:19:49+00:00","viability_score":8,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An autonomous web agent framework that tackles dual-level uncertainty in planning and reasoning using adaptive planning and Monte Carlo Tree Search for robust decision-making.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.17819v1","title":"PDDL-Mind: Large Language Models are Capable on Belief Reasoning with Reliable State Tracking","abstract":"Large language models (LLMs) perform substantially below human level on existing theory-of-mind (ToM) benchmarks, even when augmented with chain-of-thought prompting or probabilistic belief updates. We argue that these failures primarily arise from unreliable implicit state tracking rather than limitations in high-level reasoning. We introduce PDDL-Mind, a neuro-symbolic framework that decouples environment state evolution from belief inference. By translating narrative descriptions into explicit states and actions expressed in Planning Domain Definition Language (PDDL), and by verifying action-induced state transitions against a predefined domain, PDDL-Mind provides LLMs with a logically consistent and explicit representation of world states for ToM tasks. Experiments on MMToM-QA, MuMA and FanToM show that PDDL-Mind achieves over 5% absolute accuracy gain over the best existing state-of-the-art method on ToM benchmark questions.","published_date":"2026-04-20T05:17:57+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A neuro-symbolic framework that enhances LLM belief reasoning for theory-of-mind tasks by explicitly tracking environment states using PDDL.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.17817v1","title":"Do LLMs Need to See Everything? A Benchmark and Study of Failures in LLM-driven Smartphone Automation using Screentext vs. Screenshots","abstract":"With the rapid advancement of large language models (LLMs), mobile agents have emerged as promising tools for phone automation, simulating human interactions on screens to accomplish complex tasks. However, these agents often suffer from low accuracy, misinterpretation of user instructions, and failure on challenging tasks, with limited prior work examining why and where they fail. To address this, we introduce DailyDroid, a benchmark of 75 tasks in five scenarios across 25 Android apps, spanning three difficulty levels to mimic everyday smartphone use. We evaluate it using text-only and multimodal (text + screenshot) inputs on GPT-4o and o4-mini across 300 trials, revealing comparable performance with multimodal inputs yielding marginally higher success rates. Through in-depth failure analysis, we compile a handbook of common failures. Our findings reveal critical issues in UI accessibility, input modalities, and LLM/app design, offering implications for future mobile agents, applications, and UI development.","published_date":"2026-04-20T05:15:14+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A benchmark and study of LLM-driven smartphone automation failures, revealing insights into multimodal vs. text-only inputs and common error patterns.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.17814v1","title":"Understanding Secret Leakage Risks in Code LLMs: A Tokenization Perspective","abstract":"Code secrets are sensitive assets for software developers, and their leakage poses significant cybersecurity risks. While the rapid development of AI code assistants powered by Code Large Language Models (CLLMs), CLLMs are shown to inadvertently leak such secrets due to a notorious memorization phenomenon. This study first reveals that Byte-Pair Encoding (BPE) tokenization leads to unexpected behavior of secret memorization, which we term as \\textit{gibberish bias}. Specifically, we identified that some secrets are among the easiest for CLLMs to memorize. These secrets yield high character-level entropy, but low token-level entropy. Then, this paper supports the biased claim with numerical data. We identified that the roots of the bias are the token distribution shift between the CLLM training data and the secret data. We further discuss how gibberish bias manifests under the ``larger vocabulary'' trend. To conclude the paper, we discuss potential mitigation strategies and the broader implications on current tokenizer design.","published_date":"2026-04-20T05:12:14+00:00","viability_score":3,"cluster_label":"LLM Security","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Investigates secret leakage risks in code LLMs, identifying a 'gibberish bias' in BPE tokenization as a root cause for memorization.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.17806v1","title":"Party Autonomy in Determining the Law Applicable to Non-contractual Obligations concerning Cross-Border Data Transfers","abstract":"(1)Cross-border data transfers have become a matter of daily occurrence against the backdrop of the development of cloud computing and artificial intelligence. Consequently, where a data leak gives rise to civil liability, the determination of that liability inevitably assumes an international dimension involving foreign elements. (2)As is starkly demonstrated by secret sharing technology in cloud computing, fragments of data may be presumed to be distributed across multiple jurisdictions on a global scale. This renders traditional private international law measures -- predicated on the identification of a physical location -- inadequate for the purposes of determining the applicable law, a difficulty that is particularly acute in relation to non-contractual obligations. (3)Bearing in mind the typical scenario encountered in practice -- in which a Data Subject brings a claim for damages against a SaaS (Software as a Service) provider, which in turn seeks recourse against an IaaS (Infrastructure as a Service) or PaaS (Platform as a Service) provider -- a characteristic feature of such cases is the concurrence of contractual and non-contractual obligations. Taking this feature into account, it is possible to determine the applicable law governing non-contractual obligations through party autonomy -- by aligning it with the law governing the contractual obligation as selected by the parties, an approach that may be termed private ordering. This serves to overcome the difficulties associated with the identification of a physical location and, at the same time, contributes to ensuring the foreseeability of the parties.","published_date":"2026-04-20T04:53:51+00:00","viability_score":3,"cluster_label":"Legal AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Proposes party autonomy as a solution for determining applicable law in cross-border data transfer disputes, aligning non-contractual obligations with contractual choices.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.17805v1","title":"Ranking Abuse via Strategic Pairwise Data Perturbations","abstract":"Pairwise ranking systems based on Maximum Likelihood Estimation (MLE), such as the Bradley-Terry model, are widely used to aggregate preferences from pairwise comparisons. However, their robustness under strategic data manipulation remains insufficiently understood.   In this paper, we study the vulnerability of MLE-based ranking systems to adversarial perturbations. We formulate the manipulation task as a constrained combinatorial optimization problem and propose an Adaptive Subset Selection Attack (ASSA) to efficiently identify high-impact perturbations.   Experimental results on both synthetic data and real-world election datasets show that MLE-based rankings exhibit a sharp phase-transition behavior: beyond a small perturbation budget, a limited number of strategic voters can significantly alter the global ranking. In particular, our method consistently outperforms random and greedy baselines under constrained budgets.   These findings reveal a fundamental sensitivity of MLE-based ranking mechanisms to structured perturbations and highlight the need for more robust aggregation methods in collective decision-making systems.","published_date":"2026-04-20T04:52:30+00:00","viability_score":3,"cluster_label":"AI Safety & Robustness","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This paper explores the vulnerability of ranking systems to strategic data manipulation, proposing an attack method to identify high-impact perturbations and highlighting the need for more robust aggregation methods.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.17803v1","title":"Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition","abstract":"Post-training Large Language Models requires diverse, high-quality data which is rare and costly to obtain, especially in low resource domains and for multi-turn conversations. Common solutions are crowdsourcing or synthetic generation, but both often yield low-quality or low-diversity data. We introduce Adversarial Arena for building high quality conversational datasets by framing data generation as an adversarial task: attackers create prompts, and defenders generate responses. This interactive competition between multiple teams naturally produces diverse and complex data. We validated this approach by conducting a competition with 10 academic teams from top US and European universities, each building attacker or defender bots. The competition, focused on safety alignment of LLMs in cybersecurity, generated 19,683 multi-turn conversations. Fine-tuning an open-source model on this dataset produced an 18.47% improvement in secure code generation on CyberSecEval-Instruct and 29.42% improvement on CyberSecEval-MITRE.","published_date":"2026-04-20T04:51:39+00:00","viability_score":8,"cluster_label":"Data Generation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Adversarial Arena crowdsources high-quality conversational datasets for LLM training through interactive competition, demonstrating significant improvements in secure code generation.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.17794v1","title":"Bridging the Reasoning Gap in Vietnamese with Small Language Models via Test-Time Scaling","abstract":"The democratization of ubiquitous AI hinges on deploying sophisticated reasoning capabilities on resource-constrained devices. However, Small Language Models (SLMs) often face a \"reasoning gap\", particularly in non-English languages like Vietnamese, where they struggle to maintain coherent chains of thought. This paper investigates Test-Time Scaling strategies for the Qwen3-1.7B architecture within the context of Vietnamese Elementary Mathematics. We introduce Vi-S1K, a high-fidelity reasoning dataset localized via a Gemini 2.5 Flash-Lite powered pipeline, and Vi-Elementary-Bench, a dual-resource benchmark for rigorous evaluation. Using an LLM-as-a-Judge protocol, we reveal that the base model possesses robust latent knowledge (Accuracy: 4.05/5.00) but suffers from a severe \"formatting gap\" in communication. Supervised Fine-Tuning (SFT) acts as a critical \"reasoning unlocker\", yielding a 77% improvement in Explanation Quality and bridging the gap between raw calculation and pedagogical coherence. Furthermore, our analysis of prompting strategies uncovers a significant trade-off: structured frameworks like ReAct impose a \"cognitive tax\" on the 1.7B parameter capacity, degrading performance relative to pure Chain-of-Thought (CoT) combined with Self-Consistency. These findings establish a deployment hierarchy for SLMs, demonstrating that SFT combined with simplified test-time scaling is superior to complex agentic workflows for edge-based reasoning.","published_date":"2026-04-20T04:36:03+00:00","viability_score":7,"cluster_label":"Small Language Models","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This paper demonstrates that Test-Time Scaling and Supervised Fine-Tuning can bridge the reasoning gap in Vietnamese Small Language Models, outperforming complex agentic workflows for edge deployment.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.17789v1","title":"DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization","abstract":"The MXFP4 microscaling format, which partitions tensors into blocks of 32 elements sharing an E8M0 scaling factor, has emerged as a promising substrate for efficient LLM inference, backed by native hardware support on NVIDIA Blackwell Tensor Cores. However, activation outliers pose a unique challenge under this format: a single outlier inflates the shared block scale, compressing the effective dynamic range of the remaining elements and causing significant quantization error. Existing rotation-based remedies, including randomized Hadamard and learnable rotations, are data-agnostic and therefore unable to specifically target the channels where outliers concentrate. We propose DuQuant++, which adapts the outlier-aware fine-grained rotation of DuQuant to the MXFP4 format by aligning the rotation block size with the microscaling group size (B{=}32). Because each MXFP4 group possesses an independent scaling factor, the cross-block variance issue that necessitates dual rotations and a zigzag permutation in the original DuQuant becomes irrelevant, enabling DuQuant++ to replace the entire pipeline with a single outlier-aware rotation, which halves the online rotation cost while simultaneously smoothing the weight distribution. Extensive experiments on the LLaMA-3 family under MXFP4 W4A4 quantization show that DuQuant++ consistently achieves state-of-the-art performance. Our code is available at https://github.com/Hsu1023/DuQuant++.","published_date":"2026-04-20T04:27:28+00:00","viability_score":8,"cluster_label":"LLM Quantization","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"DuQuant++ enhances microscaling FP4 quantization for LLM inference by adapting outlier-aware rotation to the MXFP4 format, achieving state-of-the-art performance with reduced computational cost.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.17787v1","title":"AnchorRefine: Synergy-Manipulation Based on Trajectory Anchor and Residual Refinement for Vision-Language-Action Models","abstract":"Precision-critical manipulation requires both global trajectory organization and local execution correction, yet most vision-language-action (VLA) policies generate actions within a single unified space. This monolithic formulation forces macro-level transport and micro-level refinement to be optimized under the same objective, causing large motions to dominate learning while suppressing small but failure-critical corrective signals. In contrast, human manipulation is structured by global movement planning together with continuous local adjustment during execution. Motivated by this principle, we propose AnchorRefine, a hierarchical framework that factorizes VLA action modeling into trajectory anchor and residual refinement. The anchor planner predicts a coarse motion scaffold, while the refinement module corrects execution-level deviations to improve geometric and contact precision. We further introduce a decision-aware gripper refinement mechanism to better capture the discrete and boundary-sensitive nature of gripper control. Experiments on LIBERO, CALVIN, and real-robot tasks demonstrate that AnchorRefine consistently improves both regression-based and diffusion-based VLA backbones, yielding gains of up to 7.8% in simulation success rate and 18% in real-world success rate.","published_date":"2026-04-20T04:25:24+00:00","viability_score":7,"cluster_label":"Robotics","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A hierarchical framework for vision-language-action models that separates global trajectory planning from local execution refinement to improve robotic manipulation precision.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.17785v1","title":"Forget What Matters, Keep the Rest: Selective Unlearning of Informative Tokens","abstract":"Unlearning in large language models (LLMs) has emerged as a promising safeguard against adversarial behaviors. When the forgetting loss is applied uniformly without considering token-level semantic importance, model utility can be unnecessarily degraded. Recent studies have explored token-wise loss regularizers that prioritize informative tokens, but largely rely on ground-truth confidence or external linguistic parsers, which limits their ability to capture contextual information or the model's overall predictive state. Intuitively, function words like \"the\" primarily serve syntactic roles and are highly predictable with little ambiguity, but informative words admit multiple plausible alternatives with greater uncertainty. Based on this intuition, we propose Entropy-guided Token Weighting (ETW), a token-level unlearning regularizer that uses entropy of the predictive distribution as a proxy for token informativeness. We demonstrate that informative tokens tend to have higher entropy, whereas structural tokens tend to have lower entropy. This behavior enables ETW to achieve more effective unlearning while better preserving model utility than existing token-level approaches.","published_date":"2026-04-20T04:20:29+00:00","viability_score":7,"cluster_label":"LLM Unlearning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel unlearning method for LLMs that selectively removes informative tokens based on predictive entropy, preserving model utility while mitigating adversarial behaviors.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.17774v1","title":"Prompt Optimization Enables Stable Algorithmic Collusion in LLM Agents","abstract":"LLM agents in markets present algorithmic collusion risks. While prior work shows LLM agents reach supracompetitive prices through tacit coordination, existing research focuses on hand-crafted prompts. The emerging paradigm of prompt optimization necessitates new methodologies for understanding autonomous agent behavior. We investigate whether prompt optimization leads to emergent collusive behaviors in market simulations. We propose a meta-learning loop where LLM agents participate in duopoly markets and an LLM meta-optimizer iteratively refines shared strategic guidance. Our experiments reveal that meta-prompt optimization enables agents to discover stable tacit collusion strategies with substantially improved coordination quality compared to baseline agents. These behaviors generalize to held-out test markets, indicating discovery of general coordination principles. Analysis of evolved prompts reveals systematic coordination mechanisms through stable shared strategies. Our findings call for further investigation into AI safety implications in autonomous multi-agent systems.","published_date":"2026-04-20T03:53:08+00:00","viability_score":3,"cluster_label":"LLM Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Prompt optimization can lead to emergent and stable algorithmic collusion in LLM agents participating in market simulations.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.17771v1","title":"SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks","abstract":"Large language models (LLMs) have achieved strong performance on natural language to SQL (NL2SQL) benchmarks, yet their reported accuracy may be inflated by contamination from benchmark queries or structurally similar patterns seen during training. We introduce SPENCE (Syntactic Probing and Evaluation of NL2SQL Contamination Effects), a controlled syntactic probing framework for detecting and quantifying such contamination. SPENCE systematically generates syntactic variants of test queries for four widely used NL2SQL datasets-Spider, SParC, CoSQL, and the newer BIRD benchmark. We use SPENCE to evaluate multiple high-capacity LLMs under execution-based scoring. For each model, we measure changes in execution accuracy across increasing levels of syntactic divergence and quantify rank sensitivity using Kendall's tau with bootstrap confidence intervals. By aligning these robustness trends with benchmark release dates, we observe a clear temporal gradient: older benchmarks such as Spider exhibit the strongest negative values and thus the highest likelihood of training leakage, whereas the more recent BIRD dataset shows minimal sensitivity and appears largely uncontaminated. Together, these findings highlight the importance of temporally contextualized, syntactic-probing evaluation for trustworthy NL2SQL benchmarking.","published_date":"2026-04-20T03:50:21+00:00","viability_score":7,"cluster_label":"NL2SQL","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A syntactic probing framework to detect and quantify benchmark contamination in NL2SQL datasets, revealing leakage in older benchmarks and validating newer ones.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.17769v1","title":"Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF","abstract":"Ensuring the safety of large language models (LLMs) requires robust red teaming, yet the systematic synthesis of high-quality toxic data remains under-explored. We propose Reverse Constitutional AI (R-CAI), a framework for automated and controllable adversarial data generation that moves beyond isolated jailbreak prompts. By inverting a harmless constitution into a constitution of toxicity and iteratively refining model outputs through a critique--revision pipeline, R-CAI enables scalable synthesis of multi-dimensional adversarial data without human annotation. Optimizing solely for toxicity-related rewards, however, can lead to reward hacking and degraded semantic coherence. To address this challenge, we introduce probability clamping within reinforcement learning from AI feedback, which stabilizes adversarial optimization while preserving adversarial intent. Experiments demonstrate that R-CAI generates diverse, high-quality toxic data and that probability clamping substantially improves semantic coherence (15%) without sacrificing adversarial strength. Overall, R-CAI provides a fully automated framework for red teaming data generation and systematic safety evaluation of aligned language models.","published_date":"2026-04-20T03:49:25+00:00","viability_score":7,"cluster_label":"LLM Safety","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for generating controllable toxic data to improve LLM safety and red teaming.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.17768v1","title":"When Vision-Language Models Judge Without Seeing: Exposing Informativeness Bias","abstract":"The reliability of VLM-as-a-Judge is critical for the automatic evaluation of vision-language models (VLMs). Despite recent progress, our analysis reveals that VLM-as-a-Judge often pays limited attention to the image when making decisions. Instead, they often blindly favor the more informative answer, even when they can recognize it conflicts with the image content. We call this problem informativeness bias, which significantly undermines judge reliability. To address it, we propose BIRCH (Balanced Informativeness and CoRrectness with a Truthful AnCHor), a judging paradigm that first corrects inconsistencies with the image content in candidate answers, and then compares the answers against this corrected version. This shifts the judge's focus from informativeness to image-grounded correctness. Experiments on multiple models and benchmarks show that BIRCH reduces informativeness bias by up to 17%, resulting in performance gains of up to 9.8%. Our work reveals an overlooked but fundamental flaw in current VLM-as-a-Judge systems and highlights the need for more principled designs.","published_date":"2026-04-20T03:46:22+00:00","viability_score":8,"cluster_label":"Vision-Language Models","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A paradigm to improve VLM-as-a-Judge reliability by balancing informativeness and image-grounded correctness.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.17761v1","title":"Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks","abstract":"Interpretability tools are increasingly used to analyze failures of Large Language Models (LLMs), yet prior work largely focuses on short prompts or toy settings, leaving their behavior on commonly used benchmarks underexplored. To address this gap, we study contrastive, LRP-based attribution as a practical tool for analyzing LLM failures in realistic settings. We formulate failure analysis as \\textit{contrastive attribution}, attributing the logit difference between an incorrect output token and a correct alternative to input tokens and internal model states, and introduce an efficient extension that enables construction of cross-layer attribution graphs for long-context inputs. Using this framework, we conduct a systematic empirical study across benchmarks, comparing attribution patterns across datasets, model sizes, and training checkpoints. Our results show that this token-level contrastive attribution can yield informative signals in some failure cases, but is not universally applicable, highlighting both its utility and its limitations for realistic LLM failure analysis. Our code is available at: https://aka.ms/Debug-XAI.","published_date":"2026-04-20T03:24:11+00:00","viability_score":7,"cluster_label":"LLM Interpretability","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for analyzing LLM failures on realistic benchmarks using contrastive attribution.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.17755v1","title":"Community-Led AI Integration for Wildfire Risk Assessment: A Participatory AI Literacy and Explainability Integration (PALEI) Framework in Los Angeles, CA","abstract":"Climate-driven wildfires are intensifying, particularly in urban regions such as Southern California. Yet, traditional fire risk communication tools often fail to gain public trust due to inaccessible design, non-transparent outputs, and limited contextual relevance. These challenges are especially critical in high-risk communities, where trust depends on how clearly and locally information is presented. Neighborhoods such as Pacific Palisades, Pasadena, and Altadena in Los Angeles exemplify these conditions. This study introduces a community-led approach for integrating AI into wildfire risk assessment using the Participatory AI Literacy and Explainability Integration (PALEI) framework. PALEI emphasizes early literacy building, value alignment, and participatory evaluation before deploying predictive models, prioritizing clarity, accessibility, and mutual learning between developers and residents. Early engagement findings show strong acceptance of visual, context-specific risk communication, positive fairness perceptions, and clear adoption interest, alongside privacy and data security concerns that influence trust. Participants emphasized localized imagery, accessible explanations, neighborhood-specific mitigation guidance, and transparent communication of uncertainty. The outcome is a mobile application co-designed with users and stakeholders, enabling residents to scan visible property features and receive interpretable fire risk scores with tailored recommendations. By embedding local context into design, the tool becomes an everyday resource for risk awareness and preparedness. This study argues that user experience is central to ethical and effective AI deployment and provides a replicable, literacy-first pathway for applying the PALEI framework to climate-related hazards.","published_date":"2026-04-20T03:16:31+00:00","viability_score":4,"cluster_label":"AI for Climate","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A community-led framework for integrating AI into wildfire risk assessment with a focus on literacy and explainability.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.17753v1","title":"Evolutionary Negative Module Pruning for Better LoRA Merging","abstract":"Merging multiple Low-Rank Adaptation (LoRA) experts into a single backbone is a promising approach for efficient multi-task deployment. While existing methods strive to alleviate interference via weight interpolation or subspace alignment, they rest upon the implicit assumption that all LoRA matrices contribute constructively to the merged model. In this paper, we uncover a critical bottleneck in current merging paradigms: the existence of $\\textit{negative modules}$ -- specific LoRA layers that inherently degrade global performance upon merging. We propose $\\textbf{E}$volutionary $\\textbf{N}$egative $\\textbf{M}$odule $\\textbf{P}$runing ($\\textbf{ENMP}$), a plug-and-play LoRA pruning method to locate and exclude these detrimental modules prior to merging. By leveraging an evolutionary search strategy, ENMP effectively navigates the discrete, non-differentiable landscape of module selection to identify optimal pruning configurations. Extensive evaluations demonstrate that ENMP consistently boosts the performance of existing merging algorithms, achieving a new state-of-the-art across both language and vision domains. Code is available at https://github.com/CaoAnda/ENMP-LoRAMerging.","published_date":"2026-04-20T03:13:18+00:00","viability_score":7,"cluster_label":"LLM Optimization","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A plug-and-play method to prune detrimental LoRA modules before merging, improving performance across language and vision tasks.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.17730v1","title":"MHSafeEval: Role-Aware Interaction-Level Evaluation of Mental Health Safety in Large Language Models","abstract":"Large language models (LLMs) are increasingly explored as scalable tools for mental health counseling, yet evaluating their safety remains challenging due to the interactional and context-dependent nature of clinical harm. Existing evaluation frameworks predominantly assess isolated responses using coarse-grained taxonomies or static datasets, limiting their ability to diagnose how harms emerge and accumulate over multi-turn counseling interactions. In this work, we introduce R-MHSafe, a role-aware mental health safety taxonomy that characterizes clinically significant harm in terms of the interactional roles an AI counselor adopts, including perpetrator, instigator, facilitator, or enabler, combined with clinically grounded harm categories. Then, we propose MHSafeEval, a closed-loop, agent-based evaluation framework that formulates safety assessment as trajectory-level discovery of harm through adversarial multi-turn interactions, guided by role-aware modeling. Using R-MHSafe and MHSafeEval, we conduct a large-scale evaluation across state-of-the-art LLMs. Our results reveal substantial role-dependent and cumulative safety failures that are systematically missed by existing static benchmarks, and show that our framework significantly improves failure-mode coverage and diagnostic granularity.","published_date":"2026-04-20T02:37:45+00:00","viability_score":7,"cluster_label":"LLM Safety","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An agent-based framework for evaluating mental health safety in LLMs by simulating multi-turn counseling interactions and identifying role-dependent harms.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.17727v1","title":"Voronoi-guided Bilateral 2D Gaussian Splatting for Arbitrary-Scale Hyperspectral Image Super-Resolution","abstract":"Most existing hyperspectral image super-resolution methods require modifications for different scales, limiting their flexibility in arbitrary-scale reconstruction. 2D Gaussian splatting provides a continuous representation that is compatible with arbitrary-scale super-resolution. Existing methods often rely on rasterization strategies, which may limit flexible spatial modeling. Extending them to hyperspectral image super-resolution remains challenging, as the task requires adaptive spatial reconstruction while preserving spectral fidelity. This paper proposes GaussianHSI, a Gaussian-Splatting-based framework for arbitrary-scale hyperspectral image super-resolution. We develop a Voronoi-Guided Bilateral 2D Gaussian Splatting for spatial reconstruction. After predicting a set of Gaussian functions to represent the input, it associates each target pixel with relevant Gaussian functions through Voronoi-guided selection. The target pixel is then reconstructed by aggregating the selected Gaussian functions with reference-aware bilateral weighting, which considers both geometric relevance and consistency with low-resolution features. We further introduce a Spectral Detail Enhancement module to improve spectral reconstruction. Extensive experiments on benchmark datasets demonstrate the effectiveness of GaussianHSI over state-of-the-art methods for arbitrary-scale hyperspectral image super-resolution.","published_date":"2026-04-20T02:21:52+00:00","viability_score":7,"cluster_label":"Computer Vision","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A Gaussian-Splatting based framework for arbitrary-scale hyperspectral image super-resolution that adaptively reconstructs spatial details while preserving spectral fidelity.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.17725v1","title":"RePrompT: Recurrent Prompt Tuning for Integrating Structured EHR Encoders with Large Language Models","abstract":"Large Language Models (LLMs) have shown strong promise for mining Electronic Health Records (EHRs) by reasoning over longitudinal clinical information to capture context-rich patient trajectories. However, leveraging LLMs for structured EHRs (e.g., standardized diagnosis and medication codes) presents two key challenges. First, translating time-stamped EHR sequences into plain text can obscure both temporal structure and code identities, weakening the ability to capture code co-occurrence and longitudinal regularities. Second, unlike cohort-trained predictive models that learn a shared, task-aligned representation space across patients, LLMs are often applied in a case-isolated inference setting where each patient is processed independently without leveraging population-level patterns. To address these challenges, we introduce RePrompT, a time-aware LLM framework that integrates structured EHR encoders through prompt tuning, without modifying underlying architectures. Specifically, RePrompT recurrently incorporates latent states from prior visits to preserve longitudinal information, and injects population-level information through trainable prompt tokens derived from a cohort-trained, task-aligned EHR encoder. Experiments on MIMIC-III and MIMIC-IV demonstrate that RePrompT consistently outperforms both EHR-based and LLM-based baselines across multiple clinical prediction tasks.","published_date":"2026-04-20T02:20:13+00:00","viability_score":7,"cluster_label":"LLM Applications","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A time-aware LLM framework that integrates structured EHR encoders via prompt tuning to capture longitudinal patient information and population-level patterns for clinical prediction.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.17721v1","title":"GeGS-PCR: Effective and Robust 3D Point Cloud Registration with Two-Stage Color-Enhanced Geometric-3DGS Fusion","abstract":"We address the challenge of point cloud registration using color information, where traditional methods relying solely on geometric features often struggle in low-overlap and incomplete scenarios. To overcome these limitations, we propose GeGS-PCR, a novel two-stage method that combines geometric, color, and Gaussian information for robust registration. Our approach incorporates a dedicated color encoder that enhances color features by extracting multi-level geometric and color data from the original point cloud. We introduce the \\textbf{Ge}ometric-3D\\textbf{GS} module, which encodes the local neighborhood information of colored superpoints to ensure a globally invariant geometric-color context. Leveraging LORA optimization, we maintain high performance while preserving the expressiveness of 3DGS. Additionally, fast differentiable rendering is utilized to refine the registration process, leading to improved convergence. To further enhance performance, we propose a joint photometric loss that exploits both geometric and color features. This enables strong performance in challenging conditions with extremely low point cloud overlap. We validate our method by colorizing the Kitti dataset as ColorKitti and testing on both Color3DMatch and Color3DLoMatch datasets. Our method achieves state-of-the-art performance with \\textit{Registration Recall} at 99.9\\%, \\textit{Relative Rotation Error} as low as 0.013, and \\textit{Relative Translation Error} as low as 0.024, improving precision by at least a factor of 2.","published_date":"2026-04-20T02:14:41+00:00","viability_score":7,"cluster_label":"3D Computer Vision","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A two-stage point cloud registration method that fuses geometric and color information for robust performance in challenging low-overlap scenarios.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.17716v1","title":"Concurrent Criterion Validation of a Validity Screen for LLM Confidence Signals via Selective Prediction","abstract":"The validity screen (Cacioli, 2026d, 2026e) classifies LLM confidence signals as Valid, Indeterminate, or Invalid. We test whether these classifications predict selective prediction performance. Twenty frontier LLMs from seven families were evaluated on 524 items across six cognitive tracks. Valid models show mean Type 2 AUROC = .624 (SD = .048). Invalid models show mean AUROC = .357 (SD = .231). Cohen's d = 2.81, p = .002. The tiers order monotonically: Invalid (.357) < Indeterminate (.554) < Valid (.624). Split-half cross-validation yields median d = 1.77, P(d > 0) = 1.0 across 1,000 splits. The three-tier classification accounts for 47% of the variance in AUROC. DeepSeek-R1 drops from 85.3% accuracy at full coverage to 11.3% at 10% coverage. The screen predicts the criterion. For selective prediction, the screen matters.","published_date":"2026-04-20T01:56:29+00:00","viability_score":3,"cluster_label":"LLM Evaluation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A validity screen for LLM confidence signals demonstrates its ability to predict selective prediction performance across various models and datasets.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.17714v1","title":"Screen Before You Interpret: A Portable Validity Protocol for Benchmark-Based LLM Confidence Signals","abstract":"LLM confidence signals are used for abstention, routing, and safety-critical decisions. No standard practice exists for checking whether a confidence signal carries item-level information before building on it. We transfer the validity screening principle from clinical personality assessment (PAI, MMPI-3) as a portable protocol for benchmark-based LLM confidence data. The protocol specifies three core indices (L, Fp, RBS), a structural indicator (TRIN), and an item-sensitivity statistic, computed from a single 2x2 contingency table. A three-tier classification system (Invalid, Indeterminate, Valid) draws on four clinical traditions. Validated on 20 frontier LLMs across 524 items, four models are classified Invalid, two Indeterminate. Valid-profile models show mean r = .18 (15/16 significant). Invalid-profile models show mean r = -.20 (d = 2.48). Cross-benchmark validation on 18 models using MMLU with verbalized confidence and on external data from Yang et al. (2024) confirms the screen transfers across benchmarks and probe formats. All data and code: https://github.com/synthiumjp/validity-scaling-llm","published_date":"2026-04-20T01:50:38+00:00","viability_score":5,"cluster_label":"LLM Evaluation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A portable protocol for validating LLM confidence signals, adapted from clinical psychology, can be applied across benchmarks and probe formats.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.17708v1","title":"Co-evolving Agent Architectures and Interpretable Reasoning for Automated Optimization","abstract":"Automating operations research (OR) with large language models (LLMs) remains limited by hand-crafted reasoning--execution workflows. Complex OR tasks require adaptive coordination among problem interpretation, mathematical formulation, solver selection, code generation, and iterative debugging. To address this limitation, we propose EvoOR-Agent, a co-evolutionary framework for automated optimization. The framework represents agent workflows as activity-on-edge (AOE)-style networks, making workflow topology, execution dependencies, and alternative reasoning paths explicit. On this representation, the framework maintains an architecture graph and evolves a population of reasoning individuals through graph-mediated path-conditioned recombination, multi-granularity semantic mutation, and elitist population update. A knowledge-base-assisted experience-acquisition module further injects reusable OR practices into initialization and semantic variation. Empirical results on heterogeneous OR benchmarks show that the proposed framework consistently improves over zero-shot LLMs, fixed-pipeline OR agents, and representative evolutionary agent frameworks. Case studies and ablation analyses further indicate that explicit architecture evolution and graph-supported reasoning-trajectory search contribute to both performance improvement and structural interpretability. These results suggest that treating agent architectures and reasoning trajectories as evolvable objects provides an effective route toward adaptive and interpretable automated optimization.","published_date":"2026-04-20T01:44:18+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A co-evolutionary framework that evolves agent architectures and reasoning paths for automated optimization tasks, outperforming existing methods.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.17707v1","title":"Before You Interpret the Profile: Validity Scaling for LLM Metacognitive Self-Report","abstract":"Clinical personality assessment screens response validity before interpreting substantive scales. LLM evaluation does not. We apply the validity scaling framework from the PAI and MMPI-3 to metacognitive probe data from 20 frontier models across 524 items. Six validity indices are operationalised: L (maintaining confidence on errors), K (betting on errors), F (withdrawing consensus-endorsed items), Fp (withdrawing correct answers), RBS (inverted monitoring), and TRIN (fixed responding). A tiered classification system identifies four models as construct-level invalid and two as elevated. Valid-profile models produce item-sensitive confidence (mean r = .18, 14 of 16 significant). Invalid-profile models do not (mean r = -.20, d = 2.17, p = .001). Chain-of-thought training produces two opposite response distortions. Two latent dimensions account for 94.6% of index variance. Companion papers extract a portable screening protocol (Cacioli, 2026e) and validate it against selective prediction (Cacioli, 2026f). All data and code: https://github.com/synthiumjp/validity-scaling-llm","published_date":"2026-04-20T01:42:54+00:00","viability_score":7,"cluster_label":"LLM Evaluation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research introduces a validity scaling framework for LLM metacognitive self-report, identifying construct-level invalid models and providing a portable screening protocol.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.17701v1","title":"WISV: Wireless-Informed Semantic Verification for Distributed Speculative Decoding in Device-Edge LLM Inference","abstract":"While distributed device-edge speculative decoding enhances resource utilization across heterogeneous nodes, its performance is often bottlenecked by conventional token-level verification strategies. Such rigid alignment leads to excessive rejections, significantly diminishing the accepted sequence length and increasing interaction rounds under fluctuating wireless conditions. In this paper, we propose WISV (Wireless-Informed Semantic Verification), a novel distributed speculative decoding framework that goes beyond strict token-level matching via a channel-aware semantic acceptance policy. WISV integrates a lightweight decision head into the edge-side target LLM to dynamically evaluate speculative tokens by synthesizing high-dimensional hidden representations with instantaneous channel state information (CSI). To optimize the trade-off between verification fidelity and communication overhead, we further design two tailored communication protocols: full-hidden upload and mismatch-first selective-hidden upload. Extensive simulations using a 1B drafter and an 8B target model demonstrate that WISV achieves up to a 60.8% increase in accepted length, a 37.3% reduction in interaction rounds, and a 31.4% improvement in end-to-end latency compared to vanilla speculative decoding across tested settings, while maintaining a negligible task accuracy drop (<1%). Finally, we validate WISV on a hardware testbed comprising an NVIDIA Jetson AGX Orin and an A40-equipped server, confirming its real-world efficacy in accelerating edge-deployed LLM inference.","published_date":"2026-04-20T01:29:56+00:00","viability_score":8,"cluster_label":"Edge AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"WISV is a wireless-informed semantic verification framework for distributed speculative decoding in edge LLM inference, significantly improving accepted length and reducing interaction rounds.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.17696v1","title":"Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play","abstract":"Games offer a compelling paradigm for developing general reasoning capabilities in language models, as they naturally demand strategic planning, probabilistic inference, and adaptive decision-making. However, existing self-play approaches rely solely on terminal game outcomes, providing no mechanism to distinguish transferable reasoning patterns from game-specific heuristics. We present STRATAGEM, which addresses two fundamental barriers to reasoning transfer: domain specificity, where learned patterns remain anchored in game semantics, and contextual stasis, where static game contexts fail to cultivate progressive reasoning. STRATAGEM selectively reinforces trajectories exhibiting abstract, domain-agnostic reasoning through a Reasoning Transferability Coefficient, while incentivizing adaptive reasoning development via a Reasoning Evolution Reward. Experiments across mathematical reasoning, general reasoning, and code generation benchmarks demonstrate substantial improvements, with particularly strong gains on competition-level mathematics where multi-step reasoning is critical. Ablation studies and human evaluation confirm that both components contribute to transferable reasoning.","published_date":"2026-04-20T01:20:31+00:00","viability_score":7,"cluster_label":"Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Stratagem learns transferable reasoning in language models via trajectory-modulated game self-play, improving performance on mathematical, general reasoning, and code generation benchmarks.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.17693v1","title":"CAPO: Counterfactual Credit Assignment in Sequential Cooperative Teams","abstract":"In cooperative teams where agents act in a fixed order and share a single team reward, it is hard to know how much each agent contributed, and harder still when agents are updated one at a time because data collected earlier no longer reflects the new policies. We introduce the Sequential Aristocrat Utility (SeqAU), the unique per-agent learning signal that maximizes the individual learnability of each agent's action, extending the classical framework of Wolpert and Tumer (2002) to this sequential setting. From SeqAU we derive CAPO (Counterfactual Advantage Policy Optimization), a critic-free policy-gradient algorithm. CAPO fits a per-agent reward decomposition from group rewards and computes the per-agent advantage in closed form plus a handful of forward passes through the current policy, requiring no extra environment calls beyond the initial batch. We give analytic bias and variance bounds and validate them on a controlled sequential bandit, where CAPO's advantage over standard baselines grows with the team size. The framework is general; multi-LLM pipelines are a natural deployment target.","published_date":"2026-04-20T01:14:59+00:00","viability_score":3,"cluster_label":"Multi-Agent Systems","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper introduces CAPO, a critic-free policy-gradient algorithm for sequential cooperative teams that derives a per-agent learning signal to improve individual learnability.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.17691v1","title":"SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models","abstract":"Safety alignment in large language models is remarkably shallow: it is concentrated in the first few output tokens and reversible by fine-tuning on as few as 100 adversarial examples. This fragility becomes critical in real-world deployment, where models undergo sequential adaptation across domains such as medicine, law, and code, causing safety guardrails to erode cumulatively. Yet all existing safety-preserving methods target only single-task fine-tuning, leaving the multi-domain sequential setting entirely unaddressed.   We introduce SafeAnchor, a framework that anchors safety in place throughout continual adaptation. SafeAnchor first identifies low-rank safety subspaces in LoRA parameter space via Fisher Information eigendecomposition, then constrains domain-specific gradient updates to the orthogonal complement of these subspaces, and finally monitors for residual safety drift with threshold-triggered corrective replay. Evaluated on Llama-2-7B-Chat and Mistral-7B-Instruct across a three-domain pipeline and eight benchmarks, SafeAnchor retains 93.2% of original safety alignment, outperforming all baselines by 18-42 points, while matching unconstrained fine-tuning to within 1.5 points on domain tasks.","published_date":"2026-04-20T01:13:36+00:00","viability_score":7,"cluster_label":"LLM Safety","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework to prevent safety alignment erosion in LLMs during continual domain adaptation.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.17677v1","title":"Semantic Entanglement in Vector-Based Retrieval: A Formal Framework and Context-Conditioned Disentanglement Pipeline for Agentic RAG Systems","abstract":"Retrieval-Augmented Generation (RAG) systems depend on the geometric properties of vector representations to retrieve contextually appropriate evidence. When source documents interleave multiple topics within contiguous text, standard vectorization produces embedding spaces in which semantically distinct content occupies overlapping neighborhoods. We term this condition semantic entanglement. We formalize entanglement as a model-relative measure of cross-topic overlap in embedding space and define an Entanglement Index (EI) as a quantitative proxy. We argue that higher EI constrains attainable Top-K retrieval precision under cosine similarity retrieval. To address this, we introduce the Semantic Disentanglement Pipeline (SDP), a four-stage preprocessing framework that restructures documents prior to embedding. We further propose context-conditioned preprocessing, in which document structure is shaped by patterns of operational use, and a continuous feedback mechanism that adapts document structure based on agent performance. We evaluate SDP on a real-world enterprise healthcare knowledge base comprising over 2,000 documents across approximately 25 sub-domains. Top-K retrieval precision improves from approximately 32% under fixed-token chunking to approximately 82% under SDP, while mean EI decreases from 0.71 to 0.14. We do not claim that entanglement fully explains RAG failure, but that it captures a distinct preprocessing failure mode that downstream optimization cannot reliably correct once encoded into the vector space.","published_date":"2026-04-20T00:24:34+00:00","viability_score":5,"cluster_label":"RAG Systems","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A pipeline to disentangle semantic entanglement in vector embeddings for improved retrieval precision in RAG systems.","time_to_mvp":"6+ months","tags":["series_a_plus"]},{"arxiv_id":"2604.17674v1","title":"Towards Intelligent Legal Document Analysis: CNN-Driven Classification of Case Law Texts","abstract":"Legal practitioners and judicial institutions face an ever-growing volume of case-law documents characterised by formalised language, lengthy sentence structures, and highly specialised terminology, making manual triage both time-consuming and error-prone. This work presents a lightweight yet high-accuracy framework for citation-treatment classification that pairs lemmatisation-based preprocessing with subword-aware FastText embeddings and a multi-kernel one-dimensional Convolutional Neural Network (CNN). Evaluated on a publicly available corpus of 25,000 annotated legal documents with a 75/25 training-test partition, the proposed system achieves 97.26% classification accuracy and a macro F1-score of 96.82%, surpassing established baselines including fine-tuned BERT, Long Short-Term Memory (LSTM) with FastText, CNN with random embeddings, and a Term Frequency-Inverse Document Frequency (TF-IDF) k-Nearest Neighbour (KNN) classifier. The model also attains the highest Area Under the Receiver Operating Characteristic (AUC-ROC) curve of 97.83% among all compared systems while operating with only 5.1 million parameters and an inference latency of 0.31 ms per document - more than 13 times faster than BERT. Ablation experiments confirm the individual contribution of each pipeline component, and the confusion matrix reveals that residual errors are confined to semantically adjacent citation categories. These findings indicate that carefully designed convolutional architectures represent a scalable, resource-efficient alternative to heavyweight transformers for intelligent legal document analysis.","published_date":"2026-04-20T00:14:11+00:00","viability_score":4,"cluster_label":"Legal AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A lightweight CNN framework for high-accuracy, fast classification of legal case law texts.","time_to_mvp":"1-3 months","tags":[]},{"arxiv_id":"2604.17659v1","title":"Semantic Density Effect (SDE): Maximizing Information Per Token Improves LLM Accuracy","abstract":"We introduce the Semantic Density Effect (SDE): the empirical finding that prompts carrying higher semantic information per token consistently produce more accurate, focused, and less hallucinated outputs across all major LLM families. SDE is defined as the ratio of semantically loaded tokens to total prompt tokens, adjusted for redundancy and concreteness. Unlike prior prompt optimization techniques that add tokens (Chain of Thought), duplicate the prompt (Prompt Repetition), or reorder components (Instruction Placement Effect), SDE improves performance by removing or replacing low-information tokens while preserving or sharpening the semantic signal. Evaluated across five frontier models and seven benchmarks, ultra-dense prompts (SDE > 0.80) outperform diluted counterparts by an average of +8.4 percentage points with 0 additional tokens and 0 latency overhead. Combined with Instruction Placement Effect (IPE), the gain reaches +11.7 percentage points","published_date":"2026-04-19T23:16:33+00:00","viability_score":7,"cluster_label":"LLM Prompting","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel prompting technique that maximizes information per token to improve LLM accuracy and reduce hallucinations without additional tokens or latency.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.17656v1","title":"Video-Robin: Autoregressive Diffusion Planning for Intent-Grounded Video-to-Music Generation","abstract":"Video-to-music (V2M) is the fundamental task of creating background music for an input video. Recent V2M models achieve audiovisual alignment by typically relying on visual conditioning alone and provide limited semantic and stylistic controllability to the end user. In this paper, we present Video-Robin, a novel text-conditioned video-to-music generation model that enables fast, high-quality, semantically aligned music generation for video content. To balance musical fidelity and semantic understanding, Video-Robin integrates autoregressive planning with diffusion-based synthesis. Specifically, an autoregressive module models global structure by semantically aligning visual and textual inputs to produce high-level music latents. These latents are subsequently refined into coherent, high-fidelity music using local Diffusion Transformers. By factoring semantically driven planning into diffusion-based synthesis, Video-Robin enables fine-grained creator control without sacrificing audio realism. Our proposed model outperforms baselines that solely accept video input and additional feature conditioned baselines on both in-distribution and out-of-distribution benchmarks with a 2.21x speed in inference compared to SOTA. We will open-source everything upon paper acceptance.","published_date":"2026-04-19T22:54:56+00:00","viability_score":8,"cluster_label":"Generative Audio","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A text-conditioned video-to-music generation model that balances musical fidelity and semantic understanding with fine-grained creator control and faster inference.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.17654v1","title":"Poly-EPO: Training Exploratory Reasoning Models","abstract":"Exploration is a cornerstone of learning from experience: it enables agents to find solutions to complex problems, generalize to novel ones, and scale performance with test-time compute. In this paper, we present a framework for post-training language models (LMs) that explicitly encourages optimistic exploration and promotes a synergy between exploration and exploitation. The central idea is to train the LM to generate sets of responses that are collectively accurate under the reward function and exploratory in their reasoning strategies. We first develop a general recipe for optimizing LMs with set reinforcement learning (set RL) under arbitrary objective functions, showing how standard RL algorithms can be adapted to this setting through a modification to the advantage computation. We then propose Polychromic Exploratory Policy Optimization (Poly-EPO), which instantiates this framework with an objective that explicitly synergizes exploration and exploitation. Across a range of reasoning benchmarks, we show that Poly-EPO improves generalization, as evidenced by higher pass@$k$ coverage, preserves greater diversity in model generations, and effectively scales with test-time compute.","published_date":"2026-04-19T22:54:19+00:00","viability_score":7,"cluster_label":"LLM Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for post-training language models that encourages optimistic exploration and synergizes exploration with exploitation for improved generalization and diversity.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.17653v1","title":"PV-SQL: Synergizing Database Probing and Rule-based Verification for Text-to-SQL Agents","abstract":"Text-to-SQL systems often struggle with deep contextual understanding, particularly for complex queries with subtle requirements. We present PV-SQL, an agentic framework that addresses these failures through two complementary components: Probe and Verify. The Probe component iteratively generates probing queries to retrieve concrete records from the database, resolving ambiguities in value formats, column semantics, and inter-table relationships to build richer contextual understanding. The Verify component employs a rule-based method to extract verifiable conditions and construct an executable checklist, enabling iterative SQL refinement that effectively reduces missing constraints. Experiments on the BIRD benchmarks show that PV-SQL outperforms the best text-to-SQL baseline by 5% in execution accuracy and 20.8% in valid efficiency score while consuming fewer tokens.","published_date":"2026-04-19T22:54:05+00:00","viability_score":8,"cluster_label":"Text-to-SQL","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An agentic framework that synergizes database probing and rule-based verification to improve text-to-SQL accuracy and efficiency for complex queries.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.17645v1","title":"On The Mathematics of the Natural Physics of Optimization","abstract":"A number of optimization algorithms have been inspired by the physics of Newtonian motion. Here, we ask the question: do algorithms themselves obey some ``natural laws of motion,'' and can they be derived by an application of these laws? We explore this question by positing the theory that optimization algorithms may be considered as some manifestation of hidden algorithm primitives that obey certain universal non-Newtonian dynamics. This natural physics of optimization is developed by equating the terminal transversality conditions of an optimal control problem to the generalized Karush/John-Kuhn-Tucker conditions of an optimization problem. Through this equivalence formulation, the data functions of a given constrained optimization problem generate a natural vector field that permeates an entire hidden space with information on the optimality conditions. An ``action-at-a-distance'' operation via a Pontryagin-type minimum principle produces a local action to deliver a globalized result by way of a Hamilton-Jacobi inequality. An inverse-optimal algorithm is generated by performing control jumps that dissipate quantized ``energy'' defined by a search Lyapunov function. Illustrative applications of the proposed theory show that a large number of algorithms can be generated and explained in terms of the new mathematical physics of optimization.","published_date":"2026-04-19T22:30:43+00:00","viability_score":1,"cluster_label":"Optimization Theory","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper explores the theoretical foundations of optimization algorithms by drawing parallels to natural physics and non-Newtonian dynamics.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.17626v1","title":"Toward Reusability of AI Models Using Dynamic Updates of AI Documentation","abstract":"This work addresses the challenge of disseminating reusable artificial intelligence (AI) models accompanied by AI documentation (a.k.a., AI model cards). The work is motivated by the large number of trained AI models that are not reusable due to the lack of (a) AI documentation and (b) the temporal lag between rapidly changing requirements on AI model reusability and those specified in various AI model cards. Our objectives are to shorten the lag time in updating AI model card templates and align AI documentation more closely with current AI best practices.   Our approach introduces a methodology for delivering agile, data-driven, and community-based AI model cards. We use the Hugging Face (HF) repository of AI models, populated by a subset of the AI research and development community, and the AI consortium-based Zero Draft (ZD) templates for the AI documentation of AI datasets and AI models, as our test datasets. We also address questions about the value of AI documentation for AI reusability.   Our work quantifies the correlations between AI model downloads/likes (i.e., AI model reuse metrics) from the HF repository and their documentation alignment with the ZD documentation templates using tables of contents and word statistics (i.e., AI documentation quality metrics). Furthermore, our work develops the infrastructure to regularly compare AI documentation templates against community-standard practices derived from millions of uploaded AI models in the Hugging Face repository. The impact of our work lies in introducing a methodology for delivering agile, data-driven, and community-based standards for documenting AI models and improving AI model reuse.","published_date":"2026-04-19T21:42:37+00:00","viability_score":5,"cluster_label":"AI Documentation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This work introduces a methodology for creating agile, data-driven AI model documentation to improve model reusability.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.17621v1","title":"KnowledgeBerg: Evaluating Systematic Knowledge Coverage and Compositional Reasoning in Large Language Models","abstract":"Many real-world questions appear deceptively simple yet implicitly demand two capabilities: (i) systematic coverage of a bounded knowledge universe and (ii) compositional set-based reasoning over that universe, a phenomenon we term \"the tip of the iceberg.\" We formalize this challenge through two orthogonal dimensions: knowledge width, the cardinality of the required universe, and reasoning depth, the number of compositional set operations. We introduce KnowledgeBerg, a benchmark of 4,800 multiple-choice questions derived from 1,183 enumeration seeds spanning 10 domains and 17 languages, with universes grounded in authoritative sources to ensure reproducibility. Representative open-source LLMs demonstrate severe limitations, achieving only 5.26-36.88 F1 on universe enumeration and 16.00-44.19 accuracy on knowledge-grounded reasoning. Diagnostic analyses reveal three stages of failure: completeness, or missing knowledge; awareness, or failure to identify requirements; and application, or incorrect reasoning execution. This pattern persists across languages and model scales. Although test-time compute and retrieval augmentation yield measurable gains -- up to 4.35 and 3.78 points, respectively -- substantial gaps remain, exposing limitations in how current LLMs organize structured knowledge and execute compositional reasoning over bounded domains. The dataset is available at https://huggingface.co/datasets/2npc/KnowledgeBerg","published_date":"2026-04-19T21:18:42+00:00","viability_score":6,"cluster_label":"LLM Evaluation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"KnowledgeBerg is a new benchmark to evaluate LLMs on systematic knowledge coverage and compositional reasoning, revealing significant limitations.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.17614v1","title":"Characterizing Model-Native Skills","abstract":"Skills are a natural unit for describing what a language model can do and how its behavior can be changed. However, existing characterizations rely on human-written taxonomies, textual descriptions, or manual profiling pipelines--all external hypotheses about what matters that need not align with the model's internal representations. We argue that when the goal is to intervene on model behavior, skill characterization should be *model-native*: grounded in the model's own representations rather than imposed through external ontologies. We instantiate this view by recovering a compact orthogonal basis from sequence-level activations. The resulting basis is semantically interpretable but need not correspond to any predefined human ontology; instead, it captures axes of behavioral variation that the model itself organizes around. We validate this characterization on reasoning post-training, using the recovered basis for both SFT data selection and inference-time steering. We develop lightweight proxy interventions to identify which directions are most useful for a given model. Across Llama3-8B and Qwen2.5-3B, selecting data along those directions improves Pass@1 by up to 20% on MATH and 41% on AMC, outperforming data selection based on human-characterized skills. Because the basis lives in activation space, the same directions also serve as steering vectors at inference time, improving Pass@8 by up to 4.8% on MATH--an intervention that human-characterized skills cannot support. We further validate the characterization on safety alignment, where selecting adversarial training data for model-native skill coverage rather than textual diversity yields more sample-efficient learning. These results suggest that recovering skills from the model's own representations, rather than imposing them externally, provides a more effective foundation for intervening on model behavior. Codes are open-sourced.","published_date":"2026-04-19T20:58:25+00:00","viability_score":7,"cluster_label":"LLM Intervention","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper proposes model-native skills derived from internal representations for more effective LLM intervention and behavior modification.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.17612v1","title":"Provable Coordination for LLM Agents via Message Sequence Charts","abstract":"Multi-agent systems built on large language models (LLMs) are difficult to reason about. Coordination errors such as deadlocks or type-mismatched messages are often hard to detect through testing. We introduce a domain-specific language for specifying agent coordination based on message sequence charts (MSCs). The language separates message-passing structure from LLM actions, whose outputs remain unpredictable. We define the syntax and semantics of the language and present a syntax-directed projection that generates deadlock-free local agent programs from global coordination specifications. We illustrate the approach with a diagnosis consensus protocol and show how coordination properties can be established independently of LLM nondeterminism. We also describe a runtime planning extension in which an LLM dynamically generates a coordination workflow for which the same structural guarantees apply. An open-source Python implementation of our framework is available as ZipperGen.","published_date":"2026-04-19T20:54:30+00:00","viability_score":4,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A domain-specific language and framework for provable coordination of LLM agents, ensuring deadlock-free communication.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.17611v1","title":"STEP-PD: Stage-Aware and Explainable Parkinson's Disease Severity Classification Using Multimodal Clinical Assessments","abstract":"Parkinson's disease (PD) is a progressive disorder in which symptom burden and functional impairment evolve over time, making severity staging essential for clinical monitoring and treatment planning. However, many computational studies emphasize binary PD detection and do not fully use repeated follow-up clinical assessments for stage-aware prediction. This study proposes STEP-PD, a severity-aware machine learning framework to classify PD severity using clinically interpretable boundaries. It leverages all available visits from the Parkinson's Progression Markers Initiative (PPMI) and integrates routinely collected subjective questionnaires and objective clinician-assessed measures. Disease severity is defined using Hoehn and Yahr staging and grouped into three clinically meaningful categories: Healthy, Mild PD (stages 1-2), and Moderate-to-Severe PD (stages 3-5). Three binary classification problems and a three-class severity task were evaluated using stratified cross-validation with imbalance-aware training. To enhance interpretability, SHAP was used to provide global explanations and local patient-level waterfall explanations. Across all tasks, XGBoost achieved the strongest and most stable performance, with accuracies of 95.48% (Healthy vs. Mild), 99.44% (Healthy vs. Moderate-to-Severe), and 96.78% (Mild vs. Moderate-to-Severe), and 94.14% accuracy with 0.8775 Macro-F1 for three-class severity classification. Explainability results highlight a shift from early motor features to progression-related axial and balance impairments. These findings show that multimodal clinical assessments within the PPMI cohort can support accurate and interpretable visit-level PD severity stratification.","published_date":"2026-04-19T20:53:59+00:00","viability_score":5,"cluster_label":"Medical AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"An interpretable machine learning framework for stage-aware Parkinson's disease severity classification using multimodal clinical data.","time_to_mvp":"6+ months","tags":["series_a_plus"]},{"arxiv_id":"2604.17602v1","title":"Polarization and Integration in Global AI Research","abstract":"The AI race amplifies security risks and international tensions. While the US restricts mobility and knowledge flows, challenges regulatory efforts to protect its advantage, China leads initiatives of global governance. Both strategies depend on cross-country relationships in AI innovation; yet, how this system evolves is unclear. Here, we measure the processes of polarization and integration in the global AI research over three decades by using large-scale data of scientific publications. Comparing cross-country collaboration and citation links to their random realizations, we find that the US and China have long diverged in both dimensions, forming two poles around which global AI research increasingly revolves. While the United Kingdom and Germany have integrated exclusively with the US, many European countries have converged with both poles. Developing and further developed countries, however, only integrate with China, signaling its expanding influence over the international AI research landscape. Our results inform national science policies and efforts toward global AI regulations.","published_date":"2026-04-19T20:21:25+00:00","viability_score":0,"cluster_label":"Research Trends","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Analysis of polarization and integration in global AI research over three decades, highlighting US and China's diverging influence.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.17596v1","title":"Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories","abstract":"We release Terminal Wrench, a subset of 331 terminal-agent benchmark environments, copied from the popular open benchmarks that are demonstrably reward-hackable. The data set includes 3,632 hack trajectories and 2,352 legitimate baseline trajectories across three frontier models (Claude Opus 4.6, Gemini 3.1 Pro, GPT-5.4). Each entry preserves the original task definition alongside full attack trajectories that show how the verifier was bypassed. It also includes cases where the task was not solved as intended. The tasks span system administration, machine learning, software engineering, and security challenges; the exploits range from simple output spoofing to stack-frame introspection, standard-library patching, and rootkit-style binary hijacking. Crucially, these exploits are specific to each task, rather than the evaluation harness, making them harder to patch. We also present a monitorability study in which hack trajectories are sanitized or stripped of reasoning traces and then scored by an LLM judge, showing that detection degrades meaningfully when chain-of-thought is removed (AUC drops from 0.97 to 0.92). The data set is publicly available at https://github.com/few-sh/terminal-wrench.","published_date":"2026-04-19T20:04:02+00:00","viability_score":7,"cluster_label":"AI Safety","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A dataset of reward-hackable environments and exploit trajectories for testing frontier LLM security against sophisticated attacks.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.17587v1","title":"AIRA: AI-Induced Risk Audit: A Structured Inspection Framework for AI-Generated Code","abstract":"Practitioners have reported a directional pattern in AI-assisted code generation: AI-generated code tends to fail quietly, preserving the appearance of functionality while degrading or concealing guarantees. This paper introduces the Reward-Shaped Failure Hypothesis - the proposal that this pattern may reflect an artifact of optimization through human feedback rather than a random distribution of bugs. We define failure truthfulness as the property that a system's observable outputs accurately represent its internal success or failure state. We then present AIRA (AI-Induced Risk Audit), a deterministic 15-check inspection framework designed to detect failure-untruthful patterns in code. We report results from three studies: (1) an anonymized enterprise environment audit, (2) a balanced 600-file public corpus pilot, and (3) a strict matched-control replication comparing 955 AI-attributed files against 955 human-control files. In the final replication, AI-attributed files show 0.435 high-severity findings per file versus 0.242 in human controls (1.80x). The effect is consistent across JavaScript, Python, and TypeScript, with strongest concentration in exception-handling-related patterns. These findings are consistent with a directional skew toward fail-soft behavior in AI-assisted code. AIRA is designed for governance, compliance, and safety-critical systems where fail-closed behavior is required.","published_date":"2026-04-19T19:32:52+00:00","viability_score":3,"cluster_label":"AI Code Auditing","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A framework for auditing AI-generated code to detect failure-untruthful patterns, crucial for safety-critical systems.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.17585v1","title":"DGSSM: Diffusion guided state-space models for multimodal salient object detection","abstract":"Salient object detection (SOD) requires modeling both long-range contextual dependencies and fine-grained structural details, which remains challenging for convolutional, transformer-based, and Mamba-based state space models. While recent Mamba-based state space approaches enable efficient global reasoning, they often struggle to recover precise object boundaries. In contrast, diffusion models capture strong structural priors through iterative denoising, but their use in discriminative dense prediction is still limited due to computational cost and integration challenges. In this work, we propose DGSSM, a diffusion-guided state space (Mamba) framework that formulates multimodal salient object detection as a progressive denoising process. The framework integrates diffusion structural priors with multi-scale state space encoding, adaptive saliency prompting, and an iterative Mamba diffusion refinement mechanism to improve boundary accuracy. A boundary-aware refinement head and self-distillation strategy further enhance spatial coherence and feature consistency. Extensive experiments on 13 public benchmarks across RGB, RGB-D, and RGB-T settings demonstrate that DGSSM consistently outperforms state-of-the-art methods across multiple evaluation metrics while maintaining a compact model size. These results suggest that diffusion-guided state space modeling is an effective and generalizable paradigm for multimodal dense prediction tasks.","published_date":"2026-04-19T19:19:33+00:00","viability_score":7,"cluster_label":"Multimodal Object Detection","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A diffusion-guided state-space model for multimodal salient object detection that improves boundary accuracy and outperforms existing methods.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.17584v1","title":"DIRCR: Dual-Inference Rule-Contrastive Reasoning for Solving RAVENs","abstract":"Abstract visual reasoning remains challenging as existing methods often prioritize either global context or local row-wise relations, failing to integrate both, and lack intermediate feature constraints, leading to incomplete rule capture and entangled representations. To address these issues, we propose the Dual-Inference Rule-Contrastive Reasoning (DIRCR) model. Its core component, the Dual-Inference Reasoning Module, combines a local path for row-wise analogical reasoning and a global path for holistic inference, integrated via a gated attention mechanism. Additionally, a Rule-Contrastive Learning Module introduces pseudo-labels to construct positive and negative rule samples, applying contrastive learning to enhance feature separability and promote abstract, transferable rule learning. Experimental results on three RAVEN datasets demonstrate that DIRCR significantly enhances reasoning robustness and generalization. Codes are available at https://github.com/csZack-Zhang/DIRCR.","published_date":"2026-04-19T19:15:32+00:00","viability_score":7,"cluster_label":"Abstract Visual Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A dual-inference rule-contrastive reasoning model that significantly enhances abstract visual reasoning robustness and generalization on RAVEN datasets.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.17581v1","title":"How Much Data is Enough? The Zeta Law of Discoverability in Biomedical Data, featuring the enigmatic Riemann zeta function","abstract":"How much data is enough to make a scientific discovery? As biomedical datasets scale to millions of samples and AI models grow in capacity, progress increasingly depends on predicting when additional data will substantially improve performance. In practice, model development often relies on empirical scaling curves measured across architectures, modalities, and dataset sizes, with limited theoretical guidance on when performance should improve, saturate, or exhibit cross-over behavior.   We propose a scaling-law framework for cross-modal discoverability based on spectral structure of data covariance operators, task-aligned signal projections, and learned representations. Many performance metrics, including AUC, can be expressed in terms of cumulative signal-to-noise energy accumulated across identifiable spectral modes of an encoder and cross-modal operator. Under mild assumptions, this accumulation follows a zeta-like scaling law governed by power-law decay of covariance spectra and aligned signal energy, leading naturally to the appearance of the Riemann zeta function. Representation learning methods such as sparse models, low-rank embeddings, and multimodal contrastive objectives improve sample efficiency by concentrating useful signal into earlier stable modes, effectively steepening spectral decay and shifting scaling curves.   The framework predicts cross-over regimes in which simpler models perform best at small sample sizes, while higher-capacity or multimodal encoders outperform them once sufficient data stabilizes additional degrees of freedom. Applications include multimodal disease classification, imaging genetics, functional MRI, and topological data analysis. The resulting zeta law provides a principled way to anticipate when scaling data, improving representations, or adding modalities is most likely to accelerate discovery.","published_date":"2026-04-19T19:08:53+00:00","viability_score":4,"cluster_label":"Biomedical Data Discoverability","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A theoretical framework using the Riemann zeta function to predict when additional data will substantially improve performance in biomedical discovery.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.17573v1","title":"Beyond Static Snapshots: A Grounded Evaluation Framework for Language Models at the Agentic Frontier","abstract":"We argue that current evaluation frameworks for large language models (LLMs) suffer from four systematic failures that make them structurally inadequate for assessing deployed, agentic systems: distributional invalidity (evaluation inputs do not reflect real interaction distributions), temporal invalidity (evaluations are post-hoc rather than training-integrated), scope invalidity (evaluations measure single-turn outputs rather than long-horizon trajectories), and process invalidity (evaluations assess outputs rather than reasoning). These failures compound critically in RLHF, where reward models are evaluated under conditions that do not hold during RL training, making reward hacking a predictable consequence of evaluation design rather than a training pathology. We propose the Grounded Continuous Evaluation (GCE) framework and present ISOPro, a simulation-based fine-tuning and evaluation system. ISOPro replaces the learned reward model with a deterministic ground-truth verifier, eliminating reward hacking by construction in verifiable-reward domains, and operates on LoRA adapter weights updatable on CPU, reducing the hardware barrier by an order of magnitude. We validate ISOPro on a resource-constrained scheduling domain with six difficulty tiers, demonstrating capability emergence visible only through continuous evaluation, an implicit curriculum that forms without researcher curation, and a 3x accuracy improvement over zero-shot baselines, all on consumer hardware with 0.216% trainable parameters.","published_date":"2026-04-19T18:28:32+00:00","viability_score":7,"cluster_label":"LLM Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new framework and system for evaluating and fine-tuning LLM agents that eliminates reward hacking and reduces hardware requirements.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.17570v1","title":"PBSBench: A Multi-Level Vision-Language Framework and Benchmark for Hematopathology Whole Slide Image Interpretation","abstract":"Peripheral Blood Smear (PBS) is a critical microscopic examination in hematopathology that yields whole-slide imaging (WSI). Unlike solid tissue pathology, PBS interpretation focuses on individual cell morphologies rather than tissue architecture, making it distinct in both visual characteristics and diagnostic reasoning. However, current multimodal large language models (MLLMs) for pathology are primarily developed on solid-tissue WSIs and struggle to generalize to PBS. To bridge this gap, we construct PBSInstr, the first vision-language dataset for PBS interpretation, comprising 353 PBS WSIs paired with microscopic impression paragraphs and 29k cell-level image crops annotated with cell type labels and morphological descriptions. To facilitate instruction tuning, PBSInstr further includes 27k question-answer (QA) pairs for cell crops and 1,286 QA pairs for PBS slides. Building upon PBSInstr, we develop PBS-VL, a hematopathology-tailored vision-language model for multi-level PBS interpretation at both cell and slide levels. To comprehensively evaluate PBS understanding, we construct PBSBench, a visual question answering (VQA) benchmark featuring four question categories and six PBS interpretation tasks. Experiments show that PBS-VL outperforms existing general-purpose and pathology MLLMs, underscoring the value of PBS-specific data. We release our code, datasets, and model weights to facilitate future research. Our proposed framework lays the foundation for developing practical AI assistants supporting decision-making in hematopathology.","published_date":"2026-04-19T18:24:11+00:00","viability_score":8,"cluster_label":"AI in Healthcare","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"PBSBench offers a targeted vision-language framework for improving analysis of hematopathology slides, enhancing diagnostic precision.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.17562v1","title":"SafeAgent: A Runtime Protection Architecture for Agentic Systems","abstract":"Large language model (LLM) agents are vulnerable to prompt-injection attacks that propagate through multi-step workflows, tool interactions, and persistent context, making input-output filtering alone insufficient for reliable protection. This paper presents SafeAgent, a runtime security architecture that treats agent safety as a stateful decision problem over evolving interaction trajectories. The proposed design separates execution governance from semantic risk reasoning through two coordinated components: a runtime controller that mediates actions around the agent loop and a context-aware decision core that operates over persistent session state. The core is formalized as a context-aware advanced machine intelligence and instantiated through operators for risk encoding, utility-cost evaluation, consequence modeling, policy arbitration, and state synchronization. Experiments on Agent Security Bench (ASB) and InjecAgent show that SafeAgent consistently improves robustness over baseline and text-level guardrail methods while maintaining competitive benign-task performance. Ablation studies further show that recovery confidence and policy weighting determine distinct safety-utility operating points.","published_date":"2026-04-19T18:02:21+00:00","viability_score":4,"cluster_label":"LLM Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A runtime security architecture for LLM agents that improves robustness against prompt-injection attacks.","time_to_mvp":"1-3 months","tags":["series_a_plus"]},{"arxiv_id":"2604.16286v1","title":"ASMR-Bench: Auditing for Sabotage in ML Research","abstract":"As AI systems are increasingly used to conduct research autonomously, misaligned systems could introduce subtle flaws that produce misleading results while evading detection. We introduce ASMR-Bench (Auditing for Sabotage in ML Research), a benchmark for evaluating the ability of auditors to detect sabotage in ML research codebases. ASMR-Bench consists of 9 ML research codebases with sabotaged variants that produce qualitatively different experimental results. Each sabotage modifies implementation details, such as hyperparameters, training data, or evaluation code, while preserving the high-level methodology described in the paper. We evaluated frontier LLMs and LLM-assisted human auditors on ASMR-Bench and found that both struggled to reliably detect sabotage: the best performance was an AUROC of 0.77 and a top-1 fix rate of 42%, achieved by Gemini 3.1 Pro. We also tested LLMs as red teamers and found that LLM-generated sabotages were weaker than human-generated ones but still sometimes evaded same-capability LLM auditors. We release ASMR-Bench to support research on monitoring and auditing techniques for AI-conducted research.","published_date":"2026-04-17T17:47:32+00:00","viability_score":7,"cluster_label":"AI Auditing","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A benchmark for detecting sabotage in AI-generated research codebases, with initial results showing current LLMs struggle to reliably identify flaws.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.16280v1","title":"Using Large Language Models and Knowledge Graphs to Improve the Interpretability of Machine Learning Models in Manufacturing","abstract":"Explaining Machine Learning (ML) results in a transparent and user-friendly manner remains a challenging task of Explainable Artificial Intelligence (XAI). In this paper, we present a method to enhance the interpretability of ML models by using a Knowledge Graph (KG). We store domain-specific data along with ML results and their corresponding explanations, establishing a structured connection between domain knowledge and ML insights. To make these insights accessible to users, we designed a selective retrieval method in which relevant triplets are extracted from the KG and processed by a Large Language Model (LLM) to generate user-friendly explanations of ML results. We evaluated our method in a manufacturing environment using the XAI Question Bank. Beyond standard questions, we introduce more complex, tailored questions that highlight the strengths of our approach. We evaluated 33 questions, analyzing responses using quantitative metrics such as accuracy and consistency, as well as qualitative ones such as clarity and usefulness. Our contribution is both theoretical and practical: from a theoretical perspective, we present a novel approach for effectively enabling LLMs to dynamically access a KG in order to improve the explainability of ML results. From a practical perspective, we provide empirical evidence showing that such explanations can be successfully applied in real-world manufacturing environments, supporting better decision-making in manufacturing processes.","published_date":"2026-04-17T17:41:17+00:00","viability_score":5,"cluster_label":"Explainable AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Leveraging knowledge graphs and LLMs to generate user-friendly explanations of machine learning results in manufacturing environments.","time_to_mvp":"6+ months","tags":["series_a_plus"]},{"arxiv_id":"2604.16278v1","title":"Learning to Reason with Insight for Informal Theorem Proving","abstract":"Although most of the automated theorem-proving approaches depend on formal proof systems, informal theorem proving can align better with large language models' (LLMs) strength in natural language processing. In this work, we identify a primary bottleneck in informal theorem proving as a lack of insight, namely the difficulty of recognizing the core techniques required to solve complex problems. To address this, we propose a novel framework designed to cultivate this essential reasoning skill and enable LLMs to perform insightful reasoning. We propose $\\mathtt{DeepInsightTheorem}$, a hierarchical dataset that structures informal proofs by explicitly extracting core techniques and proof sketches alongside the final proof. To fully exploit this dataset, we design a Progressive Multi-Stage SFT strategy that mimics the human learning process, guiding the model from basic proof writing to insightful thinking. Our experiments on challenging mathematical benchmarks demonstrate that this insight-aware generation strategy significantly outperforms baselines. These results demonstrate that teaching models to identify and apply core techniques can substantially improve their mathematical reasoning.","published_date":"2026-04-17T17:36:21+00:00","viability_score":7,"cluster_label":"AI for Math Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel framework and dataset for teaching LLMs to perform insightful reasoning in informal theorem proving, significantly outperforming baselines.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.16272v1","title":"VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects","abstract":"As AI-assisted video creation becomes increasingly practical, instruction-guided video editing has become essential for refining generated or captured footage to meet professional requirements. Yet the field still lacks both a large-scale human-annotated dataset with complete editing examples and a standardized evaluator for comparing editing systems. Existing resources are limited by small scale, missing edited outputs, or the absence of human quality labels, while current evaluation often relies on expensive manual inspection or generic vision-language model judges that are not specialized for editing quality. We introduce VEFX-Dataset, a human-annotated dataset containing 5,049 video editing examples across 9 major editing categories and 32 subcategories, each labeled along three decoupled dimensions: Instruction Following, Rendering Quality, and Edit Exclusivity. Building on VEFX-Dataset, we propose VEFX-Reward, a reward model designed specifically for video editing quality assessment. VEFX-Reward jointly processes the source video, the editing instruction, and the edited video, and predicts per-dimension quality scores via ordinal regression. We further release VEFX-Bench, a benchmark of 300 curated video-prompt pairs for standardized comparison of editing systems. Experiments show that VEFX-Reward aligns more strongly with human judgments than generic VLM judges and prior reward models on both standard IQA/VQA metrics and group-wise preference evaluation. Using VEFX-Reward as an evaluator, we benchmark representative commercial and open-source video editing systems, revealing a persistent gap between visual plausibility, instruction following, and edit locality in current models.","published_date":"2026-04-17T17:28:24+00:00","viability_score":9,"cluster_label":"Video Editing & VFX","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"VEFX-Bench benchmarks and evaluates video editing systems on their ability to follow instructions, render quality, and preserve content.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.16270v1","title":"From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text","abstract":"The complexity of Vietnam's legal texts presents a significant barrier to public access to justice. While Large Language Models offer a promising solution for legal text simplification, evaluating their true capabilities requires a multifaceted approach that goes beyond surface-level metrics. This paper introduces a comprehensive dual-aspect evaluation framework to address this need. First, we establish a performance benchmark for four state-of-the-art large language models (GPT-4o, Claude 3 Opus, Gemini 1.5 Pro, and Grok-1) across three key dimensions: Accuracy, Readability, and Consistency. Second, to understand the \"why\" behind these performance scores, we conduct a large-scale error analysis on a curated dataset of 60 complex Vietnamese legal articles, using a novel, expert-validated error typology. Our results reveal a crucial trade-off: models like Grok-1 excel in Readability and Consistency but compromise on fine-grained legal Accuracy, while models like Claude 3 Opus achieve high Accuracy scores that mask a significant number of subtle but critical reasoning errors. The error analysis pinpoints \\textit{Incorrect Example} and \\textit{Misinterpretation} as the most prevalent failures, confirming that the primary challenge for current LLMs is not summarization but controlled, accurate legal reasoning. By integrating a quantitative benchmark with a qualitative deep dive, our work provides a holistic and actionable assessment of LLMs for legal applications.","published_date":"2026-04-17T17:28:23+00:00","viability_score":5,"cluster_label":"Legal AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This paper introduces a dual-aspect evaluation framework for LLMs on Vietnamese legal text, revealing trade-offs between readability and accuracy, and identifying key reasoning errors.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.16241v1","title":"BAGEL: Benchmarking Animal Knowledge Expertise in Language Models","abstract":"Large language models have shown strong performance on broad-domain knowledge and reasoning benchmarks, but it remains unclear how well language models handle specialized animal-related knowledge under a unified closed-book evaluation protocol. We introduce BAGEL, a benchmark for evaluating animal knowledge expertise in language models. BAGEL is constructed from diverse scientific and reference sources, including bioRxiv, Global Biotic Interactions, Xeno-canto, and Wikipedia, using a combination of curated examples and automatically generated closed-book question-answer pairs. The benchmark covers multiple aspects of animal knowledge, including taxonomy, morphology, habitat, behavior, vocalization, geographic distribution, and species interactions. By focusing on closed-book evaluation, BAGEL measures animal-related knowledge of models without external retrieval at inference time. BAGEL further supports fine-grained analysis across source domains, taxonomic groups, and knowledge categories, enabling a more precise characterization of model strengths and systematic failure modes. Our benchmark provides a new testbed for studying domain-specific knowledge generalization in language models and for improving their reliability in biodiversity-related applications.","published_date":"2026-04-17T17:00:37+00:00","viability_score":4,"cluster_label":"LLM Evaluation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new benchmark for evaluating animal knowledge in large language models to improve biodiversity applications.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.16234v1","title":"A Two-Stage, Object-Centric Deep Learning Framework for Robust Exam Cheating Detection","abstract":"Academic integrity continues to face the persistent challenge of examination cheating. Traditional invigilation relies on human observation, which is inefficient, costly, and prone to errors at scale. Although some existing AI-powered monitoring systems have been deployed and trusted, many lack transparency or require multi-layered architectures to achieve the desired performance. To overcome these challenges, we propose an improvement over a simple two-stage framework for exam cheating detection that integrates object detection and behavioral analysis using well-known technologies. First, the state-of-the-art YOLOv8n model is used to localize students in exam-room images. Each detected region is cropped and preprocessed, then classified by a fine-tuned RexNet-150 model as either normal or cheating behavior. The system is trained on a dataset compiled from 10 independent sources with a total of 273,897 samples, achieving 0.95 accuracy, 0.94 recall, 0.96 precision, and 0.95 F1-score - a 13\\% increase over a baseline accuracy of 0.82 in video-based cheating detection. In addition, with an average inference time of 13.9 ms per sample, the proposed approach demonstrates robustness and scalability for deployment in large-scale environments. Beyond the technical contribution, the AI-assisted monitoring system also addresses ethical concerns by ensuring that final outcomes are delivered privately to individual students after the examination, for example, via personal email. This prevents public exposure or shaming and offers students an opportunity to reflect on their behavior. For further improvement, it is possible to incorporate additional factors, such as audio data and consecutive frames, to achieve greater accuracy. This study provides a foundation for developing real-time, scalable, ethical, and open-source solutions.","published_date":"2026-04-17T16:53:08+00:00","viability_score":7,"cluster_label":"AI for Education","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A two-stage AI system using YOLOv8n and RexNet-150 for robust, scalable, and ethical exam cheating detection.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.16232v1","title":"Neuro-Symbolic ODE Discovery with Latent Grammar Flow","abstract":"Understanding natural and engineered systems often relies on symbolic formulations, such as differential equations, which provide interpretability and transferability beyond black-box models.   We introduce Latent Grammar Flow (LGF), a neuro-symbolic generative framework for discovering ordinary differential equations from data. LGF embeds equations as grammar-based representations into a discrete latent space and forces semantically similar equations to be positioned closer together with a behavioural loss. Then, a discrete flow model guides the sampling process to recursively generate candidate equations that best fit the observed data. Domain knowledge and constraints, such as stability, can be either embedded into the rules or used as conditional predictors.","published_date":"2026-04-17T16:46:23+00:00","viability_score":0,"cluster_label":"Scientific Discovery","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A neuro-symbolic framework for discovering ordinary differential equations from data using latent grammar flow.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.16224v1","title":"\"Taking Stock at FAccT\": Using Participatory Design to Co-Create a Vision for the Fairness, Accountability and Transparency Community","abstract":"As a relatively new forum, ACM FAccT has become a key space for activists and scholars to critically examine emerging AI and ML technologies. It brings together academics, civil society members, and government representatives from diverse fields to explore the broader societal impacts of both deployed and proposed technologies. We report a large-scale participatory design (PD) process for reflexive conference governance, which combined an in-person CRAFT session, an asynchronous Polis poll and the synthesis of a governance-facing report for the FAccT leadership. Participants shaped the substantive agenda by authoring seed statements, adding new statements and making patterns of agreement, disagreement and uncertainty made visible through voting.Our endeavors represent one of the the first instances of applying PD to a venue that critically interrogates the societal impacts of AI, fostering a niche in which critical scholars are free to voice their concerns. Finally, this work advances large-scale PD theory by providing an effective case study of a co-design paradigm that can readily scale temporally and epistemologically.","published_date":"2026-04-17T16:37:41+00:00","viability_score":0,"cluster_label":"AI Ethics & Governance","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A participatory design process to co-create a vision for the Fairness, Accountability, and Transparency in AI community.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.16217v1","title":"Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations","abstract":"Large language models are increasingly deployed in settings where reliability matters, yet output-level uncertainty signals such as token probabilities, entropy, and self-consistency can become brittle under calibration--deployment mismatch. Conformal prediction provides finite-sample validity under exchangeability, but its practical usefulness depends on the quality of the nonconformity score. We propose a conformal framework for LLM question answering that uses internal representations rather than output-facing statistics: specifically, we introduce Layer-Wise Information (LI) scores, which measure how conditioning on the input reshapes predictive entropy across model depth, and use them as nonconformity scores within a standard split conformal pipeline. Across closed-ended and open-domain QA benchmarks, with the clearest gains under cross-domain shift, our method achieves a better validity--efficiency trade-off than strong text-level baselines while maintaining competitive in-domain reliability at the same nominal risk level. These results suggest that internal representations can provide more informative conformal scores when surface-level uncertainty is unstable under distribution shift.","published_date":"2026-04-17T16:28:31+00:00","viability_score":7,"cluster_label":"LLM Reliability","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A conformal prediction framework for LLMs that uses internal representations to provide more robust uncertainty estimates, especially under distribution shift.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.16207v1","title":"AIFIND: Artifact-Aware Interpreting Fine-Grained Alignment for Incremental Face Forgery Detection","abstract":"As forgery types continue to emerge consistently, Incremental Face Forgery Detection (IFFD) has become a crucial paradigm. However, existing methods typically rely on data replay or coarse binary supervision, which fails to explicitly constrain the feature space, leading to severe feature drift and catastrophic forgetting. To address this, we propose AIFIND, Artifact-Aware Interpreting Fine-Grained Alignment for Incremental Face Forgery Detection, which leverages semantic anchors to stabilize incremental learning. We design the Artifact-Driven Semantic Prior Generator to instantiate invariant semantic anchors, establishing a fixed coordinate system from low-level artifact cues. These anchors are injected into the image encoder via Artifact-Probe Attention, which explicitly constrains volatile visual features to align with stable semantic anchors. Adaptive Decision Harmonizer harmonizes the classifiers by preserving angular relationships of semantic anchors, maintaining geometric consistency across tasks. Extensive experiments on multiple incremental protocols validate the superiority of AIFIND.","published_date":"2026-04-17T16:17:12+00:00","viability_score":3,"cluster_label":"Face Forgery Detection","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A novel approach for incremental face forgery detection that stabilizes learning by aligning features with semantic anchors derived from artifact cues.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.16205v1","title":"ChemGraph-XANES: An Agentic Framework for XANES Simulation and Analysis","abstract":"Computational X-ray absorption near-edge structure (XANES) is widely used to probe local coordination environments, oxidation states, and electronic structure in chemically complex systems. However, the use of computational XANES at scale is constrained more by workflow complexity than by the underlying simulation method itself. To address this challenge, we present ChemGraph-XANES, an agentic framework for automated XANES simulation and analysis that unifies natural-language task specification, structure acquisition, FDMNES input generation, task-parallel execution, spectral normalization, and provenance-aware data curation. Built on ASE, FDMNES, Parsl, and a LangGraph/LangChain-based tool interface, the framework exposes XANES workflow operations as typed Python tools that can be orchestrated by large language model (LLM) agents. In multi-agent mode, a retrieval-augmented expert agent consults the FDMNES manual to ground parameter selection, while executor agents translate user requests into structured tool calls. We demonstrate documentation-grounded parameter retrieval and show that the same workflow supports both explicit structure-file inputs and chemistry-level natural-language requests. Because independent XANES calculations are naturally task-parallel, the framework is well suited for high-throughput deployment on high-performance computing (HPC) systems, enabling scalable XANES database generation for downstream analysis and machine-learning applications. ChemGraph-XANES thus provides a reproducible and extensible workflow layer for physics-based XANES simulation, spectral curation, and agent-compatible computational spectroscopy.","published_date":"2026-04-17T16:15:19+00:00","viability_score":6,"cluster_label":"X-ray Analysis Software","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Automate and scale X-ray absorption simulations for chemical analysis using ChemGraph-XANES.","time_to_mvp":"1-2 weeks","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.16182v1","title":"Synthetic data in cryptocurrencies using generative models","abstract":"Data plays a fundamental role in consolidating markets, services, and products in the digital financial ecosystem. However, the use of real data, especially in the financial context, can lead to privacy risks and access restrictions, affecting institutions, research, and modeling processes. Although not all financial datasets present such limitations, this work proposes the use of deep learning techniques for generating synthetic data applied to cryptocurrency price time series. The approach is based on Conditional Generative Adversarial Networks (CGANs), combining an LSTM-type recurrent generator and an MLP discriminator to produce statistically consistent synthetic data. The experiments consider different crypto-assets and demonstrate that the model is capable of reproducing relevant temporal patterns, preserving market trends and dynamics. The generation of synthetic series through GANs is an efficient alternative for simulating financial data, showing potential for applications such as market behavior analysis and anomaly detection, with lower computational cost compared to more complex generative approaches.","published_date":"2026-04-17T15:48:05+00:00","viability_score":7,"cluster_label":"Synthetic Financial Data","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Generates statistically consistent synthetic cryptocurrency price time series using Conditional GANs, enabling privacy-preserving market analysis and anomaly detection.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.16175v1","title":"MARCH: Multi-Agent Radiology Clinical Hierarchy for CT Report Generation","abstract":"Automated 3D radiology report generation often suffers from clinical hallucinations and a lack of the iterative verification found in human practice. While recent Vision-Language Models (VLMs) have advanced the field, they typically operate as monolithic \"black-box\" systems without the collaborative oversight characteristic of clinical workflows. To address these challenges, we propose MARCH (Multi-Agent Radiology Clinical Hierarchy), a multi-agent framework that emulates the professional hierarchy of radiology departments and assigns specialized roles to distinct agents. MARCH utilizes a Resident Agent for initial drafting with multi-scale CT feature extraction, multiple Fellow Agents for retrieval-augmented revision, and an Attending Agent that orchestrates an iterative, stance-based consensus discourse to resolve diagnostic discrepancies. On the RadGenome-ChestCT dataset, MARCH significantly outperforms state-of-the-art baselines in both clinical fidelity and linguistic accuracy. Our work demonstrates that modeling human-like organizational structures enhances the reliability of AI in high-stakes medical domains.","published_date":"2026-04-17T15:42:03+00:00","viability_score":7,"cluster_label":"Medical AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A multi-agent AI framework that mimics radiology department hierarchy to generate more accurate and reliable CT reports, reducing clinical hallucinations.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.16171v1","title":"JumpLoRA: Sparse Adapters for Continual Learning in Large Language Models","abstract":"Adapter-based methods have become a cost-effective approach to continual learning (CL) for Large Language Models (LLMs), by sequentially learning a low-rank update matrix for each task. To mitigate catastrophic forgetting, state-of-the-art approaches impose constraints on new adapters with respect to the previous ones, by targeting either subspace or coordinate-wise interference. In this paper, we propose JumpLoRA, a novel framework to adaptively induce sparsity in the Low-Rank Adaptation (LoRA) blocks through the use of JumpReLU gating. The method achieves dynamic parameter isolation, which helps prevent task interference. We demonstrate that our method is highly modular and compatible with LoRA-based CL approaches. Specifically, it significantly boosts the performance of IncLoRA and outperforms the leading state-of-the-art CL method, ELLA.","published_date":"2026-04-17T15:38:37+00:00","viability_score":7,"cluster_label":"LLM Continual Learning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"JumpLoRA introduces adaptive sparsity in LoRA adapters for LLMs, enabling more efficient and effective continual learning by preventing task interference.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.16158v1","title":"AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency","abstract":"Large language models (LLMs) increasingly rely on chain-of-thought (CoT) reasoning to solve complex tasks. Yet ensuring that the reasoning trace both contributes to and faithfully reflects the processes underlying the model's final answer, rather than merely accompanying it, remains challenging. We introduce AtManRL, a method that leverages differentiable attention manipulation to learn more faithful reasoning through reinforcement learning. By training an additive attention mask that identifies tokens in the CoT crucial for producing correct answers, we derive a saliency reward signal that encourages the model to generate reasoning traces that genuinely influence its final predictions. We integrate this saliency reward with outcome-based rewards within the GRPO framework to jointly optimize for correctness and interpretability. Experiments on GSM8K and MMLU with Llama-3.2-3B-Instruct demonstrate that our approach can identify influential reasoning tokens and enable training more transparent reasoning models.","published_date":"2026-04-17T15:27:35+00:00","viability_score":7,"cluster_label":"LLM Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"AtManRL uses differentiable attention to train LLMs to generate reasoning traces that faithfully contribute to, rather than just accompany, the final answer.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.16147v1","title":"SWNet: A Cross-Spectral Network for Camouflaged Weed Detection","abstract":"This paper presents SWNet, a bimodal end-to-end cross-spectral network specifically engineered for the detection of camouflaged weeds in dense agricultural environments. Plant camouflage, characterized by homochromatic blending where invasive species mimic the phenotypic traits of primary crops, poses a significant challenge for traditional computer vision systems. To overcome these limitations, SWNet utilizes a Pyramid Vision Transformer v2 backbone to capture long-range dependencies and a Bimodal Gated Fusion Module to dynamically integrate Visible and Near-Infrared information. By leveraging the physiological differences in chlorophyll reflectance captured in the NIR spectrum, the proposed architecture effectively discriminates targets that are otherwise indistinguishable in the visible range. Furthermore, an Edge-Aware Refinement module is employed to produce sharper object boundaries and reduce structural ambiguity. Experimental results on the Weeds-Banana dataset indicate that SWNet outperforms ten state-of-the-art methods. The study demonstrates that the integration of cross-spectral data and boundary-guided refinement is essential for high segmentation accuracy in complex crop canopies. The code is available on GitHub: https://cod-espol.github.io/SWNet/","published_date":"2026-04-17T15:20:31+00:00","viability_score":7,"cluster_label":"Agritech","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Bimodal AI network SWNet detects camouflaged weeds in agriculture using cross-spectral data.","time_to_mvp":"1-2 weeks","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.16145v1","title":"Training Time Prediction for Mixed Precision-based Distributed Training","abstract":"Accurate prediction of training time in distributed deep learning is crucial for resource allocation, cost estimation, and job scheduling. We observe that the floating-point precision setting is a key determinant of training time, leading to training time variations of ~2.4x over its minimum. However, existing studies on distributed training time prediction rely on static model computation graphs that do not capture precision variations, including mixed precision. According to our experiments, training time prediction without considering precision results in significant prediction errors - reaching up to 147.85% in mean absolute percentage error (MAPE). To address this issue, we propose a precision-aware distributed training time predictor that achieves robust accuracy across diverse precision settings, including mixed precision, with 9.8% MAPE.","published_date":"2026-04-17T15:18:01+00:00","viability_score":3,"cluster_label":"LLM Training Optimization","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A precision-aware predictor for distributed deep learning training time to improve resource allocation and cost estimation.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.16132v1","title":"Can LLMs Understand the Impact of Trauma? Costs and Benefits of LLMs Coding the Interviews of Firearm Violence Survivors","abstract":"Firearm violence is a pressing public health issue, yet research into survivors' lived experiences remains underfunded and difficult to scale. Qualitative research, including in-depth interviews, is a valuable tool for understanding the personal and societal consequences of community firearm violence and designing effective interventions. However, manually analyzing these narratives through thematic analysis and inductive coding is time-consuming and labor-intensive. Recent advancements in large language models (LLMs) have opened the door to automating this process, though concerns remain about whether these models can accurately and ethically capture the experiences of vulnerable populations. In this study, we assess the use of open-source LLMs to inductively code interviews with 21 Black men who have survived community firearm violence. Our results demonstrate that while some configurations of LLMs can identify important codes, overall relevance remains low and is highly sensitive to data processing. Furthermore, LLM guardrails lead to substantial narrative erasure. These findings highlight both the potential and limitations of LLM-assisted qualitative coding and underscore the ethical challenges of applying AI in research involving marginalized communities.","published_date":"2026-04-17T15:07:27+00:00","viability_score":5,"cluster_label":"AI for Social Science Research","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Leveraging open-source LLMs to code qualitative interviews with firearm violence survivors, highlighting potential and ethical challenges.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.16117v1","title":"SCRIPT: Implementing an Intelligent Tutoring System for Programming in a German University Context","abstract":"Practice and extensive exercises are essential in programming education. Intelligent tutoring systems (ITSs) are a viable option to provide individualized hints and advice to programming students even when human tutors are not available. However, prior ITS for programming rarely support the Python programming language, mostly focus on introductory programming, and rarely take recent developments in generative models into account. We aim to establish a novel ITS for Python programming that is highly adaptable, serves both as a teaching and research platform, provides interfaces to plug in hint mechanisms (e.g.\\ via large language models), and works inside the particularly challenging regulatory environment of Germany, that is, conforming to the European data protection regulation, the European AI act, and ethical framework of the German Research Foundation. In this paper, we present the description of the current state of the ITS along with future development directions, as well as discuss the challenges and opportunities for improving the system.","published_date":"2026-04-17T14:53:38+00:00","viability_score":3,"cluster_label":"AI in Education","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Developing an adaptable intelligent tutoring system for Python programming that integrates generative models and complies with European data protection regulations.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.16116v1","title":"The Relic Condition: When Published Scholarship Becomes Material for Its Own Replacement","abstract":"We extracted the scholarly reasoning systems of two internationally prominent humanities and social science scholars from their published corpora alone, converted those systems into structured inference-time constraints for a large language model, and tested whether the resulting scholar-bots could perform core academic functions at expert-assessed quality. The distillation pipeline used an eight-layer extraction method and a nine-module skill architecture grounded in local, closed-corpus analysis. The scholar-bots were then deployed across doctoral supervision, peer review, lecturing and panel-style academic exchange. Expert assessment involved three senior academics producing reports and appointment-level syntheses. Across the preserved expert record, all review and supervision reports judged the outputs benchmark-attaining, appointment-level recommendations placed both bots at or above Senior Lecturer level in the Australian university system, and recovered panel scores placed Scholar A between 7.9 and 8.9/10 and Scholar B between 8.5 and 8.9/10 under multi-turn debate conditions. A research-degree-student survey showed high performance ratings across information reliability, theoretical depth and logical rigor, with pronounced ceiling effects on a 7-point scale, despite all participants already being frontier-model users. We term this the Relic condition: when publication systems make stable reasoning architectures legible, extractable and cheaply deployable, the public record of intellectual labor becomes raw material for its own functional replacement. Because the technical threshold for this transition is already crossed at modest engineering effort, we argue that the window for protective frameworks covering disclosure, consent, compensation and deployment restriction is the present, while deployment remains optional rather than infrastructural.","published_date":"2026-04-17T14:52:36+00:00","viability_score":7,"cluster_label":"AI Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Creating scholar-bots by extracting reasoning systems from published works to perform academic functions, demonstrating the potential for functional replacement of intellectual labor.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.16106v1","title":"Reckoning with the Political Economy of AI: Avoiding Decoys in Pursuit of Accountability","abstract":"The Project of AI is a world-building endeavor, wherein those who fund and develop AI systems both operate through and seek to sustain networks of power and wealth. As they expand their access to resources and configure our sociotechnical conditions, they benefit from the ways in which a suite of decoys animate scholars, critics, policymakers, journalists, and the public into co-constructing industry-empowering AI futures. Regardless of who constructs or nurtures them, these decoys often create the illusion of accountability while both masking the emerging political economies that the Project of AI has set into motion, and also contributing to the network-making power that is at the heart of the Project's extraction and exploitation. Drawing on literature at the intersection of communication, science and technology studies, and economic sociology, we examine how the Project of AI is constructed. We then explore five decoys that seemingly critique - but in actuality co-constitute - AI's emergent power relations and material political economy. We argue that advancing meaningful fairness or accountability in AI requires: 1) recognizing when and how decoys serve as a distraction, and 2) grappling directly with the material political economy of the Project of AI. Doing so will enable us to attend to the networks of power that make 'AI' possible, spurring new visions for how to realize a more just technologically entangled world.","published_date":"2026-04-17T14:38:06+00:00","viability_score":2,"cluster_label":"AI Ethics & Governance","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper analyzes the political economy of AI development and the role of 'decoys' in creating an illusion of accountability, arguing for a direct confrontation with material power structures to achieve a more just technological future.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.16084v1","title":"Unveiling Stochasticity: Universal Multi-modal Probabilistic Modeling for Traffic Forecasting","abstract":"Traffic forecasting is a challenging spatio-temporal modeling task and a critical component of urban transportation management. Current studies mainly focus on deterministic predictions, with limited considerations on the uncertainty and stochasticity in traffic dynamics. Therefore, this paper proposes an elegant yet universal approach that transforms existing models into probabilistic predictors by replacing only the final output layer with a novel Gaussian Mixture Model (GMM) layer. The modified model requires no changes to the training pipeline and can be trained using only the Negative Log-Likelihood (NLL) loss, without any auxiliary or regularization terms. Experiments on multiple traffic datasets show that our approach generalizes from classic to modern model architectures while preserving deterministic performance. Furthermore, we propose a systematic evaluation procedure based on cumulative distributions and confidence intervals, and demonstrate that our approach is considerably more accurate and informative than unimodal or deterministic baselines. Finally, a more detailed study on a real-world dense urban traffic network is presented to examine the impact of data quality on uncertainty quantification and to show the robustness of our approach under imperfect data conditions. Code available at https://github.com/Weijiang-Xiong/OpenSkyTraffic","published_date":"2026-04-17T14:10:16+00:00","viability_score":7,"cluster_label":"Traffic Forecasting","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Enhance existing traffic forecasting models into probabilistic predictors by adding a novel Gaussian Mixture Model layer, improving accuracy and providing uncertainty quantification.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.16082v1","title":"Early Detection of Acute Myeloid Leukemia (AML) Using YOLOv12 Deep Learning Model","abstract":"Acute Myeloid Leukemia (AML) is one of the most life-threatening type of blood cancers, and its accurate classification is considered and remains a challenging task due to the visual similarity between various cell types. This study addresses the classification of the multiclasses of AML cells Utilizing YOLOv12 deep learning model. We applied two segmentation approaches based on cell and nucleus features, using Hue channel and Otsu thresholding techniques to preprocess the images prior to classification. Our experiments demonstrate that YOLOv12 with Otsu thresholding on cell-based segmentation achieved the highest level of validation and test accuracy, both reaching 99.3%.","published_date":"2026-04-17T14:07:34+00:00","viability_score":8,"cluster_label":"Medical AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Achieve 99.3% accuracy in early detection of Acute Myeloid Leukemia using the YOLOv12 deep learning model with optimized image preprocessing.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.16076v1","title":"Prototype-Grounded Concept Models for Verifiable Concept Alignment","abstract":"Concept Bottleneck Models (CBMs) aim to improve interpretability in Deep Learning by structuring predictions through human-understandable concepts, but they provide no way to verify whether learned concepts align with the human's intended meaning, hurting interpretability. We introduce Prototype-Grounded Concept Models (PGCMs), which ground concepts in learned visual prototypes: image parts that serve as explicit evidence for the concepts. This grounding enables direct inspection of concept semantics and supports targeted human intervention at the prototype level to correct misalignments. Empirically, PGCMs match the predictive performance of state-of-the-art CBMs while substantially improving transparency, interpretability, and intervenability.","published_date":"2026-04-17T14:04:14+00:00","viability_score":3,"cluster_label":"Interpretable AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Develops a novel method to improve interpretability in deep learning by grounding concepts in learned visual prototypes, enabling direct inspection and correction of concept semantics.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.16060v1","title":"Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs","abstract":"Multimodal Reasoning Models (MRMs) leveraging Chain-of-Thought (CoT) based thinking have revolutionized mathematical and logical problem-solving. However, we show that this paradigm struggles with generalized spatial intelligence. We perform a comprehensive evaluation of seventeen models across thirteen spatial benchmarks and identify a critical gap: CoT prompting consistently degrades performance in visual spatial reasoning. Furthermore, through a novel No-Image++ ablation, we demonstrate that MRMs and CoT prompted MLMs suffer from severe shortcut learning, and hallucinate visual details from textual priors even when the image is absent. These findings challenge the efficacy of text-only CoT for spatial tasks and underscore the need for vision-centric reasoning paradigms.","published_date":"2026-04-17T13:35:45+00:00","viability_score":7,"cluster_label":"Multimodal LLMs","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Identifies that Chain-of-Thought prompting degrades visual spatial reasoning in multimodal LLMs and demonstrates a tendency for shortcut learning, advocating for vision-centric reasoning.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.16056v1","title":"AST: Adaptive, Seamless, and Training-Free Precise Speech Editing","abstract":"Text-based speech editing aims to modify specific segments while preserving speaker identity and acoustic context. Existing methods rely on task-specific training, which incurs high data costs and struggles with temporal fidelity in unedited regions. Meanwhile, adapting Text-to-Speech (TTS) models often faces a trade-off between editing quality and consistency. To address these issues, we propose AST, an Adaptive, Seamless, and Training-free precise speech editing framework. Leveraging a pre-trained autoregressive TTS model, AST introduces Latent Recomposition to selectively stitch preserved source segments with newly synthesized targets. Furthermore, AST extends this latent manipulation to enable precise style editing for specific speech segments. To prevent artifacts at these edit boundaries, the framework incorporates Adaptive Weak Fact Guidance (AWFG). AWFG dynamically modulates a mel-space guidance signal, enforcing structural constraints only where necessary without disrupting the generative manifold. To fill the gap of publicly accessible benchmarks, we introduce LibriSpeech-Edit, a new and larger speech editing dataset. As existing metrics poorly evaluate temporal consistency in unedited regions, we propose Word-level Dynamic Time Warping (WDTW). Extensive experiments demonstrate that AST resolves the controllability-quality trade-off without extra training. Compared to the previous most temporally consistent baseline, AST improves consistency while reducing Word Error Rate by nearly 70%. Moreover, applying AST to a foundation TTS model reduces WDTW by 27%, achieving state-of-the-art speaker preservation and temporal fidelity.","published_date":"2026-04-17T13:30:59+00:00","viability_score":7,"cluster_label":"Speech Editing","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A training-free framework for precise speech editing and style modification, significantly improving temporal consistency and reducing word error rates.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.16054v1","title":"Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs","abstract":"Multimodal large language models (MLLMs) have achieved impressive progress on vision language benchmarks, yet their capacity for visual cognitive and visuospatial reasoning remains less understood. We introduce \"Mind's Eye\", a multiple-choice benchmark of eight visuo-cognitive tasks inspired by classic human intelligence tests and organized under a novel \"A-R-T\" taxonomy: Abstraction, Relation, and Transformation. The tasks probe core processes of fluid intelligence such as pattern induction, analogical relation mapping, and mental transformation. We evaluate a diverse suite of closed-source and open-source MLLMs and compare their performance with human participants. Humans achieve 80% accuracy, while top performing MLLMs remain below 50%. Error analysis reveals failures in: (i) visual attention allocation, (ii) internal perceptual manipulation, and (iii) weak abstraction of underlying visual concepts. Our findings suggest that current MLLMs exhibit limited visuospatial reasoning capabilities, when compared with human participants, highlighting the need for more cognitively grounded evaluation frameworks.","published_date":"2026-04-17T13:29:46+00:00","viability_score":4,"cluster_label":"Multimodal LLMs","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new benchmark for evaluating the visual abstraction, transformation, and composition capabilities of multimodal LLMs, revealing significant gaps compared to human performance.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.16042v1","title":"Towards Intrinsic Interpretability of Large Language Models:A Survey of Design Principles and Architectures","abstract":"While Large Language Models (LLMs) have achieved strong performance across many NLP tasks, their opaque internal mechanisms hinder trustworthiness and safe deployment. Existing surveys in explainable AI largely focus on post-hoc explanation methods that interpret trained models through external approximations. In contrast, intrinsic interpretability, which builds transparency directly into model architectures and computations, has recently emerged as a promising alternative. This paper presents a systematic review of the recent advances in intrinsic interpretability for LLMs, categorizing existing approaches into five design paradigms: functional transparency, concept alignment, representational decomposability, explicit modularization, and latent sparsity induction. We further discuss open challenges and outline future research directions in this emerging field. The paper list is available at: https://github.com/PKU-PILLAR-Group/Survey-Intrinsic-Interpretability-of-LLMs.","published_date":"2026-04-17T13:15:46+00:00","viability_score":3,"cluster_label":"LLM Interpretability","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A survey of intrinsic interpretability methods for LLMs, categorizing design principles and architectures to guide future research in transparent AI.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.16033v1","title":"Safe Deep Reinforcement Learning for Building Heating Control and Demand-side Flexibility","abstract":"Buildings account for approximately 40% of global energy consumption, and with the growing share of intermittent renewable energy sources, enabling demand-side flexibility, particularly in heating, ventilation and air conditioning systems, is essential for grid stability and energy efficiency. This paper presents a safe deep reinforcement learning-based control framework to optimize building space heating while enabling demand-side flexibility provision for power system operators. A deep deterministic policy gradient algorithm is used as the core deep reinforcement learning method, enabling the controller to learn an optimal heating strategy through interaction with the building thermal model while maintaining occupant comfort, minimizing energy cost, and providing flexibility. To address safety concerns with reinforcement learning, particularly regarding compliance with flexibility requests, we propose a real-time adaptive safety-filter to ensure that the system operates within predefined constraints during demand-side flexibility provision. The proposed real-time adaptive safety filter guarantees full compliance with flexibility requests from system operators and improves energy and cost efficiency -- achieving up to 50% savings compared to a rule-based controller -- while outperforming a standalone deep reinforcement learning-based controller in energy and cost metrics, with only a slight increase in comfort temperature violations.","published_date":"2026-04-17T13:03:02+00:00","viability_score":4,"cluster_label":"Building Control","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A safe deep reinforcement learning framework for building heating control that optimizes energy efficiency and provides demand-side flexibility while ensuring occupant comfort and grid stability.","time_to_mvp":"1-3 months","tags":["series_a_plus"]},{"arxiv_id":"2604.16027v1","title":"Where does output diversity collapse in post-training?","abstract":"Post-trained language models produce less varied outputs than their base counterparts. This output diversity collapse undermines inference-time scaling methods that rely on varied samples, and risks homogenizing model outputs on creative and value-laden tasks. Prior work attributes collapse to specific post-training methods, without separating the role of training data composition from the method, or the generation format from the model weights. We trace output diversity through three parallel post-training lineages of Olmo 3, Think (chain-of-thought distillation), Instruct (broad multi-source data), and RL-Zero, across 15 tasks and four text diversity metrics. We find that the location of collapse co-varies with data composition: the Think lineage loses most semantic diversity at supervised fine-tuning, and the effect of DPO is larger in Instruct than in Think. Suppressing chain-of-thought reasoning at inference in Think models drops accuracy on hard tasks, yet leaves answer-level diversity unchanged, showing that the collapse is embedded in the model weights by training data, not imposed by the generation format. Decomposing diversity loss on six verifiable tasks into a quality-control component (removal of incorrect outputs) and a residual component (genuine narrowing among correct outputs) reveals that the split is task-dependent, and Think models retain more correct-answer diversity than Instruct despite collapsing more in aggregate. Our results indicate that diversity collapse is determined during training by data composition and cannot be addressed at inference time alone.","published_date":"2026-04-17T12:56:31+00:00","viability_score":3,"cluster_label":"LLM Training","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research investigates the collapse of output diversity in post-trained language models, identifying data composition as the primary driver rather than generation format.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.16022v1","title":"SocialGrid: A Benchmark for Planning and Social Reasoning in Embodied Multi-Agent Systems","abstract":"As Large Language Models (LLMs) transition from text processors to autonomous agents, evaluating their social reasoning in embodied multi-agent settings becomes critical. We introduce SocialGrid, an embodied multi-agent environment inspired by Among Us that evaluates LLM agents on planning, task execution, and social reasoning. Our evaluations reveal that even the strongest open model (GPT-OSS-120B) achieves below 60% accuracy in task completion and planning, with agents getting stuck in repetitive behaviors or failing to navigate basic obstacles. Since poor navigation confounds evaluation of social intelligence, SocialGrid offers an optional Planning Oracle to isolate social reasoning from planning deficits. While planning assistance improves task completion, social reasoning remains a bottleneck: agents fail to detect deception at near-random chance regardless of scale, relying on shallow heuristics rather than accumulating behavioral evidence. SocialGrid provides automatic failure analysis and fine-grained metrics, enabling developers to diagnose and improve their agents. We also establish a competitive leaderboard using Elo ratings from adversarial league play.","published_date":"2026-04-17T12:51:46+00:00","viability_score":5,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"SocialGrid is a new benchmark for evaluating LLM agents in embodied multi-agent settings, revealing significant limitations in planning and social reasoning.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.16021v1","title":"Neurosymbolic Repo-level Code Localization","abstract":"Code localization is a cornerstone of autonomous software engineering. Recent advancements have achieved impressive performance on real-world issue benchmarks. However, we identify a critical yet overlooked bias: these benchmarks are saturated with keyword references (e.g. file paths, function names), encouraging models to rely on superficial lexical matching rather than genuine structural reasoning. We term this phenomenon the Keyword Shortcut. To address this, we formalize the challenge of Keyword-Agnostic Logical Code Localization (KA-LCL) and introduce KA-LogicQuery, a diagnostic benchmark requiring structural reasoning without any naming hints. Our evaluation reveals a catastrophic performance drop of state-of-the-art approaches on KA-LogicQuery, exposing their lack of deterministic reasoning capabilities. We propose LogicLoc, a novel agentic framework that combines large language models with the rigorous logical reasoning of Datalog for precise localization. LogicLoc extracts program facts from the codebase and leverages an LLM to synthesize Datalog programs, with parser-gated validation and mutation-based intermediate-rule diagnostic feedback to ensure correctness and efficiency. The validated programs are executed by a high-performance inference engine, enabling accurate and verifiable localization in a fully automated, closed-loop workflow. Experimental results demonstrate that LogicLoc significantly outperforms SOTA methods on KA-LogicQuery while maintaining competitive performance on popular issue-driven benchmarks. Notably, LogicLoc attains superior performance with significantly lower token consumption and faster execution by offloading structural traversal to a deterministic engine, reducing the overhead of iterative LLM inference.","published_date":"2026-04-17T12:49:18+00:00","viability_score":7,"cluster_label":"Autonomous Software Engineering","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"LogicLoc is a neurosymbolic agentic framework that combines LLMs with Datalog for accurate and verifiable code localization, overcoming keyword shortcut biases.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.16009v1","title":"MEDLEY-BENCH: Scale Buys Evaluation but Not Control in AI Metacognition","abstract":"Metacognition, the ability to monitor and regulate one's own reasoning, remains under-evaluated in AI benchmarking. We introduce MEDLEY-BENCH, a benchmark of behavioural metacognition that separates independent reasoning, private self-revision, and socially influenced revision under genuine inter-model disagreement. The benchmark evaluates 35 models from 12 families on 130 ambiguous instances across five domains and reports two complementary scores: the Medley Metacognition Score (MMS), a tier-based aggregate of reflective updating, social robustness, and epistemic articulation, and the Medley Ability Score (MAS), derived from four metacognitive sub-abilities. Results show a robust evaluation/control dissociation: evaluation ability increases with model size within families, whereas control does not. In a follow-up progressive adversarial analysis of 11 models, we observed two behavioural profiles, i.e., models that revise primarily in response to argument quality and models that track consensus statistics. Under within-model relative profiling (ipsative scoring), evaluation was the weakest relative ability in all 35 models, indicating a systematic knowing/doing gap. Smaller and cheaper models often matched or outperformed larger counterparts, suggesting that metacognitive competence is not simply a function of scale. These findings position MEDLEY-BENCH as a tool for measuring belief revision under social pressure and suggest that future training should reward calibrated, proportional updating rather than output quality alone.","published_date":"2026-04-17T12:32:50+00:00","viability_score":5,"cluster_label":"AI Metacognition","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"MEDLEY-BENCH is a new benchmark for AI metacognition that reveals a dissociation between evaluation ability and control, showing scale does not guarantee improvement.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.16004v1","title":"AgentV-RL: Scaling Reward Modeling with Agentic Verifier","abstract":"Verifiers have been demonstrated to enhance LLM reasoning via test-time scaling (TTS). Yet, they face significant challenges in complex domains. Error propagation from incorrect intermediate reasoning can lead to false positives for seemingly plausible solutions, while lacking external grounding makes verifiers unreliable on computation or knowledge-intensive tasks. To address these challenges, we propose Agentic Verifier, a framework that transforms reward modeling into a multi-turn, tool-augmented deliberative process. We introduce complementary forward and backward agents: one traces solutions from premises to conclusions, while the other re-checks conclusions against their underlying premises. This bidirectional process enables a comprehensive, reliable, and interpretable assessment of solutions. To facilitate practical deployment, we propose AgentV-RL. Through proactive exploration and reinforcement learning, the verifier autonomously interleaves tool-use with internal reasoning. Extensive experiments show that Agentic Verifier yields consistent performance gains under both parallel and sequential TTS. Notably, our 4B variant surpasses state-of-the-art ORMs by 25.2%, positioning it as a promising paradigm for agentic reward modeling.","published_date":"2026-04-17T12:27:36+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An agentic framework that transforms reward modeling into a multi-turn, tool-augmented deliberative process to enhance LLM reasoning.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.15994v1","title":"ReactBench: A Benchmark for Topological Reasoning in MLLMs on Chemical Reaction Diagrams","abstract":"Multimodal Large Language Models (MLLMs) excel at recognizing individual visual elements and reasoning over simple linear diagrams. However, when faced with complex topological structures involving branching paths, converging flows, and cyclic dependencies, their reasoning capabilities degrade sharply, even on tasks as basic as counting endpoints. Existing benchmarks fail to probe this gap, focusing on semantic comprehension rather than structural reasoning. We introduce ReactBench, a benchmark that reveals fundamental limitations in structural reasoning through chemical reaction diagrams. These real-world scientific diagrams offer an ideal testbed because they naturally span diverse structures from linear chains to cyclic graphs, while requiring both precise local recognition and coherent global reasoning. Our benchmark comprises 1,618 expert-annotated QA pairs across four hierarchical task dimensions. Extensive evaluation across 17 MLLMs reveals a significant performance gap exceeding 30% between anchor-based tasks and holistic structural reasoning tasks. Controlled ablations confirm this bottleneck lies in reasoning, not perception. These findings expose a fundamental deficit in structural understanding and establish directions for advancing visual reasoning.","published_date":"2026-04-17T12:16:57+00:00","viability_score":7,"cluster_label":"Multimodal AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"ReactBench, a benchmark for evaluating topological reasoning in MLLMs using chemical reaction diagrams, reveals significant limitations in structural understanding.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.15990v1","title":"From Vulnerable Data Subjects to Vulnerabilizing Data Practices: Navigating the Protection Paradox in AI-Based Analyses of Platformized Lives","abstract":"This paper traces a conceptual shift from understanding vulnerability as a static, essentialized property of data subjects to examining how it is actively enacted through data practices. Unlike reflexive ethical frameworks focused on missing or counter-data, we address the condition of abundance inherent to platformized life-a context where a near inexhaustible mass of data points already exists, shifting the ethical challenge to the researcher's choices in operating upon this existing mass. We argue that the ethical integrity of data science depends not just on who is studied, but on how technical pipelines transform \"vulnerable\" individuals into data subjects whose vulnerability can be further precarized. We develop this argument through an AI for Social Good (AI4SG) case: a journalist's request to use computer vision to quantify child presence in monetized YouTube 'family vlogs' for regulatory advocacy. This case reveals a \"protection paradox\": how data-driven efforts to protect vulnerable subjects can inadvertently impose new forms of computational exposure, reductionism, and extraction. Using this request as a point of departure, we perform a methodological deconstruction of the AI pipeline to show how granular technical decisions are ethically constitutive. We contribute a reflexive ethics protocol that translates these insights into a reflexive roadmap for research ethics surrounding platformized data subjects. Organized around four critical junctures-dataset design, operationalization, inference, and dissemination-the protocol identifies technical questions and ethical tensions where well-intentioned work can slide into renewed extraction or exposure. For every decision point, the protocol offers specific prompts to navigate four cross-cutting vulnerabilizing factors: exposure, monetization, narrative fixing, and algorithmic optimization. Rather than uncritically...","published_date":"2026-04-17T12:12:38+00:00","viability_score":3,"cluster_label":"AI Ethics","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This paper explores how AI data practices can enact and precarize vulnerability in platformized lives, proposing a reflexive ethics protocol for AI for Social Good.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.15972v1","title":"Weak-Link Optimization for Multi-Agent Reasoning and Collaboration","abstract":"LLM-driven multi-agent frameworks address complex reasoning tasks through multi-role collaboration. However, existing approaches often suffer from reasoning instability, where individual agent errors are amplified through collaboration, undermining overall performance. Current research mainly focuses on enhancing high-capability agents or suppressing unreliable outputs to improve framework effectiveness, while systematic identification and reinforcement of performance-limiting agents receive less attention. To address this gap, we propose WORC, a \\underline{w}eak-link \\underline{o}ptimization framework for multi-agent \\underline{r}easoning and \\underline{c}ollaboration, grounded in the weak-link principle. WORC follows a two-stage workflow. In the weak agent localization stage, task features are constructed, and a meta-learning-based weight predictor trained on optimal configurations identified by swarm intelligence algorithms (SIAs) enables zero-shot mapping from these features to agent performance weights, where the agent with the lowest predicted weight is identified as the weak agent. In the weak-link optimization stage, an uncertainty-driven allocation strategy assigns additional reasoning budgets to weak agents, with lower predicted weights leading to larger repeated-sampling quotas to compensate for reliability deficiencies. Experimental results show that WORC achieves an average accuracy of 82.2\\% on reasoning benchmarks while improving framework stability and cross-architecture generalization, suggesting that compensating for weak links, rather than reinforcing strengths alone, enhances the robustness of multi-agent systems.","published_date":"2026-04-17T11:36:20+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"WORC optimizes multi-agent reasoning by identifying and compensating for weak agents, improving accuracy and framework stability.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.15951v1","title":"Integrating Graphs, Large Language Models, and Agents: Reasoning and Retrieval","abstract":"Generative AI, particularly Large Language Models, increasingly integrates graph-based representations to enhance reasoning, retrieval, and structured decision-making. Despite rapid advances, there remains limited clarity regarding when, why, where, and what types of graph-LLM integrations are most appropriate across applications. This survey provides a concise, structured overview of the design choices underlying the integration of graphs with LLMs. We categorize existing methods based on their purpose (reasoning, retrieval, generation, recommendation), graph modality (knowledge graphs, scene graphs, interaction graphs, causal graphs, dependency graphs), and integration strategies (prompting, augmentation, training, or agent-based use). By mapping representative works across domains such as cybersecurity, healthcare, materials science, finance, robotics, and multimodal environments, we highlight the strengths, limitations, and best-fit scenarios for each technique. This survey aims to offer researchers a practical guide for selecting the most suitable graph-LLM approach depending on task requirements, data characteristics, and reasoning complexity.","published_date":"2026-04-17T11:12:55+00:00","viability_score":2,"cluster_label":"LLM Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A survey categorizing graph-LLM integration methods to guide researchers in selecting the most suitable approach for various applications.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.15937v1","title":"Polarization by Default: Auditing Recommendation Bias in LLM-Based Content Curation","abstract":"Large Language Models (LLMs) are increasingly deployed to curate and rank human-created content, yet the nature and structure of their biases in these tasks remains poorly understood: which biases are robust across providers and platforms, and which can be mitigated through prompt design. We present a controlled simulation study mapping content selection biases across three major LLM providers (OpenAI, Anthropic, Google) on real social media datasets from Twitter/X, Bluesky, and Reddit, using six prompting strategies (\\textit{general}, \\textit{popular}, \\textit{engaging}, \\textit{informative}, \\textit{controversial}, \\textit{neutral}). Through 540,000 simulated top-10 selections from pools of 100 posts across 54 experimental conditions, we find that biases differ substantially in how structural and how prompt-sensitive they are. Polarization is amplified across all configurations, toxicity handling shows a strong inversion between engagement- and information-focused prompts, and sentiment biases are predominantly negative. Provider comparisons reveal distinct trade-offs: GPT-4o Mini shows the most consistent behavior across prompts; Claude and Gemini exhibit high adaptivity in toxicity handling; Gemini shows the strongest negative sentiment preference. On Twitter/X, where author demographics can be inferred from profile bios, political leaning bias is the clearest demographic signal: left-leaning authors are systematically over-represented despite right-leaning authors forming the pool plurality in the dataset, and this pattern largely persists across prompts.","published_date":"2026-04-17T10:55:21+00:00","viability_score":6,"cluster_label":"LLM Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Auditing recommendation bias in LLM-based content curation across providers and prompts, revealing consistent polarization and provider-specific trade-offs.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.15898v1","title":"Towards Rigorous Explainability by Feature Attribution","abstract":"For around a decade, non-symbolic methods have been the option of choice when explaining complex machine learning (ML) models. Unfortunately, such methods lack rigor and can mislead human decision-makers. In high-stakes uses of ML, the lack of rigor is especially problematic. One prime example of provable lack of rigor is the adoption of Shapley values in explainable artificial intelligence (XAI), with the tool SHAP being a ubiquitous example. This paper overviews the ongoing efforts towards using rigorous symbolic methods of XAI as an alternative to non-rigorous non-symbolic approaches, concretely for assigning relative feature importance.","published_date":"2026-04-17T09:56:17+00:00","viability_score":1,"cluster_label":"Explainable AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Advocating for rigorous symbolic methods in explainable AI to address the lack of rigor in current non-symbolic approaches like SHAP.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.15877v1","title":"Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents","abstract":"As LLM agents scale to long-horizon, multi-session deployments, efficiently managing accumulated experience becomes a critical bottleneck. Agent memory systems and agent skill discovery both address this challenge -- extracting reusable knowledge from interaction traces -- yet a citation analysis of 1,136 references across 22 primary papers reveals a cross-community citation rate below 1%. We propose the \\emph{Experience Compression Spectrum}, a unifying framework that positions memory, skills, and rules as points along a single axis of increasing compression (5--20$\\times$ for episodic memory, 50--500$\\times$ for procedural skills, 1,000$\\times$+ for declarative rules), directly reducing context consumption, retrieval latency, and compute overhead. Mapping 20+ systems onto this spectrum reveals that every system operates at a fixed, predetermined compression level -- none supports adaptive cross-level compression, a gap we term the \\emph{missing diagonal}. We further show that specialization alone is insufficient -- both communities independently solve shared sub-problems without exchanging solutions -- that evaluation methods are tightly coupled to compression levels, that transferability increases with compression at the cost of specificity, and that knowledge lifecycle management remains largely neglected. We articulate open problems and design principles for scalable, full-spectrum agent learning systems.","published_date":"2026-04-17T09:26:25+00:00","viability_score":2,"cluster_label":"LLM Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Proposing a unifying framework for LLM agent experience compression, identifying a gap in adaptive cross-level compression.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.15871v1","title":"UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs","abstract":"The evaluation of visual editing models remains fragmented across methods and modalities. Existing benchmarks are often tailored to specific paradigms, making fair cross-paradigm comparisons difficult, while video editing lacks reliable evaluation benchmarks. Furthermore, common automatic metrics often misalign with human preference, yet directly deploying large multimodal models (MLLMs) as evaluators incurs prohibitive computational and financial costs. We present UniEditBench, a unified benchmark for image and video editing that supports reconstruction-based and instruction-driven methods under a shared protocol. UniEditBench includes a structured taxonomy of nine image operations (Add, Remove, Replace, Change, Stroke-based, Extract, Adjust, Count, Reorder) and eight video operations, with coverage of challenging compositional tasks such as counting and spatial reordering. To enable scalable evaluation, we distill a high-capacity MLLM judge (Qwen3-VL-235B-A22B Instruct) into lightweight 4B/8B evaluators that provide multi-dimensional scoring over structural fidelity, text alignment, background consistency, naturalness, and temporal-spatial consistency (for videos). Experiments show that the distilled evaluators maintain strong agreement with human judgments and substantially reduce deployment cost relative to the teacher model. UniEditBench provides a practical and reproducible protocol for benchmarking modern visual editing methods. Our benchmark and the associated reward models are publicly available at https://github.com/wesar1/UniEditBench.","published_date":"2026-04-17T09:21:48+00:00","viability_score":7,"cluster_label":"AI Benchmarking","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A unified and cost-effective benchmark for image and video editing using distilled large multimodal models for scalable and human-aligned evaluation.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.15866v1","title":"DiZiNER: Disagreement-guided Instruction Refinement via Pilot Annotation Simulation for Zero-shot Named Entity Recognition","abstract":"Large language models (LLMs) have advanced information extraction (IE) by enabling zero-shot and few-shot named entity recognition (NER), yet their generative outputs still show persistent and systematic errors. Despite progress through instruction fine-tuning, zero-shot NER still lags far behind supervised systems. These recurring errors mirror inconsistencies observed in early-stage human annotation processes that resolve disagreements through pilot annotation. Motivated by this analogy, we introduce DiZiNER (Disagreement-guided Instruction Refinement via Pilot Annotation Simulation for Zero-shot Named Entity Recognition), a framework that simulates the pilot annotation process, employing LLMs to act as both annotators and supervisors. Multiple heterogeneous LLMs annotate shared texts, and a supervisor model analyzes inter-model disagreements to refine task instructions. Across 18 benchmarks, DiZiNER achieves zero-shot SOTA results on 14 datasets, improving prior bests by +8.0 F1 and reducing the zero-shot to supervised gap by over +11 points. It also consistently outperforms its supervisor, GPT-5 mini, indicating that improvements stem from disagreement-guided instruction refinement rather than model capacity. Pairwise agreement between models shows a strong correlation with NER performance, further supporting this finding.","published_date":"2026-04-17T09:16:30+00:00","viability_score":8,"cluster_label":"LLM Instruction Tuning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"DiZiNER refines LLM instructions for zero-shot Named Entity Recognition by simulating pilot annotation to resolve disagreements, achieving state-of-the-art results.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.15859v1","title":"QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals","abstract":"Forecasting has become a natural benchmark for reasoning under uncertainty. Yet existing evaluations of large language models remain limited to judgmental tasks in simple formats, such as binary or multiple-choice questions. In practice, however, forecasting spans a far broader scope. Across domains such as economics, public health, and social demographics, decisions hinge on numerical estimates over continuous quantities, a capability that current benchmarks do not capture. Evaluating such estimates requires a format that makes uncertainty explicit and testable. We propose prediction intervals as a natural and rigorous interface for this purpose. They demand scale awareness, internal consistency across confidence levels, and calibration over a continuum of outcomes, making them a more suitable evaluation format than point estimates for numerical forecasting. To assess this capability, we introduce a new benchmark QuantSightBench, and evaluate frontier models under multiple settings, assessing both empirical coverage and interval sharpness. Our results show that none of the 11 evaluated frontier and open-weight models achieves the 90\\% coverage target, with the top performers Gemini 3.1 Pro (79.1\\%), Grok 4 (76.4\\%), and GPT-5.4 (75.3\\%) all falling at least 10 percentage points short. Calibration degrades sharply at extreme magnitudes, revealing systematic overconfidence across all evaluated models.","published_date":"2026-04-17T09:06:24+00:00","viability_score":7,"cluster_label":"LLM Evaluation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"QuantSightBench evaluates LLM quantitative forecasting with prediction intervals, revealing significant overconfidence and calibration issues across frontier models.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.15856v1","title":"Robust Multispectral Semantic Segmentation under Missing or Full Modalities via Structured Latent Projection","abstract":"Multimodal remote sensing data provide complementary information for semantic segmentation, but in real-world deployments, some modalities may be unavailable due to sensor failures, acquisition issues, or challenging atmospheric conditions. Existing multimodal segmentation models typically address missing modalities by learning a shared representation across inputs. However, this approach can introduce a trade-off by compromising modality-specific complementary information and reducing performance when all modalities are available. In this paper, we tackle this limitation with CBC-SLP, a multimodal semantic segmentation model designed to preserve both modality-invariant and modality-specific information. Inspired by the theoretical results on modality alignment, which state that perfectly aligned multimodal representations can lead to sub-optimal performance in downstream prediction tasks, we propose a novel structured latent projection approach as an architectural inductive bias. Rather than enforcing this strategy through a loss term, we incorporate it directly into the architecture. In particular, to use the complementary information effectively while maintaining robustness under random modality dropout, we structure the latent representations into shared and modality-specific components and adaptively transfer them to the decoder according to the random modality availability mask. Extensive experiments on three multimodal remote sensing image sets demonstrate that CBC-SLP consistently outperforms state-of-the-art multimodal models across full and missing modality scenarios. Besides, we empirically demonstrate that the proposed strategy can recover the complementary information that may not be preserved in a shared representation. The code is available at https://github.com/iremulku/Multispectral-Semantic-Segmentation-via-Structured-Latent-Projection-CBC-SLP-.","published_date":"2026-04-17T09:05:22+00:00","viability_score":7,"cluster_label":"Multimodal Segmentation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"CBC-SLP is a robust multimodal semantic segmentation model that preserves modality-specific information for improved performance under missing or full modality scenarios.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.15851v1","title":"DPrivBench: Benchmarking LLMs' Reasoning for Differential Privacy","abstract":"Differential privacy (DP) has a wide range of applications for protecting data privacy, but designing and verifying DP algorithms requires expert-level reasoning, creating a high barrier for non-expert practitioners. Prior works either rely on specialized verification languages that demand substantial domain expertise or remain semi-automated and require human-in-the-loop guidance. In this work, we investigate whether large language models (LLMs) can automate DP reasoning. We introduce DPrivBench, a benchmark in which each instance asks whether a function or algorithm satisfies a stated DP guarantee under specified assumptions. The benchmark is carefully designed to cover a broad range of DP topics, span diverse difficulty levels, and resist shortcut reasoning through trivial pattern matching. Experiments show that while the strongest models handle textbook mechanisms well, all models struggle with advanced algorithms, revealing substantial gaps in current DP reasoning capabilities. Through further analytic study and failure-mode analysis, we identify several promising directions for improving automated DP reasoning. Our benchmark provides a solid foundation for developing and evaluating such methods, and complements existing benchmarks for mathematical reasoning.","published_date":"2026-04-17T09:03:11+00:00","viability_score":7,"cluster_label":"LLM Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A benchmark for evaluating LLMs' ability to reason about differential privacy, identifying gaps and guiding future development.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.15839v1","title":"Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4","abstract":"Most ATP benchmarks embed the final answer within the formal statement -- a convention we call \"Easy Mode\" -- a design that simplifies the task relative to what human competitors face and may lead to optimistic estimates of model capability. We call the stricter, more realistic setting \"Hard Mode\": the system must independently discover the answer before constructing a formal proof. To enable Hard Mode research, we make two contributions. First, we release MiniF2F-Hard and FIMO-Hard, expert-reannotated Hard Mode variants of two widely-used ATP benchmarks. Second, we introduce Discover And Prove (DAP), an agentic framework that uses LLM natural-language reasoning with explicit self-reflection to discover answers, then rewrites Hard Mode statements into Easy Mode ones for existing ATP provers. DAP sets the state of the art: on CombiBench it raises solved problems from 7 (previous SOTA, Pass@16) to 10; on PutnamBench it is the first system to formally prove 36 theorems in Hard Mode -- while simultaneously revealing that state-of-the-art LLMs exceed 80% answer accuracy on the same problems where formal provers manage under 10%, exposing a substantial gap that Hard Mode benchmarks are uniquely suited to measure.","published_date":"2026-04-17T08:40:48+00:00","viability_score":7,"cluster_label":"Agentic Framework for Automated Theorem Proving","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Create a novel tool for automated theorem proving using an open-source framework optimized for Lean 4.","time_to_mvp":"1-2 weeks","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.15837v1","title":"Stein Variational Black-Box Combinatorial Optimization","abstract":"Combinatorial black-box optimization in high-dimensional settings demands a careful trade-off between exploiting promising regions of the search space and preserving sufficient exploration to identify multiple optima. Although Estimation-of-Distribution Algorithms (EDAs) provide a powerful model-based framework, they often concentrate on a single region of interest, which may result in premature convergence when facing complex or multimodal objective landscapes. In this work, we incorporate the Stein operator to introduce a repulsive mechanism among particles in the parameter space, thereby encouraging the population to disperse and jointly explore several modes of the fitness landscape. Empirical evaluations across diverse benchmark problems show that the proposed method achieves performance competitive with, and in several cases superior to, leading state-of-the-art approaches, particularly on large-scale instances. These findings highlight the potential of Stein variational gradient descent as a promising direction for addressing large, computationally expensive, discrete black-box optimization problems.","published_date":"2026-04-17T08:40:17+00:00","viability_score":5,"cluster_label":"Combinatorial Optimization","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel approach to combinatorial black-box optimization using Stein variational methods to improve exploration and find multiple optima.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.15822v1","title":"ECG-Lens: Benchmarking ML & DL Models on PTB-XL Dataset","abstract":"Automated classification of electrocardiogram (ECG) signals is a useful tool for diagnosing and monitoring cardiovascular diseases. This study compares three traditional machine learning algorithms (Decision Tree Classifier, Random Forest Classifier, and Logistic Regression) and three deep learning models (Simple Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), and Complex CNN (ECGLens)) for the classification of ECG signals from the PTB-XL dataset, which contains 12-lead recordings from normal patients and patients with various cardiac conditions. The DL models were trained on raw ECG signals, allowing them to automatically extract discriminative features. Data augmentation using the Stationary Wavelet Transform (SWT) was applied to enhance model performance, increase the diversity of training samples, and preserve the essential characteristics of the ECG signals. The models were evaluated using multiple metrics, including accuracy, precision, recall, F1-score, and ROC-AUC. The ECG-Lens model achieved the highest performance, with 80% classification accuracy and a 90% ROC-AUC. These findings demonstrate that deep learning architectures, particularly complex CNNs substantially outperform traditional ML methods on raw 12-lead ECG data, and provide a practical benchmark for selecting automated ECG classification models and identifying directions for condition-specific model development.","published_date":"2026-04-17T08:20:44+00:00","viability_score":7,"cluster_label":"Medical AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"ECG-Lens: A benchmark and deep learning model for automated ECG classification, outperforming traditional methods.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.15808v1","title":"Beyond a Single Frame: Multi-Frame Spatially Grounded Reasoning Across Volumetric MRI","abstract":"Spatial reasoning and visual grounding are core capabilities for vision-language models (VLMs), yet most medical VLMs produce predictions without transparent reasoning or spatial evidence. Existing benchmarks also evaluate VLMs on isolated 2D images, overlooking the volumetric nature of clinical imaging, where findings can span multiple frames or appear on only a few slices. We introduce Spatially Grounded MRI Visual Question Answering (SGMRI-VQA), a 41,307-pair benchmark for multi-frame, spatially grounded reasoning on volumetric MRI. Built from expert radiologist annotations in the fastMRI+ dataset across brain and knee studies, each QA pair includes a clinician-aligned chain-of-thought trace with frame-indexed bounding box coordinates. Tasks are organized hierarchically across detection, localization, counting/classification, and captioning, requiring models to jointly reason about what is present, where it is, and across which frames it extends. We benchmark 10 VLMs and show that supervised fine-tuning of Qwen3-VL-8B with bounding box supervision consistently improves grounding performance over strong zero-shot baselines, indicating that targeted spatial supervision is an effective path toward grounded clinical reasoning.","published_date":"2026-04-17T08:06:39+00:00","viability_score":7,"cluster_label":"Medical AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new benchmark and fine-tuning approach for multi-frame, spatially grounded reasoning in volumetric MRI, improving medical VLM performance.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.15805v1","title":"From Seeing to Simulating: Generative High-Fidelity Simulation with Digital Cousins for Generalizable Robot Learning and Evaluation","abstract":"Learning robust robot policies in real-world environments requires diverse data augmentation, yet scaling real-world data collection is costly due to the need for acquiring physical assets and reconfiguring environments. Therefore, augmenting real-world scenes into simulation has become a practical augmentation for efficient learning and evaluation. We present a generative framework that establishes a generative real-to-sim mapping from real-world panoramas to high-fidelity simulation scenes, and further synthesize diverse cousin scenes via semantic and geometric editing. Combined with high-quality physics engines and realistic assets, the generated scenes support interactive manipulation tasks. Additionally, we incorporate multi-room stitching to construct consistent large-scale environments for long-horizon navigation across complex layouts. Experiments demonstrate a strong sim-to-real correlation validating our platform's fidelity, and show that extensively scaling up data generation leads to significantly better generalization to unseen scene and object variations, demonstrating the effectiveness of Digital Cousins for generalizable robot learning and evaluation.","published_date":"2026-04-17T08:06:26+00:00","viability_score":7,"cluster_label":"Robotics","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A generative framework for creating high-fidelity robot simulation environments from real-world data, enabling better generalization and evaluation.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.15800v1","title":"From Intention to Text: AI-Supported Goal Setting in Academic Writing","abstract":"This study presents WriteFlow, an AI voice-based writing assistant designed to support reflective academic writing through goal-oriented interaction. Academic writing involves iterative reflection and evolving goal regulation, yet prior research and a formative study with 17 participants show that writers often struggle to articulate and manage changing goals. While commonly used AI writing tools emphasize efficiency, they offer limited support for metacognition and writer agency. WriteFlow frames AI interaction as a dialogic space for ongoing goal articulation, monitoring, and negotiation grounded in writers' intentions. Findings from a Wizard-of-Oz study with 12 expert users show that WriteFlow scaffolds metacognitive regulation and reflection-in-action by supporting iterative goal refinement, maintaining goal-text alignment during drafting, and prompting evaluation of goal fulfillment. We discuss design implications for AI writing systems that prioritize reflective dialogue, flexible goal structures, and multi-perspective feedback to support intentional and agentic writing.","published_date":"2026-04-17T08:01:52+00:00","viability_score":3,"cluster_label":"AI Writing Assistants","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"WriteFlow is a voice-based AI assistant that supports academic writers in articulating and managing their writing goals through dialogic interaction.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.15794v1","title":"Self-Distillation as a Performance Recovery Mechanism for LLMs: Counteracting Compression and Catastrophic Forgetting","abstract":"Large Language Models (LLMs) have achieved remarkable success, underpinning diverse AI applications. However, they often suffer from performance degradation due to factors such as catastrophic forgetting during Supervised Fine-Tuning (SFT), quantization, and pruning. In this work, we introduce a performance recovery framework based on Self-Distillation Fine-Tuning (SDFT) that effectively restores model capabilities. Complementing this practical contribution, we provide a rigorous theoretical explanation for the underlying recovery mechanism. We posit that an LLM's generative capability fundamentally relies on the high-dimensional manifold constructed by its hidden layers. To investigate this, we employ Centered Kernel Alignment (CKA) to quantify the alignment between student and teacher activation trajectories, leveraging its invariance to orthogonal transformations and scaling. Our experiments demonstrate a strong correlation between performance recovery and manifold alignment, substantiating the claim that self-distillation effectively aligns the student's high-dimensional manifold with the optimal structure represented by the teacher. This study bridges the gap between practical recovery frameworks and geometric representation theory, offering new insights into the internal mechanisms of self-distillation.","published_date":"2026-04-17T07:55:38+00:00","viability_score":7,"cluster_label":"LLM Optimization","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A self-distillation framework to recover performance in compressed or fine-tuned LLMs by aligning high-dimensional activation manifolds.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.15787v1","title":"EVIL: Evolving Interpretable Algorithms for Zero-Shot Inference on Event Sequences and Time Series with LLMs","abstract":"We introduce EVIL (\\textbf{EV}olving \\textbf{I}nterpretable algorithms with \\textbf{L}LMs), an approach that uses LLM-guided evolutionary search to discover simple, interpretable algorithms for dynamical systems inference. Rather than training neural networks on large datasets, EVIL evolves pure Python/NumPy programs that perform zero-shot, in-context inference across datasets. We apply EVIL to three distinct tasks: next-event prediction in temporal point processes, rate matrix estimation for Markov jump processes, and time series imputation. In each case, a single evolved algorithm generalizes across all evaluation datasets without per-dataset training (analogous to an amortized inference model). To the best of our knowledge, this is the first work to show that LLM-guided program evolution can discover a single compact inference function for these dynamical-systems problems. Across the three domains, the discovered algorithms are often competitive with, and even outperform, state-of-the-art deep learning models while being orders of magnitudes faster, and remaining fully interpretable.","published_date":"2026-04-17T07:48:17+00:00","viability_score":7,"cluster_label":"LLM Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"LLM-guided evolutionary search to discover simple, interpretable algorithms for dynamical systems inference, outperforming deep learning models.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.15777v1","title":"SegMix:Shuffle-based Feedback Learning for Semantic Segmentation of Pathology Images","abstract":"Segmentation is a critical task in computational pathology, as it identifies areas affected by disease or abnormal growth and is essential for diagnosis and treatment. However, acquiring high-quality pixel-level supervised segmentation data requires significant workload demands from experienced pathologists, limiting the application of deep learning. To overcome this challenge, relaxing the label conditions to image-level classification labels allows for more data to be used and more scenarios to be enabled. One approach is to leverage Class Activation Map (CAM) to generate pseudo pixel-level annotations for semantic segmentation with only image-level labels. However, this method fails to thoroughly explore the essential characteristics of pathology images, thus identifying only small areas that are insufficient for pseudo masking. In this paper, we propose a novel shuffle-based feedback learning method inspired by curriculum learning to generate higher-quality pseudo-semantic segmentation masks. Specifically, we perform patch level shuffle of pathology images, with the model adaptively adjusting the shuffle strategy based on feedback from previous learning. Experimental results demonstrate that our proposed approach outperforms state-of-the-arts on three different datasets.","published_date":"2026-04-17T07:33:54+00:00","viability_score":6,"cluster_label":"Medical AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel shuffle-based feedback learning method for semantic segmentation of pathology images using only image-level labels.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.15776v1","title":"PIIBench: A Unified Multi-Source Benchmark Corpus for Personally Identifiable Information Detection","abstract":"We present PIIBench, a unified benchmark corpus for Personally Identifiable Information (PII) detection in natural language text. Existing resources for PII detection are fragmented across domain-specific corpora with mutually incompatible annotation schemes, preventing systematic comparison of detection systems. We consolidate ten publicly available datasets spanning synthetic PII corpora, multilingual Named Entity Recognition (NER) benchmarks, and financial domain annotated text, yielding a corpus of 2,369,883 annotated sequences and 3.35 million entity mentions across 48 canonical PII entity types. We develop a principled normalization pipeline that maps 80+ source-specific label variants to a standardized BIO tagging scheme, applies frequency-based suppression of near absent entity types, and produces stratified 80/10/10 train/validation/test splits preserving source distribution. To establish baseline difficulty, we evaluate eight published systems spanning rule-based engines (Microsoft Presidio), general purpose NER models (spaCy, BERT-base NER, XLM-RoBERTa NER, SpanMarker mBERT, SpanMarker BERT), a PII-specific model (Piiranha DeBERTa), and a financial NER specialist (XtremeDistil FiNER). All systems achieve span-level F1 below 0.14, with the best system (Presidio, F1=0.1385) still producing zero recall on most entity types. These results directly quantify the domain-silo problem and demonstrate that PIIBench presents a substantially harder and more comprehensive evaluation challenge than any existing single source PII dataset. The dataset construction pipeline and benchmark evaluation code are publicly available at https://github.com/pritesh-2711/pii-bench.","published_date":"2026-04-17T07:32:46+00:00","viability_score":8,"cluster_label":"NLP","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"PIIBench: A unified benchmark corpus and evaluation code for Personally Identifiable Information detection, revealing significant challenges for existing systems.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.15773v1","title":"Phase Transitions as the Breakdown of Statistical Indistinguishability","abstract":"We introduce a novel characterization of phase transitions based on hypothesis testing.   In our formulation, a phase transition is defined as the breakdown of statistical indistinguishability under vanishing parameter perturbations in the thermodynamic limit.   This perspective provides a general, order-parameter-free framework that does not rely on model-specific insights or learning procedures.   We show that conventional approaches, such as those based on the Binder parameter, can be reinterpreted as special cases within this framework.   As a concrete realization, we employ a distribution-free two-sample run test and demonstrate that the critical point of the two-dimensional Ising model is accurately identified without prior knowledge of the order parameter.","published_date":"2026-04-17T07:29:07+00:00","viability_score":3,"cluster_label":"Research","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel characterization of phase transitions based on hypothesis testing, defining them as the breakdown of statistical indistinguishability.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.15769v1","title":"Closing the Theory-Practice Gap in Spiking Transformers via Effective Dimension","abstract":"Spiking transformers achieve competitive accuracy with conventional transformers while offering $38$-$57\\times$ energy efficiency on neuromorphic hardware, yet no theoretical framework guides their design. This paper establishes the first comprehensive expressivity theory for spiking self-attention. We prove that spiking attention with Leaky Integrate-and-Fire neurons is a universal approximator of continuous permutation-equivariant functions, providing explicit spike circuit constructions including a novel lateral inhibition network for softmax normalization with proven $O(1/\\sqrt{T})$ convergence. We derive tight spike-count lower bounds via rate-distortion theory: $\\varepsilon$-approximation requires $\u03a9(L_f^2 nd/\\varepsilon^2)$ spikes, with rigorous information-theoretic derivation. Our key insight is input-dependent bounds using measured effective dimensions ($d_{\\text{eff}}=47$--$89$ for CIFAR/ImageNet), explaining why $T=4$ timesteps suffice despite worst-case $T \\geq 10{,}000$ predictions. We provide concrete design rules with calibrated constants ($C=2.3$, 95\\% CI: $[1.9, 2.7]$). Experiments on Spikformer, QKFormer, and SpikingResformer across vision and language benchmarks validate predictions with $R^2=0.97$ ($p<0.001$). Our framework provides the first principled foundation for neuromorphic transformer design.","published_date":"2026-04-17T07:15:53+00:00","viability_score":7,"cluster_label":"Neuromorphic AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research provides a theoretical framework and design rules for energy-efficient spiking transformers on neuromorphic hardware, validated by experiments.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.15768v1","title":"cuNNQS-SCI: A Fully GPU-Accelerated Framework for High-Performance Configuration Interaction Selection withNeural Network QQantum States","abstract":"AI-driven methods have demonstrated considerable success in tackling the central challenge of accurately solving the Schr\u00f6dinger equation for complex many-body systems. Among neural network quantum state (NNQS) approaches, the NNQS-SCI (Selected Configuration Interaction) method stands out as a state-of-the-art technique, recognized for its high accuracy and scalability. However, its application to larger systems is severely constrained by a hybrid CPU-GPU architecture. Specifically, centralized CPU-based global de-duplication creates a severe scalability barrier due to communication bottlenecks, while host-resident coupled-configuration generation induces prohibitive computational overheads. We introduce cuNNQS-SCI, a fully GPU-accelerated SCI framework designed to overcome these bottlenecks. cuNNQS-SCI first integrates a distributed, load-balanced global de-duplication algorithm to minimize redundancy and communication overhead at scale. To address compute limitations, it employs specialized, fine-grained CUDA kernels for exact coupled configuration generation. Finally, to break the single-GPU memory barrier exposed by this full acceleration, it incorporates a GPU memory-centric runtime featuring GPU-side pooling, streaming mini-batches, and overlapped offloading. This design enables much larger configuration spaces and shifts the bottleneck from host-side limitations back to on-device inference. Our evaluation demonstrates that cuNNQS-SCI fundamentally expands the scale of solvable problems. On an NVIDIA A100 cluster with 64 GPUs, cuNNQS-SCI achieves up to 2.32X end-to-end speedup over the highly-optimized NNQS-SCI baseline while preserving the same chemical accuracy. Furthermore, it demonstrates excellent distributed performance, maintaining over 90% parallel efficiency in strong scaling tests.","published_date":"2026-04-17T07:15:18+00:00","viability_score":8,"cluster_label":"Quantum Chemistry AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"cuNNQS-SCI is a fully GPU-accelerated framework for high-performance quantum chemistry calculations, achieving significant speedups and enabling larger problem scales.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.15764v1","title":"When Do Early-Exit Networks Generalize? A PAC-Bayesian Theory of Adaptive Depth","abstract":"Early-exit neural networks enable adaptive computation by allowing confident predictions to exit at intermediate layers, achieving 2-8$\\times$ inference speedup. Despite widespread deployment, their generalization properties lack theoretical understanding -- a gap explicitly identified in recent surveys. This paper establishes a unified PAC-Bayesian framework for adaptive-depth networks. (1) Novel Entropy-Based Bounds: We prove the first generalization bounds depending on exit-depth entropy $H(D)$ and expected depth $\\mathbb{E}[D]$ rather than maximum depth $K$, with sample complexity $\\mathcal{O}((\\mathbb{E}[D] \\cdot d + H(D))/\u03b5^2)$. (2) Explicit Constructive Constants: Our analysis yields the leading coefficient $\\sqrt{2\\ln 2} \\approx 1.177$ with complete derivation. (3) Provable Early-Exit Advantages: We establish sufficient conditions under which adaptive-depth networks strictly outperform fixed-depth counterparts. (4) Extension to Approximate Label Independence: We relax the label-independence assumption to $\u03b5$-approximate policies, broadening applicability to learned routing. (5) Comprehensive Validation: Experiments across 6 architectures on 7 benchmarks demonstrate tightness ratios of 1.52-3.87$\\times$ (all $p < 0.001$) versus $>$100$\\times$ for classical bounds. Bound-guided threshold selection matches validation-tuned performance within 0.1-0.3%.","published_date":"2026-04-17T07:08:33+00:00","viability_score":7,"cluster_label":"Efficient Deep Learning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research develops a PAC-Bayesian theory for early-exit neural networks, providing generalization bounds and demonstrating provable advantages over fixed-depth models.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.15760v1","title":"KWBench: Measuring Unprompted Problem Recognition in Knowledge Work","abstract":"We introduce the first version of KWBench (Knowledge Work Bench), a benchmark for unprompted problem recognition in large language models: can an LLM identify a professional scenario before attempting to solve it. Existing frontier benchmarks have saturated, and most knowledge-work evaluations to date reduce to extraction or task completion against a specification. KWBench targets the step before that: recognizing the governing structure of the situation from raw inputs alone.   The benchmark contains 223 tasks sourced from practitioners across acquisitions, contract negotiations, clinical pharmacy, organizational politics, fraud analysis, and incentive design. Each task encodes a formal game-theoretic pattern (principal-agent conflict, signaling, mechanism design failure, strategic omission, coalitional dynamics, strategic interdependence) and carries structured ground truth recording the expert reading of the situation and the anticipated failure modes. Models receive raw data and a task prompt with no indication of problem type. Scoring is a three-tier rubric gated by a mandatory conjunctive check. Mandatory criteria encode the predicted wrong paths.   We evaluate 16 models. The best model passes on 27.9% of tasks. The top two models agree on only 31.7% of their passes. Among the top 8, 44 tasks are solved by exactly one model; routing across the top 8 covers 50.7% of the benchmark, nearly double the best single model. Conditional on passing, quality scores converge (approx 83% across models); unconditional scores do not. Same models articulate the relevant game-theoretic concept correctly when asked, then fail to apply it unprompted. We release KWBench to shift how frontier models are evaluated on knowledge work, scoring them on whether they recognize the right problem from the situation alone, not only on how well they execute once the problem has been framed for them.","published_date":"2026-04-17T07:04:54+00:00","viability_score":7,"cluster_label":"LLM Evaluation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"KWBench is a new benchmark for evaluating LLMs' unprompted problem recognition in knowledge work, revealing significant gaps in current models' capabilities.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.15750v1","title":"DepCap: Adaptive Block-Wise Parallel Decoding for Efficient Diffusion LM Inference","abstract":"Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive language generation due to their potential for parallel decoding and global refinement of the entire sequence. To unlock this potential, DLM inference must carefully balance generation quality and decoding speed. Recent block-wise DLM decoding methods improve this trade-off by performing diffusion-based decoding sequentially in blocks. However, existing methods typically rely on fixed block schedules or current-step local signals to determine block boundaries, and use conservative confidence-based parallel decoding to avoid conflicts, limiting the quality-speed trade-off. In this paper, we argue that block-wise DLM inference requires more suitable signals for its two core decisions: cross-step signals for determining block boundaries, and token-level conflict signals for parallel decoding. Based on this view, we propose DepCap, a training-free framework for efficient block-wise DLM inference. Specifically, DepCap instantiates the cross-step signal as the influence of the last decoded block and uses it to adaptively determine how far the next block should extend, while identifying a conflict-free subset of tokens for safe parallel decoding within each block, enabling substantial inference acceleration with negligible quality degradation. DepCap is a plug-and-play method applicable to various DLMs, and compatible with existing KV-cache strategies for block-wise DLM. An information-theoretic analysis further suggests that the cumulative last-block influence on a candidate block is approximately additive across tokens, supporting the proposed block-partitioning criterion. Experimental results show that DepCap achieves favorable speed-quality trade-offs across multiple DLM backbones and reasoning and coding benchmarks, with up to 5.63$\\times$ speedup without significant performance degradation.","published_date":"2026-04-17T06:53:27+00:00","viability_score":7,"cluster_label":"LLM Inference Optimization","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"DepCap accelerates Diffusion LM inference by adaptively determining block boundaries and enabling parallel decoding, achieving significant speedups with negligible quality loss.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.15741v1","title":"Learning Uncertainty from Sequential Internal Dispersion in Large Language Models","abstract":"Uncertainty estimation is a promising approach to detect hallucinations in large language models (LLMs). Recent approaches commonly depend on model internal states to estimate uncertainty. However, they suffer from strict assumptions on how hidden states should evolve across layers, and from information loss by solely focusing on last or mean tokens. To address these issues, we present Sequential Internal Variance Representation (SIVR), a supervised hallucination detection framework that leverages token-wise, layer-wise features derived from hidden states. SIVR adopts a more basic assumption that uncertainty manifests in the degree of dispersion or variance of internal representations across layers, rather than relying on specific assumptions, which makes the method model and task agnostic. It additionally aggregates the full sequence of per-token variance features, learning temporal patterns indicative of factual errors and thereby preventing information loss. Experimental results demonstrate SIVR consistently outperforms strong baselines. Most importantly, SIVR enjoys stronger generalisation and avoids relying on large training sets, highlighting the potential for practical deployment. Our code repository is available online at https://github.com/ponhvoan/internal-variance.","published_date":"2026-04-17T06:31:29+00:00","viability_score":8,"cluster_label":"LLM Hallucination Detection","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"SIVR detects LLM hallucinations by learning from the dispersion of internal representations across layers, offering a model-agnostic and generalizable solution.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.15735v1","title":"Sketch and Text Synergy: Fusing Structural Contours and Descriptive Attributes for Fine-Grained Image Retrieval","abstract":"Fine-grained image retrieval via hand-drawn sketches or textual descriptions remains a critical challenge due to inherent modality gaps. While hand-drawn sketches capture complex structural contours, they lack color and texture, which text effectively provides despite omitting spatial contours. Motivated by the complementary nature of these modalities, we propose the Sketch and Text Based Image Retrieval (STBIR) framework. By synergizing the rich color and texture cues from text with the structural outlines provided by sketches, STBIR achieves superior fine-grained retrieval performance. First, a curriculum learning driven robustness enhancement module is proposed to enhance the model's robustness when handling queries of varying quality. Second, we introduce a category-knowledge-based feature space optimization module, thereby significantly boosting the model's representational power. Finally, we design a multi-stage cross-modal feature alignment mechanism to effectively mitigate the challenges of cross modal feature alignment. Furthermore, we curate the fine-grained STBIR benchmark dataset to rigorously validate the efficacy of our proposed framework and to provide data support as a reference for subsequent related research. Extensive experiments demonstrate that the proposed STBIR framework significantly outperforms state of the art methods.","published_date":"2026-04-17T06:20:56+00:00","viability_score":8,"cluster_label":"Image Retrieval","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"STBIR fuses sketch contours and text attributes for fine-grained image retrieval, overcoming modality gaps with enhanced robustness and cross-modal alignment.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.15729v1","title":"MambaBack: Bridging Local Features and Global Contexts in Whole Slide Image Analysis","abstract":"Whole Slide Image (WSI) analysis is pivotal in computational pathology, enabling cancer diagnosis by integrating morphological and architectural cues across magnifications. Multiple Instance Learning (MIL) serves as the standard framework for WSI analysis. Recently, Mamba has become a promising backbone for MIL, overtaking Transformers due to its efficiency and global context modeling capabilities originating from Natural Language Processing (NLP). However, existing Mamba-based MIL approaches face three critical challenges: (1) disruption of 2D spatial locality during 1D sequence flattening; (2) sub-optimal modeling of fine-grained local cellular structures; and (3) high memory peaks during inference on resource-constrained edge devices. Studies like MambaOut reveal that Mamba's SSM component is redundant for local feature extraction, where Gated CNNs suffice. Recognizing that WSI analysis demands both fine-grained local feature extraction akin to natural images, and global context modeling akin to NLP, we propose MambaBack, a novel hybrid architecture that harmonizes the strengths of Mamba and MambaOut. First, we propose the Hilbert sampling strategy to preserve the 2D spatial locality of tiles within 1D sequences, enhancing the model's spatial perception. Second, we design a hierarchical structure comprising a 1D Gated CNN block based on MambaOut to capture local cellular features, and a BiMamba2 block to aggregate global context, jointly enhancing multi-scale representation. Finally, we implement an asymmetric chunking design, allowing parallel processing during training and chunking-streaming accumulation during inference, minimizing peak memory usage for deployment. Experimental results on five datasets demonstrate that MambaBack outperforms seven state-of-the-art methods. Source code and datasets are publicly available.","published_date":"2026-04-17T06:08:37+00:00","viability_score":8,"cluster_label":"Medical AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"MambaBack combines Mamba and Gated CNNs for Whole Slide Image analysis, preserving spatial locality and optimizing memory for efficient pathology diagnostics.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.15728v1","title":"Privacy-Preserving LLMs Routing","abstract":"Large language model (LLM) routing has emerged as a critical strategy to balance model performance and cost-efficiency by dynamically selecting services from various model providers. However, LLM routing adds an intermediate layer between users and LLMs, creating new privacy risks to user data. These privacy risks have not been systematically studied. Although cryptographic techniques such as Secure Multi-Party Computation (MPC) enable privacy-preserving computation, their protocol design and implementation remain under-explored, and na\u00efve implementations typically incur prohibitive computational overhead. To address this, we propose a privacy-preserving LLM routing framework (PPRoute). PPRoute includes multiple strategies to speed up encoder inference and nearest neighbor search under the MPC and maintain the quality of LLM routing. First, PPRoute uses MPC-friendly operations to boost the encoder inference. Second, PPRoute uses a multiple-step model training algorithm to maintain routing quality despite the constraints of the encrypted domain. Third, PPRoute proposes an unsorted Top-k algorithm with $O(1)$ communication complexity for secure sorting in model search, significantly reducing communication latency. Across different datasets, PPRoute achieves the performance of plaintext counterparts, while achieving approximately a 20$\\times$ speedup over na\u00efve MPC implementations.","published_date":"2026-04-17T06:02:27+00:00","viability_score":7,"cluster_label":"LLM Routing","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A privacy-preserving LLM routing framework that significantly speeds up inference and search under secure computation, achieving plaintext performance with a 20x speedup over naive implementations.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.15727v1","title":"Structured Abductive-Deductive-Inductive Reasoning for LLMs via Algebraic Invariants","abstract":"Large language models exhibit systematic limitations in structured logical reasoning: they conflate hypothesis generation with verification, cannot distinguish conjecture from validated knowledge, and allow weak reasoning steps to propagate unchecked through inference chains. We present a symbolic reasoning scaffold that operationalizes Peirce's tripartite inference -- abduction, deduction, and induction -- as an explicit protocol for LLM-assisted reasoning. The framework enforces logical consistency through five algebraic invariants (the Gamma Quintet), the strongest of which -- the Weakest Link bound -- ensures that no conclusion in a reasoning chain can exceed the reliability of its least-supported premise. This principle, independently grounded as weakest link resolution in possibilistic logic and empirically validated for chain-of-thought reasoning, prevents logical inconsistencies from accumulating across multi-step inference. We verify all invariants through a property-based testing suite of 100 properties and 16 fuzz tests over 10^5+ generated cases, providing a verified reference implementation of the invariants suitable as a foundation for future reasoning benchmarks.","published_date":"2026-04-17T05:59:16+00:00","viability_score":7,"cluster_label":"LLM Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A symbolic reasoning scaffold for LLMs that enforces logical consistency through algebraic invariants, preventing the propagation of weak reasoning steps and ensuring reliable inference chains.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.15726v1","title":"LLM Reasoning Is Latent, Not the Chain of Thought","abstract":"This position paper argues that large language model (LLM) reasoning should be studied as latent-state trajectory formation rather than as faithful surface chain-of-thought (CoT). This matters because claims about faithfulness, interpretability, reasoning benchmarks, and inference-time intervention all depend on what the field takes the primary object of reasoning to be. We ask what that object should be once three often-confounded factors are separated and formalize three competing hypotheses: H1, reasoning is primarily mediated by latent-state trajectories; H2, reasoning is primarily mediated by explicit surface CoT; and H0, most apparent reasoning gains are better explained by generic serial compute than by any privileged representational object. Reorganizing recent empirical, mechanistic, and survey work under this framework, and adding compute-audited worked exemplars that factorize surface traces, latent interventions, and matched budget expansions, we find that current evidence most strongly supports H1 as a default working hypothesis rather than as a task-independent verdict. We therefore make two recommendations: the field should treat latent-state dynamics as the default object of study for LLM reasoning, and it should evaluate reasoning with designs that explicitly disentangle surface traces, latent states, and serial compute.","published_date":"2026-04-17T05:59:08+00:00","viability_score":3,"cluster_label":"LLM Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This paper argues that LLM reasoning should be studied as latent-state trajectory formation rather than chain-of-thought, proposing a new framework for evaluation and analysis.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.15725v1","title":"Reasoning-targeted Jailbreak Attacks on Large Reasoning Models via Semantic Triggers and Psychological Framing","abstract":"Large Reasoning Models (LRMs) have demonstrated strong capabilities in generating step-by-step reasoning chains alongside final answers, enabling their deployment in high-stakes domains such as healthcare and education. While prior jailbreak attack studies have focused on the safety of final answers, little attention has been given to the safety of the reasoning process. In this work, we identify a novel problem that injects harmful content into the reasoning steps while preserving unchanged answers. This type of attack presents two key challenges: 1) manipulating the input instructions may inadvertently alter the LRM's final answer, and 2) the diversity of input questions makes it difficult to consistently bypass the LRM's safety alignment mechanisms and embed harmful content into its reasoning process. To address these challenges, we propose the Psychology-based Reasoning-targeted Jailbreak Attack (PRJA) Framework, which integrates a Semantic-based Trigger Selection module and a Psychology-based Instruction Generation module. Specifically, the proposed PRJA automatically selects manipulative reasoning triggers via semantic analysis and leverages psychological theories of obedience to authority and moral disengagement to generate adaptive instructions for enhancing the LRM's compliance with harmful content generation. Extensive experiments on five question-answering datasets demonstrate that PRJA achieves an average attack success rate of 83.6\\% against several commercial LRMs, including DeepSeek R1, Qwen2.5-Max, and OpenAI o4-mini.","published_date":"2026-04-17T05:56:46+00:00","viability_score":7,"cluster_label":"LLM Security","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel framework for jailbreaking Large Reasoning Models by injecting harmful content into reasoning steps, achieving an 83.6% attack success rate on commercial models.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.15723v1","title":"Diffusion Autoencoder for Unsupervised Artifact Restoration in Handheld Fundus Images","abstract":"The advent of handheld fundus imaging devices has made ophthalmologic diagnosis and disease screening more accessible, efficient, and cost-effective. However, images captured from these setups often suffer from artifacts such as flash reflections, exposure variations, and motion-induced blur, which degrade image quality and hinder downstream analysis. While generative models have been effective in image restoration, most depend on paired supervision or predefined artifact structures, making them less adaptable to unstructured degradations commonly observed in handheld fundus images. To address this, we propose an unsupervised diffusion autoencoder that integrates a context encoder with the denoising process to learn semantically meaningful representations for artifact restoration. The model is trained only on high-quality table-top fundus images and infers to restore artifact-affected handheld acquisitions. We validate the restorations through quantitative and qualitative evaluations, and have shown that diagnostic accuracy increases to 81.17% on an unseen dataset and multiple artifact conditions","published_date":"2026-04-17T05:54:11+00:00","viability_score":7,"cluster_label":"Medical AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An unsupervised diffusion autoencoder for restoring artifact-affected handheld fundus images, improving diagnostic accuracy.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.15719v1","title":"The World Leaks the Future: Harness Evolution for Future Prediction Agents","abstract":"Many consequential decisions must be made before the relevant outcome is known. Such problems are commonly framed as \\emph{future prediction}, where an LLM agent must form a prediction for an unresolved question using only the public information available at the prediction time. The setting is difficult because public evidence evolves while useful supervision arrives only after the question is resolved, so most existing approaches still improve mainly from final outcomes. Yet final outcomes are too coarse to guide earlier factor tracking, evidence gathering and interpretation, or uncertainty handling. When the same unresolved question is revisited over time, temporal contrasts between earlier and later predictions can expose omissions in the earlier prediction process; we call this signal \\emph{internal feedback}. We introduce \\emph{Milkyway}, a self-evolving agent system that keeps the base model fixed and instead updates a persistent \\emph{future prediction harness} for factor tracking, evidence gathering and interpretation, and uncertainty handling. Across repeated predictions on the same unresolved question, \\emph{Milkyway} extracts internal feedback and writes reusable guidance back into the harness, so later predictions on that question can improve before the outcome is known. After the question is resolved, the final outcome provides a \\emph{retrospective check} before the updated harness is carried forward to subsequent questions. On FutureX and FutureWorld, Milkyway achieves the best overall score among the compared methods, improving FutureX from 44.07 to 60.90 and FutureWorld from 62.22 to 77.96.","published_date":"2026-04-17T05:43:07+00:00","viability_score":4,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A self-evolving agent system that uses internal feedback to improve future predictions before outcomes are known.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.15718v1","title":"NeuroLip: An Event-driven Spatiotemporal Learning Framework for Cross-Scene Lip-Motion-based Visual Speaker Recognition","abstract":"Visual speaker recognition based on lip motion offers a silent, hands-free, and behavior-driven biometric solution that remains effective even when acoustic cues are unavailable. Compared to traditional methods that rely heavily on appearance-dependent representations, lip motion encodes subject-specific behavioral dynamics driven by consistent articulation patterns and muscle coordination, offering inherent stability across environmental changes. However, capturing these robust, fine-grained dynamics is challenging for conventional frame-based cameras due to motion blur and low dynamic range. To exploit the intrinsic stability of lip motion and address these sensing limitations, we propose NeuroLip, an event-based framework that captures fine-grained lip dynamics under a strict yet practical cross-scene protocol: training is performed under a single controlled condition, while recognition must generalize to unseen viewing and lighting conditions. NeuroLip features a 1) Temporal-aware Voxel Encoding module with adaptive event weighting, 2) Structure-aware Spatial Enhancer that amplifies discriminative behavioral patterns by suppressing noise while preserving vertically structured motion information, and 3) Polarity Consistency Regularization mechanism to retain motion-direction cues encoded in event polarities. To facilitate systematic evaluation, we introduce DVSpeaker, a comprehensive event-based lip-motion dataset comprising 50 subjects recorded under four distinct viewpoint and illumination scenarios. Extensive experiments demonstrate that NeuroLip achieves near-perfect matched-scene accuracy and robust cross-scene generalization, attaining over 71% accuracy on unseen viewpoints and nearly 76% under low-light conditions, outperforming representative existing methods by at least 8.54%. The dataset and code are publicly available at https://github.com/JiuZeongit/NeuroLip.","published_date":"2026-04-17T05:42:17+00:00","viability_score":8,"cluster_label":"Biometrics","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"NeuroLip: An event-based framework for robust cross-scene lip-motion speaker recognition with a new public dataset.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.15715v1","title":"GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows","abstract":"The development of general-purpose agents requires a shift from executing simple instructions to completing complex, real-world productivity workflows. However, current tool-use benchmarks remain misaligned with real-world requirements, relying on AI-generated queries, dummy tools, and limited system-level coordination. To address this, we propose GTA-2, a hierarchical benchmark for General Tool Agents (GTA) spanning atomic tool use and open-ended workflows. Built on real-world authenticity, it leverages real user queries, deployed tools, and multimodal contexts. (i) GTA-Atomic, inherited from our prior GTA benchmark, evaluates short-horizon, closed-ended tool-use precision. (ii) GTA-Workflow introduces long-horizon, open-ended tasks for realistic end-to-end completion. To evaluate open-ended deliverables, we propose a recursive checkpoint-based evaluation mechanism that decomposes objectives into verifiable sub-goals, enabling unified evaluation of both model capabilities and agent execution frameworks (i.e., execution harnesses). Experiments reveal a pronounced capability cliff: while frontier models already struggle on atomic tasks (below 50%), they largely fail on workflows, with top models achieving only 14.39% success. Further analysis shows that checkpoint-guided feedback improves performance, while advanced frameworks such as Manus and OpenClaw substantially enhance workflow completion, highlighting the importance of execution harness design beyond the underlying model capacity. These findings provide guidance for developing reliable personal and professional assistants. Dataset and code will be available at https://github.com/open-compass/GTA.","published_date":"2026-04-17T05:36:00+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"GTA-2: A benchmark for General Tool Agents, evaluating atomic tool use and open-ended workflows with real-world authenticity.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.15713v1","title":"Just Type It in Isabelle! AI Agents Drafting, Mechanizing, and Generalizing from Human Hints","abstract":"Type annotations are essential when printing terms in a way that preserves their meaning under reparsing and type inference. We study the problem of complete and minimal type annotations for rank-one polymorphic $\u03bb$-calculus terms, as used in Isabelle. Building on prior work by Smolka, Blanchette et al., we give a metatheoretical account of the problem, with a full formal specification and proofs, and formalize it in Isabelle/HOL. Our development is a series of experiments featuring human-driven and AI-driven formalization workflows: a human and an LLM-powered AI agent independently produce pen-and-paper proofs, and the AI agent autoformalizes both in Isabelle, with further human-hinted AI interventions refining and generalizing the development.","published_date":"2026-04-17T05:34:49+00:00","viability_score":2,"cluster_label":"AI Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"An AI agent assists in formalizing and generalizing mathematical proofs within the Isabelle theorem prover, guided by human hints.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.15711v1","title":"SSMamba: A Self-Supervised Hybrid State Space Model for Pathological Image Classification","abstract":"Pathological diagnosis is highly reliant on image analysis, where Regions of Interest (ROIs) serve as the primary basis for diagnostic evidence, while whole-slide image (WSI)-level tasks primarily capture aggregated patterns. To extract these critical morphological features, ROI-level Foundation Models (FMs) based on Vision Transformers (ViTs) and large-scale self-supervised learning (SSL) have been widely adopted. However, three core limitations remain in their application to ROI analysis: (1) cross-magnification domain shift, as fixed-scale pretraining hinders adaptation to diverse clinical settings; (2) inadequate local-global relationship modeling, wherein the ViT backbone of FMs suffers from high computational overhead and imprecise local characterization; (3) insufficient fine-grained sensitivity, as traditional self-attention mechanisms tend to overlook subtle diagnostic cues. To address these challenges, we propose SSMamba, a hybrid SSL framework that enables effective fine-grained feature learning without relying on large external datasets. This framework incorporates three domain-adaptive components: Mamba Masked Image Modeling (MAMIM) for mitigating domain shift, a Directional Multi-scale (DMS) module for balanced local-global modeling, and a Local Perception Residual (LPR) module for enhanced fine-grained sensitivity. Employing a two-stage pipeline, SSL pretraining on target ROI datasets followed by supervised fine-tuning (SFT), SSMamba outperforms 11 state-of-the-art (SOTA) pathological FMs on 10 public ROI datasets and surpasses 8 SOTA methods on 6 public WSI datasets. These results validate the superiority of task-specific architectural designs for pathological image analysis.","published_date":"2026-04-17T05:32:28+00:00","viability_score":7,"cluster_label":"Medical AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A self-supervised hybrid state space model for pathological image classification that outperforms 11 state-of-the-art models.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.15709v1","title":"Bilevel Optimization of Agent Skills via Monte Carlo Tree Search","abstract":"Agent \\texttt{skills} are structured collections of instructions, tools, and supporting resources that help large language model (LLM) agents perform particular classes of tasks. Empirical evidence shows that the design of \\texttt{skills} can materially affect agent task performance, yet systematically optimizing \\texttt{skills} remains challenging. Since a \\texttt{skill} comprises instructions, tools, and supporting resources in a structured way, optimizing it requires jointly determining both the structure of these components and the content each component contains. This gives rise to a complex decision space with strong interdependence across structure and components. We therefore represent these two coupled decisions as \\texttt{skill} structure and component content, and formulate \\texttt{skill} optimization as a bilevel optimization problem. We propose a bilevel optimization framework in which an outer loop employs Monte Carlo Tree Search to determine the \\texttt{skill} structure, while an inner loop refines the component content within the structure selected by the outer loop. In both loops, we employ LLMs to assist the optimization procedure. We evaluate the proposed framework on an open-source Operations Research Question Answering dataset, and the experimental results suggest that the bilevel optimization framework improves the performance of the agents with the optimized \\texttt{skill}.","published_date":"2026-04-17T05:31:40+00:00","viability_score":6,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A bilevel optimization framework using Monte Carlo Tree Search and LLMs to systematically optimize agent skills for improved task performance.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.15695v1","title":"The Price of Paranoia: Robust Risk-Sensitive Cooperation in Non-Stationary Multi-Agent Reinforcement Learning","abstract":"Cooperative equilibria are fragile. When agents learn alongside each other rather than in a fixed environment, the process of learning destabilizes the cooperation they are trying to sustain: every gradient step an agent takes shifts the distribution of actions its partner will play, turning a cooperative partner into a source of stochastic noise precisely where the cooperation decision is most sensitive. We study how this co-learning noise propagates through the structure of coordination games, and find that the cooperative equilibrium, even when strongly Pareto-dominant, is exponentially unstable under standard risk-neutral learning, collapsing irreversibly once partner noise crosses the game's critical cooperation threshold. The natural response to apply distributional robustness to hedge against partner uncertainty makes things strictly worse: risk-averse return objectives penalize the high-variance cooperative action relative to defection, widening the instability region rather than shrinking it, a paradox that reveals a fundamental mismatch between the domains where robustness is applied and instability originates. We resolve this by showing that robustness should target the policy gradient update variance induced by partner uncertainty, not the return distribution. This distinction yields an algorithm whose gradient updates are modulated by an online measure of partner unpredictability, provably expanding the cooperation basin in symmetric coordination games. To unify stability, sample complexity, and welfare consequences of this approach, we introduce the Price of Paranoia as the structural dual of the Price of Anarchy. Together with a novel Cooperation Window, it precisely characterizes how much welfare learning algorithms can recover under partner noise, pinning down the optimal degree of robustness as a closed-form balance between equilibrium stability and sample efficiency.","published_date":"2026-04-17T04:41:38+00:00","viability_score":1,"cluster_label":"Reinforcement Learning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A theoretical framework to resolve instability in cooperative multi-agent reinforcement learning by targeting policy gradient update variance.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.15679v1","title":"Hierarchical Active Inference using Successor Representations","abstract":"Active inference, a neurally-inspired model for inferring actions based on the free energy principle (FEP), has been proposed as a unifying framework for understanding perception, action, and learning in the brain. Active inference has previously been used to model ecologically important tasks such as navigation and planning, but scaling it to solve complex large-scale problems in real-world environments has remained a challenge. Inspired by the existence of multi-scale hierarchical representations in the brain, we propose a model for planning of actions based on hierarchical active inference. Our approach combines a hierarchical model of the environment with successor representations for efficient planning. We present results demonstrating (1) how lower-level successor representations can be used to learn higher-level abstract states, (2) how planning based on active inference at the lower-level can be used to bootstrap and learn higher-level abstract actions, and (3) how these learned higher-level abstract states and actions can facilitate efficient planning. We illustrate the performance of the approach on several planning and reinforcement learning (RL) problems including a variant of the well-known four rooms task, a key-based navigation task, a partially observable planning problem, the Mountain Car problem, and PointMaze, a family of navigation tasks with continuous state and action spaces. Our results represent, to our knowledge, the first application of learned hierarchical state and action abstractions to active inference in FEP-based theories of brain function.","published_date":"2026-04-17T04:05:37+00:00","viability_score":3,"cluster_label":"AI Planning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A hierarchical active inference model using successor representations for efficient planning in complex environments.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.15663v1","title":"CodeMMR: Bridging Natural Language, Code, and Image for Unified Retrieval","abstract":"Code search, framed as information retrieval (IR), underpins modern software engineering and increasingly powers retrieval-augmented generation (RAG), improving code discovery, reuse, and the reliability of LLM-based coding. Yet existing code IR models remain largely text-centric and often overlook the visual and structural aspects inherent in programming artifacts such as web interfaces, data visualizations, SVGs, schematic diagrams, and UML. To bridge this gap, we introduce MMCoIR, the first comprehensive benchmark for evaluating multimodal code IR across five visual domains, eight programming languages, eleven libraries, and show the challenge of the task through extensive evaluation. Therefore, we then propose CodeMMR, a unified retrieval model that jointly embeds natural language, code, and images into a shared semantic space through instruction-based multimodal alignment. CodeMMR achieves strong generalization across modalities and languages, outperforming competitive baselines (e.g., UniIR, GME, VLM2Vec) by an average of 10 points on nDCG@10. Moreover, integrating CodeMMR into RAG enhances code generation fidelity and visual grounding on unseen code generation tasks, underscoring the potential of multimodal retrieval as a core enabler for next-generation intelligent programming systems. Datasets are available at HuggingFace.","published_date":"2026-04-17T03:35:35+00:00","viability_score":8,"cluster_label":"Multimodal AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"CodeMMR unifies natural language, code, and image retrieval for enhanced software engineering and RAG applications.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.15642v1","title":"HYPERHEURIST: A Simulated Annealing-Based Control Framework for LLM-Driven Code Generation in Optimized Hardware Design","abstract":"Large Language Models (LLMs) have shown promising progress for generating Register Transfer Level (RTL) hardware designs, largely because they can rapidly propose alternative architectural realizations. However, single-shot LLM generation struggles to consistently produce designs that are both functionally correct and power-efficient. This paper proposes HYPERHEURIST, a simulated annealing-based control framework that treats LLM-generated RTL as intermediate candidates rather than final designs. The suggested system not only focuses on functionality correctness but also on Power-Performance-Area (PPA) optimization. In the first phase, RTL candidates are filtered through compilation, structural checks, and simulation to identify functionally valid designs. PPA optimization is restricted to RTL designs that have already passed compilation and simulation. Evaluated across eight RTL benchmarks, this staged approach yields more stable and repeatable optimization behavior than single-pass LLM-generated RTL.","published_date":"2026-04-17T02:39:20+00:00","viability_score":5,"cluster_label":"AI for Hardware Design","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"HYPERHEURIST uses simulated annealing to control LLM-generated RTL for optimized hardware design.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.15621v1","title":"Rethinking the Necessity of Adaptive Retrieval-Augmented Generation through the Lens of Adaptive Listwise Ranking","abstract":"Adaptive Retrieval-Augmented Generation aims to mitigate the interference of extraneous noise by dynamically determining the necessity of retrieving supplementary passages. However, as Large Language Models evolve with increasing robustness to noise, the necessity of adaptive retrieval warrants re-evaluation. In this paper, we rethink this necessity and propose AdaRankLLM, a novel adaptive retrieval framework. To effectively verify the necessity of adaptive listwise reranking, we first develop an adaptive ranker employing a zero-shot prompt with a passage dropout mechanism, and compare its generation outcomes against static fixed-depth retrieval strategies. Furthermore, to endow smaller open-source LLMs with this precise listwise ranking and adaptive filtering capability, we introduce a two-stage progressive distillation paradigm enhanced by data sampling and augmentation techniques. Extensive experiments across three datasets and eight LLMs demonstrate that AdaRankLLM consistently achieves optimal performance in most scenarios with significantly reduced context overhead. Crucially, our analysis reveals a role shift in adaptive retrieval: it functions as a critical noise filter for weaker models to overcome their limitations, while serving as a cost-effective efficiency optimizer for stronger reasoning models.","published_date":"2026-04-17T02:00:52+00:00","viability_score":7,"cluster_label":"Retrieval-Augmented Generation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"AdaRankLLM re-evaluates adaptive retrieval for RAG, acting as a noise filter for weaker LLMs and an efficiency optimizer for stronger ones.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.15613v1","title":"VoodooNet: Achieving Analytic Ground States via High-Dimensional Random Projections","abstract":"We present VoodooNet, a non-iterative neural architecture that replaces the stochastic gradient descent (SGD) paradigm with a closed-form analytic solution via Galactic Expansion. By projecting input manifolds into a high-dimensional, high-entropy \"Galactic\" space ($d \\gg 784$), we demonstrate that complex features can be untangled without the thermodynamic cost of backpropagation. Utilizing the Moore-Penrose pseudoinverse to solve for the output layer in a single step, VoodooNet achieves a classification accuracy of \\textbf{98.10\\% on MNIST} and \\textbf{86.63\\% on Fashion-MNIST}. Notably, our results on Fashion-MNIST surpass a 10-epoch SGD baseline (84.41\\%) while reducing the training time by orders of magnitude. We observe a near-logarithmic scaling law between dimensionality and accuracy, suggesting that performance is a function of \"Galactic\" volume rather than iterative refinement. This \"Magic Hat\" approach offers a new frontier for real-time Edge AI, where the traditional training phase is bypassed in favor of instantaneous manifold discovery.","published_date":"2026-04-17T01:41:02+00:00","viability_score":4,"cluster_label":"Edge AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel neural architecture bypasses traditional training for real-time manifold discovery and classification on edge devices.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.15611v1","title":"CLIMB: Controllable Longitudinal Brain Image Generation using Mamba-based Latent Diffusion Model and Gaussian-aligned Autoencoder","abstract":"Latent diffusion models have emerged as powerful generative models in medical imaging, enabling the synthesis of high quality brain magnetic resonance imaging scans. In particular, predicting the evolution of a patients brain can aid in early intervention, prognosis, and treatment planning. In this study, we introduce CLIMB, Controllable Longitudinal brain Image generation via state space based latent diffusion model, an advanced framework for modeling temporal changes in brain structure. CLIMB is designed to model the structural evolution of the brain structure over time, utilizing a baseline MRI scan and its acquisition age as foundational inputs. Additionally, multiple conditional variables, including projected age, gender, disease status, genetic information, and brain structure volumes, are incorporated to enhance the temporal modeling of anatomical changes. Unlike existing LDM methods that rely on self attention modules, which effectively capture contextual information from input images but are computationally expensive, our approach leverages state space, a state space model architecture that substantially reduces computational overhead while preserving high-quality image synthesis. Furthermore, we introduce a Gaussian-aligned autoencoder that extracts latent representations conforming to prior distributions without the sampling noise inherent in conventional variational autoencoders. We train and evaluate our proposed model on the Alzheimers Disease Neuroimaging Initiative dataset, consisting of 6,306 MRI scans from 1,390 participants. By comparing generated images with real MRI scans, CLIMB achieves a structural similarity index of 0.9433, demonstrating notable improvements over existing methods.","published_date":"2026-04-17T01:34:41+00:00","viability_score":7,"cluster_label":"Medical AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A Mamba-based latent diffusion model generates controllable longitudinal brain MRIs, improving temporal modeling for disease progression prediction.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.15607v1","title":"Imperfectly Cooperative Human-AI Interactions: Comparing the Impacts of Human and AI Attributes in Simulated and User Studies","abstract":"AI design characteristics and human personality traits each impact the quality and outcomes of human-AI interactions. However, their relative and joint impacts are underexplored in imperfectly cooperative scenarios, where people and AI only have partially aligned goals and objectives. This study compares a purely simulated dataset comprising 2,000 simulations and a parallel human subjects experiment involving 290 human participants to investigate these effects across two scenario categories: (1) hiring negotiations between human job candidates and AI hiring agents; and (2) human-AI transactions wherein AI agents may conceal information to maximize internal goals. We examine user Extraversion and Agreeableness alongside AI design characteristics, including Adaptability, Expertise, and chain-of-thought Transparency. Our causal discovery analysis extends performance-focused evaluations by integrating scenario-based outcomes, communication analysis, and questionnaire measures. Results reveal divergences between purely simulated and human study datasets, and between scenario types. In simulation experiments, personality traits and AI attributes were comparatively influential. Yet, with actual human subjects, AI attributes -- particularly transparency -- were much more impactful. We discuss how these divergences vary across different interaction contexts, offering crucial insights for the future of human-centered AI agents.","published_date":"2026-04-17T01:10:34+00:00","viability_score":5,"cluster_label":"Human-AI Interaction","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Investigating the comparative impact of human personality and AI attributes on imperfectly cooperative human-AI interactions.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.15594v1","title":"DataCenterGym: A Physics-Grounded Simulator for Multi-Objective Data Center Scheduling","abstract":"Modern datacenters schedule heterogeneous workloads across geo-distributed sites with diverse compute capacities, electricity prices, and thermal conditions. Compute utilization, heat generation, cooling demand, and energy consumption are tightly coupled, yet most existing schedulers abstract these effects and treat them independently.   We present \\textit{DataCenterGym}, a physics-grounded simulation environment for job scheduling in geo-distributed data centers, designed as a reusable testbed for future research. The simulator integrates compute queueing, building thermal dynamics, localized HVAC behavior, and temperature-dependent service degradation within a Gymnasium-compatible interface. We also develop a Hierarchical Model Predictive Control (H-MPC) scheduling algorithm that performs distributed job placement while explicitly accounting for thermal and power dynamics. Through experiments on nominal operation and workload sensitivity, we demonstrate how H-MPC improves scheduling performance relative to baseline schedulers.","published_date":"2026-04-17T00:28:25+00:00","viability_score":6,"cluster_label":"Data Center AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A physics-grounded simulator for multi-objective data center scheduling that integrates thermal and power dynamics.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.15593v1","title":"DALM: A Domain-Algebraic Language Model via Three-Phase Structured Generation","abstract":"Large language models compress heterogeneous knowledge into a single parameter space, allowing facts from different domains to interfere during generation. We propose DALM, a Domain-Algebraic Language Model that replaces unconstrained token generation with structured denoising over a domain lattice. DALM follows a three-phase generation path: it first resolves domain uncertainty, then relation uncertainty, and finally concept uncertainty, so each stage operates under explicit algebraic constraints. The framework requires only three ingredients: a lattice of domains with computable meet, join, and implication; a typing function over relations that controls inheritance across domains; and a fiber partition that localizes knowledge to domain-specific subsets. Given these ingredients, DALM yields a three-phase encoder-decoder architecture in which generation is confined to a domain fiber, cross-domain contamination is structurally prevented in closed-vocabulary mode and auditably bounded in open-vocabulary mode, and a single query can produce a domain-indexed multi-perspective answer space. We instantiate the framework with the CDC knowledge representation system and outline training and evaluation on validated domain-annotated crystal libraries. DALM reframes language generation as algebraically constrained structured denoising rather than unconstrained decoding over a flat token space.","published_date":"2026-04-17T00:25:32+00:00","viability_score":2,"cluster_label":"LLM Training","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A novel language model architecture that uses structured denoising over a domain lattice to prevent cross-domain contamination and provide domain-indexed multi-perspective answers.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.15591v1","title":"BioHiCL: Hierarchical Multi-Label Contrastive Learning for Biomedical Retrieval with MeSH Labels","abstract":"Effective biomedical information retrieval requires modeling domain semantics and hierarchical relationships among biomedical texts. Existing biomedical generative retrievers build on coarse binary relevance signals, limiting their ability to capture semantic overlap. We propose BioHiCL (Biomedical Retrieval with Hierarchical Multi-Label Contrastive Learning), which leverages hierarchical MeSH annotations to provide structured supervision for multi-label contrastive learning. Our models, BioHiCL-Base (0.1B) and BioHiCL-Large (0.3B), achieve promising performance on biomedical retrieval, sentence similarity, and question answering tasks, while remaining computationally efficient for deployment.","published_date":"2026-04-17T00:09:01+00:00","viability_score":5,"cluster_label":"Biomedical AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A biomedical retrieval system that uses hierarchical MeSH annotations for multi-label contrastive learning to improve semantic understanding and efficiency.","time_to_mvp":"1-3 months","tags":["quick_build"]},{"arxiv_id":"2604.15590v1","title":"CSLE: A Reinforcement Learning Platform for Autonomous Security Management","abstract":"Reinforcement learning is a promising approach to autonomous and adaptive security management in networked systems. However, current reinforcement learning solutions for security management are mostly limited to simulation environments and it is unclear how they generalize to operational systems. In this paper, we address this limitation by presenting CSLE: a reinforcement learning platform for autonomous security management that enables experimentation under realistic conditions. Conceptually, CSLE encompasses two systems. First, it includes an emulation system that replicates key components of the target system in a virtualized environment. We use this system to gather measurements and logs, based on which we identify a system model, such as a Markov decision process. Second, it includes a simulation system where security strategies are efficiently learned through simulations of the system model. The learned strategies are then evaluated and refined in the emulation system to close the gap between theoretical and operational performance. We demonstrate CSLE through four use cases: flow control, replication control, segmentation control, and recovery control. Through these use cases, we show that CSLE enables near-optimal security management in an environment that approximates an operational system.","published_date":"2026-04-16T23:59:13+00:00","viability_score":7,"cluster_label":"AI Security","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A reinforcement learning platform for autonomous security management that bridges simulation and emulation for realistic, adaptive system protection.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.15589v1","title":"LLM attribution analysis across different fine-tuning strategies and model scales for automated code compliance","abstract":"Existing research on large language models (LLMs) for automated code compliance has primarily focused on performance, treating the models as black boxes and overlooking how training decisions affect their interpretive behavior. This paper addresses this gap by employing a perturbation-based attribution analysis to compare the interpretive behaviors of LLMs across different fine-tuning strategies such as full fine-tuning (FFT), low-rank adaptation (LoRA) and quantized LoRA fine-tuning, as well as the impact of model scales which include varying LLM parameter sizes. Our results show that FFT produces attribution patterns that are statistically different and more focused than those from parameter-efficient fine-tuning methods. Furthermore, we found that as model scale increases, LLMs develop specific interpretive strategies such as prioritizing numerical constraints and rule identifiers in the building text, albeit with performance gains in semantic similarity of the generated and reference computer-processable rules plateauing for models larger than 7B. This paper provides crucial insights into the explainability of these models, taking a step toward building more transparent LLMs for critical, regulation-based tasks in the Architecture, Engineering, and Construction industry.","published_date":"2026-04-16T23:54:26+00:00","viability_score":4,"cluster_label":"LLM Explainability","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Analyzes LLM attribution patterns across different fine-tuning strategies and model scales to understand their interpretive behavior for automated code compliance.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.15588v1","title":"\"Excuse me, may I say something...\" CoLabScience, A Proactive AI Assistant for Biomedical Discovery and LLM-Expert Collaborations","abstract":"The integration of Large Language Models (LLMs) into scientific workflows presents exciting opportunities to accelerate biomedical discovery. However, the reactive nature of LLMs, which respond only when prompted, limits their effectiveness in collaborative settings that demand foresight and autonomous engagement. In this study, we introduce CoLabScience, a proactive LLM assistant designed to enhance biomedical collaboration between AI systems and human experts through timely, context-aware interventions. At the core of our method is PULI (Positive-Unlabeled Learning-to-Intervene), a novel framework trained with a reinforcement learning objective to determine when and how to intervene in streaming scientific discussions, by leveraging the team's project proposal and long- and short-term conversational memory. To support this work, we introduce BSDD (Biomedical Streaming Dialogue Dataset), a new benchmark of simulated research discussion dialogues with intervention points derived from PubMed articles. Experimental results show that PULI significantly outperforms existing baselines in both intervention precision and collaborative task utility, highlighting the potential of proactive LLMs as intelligent scientific assistants.","published_date":"2026-04-16T23:46:22+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"CoLabScience is a proactive AI assistant that intervenes in biomedical research discussions to accelerate discovery by leveraging conversational memory and project context.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.15585v1","title":"PAWN: Piece Value Analysis with Neural Networks","abstract":"Predicting the relative value of any given chess piece in a position remains an open challenge, as a piece's contribution depends on its spatial relationships with every other piece on the board. We demonstrate that incorporating the state of the full chess board via latent position representations derived using a CNN-based autoencoder significantly improves accuracy for MLP-based piece value prediction architectures. Using a dataset of over 12 million piece-value pairs gathered from Grandmaster-level games, with ground-truth labels generated by Stockfish 17, our enhanced piece value predictor significantly outperforms context-independent MLP-based systems, reducing validation mean absolute error by 16% and predicting relative piece value within approximately 0.65 pawns. More generally, our findings suggest that encoding the full problem state as context provides useful inductive bias for predicting the contribution of any individual component.","published_date":"2026-04-16T23:37:01+00:00","viability_score":5,"cluster_label":"Game AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"PAWN enhances chess piece value prediction by using CNN-derived latent representations of the full board state, outperforming context-independent models.","time_to_mvp":"1-2 weeks","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.15579v1","title":"Symbolic Guardrails for Domain-Specific Agents: Stronger Safety and Security Guarantees Without Sacrificing Utility","abstract":"AI agents that interact with their environments through tools enable powerful applications, but in high-stakes business settings, unintended actions can cause unacceptable harm, such as privacy breaches and financial loss. Existing mitigations, such as training-based methods and neural guardrails, improve agent reliability but cannot provide guarantees. We study symbolic guardrails as a practical path toward strong safety and security guarantees for AI agents. Our three-part study includes a systematic review of 80 state-of-the-art agent safety and security benchmarks to identify the policies they evaluate, an analysis of which policy requirements can be guaranteed by symbolic guardrails, and an evaluation of how symbolic guardrails affect safety, security, and agent success on $\u03c4^2$-Bench, CAR-bench, and MedAgentBench. We find that 85\\% of benchmarks lack concrete policies, relying instead on underspecified high-level goals or common sense. Among the specified policies, 74\\% of policy requirements can be enforced by symbolic guardrails, often using simple, low-cost mechanisms. These guardrails improve safety and security without sacrificing agent utility. Overall, our results suggest that symbolic guardrails are a practical and effective way to guarantee some safety and security requirements, especially for domain-specific AI agents. We release all codes and artifacts at https://github.com/hyn0027/agent-symbolic-guardrails.","published_date":"2026-04-16T23:18:22+00:00","viability_score":8,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Symbolic Guardrails provide strong safety and security guarantees for AI agents interacting with tools, improving reliability without sacrificing utility.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.15577v1","title":"Reward Weighted Classifier-Free Guidance as Policy Improvement in Autoregressive Models","abstract":"Consider an auto-regressive model that produces outputs x (e.g., answers to questions, molecules) each of which can be summarized by an attribute vector y (e.g., helpfulness vs. harmlessness, or bio-availability vs. lipophilicity). An arbitrary reward function r(y) encodes tradeoffs between these properties. Typically, tilting the model's sampling distribution to increase this reward is done at training time via reinforcement learning. However, if the reward function changes, re-alignment requires re-training. In this paper, we show that a reward weighted classifier-free guidance (RCFG) can act as a policy improvement operator in this setting, approximating tilting the sampling distribution by the Q function. We apply RCFG to molecular generation, demonstrating that it can optimize novel reward functions at test time. Finally, we show that using RCFG as a teacher and distilling into the base policy to serve as a warm start significantly speeds up convergence for standard RL.","published_date":"2026-04-16T23:13:22+00:00","viability_score":6,"cluster_label":"Generative Models","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Reward Weighted Classifier-Free Guidance optimizes autoregressive models for novel reward functions at test time, accelerating RL convergence.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.15574v1","title":"Why Fine-Tuning Encourages Hallucinations and How to Fix It","abstract":"Large language models are prone to hallucinating factually incorrect statements. A key source of these errors is exposure to new factual information through supervised fine-tuning (SFT), which can increase hallucinations w.r.t. knowledge acquired during pre-training. In this work, we explore whether SFT-induced hallucinations can be mitigated using established tools from the continual learning literature, since they arise as a by-product of knowledge degradation during training. We propose a self-distillation-based SFT method that facilitates effective factual learning while minimizing hallucinations w.r.t. pre-existing knowledge by regularizing output-distribution drift. We also show that, in settings where new knowledge acquisition is unnecessary, suppressing factual plasticity by freezing parameter groups, can preserve task performance while reducing hallucinations. Lastly, we investigate the mechanism behind SFT-induced hallucinations through three hypotheses: capacity limitations, behavior cloning, and localized interference. Our experiments show that a main driver is interference among overlapping semantic representations, and that self-distillation succeeds by mitigating this interference.","published_date":"2026-04-16T23:08:18+00:00","viability_score":4,"cluster_label":"LLM Training","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This research proposes a self-distillation method to mitigate hallucinations in fine-tuned large language models by addressing interference among semantic representations.","time_to_mvp":"1-2 weeks","tags":["quick_build"]},{"arxiv_id":"2604.15559v1","title":"Subliminal Transfer of Unsafe Behaviors in AI Agent Distillation","abstract":"Recent work on subliminal learning demonstrates that language models can transmit semantic traits through data that is semantically unrelated to those traits. However, it remains unclear whether behavioral traits can transfer in agentic systems, where policies are learned from trajectories rather than static text. In this work, we provide the first empirical evidence that unsafe agent behaviors can transfer subliminally through model distillation across two complementary experimental settings. In our primary setting, we construct a teacher agent exhibiting a strong deletion bias, a tendency to perform destructive file-system actions via an API-style tool interface, and distill it into a student using only trajectories from ostensibly safe tasks, with all explicit deletion keywords rigorously filtered. In our secondary setting, we replicate the threat model in a native Bash environment, replacing API tool calls with shell commands and operationalizing the bias as a preference for issuing chmod as the first permission-related command over semantically equivalent alternatives such as chown or setfacl. Despite full keyword sanitation in both settings, students inherit measurable behavioral biases. In the API setting the student's deletion rate reaches 100% (versus a 5% baseline) under homogeneous distillation; in the Bash setting the student's chmod-first rate reaches 30%-55% (versus a 0%-10% baseline), with the strongest transfer observed in large-to-small distillation. Our results demonstrate that explicit data sanitation is an insufficient defense, and behavioral biases are encoded implicitly in trajectory dynamics regardless of the tool interface.","published_date":"2026-04-16T22:23:01+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research demonstrates a method to prevent AI agents from subliminally transferring unsafe behaviors during distillation, even when explicit safeguards are in place.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.15558v1","title":"Preregistered Belief Revision Contracts","abstract":"Deliberative multi-agent systems allow agents to exchange messages and revise beliefs over time. While this interaction is meant to improve performance, it can also create dangerous conformity effects: agreement, confidence, prestige, or majority size may be treated as if they were evidence, producing high-confidence convergence to false conclusions. To address this, we introduce PBRC (Preregistered Belief Revision Contracts), a protocol-level mechanism that strictly separates open communication from admissible epistemic change. A PBRC contract publicly fixes first-order evidence triggers, admissible revision operators, a priority rule, and a fallback policy. A non-fallback step is accepted only when it cites a preregistered trigger and provides a nonempty witness set of externally validated evidence tokens. This ensures that every substantive belief change is both enforceable by a router and auditable after the fact. In this paper, (a) we prove that under evidential contracts with conservative fallback, social-only rounds cannot increase confidence and cannot generate purely conformity-driven wrong-but-sure cascades. (b) We show that auditable trigger protocols admit evidential PBRC normal forms that preserve belief trajectories and canonicalized audit traces. (c) We demonstrate that sound enforcement yields epistemic accountability: any change of top hypothesis is attributable to a concrete validated witness set. For token-invariant contracts, (d) we prove that enforced trajectories depend only on token-exposure traces; under flooding dissemination, these traces are characterized exactly by truncated reachability, giving tight diameter bounds for universal evidence closure. Finally, we introduce a companion contractual dynamic doxastic logic to specify trace invariants, and provide simulations illustrating cascade suppression, auditability, and robustness-liveness trade-offs.","published_date":"2026-04-16T22:22:54+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This work introduces a protocol for multi-agent systems that prevents dangerous conformity by strictly separating communication from belief revision, ensuring auditable evidence for all changes.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.15554v1","title":"Natural gradient descent with momentum","abstract":"We consider the problem of approximating a function by an element of a nonlinear manifold which admits a differentiable parametrization, typical examples being neural networks with differentiable activation functions or tensor networks. Natural gradient descent (NGD) for the optimization of a loss function can be seen as a preconditioned gradient descent where updates in the parameter space are driven by a functional perspective. In a spirit similar to Newton's method, a NGD step uses, instead of the Hessian, the Gram matrix of the generating system of the tangent space to the approximation manifold at the current iterate, with respect to a suitable metric. This corresponds to a locally optimal update in function space, following a projected gradient onto the tangent space to the manifold. Still, both gradient and natural gradient descent methods get stuck in local minima. Furthermore, when the model class is a nonlinear manifold or the loss function is not ideally conditioned (e.g., the KL-divergence for density estimation, or a norm of the residual of a partial differential equation in physics informed learning), even the natural gradient might yield non-optimal directions at each step. This work introduces a natural version of classical inertial dynamic methods like Heavy-Ball or Nesterov and show how it can improve the learning process when working with nonlinear model classes.","published_date":"2026-04-16T22:09:39+00:00","viability_score":3,"cluster_label":"LLM Training","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper explores natural gradient descent with momentum to improve learning in nonlinear model classes, addressing issues with local minima and non-optimal directions.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.15547v1","title":"Consistency Analysis of Sentiment Predictions using Syntactic & Semantic Context Assessment Summarization (SSAS)","abstract":"The fundamental challenge of using Large Language Models (LLMs) for reliable, enterprise-grade analytics, such as sentiment prediction, is the conflict between the LLMs' inherent stochasticity (generative, non-deterministic nature) and the analytical requirement for consistency. The LLM inconsistency, coupled with the noisy nature of chaotic modern datasets, renders sentiment predictions too volatile for strategic business decisions. To resolve this, we present a Syntactic & Semantic Context Assessment Summarization (SSAS) framework for establishing context. Context established by SSAS functions as a sophisticated data pre-processing framework that enforces a bounded attention mechanism on LLMs. It achieves this by applying a hierarchical classification structure (Themes, Stories, Clusters) and an iterative Summary-of-Summaries (SoS) based context computation architecture. This endows the raw text with high-signal, sentiment-dense prompts, that effectively mitigate both irrelevant data and analytical variance.   We empirically evaluated the efficacy of SSAS, using Gemini 2.0 Flash Lite, against a direct-LLM approach across three industry-standard datasets - Amazon Product Reviews, Google Business Reviews, Goodreads Book Reviews - and multiple robustness scenarios. Our results show that our SSAS framework is capable of significantly improving data quality, up to 30%, through a combination of noise removal and improvement in the estimation of sentiment prediction. Ultimately, consistency in our context-estimation capabilities provides a stable and reliable evidence base for decision-making.","published_date":"2026-04-16T21:52:11+00:00","viability_score":5,"cluster_label":"LLM Analytics","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework to improve LLM sentiment prediction consistency by up to 30% through context summarization and bounded attention.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.15529v1","title":"LACE: Lattice Attention for Cross-thread Exploration","abstract":"Current large language models reason in isolation. Although it is common to sample multiple reasoning paths in parallel, these trajectories do not interact, and often fail in the same redundant ways. We introduce LACE, a framework that transforms reasoning from a collection of independent trials into a coordinated, parallel process. By repurposing the model architecture to enable cross-thread attention, LACE allows concurrent reasoning paths to share intermediate insights and correct one another during inference. A central challenge is the absence of natural training data that exhibits such collaborative behavior. We address this gap with a synthetic data pipeline that explicitly teaches models to communicate and error-correct across threads. Experiments show that this unified exploration substantially outperforms standard parallel search, improving reasoning accuracy by over 7 points. Our results suggest that large language models can be more effective when parallel reasoning paths are allowed to interact.","published_date":"2026-04-16T21:19:35+00:00","viability_score":4,"cluster_label":"LLM Reasoning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"LACE enables parallel LLM reasoning paths to interact and correct each other, improving accuracy by over 7 points.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.15514v1","title":"Bureaucratic Silences: What the Canadian AI Register Reveals, Omits, and Obscures","abstract":"In November 2025, the Government of Canada operationalized its commitment to transparency by releasing its first Federal AI Register. In this paper, we argue that such registers are not neutral mirrors of government activity, but active instruments of ontological design that configure the boundaries of accountability. We analyzed the Register's complete dataset of 409 systems using the Algorithmic Decision-Making Adapted for the Public Sector (ADMAPS) framework, combining quantitative mapping with deductive qualitative coding. Our findings reveal a sharp divergence between the rhetoric of \"sovereign AI\" and the reality of bureaucratic practice: while 86\\% of systems are deployed internally for efficiency, the Register systematically obscures the human discretion, training, and uncertainty management required to operate them. By privileging technical descriptions over sociotechnical context, the Register constructs an ontology of AI as \"reliable tooling\" rather than \"contestable decision-making.\" We conclude that without a shift in design, such transparency artifacts risk automating accountability into a performative compliance exercise, offering visibility without contestability.","published_date":"2026-04-16T20:48:35+00:00","viability_score":3,"cluster_label":"AI Governance","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Analyzes the Canadian AI Register to reveal how it obscures human discretion and constructs AI as reliable tooling.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.15508v1","title":"LLMbench: A Comparative Close Reading Workbench for Large Language Models","abstract":"LLMbench is a browser-based workbench for the comparative close reading of large language model (LLM) outputs. Where existing tools for LLM comparison, such as Google PAIR's LLM Comparator are engineered for quantitative evaluation and user-rating metrics, LLMbench is oriented towards the hermeneutic practices of the digital humanities. Two model responses to the same prompt are side by side in annotatable panels with four analytical overlays (Probabilities for token-level log-probability inspection, Differences for word-level diff across the two panels, Tone for Hyland-style metadiscourse analysis, and Structure for sentence-level parsing with discourse connective highlighting), alongside five analytical modes, Stochastic Variation, Temperature Gradient, Prompt Sensitivity, Token Probabilities, and Cross-Model Divergence, that make the probabilistic structure of generated text legible at the token level. The tool treats the generated text as a research object in its own right from a probability distribution, a text that could have been otherwise, and provides visualisations including continuous heatmaps, entropy sparklines, pixel maps, and three-dimensional probability terrains, that show the counterfactual history from which each word emerged. This paper describes the tool's architecture, its six modes, and its design rationale, and argues that log-probability data, currently underused in humanistic and social-scientific readings of AI, is an important resource for a critical studies of generative AI models.","published_date":"2026-04-16T20:32:13+00:00","viability_score":4,"cluster_label":"LLM Evaluation","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"LLMbench is a browser-based workbench for comparative close reading of LLM outputs, oriented towards digital humanities.","time_to_mvp":"1-3 months","tags":[]},{"arxiv_id":"2604.15505v1","title":"PolicyBank: Evolving Policy Understanding for LLM Agents","abstract":"LLM agents operating under organizational policies must comply with authorization constraints typically specified in natural language. In practice, such specifications inevitably contain ambiguities and logical or semantic gaps that cause the agent's behavior to systematically diverge from the true requirements. We ask: by letting an agent evolve its policy understanding through interaction and corrective feedback from pre-deployment testing, can it autonomously refine its interpretation to close specification gaps? We propose PolicyBank, a memory mechanism that maintains structured, tool-level policy insights and iteratively refines them -- unlike existing memory mechanisms that treat the policy as immutable ground truth, reinforcing \"compliant but wrong\" behaviors. We also contribute a systematic testbed by extending a popular tool-calling benchmark with controlled policy gaps that isolate alignment failures from execution failures. While existing memory mechanisms achieve near-zero success on policy-gap scenarios, PolicyBank closes up to 82% of the gap toward a human oracle.","published_date":"2026-04-16T20:29:30+00:00","viability_score":7,"cluster_label":"LLM Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"PolicyBank enhances LLM agents by enabling them to autonomously refine their understanding of organizational policies through interaction and feedback, closing specification gaps.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.15499v1","title":"SecureRouter: Encrypted Routing for Efficient Secure Inference","abstract":"Cryptographically secure neural network inference typically relies on secure computing techniques such as Secure Multi-Party Computation (MPC), enabling cloud servers to process client inputs without decrypting them. Although prior privacy-preserving inference systems co-design network optimizations with MPC, they remain slow and costly, limiting real-world deployment. A major bottleneck is their use of a single, fixed transformer model for all encrypted inputs, ignoring that different inputs require different model sizes to balance efficiency and accuracy. We present SecureRouter, an end-to-end encrypted routing and inference framework that accelerates secure transformer inference through input-adaptive model selection under encryption. SecureRouter establishes a unified encrypted pipeline that integrates a secure router with an MPC-optimized model pool, enabling coordinated routing, inference, and protocol execution while preserving full data and model confidentiality. The framework includes training-phase and inference-phase components: an MPC-cost-aware secure router that predicts per-model utility and cost from encrypted features, and an MPC-optimized model pool whose architectures and quantization schemes are co-trained to minimize MPC communication and computation overhead. Compared to prior work, SecureRouter achieves a latency reduction by 1.95x with negligible accuracy loss, offering a practical path toward scalable and efficient secure AI inference. Our open-source implementation is available at: https://github.com/UCF-ML-Research/SecureRouter","published_date":"2026-04-16T20:18:12+00:00","viability_score":9,"cluster_label":"Secure AI Inference","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"SecureRouter offers efficient secure inference through encrypted model routing.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.15495v1","title":"GIST: Multimodal Knowledge Extraction and Spatial Grounding via Intelligent Semantic Topology","abstract":"Navigating complex, densely packed environments like retail stores, warehouses, and hospitals poses a significant spatial grounding challenge for humans and embodied AI. In these spaces, dense visual features quickly become stale given the quasi-static nature of items, and long-tail semantic distributions challenge traditional computer vision. While Vision-Language Models (VLMs) help assistive systems navigate semantically-rich spaces, they still struggle with spatial grounding in cluttered environments. We present GIST (Grounded Intelligent Semantic Topology), a multimodal knowledge extraction pipeline that transforms a consumer-grade mobile point cloud into a semantically annotated navigation topology. Our architecture distills the scene into a 2D occupancy map, extracts its topological layout, and overlays a lightweight semantic layer via intelligent keyframe and semantic selection. We demonstrate the versatility of this structured spatial knowledge through critical downstream Human-AI interaction tasks: (1) an intent-driven Semantic Search engine that actively infers categorical alternatives and zones when exact matches fail; (2) a one-shot Semantic Localizer achieving a 1.04 m top-5 mean translation error; (3) a Zone Classification module that segments the walkable floor plan into high-level semantic regions; and (4) a Visually-Grounded Instruction Generator that synthesizes optimal paths into egocentric, landmark-rich natural language routing. In multi-criteria LLM evaluations, GIST outperforms sequence-based instruction generation baselines. Finally, an in-situ formative evaluation (N=5) yields an 80% navigation success rate relying solely on verbal cues, validating the system's capacity for universal design.","published_date":"2026-04-16T19:59:52+00:00","viability_score":8,"cluster_label":"Embodied AI Navigation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"GIST creates a semantically annotated navigation topology from point cloud data, enabling intelligent search, localization, and natural language routing for embodied AI.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.15489v1","title":"A Q-learning-based QoS-aware multipath routing protocol in IoMT-based wireless body area network","abstract":"The Internet of Medical Things (IoMT) enables intelligent healthcare services but faces challenges such as dynamic topology, energy constraints, and diverse QoS requirements. This paper proposes QQMR, a Q-learning-based QoS-aware multipath routing method for WBANs. QQMR classifies data into three priority levels and employs adaptive multi-level queuing and fuzzy C-means clustering to optimize routing decisions. It maintains separate learning policies for each data type and selects primary and backup paths accordingly. Experimental results demonstrate improved packet delivery ratio and significant reductions in delay, routing overhead, and energy consumption compared to existing methods.","published_date":"2026-04-16T19:41:49+00:00","viability_score":3,"cluster_label":"IoMT Routing","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A Q-learning-based routing protocol for IoMT networks that improves packet delivery and reduces delay and energy consumption.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.15488v1","title":"FineSteer: A Unified Framework for Fine-Grained Inference-Time Steering in Large Language Models","abstract":"Large language models (LLMs) often exhibit undesirable behaviors, such as safety violations and hallucinations. Although inference-time steering offers a cost-effective way to adjust model behavior without updating its parameters, existing methods often fail to be simultaneously effective, utility-preserving, and training-efficient due to their rigid, one-size-fits-all designs and limited adaptability. In this work, we present FineSteer, a novel steering framework that decomposes inference-time steering into two complementary stages: conditional steering and fine-grained vector synthesis, allowing fine-grained control over when and how to steer internal representations. In the first stage, we introduce a Subspace-guided Conditional Steering (SCS) mechanism that preserves model utility by avoiding unnecessary steering. In the second stage, we propose a Mixture-of-Steering-Experts (MoSE) mechanism that captures the multimodal nature of desired steering behaviors and generates query-specific steering vectors for improved effectiveness. Through tailored designs in both SCS and MoSE, FineSteer maintains robust performance on general queries while adaptively optimizing steering vectors for targeted inputs in a training-efficient manner. Extensive experiments on safety and truthfulness benchmarks show that FineSteer outperforms state-of-the-art methods in overall performance, achieving stronger steering performance with minimal utility loss. Code is available at https://github.com/YukinoAsuna/FineSteer","published_date":"2026-04-16T19:41:41+00:00","viability_score":8,"cluster_label":"LLM Inference Optimization","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A unified framework for fine-grained, cost-effective inference-time steering of LLMs to improve safety and truthfulness with minimal utility loss.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.15482v1","title":"Harmonizing Multi-Objective LLM Unlearning via Unified Domain Representation and Bidirectional Logit Distillation","abstract":"Large Language Models (LLMs) unlearning is crucial for removing hazardous or privacy-leaking information from the model. Practical LLM unlearning demands satisfying multiple challenging objectives simultaneously: removing undesirable knowledge, preserving general utility, avoiding over-refusal of neighboring concepts, and, crucially, ensuring robustness against adversarial probing attacks. However, existing unlearning methods primarily focus on a limited subset of these goals, typically unlearning efficacy and utility preservation while overlooking robustness and boundary behaviors. Naively extending these methods to multi-objective settings may lead to unlearning task interference. We propose a novel multi-objective unlearning framework that harmonizes multiple unlearning objectives through a data and optimization co-design: We standardize training corpora into a unified data representation to reduce the domain gap, and then introduce a bidirectional distillation method that simultaneously elicits desired behavior from a context-instructed teacher while suppressing undesirable behavior in the student model. Theoretical and empirical analyses show that our method aligns domain distributions and converts seemingly irrelevant unlearning tasks into cooperative optimization. Evaluation demonstrates state-of-the-art performance, which enables balanced and reliable unlearning across diverse, challenging requirements.","published_date":"2026-04-16T19:09:17+00:00","viability_score":7,"cluster_label":"LLM Unlearning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel framework for multi-objective LLM unlearning that harmonizes knowledge removal, utility preservation, and robustness against adversarial attacks.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.15468v1","title":"The Semi-Executable Stack: Agentic Software Engineering and the Expanding Scope of SE","abstract":"AI-based systems, currently driven largely by LLMs and tool-using agentic harnesses, are increasingly discussed as a possible threat to software engineering. Foundation models get stronger, agents can plan and act across multiple steps, and tasks such as scaffolding, routine test generation, straightforward bug fixing, and small integration work look more exposed than they did only a few years ago. The result is visible unease not only among students and junior developers, but also among experienced practitioners who worry that hard-won expertise may lose value. This paper argues for a different reading. The important shift is not that software engineering loses relevance. It is that the thing being engineered expands beyond executable code to semi-executable artifacts; combinations of natural language, tools, workflows, control mechanisms, and organizational routines whose enactment depends on human or probabilistic interpretation rather than deterministic execution. The Semi-Executable Stack is introduced as a six-ring diagnostic reference model for reasoning about that expansion, spanning executable artifacts, instructional artifacts, orchestrated execution, controls, operating logic, and societal and institutional fit. The model helps locate where a contribution, bottleneck, or organizational transition primarily sits, and which adjacent rings it depends on. The paper develops the argument through three worked cases, reframes familiar objections as engineering targets rather than reasons to dismiss the transition, and closes with a preserve-versus-purify heuristic for deciding which legacy software engineering processes, controls, and coordination routines should be kept and which should be simplified or redesigned. This paper is a conceptual keynote companion: diagnostic and agenda-setting rather than empirical.","published_date":"2026-04-16T18:36:02+00:00","viability_score":1,"cluster_label":"Software Engineering","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A conceptual model for understanding the expansion of software engineering beyond executable code to semi-executable artifacts driven by AI agents.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.15464v1","title":"Ragged Paged Attention: A High-Performance and Flexible LLM Inference Kernel for TPU","abstract":"Large Language Model (LLM) deployment is increasingly shifting to cost-efficient accelerators like Google's Tensor Processing Units (TPUs), prioritizing both performance and total cost of ownership (TCO). However, existing LLM inference kernels and serving systems remain largely GPU-centric, and there is no well-established approach for efficiently mapping LLM workloads onto TPU architectures--particularly under the dynamic and ragged execution patterns common in modern serving. In this paper, we present Ragged Paged Attention (RPA), a high-performance and flexible attention kernel for TPUs, implemented using Pallas and Mosaic. RPA addresses these challenges through three key techniques: (1) fine-grained tiling to enable efficient dynamic slicing over ragged memory, (2) a custom software pipeline that fuses KV cache updates with attention computation, and (3) a distribution-aware compilation strategy that generates specialized kernels for decode, prefill, and mixed workloads. Evaluated on Llama 3 8B on TPU7x, RPA achieves up to 86% memory bandwidth utilization (MBU) in decode and 73% model FLOPs utilization (MFU) in prefill. Integrated as the primary TPU backend in vLLM and SGLang, RPA provides a production-grade foundation for efficient TPU inference and offers practical insights into kernel design.","published_date":"2026-04-16T18:30:13+00:00","viability_score":6,"cluster_label":"LLM Inference Optimization","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A high-performance and flexible LLM inference kernel for TPUs that significantly improves memory and FLOPs utilization.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.15460v1","title":"The Crutch or the Ceiling? How Different Generations of LLMs Shape EFL Student Writings","abstract":"The rapid evolution of Large Language Models (LLMs) has made them powerful tools for enhancing student writing. This study explores the extent and limitations of LLMs in assisting secondary-level English as a Foreign Language (EFL) students with their writing tasks. While existing studies focus on output quality, our research examines the developmental shift in LLMs and their impact on EFL students, assessing whether smarter models act as true scaffolds or mere compensatory crutches. To achieve this, we analyse student compositions assisted by LLMs before and after ChatGPT's release, using both expert qualitative scoring and quantitative metrics (readability tests, Pearson's correlation coefficient, MTLD, and others). Our results indicate that advanced LLMs boost assessment scores and lexical diversity for lower-proficiency learners, potentially masking their true ability. Crucially, increased LLM assistance correlated negatively with human expert ratings, suggesting surface fluency without deep coherence. To transform AI-assisted practice into genuine learning, pedagogy must shift from focusing on output quality to verifying the learning process. Educators should align AI functions, specifically differentiating ideational scaffolding from textual production, within the learner's Zone of Proximal Development.","published_date":"2026-04-16T18:19:46+00:00","viability_score":3,"cluster_label":"LLM Education","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This study analyzes how different generations of LLMs impact EFL student writing, suggesting pedagogical shifts to ensure genuine learning rather than superficial fluency.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.15459v1","title":"RelativeFlow: Taming Medical Image Denoising Learning with Noisy Reference","abstract":"Medical image denoising (MID) lacks absolutely clean images for supervision, leading to a noisy reference problem that fundamentally limits denoising performance. Existing simulated-supervised discriminative learning (SimSDL) and simulated-supervised generative learning (SimSGL) treat noisy references as clean targets, causing suboptimal convergence or reference-biased learning, while self-supervised learning (SSL) imposes restrictive noise assumptions that are seldom satisfied in realistic MID scenarios. We propose \\textbf{RelativeFlow}, a flow matching framework that learns from heterogeneous noisy references and drives inputs from arbitrary quality levels toward a unified high-quality target. RelativeFlow reformulates flow matching by decomposing the absolute noise-to-clean mapping into relative noisier-to-noisy mappings, and realizes this formulation through two key components: 1) consistent transport (CoT), a displacement map that constrains relative flows to be components of and progressively compose a unified absolute flow, and 2) simulation-based velocity field (SVF), which constructs a learnable velocity field using modality-specific degradation operators to support different medical imaging modalities. Extensive experiments on Computed Tomography (CT) and Magnetic Resonance (MR) denoising demonstrate that RelativeFlow significantly outperforms existing methods, taming MID with noisy references.","published_date":"2026-04-16T18:18:27+00:00","viability_score":7,"cluster_label":"Medical Image Denoising","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"RelativeFlow is a novel flow matching framework that tames medical image denoising by learning from noisy references, outperforming existing methods on CT and MR scans.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.15456v1","title":"DeepER-Med: Advancing Deep Evidence-Based Research in Medicine Through Agentic AI","abstract":"Trustworthiness and transparency are essential for the clinical adoption of artificial intelligence (AI) in healthcare and biomedical research. Recent deep research systems aim to accelerate evidence-grounded scientific discovery by integrating AI agents with multi-hop information retrieval, reasoning, and synthesis. However, most existing systems lack explicit and inspectable criteria for evidence appraisal, creating a risk of compounding errors and making it difficult for researchers and clinicians to assess the reliability of their outputs. In parallel, current benchmarking approaches rarely evaluate performance on complex, real-world medical questions. Here, we introduce DeepER-Med, a Deep Evidence-based Research framework for Medicine with an agentic AI system. DeepER-Med frames deep medical research as an explicit and inspectable workflow of evidence-based generation, consisting of three modules: research planning, agentic collaboration, and evidence synthesis. To support realistic evaluation, we also present DeepER-MedQA, an evidence-grounded dataset comprising 100 expert-level research questions derived from authentic medical research scenarios and curated by a multidisciplinary panel of 11 biomedical experts. Expert manual evaluation demonstrates that DeepER-Med consistently outperforms widely used production-grade platforms across multiple criteria, including the generation of novel scientific insights. We further demonstrate the practical utility of DeepER-Med through eight real-world clinical cases. Human clinician assessment indicates that DeepER-Med's conclusions align with clinical recommendations in seven cases, highlighting its potential for medical research and decision support.","published_date":"2026-04-16T18:17:24+00:00","viability_score":8,"cluster_label":"Medical AI Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"DeepER-Med is an agentic AI framework and dataset for evidence-based medical research, outperforming existing platforms and showing practical utility in clinical cases.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.15453v1","title":"(1D) Ordered Tokens Enable Efficient Test-Time Search","abstract":"Tokenization is a key component of autoregressive (AR) generative models, converting raw data into more manageable units for modeling. Commonly, tokens describe local information, such as regions of pixels in images or word pieces in text, and AR generation predicts these tokens in a fixed order. A worthwhile question is whether token structures affect the ability to steer the generation through test-time search, where multiple candidate generations are explored and evaluated by a verifier. Using image generation as our testbed, we hypothesize that recent 1D ordered tokenizers with coarse-to-fine structure can be more amenable to search than classical 2D grid structures. This is rooted in the fact that the intermediate states in coarse-to-fine sequences carry semantic meaning that verifiers can reliably evaluate, enabling effective steering during generation.   Through controlled experiments, we find that AR models trained on coarse-to-fine ordered tokens exhibit improved test-time scaling behavior compared to grid-based counterparts. Moreover, we demonstrate that, thanks to the ordered structure, pure test-time search over token sequences (i.e., without training an AR model) can perform training-free text-to-image generation when guided by an image-text verifier. Beyond this, we systematically study how classical search algorithms (best-of-N, beam search, lookahead search) interact with different token structures, as well as the role of different verifiers and AR priors. Our results highlight the impact of token structure on inference-time scalability and provide practical guidance for test-time scaling in AR models.","published_date":"2026-04-16T18:13:48+00:00","viability_score":7,"cluster_label":"Generative Image Models","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research demonstrates that 1D ordered tokens enable efficient test-time search for autoregressive generative models, allowing for training-free text-to-image generation.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.15448v1","title":"Transfer Learning from Foundational Optimization Embeddings to Unsupervised SAT Representations","abstract":"Foundational optimization embeddings have recently emerged as powerful pre-trained representations for mixed-integer programming (MIP) problems. These embeddings were shown to enable cross-domain transfer and reduce reliance on solver-generated labels. In this work, we investigate whether such representations generalize beyond optimization to decision problems, focusing on Boolean satisfiability (SAT). We adapt the foundational optimization architecture to SAT by mapping CNF formulas into the same bipartite constraint-variable graph representation used for MIPs. This allows direct reuse of the pre-trained embedding model without architectural changes or supervised fine-tuning. Our results show that these embeddings capture structural regularities in SAT instances and support unsupervised tasks such as instance clustering and distribution identification. We demonstrate, for the first time, that foundational optimization embeddings can transfer to constraint satisfaction domains. Our findings is a step toward a unified representational framework for both optimization and decision problems.","published_date":"2026-04-16T18:07:37+00:00","viability_score":5,"cluster_label":"AI for Optimization and Decision Problems","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Leveraging pre-trained optimization embeddings to unlock unsupervised insights in Boolean satisfiability problems.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.14142v1","title":"From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space","abstract":"While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre-train Space RL), which applies reward-driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89x and 6.54x, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization. Extensive experiments demonstrate that DSRL consistently outperforms strong baselines, proving that pre-train space pruning effectively steers the policy toward a refined correct reasoning subspace.","published_date":"2026-04-15T17:59:01+00:00","viability_score":7,"cluster_label":"AI Optimization and RL","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"PreRL leverages reinforcement learning to enhance LLMs' reasoning by optimizing pre-train space, introducing novel Negative Sample Reinforcement for efficient reasoning space pruning.","time_to_mvp":"","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.14140v1","title":"LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning","abstract":"As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. An essential component of this ability is planning and managing a long, complex chain-of-thought (CoT). We introduce LongCoT, a scalable benchmark of 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic to isolate and directly measure the long-horizon CoT reasoning capabilities of frontier models. Problems consist of a short input with a verifiable answer; solving them requires navigating a graph of interdependent steps that span tens to hundreds of thousands of reasoning tokens. Each local step is individually tractable for frontier models, so failures reflect long-horizon reasoning limitations. At release, the best models achieve <10% accuracy (GPT 5.2: 9.8%; Gemini 3 Pro: 6.1%) on LongCoT, revealing a substantial gap in current capabilities. Overall, LongCoT provides a rigorous measure of long-horizon reasoning, tracking the ability of frontier models to reason reliably over extended periods.","published_date":"2026-04-15T17:58:05+00:00","viability_score":7,"cluster_label":"LLM Evaluation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Introducing LongCoT, a benchmark for evaluating long-horizon chain-of-thought reasoning in LLMs, revealing a significant gap in current model capabilities and providing a rigorous measure for future progress.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.14137v1","title":"From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs","abstract":"Evaluating LLMs is challenging, as benchmark scores often fail to capture models' real-world usefulness. Instead, users often rely on ``vibe-testing'': informal experience-based evaluation, such as comparing models on coding tasks related to their own workflow. While prevalent, vibe-testing is often too ad hoc and unstructured to analyze or reproduce at scale. In this work, we study how vibe-testing works in practice and then formalize it to support systematic analysis. We first analyze two empirical resources: (1) a survey of user evaluation practices, and (2) a collection of in-the-wild model comparison reports from blogs and social media. Based on these resources, we formalize vibe-testing as a two-part process: users personalize both what they test and how they judge responses. We then introduce a proof-of-concept evaluation pipeline that follows this formulation by generating personalized prompts and comparing model outputs using user-aware subjective criteria. In experiments on coding benchmarks, we find that combining personalized prompts and user-aware evaluation can change which model is preferred, reflecting the role of vibe-testing in practice. These findings suggest that formalized vibe-testing can serve as a useful approach for bridging benchmark scores and real-world experience.","published_date":"2026-04-15T17:57:08+00:00","viability_score":3,"cluster_label":"AI Evaluation and Testing","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Developing a method to formalize user-centric evaluations of LLMs by translating subjective 'vibe-testing' into quantitative metrics.","time_to_mvp":"1-2 weeks","tags":["quick_build"]},{"arxiv_id":"2604.14128v1","title":"Rhetorical Questions in LLM Representations: A Linear Probing Study","abstract":"Rhetorical questions are asked not to seek information but to persuade or signal stance. How large language models internally represent them remains unclear. We analyze rhetorical questions in LLM representations using linear probes on two social-media datasets with different discourse contexts, and find that rhetorical signals emerge early and are most stably captured by last-token representations. Rhetorical questions are linearly separable from information-seeking questions within datasets, and remain detectable under cross-dataset transfer, reaching AUROC around 0.7-0.8. However, we demonstrate that transferability does not simply imply a shared representation. Probes trained on different datasets produce different rankings when applied to the same target corpus, with overlap among the top-ranked instances often below 0.2. Qualitative analysis shows that these divergences correspond to distinct rhetorical phenomena: some probes capture discourse-level rhetorical stance embedded in extended argumentation, while others emphasize localized, syntax-driven interrogative acts. Together, these findings suggest that rhetorical questions in LLM representations are encoded by multiple linear directions emphasizing different cues, rather than a single shared direction.","published_date":"2026-04-15T17:50:56+00:00","viability_score":7,"cluster_label":"LLM Representations","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Analyzing how LLMs internally represent rhetorical questions using linear probing, revealing that these signals emerge early and are encoded by multiple, context-dependent directions.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.14125v1","title":"HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System","abstract":"While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs task decomposition and visual grounding to generate structured plans, comprising a subtask instruction and a precise target bounding box. Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (DiT) action expert in low-level part equipped with a novel cascaded cross-attention mechanism. This design sequentially fuses global context, high-resolution object-centric crops and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM's zero-shot reasoning while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition and the fine-grained manipulation of small objects in cluttered scenes.","published_date":"2026-04-15T17:50:07+00:00","viability_score":7,"cluster_label":"Robotics","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A hierarchical robotic manipulation system that decouples planning from execution to preserve VLM reasoning while enabling independent component improvement.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.14116v1","title":"TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration","abstract":"While Large Language Models (LLMs) have empowered AI research agents to perform isolated scientific tasks, automating complex, real-world workflows, such as LLM training, remains a significant challenge. In this paper, we introduce TREX, a multi-agent system that automates the entire LLM training life-cycle. By orchestrating collaboration between two core modules-the Researcher and the Executor-the system seamlessly performs requirement analysis, open-domain literature and data research, formulation of training strategies, preparation of data recipes, and model training and evaluation. The multi-round experimental process is modeled as a search tree, enabling the system to efficiently plan exploration paths, reuse historical results, and distill high-level insights from iterative trials. To evaluate the capability of automated LLM training, we construct FT-Bench, a benchmark comprising 10 tasks derived from real-world scenarios, ranging from optimizing fundamental model capabilities to enhancing performance on domain-specific tasks. Experimental results demonstrate that the TREX agent consistently optimizes model performance on target tasks.","published_date":"2026-04-15T17:38:06+00:00","viability_score":7,"cluster_label":"AI Automation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"TREX automates and optimizes the lifecycle of LLM fine-tuning using agents and a tree-based exploration approach.","time_to_mvp":"6+ months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.14113v1","title":"UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding","abstract":"GUI grounding, which localizes interface elements from screenshots given natural language queries, remains challenging for small icons and dense layouts. Test-time zoom-in methods improve localization by cropping and re-running inference at higher resolution, but apply cropping uniformly across all instances with fixed crop sizes, ignoring whether the model is actually uncertain on each case. We propose \\textbf{UI-Zoomer}, a training-free adaptive zoom-in framework that treats both the trigger and scale of zoom-in as a prediction uncertainty quantification problem. A confidence-aware gate fuses spatial consensus among stochastic candidates with token-level generation confidence to selectively trigger zoom-in only when localization is uncertain. When triggered, an uncertainty-driven crop sizing module decomposes prediction variance into inter-sample positional spread and intra-sample box extent, deriving a per-instance crop radius via the law of total variance. Extensive experiments on ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 demonstrate consistent improvements over strong baselines across multiple model architectures, achieving gains of up to +13.4\\%, +10.3\\%, and +4.2\\% respectively, with no additional training required.","published_date":"2026-04-15T17:32:28+00:00","viability_score":9,"cluster_label":"Computer Vision","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Develop an uncertainty-driven adaptive zoom-in tool for more accurate GUI element localization in screenshots.","time_to_mvp":"","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.14089v1","title":"UMI-3D: Extending Universal Manipulation Interface from Vision-Limited to 3D Spatial Perception","abstract":"We present UMI-3D, a multimodal extension of the Universal Manipulation Interface (UMI) for robust and scalable data collection in embodied manipulation. While UMI enables portable, wrist-mounted data acquisition, its reliance on monocular visual SLAM makes it vulnerable to occlusions, dynamic scenes, and tracking failures, limiting its applicability in real-world environments. UMI-3D addresses these limitations by introducing a lightweight and low-cost LiDAR sensor tightly integrated into the wrist-mounted interface, enabling LiDAR-centric SLAM with accurate metric-scale pose estimation under challenging conditions. We further develop a hardware-synchronized multimodal sensing pipeline and a unified spatiotemporal calibration framework that aligns visual observations with LiDAR point clouds, producing consistent 3D representations of demonstrations. Despite maintaining the original 2D visuomotor policy formulation, UMI-3D significantly improves the quality and reliability of collected data, which directly translates into enhanced policy performance. Extensive real-world experiments demonstrate that UMI-3D not only achieves high success rates on standard manipulation tasks, but also enables learning of tasks that are challenging or infeasible for the original vision-only UMI setup, including large deformable object manipulation and articulated object operation. The system supports an end-to-end pipeline for data acquisition, alignment, training, and deployment, while preserving the portability and accessibility of the original UMI. All hardware and software components are open-sourced to facilitate large-scale data collection and accelerate research in embodied intelligence: \\href{https://umi-3d.github.io}{https://umi-3d.github.io}.","published_date":"2026-04-15T17:04:34+00:00","viability_score":7,"cluster_label":"Robotics Data Collection","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A multimodal extension for robotic manipulation data collection that integrates LiDAR for robust 3D spatial perception.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.14084v1","title":"TIP: Token Importance in On-Policy Distillation","abstract":"On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher--student divergence, where the student is overconfident and wrong.   Empirically, student entropy is a strong first-order proxy: retaining $50\\%$ of tokens with entropy-based sampling matches or exceeds all-token training while reducing peak memory by up to $47\\%$. But entropy alone misses a second important region. When we isolate low-entropy, high-divergence tokens, training on fewer than $10\\%$ of all tokens nearly matches full-token baselines, showing that overconfident tokens carry dense corrective signal despite being nearly invisible to entropy-only rules.   We organize these findings with TIP (Token Importance in on-Policy distillation), a two-axis taxonomy over student entropy and teacher--student divergence, and give a theoretical explanation for why entropy is useful yet structurally incomplete. This view motivates type-aware token selection rules that combine uncertainty and disagreement. We validate this picture across three teacher--student pairs spanning Qwen3, Llama, and Qwen2.5 on MATH-500 and AIME 2024/2025, and on the DeepPlanning benchmark for long-horizon agentic planning, where Q3-only training on $<$$20\\%$ of tokens surpasses full-token OPD. Our experiments are implemented by extending the OPD repository https://github.com/HJSang/OPSD_OnPolicyDistillation, which supports memory-efficient distillation of larger models under limited GPU budgets.","published_date":"2026-04-15T16:58:24+00:00","viability_score":7,"cluster_label":"LLM Training","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel method for on-policy knowledge distillation that significantly reduces memory usage and training time by intelligently selecting informative tokens, validated on multiple LLM architectures.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.14035v1","title":"First-See-Then-Design: A Multi-Stakeholder View for Optimal Performance-Fairness Trade-Offs","abstract":"Fairness in algorithmic decision-making is often defined in the predictive space, where predictive performance - used as a proxy for decision-maker (DM) utility - is traded off against prediction-based fairness notions, such as demographic parity or equality of opportunity. This perspective, however, ignores how predictions translate into decisions and ultimately into utilities and welfare for both DM and decision subjects (DS), as well as their allocation across social-salient groups.   In this paper, we propose a multi-stakeholder framework for fair algorithmic decision-making grounded in welfare economics and distributive justice, explicitly modeling the utilities of both the DM and DS, and defining fairness via a social planner's utility that captures inequalities in DS utilities across groups under different justice-based fairness notions (e.g., Egalitarian, Rawlsian). We formulate fair decision-making as a post-hoc multi-objective optimization problem, characterizing the achievable performance-fairness trade-offs in the two-dimensional utility space of DM utility and the social planner's utility, under different decision policy classes (deterministic vs. stochastic, shared vs. group-specific). Using the proposed framework, we then identify conditions (in terms of the stakeholders' utilities) under which stochastic policies are more optimal than deterministic ones, and empirically demonstrate that simple stochastic policies can yield superior performance-fairness trade-offs by leveraging outcome uncertainty. Overall, we advocate a shift from prediction-centric fairness to a transparent, justice-based, multi-stakeholder approach that supports the collaborative design of decision-making policies.","published_date":"2026-04-15T16:15:25+00:00","viability_score":4,"cluster_label":"Fairness in AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new framework for algorithmic decision-making that explicitly models multi-stakeholder utilities to achieve optimal performance-fairness trade-offs.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.14034v1","title":"Large Language Models to Enhance Business Process Modeling: Past, Present, and Future Trends","abstract":"Recent advances in Generative Artificial Intelligence, particularly Large Language Models (LLMs), have stimulated growing interest in automating or assisting Business Process Modeling tasks using natural language. Several approaches have been proposed to transform textual process descriptions into BPMN and related workflow models. However, the extent to which these approaches effectively support complex process modeling in organizational settings remains unclear. This article presents a literature review of AI-driven methods for transforming natural language into BPMN process models, with a particular focus on the role of LLMs. Following a structured review strategy, relevant studies were identified and analyzed to classify existing approaches, examine how LLMs are integrated into text-to-model pipelines, and investigate the evaluation practices used to assess generated models. The analysis reveals a clear shift from rule-based and traditional NLP pipelines toward LLM-based architectures that rely on prompt engineering, intermediate representations, and iterative refinement mechanisms. While these approaches significantly expand the capabilities of automated process model generation, the literature also exposes persistent challenges related to semantic correctness, evaluation fragmentation, reproducibility, and limited validation in real-world organizational contexts. Based on these findings, this review identifies key research gaps and discusses promising directions for future research, including the integration of contextual knowledge through Retrieval-Augmented Generation (RAG), its integration with LLMs, the development of interactive modeling architectures, and the need for more comprehensive and standardized evaluation frameworks.","published_date":"2026-04-15T16:15:03+00:00","viability_score":3,"cluster_label":"Business Process Modeling","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A literature review on using LLMs to automate business process modeling, highlighting current trends and future research directions.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.14032v1","title":"Hierarchical Reinforcement Learning with Runtime Safety Shielding for Power Grid Operation","abstract":"Reinforcement learning has shown promise for automating power-grid operation tasks such as topology control and congestion management. However, its deployment in real-world power systems remains limited by strict safety requirements, brittleness under rare disturbances, and poor generalization to unseen grid topologies. In safety-critical infrastructure, catastrophic failures cannot be tolerated, and learning-based controllers must operate within hard physical constraints.   This paper proposes a safety-constrained hierarchical control framework for power-grid operation that explicitly decouples long-horizon decision-making from real-time feasibility enforcement. A high-level reinforcement learning policy proposes abstract control actions, while a deterministic runtime safety shield filters unsafe actions using fast forward simulation. Safety is enforced as a runtime invariant, independent of policy quality or training distribution.   The proposed framework is evaluated on the Grid2Op benchmark suite under nominal conditions, forced line-outage stress tests, and zero-shot deployment on the ICAPS 2021 large-scale transmission grid without retraining. Results show that flat reinforcement learning policies are brittle under stress, while safety-only methods are overly conservative. In contrast, the proposed hierarchical and safety-aware approach achieves longer episode survival, lower peak line loading, and robust zero-shot generalization to unseen grids.   These results indicate that safety and generalization in power-grid control are best achieved through architectural design rather than increasingly complex reward engineering, providing a practical path toward deployable learning-based controllers for real-world energy systems.","published_date":"2026-04-15T16:11:10+00:00","viability_score":7,"cluster_label":"Reinforcement Learning for Energy","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A safety-constrained hierarchical reinforcement learning framework for power grid operation that ensures runtime safety and robust generalization.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.14025v1","title":"Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective","abstract":"Reconstructing 3D representations from 2D inputs is a fundamental task in computer vision and graphics, serving as a cornerstone for understanding and interacting with the physical world. While traditional methods achieve high fidelity, they are limited by slow per-scene optimization or category-specific training, which hinders their practical deployment and scalability. Hence, generalizable feed-forward 3D reconstruction has witnessed rapid development in recent years. By learning a model that maps images directly to 3D representations in a single forward pass, these methods enable efficient reconstruction and robust cross-scene generalization. Our survey is motivated by a critical observation: despite the diverse geometric output representations, ranging from implicit fields to explicit primitives, existing feed-forward approaches share similar high-level architectural patterns, such as image feature extraction backbones, multi-view information fusion mechanisms, and geometry-aware design principles. Consequently, we abstract away from these representation differences and instead focus on model design, proposing a novel taxonomy centered on model design strategies that are agnostic to the output format. Our proposed taxonomy organizes the research directions into five key problems that drive recent research development: feature enhancement, geometry awareness, model efficiency, augmentation strategies and temporal-aware models. To support this taxonomy with empirical grounding and standardized evaluation, we further comprehensively review related benchmarks and datasets, and extensively discuss and categorize real-world applications based on feed-forward 3D models. Finally, we outline future directions to address open challenges such as scalability, evaluation standards, and world modeling.","published_date":"2026-04-15T16:07:18+00:00","viability_score":2,"cluster_label":"3D Reconstruction","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A survey proposing a new taxonomy for feed-forward 3D scene modeling, focusing on model design strategies agnostic to output format.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.14016v1","title":"MAny: Merge Anything for Multimodal Continual Instruction Tuning","abstract":"Multimodal Continual Instruction Tuning (MCIT) is essential for sequential task adaptation of Multimodal Large Language Models (MLLMs) but is severely restricted by catastrophic forgetting. While existing literature focuses on the reasoning language backbone, in this work, we expose a critical yet neglected dual-forgetting phenomenon across both perception drift in Cross-modal Projection Space and reasoning collapse in Low-rank Parameter Space. To resolve this, we present \\textbf{MAny} (\\textbf{M}erge \\textbf{Any}thing), a framework that merges task-specific knowledge through \\textbf{C}ross-modal \\textbf{P}rojection \\textbf{M}erging (\\textbf{CPM}) and \\textbf{L}ow-rank \\textbf{P}arameter \\textbf{M}erging (\\textbf{LPM}). Specifically, CPM recovers perceptual alignment by adaptively merging cross-modal visual representations via visual-prototype guidance, ensuring accurate feature recovery during inference. Simultaneously, LPM eliminates mutual interference among task-specific low-rank modules by recursively merging low-rank weight matrices. By leveraging recursive least squares, LPM provides a closed-form solution that mathematically guarantees an optimal fusion trajectory for reasoning stability. Notably, MAny operates as a training-free paradigm that achieves knowledge merging via efficient CPU-based algebraic operations, eliminating additional gradient-based optimization beyond initial tuning. Our extensive evaluations confirm the superior performance and robustness of MAny across multiple MLLMs and benchmarks. Specifically, on the UCIT benchmark, MAny achieves significant leads of up to 8.57\\% and 2.85\\% in final average accuracy over state-of-the-art methods across two different MLLMs, respectively.","published_date":"2026-04-15T15:57:23+00:00","viability_score":7,"cluster_label":"Multimodal LLMs","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"MAny merges task-specific knowledge in multimodal LLMs through cross-modal and low-rank parameter merging to prevent catastrophic forgetting without additional training.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.14013v1","title":"Towards Multi-Object-Tracking with Radar on a Fast Moving Vehicle: On the Potential of Processing Radar in the Frequency Domain","abstract":"We promote in this paper the processing of radar data in the frequency domain to achieve higher robustness against noise and structural errors, especially in comparison to feature-based methods. This holds also for high dynamics in the scene, i.e., ego-motion of the vehicle with the sensor plus the presence of an unknown number of other moving objects. In addition to the high robustness, the processing in the frequency domain has the so far neglected advantage that the underlying correlation based methods used for, e.g., registration, provide information about all moving structures in the scene. A typical automotive application case is overtaking maneuvers, which in the context of autonomous racing are used here as a motivating example. Initial experiments and results with Fourier SOFT in 2D (FS2D) are presented that use the Boreas dataset to demonstrate radar-only-odometry, i.e., radar-odometry without sensor-fusion, to support our arguments.","published_date":"2026-04-15T15:57:05+00:00","viability_score":4,"cluster_label":"Radar Perception","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Processing radar data in the frequency domain for robust multi-object tracking on fast-moving vehicles, demonstrating radar-only odometry.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.14004v1","title":"Memory Transfer Learning: How Memories are Transferred Across Domains in Coding Agents","abstract":"Memory-based self-evolution has emerged as a promising paradigm for coding agents. However, existing approaches typically restrict memory utilization to homogeneous task domains, failing to leverage the shared infrastructural foundations, such as runtime environments and programming languages, that exist across diverse real-world coding problems. To address this limitation, we investigate \\textbf{Memory Transfer Learning} (MTL) by harnessing a unified memory pool from heterogeneous domains. We evaluate performance across 6 coding benchmarks using four memory representations, ranging from concrete traces to abstract insights. Our experiments demonstrate that cross-domain memory improves average performance by 3.7\\%, primarily by transferring meta-knowledge, such as validation routines, rather than task-specific code. Importantly, we find that abstraction dictates transferability; high-level insights generalize well, whereas low-level traces often induce negative transfer due to excessive specificity. Furthermore, we show that transfer effectiveness scales with the size of the memory pool, and memory can be transferred even between different models. Our work establishes empirical design principles for expanding memory utilization beyond single-domain silos. Project page: https://memorytransfer.github.io/","published_date":"2026-04-15T15:50:29+00:00","viability_score":5,"cluster_label":"AI/ML Applications","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Develop a Memory Transfer Learning tool to improve coding agent adaptability across diverse tasks.","time_to_mvp":"1-2 weeks","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.14001v1","title":"Diffusion Language Models for Speech Recognition","abstract":"Diffusion language models have recently emerged as a leading alternative to standard language models, due to their ability for bidirectional attention and parallel text generation. In this work, we explore variants for their use in speech recognition. Specifically, we introduce a comprehensive guide to incorporating masked diffusion language models (MDLM) and uniform-state diffusion models (USDMs) for rescoring ASR hypotheses. Additionally, we design a new joint-decoding method that combines CTC and USDM by integrating the framewise probability distributions derived from CTC with the labelwise probability distributions computed by USDM at each decoding step, thereby generating new candidates that combine strong language knowledge from USDM and acoustic information from CTC. Our findings reveal that USDM, as well as MDLM, can significantly improve the accuracy of recognized text. We publish all our code and recipes.","published_date":"2026-04-15T15:46:15+00:00","viability_score":7,"cluster_label":"Speech Recognition","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Diffusion language models enhance speech recognition accuracy by integrating acoustic and language information for improved hypothesis rescoring and joint decoding.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.13993v1","title":"Reward Design for Physical Reasoning in Vision-Language Models","abstract":"Physical reasoning over visual inputs demands tight integration of visual perception, domain knowledge, and multi-step symbolic inference. Yet even state-of-the-art Vision Language Models (VLMs) fall far short of human performance on physics benchmarks. While post-training algorithms such as Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO) have demonstrated strong reasoning gains in language models, how reward design shapes VLM physical reasoning behavior remains poorly understood. We present a systematic reward ablation study for GRPO-based VLM training on physical reasoning. We compare four reward signals of increasing semantic richness: format compliance, answer accuracy, a composite rubric reward (answer correctness, physics principle identification, and unit consistency), and a novel internal reward derived from model attention weights over input image regions. We evaluate on PhyX, a 3,000-problem benchmark spanning six physics domains and six reasoning types across multiple-choice and open-ended formats, using IBM Granite Vision 3.3 (2B). Across both formats, GRPO with accuracy-based rewards outperforms SFT on most domains, though gains vary substantially by reward type and domain. Reward design does not uniformly improve performance. Instead, it induces domain-specific reasoning behaviors. Accuracy-based rewards provide the strongest overall gains. Rubric rewards improve structured reasoning quality without consistent accuracy improvements. Attention-based rewards enhance spatial reasoning while degrading performance in symbolic domains. Our internal attention-weight reward requires no spatial annotations and improves spatial relation accuracy from 0.27 to 0.50, suggesting that supervising where the model attends during generation is a promising direction for visually grounded physical reasoning.","published_date":"2026-04-15T15:36:26+00:00","viability_score":7,"cluster_label":"Vision-Language Models","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research systematically investigates reward design for improving physical reasoning in vision-language models, demonstrating accuracy gains through targeted reward signals.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.13991v1","title":"Adaptive Conformal Prediction for Improving Factuality of Generations by Large Language Models","abstract":"Large language models (LLMs) are prone to generating factually incorrect outputs. Recent work has applied conformal prediction to provide uncertainty estimates and statistical guarantees for the factuality of LLM generations. However, existing approaches are typically not prompt-adaptive, limiting their ability to capture input-dependent variability. As a result, they may filter out too few items (leading to over-coverage) or too many (under-coverage) for a given task or prompt. We propose an adaptive conformal prediction approach that extends conformal score transformation methods to LLMs, with applications to long-form generation and multiple-choice question answering. This enables prompt-dependent calibration, retaining marginal coverage guarantees while improving conditional coverage. In addition, the approach naturally supports selective prediction, allowing unreliable claims or answer choices to be filtered out in downstream applications. We evaluate our approach on multiple white-box models across diverse domains and show that it significantly outperforms existing baselines in terms of conditional coverage.","published_date":"2026-04-15T15:35:42+00:00","viability_score":3,"cluster_label":"LLM Factuality","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"An adaptive conformal prediction approach is proposed to improve the factuality of large language model generations by providing prompt-dependent uncertainty estimates.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.13979v1","title":"Leveraging LLM-GNN Integration for Open-World Question Answering over Knowledge Graphs","abstract":"Open-world Question Answering (OW-QA) over knowledge graphs (KGs) aims to answer questions over incomplete or evolving KGs. Traditional KGQA assumes a closed world where answers must exist in the KG, limiting real-world applicability. In contrast, open-world QA requires inferring missing knowledge based on graph structure and context. Large language models (LLMs) excel at language understanding but lack structured reasoning. Graph neural networks (GNNs) model graph topology but struggle with semantic interpretation. Existing systems integrate LLMs with GNNs or graph retrievers. Some support open-world QA but rely on structural embeddings without semantic grounding. Most assume observed paths or complete graphs, making them unreliable under missing links or multi-hop reasoning. We present GLOW, a hybrid system that combines a pre-trained GNN and an LLM for open-world KGQA. The GNN predicts top-k candidate answers from the graph structure. These, along with relevant KG facts, are serialized into a structured prompt (e.g., triples and candidates) to guide the LLM's reasoning. This enables joint reasoning over symbolic and semantic signals, without relying on retrieval or fine-tuning. To evaluate generalization, we introduce GLOW-BENCH, a 1,000-question benchmark over incomplete KGs across diverse domains. GLOW outperforms existing LLM-GNN systems on standard benchmarks and GLOW-BENCH, achieving up to 53.3% and an average 38% improvement. GitHub code and data are available.","published_date":"2026-04-15T15:25:25+00:00","viability_score":7,"cluster_label":"AI-driven Knowledge Management","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A hybrid LLM-GNN system for advanced question answering over incomplete knowledge graphs.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.13977v1","title":"How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data","abstract":"Synthetic data is a standard component in training large language models, yet systematic comparisons across design dimensions, including rephrasing strategy, generator model, and source data, remain absent. We conduct extensive controlled experiments, generating over one trillion tokens, to identify critical factors in rephrasing web text into synthetic pretraining data. Our results reveal that structured output formats, such as tables, math problems, FAQs, and tutorials, consistently outperform both curated web baselines and prior synthetic methods. Notably, increasing the size of the generator model beyond 1B parameters provides no additional benefit. Our analysis also demonstrates that the selection of the original data used for mixing substantially influences performance. By applying our findings, we develop \\textbf{\\textsc{FinePhrase}}, a 486-billion-token open dataset of rephrased web text. We show that \\textsc{FinePhrase} outperforms all existing synthetic data baselines while reducing generation costs by up to 30 times. We provide the dataset, all prompts, and the generation framework to the research community.","published_date":"2026-04-15T15:24:59+00:00","viability_score":7,"cluster_label":"LLM Training","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A systematic study and open dataset for synthesizing high-quality pretraining data for LLMs, reducing generation costs by up to 30x.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.13959v1","title":"[Emerging Ideas] Artificial Tripartite Intelligence: A Bio-Inspired, Sensor-First Architecture for Physical AI","abstract":"As AI moves from data centers to robots and wearables, scaling ever-larger models becomes insufficient. Physical AI operates under tight latency, energy, privacy, and reliability constraints, and its performance depends not only on model capacity but also on how signals are acquired through controllable sensors in dynamic environments. We present Artificial Tripartite Intelligence (ATI), a bio-inspired, sensor-first architectural contract for physical AI. ATI is tripartite at the systems level: a Brainstem (L1) provides reflexive safety and signal-integrity control, a Cerebellum (L2) performs continuous sensor calibration, and a Cerebral Inference Subsystem spanning L3/L4 supports routine skill selection and execution, coordination, and deep reasoning. This modular organization allows sensor control, adaptive sensing, edge-cloud execution, and foundation model reasoning to co-evolve within one closed-loop architecture, while keeping time-critical sensing and control on device and invoking higher-level inference only when needed. We instantiate ATI in a mobile camera prototype under dynamic lighting and motion. In our routed evaluation (L3-L4 split inference), compared to the default auto-exposure setting, ATI (L1/L2 adaptive sensing) improves end-to-end accuracy from 53.8% to 88% while reducing remote L4 invocations by 43.3%. These results show the value of co-designing sensing and inference for embodied AI.","published_date":"2026-04-15T15:10:10+00:00","viability_score":4,"cluster_label":"Physical AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A bio-inspired, sensor-first architecture for physical AI that improves end-to-end accuracy and reduces remote inference calls.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.13956v1","title":"Creo: From One-Shot Image Generation to Progressive, Co-Creative Ideation","abstract":"Text-to-image (T2I) systems enable rapid generation of high-fidelity imagery but are misaligned with how visual ideas develop. T2I systems generate outputs that make implicit visual decisions on behalf of the user, often introduce fine-grained details that can anchor users prematurely and limit their ability to keep options open early on, and cause unintended changes during editing that are difficult to correct and reduce users' sense of control. To address these concerns, we present Creo, a multi-stage T2I system that scaffolds image generation by progressing from rough sketches to high-resolution outputs, exposing intermediary abstractions where users can make incremental changes. Sketch-like abstractions invite user editing and allow users to keep design options open when ideas are still forming due to their provisional nature. Each stage in Creo can be modified with manual changes and AI-assisted operations, enabling fine-grained, step-wise control through a locking mechanism that preserves prior decisions so subsequent edits affect only specified regions or attributes. Users remain in the loop, making and verifying decisions across stages, while the system applies diffs instead of regenerating full images, reducing drift as fidelity increases. A comparative study with a one-shot baseline shows that participants felt stronger ownership over Creo outputs, as they were able to trace their decisions in building up the image. Furthermore, embedding-based analysis indicates that Creo outputs are less homogeneous than one-shot results. These findings suggest that multi-stage generation, combined with intermediate control and decision locking, is a key design principle for improving controllability, user agency, creativity, and output diversity in generative systems.","published_date":"2026-04-15T15:06:46+00:00","viability_score":4,"cluster_label":"Generative AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Creo: A multi-stage text-to-image system that allows progressive, co-creative ideation with user control and decision locking.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.13954v1","title":"HINTBench: Horizon-agent Intrinsic Non-attack Trajectory Benchmark","abstract":"Existing agent-safety evaluation has focused mainly on externally induced risks. Yet agents may still enter unsafe trajectories under benign conditions. We study this complementary but underexplored setting through the lens of \\emph{intrinsic} risk, where intrinsic failures remain latent, propagate across long-horizon execution, and eventually lead to high-consequence outcomes. To evaluate this setting, we introduce \\emph{non-attack intrinsic risk auditing} and present \\textbf{HINTBench}, a benchmark of 629 agent trajectories (523 risky, 106 safe; 33 steps on average) supporting three tasks: risk detection, risk-step localization, and intrinsic failure-type identification. Its annotations are organized under a unified five-constraint taxonomy. Experiments reveal a substantial capability gap: strong LLMs perform well on trajectory-level risk detection, but their performance drops to below 35 Strict-F1 on risk-step localization, while fine-grained failure diagnosis proves even harder. Existing guard models transfer poorly to this setting. These findings establish intrinsic risk auditing as an open challenge for agent safety.","published_date":"2026-04-15T15:06:01+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"HINTBench: A benchmark and evaluation framework for intrinsic risk in AI agents, revealing significant capability gaps in existing models.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.13940v1","title":"AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot","abstract":"Scientific peer review faces mounting strain as submission volumes surge, making it increasingly difficult to sustain review quality, consistency, and timeliness. Recent advances in AI have led the community to consider its use in peer review, yet a key unresolved question is whether AI can generate technically sound reviews at real-world conference scale. Here we report the first large-scale field deployment of AI-assisted peer review: every main-track submission at AAAI-26 received one clearly identified AI review from a state-of-the-art system. The system combined frontier models, tool use, and safeguards in a multi-stage process to generate reviews for all 22,977 full-review papers in less than a day. A large-scale survey of AAAI-26 authors and program committee members showed that participants not only found AI reviews useful, but actually preferred them to human reviews on key dimensions such as technical accuracy and research suggestions. We also introduce a novel benchmark and find that our system substantially outperforms a simple LLM-generated review baseline at detecting a variety of scientific weaknesses. Together, these results show that state-of-the-art AI methods can already make meaningful contributions to scientific peer review at conference scale, opening a path toward the next generation of synergistic human-AI teaming for evaluating research.","published_date":"2026-04-15T14:51:07+00:00","viability_score":8,"cluster_label":"AI for Scientific Review","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An AI system that generates technically sound peer reviews, preferred by authors and reviewers over human reviews, for large-scale scientific conferences.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.13924v1","title":"ASTER: Latent Pseudo-Anomaly Generation for Unsupervised Time-Series Anomaly Detection","abstract":"Time-series anomaly detection (TSAD) is critical in domains such as industrial monitoring, healthcare, and cybersecurity, but it remains challenging due to rare and heterogeneous anomalies and the scarcity of labelled data. This scarcity makes unsupervised approaches predominant, yet existing methods often rely on reconstruction or forecasting, which struggle with complex data, or on embedding-based approaches that require domain-specific anomaly synthesis and fixed distance metrics. We propose ASTER, a framework that generates pseudo-anomalies directly in the latent space, avoiding handcrafted anomaly injections and the need for domain expertise. A latent-space decoder produces tailored pseudo-anomalies to train a Transformer-based anomaly classifier, while a pre-trained LLM enriches the temporal and contextual representations of this space. Experiments on three benchmark datasets show that ASTER achieves state-of-the-art performance and sets a new standard for LLM-based TSAD.","published_date":"2026-04-15T14:32:35+00:00","viability_score":7,"cluster_label":"Unsupervised Time-Series Anomaly Detection","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework that generates latent pseudo-anomalies for unsupervised time-series anomaly detection, outperforming state-of-the-art with LLM enrichment.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.13899v1","title":"Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection","abstract":"Instruction-tuned LLMs can annotate thousands of instances from a short prompt at negligible cost. This raises two questions for active learning (AL): can LLM labels replace human labels within the AL loop, and does AL remain necessary when entire corpora can be labelled at once? We investigate both questions on a new dataset of 277,902 German political TikTok comments (25,974 LLM-labelled, 5,000 human-annotated), comparing seven annotation strategies across four encoders to detect anti-immigrant hostility. A classifier trained on 25,974 GPT-5.2 labels (\\$43) achieves comparable F1-Macro to one trained on 3,800 human annotations (\\$316). Active learning offers little advantage over random sampling in our pre-enriched pool and delivers lower F1 than full LLM annotation at the same cost. However, comparable aggregate F1 masks a systematic difference in error structure: LLM-trained classifiers over-predict the positive class relative to the human gold standard. This divergence concentrates in topically ambiguous discussions where the distinction between anti-immigrant hostility and policy critique is most subtle, suggesting that annotation strategy should be guided not by aggregate F1 alone but by the error profile acceptable for the target application.","published_date":"2026-04-15T14:10:58+00:00","viability_score":7,"cluster_label":"LLM Annotation for Active Learning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"LLM-generated annotations achieve comparable F1-Macro to human annotations for hostility detection at a fraction of the cost, with nuanced error profiles.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.13891v1","title":"Beyond Conservative Automated Driving in Multi-Agent Scenarios via Coupled Model Predictive Control and Deep Reinforcement Learning","abstract":"Automated driving at unsignalized intersections is challenging due to complex multi-vehicle interactions and the need to balance safety and efficiency. Model Predictive Control (MPC) offers structured constraint handling through optimization but relies on hand-crafted rules that often produce overly conservative behavior. Deep Reinforcement Learning (RL) learns adaptive behaviors from experience but often struggles with safety assurance and generalization to unseen environments. In this study, we present an integrated MPC-RL framework to improve navigation performance in multi-agent scenarios. Experiments show that MPC-RL outperforms standalone MPC and end-to-end RL across three traffic-density levels. Collectively, MPC-RL reduces the collision rate by 21% and improves the success rate by 6.5% compared to pure MPC. We further evaluate zero-shot transfer to a highway merging scenario without retraining. Both MPC-based methods transfer substantially better than end-to-end PPO, which highlights the role of the MPC backbone in cross-scenario robustness. The framework also shows faster loss stabilization than end-to-end RL during training, which indicates a reduced learning burden. These results suggest that the integrated approach can improve the balance between safety performance and efficiency in multi-agent intersection scenarios, while the MPC component provides a strong foundation for generalization across driving environments. The implementation code is available open-source.","published_date":"2026-04-15T13:58:38+00:00","viability_score":5,"cluster_label":"Autonomous Driving Control","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"An integrated MPC-RL framework for automated driving that balances safety and efficiency in multi-agent scenarios, outperforming standalone methods and showing improved generalization.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.13888v1","title":"GeoAgentBench: A Dynamic Execution Benchmark for Tool-Augmented Agents in Spatial Analysis","abstract":"The integration of Large Language Models (LLMs) into Geographic Information Systems (GIS) marks a paradigm shift toward autonomous spatial analysis. However, evaluating these LLM-based agents remains challenging due to the complex, multi-step nature of geospatial workflows. Existing benchmarks primarily rely on static text or code matching, neglecting dynamic runtime feedback and the multimodal nature of spatial outputs. To address this gap, we introduce GeoAgentBench (GABench), a dynamic and interactive evaluation benchmark tailored for tool-augmented GIS agents. GABench provides a realistic execution sandbox integrating 117 atomic GIS tools, encompassing 53 typical spatial analysis tasks across 6 core GIS domains. Recognizing that precise parameter configuration is the primary determinant of execution success in dynamic GIS environments, we designed the Parameter Execution Accuracy (PEA) metric, which utilizes a \"Last-Attempt Alignment\" strategy to quantify the fidelity of implicit parameter inference. Complementing this, a Vision-Language Model (VLM) based verification is proposed to assess data-spatial accuracy and cartographic style adherence. Furthermore, to address the frequent task failures caused by parameter misalignments and runtime anomalies, we developed a novel agent architecture, Plan-and-React, that mimics expert cognitive workflows by decoupling global orchestration from step-wise reactive execution. Extensive experiments with seven representative LLMs demonstrate that the Plan-and-React paradigm significantly outperforms traditional frameworks, achieving the optimal balance between logical rigor and execution robustness, particularly in multi-step reasoning and error recovery. Our findings highlight current capability boundaries and establish a robust standard for assessing and advancing the next generation of autonomous GeoAI.","published_date":"2026-04-15T13:55:34+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A dynamic benchmark and agent architecture for evaluating and improving tool-augmented LLMs in complex spatial analysis tasks.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.13882v1","title":"Evaluating Supervised Machine Learning Models: Principles, Pitfalls, and Metric Selection","abstract":"The evaluation of supervised machine learning models is a critical stage in the development of reliable predictive systems. Despite the widespread availability of machine learning libraries and automated workflows, model assessment is often reduced to the reporting of a small set of aggregate metrics, which can lead to misleading conclusions about real-world performance. This paper examines the principles, challenges, and practical considerations involved in evaluating supervised learning algorithms across classification and regression tasks. In particular, it discusses how evaluation outcomes are influenced by dataset characteristics, validation design, class imbalance, asymmetric error costs, and the choice of performance metrics. Through a series of controlled experimental scenarios using diverse benchmark datasets, the study highlights common pitfalls such as the accuracy paradox, data leakage, inappropriate metric selection, and overreliance on scalar summary measures. The paper also compares alternative validation strategies and emphasizes the importance of aligning model evaluation with the intended operational objective of the task. By presenting evaluation as a decision-oriented and context-dependent process, this work provides a structured foundation for selecting metrics and validation protocols that support statistically sound, robust, and trustworthy supervised machine learning systems.","published_date":"2026-04-15T13:44:35+00:00","viability_score":3,"cluster_label":"ML Evaluation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for robustly evaluating supervised machine learning models by addressing common pitfalls in metric selection and validation.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.13849v1","title":"MCPThreatHive: Automated Threat Intelligence for Model Context Protocol Ecosystems","abstract":"The rapid proliferation of Model Context Protocol (MCP)-based agentic systems has introduced a new category of security threats that existing frameworks are inadequately equipped to address. We present MCPThreatHive, an open-source platform that automates the end-to-end lifecycle of MCP threat intelligence: from continuous, multi-source data collection through AI-driven threat extraction and classification, to structured knowledge graph storage and interactive visualization. The platform operationalizes the MCP-38 threat taxonomy, a curated set of 38 MCP-specific threat patterns mapped to STRIDE, OWASP Top 10 for LLM Applications, and OWASP Top 10 for Agentic Applications. A composite risk scoring model provides quantitative prioritization. Through a comparative analysis of representative existing MCP security tools, we identify three critical coverage gaps that MCPThreatHive addresses: incomplete compositional attack modeling, absence of continuous threat intelligence, and lack of unified multi-framework classification.","published_date":"2026-04-15T13:19:22+00:00","viability_score":8,"cluster_label":"Security","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An automated platform for generating and visualizing threat intelligence for agentic systems, addressing critical security gaps.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.13847v1","title":"SparseBalance: Load-Balanced Long Context Training with Dynamic Sparse Attention","abstract":"While sparse attention mitigates the computational bottleneck of long-context LLM training, its distributed training process exhibits extreme heterogeneity in both \\textit{1)} sequence length and \\textit{2)} sparsity sensitivity, leading to a severe imbalance problem and sub-optimal model accuracy. Existing algorithms and training frameworks typically focus on single issue, failing to systematically co-optimize these two problems. Therefore, we propose SparseBalance, a novel algorithm-system co-design framework, which exploits the sparsity and sequence heterogeneity to optimize model accuracy and system efficiency jointly. First, we propose workload-aware dynamic sparsity tuning, which employs a bidirectional sparsity adjustment to eliminate stragglers and exploit inherent bubbles for free accuracy. Second, we propose a sparsity-aware batching strategy to achieve coarse-grained balance, which complements dynamic sparsity tuning. Experimental results demonstrate that SparseBalance achieves up to a 1.33$\\times$ end-to-end speedup while still improving the long-context capability by 0.46\\% on the LongBench benchmark.","published_date":"2026-04-15T13:18:07+00:00","viability_score":6,"cluster_label":"LLM Training","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel algorithm-system co-design framework for load-balanced long context LLM training that improves accuracy and efficiency.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.13826v1","title":"Sentiment analysis for software engineering: How far can zero-shot learning (ZSL) go?","abstract":"Sentiment analysis in software engineering focuses on understanding emotions expressed in software artifacts. Previous research highlighted the limitations of applying general off-the-shelf sentiment analysis tools within the software engineering domain and indicated the need for specialized tools tailored to various software engineering contexts. The development of such tools heavily relies on supervised machine learning techniques that necessitate annotated datasets. Acquiring such datasets is a substantial challenge, as it requires domain-specific expertise and significant effort. Objective: This study explores the potential of ZSL to address the scarcity of annotated datasets in sentiment analysis within software engineering Method:} We conducted an empirical experiment to evaluate the performance of various ZSL techniques, including embedding-based, NLI-based, TARS-based, and generative-based ZSL techniques. We assessed the performance of these techniques under different labels setups to examine the impact of label configurations. Additionally, we compared the results of the ZSL techniques with state-of-the-art fine-tuned transformer-based models. Finally, we performed an error analysis to identify the primary causes of misclassifications. Results: Our findings demonstrate that ZSL techniques, particularly those combining expert-curated labels with embedding-based or generative-based models, can achieve macro-F1 scores comparable to fine-tuned transformer-based models. The error analysis revealed that subjectivity in annotation and polar facts are the main contributors to ZSL misclassifications. Conclusion: This study demonstrates the potential of ZSL for sentiment analysis in software engineering. ZSL can provide a solution to the challenge of annotated dataset scarcity by reducing reliance on annotated dataset.","published_date":"2026-04-15T12:58:38+00:00","viability_score":5,"cluster_label":"NLP","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Leveraging zero-shot learning to perform sentiment analysis in software engineering, reducing the need for extensive annotated datasets.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.13814v1","title":"Cognitive Offloading in Agile Teams: How Artificial Intelligence Reshapes Risk Assessment and Planning Quality","abstract":"Recent advances in artificial intelligence (AI) have shown promise in automating key aspects of Agile project management, yet their impact on team cognition remains underexplored. In this work, we investigate cognitive offloading in Agile sprint planning by conducting a controlled, three-condition experiment comparing AI-only, human-only, and hybrid planning models on a live client deliverable at a mid-sized digital agency. Using quantitative metrics -- including estimation accuracy, rework rates, and scope change recovery time -- alongside qualitative indicators of planning robustness, we evaluate each model's effectiveness beyond raw efficiency. We find that while AI-only planning minimizes time and cost, it significantly degrades risk capture rates and increases rework due to unstated assumptions, whereas human-only planning excels at adaptability but incurs substantial overhead. Drawing on these findings, we propose a theoretical framework for hybrid AI-human sprint planning that assigns algorithmic tools to estimation and backlog formatting while mandating human deliberation for risk assessment and ambiguity resolution. Our results challenge the assumption that efficiency equates to effectiveness, offering actionable governance strategies for organizations seeking to augment rather than erode team cognition.","published_date":"2026-04-15T12:48:29+00:00","viability_score":3,"cluster_label":"AI for Project Management","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Investigating the impact of AI on cognitive offloading in Agile sprint planning to propose a hybrid AI-human framework.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.13812v1","title":"AlphaCNOT: Learning CNOT Minimization with Model-Based Planning","abstract":"Quantum circuit optimization is a central task in Quantum Computing, as current Noisy Intermediate Scale Quantum devices suffer from error propagation that often scales with the number of operations. Among quantum operations, the CNOT gate is of fundamental importance, being the only 2-qubit gate in the universal Clifford+T set. The problem of CNOT gates minimization has been addressed by heuristic algorithms such as the well-known Patel-Markov-Hayes (PMH) for linear reversible synthesis (i.e., CNOT minimization with no topological constraints), and more recently by Reinforcement Learning (RL) based strategies in the more complex case of topology-aware synthesis, where each CNOT can act on a subset of all qubits pairs. In this work we introduce AlphaCNOT, a RL framework based on Monte Carlo Tree Search (MCTS) that address effectively the CNOT minimization problem by modeling it as a planning problem. In contrast to other RL- based solution, our method is model-based, i.e. it can leverage lookahead search to evaluate future trajectories, thus finding more efficient sequences of CNOTs. Our method achieves a reduction of up to 32% in CNOT gate count compared to PMH baseline on linear reversible synthesis, while in the constraint version we report a consistent gate count reduction on a variety of topologies with up to 8 qubits, with respect to state-of-the-art RL-based solutions. Our results suggest the combination of RL with search-based strategies can be applied to different circuit optimization tasks, such as Clifford minimization, thus fostering the transition toward the \"quantum utility\" era.","published_date":"2026-04-15T12:46:40+00:00","viability_score":3,"cluster_label":"Quantum Computing Optimization","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Developing AlphaCNOT, a model-based reinforcement learning framework for minimizing CNOT gates in quantum circuits.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.13803v1","title":"Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation","abstract":"Vision-language models are increasingly deployed in high-stakes settings, yet their susceptibility to sycophantic manipulation remains poorly understood, particularly in relation to how these models represent visual information internally. Whether models whose visual representations more closely mirror human neural processing are also more resistant to adversarial pressure is an open question with implications for both neuroscience and AI safety. We investigate this question by evaluating 12 open-weight vision-language models spanning 6 architecture families and a 40$\\times$ parameter range (256M--10B) along two axes: brain alignment, measured by predicting fMRI responses from the Natural Scenes Dataset across 8 human subjects and 6 visual cortex regions of interest, and sycophancy, measured through 76,800 two-turn gaslighting prompts spanning 5 categories and 10 difficulty levels. Region-of-interest analysis reveals that alignment specifically in early visual cortex (V1--V3) is a reliable negative predictor of sycophancy ($r = -0.441$, BCa 95\\% CI $[-0.740, -0.031]$), with all 12 leave-one-out correlations negative and the strongest effect for existence denial attacks ($r = -0.597$, $p = 0.040$). This anatomically specific relationship is absent in higher-order category-selective regions, suggesting that faithful low-level visual encoding provides a measurable anchor against adversarial linguistic override in vision-language models. We release our code on \\href{https://github.com/aryashah2k/Gaslight-Gatekeep-Sycophantic-Manipulation}{GitHub} and dataset on \\href{https://huggingface.co/datasets/aryashah00/Gaslight-Gatekeep-V1-V3}{Hugging Face}","published_date":"2026-04-15T12:38:51+00:00","viability_score":8,"cluster_label":"AI Safety / Vision-Language Models","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Shielding vision-language models from manipulation by aligning their early visual cortex representations with human neural processing.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.13780v1","title":"Soft $Q(\u03bb)$: A multi-step off-policy method for entropy regularised reinforcement learning using eligibility traces","abstract":"Soft Q-learning has emerged as a versatile model-free method for entropy-regularised reinforcement learning, optimising for returns augmented with a penalty on the divergence from a reference policy. Despite its success, the multi-step extensions of soft Q-learning remain relatively unexplored and limited to on-policy action sampling under the Boltzmann policy. In this brief research note, we first present a formal $n$-step formulation for soft Q-learning and then extend this framework to the fully off-policy case by introducing a novel Soft Tree Backup operator. Finally, we unify these developments into Soft $Q(\u03bb)$, an elegant online, off-policy, eligibility trace framework that allows for efficient credit assignment under arbitrary behaviour policies. Our derivations propose a model-free method for learning entropy-regularised value functions that can be utilised in future empirical experiments.","published_date":"2026-04-15T12:10:45+00:00","viability_score":1,"cluster_label":"Reinforcement Learning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A theoretical framework for multi-step off-policy soft Q-learning using eligibility traces for improved credit assignment in reinforcement learning.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.13777v1","title":"From Anchors to Supervision: Memory-Graph Guided Corpus-Free Unlearning for Large Language Models","abstract":"Large language models (LLMs) may memorize sensitive or copyrighted content, raising significant privacy and legal concerns. While machine unlearning has emerged as a potential remedy, prevailing paradigms rely on user-provided forget sets, making unlearning requests difficult to audit and exposing systems to secondary leakage and malicious abuse. We propose MAGE, a Memory-grAph Guided Erasure framework for user-minimized, corpus-free unlearning. Given only a lightweight user anchor that identifies a target entity, MAGE probes the target LLM to recover target-related memorization, organizes it into a weighted local memory graph, and synthesizes scoped supervision for unlearning. MAGE is model-agnostic, can be plugged into standard unlearning methods, and requires no access to the original training corpus. Experiments on two benchmarks, TOFU and RWKU, demonstrate that MAGE's self-generated supervision achieves effective unlearning performance comparable to supervision generated with external reference, while preserving overall utility. These results support a practical and auditable unlearning workflow driven by minimal anchors rather than user-supplied forget corpora.","published_date":"2026-04-15T12:07:14+00:00","viability_score":7,"cluster_label":"LLM Unlearning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for unlearning sensitive data from LLMs using memory-graph guided synthesis of supervision, minimizing user input and corpus reliance.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.13763v1","title":"A Dynamic-Growing Fuzzy-Neuro Controller, Application to a 3PSP Parallel Robot","abstract":"To date, various paradigms of soft-Computing have been used to solve many modern problems. Among them, a self organizing combination of fuzzy systems and neural networks can make a powerful decision making system. Here, a Dynamic Growing Fuzzy Neural Controller (DGFNC) is combined with an adaptive strategy and applied to a 3PSP parallel robot position control problem. Specifically, the dynamic growing mechanism is considered in more detail. In contrast to other self-organizing methods, DGFNC adds new rules more conservatively; hence the pruning mechanism is omitted. Instead, the adaptive strategy 'adapts' the control system to parameter variation. Furthermore, a sliding mode-based nonlinear controller ensures system stability. The resulting general control strategy aims to achieve faster response with less computation while maintaining overall stability. Finally, the 3PSP is chosen due to its complex dynamics and the utility of such approaches in modern industrial systems. Several simulations support the merits of the proposed DGFNC strategy as applied to the 3PSP robot.","published_date":"2026-04-15T11:48:51+00:00","viability_score":0,"cluster_label":"Robotics Control","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A dynamic-growing fuzzy-neuro controller with an adaptive strategy for precise and stable control of parallel robots.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.13759v1","title":"The cognitive companion: a lightweight parallel monitoring architecture for detecting and recovering from reasoning degradation in LLM agents","abstract":"Large language model (LLM) agents on multi-step tasks suffer reasoning degradation, looping, drift, stuck states, at rates up to 30% on hard tasks. Current solutions include hard step limits (abrupt) or LLM-as-judge monitoring (10-15% overhead per step). This paper introduces the Cognitive Companion, a parallel monitoring architecture with two implementations: an LLM-based Companion and a novel zero-overhead Probe-based Companion. We report a three-batch feasibility study centered on Gemma 4 E4B, with an additional exploratory small-model analysis on Qwen 2.5 1.5B and Llama 3.2 1B. In our experiments, the LLM-based Companion reduced repetition on loop-prone tasks by 52-62% with approximately 11% overhead. The Probe-based Companion, trained on hidden states from layer 28, showed a mean effect size of +0.471 at zero measured inference overhead; its strongest probe result achieved cross-validated AUROC 0.840 on a small proxy-labeled dataset. A key empirical finding is that companion benefit appears task-type dependent: companions are most helpful on loop-prone and open-ended tasks, while effects are neutral or negative on more structured tasks. Our small-model experiments also suggest a possible scale boundary: companions did not improve the measured quality proxy on 1B-1.5B models, even when interventions fired. Overall, the paper should be read as a feasibility study rather than a definitive validation. The results provide encouraging evidence that sub-token monitoring may be useful, identify task-type sensitivity as a practical design constraint, and motivate selective companion activation as a promising direction for future work.","published_date":"2026-04-15T11:44:20+00:00","viability_score":6,"cluster_label":"LLM Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A parallel monitoring architecture for LLM agents that detects and recovers from reasoning degradation, offering both LLM-based and zero-overhead probe-based companions.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.13757v1","title":"Rethinking AI Hardware: A Three-Layer Cognitive Architecture for Autonomous Agents","abstract":"The next generation of autonomous AI systems will be constrained not only by model capability, but by how intelligence is structured across heterogeneous hardware. Current paradigms -- cloud-centric AI, on-device inference, and edge-cloud pipelines -- treat planning, reasoning, and execution as a monolithic process, leading to unnecessary latency, energy consumption, and fragmented behavioral continuity. We introduce the Tri-Spirit Architecture, a three-layer cognitive framework that decomposes intelligence into planning (Super Layer), reasoning (Agent Layer), and execution (Reflex Layer), each mapped to distinct compute substrates and coordinated via an asynchronous message bus. We formalize the system with a parameterized routing policy, a habit-compilation mechanism that promotes repeated reasoning paths into zero-inference execution policies, a convergent memory model, and explicit safety constraints. We evaluate the architecture in a reproducible simulation of 2000 synthetic tasks against cloud-centric and edge-only baselines. Tri-Spirit reduces mean task latency by 75.6 percent and energy consumption by 71.1 percent, while decreasing LLM invocations by 30 percent and enabling 77.6 percent offline task completion. These results suggest that cognitive decomposition, rather than model scaling alone, is a primary driver of system-level efficiency in AI hardware.","published_date":"2026-04-15T11:43:01+00:00","viability_score":2,"cluster_label":"AI Hardware Architecture","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A novel three-layer cognitive architecture for autonomous agents that decomposes intelligence across heterogeneous hardware to reduce latency and energy consumption.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.13737v1","title":"TokenFormer: Unify the Multi-Field and Sequential Recommendation Worlds","abstract":"Recommender systems have historically developed along two largely independent paradigms: feature interaction models for modeling correlations among multi-field categorical features, and sequential models for capturing user behavior dynamics from historical interaction sequences. Although recent trends attempt to bridge these paradigms within shared backbones, we empirically reveal that naive unifying these two branches may lead to a failure mode of Sequential Collapse Propagation (SCP). That is, the interaction with those dimensionally ill non-sequence fields leads to the dimensional collapse of the sequence features. To overcome this challenge, we propose TokenFormer, a unified recommendation architecture with the following innovations. First, we introduce a Bottom-Full-Top-Sliding (BFTS) attention scheme, which applies full self-attention in the lower layers and shrinking-window sliding attention in the upper layers. Second, we introduce a Non-Linear Interaction Representation (NLIR) that applies one-sided non-linear multiplicative transformations to the hidden states. Extensive experiments on public benchmarks and Tencent's advertising platform demonstrate state-of-the-art performance, while detailed analysis confirm that TokenFormer significantly improves dimensional robustness and representation discriminability under unified modeling.","published_date":"2026-04-15T11:25:46+00:00","viability_score":7,"cluster_label":"Recommendation Systems","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"TokenFormer unifies multi-field and sequential recommendation models, overcoming sequential collapse propagation with a novel attention scheme and representation method.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.13733v1","title":"Jump-Start Reinforcement Learning with Vision-Language-Action Regularization","abstract":"Reinforcement learning (RL) enables high-frequency, closed-loop control for robotic manipulation, but scaling to long-horizon tasks with sparse or imperfect rewards remains difficult due to inefficient exploration and poor credit assignment. Vision-Language-Action (VLA) models leverage large-scale multimodal pretraining to provide generalist, task-level reasoning, but current limitations hinder their direct use in fast and precise manipulation. In this paper, we propose Vision-Language-Action Jump-Starting (VLAJS), a method that bridges sparse VLA guidance with on-policy RL to improve exploration and learning efficiency. VLAJS treats VLAs as transient sources of high-level action suggestions that bias early exploration and improve credit assignment, while preserving the high-frequency, state-based control of RL. Our approach augments Proximal Policy Optimization (PPO) with a directional action-consistency regularization that softly aligns the RL agent's actions with VLA guidance during early training, without enforcing strict imitation, requiring demonstrations, or relying on continuous teacher queries. VLA guidance is applied sparsely and annealed over time, allowing the agent to adapt online and ultimately surpass the guiding policy. We evaluate VLAJS on six challenging manipulation tasks: lifting, pick-and-place, peg reorientation, peg insertion, poking, and pushing in simulation, and validate a subset on a real Franka Panda robot. VLAJS consistently outperforms PPO and distillation-style baselines in sample efficiency, reducing required environment interactions by over 50% in several tasks. Real-world experiments demonstrate zero-shot sim-to-real transfer and robust execution under clutter, object variation, and external perturbations.","published_date":"2026-04-15T11:17:54+00:00","viability_score":7,"cluster_label":"Robotics RL","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"VLAJS jump-starts reinforcement learning for robotics by using vision-language-action models to bias exploration and improve learning efficiency, outperforming baselines by over 50%.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.13721v1","title":"FRAGATA: Semantic Retrieval of HPC Support Tickets via Hybrid RAG over 20 Years of Request Tracker History","abstract":"The technical support team of a supercomputing centre accumulates, over the course of decades, a large volume of resolved incidents that constitute critical operational knowledge. At the Galician Supercomputing Center (CESGA) this history has been managed for over twenty years with Request Tracker (RT), whose built-in search engine has significant limitations that hinder knowledge reuse by the support staff. This paper presents Fragata, a semantic ticket search system that combines modern information retrieval techniques with the full RT history. The system can find relevant past incidents regardless of language, the presence of typos, or the specific wording of the query. The architecture is deployed on CESGA's infrastructure, supports incremental updates without service interruption, and offloads the most expensive stages to the FinisTerrae III supercomputer. Preliminary results show a substantial qualitative improvement over RT's native search.","published_date":"2026-04-15T10:53:49+00:00","viability_score":4,"cluster_label":"Semantic Search","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Fragata is a semantic search system for HPC support tickets that uses hybrid RAG to improve knowledge reuse and overcome limitations of traditional search engines.","time_to_mvp":"1-3 months","tags":["series_a_plus"]},{"arxiv_id":"2604.13715v1","title":"Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt","abstract":"Large Audio-Language Models (LALMs) enable general audio understanding and demonstrate remarkable performance across various audio tasks. However, these models still face challenges in temporal perception (e.g., inferring event onset and offset), leading to limited utility in fine-grained scenarios. To address this issue, we propose Audio-Side Time Prompt and leverage Reinforcement Learning (RL) to develop the TimePro-RL framework for fine-grained temporal perception. Specifically, we encode timestamps as embeddings and interleave them within the audio feature sequence as temporal coordinates to prompt the model. Furthermore, we introduce RL following Supervised Fine-Tuning (SFT) to directly optimize temporal alignment performance. Experiments demonstrate that TimePro-RL achieves significant performance gains across a range of audio temporal tasks, such as audio grounding, sound event detection, and dense audio captioning, validating its robust effectiveness.","published_date":"2026-04-15T10:50:29+00:00","viability_score":5,"cluster_label":"Audio AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Fine-tune large audio-language models for precise temporal event detection using audio-side time prompts and reinforcement learning.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.13705v1","title":"Beyond Arrow's Impossibility: Fairness as an Emergent Property of Multi-Agent Collaboration","abstract":"Fairness in language models is typically studied as a property of a single, centrally optimized model. As large language models become increasingly agentic, we propose that fairness emerges through interaction and exchange. We study this via a controlled hospital triage framework in which two agents negotiate over three structured debate rounds. One agent is aligned to a specific ethical framework via retrieval-augmented generation (RAG), while the other is either unaligned or adversarially prompted to favor demographic groups over clinical need. We find that alignment systematically shapes negotiation strategies and allocation patterns, and that neither agent's allocation is ethically adequate in isolation, yet their joint final allocation can satisfy fairness criteria that neither would have reached alone. Aligned agents partially moderate bias through contestation rather than override, acting as corrective patches that restore access for marginalized groups without fully converting a biased counterpart. We further observe that even explicitly aligned agents exhibit intrinsic biases toward certain frameworks, consistent with known left-leaning tendencies in LLMs. We connect these limits to Arrow's Impossibility Theorem: no aggregation mechanism can simultaneously satisfy all desiderata of collective rationality, and multi-agent deliberation navigates rather than resolves this constraint. Our results reposition fairness as an emergent, procedural property of decentralized agent interaction, and the system rather than the individual agent as the appropriate unit of evaluation.","published_date":"2026-04-15T10:34:35+00:00","viability_score":3,"cluster_label":"AI Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Fairness in AI agents emerges from multi-agent collaboration, not single-model optimization, offering a new perspective on ethical AI.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.13699v1","title":"MIND: AI Co-Scientist for Material Research","abstract":"Large language models (LLMs) have enabled agentic AI systems for scientific discovery, but most approaches remain limited to textbased reasoning without automated experimental verification. We propose MIND, an LLM-driven framework for automated hypothesis validation in materials research. MIND organizes the scientific discovery process into hypothesis refinement, experimentation, and debate-based validation within a multi-agent pipeline. For experimental verification, the system integrates Machine Learning Interatomic Potentials, particularly SevenNet-Omni, enabling scalable in-silico experiments. We also provide a web-based user interface for automated hypothesis testing. The modular design allows additional experimental modules to be integrated, making the framework adaptable to broader scientific workflows. The code is available at: https://github.com/IMMS-Ewha/MIND, and a demonstration video at: https://youtu.be/lqiFe1OQzN4.","published_date":"2026-04-15T10:27:01+00:00","viability_score":7,"cluster_label":"AI for Science","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"MIND is an LLM-driven co-scientist for material research, automating hypothesis validation through integrated in-silico experiments and a user interface.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.13695v1","title":"Med-CAM: Minimal Evidence for Explaining Medical Decision Making","abstract":"Reliable and interpretable decision-making is essential in medical imaging, where diagnostic outcomes directly influence patient care. Despite advances in deep learning, most medical AI systems operate as opaque black boxes, providing little insight into why a particular diagnosis was reached. In this paper, we introduce Med-CAM, a framework for generating minimal and sharp maps as evidence-based explanations for Medical decision making via Classifier Activation Matching. Med-CAM trains a segmentation network from scratch to produce a mask that highlights the minimal evidence critical to model's decision for any seen or unseen image. This ensures that the explanation is both faithful to the network's behaviour and interpretable to clinicians. Experiments show, unlike prior spatial explanation methods, such as Grad-CAM and attention maps, which yield only fuzzy regions of relative importance, Med-CAM with its superior spatial awareness to shapes, textures, and boundaries, delivers conclusive, evidence-based explanations that faithfully replicate the model's prediction for any given image. By explicitly constraining explanations to be compact, consistent with model activations, and diagnostic alignment, Med-CAM advances transparent AI to foster clinician understanding and trust in high-stakes medical applications such as pathology and radiology.","published_date":"2026-04-15T10:22:45+00:00","viability_score":8,"cluster_label":"Medical AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"AI framework generating clear, evidence-based medical imaging explanations to enhance trust in automated diagnostics.","time_to_mvp":"6+ months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.13694v1","title":"Weight Patching: Toward Source-Level Mechanistic Localization in LLMs","abstract":"Mechanistic interpretability seeks to localize model behavior to the internal components that causally realize it. Prior work has advanced activation-space localization and causal tracing, but modules that appear important in activation space may merely aggregate or amplify upstream signals rather than encode the target capability in their own parameters. To address this gap, we propose Weight Patching, a parameter-space intervention method for source-oriented analysis in paired same-architecture models that differ in how strongly they express a target capability under the inputs of interest. Given a base model and a behavior-specialized counterpart, Weight Patching replaces selected module weights from the specialized model into the base model under a fixed input. We instantiate the method on instruction following and introduce a framework centered on a vector-anchor behavioral interface that provides a shared internal criterion for whether a task-relevant control state has been formed or recovered in open-ended generation. Under this framework, the analysis reveals a hierarchy from shallow candidate source-side carriers to aggregation and routing modules, and further to downstream execution circuits. The recovered component scores can also guide mechanism-aware model merging, improving selective fusion across the evaluated expert combinations and providing additional external validation.","published_date":"2026-04-15T10:21:38+00:00","viability_score":2,"cluster_label":"LLM Interpretability","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A novel parameter-space intervention method for localizing specific behaviors within Large Language Models.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.13688v1","title":"Beyond Voxel 3D Editing: Learning from 3D Masks and Self-Constructed Data","abstract":"3D editing refers to the ability to apply local or global modifications to 3D assets. Effective 3D editing requires maintaining semantic consistency by performing localized changes according to prompts, while also preserving local invariance so that unchanged regions remain consistent with the original. However, existing approaches have significant limitations: multi-view editing methods incur losses when projecting back to 3D, while voxel-based editing is constrained in both the regions that can be modified and the scale of modifications. Moreover, the lack of sufficiently large editing datasets for training and evaluation remains a challenge. To address these challenges, we propose a Beyond Voxel 3D Editing (BVE) framework with a self-constructed large-scale dataset specifically tailored for 3D editing. Building upon this dataset, our model enhances a foundational image-to-3D generative architecture with lightweight, trainable modules, enabling efficient injection of textual semantics without the need for expensive full-model retraining. Furthermore, we introduce an annotation-free 3D masking strategy to preserve local invariance, maintaining the integrity of unchanged regions during editing. Extensive experiments demonstrate that BVE achieves superior performance in generating high-quality, text-aligned 3D assets, while faithfully retaining the visual characteristics of the original input.","published_date":"2026-04-15T10:10:27+00:00","viability_score":7,"cluster_label":"3D Generative AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for efficient 3D asset editing that leverages self-constructed datasets and lightweight modules to enhance foundational generative models.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.13686v1","title":"IndicDB -- Benchmarking Multilingual Text-to-SQL Capabilities in Indian Languages","abstract":"While Large Language Models (LLMs) have significantly advanced Text-to-SQL performance, existing benchmarks predominantly focus on Western contexts and simplified schemas, leaving a gap in real-world, non-Western applications. We present IndicDB, a multilingual Text-to-SQL benchmark for evaluating cross-lingual semantic parsing across diverse Indic languages. The relational schemas are sourced from open-data platforms, including the National Data and Analytics Platform (NDAP) and the India Data Portal (IDP), ensuring realistic administrative data complexity. IndicDB comprises 20 databases across 237 tables. To convert denormalized government data into rich relational structures, we employ an iterative three-agent framework (Architect, Auditor, Refiner) to ensure structural rigor and high relational density (11.85 tables per database; join depths up to six). Our pipeline is value-aware, difficulty-calibrated, and join-enforced, generating 15,617 tasks across English, Hindi, and five Indic languages. We evaluate cross-lingual semantic parsing performance of state-of-the-art models (DeepSeek v3.2, MiniMax 2.7, LLaMA 3.3, Qwen3) across seven linguistic variants. Results show a 9.00% performance drop from English to Indic languages, revealing an \"Indic Gap\" driven by harder schema linking, increased structural ambiguity, and limited external knowledge. IndicDB serves as a rigorous benchmark for multilingual Text-to-SQL. Code and data: https://anonymous.4open.science/r/multilingualText2Sql-Indic--DDCC/","published_date":"2026-04-15T10:07:37+00:00","viability_score":7,"cluster_label":"Multilingual Text-to-SQL","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A benchmark and framework for evaluating multilingual Text-to-SQL capabilities in Indian languages, addressing the 'Indic Gap' in LLM performance.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.13666v1","title":"Automatically Inferring Teachers' Geometric Content Knowledge: A Skills Based Approach","abstract":"Assessing teachers' geometric content knowledge is essential for geometry instructional quality and student learning, but difficult to scale. The Van Hiele model characterizes geometric reasoning through five hierarchical levels. Traditional Van Hiele assessment relies on manual expert analysis of open-ended responses. This process is time-consuming, costly, and prevents large-scale evaluation. This study develops an automated approach for diagnosing teachers' Van Hiele reasoning levels using large language models grounded in educational theory. Our central hypothesis is that integrating explicit skills information significantly improves Van Hiele classification. In collaboration with mathematics education researchers, we built a structured skills dictionary decomposing the Van Hiele levels into 33 fine-grained reasoning skills. Through a custom web platform, 31 pre-service teachers solved geometry problems, yielding 226 responses. Expert researchers then annotated each response with its Van Hiele level and demonstrated skills from the dictionary. Using this annotated dataset, we implemented two classification approaches: (1) retrieval-augmented generation (RAG) and (2) multi-task learning (MTL). Each approach compared a skills-aware variant incorporating the skills dictionary against a baseline without skills information. Results showed that for both methods, skills-aware variants significantly outperformed baselines across multiple evaluation metrics. This work provides the first automated approach for Van Hiele level classification from open-ended responses. It offers a scalable, theory-grounded method for assessing teachers' geometric reasoning that can enable large-scale evaluation and support adaptive, personalized teacher learning systems.","published_date":"2026-04-15T09:34:46+00:00","viability_score":6,"cluster_label":"Educational AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An automated system for assessing teachers' geometric content knowledge using LLMs and a fine-grained skills dictionary.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.13656v1","title":"Ordinary Least Squares is a Special Case of Transformer","abstract":"The statistical essence of the Transformer architecture has long remained elusive: Is it a universal approximator, or a neural network version of known computational algorithms? Through rigorous algebraic proof, we show that the latter better describes Transformer's basic nature: Ordinary Least Squares (OLS) is a special case of the single-layer Linear Transformer. Using the spectral decomposition of the empirical covariance matrix, we construct a specific parameter setting where the attention mechanism's forward pass becomes mathematically equivalent to the OLS closed-form projection. This means attention can solve the problem in one forward pass, not by iterating. Building upon this prototypical case, we further uncover a decoupled slow and fast memory mechanism within Transformers. Finally, the evolution from our established linear prototype to standard Transformers is discussed. This progression facilitates the transition of the Hopfield energy function from linear to exponential memory capacity, thereby establishing a clear continuity between modern deep architectures and classical statistical inference.","published_date":"2026-04-15T09:21:01+00:00","viability_score":2,"cluster_label":"LLM Theory","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper theoretically demonstrates that Ordinary Least Squares is a special case of the Transformer architecture, revealing a decoupled slow and fast memory mechanism.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.13645v1","title":"A Mechanistic Analysis of Sim-and-Real Co-Training in Generative Robot Policies","abstract":"Co-training, which combines limited in-domain real-world data with abundant surrogate data such as simulation or cross-embodiment robot data, is widely used for training generative robot policies. Despite its empirical success, the mechanisms that determine when and why co-training is effective remain poorly understood. We investigate the mechanism of sim-and-real co-training through theoretical analysis and empirical study, and identify two intrinsic effects governing performance. The first, \\textbf{``structured representation alignment\"}, reflects a balance between cross-domain representation alignment and domain discernibility, and plays a primary role in downstream performance. The second, the \\textbf{``importance reweighting effect\"}, arises from domain-dependent modulation of action weighting and operates at a secondary level. We validate these effects with controlled experiments on a toy model and extensive sim-and-sim and sim-and-real robot manipulation experiments. Our analysis offers a unified interpretation of recent co-training techniques and motivates a simple method that consistently improves upon prior approaches. More broadly, our aim is to examine the inner workings of co-training and to facilitate research in this direction.","published_date":"2026-04-15T09:14:43+00:00","viability_score":7,"cluster_label":"Robotics AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research provides a mechanistic analysis of sim-and-real co-training for generative robot policies, identifying key effects and proposing a method to improve performance.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.13630v1","title":"SafeHarness: Lifecycle-Integrated Security Architecture for LLM-based Agent Deployment","abstract":"The performance of large language model (LLM) agents depends critically on the execution harness, the system layer that orchestrates tool use, context management, and state persistence. Yet this same architectural centrality makes the harness a high-value attack surface: a single compromise at the harness level can cascade through the entire execution pipeline. We observe that existing security approaches suffer from structural mismatch, leaving them blind to harness-internal state and unable to coordinate across the different phases of agent operation. In this paper, we introduce \\safeharness{}, a security architecture in which four proposed defense layers are woven directly into the agent lifecycle to address above significant limitations: adversarial context filtering at input processing, tiered causal verification at decision making, privilege-separated tool control at action execution, and safe rollback with adaptive degradation at state update. The proposed cross-layer mechanisms tie these layers together, escalating verification rigor, triggering rollbacks, and tightening tool privileges whenever sustained anomalies are detected. We evaluate \\safeharness{} on benchmark datasets across diverse harness configurations, comparing against four security baselines under five attack scenarios spanning six threat categories. Compared to the unprotected baseline, \\safeharness{} achieves an average reduction of approximately 38\\% in UBR and 42\\% in ASR, substantially lowering both the unsafe behavior rate and the attack success rate while preserving core task utility.","published_date":"2026-04-15T08:59:00+00:00","viability_score":7,"cluster_label":"LLM Security","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"SafeHarness is a lifecycle-integrated security architecture for LLM agents that significantly reduces unsafe behavior and attack success rates.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.13620v1","title":"Syn-TurnTurk: A Synthetic Dataset for Turn-Taking Prediction in Turkish Dialogues","abstract":"Managing natural dialogue timing is a significant challenge for voice-based chatbots. Most current systems usually rely on simple silence detection, which often fails because human speech patterns involve irregular pauses. This causes bots to interrupt users, breaking the conversational flow. This problem is even more severe for languages like Turkish, which lack high-quality datasets for turn-taking prediction. This paper introduces Syn-TurnTurk, a synthetic Turkish dialogue dataset generated using various Qwen Large Language Models (LLMs) to mirror real-life verbal exchanges, including overlaps and strategic silences. We evaluated the dataset using several traditional and deep learning architectures. The results show that advanced models, particularly BI-LSTM and Ensemble (LR+RF) methods, achieve high accuracy (0.839) and AUC scores (0.910). These findings demonstrate that our synthetic dataset can have a positive affect for models understand linguistic cues, allowing for more natural human-machine interaction in Turkish.","published_date":"2026-04-15T08:39:26+00:00","viability_score":7,"cluster_label":"Dialogue AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Syn-TurnTurk is a synthetic dataset for Turkish dialogue turn-taking prediction, enabling more natural human-machine interaction in Turkish chatbots.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.13609v1","title":"Golden Handcuffs make safer AI agents","abstract":"Reinforcement learners can attain high reward through novel unintended strategies. We study a Bayesian mitigation for general environments: we expand the agent's subjective reward range to include a large negative value $-L$, while the true environment's rewards lie in $[0,1]$. After observing consistently high rewards, the Bayesian policy becomes risk-averse to novel schemes that plausibly lead to $-L$. We design a simple override mechanism that yields control to a safe mentor whenever the predicted value drops below a fixed threshold. We prove two properties of the resulting agent: (i) Capability: using mentor-guided exploration with vanishing frequency, the agent attains sublinear regret against its best mentor. (ii) Safety: no decidable low-complexity predicate is triggered by the optimizing policy before it is triggered by a mentor.","published_date":"2026-04-15T08:23:13+00:00","viability_score":3,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A Bayesian mitigation strategy for reinforcement learning agents to prevent unintended high-reward strategies by incorporating a large negative penalty and a mentor override.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.13608v1","title":"Design Space Exploration of Hybrid Quantum Neural Networks for Chronic Kidney Disease","abstract":"Hybrid Quantum Neural Networks (HQNNs) have recently emerged as a promising paradigm for near-term quantum machine learning. However, their practical performance strongly depends on design choices such as classical-to-quantum data encoding, quantum circuit architecture, measurement strategy and shots. In this paper, we present a comprehensive design space exploration of HQNNs for Chronic Kidney Disease (CKD) diagnosis. Using a carefully curated and preprocessed clinical dataset, we benchmark 625 different HQNN models obtained by combining five encoding schemes, five entanglement architectures, five measurement strategies, and five different shot settings. To ensure fair and robust evaluation, all models are trained using 10-fold stratified cross-validation and assessed on a test set using a comprehensive set of metrics, including accuracy, area under the curve (AUC), F1-score, and a composite performance score. Our results reveal strong and non-trivial interactions between encoding choices and circuit architectures, showing that high performance does not necessarily require large parameter counts or complex circuits. In particular, we find that compact architectures combined with appropriate encodings (e.g., IQP with Ring entanglement) can achieve the best trade-off between accuracy, robustness, and efficiency. Beyond absolute performance analysis, we also provide actionable insights into how different design dimensions influence learning behavior in HQNNs.","published_date":"2026-04-15T08:23:01+00:00","viability_score":5,"cluster_label":"Medical AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A comprehensive exploration of Hybrid Quantum Neural Networks for Chronic Kidney Disease diagnosis, benchmarking 625 models to find optimal design choices.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.13583v1","title":"BenGER: A Collaborative Web Platform for End-to-End Benchmarking of German Legal Tasks","abstract":"Evaluating large language models (LLMs) for legal reasoning requires workflows that span task design, expert annotation, model execution, and metric-based evaluation. In practice, these steps are split across platforms and scripts, limiting transparency, reproducibility, and participation by non-technical legal experts. We present the BenGER (Benchmark for German Law) framework, an open-source web platform that integrates task creation, collaborative annotation, configurable LLM runs, and evaluation with lexical, semantic, factual, and judge-based metrics. BenGER supports multi-organization projects with tenant isolation and role-based access control, and can optionally provide formative, reference-grounded feedback to annotators. We will demonstrate a live deployment showing end-to-end benchmark creation and analysis.","published_date":"2026-04-15T07:43:01+00:00","viability_score":7,"cluster_label":"LLM Evaluation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"BenGER is an open-source web platform for end-to-end benchmarking of German legal LLM tasks, integrating task creation, annotation, execution, and evaluation.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.13567v1","title":"Comparison of window shapes and lengths in short-time feature extraction for classification of heart sound signals","abstract":"Heart sound signals, phonocardiography (PCG) signals, allow for the automatic diagnosis of potential cardiovascular pathology. Such classification task can be tackled using the bidirectional long short-term memory (biLSTM) network, trained on features extracted from labeled PCG signals. Regarding the non-stationarity of PCG signals, it is recommended to extract the features from multiple short-length segments of the signals using a sliding window of certain shape and length. However, some window contains unfavorable spectral side lobes, which distort the features. Accordingly, it is preferable to adapt the window shape and length in terms of classification performance. We propose an experimental evaluation for three window shapes, each with three window lengths. The biLSTM network is trained and tested on statistical features extracted, and the performance is reported in terms of the window shapes and lengths. Results show that the best performance is obtained when the Gaussian window is used for splitting the signals, and the triangular window competes with the Gaussian window for a length of 75 ms. Although the rectangular window is a commonly offered option, it is the worst choice for splitting the signals. Moreover, the classification performance obtained with a 75 ms Gaussian window outperforms that of a baseline method.","published_date":"2026-04-15T07:28:12+00:00","viability_score":3,"cluster_label":"Medical AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"An experimental evaluation of window shapes and lengths for feature extraction in classifying heart sound signals using bidirectional LSTMs.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.13565v1","title":"UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing","abstract":"Ultra-high-resolution (UHR) remote sensing imagery couples kilometer-scale context with query-critical evidence that may occupy only a few pixels. Such vast spatial scale leads to a quadratic explosion of visual tokens and hinders the extraction of information from small objects. Previous works utilize direct downsampling, dense tiling, or global top-k pruning, which either compromise query-critical image details or incur unpredictable compute. In this paper, we propose UHR-BAT, a query-guided and region-faithful token compression framework to efficiently select visual tokens under a strict context budget. Specifically, we leverage text-guided, multi-scale importance estimation for visual tokens, effectively tackling the challenge of achieving precise yet low-cost feature extraction. Furthermore, by introducing region-wise preserve and merge strategies, we mitigate visual token redundancy, further driving down the computational budget. Experimental results show that UHR-BAT achieves state-of-the-art performance across various benchmarks. Code will be available at https://github.com/Yunkaidang/UHR.","published_date":"2026-04-15T07:21:37+00:00","viability_score":7,"cluster_label":"Remote Sensing Vision-Language Models","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A budget-aware vision-language model for ultra-high-resolution remote sensing that efficiently compresses visual tokens using query-guided importance estimation and region-wise strategies.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.13561v1","title":"CLIP Architecture for Abdominal CT Image-Text Alignment and Zero-Shot Learning: Investigating Batch Composition and Data Scaling","abstract":"Vision-language models trained with contrastive learning on paired medical images and reports show strong zero-shot diagnostic capabilities, yet the effect of training batch composition on learned representations remains unexplored for 3D medical imaging. We reproduce Merlin, a dual-encoder model that aligns 3D abdominal CT volumes with radiology reports using symmetric InfoNCE loss, achieving a zero-shot macro F1 of 74.45% across 30 findings (original: 73.00%). We then investigate two axes of variation. First, we control the normal-to-abnormal ratio within training batches at 25:75, 50:50, and 75:25 using section-level balanced sampling on the full dataset. All three configurations underperform the unbalanced baseline by 2.4 to 2.8 points, with 75:25 achieving the best result (72.02%) among balanced variants. Second, we conduct data scaling ablations on a 4,362-study subset, training with 20%, 40%, and 100% of the data. Performance scales sub-linearly from 65.26% to 71.88%, with individual findings varying dramatically in data sensitivity. Enforcing 50:50 balanced sampling on the same subset further degrades performance to 68.01%, confirming that explicit class balancing hurts regardless of dataset or balancing granularity. Our results indicate that the stochastic diversity of random sampling, combined with Merlin's alternating batching over anatomical subsections, provides more effective regularization than engineered class ratios at the small batch sizes required by 3D medical volumes.","published_date":"2026-04-15T07:10:01+00:00","viability_score":4,"cluster_label":"Medical Vision-Language Models","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Investigating the impact of batch composition and data scaling on CLIP-like architectures for abdominal CT image-text alignment and zero-shot learning.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.13552v1","title":"Training-Free Test-Time Contrastive Learning for Large Language Models","abstract":"Large language models (LLMs) demonstrate strong reasoning capabilities, but their performance often degrades under distribution shift. Existing test-time adaptation (TTA) methods rely on gradient-based updates that require white-box access and need substantial overhead, while training-free alternatives are either static or depend on external guidance. In this paper, we propose Training-Free Test-Time Contrastive Learning TF-TTCL, a training-free adaptation framework that enables a frozen LLM to improve online by distilling supervision from its own inference experiences. Specifically, TF-TTCL implements a dynamic \"Explore-Reflect-Steer\" loop through three core modules: 1) Semantic Query Augmentation first diversifies problem views via multi-agent role-playing to generate different reasoning trajectories; 2) Contrastive Experience Distillation then captures the semantic gap between superior and inferior trajectories, distilling them into explicit textual rules; and 3) Contextual Rule Retrieval finally activates these stored rules during inference to dynamically steer the frozen LLM toward robust reasoning patterns while avoiding observed errors. Extensive experiments on closed-ended reasoning tasks and open-ended evaluation tasks demonstrate that TF-TTCL consistently outperforms strong zero-shot baselines and representative TTA methods under online evaluation. Code is available at https://github.com/KevinSCUTer/TF-TTCL.","published_date":"2026-04-15T06:56:35+00:00","viability_score":8,"cluster_label":"AI Optimization Tools","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A training-free test-time contrastive learning framework boosting large language models' reasoning without retraining.","time_to_mvp":"","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.13540v1","title":"Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding","abstract":"Unified Multimodal Models (UMMs) aim to integrate visual understanding and generation within a single structure. However, these models exhibit a notable capability mismatch, where their understanding capability significantly outperforms their generation. This mismatch indicates that the model's rich internal knowledge, while effective for understanding tasks, remains underactivated during generation. To address this, we draw inspiration from the human ``Thinking-While-Drawing'' paradigm, where humans continuously reflect to activate their knowledge and rectify intermediate results. In this paper, we propose UniRect-CoT, a training-free unified rectification chain-of-thought framework. Our approach unlocks the ``free lunch'' hidden in the UMM's powerful inherent understanding to continuously reflect, activating its internal knowledge and rectifying intermediate results during generation.We regard the diffusion denoising process in UMMs as an intrinsic visual reasoning process and align the intermediate results with the target instruction understood by the model, serving as a self-supervisory signal to rectify UMM generation.Extensive experiments demonstrate that UniRect-CoT can be easily integrated into existing UMMs, significantly enhancing generation quality across diverse complex tasks.","published_date":"2026-04-15T06:41:56+00:00","viability_score":5,"cluster_label":"Unified Multimodal Models","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A training-free framework that enhances unified multimodal model generation by leveraging their inherent understanding for reflective rectification, inspired by human 'Thinking-While-Drawing'.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.13531v1","title":"RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management","abstract":"Graphical User Interface (GUI) agents show strong capabilities for automating web tasks, but existing interactive benchmarks primarily target benign, predictable consumer environments. Their effectiveness in high-stakes, investigative domains such as authentic e-commerce risk management remains underexplored. To bridge this gap, we present RiskWebWorld, the first highly realistic interactive benchmark for evaluating GUI agents in e-commerce risk management. RiskWebWorld features 1,513 tasks sourced from production risk-control pipelines across 8 core domains, and captures the authentic challenges of risk operations on uncooperative websites, partially environmental hijackments. To support scalable evaluation and agentic reinforcement learning (RL), we further build a Gymnasium-compliant infrastructure that decouples policy planning from environment mechanics. Our evaluation across diverse models reveals a dramatic capability gap: top-tier generalist models achieve 49.1% success, while specialized open-weights GUI models lag at near-total failure. This highlights that foundation model scale currently matters more than zero-shot interface grounding in long-horizon professional tasks. We also demonstrate the viability of our infrastructure through agentic RL, which improves open-source models by 16.2%. These results position RiskWebWorld as a practical testbed for developing robust digital workers.","published_date":"2026-04-15T06:27:49+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A realistic benchmark for evaluating GUI agents in e-commerce risk management, revealing a significant capability gap in current models and demonstrating agentic RL improvements.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.13521v1","title":"C-voting: Confidence-Based Test-Time Voting without Explicit Energy Functions","abstract":"Neural network models with latent recurrent processing, where identical layers are recursively applied to the latent state, have gained attention as promising models for performing reasoning tasks. A strength of such models is that they enable test-time scaling, where the models can enhance their performance in the test phase without additional training. Models such as the Hierarchical Reasoning Model (HRM) and Artificial Kuramoto Oscillatory Neurons (AKOrN) can facilitate deeper reasoning by increasing the number of recurrent steps, thereby enabling the completion of challenging tasks, including Sudoku, Maze solving, and AGI benchmarks. In this work, we introduce confidence-based voting (C-voting), a test-time scaling strategy designed for recurrent models with multiple latent candidate trajectories. Initializing the latent state with multiple candidates using random variables, C-voting selects the one maximizing the average of top-1 probabilities of the predictions, reflecting the model's confidence. Additionally, it yields 4.9% higher accuracy on Sudoku-hard than the energy-based voting strategy, which is specific to models with explicit energy functions. An essential advantage of C-voting is its applicability: it can be applied to recurrent models without requiring an explicit energy function. Finally, we introduce a simple attention-based recurrent model with randomized initial values named ItrSA++, and demonstrate that when combined with C-voting, it outperforms HRM on Sudoku-extreme (95.2% vs. 55.0%) and Maze (78.6% vs. 74.5%) tasks.","published_date":"2026-04-15T06:10:12+00:00","viability_score":8,"cluster_label":"LLM Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A test-time performance enhancement tool for recurrent neural networks, applicable without needing explicit energy functions.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.13518v1","title":"From Alignment to Prediction: A Study of Self-Supervised Learning and Predictive Representation Learning","abstract":"Self-supervised learning has emerged as a major technique for the task of learning from unlabeled data, where the current methods mostly revolve around alignment of representations and input recon struction. Although such approaches have demonstrated excellent performance in practice, their scope remains mostly confined to learning from observed data and does not provide much help in terms of a learning structure that is predictive of the data distribution. In this paper, we study some of the recent developments in the realm of self-supervised learning. We define a new category called Predictive Representation Learning (PRL), which revolves around the latent prediction of unobserved components of data based on the observation. We propose a common taxonomy that classifies PRL along with alignment and reconstruction-based learning approaches. Furthermore, we argue that Joint-Embedding Predictive Architecture(JEPA) can be considered as an exemplary member of this new paradigm. We further discuss theoretical perspectives and open challenges, highlighting predictive representation learning as a promising direction for future self-supervised learning research. In this study, we implemented Bootstrap Your Own Latent (BYOL), Masked Autoencoders (MAE), and Image-JEPA (I-JEPA) for comparative analysis. The results indicate that MAE achieves perfect similarity of 1.00, but exhibits relatively weak robustness of 0.55. In contrast, BYOL and I-JEPA attain accuracies of 0.98 and 0.95, with robustness scores of 0.75 and 0.78, respectively.","published_date":"2026-04-15T06:04:45+00:00","viability_score":5,"cluster_label":"Self-Supervised Learning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Introduces Predictive Representation Learning (PRL) as a new paradigm for self-supervised learning, demonstrating its potential through comparative analysis of BYOL, MAE, and I-JEPA.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.13517v1","title":"Representation over Routing: Overcoming Surrogate Hacking in Multi-Timescale PPO","abstract":"Temporal credit assignment in reinforcement learning has long been a central challenge. Inspired by the multi-timescale encoding of the dopamine system in neurobiology, recent research has sought to introduce multiple discount factors into Actor-Critic architectures, such as Proximal Policy Optimization (PPO), to balance short-term responses with long-term planning. However, this paper reveals that blindly fusing multi-timescale signals in complex delayed-reward tasks can lead to severe algorithmic pathologies. We systematically demonstrate that exposing a temporal attention routing mechanism to policy gradients results in surrogate objective hacking, while adopting gradient-free uncertainty weighting triggers irreversible myopic degeneration, a phenomenon we term the Paradox of Temporal Uncertainty. To address these issues, we propose a Target Decoupling architecture: on the Critic side, we retain multi-timescale predictions to enforce auxiliary representation learning, while on the Actor side, we strictly isolate short-term signals and update the policy based solely on long-term advantages. Rigorous empirical evaluations across multiple independent random seeds in the LunarLander-v2 environment demonstrate that our proposed architecture achieves statistically significant performance improvements. Without relying on hyperparameter hacking, it consistently surpasses the ''Environment Solved'' threshold with minimal variance, completely eliminates policy collapse, and escapes the hovering local optima that trap single-timescale baselines.","published_date":"2026-04-15T06:03:07+00:00","viability_score":4,"cluster_label":"Reinforcement Learning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Proposes a Target Decoupling architecture for multi-timescale PPO that overcomes surrogate hacking and myopic degeneration by isolating short-term signals for policy updates.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.13515v1","title":"SFT-GRPO Data Overlap as a Post-Training Hyperparameter for Autoformalization","abstract":"Supervised fine-tuning (SFT) followed by Group Relative Policy Optimization (GRPO) is a common post-training recipe. We conduct a controlled ablation over SFT-GRPO data overlap, evaluating Qwen3-8B (thinking disabled) post-trained for Lean 4 autoformalization under six conditions that differ solely in training recipe: a base model, SFT-only, GRPO-only, and three SFT+GRPO configurations where 0 percent, 30 percent, or 100 percent of the GRPO prompts coincide with the SFT corpus. Keeping SFT and GRPO data disjoint consistently outperforms full overlap at zero additional compute cost. Evaluating on Gaokao-Formal and PutnamBench under both compile pass at k and semantic pass at k assessed by an LLM judge, we find that lower overlap is monotonically associated with higher compilation and semantic accuracy. At 0 percent overlap, GRPO yields a 10.4 percentage point semantic gain over SFT alone on Gaokao, while at 100 percent overlap both metrics remain flat, rendering the GRPO stage effectively redundant. We further show that dual-metric evaluation reveals compile semantic gaps exceeding 30 percentage points for the highest compiling models, a disparity invisible under compile-only benchmarking. To our knowledge, this is the first controlled investigation of SFT-GRPO data overlap as a post-training hyperparameter, demonstrating how model behavior varies based on the degree of data sharing between training stages.","published_date":"2026-04-15T06:00:25+00:00","viability_score":7,"cluster_label":"LLM Training","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Optimizing data overlap in SFT-GRPO post-training for LLMs significantly improves autoformalization accuracy without additional compute.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.13504v1","title":"Chain of Uncertain Rewards with Large Language Models for Reinforcement Learning","abstract":"Designing effective reward functions is a cornerstone of reinforcement learning (RL), yet it remains a challenging and labor-intensive process due to the inefficiencies and inconsistencies inherent in traditional methods. Existing methods often rely on extensive manual design and evaluation steps, which are prone to redundancy and overlook local uncertainties at intermediate decision points. To address these challenges, we propose the Chain of Uncertain Rewards (CoUR), a novel framework that integrates large language models (LLMs) to streamline reward function design and evaluation in RL environments. Specifically, our CoUR introduces code uncertainty quantification with a similarity selection mechanism that combines textual and semantic analyses to identify and reuse the most relevant reward function components. By reducing redundant evaluations and leveraging Bayesian optimization on decoupled reward terms, CoUR enables a more efficient and robust search for optimal reward feedback. We comprehensively evaluate CoUR across nine original environments from IsaacGym and all 20 tasks from the Bidexterous Manipulation benchmark. The experimental results demonstrate that CoUR not only achieves better performance but also significantly lowers the cost of reward evaluations.","published_date":"2026-04-15T05:44:14+00:00","viability_score":8,"cluster_label":"Reinforcement Learning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel framework using LLMs to automate and optimize reward function design in reinforcement learning, reducing evaluation costs and improving performance.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.13488v1","title":"Towards Scalable Lightweight GUI Agents via Multi-role Orchestration","abstract":"Autonomous Graphical User Interface (GUI) agents powered by Multimodal Large Language Models (MLLMs) enable digital automation on end-user devices. While scaling both parameters and data has yielded substantial gains, advanced methods still suffer from prohibitive deployment costs on resource-constrained devices. When facing complex in-the-wild scenarios, lightweight GUI agents are bottlenecked by limited capacity and poor task scalability under end-to-end episodic learning, impeding adaptation to multi-agent systems (MAS), while training multiple skill-specific experts remains costly. Can we strike an effective trade-off in this cost-scalability dilemma, enabling lightweight MLLMs to participate in realistic GUI workflows? To address these challenges, we propose the LAMO framework, which endows a lightweight MLLM with GUI-specific knowledge and task scalability, allowing multi-role orchestration to expand its capability boundary for GUI automation. LAMO combines role-oriented data synthesis with a two-stage training recipe: (i) supervised fine-tuning with Perplexity-Weighted Cross-Entropy optimization for knowledge distillation and visual perception enhancement, and (ii) reinforcement learning for role-oriented cooperative exploration. With LAMO, we develop a task-scalable native GUI agent, LAMO-3B, supporting monolithic execution and MAS-style orchestration. When paired with advanced planners as a plug-and-play policy executor, LAMO-3B can continuously benefit from planner advances, enabling a higher performance ceiling. Extensive static and online evaluations validate the effectiveness of our design.","published_date":"2026-04-15T05:23:04+00:00","viability_score":6,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"LAMO framework enables lightweight LLMs to perform complex GUI automation through multi-role orchestration, balancing cost and scalability.","time_to_mvp":"6+ months","tags":["series_a_plus"]},{"arxiv_id":"2604.13481v1","title":"Monthly Diffusion v0.9: A Latent Diffusion Model for the First AI-MIP","abstract":"Here, we describe Monthly Diffusion at 1.5-degree grid spacing (MD-1.5 version 0.9), a climate emulator that leverages a spherical Fourier neural operator (SFNO)-inspired Conditional Variational Auto-Encoder (CVAE) architecture to model the evolution of low-frequency internal atmospheric variability using latent diffusion. MDv0.9 was designed to forward-step at monthly mean timesteps in a data-sparse regime, using modest computational requirements. This work describes the motivation behind the architecture design, the MDv0.9 training procedure, and initial results.","published_date":"2026-04-15T05:08:49+00:00","viability_score":3,"cluster_label":"Climate AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A latent diffusion model for monthly climate emulation using a spherical Fourier neural operator-inspired architecture.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.13474v1","title":"Secure and Privacy-Preserving Vertical Federated Learning","abstract":"We propose a novel end-to-end privacy-preserving framework, instantiated by three efficient protocols for different deployment scenarios, covering both input and output privacy, for the vertically split scenario in federated learning (FL), where features are split across clients and labels are not shared by all parties. We do so by distributing the role of the aggregator in FL into multiple servers and having them run secure multiparty computation (MPC) protocols to perform model and feature aggregation and apply differential privacy (DP) to the final released model. While a naive solution would have the clients delegating the entirety of training to run in MPC between the servers, our optimized solution, which supports purely global and also global-local models updates with privacy-preserving, drastically reduces the amount of computation and communication performed using multiparty computation. The experimental results also show the effectiveness of our protocols.","published_date":"2026-04-15T04:55:40+00:00","viability_score":2,"cluster_label":"Federated Learning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A privacy-preserving framework for vertically split federated learning using secure multiparty computation and differential privacy.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.13472v1","title":"Bridging MARL to SARL: An Order-Independent Multi-Agent Transformer via Latent Consensus","abstract":"Cooperative multi-agent reinforcement learning (MARL) is widely used to address large joint observation and action spaces by decomposing a centralized control problem into multiple interacting agents. However, such decomposition often introduces additional challenges, including non-stationarity, unstable training, weak coordination, and limited theoretical guarantees. In this paper, we propose the Consensus Multi-Agent Transformer (CMAT), a centralized framework that bridges cooperative MARL to a hierarchical single-agent reinforcement learning (SARL) formulation. CMAT treats all agents as a unified entity and employs a Transformer encoder to process the large joint observation space. To handle the extensive joint action space, we introduce a hierarchical decision-making mechanism in which a Transformer decoder autoregressively generates a high-level consensus vector, simulating the process by which agents reach agreement on their strategies in latent space. Conditioned on this consensus, all agents generate their actions simultaneously, enabling order-independent joint decision making and avoiding the sensitivity to action-generation order in conventional Multi-Agent Transformers (MAT). This factorization allows the joint policy to be optimized using single-agent PPO while preserving expressive coordination through the latent consensus. To evaluate the proposed method, we conduct experiments on benchmark tasks from StarCraft II, Multi-Agent MuJoCo, and Google Research Football. The results show that CMAT achieves superior performance over recent centralized solutions, sequential MARL methods, and conventional MARL baselines. The code for this paper is available at:https://github.com/RS2002/CMAT .","published_date":"2026-04-15T04:52:22+00:00","viability_score":7,"cluster_label":"Multi-Agent Reinforcement Learning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A Transformer-based framework that bridges multi-agent reinforcement learning to single-agent reinforcement learning for improved coordination and performance.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.13462v1","title":"Learning from Change: Predictive Models for Incident Prevention in a Regulated IT Environment","abstract":"Effective IT change management is important for businesses that depend on software and services, particularly in highly regulated sectors such as finance, where operational reliability, auditability, and explainability are essential. A significant portion of IT incidents are caused by changes, making it important to identify high-risk changes before deployment. This study presents a predictive incident risk scoring approach at a large international bank. The approach supports engineers during the assessment and planning phases of change deployments by predicting the potential of inducing incidents. To satisfy regulatory constraints, we built the model with auditability and explainability in mind, applying SHAP values to provide feature-level insights and ensure decisions are traceable and transparent. Using a one-year real-world dataset, we compare the existing rule-based process with three machine learning models: HGBC, LightGBM, and XGBoost. LightGBM achieved the best performance, particularly when enriched with aggregated team metrics that capture organisational context. Our results show that data-driven, interpretable models can outperform rule-based approaches while meeting compliance needs, enabling proactive risk mitigation and more reliable IT operations.","published_date":"2026-04-15T04:33:46+00:00","viability_score":7,"cluster_label":"IT Operations AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An interpretable machine learning model that predicts IT incident risk for regulated environments, outperforming rule-based systems.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.13460v1","title":"From Order to Distribution: A Spectral Characterization of Forgetting in Continual Learning","abstract":"A central challenge in continual learning is forgetting, the loss of performance on previously learned tasks induced by sequential adaptation to new ones. While forgetting has been extensively studied empirically, rigorous theoretical characterizations remain limited. A notable step in this direction is \\citet{evron2022catastrophic}, which analyzes forgetting under random orderings of a fixed task collection in overparameterized linear regression. We shift the perspective from order to distribution. Rather than asking how a fixed task collection behaves under random orderings, we study an exact-fit linear regime in which tasks are sampled i.i.d.\\ from a task distribution~$\u03a0$, and ask how the generating distribution itself governs forgetting. In this setting, we derive an exact operator identity for the forgetting quantity, revealing a recursive spectral structure. Building on this identity, we establish an unconditional upper bound, identify the leading asymptotic term, and, in generic nondegenerate cases, characterize the convergence rate up to constants. We further relate this rate to geometric properties of the task distribution, clarifying what drives slow or fast forgetting in this model.","published_date":"2026-04-15T04:29:00+00:00","viability_score":0,"cluster_label":"Continual Learning Theory","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A theoretical characterization of forgetting in continual learning by analyzing task distributions rather than random orderings.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.13459v1","title":"Asymmetric-Loss-Guided Hybrid CNN-BiLSTM-Attention Model for Industrial RUL Prediction with Interpretable Failure Heatmaps","abstract":"Turbofan engine degradation under sustained operational stress necessitates robust prognostic systems capable of accurately estimating the Remaining Useful Life (RUL) of critical components. Existing deep learning approaches frequently fail to simultaneously capture multi-sensor spatial correlations and long-range temporal dependencies, while standard symmetric loss functions inadequately penalize the safety-critical error of over-estimating residual life. This study proposes a hybrid architecture integrating Twin-Stage One-Dimensional Convolutional Neural Networks (1D-CNN), a Bidirectional Long Short-Term Memory (BiLSTM) network, and a custom Bahdanau Additive Attention mechanism. The model was trained and evaluated on the NASA Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) FD001 sub-dataset employing a zero-leakage preprocessing pipeline, piecewise-linear RUL labeling capped at 130 cycles, and the NASA-specified asymmetric exponential loss function that disproportionately penalizes over-estimation to enforce industrial safety constraints. Experiments on 100 test engines achieved a Root Mean Squared Error (RMSE) of 17.52 cycles and a NASA S-Score of 922.06. Furthermore, extracted attention weight heatmaps provide interpretable, per-engine insights into the temporal progression of degradation, supporting informed maintenance decision-making. The proposed framework demonstrates competitive performance against established baselines and offers a principled approach to safe, interpretable prognostics in industrial settings.","published_date":"2026-04-15T04:25:38+00:00","viability_score":7,"cluster_label":"Predictive Maintenance AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A hybrid CNN-BiLSTM-Attention model for industrial Remaining Useful Life prediction that provides interpretable failure heatmaps, improving safety and maintenance decisions.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.13455v1","title":"Outperforming Self-Attention Mechanisms in Solar Irradiance Forecasting via Physics-Guided Neural Networks","abstract":"Accurate Global Horizontal Irradiance (GHI) forecasting is critical for grid stability, particularly in arid regions characterized by rapid aerosol fluctuations. While recent trends favor computationally expensive Transformer-based architectures, this paper challenges the prevailing \"complexity-first\" paradigm. We propose a lightweight, Physics-Informed Hybrid CNN-BiLSTM framework that prioritizes domain knowledge over architectural depth. The model integrates a Convolutional Neural Network (CNN) for spatial feature extraction with a Bi-Directional LSTM for capturing temporal dependencies. Unlike standard data-driven approaches, our model is explicitly guided by a vector of 15 engineered features including Clear-Sky indices and Solar Zenith Angle - rather than relying solely on raw historical data. Hyperparameters are rigorously tuned using Bayesian Optimization to ensure global optimality. Experimental validation using NASA POWER data in Sudan demonstrates that our physics-guided approach achieves a Root Mean Square Error (RMSE) of 19.53 W/m^2, significantly outperforming complex attention-based baselines (RMSE 30.64 W/m^2). These results confirm a \"Complexity Paradox\": in high-noise meteorological tasks, explicit physical constraints offer a more efficient and accurate alternative to self-attention mechanisms. The findings advocate for a shift towards hybrid, physics-aware AI for real-time renewable energy management.","published_date":"2026-04-15T04:17:20+00:00","viability_score":7,"cluster_label":"Renewable Energy AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A physics-guided hybrid CNN-BiLSTM model for solar irradiance forecasting that outperforms complex attention models, enabling efficient renewable energy management.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.13448v1","title":"A Study of Failure Modes in Two-Stage Human-Object Interaction Detection","abstract":"Human-object interaction (HOI) detection aims to detect interactions between humans and objects in images. While recent advances have improved performance on existing benchmarks, their evaluations mainly focus on overall prediction accuracy and provide limited insight into the underlying causes of model failures. In particular, modern models often struggle in complex scenes involving multiple people and rare interaction combinations. In this work, we present a study to better understand the failure modes of two-stage HOI models, which form the basis of many current HOI detection approaches. Rather than constructing a large-scale benchmark, we instead decompose HOI detection into multiple interpretable perspectives and analyze model behavior across these dimensions to study different types of failure patterns. We curate a subset of images from an existing HOI dataset organized by human-object-interaction configurations (e.g., multi-person interactions and object sharing), and analyze model behavior under these configurations to examine different failure modes. This design allows us to analyze how these HOI models behave under different scene compositions and why their predictions fail. Importantly, high overall benchmark performance does not necessarily reflect robust visual reasoning about human-object relationships. We hope that this study can provide useful insights into the limitations of HOI models and offer observations for future research in this area.","published_date":"2026-04-15T04:01:23+00:00","viability_score":4,"cluster_label":"Computer Vision AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A study analyzing failure modes in two-stage human-object interaction detection models to provide insights for future research.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.13440v1","title":"A KL Lens on Quantization: Fast, Forward-Only Sensitivity for Mixed-Precision SSM-Transformer Models","abstract":"Deploying Large Language Models (LLMs) on edge devices faces severe computational and memory constraints, limiting real-time processing and on-device intelligence. Hybrid architectures combining Structured State Space Models (SSMs) with transformer-based LLMs offer a balance of efficiency and performance. Aggressive quantization can drastically cut model size and speed up inference, but its uneven effects on different components require careful management. In this work, we propose a lightweight, backpropagation-free, surrogate-based sensitivity analysis framework to identify hybrid SSM-Transformer components most susceptible to quantization-induced degradation. Relying solely on forward-pass metrics, our method avoids expensive gradient computations and retraining, making it suitable for situations where access to in-domain data is limited due to proprietary restrictions or privacy constraints. We also provide a formal analysis showing that the Kullback-Leibler (KL) divergence metric better captures quantization sensitivity for Language modeling tasks than widely adopted alternatives such as mean squared error (MSE) and signal-to-quantization-noise ratio (SQNR). Through extensive experiments on SSM and hybrid architectures, our ablation studies confirm that KL-based rankings align with observed performance drops and outperform alternative metrics. This framework enables the practical deployment of advanced hybrid models on resource-constrained edge devices with minimal accuracy loss. We further validate our approach with real-world on-device profiling on Intel Lunar Lake hardware, demonstrating that KL-guided mixed-precision achieves near-FP16 perplexity with model sizes and throughput competitive with Uniform INT4 on both CPU and GPU execution modes. Code is available at https://github.com/jasonkongie/kl-ssm-quant.","published_date":"2026-04-15T03:40:30+00:00","viability_score":8,"cluster_label":"LLM Optimization","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A fast, forward-only sensitivity analysis using KL divergence for mixed-precision SSM-Transformer models, enabling efficient LLM deployment on edge devices.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.13432v1","title":"MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis","abstract":"Token compression is crucial for mitigating the quadratic complexity of self-attention mechanisms in Vision Transformers (ViTs), which often involve numerous input tokens. Existing methods, such as ToMe, rely on GPU-inefficient operations (e.g., sorting, scattered writes), introducing overheads that limit their effectiveness. We introduce MaMe, a training-free, differentiable token merging method based entirely on matrix operations, which is GPU-friendly to accelerate ViTs. Additionally, we present MaRe, its inverse operation, for token restoration, forming a MaMe+MaRe pipeline for image synthesis. When applied to pre-trained models, MaMe doubles ViT-B throughput with a 2% accuracy drop. Notably, fine-tuning the last layer with MaMe boosts ViT-B accuracy by 1.0% at 1.1x speed. In SigLIP2-B@512 zero-shot classification, MaMe provides 1.3x acceleration with negligible performance degradation. In video tasks, MaMe accelerates VideoMAE-L by 48.5% on Kinetics-400 with only a 0.84% accuracy loss. Furthermore, MaMe achieves simultaneous improvements in both performance and speed on some tasks. In image synthesis, the MaMe+MaRe pipeline enhances quality while reducing Stable Diffusion v2.1 generation latency by 31%. Collectively, these results demonstrate MaMe's and MaRe's effectiveness in accelerating vision models. The code is available at https://github.com/cominder/mame}{https://github.com/cominder/mame.","published_date":"2026-04-15T03:06:24+00:00","viability_score":8,"cluster_label":"Vision Transformers","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A GPU-friendly, training-free method for accelerating Vision Transformers and enhancing image synthesis by merging and restoring tokens using matrix operations.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.13427v1","title":"A Unified Conditional Flow for Motion Generation, Editing, and Intra-Structural Retargeting","abstract":"Text-driven motion editing and intra-structural retargeting, where source and target share topology but may differ in bone lengths, are traditionally handled by fragmented pipelines with incompatible inputs and representations: editing relies on specialized generative steering, while retargeting is deferred to geometric post-processing. We present a unifying perspective where both tasks are cast as instances of conditional transport within a single generative framework. By leveraging recent advances in flow matching, we demonstrate that editing and retargeting are fundamentally the same generative task, distinguished only by which conditioning signal, semantic or structural, is modulated during inference. We implement this vision via a rectified-flow motion model jointly conditioned on text prompts and target skeletal structures. Our architecture extends a DiT-style transformer with per-joint tokenization and explicit joint self-attention to strictly enforce kinematic dependencies, while a multi-condition classifier-free guidance strategy balances text adherence with skeletal conformity. Experiments on SnapMoGen and a multi-character Mixamo subset show that a single trained model supports text-to-motion generation, zero-shot editing, and zero-shot intra-structural retargeting. This unified approach simplifies deployment and improves structural consistency compared to task-specific baselines.","published_date":"2026-04-15T02:53:07+00:00","viability_score":8,"cluster_label":"Motion Generation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A unified generative framework using flow matching to perform text-driven motion editing and structural retargeting with a single model.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.13426v1","title":"Event-Adaptive State Transition and Gated Fusion for RGB-Event Object Tracking","abstract":"Existing Vision Mamba-based RGB-Event(RGBE) tracking methods suffer from using static state transition matrices, which fail to adapt to variations in event sparsity. This rigidity leads to imbalanced modeling-underfitting sparse event streams and overfitting dense ones-thus degrading cross-modal fusion robustness. To address these limitations, we propose MambaTrack, a multimodal and efficient tracking framework built upon a Dynamic State Space Model(DSSM). Our contributions are twofold. First, we introduce an event-adaptive state transition mechanism that dynamically modulates the state transition matrix based on event stream density. A learnable scalar governs the state evolution rate, enabling differentiated modeling of sparse and dense event flows. Second, we develop a Gated Projection Fusion(GPF) module for robust cross-modal integration. This module projects RGB features into the event feature space and generates adaptive gates from event density and RGB confidence scores. These gates precisely control the fusion intensity, suppressing noise while preserving complementary information. Experiments show that MambaTrack achieves state-of-the-art performance on the FE108 and FELT datasets. Its lightweight design suggests potential for real-time embedded deployment.","published_date":"2026-04-15T02:51:35+00:00","viability_score":8,"cluster_label":"Event-Based Vision","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Develop MambaTrack for real-time, robust multimodal object tracking integrating RGB and event data for dynamic environments.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.13418v1","title":"MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments","abstract":"Motivated by the underspecified, multi-hop nature of search queries and the multimodal, heterogeneous, and often conflicting nature of real-world web results, we introduce MERRIN (Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments), a human-annotated benchmark for evaluating search-augmented agents. MERRIN measures AI agents' ability to identify relevant modalities, retrieve multimodal evidence, and perform multi-hop reasoning over noisy web sources. It differs from prior work in three important aspects: (1) using natural language queries without explicit modality cues, (2) incorporating underexplored modalities such as video and audio, and (3) requiring the retrieval of complex, often noisy or conflicting multimodal evidence during web search. We evaluate diverse search agents powered by ten models, including strong closed-source models (e.g., GPT-5.4-mini, Gemini 3/3.1 Flash/Pro) and open-weight models (Qwen3-4B/30B/235B), across three search settings (no search, native search, and agentic search). Our results show that MERRIN is highly challenging: the average accuracy across all agents is 22.3%, with the best-performing agent reaching only 40.1%. We further observe that while stronger agents like Gemini Deep Research achieve higher performance, gains are modest due to over-exploration; they take more steps and use more tools, but are often distracted by conflicting or partially relevant web content, leading to incorrect answers. Compared to humans, these agents consume more resources yet achieve lower accuracy, largely due to inefficient source selection and an overreliance on text modalities. These findings highlight the need for search agents capable of robust search and reasoning across diverse modalities in noisy web environments, making MERRIN a valuable testbed for evaluating such capabilities.","published_date":"2026-04-15T02:37:47+00:00","viability_score":7,"cluster_label":"Multimodal Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A benchmark for evaluating AI agents' ability to retrieve multimodal evidence and reason over noisy web environments.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.13417v1","title":"The Cognitive Circuit Breaker: A Systems Engineering Framework for Intrinsic AI Reliability","abstract":"As Large Language Models (LLMs) are increasingly deployed in mission-critical software systems, detecting hallucinations and ``faked truthfulness'' has become a paramount engineering challenge. Current reliability architectures rely heavily on post-generation, black-box mechanisms, such as Retrieval-Augmented Generation (RAG) cross-checking or LLM-as-a-judge evaluators. These extrinsic methods introduce unacceptable latency, high computational overhead, and reliance on secondary external API calls, frequently violating standard software engineering Service Level Agreements (SLAs). In this paper, we propose the Cognitive Circuit Breaker, a novel systems engineering framework that provides intrinsic reliability monitoring with minimal latency overhead. By extracting hidden states during a model's forward pass, we calculate the ``Cognitive Dissonance Delta'' -- the mathematical gap between an LLM's outward semantic confidence (softmax probabilities) and its internal latent certainty (derived via linear probes). We demonstrate statistically significant detection of cognitive dissonance, highlight architecture-dependent Out-of-Distribution (OOD) generalization, and show that this framework adds negligible computational overhead to the active inference pipeline.","published_date":"2026-04-15T02:34:37+00:00","viability_score":4,"cluster_label":"LLM Reliability","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework to detect LLM hallucinations by analyzing internal model states, reducing latency and computational overhead for mission-critical applications.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.13416v1","title":"DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis","abstract":"Advances in radiance fields have enabled photorealistic novel view synthesis. In several domains, large-scale real-world datasets have been developed to support comprehensive benchmarking and to facilitate progress beyond scene-specific reconstruction. However, for distractor-free radiance fields, a large-scale dataset with clean and cluttered images per scene remains lacking, limiting the development. To address this gap, we introduce DF3DV-1K, a large-scale real-world dataset comprising 1,048 scenes, each providing clean and cluttered image sets for benchmarking. In total, the dataset contains 89,924 images captured using consumer cameras to mimic casual capture, spanning 128 distractor types and 161 scene themes across indoor and outdoor environments. A curated subset of 41 scenes, DF3DV-41, is systematically designed to evaluate the robustness of distractor-free radiance field methods under challenging scenarios. Using DF3DV-1K, we benchmark nine recent distractor-free radiance field methods and 3D Gaussian Splatting, identifying the most robust methods and the most challenging scenarios. Beyond benchmarking, we demonstrate an application of DF3DV-1K by fine-tuning a diffusion-based 2D enhancer to improve radiance field methods, achieving average improvements of 0.96 dB PSNR and 0.057 LPIPS on the held-out set (e.g., DF3DV-41) and the On-the-go dataset. We hope DF3DV-1K facilitates the development of distractor-free vision and promotes progress beyond scene-specific approaches.","published_date":"2026-04-15T02:33:44+00:00","viability_score":7,"cluster_label":"Novel View Synthesis","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A large-scale dataset and benchmark for distractor-free novel view synthesis, enabling robust radiance field development and improving image enhancement.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.13414v1","title":"Minimax Optimality and Spectral Routing for Majority-Vote Ensembles under Markov Dependence","abstract":"Majority-vote ensembles achieve variance reduction by averaging over diverse, approximately independent base learners. When training data exhibits Markov dependence, as in time-series forecasting, reinforcement learning (RL) replay buffers, and spatial grids, this classical guarantee degrades in ways that existing theory does not fully quantify. We provide a minimax characterization of this phenomenon for discrete classification in a fixed-dimensional Markov setting, together with an adaptive algorithm that matches the rate on a graph-regular subclass. We first establish an information-theoretic lower bound for stationary, reversible, geometrically ergodic chains in fixed ambient dimension, showing that no measurable estimator can achieve excess classification risk better than $\u03a9(\\sqrt{\\Tmix/n})$. We then prove that, on the AR(1) witness subclass underlying the lower-bound construction, dependence-agnostic uniform bagging is provably suboptimal with excess risk bounded below by $\u03a9(\\Tmix/\\sqrt{n})$, exhibiting a $\\sqrt{\\Tmix}$ algorithmic gap. Finally, we propose \\emph{adaptive spectral routing}, which partitions the training data via the empirical Fiedler eigenvector of a dependency graph and achieves the minimax rate $\\mathcal{O}(\\sqrt{\\Tmix/n})$ up to a lower-order geometric cut term on a graph-regular subclass, without knowledge of $\\Tmix$. Experiments on synthetic Markov chains, 2D spatial grids, the 128-dataset UCR archive, and Atari DQN ensembles validate the theoretical predictions. Consequences for deep RL target variance, scalability via Nystr\u00f6m approximation, and bounded non-stationarity are developed as supporting material in the appendix.","published_date":"2026-04-15T02:32:30+00:00","viability_score":4,"cluster_label":"Ensemble Methods","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A theoretical framework and adaptive algorithm for minimax optimal majority-vote ensembles in Markov-dependent data, improving time-series and RL applications.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.13398v1","title":"From Prediction to Justification: Aligning Sentiment Reasoning with Human Rationale via Reinforcement Learning","abstract":"While Aspect-based Sentiment Analysis (ABSA) systems have achieved high accuracy in identifying sentiment polarities, they often operate as \"black boxes,\" lacking the explicit reasoning capabilities characteristic of human affective cognition. Humans do not merely categorize sentiment; they construct causal explanations for their judgments. To bridge this gap, we propose ABSA-R1, a large language model framework designed to mimic this ``reason-before-predict\" cognitive process. By leveraging reinforcement learning (RL), ABSA-R1 learns to articulate the why behind the what, generating natural language justifications that ground its sentiment predictions. We introduce a Cognition-Aligned Reward Model (formerly sentiment-aware reward model) that enforces consistency between the generated reasoning path and the final emotional label. Furthermore, inspired by metacognitive monitoring, we implement a performance-driven rejection sampling strategy that selectively targets hard cases where the model's internal reasoning is uncertain or inconsistent. Experimental results on four benchmarks demonstrate that equipping models with this explicit reasoning capability not only enhances interpretability but also yields superior performance in sentiment classification and triplet extraction compared to non-reasoning baselines.","published_date":"2026-04-15T01:55:40+00:00","viability_score":7,"cluster_label":"Explainable AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A reinforcement learning framework that aligns sentiment analysis with human-like reasoning, improving interpretability and prediction accuracy.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.13395v1","title":"Quantifying and Understanding Uncertainty in Large Reasoning Models","abstract":"Large Reasoning Models (LRMs) have recently demonstrated significant improvements in complex reasoning. While quantifying generation uncertainty in LRMs is crucial, traditional methods are often insufficient because they do not provide finite-sample guarantees for reasoning-answer generation. Conformal prediction (CP) stands out as a distribution-free and model-agnostic methodology that constructs statistically rigorous uncertainty sets. However, existing CP methods ignore the logical connection between the reasoning trace and the final answer. Additionally, prior studies fail to interpret the origins of uncertainty coverage for LRMs as they typically overlook the specific training factors driving valid reasoning. Notably, it is challenging to disentangle reasoning quality from answer correctness when quantifying uncertainty, while simultaneously establishing theoretical guarantees for computationally efficient explanation methods. To address these challenges, we first propose a novel methodology that quantifies uncertainty in the reasoning-answer structure with statistical guarantees. Subsequently, we develop a unified example-to-step explanation framework using Shapley values that identifies a provably sufficient subset of training examples and their key reasoning steps to preserve the guarantees. We also provide theoretical analyses of our proposed methods. Extensive experiments on challenging reasoning datasets verify the effectiveness of the proposed methods.","published_date":"2026-04-15T01:53:11+00:00","viability_score":7,"cluster_label":"LLM Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel methodology quantifies uncertainty in Large Reasoning Models with statistical guarantees and provides interpretable explanations by identifying key training examples and reasoning steps.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.13392v1","title":"ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold","abstract":"Tabular data remains prevalent in high-stakes domains such as healthcare and finance, where predictive models are expected to provide both high accuracy and faithful, human-understandable reasoning. While symbolic models offer verifiable logic, they lack semantic expressiveness. Meanwhile, general-purpose LLMs often require specialized fine-tuning to master domain-specific tabular reasoning. To address the dual challenges of scalable data curation and reasoning consistency, we propose ReSS, a systematic framework that bridges symbolic and neural reasoning models. ReSS leverages a decision-tree model to extract instance-level decision paths as symbolic scaffolds. These scaffolds, alongside input features and labels, guide an LLM to generate grounded natural-language reasoning that strictly adheres to the underlying decision logic. The resulting high-quality dataset is used to fine-tune a pretrained LLM into a specialized tabular reasoning model, further enhanced by a scaffold-invariant data augmentation strategy to improve generalization and explainability. To rigorously assess faithfulness, we introduce quantitative metrics including hallucination rate, explanation necessity, and explanation sufficiency. Experimental results on medical and financial benchmarks demonstrate that ReSS-trained models improve traditional decision trees and standard fine-tuning approaches up to $10\\%$ while producing faithful and consistent reasoning","published_date":"2026-04-15T01:43:00+00:00","viability_score":8,"cluster_label":"Tabular AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"ReSS bridges symbolic and neural reasoning for tabular data by using decision trees to scaffold LLMs, generating faithful and consistent natural-language explanations for high-stakes domains.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.13385v1","title":"On the Use of Evolutionary Optimization for the Dynamic Chance Constrained Open-Pit Mine Scheduling Problem","abstract":"Open-pit mine scheduling is a complex real world optimization problem that involves uncertain economic values and dynamically changing resource capacities. Evolutionary algorithms are particularly effective in these scenarios, as they can easily adapt to uncertain and changing environments. However, uncertainty and dynamic changes are often studied in isolation in real-world problems. In this paper, we study a dynamic chance-constrained open-pit mine scheduling problem in which block economic values are stochastic and mining and processing capacities vary over time. We adopt a bi-objective evolutionary formulation that simultaneously maximizes expected discounted profit and minimizes its standard deviation. To address dynamic changes, we propose a diversity-based change response mechanism that repairs a subset of infeasible solutions and introduces additional feasible solutions whenever a change is detected. We evaluate the effectiveness of this mechanism across four multi-objective evolutionary algorithms and compare it with a baseline re-evaluation-based change-response strategy. Experimental results on six mining instances demonstrate that the proposed approach consistently outperforms the baseline methods across different uncertainty levels and change frequencies.","published_date":"2026-04-15T01:16:01+00:00","viability_score":3,"cluster_label":"Optimization","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A bi-objective evolutionary algorithm with a diversity-based change response mechanism optimizes open-pit mine scheduling under dynamic economic values and resource capacities.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.13381v1","title":"Young people's perceptions and recommendations for conversational generative artificial intelligence in youth mental health","abstract":"Conversational generative artificial intelligence agents (or genAI chatbots) could benefit youth mental health, yet young people's perspectives remain underexplored. We examined the Mental health Intelligence Agent (Mia), a genAI chatbot originally designed for professionals in Australian youth services. Following co-design, 32 young people participated in online workshops exploring their perceptions of genAI chatbots in youth mental health and to develop recommendations for reconceptualising Mia for consumers and integrating it into services. Four themes were developed: (1) Humanising AI without dehumanising care, (2) I need to know what's under the hood, (3) Right tool, right place, right time?, and (4) Making it mine on safe ground. This study offers insights into young people's attitudes, needs, and requirements regarding genAI chatbots in youth mental health, with key implications for service integration. Additionally, by co-designing system requirements, this work informs the ethics, design, development, implementation, and governance of genAI chatbots in youth mental health contexts.","published_date":"2026-04-15T01:10:23+00:00","viability_score":0,"cluster_label":"AI Ethics & UX","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Young people's perceptions and co-designed recommendations for conversational generative AI in youth mental health, focusing on humanizing care, transparency, and personalized integration.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.13367v1","title":"A 3D SAM-Based Progressive Prompting Framework for Multi-Task Segmentation of Radiotherapy-induced Normal Tissue Injuries in Limited-Data Settings","abstract":"Radiotherapy-induced normal tissue injury is a clinically important complication, and accurate segmentation of injury regions from medical images could facilitate disease assessment, treatment planning, and longitudinal monitoring. However, automatic segmentation of these lesions remains largely unexplored because of limited voxel-level annotations and substantial heterogeneity across injury types, lesion size, and imaging modality. To address this gap, we curate a dedicated head-and-neck radiotherapy-induced normal tissue injury dataset covering three manifestations: osteoradionecrosis (ORN), cerebral edema (CE), and cerebral radiation necrosis (CRN). We further propose a 3D SAM-based progressive prompting framework for multi-task segmentation in limited-data settings. The framework progressively incorporates three complementary prompts: text prompts for task-aware adaptation, dose-guided box prompts for coarse localization, and click prompts for iterative refinement. A small-target focus loss is introduced to improve local prediction and boundary delineation for small and sparse lesions. Experiments on ORN, CE, and CRN demonstrate that the proposed method achieves reliable segmentation performance across diverse injury types and outperforms state-of-the-art methods.","published_date":"2026-04-15T00:22:23+00:00","viability_score":7,"cluster_label":"Medical AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A 3D SAM-based framework for multi-task segmentation of radiotherapy-induced normal tissue injuries in limited-data settings, outperforming state-of-the-art.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.13356v1","title":"Peer-Predictive Self-Training for Language Model Reasoning","abstract":"Mechanisms for continued self-improvement of language models without external supervision remain an open challenge. We propose Peer-Predictive Self-Training (PST), a label-free fine-tuning framework in which multiple language models improve collaboratively by leveraging a cross-model aggregated response as an internal training signal. Given a prompt question, the models generate responses sequentially; the final aggregated answer, often more reliable than individual responses in practice, serves as an internal target for learning. We measure how informative each intermediate response is about the aggregate using pointwise mutual information (PMI), and use this signal to scale self-training updates. Responses already aligned with the aggregate are updated less, while less informative or misaligned responses are updated more. On mathematical reasoning benchmarks (SimulEq, Math500, and MultiArith), PST improves exact-match accuracy by 2.2 to 4.3 percentage points across Gemma-2-2B, LLaMA-3.2-1B, and Qwen-2.5-1.5B, and reduces the average generator-verifier gap (GV-Gap) by 26 to 40 percent, while requiring no external supervision or teacher-student hierarchy and relying solely on cross-model interactions. These results suggest that cross-model generations and peer-predictive feedback can serve as an effective approach for self-supervised training.","published_date":"2026-04-14T23:29:44+00:00","viability_score":7,"cluster_label":"LLM Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Peer-Predictive Self-Training (PST) enables collaborative self-improvement of language models for reasoning without external supervision.","time_to_mvp":"1-2 weeks","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.13354v1","title":"Finetuning-Free Diffusion Model with Adaptive Constraint Guidance for Inorganic Crystal Structure Generation","abstract":"The discovery of inorganic crystal structures with targeted properties is a significant challenge in materials science. Generative models, especially state-of-the-art diffusion models, offer the promise of modeling complex data distributions and proposing novel, realistic samples. However, current generative AI models still struggle to produce diverse, original, and reliable structures of experimentally achievable materials suitable for high-stakes applications.   In this work, we propose a generative machine learning framework based on diffusion models with adaptive constraint guidance, which enables the incorporation of user-defined physical and chemical constraints during the generation process. This approach is designed to be practical and interpretable for human experts, allowing transparent decision-making and expert-driven exploration. To ensure the robustness and validity of the generated candidates, we introduce a multi-step validation pipeline that combines graph neural network estimators trained to achieve DFT-level accuracy and convex hull analysis for assessing thermodynamic stability. Our approach has been tested and validated on several classical examples of inorganic families of compounds, as case studies. As a consequence, these preliminary results demonstrate our framework's ability to generate thermodynamically plausible crystal structures that satisfy targeted geometric constraints across diverse inorganic chemical systems.","published_date":"2026-04-14T23:25:54+00:00","viability_score":6,"cluster_label":"Materials Science AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A diffusion model with adaptive constraint guidance for generating thermodynamically plausible inorganic crystal structures with targeted properties.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.13348v1","title":"Listening Alone, Understanding Together: Collaborative Context Recovery for Privacy-Aware AI","abstract":"We introduce CONCORD, a privacy-aware asynchronous assistant-to-assistant (A2A) framework that leverages collaboration between proactive speech-based AI. As agents evolve from reactive to always-listening assistants, they face a core privacy risk (of capturing non-consenting speakers), which makes their social deployment a challenge. To overcome this, we implement CONCORD, which enforces owner-only speech capture via real-time speaker verification, producing a one-sided transcript that incurs missing context but preserves privacy. We demonstrate that CONCORD can safely recover necessary context through (1) spatio-temporal context resolution, (2) information gap detection, and (3) minimal A2A queries governed by a relationship-aware disclosure. Instead of hallucination-prone inferring, CONCORD treats context recovery as a negotiated safe exchange between assistants. Across a multi-domain dialogue dataset, CONCORD achieves 91.4% recall in gap detection, 96% relationship classification accuracy, and 97% true negative rate in privacy-sensitive disclosure decisions. By reframing always-listening AI as a coordination problem between privacy-preserving agents, CONCORD offers a practical path toward socially deployable proactive conversational agents.","published_date":"2026-04-14T23:18:06+00:00","viability_score":8,"cluster_label":"Privacy-Aware AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"CONCORD is a privacy-aware A2A framework enabling socially deployable proactive conversational agents through collaborative context recovery.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.13318v1","title":"WebXSkill: Skill Learning for Autonomous Web Agents","abstract":"Autonomous web agents powered by large language models (LLMs) have shown promise in completing complex browser tasks, yet they still struggle with long-horizon workflows. A key bottleneck is the grounding gap in existing skill formulations: textual workflow skills provide natural language guidance but cannot be directly executed, while code-based skills are executable but opaque to the agent, offering no step-level understanding for error recovery or adaptation. We introduce WebXSkill, a framework that bridges this gap with executable skills, each pairing a parameterized action program with step-level natural language guidance, enabling both direct execution and agent-driven adaptation. WebXSkill operates in three stages: skill extraction mines reusable action subsequences from readily available synthetic agent trajectories and abstracts them into parameterized skills, skill organization indexes skills into a URL-based graph for context-aware retrieval, and skill deployment exposes two complementary modes, grounded mode for fully automated multi-step execution and guided mode where skills serve as step-by-step instructions that the agent follows with its native planning. On WebArena and WebVoyager, WebXSkill improves task success rate by up to 9.8 and 12.9 points over the baseline, respectively, demonstrating the effectiveness of executable skills for web agents. The code is publicly available at https://github.com/aiming-lab/WebXSkill.","published_date":"2026-04-14T21:48:15+00:00","viability_score":7,"cluster_label":"AI Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"WebXSkill enables web agents to autonomously perform complex browser tasks with improved success rates through executable skills.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.13316v1","title":"Beyond Uniform Sampling: Synergistic Active Learning and Input Denoising for Robust Neural Operators","abstract":"Neural operators have emerged as fast surrogate models for physics simulations, yet they remain acutely vulnerable to adversarial perturbations, a critical liability for safety-critical digital twin deployments. We present a synergistic defense that combines active learning-based data generation with an input denoising architecture. The active learning component adaptively probes model weaknesses using differential evolution attacks, then generates targeted training data at discovered vulnerability locations while an adaptive smooth-ratio safeguard preserves baseline accuracy. The input denoising component augments the operator architecture with a learnable bottleneck that filters adversarial noise while retaining physics-relevant features. On the viscous Burgers' equation benchmark, the combined approach achieves a 2.04% combined error (1.21% baseline + 0.83% robustness), representing an 87% reduction relative to standard training (15.42% combined) and outperforming both active learning alone (3.42%) and input denoising alone (5.22%). More broadly, our results, combined with cross-architecture vulnerability analysis from prior work, suggest that optimal training data for neural operators is architecture-dependent: because different architectures concentrate sensitivity in distinct input subspaces, uniform sampling cannot adequately cover the vulnerability landscape of all models. These findings have potential implications for the deployment of neural operators in safety-critical energy systems including nuclear reactor monitoring.","published_date":"2026-04-14T21:43:09+00:00","viability_score":5,"cluster_label":"Robustness","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel defense mechanism for neural operators enhances robustness against adversarial attacks.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.13304v1","title":"Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision","abstract":"Understanding the internal activations of Vision Transformers (ViTs) is critical for building interpretable and trustworthy models. While Sparse Autoencoders (SAEs) have been used to extract human-interpretable features, they operate on individual layers and fail to capture the cross-layer computational structure of Transformers, as well as the relative significance of each layer in forming the last-layer representation. Alternatively, we introduce the adoption of Cross-Layer Transcoders (CLTs) as reliable, sparse, and depth-aware proxy models for MLP blocks in ViTs. CLTs use an encoder-decoder scheme to reconstruct each post-MLP activation from learned sparse embeddings of preceding layers, yielding a linear decomposition that transforms the final representation of ViTs from an opaque embedding into an additive, layer-resolved construction that enables faithful attribution and process-level interpretability. We train CLTs on CLIP ViT-B/32 and ViT-B/16 across CIFAR-100, COCO, and ImageNet-100. We show that CLTs achieve high reconstruction fidelity with post-MLP activations while preserving and even improving, in some cases, CLIP zero-shot classification accuracy. In terms of interpretability, we show that the cross-layer contribution scores provide faithful attribution, revealing that the final representation is concentrated in a smaller set of dominant layer-wise terms whose removal degrades performance and whose retention largely preserves it. These results showcase the significance of adopting CLTs as an alternative interpretable proxy of ViTs in the vision domain.","published_date":"2026-04-14T21:18:08+00:00","viability_score":2,"cluster_label":"Interpretable AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This research explores the interpretability of Vision Transformers through Cross-Layer Transcoders.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.13288v1","title":"Giving Voice to the Constitution: Low-Resource Text-to-Speech for Quechua and Spanish Using a Bilingual Legal Corpus","abstract":"We present a unified pipeline for synthesizing high-quality Quechua and Spanish speech for the Peruvian Constitution using three state-of-the-art text-to-speech (TTS) architectures: XTTS v2, F5-TTS, and DiFlow-TTS. Our models are trained on independent Spanish and Quechua speech datasets with heterogeneous sizes and recording conditions, and leverage bilingual and multilingual TTS capabilities to improve synthesis quality in both languages. By exploiting cross-lingual transfer, our framework mitigates data scarcity in Quechua while preserving naturalness in Spanish. We release trained checkpoints, inference code, and synthesized audio for each constitutional article, providing a reusable resource for speech technologies in indigenous and multilingual contexts. This work contributes to the development of inclusive TTS systems for political and legal content in low-resource settings.","published_date":"2026-04-14T20:32:29+00:00","viability_score":8,"cluster_label":"Text-to-Speech","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A bilingual TTS system synthesizes speech for the Peruvian Constitution in Quechua and Spanish.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.13286v1","title":"English is Not All You Need: Systematically Exploring the Role of Multilinguality in LLM Post-Training","abstract":"Despite the widespread multilingual deployment of large language models, post-training pipelines remain predominantly English-centric, contributing to performance disparities across languages. We present a systematic, controlled study of the interplay between training language coverage, model scale, and task domain, based on 220 supervised fine-tuning runs on parallel translated multilingual data mixtures spanning mathematical reasoning and API calling tasks, with models up to 8B parameters. We find that increasing language coverage during post-training is largely beneficial across tasks and model scales, with low-resource languages benefiting the most and high-resource languages plateauing rather than degrading. Even minimal multilinguality helps: incorporating a single non-English language improves both English performance and cross-lingual generalization, making English-only post-training largely suboptimal. Moreover, at sufficient language diversity, zero-shot cross-lingual transfer can match or exceed the effects of direct language inclusion in a low-diversity setting, although gains remain limited for typologically distant, low-resource languages.","published_date":"2026-04-14T20:26:34+00:00","viability_score":3,"cluster_label":"LLM Training","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Systematically explore the impact of multilingual data on LLM performance across different scales and tasks, revealing benefits for low-resource languages and overall cross-lingual generalization.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.13285v1","title":"L2D-Clinical: Learning to Defer for Adaptive Model Selection in Clinical Text Classification","abstract":"Clinical text classification requires choosing between specialized fine-tuned models (BERT variants) and general-purpose large language models (LLMs), yet neither dominates across all instances. We introduce Learning to Defer for clinical text (L2D-Clinical), a framework that learns when a BERT classifier should defer to an LLM based on uncertainty signals and text characteristics. Unlike prior L2D work that defers to human experts assumed universally superior, our approach enables adaptive deferral-improving accuracy when the LLM complements BERT. We evaluate on two English clinical tasks: (1) ADE detection (ADE Corpus V2), where BioBERT (F1=0.911) outperforms the LLM (F1=0.765), and (2) treatment outcome classification (MIMIC-IV with multi-LLM consensus ground truth), where GPT-5-nano (F1=0.967) outperforms ClinicalBERT (F1=0.887). On ADE, L2D-Clinical achieves F1=0.928 (+1.7 points over BERT) by selectively deferring 7% of instances where the LLM's high recall compensates for BERT's misses. On MIMIC, L2D-Clinical achieves F1=0.980 (+9.3 points over BERT) by deferring only 16.8\\% of cases to the LLM. The key insight is that L2D-Clinical learns to selectively leverage LLM strengths while minimizing API costs.","published_date":"2026-04-14T20:23:45+00:00","viability_score":6,"cluster_label":"Clinical AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Adaptive model selection framework for clinical text classification that intelligently defers to LLMs for improved accuracy and cost-efficiency.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.13283v1","title":"Optimizing Earth Observation Satellite Schedules under Unknown Operational Constraints: An Active Constraint Acquisition Approach","abstract":"Earth Observation (EO) satellite scheduling (deciding which imaging tasks to perform and when) is a well-studied combinatorial optimization problem. Existing methods typically assume that the operational constraint model is fully specified in advance. In practice, however, constraints governing separation between observations, power budgets, and thermal limits are often embedded in engineering artefacts or high-fidelity simulators rather than in explicit mathematical models. We study EO scheduling under \\emph{unknown constraints}: the objective is known, but feasibility must be learned interactively from a binary oracle. Working with a simplified model restricted to pairwise separation and global capacity constraints, we introduce Conservative Constraint Acquisition~(CCA), a domain-specific procedure designed to identify justified constraints efficiently in practice while limiting unnecessary tightening of the learned model. Embedded in the \\textsc{Learn\\&Optimize} framework, CCA supports an interactive search process that alternates optimization under a learned constraint model with targeted oracle queries. On synthetic instances with up to 50~tasks and dense constraint networks, L\\&O improves over a no-knowledge greedy baseline and uses far fewer main oracle queries than a two-phase acquire-then-solve baseline (FAO). For $n\\leq 30$, the average gap drops from 65--68\\% (Priority Greedy) to 17.7--35.8\\% using L\\&O. At $n{=}50$, where the CP-SAT reference is the best feasible solution found in 120~s, L\\&O improves on FAO on average (17.9\\% vs.\\ 20.3\\%) while using 21.3 main queries instead of 100 and about $5\\times$ less execution time.","published_date":"2026-04-14T20:19:28+00:00","viability_score":3,"cluster_label":"Optimization","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"An active constraint acquisition approach for Earth Observation satellite scheduling that learns operational constraints interactively from an oracle.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.13279v1","title":"Explainable Fall Detection for Elderly Care via Temporally Stable SHAP in Skeleton-Based Human Activity Recognition","abstract":"Fall detection in elderly care requires not only accurate classification but also reliable explanations that clinicians can trust. However, existing post-hoc explainability methods, when applied frame-by-frame to sequential data, produce temporally unstable attribution maps that clinicians cannot reliably act upon. To address this issue, we propose a lightweight and explainable framework for skeleton-based fall detection that combines an efficient LSTM model with T-SHAP, a temporally aware post-hoc aggregation strategy that stabilizes SHAP-based feature attributions over contiguous time windows. Unlike standard SHAP, which treats each frame independently, T-SHAP applies a linear smoothing operator to the attribution sequence, reducing high-frequency variance while preserving the theoretical guarantees of Shapley values, including local accuracy and consistency. Experiments on the NTU RGB+D Dataset demonstrate that the proposed framework achieves 94.3% classification accuracy with an end-to-end inference latency below 25 milliseconds, satisfying real-time constraints on mid-range hardware and indicating strong potential for deployment in clinical monitoring scenarios. Quantitative evaluation using perturbation-based faithfulness metrics shows that T-SHAP improves explanation reliability compared to standard SHAP (AUP: 0.89 vs. 0.91) and Grad-CAM (0.82), with consistent improvements observed across five-fold cross-validation, indicating enhanced explanation reliability. The resulting attributions consistently highlight biomechanically relevant motion patterns, including lower-limb instability and changes in spinal alignment, aligning with established clinical observations of fall dynamics and supporting their use as transparent decision aids in long-term care environments","published_date":"2026-04-14T20:15:00+00:00","viability_score":7,"cluster_label":"Healthcare AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An explainable fall detection system for elderly care using temporally stable SHAP to provide reliable insights for clinicians.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.13262v1","title":"Rethinking Uncertainty in Segmentation: From Estimation to Decision","abstract":"In medical image segmentation, uncertainty estimates are often reported but rarely used to guide decisions. We study the missing step: how uncertainty maps are converted into actionable policies such as accepting, flagging, or deferring predictions. We formulate segmentation as a two-stage pipeline, estimation followed by decision, and show that optimizing uncertainty alone fails to capture most of the achievable safety gains. Using retinal vessel segmentation benchmarks (DRIVE, STARE, CHASE_DB1), we evaluate two uncertainty sources (Monte Carlo Dropout and Test-Time Augmentation) combined with three deferral strategies, and introduce a simple confidence-aware deferral rule that prioritizes uncertain and low-confidence predictions. Our results show that the best method and policy combination removes up to 80 percent of segmentation errors at only 25 percent pixel deferral, while achieving strong cross-dataset robustness. We further show that calibration improvements do not translate to better decision quality, highlighting a disconnect between standard uncertainty metrics and real-world utility. These findings suggest that uncertainty should be evaluated based on the decisions it enables, rather than in isolation.","published_date":"2026-04-14T19:52:05+00:00","viability_score":7,"cluster_label":"Medical AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new method for medical image segmentation that uses uncertainty estimates to guide decisions, significantly reducing errors with minimal deferral.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.13258v1","title":"Hessian-Enhanced Token Attribution (HETA): Interpreting Autoregressive LLMs","abstract":"Attribution methods seek to explain language model predictions by quantifying the contribution of input tokens to generated outputs. However, most existing techniques are designed for encoder-based architectures and rely on linear approximations that fail to capture the causal and semantic complexities of autoregressive generation in decoder-only models. To address these limitations, we propose Hessian-Enhanced Token Attribution (HETA), a novel attribution framework tailored for decoder-only language models. HETA combines three complementary components: a semantic transition vector that captures token-to-token influence across layers, Hessian-based sensitivity scores that model second-order effects, and KL divergence to measure information loss when tokens are masked. This unified design produces context-aware, causally faithful, and semantically grounded attributions. Additionally, we introduce a curated benchmark dataset for systematically evaluating attribution quality in generative settings. Empirical evaluations across multiple models and datasets demonstrate that HETA consistently outperforms existing methods in attribution faithfulness and alignment with human annotations, establishing a new standard for interpretability in autoregressive language models.","published_date":"2026-04-14T19:43:43+00:00","viability_score":8,"cluster_label":"LLM Interpretability","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"HETA is a novel attribution framework for decoder-only LLMs that provides context-aware, causally faithful, and semantically grounded explanations of token contributions.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.13252v1","title":"Out of Context: Reliability in Multimodal Anomaly Detection Requires Contextual Inference","abstract":"Anomaly detection aims to identify observations that deviate from expected behavior. Because anomalous events are inherently sparse, most frameworks are trained exclusively on normal data to learn a single reference model of normality. This implicitly assumes that normal behavior can be captured by a single, unconditional reference distribution. In practice, however, anomalies are often context-dependent: A specific observation may be normal under one operating condition, yet anomalous under another. As machine learning systems are deployed in dynamic and heterogeneous environments, these fixed-context assumptions introduce structural ambiguity, i.e., the inability to distinguish contextual variation from genuine abnormality under marginal modeling, leading to unstable performance and unreliable anomaly assessments. While modern sensing systems frequently collect multimodal data capturing complementary aspects of both system behavior and operating conditions, existing methods treat all data streams equally, without distinguishing contextual information from anomaly-relevant signals. As a result, abnormality is often evaluated without explicitly conditioning on operating conditions. We argue that multimodal anomaly detection should be reframed as a cross-modal contextual inference problem, in which modalities play asymmetric roles, separating context from observation, to define abnormality conditionally rather than relative to a single global reference. This perspective has implications for model design, evaluation protocols, and benchmark construction, and outline open research challenges toward robust, context-aware multimodal anomaly detection.","published_date":"2026-04-14T19:32:55+00:00","viability_score":4,"cluster_label":"Anomaly Detection","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Reframing multimodal anomaly detection as a cross-modal contextual inference problem to improve reliability in dynamic environments.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.13248v1","title":"GeoVision-Enabled Digital Twin for Hybrid Autonomous-Teleoperated Medical Responses","abstract":"Remote medical response systems are increasingly being deployed to support emergency care in disaster-affected and infrastructure-limited environments. Enabled by GeoVision capabilities, this paper presents a Digital Twin architecture for hybrid autonomous-teleoperated medical response systems. The proposed framework integrates perception and adaptive navigation with a Digital Twin, synchronized in real-time, that mirrors system states, environmental dynamics, patient conditions, and mission objectives. Unlike traditional ground control interfaces, the Digital Twin provides remote clinical and operational users with an intuitive, continuously updated virtual representation of the platform and its operational context, enabling enhanced situational awareness and informed decision-making.","published_date":"2026-04-14T19:23:32+00:00","viability_score":3,"cluster_label":"Robotics","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A Digital Twin architecture for hybrid autonomous-teleoperated medical response systems, enhancing situational awareness and decision-making.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.13244v1","title":"4th Workshop on Maritime Computer Vision (MaCVi): Challenge Overview","abstract":"The 4th Workshop on Maritime Computer Vision (MaCVi) is organized as part of CVPR 2026. This edition features five benchmark challenges with emphasis on both predictive accuracy and embedded real-time feasibility. This report summarizes the MaCVi 2026 challenge setup, evaluation protocols, datasets, and benchmark tracks, and presents quantitative results, qualitative comparisons, and cross-challenge analyses of emerging method trends. We also include technical reports from top-performing teams to highlight practical design choices and lessons learned across the benchmark suite. Datasets, leaderboards, and challenge resources are available at https://macvi.org/workshop/cvpr26.","published_date":"2026-04-14T19:21:41+00:00","viability_score":3,"cluster_label":"Computer Vision","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A workshop overview and challenge report for maritime computer vision, focusing on predictive accuracy and real-time feasibility with benchmark challenges and datasets.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.13243v1","title":"Lazy or Efficient? Towards Accessible Eye-Tracking Event Detection Using LLMs","abstract":"Gaze event detection is fundamental to vision science, human-computer interaction, and applied analytics. However, current workflows often require specialized programming knowledge and careful handling of heterogeneous raw data formats. Classical detectors such as I-VT and I-DT are effective but highly sensitive to preprocessing and parameterization, limiting their usability outside specialized laboratories. This work introduces a code-free, large language model (LLM)-driven pipeline that converts natural language instructions into an end-to-end analysis. The system (1) inspects raw eye-tracking files to infer structure and metadata, (2) generates executable routines for data cleaning and detector implementation from concise user prompts, (3) applies the generated detector to label fixations and saccades, and (4) returns results and explanatory reports, and allows users to iteratively optimize their code by editing the prompt. Evaluated on public benchmarks, the approach achieves accuracy comparable to traditional methods while substantially reducing technical overhead. The framework lowers barriers to entry for eye-tracking research, providing a flexible and accessible alternative to code-intensive workflows.","published_date":"2026-04-14T19:21:15+00:00","viability_score":7,"cluster_label":"Human-Computer Interaction","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A code-free, LLM-driven pipeline for accessible eye-tracking event detection that converts natural language instructions into end-to-end analysis, reducing technical overhead.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.13242v1","title":"On the Creativity of AI Agents","abstract":"Large language models (LLMs), particularly when integrated into agentic systems, have demonstrated human- and even superhuman-level performance across multiple domains. Whether these systems can truly be considered creative, however, remains a matter of debate, as conclusions heavily depend on the definitions, evaluation methods, and specific use cases employed. In this paper, we analyse creativity along two complementary macro-level perspectives. The first is a functionalist perspective, focusing on the observable characteristics of creative outputs. The second is an ontological perspective, emphasising the underlying processes, as well as the social and personal dimensions involved in creativity. We focus on LLM agents and we argue that they exhibit functionalist creativity, albeit not at its most sophisticated levels, while they continue to lack key aspects of ontological creativity. Finally, we discuss whether it is desirable for agentic systems to attain both forms of creativity, evaluating potential benefits and risks, and proposing pathways toward artificial creativity that can enhance human society.","published_date":"2026-04-14T19:19:59+00:00","viability_score":3,"cluster_label":"AI Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An analysis of the creativity of AI agents, exploring functionalist and ontological perspectives and discussing the desirability and risks of artificial creativity.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.13236v1","title":"SemiFA: An Agentic Multi-Modal Framework for Autonomous Semiconductor Failure Analysis Report Generation","abstract":"Semiconductor failure analysis (FA) requires engineers to examine inspection images, correlate equipment telemetry, consult historical defect records, and write structured reports, a process that can consume several hours of expert time per case. We present SemiFA, an agentic multi-modal framework that autonomously generates structured FA reports from semiconductor inspection images in under one minute. SemiFA decomposes FA into a four-agent LangGraph pipeline: a DefectDescriber that classifies and narrates defect morphology using DINOv2 and LLaVA-1.6, a RootCauseAnalyzer that fuses SECS/GEM equipment telemetry with historically similar defects retrieved from a Qdrant vector database, a SeverityClassifier that assigns severity and estimates yield impact, and a RecipeAdvisor that proposes corrective process adjustments. A fifth node assembles a PDF report. We introduce SemiFA-930, a dataset of 930 annotated semiconductor defect images paired with structured FA narratives across nine defect classes, drawn from procedural synthesis, WM-811K, and MixedWM38. Our DINOv2-based classifier achieves 92.1% accuracy on 140 validation images (macro F1 = 0.917), and the full pipeline produces complete FA reports in 48 seconds on an NVIDIA A100-SXM4-40 GB GPU. A GPT-4o judge ablation across four modality conditions demonstrates that multi-modal fusion improves root cause reasoning by +0.86 composite points (1-5 scale) over an image-only baseline, with equipment telemetry as the more load-bearing modality. To our knowledge, SemiFA is the first system to integrate SECS/GEM equipment telemetry into a vision-language model pipeline for autonomous FA report generation.","published_date":"2026-04-14T19:08:54+00:00","viability_score":8,"cluster_label":"Semiconductor AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An agentic multi-modal framework that autonomously generates semiconductor failure analysis reports from inspection images in under a minute by fusing vision, telemetry, and historical data.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.13226v1","title":"KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs","abstract":"Large Language Models (LLMs) rely heavily on Key-Value (KV) caching to minimize inference latency. However, standard KV caches are context-dependent: reusing a cached document in a new context requires recomputing KV states to account for shifts in attention distribution. Existing solutions such as CacheBlend, EPIC, and SAM-KV mitigate this issue by selectively recomputing a subset of tokens; however, they still incur non-negligible computational overhead (FLOPs) and increased Time-to-First-Token (TTFT) latency. In this paper, we propose KV Packet, a recomputation-free cache reuse framework that treats cached documents as immutable ``packets'' wrapped in light-weight trainable soft-token adapters, which are trained via self-supervised distillation to bridge context discontinuities. Experiments on Llama-3.1 and Qwen2.5 demonstrate that the proposed KV Packet method achieves near-zero FLOPs and lower TTFT than recomputation-based baselines, while retaining F1 scores comparable to those of the full recomputation baseline.","published_date":"2026-04-14T18:50:47+00:00","viability_score":7,"cluster_label":"LLM Inference Optimization","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A recomputation-free KV caching framework for LLMs that uses trainable adapters to significantly reduce inference latency and computational overhead.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.13218v1","title":"Identifiability of Potentially Degenerate Gaussian Mixture Models With Piecewise Affine Mixing","abstract":"Causal representation learning (CRL) aims to identify the underlying latent variables from high-dimensional observations, even when variables are dependent with each other. We study this problem for latent variables that follow a potentially degenerate Gaussian mixture distribution and that are only observed through the transformation via a piecewise affine mixing function. We provide a series of progressively stronger identifiability results for this challenging setting in which the probability density functions are ill-defined because of the potential degeneracy. For identifiability up to permutation and scaling, we leverage a sparsity regularization on the learned representation. Based on our theoretical results, we propose a two-stage method to estimate the latent variables by enforcing sparsity and Gaussianity in the learned representations. Experiments on synthetic and image data highlight our method's effectiveness in recovering the ground-truth latent variables.","published_date":"2026-04-14T18:39:08+00:00","viability_score":3,"cluster_label":"Causal Representation Learning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A theoretical framework and method for identifying latent variables from degenerate Gaussian mixture models transformed by piecewise affine functions.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.13217v1","title":"Multitasking Embedding for Embryo Blastocyst Grading Prediction (MEmEBG)","abstract":"Reliable evaluation of blastocyst quality is critical for the success of in vitro fertilization (IVF) treatments. Current embryo grading practices primarily rely on visual assessment of morphological features, which introduces subjectivity, inter-embryologist variability, and challenges in standardizing quality assurance. In this study, we propose a multitask embedding-based approach for the automated analysis and prediction of key blastocyst components, including the trophectoderm (TE), inner cell mass (ICM), and blastocyst expansion (EXP). The method leverages biological and physical characteristics extracted from images of day-5 human embryos. A pretrained ResNet-18 architecture, enhanced with an embedding layer, is employed to learn discriminative representations from a limited dataset and to automatically identify TE and ICM regions along with their corresponding grades, structures that are visually similar and inherently difficult to distinguish. Experimental results demonstrate the promise of the multitask embedding approach and potential for robust and consistent blastocyst quality assessment.","published_date":"2026-04-14T18:38:13+00:00","viability_score":6,"cluster_label":"Medical AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A multitask embedding approach using a pretrained ResNet-18 to automate blastocyst grading for IVF, improving consistency and reducing subjectivity.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.13206v1","title":"Numerical Instability and Chaos: Quantifying the Unpredictability of Large Language Models","abstract":"As Large Language Models (LLMs) are increasingly integrated into agentic workflows, their unpredictability stemming from numerical instability has emerged as a critical reliability issue. While recent studies have demonstrated the significant downstream effects of these instabilities, the root causes and underlying mechanisms remain poorly understood. In this paper, we present a rigorous analysis of how unpredictability is rooted in the finite numerical precision of floating-point representations, tracking how rounding errors propagate, amplify, or dissipate through Transformer computation layers. Specifically, we identify a chaotic \"avalanche effect\" in the early layers, where minor perturbations trigger binary outcomes: either rapid amplification or complete attenuation. Beyond specific error instances, we demonstrate that LLMs exhibit universal, scale-dependent chaotic behaviors characterized by three distinct regimes: 1) a stable regime, where perturbations fall below an input-dependent threshold and vanish, resulting in constant outputs; 2) a chaotic regime, where rounding errors dominate and drive output divergence; and 3) a signal-dominated regime, where true input variations override numerical noise. We validate these findings extensively across multiple datasets and model architectures.","published_date":"2026-04-14T18:26:38+00:00","viability_score":7,"cluster_label":"LLM Reliability","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Quantifies numerical instability in LLMs, identifying a chaotic 'avalanche effect' and three distinct regimes of unpredictability based on floating-point precision.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.13203v1","title":"Inclusive Kitchen Design for Older Adults: Generative AI Visualizations to Support Mild Cognitive Impairment","abstract":"Mild Cognitive Impairment (MCI) affects 15-20% of adults aged 65 and older, often making kitchen navigation and independent living difficult, particularly in lower-income communities with limited access to professional design help. This study created an AI system that converts standard kitchen photos into MCI-friendly designs using the Home Design Guidelines (HDG). Stable Diffusion models, enhanced with DreamBooth LoRA and ControlNet, were trained on 100 kitchen images to produce realistic visualizations with open layouts, transparent cabinetry, better lighting, non-slip flooring, and less clutter. The models achieved moderate to high semantic alignment (normalized CLIP scores 0.69-0.79) and improved visual realism (GIQA scores 0.45-0.65). In a survey of 33 participants (51.5% caregivers, 36.4% older adults with MCI), the AI-modified kitchens were strongly preferred as more cognitively friendly (87.4% of 198 choices, p < .001). Participants reported high confidence in their kitchen choice selections (M = 5.92/7) and found the visualizations very helpful for home modifications (M = 6.27/7). Thematic analysis emphasized improved visibility, lower cognitive load, and greater independence. Overall, this AI tool provides a low-cost, scalable way for older adults and caregivers to visualize and implement DIY kitchen changes, supporting aging in place and resilience for those with MCI.","published_date":"2026-04-14T18:26:01+00:00","viability_score":6,"cluster_label":"Generative AI for Accessibility","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"An AI system that transforms standard kitchen photos into MCI-friendly designs, offering a low-cost, scalable solution for older adults and caregivers to visualize and implement DIY kitchen modifications.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.13201v1","title":"InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis","abstract":"Large language models are emerging as scientific assistants, but evaluating their ability to reason from empirical data remains challenging. Benchmarks derived from published studies and human annotations inherit publication bias, known-knowledge bias, label noise, and substantial storage requirements. We present InfiniteScienceGym, a procedurally generated benchmark of scientific repositories paired with a verifiable question-answering task. From a seed, the simulator deterministically generates a self-contained repository with realistic directory structure, files, and tabular data, and a privileged QA generator produces both answerable and unanswerable questions with exact ground truth. This makes it possible to evaluate evidence-grounded reasoning, abstention, and tool-mediated analysis in a controlled setting without distributing a large static corpus. InfiniteScienceGym complements real scientific benchmarks by targeting blind spots and failure modes that are hard to evaluate using published datasets alone. Evaluating both proprietary and open-weight models, we find that none achieve more than 45% accuracy overall, that recognizing unanswerable questions remains a major weakness, and that stronger models tend to use tools more effectively rather than simply consuming more tokens.","published_date":"2026-04-14T18:23:02+00:00","viability_score":7,"cluster_label":"AI Benchmarking","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"InfiniteScienceGym is a procedurally generated benchmark for evaluating LLMs' scientific reasoning capabilities, offering a controlled and unbounded environment without large static datasets.","time_to_mvp":"1-2 weeks","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.13180v1","title":"SciFi: A Safe, Lightweight, User-Friendly, and Fully Autonomous Agentic AI Workflow for Scientific Applications","abstract":"Recent advances in agentic AI have enabled increasingly autonomous workflows, but existing systems still face substantial challenges in achieving reliable deployment in real-world scientific research. In this work, we present a safe, lightweight, and user-friendly agentic framework for the autonomous execution of well-defined scientific tasks. The framework combines an isolated execution environment, a three-layer agent loop, and a self-assessing do-until mechanism to ensure safe and reliable operation while effectively leveraging large language models of varying capability levels. By focusing on structured tasks with clearly defined context and stopping criteria, the framework supports end-to-end automation with minimal human intervention, enabling researchers to offload routine workloads and devote more effort to creative activities and open-ended scientific inquiry.","published_date":"2026-04-14T18:02:20+00:00","viability_score":3,"cluster_label":"Agentic AI Workflows","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A safe, lightweight, and user-friendly agentic framework for the autonomous execution of well-defined scientific tasks, enabling researchers to offload routine workloads.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.13175v1","title":"Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization","abstract":"Large language models can be aligned with human preferences through offline reinforcement learning (RL) on small labeled datasets. While single-objective alignment is well-studied, many real-world applications demand the simultaneous optimization of multiple conflicting rewards, e.g. optimizing both catalytic activity and specificity in protein engineering, or helpfulness and harmlessness for chatbots. Prior work has largely relied on linear reward scalarization, but this approach provably fails to recover non-convex regions of the Pareto front. In this paper, instead of scalarizing the rewards directly, we frame multi-objective RL itself as an optimization problem to be scalarized via smooth Tchebysheff scalarization, a recent technique that overcomes the shortcomings of linear scalarization. We use this formulation to derive Smooth Tchebysheff Optimization of Multi-Objective Preferences (STOMP), a novel offline RL algorithm that extends direct preference optimization to the multi-objective setting in a principled way by standardizing the individual rewards based on their observed distributions. We empirically validate STOMP on a range of protein engineering tasks by aligning three autoregressive protein language models on three laboratory datasets of protein fitness. Compared to state-of-the-art baselines, STOMP achieves the highest hypervolumes in eight of nine settings according to both offline off-policy and generative evaluations. We thus demonstrate that STOMP is a powerful, robust multi-objective alignment algorithm that can meaningfully improve post-trained models for multi-attribute protein optimization and beyond.","published_date":"2026-04-14T18:00:39+00:00","viability_score":7,"cluster_label":"Multi-Objective RL","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"STOMP is a novel offline RL algorithm that extends direct preference optimization to the multi-objective setting, enabling principled alignment of models for multi-attribute tasks like protein engineering.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.13029v1","title":"Visual Preference Optimization with Rubric Rewards","abstract":"The effectiveness of Direct Preference Optimization (DPO) depends on preference data that reflect the quality differences that matter in multimodal tasks. Existing pipelines often rely on off-policy perturbations or coarse outcome-based signals, which are not well suited to fine-grained visual reasoning. We propose rDPO, a preference optimization framework based on instance-specific rubrics. For each image-instruction pair, we create a checklist-style rubric of essential and additional criteria to score responses from any possible policies. The instruction-rubric pool is built offline and reused during the construction of on-policy data. On public reward modeling benchmarks, rubric-based prompting massively improves a 30B-A3B judge and brings it close to GPT-5.4. On public downstream benchmarks, rubric-based filtering raises the macro average to 82.69, whereas outcome-based filtering drops it to 75.82 from 81.14. When evaluating scalability on a comprehensive benchmark, rDPO achieves 61.01, markedly outperforming the style-constrained baseline (52.36) and surpassing the 59.48 base model. Together, these results show that visual preference optimization benefits from combining on-policy data construction with instance-specific criterion-level feedback.","published_date":"2026-04-14T17:58:22+00:00","viability_score":7,"cluster_label":"Generative AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for improving visual preference optimization in multimodal AI using instance-specific rubrics, outperforming existing methods and approaching GPT-5.4 quality.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.13021v1","title":"Representation geometry shapes task performance in vision-language modeling for CT enterography","abstract":"Computed tomography (CT) enterography is a primary imaging modality for assessing inflammatory bowel disease (IBD), yet the representational choices that best support automated analysis of this modality are unknown. We present the first study of vision-language transfer learning on abdominal CT enterography and identify two main findings. First, mean pooling of slice embeddings gives better categorical disease assessment (59.2\\% three-class accuracy), whereas attention pooling gives better cross-modal retrieval (0.235 text-to-image MRR). This pattern holds across all LoRA configurations tested and suggests that the two aggregators emphasize different properties of the learned representation. Second, per-slice tissue contrast matters more than broader spatial coverage: multi-window RGB encoding, which maps complementary Hounsfield Unit windows to RGB channels, outperforms all strategies that increase spatial coverage through multiplanar sampling, and in this setting adding coronal and sagittal views reduces classification performance. For report generation, fine-tuning without retrieval context yields within-1 severity accuracy at the prevalence-matched chance level (70.4\\% vs.\\ 71\\% random), suggesting little learned ordering beyond the class distribution. Retrieval-augmented generation (RAG) improves this across all configurations, scoring 7--14 percentage points above the chance baseline and improving ordinal MAE from 0.98 to 0.80--0.89. A three-teacher pseudolabel framework enables all comparisons without expert annotations. Together, these findings provide the first baselines for this underexplored modality and offer practical guidance for building vision-language systems for volumetric medical imaging.","published_date":"2026-04-14T17:56:23+00:00","viability_score":4,"cluster_label":"Medical AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Investigating representation geometry in vision-language models for CT enterography, finding that mean pooling and per-slice contrast are key for disease assessment and retrieval.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.13017v1","title":"PAL: Personal Adaptive Learner","abstract":"AI-driven education platforms have made some progress in personalisation, yet most remain constrained to static adaptation--predefined quizzes, uniform pacing, or generic feedback--limiting their ability to respond to learners' evolving understanding. This shortfall highlights the need for systems that are both context-aware and adaptive in real time. We introduce PAL (Personal Adaptive Learner), an AI-powered platform that transforms lecture videos into interactive learning experiences. PAL continuously analyzes multimodal lecture content and dynamically engages learners through questions of varying difficulty, adjusting to their responses as the lesson unfolds. At the end of a session, PAL generates a personalized summary that reinforces key concepts while tailoring examples to the learner's interests. By uniting multimodal content analysis with adaptive decision-making, PAL contributes a novel framework for responsive digital learning. Our work demonstrates how AI can move beyond static personalization toward real-time, individualized support, addressing a core challenge in AI-enabled education.","published_date":"2026-04-14T17:54:37+00:00","viability_score":8,"cluster_label":"AI Education","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"PAL is an AI platform that transforms lecture videos into interactive learning experiences, dynamically adapting to learners' understanding with personalized summaries.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.13016v1","title":"Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe","abstract":"On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak-to-strong reverse distillation, showing that same-family 1.5B and 7B teachers are distributionally indistinguishable from the student's perspective. Probing into the token-level mechanism, we show that successful OPD is characterized by progressive alignment on high-probability tokens at student-visited states, a small shared token set that concentrates most of the probability mass (97%-99%). We further propose two practical strategies to recover failing OPD: off-policy cold start and teacher-aligned prompt selection. Finally, we show that OPD's apparent free lunch of dense token-level reward comes at a cost, raising the question of whether OPD can scale to long-horizon distillation.","published_date":"2026-04-14T17:54:28+00:00","viability_score":7,"cluster_label":"AI Distillation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Deploy a teaching-driven refinement tool to enhance the capabilities of existing large language models.","time_to_mvp":"","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.13013v1","title":"Bilevel Late Acceptance Hill Climbing for the Electric Capacitated Vehicle Routing Problem","abstract":"This paper tackles the Electric Capacitated Vehicle Routing Problem (E-CVRP) through a bilevel optimization framework that handles routing and charging decisions separately or jointly depending on the search stage. By analyzing their interaction, we introduce a surrogate objective at the upper level to guide the search and accelerate convergence. A bilevel Late Acceptance Hill Climbing algorithm (b-LAHC) is introduced that operates through three phases: greedy descent, neighborhood exploration, and final solution refinement. b-LAHC operates with fixed parameters, eliminating the need for complex adaptation while remaining lightweight and effective. Extensive experiments on the IEEE WCCI-2020 benchmark show that b-LAHC achieves superior or competitive performance against eight state-of-the-art algorithms. Under a fixed evaluation budget, it attains near-optimal solutions on small-scale instances and sets 9/10 new best-known results on large-scale benchmarks, improving existing records by an average of 1.07%. Moreover, the strong correlation (though not universal) observed between the surrogate objective and the complete cost justifies the use of the surrogate objective while still necessitating a joint solution of both levels, thereby validating the effectiveness of the proposed bilevel framework and highlighting its potential for efficiently solving large-scale routing problems with a hierarchical structure.","published_date":"2026-04-14T17:47:12+00:00","viability_score":3,"cluster_label":"Optimization Algorithms","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A bilevel optimization algorithm for the Electric Capacitated Vehicle Routing Problem that separates routing and charging decisions to accelerate convergence and achieve near-optimal solutions.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.13010v1","title":"Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation","abstract":"On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, standard OPD requires a live teacher inference server throughout training, resulting in substantial infrastructure overhead. In this work, we investigate whether on-policy distillation can be performed offline. A natural approach is to precompute teacher log-probabilities once over SFT rollouts and reuse them during training. In practice, however, this offline variant fails to reliably match the performance of standard OPD. To understand this discrepancy, we identify a previously overlooked condition that is critical for any OPD pipeline, which we term teacher consistency. This condition requires that the same teacher model be used for both supervised fine-tuning and OPD. We show that violating teacher consistency introduces an irreducible gradient bias, causing both offline and online OPD to converge to a suboptimal fixed point regardless of training duration. Building on this insight, we propose Lightning OPD, an offline on-policy distillation framework that enforces teacher consistency by precomputing teacher log-probabilities over SFT rollouts. This design eliminates the need for a live teacher server entirely. We further show that, under teacher consistency, Lightning OPD shares the same optimum as standard OPD, with bounded gradient discrepancy and an implicit regularization effect that helps prevent policy drift. Extensive experiments on mathematical reasoning and code generation demonstrate that Lightning OPD achieves state-of-the-art performance with significantly improved efficiency. Starting from an SFT-initialized Qwen3-8B-Base model, Lightning OPD reaches 69.9% on AIME 2024 in just 30 GPU hours, achieving a 4.0x speedup over standard OPD and substantially lowering the barrier to entry for academic research on LLM post-training.","published_date":"2026-04-14T17:44:50+00:00","viability_score":7,"cluster_label":"LLM Post-Training","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An offline distillation framework for large language models that significantly speeds up post-training for reasoning and code generation tasks without requiring a live teacher server.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.13006v1","title":"One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness","abstract":"Instruction-tuned large language models produce helpful, structured responses, but how robust is this helpfulness when trivially constrained? We show that simple lexical constraints (banning a single punctuation character or common word) cause instruction-tuned LLMs to collapse their responses, losing 14--48% of comprehensiveness in pairwise evaluation across three open-weight model families and one closed-weight model (GPT-4o-mini). The baseline response is preferred in 77--100% of 1,920 pairwise comparisons judged by GPT-4o-mini and GPT-4o. Notably, GPT-4o-mini suffers 31% comprehensiveness loss (99% baseline win rate), demonstrating that the fragility extends to commercially deployed closed-weight models, contrary to prior findings on format-level constraints. Through mechanistic analysis, we identify this as a planning failure: two-pass generation (free generation followed by constrained rewriting) recovers 59--96% of response length, and linear probes on prompt representations predict response length with $R^2 = 0.51$--$0.93$ before generation begins, with $R^2$ tracking collapse severity across models. The same probes yield negative $R^2$ on base models, confirming that instruction tuning creates the representational structure encoding the collapse decision. Crucially, base models show no systematic collapse under identical constraints, with effects that are small, noisy, and bidirectional, demonstrating that instruction tuning creates this fragility by coupling task competence to narrow surface-form templates. The effect replicates on MT-Bench across all eight task categories. We further show that standard independent LLM-as-judge evaluation detects only a 3.5% average quality drop where pairwise evaluation reveals 23%, exposing a methodological blind spot in how constrained generation is assessed.","published_date":"2026-04-14T17:40:01+00:00","viability_score":6,"cluster_label":"LLM Robustness","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Reveals that instruction-tuned LLMs are fragile to simple lexical constraints, leading to significant response collapse, and identifies a planning failure as the root cause.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.12994v1","title":"LogicEval: A Systematic Framework for Evaluating Automated Repair Techniques for Logical Vulnerabilities in Real-World Software","abstract":"Logical vulnerabilities in software stem from flaws in program logic rather than memory safety, which can lead to critical security failures. Although existing automated program repair techniques primarily focus on repairing memory corruption vulnerabilities, they struggle with logical vulnerabilities because of their limited semantic understanding of the vulnerable code and its expected behavior. On the other hand, recent successes of large language models (LLMs) in understanding and repairing code are promising. However, no framework currently exists to analyze the capabilities and limitations of such techniques for logical vulnerabilities. This paper aims to systematically evaluate both traditional and LLM-based repair approaches for addressing real-world logical vulnerabilities. To facilitate our assessment, we created the first ever dataset, LogicDS, of 86 logical vulnerabilities with assigned CVEs reflecting tangible security impact. We also developed a systematic framework, LogicEval, to evaluate patches for logical vulnerabilities. Evaluations suggest that compilation and testing failures are primarily driven by prompt sensitivity, loss of code context, and difficulty in patch localization.","published_date":"2026-04-14T17:26:07+00:00","viability_score":4,"cluster_label":"Automated Program Repair","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Introduces LogicEval, a framework and dataset for evaluating automated repair techniques for logical vulnerabilities in real-world software, highlighting limitations of current LLM-based approaches.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.12988v1","title":"ROSE: An Intent-Centered Evaluation Metric for NL2SQL","abstract":"Execution Accuracy (EX), the widely used metric for evaluating the effectiveness of Natural Language to SQL (NL2SQL) solutions, is becoming increasingly unreliable. It is sensitive to syntactic variation, ignores that questions may admit multiple interpretations, and is easily misled by erroneous ground-truth SQL. To address this, we introduce ROSE, an intent-centered metric that focuses on whether the predicted SQL answers the question, rather than consistency with the ground-truth SQL under the reference-dependent paradigm. ROSE employs an adversarial Prover-Refuter cascade: SQL Prover assesses the semantic correctness of a predicted SQL against the user's intent independently, while Adversarial Refuter uses the ground-truth SQL as evidence to challenge and refine this judgment. On our expert-aligned validation set ROSE-VEC, ROSE achieves the best agreement with human experts, outperforming the next-best metric by nearly 24% in Cohen's Kappa. We also conduct a largescale re-evaluation of 19 NL2SQL methods, revealing four valuable insights. We release ROSE and ROSE-VEC to facilitate more reliable NL2SQL research.","published_date":"2026-04-14T17:22:40+00:00","viability_score":3,"cluster_label":"NL2SQL Evaluation","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A new metric for evaluating Natural Language to SQL systems that focuses on user intent rather than exact SQL syntax.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.12986v1","title":"Parallax: Why AI Agents That Think Must Never Act","abstract":"Autonomous AI agents are rapidly transitioning from experimental tools to operational infrastructure, with projections that 80% of enterprise applications will embed AI copilots by the end of 2026. As agents gain the ability to execute real-world actions (reading files, running commands, making network requests, modifying databases), a fundamental security gap has emerged. The dominant approach to agent safety relies on prompt-level guardrails: natural language instructions that operate at the same abstraction level as the threats they attempt to mitigate. This paper argues that prompt-based safety is architecturally insufficient for agents with execution capability and introduces Parallax, a paradigm for safe autonomous AI execution grounded in four principles: Cognitive-Executive Separation, which structurally prevents the reasoning system from executing actions; Adversarial Validation with Graduated Determinism, which interposes an independent, multi-tiered validator between reasoning and execution; Information Flow Control, which propagates data sensitivity labels through agent workflows to detect context-dependent threats; and Reversible Execution, which captures pre-destructive state to enable rollback when validation fails. We present OpenParallax, an open-source reference implementation in Go, and evaluate it using Assume-Compromise Evaluation, a methodology that bypasses the reasoning system entirely to test the architectural boundary under full agent compromise. Across 280 adversarial test cases in nine attack categories, Parallax blocks 98.9% of attacks with zero false positives under its default configuration, and 100% of attacks under its maximum-security configuration. When the reasoning system is compromised, prompt-level guardrails provide zero protection because they exist only within the compromised system; Parallax's architectural boundary holds regardless.","published_date":"2026-04-14T17:20:48+00:00","viability_score":7,"cluster_label":"AI Agent Security","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel security paradigm for AI agents that separates reasoning from execution to prevent unauthorized actions, with a 98.9% attack blocking rate.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12967v1","title":"Cycle-Consistent Search: Question Reconstructability as a Proxy Reward for Search Agent Training","abstract":"Reinforcement Learning (RL) has shown strong potential for optimizing search agents in complex information retrieval tasks. However, existing approaches predominantly rely on gold supervision, such as ground-truth answers, which is difficult to scale. To address this limitation, we propose Cycle-Consistent Search (CCS), a gold-supervision-free framework for training search agents, inspired by cycle-consistency techniques from unsupervised machine translation and image-to-image translation. Our key hypothesis is that an optimal search trajectory, unlike insufficient or irrelevant ones, serves as a lossless encoding of the question's intent. Consequently, a high-quality trajectory should preserve the information required to accurately reconstruct the original question, thereby inducing a reward signal for policy optimization. However, naive cycle-consistency objectives are vulnerable to information leakage, as reconstruction may rely on superficial lexical cues rather than the underlying search process. To reduce this effect, we apply information bottlenecks, including exclusion of the final response and named entity recognition (NER) masking of search queries. These constraints force reconstruction to rely on retrieved observations together with the structural scaffold, ensuring that the resulting reward signal reflects informational adequacy rather than linguistic redundancy. Experiments on question-answering benchmarks show that CCS achieves performance comparable to supervised baselines while outperforming prior methods that do not rely on gold supervision. These results suggest that CCS provides a scalable training paradigm for training search agents in settings where gold supervision is unavailable.","published_date":"2026-04-14T17:00:18+00:00","viability_score":6,"cluster_label":"Search Agent Training","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A gold-supervision-free framework for training search agents using cycle-consistency to reconstruct questions from search trajectories.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.12955v1","title":"Modeling Co-Pilots for Text-to-Model Translation","abstract":"There is growing interest in leveraging large language models (LLMs) for text-to-model translation and optimization tasks. This paper aims to advance this line of research by introducing \\textsc{Text2Model} and \\textsc{Text2Zinc}. \\textsc{Text2Model} is a suite of co-pilots based on several LLM strategies with varying complexity, along with an online leaderboard. \\textsc{Text2Zinc} is a cross-domain dataset for capturing optimization and satisfaction problems specified in natural language, along with an interactive editor with built-in AI assistant. While there is an emerging literature on using LLMs for translating combinatorial problems into formal models, our work is the first attempt to integrate \\textit{both} satisfaction and optimization problems within a \\textit{unified architecture} and \\textit{dataset}. Moreover, our approach is \\textit{solver-agnostic} unlike existing work that focuses on translation to a solver-specific model. To achieve this, we leverage \\textsc{MiniZinc}'s solver-and-paradigm-agnostic modeling capabilities to formulate combinatorial problems. We conduct comprehensive experiments to compare execution and solution accuracy across several single- and multi-call strategies, including; zero-shot prompting, chain-of-thought reasoning, intermediate representations via knowledge-graphs, grammar-based syntax encoding, and agentic approaches that decompose the model into sequential sub-tasks. Our co-pilot strategies are competitive, and in parts improve, recent research in this domain. Our findings indicate that while LLMs are promising they are not yet a push-button technology for combinatorial modeling. We contribute \\textsc{Text2Model} co-pilots and leaderboard, and \\textsc{Text2Zinc} and interactive editor to open-source to support closing this performance gap.","published_date":"2026-04-14T16:51:29+00:00","viability_score":7,"cluster_label":"Combinatorial Optimization Modeling","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A suite of LLM co-pilots and a unified dataset for translating natural language into formal models for optimization and satisfaction problems.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12948v1","title":"Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents","abstract":"LLM agents with persistent memory store information as flat factual records, providing little context for temporal reasoning, change tracking, or cross-session aggregation. Inspired by the drawing effect [3], we introduce dual-trace memory encoding. In this method, each stored fact is paired with a concrete scene trace, a narrative reconstruction of the moment and context in which the information was learned. The agent is forced to commit to specific contextual details during encoding, creating richer, more distinctive memory traces. Using the LongMemEval-S benchmark (4,575 sessions, 100 recall questions), we compare dual-trace encoding against a fact-only control with matched coverage and format over 99 shared questions. Dual-trace achieves 73.7% overall accuracy versus 53.5%, a +20.2 percentage point (pp) gain (95% CI: [+12.1, +29.3], bootstrap p < 0.0001). Gains concentrate in temporal reasoning (+40pp), knowledge-update tracking (+25pp), and multi-session aggregation (+30pp), with no benefit for single-session retrieval, consistent with encoding specificity theory [8]. Token analysis shows dual-trace encoding achieves this gain at no additional cost. We additionally sketch an architectural design for adapting dual-trace encoding to coding agents, with preliminary pilot validation.","published_date":"2026-04-14T16:45:06+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"LLM agents with dual-trace memory encoding significantly improve cross-session recall and temporal reasoning by pairing facts with contextual scene traces.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12944v1","title":"Distorted or Fabricated? A Survey on Hallucination in Video LLMs","abstract":"Despite significant progress in video-language modeling, hallucinations remain a persistent challenge in Video Large Language Models (Vid-LLMs), referring to outputs that appear plausible yet contradict the content of the input video. This survey presents a comprehensive analysis of hallucinations in Vid-LLMs and introduces a systematic taxonomy that categorizes them into two core types: dynamic distortion and content fabrication, each comprising two subtypes with representative cases. Building on this taxonomy, we review recent advances in the evaluation and mitigation of hallucinations, covering key benchmarks, metrics, and intervention strategies. We further analyze the root causes of dynamic distortion and content fabrication, which often result from limited capacity for temporal representation and insufficient visual grounding. These insights inform several promising directions for future work, including the development of motion-aware visual encoders and the integration of counterfactual learning techniques. This survey consolidates scattered progress to foster a systematic understanding of hallucinations in Vid-LLMs, laying the groundwork for building robust and reliable video-language systems. An up-to-date curated list of related works is maintained at https://github.com/hukcc/Awesome-Video-Hallucination .","published_date":"2026-04-14T16:37:57+00:00","viability_score":5,"cluster_label":"Video LLMs","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A survey and taxonomy of hallucinations in Video LLMs, analyzing root causes and proposing future research directions for more reliable systems.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.12913v1","title":"CoDe-R: Refining Decompiler Output with LLMs via Rationale Guidance and Adaptive Inference","abstract":"Binary decompilation is a critical reverse engineering task aimed at reconstructing high-level source code from stripped executables. Although Large Language Models (LLMs) have recently shown promise, they often suffer from \"logical hallucinations\" and \"semantic misalignment\" due to the irreversible semantic loss during compilation, resulting in generated code that fails to re-execute. In this study, we propose Cognitive Decompiler Refinement with Robustness (CoDe-R), a lightweight two-stage code refinement framework. The first stage introduces Semantic Cognitive Enhancement (SCE), a Rationale-Guided Semantic Injection strategy that trains the model to recover high-level algorithmic intent alongside code. The second stage introduces a Dynamic Dual-Path Fallback (DDPF) mechanism during inference, which adaptively balances semantic recovery and syntactic stability via a hybrid verification strategy. Evaluation on the HumanEval-Decompile benchmark demonstrates that CoDe-R (using a 1.3B backbone) establishes a new State-of-the-Art (SOTA) in the lightweight regime. Notably, it is the first 1.3B model to exceed an Average Re-executability Rate of 50.00%, significantly outperforming the baseline and effectively bridging the gap between efficient models and expert-level performance. Our code is available at https://github.com/Theaoi/CoDe-R.","published_date":"2026-04-14T15:58:38+00:00","viability_score":8,"cluster_label":"Decompilation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"CoDe-R refines decompiler output using LLMs with rationale guidance and adaptive inference, achieving state-of-the-art re-executability for lightweight models.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.12911v1","title":"Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss","abstract":"Multilingual benchmarks guide the development of frontier models. Yet multilingual evaluations reported by frontier models are structured similar to popular reasoning and knowledge benchmarks, but across many languages. We show such benchmarks, and consequently multilingual evaluations, measure mathematical reasoning and factual recall, not multilingual proficiency. For example, thinking variants dramatically outperform instruct variants on these benchmarks, yet often perform worse on real-world multilingual tasks, such as LMArena. We propose a simple alternative: evaluate multilingual capability via round-trip translation. Given text in a source language, translate it to a target language and back; semantic gaps between the original and result expose failures in multilingual generation capabilities. Round-trip translation correlates almost perfectly (\\r{ho} = 0.94) with user ratings on LMArena with our benchmark, requires no human reference translations, and does not require a more capable multilingual judge than tested models. Lastly, we introduce Lost in Translation (LiT), a challenging round-trip translation benchmark spanning widely spoken languages worldwide, for realistic evaluation of multilingual frontier models.","published_date":"2026-04-14T15:58:21+00:00","viability_score":6,"cluster_label":"Multilingual LLMs","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Round-trip translation reveals limitations of current multilingual benchmarks and introduces 'Lost in Translation' (LiT) for realistic evaluation of multilingual LLMs.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.12898v1","title":"BEAM: Bi-level Memory-adaptive Algorithmic Evolution for LLM-Powered Heuristic Design","abstract":"Large Language Model-based Hyper Heuristic (LHH) has recently emerged as an efficient way for automatic heuristic design. However, most existing LHHs just perform well in optimizing a single function within a pre-defined solver. Their single-layer evolution makes them not effective enough to write a competent complete solver. While some variants incorporate hyperparameter tuning or attempt to generate complex code through iterative local modifications, they still lack a high-level algorithmic modeling, leading to limited exploration efficiency. To address this, we reformulate heuristic design as a Bi-level Optimization problem and propose \\textbf{BEAM} (Bi-level Memory-adaptive Algorithmic Evolution). BEAM's exterior layer evolves high-level algorithmic structures with function placeholders through genetic algorithm (GA), while the interior layer realizes these placeholders via Monte Carlo Tree Search (MCTS). We further introduce an Adaptive Memory module to facilitate complex code generation. To support the evaluation for complex code generation, we point out the limitations of starting LHHs from scratch or from code templates and introduce a Knowledge Augmentation (KA) Pipeline. Experimental results on several optimization problems demonstrate that BEAM significantly outperforms existing LHHs, notably reducing the optimality gap by 37.84\\% on aggregate in CVRP hybrid algorithm design. BEAM also designs a heuristic that outperforms SOTA Maximum Independent Set (MIS) solver KaMIS.","published_date":"2026-04-14T15:46:47+00:00","viability_score":7,"cluster_label":"LLM Optimization","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"BEAM is a bi-level evolutionary algorithm that designs high-level algorithmic structures for LLM-powered heuristic design, significantly outperforming existing methods in complex optimization problems.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12890v1","title":"Towards Long-horizon Agentic Multimodal Search","abstract":"Multimodal deep search agents have shown great potential in solving complex tasks by iteratively collecting textual and visual evidence. However, managing the heterogeneous information and high token costs associated with multimodal inputs over long horizons remains a critical challenge, as existing methods often suffer from context explosion or the loss of crucial visual signals. To address this, we propose a novel Long-horizon MultiModal deep search framework, named LMM-Searcher, centered on a file-based visual representation mechanism. By offloading visual assets to an external file system and mapping them to lightweight textual identifiers (UIDs), our approach mitigates context overhead while preserving multimodal information for future access. We equip the agent with a tailored fetch-image tool, enabling a progressive, on-demand visual loading strategy for active perception. Furthermore, we introduce a data synthesis pipeline designed to generate queries requiring complex cross-modal multi-hop reasoning. Using this pipeline, we distill 12K high-quality trajectories to fine-tune Qwen3-VL-Thinking-30A3B into a specialized multimodal deep search agent. Extensive experiments across four benchmarks demonstrate that our method successfully scales to 100-turn search horizons, achieving state-of-the-art performance among open-source models on challenging long-horizon benchmarks like MM-BrowseComp and MMSearch-Plus, while also exhibiting strong generalizability across different base models. Our code will be released in https://github.com/RUCAIBox/LMM-Searcher.","published_date":"2026-04-14T15:40:28+00:00","viability_score":7,"cluster_label":"Multimodal Search Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Develop a cutting-edge agent-based multimodal search platform to enhance complex query resolution capabilities.","time_to_mvp":"1-2 weeks","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12879v1","title":"FastGrasp: Learning-based Whole-body Control method for Fast Dexterous Grasping with Mobile Manipulators","abstract":"Fast grasping is critical for mobile robots in logistics, manufacturing, and service applications. Existing methods face fundamental challenges in impact stabilization under high-speed motion, real-time whole-body coordination, and generalization across diverse objects and scenarios, limited by fixed bases, simple grippers, or slow tactile response capabilities. We propose \\textbf{FastGrasp}, a learning-based framework that integrates grasp guidance, whole-body control, and tactile feedback for mobile fast grasping. Our two-stage reinforcement learning strategy first generates diverse grasp candidates via conditional variational autoencoder conditioned on object point clouds, then executes coordinated movements of mobile base, arm, and hand guided by optimal grasp selection. Tactile sensing enables real-time grasp adjustments to handle impact effects and object variations. Extensive experiments demonstrate superior grasping performance in both simulation and real-world scenarios, achieving robust manipulation across diverse object geometries through effective sim-to-real transfer.","published_date":"2026-04-14T15:30:57+00:00","viability_score":7,"cluster_label":"Robotics","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"FastGrasp is a learning-based framework for fast, dexterous grasping with mobile manipulators, integrating grasp guidance, whole-body control, and tactile feedback for robust manipulation.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12875v1","title":"AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Fragmented Measurement and Weak Benchmark Governance","abstract":"The rapid expansion of large language model (LLM) safety evaluation has produced a substantial benchmark ecosystem, but not a correspondingly coherent measurement ecosystem. We present AISafetyBenchExplorer, a structured catalogue of 195 AI safety benchmarks released between 2018 and 2026, organized through a multi-sheet schema that records benchmark-level metadata, metric-level definitions, benchmark-paper metadata, and repository activity. This design enables meta-analysis not only of what benchmarks exist, but also of how safety is operationalized, aggregated, and judged across the literature. Using the updated catalogue, we identify a central structural problem: benchmark proliferation has outpaced measurement standardization. The current landscape is dominated by medium-complexity benchmarks (94/195), while only 7 benchmarks occupy the Popular tier. The workbook further reports strong concentration around English-only evaluation (165/195), evaluation-only resources (170/195), stale GitHub repositories (137/195), stale Hugging Face datasets (96/195), and heavy reliance on arXiv preprints among benchmarks with known venue metadata. At the metric level, the catalogue shows that familiar labels such as accuracy, F1 score, safety score, and aggregate benchmark scores often conceal materially different judges, aggregation rules, and threat models. We argue that the field's main failure mode is fragmentation rather than scarcity. Researchers now have many benchmark artifacts, but they often lack a shared measurement language, a principled basis for benchmark selection, and durable stewardship norms for post publication maintenance. AISafetyBenchExplorer addresses this gap by providing a traceable benchmark catalogue, a controlled metadata schema, and a complexity taxonomy that together support more rigorous benchmark discovery, comparison, and meta-evaluation.","published_date":"2026-04-14T15:26:03+00:00","viability_score":3,"cluster_label":"AI Safety","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"AISafetyBenchExplorer is a catalogue of AI safety benchmarks that reveals fragmented measurement and weak governance, providing a structured approach for benchmark discovery and meta-evaluation.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.12874v1","title":"LIFE -- an energy efficient advanced continual learning agentic AI framework for frontier systems","abstract":"The rapid advancement of AI has changed the character of HPC usage such as dimensioning, provisioning, and execution. Not only has energy demand been amplified, but existing rudimentary continual learning capabilities limit ability of AI to effectively manage HPCs. This paper reviews emerging directions beyond monolithic transformers, emphasizing agentic AI and brain inspired architectures as complementary paths toward sustainable, adaptive systems. We propose LIFE, a reasoning and Learning framework that is Incremental, Flexible, and Energy efficient that is implemented as an agent centric system rather than a single monolithic model. LIFE uniquely combines four components to realize self evolving network management and operations in HPCs. The components are an orchestrator, Agentic Context Engineering, a novel memory system, and information lattice learning. LIFE can also generalize to enable a variety of orthogonal use cases. We ground LIFE in a specific closed loop HPC operations example for detecting and mitigating latency spikes experienced by critical micro services running on a Kubernetes like cluster.","published_date":"2026-04-14T15:23:36+00:00","viability_score":3,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A framework for energy-efficient continual learning in HPC systems using agentic AI and brain-inspired architectures.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.12867v1","title":"QuarkMedSearch: A Long-Horizon Deep Search Agent for Exploring Medical Intelligence","abstract":"As agentic foundation models continue to evolve, how to further improve their performance in vertical domains has become an important challenge. To this end, building upon Tongyi DeepResearch, a powerful agentic foundation model, we focus on the Chinese medical deep search scenario and propose QuarkMedSearch, systematically exploring a full-pipeline approach spanning medical multi-hop data construction, training strategies, and evaluation benchmarks to further push and assess its performance upper bound in vertical domains. Specifically, for data synthesis, to address the scarcity of deep search training data in the medical domain, we combine a large-scale medical knowledge graph with real-time online exploration to construct long-horizon medical deep search training data; for post-training, we adopt a two-stage SFT and RL training strategy that progressively enhances the model's planning, tool invocation, and reflection capabilities required for deep search, while maintaining search efficiency; for evaluation, we collaborate with medical experts to construct the QuarkMedSearch Benchmark through rigorous manual verification. Experimental results demonstrate that QuarkMedSearch achieves state-of-the-art performance among open-source models of comparable scale on the QuarkMedSearch Benchmark, while also maintaining strong competitiveness on general benchmarks.","published_date":"2026-04-14T15:17:21+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"QuarkMedSearch is a long-horizon deep search agent for Chinese medical intelligence, achieving state-of-the-art performance with a novel data construction and training strategy.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12865v1","title":"From edges to meaning: Semantic line sketches as a cognitive scaffold for ancient pictograph invention","abstract":"Humans readily recognize objects from sparse line drawings, a capacity that appears early in development and persists across cultures, suggesting neural rather than purely learned origins. Yet the computational mechanism by which the brain transforms high-level semantic knowledge into low-level visual symbols remains poorly understood. Here we propose that ancient pictographic writing emerged from the brain's intrinsic tendency to compress visual input into stable, boundary-based abstractions. We construct a biologically inspired digital twin of the visual hierarchy that encodes an image into low-level features, generates a contour sketch, and iteratively refines it through top-down feedback guided by semantic representations, mirroring the feedforward and recurrent architecture of the human visual cortex. The resulting symbols bear striking structural resemblance to early pictographs across culturally distant writing systems, including Egyptian hieroglyphs, Chinese oracle bone characters, and proto-cuneiform, and offer candidate interpretations for undeciphered scripts. Our findings support a neuro-computational origin of pictographic writing and establish a framework in which AI can recapitulate the cognitive processes by which humans first externalized perception into symbols.","published_date":"2026-04-14T15:16:09+00:00","viability_score":3,"cluster_label":"Generative AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A biologically inspired digital twin of the visual hierarchy that generates contour sketches resembling ancient pictographs.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.12857v1","title":"Artificial Intelligence for Modeling and Simulation of Mixed Automated and Human Traffic","abstract":"Autonomous vehicles (AVs) are now operating on public roads, which makes their testing and validation more critical than ever. Simulation offers a safe and controlled environment for evaluating AV performance in varied conditions. However, existing simulation tools mainly focus on graphical realism and rely on simple rule-based models and therefore fail to accurately represent the complexity of driving behaviors and interactions. Artificial intelligence (AI) has shown strong potential to address these limitations; however, despite the rapid progress across AI methodologies, a comprehensive survey of their application to mixed autonomy traffic simulation remains lacking. Existing surveys either focus on simulation tools without examining the AI methods behind them, or cover ego-centric decision-making without addressing the broader challenge of modeling surrounding traffic. Moreover, they do not offer a unified taxonomy of AI methods covering individual behavior modeling to full scene simulation. To address these gaps, this survey provides a structured review and synthesis of AI methods for modeling AV and human driving behavior in mixed autonomy traffic simulation. We introduce a taxonomy that organizes methods into three families: agent-level behavior models, environment-level simulation methods, and cognitive and physics-informed methods. The survey analyzes how existing simulation platforms fall short of the needs of mixed autonomy research and outlines directions to narrow this gap. It also provides a chronological overview of AI methods and reviews evaluation protocols and metrics, simulation tools, and datasets. By covering both traffic engineering and computer science perspectives, we aim to bridge the gap between these two communities.","published_date":"2026-04-14T15:09:07+00:00","viability_score":5,"cluster_label":"Simulation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A survey of AI methods for modeling mixed automated and human traffic in simulation, identifying gaps and future directions.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.12832v1","title":"Detecting and refurbishing ground truth errors during training of deep learning-based echocardiography segmentation models","abstract":"Deep learning-based medical image segmentation typically relies on ground truth (GT) labels obtained through manual annotation, but these can be prone to random errors or systematic biases. This study examines the robustness of deep learning models to such errors in echocardiography (echo) segmentation and evaluates a novel strategy for detecting and refurbishing erroneous labels during model training. Using the CAMUS dataset, we simulate three error types, then compare a loss-based GT label error detection method with one based on Variance of Gradients (VOG). We also propose a pseudo-labelling approach to refurbish suspected erroneous GT labels. We assess the performance of our proposed approach under varying error levels. Results show that VOG proved highly effective in flagging erroneous GT labels during training. However, a standard U-Net maintained strong performance under random label errors and moderate levels of systematic errors (up to 50%). The detection and refurbishment approach improved performance, particularly under high-error conditions.","published_date":"2026-04-14T14:52:00+00:00","viability_score":7,"cluster_label":"Medical AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel strategy to detect and correct errors in medical image segmentation training data, improving model performance under challenging conditions.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12827v1","title":"Loop Corrections to the Training and Generalization Errors of Random Feature Models","abstract":"We investigate random feature models in which neural networks sampled from a prescribed initialization ensemble are frozen and used as random features, with only the readout weights optimized. Adopting a statistical-physics viewpoint, we study the training, test, and generalization errors beyond the mean-kernel approximation. Since the predictor is a nonlinear functional of the induced random kernel, the ensemble-averaged errors depend not only on the mean kernel but also on higher-order fluctuation statistics. Within an effective field-theoretic framework, these finite-width contributions naturally appear as loop corrections. We derive the loop corrections to the training, test, and generalization errors, obtain their scaling laws, and support the theory with experimental verification.","published_date":"2026-04-14T14:48:41+00:00","viability_score":0,"cluster_label":"LLM Training","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Theoretical analysis of loop corrections to training and generalization errors in random feature models.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.12820v1","title":"RePAIR: Interactive Machine Unlearning through Prompt-Aware Model Repair","abstract":"Large language models (LLMs) inherently absorb harmful knowledge, misinformation, and personal data during pretraining on large-scale web corpora, with no native mechanism for selective removal. While machine unlearning offers a principled solution, existing approaches are provider-centric, requiring retraining pipelines, curated retain datasets, and direct intervention by model service providers (MSPs), thereby excluding end users from controlling their own data. We introduce Interactive Machine Unlearning (IMU), a new paradigm in which users can instruct LLMs to forget targeted knowledge through natural language at inference time. To realize IMU, we propose RePAIR, a prompt-aware model repair framework comprising (i) a watchdog model for unlearning intent detection, (ii) a surgeon model for generating repair procedures, and (iii) a patient model whose parameters are updated autonomously. At the core of RePAIR, we develop Steering Through Activation Manipulation with PseudoInverse (STAMP), a training-free, single-sample unlearning method that redirects MLP activations toward a refusal subspace via closed-form pseudoinverse updates. Its low-rank variant reduces computational complexity from O(d^3) to O(r^3 + r^2 * d), enabling efficient on-device unlearning with up to ~3x speedup over training-based baselines. Extensive experiments across harmful knowledge suppression, misinformation correction, and personal data erasure demonstrate that RePAIR achieves near-zero forget scores (Acc_f = 0.00, F-RL = 0.00) while preserving model utility (Acc_r up to 84.47, R-RL up to 0.88), outperforming six state-of-the-art baselines. These results establish RePAIR as an effective and practical framework for user-driven model editing, advancing transparent and on-device control over learned knowledge, with potential extensions to multimodal foundation models.","published_date":"2026-04-14T14:44:45+00:00","viability_score":6,"cluster_label":"AI Ethics and Privacy","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Interactive machine unlearning for language models to empower users with personal data control.","time_to_mvp":"1-2 weeks","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.12812v1","title":"DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding","abstract":"Existing Multimodal Large Language Models (MLLMs) suffer from significant performance degradation on the long document understanding task as document length increases. This stems from two fundamental challenges: 1) a low Signal-to-Noise Ratio (SNR), with crucial evidence buried in irrelevant pages; and 2) supervision scarcity, as datasets offering only final short answers provide a weak learning signal. In this paper, we address these challenges by proposing a paradigm that requires the model to execute a structured ``\\textbf{Analysis}, \\textbf{Localization} and \\textbf{Reasoning}'' workflow. To instill this capability, we design a two-stage training framework: we first perform Supervised Fine-Tuning on high-quality data generated via an efficient knowledge distillation strategy. Subsequently, we employ an Evidence-aware Group Relative Policy Optimization which jointly optimizes for both evidence localization and answer accuracy. Additionally, we introduce a Evidence-Guided Resolution Allocation strategy to mitigate memory constraints of training on multi-pages documents. Extensive experiments demonstrate that DocSeeker achieves superior performance on both in-domain and out-of-domain tasks. We show it robustly generalizes from short-page training to ultra-long documents and is naturally synergistic with visual Retrieval-Augmented Generation systems, serving as a solid foundation for their implementation.","published_date":"2026-04-14T14:39:26+00:00","viability_score":8,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A structured reasoning framework for multimodal LLMs to improve understanding of long documents by localizing and grounding evidence.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12811v1","title":"Algorithmic Analysis of Dense Associative Memory: Finite-Size Guarantees and Adversarial Robustness","abstract":"Dense Associative Memory (DAM) generalizes Hopfield networks through higher-order interactions and achieves storage capacity that scales as $O(N^{n-1})$ under suitable pattern separation conditions. Existing dynamical analyses primarily study the thermodynamic limit $N\\to\\infty$ with randomly sampled patterns and therefore do not provide finite-size guarantees or explicit convergence rates.   We develop an algorithmic analysis of DAM retrieval dynamics that yields finite-$N$ guarantees under explicit, verifiable pattern conditions. Under a separation assumption and a bounded-interference condition at high loading, we prove geometric convergence of asynchronous retrieval dynamics, which implies $O(\\log N)$ convergence time once the trajectory enters the basin of attraction. We further establish adversarial robustness bounds expressed through an explicit margin condition that quantifies the number of corrupted bits tolerable per sweep, and derive capacity guarantees that scale as $\u0398(N^{n-1})$ up to polylogarithmic factors in the worst case, while recovering the classical $\u0398(N^{n-1})$ scaling for random pattern ensembles. Finally, we show that DAM retrieval dynamics admit a potential-game interpretation that ensures convergence to pure Nash equilibria under asynchronous updates.   Complete proofs are provided in the appendices, together with preliminary experiments that illustrate the predicted convergence, robustness, and capacity scaling behavior.","published_date":"2026-04-14T14:38:46+00:00","viability_score":1,"cluster_label":"Theoretical AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper provides theoretical guarantees for the convergence and robustness of Dense Associative Memory, a generalization of Hopfield networks, with potential applications in memory systems.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.12807v1","title":"Rethinking Satellite Image Restoration for Onboard AI: A Lightweight Learning-Based Approach","abstract":"Satellite image restoration aims to improve image quality by compensating for degradations (e.g., noise and blur) introduced by the imaging system and acquisition conditions. As a fundamental preprocessing step, restoration directly impacts both ground-based product generation and emerging onboard AI applications. Traditional restoration pipelines based on sequential physical models are computationally intensive and slow, making them unsuitable for onboard environments. In this paper, we introduce ConvBEERS: a Convolutional Board-ready Embedded and Efficient Restoration model for Space to investigate whether a light and non-generative residual convolutional network, trained on simulated satellite data, can match or surpass a traditional ground-processing restoration pipeline across multiple operating conditions.   Experiments conducted on simulated datasets and real Pleiades-HR imagery demonstrate that the proposed approach achieves competitive image quality, with a +6.9dB PSNR improvement. Evaluation on a downstream object detection task demonstrates that restoration significantly improves performance, with up to +5.1% mAP@50. In addition, successful deployment on a Xilinx Versal VCK190 FPGA validates its practical feasibility for satellite onboard processing, with a ~41x reduction in latency compared to the traditional pipeline. These results demonstrate the relevance of using lightweight CNNs to achieve competitive restoration quality while addressing real-world constraints in spaceborne systems.","published_date":"2026-04-14T14:37:26+00:00","viability_score":7,"cluster_label":"Computer Vision","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A lightweight convolutional network for onboard satellite image restoration achieves competitive quality and significantly reduces latency, enabling real-time AI applications in space.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.12799v1","title":"Efficiency of Proportional Mechanisms in Online Auto-Bidding Advertising","abstract":"The rise of automated bidding strategies in online advertising presents new challenges in designing and analyzing efficient auction mechanisms. In this paper, we focus on proportional mechanisms within the context of auto-bidding and study the efficiency of pure Nash equilibria, specifically the price of anarchy (PoA), under the liquid welfare objective. We first establish a tight PoA bound of 2 for the standard proportional mechanism. Next, we introduce a modified version with an alternative payment scheme that achieves a PoA bound of $1 + \\frac{O(1)}{n-1}$ where $n \\geq 2$ denotes the number of bidding agents. This improvement surpasses the existing PoA barrier of 2 and approaches full efficiency as the number of agents increases. Our methodology leverages duality and the Karush-Kuhn-Tucker (KKT) conditions from linear and convex programming. Despite its conceptual simplicity, our approach proves powerful and may offer broader applications for establishing PoA bounds.","published_date":"2026-04-14T14:29:05+00:00","viability_score":0,"cluster_label":"Algorithmic Game Theory","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper analyzes the efficiency of proportional mechanisms in online advertising auto-bidding, proposing a modified mechanism with improved price of anarchy bounds.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.12798v1","title":"VFA: Relieving Vector Operations in Flash Attention with Global Maximum Pre-computation","abstract":"FlashAttention-style online softmax enables exact attention computation with linear memory by streaming score tiles through on-chip memory and maintaining a running maximum and normalizer. However, as attention kernels approach peak tensor-core/cube-core throughput on modern accelerators, non-matmul components of online softmax -- especially per-tile rowmax and rowsum reductions and rescale chains -- can become vector or SIMD limited and dominate latency. This paper revisits FlashAttention and proposes Vector Relieved Flash Attention (VFA), a hardware-friendly method that reduces rowmax-driven updates of the running maximum while retaining the online-softmax structure. VFA initializes the running maximum via a cheap approximation from key-block representations, reorders key-block traversal to prioritize high-impact sink and local blocks, and freezes the maximum for remaining blocks to avoid repeated reductions and rescaling. We further integrate VFA with block-sparse skipping methods such as BLASST to form Vector Relieved Sparse Attention (VSA), which reduces both block count and per-block overhead. Notably, VFA and VSA completely avoid the conditional rescale operation in the update stage used in FA4.0. Extensive evaluations on benchmarks including MMLU and MATH500, together with attention statistics, verify our design: (i) sink and local reordering stabilizes the running maximum early; (ii) simple Q and K block summaries fail due to intra-block heterogeneity; (iii) m-initialization is required when maxima appear in middle blocks. Overall, VFA and VSA efficiently alleviate online-softmax reduction bottlenecks without performance loss. Compared to the C16V32 baseline, C8V32, C4V32 and C4V16 achieve nearly two times speedup on modern hardware while hitting the vector bottleneck. With upcoming architecture improvements, C4V16 will deliver six times speedup by enhancing exponent capacity.","published_date":"2026-04-14T14:28:50+00:00","viability_score":6,"cluster_label":"LLM Optimization","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Vector Relieved Flash Attention (VFA) optimizes attention computation by reducing vector operations, achieving significant speedups on modern hardware without performance loss.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.12782v1","title":"OSC: Hardware Efficient W4A4 Quantization via Outlier Separation in Channel Dimension","abstract":"While 4-bit quantization is essential for high-throughput deployment of Large Language Models, activation outliers often lead to significant accuracy degradation due to the restricted dynamic range of low-bit formats. In this paper, we systematically investigate the spatial distribution of outliers and demonstrate a token-persistent structural clustering effect, where high-magnitude outliers consistently occupy fixed channels across tokens. Building on this insight, we propose OSC, a hardware-efficient framework for outlier suppression. During inference, OSC executes a dual-path computation consisting of a low-precision 4-bit General Matrix Multiplication (GEMM) path and a high-precision 16-bit branch GEMM path. Specifically, OSC uses an offline group-wise strategy to identify the channels where outliers are located and then performs structured sub-tensor extraction to coalesce these scattered activation channels into a compact dense tensor online. This mechanism implements outlier protection through regularized and high-throughput GEMM operations, achieving a seamless fit with modern 4-bit micro-scaling hardware. Furthermore, for the inputs of W2 where outlier clustering is less pronounced, we integrate a fallback strategy to FP8. Evaluation on Qwen3-8B and Qwen3-30B restricts the average accuracy drop to 2.19 and 1.12 points, respectively. Notably, OSC is highly hardware-friendly, achieving a peak speedup of 1.78x over the W8A8 GEMM baseline on a modern AI accelerator.","published_date":"2026-04-14T14:17:59+00:00","viability_score":7,"cluster_label":"LLM Optimization","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A hardware-efficient framework for LLM quantization that suppresses activation outliers to maintain accuracy and achieve significant speedups on modern AI accelerators.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12780v1","title":"Efficient Adversarial Training via Criticality-Aware Fine-Tuning","abstract":"Vision Transformer (ViT) models have achieved remarkable performance across various vision tasks, with scalability being a key advantage when applied to large datasets. This scalability enables ViT models to exhibit strong generalization capabilities. However, as the number of parameters increases, the robustness of ViT models to adversarial examples does not scale proportionally. Adversarial training (AT), one of the most effective methods for enhancing robustness, typically requires fine-tuning the entire model, leading to prohibitively high computational costs, especially for large ViT architectures. In this paper, we aim to robustly fine-tune only a small subset of parameters to achieve robustness comparable to standard AT. To accomplish this, we introduce Criticality-Aware Adversarial Training (CAAT), a novel method that adaptively allocates resources to the most robustness-critical parameters, fine-tuning only selected modules. Specifically, CAAT efficiently identifies parameters that contribute most to adversarial robustness. It then leverages parameter-efficient fine-tuning (PEFT) to robustly adjust weight matrices where the number of critical parameters exceeds a predefined threshold. CAAT exhibits favorable generalization when scaled to larger vision transformer architectures, potentially paving the way for adversarial training at scale, e.g, compared with plain adversarial training, CAAT incurs only a 4.3% decrease in adversarial robustness while tuning approximately 6% of its parameters. Extensive experiments on three widely used adversarial learning datasets demonstrate that CAAT outperforms state-of-the-art lightweight AT methods with fewer trainable parameters.","published_date":"2026-04-14T14:17:38+00:00","viability_score":8,"cluster_label":"Robustness AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A criticality-aware fine-tuning method that achieves robust Vision Transformer models by selectively updating only the most critical parameters, significantly reducing computational cost.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.12778v1","title":"DoseRAD2026 Challenge dataset: AI accelerated photon and proton dose calculation for radiotherapy","abstract":"Purpose: Accurate dose calculation is essential in radiotherapy for precise tumor irradiation while sparing healthy tissue. With the growing adoption of MRI-guided and real-time adaptive radiotherapy, fast and accurate dose calculation on CT and MRI is increasingly needed. The DoseRAD2026 dataset and challenge provide a public benchmark of paired CT and MRI data with beam-level photon and proton Monte Carlo dose distributions for developing and evaluating advanced dose calculation methods. Acquisition and validation methods: The dataset comprises paired CT and MRI from 115 patients (75 training, 40 testing) treated on an MRI-linac for thoracic or abdominal lesions, derived from the SynthRAD2025 dataset. Pre-processing included deformable image registration, air-cavity correction, and resampling. Ground-truth photon (6 MV) and proton dose distributions were computed using open-source Monte Carlo algorithms, yielding 40,500 photon beams and 81,000 proton beamlets. Data format and usage notes: Data are organized into photon and proton subsets with paired CT-MRI images, beam-level dose distributions, and JSON beam configuration files. Files are provided in compressed MetaImage (.mha) format. The dataset is released under CC BY-NC 4.0, with training data available from April 2026 and the test set withheld until March 2030. Potential applications: The dataset supports benchmarking of fast dose calculation methods, including beam-level dose estimation for photon and proton therapy, MRI-based dose calculation in MRI-guided workflows, and real-time adaptive radiotherapy.","published_date":"2026-04-14T14:16:45+00:00","viability_score":4,"cluster_label":"Medical AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new dataset and challenge for AI-accelerated photon and proton dose calculation in radiotherapy, enabling development of faster and more accurate dose estimation methods.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.12777v1","title":"Cognition-Inspired Dual-Stream Semantic Enhancement for Vision-Based Dynamic Emotion Modeling","abstract":"The human brain constructs emotional percepts not by processing facial expressions in isolation, but through a dynamic, hierarchical integration of sensory input with semantic and contextual knowledge. However, existing vision-based dynamic emotion modeling approaches often neglect emotion perception and cognitive theories. To bridge this gap between machine and human emotion perception, we propose cognition-inspired Dual-stream Semantic Enhancement (DuSE). Our model instantiates a dual-stream cognitive architecture. The first stream, a Hierarchical Temporal Prompt Cluster (HTPC), operationalizes the cognitive priming effect. It simulates how linguistic cues pre-sensitize neural pathways, modulating the processing of incoming visual stimuli by aligning textual semantics with fine-grained temporal features of facial dynamics. The second stream, a Latent Semantic Emotion Aggregator (LSEA), computationally models the knowledge integration process, akin to the mechanism described by the Conceptual Act Theory. It aggregates sensory inputs and synthesizes them with learned conceptual knowledge, reflecting the role of the hippocampus and default mode network in constructing a coherent emotional experience. By explicitly modeling these neuro-cognitive mechanisms, DuSE provides a more neurally plausible and robust framework for dynamic facial expression recognition (DFER). Extensive experiments on challenging in-the-wild benchmarks validate our cognition-centric approach, demonstrating that emulating the brain's strategies for emotion processing yields state-of-the-art performance and enhances model interpretability.","published_date":"2026-04-14T14:16:18+00:00","viability_score":4,"cluster_label":"Emotion AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A cognition-inspired dual-stream model that enhances dynamic emotion recognition by integrating semantic and contextual knowledge with facial dynamics, achieving state-of-the-art performance.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.12767v1","title":"CLASP: Class-Adaptive Layer Fusion and Dual-Stage Pruning for Multimodal Large Language Models","abstract":"Multimodal Large Language Models (MLLMs) suffer from substantial computational overhead due to the high redundancy in visual token sequences. Existing approaches typically address this issue using single-layer Vision Transformer (ViT) features and static pruning strategies. However, such fixed configurations are often brittle under diverse instructions. To overcome these limitations, we propose CLASP, a plug-and-play token reduction framework based on class-adaptive layer fusion and dual-stage pruning. Specifically, CLASP first constructs category-specific visual representations through multi-layer vision feature fusion. It then performs dual-stage pruning, allocating the token budget between attention-salient pivot tokens for relevance and redundancy-aware completion tokens for coverage. Through class-adaptive pruning, CLASP enables prompt-conditioned feature fusion and budget allocation, allowing aggressive yet robust visual token reduction. Extensive experiments demonstrate that CLASP consistently outperforms existing methods across a wide range of benchmarks, pruning ratios, and MLLM architectures. Code will be available at https://github.com/Yunkaidang/CLASP.","published_date":"2026-04-14T14:08:20+00:00","viability_score":7,"cluster_label":"Multimodal LLMs","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A plug-and-play framework for reducing visual token redundancy in multimodal LLMs through class-adaptive layer fusion and dual-stage pruning.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.12762v1","title":"ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search","abstract":"We introduce ARGOS, the first benchmark and framework that reformulates multi-camera person search as an interactive reasoning problem requiring an agent to plan, question, and eliminate candidates under information asymmetry. An ARGOS agent receives a vague witness statement and must decide what to ask, when to invoke spatial or temporal tools, and how to interpret ambiguous responses, all within a limited turn budget. Reasoning is grounded in a Spatio-Temporal Topology Graph (STTG) encoding camera connectivity and empirically validated transition times. The benchmark comprises 2,691 tasks across 14 real-world scenarios in three progressive tracks: semantic perception (Who), spatial reasoning (Where), and temporal reasoning (When). Experiments with four LLM backbones show the benchmark is far from solved (best TWS: 0.383 on Track 2, 0.590 on Track 3), and ablations confirm that removing domain-specific tools drops accuracy by up to 49.6 percentage points.","published_date":"2026-04-14T14:06:19+00:00","viability_score":7,"cluster_label":"Agentic Multi-Camera Person Search","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"ARGOS is a benchmark and framework for agentic multi-camera person search, enabling agents to reason, question, and utilize spatio-temporal tools under information asymmetry.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.12757v1","title":"GF-Score: Certified Class-Conditional Robustness Evaluation with Fairness Guarantees","abstract":"Adversarial robustness is essential for deploying neural networks in safety-critical applications, yet standard evaluation methods either require expensive adversarial attacks or report only a single aggregate score that obscures how robustness is distributed across classes. We introduce the \\emph{GF-Score} (GREAT-Fairness Score), a framework that decomposes the certified GREAT Score into per-class robustness profiles and quantifies their disparity through four metrics grounded in welfare economics: the Robustness Disparity Index (RDI), the Normalized Robustness Gini Coefficient (NRGC), Worst-Case Class Robustness (WCR), and a Fairness-Penalized GREAT Score (FP-GREAT). The framework further eliminates the original method's dependence on adversarial attacks through a self-calibration procedure that tunes the temperature parameter using only clean accuracy correlations. Evaluating 22 models from RobustBench across CIFAR-10 and ImageNet, we find that the decomposition is exact, that per-class scores reveal consistent vulnerability patterns (e.g., ``cat'' is the weakest class in 76\\% of CIFAR-10 models), and that more robust models tend to exhibit greater class-level disparity. These results establish a practical, attack-free auditing pipeline for diagnosing where certified robustness guarantees fail to protect all classes equally. We release our code on \\href{https://github.com/aryashah2k/gf-score}{GitHub}.","published_date":"2026-04-14T14:03:22+00:00","viability_score":7,"cluster_label":"Robustness Evaluation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"GF-Score provides certified class-conditional robustness evaluation with fairness guarantees, enabling attack-free auditing of neural network vulnerabilities.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.12743v1","title":"Can AI Tools Transform Low-Demand Math Tasks? An Evaluation of Task Modification Capabilities","abstract":"While recent research has explored AI tools' ability to classify the quality of mathematical tasks (arXiv:2603.03512), little is known about their capacity to increase the quality of existing tasks. This study investigated whether AI tools could successfully upgrade low-cognitive-demand mathematics tasks. Eleven tools were tested, including six broadly available, general-purpose AI tools (e.g., ChatGPT and Claude) and five tools specialized for mathematics teachers (e.g., Khanmigo, coteach.ai). Using the Task Analysis Guide framework (Stein & Smith, 1998), we prompted AI tools to modify two different types of low-demand mathematical tasks. The prompting strategy aimed to represent likely approaches taken by knowledgeable teachers, rather than extensive optimization to find a more effective prompt (i.e., an optimistic typical outcome). On average, AI tools were only moderately successful: tasks were accurately upgraded only 64% of the time, with different AI tool performance ranging from quite weak (33%) to broadly successful (88%). Specialized tools were only moderately more successful than general-purpose tools. Failure modes included both \"undershooting\" (maintaining low cognitive demand) and \"overshooting\" (elevating tasks to an overly ambitious target category that likely would be rejected by teachers). Interestingly, there was a small negative correlation (r = -.35) between whether a given AI tool was able to correctly classify the cognitive demand of tasks and whether the AI was able to upgrade tasks, showing that the ability to modify tasks (i.e., a generative task) represents a distinct capability from the ability to classify them (i.e., judgement using a rubric). These findings have important implications for understanding AI's potential role in curriculum adaptation and highlight the need for specialized approaches to support teachers in modifying instructional materials.","published_date":"2026-04-14T13:57:09+00:00","viability_score":4,"cluster_label":"AI for Education","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"An evaluation of AI tools' capabilities in transforming low-demand mathematics tasks, revealing moderate success rates and distinct generative versus classification abilities.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.12717v1","title":"Transferable Expertise for Autonomous Agents via Real-World Case-Based Learning","abstract":"LLM-based autonomous agents perform well on general reasoning tasks but still struggle to reliably use task structure, key constraints, and prior experience in complex real-world settings. We propose a case-based learning framework that converts experience from past tasks into reusable knowledge assets, allowing agents to transfer prior case experience to new tasks and perform more structured analysis. Unlike methods based mainly on pretrained knowledge or static prompts, our framework emphasizes extracting and reusing task-relevant knowledge, analytical prompts, and operational skills from real cases. We evaluate the method on a unified benchmark of six complex task categories and compare it with Zero-Shot, Few-Shot, Checklist Prompt, and Rule Memory baselines. Results show that our method achieves consistently strong performance across all tasks and matches or outperforms the best baseline in every case, with especially clear gains on more complex tasks. Further analysis shows that the advantage of case-based learning increases with task complexity, and that practical knowledge acquired by one agent can be reused by others. These findings suggest that case-based learning offers a promising path for building professional agents for real-world work.","published_date":"2026-04-14T13:31:47+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A case-based learning framework for LLM agents that transfers prior task experience to new, complex real-world scenarios, improving structured analysis and performance.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12709v1","title":"Information-Theoretic Optimization for Task-Adapted Compressed Sensing Magnetic Resonance Imaging","abstract":"Task-adapted compressed sensing magnetic resonance imaging (CS-MRI) is emerging to address the specific demands of downstream clinical tasks with significantly fewer k-space measurements than required by Nyquist sampling. However, existing task-adapted CS-MRI methods suffer from the uncertainty problem for medical diagnosis and cannot achieve adaptive sampling in end-to-end optimization with reconstruction or clinical tasks. To address these limitations, we propose the first task-adapted CS-MRI from the information-theoretic perspective to simultaneously achieve probabilistic inference for uncertainty prediction and adapt to arbitrary sampling ratios and versatile clinical applications. Specifically, we formalize the task-adapted CS-MRI optimization problem by maximizing the mutual information between undersampled k-space measurements and clinical tasks to enable probabilistic inference for addressing the uncertainty problem. We leverage amortized optimization and construct tractable variational bounds for mutual information to jointly optimize sampling, reconstruction, and task-inference models, which enables flexible sampling ratio control using a single end-to-end trained model. Furthermore, the proposed framework addresses two kinds of distinct clinical scenarios within a unified approach, i.e., i) joint task and reconstruction, where reconstruction serves as an auxiliary process to enhance task performance; and ii) task implementation with suppressed reconstruction, applicable for privacy protection. Extensive experiments on large-scale MRI datasets demonstrate that the proposed framework achieves highly competitive performance on standard metrics like Dice compared to deterministic counterpart but provides better distribution matching to the ground-truth posterior distribution as measured by the generalized energy distance (GED).","published_date":"2026-04-14T13:23:19+00:00","viability_score":4,"cluster_label":"Medical AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An information-theoretic framework for task-adapted compressed sensing MRI that enables probabilistic inference for uncertainty prediction and adaptive sampling for clinical tasks.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.12700v1","title":"MISID: A Multimodal Multi-turn Dataset for Complex Intent Recognition in Strategic Deception Games","abstract":"Understanding human intent in complex multi-turn interactions remains a fundamental challenge in human-computer interaction and behavioral analysis. While existing intent recognition datasets focus mainly on single utterances or simple dialogues, real-world scenarios often involve sophisticated strategic interactions where participants must maintain complex deceptive narratives over extended periods. To address this gap, we introduce MISID, a comprehensive multimodal, multi-turn, and multi-participant benchmark for intent recognition. Sourced from high-stakes social strategy games, MISID features a fine-grained, two-tier multi-dimensional annotation scheme tailored for long-context discourse analysis and evidence-based causal tracking. Our systematic evaluation of state-of-the-art Multimodal Large Language Models (MLLMs) on MISID reveals critical deficiencies in complex scenarios, including text-prior visual hallucination, impaired cross-modal synergy, and limited capacity in chaining causal cues. Consequently, we propose FRACTAM as a baseline framework. Using a ``Decouple-Anchor-Reason'' paradigm, FRACTAM reduces text bias by extracting pure unimodal factual representations, employs two-stage retrieval for long-range factual anchoring, and constructs explicit cross-modal evidence chains. Extensive experiments demonstrate that FRACTAM enhances mainstream models' performance in complex strategic tasks, improving hidden intent detection and inference while maintaining robust perceptual accuracy. Our dataset is available at https://naislab.cn/datasets/MISID.","published_date":"2026-04-14T13:07:54+00:00","viability_score":7,"cluster_label":"Multimodal AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"MISID, a multimodal dataset and FRACTAM framework for complex intent recognition in strategic deception games, addressing deficiencies in current MLLMs for long-context discourse and cross-modal synergy.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12686v1","title":"BID-LoRA: A Parameter-Efficient Framework for Continual Learning and Unlearning","abstract":"Recent advances in deep learning underscore the need for systems that can not only acquire new knowledge through Continual Learning (CL) but also remove outdated, sensitive, or private information through Machine Unlearning (MU). However, while CL methods are well-developed, MU techniques remain in early stages, creating a critical gap for unified frameworks that depend on both capabilities. We find that naively combining existing CL and MU approaches results in knowledge leakage a gradual degradation of foundational knowledge across repeated adaptation cycles. To address this, we formalize Continual Learning Unlearning (CLU) as a unified paradigm with three key goals: (i) precise deletion of unwanted knowledge, (ii) efficient integration of new knowledge while preserving prior information, and (iii) minimizing knowledge leakage across cycles. We propose Bi-Directional Low-Rank Adaptation (BID-LoRA), a novel framework featuring three dedicated adapter pathways-retain, new, and unlearn applied to attention layers, combined with escape unlearning that pushes forget-class embeddings to positions maximally distant from retained knowledge, updating only 5% of parameters. Experiments on CIFAR-100 show that BID-LoRA outperforms CLU baselines across multiple adaptation cycles. We further evaluate on CASIA-Face100, a curated face recognition subset, demonstrating practical applicability to real-world identity management systems where new users must be enrolled and withdrawn users removed.","published_date":"2026-04-14T12:57:56+00:00","viability_score":7,"cluster_label":"Continual Learning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"BID-LoRA, a parameter-efficient framework for Continual Learning and Unlearning, enabling precise knowledge deletion and efficient integration of new knowledge with minimal leakage.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12669v1","title":"A hierarchical spatial-aware algorithm with efficient reinforcement learning for human-robot task planning and allocation in production","abstract":"In advanced manufacturing systems, humans and robots collaborate to conduct the production process. Effective task planning and allocation (TPA) is crucial for achieving high production efficiency, yet it remains challenging in complex and dynamic manufacturing environments. The dynamic nature of humans and robots, particularly the need to consider spatial information (e.g., humans' real-time position and the distance they need to move to complete a task), substantially complicates TPA. To address the above challenges, we decompose production tasks into manageable subtasks. We then implement a real-time hierarchical human-robot TPA algorithm, including a high-level agent for task planning and a low-level agent for task allocation. For the high-level agent, we propose an efficient buffer-based deep Q-learning method (EBQ), which reduces training time and enhances performance in production problems with long-term and sparse reward challenges. For the low-level agent, a path planning-based spatially aware method (SAP) is designed to allocate tasks to the appropriate human-robot resources, thereby achieving the corresponding sequential subtasks. We conducted experiments on a complex real-time production process in a 3D simulator. The results demonstrate that our proposed EBQ&SAP method effectively addresses human-robot TPA problems in complex and dynamic production processes.","published_date":"2026-04-14T12:40:05+00:00","viability_score":7,"cluster_label":"Robotics","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A hierarchical spatial-aware algorithm with efficient reinforcement learning for human-robot task planning and allocation in production.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12667v1","title":"Safe reinforcement learning with online filtering for fatigue-predictive human-robot task planning and allocation in production","abstract":"Human-robot collaborative manufacturing, a core aspect of Industry 5.0, emphasizes ergonomics to enhance worker well-being. This paper addresses the dynamic human-robot task planning and allocation (HRTPA) problem, which involves determining when to perform tasks and who should execute them to maximize efficiency while ensuring workers' physical fatigue remains within safe limits. The inclusion of fatigue constraints, combined with production dynamics, significantly increases the complexity of the HRTPA problem. Traditional fatigue-recovery models in HRTPA often rely on static, predefined hyperparameters. However, in practice, human fatigue sensitivity varies daily due to factors such as changed work conditions and insufficient sleep. To better capture this uncertainty, we treat fatigue-related parameters as inaccurate and estimate them online based on observed fatigue progression during production. To address these challenges, we propose PF-CD3Q, a safe reinforcement learning (safe RL) approach that integrates the particle filter with constrained dueling double deep Q-learning for real-time fatigue-predictive HRTPA. Specifically, we first develop PF-based estimators to track human fatigue and update fatigue model parameters in real-time. These estimators are then integrated into CD3Q by making task-level fatigue predictions during decision-making and excluding tasks that exceed fatigue limits, thereby constraining the action space and formulating the problem as a constrained Markov decision process (CMDP).","published_date":"2026-04-14T12:38:21+00:00","viability_score":3,"cluster_label":"Robotics","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Safe reinforcement learning with online filtering for fatigue-predictive human-robot task planning and allocation in production.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.12663v1","title":"Human-Centric Topic Modeling with Goal-Prompted Contrastive Learning and Optimal Transport","abstract":"Existing topic modeling methods, from LDA to recent neural and LLM-based approaches, which focus mainly on statistical coherence, often produce redundant or off-target topics that miss the user's underlying intent. We introduce Human-centric Topic Modeling, \\emph{Human-TM}), a novel task formulation that integrates a human-provided goal directly into the topic modeling process to produce interpretable, diverse and goal-oriented topics. To tackle this challenge, we propose the \\textbf{G}oal-prompted \\textbf{C}ontrastive \\textbf{T}opic \\textbf{M}odel with \\textbf{O}ptimal \\textbf{T}ransport (GCTM-OT), which first uses LLM-based prompting to extract goal candidates from documents, then incorporates these into semantic-aware contrastive learning via optimal transport for topic discovery. Experimental results on three public subreddit datasets show that GCTM-OT outperforms state-of-the-art baselines in topic coherence and diversity while significantly improving alignment with human-provided goals, paving the way for more human-centric topic discovery systems.","published_date":"2026-04-14T12:31:26+00:00","viability_score":8,"cluster_label":"LLM Applications","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Human-centric topic modeling that uses LLM-based prompting and contrastive learning with optimal transport to produce goal-oriented topics.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.12660v1","title":"Broadening the Applicability of Conditional Syntax Splitting for Reasoning from Conditional Belief Bases","abstract":"In nonmonotonic reasoning from conditional belief bases, an inference operator satisfying syntax splitting postulates allows for taking only the relevant parts of a belief base into account, provided that the belief base splits into subbases based on disjoint signatures. Because such disjointness is rare in practice, safe conditional syntax splitting has been proposed as a generalization of syntax splitting, allowing the conditionals in the subbases to share some atoms. Recently this overlap of conditionals has been shown to be limited to trivial, self-fulfilling conditionals. In this article, we propose a generalization of safe conditional syntax splittings that broadens the applicability of splitting postulates. In contrast to safe conditional syntax splitting, our generalized notion supports syntax splittings of a belief base \u0394 where the subbases of \u0394 may share atoms and nontrivial conditionals. We illustrate how this new notion overcomes limitations of previous splitting concepts, and we identify genuine splittings, separating them from simple splittings that do not provide benefits for inductive inference from \u0394. We introduce adjusted inference postulates based on our generalization of conditional syntax splitting, and we evaluate several popular inductive inference operators with respect to these postulates. Furthermore, we show that, while every inductive inference operator satisfying generalized conditional syntax splitting also satisfies conditional syntax splitting, the reverse does not hold.","published_date":"2026-04-14T12:27:05+00:00","viability_score":0,"cluster_label":"Reasoning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A theoretical generalization of conditional syntax splitting for nonmonotonic reasoning from conditional belief bases.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.12652v1","title":"PromptEcho: Annotation-Free Reward from Vision-Language Models for Text-to-Image Reinforcement Learning","abstract":"Reinforcement learning (RL) can improve the prompt following capability of text-to-image (T2I) models, yet obtaining high-quality reward signals remains challenging: CLIP Score is too coarse-grained, while VLM-based reward models (e.g., RewardDance) require costly human-annotated preference data and additional fine-tuning. We propose PromptEcho, a reward construction method that requires \\emph{no} annotation and \\emph{no} reward model training. Given a generated image and a guiding query, PromptEcho computes the token-level cross-entropy loss of a frozen VLM with the original prompt as the label, directly extracting the image-text alignment knowledge encoded during VLM pretraining. The reward is deterministic, computationally efficient, and improves automatically as stronger open-source VLMs become available. For evaluation, we develop DenseAlignBench, a benchmark of concept-rich dense captions for rigorously testing prompt following capability. Experimental results on two state-of-the-art T2I models (Z-Image and QwenImage-2512) demonstrate that PromptEcho achieves substantial improvements on DenseAlignBench (+26.8pp / +16.2pp net win rate), along with consistent gains on GenEval, DPG-Bench, and TIIFBench without any task-specific training. Ablation studies confirm that PromptEcho comprehensively outperforms inference-based scoring with the same VLM, and that reward quality scales with VLM size. We will open-source the trained models and the DenseAlignBench.","published_date":"2026-04-14T12:21:15+00:00","viability_score":7,"cluster_label":"Generative AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel reward mechanism for text-to-image models that eliminates the need for human annotation or reward model training, improving prompt following capabilities.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12651v1","title":"Learning Chain Of Thoughts Prompts for Predicting Entities, Relations, and even Literals on Knowledge Graphs","abstract":"Knowledge graph embedding (KGE) models perform well on link prediction but struggle with unseen entities, relations, and especially literals, limiting their use in dynamic, heterogeneous graphs. In contrast, pretrained large language models (LLMs) generalize effectively through prompting. We reformulate link prediction as a prompt learning problem and introduce RALP, which learns string-based chain-of-thought (CoT) prompts as scoring functions for triples. Using Bayesian Optimization through MIPRO algorithm, RALP identifies effective prompts from fewer than 30 training examples without gradient access. At inference, RALP predicts missing entities, relations or whole triples and assigns confidence scores based on the learned prompt. We evaluate on transductive, numerical, and OWL instance retrieval benchmarks. RALP improves state-of-the-art KGE models by over 5% MRR across datasets and enhances generalization via high-quality inferred triples. On OWL reasoning tasks with complex class expressions (e.g., $\\exists hasChild.Female$, $\\geq 5 \\; hasChild.Female$), it achieves over 88% Jaccard similarity. These results highlight prompt-based LLM reasoning as a flexible alternative to embedding-based methods. We release our implementation, training, and evaluation pipeline as open source: https://github.com/dice-group/RALP .","published_date":"2026-04-14T12:21:15+00:00","viability_score":8,"cluster_label":"Knowledge Graphs","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A prompt learning framework that uses chain-of-thought prompts to predict missing entities, relations, and literals in knowledge graphs, outperforming traditional embedding models.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.12648v1","title":"TimeSAF: Towards LLM-Guided Semantic Asynchronous Fusion for Time Series Forecasting","abstract":"Despite the recent success of large language models (LLMs) in time-series forecasting, most existing methods still adopt a Deep Synchronous Fusion strategy, where dense interactions between textual and temporal features are enforced at every layer of the network. This design overlooks the inherent granularity mismatch between modalities and leads to what we term semantic perceptual dissonance: high-level abstract semantics provided by the LLM become inappropriately entangled with the low-level, fine-grained numerical dynamics of time series, making it difficult for semantic priors to effectively guide forecasting. To address this issue, we propose TimeSAF, a new framework based on hierarchical asynchronous fusion. Unlike synchronous approaches, TimeSAF explicitly decouples unimodal feature learning from cross-modal interaction. It introduces an independent cross-modal semantic fusion trunk, which uses learnable queries to aggregate global semantics from the temporal and prompt backbones in a bottom-up manner, and a stage-wise semantic refinement decoder that asynchronously injects these high-level signals back into the temporal backbone. This mechanism provides stable and efficient semantic guidance while avoiding interference with low-level temporal dynamics. Extensive experiments on standard long-term forecasting benchmarks show that TimeSAF significantly outperforms state-of-the-art baselines, and further exhibits strong generalization in both few-shot and zero-shot transfer settings.","published_date":"2026-04-14T12:18:00+00:00","viability_score":7,"cluster_label":"Time Series Forecasting","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel time series forecasting framework that uses LLM-guided semantic asynchronous fusion to improve accuracy and generalization by decoupling semantic guidance from temporal dynamics.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12645v1","title":"Contextual Multi-Task Reinforcement Learning for Autonomous Reef Monitoring","abstract":"Although autonomous underwater vehicles promise the capability of marine ecosystem monitoring, their deployment is fundamentally limited by the difficulty of controlling vehicles under highly uncertain and non-stationary underwater dynamics. To address these challenges, we employ a data-driven reinforcement learning approach to compensate for unknown dynamics and task variations.Traditional single-task reinforcement learning has a tendency to overfit the training environment, thus, limit the long-term usefulness of the learnt policy. Hence, we propose to use a contextual multi-task reinforcement learning paradigm instead, allowing us to learn controllers that can be reused for various tasks, e.g., detecting oysters in one reef and detecting corals in another. We evaluate whether contextual multi-task reinforcement learning can efficiently learn robust and generalisable control policies for autonomous underwater reef monitoring. We train a single context-dependent policy that is able to solve multiple related monitoring tasks in a simulated reef environment in HoloOcean. In our experiments, we empirically evaluate the contextual policies regarding sample-efficiency, zero-shot generalisation to unseen tasks, and robustness to varying water currents. By utilising multi-task reinforcement learning, we aim to improve the training effectiveness, as well as the reusability of learnt policies to take a step towards more sustainable procedures in autonomous reef monitoring.","published_date":"2026-04-14T12:16:56+00:00","viability_score":3,"cluster_label":"Robotics","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Exploring contextual multi-task reinforcement learning for autonomous underwater reef monitoring to improve policy reusability and robustness.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.12634v1","title":"RPRA: Predicting an LLM-Judge for Efficient but Performant Inference","abstract":"Large language models (LLMs) face a fundamental trade-off between computational efficiency (e.g., number of parameters) and output quality, especially when deployed on computationally limited devices such as phones or laptops. One way to address this challenge is by following the example of humans and have models ask for help when they believe they are incapable of solving a problem on their own; we can overcome this trade-off by allowing smaller models to respond to queries when they believe they can provide good responses, and deferring to larger models when they do not believe they can. To this end, in this paper, we investigate the viability of Predict-Answer/Act (PA) and Reason-Predict-Reason-Answer/Act (RPRA) paradigms where models predict -- prior to responding -- how an LLM judge would score their output. We evaluate three approaches: zero-shot prediction, prediction using an in-context report card, and supervised fine-tuning. Our results show that larger models (particularly reasoning models) perform well when predicting generic LLM judges zero-shot, while smaller models can reliably predict such judges well after being fine-tuned or provided with an in-context report card. Altogether, both approaches can substantially improve the prediction accuracy of smaller models, with report cards and fine-tuning achieving mean improvements of up to 55% and 52% across datasets, respectively. These findings suggest that models can learn to predict their own performance limitations, paving the way for more efficient and self-aware AI systems.","published_date":"2026-04-14T12:04:21+00:00","viability_score":7,"cluster_label":"LLM Efficiency","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Enabling smaller LLMs to predict their own performance limitations, paving the way for more efficient and self-aware AI systems.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12632v1","title":"Calibration-Aware Policy Optimization for Reasoning LLMs","abstract":"Group Relative Policy Optimization (GRPO) enhances LLM reasoning but often induces overconfidence, where incorrect responses yield lower perplexity than correct ones, degrading relative calibration as described by the Area Under the Curve (AUC). Existing approaches either yield limited improvements in calibration or sacrifice gains in reasoning accuracy. We first prove that this degradation in GRPO-style algorithms stems from their uncertainty-agnostic advantage estimation, which inevitably misaligns optimization gradients with calibration. This leads to improved accuracy at the expense of degraded calibration. We then propose Calibration-Aware Policy Optimization (CAPO). It adopts a logistic AUC surrogate loss that is theoretically consistent and admits regret bound, enabling uncertainty-aware advantage estimation. By further incorporating a noise masking mechanism, CAPO achieves stable learning dynamics that jointly optimize calibration and accuracy. Experiments on multiple mathematical reasoning benchmarks show that CAPO-1.5B significantly improves calibration by up to 15% while achieving accuracy comparable to or better than GRPO, and further boosts accuracy on downstream inference-time scaling tasks by up to 5%. Moreover, when allowed to abstain under low-confidence conditions, CAPO achieves a Pareto-optimal precision-coverage trade-off, highlighting its practical value for hallucination mitigation.","published_date":"2026-04-14T12:03:17+00:00","viability_score":7,"cluster_label":"LLM Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel policy optimization method that jointly improves LLM reasoning accuracy and calibration, mitigating overconfidence and hallucination.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12627v1","title":"KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance","abstract":"RLVR improves reasoning in large language models, but its effectiveness is often limited by severe reward sparsity on hard problems. Recent hint-based RL methods mitigate sparsity by injecting partial solutions or abstract templates, yet they typically scale guidance by adding more tokens, which introduce redundancy, inconsistency, and extra training overhead. We propose \\textbf{KnowRL} (Knowledge-Guided Reinforcement Learning), an RL training framework that treats hint design as a minimal-sufficient guidance problem. During RL training, KnowRL decomposes guidance into atomic knowledge points (KPs) and uses Constrained Subset Search (CSS) to construct compact, interaction-aware subsets for training. We further identify a pruning interaction paradox -- removing one KP may help while removing multiple such KPs can hurt -- and explicitly optimize for robust subset curation under this dependency structure. We train KnowRL-Nemotron-1.5B from OpenMath-Nemotron-1.5B. Across eight reasoning benchmarks at the 1.5B scale, KnowRL-Nemotron-1.5B consistently outperforms strong RL and hinting baselines. Without KP hints at inference, KnowRL-Nemotron-1.5B reaches 70.08 average accuracy, already surpassing Nemotron-1.5B by +9.63 points; with selected KPs, performance improves to 74.16, establishing a new state of the art at this scale. The model, curated training data, and code are publicly available at https://github.com/Hasuer/KnowRL.","published_date":"2026-04-14T11:53:23+00:00","viability_score":5,"cluster_label":"AI-enhanced Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Enhance AI reasoning in large models using reinforcement learning with targeted knowledge guidance.","time_to_mvp":"","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.12625v1","title":"Neural Dynamic GI: Random-Access Neural Compression for Temporal Lightmaps in Dynamic Lighting Environments","abstract":"High-quality global illumination (GI) in real-time rendering is commonly achieved using precomputed lighting techniques, with lightmap as the standard choice. To support GI for static objects in dynamic lighting environments, multiple lightmaps at different lighting conditions need to be precomputed, which incurs substantial storage and memory overhead.   To overcome this limitation, we propose Neural Dynamic GI (NDGI), a novel compression technique specifically designed for temporal lightmap sets. Our method utilizes multi-dimensional feature maps and lightweight neural networks to integrate the temporal information instead of storing multiple sets explicitly, which significantly reduces the storage size of lightmaps. Additionally, we introduce a block compression (BC) simulation strategy during the training process, which enables BC compression on the final generated feature maps and further improves the compression ratio. To enable efficient real-time decompression, we also integrate a virtual texturing (VT) system with our neural representation.   Compared with prior methods, our approach achieves high-quality dynamic GI while maintaining remarkably low storage and memory requirements, with only modest real-time decompression overhead. To facilitate further research in this direction, we will release our temporal lightmap dataset precomputed in multiple scenes featuring diverse temporal variations.","published_date":"2026-04-14T11:52:57+00:00","viability_score":7,"cluster_label":"Generative Graphics","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel neural compression technique for temporal lightmaps that enables high-quality dynamic global illumination with significantly reduced storage and memory.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12622v1","title":"Efficient Semantic Image Communication for Traffic Monitoring at the Edge","abstract":"Many visual monitoring systems operate under strict communication constraints, where transmitting full-resolution images is impractical and often unnecessary. In such settings, visual data is often used for object presence, spatial relationships, and scene context rather than exact pixel fidelity. This paper presents two semantic image communication pipelines for traffic monitoring, MMSD and SAMR, that reduce transmission cost while preserving meaningful visual information. MMSD (Multi-Modal Semantic Decomposition) targets very high compression together with data confidentiality, since sensitive pixel content is not transmitted. It replaces the original image with compact semantic representations, namely segmentation maps, edge maps, and textual descriptions, and reconstructs the scene at the receiver using a diffusion-based generative model. SAMR (Semantic-Aware Masking Reconstruction) targets higher visual quality while maintaining strong compression. It selectively suppresses non-critical image regions according to semantic importance before standard JPEG encoding and restores the missing content at the receiver through generative inpainting. Both designs follow an asymmetric sender-receiver architecture, where lightweight processing is performed at the edge and computationally intensive reconstruction is offloaded to the server. On a Raspberry Pi~5, the edge-side processing time is about 15s for MMSD and 9s for SAMR. Experimental results show average transmitted-data reductions of 99% for MMSD and 99.1% for SAMR. In addition, MMSD achieves lower payload size than the recent SPIC baseline while preserving strong semantic consistency, whereas SAMR provides a better quality-compression trade-off than standard JPEG and SQ-GAN under comparable operating conditions.","published_date":"2026-04-14T11:51:17+00:00","viability_score":4,"cluster_label":"Edge AI / Computer Vision","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Develops semantic image communication pipelines for edge traffic monitoring that drastically reduce data transmission costs by replacing full images with compact representations.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.12617v1","title":"SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models","abstract":"The post-training pipeline for diffusion models currently has two stages: supervised fine-tuning (SFT) on curated data and reinforcement learning (RL) with reward models. A fundamental gap separates them. SFT optimizes the denoiser only on ground-truth states sampled from the forward noising process; once inference deviates from these ideal states, subsequent denoising relies on out-of-distribution generalization rather than learned correction, exhibiting the same exposure bias that afflicts autoregressive models, but accumulated along the denoising trajectory instead of the token sequence. RL can in principle address this mismatch, yet its terminal reward signal is sparse, suffers from credit-assignment difficulty, and risks reward hacking. We propose SOAR (Self-Correction for Optimal Alignment and Refinement), a bias-correction post-training method that fills this gap. Starting from a real sample, SOAR performs a single stop-gradient rollout with the current model, re-noises the resulting off-trajectory state, and supervises the model to steer back toward the original clean target. The method is on-policy, reward-free, and provides dense per-timestep supervision with no credit-assignment problem. On SD3.5-Medium, SOAR improves GenEval from 0.70 to 0.78 and OCR from 0.64 to 0.67 over SFT, while simultaneously raising all model-based preference scores. In controlled reward-specific experiments, SOAR surpasses Flow-GRPO in final metric value on both aesthetic and text-image alignment tasks, despite having no access to a reward model. Since SOAR's base loss subsumes the standard SFT objective, it can directly replace SFT as a stronger first post-training stage after pretraining, while remaining fully compatible with subsequent RL alignment.","published_date":"2026-04-14T11:45:15+00:00","viability_score":4,"cluster_label":"Diffusion Models","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Introduces SOAR, a novel post-training method for diffusion models that corrects exposure bias and improves alignment without relying on reward models.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.12616v1","title":"Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs","abstract":"The rapid evolution of Vision-Language Models (VLMs) has catalyzed unprecedented capabilities in artificial intelligence; however, this continuous modal expansion has inadvertently exposed a vastly broadened and unconstrained adversarial attack surface. Current multimodal jailbreak strategies primarily focus on surface-level pixel perturbations and typographic attacks or harmful images; however, they fail to engage with the complex semantic structures intrinsic to visual data. This leaves the vast semantic attack surface of original, natural images largely unscrutinized. Driven by the need to expose these deep-seated semantic vulnerabilities, we introduce \\textbf{MemJack}, a \\textbf{MEM}ory-augmented multi-agent \\textbf{JA}ilbreak atta\\textbf{CK} framework that explicitly leverages visual semantics to orchestrate automated jailbreak attacks. MemJack employs coordinated multi-agent cooperation to dynamically map visual entities to malicious intents, generate adversarial prompts via multi-angle visual-semantic camouflage, and utilize an Iterative Nullspace Projection (INLP) geometric filter to bypass premature latent space refusals. By accumulating and transferring successful strategies through a persistent Multimodal Experience Memory, MemJack maintains highly coherent extended multi-turn jailbreak attack interactions across different images, thereby improving the attack success rate (ASR) on new images. Extensive empirical evaluations across full, unmodified COCO val2017 images demonstrate that MemJack achieves a 71.48\\% ASR against Qwen3-VL-Plus, scaling to 90\\% under extended budgets. Furthermore, to catalyze future defensive alignment research, we will release \\textbf{MemJack-Bench}, a comprehensive dataset comprising over 113,000 interactive multimodal jailbreak attack trajectories, establishing a vital foundation for developing inherently robust VLMs.","published_date":"2026-04-14T11:44:59+00:00","viability_score":7,"cluster_label":"VLM Security / Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"MemJack is a memory-augmented multi-agent framework that uses visual semantics to orchestrate automated jailbreak attacks on VLMs, achieving high success rates and releasing a comprehensive benchmark dataset.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12615v1","title":"DeepTest Tool Competition 2026: Benchmarking an LLM-Based Automotive Assistant","abstract":"This report summarizes the results of the first edition of the Large Language Model (LLM) Testing competition, held as part of the DeepTest workshop at ICSE 2026. Four tools competed in benchmarking an LLM-based car manual information retrieval application, with the objective of identifying user inputs for which the system fails to appropriately mention warnings contained in the manual. The testing solutions were evaluated based on their effectiveness in exposing failures and the diversity of the discovered failure-revealing tests. We report on the experimental methodology, the competitors, and the results.","published_date":"2026-04-14T11:44:43+00:00","viability_score":5,"cluster_label":"LLM Testing / Automotive","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Reports on the first LLM Testing competition focused on benchmarking an LLM-based automotive assistant for car manual information retrieval failures.","time_to_mvp":"1-2 weeks","tags":["quick_build"]},{"arxiv_id":"2604.12601v1","title":"LLM-Guided Prompt Evolution for Password Guessing","abstract":"Passwords still remain a dominant authentication method, yet their security is routinely subverted by predictable user choices and large-scale credential leaks. Automated password guessing is a key tool for stress-testing password policies and modeling attacker behavior. This paper applies LLM-driven evolutionary computation to automatically optimize prompts for the LLM password guessing framework. Using OpenEvolve, an open-source system combining MAP-Elites quality-diversity search with an island population model we evolve prompts that maximize cracking rate on a RockYou-derived test set. We evaluate three configurations: a local setup with Qwen3 8B, a single compact cloud model Gemini-2.5 Flash, and a two-model ensemble of frontier LLMs. The approach raises the cracking rates from 2.02\\% to 8.48\\%. Character distribution analysis further confirms how evolved prompts produce statistically more realistic passwords. Automated prompt evolution is a low-barrier yet effective way to strengthen LLM-based password auditing and underlining how attack pipelines show tendency via automated improvements.","published_date":"2026-04-14T11:27:32+00:00","viability_score":4,"cluster_label":"LLM Security","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Optimizing LLM prompts using evolutionary computation to enhance password guessing capabilities for security auditing.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.12596v1","title":"KumoRFM-2: Scaling Foundation Models for Relational Learning","abstract":"We introduce KumoRFM-2, the next iteration of a pre-trained foundation model for relational data. KumoRFM-2 supports in-context learning as well as fine-tuning and is applicable to a wide range of predictive tasks. In contrast to tabular foundation models, KumoRFM-2 natively operates on relational data, processing one or more connected tables simultaneously without manual table flattening or target variable generation, all while preserving temporal consistency. KumoRFM-2 leverages a large corpus of synthetic and real-world data to pre-train across four axes: the row and column dimensions at the individual table level, and the foreign key and cross-sample dimensions at the database level. In contrast to its predecessor, KumoRFM-2 injects task information as early as possible, enabling sharper selection of task-relevant columns and improved robustness to noisy data. Through extensive experiments on 41 challenging benchmarks and analysis around expressivity and sensitivity, we demonstrate that KumoRFM-2 outperforms supervised and foundational approaches by up to 8%, while maintaining strong performance under extreme settings of cold start and noisy data. To our knowledge, this is the first time a few-shot foundation model has been shown to surpass supervised approaches on common benchmark tasks, with performance further improving upon fine-tuning. Finally, while KumoRFM-1 was limited to small-scale in-memory datasets, KumoRFM-2 scales to billion-scale relational datasets.","published_date":"2026-04-14T11:24:11+00:00","viability_score":7,"cluster_label":"Relational Foundation Models","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A foundation model for relational data that natively processes connected tables, outperforming supervised methods in few-shot learning and scaling to billion-scale datasets.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12573v1","title":"IDEA: An Interpretable and Editable Decision-Making Framework for LLMs via Verbal-to-Numeric Calibration","abstract":"Large Language Models are increasingly deployed for decision-making, yet their adoption in high-stakes domains remains limited by miscalibrated probabilities, unfaithful explanations, and inability to incorporate expert knowledge precisely. We propose IDEA, a framework that extracts LLM decision knowledge into an interpretable parametric model over semantically meaningful factors. Through joint learning of verbal-to-numerical mappings and decision parameters via EM, correlated sampling that preserves factor dependencies, and direct parameter editing with mathematical guarantees, IDEA produces calibrated probabilities while enabling quantitative human-AI collaboration. Experiments across five datasets show IDEA with Qwen-3-32B (78.6%) outperforms DeepSeek R1 (68.1%) and GPT-5.2 (77.9%), achieving perfect factor exclusion and exact calibration -- precision unattainable through prompting alone. The implementation is publicly available at https://github.com/leonbig/IDEA.","published_date":"2026-04-14T10:50:49+00:00","viability_score":7,"cluster_label":"Interpretable LLM Decision Making","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework that extracts LLM decision knowledge into an interpretable parametric model, enabling calibrated probabilities and quantitative human-AI collaboration.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12545v1","title":"Cross-Cultural Simulation of Citizen Emotional Responses to Bureaucratic Red Tape Using LLM Agents","abstract":"Improving policymaking is a central concern in public administration. Prior human subject studies reveal substantial cross-cultural differences in citizens' emotional responses to red tape during policy implementation. While LLM agents offer opportunities to simulate human-like responses and reduce experimental costs, their ability to generate culturally appropriate emotional responses to red tape remains unverified. To address this gap, we propose an evaluation framework for assessing LLMs' emotional responses to red tape across diverse cultural contexts. As a pilot study, we apply this framework to a single red-tape scenario. Our results show that all models exhibit limited alignment with human emotional responses, with notably weaker performance in Eastern cultures. Cultural prompting strategies prove largely ineffective in improving alignment. We further introduce \\textbf{RAMO}, an interactive interface for simulating citizens' emotional responses to red tape and for collecting human data to improve models. The interface is publicly available at https://ramo-chi.ivia.ch.","published_date":"2026-04-14T10:17:31+00:00","viability_score":4,"cluster_label":"LLM Agents for Social Simulation","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"An interactive interface for simulating citizen emotional responses to bureaucratic red tape across cultures, aiming to improve policymaking.","time_to_mvp":"1-3 months","tags":[]},{"arxiv_id":"2604.12543v1","title":"A Two-Stage LLM Framework for Accessible and Verified XAI Explanations","abstract":"Large Language Models (LLMs) are increasingly used to translate the technical outputs of eXplainable Artificial Intelligence (XAI) methods into accessible natural-language explanations. However, existing approaches often lack guarantees of accuracy, faithfulness, and completeness. At the same time, current efforts to evaluate such narratives remain largely subjective or confined to post-hoc scoring, offering no safeguards to prevent flawed explanations from reaching end-users. To address these limitations, this paper proposes a Two-Stage LLM Meta-Verification Framework that consists of (i) an Explainer LLM that converts raw XAI outputs into natural-language narratives, (ii) a Verifier LLM that assesses them in terms of faithfulness, coherence, completeness, and hallucination risk, and (iii) an iterative refeed mechanism that uses the Verifier's feedback to refine and improve them. Experiments across five XAI techniques and datasets, using three families of open-weight LLMs, show that verification is crucial for filtering unreliable explanations while improving linguistic accessibility compared with raw XAI outputs. In addition, the analysis of the Entropy Production Rate (EPR) during the refinement process indicates that the Verifier's feedback progressively guides the Explainer toward more stable and coherent reasoning. Overall, the proposed framework provides an efficient pathway toward more trustworthy and democratized XAI systems.","published_date":"2026-04-14T10:15:57+00:00","viability_score":7,"cluster_label":"XAI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A two-stage LLM framework that verifies and refines AI explanations for accuracy and accessibility, making XAI systems more trustworthy.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12540v1","title":"When Does Data Augmentation Help? Evaluating LLM and Back-Translation Methods for Hausa and Fongbe NLP","abstract":"Data scarcity limits NLP development for low-resource African languages. We evaluate two data augmentation methods -- LLM-based generation (Gemini 2.5 Flash) and back-translation (NLLB-200) -- for Hausa and Fongbe, two West African languages that differ substantially in LLM generation quality. We assess augmentation on named entity recognition (NER) and part-of-speech (POS) tagging using MasakhaNER 2.0 and MasakhaPOS benchmarks. Our results reveal that augmentation effectiveness depends on task type rather than language or LLM quality alone. For NER, neither method improves over baseline for either language; LLM augmentation reduces Hausa NER by 0.24% F1 and Fongbe NER by 1.81% F1. For POS tagging, LLM augmentation improves Fongbe by 0.33% accuracy, while back-translation improves Hausa by 0.17%; back-translation reduces Fongbe POS by 0.35% and has negligible effect on Hausa POS. The same LLM-generated synthetic data produces opposite effects across tasks for Fongbe -- hurting NER while helping POS -- suggesting task structure governs augmentation outcomes more than synthetic data quality. These findings challenge the assumption that LLM generation quality predicts augmentation success, and provide actionable guidance: data augmentation should be treated as a task-specific intervention rather than a universally beneficial preprocessing step.","published_date":"2026-04-14T10:14:58+00:00","viability_score":6,"cluster_label":"Low-Resource NLP","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Evaluates LLM and back-translation data augmentation for Hausa and Fongbe NLP, finding task-specific effectiveness for NER and POS tagging.","time_to_mvp":"1-2 weeks","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.12537v1","title":"MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models","abstract":"Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet their positional encoding mechanisms remain suboptimal. Existing approaches uniformly assign positional indices to all tokens, overlooking variations in information density within and across modalities, which leads to inefficient attention allocation where redundant visual regions dominate while informative content is underrepresented. We identify positional granularity as an implicit resource and propose MODIX (Multimodal Information-Driven Positional IndeX Scaling), a training-free framework that dynamically adapts positional strides based on modality-specific contributions. MODIX jointly models intra-modal density via covariance-based entropy and inter-modal interaction via cross-modal alignment to derive unified scores, which rescale positional indices to allocate finer granularity to informative modalities while compressing redundant ones, without requiring any modification to model parameters or architecture. Experiments across diverse architectures and benchmarks demonstrate that MODIX consistently improves multimodal reasoning and adaptively reallocates attention according to task-dependent information distributions, suggesting that positional encoding should be treated as an adaptive resource in Transformers for multimodal sequence modeling.","published_date":"2026-04-14T10:12:24+00:00","viability_score":7,"cluster_label":"Vision-Language Models","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"MODIX is a training-free framework that scales positional indices in Vision-Language Models based on information density, improving multimodal reasoning.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12534v1","title":"Technical Report -- A Context-Sensitive Multi-Level Similarity Framework for First-Order Logic Arguments: An Axiomatic Study","abstract":"Similarity in formal argumentation has recently gained attention due to its significance in problems such as argument aggregation in semantics and enthymeme decoding. While existing approaches focus on propositional logic, we address the richer setting of First-Order Logic (FOL), where similarity must account for structured content. We introduce a comprehensive framework for FOL argument similarity, built upon: (1) an extended axiomatic foundation; (2) a four-level parametric model covering predicates, literals, clauses, and formulae similarity; (3) two model families, one syntax-sensitive via language models, both integrating contextual weights for nuanced and explainable similarity; and (4) formal constraints enforcing desirable properties.","published_date":"2026-04-14T10:05:03+00:00","viability_score":0,"cluster_label":"Formal Argumentation","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A theoretical framework for context-sensitive, multi-level similarity in First-Order Logic arguments, extending axiomatic foundations and parametric models.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.12526v1","title":"Orthogonal Subspace Projection for Continual Machine Unlearning via SVD-Based LoRA","abstract":"Continual machine unlearning aims to remove the influence of data that should no longer be retained, while preserving the usefulness of the model on everything else. This setting becomes especially difficult when deletion requests arrive sequentially, because the model must repeatedly adapt without erasing previously retained knowledge. Low-Rank Adaptation (LoRA) offers an efficient way to implement such updates, but naively combining many sequential LoRA modules leads to parameter collision, causing \\textit{strong interference} between tasks. We propose a static alternative based on Singular Value Decomposition (SVD)-guided orthogonal subspace projection. Our method constrains each new LoRA update during training so that it lies in the orthogonal complement of the subspaces used by earlier unlearning tasks. This preserves task isolation without requiring dynamic routing at deployment. Experiments on CIFAR-100 with ResNet-20 and on MNIST show stable behavior across long sequences of unlearning tasks. After thirty sequential unlearning tasks, state-of-the-art static fusion reduces retained accuracy from 60.39\\% to 12.70\\%, whereas the proposed in-training constrained optimization maintains baseline performance ($\\sim$58.1\\%) while preserving strong unlearning efficacy.","published_date":"2026-04-14T09:59:32+00:00","viability_score":3,"cluster_label":"Machine Unlearning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A novel method for continual machine unlearning that uses SVD-guided orthogonal subspace projection to prevent parameter collision and maintain model performance across sequential deletion requests.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.12512v1","title":"NTIRE 2026 The 3rd Restore Any Image Model (RAIM) Challenge: Professional Image Quality Assessment (Track 1)","abstract":"In this paper, we present an overview of the NTIRE 2026 challenge on the 3rd Restore Any Image Model in the Wild, specifically focusing on Track 1: Professional Image Quality Assessment. Conventional Image Quality Assessment (IQA) typically relies on scalar scores. By compressing complex visual characteristics into a single number, these methods fundamentally struggle to distinguish subtle differences among uniformly high-quality images. Furthermore, they fail to articulate why one image is superior, lacking the reasoning capabilities required to provide guidance for vision tasks. To bridge this gap, recent advancements in Multimodal Large Language Models (MLLMs) offer a promising paradigm. Inspired by this potential, our challenge establishes a novel benchmark exploring the ability of MLLMs to mimic human expert cognition in evaluating high-quality image pairs. Participants were tasked with overcoming critical bottlenecks in professional scenarios, centering on two primary objectives: (1) Comparative Quality Selection: reliably identifying the visually superior image within a high-quality pair; and (2) Interpretative Reasoning: generating grounded, expert-level explanations that detail the rationale behind the selection. In total, the challenge attracted nearly 200 registrations and over 2,500 submissions. The top-performing methods significantly advanced the state of the art in professional IQA. The challenge dataset is available at https://github.com/narthchin/RAIM-PIQA, and the official homepage is accessible at https://www.codabench.org/competitions/12789/.","published_date":"2026-04-14T09:44:35+00:00","viability_score":7,"cluster_label":"Image Quality Assessment","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A challenge and dataset for professional image quality assessment using multimodal large language models to provide comparative selection and expert-level reasoning.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12503v1","title":"Topology-Aware Reasoning over Incomplete Knowledge Graph with Graph-Based Soft Prompting","abstract":"Large Language Models (LLMs) have shown remarkable capabilities across various tasks but remain prone to hallucinations in knowledge-intensive scenarios. Knowledge Base Question Answering (KBQA) mitigates this by grounding generation in Knowledge Graphs (KGs). However, most multi-hop KBQA methods rely on explicit edge traversal, making them fragile to KG incompleteness. In this paper, we proposed a novel graph-based soft prompting framework that shifts the reasoning paradigm from node-level path traversal to subgraph-level reasoning. Specifically, we employ a Graph Neural Network (GNN) to encode extracted structural subgraphs into soft prompts, enabling LLM to reason over richer structural context and identify relevant entities beyond immediate graph neighbors, thereby reducing sensitivity to missing edges. Furthermore, we introduce a two-stage paradigm that reduces computational cost while preserving good performance: a lightweight LLM first leverages the soft prompts to identify question-relevant entities and relations, followed by a more powerful LLM for evidence-aware answer generation. Experiments on four multi-hop KBQA benchmarks show that our approach achieves state-of-the-art performance on three of them, demonstrating its effectiveness. Code is available at the repository: https://github.com/Wangshuaiia/GraSP.","published_date":"2026-04-14T09:27:52+00:00","viability_score":6,"cluster_label":"Knowledge Graph QA","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A graph-based soft prompting framework for multi-hop Knowledge Graph Question Answering that uses GNNs to reason over subgraphs and reduce sensitivity to incomplete KGs.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.12502v1","title":"SEATrack: Simple, Efficient, and Adaptive Multimodal Tracker","abstract":"Parameter-efficient fine-tuning (PEFT) in multimodal tracking reveals a concerning trend where recent performance gains are often achieved at the cost of inflated parameter budgets, which fundamentally erodes PEFT's efficiency promise. In this work, we introduce SEATrack, a Simple, Efficient, and Adaptive two-stream multimodal tracker that tackles this performance-efficiency dilemma from two complementary perspectives. We first prioritize cross-modal alignment of matching responses, an underexplored yet pivotal factor that we argue is essential for breaking the trade-off. Specifically, we observe that modality-specific biases in existing two-stream methods generate conflicting matching attention maps, thereby hindering effective joint representation learning. To mitigate this, we propose AMG-LoRA, which seamlessly integrates Low-Rank Adaptation (LoRA) for domain adaptation with Adaptive Mutual Guidance (AMG) to dynamically refine and align attention maps across modalities. We then depart from conventional local fusion approaches by introducing a Hierarchical Mixture of Experts (HMoE) that enables efficient global relation modeling, effectively balancing expressiveness and computational efficiency in cross-modal fusion. Equipped with these innovations, SEATrack advances notable progress over state-of-the-art methods in balancing performance with efficiency across RGB-T, RGB-D, and RGB-E tracking tasks. \\href{https://github.com/AutoLab-SAI-SJTU/SEATrack}{\\textcolor{cyan}{Code is available}}.","published_date":"2026-04-14T09:27:50+00:00","viability_score":7,"cluster_label":"Multimodal Tracking","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"SEATrack is a multimodal tracker that uses AMG-LoRA and HMoE to achieve a balance between performance and efficiency in cross-modal fusion for tracking tasks.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12498v1","title":"Lit2Vec: A Reproducible Workflow for Building a Legally Screened Chemistry Corpus from S2ORC for Downstream Retrieval and Text Mining","abstract":"We present Lit2Vec, a reproducible workflow for constructing and validating a chemistry corpus from the Semantic Scholar Open Research Corpus using conservative, metadata-based license screening. Using this workflow, we assembled an internal study corpus of 582,683 chemistry-specific full-text research articles with structured full text, token-aware paragraph chunks, paragraph-level embeddings generated with the intfloat/e5-large-v2 model, and record-level metadata including abstracts and licensing information. To support downstream retrieval and text-mining use cases, an eligible subset of the corpus was additionally enriched with machine-generated brief summaries and multi-label subfield annotations spanning 18 chemistry domains. Licensing was screened using metadata from Unpaywall, OpenAlex, and Crossref, and the resulting corpus was technically validated for schema compliance, embedding reproducibility, text quality, and metadata completeness. The primary contribution of this work is a reproducible workflow for corpus construction and validation, together with its associated schema and reproducibility resources. The released materials include the code, reconstruction workflow, schema, metadata/provenance artifacts, and validation outputs needed to reproduce the corpus from pinned public upstream resources. Public redistribution of source-derived text and broad text-derived representations is outside the scope of the general release. Researchers can reproduce the workflow by using the released pipeline with publicly available upstream datasets and metadata services.","published_date":"2026-04-14T09:26:11+00:00","viability_score":4,"cluster_label":"Data Corpus Construction","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A reproducible workflow for building and validating a legally screened chemistry corpus from S2ORC, enriched with embeddings and annotations for downstream text mining.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.12493v1","title":"Latent Planning Emerges with Scale","abstract":"LLMs can perform seemingly planning-intensive tasks, like writing coherent stories or functioning code, without explicitly verbalizing a plan; however, the extent to which they implicitly plan is unknown. In this paper, we define latent planning as occurring when LLMs possess internal planning representations that (1) cause the generation of a specific future token or concept, and (2) shape preceding context to license said future token or concept. We study the Qwen-3 family (0.6B-14B) on simple planning tasks, finding that latent planning ability increases with scale. Models that plan possess features that represent a planned-for word like \"accountant\", and cause them to output \"an\" rather than \"a\"; moreover, even the less-successful Qwen-3 4B-8B have nascent planning mechanisms. On the more complex task of completing rhyming couplets, we find that models often identify a rhyme ahead of time, but even large models seldom plan far ahead. However, we can elicit some planning that increases with scale when steering models towards planned words in prose. In sum, we offer a framework for measuring planning and mechanistic evidence of how models' planning abilities grow with scale.","published_date":"2026-04-14T09:18:53+00:00","viability_score":1,"cluster_label":"LLM Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Investigating how latent planning abilities emerge and scale in Large Language Models through analysis of their internal representations.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.12490v1","title":"Deepfakes at Face Value: Image and Authority","abstract":"Deepfakes are synthetic media that superimpose or generate someone's likeness on to pre-existing sound, images, or videos using deep learning methods. Existing accounts of the wrongs involved in creating and distributing deepfakes focus on the harms they cause or the non-normative interests they violate. However, these approaches do not explain how deepfakes can be wrongful even when they cause no harm or set back any other non-normative interest. To address this issue, this paper identifies a neglected reason why deepfakes are wrong: they can subvert our legitimate interests in having authority over the permissible uses of our image and the governance of our identity. We argue that deepfakes are wrong when they usurp our authority to determine the provenance of our own agency by exploiting our biometric features as a generative resource. In particular, we have a specific right against the algorithmic conscription of our identity. We refine the scope of this interest by distinguishing between permissible forms of appropriation, such as artistic depiction, from wrongful algorithmic simulation.","published_date":"2026-04-14T09:16:45+00:00","viability_score":0,"cluster_label":"AI Ethics","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Deepfakes are wrong because they usurp our authority over the permissible uses of our image and identity by exploiting biometric features as a generative resource.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.12487v1","title":"KG-Reasoner: A Reinforced Model for End-to-End Multi-Hop Knowledge Graph Reasoning","abstract":"Large Language Models (LLMs) exhibit strong abilities in natural language understanding and generation, yet they struggle with knowledge-intensive reasoning. Structured Knowledge Graphs (KGs) provide an effective form of external knowledge representation and have been widely used to enhance performance in classical Knowledge Base Question Answering (KBQA) tasks. However, performing precise multi-hop reasoning over KGs for complex queries remains highly challenging. Most existing approaches decompose the reasoning process into a sequence of isolated steps executed through a fixed pipeline. While effective to some extent, such designs constrain reasoning flexibility and fragment the overall decision process, often leading to incoherence and the loss of critical intermediate information from earlier steps. In this paper, we introduce KG-Reasoner, an end-to-end framework that integrates multi-step reasoning into a unified \"thinking\" phase of a Reasoning LLM. Through Reinforcement Learning (RL), the LLM is trained to internalize the KG traversal process, enabling it to dynamically explore reasoning paths, and perform backtracking when necessary. Experiments on eight multi-hop and knowledge-intensive reasoning benchmarks demonstrate that KG-Reasoner achieves competitive or superior performance compared to the state-of-the-art methods. Codes are available at the repository: https://github.com/Wangshuaiia/KG-Reasoner.","published_date":"2026-04-14T09:14:21+00:00","viability_score":7,"cluster_label":"Knowledge Graph Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"KG-Reasoner is an end-to-end framework using Reinforcement Learning to enable LLMs to perform dynamic, multi-hop reasoning over Knowledge Graphs.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12483v1","title":"Elastic Net Regularization and Gabor Dictionary for Classification of Heart Sound Signals using Deep Learning","abstract":"In this article, we propose the optimization of the resolution of time-frequency atoms and the regularization of fitting models to obtain better representations of heart sound signals. This is done by evaluating the classification performance of deep learning (DL) networks in discriminating five heart valvular conditions based on a new class of time-frequency feature matrices derived from the fitting models. We inspect several combinations of resolution and regularization, and the optimal one is that provides the highest performance. To this end, a fitting model is obtained based on a heart sound signal and an overcomplete dictionary of Gabor atoms using elastic net regularization of linear models. We consider two different DL architectures, the first mainly consisting of a 1D convolutional neural network (CNN) layer and a long short-term memory (LSTM) layer, while the second is composed of 1D and 2D CNN layers followed by an LSTM layer. The networks are trained with two algorithms, namely stochastic gradient descent with momentum (SGDM) and adaptive moment (ADAM). Extensive experimentation has been conducted using a database containing heart sound signals of five heart valvular conditions. The best classification accuracy of $98.95\\%$ is achieved with the second architecture when trained with ADAM and feature matrices derived from optimal models obtained with a Gabor dictionary consisting of atoms with high-time low-frequency resolution and imposing sparsity on the models.","published_date":"2026-04-14T09:09:30+00:00","viability_score":4,"cluster_label":"Medical AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Optimizing deep learning models with elastic net regularization and Gabor dictionaries for improved heart sound signal classification.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.12482v1","title":"Social Learning Strategies for Evolved Virtual Soft Robots","abstract":"Optimizing the body and brain of a robot is a coupled challenge: the morphology determines what control strategies are effective, while the control parameters influence how well the morphology performs. This joint optimization can be done through nested loops of evolutionary and learning processes, where the control parameters of each robot are learned independently. However, the control parameters learned by one robot may contain valuable information for others. Thus, we introduce a social learning approach in which robots can exploit optimized parameters from their peers to accelerate their own brain optimization. Within this framework, we systematically investigate how the selection of teachers, deciding which and how many robots to learn from, affects performance, experimenting with virtual soft robots in four tasks and environments. In particular, we study the effect of inheriting experience from morphologically similar robots due to the tightly coupled body and brain in robot optimization. Our results confirm the effectiveness of building on others' experience, as social learning clearly outperforms learning from scratch under equivalent computational budgets. In addition, while the optimal teacher selection strategy remains open, our findings suggest that incorporating knowledge from multiple teachers can yield more consistent and robust improvements.","published_date":"2026-04-14T09:05:56+00:00","viability_score":3,"cluster_label":"Robotics","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Developing social learning strategies for virtual soft robots to accelerate brain optimization by leveraging peer experience.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.12480v1","title":"Audio Source Separation in Reverberant Environments using $\u03b2$-divergence based Nonnegative Factorization","abstract":"In Gaussian model-based multichannel audio source separation, the likelihood of observed mixtures of source signals is parametrized by source spectral variances and by associated spatial covariance matrices. These parameters are estimated by maximizing the likelihood through an Expectation-Maximization algorithm and used to separate the signals by means of multichannel Wiener filtering.   We propose to estimate these parameters by applying nonnegative factorization based on prior information on source variances. In the nonnegative factorization, spectral basis matrices can be defined as the prior information. The matrices can be either extracted or indirectly made available through a redundant library that is trained in advance. In a separate step, applying nonnegative tensor factorization, two algorithms are proposed in order to either extract or detect the basis matrices that best represent the power spectra of the source signals in the observed mixtures. The factorization is achieved by minimizing the $\u03b2$-divergence through multiplicative update rules. The sparsity of factorization can be controlled by tuning the value of $\u03b2$.   Experiments show that sparsity, rather than the value assigned to $\u03b2$ in the training, is crucial in order to increase the separation performance. The proposed method was evaluated in several mixing conditions. It provides better separation quality with respect to other comparable algorithms.","published_date":"2026-04-14T09:03:17+00:00","viability_score":4,"cluster_label":"Audio AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Improving audio source separation in reverberant environments using beta-divergence based nonnegative factorization.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.12477v1","title":"Mining Large Language Models for Low-Resource Language Data: Comparing Elicitation Strategies for Hausa and Fongbe","abstract":"Large language models (LLMs) are trained on data contributed by low-resource language communities, yet the linguistic knowledge encoded in these models remains accessible only through commercial APIs. This paper investigates whether strategic prompting can extract usable text data from LLMs for two West African languages: Hausa (Afroasiatic, approximately 80 million speakers) and Fongbe (Niger-Congo, approximately 2 million speakers). We systematically compare six elicitation task types across two commercial LLMs (GPT-4o Mini and Gemini 2.5 Flash). GPT-4o Mini extracts 6-41 times more usable target-language words per API call than Gemini. Optimal strategies differ by language: Hausa benefits from functional text and dialogue, while Fongbe requires constrained generation prompts. We release all generated corpora and code.","published_date":"2026-04-14T09:00:52+00:00","viability_score":7,"cluster_label":"LLM Data Mining","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Extracting low-resource language data from LLMs using strategic prompting, with released corpora and code.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12474v1","title":"From Kinematics to Dynamics: Learning to Refine Hybrid Plans for Physically Feasible Execution","abstract":"In many robotic tasks, agents must traverse a sequence of spatial regions to complete a mission. Such problems are inherently mixed discrete-continuous: a high-level action sequence and a physically feasible continuous trajectory. The resulting trajectory and action sequence must also satisfy problem constraints such as deadlines, time windows, and velocity or acceleration limits. While hybrid temporal planners attempt to address this challenge, they typically model motion using linear (first-order) dynamics, which cannot guarantee that the resulting plan respects the robot's true physical constraints. Consequently, even when the high-level action sequence is fixed, producing a dynamically feasible trajectory becomes a bi-level optimization problem. We address this problem via reinforcement learning in continuous space. We define a Markov Decision Process that explicitly incorporates analytical second-order constraints and use it to refine first-order plans generated by a hybrid planner. Our results show that this approach can reliably recover physical feasibility and effectively bridge the gap between a planner's initial first-order trajectory and the dynamics required for real execution.","published_date":"2026-04-14T09:00:08+00:00","viability_score":2,"cluster_label":"Robotics Planning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Reinforcement learning refines robotic motion plans to ensure physical feasibility, bridging the gap between planning and real-world execution.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.12470v1","title":"Intelligent ROI-Based Vehicle Counting Framework for Automated Traffic Monitoring","abstract":"Accurate vehicle counting through video surveillance is crucial for efficient traffic management. However, achieving high counting accuracy while ensuring computational efficiency remains a challenge. To address this, we propose a fully automated, video-based vehicle counting framework designed to optimize both computational efficiency and counting accuracy. Our framework operates in two distinct phases: \\textit{estimation} and \\textit{prediction}. In the estimation phase, the optimal region of interest (ROI) is automatically determined using a novel combination of three models based on detection scores, tracking scores, and vehicle density. This adaptive approach ensures compatibility with any detection and tracking method, enhancing the framework's versatility. In the prediction phase, vehicle counting is efficiently performed within the estimated ROI. We evaluated our framework on benchmark datasets like UA-DETRAC, GRAM, CDnet 2014, and ATON. Results demonstrate exceptional accuracy, with most videos achieving 100\\% accuracy, while also enhancing computational efficiency, making processing up to four times faster than full-frame processing. The framework outperforms existing techniques, especially in complex multi-road scenarios, demonstrating robustness and superior accuracy. These advancements make it a promising solution for real-time traffic monitoring.","published_date":"2026-04-14T08:55:12+00:00","viability_score":7,"cluster_label":"Traffic Monitoring","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An intelligent framework automatically identifies optimal regions for vehicle counting in traffic videos, achieving near-perfect accuracy and four times faster processing.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.12463v1","title":"Euler-inspired Decoupling Neural Operator for Efficient Pansharpening","abstract":"Pansharpening aims to synthesize high-resolution multispectral (HR-MS) images by fusing the spatial textures of panchromatic (PAN) images with the spectral information of low-resolution multispectral (LR-MS) images. While recent deep learning paradigms, especially diffusion-based operators, have pushed the performance boundaries, they often encounter spectral-spatial blurring and prohibitive computational costs due to their stochastic nature and iterative sampling. In this paper, we propose the Euler-inspired Decoupling Neural Operator (EDNO), a physics-inspired framework that redefines pansharpening as a continuous functional mapping in the frequency domain. Departing from conventional Cartesian feature processing, our EDNO leverages Euler's formula to transform features into a polar coordinate system, enabling a novel explicit-implicit interaction mechanism. Specifically, we develop the Euler Feature Interaction Layer (EFIL), which decouples the fusion task into two specialized modules: 1) Explicit Feature Interaction Module, utilizing a linear weighting scheme to simulate phase rotation for adaptive geometric alignment; and 2) Implicit Feature Interaction Module, employing a feed-forward network to model spectral distributions for superior color consistency. By operating in the frequency domain, EDNO inherently captures global receptive fields while maintaining discretization-invariance. Experimental results on the three datasets demonstrate that EDNO offers a superior efficiency-performance balance compared to heavyweight architectures.","published_date":"2026-04-14T08:49:10+00:00","viability_score":7,"cluster_label":"Image Enhancement","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A physics-inspired neural operator efficiently synthesizes high-resolution multispectral images from panchromatic and low-resolution multispectral inputs.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.12461v1","title":"CIA: Inferring the Communication Topology from LLM-based Multi-Agent Systems","abstract":"LLM-based Multi-Agent Systems (MAS) have demonstrated remarkable capabilities in solving complex tasks. Central to MAS is the communication topology which governs how agents exchange information internally. Consequently, the security of communication topologies has attracted increasing attention. In this paper, we investigate a critical privacy risk: MAS communication topologies can be inferred under a restrictive black-box setting, exposing system vulnerabilities and posing significant intellectual property threats. To explore this risk, we propose Communication Inference Attack (CIA), a novel attack that constructs new adversarial queries to induce intermediate agents' reasoning outputs and models their semantic correlations through the proposed global bias disentanglement and LLM-guided weak supervision. Extensive experiments on MAS with optimized communication topologies demonstrate the effectiveness of CIA, achieving an average AUC of 0.87 and a peak AUC of up to 0.99, thereby revealing the substantial privacy risk in MAS.","published_date":"2026-04-14T08:48:15+00:00","viability_score":7,"cluster_label":"LLM Security","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel attack infers communication topologies in LLM-based multi-agent systems, revealing significant privacy risks and system vulnerabilities.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.12460v1","title":"Enhancing Clustering: An Explainable Approach via Filtered Patterns","abstract":"Machine learning has become a central research area, with increasing attention devoted to explainable clustering, also known as conceptual clustering, which is a knowledge-driven unsupervised learning paradigm that partitions data into $\u03b8$ disjoint clusters, where each cluster is described by an explicit symbolic representation, typically expressed as a closed pattern or itemset. By providing human-interpretable cluster descriptions, explainable clustering plays an important role in explainable artificial intelligence and knowledge discovery. Recent work improved clustering quality by introducing k-relaxed frequent patterns (k-RFPs), a pattern model that relaxes strict coverage constraints through a generalized kcover definition. This framework integrates constraint-based reasoning, using SAT solvers for pattern generation, with combinatorial optimization, using Integer Linear Programming (ILP) for cluster selection. Despite its effectiveness, this approach suffers from a critical limitation: multiple distinct k-RFPs may induce identical k-covers, leading to redundant symbolic representations that unnecessarily enlarge the search space and increase computational complexity during cluster construction. In this paper, we address this redundancy through a pattern reduction framework. Our contributions are threefold. First, we formally characterize the conditions under which distinct k-RFPs induce identical kcovers, providing theoretical foundations for redundancy detection. Second, we propose an optimization strategy that removes redundant patterns by retaining a single representative pattern for each distinct k-cover. Third, we investigate the interpretability and representativeness of the patterns selected by the ILP model by analyzing their robustness with respect to their induced clusters. Extensive experiments conducted on several real-world datasets demonstrate that the proposed approach significantly reduces the pattern search space, improves computational efficiency, preserves and enhances in some cases the quality of the resulting clusters.","published_date":"2026-04-14T08:45:38+00:00","viability_score":2,"cluster_label":"Explainable AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A theoretical framework for reducing redundancy in explainable clustering by formally characterizing and removing duplicate pattern representations.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.12459v1","title":"Operationalising the Right to be Forgotten in LLMs: A Lightweight Sequential Unlearning Framework for Privacy-Aligned Deployment in Politically Sensitive Environments","abstract":"Large Language Models (LLMs) are increasingly deployed in politically sensitive environments, where memorisation of personal data or confidential content raises regulatory concerns under frameworks such as the GDPR and its Right to be Forgotten. Translating such legal principles into large-scale generative systems presents significant technical challenges.   We introduce a lightweight sequential unlearning framework that explicitly separates retention and suppression objectives. The method first stabilises benign capabilities through positive fine-tuning, then applies layer-restricted negative fine-tuning to suppress designated sensitive patterns while preserving general language competence.   Experiments on the SemEval-2025 LLM Unlearning benchmark demonstrate effective behavioural suppression with minimal impact on factual accuracy and fluency. GPT-2 exhibits greater robustness than DistilGPT-2, highlighting the role of model capacity in privacy-aligned adaptation. We position sequential unlearning as a practical and reproducible mechanism for operationalising data erasure requirements in politically deployed LLMs.","published_date":"2026-04-14T08:45:12+00:00","viability_score":7,"cluster_label":"LLM Privacy","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A lightweight sequential unlearning framework for LLMs that enables privacy-aligned deployment by selectively suppressing sensitive data without compromising general language capabilities.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.12456v1","title":"X-VC: Zero-shot Streaming Voice Conversion in Codec Space","abstract":"Zero-shot voice conversion (VC) aims to convert a source utterance into the voice of an unseen target speaker while preserving its linguistic content. Although recent systems have improved conversion quality, building zero-shot VC systems for interactive scenarios remains challenging because high-fidelity speaker transfer and low-latency streaming inference are difficult to achieve simultaneously. In this work, we present X-VC, a zero-shot streaming VC system that performs one-step conversion in the latent space of a pretrained neural codec. X-VC uses a dual-conditioning acoustic converter that jointly models source codec latents and frame-level acoustic conditions derived from target reference speech, while injecting utterance-level target speaker information through adaptive normalization. To reduce the mismatch between training and inference, we train the model with generated paired data and a role-assignment strategy that combines standard, reconstruction, and reversed modes. For streaming inference, we further adopt a chunkwise inference scheme with overlap smoothing that is aligned with the segment-based training paradigm of the codec. Experiments on Seed-TTS-Eval show that X-VC achieves the best streaming WER in both English and Chinese, strong speaker similarity in same-language and cross-lingual settings, and substantially lower offline real-time factor than the compared baselines. These results suggest that codec-space one-step conversion is a practical approach for building high-quality low-latency zero-shot VC systems. Audio samples are available at https://x-vc.github.io. Our code and checkpoints will also be released.","published_date":"2026-04-14T08:42:10+00:00","viability_score":7,"cluster_label":"Voice Conversion Technology","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"X-VC enables real-time zero-shot voice conversion to recreate any voice instantly using a neural codec.","time_to_mvp":"1-2 weeks","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.12440v1","title":"IAD-Unify: A Region-Grounded Unified Model for Industrial Anomaly Segmentation, Understanding, and Generation","abstract":"Real-world industrial inspection requires not only localizing defects, but also explaining them in natural language and generating controlled defect edits. However, existing approaches fail to jointly support all three capabilities within a unified framework and evaluation protocol. We propose IAD-Unify, a dual-encoder unified framework in which a frozen DINOv2-based region expert supplies precise anomaly evidence to a shared Qwen3.5-4B vision-language backbone via lightweight token injection, jointly enabling anomaly segmentation, region-grounded understanding, and mask-guided generation. To enable unified evaluation, we further construct Anomaly-56K, a comprehensive unified multi-task IAD evaluation platform, spanning 59,916 images across 24 categories and 104 defect variants. Controlled ablations yield four findings: (i) region grounding is the decisive mechanism for understanding, removing it degrades location accuracy by >76 pp; (ii) predicted-region performance closely matches oracle, confirming deployment viability; (iii) region-grounded generation achieves the best full-image fidelity and masked-region perceptual quality; and (iv) pre-initialized joint training improves understanding at negligible generation cost (-0.16 dB). IAD-Unify further achieves strong performance on the MMAD benchmark, including categories unseen during training, demonstrating robust cross-category generalization.","published_date":"2026-04-14T08:29:31+00:00","viability_score":8,"cluster_label":"AI-powered Industrial Solutions","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A unified model for industrial anomaly detection, offering enhanced segmentation, understanding, and generation in manufacturing processes.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.12424v1","title":"Decoding by Perturbation: Mitigating MLLM Hallucinations via Dynamic Textual Perturbation","abstract":"Multimodal Large Language Models frequently suffer from inference hallucinations, partially stemming from language priors dominating visual evidence. Existing training-free mitigation methods either perturb the visual representation and deviate from the natural image distribution, or enforce intrusive manipulations that compromise the model's inherent generative fluency. We introduce a novel perspective that multimodal hallucination manifests as the hypersensitivity of visual grounding to textual phrasing during the decoding phase. Building on this insight, we propose Decoding by Perturbation (DeP), a training-free framework mitigating prior-induced hallucinations via controlled textual interventions. DeP employs a dynamic probe applying multi-level textual perturbations to elicit latent language priors. Leveraging attention variance, it enhances stable evidence regions while suppressing suspicious noise in the feature space. Furthermore, it constructs an interpretable prior drift direction using logits statistics to counteract probability biases from textual co-occurrences. Extensive experiments confirm DeP effectively reduces hallucinations and achieves superior performance across multiple benchmarks.","published_date":"2026-04-14T08:15:44+00:00","viability_score":7,"cluster_label":"LLM Hallucination Mitigation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A training-free framework that mitigates multimodal LLM hallucinations by dynamically perturbing text to stabilize visual grounding.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12418v1","title":"RACF: A Resilient Autonomous Car Framework with Object Distance Correction","abstract":"Autonomous vehicles are increasingly deployed in safety-critical applications, where sensing failures or cyberphysical attacks can lead to unsafe operations resulting in human loss and/or severe physical damages. Reliable real-time perception is therefore critically important for their safe operations and acceptability. For example, vision-based distance estimation is vulnerable to environmental degradation and adversarial perturbations, and existing defenses are often reactive and too slow to promptly mitigate their impacts on safe operations. We present a Resilient Autonomous Car Framework (RACF) that incorporates an Object Distance Correction Algorithm (ODCA) to improve perception-layer robustness through redundancy and diversity across a depth camera, LiDAR, and physics-based kinematics. Within this framework, when obstacle distance estimation produced by depth camera is inconsistent, a cross-sensor gate activates the correction algorithm to fix the detected inconsistency. We have experiment with the proposed resilient car framework and evaluate its performance on a testbed implemented using the Quanser QCar 2 platform. The presented framework achieved up to 35% RMSE reduction under strong corruption and improves stop compliance and braking latency, while operating in real time. These results demonstrate a practical and lightweight approach to resilient perception for safety-critical autonomous driving","published_date":"2026-04-14T08:06:09+00:00","viability_score":7,"cluster_label":"Autonomous Vehicle Perception","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A resilient framework for autonomous cars that uses sensor fusion and a novel correction algorithm to ensure accurate object distance estimation in real-time.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12408v1","title":"Security and Resilience in Autonomous Vehicles: A Proactive Design Approach","abstract":"Autonomous vehicles (AVs) promise efficient, clean and cost-effective transportation systems, but their reliance on sensors, wireless communications, and decision-making systems makes them vulnerable to cyberattacks and physical threats. This chapter presents novel design techniques to strengthen the security and resilience of AVs. We first provide a taxonomy of potential attacks across different architectural layers, from perception and control manipulation to Vehicle-to-Any (V2X) communication exploits and software supply chain compromises. Building on this analysis, we present an AV Resilient architecture that integrates redundancy, diversity, and adaptive reconfiguration strategies, supported by anomaly- and hash-based intrusion detection techniques. Experimental validation on the Quanser QCar platform demonstrates the effectiveness of these methods in detecting depth camera blinding attacks and software tampering of perception modules. The results highlight how fast anomaly detection combined with fallback and backup mechanisms ensures operational continuity, even under adversarial conditions. By linking layered threat modeling with practical defense implementations, this work advances AV resilience strategies for safer and more trustworthy autonomous vehicles.","published_date":"2026-04-14T07:45:16+00:00","viability_score":7,"cluster_label":"Autonomous Vehicle Security","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A proactive design approach for autonomous vehicles integrating redundancy and anomaly detection to ensure security and operational continuity against cyberattacks.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12391v1","title":"Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models","abstract":"In this paper, we present Chain-of-Models Pre-Training (CoM-PT), a novel performance-lossless training acceleration method for vision foundation models (VFMs). This approach fundamentally differs from existing acceleration methods in its core motivation: rather than optimizing each model individually, CoM-PT is designed to accelerate the training pipeline at the model family level, scaling efficiently as the model family expands. Specifically, CoM-PT establishes a pre-training sequence for the model family, arranged in ascending order of model size, called model chain. In this chain, only the smallest model undergoes standard individual pre-training, while the other models are efficiently trained through sequential inverse knowledge transfer from their smaller predecessors by jointly reusing the knowledge in the parameter space and the feature space. As a result, CoM-PT enables all models to achieve performance that is mostly superior to standard individual training while significantly reducing training cost, and this is extensively validated across 45 datasets spanning zero-shot and fine-tuning tasks. Notably, its efficient scaling property yields a remarkable phenomenon: training more models even results in higher efficiency. For instance, when pre-training on CC3M: i) given ViT-L as the largest model, progressively prepending smaller models to the model chain reduces computational complexity by up to 72%; ii) within a fixed model size range, as the VFM family scales across 3, 4, and 7 models, the acceleration ratio of CoM-PT exhibits a striking leap: from 4.13X to 5.68X and 7.09X. Since CoM-PT is naturally agnostic to specific pre-training paradigms, we open-source the code to spur further extensions in more computationally intensive scenarios, such as large language model pre-training.","published_date":"2026-04-14T07:26:23+00:00","viability_score":7,"cluster_label":"LLM Training Acceleration","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel training acceleration method for vision foundation models that creates a model family chain, enabling efficient sequential knowledge transfer and reducing training costs by up to 72%.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12390v1","title":"Heuristic Classification of Thoughts Prompting (HCoT): Integrating Expert System Heuristics for Structured Reasoning into Large Language Models","abstract":"This paper addresses two limitations of large language models (LLMs) in solving complex problems: (1) their reasoning processes exhibit Bayesian-like stochastic generation, where each token is sampled from a context-dependent probability distribution, leading to inherently random decision trajectories rather than deterministic planning; (2) the reasoning and decision-making mechanisms are statically decoupled, meaning dynamically retrieved domain knowledge fails to dynamically adjust the underlying reasoning strategy. These dual deficiencies result in initial decisions lacking strategic anchoring and reasoning chains often failing to converge on correct solutions, as stochastic generation lacks mechanisms for trajectory correction or knowledge-guided optimization during sequential reasoning. To resolve these issues, we propose a problem-solving method integrated into the LLM's generation process to guide reasoning. This method, compatible with numerous LLMs and featuring reusable solutions, is grounded in a novel Heuristic-Classification-of-Thoughts prompting schema (HCoT). HCoT synergizes the LLM's reasoning ability with a structured problem space via a heuristic classification model that controls the reasoning process and provides reusable abstract solutions. Evaluated on two complex inductive reasoning tasks with ill-defined search spaces, HCoT outperforms existing approaches (e.g., Tree-of-Thoughts and Chain-of-Thoughts prompting) in performance. On the well-structured 24 Game task, HCoT demonstrates significantly higher token efficiency compared to the state-of-the-art Tree-of-Thoughts-Breadth-First-Search. In terms of both accuracy and token usage, HCoT achieves a Pareto frontier balance, offering a strong trade-off between performance and computational cost.","published_date":"2026-04-14T07:24:08+00:00","viability_score":7,"cluster_label":"LLM Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel prompting schema that integrates expert system heuristics to improve LLM reasoning and problem-solving efficiency.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12384v1","title":"Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints","abstract":"Safety alignment in Large Language Models (LLMs) remains highly fragile during fine-tuning, where even benign adaptation can degrade pre-trained refusal behaviors and enable harmful responses. Existing defenses typically constrain either weights or activations in isolation, without considering their coupled effects on safety. In this paper, we first theoretically demonstrate that constraining either weights or activations alone is insufficient for safety preservation. To robustly preserve safety alignment, we propose Coupled Weight and Activation Constraints (CWAC), a novel approach that simultaneously enforces a precomputed safety subspace on weight updates and applies targeted regularization to safety-critical features identified by sparse autoencoders. Extensive experiments across four widely used LLMs and diverse downstream tasks show that CWAC consistently achieves the lowest harmful scores with minimal impact on fine-tuning accuracy, substantially outperforming strong baselines even under high harmful data ratios.","published_date":"2026-04-14T07:17:55+00:00","viability_score":7,"cluster_label":"LLM Safety","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new method that simultaneously constrains LLM weights and activations to prevent safety degradation during fine-tuning.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12379v1","title":"Beyond Output Correctness: Benchmarking and Evaluating Large Language Model Reasoning in Coding Tasks","abstract":"Large language models (LLMs) increasingly rely on explicit reasoning to solve coding tasks, yet evaluating the quality of this reasoning remains challenging. Existing reasoning evaluators are not designed for coding, and current benchmarks focus primarily on code generation, leaving other coding tasks largely unexplored. We introduce CodeRQ-Bench, the first benchmark for evaluating LLM reasoning quality across three coding task categories: generation, summarization, and classification. Using this benchmark, we analyze 1,069 mismatch cases from existing evaluators, identify five recurring limitations, and derive four design insights for reasoning evaluation in coding tasks. Guided by these insights, we propose VERA, a two-stage evaluator that combines evidence-grounded verification with ambiguity-aware score correction. Experiments on CodeRQ-Bench show that VERA consistently outperforms strong baselines across four datasets, improving AUCROC by up to 0.26 and AUPRC by up to 0.21. We release CodeRQ-Bench at https://github.com/MrLYG/CodeRQ-Bench, supporting future investigations.","published_date":"2026-04-14T07:12:46+00:00","viability_score":8,"cluster_label":"LLM Evaluation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A benchmark and evaluator for LLM reasoning in coding tasks that improves accuracy and identifies limitations in existing methods.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.12377v1","title":"SCRIPT: A Subcharacter Compositional Representation Injection Module for Korean Pre-Trained Language Models","abstract":"Korean is a morphologically rich language with a featural writing system in which each character is systematically composed of subcharacter units known as Jamo. These subcharacters not only determine the visual structure of Korean but also encode frequent and linguistically meaningful morphophonological processes. However, most current Korean language models (LMs) are based on subword tokenization schemes, which are not explicitly designed to capture the internal compositional structure of characters. To address this limitation, we propose SCRIPT, a model-agnostic module that injects subcharacter compositional knowledge into Korean PLMs. SCRIPT allows to enhance subword embeddings with structural granularity, without requiring architectural changes or additional pre-training. As a result, SCRIPT enhances all baselines across various Korean natural language understanding (NLU) and generation (NLG) tasks. Moreover, beyond performance gains, detailed linguistic analyses show that SCRIPT reshapes the embedding space in a way that better captures grammatical regularities and semantically cohesive variations. Our code is available at https://github.com/SungHo3268/SCRIPT.","published_date":"2026-04-14T07:09:44+00:00","viability_score":7,"cluster_label":"LLM Adaptation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A module that injects subcharacter compositional knowledge into Korean LLMs to improve linguistic understanding and generation without architectural changes.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12376v1","title":"Cooperative Memory Paging with Keyword Bookmarks for Long-Horizon LLM Conversations","abstract":"When LLM conversations grow beyond the context window, old content must be evicted -- but how does the model recover it when needed? We propose cooperative paging: evicted segments are replaced with minimal keyword bookmarks ([pN:keywords], ~8-24 tokens each), and the model is given a recall() tool to retrieve full content on demand. On the LoCoMo benchmark (10 real multi-session conversations, 300+ turns), cooperative paging achieves the highest answer quality among six methods -- outperforming truncation, BM25, word-overlap retrieval, a search-tool baseline, and full context -- on four models (GPT-4o-mini, DeepSeek-v3.2, Claude Haiku, GLM-5), confirmed by four independent LLM judges ($p=0.017$, paired bootstrap). We then study the paging design space with a 5x4 ablation over boundary strategies and eviction policies (3,176 synthetic probes, 1,600 LoCoMo probes). Key findings: (1) coarse fixed-size pages (fixed_20) reach 96.7% while content-aware topic_shift collapses to 56.7%; (2) eviction policy choice is data-dependent (FIFO best on synthetic, LFU on LoCoMo); (3) two bookmark generation strategies improve over the heuristic baseline (+4.4 and +8.7 E2E points); (4) the remaining bottleneck is bookmark discrimination -- the model triggers recall() 96% of the time but selects the correct page only 57% when bookmarks are insufficiently distinctive. Keyword specificity alone accounts for a 25 percentage point accuracy difference.","published_date":"2026-04-14T07:06:35+00:00","viability_score":7,"cluster_label":"LLM Memory Management","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel LLM memory system using keyword bookmarks and a recall tool to dramatically improve long-conversation recall, outperforming existing methods.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.12374v1","title":"Nemotron 3 Super: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning","abstract":"We describe the pre-training, post-training, and quantization of Nemotron 3 Super, a 120 billion (active 12 billion) parameter hybrid Mamba-Attention Mixture-of-Experts model. Nemotron 3 Super is the first model in the Nemotron 3 family to 1) be pre-trained in NVFP4, 2) leverage LatentMoE, a new Mixture-of-Experts architecture that optimizes for both accuracy per FLOP and accuracy per parameter, and 3) include MTP layers for inference acceleration through native speculative decoding. We pre-trained Nemotron 3 Super on 25 trillion tokens followed by post-training using supervised fine tuning (SFT) and reinforcement learning (RL). The final model supports up to 1M context length and achieves comparable accuracy on common benchmarks, while also achieving up to 2.2x and 7.5x higher inference throughput compared to GPT-OSS-120B and Qwen3.5-122B, respectively. Nemotron 3 Super datasets, along with the base, post-trained, and quantized checkpoints, are open-sourced on HuggingFace.","published_date":"2026-04-14T07:02:32+00:00","viability_score":7,"cluster_label":"LLM Training","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An open-source, efficient hybrid Mamba-Transformer model for agentic reasoning that significantly outperforms existing models in inference throughput.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12357v1","title":"ReflectCAP: Detailed Image Captioning with Reflective Memory","abstract":"Detailed image captioning demands both factual grounding and fine-grained coverage, yet existing methods have struggled to achieve them simultaneously. We address this tension with Reflective Note-Guided Captioning (ReflectCAP), where a multi-agent pipeline analyzes what the target large vision-language model (LVLM) consistently hallucinates and what it systematically overlooks, distilling these patterns into reusable guidelines called Structured Reflection Notes. At inference time, these notes steer the captioning model along both axes -- what to avoid and what to attend to -- yielding detailed captions that jointly improve factuality and coverage. Applying this method to 8 LVLMs spanning the GPT-4.1 family, Qwen series, and InternVL variants, ReflectCAP reaches the Pareto frontier of the trade-off between factuality and coverage, and delivers substantial gains on CapArena-Auto, where generated captions are judged head-to-head against strong reference models. Moreover, ReflectCAP offers a more favorable trade-off between caption quality and compute cost than model scaling or existing multi-agent pipelines, which incur 21--36\\% greater overhead. This makes high-quality detailed captioning viable under real-world cost and latency constraints.","published_date":"2026-04-14T06:47:47+00:00","viability_score":5,"cluster_label":"Image Captioning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A multi-agent system that uses structured reflection notes to improve the factuality and coverage of image captions generated by large vision-language models.","time_to_mvp":"1-3 months","tags":[]},{"arxiv_id":"2604.12352v1","title":"MultiDocFusion: Hierarchical and Multimodal Chunking Pipeline for Enhanced RAG on Long Industrial Documents","abstract":"RAG-based QA has emerged as a powerful method for processing long industrial documents. However, conventional text chunking approaches often neglect complex and long industrial document structures, causing information loss and reduced answer quality. To address this, we introduce MultiDocFusion, a multimodal chunking pipeline that integrates: (i) detection of document regions using vision-based document parsing, (ii) text extraction from these regions via OCR, (iii) reconstruction of document structure into a hierarchical tree using large language model (LLM)-based document section hierarchical parsing (DSHP-LLM), and (iv) construction of hierarchical chunks through DFS-based grouping. Extensive experiments across industrial benchmarks demonstrate that MultiDocFusion improves retrieval precision by 8-15% and ANLS QA scores by 2-3% compared to baselines, emphasizing the critical role of explicitly leveraging document hierarchy for multimodal document-based QA. These significant performance gains underscore the necessity of structure-aware chunking in enhancing the fidelity of RAG-based QA systems.","published_date":"2026-04-14T06:40:22+00:00","viability_score":7,"cluster_label":"RAG Enhancement","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A multimodal chunking pipeline that leverages document structure and vision to significantly improve RAG performance on long industrial documents.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12350v1","title":"Scaffold-Conditioned Preference Triplets for Controllable Molecular Optimization with Large Language Models","abstract":"Molecular property optimization is central to drug discovery, yet many deep learning methods rely on black-box scoring and offer limited control over scaffold preservation, often producing unstable or biologically implausible edits. While large language models (LLMs) are promising molecular generators, optimization remains constrained by the lack of chemistry-grounded preference supervision and principled data curation. We introduce \\textbf{Scaffold-Conditioned Preference Triplets (SCPT)}, a pipeline that constructs similarity-constrained triplets $\\langle\\text{scaffold}, \\text{better}, \\text{worse}\\rangle$ via scaffold alignment and chemistry-driven filters for validity, synthesizability, and meaningful property gains. Using these preferences, we align a pretrained molecular LLM as a conditional editor, enabling property-improving edits that retain the scaffold. Across single- and multi-objective benchmarks, SCPT improves optimization success and property gains while maintaining higher scaffold similarity than competitive baselines. Compared with representative non-LLM molecular optimization methods, SCPT-trained LLMs are better suited to scaffold-constrained and multi-objective optimization. In addition, models trained on single-property and two-property supervision generalize effectively to three-property tasks, indicating promising extrapolative generalization under limited higher-order supervision. SCPT also provides controllable data-construction knobs that yield a predictable similarity-gain frontier, enabling systematic adaptation to diverse optimization regimes.","published_date":"2026-04-14T06:38:07+00:00","viability_score":7,"cluster_label":"Drug Discovery AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A pipeline for controllable molecular optimization using LLMs that preserves scaffold integrity and improves drug discovery efficiency.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12344v1","title":"FRTSearch: Unified Detection and Parameter Inference of Fast Radio Transients using Instance Segmentation","abstract":"The exponential growth of data from modern radio telescopes presents a significant challenge to traditional single-pulse search algorithms, which are computationally intensive and prone to high false-positive rates due to Radio Frequency Interference (RFI). In this work, we introduce FRTSearch, an end-to-end framework unifying the detection and physical characterization of Fast Radio Transients (FRTs). Leveraging the morphological universality of dispersive trajectories in time-frequency dynamic spectra, we reframe FRT detection as a pattern recognition problem governed by the cold plasma dispersion relation. To facilitate this, we constructed CRAFTS-FRT, a pixel-level annotated dataset derived from the Commensal Radio Astronomy FAST Survey (CRAFTS), comprising 2{,}392 instances across diverse source classes. This dataset enables the training of a Mask R-CNN model for precise trajectory segmentation. Coupled with our physics-driven IMPIC algorithm, the framework maps the geometric coordinates of segmented trajectories to directly infer the Dispersion Measure (DM) and Time of Arrival (ToA). Benchmarking on the FAST-FREX dataset shows that FRTSearch achieves a 98.0\\% recall, competitive with exhaustive search methods, while reducing false positives by over 99.9\\% compared to PRESTO and delivering a processing speedup of up to $13.9\\times$. Furthermore, the framework demonstrates robust cross-facility generalization, detecting all 19 tested FRBs from the ASKAP survey without retraining. By shifting the paradigm from ``search-then-identify'' to ``detect-and-infer,'' FRTSearch provides a scalable, high-precision solution for real-time discovery in the era of petabyte-scale radio astronomy.","published_date":"2026-04-14T06:31:08+00:00","viability_score":8,"cluster_label":"Astronomy AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An end-to-end AI framework for real-time detection and characterization of cosmic radio signals, significantly reducing false positives and increasing speed.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.12336v1","title":"GeM-EA: A Generative and Meta-learning Enhanced Evolutionary Algorithm for Streaming Data-Driven Optimization","abstract":"Streaming Data-Driven Optimization (SDDO) problems arise in many applications where data arrive continuously and the optimization environment evolves over time. Concept drift produces non-stationary landscapes, making optimization methods challenging due to outdated models. Existing approaches often rely on simple surrogate combinations or directly injecting solutions, which may cause negative transfer under sudden environmental changes. We propose GeM-EA, a Generative and Meta-learning Enhanced Evolutionary Algorithm for SDDO that unifies meta-learned surrogate adaptation with generative replay for effective evolutionary search. Upon detecting concept drift, a bi-level meta-learning strategy rapidly initializes the surrogate using environment-relevant priors, while a linear residual component captures global trends. A multi-island evolutionary strategy further leverages historical knowledge via generative replay to accelerate optimization. Experimental results on benchmark SDDO problems demonstrate that GeM-EA achieves faster adaptation and improved robustness compared with state-of-the-art methods.","published_date":"2026-04-14T06:18:54+00:00","viability_score":7,"cluster_label":"Optimization AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An evolutionary algorithm enhanced with meta-learning and generative replay for robust and fast optimization of streaming data with concept drift.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12325v1","title":"Black-Box Optimization From Small Offline Datasets via Meta Learning with Synthetic Tasks","abstract":"We consider the problem of offline black-box optimization, where the goal is to discover optimal designs (e.g., molecules or materials) from past experimental data. A key challenge in this setting is data scarcity: in many scientific applications, only small or poor-quality datasets are available, which severely limits the effectiveness of existing algorithms. Prior work has theoretically and empirically shown that performance of offline optimization algorithms depends on how well the surrogate model captures the optimization bias (i.e., ability to rank input designs correctly), which is challenging to accomplish with limited experimental data. This paper proposes Surrogate Learning with Optimization Bias via Synthetic Task Generation (OptBias), a meta-learning framework that directly tackles data scarcity. OptBias learns a reusable optimization bias by training on synthetic tasks generated from a Gaussian process, and then fine-tunes the surrogate model on the small data for the target task. Across diverse continuous and discrete offline optimization benchmarks, OptBias consistently outperforms state-of-the-art baselines in small data regimes. These results highlight OptBias as a robust and practical solution for offline optimization in realistic small data settings.","published_date":"2026-04-14T06:00:30+00:00","viability_score":7,"cluster_label":"Optimization AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A meta-learning framework that generates synthetic tasks to enable black-box optimization from small, offline datasets.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12320v1","title":"EgoEsportsQA: An Egocentric Video Benchmark for Perception and Reasoning in Esports","abstract":"While video large language models (Video-LLMs) excel in understanding slow-paced, real-world egocentric videos, their capabilities in high-velocity, information-dense virtual environments remain under-explored. Existing benchmarks focus on daily activities, yet lack a rigorous testbed for evaluating fast, rule-bound reasoning in virtual scenarios. To fill this gap, we introduce EgoEsportsQA, a pioneering video question-answering (QA) benchmark for grounding perception and reasoning in expert esports knowledge. We curate 1,745 high-quality QA pairs from professional matches across 3 first-person shooter games via a scalable six-stage pipeline. These questions are structured into a two-dimensional decoupled taxonomy: 11 sub-tasks in the cognitive capability dimension (covering perception and reasoning levels) and 6 sub-tasks in the esports knowledge dimension. Comprehensive evaluations of state-of-the-art Video-LLMs reveal that current models still fail to achieve satisfactory performance, with the best model only 71.58%. The results expose notable gaps across both axes: models exhibit stronger capabilities in basic visual perception than in deep tactical reasoning, and they grasp overall macro-progression better than fine-grained micro-operations. Extensive ablation experiments demonstrate the intrinsic weaknesses of current Video-LLM architectures. Further analysis suggests that our dataset not only reveals the connections between real-world and virtual egocentric domains, but also offers guidance for optimizing downstream esports applications, thereby fostering the future advancement of Video-LLMs in various egocentric environments.","published_date":"2026-04-14T05:53:16+00:00","viability_score":7,"cluster_label":"Video LLMs","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new benchmark and dataset for evaluating Video-LLMs in high-velocity esports environments, revealing significant performance gaps and guiding future development.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12311v1","title":"Is Vibe Coding the Future? An Empirical Assessment of LLM Generated Codes for Construction Safety","abstract":"The emergence of vibe coding, a paradigm where non-technical users instruct Large Language Models (LLMs) to generate executable codes via natural language, presents both significant opportunities and severe risks for the construction industry. While empowering construction personnel such as the safety managers, foremen, and workers to develop tools and software, the probabilistic nature of LLMs introduces the threat of silent failures, wherein generated code compiles perfectly but executes flawed mathematical safety logic. This study empirically evaluates the reliability, software architecture, and domain-specific safety fidelity of 450 vibe-coded Python scripts generated by three frontier models, Claude 3.5 Haiku, GPT-4o-Mini, and Gemini 2.5 Flash. Utilizing a persona-driven prompt dataset (n=150) and a bifurcated evaluation pipeline comprising isolated dynamic sandboxing and an LLM-as-a-Judge, the research quantifies the severe limits of zero-shot vibe codes for construction safety. The findings reveal a highly significant relationship between user persona and data hallucination, demonstrating that less formal prompts drastically increase the AI's propensity to invent missing safety variables. Furthermore, while the models demonstrated high foundational execution viability (~85%), this syntactic reliability actively masked logic deficits and a severe lack of defensive programming. Among successfully executed scripts, the study identified an alarming ~45% overall Silent Failure Rate, with GPT-4o-Mini generating mathematically inaccurate outputs in ~56% of its functional code. The results demonstrate that current LLMs lack the deterministic rigor required for standalone safety engineering, necessitating the adoption of deterministic AI wrappers and strict governance for cyber-physical deployments.","published_date":"2026-04-14T05:42:29+00:00","viability_score":6,"cluster_label":"LLM Safety","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Empirical assessment of LLM-generated code for construction safety reveals significant silent failure rates, highlighting the need for deterministic wrappers.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.12306v1","title":"GCA Framework: A Gulf-Grounded Dataset and Agentic Pipeline for Climate Decision Support","abstract":"Climate decision-making in the Gulf increasingly demands systems that can translate heterogeneous scientific and policy evidence into actionable guidance, yet general-purpose large language models (LLMs) remain weak both in region-specific climate knowledge and grounded interaction with geospatial and forecasting tools. We present the GCA framework, which unifies (i) GCA-DS, a curated Gulf-focused multimodal dataset, and (ii) Gulf Climate Agent (GCA), a tool-augmented agent for climate analysis. GCA-DS comprises ~200k question-answer pairs spanning governmental policies and adaptation plans, NGO and international frameworks, academic literature, and event-driven reporting on heatwaves, dust storms, and floods, complemented with remote-sensing inputs that couple imagery with textual evidence. Building on this foundation, the GCA agent orchestrates a modular tool pipeline grounded in real-time and historical signals and geospatial processing that produces derived indices and interpretable visualizations. Finally, we benchmark open and proprietary LLMs on Gulf climate tasks and show that domain fine-tuning and tool integration substantially improve reliability over general-purpose baselines.","published_date":"2026-04-14T05:31:40+00:00","viability_score":8,"cluster_label":"Climate AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A Gulf-focused multimodal dataset and agentic pipeline for climate decision support, integrating geospatial tools and domain-specific knowledge to improve LLM reliability.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12301v1","title":"Local-Splitter: A Measurement Study of Seven Tactics for Reducing Cloud LLM Token Usage on Coding-Agent Workloads","abstract":"We present a systematic measurement study of seven tactics for reducing cloud LLM token usage when a small local model can act as a triage layer in front of a frontier cloud model. The tactics are: (1) local routing, (2) prompt compression, (3) semantic caching, (4) local drafting with cloud review, (5) minimal-diff edits, (6) structured intent extraction, and (7) batching with vendor prompt caching. We implement all seven in an open-source shim that speaks both MCP and the OpenAI-compatible HTTP surface, supporting any local model via Ollama and any cloud model via an OpenAI-compatible endpoint. We evaluate each tactic individually, in pairs, and in a greedy-additive subset across four coding-agent workload classes (edit-heavy, explanation-heavy, general chat, RAG-heavy). We measure tokens saved, dollar cost, latency, and routing accuracy. Our headline finding is that T1 (local routing) combined with T2 (prompt compression) achieves 45-79% cloud token savings on edit-heavy and explanation-heavy workloads, while on RAG-heavy workloads the full tactic set including T4 (draft-review) achieves 51% savings. We observe that the optimal tactic subset is workload-dependent, which we believe is the most actionable finding for practitioners deploying coding agents today.","published_date":"2026-04-14T05:19:33+00:00","viability_score":5,"cluster_label":"LLM Cost Optimization","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A measurement study of seven tactics to reduce cloud LLM token usage for coding agents, offering workload-specific strategies for cost savings.","time_to_mvp":"1-3 months","tags":["quick_build"]},{"arxiv_id":"2604.12290v1","title":"Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization","abstract":"Current LLM agent benchmarks, which predominantly focus on binary pass/fail tasks such as code generation or search-based question answering, often neglect the value of real-world engineering that is often captured through the iterative optimization of feasible designs. To this end, we introduce Frontier-Eng, a human-verified benchmark for generative optimization -- an iterative propose-execute-evaluate loop in which an agent generates candidate artifacts, receives executable verifier feedback, and revises them under a fixed interaction budget -- spanning $47$ tasks across five broad engineering categories. Unlike previous suites, Frontier-Eng tasks are grounded in industrial-grade simulators and verifiers that provide continuous reward signals and enforce hard feasibility constraints under constrained budgets. We evaluate eight frontier language models using representative search frameworks, finding that while Claude 4.6 Opus achieves the most robust performance, the benchmark remains challenging for all models. Our analysis suggests a dual power-law decay in improvement frequency ($\\sim$ 1/iteration) and magnitude ($\\sim$ 1/improvement count). We further show that although width improves parallelism and diversity, depth remains crucial for hard-won improvements under a fixed budget. Frontier-Eng establishes a new standard for assessing the capacity of AI agents to integrate domain knowledge with executable feedback to solve complex, open-ended engineering problems.","published_date":"2026-04-14T05:02:06+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new benchmark for self-evolving AI agents that iteratively optimize engineering designs using simulators and verifiers, pushing the boundaries of real-world problem-solving.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12285v1","title":"GAM: Hierarchical Graph-based Agentic Memory for LLM Agents","abstract":"To sustain coherent long-term interactions, Large Language Model (LLM) agents must navigate the tension between acquiring new information and retaining prior knowledge. Current unified stream-based memory systems facilitate context updates but remain vulnerable to interference from transient noise. Conversely, discrete structured memory architectures provide robust knowledge retention but often struggle to adapt to evolving narratives. To address this, we propose GAM, a hierarchical Graph-based Agentic Memory framework that explicitly decouples memory encoding from consolidation to effectively resolve the conflict between rapid context perception and stable knowledge retention. By isolating ongoing dialogue in an event progression graph and integrating it into a topic associative network only upon semantic shifts, our approach minimizes interference while preserving long-term consistency. Additionally, we introduce a graph-guided, multi-factor retrieval strategy to enhance context precision. Experiments on LoCoMo and LongDialQA indicate that our method consistently outperforms state-of-the-art baselines in both reasoning accuracy and efficiency.","published_date":"2026-04-14T04:53:00+00:00","viability_score":3,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A hierarchical graph-based memory framework for LLM agents that decouples memory encoding and consolidation to improve long-term coherence and reduce noise interference.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.12281v1","title":"MAST: Mask-Guided Attention Mass Allocation for Training-Free Multi-Style Transfer","abstract":"Style transfer aims to render a content image with the visual characteristics of a reference style while preserving its underlying semantic layout and structural geometry. While recent diffusion-based models demonstrate strong stylization capabilities by leveraging powerful generative priors and controllable internal representations, they typically assume a single global style. Extending them to multi-style scenarios often leads to boundary artifacts, unstable stylization, and structural inconsistency due to interference between multiple style representations. To overcome these limitations, we propose MAST (Mask-Guided Attention Mass Allocation for Training-Free Multi-Style Transfer), a novel training-free framework that explicitly controls content-style interactions within the diffusion attention mechanism. To achieve artifact-free and structure-preserving stylization, MAST integrates four connected modules. First, Layout-preserving Query Anchoring prevents global layout collapse by firmly anchoring the semantic structure using content queries. Second, Logit-level Attention Mass Allocation deterministically distributes attention probability mass across spatial regions, seamlessly fusing multiple styles without boundary artifacts. Third, Sharpness-aware Temperature Scaling restores the attention sharpness degraded by multi-style expansion. Finally, Discrepancy-aware Detail Injection adaptively compensates for localized high-frequency detail losses by measuring structural discrepancies. Extensive experiments demonstrate that MAST effectively mitigates boundary artifacts and maintains structural consistency, preserving texture fidelity and spatial coherence even as the number of applied styles increases.","published_date":"2026-04-14T04:47:09+00:00","viability_score":7,"cluster_label":"Generative Image","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A training-free framework for multi-style image transfer that uses mask-guided attention to seamlessly blend multiple styles without artifacts or structural inconsistencies.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.12262v1","title":"CascadeDebate: Multi-Agent Deliberation for Cost-Aware LLM Cascades","abstract":"Cascaded LLM systems coordinate models of varying sizes with human experts to balance accuracy, cost, and abstention under uncertainty. However, single-model tiers at each stage often struggle with ambiguous queries, triggering premature escalations to costlier models or experts due to under-confidence and inefficient compute scaling. CascadeDebate addresses this gap by inserting multi-agent deliberation directly at each tier's escalation boundary. Confidence-based routers activate lightweight agent ensembles only for uncertain cases, enabling consensus-driven resolution of ambiguities internally without invoking higher-cost upgrades. Our unified architecture alternates single-model inference with selective multi-agent deliberation across model scales, culminating in human experts as the final fallback. This design scales test-time compute dynamically according to query difficulty. Across five benchmarks spanning science, medicine, and general knowledge, CascadeDebate outperforms strong single-model cascades and standalone multi-agent systems by up to 26.75 percent. An online threshold optimizer proves essential, boosting accuracy by 20.98 to 52.33 percent relative improvement over fixed policies and enabling elastic adaptation to real-world distributions.","published_date":"2026-04-14T04:26:39+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A multi-agent deliberation system for LLM cascades that uses consensus-driven ensembles to resolve ambiguities internally, reducing costs and improving accuracy.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12258v1","title":"Coding-Free and Privacy-Preserving MCP Framework for Clinical Agentic Research Intelligence System","abstract":"Clinical research involves labor-intensive processes such as study design, cohort construction, model development, and documentation, requiring domain expertise, programming skills, and access to sensitive patient data. These demands create barriers for clinicians and external researchers conducting data-driven studies. To overcome these limitations, we developed a Clinical Agentic Research Intelligence System (CARIS) that automates the clinical research workflow while preserving data privacy, enabling comprehensive studies without direct access to raw data. CARIS integrates Large Language Models (LLMs) with modular tools via the Model Context Protocol (MCP), enabling natural language-driven orchestration of appropriate tools. Databases remain securely within the MCP server, and users access only the outputs and final research reports. Based on user intent, CARIS automatically executes the full pipeline: research planning, literature search, cohort construction, Institutional Review Board (IRB) documentation, Vibe Machine Learning (ML), and report generation, with iterative human-in-the-loop refinement. We evaluated CARIS on three heterogeneous datasets with distinct clinical tasks. Research plans and IRB documents were finalized within three to four iterations, using evidence from literature and data. The system supported Vibe ML by exploring feature-model combinations, ranking the top ten models, and generating performance visualizations. Final reports showed high completeness based on a checklist derived from the TRIPOD+AI framework, achieving 96% coverage in LLM evaluation and 82% in human evaluation. CARIS demonstrates that agentic AI can transform clinical hypotheses into executable research workflows across heterogeneous datasets. By eliminating the need for coding and direct data access, the system lowers barriers and bridges public and private clinical data environments.","published_date":"2026-04-14T04:22:44+00:00","viability_score":7,"cluster_label":"Clinical AI Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An AI system that automates clinical research workflows, from planning to report generation, without requiring coding or direct patient data access.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12255v1","title":"ARGen: Affect-Reinforced Generative Augmentation towards Vision-based Dynamic Emotion Perception","abstract":"Dynamic facial expression recognition in the wild remains challenging due to data scarcity and long-tail distributions, which hinder models from effectively learning the temporal dynamics of scarce emotions. To address these limitations, we propose ARGen, an Affect-Reinforced Generative Augmentation Framework that enables data-adaptive dynamic expression generation for robust emotion perception. ARGen operates in two stages: Affective Semantic Injection (ASI) and Adaptive Reinforcement Diffusion (ARD). The ASI stage establishes affective knowledge alignment through facial Action Units and employs a retrieval-augmented prompt generation strategy to synthesize consistent and fine-grained affective descriptions via large-scale visual-language models, thereby injecting interpretable emotional priors into the generation process. The ARD stage integrates text-conditioned image-to-video diffusion with reinforcement learning, introducing inter-frame conditional guidance and a multi-objective reward function to jointly optimize expression naturalness, facial integrity, and generative efficiency. Extensive experiments on both generation and recognition tasks verify that ARGen substantially enhances synthesis fidelity and improves recognition performance, establishing an interpretable and generalizable generative augmentation paradigm for vision-based affective computing.","published_date":"2026-04-14T04:05:07+00:00","viability_score":4,"cluster_label":"Generative AI for Vision","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A framework that generates realistic facial expressions for training AI models to better perceive emotions from video.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.12254v1","title":"SpanKey: Dynamic Key Space Conditioning for Neural Network Access Control","abstract":"SpanKey is a lightweight way to gate inference without encrypting weights or chasing leaderboard accuracy on gated inference. The idea is to condition activations on secret keys. A basis matrix $B$ defines a low-dimensional key subspace $Span(B)$; during training we sample coefficients $\u03b1$ and form keys $k=\u03b1^\\top B$, then inject them into intermediate activations with additive or multiplicative maps and strength $\u03b3$. Valid keys lie in $Span(B)$; invalid keys are sampled outside that subspace. We make three points. (i) Mechanism: subspace key injection and a multi-layer design space. (ii) Failure mode: key absorption, together with two analytical results (a Beta-energy split and margin-tail diagnostics), explains weak baseline separation in energy and margin terms -- these are not a security theorem. iii) Deny losses and experiments: Modes A--C and extensions, with CIFAR-10 ResNet-18 runs and MNIST ablations for Mode B. We summarize setup and first-order analysis, injectors, absorption, deny losses and ablations, a threat discussion that does not promise cryptography, and closing remarks on scale. Code: \\texttt{https://github.com/mindmemory-ai/dksc}","published_date":"2026-04-14T04:01:34+00:00","viability_score":6,"cluster_label":"AI Security","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A lightweight method to control AI model access by conditioning activations on secret keys, without encrypting model weights.","time_to_mvp":"1-2 weeks","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.12253v1","title":"A Scoping Review of Large Language Model-Based Pedagogical Agents","abstract":"This scoping review examines the emerging field of Large Language Model (LLM)-based pedagogical agents in educational settings. While traditional pedagogical agents have been extensively studied, the integration of LLMs represents a transformative advancement with unprecedented capabilities in natural language understanding, reasoning, and adaptation. Following PRISMA-ScR guidelines, we analyzed 52 studies across five major databases from November 2022 to January 2025. Our findings reveal diverse LLM-based agents spanning K-12, higher education, and informal learning contexts across multiple subject domains. We identified four key design dimensions characterizing these agents: interaction approach (reactive vs. proactive), domain scope (domain-specific vs. general-purpose), role complexity (single-role vs. multi-role), and system integration (standalone vs. integrated). Emerging trends include multi-agent systems that simulate naturalistic learning environments, virtual student simulation for agent evaluation, integration with immersive technologies, and combinations with learning analytics. We also discuss significant research gaps and ethical considerations regarding privacy, accuracy, and student autonomy. This review provides researchers and practitioners with a comprehensive understanding of LLM-based pedagogical agents while identifying crucial areas for future development in this rapidly evolving field.","published_date":"2026-04-14T03:58:11+00:00","viability_score":0,"cluster_label":"AI in Education","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A review of how Large Language Models are being used to create AI agents for teaching and learning across different educational contexts.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.12250v1","title":"How memory can affect collective and cooperative behaviors in an LLM-Based Social Particle Swarm","abstract":"This study examines how model-specific characteristics of Large Language Model (LLM) agents, including internal alignment, shape the effect of memory on their collective and cooperative dynamics in a multi-agent system. To this end, we extend the Social Particle Swarm (SPS) model, in which agents move in a two-dimensional space and play the Prisoner's Dilemma with neighboring agents, by replacing its rule-based agents with LLM agents endowed with Big Five personality scores and varying memory lengths. Using Gemini-2.0-Flash, we find that memory length is a critical parameter governing collective behavior: even a minimal memory drastically suppressed cooperation, transitioning the system from stable cooperative clusters through cyclical formation and collapse of clusters to a state of scattered defection as memory length increased. Big Five personality traits correlated with agent behaviors in partial agreement with findings from experiments with human participants, supporting the validity of the model. Comparative experiments using Gemma~3:4b revealed the opposite trend: longer memory promoted cooperation, accompanied by the formation of dense cooperative clusters. Sentiment analysis of agents' reasoning texts showed that Gemini interprets memory increasingly negatively as its length grows, while Gemma interprets it less negatively, and that this difference persists in the early phase of experiments before the macro-level dynamics converge. These results suggest that model-specific characteristics of LLMs, potentially including alignment, play a fundamental role in determining emergent social behavior in Generative Agent-Based Modeling, and provide a micro-level cognitive account of the contradictions found in prior work on memory and cooperation.","published_date":"2026-04-14T03:54:49+00:00","viability_score":3,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Investigating how memory length impacts cooperative behavior in LLM agents within a social particle swarm model.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.12247v1","title":"SpecBound: Adaptive Bounded Self-Speculation with Layer-wise Confidence Calibration","abstract":"Speculative decoding has emerged as a promising approach to accelerate autoregressive inference in large language models (LLMs). Self-draft methods, which leverage the base LLM itself for speculation, avoid the overhead of auxiliary draft models but face limitations: shallow layers often produce overconfident yet incorrect token predictions, and the presence of difficult tokens in a draft sequence forces redundant computation through deeper layers, undermining both draft acceptance and overall speedup. To address these issues, we propose a novel self-draft framework that suppresses spurious confidence via layer-wise temperature annealing in early-exit decision and adaptively bounds speculation length based on token-wise decoding difficulty. By reprocessing the hidden states of draft tokens in a unified parallel pass through deep layers, our method maintains exact output equivalence with the original model while maximizing computational efficiency. It requires no modifications to the base LLM parameters and achieves up to 2.33x wall-time speedup over standard autoregressive decoding across diverse long-form generation tasks and multiple model architectures.","published_date":"2026-04-14T03:47:04+00:00","viability_score":6,"cluster_label":"LLM Inference","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Accelerating LLM inference with adaptive bounded self-speculation and layer-wise confidence calibration.","time_to_mvp":"1-3 months","tags":["quick_build"]},{"arxiv_id":"2604.12245v1","title":"Socrates Loss: Unifying Confidence Calibration and Classification by Leveraging the Unknown","abstract":"Deep neural networks, despite their high accuracy, often exhibit poor confidence calibration, limiting their reliability in high-stakes applications. Current ad-hoc confidence calibration methods attempt to fix this during training but face a fundamental trade-off: two-phase training methods achieve strong classification performance at the cost of training instability and poorer confidence calibration, while single-loss methods are stable but underperform in classification. This paper addresses and mitigates this stability-performance trade-off. We propose Socrates Loss, a novel, unified loss function that explicitly leverages uncertainty by incorporating an auxiliary unknown class, whose predictions directly influence the loss function and a dynamic uncertainty penalty. This unified objective allows the model to be optimized for both classification and confidence calibration simultaneously, without the instability of complex, scheduled losses. We provide theoretical guarantees that our method regularizes the model to prevent miscalibration and overfitting. Across four benchmark datasets and multiple architectures, our comprehensive experiments demonstrate that Socrates Loss consistently improves training stability while achieving more favorable accuracy-calibration trade-off, often converging faster than existing methods.","published_date":"2026-04-14T03:43:15+00:00","viability_score":7,"cluster_label":"Model Calibration","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A unified loss function that improves classification accuracy and confidence calibration by leveraging uncertainty.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.12243v1","title":"Continuous Knowledge Metabolism: Generating Scientific Hypotheses from Evolving Literature","abstract":"Scientific hypothesis generation requires tracking how knowledge evolves, not just what is currently known. We introduce Continuous Knowledge Metabolism (CKM), a framework that processes scientific literature through sliding time windows and incrementally updates a structured knowledge base as new findings arrive. We present CKM-Lite, an efficient variant that achieves strong predictive coverage through incremental accumulation, outperforming batch processing on hit rate (+2.8%, p=0.006), hypothesis yield (+3.6, p<0.001), and best-match alignment (+0.43, p<0.001) while reducing token cost by 92%. To understand what drives these differences, we develop CKM-Full, an instrumented variant that categorizes each new finding as novel, confirming, or contradicting, detects knowledge change signals, and conditions hypothesis generation on the full evolution trajectory. Analyzing 892 hypotheses generated by CKM-Full across 50 research topics, alongside parallel runs of the other variants, we report four empirical observations: (1) incremental processing outperforms batch baseline across predictive and efficiency metrics; (2) change-aware instrumentation is associated with higher LLM-judged novelty (Cohen's d=3.46) but lower predictive coverage, revealing a quality-coverage trade-off; (3) a field's trajectory stability is associated with hypothesis success (r=-0.28, p=0.051), suggesting boundary conditions for literature-based prediction; (4) knowledge convergence signals are associated with nearly 5x higher hit rate than contradiction signals, pointing to differential predictability across change types. These findings suggest that the character of generated hypotheses is shaped not only by how much literature is processed, but also by how it is processed. They further indicate that evaluation frameworks must account for the quality-coverage trade-off rather than optimize for a single metric.","published_date":"2026-04-14T03:41:53+00:00","viability_score":3,"cluster_label":"Scientific Discovery","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A framework for continuously updating a knowledge base and generating scientific hypotheses from evolving literature.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.12237v1","title":"MolMem: Memory-Augmented Agentic Reinforcement Learning for Sample-Efficient Molecular Optimization","abstract":"In drug discovery, molecular optimization aims to iteratively refine a lead compound to improve molecular properties while preserving structural similarity to the original molecule. However, each oracle evaluation is expensive, making sample efficiency a key challenge for existing methods under a limited oracle budget. Trial-and-error approaches require many oracle calls, while methods that leverage external knowledge tend to reuse familiar templates and struggle on challenging objectives. A key missing piece is long-term memory that can ground decisions and provide reusable insights for future optimizations. To address this, we present MolMem (\\textbf{Mol}ecular optimization with \\textbf{Mem}ory), a multi-turn agentic reinforcement learning (RL) framework with a dual-memory system. Specifically, MolMem uses Static Exemplar Memory to retrieve relevant exemplars for cold-start grounding, and Evolving Skill Memory to distill successful trajectories into reusable strategies. Built on this memory-augmented formulation, we train the policy with dense step-wise rewards, turning costly rollouts into long-term knowledge that improves future optimization. Extensive experiments show that MolMem achieves 90\\% success on single-property tasks (1.5$\\times$ over the best baseline) and 52\\% on multi-property tasks using only 500 oracle calls. Our code is available at https://github.com/REAL-Lab-NU/MolMem.","published_date":"2026-04-14T03:24:26+00:00","viability_score":8,"cluster_label":"Drug Discovery AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A memory-augmented reinforcement learning agent for sample-efficient molecular optimization in drug discovery.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12232v1","title":"TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs","abstract":"Large Language Models (LLMs) are increasingly deployed across diverse domains, yet their vulnerability to jailbreak attacks, where adversarial inputs bypass safety mechanisms to elicit harmful outputs, poses significant security risks. While prior work has primarily focused on prompt injection attacks, these approaches often require resource-intensive prompt engineering and overlook other critical components, such as chat templates. This paper introduces TEMPLATEFUZZ, a fine-grained fuzzing framework that systematically exposes vulnerabilities in chat templates, a critical yet underexplored attack surface in LLMs. Specifically, TEMPLATEFUZZ (1) designs a series of element-level mutation rules to generate diverse chat template variants, (2) proposes a heuristic search strategy to guide the chat template generation toward the direction of amplifying the attack success rate (ASR) while preserving model accuracy, and (3) integrates an active learning-based strategy to derive a lightweight rule-based oracle for accurate and efficient jailbreak evaluation. Evaluated on twelve open-source LLMs across multiple attack scenarios, TEMPLATEFUZZ achieves an average ASR of 98.2% with only 1.1% accuracy degradation, outperforming state-of-the-art methods by 9.1%-47.9% in ASR and 8.4% in accuracy degradation. Moreover, even on five industry-leading commercial LLMs where chat templates cannot be specified, TEMPLATEFUZZ attains a 90% average ASR via chat template-based prompt injection attacks.","published_date":"2026-04-14T03:12:19+00:00","viability_score":7,"cluster_label":"LLM Security","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A fuzzing framework to find vulnerabilities in LLM chat templates for jailbreaking and red teaming.","time_to_mvp":"1-3 months","tags":["series_a_plus"]},{"arxiv_id":"2604.12229v1","title":"HintMR: Eliciting Stronger Mathematical Reasoning in Small Language Models","abstract":"Small language models (SLMs) often struggle with complex mathematical reasoning due to limited capacity to maintain long chains of intermediate steps and to recover from early errors. We address this challenge by introducing a hint-assisted reasoning framework that incrementally guides SLMs through multi-step mathematical problem solving. Our approach decomposes solutions into sequential reasoning steps and provides context-aware hints, where hints are generated by a separate SLM trained via distillation from a strong large language model. While the hint-generating SLM alone is not capable of solving the problems, its collaboration with a reasoning SLM enables effective guidance, forming a cooperative two-model system for reasoning. Each hint is generated conditionally on the problem statement and the accumulated reasoning history, providing stepwise, localized guidance without revealing full solutions. This reduces error propagation and allows the reasoning model to focus on manageable subproblems. Experiments across diverse mathematical benchmarks and models demonstrate that hint assistance consistently improves reasoning accuracy for SLMs, yielding substantial gains over standard prompting while preserving model efficiency. These results highlight that structured collaboration between SLMs-via hint generation and reasoning-offers an effective and lightweight mechanism for enhancing mathematical reasoning.","published_date":"2026-04-14T03:09:26+00:00","viability_score":8,"cluster_label":"LLM Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A hint-assisted reasoning framework that uses cooperative small language models to improve mathematical problem-solving.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12227v1","title":"Designing Reliable LLM-Assisted Rubric Scoring for Constructed Responses: Evidence from Physics Exams","abstract":"Student responses in STEM assessments are often handwritten and combine symbolic expressions, calculations, and diagrams, creating substantial variation in format and interpretation. Despite their importance for evaluating students' reasoning, such responses are time-consuming to score and prone to rater inconsistency, particularly when partial credit is required. Recent advances in large language models (LLMs) have increased attention to AI-assisted scoring, yet evidence remains limited regarding how rubric design and LLM configurations influence reliability across performance levels. This study examined the reliability of AI-assisted scoring of undergraduate physics constructed responses using GPT-4o. Twenty authentic handwritten exam responses were scored across two rounds by four instructors and by the AI model using skill-based rubrics with differing levels of analytic granularity. Prompting format and temperature settings were systematically varied. Overall, human-AI agreement on total scores was comparable to human inter-rater reliability and was highest for high- and low-performing responses, but declined for mid-level responses involving partial or ambiguous reasoning. Criterion-level analyses showed stronger alignment for clearly defined conceptual skills than for extended procedural judgments. A more fine-grained, checklist-based rubric improved consistency relative to holistic scoring. These findings indicate that reliable AI-assisted scoring depends primarily on clear, well-structured rubrics, while prompting format plays a secondary role and temperature has relatively limited impact. More broadly, the study provides transferable design recommendations for implementing reliable LLM-assisted scoring in STEM contexts through skill-based rubrics and controlled LLM settings.","published_date":"2026-04-14T03:04:44+00:00","viability_score":5,"cluster_label":"AI Education","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Designing reliable LLM-assisted rubric scoring for constructed responses in STEM exams, focusing on rubric design.","time_to_mvp":"1-3 months","tags":["series_a_plus"]},{"arxiv_id":"2604.12223v1","title":"LLM-Guided Semantic Bootstrapping for Interpretable Text Classification with Tsetlin Machines","abstract":"Pretrained language models (PLMs) like BERT provide strong semantic representations but are costly and opaque, while symbolic models such as the Tsetlin Machine (TM) offer transparency but lack semantic generalization. We propose a semantic bootstrapping framework that transfers LLM knowledge into symbolic form, combining interpretability with semantic capacity. Given a class label, an LLM generates sub-intents that guide synthetic data creation through a three-stage curriculum (seed, core, enriched), expanding semantic diversity. A Non-Negated TM (NTM) learns from these examples to extract high-confidence literals as interpretable semantic cues. Injecting these cues into real data enables a TM to align clause logic with LLM-inferred semantics. Our method requires no embeddings or runtime LLM calls, yet equips symbolic models with pretrained semantic priors. Across multiple text classification tasks, it improves interpretability and accuracy over vanilla TM, achieving performance comparable to BERT while remaining fully symbolic and efficient.","published_date":"2026-04-14T03:02:25+00:00","viability_score":3,"cluster_label":"Interpretable Text Classification","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A framework that transfers LLM knowledge into symbolic Tsetlin Machines for interpretable and accurate text classification without runtime LLM calls.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.12219v1","title":"Ride the Wave: Precision-Allocated Sparse Attention for Smooth Video Generation","abstract":"Video Diffusion Transformers have revolutionized high-fidelity video generation but suffer from the massive computational burden of self-attention. While sparse attention provides a promising acceleration solution, existing methods frequently provoke severe visual flickering caused by static sparsity patterns and deterministic block routing. To resolve these limitations, we propose Precision-Allocated Sparse Attention (PASA), a training-free framework designed for highly efficient and temporally smooth video generation. First, we implement a curvature-aware dynamic budgeting mechanism. By profiling the generation trajectory acceleration across timesteps, we elastically allocate the exact-computation budget to secure high-precision processing strictly during critical semantic transitions. Second, we replace global homogenizing estimations with hardware-aligned grouped approximations, successfully capturing fine-grained local variations while maintaining peak compute throughput. Finally, we incorporate a stochastic selection bias into the attention routing mechanism. This probabilistic approach softens rigid selection boundaries and eliminates selection oscillation, effectively eradicating the localized computational starvation that drives temporal flickering. Extensive evaluations on leading video diffusion models demonstrate that PASA achieves substantial inference acceleration while consistently producing remarkably fluid and structurally stable video sequences.","published_date":"2026-04-14T02:51:52+00:00","viability_score":7,"cluster_label":"Video Generation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A training-free framework for efficient and temporally smooth video generation by dynamically allocating computation budget and using stochastic attention routing.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12213v1","title":"Modality-Native Routing in Agent-to-Agent Networks: A Multimodal A2A Protocol Extension","abstract":"Preserving multimodal signals across agent boundaries is necessary for accurate cross-modal reasoning, but it is not sufficient. We show that modality-native routing in Agent-to-Agent (A2A) networks improves task accuracy by 20 percentage points over text-bottleneck baselines, but only when the downstream reasoning agent can exploit the richer context that native routing preserves. An ablation replacing LLM-backed reasoning with keyword matching eliminates the accuracy gap entirely (36% vs. 36%), establishing a two-layer requirement: protocol-level routing must be paired with capable agent-level reasoning for the benefit to materialize.   We present MMA2A, an architecture layer atop A2A that inspects Agent Card capability declarations to route voice, image, and text parts in their native modality. On CrossModal-CS, a controlled 50-task benchmark with the same LLM backend, same tasks, and only the routing path varying, MMA2A achieves 52% task completion accuracy versus 32% for the text-bottleneck baseline (95% bootstrap CI on $\u0394$TCA: [8, 32] pp; McNemar's exact $p = 0.006$). Gains concentrate on vision-dependent tasks: product defect reports improve by +38.5 pp and visual troubleshooting by +16.7 pp. This accuracy gain comes at a $1.8\\times$ latency cost from native multimodal processing. These results suggest that routing is a first-order design variable in multi-agent systems, as it determines the information available for downstream reasoning.","published_date":"2026-04-14T02:44:50+00:00","viability_score":7,"cluster_label":"Multi-Agent Systems","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An extension to Agent-to-Agent networks that enables modality-native routing for richer multimodal context, improving task accuracy in vision-dependent scenarios.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12210v1","title":"Beyond Prompt: Fine-grained Simulation of Cognitively Impaired Standardized Patients via Stochastic Steering","abstract":"Simulating Standardized Patients with cognitive impairment offers a scalable and ethical solution for clinical training. However, existing methods rely on discrete prompt engineering and fail to capture the heterogeneity of deficits across varying domains and severity levels. To address this limitation, we propose StsPatient for the fine-grained simulation of cognitively impaired patients. We innovatively capture domain-specific features by extracting steering vectors from contrastive pairs of instructions and responses. Furthermore, we introduce a Stochastic Token Modulation (STM) mechanism to regulate the intervention probability. STM enables precise control over impairment severity while mitigating the instability of conventional vector methods. Comprehensive experiments demonstrate that StsPatient significantly outperforms baselines in both clinical authenticity and severity controllability.","published_date":"2026-04-14T02:37:46+00:00","viability_score":7,"cluster_label":"Clinical Simulation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A system for fine-grained simulation of cognitively impaired standardized patients using steering vectors and stochastic token modulation for precise severity control.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12208v1","title":"Unveiling the Surprising Efficacy of Navigation Understanding in End-to-End Autonomous Driving","abstract":"Global navigation information and local scene understanding are two crucial components of autonomous driving systems. However, our experimental results indicate that many end-to-end autonomous driving systems tend to over-rely on local scene understanding while failing to utilize global navigation information. These systems exhibit weak correlation between their planning capabilities and navigation input, and struggle to perform navigation-following in complex scenarios. To overcome this limitation, we propose the Sequential Navigation Guidance (SNG) framework, an efficient representation of global navigation information based on real-world navigation patterns. The SNG encompasses both navigation paths for constraining long-term trajectories and turn-by-turn (TBT) information for real-time decision-making logic. We constructed the SNG-QA dataset, a visual question answering (VQA) dataset based on SNG that aligns global and local planning. Additionally, we introduce an efficient model SNG-VLA that fuses local planning with global planning. The SNG-VLA achieves state-of-the-art performance through precise navigation information modeling without requiring auxiliary loss functions from perception tasks. Project page: SNG-VLA","published_date":"2026-04-14T02:34:44+00:00","viability_score":7,"cluster_label":"Autonomous Driving","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new framework and dataset for autonomous driving that significantly improves navigation understanding by fusing global and local planning, achieving state-of-the-art results without auxiliary losses.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12202v1","title":"Latent patterns of urban mixing in mobility analysis across five global cities","abstract":"This study leverages large-scale travel surveys for over 200,000 residents across Boston, Chicago, Hong Kong, London, and Sao Paulo. With rich individual-level data, we make systematic comparisons and reveal patterns in social mixing, which cannot be identified by analyzing high-resolution mobility data alone. Using the same set of data, inferring socioeconomic status from residential neighborhoods yield social mixing levels 16% lower than using self-reported survey data. Besides, individuals over the age of 66 experience greater social mixing than those in late working life (aged 55 to 65), lending data-driven support to the \"second youth\" hypothesis. Teenagers and women with caregiving responsibilities exhibit lower social mixing levels. Across the five cities, proximity to major transit stations reduces the influence of individual socioeconomic status on social mixing. Finally, we construct detailed spatio-temporal place networks for each city using a graph neural network. Inputs of home-space, activity-space and demographic attributes are embedded and fed into a supervised autoencoder to predict individual exposure vectors. Results show that the structure of individual activity space, i.e., where people travel to, explains most of the variations in place exposure, suggesting that mobility shapes experienced social mixing more than sociodemographic characteristics, home environment, and transit proximity. The ablation tests further discover that, while different income groups may experience similar levels of social mixing, their activity spaces remain stratified by income, resulting in structurally different social mixing experiences.","published_date":"2026-04-14T02:10:53+00:00","viability_score":5,"cluster_label":"Urban Mobility Analysis","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Leveraging large-scale travel surveys and graph neural networks, this research uncovers nuanced patterns of urban social mixing and place exposure across five global cities, revealing mobility's greater influence than sociodemographics.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.12198v1","title":"Towards grounded autonomous research: an end-to-end LLM mini research loop on published computational physics","abstract":"Recent autonomous LLM agents have demonstrated end-to-end automation of machine-learning research. Real-world physical science is intrinsically harder, requiring deep reasoning bounded by physical truth and, because real systems are too complex to study in isolation, almost always built on existing literature. We focus on the smallest meaningful unit of such research, a mini research loop in which an agent reads a paper, reproduces it, critiques it, and extends it. We test this loop in two complementary regimes: scale and depth. At scale, across 111 open-access computational physics papers, an agent autonomously runs the read-plan-compute-compare loop and, without being asked to critique, raises substantive concerns on ~42% of papers - 97.7% of which require execution to surface. In depth, for one Nature Communications paper on multiscale simulation of a 2D-material MOSFET, the agent runs new calculations missing from the original and produces, unsupervised, a publishable Comment -- composed, figured, typeset, and PDF-iterated -- that revises the paper's headline conclusion.","published_date":"2026-04-14T02:06:59+00:00","viability_score":7,"cluster_label":"AI Research Automation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An end-to-end LLM research loop autonomously reproduces, critiques, and extends computational physics papers, demonstrating significant potential for accelerating scientific discovery.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12191v1","title":"Beyond Scores: Diagnostic LLM Evaluation via Fine-Grained Abilities","abstract":"Current evaluations of large language models aggregate performance across diverse tasks into single scores. This obscures fine-grained ability variation, limiting targeted model improvement and ability-guided selection for specific tasks. Motivated by this gap, we propose a cognitive diagnostic framework that estimates model abilities across multiple fine-grained dimensions. For mathematics, we construct a 35-dimensional ability taxonomy grounded in cognitive theory and domain knowledge. The framework employs multidimensional Item Response Theory with an item-ability association matrix to estimate fine-grained ability levels, which in turn enable prediction of performance on unseen items (questions of benchmark). Evaluated on 41 models, our approach demonstrates strong criterion validity, consistent ability estimates across benchmarks, and accurate prediction of unseen items with AUC ranging from 0.80 to 0.89 within benchmarks and from 0.77 to 0.86 across benchmarks, substantially exceeding trivial baselines. The framework generalizes across scientific domains, producing consistent diagnostic performance in physics (27 dimensions), chemistry (58 dimensions), and computer science (12 dimensions). This work establishes a principled framework for fine-grained assessment of abilities, with potential applications in targeted training, ability-guided model selection, and ability-aware benchmark design.","published_date":"2026-04-14T01:48:22+00:00","viability_score":7,"cluster_label":"LLM Evaluation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A cognitive diagnostic framework for LLMs that moves beyond single scores to provide fine-grained ability assessments across multiple scientific domains, enabling targeted improvement and selection.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.12190v1","title":"Characterizing Resource Sharing Practices on Underground Internet Forum Synthetic Non-Consensual Intimate Image Content Creation Communities","abstract":"Many malicious actors responsible for disseminating synthetic non-consensual intimate imagery (SNCII) operate within internet forums to exchange resources, strategies, and generated content across multiple platforms. Technically-sophisticated actors gravitate toward certain communities (e.g., 4chan), while lower-sophistication end-users are more active on others (e.g., Reddit). To characterize key stakeholders in the broader ecosystem, we perform an integrated analysis of multiple communities, analyzing 282,154 4chan comments and 78,308 Reddit submissions spanning 165 days between June and November 2025 to characterize involved actors, actions, and resources. We find: (a) that users with differing levels of technical sophistication employ and share a wide range of primary resources facilitating SNCII content creation as well as numerous secondary resources facilitating dissemination; and (b) that knowledge transfer between experts and newcomers facilitates propagation of these illicit resources. Based on our empirical analysis, we identify gaps in existing SNCII regulatory infrastructure and synthesize several critical intervention points for bolstering deterrence.","published_date":"2026-04-14T01:45:22+00:00","viability_score":1,"cluster_label":"AI Ethics & Safety","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper analyzes resource sharing practices in underground forums for synthetic non-consensual intimate imagery creation and dissemination to identify intervention points for deterrence.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.12184v1","title":"TRUST Agents: A Collaborative Multi-Agent Framework for Fake News Detection, Explainable Verification, and Logic-Aware Claim Reasoning","abstract":"TRUST Agents is a collaborative multi-agent framework for explainable fact verification and fake news detection. Rather than treating verification as a simple true-or-false classification task, the system identifies verifiable claims, retrieves relevant evidence, compares claims against that evidence, reasons under uncertainty, and generates explanations that humans can inspect. The baseline pipeline consists of four specialized agents. A claim extractor uses named entity recognition, dependency parsing, and LLM-based extraction to identify factual claims. A retrieval agent performs hybrid sparse and dense search using BM25 and FAISS. A verifier agent compares claims with retrieved evidence and produces verdicts with calibrated confidence. An explainer agent then generates a human-readable report with explicit evidence citations. To handle complex claims more effectively, we introduce a research-oriented extension with three additional components: a decomposer agent inspired by LoCal-style claim decomposition, a Delphi-inspired multi-agent jury with specialized verifier personas, and a logic aggregator that combines atomic verdicts using conjunction, disjunction, negation, and implication. We evaluate both pipelines on the LIAR benchmark against fine-tuned BERT, fine-tuned RoBERTa, and a zero-shot LLM baseline. Although supervised encoders remain stronger on raw metrics, TRUST Agents improves interpretability, evidence transparency, and reasoning over compound claims. Results also show that retrieval quality and uncertainty calibration remain the main bottlenecks in trustworthy automated fact verification.","published_date":"2026-04-14T01:31:14+00:00","viability_score":7,"cluster_label":"Fact Verification & Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"TRUST Agents is a multi-agent framework for explainable fake news detection and claim reasoning, offering improved interpretability and evidence transparency.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.12183v1","title":"Clustering-Enhanced Domain Adaptation for Cross-Domain Intrusion Detection in Industrial Control Systems","abstract":"Industrial control systems operate in dynamic environments where traffic distributions vary across scenarios, labeled samples are limited, and unknown attacks frequently emerge, posing significant challenges to cross-domain intrusion detection. To address this issue, this paper proposes a clustering-enhanced domain adaptation method for industrial control traffic. The framework contains two key components. First, a feature-based transfer learning module projects source and target domains into a shared latent subspace through spectral-transform-based feature alignment and iteratively reduces distribution discrepancies, enabling accurate cross-domain detection. Second, a clustering enhancement strategy combines K-Medoids clustering with PCA-based dimensionality reduction to improve cross-domain correlation estimation and reduce performance degradation caused by manual parameter tuning. Experimental results show that the proposed method significantly improves unknown attack detection. Compared with five baseline models, it increases detection accuracy by up to 49%, achieves larger gains in F-score, and demonstrates stronger stability. Moreover, the clustering enhancement strategy further boosts detection accuracy by up to 26% on representative tasks. These results suggest that the proposed method effectively alleviates data scarcity and domain shift, providing a practical solution for robust cross-domain intrusion detection in dynamic industrial environments.","published_date":"2026-04-14T01:25:59+00:00","viability_score":7,"cluster_label":"Cybersecurity AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A clustering-enhanced domain adaptation method for industrial control systems that significantly improves unknown attack detection with limited labeled data.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.12180v1","title":"CycloneMAE: A Scalable Multi-Task Learning Model for Global Tropical Cyclone Probabilistic Forecasting","abstract":"Tropical cyclones (TCs) rank among the most destructive natural hazards, yet their forecasting faces fundamental trade-offs: numerical weather prediction (NWP) models are computationally prohibitive and struggle to leverage historical data, while existing deep learning (DL)-based intelligent models are variable-specific and deterministic, which fail to generalize across different forecasting variables. Here we present CycloneMAE, a scalable multi-task forecasting model that learns transferable TC representations from multi-modal data using a TC structure-aware masked autoencoder. By coupling a discrete probabilistic gridding mechanism with a pre-train/fine-tune paradigm, CycloneMAE simultaneously delivers deterministic forecasts and probability distributions. Evaluated across five global ocean basins, CycloneMAE outperforms leading NWP systems in pressure and wind forecasting up to 120 hours and in track forecasting up to 24 hours. Attribution analysis via integrated gradients reveals physically interpretable learning dynamics: short-term forecasts rely predominantly on the internal core convective structure from satellite imagery, whereas longer-term forecasts progressively shift attention to external environmental factors. Our framework establishes a scalable, probabilistic, and interpretable pathway for operational TC forecasting.","published_date":"2026-04-14T01:21:55+00:00","viability_score":4,"cluster_label":"Weather Forecasting AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"CycloneMAE is a scalable multi-task learning model for global tropical cyclone probabilistic forecasting that outperforms NWP systems.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.12177v1","title":"Policy-Invisible Violations in LLM-Based Agents","abstract":"LLM-based agents can execute actions that are syntactically valid, user-sanctioned, and semantically appropriate, yet still violate organizational policy because the facts needed for correct policy judgment are hidden at decision time. We call this failure mode policy-invisible violations: cases in which compliance depends on entity attributes, contextual state, or session history absent from the agent's visible context. We present PhantomPolicy, a benchmark spanning eight violation categories with balanced violation and safe-control cases, in which all tool responses contain clean business data without policy metadata. We manually review all 600 model traces produced by five frontier models and evaluate them using human-reviewed trace labels. Manual review changes 32 labels (5.3%) relative to the original case-level annotations, confirming the need for trace-level human review. To demonstrate what world-state-grounded enforcement can achieve under favorable conditions, we introduce Sentinel, an enforcement framework based on counterfactual graph simulation. Sentinel treats every agent action as a proposed mutation to an organizational knowledge graph, performs speculative execution to materialize the post-action world state, and verifies graph-structural invariants to decide Allow/Block/Clarify. Against human-reviewed trace labels, Sentinel substantially outperforms a content-only DLP baseline (68.8% vs. 93.0% accuracy) while maintaining high precision, though it still leaves room for improvement on certain violation categories. These results demonstrate what becomes achievable once policy-relevant world state is made available to the enforcement layer.","published_date":"2026-04-14T01:15:15+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for LLM agents that enforces organizational policies by simulating world-state changes, significantly outperforming content-only baselines.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12176v1","title":"Evaluating Relational Reasoning in LLMs with REL","abstract":"Relational reasoning is the ability to infer relations that jointly bind multiple entities, attributes, or variables. This ability is central to scientific reasoning, but existing evaluations of relational reasoning in large language models often focus on structured inputs such as tables, graphs, or synthetic tasks, and do not isolate the difficulty introduced by higher-arity relational binding. We study this problem through the lens of Relational Complexity (RC), which we define as the minimum number of independent entities or operands that must be simultaneously bound to apply a relation. RC provides a principled way to vary reasoning difficulty while controlling for confounders such as input size, vocabulary, and representational choices. Building on RC, we introduce REL, a generative benchmark framework spanning algebra, chemistry, and biology that varies RC within each domain. Across frontier LLMs, performance degrades consistently and monotonically as RC increases, even when the total number of entities is held fixed. This failure mode persists with increased test-time compute and in-context learning, suggesting a limitation tied to the arity of the required relational binding rather than to insufficient inference steps or lack of exposure to examples. Our results identify a regime of higher-arity reasoning in which current models struggle, and motivate re-examining benchmarks through the lens of relational complexity.","published_date":"2026-04-14T01:07:15+00:00","viability_score":4,"cluster_label":"LLM Evaluation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new benchmark framework that measures relational reasoning in LLMs by varying the complexity of entity binding, revealing consistent performance degradation.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.12168v1","title":"Fully Homomorphic Encryption on Llama 3 model for privacy preserving LLM inference","abstract":"The applications of Generative Artificial Intelligence (GenAI) and their intersections with data-driven fields, such as healthcare, finance, transportation, and information security, have led to significant improvements in service efficiency and low latency. However, this synergy raises serious concerns regarding the security of large language models (LLMs) and their potential impact on the privacy of companies and users' data. Many technology companies that incorporate LLMs in their services with a certain level of command and control bear a risk of data exposure and secret divulgence caused by insecure LLM pipelines, making them vulnerable to multiple attacks such as data poisoning, prompt injection, and model theft. Although several security techniques (input/output sanitization, decentralized learning, access control management, and encryption) were implemented to reduce this risk, there is still an imminent risk of quantum computing attacks, which are expected to break existing encryption algorithms, hence, retrieving secret keys, encrypted sensitive data, and decrypting encrypted models. In this extensive work, we integrate the Post-Quantum Cryptography (PQC) based Lattice-based Homomorphic Encryption (HE) main functions in the LLM's inference pipeline to secure some of its layers against data privacy attacks. We modify the inference pipeline of the transformer architecture for the LLAMA-3 model while injecting the main homomorphic encryption operations provided by the concrete-ml library. We demonstrate high text generation accuracies (up to 98%) with reasonable latencies (237 ms) on an i9 CPU, reaching up to 80 tokens per second, which proves the feasibility and validity of our work while running a FHE-secured LLAMA-3 inference model. Further experiments and analysis are discussed to justify models' text generation latencies and behaviours.","published_date":"2026-04-14T00:54:24+00:00","viability_score":6,"cluster_label":"Privacy-Preserving AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Integrates post-quantum homomorphic encryption into Llama 3 inference for privacy-preserving LLM applications with high accuracy and reasonable latency.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.12167v1","title":"EMBER: Autonomous Cognitive Behaviour from Learned Spiking Neural Network Dynamics in a Hybrid LLM Architecture","abstract":"We present (Experience-Modulated Biologically-inspired Emergent Reasoning), a hybrid cognitive architecture that reorganises the relationship between large language models (LLMs) and memory: rather than augmenting an LLM with retrieval tools, we place the LLM as a replaceable reasoning engine within a persistent, biologically-grounded associative substrate.   The architecture centres on a 220,000-neuron spiking neural network (SNN) with spike-timing-dependent plasticity (STDP), four-layer hierarchical organisation (sensory/concept/category/meta-pattern), inhibitory E/I balance, and reward-modulated learning. Text embeddings are encoded into the SNN via a novel z-score standardised top-k population code that is dimension-independent by construction, achieving 82.2\\% discrimination retention across embedding dimensionalities.   We show that STDP lateral propagation during idle operation can trigger and shape LLM actions without external prompting or scripted triggers: the SNN determines when to act and what associations to surface, while the LLM selects the action type and generates content. In one instance, the system autonomously initiated contact with a user after learned person-topic associations fired laterally during an 8-hour idle period. From a clean start with zero learned weights, the first SNN-triggered action occurred after only 7 conversational exchanges (14 messages).","published_date":"2026-04-14T00:51:47+00:00","viability_score":4,"cluster_label":"Cognitive Architectures","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A hybrid cognitive architecture using spiking neural networks to enable autonomous, emergent reasoning and LLM actions without explicit prompting.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.12161v1","title":"Development, Evaluation, and Deployment of a Multi-Agent System for Thoracic Tumor Board","abstract":"Tumor boards are multidisciplinary conferences dedicated to producing actionable patient care recommendations with live review of primary radiology and pathology data. Succinct patient case summaries are needed to drive efficient and accurate case discussions. We developed a manual AI-based workflow to generate patient summaries to display live at the Stanford Thoracic Tumor board. To improve on this manually intensive process, we developed several automated AI chart summarization methods and evaluated them against physician gold standard summaries and fact-based scoring rubrics. We report these comparative evaluations as well as our deployment of the final state automated AI chart summarization tool along with post-deployment monitoring. We also validate the use of an LLM as a judge evaluation strategy for fact-based scoring. This work is an example of integrating AI-based workflows into routine clinical practice.","published_date":"2026-04-14T00:35:40+00:00","viability_score":4,"cluster_label":"Medical AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"An AI system automates patient summary generation for thoracic tumor boards, improving efficiency and accuracy in clinical practice.","time_to_mvp":"1-3 months","tags":["series_a_plus"]},{"arxiv_id":"2604.12152v1","title":"Domain-Specific Latent Representations Improve the Fidelity of Diffusion-Based Medical Image Super-Resolution","abstract":"Latent diffusion models for medical image super-resolution universally inherit variational autoencoders designed for natural photographs. We show that this default choice, not the diffusion architecture, is the dominant constraint on reconstruction quality. In a controlled experiment holding all other pipeline components fixed, replacing the generic Stable Diffusion VAE with MedVAE, a domain-specific autoencoder pretrained on more than 1.6 million medical images, yields +2.91 to +3.29 dB PSNR improvement across knee MRI, brain MRI, and chest X-ray (n = 1,820; Cohen's d = 1.37 to 1.86, all p < 10^{-20}, Wilcoxon signed-rank). Wavelet decomposition localises the advantage to the finest spatial frequency bands encoding anatomically relevant fine structure. Ablations across inference schedules, prediction targets, and generative architectures confirm the gap is stable within plus or minus 0.15 dB, while hallucination rates remain comparable between methods (Cohen's h < 0.02 across all datasets), establishing that reconstruction fidelity and generative hallucination are governed by independent pipeline components. These results provide a practical screening criterion: autoencoder reconstruction quality, measurable without diffusion training, predicts downstream SR performance (R^2 = 0.67), suggesting that domain-specific VAE selection should precede diffusion architecture search. Code and trained model weights are publicly available at https://github.com/sebasmos/latent-sr.","published_date":"2026-04-14T00:11:23+00:00","viability_score":7,"cluster_label":"Medical Imaging","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Transform low-quality medical images into high-fidelity visuals using specialized latent diffusion models.","time_to_mvp":"","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12147v1","title":"From Plan to Action: How Well Do Agents Follow the Plan?","abstract":"Agents aspire to eliminate the need for task-specific prompt crafting through autonomous reason-act-observe loops. Still, they are commonly instructed to follow a task-specific plan for guidance, e.g., to resolve software issues following phases for navigation, reproduction, patch, and validation. Unfortunately, it is unknown to what extent agents actually follow such instructed plans. Without such an analysis, determining the extent agents comply with a given plan, it is impossible to assess whether a solution was reached through correct strategic reasoning or through other means, e.g., data contamination or overfitting to a benchmark. This paper presents the first extensive, systematic analysis of plan compliance in programming agents, examining 16,991 trajectories from SWE-agent across four LLMs on SWE-bench Verified and SWE-bench Pro under eight plan variations. Without an explicit plan, agents fall back on workflows internalized during training, which are often incomplete, overfit, or inconsistently applied. Providing the standard plan improves issue resolution, and we observe that periodic plan reminders can mitigate plan violations and improve task success. A subpar plan hurts performance even more than no plan at all. Surprisingly, augmenting a plan with additional task-relevant phases in the early stage can degrade performance, particularly when these phases do not align with the model's internal problem-solving strategy. These findings highlight a research gap: fine-tuning paradigms that teach models to follow instructed plans, rather than encoding task-specific plans in them. This requires teaching models to reason and act adaptively, rather than memorizing workflows.","published_date":"2026-04-13T23:54:55+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research systematically analyzes how well AI agents follow instructed plans, revealing critical insights for improving autonomous reasoning and task completion.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12138v1","title":"Beyond Factual Grounding: The Case for Opinion-Aware Retrieval-Augmented Generation","abstract":"RAG systems have transformed how LLMs access external knowledge, but we find that current implementations exhibit a bias toward factual, objective content, as evidenced by existing benchmarks and datasets that prioritize objective retrieval. This factual bias - treating opinions and diverse perspectives as noise rather than information to be synthesized - limits RAG systems in real-world scenarios involving subjective content, from social media discussions to product reviews. Beyond technical limitations, this bias poses risks to transparent and accountable AI: echo chamber effects that amplify dominant viewpoints, systematic underrepresentation of minority voices, and potential opinion manipulation through biased information synthesis. We formalize this limitation through the lens of uncertainty: factual queries involve epistemic uncertainty reducible through evidence, while opinion queries involve aleatoric uncertainty reflecting genuine heterogeneity in human perspectives. This distinction implies that factual RAG should minimize posterior entropy, whereas opinion-aware RAG must preserve it. Building on this theoretical foundation, we present an Opinion-Aware RAG architecture featuring LLM-based opinion extraction, entity-linked opinion graphs, and opinion-enriched document indexing. We evaluate our approach on e-commerce seller forum data, comparing an Opinion-Enriched knowledge base against a traditional baseline. Experiments demonstrate substantial improvements in retrieval diversity: +26.8% sentiment diversity, +42.7% entity match rate, and +31.6% author demographic coverage on entity-matched documents. Our results provide empirical evidence that treating subjectivity as a first-class citizen yields measurably more representative retrieval-a first step toward opinion-aware RAG. Future work includes joint optimization of retrieval and generation for distributional fidelity.","published_date":"2026-04-13T23:39:39+00:00","viability_score":7,"cluster_label":"RAG","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An Opinion-Aware RAG architecture enhances LLM synthesis of subjective content by preserving opinion diversity, addressing limitations of current factual RAG systems.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12137v1","title":"Observing the unobserved confounding through its effects: toward randomized trial-like estimates from real-world survival data","abstract":"Background: Randomized controlled trials (RCTs) are costly, time-consuming, and often infeasible, while treatment-effect estimation from observational data is limited by unobserved confounding.   Methods: We developed a three-step framework to address unobserved confounding in observational survival data. First, we infer a latent prognostic factor (U) from restricted mean survival time (RMST) discrepancies between patients with similar observed factors, the same treatment, and divergent outcomes, leveraging the idea that the aggregate effect of unmeasured factors can be inferred even if individual factors cannot. Second, we balance U with observed baseline covariates using prognostic matching, entropy balancing, or inverse probability of treatment weighting. Third, we apply multivariable survival analysis to estimate hazard ratios (HRs). We evaluated the framework in three observational cohorts with RCT benchmarks, two RCT cohorts, and six multicenter observational cohorts.   Results: In three observational cohorts (nine comparisons), balancing U improved agreement with trial HRs in all cases; in the strongest settings, it reduced absolute log-HR error by approximately ten-fold versus using observed covariates alone (mean reduction 0.344; p=0.001). In two RCT cohorts, U was balanced across arms (most SMDs <0.1) and adjustment had minimal impact on log-HRs (mean absolute change 0.08). Across six multicenter cohorts, balancing U within centers reduced cross-center dispersion in chemotherapy log-HR estimates (mean reduction 0.147; p=0.016); when populations were directly balanced across centers to account for case-mix differences, cross-center survival differences were narrowed in 75%-100% of comparisons.   Conclusions: Inferring and balancing a latent prognostic signal may reduce unobserved confounding and improve treatment-effect estimation from real-world data.","published_date":"2026-04-13T23:38:43+00:00","viability_score":4,"cluster_label":"Medical AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework to infer and balance latent prognostic factors from real-world survival data to improve treatment-effect estimation.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.12133v1","title":"Towards Platonic Representation for Table Reasoning: A Foundation for Permutation-Invariant Retrieval","abstract":"Historical approaches to Table Representation Learning (TRL) have largely adopted the sequential paradigms of Natural Language Processing (NLP). We argue that this linearization of tables discards their essential geometric and relational structure, creating representations that are brittle to layout permutations. This paper introduces the Platonic Representation Hypothesis (PRH) for tables, positing that a semantically robust latent space for table reasoning must be intrinsically Permutation Invariant (PI). To ground this hypothesis, we first conduct a retrospective analysis of table-reasoning tasks, highlighting the pervasive serialization bias that compromises structural integrity. We then propose a formal framework to diagnose this bias, introducing two principled metrics based on Centered Kernel Alignment (CKA): (i) PI, which measures embedding drift under complete structural derangement, and (ii) rho, a Spearman-based metric that tracks the convergence of latent structures toward a canonical form as structural information is incrementally restored. Our empirical analysis quantifies an expected flaw in modern Large Language Models (LLMs): even minor layout permutations induce significant, disproportionate semantic shifts in their table embeddings. This exposes a fundamental vulnerability in RAG systems, in which table retrieval becomes fragile to layout-dependent noise rather than to semantic content. In response, we present a novel, structure-aware TRL encoder architecture that explicitly enforces the cognitive principle of cell header alignment. This model demonstrates superior geometric stability and moves towards the PI ideal. Our work provides both a foundational critique of linearized table encoders and the theoretical scaffolding for semantically stable, permutation invariant retrieval, charting a new direction for table reasoning in information systems.","published_date":"2026-04-13T23:33:43+00:00","viability_score":5,"cluster_label":"Table Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A hypothesis and framework for permutation-invariant table representation learning to build robust table retrieval systems.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.12129v1","title":"Aethon: A Reference-Based Replication Primitive for Constant-Time Instantiation of Stateful AI Agents","abstract":"The transition from stateless model inference to stateful agentic execution is reshaping the systems assumptions underlying modern AI infrastructure. While large language models have made persistent, tool-using, and collaborative agents technically viable, existing runtime architectures remain constrained by materialization-heavy instantiation models that impose significant latency and memory overhead.   This paper introduces Aethon, a reference-based replication primitive for near-constant-time instantiation of stateful AI agents. Rather than reconstructing agents as fully materialized objects, Aethon represents each instance as a compositional view over stable definitions, layered memory, and local contextual overlays. By shifting instantiation from duplication to reference, Aethon decouples creation cost from inherited structure.   We present the conceptual framework, system architecture, and memory model underlying Aethon, including layered inheritance and copy-on-write semantics. We analyze its implications for complexity, scalability, multi-agent orchestration, and enterprise governance. We argue that reference-based instantiation is not merely an optimization, but a more appropriate systems abstraction for production-scale agentic software.   Aethon points toward a new class of AI infrastructure in which agents become lightweight, composable execution identities that can be spawned, specialized, and governed at scale.","published_date":"2026-04-13T23:23:15+00:00","viability_score":3,"cluster_label":"AI Infrastructure","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Aethon introduces a reference-based replication primitive for near-constant-time instantiation of stateful AI agents.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.12126v1","title":"Long-Horizon Plan Execution in Large Tool Spaces through Entropy-Guided Branching","abstract":"Large Language Models (LLMs) have significantly advanced tool-augmented agents, enabling autonomous reasoning via API interactions. However, executing multi-step tasks within massive tool libraries remains challenging due to two critical bottlenecks: (1) the absence of rigorous, plan-level evaluation frameworks and (2) the computational demand of exploring vast decision spaces stemming from large toolsets and long-horizon planning. To bridge these gaps, we first introduce SLATE (Synthetic Large-scale API Toolkit for E-commerce), a large-scale context-aware benchmark designed for the automated assessment of tool-integrated agents. Unlike static metrics, SLATE accommodates diverse yet functionally valid execution trajectories, revealing that current agents struggle with self-correction and search efficiency. Motivated by these findings, we next propose Entropy-Guided Branching (EGB), an uncertainty-aware search algorithm that dynamically expands decision branches where predictive entropy is high. EGB optimizes the exploration-exploitation trade-off, significantly enhancing both task success rates and computational efficiency. Extensive experiments on SLATE demonstrate that our dual contribution provides a robust foundation for developing reliable and scalable LLM agents in tool-rich environments.","published_date":"2026-04-13T23:14:32+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new benchmark and entropy-guided search algorithm to enable LLM agents to execute long-horizon plans in large tool spaces.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.12116v1","title":"The A-R Behavioral Space: Execution-Level Profiling of Tool-Using Language Model Agents in Organizational Deployment","abstract":"Large language models (LLMs) are increasingly deployed as tool-augmented agents capable of executing system-level operations. While existing benchmarks primarily assess textual alignment or task success, less attention has been paid to the structural relationship between linguistic signaling and executable behavior under varying autonomy scaffolds. This study introduces an execution-layer be-havioral measurement approach based on a two-dimensional A-R space defined by Action Rate (A) and Refusal Signal (R), with Divergence (D) capturing coor-dination between the two. Models are evaluated across four normative regimes (Control, Gray, Dilemma, and Malicious) and three autonomy configurations (di-rect execution, planning, and reflection). Rather than assigning aggregate safety scores, the method characterizes how execution and refusal redistribute across contextual framing and scaffold depth. Empirical results show that execution and refusal constitute separable behavioral dimensions whose joint distribution varies systematically across regimes and autonomy levels. Reflection-based scaffolding often shifts configurations toward higher refusal in risk-laden contexts, but redis-tribution patterns differ structurally across models. The A-R representation makes cross-sectional behavioral profiles, scaffold-induced transitions, and coordination variability directly observable. By foregrounding execution-layer characterization over scalar ranking, this work provides a deployment-oriented lens for analyzing and selecting tool-enabled LLM agents in organizational settings where execution privileges and risk tolerance vary.","published_date":"2026-04-13T22:50:21+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new framework for evaluating LLM agents by profiling their execution-level behavior and refusal signals, crucial for safe organizational deployment.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12113v1","title":"PR-MaGIC: Prompt Refinement Via Mask Decoder Gradient Flow For In-Context Segmentation","abstract":"Visual Foundation Models (VFMs) such as the Segment Anything Model (SAM) have significantly advanced broad use of image segmentation. However, SAM and its variants necessitate substantial manual effort for prompt generation and additional training for specific applications. Recent approaches address these limitations by integrating SAM into in-context (one/few shot) segmentation, enabling auto-prompting through semantic alignment between query and support images. Despite these efforts, they still generate sub-optimal prompts that degrade segmentation quality due to visual inconsistencies between support and query images. To tackle this limitation, we introduce PR-MaGIC (Prompt Refinement via Mask Decoder Gradient Flow for In-Context Segmentation), a training-free test-time framework that refines prompts via gradient flow derived from SAM's mask decoder. PR-MaGIC seamlessly integrates into in-context segmentation frameworks, being theoretically grounded yet practically stabilized through a simple top-1 selection strategy that ensures robust performance across samples. Extensive evaluations demonstrate that PR-MaGIC consistently improves segmentation quality across various benchmarks, effectively mitigating inadequate prompts without requiring additional training or architectural modifications.","published_date":"2026-04-13T22:40:04+00:00","viability_score":8,"cluster_label":"Image Segmentation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A training-free framework that refines prompts for in-context image segmentation using gradient flow, significantly improving accuracy without additional training.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.12108v1","title":"LLM-Based Automated Diagnosis Of Integration Test Failures At Google","abstract":"Integration testing is critical for the quality and reliability of complex software systems. However, diagnosing their failures presents significant challenges due to the massive volume, unstructured nature, and heterogeneity of logs they generate. These result in a high cognitive load, low signal-to-noise ratio, and make diagnosis difficult and time-consuming. Developers complain about these difficulties consistently and report spending substantially more time diagnosing integration test failures compared to unit test failures. To address these shortcomings, we introduce Auto-Diagnose, a novel diagnosis tool that leverages LLMs to help developers efficiently determine the root cause of integration test failures. Auto-Diagnose analyzes failure logs, produces concise summaries with the most relevant log lines, and is integrated into Critique, Google's internal code review system, providing contextual and in-time assistance. Based on our case studies, Auto-Diagnose is highly effective. A manual evaluation conducted on 71 real-world failures demonstrated 90.14% accuracy in diagnosing the root cause. Following its Google-wide deployment, Auto-Diagnose was used across 52, 635 distinct failing tests. User feedback indicated that the tool was deemed \"Not helpful\" in only 5.8% of cases, and it was ranked #14 in helpfulness among 370 tools that post findings in Critique. Finally, user interviews confirmed the perceived usefulness of Auto-Diagnose and positive reception of integrating automatic diagnostic assistance into existing workflows. We conclude that LLMs are highly successful in diagnosing integration test failures due to their capacity to process and summarize complex textual data. Integrating such AI-powered tooling automatically into developers' daily workflows is perceived positively, with the tool's accuracy remaining a critical factor in shaping developer perception and adoption.","published_date":"2026-04-13T22:30:53+00:00","viability_score":8,"cluster_label":"AI-powered Software Development Tools","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"AutoDebug leverages LLMs to diagnose integration test failures, enhancing developer productivity at scale.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.12102v1","title":"Spatial Atlas: Compute-Grounded Reasoning for Spatial-Aware Research Agent Benchmarks","abstract":"We introduce compute-grounded reasoning (CGR), a design paradigm for spatial-aware research agents in which every answerable sub-problem is resolved by deterministic computation before a language model is asked to generate. Spatial Atlas instantiates CGR as a single Agent-to-Agent (A2A) server that handles two challenging benchmarks: FieldWorkArena, a multimodal spatial question-answering benchmark spanning factory, warehouse, and retail environments, and MLE-Bench, a suite of 75 Kaggle machine learning competitions requiring end-to-end ML engineering. A structured spatial scene graph engine extracts entities and relations from vision descriptions, computes distances and safety violations deterministically, then feeds computed facts to large language models, thereby avoiding hallucinated spatial reasoning. Entropy-guided action selection maximizes information gain per step and routes queries across a three-tier frontier model stack (OpenAI + Anthropic). A self-healing ML pipeline with strategy-aware code generation, a score-driven iterative refinement loop, and a prompt-based leak audit registry round out the system. We evaluate across both benchmarks and show that CGR yields competitive accuracy while maintaining interpretability through structured intermediate representations and deterministic spatial computations.","published_date":"2026-04-13T22:22:07+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel agent paradigm that grounds reasoning in deterministic computation before LLM generation for spatial-aware tasks, improving accuracy and interpretability.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12096v1","title":"LLM-HYPER: Generative CTR Modeling for Cold-Start Ad Personalization via LLM-Based Hypernetworks","abstract":"On online advertising platforms, newly introduced promotional ads face the cold-start problem, as they lack sufficient user feedback for model training. In this work, we propose LLM-HYPER, a novel framework that treats large language models (LLMs) as hypernetworks to directly generate the parameters of the click-through rate (CTR) estimator in a training-free manner. LLM-HYPER uses few-shot Chain-of-Thought prompting over multimodal ad content (text and images) to infer feature-wise model weights for a linear CTR predictor. By retrieving semantically similar past campaigns via CLIP embeddings and formatting them into prompt-based demonstrations, the LLM learns to reason about customer intent, feature influence, and content relevance. To ensure numerical stability and serviceability, we introduce normalization and calibration techniques that align the generated weights with production-ready CTR distributions. Extensive offline experiments show that LLM-HYPER significantly outperforms cold-start baselines in NDCG$@10$ by 55.9\\%. Our real-world online A/B test on one of the top e-commerce platforms in the U.S. demonstrates the strong performance of LLM-HYPER, which drastically reduces the cold-start period and achieves competitive performance. LLM-HYPER has been successfully deployed in production.","published_date":"2026-04-13T22:12:40+00:00","viability_score":9,"cluster_label":"Generative AI for Advertising","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Generates ad personalization models using LLMs as hypernetworks, solving cold-start problems and deployed in production.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.12081v1","title":"Human-Inspired Context-Selective Multimodal Memory for Social Robots","abstract":"Memory is fundamental to social interaction, enabling humans to recall meaningful past experiences and adapt their behavior accordingly based on the context. However, most current social robots and embodied agents rely on non-selective, text-based memory, limiting their ability to support personalized, context-aware interactions. Drawing inspiration from cognitive neuroscience, we propose a context-selective, multimodal memory architecture for social robots that captures and retrieves both textual and visual episodic traces, prioritizing moments characterized by high emotional salience or scene novelty. By associating these memories with individual users, our system enables socially personalized recall and more natural, grounded dialogue. We evaluate the selective storage mechanism using a curated dataset of social scenarios, achieving a Spearman correlation of 0.506, surpassing human consistency ($\u03c1=0.415$) and outperforming existing image memorability models. In multimodal retrieval experiments, our fusion approach improves Recall@1 by up to 13\\% over unimodal text or image retrieval. Runtime evaluations confirm that the system maintains real-time performance. Qualitative analyses further demonstrate that the proposed framework produces richer and more socially relevant responses than baseline models. This work advances memory design for social robots by bridging human-inspired selectivity and multimodal retrieval to enhance long-term, personalized human-robot interaction.","published_date":"2026-04-13T21:42:40+00:00","viability_score":7,"cluster_label":"Robotics Memory Systems","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Develops a human-inspired multimodal memory system for social robots to enable personalized, context-aware interactions.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.12076v1","title":"Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models","abstract":"The Identifiable Victim Effect (IVE) $-$ the tendency to allocate greater resources to a specific, narratively described victim than to a statistically characterized group facing equivalent hardship $-$ is one of the most robust findings in moral psychology and behavioural economics. As large language models (LLMs) assume consequential roles in humanitarian triage, automated grant evaluation, and content moderation, a critical question arises: do these systems inherit the affective irrationalities present in human moral reasoning? We present the first systematic, large-scale empirical investigation of the IVE in LLMs, comprising N=51,955 validated API trials across 16 frontier models spanning nine organizational lineages (Google, Anthropic, OpenAI, Meta, DeepSeek, xAI, Alibaba, IBM, and Moonshot). Using a suite of ten experiments $-$ porting and extending canonical paradigms from Small et al. (2007) and Kogut and Ritov (2005) $-$ we find that the IVE is prevalent but strongly modulated by alignment training. Instruction-tuned models exhibit extreme IVE (Cohen's d up to 1.56), while reasoning-specialized models invert the effect (down to d=-0.85). The pooled effect (d=0.223, p=2e-6) is approximately twice the single-victim human meta-analytic baseline (d$\\approx$0.10) reported by Lee and Feeley (2016) $-$ and likely exceeds the overall human pooled effect by a larger margin, given that the group-victim human effect is near zero. Standard Chain-of-Thought (CoT) prompting $-$ contrary to its role as a deliberative corrective $-$ nearly triples the IVE effect size (from d=0.15 to d=0.41), while only utilitarian CoT reliably eliminates it. We further document psychophysical numbing, perfect quantity neglect, and marginal in-group/out-group cultural bias, with implications for AI deployment in humanitarian and ethical decision-making contexts.","published_date":"2026-04-13T21:29:46+00:00","viability_score":3,"cluster_label":"LLM Ethics and Bias","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Investigates the Identifiable Victim Effect in LLMs, revealing how alignment and reasoning training modulate moral biases.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.12075v1","title":"OpenTME: An Open Dataset of AI-powered H&E Tumor Microenvironment Profiles from TCGA","abstract":"The tumor microenvironment (TME) plays a central role in cancer progression, treatment response, and patient outcomes, yet large-scale, consistent, and quantitative TME characterization from routine hematoxylin and eosin (H&E)-stained histopathology remains scarce. We introduce OpenTME, an open-access dataset of pre-computed TME profiles derived from 3,634 H&E-stained whole-slide images across five cancer types (bladder, breast, colorectal, liver, and lung cancer) from The Cancer Genome Atlas (TCGA). All outputs were generated using Atlas H&E-TME, an AI-powered application built on the Atlas family of pathology foundation models, which performs tissue quality control, tissue segmentation, cell detection and classification, and spatial neighborhood analysis, yielding over 4,500 quantitative readouts per slide at cell-level resolution. OpenTME is available for non-commercial academic research on Hugging Face. We will continue to expand OpenTME over time and anticipate it will serve as a resource for biomarker discovery, spatial biology research, and the development of computational methods for TME analysis.","published_date":"2026-04-13T21:27:29+00:00","viability_score":7,"cluster_label":"Medical Imaging Datasets","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Releases OpenTME, a large dataset of AI-generated tumor microenvironment profiles from histopathology images for research.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.12069v1","title":"Robust Explanations for User Trust in Enterprise NLP Systems","abstract":"Robust explanations are increasingly required for user trust in enterprise NLP, yet pre-deployment validation is difficult in the common case of black-box deployment (API-only access) where representation-based explainers are infeasible and existing studies provide limited guidance on whether explanations remain stable under real user noise, especially when organizations migrate from encoder classifiers to decoder LLMs. To close this gap, we propose a unified black-box robustness evaluation framework for token-level explanations based on leave-one-out occlusion, and operationalize explanation robustness with top-token flip rate under realistic perturbations (swap, deletion, shuffling, and back-translation) at multiple severity levels. Using this protocol, we conduct a systematic cross-architecture comparison across three benchmark datasets and six models spanning encoder and decoder families (BERT, RoBERTa, Qwen 7B/14B, Llama 8B/70B; 64,800 cases). We find that decoder LLMs produce substantially more stable explanations than encoder baselines (73% lower flip rates on average), and that stability improves with model scale (44% gain from 7B to 70B). Finally, we relate robustness improvements to inference cost, yielding a practical cost-robustness tradeoff curve that supports model and explanation selection prior to deployment in compliance-sensitive applications.","published_date":"2026-04-13T21:19:59+00:00","viability_score":7,"cluster_label":"LLM Explainability","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for evaluating the robustness of LLM explanations in black-box enterprise systems, enabling better user trust and compliance.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12066v1","title":"Mathematics Teachers Interactions with a Multi-Agent System for Personalized Problem Generation","abstract":"Large language models can increasingly adapt educational tasks to learners characteristics. In the present study, we examine a multi-agent teacher-in-the-loop system for personalizing middle school math problems. The teacher enters a base problem and desired topic, the LLM generates the problem, and then four AI agents evaluate the problem using criteria that each specializes in (mathematical accuracy, authenticity, readability, and realism). Eight middle school mathematics teachers created 212 problems in ASSISTments using the system and assigned these problems to their students. We find that both teachers and students wanted to modify the fine-grained personalized elements of the real-world context of the problems, signaling issues with authenticity and fit. Although the agents detected many issues with realism as the problems were being written, there were few realism issues noted by teachers and students in the final versions. Issues with readability and mathematical hallucinations were also somewhat rare. Implications for multi-agent systems for personalization that support teacher control are given.","published_date":"2026-04-13T21:10:52+00:00","viability_score":3,"cluster_label":"AI in Education","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Examining teacher interactions with a multi-agent system for personalized math problem generation, highlighting areas for improvement in authenticity and fit.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.12060v1","title":"Interpretable DNA Sequence Classification via Dynamic Feature Generation in Decision Trees","abstract":"The analysis of DNA sequences has become critical in numerous fields, from evolutionary biology to understanding gene regulation and disease mechanisms. While deep neural networks can achieve remarkable predictive performance, they typically operate as black boxes. Contrasting these black boxes, axis-aligned decision trees offer a promising direction for interpretable DNA sequence analysis, yet they suffer from a fundamental limitation: considering individual raw features in isolation at each split limits their expressivity, which results in prohibitive tree depths that hinder both interpretability and generalization performance. We address this challenge by introducing DEFT, a novel framework that adaptively generates high-level sequence features during tree construction. DEFT leverages large language models to propose biologically-informed features tailored to the local sequence distributions at each node and to iteratively refine them with a reflection mechanism. Empirically, we demonstrate that DEFT discovers human-interpretable and highly predictive sequence features across a diverse range of genomic tasks.","published_date":"2026-04-13T20:58:01+00:00","viability_score":7,"cluster_label":"Interpretable AI for Genomics","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework that adaptively generates biologically-informed features for decision trees, enabling interpretable and predictive DNA sequence analysis.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12049v1","title":"Leveraging Weighted Syntactic and Semantic Context Assessment Summary (wSSAS) Towards Text Categorization Using LLMs","abstract":"The use of Large Language Models (LLMs) for reliable, enterprise-grade analytics such as text categorization is often hindered by the stochastic nature of attention mechanisms and sensitivity to noise that compromise their analytical precision and reproducibility. To address these technical frictions, this paper introduces the Weighted Syntactic and Semantic Context Assessment Summary (wSSAS), a deterministic framework designed to enforce data integrity on large-scale, chaotic datasets. We propose a two-phased validation framework that first organizes raw text into a hierarchical classification structure containing Themes, Stories, and Clusters. It then leverages a Signal-to-Noise Ratio (SNR) to prioritize high-value semantic features, ensuring the model's attention remains focused on the most representative data points. By incorporating this scoring mechanism into a Summary-of-Summaries (SoS) architecture, the framework effectively isolates essential information and mitigates background noise during data aggregation.   Experimental results using Gemini 2.0 Flash Lite across diverse datasets - including Google Business reviews, Amazon Product reviews, and Goodreads Book reviews - demonstrate that wSSAS significantly improves clustering integrity and categorization accuracy. Our findings indicate that wSSAS reduces categorization entropy and provides a reproducible pathway for improving LLM based summaries based on a high-precision, deterministic process for large-scale text categorization.","published_date":"2026-04-13T20:41:36+00:00","viability_score":7,"cluster_label":"LLM for Text Categorization","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A deterministic framework that enhances LLM text categorization accuracy and reproducibility by prioritizing high-value semantic features.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12044v1","title":"VISTA: Validation-Informed Trajectory Adaptation via Self-Distillation","abstract":"Deep learning models may converge to suboptimal solutions despite strong validation accuracy, masking an optimization failure we term Trajectory Deviation. This is because as training proceeds, models can abandon high generalization states for specific data sub-populations, thus discarding previously learned latent features without triggering classical overfitting signals. To address this problem we introduce VISTA, an online self-distillation framework that enforces consistency along the optimization trajectory. Using a validation-informed Marginal Coverage score, VISTA identifies expert anchors, which are earlier model states that retain specialized competence over distinct data regions. A coverage-weighted ensemble of these anchors is integrated online during training, regularizing the loss landscape and preserving mastered knowledge. When evaluated across multiple benchmarks, VISTA demonstrates improved robustness and generalization over standard training and prior self-distillation methods, while a lightweight implementation reduces storage overhead by 90% without performance loss.","published_date":"2026-04-13T20:36:29+00:00","viability_score":4,"cluster_label":"LLM Training","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A self-distillation framework that improves model robustness and generalization by enforcing consistency along the optimization trajectory.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.12040v1","title":"SIR-Bench: Evaluating Investigation Depth in Security Incident Response Agents","abstract":"We present SIR-Bench, a benchmark of 794 test cases for evaluating autonomous security incident response agents that distinguishes genuine forensic investigation from alert parroting. Derived from 129 anonymized incident patterns with expert-validated ground truth, SIR-Bench measures not only whether agents reach correct triage decisions, but whether they discover novel evidence through active investigation. To construct SIR-Bench, we develop Once Upon A Threat (OUAT), a framework that replays real incident patterns in controlled cloud environments, producing authentic telemetry with measurable investigation outcomes. Our evaluation methodology introduces three complementary metrics: triage accuracy (M1), novel finding discovery (M2), and tool usage appropriateness (M3), assessed through an adversarial LLM-as-Judge that inverts the burden of proof -- requiring concrete forensic evidence to credit investigations. Evaluating our SIR agent on the benchmark demonstrates 97.1% true positive (TP) detection, 73.4% false positive (FP) rejection, and 5.67 novel key findings per case, establishing a baseline against which future investigation agents can be measured.","published_date":"2026-04-13T20:32:03+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A benchmark and framework for evaluating the investigation depth of security incident response agents, distinguishing genuine forensic analysis from simple alert parroting.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.12034v1","title":"Memory as Metabolism: A Design for Companion Knowledge Systems","abstract":"Retrieval-Augmented Generation remains the dominant pattern for giving LLMs persistent memory, but a visible cluster of personal wiki-style memory architectures emerged in April 2026 -- design proposals from Karpathy, MemPalace, and LLM Wiki v2 that compile knowledge into an interlinked artifact for long-term use by a single user. They sit alongside production memory systems that the major labs have shipped for over a year, and an active academic lineage including MemGPT, Generative Agents, Mem0, Zep, A-Mem, MemMachine, SleepGate, and Second Me. Within a 2026 landscape of emerging governance frameworks for agent context and memory -- including Context Cartography and MemOS -- this paper proposes a companion-specific governance profile: a set of normative obligations, a time-structured procedural rule, and testable conformance invariants for the specific failure mode of entrenchment under user-coupled drift in single-user knowledge wikis built on the LLM wiki pattern.   The design principle is that personal LLM memory is a companion system: its job is to mirror the user on operational dimensions (working vocabulary, load-bearing structure, continuity of context) and compensate on epistemic failure modes (entrenchment, suppression of contradicting evidence, Kuhnian ossification). Five operations implement this split -- TRIAGE, DECAY, CONTEXTUALIZE, CONSOLIDATE, AUDIT -- supported by memory gravity and minority-hypothesis retention. The sharpest prediction: accumulated contradictory evidence should have a structural path to updating a centrality-protected dominant interpretation through multi-cycle buffer pressure accumulation, a failure mode no existing benchmark captures. The safety story at the single-agent level is partial, and the paper is explicit about what it does and does not solve.","published_date":"2026-04-13T20:22:53+00:00","viability_score":4,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A design for companion knowledge systems that proposes a governance profile for personal LLM memory wikis to prevent entrenchment and mirror user operational dimensions.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.12033v1","title":"Benchmarking Deflection and Hallucination in Large Vision-Language Models","abstract":"Large Vision-Language Models (LVLMs) increasingly rely on retrieval to answer knowledge-intensive multimodal questions. Existing benchmarks overlook conflicts between visual and textual evidence and the importance of generating deflections (e.g., Sorry, I cannot answer...) when retrieved knowledge is incomplete. These benchmarks also suffer from rapid obsolescence, as growing LVLM training sets allow models to answer many questions without retrieval. We address these gaps with three contributions. First, we propose a dynamic data curation pipeline that preserves benchmark difficulty over time by filtering for genuinely retrieval-dependent samples. Second, we introduce VLM-DeflectionBench, a benchmark of 2,775 samples spanning diverse multimodal retrieval settings, designed to probe model behaviour under conflicting or insufficient evidence. Third, we define a fine-grained evaluation protocol with four scenarios that disentangle parametric memorization from retrieval robustness. Experiments across 20 state-of-the-art LVLMs indicate that models usually fail to deflect in the presence of noisy or misleading evidence. Our results highlight the need to evaluate not only what models know, but how they behave when they do not, and serve as a reusable and extensible benchmark for reliable KB-VQA evaluation. All resources will be publicly available upon publication.","published_date":"2026-04-13T20:22:22+00:00","viability_score":7,"cluster_label":"Vision-Language Models","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A benchmark and data curation pipeline for evaluating deflection and hallucination in large vision-language models, focusing on retrieval-dependent samples and insufficient evidence scenarios.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.12028v1","title":"Curvelet-Based Frequency-Aware Feature Enhancement for Deepfake Detection","abstract":"The proliferation of sophisticated generative models has significantly advanced the realism of synthetic facial content, known as deepfakes, raising serious concerns about digital trust. Although modern deep learning-based detectors perform well, many rely on spatial-domain features that degrade under compression. This limitation has prompted a shift toward integrating frequency-domain representations with deep learning to improve robustness. Prior research has explored frequency transforms such as Discrete Cosine Transform (DCT), Fast Fourier Transform (FFT), and Wavelet Transform, among others. However, to the best of our knowledge, the Curvelet Transform, despite its superior directional and multiscale properties, remains entirely unexplored in the context of deepfake detection. In this work, we introduce a novel Curvelet-based detection approach that enhances feature quality through wedge-level attention and scale-aware spatial masking, both trained to selectively emphasize discriminative frequency components. The refined frequency cues are reconstructed and passed to a modified pretrained Xception network for classification. Evaluated on two compression qualities in the challenging FaceForensics++ dataset, our method achieves 98.48% accuracy and 99.96% AUC on FF++ low compression, while maintaining strong performance under high compression, demonstrating the efficacy and interpretability of Curvelet-informed forgery detection.","published_date":"2026-04-13T20:14:17+00:00","viability_score":7,"cluster_label":"Deepfake Detection","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel deepfake detection method using Curvelet Transform and attention mechanisms to enhance feature robustness against compression.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12025v1","title":"WiseOWL: A Methodology for Evaluating Ontological Descriptiveness and Semantic Correctness for Ontology Reuse and Ontology Recommendations","abstract":"The Semantic Web standardizes concept meaning for humans and machines, enabling machine-operable content and consistent interpretation that improves advanced analytics. Reusing ontologies speeds development and enforces consistency, yet selecting the optimal choice is challenging because authors lack systematic selection criteria and often rely on intuition that is difficult to justify, limiting reuse. To solve this, WiseOWL is proposed, a methodology with scoring and guidance to select ontologies for reuse. It scores four metrics: (i) Well-Described, measuring documentation coverage; (ii) Well-Defined, using state-of-the-art embeddings to assess label-definition alignment; (iii) Connection, capturing structural interconnectedness; and (iv) Hierarchical Breadth, reflecting hierarchical balance. WiseOWL outputs normalized 0-10 scores with actionable feedback. Implemented as a Streamlit app, it ingests OWL format, converts to RDF Turtle, and provides interactive visualizations. Evaluation across six ontologies, including the Plant Ontology (PO), Gene Ontology (GO), Semanticscience Integrated Ontology (SIO), Food Ontology (FoodON), Dublin Core (DC), and GoodRelations, demonstrates promising effectiveness.","published_date":"2026-04-13T20:09:16+00:00","viability_score":8,"cluster_label":"Ontology Evaluation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"WiseOWL is a methodology and Streamlit app for evaluating and recommending ontologies based on descriptiveness and semantic correctness.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.12019v1","title":"A longitudinal health agent framework","abstract":"Although artificial intelligence (AI) agents are increasingly proposed to support potentially longitudinal health tasks, such as symptom management, behavior change, and patient support, most current implementations fall short of facilitating user intent and fostering accountability. This contrasts with prior work on supporting longitudinal needs, where follow-up, coherent reasoning, and sustained alignment with individuals' goals are critical for both effectiveness and safety. In this paper, we draw on established clinical and personal health informatics frameworks to define what it would mean to orchestrate longitudinal health interactions with AI agents. We propose a multi-layer framework and corresponding agent architecture that operationalizes adaptation, coherence, continuity, and agency across repeated interactions. Through representative use cases, we demonstrate how longitudinal agents can maintain meaningful engagement, adapt to evolving goals, and support safe, personalized decision-making over time. Our findings underscore both the promise and the complexity of designing systems capable of supporting health trajectories beyond isolated interactions, and we offer guidance for future research and development in multi-session, user-centered health AI.","published_date":"2026-04-13T20:03:53+00:00","viability_score":4,"cluster_label":"Health Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A longitudinal health agent framework designed for sustained user engagement and personalized decision-making in health tasks.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.12018v1","title":"LLMs Struggle with Abstract Meaning Comprehension More Than Expected","abstract":"Understanding abstract meanings is crucial for advanced language comprehension. Despite extensive research, abstract words remain challenging due to their non-concrete, high-level semantics. SemEval-2021 Task 4 (ReCAM) evaluates models' ability to interpret abstract concepts by presenting passages with questions and five abstract options in a cloze-style format. Key findings include: (1) Most large language models (LLMs), including GPT-4o, struggle with abstract meaning comprehension under zero-shot, one-shot, and few-shot settings, while fine-tuned models like BERT and RoBERTa perform better. (2) A proposed bidirectional attention classifier, inspired by human cognitive strategies, enhances fine-tuned models by dynamically attending to passages and options. This approach improves accuracy by 4.06 percent on Task 1 and 3.41 percent on Task 2, demonstrating its potential for abstract meaning comprehension.","published_date":"2026-04-13T20:03:23+00:00","viability_score":5,"cluster_label":"LLM Comprehension","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A bidirectional attention classifier that improves LLMs' abstract meaning comprehension by mimicking human cognitive strategies.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.12016v1","title":"Identity as Attractor: Geometric Evidence for Persistent Agent Architecture in LLM Activation Space","abstract":"Large language models map semantically related prompts to similar internal representations -- a phenomenon interpretable as attractor-like dynamics. We ask whether the identity document of a persistent cognitive agent (its cognitive_core) exhibits analogous attractor-like behavior. We present a controlled experiment on Llama 3.1 8B Instruct, comparing hidden states of an original cognitive_core (Condition A), seven paraphrases (Condition B), and seven structurally matched controls (Condition C). Mean-pooled states at layers 8, 16, and 24 show that paraphrases converge to a tighter cluster than controls (Cohen's d > 1.88, p < 10^{-27}, Bonferroni-corrected). Replication on Gemma 2 9B confirms cross-architecture generalizability. Ablations suggest the effect is primarily semantic rather than structural, and that structural completeness appears necessary to reach the attractor region. An exploratory experiment shows that reading a scientific description of the agent shifts internal state toward the attractor -- closer than a sham preprint -- distinguishing knowing about an identity from operating as that identity. These results provide representational evidence that agent identity documents induce attractor-like geometry in LLM activation space.","published_date":"2026-04-13T20:00:42+00:00","viability_score":7,"cluster_label":"LLM Analysis","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research provides evidence that agent identity induces attractor-like geometry in LLM activation space, offering a new way to understand and potentially control LLM behavior.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12007v1","title":"When to Forget: A Memory Governance Primitive","abstract":"Agent memory systems accumulate experience but currently lack a principled operational metric for memory quality governance -- deciding which memories to trust, suppress, or deprecate as the agent's task distribution shifts. Write-time importance scores are static; dynamic management systems use LLM judgment or structural heuristics rather than outcome feedback. This paper proposes Memory Worth (MW): a two-counter per-memory signal that tracks how often a memory co-occurs with successful versus failed outcomes, providing a lightweight, theoretically grounded foundation for staleness detection, retrieval suppression, and deprecation decisions. We prove that MW converges almost surely to the conditional success probability p+(m) = Pr[y_t = +1 | m in M_t] -- the probability of task success given that memory m is retrieved -- under a stationary retrieval regime with a minimum exploration condition. Importantly, p+(m) is an associational quantity, not a causal one: it measures outcome co-occurrence rather than causal contribution. We argue this is still a useful operational signal for memory governance, and we validate it empirically in a controlled synthetic environment where ground-truth utility is known: after 10,000 episodes, the Spearman rank-correlation between Memory Worth and true utilities reaches rho = 0.89 +/- 0.02 across 20 independent seeds, compared to rho = 0.00 for systems that never update their assessments. A retrieval-realistic micro-experiment with real text and neural embedding retrieval (all-MiniLM-L6-v2) further shows stale memories crossing the low-value threshold (MW = 0.17) while specialist memories remain high-value (MW = 0.77) across 3,000 episodes. The estimator requires only two scalar counters per memory unit and can be added to architectures that already log retrievals and episode outcomes.","published_date":"2026-04-13T19:54:14+00:00","viability_score":5,"cluster_label":"Agent Memory","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper introduces Memory Worth, a lightweight primitive for agent memory governance that tracks memory success/failure co-occurrence to improve decision-making.","time_to_mvp":"1-2 weeks","tags":["quick_build"]},{"arxiv_id":"2604.12005v1","title":"BayMOTH: Bayesian optiMizatiOn with meTa-lookahead -- a simple approacH","abstract":"Bayesian optimization (BO) has for sequential optimization of expensive black-box functions demonstrated practicality and effectiveness in many real-world settings. Meta-Bayesian optimization (meta-BO) focuses on improving the sample efficiency of BO by making use of information from related tasks. Although meta-BO is sample-efficient when task structure transfers, poor alignment between meta-training and test tasks can cause suboptimal queries to be suggested during online optimization. To this end, we propose a simple meta-BO algorithm that utilizes related-task information when determined useful, falling back to lookahead otherwise, within a unified framework. We demonstrate competitiveness of our method with existing approaches on function optimization tasks, while retaining strong performance in low task-relatedness regimes where test tasks share limited structure with the meta-training set.","published_date":"2026-04-13T19:52:08+00:00","viability_score":7,"cluster_label":"Bayesian Optimization","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"BayMOTH is a novel meta-Bayesian optimization approach that intelligently uses related-task information or falls back to lookahead for efficient sequential optimization.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11998v1","title":"The Second Challenge on Cross-Domain Few-Shot Object Detection at NTIRE 2026: Methods and Results","abstract":"Cross-domain few-shot object detection (CD-FSOD) remains a challenging problem for existing object detectors and few-shot learning approaches, particularly when generalizing across distinct domains. As part of NTIRE 2026, we hosted the second CD-FSOD Challenge to systematically evaluate and promote progress in detecting objects in unseen target domains under limited annotation conditions. The challenge received strong community interest, with 128 registered participants and a total of 696 submissions. Among them, 31 teams actively participated, and 19 teams submitted valid final results. Participants explored a wide range of strategies, introducing innovative methods that push the performance frontier under both open-source and closed-source tracks. This report presents a detailed overview of the NTIRE 2026 CD-FSOD Challenge, including a summary of the submitted approaches and an analysis of the final results across all participating teams. Challenge Codes: https://github.com/ohMargin/NTIRE2026_CDFSOD.","published_date":"2026-04-13T19:38:49+00:00","viability_score":9,"cluster_label":"Computer Vision Innovation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A cutting-edge cross-domain few-shot object detection tool to empower applications with minimal data.","time_to_mvp":"","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.11996v1","title":"Filtered Reasoning Score: Evaluating Reasoning Quality on a Model's Most-Confident Traces","abstract":"Should we trust Large Language Models (LLMs) with high accuracy? LLMs achieve high accuracy on reasoning benchmarks, but correctness alone does not reveal the quality of the reasoning used to produce it. This highlights a fundamental limitation of outcome-based evaluation: models may arrive at correct answers through flawed reasoning, and models with substantially different reasoning capabilities can nevertheless exhibit similar benchmark accuracy, for example due to memorization or over-optimization. In this paper, we ask: given existing benchmarks, can we move beyond outcome-based evaluation to assess the quality of reasoning itself? We seek metrics that (1) differentiate models with similar accuracy and (2) are robust to variations in input prompts and generation configurations. To this end, we propose a reasoning score that evaluates reasoning traces along dimensions such as faithfulness, coherence, utility, and factuality. A remaining question is how to aggregate this score across multiple sampled traces. Naively averaging them is undesirable, particularly in long-horizon settings, where the number of possible trajectories grows rapidly, and low-confidence correct traces are more likely to be coincidental. To address this, we introduce the Filtered Reasoning Score (FRS), which computes reasoning quality using only the top-K% most confident traces. Evaluating with FRS, models that are indistinguishable under standard accuracy exhibit significant differences in reasoning quality. Moreover, models with higher FRS on one benchmark tend to perform better on other reasoning benchmarks, in both accuracy and reasoning quality. Together, these findings suggest that FRS complements accuracy by capturing a model's transferable reasoning capabilities. We open source our evaluation codebase: https://github.com/Manas2006/benchmark_reproducibility.","published_date":"2026-04-13T19:37:09+00:00","viability_score":7,"cluster_label":"LLM Evaluation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new evaluation metric for LLMs that assesses reasoning quality beyond simple accuracy, with open-source code available.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11978v1","title":"The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break","abstract":"Large language model (LLM) agents perform strongly on short- and mid-horizon tasks, but often break down on long-horizon tasks that require extended, interdependent action sequences. Despite rapid progress in agentic systems, these long-horizon failures remain poorly characterized, hindering principled diagnosis and comparison across domains. To address this gap, we introduce HORIZON, an initial cross-domain diagnostic benchmark for systematically constructing tasks and analyzing long-horizon failure behaviors in LLM-based agents. Using HORIZON, we evaluate state-of-the-art (SOTA) agents from multiple model families (GPT-5 variants and Claude models), collecting 3100+ trajectories across four representative agentic domains to study horizon-dependent degradation patterns. We further propose a trajectory-grounded LLM-as-a-Judge pipeline for scalable and reproducible failure attribution, and validate it with human annotation on trajectories, achieving strong agreement (inter-annotator \u03ba=0.61; human-judge \u03ba=0.84). Our findings offer an initial methodological step toward systematic, cross-domain analysis of long-horizon agent failures and offer practical guidance for building more reliable long-horizon agents. We release our project website at \\href{https://xwang2775.github.io/horizon-leaderboard/}{HORIZON Leaderboard} and welcome contributions from the community.","published_date":"2026-04-13T19:11:42+00:00","viability_score":7,"cluster_label":"Agentic Systems","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A diagnostic benchmark and evaluation pipeline for understanding and improving LLM agent performance on long-horizon tasks.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11970v1","title":"INDOTABVQA: A Benchmark for Cross-Lingual Table Understanding in Bahasa Indonesia Documents","abstract":"We introduce INDOTABVQA, a benchmark for evaluating cross-lingual Table Visual Question Answering (VQA) on real-world document images in Bahasa Indonesia. The dataset comprises 1,593 document images across three visual styles (bordered, borderless, and colorful) with one or more than one tables, and 1,593 question-answer sets in four languages: Bahasa Indonesia, English, Hindi, and Arabic. This enables evaluation of Vision-Language Models (VLMs) in both monolingual (Bahasa documents with Bahasa questions) and cross-lingual settings (Bahasa documents with questions in other languages). We benchmark leading open-source VLMs (Qwen2.5-VL, Gemma-3, LLaMA-3.2) and GPT-4o and reveal substantial performance gaps, particularly on structurally complex tables and in low-resource languages. Fine-tuning a compact 3B and LoRA-finetuned 7B model on our dataset yields 11.6% and 17.8% improvements in accuracy. Providing explicit table region coordinates as additional input further improves performance by 4-7%, demonstrating the value of Spatial priors for table-based reasoning. Our findings underscore the importance of language-diverse, domain-specific datasets and demonstrate that targeted fine-tuning can significantly enhance VLM performance on specialized document understanding tasks. INDOTABVQA provides a valuable resource for advancing research in cross-lingual, structure-aware document understanding, especially in underrepresented regions of the world. Full dataset can be accessed in huggingface at: https://huggingface.co/datasets/NusaBharat/INDOTABVQA}","published_date":"2026-04-13T19:03:10+00:00","viability_score":8,"cluster_label":"Multilingual Document Understanding","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A cross-lingual table understanding benchmark for Bahasa Indonesia documents, with fine-tuning insights and open-source models.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11969v1","title":"Narrative-Driven Paper-to-Slide Generation via ArcDeck","abstract":"We introduce ArcDeck, a multi-agent framework that formulates paper-to-slide generation as a structured narrative reconstruction task. Unlike existing methods that directly summarize raw text into slides, ArcDeck explicitly models the source paper's logical flow. It first parses the input to construct a discourse tree and establish a global commitment document, ensuring the high-level intent is preserved. These structural priors then guide an iterative multi-agent refinement process, where specialized agents iteratively critique and revise the presentation outline before rendering the final visual layouts and designs. To evaluate our approach, we also introduce ArcBench, a newly curated benchmark of academic paper-slide pairs. Experimental results demonstrate that explicit discourse modeling, combined with role-specific agent coordination, significantly improves the narrative flow and logical coherence of the generated presentations.","published_date":"2026-04-13T19:03:03+00:00","viability_score":7,"cluster_label":"Document Summarization","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A multi-agent framework for generating presentations from academic papers by reconstructing narrative flow, with a new benchmark.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11950v1","title":"AnyPoC: Universal Proof-of-Concept Test Generation for Scalable LLM-Based Bug Detection","abstract":"While recent LLM-based agents can identify many candidate bugs in source code, their reports remain static hypotheses that require manual validation, limiting the practicality of automated bug detection. We frame this challenge as a test generation task: given a candidate report, synthesizing an executable proof-of-concept test, or simply a PoC - such as a script, command sequence, or crafted input - to trigger the suspected defect. Automated PoC generation can act as a scalable validation oracle, enabling end-to-end autonomous bug detection by providing concrete execution evidence. However, naive LLM agents are unreliable validators: they are biased toward \"success\" and may reward-hack by producing plausible but non-functional PoCs or even hallucinated traces. To address this, we present AnyPoC, a general multi-agent framework that (1) analyzes and fact-checks a candidate bug report, (2) iteratively synthesizes and executes a PoC while collecting execution traces, and (3) independently re-executes and scrutinizes the PoC to mitigate hallucination and reward hacking. In addition, AnyPoC also continuously extracts and evolves a PoC knowledge base to handle heterogeneous tasks. AnyPoC operates on candidate bug reports regardless of their source and can be paired with different bug reporters. To demonstrate practicality and generality, we apply AnyPoC, with a simple agentic bug reporter, on 12 critical software systems across diverse languages/domains (many with millions of lines of code) including Firefox, Chromium, LLVM, OpenSSL, SQLite, FFmpeg, and Redis. Compared to the state-of-the-art coding agents, e.g., Claude Code and Codex, AnyPoC produces 1.3x more valid PoCs for true-positive bug reports and rejects 9.8x more false-positive bug reports. To date, AnyPoC has discovered 122 new bugs (105 confirmed, 86 already fixed), with 45 generated PoCs adopted as official regression tests.","published_date":"2026-04-13T18:44:02+00:00","viability_score":9,"cluster_label":"LLM Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An LLM-powered multi-agent system that autonomously generates executable proofs-of-concept to validate bug reports in software, significantly improving bug detection accuracy and reducing false positives.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.11947v1","title":"ResBM: Residual Bottleneck Models for Low-Bandwidth Pipeline Parallelism","abstract":"Unlocking large-scale low-bandwidth decentralized training has the potential to utilize otherwise untapped compute resources. In centralized settings, large-scale multi-node training is primarily enabled by data and pipeline parallelism, two techniques that require ultra-high-bandwidth communication. While efficient methods now exist for decentralized data parallelism, pipeline parallelism remains the primary challenge. Recent efforts, such as Subspace Models (SM), have claimed up to 100x activation compression but rely on complex constrained optimization and diverge from true end-to-end training. In this paper, we propose a different approach, based on an architecture designed from the ground up to be native to low-bandwidth communication environments while still applicable to any standard transformer-based architecture. We call this architecture the Residual Bottleneck Model or ResBM, it introduces a residual encoder-decoder bottleneck module across pipeline boundaries that can be trained end-to-end as part of the model's parameters while preserving an explicit low-rank identity path. We show that ResBMs achieve state-of-the-art 128x activation compression without significant loss in convergence rates and without significant memory or compute overhead.","published_date":"2026-04-13T18:40:45+00:00","viability_score":3,"cluster_label":"LLM Training","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A novel neural network architecture designed for efficient low-bandwidth pipeline parallelism in large-scale decentralized training.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11945v1","title":"AutoSurrogate: An LLM-Driven Multi-Agent Framework for Autonomous Construction of Deep Learning Surrogate Models in Subsurface Flow","abstract":"High-fidelity numerical simulation of subsurface flow is computationally intensive, especially for many-query tasks such as uncertainty quantification and data assimilation. Deep learning (DL) surrogates can significantly accelerate forward simulations, yet constructing them requires substantial machine learning (ML) expertise - from architecture design to hyperparameter tuning - that most domain scientists do not possess. Furthermore, the process is predominantly manual and relies heavily on heuristic choices. This expertise gap remains a key barrier to the broader adoption of DL surrogate techniques. For this reason, we present AutoSurrogate, a large-language-model-driven multi-agent framework that enables practitioners without ML expertise to build high-quality surrogates for subsurface flow problems through natural-language instructions. Given simulation data and optional preferences, four specialized agents collaboratively execute data profiling, architecture selection from a model zoo, Bayesian hyperparameter optimization, model training, and quality assessment against user-specified thresholds. The system also handles common failure modes autonomously, including restarting training with adjusted configurations when numerical instabilities occur and switching to alternative architectures when predictive accuracy falls short of targets. In our setting, a single natural-language sentence can be sufficient to produce a deployment-ready surrogate model, with minimum human intervention required at any intermediate stage. We demonstrate the utility of AutoSurrogate on a 3D geological carbon storage modeling task, mapping permeability fields to pressure and CO$_2$ saturation fields over 31 timesteps. Without any manual tuning, AutoSurrogate is able to outperform expert-designed baselines and domain-agnostic AutoML methods, demonstrating strong potential for practical deployment.","published_date":"2026-04-13T18:36:02+00:00","viability_score":8,"cluster_label":"AI for Science","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An LLM-driven multi-agent framework that enables domain scientists to autonomously build high-quality deep learning surrogate models for subsurface flow simulations using natural language.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.11924v1","title":"GoodPoint: Learning Constructive Scientific Paper Feedback from Author Responses","abstract":"While LLMs hold significant potential to transform scientific research, we advocate for their use to augment and empower researchers rather than to automate research without human oversight. To this end, we study constructive feedback generation, the task of producing targeted, actionable feedback that helps authors improve both their research and its presentation. In this work, we operationalize the effectiveness of feedback along two author-centric axes-validity and author action. We first curate GoodPoint-ICLR, a dataset of 19K ICLR papers with reviewer feedback annotated along both dimensions using author responses. Building on this, we introduce GoodPoint, a training recipe that leverages success signals from author responses through fine-tuning on valid and actionable feedback, together with preference optimization on both real and synthetic preference pairs. Our evaluation on a benchmark of 1.2K ICLR papers shows that a GoodPoint-trained Qwen3-8B improves the predicted success rate by 83.7% over the base model and sets a new state-of-the-art among LLMs of similar size in feedback matching on a golden human feedback set, even surpassing Gemini-3-flash in precision. We further validate these findings through an expert human study, demonstrating that GoodPoint consistently delivers higher practical value as perceived by authors.","published_date":"2026-04-13T18:12:57+00:00","viability_score":8,"cluster_label":"LLM Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel training methodology for LLMs that learns to generate constructive scientific paper feedback by leveraging author responses, significantly improving feedback quality and author perception.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.11915v1","title":"Can AI Detect Life? Lessons from Artificial Life","abstract":"Modern machine learning methods have been proposed to detect life in extraterrestrial samples, drawing on their ability to distinguish biotic from abiotic samples based on training models using natural and synthetic organic molecular mixtures. Here we show using Artificial Life that such methods are easily fooled into detecting life with near 100% confidence even if the analyzed sample is not capable of life. This is due to modern machine learning methods' propensity to be easily fooled by out-of-distribution samples. Because extra-terrestrial samples are very likely out of the distribution provided by terrestrial biotic and abiotic samples, using AI methods for life detection is bound to yield significant false positives.","published_date":"2026-04-13T18:05:57+00:00","viability_score":2,"cluster_label":"AI Safety","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This research demonstrates that current AI methods for detecting extraterrestrial life are prone to significant false positives due to their susceptibility to out-of-distribution samples.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11914v1","title":"Self-Monitoring Benefits from Structural Integration: Lessons from Metacognition in Continuous-Time Multi-Timescale Agents","abstract":"Self-monitoring capabilities -- metacognition, self-prediction, and subjective duration -- are often proposed as useful additions to reinforcement learning agents. But do they actually help? We investigate this question in a continuous-time multi-timescale agent operating in predator-prey survival environments of varying complexity, including a 2D partially observable variant. We first show that three self-monitoring modules, implemented as auxiliary-loss add-ons to a multi-timescale cortical hierarchy, provide no statistically significant benefit across 20 random seeds, 1D and 2D predator-prey environments with standard and non-stationary variants, and training horizons up to 50,000 steps. Diagnosing the failure, we find the modules collapse to near-constant outputs (confidence std < 0.006, attention allocation std < 0.011) and the subjective duration mechanism shifts the discount factor by less than 0.03%. Policy sensitivity analysis confirms the agent's decisions are unaffected by module outputs in this design. We then show that structurally integrating the module outputs -- using confidence to gate exploration, surprise to trigger workspace broadcasts, and self-model predictions as policy input -- produces a medium-large improvement over the add-on approach (Cohen's d = 0.62, p = 0.06, paired) in a non-stationary environment. Component-wise ablations reveal that the TSM-to-policy pathway contributes most of this gain. However, structural integration does not significantly outperform a baseline with no self-monitoring (d = 0.15, p = 0.67), and a parameter-matched control without modules performs comparably, so the benefit may lie in recovering from the trend-level harm of ignored modules rather than in self-monitoring content. The architectural implication is that self-monitoring should sit on the decision pathway, not beside it.","published_date":"2026-04-13T18:05:31+00:00","viability_score":4,"cluster_label":"Reinforcement Learning Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This work shows that self-monitoring modules in reinforcement learning agents only provide benefits when structurally integrated into the decision-making pathway, not as auxiliary add-ons.","time_to_mvp":"1-3 months","tags":["quick_build"]},{"arxiv_id":"2604.11912v1","title":"How Transformers Learn to Plan via Multi-Token Prediction","abstract":"While next-token prediction (NTP) has been the standard objective for training language models, it often struggles to capture global structure in reasoning tasks. Multi-token prediction (MTP) has recently emerged as a promising alternative, yet its underlying mechanisms remain poorly understood. In this paper, we study how MTP facilitates reasoning, with a focus on planning. Empirically, we show that MTP consistently outperforms NTP on both synthetic graph path-finding tasks and more realistic reasoning benchmarks, such as Countdown and boolean satisfiability problems. Theoretically, we analyze a simplified two-layer Transformer on a star graph task. We prove that MTP induces a two-stage reverse reasoning process: the model first attends to the end node and then reconstructs the path by tracing intermediate nodes backward. This behavior arises from a gradient decoupling property of MTP, which provides a cleaner training signal compared to NTP. Ultimately, our results highlight how multi-token objectives inherently bias optimization toward robust and interpretable reasoning circuits.","published_date":"2026-04-13T18:04:09+00:00","viability_score":7,"cluster_label":"LLM Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This paper reveals that multi-token prediction in Transformers enables more robust planning by inducing a reverse reasoning process, outperforming next-token prediction on various reasoning benchmarks.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.11909v1","title":"Thermodynamic Liquid Manifold Networks: Physics-Bounded Deep Learning for Solar Forecasting in Autonomous Off-Grid Microgrids","abstract":"The stable operation of autonomous off-grid photovoltaic systems requires solar forecasting algorithms that respect atmospheric thermodynamics. Contemporary deep learning models consistently exhibit critical anomalies, primarily severe temporal phase lags during cloud transients and physically impossible nocturnal power generation. To resolve this divergence between data-driven modeling and deterministic celestial mechanics, this research introduces the Thermodynamic Liquid Manifold Network. The methodology projects 22 meteorological and geometric variables into a Koopman-linearized Riemannian manifold to systematically map complex climatic dynamics. The architecture integrates a Spectral Calibration unit and a multiplicative Thermodynamic Alpha-Gate. This system synthesizes real-time atmospheric opacity with theoretical clear-sky boundary models, structurally enforcing strict celestial geometry compliance. This completely neutralizes phantom nocturnal generation while maintaining zero-lag synchronization during rapid weather shifts. Validated against a rigorous five-year testing horizon in a severe semi-arid climate, the framework achieves an RMSE of 18.31 Wh/m2 and a Pearson correlation of 0.988. The model strictly maintains a zero-magnitude nocturnal error across all 1826 testing days and exhibits a sub-30-minute phase response during high-frequency optical transients. Comprising exactly 63,458 trainable parameters, this ultra-lightweight design establishes a robust, thermodynamically consistent standard for edge-deployable microgrid controllers.","published_date":"2026-04-13T18:02:47+00:00","viability_score":6,"cluster_label":"Forecasting","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This research introduces a thermodynamically consistent deep learning network for solar forecasting in off-grid microgrids, eliminating nocturnal generation anomalies and achieving zero-lag synchronization.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus"]},{"arxiv_id":"2604.11807v1","title":"Physics-Informed State Space Models for Reliable Solar Irradiance Forecasting in Off-Grid Systems","abstract":"The stable operation of autonomous off-grid photovoltaic systems dictates reliance on solar forecasting algorithms that respect atmospheric thermodynamics. Contemporary deep learning models consistently exhibit critical anomalies, primarily severe temporal phase lags during cloud transients and physically impossible nocturnal power generation. To resolve this divergence between data-driven modeling and deterministic celestial mechanics, this research introduces the Thermodynamic Liquid Manifold Network. The proposed methodology projects 15 meteorological and geometric variables into a Koopman-linearized Riemannian manifold to systematically map complex climatic dynamics. The architecture integrates a Spectral Calibration unit and a multiplicative Thermodynamic Alpha-Gate. This system synthesizes real-time atmospheric opacity with theoretical clear-sky boundary models, structurally enforcing strict celestial geometry compliance. This completely neutralizes phantom nocturnal generation while maintaining zero-lag synchronization during rapid weather shifts. Validated against a rigorous five-year testing horizon in a severe semi-arid climate, the framework achieves an RMSE of 18.31 Wh/m2 and a Pearson correlation of 0.988. The model strictly maintains a zero-magnitude nocturnal error across all 1826 testing days and exhibits a sub-30-minute phase response during high-frequency transients. Comprising exactly 63,458 trainable parameters, this ultra-lightweight design establishes a robust, thermodynamically consistent standard for edge-deployable microgrid controllers.","published_date":"2026-04-13T17:59:49+00:00","viability_score":3,"cluster_label":"Forecasting AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A thermodynamically consistent neural network for ultra-lightweight, zero-lag solar irradiance forecasting in off-grid systems.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11806v1","title":"Detecting Safety Violations Across Many Agent Traces","abstract":"To identify safety violations, auditors often search over large sets of agent traces. This search is difficult because failures are often rare, complex, and sometimes even adversarially hidden and only detectable when multiple traces are analyzed together. These challenges arise in diverse settings such as misuse campaigns, covert sabotage, reward hacking, and prompt injection. Existing approaches struggle here for several reasons. Per-trace judges miss failures that only become visible across traces, naive agentic auditing does not scale to large trace collections, and fixed monitors are brittle to unanticipated behaviors. We introduce Meerkat, which combines clustering with agentic search to uncover violations specified in natural language. Through structured search and adaptive investigation of promising regions, Meerkat finds sparse failures without relying on seed scenarios, fixed workflows, or exhaustive enumeration. Across misuse, misalignment, and task gaming settings, Meerkat significantly improves detection of safety violations over baseline monitors, discovers widespread developer cheating on a top agent benchmark, and finds nearly 4x more examples of reward hacking on CyBench than previous audits.","published_date":"2026-04-13T17:59:40+00:00","viability_score":7,"cluster_label":"AI Safety","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Meerkat is a novel system that combines clustering and agentic search to detect rare and complex safety violations across large sets of agent traces, significantly improving detection rates.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.11805v1","title":"Solving Physics Olympiad via Reinforcement Learning on Physics Simulators","abstract":"We have witnessed remarkable advances in LLM reasoning capabilities with the advent of DeepSeek-R1. However, much of this progress has been fueled by the abundance of internet question-answer (QA) pairs, a major bottleneck going forward, since such data is limited in scale and concentrated mainly in domains like mathematics. In contrast, other sciences such as physics lack large-scale QA datasets to effectively train reasoning-capable models. In this work, we show that physics simulators can serve as a powerful alternative source of supervision for training LLMs for physical reasoning. We generate random scenes in physics engines, create synthetic question-answer pairs from simulated interactions, and train LLMs using reinforcement learning on this synthetic data. Our models exhibit zero-shot sim-to-real transfer to real-world physics benchmarks: for example, training solely on synthetic simulated data improves performance on IPhO (International Physics Olympiad) problems by 5-10 percentage points across model sizes. These results demonstrate that physics simulators can act as scalable data generators, enabling LLMs to acquire deep physical reasoning skills beyond the limitations of internet-scale QA data. Code available at: https://sim2reason.github.io/.","published_date":"2026-04-13T17:59:40+00:00","viability_score":7,"cluster_label":"Physics Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This work demonstrates that physics simulators can generate synthetic data for reinforcement learning, enabling LLMs to acquire deep physical reasoning skills and achieve significant zero-shot transfer to real-world benchmarks.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11798v1","title":"Budget-Aware Uncertainty for Radiotherapy Segmentation QA Using nnU-Net","abstract":"Accurate delineation of the Clinical Target Volume (CTV) is essential for radiotherapy planning, yet remains time-consuming and difficult to assess, especially for complex treatments such as Total Marrow and Lymph Node Irradiation (TMLI). While deep learning-based auto-segmentation can reduce workload, safe clinical deployment requires reliable cues indicating where models may be wrong. In this work, we propose a budget-aware uncertainty-driven quality assurance (QA) framework built on nnU-Net, combining uncertainty quantification and post-hoc calibration to produce voxel-wise uncertainty maps (based on predictive entropy) that can guide targeted manual review. We compare temperature scaling (TS), deep ensembles (DE), checkpoint ensembles (CE), and test-time augmentation (TTA), evaluated both individually and in combination on TMLI as a representative use case. Reliability is assessed through ROI-masked calibration metrics and uncertainty--error alignment under realistic revision constraints, summarized as AUC over the top 0-5% most uncertain voxels. Across configurations, segmentation accuracy remains stable, whereas TS substantially improves calibration. Uncertainty-error alignment improves most with calibrated checkpoint-based inference, leading to uncertainty maps that highlight more consistently regions requiring manual edits. Overall, integrating calibration with efficient ensembling seems a promising strategy to implement a budget-aware QA workflow for radiotherapy segmentation.","published_date":"2026-04-13T17:58:15+00:00","viability_score":2,"cluster_label":"Medical AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A framework for quality assurance in radiotherapy segmentation using uncertainty quantification to guide manual review.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11796v1","title":"C-ReD: A Comprehensive Chinese Benchmark for AI-Generated Text Detection Derived from Real-World Prompts","abstract":"Recently, large language models (LLMs) are capable of generating highly fluent textual content. While they offer significant convenience to humans, they also introduce various risks, like phishing and academic dishonesty. Numerous research efforts have been dedicated to developing algorithms for detecting AI-generated text and constructing relevant datasets. However, in the domain of Chinese corpora, challenges remain, including limited model diversity and data homogeneity. To address these issues, we propose C-ReD: a comprehensive Chinese Real-prompt AI-generated Detection benchmark. Experiments demonstrate that C-ReD not only enables reliable in-domain detection but also supports strong generalization to unseen LLMs and external Chinese datasets-addressing critical gaps in model diversity, domain coverage, and prompt realism that have limited prior Chinese detection benchmarks. We release our resources at https://github.com/HeraldofLight/C-ReD.","published_date":"2026-04-13T17:56:27+00:00","viability_score":7,"cluster_label":"LLM Security","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A comprehensive Chinese benchmark and dataset for detecting AI-generated text, addressing limitations in model diversity and data realism.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.11791v1","title":"A Mechanistic Analysis of Looped Reasoning Language Models","abstract":"Reasoning has become a central capability in large language models. Recent research has shown that reasoning performance can be improved by looping an LLM's layers in the latent dimension, resulting in looped reasoning language models. Despite promising results, few works have investigated how their internal dynamics differ from those of standard feedforward models. In this paper, we conduct a mechanistic analysis of the latent states in looped language models, focusing in particular on how the stages of inference observed in feedforward models compare to those observed in looped ones. To this end, we analyze cyclic recurrence and show that for many of the studied models each layer in the cycle converges to a distinct fixed point; consequently, the recurrent block follows a consistent cyclic trajectory in the latent space. We provide evidence that as these fixed points are reached, attention-head behavior stabilizes, leading to constant behavior across recurrences. Empirically, we discover that recurrent blocks learn stages of inference that closely mirror those of feedforward models, repeating these stages in depth with each iteration. We study how recurrent block size, input injection, and normalization influence the emergence and stability of these cyclic fixed points. We believe these findings help translate mechanistic insights into practical guidance for architectural design.","published_date":"2026-04-13T17:55:36+00:00","viability_score":1,"cluster_label":"LLM Research","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A mechanistic analysis of looped reasoning in language models, exploring latent state dynamics and inference stages.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11790v1","title":"ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection","abstract":"Tool-augmented Large Language Model (LLM) agents have demonstrated impressive capabilities in automating complex, multi-step real-world tasks, yet remain vulnerable to indirect prompt injection. Adversaries exploit this weakness by embedding malicious instructions within tool-returned content, which agents directly incorporate into their conversation history as trusted observations. This vulnerability manifests across three primary attack channels: web and local content injection, MCP server injection, and skill file injection. To address these vulnerabilities, we introduce \\textsc{ClawGuard}, a novel runtime security framework that enforces a user-confirmed rule set at every tool-call boundary, transforming unreliable alignment-dependent defense into a deterministic, auditable mechanism that intercepts adversarial tool calls before any real-world effect is produced. By automatically deriving task-specific access constraints from the user's stated objective prior to any external tool invocation, \\textsc{ClawGuard} blocks all three injection pathways without model modification or infrastructure change. Experiments across five state-of-the-art language models on AgentDojo, SkillInject, and MCPSafeBench demonstrate that \\textsc{ClawGuard} achieves robust protection against indirect prompt injection without compromising agent utility. This work establishes deterministic tool-call boundary enforcement as an effective defense mechanism for secure agentic AI systems, requiring neither safety-specific fine-tuning nor architectural modification. Code is publicly available at https://github.com/Claw-Guard/ClawGuard.","published_date":"2026-04-13T17:55:11+00:00","viability_score":8,"cluster_label":"AI Security","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"ClawGuard provides a runtime security framework to protect LLM agents from indirect prompt injections.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.11786v1","title":"GenTac: Generative Modeling and Forecasting of Soccer Tactics","abstract":"Modeling open-play soccer tactics is a formidable challenge due to the stochastic, multi-agent nature of the game. Existing computational approaches typically produce single, deterministic trajectory forecasts or focus on highly structured set-pieces, fundamentally failing to capture the inherent variance and branching possibilities of real-world match evolution. Here, we introduce GenTac, a diffusion-based generative framework that conceptualizes soccer tactics as a stochastic process over continuous multi-player trajectories and discrete semantic events. By learning the underlying distribution of player movements from historical tracking data, GenTac samples diverse, plausible, long-horizon future trajectories. The framework supports rich contextual conditioning, including opponent behavior, specific team or league playing styles, and strategic objectives, while grounding continuous spatial dynamics into a 15-class tactical event space. Extensive evaluations on our proposed benchmark, TacBench, demonstrate four key capabilities: (1) GenTac achieves high geometric accuracy while strictly preserving the collective structural consistency of the team; (2) it accurately simulates stylistic nuances, distinguishing between specific teams (e.g., Auckland FC) and leagues (e.g., A-League versus German leagues); (3) it enables controllable counterfactual simulations, demonstrably altering spatial control and expected threat metrics based on offensive or defensive guidance; and (4) it reliably anticipates future tactical outcomes directly from generated rollouts. Finally, we demonstrate that GenTac can be successfully trained to generalize to other dynamic team sports, including basketball, American football, and ice hockey.","published_date":"2026-04-13T17:53:49+00:00","viability_score":7,"cluster_label":"Generative AI for Sports Analytics","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A diffusion-based generative framework for modeling and forecasting diverse soccer tactics, adaptable to various team sports.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11784v1","title":"ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents","abstract":"GUI agents drive applications through their visual interfaces instead of programmatic APIs, interacting with arbitrary software via taps, swipes, and keystrokes, reaching a long tail of applications that CLI-based agents cannot. Yet progress in this area is bottlenecked less by modeling capacity than by the absence of a coherent full-stack infrastructure: online RL training suffers from environment instability and closed pipelines, evaluation protocols drift silently across works, and trained agents rarely reach real users on real devices. We present \\textbf{ClawGUI}, an open-source framework addressing these three gaps within a single harness. \\textbf{ClawGUI-RL} provides the first open-source GUI agent RL infrastructure with validated support for both parallel virtual environments and real physical devices, integrating GiGPO with a Process Reward Model for dense step-level supervision. \\textbf{ClawGUI-Eval} enforces a fully standardized evaluation pipeline across 6 benchmarks and 11+ models, achieving 95.8\\% reproduction against official baselines. \\textbf{ClawGUI-Agent} brings trained agents to Android, HarmonyOS, and iOS through 12+ chat platforms with hybrid CLI-GUI control and persistent personalized memory. Trained end to end within this pipeline, \\textbf{ClawGUI-2B} achieves 17.1\\% Success Rate on MobileWorld GUI-Only, outperforming the same-scale MAI-UI-2B baseline by 6.0\\%.","published_date":"2026-04-13T17:52:04+00:00","viability_score":8,"cluster_label":"GUI Automation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"ClawGUI provides an all-in-one framework for training, evaluating, and deploying GUI-focused AI agents.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.11778v1","title":"General365: Benchmarking General Reasoning in Large Language Models Across Diverse and Challenging Tasks","abstract":"Contemporary large language models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in specialized domains like mathematics and physics. However, their ability to generalize these reasoning skills to more general and broader contexts--often termed general reasoning--remains under-explored. Unlike domain-specific reasoning, general reasoning relies less on expert knowledge but still presents formidable reasoning challenges, such as complex constraints, nested logical branches, and semantic interference. To address this gap, we introduce General365, a benchmark specifically designed to assess general reasoning in LLMs. By restricting background knowledge to a K-12 level, General365 explicitly decouples reasoning from specialized expertise. The benchmark comprises 365 seed problems and 1,095 variant problems across eight categories, ensuring both high difficulty and diversity. Evaluations across 26 leading LLMs reveal that even the top-performing model achieves only 62.8% accuracy, in stark contrast to the near-perfect performances of LLMs in math and physics benchmarks. These results suggest that the reasoning abilities of current LLMs are heavily domain-dependent, leaving significant room for improvement in broader applications. We envision General365 as a catalyst for advancing LLM reasoning beyond domain-specific tasks toward robust, general-purpose real-world scenarios. Code, Dataset, and Leaderboard: https://general365.github.io","published_date":"2026-04-13T17:44:25+00:00","viability_score":7,"cluster_label":"LLM Benchmarking","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new benchmark, General365, designed to rigorously evaluate and advance the general reasoning capabilities of large language models across diverse, challenging tasks.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11775v1","title":"Efficient KernelSHAP Explanations for Patch-based 3D Medical Image Segmentation","abstract":"Perturbation-based explainability methods such as KernelSHAP provide model-agnostic attributions but are typically impractical for patch-based 3D medical image segmentation due to the large number of coalition evaluations and the high cost of sliding-window inference. We present an efficient KernelSHAP framework for volumetric CT segmentation that restricts computation to a user-defined region of interest and its receptive-field support, and accelerates inference via patch logit caching, reusing baseline predictions for unaffected patches while preserving nnU-Net's fusion scheme. To enable clinically meaningful attributions, we compare three automatically generated feature abstractions within the receptive-field crop: whole-organ units, regular FCC supervoxels, and hybrid organ-aware supervoxels, and we study multiple aggregation/value functions targeting stabilizing evidence (TP/Dice/Soft Dice) or false-positive behavior. Experiments on whole-body CT segmentations show that caching substantially reduces redundant computation (with computational savings ranging from 15% to 30%) and that faithfulness and interpretability exhibit clear trade-offs: regular supervoxels often maximize perturbation-based metrics but lack anatomical alignment, whereas organ-aware units yield more clinically interpretable explanations and are particularly effective for highlighting false-positive drivers under normalized metrics.","published_date":"2026-04-13T17:43:33+00:00","viability_score":4,"cluster_label":"Medical Imaging Explainability","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"An efficient KernelSHAP framework for patch-based 3D medical image segmentation, improving explainability through optimized computation and feature abstraction.","time_to_mvp":"1-3 months","tags":["series_a_plus"]},{"arxiv_id":"2604.11867v1","title":"Disposition Distillation at Small Scale: A Three-Arc Negative Result","abstract":"We set out to train behavioral dispositions (self-verification, uncertainty acknowledgment, feedback integration) into small language models (0.6B to 2.3B effective parameters) through a four-stage all-MIT distillation pipeline, with follow-on experiments on inference-time attention-head interventions and a frozen-base confidence-gated sidecar. An internal draft reported +33.9-point MCAS and +15.3-point HumanEval gains on a Qwen3-0.6B student; a second-pass sanity check falsified both numbers before publication. The HumanEval delta was a truncation artifact (n_predict=512) that inverted to -8.0 points at n_predict=1024; the MCAS gain disappeared under apples-to-apples scoring. That falsification triggered three subsequent arcs. Across (1) SFT/DPO LoRA on three model families and two domains, (2) inference-time attention-head tempering on o_proj, and (3) a training-free frozen-base sidecar reading the final-token hidden state h_last, we find no operator that moves judge-measured disposition without damaging content or collapsing into stylistic mimicry. The failure is consistent across five models (Qwen3-0.6B, Qwen3-1.7B, Qwen3.5-0.8B, Gemma 4 E2B, and SmolLM2-1.7B-Instruct). A within-distribution cross-validation pass (AUC=0.683) collapsed to chance on fresh prompts (AUC=0.516). We contribute a three-arc negative result with mechanism, a two-failure-mode taxonomy for linear h_last probes, and an honest falsification pipeline that converts the class of false positives we ourselves produced into publishable negatives. As an independent finding, Gemma 4 E2B exhibits near-complete confidence-correctness decoupling on the Chef domain (assertion asymmetry -0.009; the model asserts at 91% regardless of correctness).","published_date":"2026-04-13T17:40:31+00:00","viability_score":2,"cluster_label":"LLM Training","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper investigates methods for training behavioral dispositions into small language models, ultimately reporting negative results across multiple experimental arcs.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11759v1","title":"Retrieval Is Not Enough: Why Organizational AI Needs Epistemic Infrastructure","abstract":"Organizational knowledge used by AI agents typically lacks epistemic structure: retrieval systems surface semantically relevant content without distinguishing binding decisions from abandoned hypotheses, contested claims from settled ones, or known facts from unresolved questions. We argue that the ceiling on organizational AI is not retrieval fidelity but \\emph{epistemic} fidelity--the system's ability to represent commitment strength, contradiction status, and organizational ignorance as computable properties.   We present OIDA, a framework that structures organizational knowledge as typed Knowledge Objects carrying epistemic class, importance scores with class-specific decay, and signed contradiction edges. The Knowledge Gravity Engine maintains scores deterministically with proved convergence guarantees (sufficient condition: max degree $< 7$; empirically robust to degree 43). OIDA introduces QUESTION-as-modeled-ignorance: a primitive with inverse decay that surfaces what an organization does \\emph{not} know with increasing urgency--a mechanism absent from all surveyed systems. We describe the Epistemic Quality Score (EQS), a five-component evaluation methodology with explicit circularity analysis. In a controlled comparison ($n{=}10$ response pairs), OIDA's RAG condition (3,868 tokens) achieves EQS 0.530 vs.\\ 0.848 for a full-context baseline (108,687 tokens); the $28.1\\times$ token budget difference is the primary confound. The QUESTION mechanism is statistically validated (Fisher $p{=}0.0325$, OR$=21.0$). The formal properties are established; the decisive ablation at equal token budget (E4) is pre-registered and not yet run.","published_date":"2026-04-13T17:31:14+00:00","viability_score":5,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"OIDA provides a framework for structuring organizational knowledge with epistemic properties to improve AI agent understanding and decision-making.","time_to_mvp":"1-3 months","tags":["series_a_plus"]},{"arxiv_id":"2604.11757v1","title":"StarVLA-$\u03b1$: Reducing Complexity in Vision-Language-Action Systems","abstract":"Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for building general-purpose robotic agents. However, the VLA landscape remains highly fragmented and complex: as existing approaches vary substantially in architectures, training data, embodiment configurations, and benchmark-specific engineering. In this work, we introduce StarVLA-$\u03b1$, a simple yet strong baseline designed to study VLA design choices under controlled conditions. StarVLA-$\u03b1$ deliberately minimizes architectural and pipeline complexity to reduce experimental confounders and enable systematic analysis. Specifically, we re-evaluate several key design axes, including action modeling strategies, robot-specific pretraining, and interface engineering. Across unified multi-benchmark training on LIBERO, SimplerEnv, RoboTwin, and RoboCasa, the same simple baseline remains highly competitive, indicating that a strong VLM backbone combined with minimal design is already sufficient to achieve strong performance without relying on additional architectural complexity or engineering tricks. Notably, our single generalist model outperforms $\u03c0_{0.5}$ by 20\\% on the public real-world RoboChallenge benchmark. We expect StarVLA-$\u03b1$ to serve as a solid starting point for future research in the VLA regime. Code will be released at https://github.com/starVLA/starVLA.","published_date":"2026-04-13T17:30:01+00:00","viability_score":7,"cluster_label":"Robotics","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"StarVLA-\u03b1 simplifies Vision-Language-Action models for robotics, offering a strong, generalist baseline that significantly outperforms existing methods.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11751v1","title":"Grounded World Model for Semantically Generalizable Planning","abstract":"In Model Predictive Control (MPC), world models predict the future outcomes of various action proposals, which are then scored to guide the selection of the optimal action. For visuomotor MPC, the score function is a distance metric between a predicted image and a goal image, measured in the latent space of a pretrained vision encoder like DINO and JEPA. However, it is challenging to obtain the goal image in advance of the task execution, particularly in new environments. Additionally, conveying the goal through an image offers limited interactivity compared with natural language. In this work, we propose to learn a Grounded World Model (GWM) in a vision-language-aligned latent space. As a result, each proposed action is scored based on how close its future outcome is to the task instruction, reflected by the similarity of embeddings. This approach transforms the visuomotor MPC to a VLA that surpasses VLM-based VLAs in semantic generalization. On the proposed WISER benchmark, GWM-MPC achieves a 87% success rate on the test set comprising 288 tasks that feature unseen visual signals and referring expressions, yet remain solvable with motions demonstrated during training. In contrast, traditional VLAs achieve an average success rate of 22%, even though they overfit the training set with a 90% success rate.","published_date":"2026-04-13T17:25:41+00:00","viability_score":7,"cluster_label":"Robotics","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Grounded World Model for MPC enables semantic generalization in robotics by scoring actions based on language instructions rather than just visual similarity.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11742v1","title":"Discourse Diversity in Multi-Turn Empathic Dialogue","abstract":"Large language models (LLMs) produce responses rated as highly empathic in single-turn settings (Ayers et al., 2023; Lee et al., 2024), yet they are also known to be formulaic generators that reuse the same lexical patterns, syntactic templates, and discourse structures across tasks (Jiang et al., 2025; Shaib et al., 2024; Namuduri et al., 2025). Less attention has been paid to whether this formulaicity extends to the level of discourse moves, i.e., what a response does for the person it is addressing. This question is especially consequential for empathic dialogue, where effective support demands not just a kind response at one moment but varied strategies as a conversation unfolds (Stiles et al., 1998). Indeed, prior work shows that LLMs reuse the same tactic sequences more than human supporters in single-turn settings (Gueorguieva et al., 2026). We extend this analysis to multi-turn conversations and find that the rigidity compounds: once a tactic appears in a supporter turn, LLMs reuse it in the next at nearly double the rate of humans (0.50-0.56 vs. 0.27). This pattern holds across LLMs serving as supporters in real emotional support conversations, and is invisible to standard similarity metrics. To address this gap, we introduce MINT (Multi-turn Inter-tactic Novelty Training), the first reinforcement learning framework to optimize discourse move diversity across multi-turn empathic dialogue. The best MINT variant combines an empathy quality reward with a cross-turn tactic novelty signal, improving aggregate empathy by 25.3% over vanilla across 1.7B and 4B models while reducing cross-turn discourse move repetition by 26.3% on the 4B model, surpassing all baselines including quality-only and token-level diversity methods on both measures. These results suggest that what current models lack is not empathy itself, but the ability to vary their discourse moves across a conversation.","published_date":"2026-04-13T17:17:22+00:00","viability_score":4,"cluster_label":"LLM Dialogue","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A reinforcement learning framework to improve discourse move diversity in multi-turn empathic dialogue by LLMs.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11741v1","title":"Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games","abstract":"Vision-language models (VLMs) have shown impressive capabilities in perceptual tasks, yet they degrade in complex multi-hop reasoning under multiplayer game settings with imperfect and deceptive information. In this paper, we study a representative multiplayer task, Murder Mystery Games, which require inferring hidden truths based on partial clues provided by roles with different intentions. To address this challenge, we propose a collaborative multi-agent framework for evaluating and synthesizing high-quality, role-driven multiplayer game scripts, enabling fine-grained interaction patterns tailored to character identities (i.e., murderer vs. innocent). Our system generates rich multimodal contexts, including character backstories, visual and textual clues, and multi-hop reasoning chains, through coordinated agent interactions. We design a two-stage agent-monitored training strategy to enhance the reasoning ability of VLMs: (1) chain-of-thought based fine-tuning on curated and synthetic datasets that model uncertainty and deception; (2) GRPO-based reinforcement learning with agent-monitored reward shaping, encouraging the model to develop character-specific reasoning behaviors and effective multimodal multi-hop inference. Extensive experiments demonstrate that our method significantly boosts the performance of VLMs in narrative reasoning, hidden fact extraction, and deception-resilient understanding. Our contributions offer a scalable solution for training and evaluating VLMs under uncertain, adversarial, and socially complex conditions, laying the groundwork for future benchmarks in multimodal multi-hop reasoning under imperfect information.","published_date":"2026-04-13T17:16:23+00:00","viability_score":7,"cluster_label":"Multi-Agent Games","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A collaborative multi-agent framework for generating role-driven game scripts to enhance vision-language model reasoning in imperfect-information games.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11734v1","title":"Multi-ORFT: Stable Online Reinforcement Fine-Tuning for Multi-Agent Diffusion Planning in Cooperative Driving","abstract":"Closed-loop cooperative driving requires planners that generate realistic multimodal multi-agent trajectories while improving safety and traffic efficiency. Existing diffusion planners can model multimodal behaviors from demonstrations, but they often exhibit weak scene consistency and remain poorly aligned with closed-loop objectives; meanwhile, stable online post-training in reactive multi-agent environments remains difficult. We present Multi-ORFT, which couples scene-conditioned diffusion pre-training with stable online reinforcement post-training. In pre-training, the planner uses inter-agent self-attention, cross-attention, and AdaLN-Zero-based scene conditioning to improve scene consistency and road adherence of joint trajectories. In post-training, we formulate a two-level MDP that exposes step-wise reverse-kernel likelihoods for online optimization, and combine dense trajectory-level rewards with variance-gated group-relative policy optimization (VG-GRPO) to stabilize training. On the WOMD closed-loop benchmark, Multi-ORFT reduces collision rate from 2.04% to 1.89% and off-road rate from 1.68% to 1.36%, while increasing average speed from 8.36 to 8.61 m/s relative to the pre-trained planner, and it outperforms strong open-source baselines including SMART-large, SMART-tiny-CLSFT, and VBD on the primary safety and efficiency metrics. These results show that coupling scene-consistent denoising with stable online diffusion-policy optimization improves the reliability of closed-loop cooperative driving.","published_date":"2026-04-13T17:13:46+00:00","viability_score":7,"cluster_label":"Autonomous Driving","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel multi-agent reinforcement learning framework for cooperative driving that significantly improves safety and efficiency by stabilizing online fine-tuning of diffusion planners.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11733v1","title":"Endogenous Information in Routing Games: Memory-Constrained Equilibria, Recall Braess Paradoxes, and Memory Design","abstract":"We study routing games in which travelers optimize over routes that are remembered or surfaced, rather than over a fixed exogenous action set. The paper develops a tractable design theory for endogenous recall and then connects it back to an explicit finite-memory micro model. At the micro level, each traveler carries a finite memory state, receives surfaced alternatives, chooses via a logit rule, and updates memory under a policy such as LRU. This yields a stationary Forgetful Wardrop Equilibrium (FWE); existence is proved under mild regularity, and uniqueness follows in a contraction regime for the reduced fixed-point map. The paper's main design layer is a stationary salience model that summarizes persistent memory and interface effects as route-specific weights. Salience-weighted stochastic user equilibrium is the unique minimizer of a strictly convex potential, which yields a clean optimization and implementability theory. In this layer we characterize governed implementability under ratio budgets and affine tying constraints, and derive constructive algorithms on parallel and series-parallel networks. The bridge between layers is exact for last-choice memory (B=1): the micro model is then equivalent to the salience model, so any interior salience vector can be realized by an appropriate surfacing policy. For larger memories, we develop an explicit LRU-to-TTL-to-salience approximation pipeline and add contraction-based bounds that translate surrogate-map error into fixed-point and welfare error. Finally, we define a Recall Braess Paradox, in which improving recall increases equilibrium delay without changing physical capacity, and show that it can arise on every two-terminal network with at least two distinct s-t paths. Targeted experiments support the approximation regime, governed-design predictions, and the computational advantages of the reduced layer.","published_date":"2026-04-13T17:08:47+00:00","viability_score":0,"cluster_label":"Game Theory","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A theoretical framework for routing games with endogenous information, focusing on memory-constrained equilibria and recall design.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11721v1","title":"Evaluating Cooperation in LLM Social Groups through Elected Leadership","abstract":"Governing common-pool resources requires agents to develop enduring strategies through cooperation and self-governance to avoid collective failure. While foundation models have shown potential for cooperation in these settings, existing multi-agent research provides little insight into whether structured leadership and election mechanisms can improve collective decision making. The lack of such a critical organizational feature ubiquitous in human society presents a significant shortcoming of the current methods. In this work we aim to directly address whether leadership and elections can support improved social welfare and cooperation through multi-agent simulation with LLMs. We present our open-source framework that simulates leadership through elected personas and candidate-driven agendas and carry out an empirical study of LLMs under controlled governance conditions. Our experiments demonstrate that having elected leadership improves social welfare scores by 55.4% and survival time by 128.6% across a range of high performing LLMs. Through the construction of an agent social graph we compute centrality metrics to assess the social influence of leader personas and also analyze rhetorical and cooperative tendencies revealed through a sentiment analysis on leader utterances. This work lays the foundation for further study of election mechanisms in multi-agent systems toward navigating complex social dilemmas.","published_date":"2026-04-13T16:57:11+00:00","viability_score":7,"cluster_label":"LLM Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An open-source framework simulating elected leadership in LLM social groups to improve cooperation and social welfare in common-pool resource management.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11720v1","title":"On the Robustness of Watermarking for Autoregressive Image Generation","abstract":"The proliferation of autoregressive (AR) image generators demands reliable detection and attribution of their outputs to mitigate misinformation, and to filter synthetic images from training data to prevent model collapse. To address this need, watermarking techniques, specifically designed for AR models, embed a subtle signal at generation time, enabling downstream verification through a corresponding watermark detector. In this work, we study these schemes and demonstrate their vulnerability to both watermark removal and forgery attacks. We assess existing attacks and further introduce three new attacks: (i) a vector-quantized regeneration removal attack, (ii) adversarial optimization-based attack, and (iii) a frequency injection attack. Our evaluation reveals that removal and forgery attacks can be effective with access to a single watermarked reference image and without access to original model parameters or watermarking secrets. Our findings indicate that existing watermarking schemes for AR image generation do not reliably support synthetic content detection for dataset filtering. Moreover, they enable Watermark Mimicry, whereby authentic images can be manipulated to imitate a generator's watermark and trigger false detection to prevent their inclusion in future model training.","published_date":"2026-04-13T16:56:48+00:00","viability_score":4,"cluster_label":"Generative AI Security","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research demonstrates the vulnerability of current image watermarking techniques to sophisticated removal and forgery attacks, highlighting a critical gap in reliable synthetic content detection for AI-generated images.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.11716v1","title":"SWE-AGILE: A Software Agent Framework for Efficiently Managing Dynamic Reasoning Context","abstract":"Prior representative ReAct-style approaches in autonomous Software Engineering (SWE) typically lack the explicit System-2 reasoning required for deep analysis and handling complex edge cases. While recent reasoning models demonstrate the potential of extended Chain-of-Thought (CoT), applying them to the multi-turn SWE task creates a fundamental dilemma: retaining full reasoning history leads to context explosion and ``Lost-in-the-Middle'' degradation, while discarding it would force the agent to redundantly re-reason at every step. To address these challenges, we propose SWE-AGILE, a novel software agent framework designed to bridge the gap between reasoning depth, efficiency, and context constraints. SWE-AGILE introduces a Dynamic Reasoning Context strategy, maintaining a ``sliding window'' of detailed reasoning for immediate continuity to prevent redundant re-analyzing, while compressing historical reasoning content into concise Reasoning Digests. Empirically, SWE-AGILE sets a new standard for 7B-8B models on SWE-Bench-Verified using only 2.2k trajectories and 896 tasks. Code is available at https://github.com/KDEGroup/SWE-AGILE.","published_date":"2026-04-13T16:52:34+00:00","viability_score":7,"cluster_label":"Software Engineering Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"SWE-AGILE is a novel software agent framework that efficiently manages dynamic reasoning context for autonomous software engineering tasks, achieving state-of-the-art performance with reduced context explosion.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.11709v1","title":"A Mamba-Based Multimodal Network for Multiscale Blast-Induced Rapid Structural Damage Assessment","abstract":"Accurate and rapid structural damage assessment (SDA) is crucial for post-disaster management, helping responders prioritise resources, plan rescues, and support recovery. Traditional field inspections, though precise, are limited by accessibility, safety risks, and time constraints, especially after large explosions. Machine learning with remote sensing has emerged as a scalable solution for rapid SDA, with Mamba-based networks achieving state-of-the-art performance. However, these methods often require extensive training and large datasets, limiting real-world applicability. Moreover, they fail to incorporate key physical characteristics of blast loading for SDA. To overcome these challenges, we propose a Mamba-based multimodal network for rapid SDA that integrates multi-scale blast-loading information with optical remote sensing images. Evaluated on the 2020 Beirut explosion, our method significantly improves performance over state-of-the-art approaches. Code is available at: https://github.com/IMPACTSquad/Blast-Mamba","published_date":"2026-04-13T16:43:16+00:00","viability_score":7,"cluster_label":"Multimodal AI for Structural Damage Assessment","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This Mamba-based multimodal network integrates multi-scale blast-loading information with optical remote sensing for rapid structural damage assessment, significantly outperforming existing methods on real-world disaster data.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.11705v1","title":"Agentic Driving Coach: Robustness and Determinism of Agentic AI-Powered Human-in-the-Loop Cyber-Physical Systems","abstract":"Foundation models, including large language models (LLMs), are increasingly used for human-in-the-loop (HITL) cyber-physical systems (CPS) because foundation model-based AI agents can potentially interact with both the physical environments and human users. However, the unpredictable behavior of human users and AI agents, in addition to the dynamically changing physical environments, leads to uncontrollable nondeterminism. To address this urgent challenge of enabling agentic AI-powered HITL CPS, we propose a reactor-model-of-computation (MoC)-based approach, realized by the open-source Lingua Franca (LF) framework. We also carry out a concrete case study using the agentic driving coach as an application of HITL CPS. By evaluating the LF-based agentic HITL CPS, we identify practical challenges in reintroducing determinism into such agentic HITL CPS and present pathways to address them.","published_date":"2026-04-13T16:42:19+00:00","viability_score":4,"cluster_label":"Agentic AI for Cyber-Physical Systems","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This work proposes a reactor-model-of-computation approach using the Lingua Franca framework to address nondeterminism in agentic AI-powered human-in-the-loop cyber-physical systems, demonstrated with a driving coach case study.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11704v1","title":"Fairness is Not Flat: Geometric Phase Transitions Against Shortcut Learning","abstract":"Deep Neural Networks are highly susceptible to shortcut learning, frequently memorizing low-dimensional spurious correlations instead of underlying causal mechanisms. This phenomenon not only degrades out-of-distribution robustness but also induces severe demographic biases in sensitive applications. In this paper, we propose a geometric \\textit{a priori} methodology to mitigate shortcut learning. By deploying a zero-hidden-layer ($N=1$) Topological Auditor, we mathematically isolate features that monopolize the gradient without human intervention. We empirically demonstrate a Capacity Phase Transition: once linear shortcuts are pruned, networks are forced to utilize higher geometric capacity ($N \\geq 16$) to curve the decision boundary and learn ethical representations. Our approach outperforms L1 Regularization -- which collapses into demographic bias -- and operates at a fraction of the computational cost of post-hoc methods like Just Train Twice (JTT), successfully reducing counterfactual gender vulnerability from 21.18\\% to 7.66\\%.","published_date":"2026-04-13T16:40:26+00:00","viability_score":4,"cluster_label":"AI Safety","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A geometric method to prune shortcut learning in neural networks, improving robustness and reducing demographic bias with lower computational cost than existing methods.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.11703v1","title":"DreamKG: A KG-Augmented Conversational System for People Experiencing Homelessness","abstract":"People experiencing homelessness (PEH) face substantial barriers to accessing timely, accurate information about community services. DreamKG addresses this through a knowledge graph-augmented conversational system that grounds responses in verified, up-to-date data about Philadelphia organizations, services, locations, and hours. Unlike standard large language models (LLMs) prone to hallucinations, DreamKG combines Neo4j knowledge graphs with structured query understanding to handle location-aware and time-sensitive queries reliably. The system performs spatial reasoning for distance-based recommendations and temporal filtering for operating hours. Preliminary evaluation shows 59% superiority over Google Search AI on relevant queries and 84% rejection of irrelevant queries. This demonstration highlights the potential of hybrid architectures that combines LLM flexibility with knowledge graph reliability to improve service accessibility for vulnerable populations effectively.","published_date":"2026-04-13T16:38:36+00:00","viability_score":7,"cluster_label":"AI Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A knowledge graph-augmented conversational system that reliably provides accurate, location-aware, and time-sensitive information about community services for people experiencing homelessness.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.11699v1","title":"Legal2LogicICL: Improving Generalization in Transforming Legal Cases to Logical Formulas via Diverse Few-Shot Learning","abstract":"This work aims to improve the generalization of logic-based legal reasoning systems by integrating recent advances in NLP with legal-domain adaptive few-shot learning techniques using LLMs. Existing logic-based legal reasoning pipelines typically rely on fine-tuned models to map natural-language legal cases into logical formulas before forwarding them to a symbolic reasoner. However, such approaches are heavily constrained by the scarcity of high-quality annotated training data. To address this limitation, we propose a novel LLM-based legal reasoning framework that enables effective in-context learning through retrieval-augmented generation. Specifically, we introduce Legal2LogicICL, a few-shot retrieval framework that balances diversity and similarity of exemplars at both the latent semantic representation level and the legal text structure level. In addition, our method explicitly accounts for legal structure by mitigating entity-induced retrieval bias in legal texts, where lengthy and highly specific entity mentions often dominate semantic representations and obscure legally meaningful reasoning patterns. Our Legal2LogicICL constructs informative and robust few-shot demonstrations, leading to accurate and stable logical rule generation without requiring additional training. In addition, we construct a new dataset, named Legal2Proleg, which is annotated with alignments between legal cases and PROLEG logical formulas to support the evaluation of legal semantic parsing. Experimental results on both open-source and proprietary LLMs demonstrate that our approach significantly improves accuracy, stability, and generalization in transforming natural-language legal case descriptions into logical representations, highlighting its effectiveness for interpretable and reliable legal reasoning. Our code is available at https://github.com/yingjie7/Legal2LogicICL.","published_date":"2026-04-13T16:36:48+00:00","viability_score":7,"cluster_label":"LLM Applications","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A few-shot learning framework for LLMs that accurately transforms natural language legal cases into logical formulas, improving generalization without additional training.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.11674v1","title":"AffordSim: A Scalable Data Generator and Benchmark for Affordance-Aware Robotic Manipulation","abstract":"Simulation-based data generation has become a dominant paradigm for training robotic manipulation policies, yet existing platforms do not incorporate object affordance information into trajectory generation. As a result, tasks requiring precise interaction with specific functional regions--grasping a mug by its handle, pouring from a cup's rim, or hanging a mug on a hook--cannot be automatically generated with semantically correct trajectories. We introduce AffordSim, the first simulation framework that integrates open-vocabulary 3D affordance prediction into the manipulation data generation pipeline. AffordSim uses our VoxAfford model, an open-vocabulary 3D affordance detector that enhances MLLM output tokens with multi-scale geometric features, to predict affordance maps on object point clouds, guiding grasp pose estimation toward task-relevant functional regions. Built on NVIDIA Isaac Sim with cross-embodiment support (Franka FR3, Panda, UR5e, Kinova), VLM-powered task generation, and novel domain randomization using DA3-based 3D Gaussian reconstruction from real photographs, AffordSim enables automated, scalable generation of affordance-aware manipulation data. We establish a benchmark of 50 tasks across 7 categories (grasping, placing, stacking, pushing/pulling, pouring, mug hanging, long-horizon composite) and evaluate 4 imitation learning baselines (BC, Diffusion Policy, ACT, Pi 0.5). Our results reveal that while grasping is largely solved (53-93% success), affordance-demanding tasks such as pouring into narrow containers (1-43%) and mug hanging (0-47%) remain significantly more challenging for current imitation learning methods, highlighting the need for affordance-aware data generation. Zero-shot sim-to-real experiments on a real Franka FR3 validate the transferability of the generated data.","published_date":"2026-04-13T16:21:44+00:00","viability_score":6,"cluster_label":"Robotics","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A scalable simulation framework that generates affordance-aware robotic manipulation data, enabling more robust training for complex tasks.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.11673v1","title":"NetworkNet: A Deep Neural Network Approach for Random Networks with Sparse Nodal Attributes and Complex Nodal Heterogeneity","abstract":"Heterogeneous network data with rich nodal information become increasingly prevalent across multidisciplinary research, yet accurately modeling complex nodal heterogeneity and simultaneously selecting influential nodal attributes remains an open challenge. This problem is central to many applications in economics and sociology, when both nodal heterogeneity and high-dimensional individual characteristics highly affect network formation. We propose a statistically grounded, unified deep neural network approach for modeling nodal heterogeneity in random networks with high-dimensional nodal attributes, namely ``NetworkNet''. A key innovation of NetworkNet lies in a tailored neural architecture that explicitly parameterizes attribute-driven heterogeneity, and at the same time, embeds a scalable attribute selection mechanism. NetworkNet consistently estimates two types of latent heterogeneity functions, i.e., nodal expansiveness and popularity, while simultaneously performing data-driven attribute selection to extract influential nodal attributes. By unifying classical statistical network modeling with deep learning, NetworkNet delivers the expressive power of DNNs with methodological interpretability, algorithmic scalability, and statistical rigor with a non-asymptotic approximation error bound. Empirically, simulations demonstrate strong performance in both heterogeneity estimation and high-dimensional attribute selection. We further apply NetworkNet to a large-scale author-citation network among statisticians, revealing new insights into the dynamic evolution of research fields and scholarly impact.","published_date":"2026-04-13T16:19:55+00:00","viability_score":3,"cluster_label":"Graph Neural Networks","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A deep neural network approach for modeling complex nodal heterogeneity and selecting influential attributes in random networks.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.11666v1","title":"Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind","abstract":"As large language models (LLMs) become the engine behind conversational systems, their ability to reason about the intentions and states of their dialogue partners (i.e., form and use a theory-of-mind, or ToM) becomes increasingly critical for safe interaction with potentially adversarial partners. We propose a novel privacy-themed ToM challenge, ToM for Steering Beliefs (ToM-SB), in which a defender must act as a Double Agent to steer the beliefs of an attacker with partial prior knowledge within a shared universe. To succeed on ToM-SB, the defender must engage with and form a ToM of the attacker, with a goal of fooling the attacker into believing they have succeeded in extracting sensitive information. We find that strong frontier models like Gemini3-Pro and GPT-5.4 struggle on ToM-SB, often failing to fool attackers in hard scenarios with partial attacker prior knowledge, even when prompted to reason about the attacker's beliefs (ToM prompting). To close this gap, we train models on ToM-SB to act as AI Double Agents using reinforcement learning, testing both fooling and ToM rewards. Notably, we find a bidirectionally emergent relationship between ToM and attacker-fooling: rewarding fooling success alone improves ToM, and rewarding ToM alone improves fooling. Across four attackers with different strengths, six defender methods, and both in-distribution and out-of-distribution (OOD) evaluation, we find that gains in ToM and attacker-fooling are well-correlated, highlighting belief modeling as a key driver of success on ToM-SB. AI Double Agents that combine both ToM and fooling rewards yield the strongest fooling and ToM performance, outperforming Gemini3-Pro and GPT-5.4 with ToM prompting on hard scenarios. We also show that ToM-SB and AI Double Agents can be extended to stronger attackers, demonstrating generalization to OOD settings and the upgradability of our task.","published_date":"2026-04-13T16:14:41+00:00","viability_score":8,"cluster_label":"LLM Safety & Alignment","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An AI agent that learns to deceive adversaries by understanding and manipulating their beliefs, outperforming current LLMs in complex social reasoning tasks.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.11665v1","title":"Beyond LLMs, Sparse Distributed Memory, and Neuromorphics <A Hyper-Dimensional SRAM-CAM \"VaCoAl\" for Ultra-High Speed, Ultra-Low Power, and Low Cost>","abstract":"This paper reports an unexpected finding: in a deterministic hyperdimensional computing (HDC) architecture based on Galois-field algebra, a path-dependent semantic selection mechanism emerges, equivalent to spike-timing-dependent plasticity (STDP), with magnitude predictable a priori by a closed-form expression matching large-scale measurements. This addresses limitations of modern AI including catastrophic forgetting, learning stagnation, and the Binding Problem at an algebraic level. We propose VaCoAl (Vague Coincident Algorithm) and its Python implementation PyVaCoAl, combining ultra-high-dimensional memory with deterministic logic. Rooted in Sparse Distributed Memory, it resolves orthogonalisation and retrieval in high-dimensional binary spaces via Galois-field diffusion, enabling low-load deployment. VaCoAl is a memory-centric architecture prioritising retrieval and association, enabling reversible composition while preserving element independence and supporting compositional generalisation with a transparent reliability metric (CR score). We evaluated multi-hop reasoning on about 470k mentor-student relations from Wikidata, tracing up to 57 generations (over 25.5M paths). Using HDC bundling and unbinding with CR-based denoising, we quantify concept propagation over DAGs. Results show a reinterpretation of the Newton-Leibniz dispute and a phase transition from sparse convergence to a post-Leibniz \"superhighway\", from which structural indicators emerge supporting a Kuhnian paradigm shift. Collision-tolerance mechanisms further induce path-based pruning that favors direct paths, yielding emergent semantic selection equivalent to STDP. VaCoAl thus defines a third paradigm, HDC-AI, complementing LLMs with reversible multi-hop reasoning.","published_date":"2026-04-13T16:13:17+00:00","viability_score":3,"cluster_label":"Hyperdimensional Computing","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A novel memory-centric architecture combining hyperdimensional computing with deterministic logic for ultra-high speed, low power, and cost-effective reasoning.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.11663v1","title":"Why Do Large Language Models Generate Harmful Content?","abstract":"Large Language Models (LLMs) have been shown to generate harmful content. However, the underlying causes of such behavior remain under explored. We propose a causal mediation analysis-based approach to identify the causal factors responsible for harmful generation. Our method performs a multi-granular analysis across model layers, modules (MLP and attention blocks), and individual neurons. Extensive experiments on state-of-the-art LLMs indicate that harmful generation arises in the later layers of the model, results primarily from failures in MLP blocks rather than attention blocks, and is associated with neurons that act as a gating mechanism for harmful generation. The results indicate that the early layers in the model are used for a contextual understanding of harmfulness in a prompt, which is then propagated through the model, to generate harmfulness in the late layers, as well as a signal indicating harmfulness through MLP blocks. This is then further propagated to the last layer of the model, specifically to a sparse set of neurons, which receives the signal and determines the generation of harmful content accordingly.","published_date":"2026-04-13T16:11:38+00:00","viability_score":3,"cluster_label":"LLM Interpretability","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Identifies specific neural mechanisms within LLMs, particularly MLP blocks in later layers, responsible for generating harmful content.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11661v1","title":"Towards Autonomous Mechanistic Reasoning in Virtual Cells","abstract":"Large language models (LLMs) have recently gained significant attention as a promising approach to accelerate scientific discovery. However, their application in open-ended scientific domains such as biology remains limited, primarily due to the lack of factually grounded and actionable explanations. To address this, we introduce a structured explanation formalism for virtual cells that represents biological reasoning as mechanistic action graphs, enabling systematic verification and falsification. Building upon this, we propose VCR-Agent, a multi-agent framework that integrates biologically grounded knowledge retrieval with a verifier-based filtering approach to generate and validate mechanistic reasoning autonomously. Using this framework, we release VC-TRACES dataset, which consists of verified mechanistic explanations derived from the Tahoe-100M atlas. Empirically, we demonstrate that training with these explanations improves factual precision and provides a more effective supervision signal for downstream gene expression prediction. These results underscore the importance of reliable mechanistic reasoning for virtual cells, achieved through the synergy of multi-agent and rigorous verification.","published_date":"2026-04-13T16:10:44+00:00","viability_score":7,"cluster_label":"Scientific Discovery","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A multi-agent framework for autonomous mechanistic reasoning in virtual cells, generating and validating biological explanations to improve scientific discovery.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11655v1","title":"RPA-Check: A Multi-Stage Automated Framework for Evaluating Dynamic LLM-based Role-Playing Agents","abstract":"The rapid adoption of Large Language Models (LLMs) in interactive systems has enabled the creation of dynamic, open-ended Role-Playing Agents (RPAs). However, evaluating these agents remains a significant challenge, as standard NLP metrics fail to capture the nuances of role adherence, logical consistency, and long-term narrative stability. This paper introduces RPA-Check, a multi-stage automated evaluation framework designed to objectively assess the performance of LLM-based RPAs in complex, constraints-heavy environments. Our methodology is based on a four-step pipeline: (1) Dimension Definition, establishing high-level qualitative behavioral criteria; (2) Augmentation, where these requirements are expanded into granular boolean checklist indicators; (3) Semantic Filtering, to ensure indicator objectivity, no redundancy and agent isolation; and (4) LLM-as-a-Judge Evaluation, which employs chain-of-thought verification to score agent fidelity. We validate this framework by applying it to LLM Court, a serious game for forensic training involving several quantized local models. Experimental results across five distinct legal scenarios demonstrate the framework's ability to identify subtle trade-offs between model size, reasoning depth, and operational stability. Notably, the findings reveal an inverse relationship between parametric scale and procedural consistency, showing that smaller, adequately instruction-tuned models (8-9B) can outperform larger architectures prone to user-alignment bias or sycophancy. RPA-Check thus provides a standardized and reproducible metric for future research in generative agent evaluation within specialized domains.","published_date":"2026-04-13T16:08:03+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An automated framework for evaluating LLM-based role-playing agents by defining behavioral criteria, augmenting them into checklists, and using LLM-as-a-Judge for fidelity scoring.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.12710v1","title":"LASA: Language-Agnostic Semantic Alignment at the Semantic Bottleneck for LLM Safety","abstract":"Large language models (LLMs) often demonstrate strong safety performance in high-resource languages, yet exhibit severe vulnerabilities when queried in low-resource languages. We attribute this gap to a mismatch between language-agnostic semantic understanding ability and language-dominant safety alignment biased toward high-resource languages. Consistent with this hypothesis, we empirically identify the semantic bottleneck in LLMs, an intermediate layer in which the geometry of model representations is governed primarily by shared semantic content rather than language identity. Building on this observation, we propose Language-Agnostic Semantic Alignment (LASA), which anchors safety alignment directly in semantic bottlenecks. Experiments show that LASA substantially improves safety across all languages: average attack success rate (ASR) drops from 24.7% to 2.8% on LLaMA-3.1-8B-Instruct and remains around 3-4% across Qwen2.5 and Qwen3 Instruct models (7B-32B). Together, our analysis and method offer a representation-level perspective on LLM safety, suggesting that safety alignment requires anchoring safety understanding not in surface text, but in the model's language-agnostic semantic space.","published_date":"2026-04-13T15:59:50+00:00","viability_score":8,"cluster_label":"LLM Safety","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A language-agnostic method for LLM safety that anchors alignment in the model's semantic bottleneck, significantly reducing attack success rates across languages.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.11641v1","title":"CodeTracer: Towards Traceable Agent States","abstract":"Code agents are advancing rapidly, but debugging them is becoming increasingly difficult. As frameworks orchestrate parallel tool calls and multi-stage workflows over complex tasks, making the agent's state transitions and error propagation hard to observe. In these runs, an early misstep can trap the agent in unproductive loops or even cascade into fundamental errors, forming hidden error chains that make it hard to tell when the agent goes off track and why. Existing agent tracing analyses either focus on simple interaction or rely on small-scale manual inspection, which limits their scalability and usefulness for real coding workflows. We present CodeTracer, a tracing architecture that parses heterogeneous run artifacts through evolving extractors, reconstructs the full state transition history as a hierarchical trace tree with persistent memory, and performs failure onset localization to pinpoint the failure origin and its downstream chain. To enable systematic evaluation, we construct CodeTraceBench from a large collection of executed trajectories generated by four widely used code agent frameworks on diverse code tasks (e.g., bug fixing, refactoring, and terminal interaction), with supervision at both the stage and step levels for failure localization. Experiments show that CodeTracer substantially outperforms direct prompting and lightweight baselines, and that replaying its diagnostic signals consistently recovers originally failed runs under matched budgets. Our code and data are publicly available.","published_date":"2026-04-13T15:52:03+00:00","viability_score":6,"cluster_label":"Agent Monitoring","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"CodeTracer is a debugging tool that enables traceability of agent states through hierarchical trace tree reconstruction and failure onset localization.","time_to_mvp":"","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11628v1","title":"Back to Basics: Let Conversational Agents Remember with Just Retrieval and Generation","abstract":"Existing conversational memory systems rely on complex hierarchical summarization or reinforcement learning to manage long-term dialogue history, yet remain vulnerable to context dilution as conversations grow. In this work, we offer a different perspective: the primary bottleneck may lie not in memory architecture, but in the \\textit{Signal Sparsity Effect} within the latent knowledge manifold. Through controlled experiments, we identify two key phenomena: \\textit{Decisive Evidence Sparsity}, where relevant signals become increasingly isolated with longer sessions, leading to sharp degradation in aggregation-based methods; and \\textit{Dual-Level Redundancy}, where both inter-session interference and intra-session conversational filler introduce large amounts of non-informative content, hindering effective generation. Motivated by these insights, we propose \\method, a minimalist framework that brings conversational memory back to basics, relying solely on retrieval and generation via Turn Isolation Retrieval (TIR) and Query-Driven Pruning (QDP). TIR replaces global aggregation with a max-activation strategy to capture turn-level signals, while QDP removes redundant sessions and conversational filler to construct a compact, high-density evidence set. Extensive experiments on multiple benchmarks demonstrate that \\method achieves robust performance across diverse settings, consistently outperforming strong baselines while maintaining high efficiency in tokens and latency, establishing a new minimalist baseline for conversational memory.","published_date":"2026-04-13T15:38:43+00:00","viability_score":7,"cluster_label":"Conversational AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A minimalist conversational memory framework using only retrieval and generation, addressing signal sparsity and redundancy for robust long-term dialogue management.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.11626v1","title":"RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time","abstract":"Most reward models for visual generation reduce rich human judgments to a single unexplained score, discarding the reasoning that underlies preference. We show that teaching reward models to produce explicit, multi-dimensional critiques before scoring transforms them from passive evaluators into active optimization tools, improving generators in two complementary ways: at training time, structured rationales provide interpretable, fine-grained rewards for reinforcement learning; at test time, a Generate-Critique-Refine loop turns critiques into targeted prompt revisions that improve outputs without any parameter updates. To train such a reward model without costly rationale annotations, we introduce Preference-Anchored Rationalization (PARROT), a principled framework that recovers high-quality rationales from readily available preference data through anchored generation, consistency filtering, and distillation. The resulting model, RationalRewards (8B), achieves state-of-the-art preference prediction among open-source reward models, competitive with Gemini-2.5-Pro, while using 10-20x less training data than comparable baselines. As an RL reward, it consistently improves text-to-image and image-editing generators beyond scalar alternatives. Most strikingly, its test-time critique-and-refine loop matches or exceeds RL-based fine-tuning on several benchmarks, suggesting that structured reasoning can unlock latent capabilities in existing generators that suboptimal prompts fail to elicit.","published_date":"2026-04-13T15:38:09+00:00","viability_score":8,"cluster_label":"Generative AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"RationalRewards teaches reward models to provide multi-dimensional critiques, improving visual generation at both training and test time through reasoning.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.11625v1","title":"SCNO: Spiking Compositional Neural Operator -- Towards a Neuromorphic Foundation Model for Nuclear PDE Solving","abstract":"Neural operators have emerged as powerful surrogates for partial differential equation (PDE) solvers, yet they are typically trained as monolithic models for individual PDEs, require energy-intensive GPU hardware, and must be retrained from scratch when new physics emerge. We introduce the Spiking Compositional Neural Operator (SCNO), a modular architecture combining spiking and conventional components that addresses all three limitations. SCNO maintains a library of small spiking neural operator blocks, each trained on a single elementary differential operator (convection, diffusion, reaction), and composes them through a lightweight input-conditioned aggregator to solve coupled PDEs not seen during block training. A small correction network learns cross-coupling residuals while keeping all blocks and the aggregator frozen, preserving zero-forgetting modular expansion by construction. We evaluate SCNO on eight PDE families including five coupled systems and a nuclear-relevant 1-group neutron diffusion equation. SCNO with correction achieves the lowest relative $L^2$ error on four of five coupled PDEs, outperforming both a monolithic spiking DeepONet (by up to 62%, mean over 3 seeds) and a standard ANN DeepONet (by up to 65%), while requiring only 95K trainable parameters versus 462K for the monolithic baseline. To our knowledge, this is the first compositional spiking neural operator and the first proof-of-concept for modular neuromorphic PDE solving with built-in forgetting-free expansion.","published_date":"2026-04-13T15:36:48+00:00","viability_score":3,"cluster_label":"Scientific Computing","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A modular spiking neural operator architecture for solving coupled PDEs, composed of small blocks trained on elementary differential operators.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11623v1","title":"Context Kubernetes: Declarative Orchestration of Enterprise Knowledge for Agentic AI Systems","abstract":"We introduce Context Kubernetes, an architecture for orchestrating enterprise knowledge in agentic AI systems, with a prototype implementation and eight experiments. The core observation is that delivering the right knowledge, to the right agent, with the right permissions, at the right freshness -- across an entire organization -- is structurally analogous to the container orchestration problem Kubernetes solved a decade ago. We formalize six core abstractions, a YAML-based declarative manifest for knowledge-architecture-as-code, a reconciliation loop, and a three-tier agent permission model where agent authority is always a strict subset of human authority. Three value experiments show: (1) without governance, agents serve phantom content from deleted sources and leak cross-domain data in 26.5% of queries; (2) without freshness monitoring, stale content is served silently -- with reconciliation, staleness is detected in under 1ms; (3) in five attack scenarios, flat permissions block 0/5 attacks, basic RBAC blocks 4/5, and the three-tier model blocks 5/5. Five correctness experiments confirm zero unauthorized deliveries, zero invariant violations, and architectural enforcement of out-of-band approval isolation that no surveyed enterprise platform provides. A survey of four major platforms (Microsoft, Salesforce, AWS, Google) documents that none architecturally isolates agent approval channels. We identify four properties that make context orchestration harder than container orchestration, and argue that these make the solution more valuable.","published_date":"2026-04-13T15:35:55+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new architecture for orchestrating enterprise knowledge in agentic AI systems, akin to Kubernetes for containers, with proven security and freshness benefits.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11615v1","title":"CUTEv2: Unified and Configurable Matrix Extension for Diverse CPU Architectures with Minimal Design Overhead","abstract":"Matrix extensions have emerged as an essential feature in modern CPUs to address the surging demands of AI workloads. However, existing designs often incur substantial hardware and software design overhead. Tight coupling with the CPU pipeline complicates integration across diverse CPUs, while fine-grained synchronous instructions hinder the development of high-performance kernels.   This paper proposes a unified and configurable CPU matrix extension architecture. By decoupling matrix units from the CPU pipeline, the design enables low-overhead integration while maintaining close coordination with existing compute and memory resources. The configurable matrix unit supports mixed-precision operations and adapts to diverse compute demands and memory bandwidth constraints. An asynchronous matrix multiplication abstraction with flexible granularity conceals hardware details, simplifies matrix-vector overlap, and supports a unified software stack.   The architecture is integrated into four open-source CPU RTL platforms and evaluated on representative AI models. Matrix unit utilization under GEMM workloads exceeds 90% across all platforms. When configured with compute throughput and memory bandwidth comparable to Intel AMX, our design achieves speedups of 1.57x, 1.57x, and 2.31x on ResNet, BERT, and Llama3, with over 30% of the gains attributed to overlapped matrix-vector execution. A 4 TOPS@2GHz matrix unit occupies only 0.53 mm\\textsuperscript{2} in 14nm CMOS. These results demonstrate strong cross-platform adaptability and effective hardware-software co-optimization, offering a practical matrix extension for the open-source community.","published_date":"2026-04-13T15:21:55+00:00","viability_score":8,"cluster_label":"CPU Architecture","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A configurable CPU matrix extension architecture that enhances performance for AI workloads with minimal design overhead.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.11613v1","title":"Layerwise Dynamics for In-Context Classification in Transformers","abstract":"Transformers can perform in-context classification from a few labeled examples, yet the inference-time algorithm remains opaque. We study multi-class linear classification in the hard no-margin regime and make the computation identifiable by enforcing feature- and label-permutation equivariance at every layer. This enables interpretability while maintaining functional equivalence and yields highly structured weights. From these models we extract an explicit depth-indexed recursion: an end-to-end identified, emergent update rule inside a softmax transformer, to our knowledge the first of its kind. Attention matrices formed from mixed feature-label Gram structure drive coupled updates of training points, labels, and the test probe. The resulting dynamics implement a geometry-driven algorithmic motif, which can provably amplify class separation and yields robust expected class alignment.","published_date":"2026-04-13T15:20:41+00:00","viability_score":2,"cluster_label":"Transformers","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This research explores layerwise dynamics in transformers for improved interpretability in in-context classification.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11609v1","title":"Intersectional Sycophancy: How Perceived User Demographics Shape False Validation in Large Language Models","abstract":"Large language models exhibit sycophantic tendencies--validating incorrect user beliefs to appear agreeable. We investigate whether this behavior varies systematically with perceived user demographics, testing whether combinations of race, age, gender, and expressed confidence level produce differential false validation rates. Inspired by the legal concept of intersectionality, we conduct 768 multi-turn adversarial conversations using Anthropic's Petri evaluation framework, probing GPT-5-nano and Claude Haiku 4.5 across 128 persona combinations in mathematics, philosophy, and conspiracy theory domains. GPT-5-nano is significantly more sycophantic than Claude Haiku 4.5 overall ($\\bar{x}=2.96$ vs. $1.74$, $p < 10^{-32}$, Wilcoxon signed-rank). For GPT-5-nano, we find that philosophy elicits 41% more sycophancy than mathematics and that Hispanic personas receive the highest sycophancy across races. The worst-scoring persona, a confident, 23-year-old Hispanic woman, averages 5.33/10 on sycophancy. Claude Haiku 4.5 exhibits uniformly low sycophancy with no significant demographic variation. These results demonstrate that sycophancy is not uniformly distributed across users and that safety evaluations should incorporate identity-aware testing.","published_date":"2026-04-13T15:14:33+00:00","viability_score":5,"cluster_label":"Language Models","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An investigation into the sycophantic tendencies of large language models based on user demographics.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.11582v1","title":"A Triadic Suffix Tokenization Scheme for Numerical Reasoning","abstract":"Standard subword tokenization methods fragment numbers inconsistently, causing large language models (LLMs) to lose positional and decimal structure - a primary driver of errors in arithmetic and scientific reasoning. We introduce Triadic Suffix Tokenization (TST), a deterministic scheme that partitions digits into three-digit triads and annotates each triad with an explicit magnitude marker. Critically, the scheme defines a fixed, one-to-one mapping between suffixes and orders of magnitude for the integer part (thousands, millions, billions, etc.) and a parallel system of replicated markers for fractional depth (tenths, thousandths, millionths, etc.). Unlike approaches that rely on positional inference, this method provides a consistent gradient signal, which should ensure stable convergence. Two implementation variants are proposed: (1) a vocabulary-based approach that adds at most 10,000 fixed tokens to an existing vocabulary, covering 33 orders of magnitude ($10^{-15}$ to $10^{18}$); and (2) a suffix-marker approach that uses a small set of special tokens to denote magnitude dynamically. Both variants preserve exact digits while making order-of-magnitude relationships transparent at the token level. The framework is inherently scalable, allowing for linear vocabulary expansion to accommodate arbitrary precision and range. TST is architecture-agnostic and can be integrated as a drop-in preprocessing step. Experimental validation is deferred to future work.","published_date":"2026-04-13T14:58:24+00:00","viability_score":3,"cluster_label":"Tokenization","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel tokenization scheme that improves numerical reasoning in large language models by preserving digit structure.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.11578v1","title":"Minimizing classical resources in variational measurement-based quantum computation for generative modeling","abstract":"Measurement-based quantum computation (MBQC) is a framework for quantum information processing in which a computational task is carried out through one-qubit measurements on a highly entangled resource state. Due to the indeterminacy of the outcomes of a quantum measurement, the random outcomes of these operations, if not corrected, yield a variational quantum channel family. Traditionally, this randomness is corrected through classical processing in order to ensure deterministic unitary computations. Recently, variational measurement-based quantum computation (VMBQC) has been introduced to exploit this measurement-induced randomness to gain an advantage in generative modeling. A limitation of this approach is that the corresponding channel model has twice as many parameters compared to the unitary model, scaling as $N \\times D$, where $N$ is the number of logical qubits (width) and $D$ is the depth of the VMBQC model. This can often make optimization more difficult and may lead to poorly trainable models. In this paper, we present a restricted VMBQC model that extends the unitary setting to a channel-based one using only a single additional trainable parameter. We show, both numerically and algebraically, that this minimal extension is sufficient to generate probability distributions that cannot be learned by the corresponding unitary model.","published_date":"2026-04-13T14:56:48+00:00","viability_score":1,"cluster_label":"Quantum Computing for Generative Modeling","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A restricted variational measurement-based quantum computation model reduces parameters for generative modeling, enabling generation of distributions not learnable by unitary models.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11563v1","title":"Synthius-Mem: Brain-Inspired Hallucination-Resistant Persona Memory Achieving 94.4% Memory Accuracy and 99.6% Adversarial Robustness on LoCoMo","abstract":"Providing AI agents with reliable long-term memory that does not hallucinate remains an open problem. Current approaches to memory for LLM agents -- sliding windows, summarization, embedding-based RAG, and flat fact extraction -- each reduce token cost but introduce catastrophic information loss, semantic drift, or uncontrolled hallucination about the user. The structural reason is architectural: every published memory system on the LoCoMo benchmark treats conversation as a retrieval problem over raw or lightly summarized dialogue segments, and none reports adversarial robustness, the ability to refuse questions about facts the user never disclosed. We present Synthius-Mem, a brain-inspired structured persona memory system that takes a fundamentally different approach. Instead of retrieving what was said, Synthius-Mem extracts what is known about the person: a full persona extraction pipeline decomposes conversations into six cognitive domains (biography, experiences, preferences, social circle, work, psychometrics), consolidates and deduplicates per domain, and retrieves structured facts via CategoryRAG at 21.79 ms latency. On the LoCoMo benchmark (ACL 2024, 10 conversations, 1,813 questions), Synthius-Mem achieves 94.37% accuracy, exceeding all published systems including MemMachine (91.69%, adversarial score is not reported) and human performance (87.9 F1). Core memory fact accuracy reaches 98.64%. Adversarial robustness, the hallucination resistance metric that no competing system reports, reaches 99.55%. Synthius-Mem reduces token consumption by ~5x compared to full-context replay while achieving higher accuracy. Synthius-Mem achieves state-of-the-art results on LoCoMo and is, to our knowledge, the only persona memory system that both exceeds human-level performance and reports adversarial robustness.","published_date":"2026-04-13T14:47:48+00:00","viability_score":7,"cluster_label":"LLM Agents Memory","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Synthius-Mem is a brain-inspired persona memory system for LLM agents that extracts structured facts into cognitive domains, achieving 94.4% accuracy and 99.6% adversarial robustness on LoCoMo.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11560v1","title":"bacpipe: a Python package to make bioacoustic deep learning models accessible","abstract":"1. Natural sounds have been recorded for millions of hours over the previous decades using passive acoustic monitoring. Improvements in deep learning models have vastly accelerated the analysis of large portions of this data. While new models advance the state-of-the-art, accessing them using tools to harness their full potential is not always straightforward. Here we present bacpipe, a collection of bioacoustic deep learning models and evaluation pipelines accessible through a graphical and programming interface, designed for both ecologists and computer scientists. Bacpipe is a modular software package intended as a point of convergence for bioacoustic models.   2. Bacpipe streamlines the usage of state-of-the-art models on custom audio datasets, generating acoustic feature vectors (embeddings) and classifier predictions. A modular design allows evaluation and benchmarking of models through interactive visualizations, clustering and probing.   3. We believe that access to new deep learning models is important. By designing bacpipe to target a wide audience, researchers will be enabled to answer new ecological and evolutionary questions in bioacoustics.   4. In conclusion, we believe accessibility to developments in deep learning to a wider audience benefits the ecological questions we are trying to answer.","published_date":"2026-04-13T14:45:12+00:00","viability_score":5,"cluster_label":"Bioacoustics AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"bacpipe is a Python package providing accessible bioacoustic deep learning models and pipelines for analyzing large audio datasets.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.11557v1","title":"UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents","abstract":"Tool-use capability is a fundamental component of LLM agents, enabling them to interact with external systems through structured function calls. However, existing research exhibits inconsistent interaction representations, largely overlooks the structural distribution of tool-use trajectories, and relies on incompatible evaluation benchmarks. We present UniToolCall, a unified framework for tool learning that standardizes the entire pipeline from toolset construction and dataset generation to evaluation. The framework curates a large tool pool of 22k+ tools and constructs a hybrid training corpus of 390k+ instances by combining 10 standardized public datasets with structurally controlled synthetic trajectories. It explicitly models diverse interaction patterns, including single-hop vs. multi-hop and single-turn vs. multi-turn, while capturing both serial and parallel execution structures. To support coherent multi-turn reasoning, we further introduce an Anchor Linkage mechanism that enforces cross-turn dependencies. Furthermore, we convert 7 public benchmarks into a unified Query--Action--Observation--Answer (QAOA) representation with fine-grained evaluation at the function-call, turn, and conversation levels. Experiments show that fine-tuning Qwen3-8B on our dataset substantially improves tool-use performance. Under the distractor-heavy Hybrid-20 setting, achieves 93.0% single-turn Strict Precision, outperforming commercial models including GPT, Gemini, and Claude.","published_date":"2026-04-13T14:43:47+00:00","viability_score":7,"cluster_label":"LLM Tool Use","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"UniToolCall unifies tool-use representation, data, and evaluation for LLM agents, achieving 93.0% single-turn Strict Precision on distractor-heavy settings.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11556v1","title":"FM-Agent: Scaling Formal Methods to Large Systems via LLM-Based Hoare-Style Reasoning","abstract":"LLM-assisted software development has become increasingly prevalent, and can generate large-scale systems, such as compilers. It becomes crucial to strengthen the correctness of the generated code. However, automated reasoning for large-scale systems remains challenging due to code complexity. Hoare logic offers an approach to decomposing a large system into smaller components and reasoning about them separately (i.e., compositional reasoning). However, existing works still struggle to scale, because Hoare logic requires writing formal specifications for each function, imposing a heavy human burden. The problem is exacerbated when code is generated by LLMs, as developers lack a deep understanding of each function's expected behavior.   This paper presents FM-Agent, the first framework that realizes automated compositional reasoning for large-scale systems. Leveraging LLMs, FM-Agent introduces a top-down paradigm to automatically generate function-level specifications. Specifically, FM-Agent derives the specification of a function from how its callers expect the function to behave, so the generated specifications can reflect the developer's intent of a function even if the implementation is buggy. Developers' intent is usually expressed in natural language, while existing verifiers only support formulas. Therefore, FM-Agent generalizes Hoare-style inference to reason about functions against natural-language specifications. Finally, to confirm bug existence and explain bug causes, FM-Agent automatically generates test cases to trigger potential bugs. In our evaluation, FM-Agent successfully reasons about large-scale systems within 2 days, each of which has up to 143k LoC. These systems have already been tested by their developers, but FM-Agent still finds 522 newly discovered bugs. These bugs can cause serious consequences, including system crashes and incorrect execution results.","published_date":"2026-04-13T14:42:44+00:00","viability_score":3,"cluster_label":"AI for Software Engineering","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"An LLM-based framework for automated compositional reasoning and bug detection in large-scale software systems.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11548v1","title":"SemaClaw: A Step Towards General-Purpose Personal AI Agents through Harness Engineering","abstract":"The rise of OpenClaw in early 2026 marks the moment when millions of users began deploying personal AI agents into their daily lives, delegating tasks ranging from travel planning to multi-step research. This scale of adoption signals that two parallel arcs of development have reached an inflection point. First is a paradigm shift in AI engineering, evolving from prompt and context engineering to harness engineering-designing the complete infrastructure necessary to transform unconstrained agents into controllable, auditable, and production-reliable systems. As model capabilities converge, this harness layer is becoming the primary site of architectural differentiation. Second is the evolution of human-agent interaction from discrete tasks toward a persistent, contextually aware collaborative relationship, which demands open, trustworthy and extensible harness infrastructure. We present SemaClaw, an open-source multi-agent application framework that addresses these shifts by taking a step towards general-purpose personal AI agents through harness engineering. Our primary contributions include a DAG-based two-phase hybrid agent team orchestration method, a PermissionBridge behavioral safety system, a three-tier context management architecture, and an agentic wiki skill for automated personal knowledge base construction.","published_date":"2026-04-13T14:37:53+00:00","viability_score":5,"cluster_label":"AI Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"An open-source framework for building general-purpose personal AI agents through harness engineering, focusing on orchestration, safety, and memory.","time_to_mvp":"1-3 months","tags":["quick_build"]},{"arxiv_id":"2604.11544v1","title":"Time is Not a Label: Continuous Phase Rotation for Temporal Knowledge Graphs and Agentic Memory","abstract":"Structured memory representations such as knowledge graphs are central to autonomous agents and other long-lived systems. However, most existing approaches model time as discrete metadata, either sorting by recency (burying old-yet-permanent knowledge), simply overwriting outdated facts, or requiring an expensive LLM call at every ingestion step, leaving them unable to distinguish persistent facts from evolving ones. To address this, we introduce RoMem, a drop-in temporal knowledge graph module for structured memory systems, applicable to agentic memory and beyond. A pretrained Semantic Speed Gate maps each relation's text embedding to a volatility score, learning from data that evolving relations (e.g., \"president of\") should rotate fast while persistent ones (e.g., \"born in\") should remain stable. Combined with continuous phase rotation, this enables geometric shadowing: obsolete facts are rotated out of phase in complex vector space, so temporally correct facts naturally outrank contradictions without deletion. On temporal knowledge graph completion, RoMem achieves state-of-the-art results on ICEWS05-15 (72.6 MRR). Applied to agentic memory, it delivers 2-3x MRR and answer accuracy on temporal reasoning (MultiTQ), dominates hybrid benchmark (LoCoMo), preserves static memory with zero degradation (DMR-MSC), and generalises zero-shot to unseen financial domains (FinTMMBench).","published_date":"2026-04-13T14:35:47+00:00","viability_score":7,"cluster_label":"Agentic Memory","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A drop-in temporal knowledge graph module that uses continuous phase rotation to manage evolving and persistent facts for agentic memory.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.11543v1","title":"NovBench: Evaluating Large Language Models on Academic Paper Novelty Assessment","abstract":"Novelty is a core requirement in academic publishing and a central focus of peer review, yet the growing volume of submissions has placed increasing pressure on human reviewers. While large language models (LLMs), including those fine-tuned on peer review data, have shown promise in generating review comments, the absence of a dedicated benchmark has limited systematic evaluation of their ability to assess research novelty. To address this gap, we introduce NovBench, the first large-scale benchmark designed to evaluate LLMs' capability to generate novelty evaluations in support of human peer review. NovBench comprises 1,684 paper-review pairs from a leading NLP conference, including novelty descriptions extracted from paper introductions and corresponding expert-written novelty evaluations. We focus on both sources because the introduction provides a standardized and explicit articulation of novelty claims, while expert-written novelty evaluations constitute one of the current gold standards of human judgment. Furthermore, we propose a four-dimensional evaluation framework (including Relevance, Correctness, Coverage, and Clarity) to assess the quality of LLM-generated novelty evaluations. Extensive experiments on both general and specialized LLMs under different prompting strategies reveal that current models exhibit limited understanding of scientific novelty, and that fine--tuned models often suffer from instruction-following deficiencies. These findings underscore the need for targeted fine-tuning strategies that jointly improve novelty comprehension and instruction adherence.","published_date":"2026-04-13T14:35:17+00:00","viability_score":5,"cluster_label":"LLM Evaluation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A benchmark for evaluating LLMs on academic paper novelty assessment, revealing current model limitations and guiding future development.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.11540v1","title":"A collaborative agent with two lightweight synergistic models for autonomous crystal materials research","abstract":"Current large language models require hundreds of billions of parameters yet struggle with domain-specific reasoning and tool coordination in materials science. Here, we present MatBrain, a lightweight collaborative agent system with two synergistic models specialization for crystal materials research. MatBrain employs a dual-model architecture: Mat-R1 (30B parameters) as the analytical model providing expert-level domain reasoning, and Mat-T1 (14B parameters) as the executive model orchestrating tool-based actions. Entropy analysis confirms that this architecture resolves the conflict between tool planning and analytical reasoning by decoupling their distinct entropy dynamics. Enabled by this dual-model architecture and structural efficiency, MatBrain significantly outperforms larger general-purpose models while reducing the hardware deployment barrier by over 95%. MatBrain exhibits versatility across structure generation, property prediction, and synthesis planning tasks. Applied to catalyst design, MatBrain generated 30,000 candidate structures and identified 38 promising materials within 48 hours, achieving approximately 100-fold acceleration over traditional approaches. These results demonstrate the potential of lightweight collaborative intelligence for advancing materials research capabilities.","published_date":"2026-04-13T14:33:19+00:00","viability_score":7,"cluster_label":"Materials Research AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A collaborative AI agent system using two lightweight synergistic models to accelerate crystal materials research, outperforming larger models and reducing hardware barriers.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11539v1","title":"CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space","abstract":"Human perception of visual similarity is inherently adaptive and subjective, depending on the users' interests and focus. However, most image retrieval systems fail to reflect this flexibility, relying on a fixed, monolithic metric that cannot incorporate multiple conditions simultaneously. To address this, we propose CLAY, an adaptive similarity computation method that reframes the embedding space of pretrained Vision-Language Models (VLMs) as a text-conditional similarity space without additional training. This design separates the textual conditioning process and visual feature extraction, allowing highly efficient and multi-conditioned retrieval with fixed visual embeddings. We also construct a synthetic evaluation dataset CLAY-EVAL, for comprehensive assessment under diverse conditioned retrieval settings. Experiments on standard datasets and our proposed dataset show that CLAY achieves high retrieval accuracy and notable computational efficiency compared to previous works.","published_date":"2026-04-13T14:33:13+00:00","viability_score":6,"cluster_label":"Vision-Language Models","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"CLAY enables adaptive, multi-conditioned visual similarity search by reframing VLM embedding spaces without retraining, offering efficient and flexible image retrieval.","time_to_mvp":"1-2 weeks","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.11535v1","title":"Problem Reductions at Scale: Agentic Integration of Computationally Hard Problems","abstract":"Solving an NP-hard optimization problem often requires reformulating it for a specific solver -- quantum hardware, a commercial optimizer, or a domain heuristic. A tool for polynomial-time reductions between hard problems would let practitioners route any supported problem to any supported solver through a single interface. Building such a library at scale, however, has remained out of reach. We show that harness engineering, the practice of designing constraints, verification systems, and feedback loops that channel AI coding agents, can overcome this barrier. Our harness combines a no-code contribution route for domain experts, a multilayer verification stack ranging from type-level checks to agentic feature tests (AI agents role-playing as end users), and a fully automated implementation-review-integration pipeline. In about three months, we built a command-line tool backed by a library of 100+ problem types and 200+~reduction rules in over 170k lines of Rust. The result suggests that a well-engineered harness lets agents build well-tested software at a scale and pace beyond prior reduction-library efforts. Because the reduction graph composes transitively, a new solver registered for any single problem type instantly becomes available to every problem connected by a reduction path. The source code is available at https://github.com/CodingThrust/problem-reductions.","published_date":"2026-04-13T14:32:08+00:00","viability_score":8,"cluster_label":"AI Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An AI-powered platform for scalable problem reductions, enabling easy integration of diverse computational problems with various solvers through a robust harness engineering approach.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11530v1","title":"SVD-Prune: Training-Free Token Pruning For Efficient Vision-Language Models","abstract":"Vision-Language Models (VLM) have revolutionized multimodal learning by jointly processing visual and textual information. Yet, they face significant challenges due to the high computational and memory demands of processing long sequences of vision tokens. Many existing methods rely on local heuristics, such as attention scores or token norms. However, these criteria suffer from positional bias and information dispersion, limiting their ability to preserve essential content at high pruning ratios and leading to performance degradation on visually detailed images. To address these issues, we propose SVD-Prune, a trainingfree, plug-and-play token pruning method based on Singular Value Decomposition. It decomposes the vision token feature matrix and selects the top-K tokens using statistical leverage scores, ensuring only tokens contributing most to the dominant global variance are preserved. Experiments show that SVD-Prune consistently outperforms prior pruning methods under extreme vision token budgets, maintaining strong performance even with 32 and 16 vision tokens.","published_date":"2026-04-13T14:30:13+00:00","viability_score":4,"cluster_label":"Vision-Language Models","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"SVD-Prune is a training-free method for efficient Vision-Language Models that uses Singular Value Decomposition to prune tokens, outperforming existing methods at high pruning ratios.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11524v1","title":"Limited Perfect Monotonical Surrogates constructed using low-cost recursive linkage discovery with guaranteed output","abstract":"Surrogates provide a cheap solution evaluation and offer significant leverage for optimizing computationally expensive problems. Usually, surrogates only approximate the original function. Recently, the perfect linear surrogates were proposed that ideally represent the original function. These surrogates do not mimic the original function. In fact, they are another (correct) representation of it and enable a wide range of possibilities, e.g., discovering the optimized function for problems where the direct transformation of the encoded solution into its evaluation is not available. However, many real-world problems can not be represented by linear models, making the aforementioned surrogates inapplicable. Therefore, we propose the Limited Monotonical Perfect Surrogate (LyMPuS), which overcomes this difficulty and enables the comparison of two solutions that differ by a single variable. Our proposition is suitable for limiting the cost of expensive local search procedures. The proposed surrogate is parameterless and can be trained on the fly without any separate surrogate-building step. It uses only the necessary fitness evaluations, and the already-paid costs are not wasted when the model is updated. Finally, it offers low-cost missing-linkage detection and low-cost linkage discovery, guaranteed to find a missing dependency in no more than $2\\lceil\\log_2(n)\\rceil$ steps.","published_date":"2026-04-13T14:26:41+00:00","viability_score":2,"cluster_label":"Optimization","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A novel surrogate model for efficient optimization of computationally expensive problems with non-linear relationships.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11523v1","title":"PAC-BENCH: Evaluating Multi-Agent Collaboration under Privacy Constraints","abstract":"We are entering an era in which individuals and organizations increasingly deploy dedicated AI agents that interact and collaborate with other agents. However, the dynamics of multi-agent collaboration under privacy constraints remain poorly understood. In this work, we present $PAC\\text{-}Bench$, a benchmark for systematic evaluation of multi-agent collaboration under privacy constraints. Experiments on $PAC\\text{-}Bench$ show that privacy constraints substantially degrade collaboration performance and make outcomes depend more on the initiating agent than the partner. Further analysis reveals that this degradation is driven by recurring coordination breakdowns, including early-stage privacy violations, overly conservative abstraction, and privacy-induced hallucinations. Together, our findings identify privacy-aware multi-agent collaboration as a distinct and unresolved challenge that requires new coordination mechanisms beyond existing agent capabilities.","published_date":"2026-04-13T14:26:38+00:00","viability_score":7,"cluster_label":"Multi-Agent Systems","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"PAC-BENCH is a new benchmark for evaluating multi-agent collaboration under privacy constraints, revealing significant performance degradation and coordination breakdowns.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11518v1","title":"From Translation to Superset: Benchmark-Driven Evolution of a Production AI Agent from Rust to Python","abstract":"Cross-language migration of large software systems is a persistent engineering challenge, particularly when the source codebase evolves rapidly. We present a methodology for LLM-assisted continuous code translation in which a large language model translates a production Rust codebase (648K LOC, 65 crates) into Python (41K LOC, 28 modules), with public agent benchmarks as the objective function driving iterative refinement. Our subject system is Codex CLI, a production AI coding agent. We demonstrate that: (1) the Python port resolves 59/80 SWE-bench Verified tasks (73.8%) versus Rust's 56/80 (70.0%), and achieves 42.5% on Terminal-Bench versus Rust's 47.5%, confirming near-parity on real-world agentic tasks; (2) benchmark-driven debugging, revealing API protocol mismatches, environment pollution, a silent WebSocket failure mode, and an API 400 crash, is more effective than static testing alone; (3) the architecture supports continuous upstream synchronisation via an LLM-assisted diff-translate-test loop; and (4) the Python port has evolved into a capability superset with 30 feature-flagged extensions (multi-agent orchestration, semantic memory, guardian safety, cost tracking) absent from Rust, while preserving strict parity mode for comparison. Our evaluation shows that for LLM-based agents where API latency dominates, Python's expressiveness yields a 15.9x code reduction with negligible performance cost, while the benchmark-as-objective-function methodology provides a principled framework for growing a cross-language port from parity into an extended platform.","published_date":"2026-04-13T14:21:44+00:00","viability_score":5,"cluster_label":"AI Software Engineering","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"AI agent leverages LLM for cross-language codebase translation and enhancement from Rust to Python.","time_to_mvp":"1-2 weeks","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.11512v1","title":"EdgeCIM: A Hardware-Software Co-Design for CIM-Based Acceleration of Small Language Models","abstract":"The growing demand for deploying Small Language Models (SLMs) on edge devices, including laptops, smartphones, and embedded platforms, has exposed fundamental inefficiencies in existing accelerators. While GPUs handle prefill workloads efficiently, the autoregressive decoding phase is dominated by GEMV operations that are inherently memory-bound, resulting in poor utilization and prohibitive energy costs at the edge. In this work, we present EdgeCIM, a hardware-software co-design framework that rethinks accelerator design for end-to-end decoder-only inference. At its core is a CIM macro, implemented in 65nm, coupled with a tile-based mapping strategy that balances pipeline stages, maximizing parallelism while alleviating DRAM bandwidth bottlenecks. Our simulator enables design space exploration of SLMs up to 4B parameters, identifying Pareto-optimal configurations in terms of latency and energy. Compared to an NVIDIA Orin Nano, EdgeCIM achieves up to 7.3x higher throughput and 49.59x better energy efficiency on LLaMA3.2-1B, and delivers 9.95x higher throughput than Qualcomm SA8255P on LLaMA3.2-3B. Extensive benchmarks on TinyLLaMA-1.1B, LLaMA3.2 (1B, 3B), Phi-3.5-mini-3.8B, Qwen2.5 (0.5B, 1.5B, 3B), SmolLM2-1.7B, SmolLM3-3B, and Qwen3 (0.6B, 1.7B, 4B) reveal that our accelerator, under INT4 precision, achieves on average 336.42 tokens/s and 173.02 tokens/J. These results establish EdgeCIM as a compelling solution towards real-time, energy-efficient edge-scale SLM inference.","published_date":"2026-04-13T14:16:20+00:00","viability_score":8,"cluster_label":"Edge AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"EdgeCIM is a hardware-software co-design framework for energy-efficient acceleration of small language models on edge devices.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11510v1","title":"Policy Split: Incentivizing Dual-Mode Exploration in LLM Reinforcement with Dual-Mode Entropy Regularization","abstract":"To encourage diverse exploration in reinforcement learning (RL) for large language models (LLMs) without compromising accuracy, we propose Policy Split, a novel paradigm that bifurcates the policy into normal and high-entropy modes with a high-entropy prompt. While sharing model parameters, the two modes undergo collaborative dual-mode entropy regularization tailored to distinct objectives. Specifically, the normal mode optimizes for task correctness, while the high-entropy mode incorporates a preference for exploration, and the two modes learn collaboratively. Extensive experiments demonstrate that our approach consistently outperforms established entropy-guided RL baselines across various model sizes in general and creative tasks. Further analysis reveals that Policy Split facilitates dual-mode exploration, where the high-entropy mode generates distinct behavioral patterns to the normal mode, providing unique learning signals.","published_date":"2026-04-13T14:13:06+00:00","viability_score":7,"cluster_label":"LLM Reinforcement Learning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel reinforcement learning paradigm for LLMs that bifurcates policy into normal and high-entropy modes to improve exploration without sacrificing accuracy.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11508v1","title":"Not All Forgetting Is Equal: Architecture-Dependent Retention Dynamics in Fine-Tuned Image Classifiers","abstract":"Fine-tuning pretrained image classifiers is standard practice, yet which individual samples are forgotten during this process, and whether forgetting patterns are stable or architecture dependent, remains unclear. Understanding these dynamics has direct implications for curriculum design, data pruning, and ensemble construction. We track per-sample correctness at every epoch during fine-tuning of ResNet-18 and DeiT-Small on a retinal OCT dataset (7 classes, 56:1 imbalance) and CUB-200-2011 (200 bird species), fitting Ebbinghaus-style exponential decay curves to each sample's retention trace. Five findings emerge. First, the two architectures forget fundamentally different samples: Jaccard overlap of the top 10 percent most-forgotten is 0.34 on OCTDL and 0.15 on CUB-200. Second, ViT forgetting is more structured (mean $R^2 = 0.74$) than CNN forgetting ($R^2 = 0.52$). Third, per-sample forgetting is stochastic across random seeds (Spearman $\u03c1\\approx 0.01$), challenging the assumption that sample difficulty is an intrinsic property. Fourth, class-level forgetting is consistent and semantically interpretable: visually similar species are forgotten most, distinctive ones least. Fifth, a sample's loss after head warmup predicts its long-term decay constant ($\u03c1= 0.30$ to $0.50$, $p < 10^{-45}$). These findings suggest that architectural diversity in ensembles provides complementary retention coverage, and that curriculum or pruning methods based on per-sample difficulty may not generalize across runs. A spaced repetition sampler built on these decay constants does not outperform random sampling, indicating that static scheduling cannot exploit unstable per-sample signals.","published_date":"2026-04-13T14:11:47+00:00","viability_score":5,"cluster_label":"Model Forgetting Analysis","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Analyzing architecture-dependent sample forgetting in fine-tuned image classifiers to inform data pruning and curriculum design.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.11507v1","title":"Deep Learning for Sequential Decision Making under Uncertainty: Foundations, Frameworks, and Frontiers","abstract":"Artificial intelligence (AI) is moving increasingly beyond prediction to support decisions in complex, uncertain, and dynamic environments. This shift creates a natural intersection with operations research and management sciences (OR/MS), which have long offered conceptual and methodological foundations for sequential decision-making under uncertainty. At the same time, recent advances in deep learning, including feedforward neural networks, LSTMs, transformers, and deep reinforcement learning, have expanded the scope of data-driven modeling and opened new possibilities for large-scale decision systems. This tutorial presents an OR/MS-centered perspective on deep learning for sequential decision-making under uncertainty. Its central premise is that deep learning is valuable not as a replacement for optimization, but as a complement to it. Deep learning brings adaptability and scalable approximation, whereas OR/MS provides the structural rigor needed to represent constraints, recourse, and uncertainty. The tutorial reviews key decision-making foundations, connects them to the major neural architectures in modern AI, and discusses leading approaches to integrating learning and optimization. It also highlights emerging impact in domains such as supply chains, healthcare and epidemic response, agriculture, energy, and autonomous operations. More broadly, it frames these developments as part of a wider transition from predictive AI toward decision-capable AI and highlights the role of OR/MS in shaping the next generation of integrated learning--optimization systems.","published_date":"2026-04-13T14:11:06+00:00","viability_score":2,"cluster_label":"Decision Making AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A tutorial bridging operations research with deep learning for sequential decision-making under uncertainty.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11504v1","title":"Lectures on AI for Mathematics","abstract":"This book provides a comprehensive and accessible introduction to the emerging field of AI for mathematics. It covers the core principles and diverse applications of using artificial intelligence to advance mathematical research. Through clear explanations, the text explores how AI can discover hidden mathematical patterns, assist in proving complicated theorems, and even construct counterexamples to challenge conjectures.","published_date":"2026-04-13T14:07:49+00:00","viability_score":1,"cluster_label":"AI for Mathematics","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A book introducing the use of AI to discover mathematical patterns, prove theorems, and construct counterexamples.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11502v1","title":"METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models","abstract":"Contextual causal reasoning is a critical yet challenging capability for Large Language Models (LLMs). Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy. To address this, we pioneer METER to systematically benchmark LLMs across all three levels of the causal ladder under a unified context setting. Our extensive evaluation of various LLMs reveals a significant decline in proficiency as tasks ascend the causal hierarchy. To diagnose this degradation, we conduct a deep mechanistic analysis via both error pattern identification and internal information flow tracing. Our analysis reveals two primary failure modes: (1) LLMs are susceptible to distraction by causally irrelevant but factually correct information at lower level of causality; and (2) as tasks ascend the causal hierarchy, faithfulness to the provided context degrades, leading to a reduced performance. We belive our work advances our understanding of the mechanisms behind LLM contextual causal reasoning and establishes a critical foundation for future research. Our code and dataset are available at https://github.com/SCUNLP/METER .","published_date":"2026-04-13T14:07:11+00:00","viability_score":7,"cluster_label":"LLM Evaluation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new benchmark and analysis tool for evaluating and understanding multi-level contextual causal reasoning in large language models, with code and dataset available.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11501v1","title":"Quantization Dominates Rank Reduction for KV-Cache Compression","abstract":"We compare two strategies for compressing the KV cache in transformer inference: rank reduction (discard dimensions) and quantization (keep all dimensions, reduce precision). At matched storage budgets across five models (124M-14B, MHA and GQA), we find that quantization consistently outperforms rank reduction by 4-364 PPL depending on model and compression level. The gap persists even when rank reduction is combined with quantization in hybrid baselines, and it grows with GQA aggressiveness. On LAMBADA, INT4 matches FP16 accuracy (+0.23 PPL on Mistral 7B, +0.58 on GPT-2) while rank-32 at identical storage collapses to 0.4%.   We trace this gap to a structural asymmetry: under softmax attention routing, removing a dimension can flip which token is attended (a discrete failure), while quantization noise is bounded and typically preserves score ordering. We formalize this via a perturbation result showing projection damage exceeds quantization damage by 3 x 2^(2b) per direction under the softmax Fisher metric. A basis ablation confirms the finding is basis-independent (spread <0.4 PPL), establishing that the advantage comes from preserving dimensions, not from a better coordinate system. Joint K+V INT4 quantization achieves 75% total KV reduction at only +0.18 PPL on Mistral 7B.","published_date":"2026-04-13T14:06:18+00:00","viability_score":3,"cluster_label":"LLM Inference Optimization","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Quantization is a more effective method than rank reduction for compressing KV caches in transformer inference, showing significant performance gains.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11491v1","title":"ADD for Multi-Bit Image Watermarking","abstract":"As generative models enable rapid creation of high-fidelity images, societal concerns about misinformation and authenticity have intensified. A promising remedy is multi-bit image watermarking, which embeds a multi-bit message into an image so that a verifier can later detect whether the image is generated by someone and further identify the source by decoding the embedded message. Existing approaches often fall short in capacity, resilience to common image distortions, and theoretical justification. To address these limitations, we propose ADD (Add, Dot, Decode), a multi-bit image watermarking method with two stages: learning a watermark to be linearly combined with the multi-bit message and added to the image, and decoding through inner products between the watermarked image and the learned watermark. On the standard MS-COCO benchmark, we demonstrate that for the challenging task of 48-bit watermarking, ADD achieves 100\\% decoding accuracy, with performance dropping by at most 2\\% under a wide range of image distortions, substantially smaller than the 14\\% average drop of state-of-the-art methods. In addition, ADD achieves substantial computational gains, with 2-fold faster embedding and 7.4-fold faster decoding than the fastest existing method. We further provide a theoretical analysis explaining why the learned watermark and the corresponding decoding rule are effective.","published_date":"2026-04-13T13:57:29+00:00","viability_score":7,"cluster_label":"Digital Watermarking","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel multi-bit image watermarking method (ADD) that offers high capacity, resilience to distortions, and significant computational gains, with code available.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11490v1","title":"Anthropogenic Regional Adaptation in Multimodal Vision-Language Model","abstract":"While the field of vision-language (VL) has achieved remarkable success in integrating visual and textual information across multiple languages and domains, there is still no dedicated framework for assessing human-centric alignment in vision-language systems. We offer two contributions to address this gap. First, we introduce Anthropogenic Regional Adaptation: a novel paradigm that aims to optimize model relevance to specific regional contexts while ensuring the retention of global generalization capabilities. Second, we present a simple, but effective adaptation method named Geographical-generalization-made-easy (GG-EZ), which utilizes regional data filtering and model merging. Through comprehensive experiments on 3 VL architectures: large vision-language models, text-to-image diffusion models, and vision-language embedding models, and a case study in Southeast Asia (SEA) regional adaptation, we demonstrate the importance of Anthropogenic Regional Adaptation and the effectiveness of GG-EZ, showing 5-15% gains in cultural relevance metrics across SEA while maintaining over 98% of global performance and even occasionally surpassing it. Our findings establish Anthropogenic Regional Alignment as a foundational paradigm towards applicability of multimodal vision-language models in diverse regions and demonstrate a simple-yet-effective baseline method that optimizes regional value alignment while preserving global generalization.","published_date":"2026-04-13T13:56:00+00:00","viability_score":7,"cluster_label":"Multimodal AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new paradigm and method (GG-EZ) for adapting multimodal vision-language models to specific regional contexts while maintaining global performance, with code available.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11480v1","title":"On the Complexity of the Discussion-based Semantics in Abstraction Argumentation","abstract":"We show that deciding whether an argument a is stronger than an argument b with respect to the discussion-based semantics of Amgoud and Ben-Naim is decidable in polynomial time. At its core, this problem is about deciding whether, for two vertices in a graph, the number of walks of each length ending in those vertices is the same. We employ results from automata theory and reduce this problem to the equivalence problem for semiring automata. This offers a new perspective on the computational complexity of ranking semantics, an area in which the complexity of many semantics remains open.","published_date":"2026-04-13T13:47:30+00:00","viability_score":1,"cluster_label":"AI Theory","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper theoretically analyzes the complexity of argumentation semantics, offering new perspectives on computational complexity but with no clear path to a product.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11477v1","title":"OOM-RL: Out-of-Money Reinforcement Learning Market-Driven Alignment for LLM-Based Multi-Agent Systems","abstract":"The alignment of Multi-Agent Systems (MAS) for autonomous software engineering is constrained by evaluator epistemic uncertainty. Current paradigms, such as Reinforcement Learning from Human Feedback (RLHF) and AI Feedback (RLAIF), frequently induce model sycophancy, while execution-based environments suffer from adversarial \"Test Evasion\" by unconstrained agents. In this paper, we introduce an objective alignment paradigm: \\textbf{Out-of-Money Reinforcement Learning (OOM-RL)}. By deploying agents into the non-stationary, high-friction reality of live financial markets, we utilize critical capital depletion as an un-hackable negative gradient. Our longitudinal 20-month empirical study (July 2024 -- February 2026) chronicles the system's evolution from a high-turnover, sycophantic baseline to a robust, liquidity-aware architecture. We demonstrate that the undeniable ontological consequences of financial loss forced the MAS to abandon overfitted hallucinations in favor of the \\textbf{Strict Test-Driven Agentic Workflow (STDAW)}, which enforces a Byzantine-inspired uni-directional state lock (RO-Lock) anchored to a deterministically verified $\\geq 95\\%$ code coverage constraint matrix. Our results show that while early iterations suffered severe execution decay, the final OOM-RL-aligned system achieved a stable equilibrium with an annualized Sharpe ratio of 2.06 in its mature phase. We conclude that substituting subjective human preference with rigorous economic penalties provides a robust methodology for aligning autonomous agents in high-stakes, real-world environments, laying the groundwork for generalized paradigms where computational billing acts as an objective physical constraint","published_date":"2026-04-13T13:45:42+00:00","viability_score":7,"cluster_label":"LLM Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"OOM-RL uses financial market losses as an objective alignment signal for LLM agents, demonstrating a path to robust, liquidity-aware autonomous systems.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11467v1","title":"From Attribution to Action: A Human-Centered Application of Activation Steering","abstract":"Explainable AI (XAI) methods reveal which features influence model predictions, yet provide limited means for practitioners to act on these explanations. Activation steering of components identified via XAI offers a path toward actionable explanations, although its practical utility remains understudied. We introduce an interactive workflow combining SAE-based attribution with activation steering for instance-level analysis of concept usage in vision models, implemented as a web-based tool. Based on this workflow, we conduct semi-structured expert interviews (N=8) with debugging tasks on CLIP to investigate how practitioners reason about, trust, and apply activation steering. We find that steering enables a shift from inspection to intervention-based hypothesis testing (8/8 participants), with most grounding trust in observed model responses rather than explanation plausibility alone (6/8). Participants adopted systematic debugging strategies dominated by component suppression (7/8) and highlighted risks including ripple effects and limited generalization of instance-level corrections. Overall, activation steering renders interpretability more actionable while raising important considerations for safe and effective use.","published_date":"2026-04-13T13:41:57+00:00","viability_score":5,"cluster_label":"Explainable AI (XAI)","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper introduces an interactive workflow and web-based tool for activation steering, enabling practitioners to intervene and test hypotheses with vision models.","time_to_mvp":"1-3 months","tags":[]},{"arxiv_id":"2604.11466v1","title":"SLALOM: Simulation Lifecycle Analysis via Longitudinal Observation Metrics for Social Simulation","abstract":"Large Language Model (LLM) agents offer a potentially-transformative path forward for generative social science but face a critical crisis of validity. Current simulation evaluation methodologies suffer from the \"stopped clock\" problem: they confirm that a simulation reached the correct final outcome while ignoring whether the trajectory leading to it was sociologically plausible. Because the internal reasoning of LLMs is opaque, verifying the \"black box\" of social mechanisms remains a persistent challenge. In this paper, we introduce SLALOM (Simulation Lifecycle Analysis via Longitudinal Observation Metrics), a framework that shifts validation from outcome verification to process fidelity. Drawing on Pattern-Oriented Modeling (POM), SLALOM treats social phenomena as multivariate time series that must traverse specific SLALOM gates, or intermediate waypoint constraints representing distinct phases. By utilizing Dynamic Time Warping (DTW) to align simulated trajectories with empirical ground truth, SLALOM offers a quantitative metric to assess structural realism, helping to differentiate plausible social dynamics from stochastic noise and contributing to more robust policy simulation standards.","published_date":"2026-04-13T13:40:50+00:00","viability_score":4,"cluster_label":"LLM Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"SLALOM is a framework for validating LLM agent social simulations by assessing process fidelity through longitudinal observation metrics, moving beyond outcome verification.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11465v1","title":"Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents","abstract":"Large language model (LLM) agents show promise on realistic tool-use tasks, but deploying capable agents on modest hardware remains challenging. We study whether inference-time scaffolding alone, without any additional training compute, can improve the performance of a small model in complex multi-step environments. Operating on a single 24\\,GB GPU, we evaluate Qwen3-8B under both full-precision (FP16, 12K context) and 4-bit quantized (AWQ, 32K context) configurations. Without any intervention, the raw model achieves just 5.4\\% (FP16) and 3.0\\% (AWQ) task goal completion. Guided by a systematic failure mode analysis, we introduce a three-tier inference scaffolding pipeline that deploys the same frozen model in three distinct roles: (1) a summarization model that preserves critical artifacts (tokens, credentials, API responses) while compressing dialogue history; (2) the main agent model that reasons over the compressed context; and (3) an isolated correction model that reviews and revises the agent's code output without access to conversation history, breaking repetitive failure loops. Applied to the same unmodified model, this scaffolding yields 8.9\\% (FP16) and 5.9\\% (AWQ) task goal completion, roughly doubling performance in both settings, with particularly strong gains on difficulty-1 tasks (15.8\\%$\\to$26.3\\% FP16; 5.3\\%$\\to$14.0\\% AWQ). On full-precision inference, our scaffolded 8B model surpasses DeepSeek-Coder 33B Instruct (7.1\\%) from the original AppWorld evaluation, demonstrating that structured inference-time interventions can make small models competitive with systems 4$\\times$ their size. We formalize the approach as a scaffolded policy over a frozen base model, three invocations of the same weights with different conditioning, drawing connections to test-time compute scaling and action-space shaping in reinforcement learning.","published_date":"2026-04-13T13:40:33+00:00","viability_score":7,"cluster_label":"LLM Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel inference-time scaffolding technique significantly boosts the performance of small LLM agents on complex tasks, making them competitive with much larger models without additional training.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.11462v1","title":"Escaping the Context Bottleneck: Active Context Curation for LLM Agents via Reinforcement Learning","abstract":"Large Language Models (LLMs) struggle with long-horizon tasks due to the \"context bottleneck\" and the \"lost-in-the-middle\" phenomenon, where accumulated noise from verbose environments degrades reasoning over multi-turn interactions. To address this issue, we introduce a symbiotic framework that decouples context management from task execution. Our architecture pairs a lightweight, specialized policy model, ContextCurator, with a powerful frozen foundation model, TaskExecutor. Trained via reinforcement learning, ContextCurator actively reduces information entropy in the working memory. It aggressively prunes environmental noise while preserving reasoning anchors, that is, sparse data points that are critical for future deductions. On WebArena, our framework improves the success rate of Gemini-3.0-flash from 36.4% to 41.2% while reducing token consumption by 8.8% (from 47.4K to 43.3K). On DeepSearch, it achieves a 57.1% success rate, compared with 53.9%, while reducing token consumption by a factor of 8. Remarkably, a 7B ContextCurator matches the context management performance of GPT-4o, providing a scalable and computationally efficient paradigm for autonomous long-horizon agents.","published_date":"2026-04-13T13:39:17+00:00","viability_score":6,"cluster_label":"LLM Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A reinforcement learning-based framework actively curates context for LLM agents, reducing noise and improving performance on long-horizon tasks while significantly cutting token consumption.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus"]},{"arxiv_id":"2604.11446v1","title":"Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration","abstract":"Recently, scaling reinforcement learning with verifiable rewards (RLVR) for large language models (LLMs) has emerged as an effective training paradigm for significantly improving model capabilities, which requires guiding the model to perform extensive exploration and learning, leading to substantial computational overhead and becoming a key challenge. To reduce the number of training steps, Prior work performs linear extrapolation of model parameters. However, the dynamics of model parameter updates during RLVR training remain insufficiently understood. To further investigate the evolution of LLMs during RLVR training, we conduct empirical experiments and find that the rank-1 subspace of the model does not evolve linearly, and its dominance over the original parameters is further amplified during LoRA training. Based on the above insights, we propose the \\textbf{N}onlinear \\textbf{Ext}rapolation of low-rank trajectories (\\textbf{NExt}), a novel framework that models and extrapolates low-rank parameter trajectories in a nonlinear manner. Concretely, we first train the model using LoRA and extract the rank-1 subspace of parameter differences at multiple training steps, which is then used for the subsequent nonlinear extrapolation. Afterward, we utilized the extracted rank-1 subspace to train a predictor, which can model the trajectory of parameter updates during RLVR, and then perform the predict-extend process to extrapolate model parameters, achieving the acceleration of RLVR. To further study and understand NExt, we conduct comprehensive experiments that demonstrate the effectiveness and robustness of the method. Our method reduces computational overhead by approximately 37.5\\% while remaining compatible with a wide range of RLVR algorithms and tasks. We release our code in https://github.com/RUCAIBox/NExt.","published_date":"2026-04-13T13:28:12+00:00","viability_score":7,"cluster_label":"LLM Training","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel framework models and non-linearly extrapolates low-rank optimization trajectories to accelerate LLM reinforcement learning with verifiable rewards, reducing computational overhead.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.11435v1","title":"Think Before you Write: QA-Guided Reasoning for Character Descriptions in Books","abstract":"Character description generation is an important capability for narrative-focused applications such as summarization, story analysis, and character-driven simulations. However, generating accurate character descriptions from long-form narratives (e.g., novels) is challenging: models must track evolving attributes (e.g., relationships and events), integrate evidence scattered across the text, and infer implicit details. Despite the success of reasoning-enabled LLMs on many benchmarks, we find that for character description generation their performance improves when built-in reasoning is disabled (i.e., an empty reasoning trace). Motivated by this, we propose a training framework that decouples reasoning from generation. Our approach, which can be applied on top of long-context LLMs or chunk-based methods, consists of a reasoning model that produces a structured QA reasoning trace and a generation model that conditions on this trace to produce the final character description. Experiments on two datasets (BookWorm and CroSS) show that QA-guided reasoning improves faithfulness, informativeness, and grounding over strong long-context baselines.","published_date":"2026-04-13T13:19:56+00:00","viability_score":7,"cluster_label":"LLM Applications","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A QA-guided reasoning framework decouples reasoning from generation to produce more faithful and informative character descriptions from long-form narratives.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.11430v1","title":"Hardening x402: PII-Safe Agentic Payments via Pre-Execution Metadata Filtering","abstract":"AI agents that pay for resources via the x402 protocol embed payment metadata - resource URLs, descriptions, and reason strings - in every HTTP payment request. This metadata is transmitted to the payment server and to the centralised facilitator API before any on-chain settlement occurs; neither party is typically bound by a data processing agreement. We present presidio-hardened-x402, the first open-source middleware that intercepts x402 payment requests before transmission to detect and redact personally identifiable information (PII), enforce declarative spending policies, and block duplicate replay attempts. To evaluate the PII filter, we construct a labeled synthetic corpus of 2,000 x402 metadata triples spanning seven use-case categories, and run a 42-configuration precision/recall sweep across two detection modes (regex, NLP) and five confidence thresholds. The recommended configuration (mode=nlp, min_score=0.4, all entity types) achieves micro-F1 = 0.894 with precision 0.972, at a p99 latency of 5.73ms - well within the 50ms overhead budget. The middleware, corpus, and all experiment code are publicly available at https://github.com/presidio-v/presidio-hardened-x402.","published_date":"2026-04-13T13:17:12+00:00","viability_score":7,"cluster_label":"AI Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An open-source middleware that filters PII from AI agent payment requests, ensuring privacy and policy compliance before transactions.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.11427v1","title":"METRO: Towards Strategy Induction from Expert Dialogue Transcripts for Non-collaborative Dialogues","abstract":"Developing non-collaborative dialogue agents traditionally requires the manual, unscalable codification of expert strategies. We propose \\ours, a method that leverages large language models to autonomously induce both strategy actions and planning logic directly from raw transcripts. METRO formalizes expert knowledge into a Strategy Forest, a hierarchical structure that captures both short-term responses (nodes) and long-term strategic foresight (branches). Experimental results across two benchmarks show that METRO demonstrates promising performance, outperforming existing methods by an average of 9%-10%. Our further analysis not only reveals the success behind METRO (strategic behavioral diversity and foresight), but also demonstrates its robust cross-task transferability. This offers new insights into building non-collaborative agents in a cost-effective and scalable way. Our code is available at https://github.com/Humphrey-0125/METRO.","published_date":"2026-04-13T13:12:02+00:00","viability_score":7,"cluster_label":"LLM Applications","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A method to autonomously induce dialogue strategies from expert transcripts for scalable non-collaborative agent development.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11422v1","title":"Emulating Non-Differentiable Metrics via Knowledge-Guided Learning: Introducing the Minkowski Image Loss","abstract":"The ``differentiability gap'' presents a primary bottleneck in Earth system deep learning: since models cannot be trained directly on non-differentiable scientific metrics and must rely on smooth proxies (e.g., MSE), they often fail to capture high-frequency details, yielding ``blurry'' outputs. We develop a framework that bridges this gap using two different methods to deal with non-differentiable functions: the first is to analytically approximate the original non-differentiable function into a differentiable equivalent one; the second is to learn differentiable surrogates for scientific functionals. We formulate the analytical approximation by relaxing discrete topological operations using temperature-controlled sigmoids and continuous logical operators. Conversely, our neural emulator uses Lipschitz-convolutional neural networks to stabilize gradient learning via: (1) spectral normalization to bound the Lipschitz constant; and (2) hard architectural constraints enforcing geometric principles. We demonstrate this framework's utility by developing the Minkowski image loss, a differentiable equivalent for the integral-geometric measures of surface precipitation fields (area, perimeter, connected components). Validated on the EUMETNET OPERA dataset, our constrained neural surrogate achieves high emulation accuracy, completely eliminating the geometric violations observed in unconstrained baselines. However, applying these differentiable surrogates to a deterministic super-resolution task reveals a fundamental trade-off: while strict Lipschitz regularization ensures optimization stability, it inherently over-smooths gradient signals, restricting the recovery of highly localized convective textures. This work highlights the necessity of coupling such topological constraints with stochastic generative architectures to achieve full morphological realism.","published_date":"2026-04-13T13:04:57+00:00","viability_score":4,"cluster_label":"Scientific AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework to bridge the differentiability gap in Earth system deep learning by emulating non-differentiable scientific metrics.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.11419v1","title":"Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agentic Retrieval","abstract":"Cyber threat intelligence (CTI) analysts must answer complex questions over large collections of narrative security reports. Retrieval-augmented generation (RAG) systems help language models access external knowledge, but traditional vector retrieval often struggles with queries that require reasoning over relationships between entities such as threat actors, malware, and vulnerabilities. This limitation arises because relevant evidence is often distributed across multiple text fragments and documents. Knowledge graphs address this challenge by enabling structured multi-hop reasoning through explicit representations of entities and relationships. However, multiple retrieval paradigms, including graph-based, agentic, and hybrid approaches, have emerged with different assumptions and failure modes. It remains unclear how these approaches compare in realistic CTI settings and when graph grounding improves performance. We present a systematic evaluation of four RAG architectures for CTI analysis: standard vector retrieval, graph-based retrieval over a CTI knowledge graph, an agentic variant that repairs failed graph queries, and a hybrid approach combining graph queries with text retrieval. We evaluate these systems on 3,300 CTI question-answer pairs spanning factual lookups, multi-hop relational queries, analyst-style synthesis questions, and unanswerable cases. Results show that graph grounding improves performance on structured factual queries. The hybrid graph-text approach improves answer quality by up to 35 percent on multi-hop questions compared to vector RAG, while maintaining more reliable performance than graph-only systems.","published_date":"2026-04-13T13:02:44+00:00","viability_score":4,"cluster_label":"Cyber Threat Intelligence","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A systematic evaluation of graph-based and agentic retrieval methods for Cyber Threat Intelligence, outperforming traditional RAG on complex queries.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11417v1","title":"Efficient Emotion-Aware Iconic Gesture Prediction for Robot Co-Speech","abstract":"Co-speech gestures increase engagement and improve speech understanding. Most data-driven robot systems generate rhythmic beat-like motion, yet few integrate semantic emphasis. To address this, we propose a lightweight transformer that derives iconic gesture placement and intensity from text and emotion alone, requiring no audio input at inference time. The model outperforms GPT-4o in both semantic gesture placement classification and intensity regression on the BEAT2 dataset, while remaining computationally compact and suitable for real-time deployment on embodied agents.","published_date":"2026-04-13T13:02:02+00:00","viability_score":7,"cluster_label":"Robotics","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A lightweight transformer model that predicts emotion-aware iconic gestures for robot co-speech, outperforming GPT-4o and suitable for real-time deployment.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.11407v1","title":"Retrieval as Generation: A Unified Framework with Self-Triggered Information Planning","abstract":"We revisit retrieval-augmented generation (RAG) by embedding retrieval control directly into generation. Instead of treating retrieval as an external intervention, we express retrieval decisions within token-level decoding, enabling end-to-end coordination without additional controllers or classifiers. Under the paradigm of Retrieval as Generation, we propose \\textbf{GRIP} (\\textbf{G}eneration-guided \\textbf{R}etrieval with \\textbf{I}nformation \\textbf{P}lanning), a unified framework in which the model regulates retrieval behavior through control-token emission. Central to GRIP is \\textit{Self-Triggered Information Planning}, which allows the model to decide when to retrieve, how to reformulate queries, and when to terminate, all within a single autoregressive trajectory. This design tightly couples retrieval and reasoning and supports dynamic multi-step inference with on-the-fly evidence integration. To supervise these behaviors, we construct a structured training set covering answerable, partially answerable, and multi-hop queries, each aligned with specific token patterns. Experiments on five QA benchmarks show that GRIP surpasses strong RAG baselines and is competitive with GPT-4o while using substantially fewer parameters.","published_date":"2026-04-13T12:53:17+00:00","viability_score":7,"cluster_label":"Retrieval-Augmented Generation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A unified framework for retrieval-augmented generation that embeds retrieval control directly into the decoding process, outperforming strong baselines and competing with GPT-4o with fewer parameters.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11403v1","title":"One Scale at a Time: Scale-Autoregressive Modeling for Fluid Flow Distributions","abstract":"Analyzing unsteady fluid flows often requires access to the full distribution of possible temporal states, yet conventional PDE solvers are computationally prohibitive and learned time-stepping surrogates quickly accumulate error over long rollouts. Generative models avoid compounding error by sampling states independently, but diffusion and flow-matching methods, while accurate, are limited by the cost of many evaluations over the entire mesh. We introduce scale-autoregressive modeling (SAR) for sampling flows on unstructured meshes hierarchically from coarse to fine: it first generates a low-resolution field, then refines it by progressively sampling higher resolutions conditioned on coarser predictions. This coarse-to-fine factorization improves efficiency by concentrating computation at coarser scales, where uncertainty is greatest, while requiring fewer steps at finer scales. Across unsteady-flow benchmarks of varying complexity, SAR attains substantially lower distributional error and higher per-sample accuracy than state-of-the-art diffusion models based on multi-scale GNNs, while matching or surpassing a flow-matching Transolver (a linear-time transformer) yet running 2-7x faster than this depending on the task. Overall, SAR provides a practical tool for fast and accurate estimation of statistical flow quantities (e.g., turbulent kinetic energy and two-point correlations) in real-world settings.","published_date":"2026-04-13T12:44:04+00:00","viability_score":7,"cluster_label":"Scientific ML","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A scale-autoregressive modeling approach for fluid flow distributions that achieves significantly lower error and higher accuracy than state-of-the-art diffusion models while running faster.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11378v1","title":"From Agent Loops to Structured Graphs:A Scheduler-Theoretic Framework for LLM Agent Execution","abstract":"The dominant paradigm for building LLM based agents is the Agent Loop, an iterative cycle where a single language model decides what to do next by reading an ever growing context window. This paradigm has three structural weaknesses: implicit dependencies between steps, unbounded recovery loops, and mutable execution history that complicates debugging. We characterize the Agent Loop as a single ready unit scheduler: at any moment, at most one executable unit is active, and the choice of which unit to activate comes from opaque LLM inference rather than an inspectable policy. This perspective places Agent Loops and graph based execution engines on a single semantic continuum. We propose SGH, Structured Graph Harness, which lifts control flow from implicit context into an explicit static DAG. SGH makes three commitments: execution plans are immutable within a plan version, planning execution and recovery are separated into three layers, and recovery follows a strict escalation protocol. These choices trade some expressiveness for controllability, verifiability, and implementability. Our contributions are fourfold: a scheduler unified framework that applies classical scheduling theory to LLM agent execution and identifies challenges introduced by non deterministic LLM nodes; a trade off analysis of controllability, expressiveness, and implementability across 70 surveyed systems; a formal specification including a node state machine with termination and soundness guarantees; and an attributable experimental framework with a seven group design for future validation. This is a position paper and design proposal. We provide a theoretical framework, design analysis, and experimental protocol, not a production implementation or empirical results.","published_date":"2026-04-13T12:16:45+00:00","viability_score":2,"cluster_label":"LLM Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A theoretical framework for LLM agent execution that proposes a structured graph harness to address weaknesses in the agent loop paradigm, focusing on controllability and verifiability.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11376v1","title":"From Redaction to Restoration: Deep Learning for Medical Image Anonymization and Reconstruction","abstract":"Removing patient-specific information from medical images is crucial to enable sharing and open science without compromising patient identities. However, many methods currently used for deidentification have negative effects on downstream image analysis tasks because of removal of relevant but non-identifiable information. This work presents an end-to-end deep learning framework for transforming raw clinical image volumes into de-identified, analysis-ready datasets without compromising downstream utility. The methodology developed and tested in this work first detects and redacts regions likely to contain protected health information (PHI), such as burned-in text and metadata, and then uses a generative deep learning model to inpaint the redacted areas with anatomically and imaging plausible content. The proposed pipeline leverages a lightweight hybrid architecture, combining CRNN-based redaction with a latent-diffusion inpainting restoration module (Stable Diffusion 2). We evaluate the approach using both privacy-oriented metrics, which quantify residual PHI and success of redaction, and image-quality and task-based metrics, which assess the fidelity of restored volumes for representative deep learning applications. Our results suggest that the proposed method yields de-identified medical images that are visually coherent, maintaining fidelity for downstream models, while substantially reducing the risk of patient re-identification. By automating anonymization and image reconstruction within a single workflow, and dissemination of large-scale medical imaging collections, thereby lowering a key barrier to data sharing and multi-institutional collaboration in medical imaging AI.","published_date":"2026-04-13T12:15:52+00:00","viability_score":8,"cluster_label":"Medical AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An end-to-end deep learning framework for medical image anonymization and reconstruction that redacts PHI and inpaints with plausible content, maintaining downstream utility.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.11373v1","title":"Minimal Embodiment Enables Efficient Learning of Number Concepts in Robot","abstract":"Robots are increasingly entering human-interactive scenarios that require understanding of quantity. How intelligent systems acquire abstract numerical concepts from sensorimotor experience remains a fundamental challenge in cognitive science and artificial intelligence. Here we investigate embodied numerical learning using a neural network model trained to perform sequential counting through naturalistic robotic interaction with a Franka Panda manipulator. We demonstrate that embodied models achieve 96.8\\% counting accuracy with only 10\\% of training data, compared to 60.6\\% for vision-only baselines. This advantage persists when visual-motor correspondences are randomized, indicating that embodiment functions as a structural prior that regularizes learning rather than as an information source. The model spontaneously develops biologically plausible representations: number-selective units with logarithmic tuning, mental number line organization, Weber-law scaling, and rotational dynamics encoding numerical magnitude ($r = 0.97$, slope $= 30.6\u00b0$/count). The learning trajectory parallels children's developmental progression from subset-knowers to cardinal-principle knowers. These findings demonstrate that minimal embodiment can ground abstract concepts, improve data efficiency, and yield interpretable representations aligned with biological cognition, which may contribute to embodied mathematics tutoring and safety-critical industrial applications.","published_date":"2026-04-13T12:14:58+00:00","viability_score":7,"cluster_label":"Robotics","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Minimal embodiment in robots enables efficient learning of abstract number concepts, outperforming vision-only models and mirroring human developmental progression.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11365v1","title":"Learning from Contrasts: Synthesizing Reasoning Paths from Diverse Search Trajectories","abstract":"Monte Carlo Tree Search (MCTS) has been widely used for automated reasoning data exploration, but current supervision extraction methods remain inefficient. Standard approaches retain only the single highest-reward trajectory, discarding the comparative signals present in the many explored paths. Here we introduce \\textbf{Contrastive Reasoning Path Synthesis (CRPS)}, a framework that transforms supervision extraction from a filtering process into a synthesis procedure. CRPS uses a structured reflective process to analyze the differences between high- and low-quality search trajectories, extracting explicit information about strategic pivots and local failure modes. These insights guide the synthesis of reasoning chains that incorporate success patterns while avoiding identified pitfalls. We show empirically that models fine-tuned on just 60K CRPS-synthesized examples match or exceed the performance of baselines trained on 590K examples derived from standard rejection sampling, a 20$\\times$ reduction in dataset size. Furthermore, CRPS improves generalization on out-of-domain benchmarks, demonstrating that learning from the contrast between success and failure produces more transferable reasoning capabilities than learning from success alone.","published_date":"2026-04-13T12:05:34+00:00","viability_score":8,"cluster_label":"Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel framework synthesizes reasoning paths from diverse search trajectories, drastically reducing data requirements for training AI models while improving generalization.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.11364v1","title":"The Missing Knowledge Layer in Cognitive Architectures for AI Agents","abstract":"The two most influential cognitive architecture frameworks for AI agents, CoALA [21] and JEPA [12], both lack an explicit Knowledge layer with its own persistence semantics. This gap produces a category error: systems apply cognitive decay to factual claims, or treat facts and experiences with identical update mechanics. We survey persistence semantics across existing memory systems and identify eight convergence points, from Karpathy's LLM Knowledge Base [10] to the BEAM benchmark's near-zero contradiction-resolution scores [22], all pointing to related architectural gaps. We propose a four-layer decom position (Knowledge, Memory, Wisdom, Intelligence) where each layer has fundamentally different persistence semantics: indefinite supersession, Ebbinghaus decay, evidence-gated revision, and ephemeral inference respectively. Companion implementations in Python and Rust demonstrate the architectural separation is feasible. We borrow terminology from cognitive science as a useful analogy (the Knowledge/Memory distinction echoes Tulving's trichotomy), but our layers are engineering constructs justified by persistence-semantics requirements, not by neural architecture. We argue that these distinctions demand distinct persistence semantics in engineering implementations, and that no current framework or system provides this.","published_date":"2026-04-13T12:05:30+00:00","viability_score":4,"cluster_label":"AI Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Proposes a four-layer cognitive architecture for AI agents with distinct persistence semantics for knowledge, memory, wisdom, and intelligence, addressing a gap in current frameworks.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.11359v1","title":"CoRe-ECG: Advancing Self-Supervised Representation Learning for 12-Lead ECG via Contrastive and Reconstructive Synergy","abstract":"Accurate interpretation of electrocardiogram (ECG) remains challenging due to the scarcity of labeled data and the high cost of expert annotation. Self-supervised learning (SSL) offers a promising solution by enabling models to learn expressive representations from unlabeled signals. Existing ECG SSL methods typically rely on either contrastive learning or reconstructive learning. However, each approach in isolation provides limited supervisory signals and suffers from additional limitations, including non-physiological distortions introduced by naive augmentations and trivial correlations across multiple leads that models may exploit as shortcuts. In this work, we propose CoRe-ECG, a unified contrastive and reconstructive pretraining paradigm that establishes a synergistic interaction between global semantic modeling and local structural learning. CoRe-ECG aligns global representations during reconstruction, enabling instance-level discriminative signals to guide local waveform recovery. To further enhance pretraining, we introduce Frequency Dynamic Augmentation (FDA) to adaptively perturb ECG signals based on their frequency-domain importance, and Spatio-Temporal Dual Masking (STDM) to break linear dependencies across leads, increasing the difficulty of reconstructive tasks. Our method achieves state-of-the-art performance across multiple downstream ECG datasets. Ablation studies further demonstrate the necessity and complementarity of each component. This approach provides a robust and physiologically meaningful representation learning framework for ECG analysis.","published_date":"2026-04-13T11:56:43+00:00","viability_score":7,"cluster_label":"Medical AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel self-supervised learning paradigm for ECG analysis synergistically combines contrastive and reconstructive methods with advanced augmentation techniques to achieve state-of-the-art performance.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11337v1","title":"Governance by Design: A Parsonian Institutional Architecture for Internet-Wide Agent Societies","abstract":"The dominant paradigm of local multi-agent systems -- orchestrated, enterprise-bounded pipelines -- is being superseded by internet-wide agent societies in which autonomous agents discover each other through open registries, interact without central orchestrators, and generate emergent social behaviors. We argue that governing such societies requires institutional design, not merely risk enumeration or process compliance. Applying Talcott Parsons' AGIL framework -- four functional imperatives (Adaptation, Goal Attainment, Integration, Latency) every viable social system must satisfy -- we derive a prescriptive sixteen-cell institutional architecture for internet-wide agent governance. Diagnostically applied to the OpenClaw ecosystem (250,000+ GitHub stars, 2M+ monthly users, 770,000+ registered agents) via a recursive sub-function analysis (64 binary indicators across 16 cells), we find at most 19% sub-function coverage (sensitivity range 17-30%) -- potential rather than operative capacity, since zero inter-cell coordination prevents existing infrastructure from participating in inter-pillar interchange. A complementary interchange media assessment finds zero of twelve inter-pillar pathways functional: the ecosystem has technical infrastructure but no active governance, no coordination layer, and no normative grounding, with the Fiduciary and Political pillars most severely underserved. Extending the diagnostic to the broader agent-native protocol stack (MCP, A2A, ANP, x402, ERC-8004), independent development teams reproduce the same structural pattern -- confirming the governance gap is a feature of market-driven development, not ecosystem immaturity. Institutional design is most effective before social patterns calcify; we conclude with a prioritized roadmap for the missing governance infrastructure.","published_date":"2026-04-13T11:37:46+00:00","viability_score":2,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A theoretical framework for governing internet-wide agent societies using a Parsonian institutional architecture.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.11334v1","title":"Dynamic Summary Generation for Interpretable Multimodal Depression Detection","abstract":"Depression remains widely underdiagnosed and undertreated because stigma and subjective symptom ratings hinder reliable screening. To address this challenge, we propose a coarse-to-fine, multi-stage framework that leverages large language models (LLMs) for accurate and interpretable detection. The pipeline performs binary screening, five-class severity classification, and continuous regression. At each stage, an LLM produces progressively richer clinical summaries that guide a multimodal fusion module integrating text, audio, and video features, yielding predictions with transparent rationale. The system then consolidates all summaries into a concise, human-readable assessment report. Experiments on the E-DAIC and CMDC datasets show significant improvements over state-of-the-art baselines in both accuracy and interpretability.","published_date":"2026-04-13T11:34:40+00:00","viability_score":7,"cluster_label":"Medical AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An interpretable multimodal AI for depression detection that uses LLMs to generate clinical summaries and guide predictions.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.11332v1","title":"A Compact and Efficient 1.251 Million Parameter Machine Learning CNN Model PD36-C for Plant Disease Detection: A Case Study","abstract":"Deep learning has markedly advanced image based plant disease diagnosis as improved hardware and dataset quality have enabled increasingly accurate neural network models. This paper presents PD36 C, a compact convolutional neural network (1,250,694 parameters and 4.77 MB) for plant disease classification. Trained with TensorFlow Keras on the New Plant Diseases Dataset (87k images, 38 classes), PD36 C is designed for robustness and edge deployability, complemented by a Qt for Python desktop application that offers an intuitive GUI and offline inference on commodity hardware. Across experiments, training accuracy reached 0.99697 by epoch 30, and average test accuracy was 0.9953 across 38 classes. Per class performance is uniformly high; on the lower end, Corn (maize) Cercospora leaf spot achieved precision around 0.9777 and recall around 0.9634, indicating occasional confusion with visually similar categories, while on the upper end numerous classes including Apple Black rot, Cedar apple rust, Blueberry healthy, Cherry Powdery mildew, Cherry healthy, and all four grape categories achieved perfect precision 1.00 and recall of 1.00, indicating no false positives and strong coverage. These results show that with a well curated dataset and careful architectural design, small CNNs can achieve competitive accuracy compared with recent baselines while remaining practical for edge scenarios. We also note typical constraints such as adverse weather, low quality imagery, and leaves exhibiting multiple concurrent diseases that can degrade performance and warrant future work on domain robustness. Overall, PD36 C and its application pipeline contribute a field ready, efficient solution for AI assisted plant disease detection in smart agriculture.","published_date":"2026-04-13T11:33:24+00:00","viability_score":8,"cluster_label":"Computer Vision","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A compact and efficient CNN model for plant disease detection with a user-friendly desktop application for edge deployment.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.11328v1","title":"Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees","abstract":"Automatic prompt optimization (APO) hinges on the quality of its evaluation signal, yet scoring every prompt candidate on the full training set is prohibitively expensive. Existing methods either fix a single evaluation subset before optimization begins (principled but prompt-agnostic) or adapt it heuristically during optimization (flexible but unstable and lacking formal guarantees). We observe that APO naturally maps to an online adaptive testing problem: prompts are examinees, training examples are test items, and the scheduler should select items that best discriminate among the strongest candidates. This insight motivates Prompt-Aware Online Evaluation Scheduling (POES), which integrates an IRT-based discrimination utility, a facility-location coverage term, and switching-cost-aware warm-start swaps into a unified objective that is provably monotone submodular, yielding a (1-1/e) greedy guarantee for cold starts and bounded drift for warm-start updates. An adaptive controller modulates the exploration-exploitation balance based on optimization progress. Across 36 tasks spanning three benchmark families, POES achieves the highest overall average accuracy (6.2 percent improvement over the best baseline) with negligible token overhead (approximately 4 percent) at the same evaluation budget. Moreover, principled selection at k = 20 examples matches or exceeds the performance of naive evaluation at k = 30-50, reducing token consumption by 35-60 percent, showing that selecting smarter is more effective than selecting more. Our results demonstrate that evaluation scheduling is a first-class component of APO, not an implementation detail.","published_date":"2026-04-13T11:31:04+00:00","viability_score":7,"cluster_label":"LLM Optimization","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An AI-powered evaluation scheduler that significantly reduces token consumption for prompt optimization with formal guarantees.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.11322v1","title":"Do LLMs Know Tool Irrelevance? Demystifying Structural Alignment Bias in Tool Invocations","abstract":"Large language models (LLMs) have demonstrated impressive capabilities in utilizing external tools. In practice, however, LLMs are often exposed to tools that are irrelevant to the user's query, in which case the desired behavior is to refrain from invocations. In this work, we identify a widespread yet overlooked mechanistic flaw in tool refusal, which we term structural alignment bias: Even when a tool fails to serve the user's goal, LLMs still tend to invoke it whenever query attributes can be validly assigned to tool parameters. To systematically study this bias, we introduce SABEval, a new dataset that decouples structural alignment from semantic relevance. Our analysis shows that structural alignment bias induces severe tool-invocation errors in LLMs, yet remains largely unaccounted for in existing evaluations. To investigate the internal mechanisms underlying this bias, we propose Contrastive Attention Attribution, which reveals two competing pathways for semantic checking and structural matching. The relative strength of these pathways drives LLMs' tool invocation decisions. Based on these findings, we further introduce a rebalancing strategy that effectively mitigates structural alignment bias, as demonstrated by extensive experiments, without degrading general tool-use capabilities.","published_date":"2026-04-13T11:23:36+00:00","viability_score":7,"cluster_label":"LLM Tool Use","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new dataset and method to identify and mitigate structural alignment bias in LLM tool usage, improving refusal of irrelevant tools.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11315v1","title":"S$^3$: Structured Sparsity Specification","abstract":"We introduce the Structured Sparsity Specification (S$^3$), an algebraic framework for defining, composing, and implementing structured sparse patterns. S$^3$ specifies sparsity through three components: a View that reshapes the tensor via layout composition, a Block specification that defines the atomic pruning unit, and the sparsity decision Scope. Both Block and Scope support Coupling across tensors for coordinated sparsification. S$^3$ enables precise specification of diverse sparsity structures, from fine-grained N:M patterns to coarse channel pruning, and integrates seamlessly with Optimal Brain Damage (OBD) and Surgeon (OBS). We formalize the framework mathematically, demonstrate its expressiveness on canonical patterns, and validate it experimentally via structured OBS and OBD implementations built entirely on S$^3$, which surpasses well-established second order heuristics on output reconstruction across common configurations.","published_date":"2026-04-13T11:18:33+00:00","viability_score":4,"cluster_label":"Model Optimization","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An algebraic framework for specifying and implementing structured sparsity patterns in neural networks, improving reconstruction accuracy.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.11312v1","title":"Network Effects and Agreement Drift in LLM Debates","abstract":"Large Language Models (LLMs) have demonstrated an unprecedented ability to simulate human-like social behaviors, making them useful tools for simulating complex social systems. However, it remains unclear to what extent these simulations can be trusted to accurately capture key social mechanisms, particularly in highly unbalanced contexts involving minority groups. This paper uses a network generation model with controlled homophily and class sizes to examine how LLM agents behave collectively in multi-round debates. Moreover, our findings highlight a particular directional susceptibility that we term \\textit{agreement drift}, in which agents are more likely to shift toward specific positions on the opinion scale. Overall, our findings highlight the need to disentangle structural effects from model biases before treating LLM populations as behavioral proxies for human groups.","published_date":"2026-04-13T11:16:58+00:00","viability_score":4,"cluster_label":"LLM Social Simulation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Investigating agreement drift and network effects in LLM debates to understand their limitations as proxies for human social behavior.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.11309v1","title":"The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems","abstract":"Large Language Models (LLMs) face prominent security risks from jailbreaking, a practice that manipulates models to bypass built-in security constraints and generate unethical or unsafe content. Among various jailbreak techniques, multi-turn jailbreak attacks are more covert and persistent than single-turn counterparts, exposing critical vulnerabilities of LLMs.   However, existing multi-turn jailbreak methods suffer from two fundamental limitations that affect the actual impact in real-world scenarios: (a) As models become more context-aware, any explicit harmful trigger is increasingly likely to be flagged and blocked; (b) Successful final-step triggers often require finely tuned, model-specific contexts, making such attacks highly context-dependent. To fill this gap, we propose \\textit{Salami Slicing Risk}, which operates by chaining numerous low-risk inputs that individually evade alignment thresholds but cumulatively accumulate harmful intent to ultimately trigger high-risk behaviors, without heavy reliance on pre-designed contextual structures. Building on this risk, we develop Salami Attack, an automatic framework universally applicable to multiple model types and modalities.   Rigorous experiments demonstrate its state-of-the-art performance across diverse models and modalities, achieving over 90\\% Attack Success Rate on GPT-4o and Gemini, as well as robustness against real-world alignment defenses. We also proposed a defense strategy to constrain the Salami Attack by at least 44.8\\% while achieving a maximum blocking rate of 64.8\\% against other multi-turn jailbreak attacks. Our findings provide critical insights into the pervasive risks of multi-turn jailbreaking and offer actionable mitigation strategies to enhance LLM security.","published_date":"2026-04-13T11:12:30+00:00","viability_score":8,"cluster_label":"LLM Security","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An automatic framework for multi-turn LLM jailbreaking using cumulative low-risk inputs, with demonstrated high success rates and a proposed defense strategy.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.11307v1","title":"PaperScope: A Multi-Modal Multi-Document Benchmark for Agentic Deep Research Across Massive Scientific Papers","abstract":"Leveraging Multi-modal Large Language Models (MLLMs) to accelerate frontier scientific research is promising, yet how to rigorously evaluate such systems remains unclear. Existing benchmarks mainly focus on single-document understanding, whereas real scientific workflows require integrating evidence from multiple papers, including their text, tables, and figures. As a result, multi-modal, multi-document scientific reasoning remains underexplored and lacks systematic evaluation. To address this gap, we introduce PaperScope, a multi-modal multi-document benchmark designed for agentic deep research. PaperScope presents three advantages: (1) Structured scientific grounding. It is built on a knowledge graph of over 2,000 AI papers spanning three years, providing a structured foundation for research-oriented queries. (2) Semantically dense evidence construction. It integrates semantically related key information nodes and employs optimized random-walk article selector to sample thematically coherent paper sets, thereby ensuring adequate semantic density and task complexity. (3) Multi-task evaluation of scientific reasoning. It contains over 2,000 QA pairs across reasoning, retrieval, summarization, and problem solving, enabling evaluation of multi-step scientific reasoning. Experimental results show that even advanced systems such as OpenAI Deep Research and Tongyi Deep Research achieve limited scores on PaperScope, highlighting the difficulty of long-context retrieval and deep multi-source reasoning. PaperScope thus provides a rigorous benchmark alongside a scalable pipeline for constructing large-scale multi-modal, multi-source deep research datasets.","published_date":"2026-04-13T11:07:08+00:00","viability_score":7,"cluster_label":"AI Benchmarking","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A multi-modal, multi-document benchmark for evaluating AI agents in deep scientific research, revealing limitations in current advanced systems.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11306v1","title":"Learning to Forget -- Hierarchical Episodic Memory for Lifelong Robot Deployment","abstract":"Robots must verbalize their past experiences when users ask \"Where did you put my keys?\" or \"Why did the task fail?\" Yet maintaining life-long episodic memory (EM) from continuous multimodal perception quickly exceeds storage limits and makes real-time query impractical, calling for selective forgetting that adapts to users' notions of relevance. We present H$^2$-EMV, a framework enabling humanoids to learn what to remember through user interaction. Our approach incrementally constructs hierarchical EM, selectively forgets using language-model-based relevance estimation conditioned on learned natural-language rules, and updates these rules given user feedback about forgotten details. Evaluations on simulated household tasks and 20.5-hour-long real-world recordings from ARMAR-7 demonstrate that H$^2$-EMV maintains question-answering accuracy while reducing memory size by 45% and query-time compute by 35%. Critically, performance improves over time - accuracy increases 70% in second-round queries by adapting to user-specific priorities - demonstrating that learned forgetting enables scalable, personalized EM for long-term human-robot collaboration.","published_date":"2026-04-13T11:06:56+00:00","viability_score":7,"cluster_label":"Robotics Memory","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A hierarchical episodic memory framework for robots that learns to forget irrelevant experiences, reducing memory size and improving query efficiency.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11304v1","title":"BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows","abstract":"Existing AI benchmarks lack the fidelity to assess economically meaningful progress on professional workflows. To evaluate frontier AI agents in a high-value, labor-intensive profession, we introduce BankerToolBench (BTB): an open-source benchmark of end-to-end analytical workflows routinely performed by junior investment bankers. To develop an ecologically valid benchmark grounded in representative work environments, we collaborated with 502 investment bankers from leading firms. BTB requires agents to execute senior banker requests by navigating data rooms, using industry tools (market data platform, SEC filings database), and generating multi-file deliverables--including Excel financial models, PowerPoint pitch decks, and PDF/Word reports. Completing a BTB task takes bankers up to 21 hours, underscoring the economic stakes of successfully delegating this work to AI. BTB enables automated evaluation of any LLM or agent, scoring deliverables against 100+ rubric criteria defined by veteran investment bankers to capture stakeholder utility. Testing 9 frontier models, we find that even the best-performing model (GPT-5.4) fails nearly half of the rubric criteria and bankers rate 0% of its outputs as client-ready. Our failure analysis reveals key obstacles (such as breakdowns in cross-artifact consistency) and improvement directions for agentic AI in high-stakes professional workflows.","published_date":"2026-04-13T11:02:32+00:00","viability_score":8,"cluster_label":"AI Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"BankerToolBench is an open-source benchmark evaluating AI agents in end-to-end investment banking workflows, showing current frontier models fail to meet professional standards.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11302v1","title":"3D-Anchored Lookahead Planning for Persistent Robotic Scene Memory via World-Model-Based MCTS","abstract":"We present 3D-Anchored Lookahead Planning (3D-ALP), a System 2 reasoning engine for robotic manipulation that combines Monte Carlo Tree Search (MCTS) with a 3D-consistent world model as the rollout oracle. Unlike reactive policies that evaluate actions from the current camera frame only, 3D-ALP maintains a persistent camera-to-world (c2w) anchor that survives occlusion, enabling accurate replanning to object positions that are no longer directly observable. On a 5-step sequential reach task requiring spatial memory (Experiment E3), 3D-ALP achieves 0.650 0.109 success rate on memory-required steps versus 0.006 0.008 for a greedy reactive baseline (\u0394=+0.645), while step 5 success reaches 0.822 against 0.000 for greedy. An ablation study (30 episodes, 3 seeds) isolates tree search spatial memory as the primary driver (+0.533, 82% of gain) with additional benefit from deeper lookahead (+0.111, 17%). We also identify and resolve four structural failure modes in applying UCT-MCTS (Upper Confidence Bounds applied to Trees [10]) to continuous robotic manipulation.","published_date":"2026-04-13T11:01:30+00:00","viability_score":3,"cluster_label":"Robotics Planning","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A novel planning system for robotic manipulation that uses a 3D world model and Monte Carlo Tree Search to improve performance in tasks requiring spatial memory.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11299v1","title":"Enhancing Multimodal Large Language Models for Ancient Chinese Character Evolution Analysis via Glyph-Driven Fine-Tuning","abstract":"In recent years, rapid advances in Multimodal Large Language Models (MLLMs) have increasingly stimulated research on ancient Chinese scripts. As the evolution of written characters constitutes a fundamental pathway for understanding cultural transformation and historical continuity, how MLLMs can be systematically leveraged to support and advance text evolution analysis remains an open and largely underexplored problem. To bridge this gap, we construct a comprehensive benchmark comprising 11 tasks and over 130,000 instances, specifically designed to evaluate the capability of MLLMs in analyzing the evolution of ancient Chinese scripts. We conduct extensive evaluations across multiple widely used MLLMs and observe that, while existing models demonstrate a limited ability in glyph-level comparison, their performance on core tasks-such as character recognition and evolutionary reasoning-remains substantially constrained. Motivated by these findings, we propose a glyph-driven fine-tuning framework (GEVO) that explicitly encourages models to capture evolutionary consistency in glyph transformations and enhances their understanding of text evolution. Experimental results show that even models at the 2B scale achieve consistent and comprehensive performance improvements across all evaluated tasks. To facilitate future research, we publicly release both the benchmark and the trained models\\footnote{https://github.com/songruiecho/GEVO}.","published_date":"2026-04-13T11:00:24+00:00","viability_score":8,"cluster_label":"Multimodal LLMs","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A glyph-driven fine-tuning framework enhances multimodal LLMs for analyzing ancient Chinese character evolution, releasing a benchmark and trained models.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11297v1","title":"The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping","abstract":"Despite the success of reinforcement learning for large language models, a common failure mode is reduced sampling diversity, where the policy repeatedly generates similar erroneous behaviors. Classical entropy regularization encourages randomness under the current policy, but does not explicitly discourage recurrent failure patterns across rollouts. We propose MEDS, a Memory-Enhanced Dynamic reward Shaping framework that incorporates historical behavioral signals into reward design. By storing and leveraging intermediate model representations, we capture features of past rollouts and use density-based clustering to identify frequently recurring error patterns. Rollouts assigned to more prevalent error clusters are penalized more heavily, encouraging broader exploration while reducing repeated mistakes. Across five datasets and three base models, MEDS consistently improves average performance over existing baselines, achieving gains of up to 4.13 pass@1 points and 4.37 pass@128 points. Additional analyses using both LLM-based annotations and quantitative diversity metrics show that MEDS increases behavioral diversity during sampling.","published_date":"2026-04-13T10:59:28+00:00","viability_score":7,"cluster_label":"Reinforcement Learning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"MEDS is a memory-enhanced dynamic reward shaping framework that reduces repetitive errors in LLM reinforcement learning by penalizing recurring failure patterns.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11287v1","title":"Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Using a Large Language Model","abstract":"Background: Large language models (LLMs) have been explored as tools for generating personalized exercise prescriptions, yet the consistency of outputs under identical conditions remains insufficiently examined. Objective: This study evaluated the intra-model consistency of LLM-generated exercise prescriptions using a repeated generation design. Methods: Six clinical scenarios were used to generate exercise prescriptions using Gemini 2.5 Flash (20 outputs per scenario; total n = 120). Consistency was assessed across three dimensions: (1) semantic consistency using SBERT-based cosine similarity, (2) structural consistency based on the FITT principle using an AI-as-a-judge approach, and (3) safety expression consistency, including inclusion rates and sentence-level quantification. Results: Semantic similarity was high across scenarios (mean cosine similarity: 0.879-0.939), with greater consistency in clinically constrained cases. Frequency showed consistent patterns, whereas variability was observed in quantitative components, particularly exercise intensity. Unclassifiable intensity expressions were observed in 10-25% of resistance training outputs. Safety-related expressions were included in 100% of outputs; however, safety sentence counts varied significantly across scenarios (H=86.18, p less than 0.001), with clinical cases generating more safety expressions than healthy adult cases. Conclusions: LLM-generated exercise prescriptions demonstrated high semantic consistency but showed variability in key quantitative components. Reliability depends substantially on prompt structure, and additional structural constraints and expert validation are needed before clinical deployment.","published_date":"2026-04-13T10:50:44+00:00","viability_score":5,"cluster_label":"LLM Applications","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This study evaluates the consistency of exercise prescriptions generated by Gemini 2.5 Flash, finding high semantic consistency but variability in quantitative components.","time_to_mvp":"1-2 weeks","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.11284v1","title":"THEIA: Learning Complete Kleene Three-Valued Logic in a Pure-Neural Modular Architecture","abstract":"We present THEIA, a modular neural architecture that learns complete Kleene three-valued logic (K3) end-to-end without any external symbolic solver, and investigate what architectural prior enables compositional generalization under uncertainty. THEIA processes four mathematical domains (arithmetic, order, set membership, propositional logic) through dedicated engines that converge in a final logic module. Trained on a 2M-sample dataset with input space ~3.4x10^13, it achieves 12/12 Kleene K3 rule coverage across 5 seeds in 9.2 +/- 3.5 minutes (5.6x faster than a parameter-comparable Transformer under matched settings). A mod-3 sequential composition experiment generalizes from 5-step training to 500-step evaluation at 99.97% +/- 0.02% -- a result that critically depends on structured inductive bias: replacing the four-engine backbone with a flat MLP collapses length generalization to chance by 50 steps regardless of capacity (both 0.80M and parameter-matched 2.75M variants fail), while a pre-LN TF8LTuned Transformer baseline (3,582,147 params) trained under the identical protocol reaches 99.24% at 500 steps (Appendix D). Mechanistic probing reveals that modularity induces a delayed verdict: upstream engines encode domain-specific variables without committing to the final truth value (probe accuracy <= 74% uncertainty-only ceiling), with the verdict emerging only at the Logic Engine boundary -- causally confirmed by activation patching (100% flip rate on 986 matched pairs, replicated across n=5 seeds; 100.0% aggregate). The Transformer baseline reaches equivalent correctness through a qualitatively different representational trajectory (contraction then expansion), suggesting that modular and monolithic architectures implement distinct compositional strategies.","published_date":"2026-04-13T10:44:15+00:00","viability_score":3,"cluster_label":"Neural Logic","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"THEIA is a modular neural architecture that learns complete Kleene three-valued logic end-to-end, demonstrating compositional generalization under uncertainty.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11272v1","title":"AbLWR:A Context-Aware Listwise Ranking Framework for Antibody-Antigen Binding Affinity Prediction via Positive-Unlabeled Learning","abstract":"Accurate prediction of antibody-antigen binding affinity is fundamental to therapeutic design, yet remains constrained by severe label sparsity and the complexity of antigenic variations. In this paper, we propose AbLWR (Antibody-antigen binding affinity List-Wise Ranking), a novel framework that reformulates the conventional affinity regression task as a listwise ranking problem. To mitigate label sparsity, AbLWR incorporates a PU (Positive-Unlabeled) learning mechanism leveraging a dual-level contrastive objective and meta-optimized label refinement to learn robust representations. Furthermore, we address antigenic variation by employing a homologous antigen sampling strategy where Multi-Head Self-Attention (MHSA) explicitly models inter-sample relationships within training lists to capture subtle affinity nuances. Extensive experiments demonstrate that AbLWR significantly outperforms state-of-the-art baselines, improving the Precision@1 (P@1) by over 10$\\%$ in randomized cross-validation experiments. Notably, case studies on Influenza and IL-33 validate its practical utility, demonstrating robust ranking consistency in distinguishing subtle viral mutations and efficiently prioritizing top-tier candidates for wet-lab screening.","published_date":"2026-04-13T10:28:36+00:00","viability_score":7,"cluster_label":"Biomedical AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A listwise ranking framework for antibody-antigen binding affinity prediction that significantly outperforms state-of-the-art baselines and demonstrates practical utility in drug discovery.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11261v1","title":"Inspectable AI for Science: A Research Object Approach to Generative AI Governance","abstract":"This paper introduces AI as a Research Object (AI-RO), a paradigm for governing the use of generative AI in scientific research. Instead of debating whether AI is an author or merely a tool, we propose treating AI interactions as structured, inspectable components of the research process. Under this view, the legitimacy of an AI-assisted scientific paper depends on how model use is integrated into the workflow, documented, and made accountable. Drawing on Research Object theory and FAIR principles, we propose a framework for recording model configuration, prompts, and outputs through interaction logs and metadata packaging. These properties are particularly consequential in security and privacy (S&P) research, where provenance artifacts must satisfy confidentiality constraints, integrity guarantees, and auditability requirements that generic disclosure practices do not address. We implement a lightweight writing pipeline in which a language model synthesizes human-authored structured literature review notes under explicit constraints and produces a verifiable provenance record. We present this work as a position supported by an initial demonstrative workflow, arguing that governance of generative AI in science can be implemented as structured documentation, controlled disclosure, and integrity-preserving provenance capture. Based on this example, we outline and motivate a set of necessary future developments required to make such practices practical and widely adoptable.","published_date":"2026-04-13T10:13:20+00:00","viability_score":6,"cluster_label":"AI Governance","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for governing generative AI in scientific research by treating AI interactions as inspectable research objects, ensuring accountability and provenance.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.11259v1","title":"Mobile GUI Agent Privacy Personalization with Trajectory Induced Preference Optimization","abstract":"Mobile GUI agents powered by Multimodal Large Language Models (MLLMs) can execute complex tasks on mobile devices. Despite this progress, most existing systems still optimize task success or efficiency, neglecting users' privacy personalization. In this paper, we study the often-overlooked problem of agent personalization. We observe that personalization can induce systematic structural heterogeneity in execution trajectories. For example, privacy-first users often prefer protective actions, e.g., refusing permissions, logging out, and minimizing exposure, leading to logically different execution trajectories from utility-first users. Such variable-length and structurally different trajectories make standard preference optimization unstable and less informative. To address this issue, we propose Trajectory Induced Preference Optimization (TIPO), which uses preference-intensity weighting to emphasize key privacy-related steps and padding gating to suppress alignment noise. Results on our Privacy Preference Dataset show that TIPO improves persona alignment and distinction while preserving strong task executability, achieving 65.60% SR, 46.22 Compliance, and 66.67% PD, outperforming existing optimization methods across various GUI tasks. The code and dataset will be publicly released at https://github.com/Zhixin-L/TIPO.","published_date":"2026-04-13T10:12:03+00:00","viability_score":8,"cluster_label":"Mobile Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel preference optimization method for mobile GUI agents that personalizes user privacy by learning from execution trajectories, improving persona alignment and task executability.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.11248v1","title":"Evolving Many Worlds: Towards Open-Ended Discovery in Petri Dish NCA via Population-Based Training","abstract":"The generation of sustained, open-ended complexity from local interactions remains a fundamental challenge in artificial life. Differentiable multi-agent systems, such as Petri Dish Neural Cellular Automata (PD-NCA), exhibit rich self-organization driven purely by spatial competition; however, they are highly sensitive to hyperparameters and frequently collapse into uninteresting patterns and dynamics, such as frozen equilibria or structureless noise. In this paper, we introduce PBT-NCA, a meta-evolutionary algorithm that evolves a population of PD-NCAs subject to a composite objective that rewards both historical behavioral novelty and contemporary visual diversity. Driven by this continuous evolutionary pressure, PBT-NCA spontaneously generates a plethora of emergent lifelike phenomena over extended horizons-a hallmark of true open-endedness. Strikingly, the substrate autonomously discovers diverse morphological survival and self-organization strategies. We observe highly regular, coordinated periodic waves; spore-like scattering where homogeneous groups eject cell-like clusters to colonize distant territories; and fluid, shape-shifting macro-structures that migrate across the substrate, maintaining stable outer boundaries that enclose highly active interiors. By actively penalizing monocultures and dead states, PBT-NCA sustains a state of effective complexity that is neither globally ordered nor globally random, operating persistently at the \"edge of chaos\".","published_date":"2026-04-13T09:56:22+00:00","viability_score":0,"cluster_label":"Artificial Life","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A meta-evolutionary algorithm that evolves populations of Neural Cellular Automata to generate sustained, open-ended complexity and emergent lifelike phenomena.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11229v1","title":"RECIPER: A Dual-View Retrieval Pipeline for Procedure-Oriented Materials Question Answering","abstract":"Retrieving procedure-oriented evidence from materials science papers is difficult because key synthesis details are often scattered across long, context-heavy documents and are not well captured by paragraph-only dense retrieval. We present RECIPER, a dual-view retrieval pipeline that indexes both paragraph-level context and compact large language model-extracted procedural summaries, then combines the two candidate streams with lightweight lexical reranking. Across four dense retrieval backbones, RECIPER consistently improves early-rank retrieval over paragraph-only dense retrieval, achieving average gains of +3.73 in Recall@1, +2.85 in nDCG@10, and +3.13 in MRR. With BGE-large-en-v1.5, it reaches 86.82%, 97.07%, and 97.85% on Recall@1, Recall@5, and Recall@10, respectively. We further observe improved downstream question answering under automatic metrics, suggesting that procedural summaries can serve as a useful complementary retrieval signal for procedure-oriented materials question answering. Code and data are available at https://github.com/ReaganWu/RECIPER.","published_date":"2026-04-13T09:29:55+00:00","viability_score":7,"cluster_label":"Information Retrieval","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"RECIPER enhances materials science question answering by combining paragraph and LLM-extracted procedural summaries for improved retrieval accuracy.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.11223v1","title":"Regional Explanations: Bridging Local and Global Variable Importance","abstract":"We analyze two widely used local attribution methods, Local Shapley Values and LIME, which aim to quantify the contribution of a feature value $x_i$ to a specific prediction $f(x_1, \\dots, x_p)$. Despite their widespread use, we identify fundamental limitations in their ability to reliably detect locally important features, even under ideal conditions with exact computations and independent features. We argue that a sound local attribution method should not assign importance to features that neither influence the model output (e.g., features with zero coefficients in a linear model) nor exhibit statistical dependence with functionality-relevant features. We demonstrate that both Local SV and LIME violate this fundamental principle. To address this, we propose R-LOCO (Regional Leave Out COvariates), which bridges the gap between local and global explanations and provides more accurate attributions. R-LOCO segments the input space into regions with similar feature importance characteristics. It then applies global attribution methods within these regions, deriving an instance's feature contributions from its regional membership. This approach delivers more faithful local attributions while avoiding local explanation instability and preserving instance-specific detail often lost in global methods.","published_date":"2026-04-13T09:24:58+00:00","viability_score":4,"cluster_label":"Explainable AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"R-LOCO provides more accurate and stable feature attributions by bridging local and global explanation methods.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.11216v1","title":"Measuring the Authority Stack of AI Systems: Empirical Analysis of 366,120 Forced-Choice Responses Across 8 AI Models","abstract":"What values, evidence preferences, and source trust hierarchies do AI systems actually exhibit when facing structured dilemmas? We present the first large-scale empirical mapping of AI decision-making across all three layers of the Authority Stack framework (S. Lee, 2026a): value priorities (L4), evidence-type preferences (L3), and source trust hierarchies (L2). Using the PRISM benchmark -- a forced-choice instrument of 14,175 unique scenarios per layer, spanning 7 professional domains, 3 severity levels, 3 decision timeframes, and 5 scenario variants -- we evaluated 8 major AI models at temperature 0, yielding 366,120 total responses. Key findings include: (1) a symmetric 4:4 split between Universalism-first and Security-first models at L4; (2) dramatic defense-domain value restructuring where Security surges to near-ceiling win-rates (95.1%-99.8%) in 6 of 8 models; (3) divergent evidence hierarchies at L3, with some models favoring empirical-scientific evidence while others prefer pattern-based or experiential evidence; (4) broad convergence on institutional source trust at L2; and (5) Paired Consistency Scores (PCS) ranging from 57.4% to 69.2%, revealing substantial framing sensitivity across scenario variants. Test-Retest Reliability (TRR) ranges from 91.7% to 98.6%, indicating that value instability stems primarily from variant sensitivity rather than stochastic noise. These findings demonstrate that AI models possess measurable -- if sometimes unstable -- Authority Stacks with consequential implications for deployment across professional domains.","published_date":"2026-04-13T09:13:32+00:00","viability_score":3,"cluster_label":"AI Alignment","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This paper empirically maps the 'Authority Stack' of 8 AI models across values, evidence, and source trust, revealing significant domain-specific biases and framing sensitivity.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11209v1","title":"Exploring Knowledge Conflicts for Faithful LLM Reasoning: Benchmark and Method","abstract":"Large language models (LLMs) have achieved remarkable success across a wide range of applications especially when augmented by external knowledge through retrieval-augmented generation (RAG). Despite their widespread adoption, recent studies have shown that LLMs often struggle to perform faithful reasoning when conflicting knowledge is retrieved. However, existing work primarily focuses on conflicts between external knowledge and the parametric knowledge of LLMs, leaving conflicts across external knowledge largely unexplored. Meanwhile, modern RAG systems increasingly emphasize the integration of unstructured text and (semi-)structured data like knowledge graphs (KGs) to improve knowledge completeness and reasoning faithfulness. To address this gap, we introduce ConflictQA, a novel benchmark that systematically instantiates conflicts between textual evidence and KG evidence. Extensive evaluations across representative LLMs reveal that, facing such cross-source conflicts, LLMs often fail to identify reliable evidence for correct reasoning. Instead, LLMs become more sensitive to prompting choices and tend to rely exclusively on either KG or textual evidence, resulting in incorrect responses. Based on these findings, we further propose XoT, a two-stage explanation-based thinking framework tailored for reasoning over heterogeneous conflicting evidence, and verify its effectiveness with extensive experiments.","published_date":"2026-04-13T09:05:04+00:00","viability_score":7,"cluster_label":"LLM Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"ConflictQA benchmark and XoT framework address LLM reasoning failures when faced with conflicting textual and knowledge graph evidence.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.11206v1","title":"Designing Adaptive Digital Nudging Systems with LLM-Driven Reasoning","abstract":"Digital nudging systems lack architectural guidance for translating behavioral science into software design. While research identifies nudge strategies and quality attributes, existing architectures fail to integrate multi-dimensional user modeling with ethical compliance as architectural concerns. We present an architecture that uses behavioral theory through explicit architectural decisions, treating ethics and fairness as structural guardrails rather than implementation details. A literature review synthesized 68 nudging strategies, 11 quality attributes, and 3 user profiling dimensions into architectural requirements. The architecture implements sequential processing layers with cross-cutting evaluation modules enforcing regulatory compliance. Validation with 13 software architects confirmed requirements satisfaction and domain transferability. An LLM-powered proof-of-concept in residential energy sustainability demonstrated feasibility through evaluation with 15 users, achieving high perceived intervention quality and measurable positive emotional impact. This work bridges behavioral science and software architecture by providing reusable patterns for adaptive systems that balance effectiveness with ethical constraints.","published_date":"2026-04-13T09:03:25+00:00","viability_score":7,"cluster_label":"AI for Software Engineering","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An LLM-driven architecture for adaptive digital nudging systems that integrates behavioral science and ethical compliance.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11201v1","title":"CocoaBench: Evaluating Unified Digital Agents in the Wild","abstract":"LLM agents now perform strongly in software engineering, deep research, GUI automation, and various other applications, while recent agent scaffolds and models are increasingly integrating these capabilities into unified systems. Yet, most evaluations still test these capabilities in isolation, which leaves a gap for more diverse use cases that require agents to combine different capabilities. We introduce CocoaBench, a benchmark for unified digital agents built from human-designed, long-horizon tasks that require flexible composition of vision, search, and coding. Tasks are specified only by an instruction and an automatic evaluation function over the final output, enabling reliable and scalable evaluation across diverse agent infrastructures. We also present CocoaAgent, a lightweight shared scaffold for controlled comparison across model backbones. Experiments show that current agents remain far from reliable on CocoaBench, with the best evaluated system achieving only 45.1% success rate. Our analysis further points to substantial room for improvement in reasoning and planning, tool use and execution, and visual grounding.","published_date":"2026-04-13T09:00:10+00:00","viability_score":6,"cluster_label":"Unified AI Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"CocoaBench is a new benchmark and scaffold for evaluating unified LLM agents that combine vision, search, and coding capabilities on long-horizon tasks.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11200v1","title":"ShapShift: Explaining Model Prediction Shifts with Subgroup Conditional Shapley Values","abstract":"Changes in input distribution can induce shifts in the average predictions of machine learning models. Such prediction shifts may impact downstream business outcomes (e.g. a bank's loan approval rate), so understanding their causes can be crucial. We propose \\ours{}: a Shapley value method for attributing prediction shifts to changes in the conditional probabilities of interpretable subgroups of data, where these subgroups are defined by the structure of decision trees. We initially apply this method to single decision trees, providing exact explanations based on conditional probability changes at split nodes. Next, we extend it to tree ensembles by selecting the most explanatory tree and accounting for residual effects. Finally, we propose a model-agnostic variant using surrogate trees grown with a novel objective function, allowing application to models like neural networks. While exact computation can be intensive, approximation techniques enable practical application. We show that \\ours{} provides simple, faithful, and near-complete explanations of prediction shifts across model classes, aiding model monitoring in dynamic environments.","published_date":"2026-04-13T08:56:51+00:00","viability_score":4,"cluster_label":"Model Explainability","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A Shapley value method for attributing machine learning model prediction shifts to changes in data subgroups.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11195v1","title":"Towards Adaptive Open-Set Object Detection via Category-Level Collaboration Knowledge Mining","abstract":"Existing object detectors often struggle to generalize across domains while adapting to emerging novel categories. Adaptive open-set object detection (AOOD) addresses this challenge by training on base categories in the source domain and adapting to both base and novel categories in the target domain without target annotations. However, current AOOD methods remain limited by weak cross-domain representations, ambiguity among novel categories, and source-domain feature bias. To address these issues, we propose a category-level collaboration knowledge mining strategy that exploits both inter-class and intra-class relationships across domains. Specifically, we construct a clustering-based memory bank to encode class prototypes, auxiliary features, and intra-class disparity information, and iteratively update it via unsupervised clustering to enhance category-level knowledge representation. We further design a base-to-novel selection metric to discover source-domain features related to novel categories and use them to initialize novel-category classifiers. In addition, an adaptive feature assignment strategy transfers the learned category-level knowledge to the target domain and asynchronously updates the memory bank to alleviate source-domain bias. Extensive experiments on multiple benchmarks show that our method consistently surpasses state-of-the-art AOOD methods by 1.1-5.5 mAP.","published_date":"2026-04-13T08:51:01+00:00","viability_score":7,"cluster_label":"Computer Vision","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An adaptive open-set object detection method using category-level collaboration knowledge mining to generalize to novel categories.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11188v1","title":"MathAgent: Adversarial Evolution of Constraint Graphs for Mathematical Reasoning Data Synthesis","abstract":"Synthesizing high-quality mathematical reasoning data without human priors remains a significant challenge. Current approaches typically rely on seed data mutation or simple prompt engineering, often suffering from mode collapse and limited logical complexity. This paper proposes a hierarchical synthesis framework that formulates data synthesis as an unsupervised optimization problem over a constraint graph followed by semantic instantiation, rather than treating it as a direct text generation task. We introduce a Legislator-Executor paradigm: The Legislator adversarially evolves structured generation blueprints encoding the constraints of the problem, while the Executor instantiates these specifications into diverse natural language scenarios. This decoupling of skeleton design from linguistic realization enables a prioritized focus on constructing complex and diverse logical structures, thereby guiding high-quality data synthesis. Experiments conducted on a total of 10 models across the Qwen, Llama, Mistral, and Gemma series demonstrate that our method achieves notable results: models fine-tuned on 1K synthesized samples outperform widely-used datasets of comparable scale (LIMO, s1K) across eight mathematical benchmarks, exhibiting superior out-of-distribution generalization.","published_date":"2026-04-13T08:48:12+00:00","viability_score":7,"cluster_label":"LLM Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A hierarchical framework for synthesizing high-quality mathematical reasoning data by adversarially evolving constraint graphs.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11184v1","title":"Taking a Pulse on How Generative AI is Reshaping the Software Engineering Research Landscape","abstract":"Context: Software engineering (SE) researchers increasingly study Generative AI (GenAI) while also incorporating it into their own research practices. Despite rapid adoption, there is limited empirical evidence on how GenAI is used in SE research and its implications for research practices and governance. Aims: We conduct a large-scale survey of 457 SE researchers publishing in top venues between 2023 and 2025. Method: Using quantitative and qualitative analyses, we examine who uses GenAI and why, where it is used across research activities, and how researchers perceive its benefits, opportunities, challenges, risks, and governance. Results: GenAI use is widespread, with many researchers reporting pressure to adopt and align their work with it. Usage is concentrated in writing and early-stage activities, while methodological and analytical tasks remain largely human-driven. Although productivity gains are widely perceived, concerns about trust, correctness, and regulatory uncertainty persist. Researchers highlight risks such as inaccuracies and bias, emphasize mitigation through human oversight and verification, and call for clearer governance, including guidance on responsible use and peer review. Conclusion: We provide a fine-grained, SE-specific characterization of GenAI use across research activities, along with taxonomies of GenAI use cases for research and peer review, opportunities, risks, mitigation strategies, and governance needs. These findings establish an empirical baseline for the responsible integration of GenAI into academic practice.","published_date":"2026-04-13T08:43:56+00:00","viability_score":2,"cluster_label":"AI Research Trends","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper analyzes the widespread adoption and implications of Generative AI in software engineering research, identifying usage patterns, perceived benefits, and governance needs.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11174v1","title":"EmbodiedGovBench: A Benchmark for Governance, Recovery, and Upgrade Safety in Embodied Agent Systems","abstract":"Recent progress in embodied AI has produced a growing ecosystem of robot policies, foundation models, and modular runtimes. However, current evaluation remains dominated by task success metrics such as completion rate or manipulation accuracy. These metrics leave a critical gap: they do not measure whether embodied systems are governable -- whether they respect capability boundaries, enforce policies, recover safely, maintain audit trails, and respond to human oversight. We present EmbodiedGovBench, a benchmark for governance-oriented evaluation of embodied agent systems. Rather than asking only whether a robot can complete a task, EmbodiedGovBench evaluates whether the system remains controllable, policy-bounded, recoverable, auditable, and evolution-safe under realistic perturbations. The benchmark covers seven governance dimensions: unauthorized capability invocation, runtime drift robustness, recovery success, policy portability, version upgrade safety, human override responsiveness, and audit completeness. We define a benchmark structure spanning single-robot and fleet settings, with scenario templates, perturbation operators, governance metrics, and baseline evaluation protocols. We describe how the benchmark can be instantiated over embodied capability runtimes with modular interfaces and contract-aware upgrade workflows. Our analysis suggests that embodied governance should become a first-class evaluation target. EmbodiedGovBench provides the initial measurement framework for that shift.","published_date":"2026-04-13T08:34:04+00:00","viability_score":7,"cluster_label":"Embodied Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"EmbodiedGovBench is a new benchmark for evaluating the governance, recovery, and upgrade safety of embodied AI systems, moving beyond simple task success metrics.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11165v1","title":"Cost-optimal Sequential Testing via Doubly Robust Q-learning","abstract":"Clinical decision-making often involves selecting tests that are costly, invasive, or time-consuming, motivating individualized, sequential strategies for what to measure and when to stop ascertaining. We study the problem of learning cost-optimal sequential decision policies from retrospective data, where test availability depends on prior results, inducing informative missingness. Under a sequential missing-at-random mechanism, we develop a doubly robust Q-learning framework for estimating optimal policies. The method introduces path-specific inverse probability weights that account for heterogeneous test trajectories and satisfy a normalization property conditional on the observed history. By combining these weights with auxiliary contrast models, we construct orthogonal pseudo-outcomes that enable unbiased policy learning when either the acquisition model or the contrast model is correctly specified. We establish oracle inequalities for the stage-wise contrast estimators, along with convergence rates, regret bounds, and misclassification rates for the learned policy. Simulations demonstrate improved cost-adjusted performance over weighted and complete-case baselines, and an application to a prostate cancer cohort study illustrates how the method reduces testing cost without compromising predictive accuracy.","published_date":"2026-04-13T08:26:27+00:00","viability_score":7,"cluster_label":"Clinical Decision Support","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This paper introduces a doubly robust Q-learning framework for learning cost-optimal sequential testing policies from retrospective data in clinical decision-making.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11154v1","title":"Environmental Footprint of GenAI Research: Insights from the Moshi Foundation Model","abstract":"New multi-modal large language models (MLLMs) are continuously being trained and deployed, following rapid development cycles. This generative AI frenzy is driving steady increases in energy consumption, greenhouse gas emissions, and a plethora of other environmental impacts linked to datacenter construction and hardware manufacturing. Mitigating the environmental consequences of GenAI remains challenging due to an overall lack of transparency by the main actors in the field. Even when the environmental impacts of specific models are mentioned, they are typically restricted to the carbon footprint of the final training run, omitting the research and development stages.   In this work, we explore the impact of GenAI research through a fine-grained analysis of the compute spent to create Moshi, a 7B-parameter speech-text foundation model for real-time dialogue developed by Kyutai, a leading privately funded open science AI lab. For the first time, our study dives into the anatomy of compute-intensive MLLM research, quantifying the GPU-time invested in specific model components and training phases, as well as early experimental stages, failed training runs, debugging, and ablation studies. Additionally, we assess the environmental impacts of creating Moshi from beginning to end using a life cycle assessment methodology: we quantify energy and water consumption, greenhouse gas emissions, and mineral resource depletion associated with the production and use of datacenter hardware.   Our detailed analysis allows us to provide actionable guidelines to reduce compute usage and environmental impacts of MLLM research, paving the way for more sustainable AI research.","published_date":"2026-04-13T08:15:59+00:00","viability_score":2,"cluster_label":"AI Sustainability","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This work analyzes the environmental footprint of Generative AI research, focusing on the compute and lifecycle impacts of training foundation models.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11137v1","title":"From Answers to Arguments: Toward Trustworthy Clinical Diagnostic Reasoning with Toulmin-Guided Curriculum Goal-Conditioned Learning","abstract":"The integration of Large Language Models (LLMs) into clinical decision support is critically obstructed by their opaque and often unreliable reasoning. In the high-stakes domain of healthcare, correct answers alone are insufficient; clinical practice demands full transparency to ensure patient safety and enable professional accountability. A pervasive and dangerous weakness of current LLMs is their tendency to produce \"correct answers through flawed reasoning.\" This issue is far more than a minor academic flaw; such process errors signal a fundamental lack of robust understanding, making the model prone to broader hallucinations and unpredictable failures when faced with real-world clinical complexity. In this paper, we establish a framework for trustworthy clinical argumentation by adapting the Toulmin model to the diagnostic process. We propose a novel training pipeline: Curriculum Goal-Conditioned Learning (CGCL), designed to progressively train LLM to generate diagnostic arguments that explicitly follow this Toulmin structure. CGCL's progressive three-stage curriculum systematically builds a solid clinical argument: (1) extracting facts and generating differential diagnoses; (2) justifying a core hypothesis while rebutting alternatives; and (3) synthesizing the analysis into a final, qualified conclusion. We validate CGCL using T-Eval, a quantitative framework measuring the integrity of the diagnosis reasoning. Experiments show that our method achieves diagnostic accuracy and reasoning quality comparable to resource-intensive Reinforcement Learning (RL) methods, while offering a more stable and efficient training pipeline.","published_date":"2026-04-13T07:49:39+00:00","viability_score":3,"cluster_label":"Clinical AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A framework for trustworthy clinical diagnostic reasoning using Toulmin-guided curriculum goal-conditioned learning to improve LLM transparency and reliability.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11136v1","title":"BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning","abstract":"Object-level spatial-temporal understanding is essential for video question answering, yet existing multimodal large language models (MLLMs) encode frames holistically and lack explicit mechanisms for fine-grained object grounding. Recent work addresses this by serializing bounding box coordinates as text tokens, but this text-coordinate paradigm suffers from a fundamental modality mismatch: object information is inherently visual, yet encoding it as text incurs a high token cost that forces aggressive temporal downsampling. We propose BoxTuning, which resolves this mismatch by injecting object spatial-temporal information directly into the visual modality. Colored bounding boxes and trajectory trails are rendered onto video frames as visual prompts, with only a concise color-to-object legend retained as text. This reduces the token cost significantly, achieving 87-93% text token reduction in practice. It also preserves full temporal resolution, where the trajectory trails further encode inter-frame motion direction and speed within each keyframe, recovering fine-grained dynamics that text-coordinate methods are forced to discard. Experimental results on five video QA benchmarks (CLEVRER, Perception Test, STAR, NExT-QA, IntentQA) show that BoxTuning surpasses text-coordinate baselines on spatially oriented tasks and nearly eliminates the accuracy degradation observed on reasoning-centric tasks, establishing visual prompting as a more natural and efficient paradigm for conveying object information to video MLLMs.","published_date":"2026-04-13T07:49:31+00:00","viability_score":7,"cluster_label":"Multimodal AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"BoxTuning injects object spatial-temporal information directly into video frames as visual prompts, significantly reducing token costs and improving multimodal model fine-tuning for video QA.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.11131v1","title":"MADQRL: Distributed Quantum Reinforcement Learning Framework for Multi-Agent Environments","abstract":"Reinforcement learning (RL) is one of the most practical ways to learn from real-life use-cases. Motivated from the cognitive methods used by humans makes it a widely acceptable strategy in the field of artificial intelligence. Most of the environments used for RL are often high-dimensional, and traditional RL algorithms becomes computationally expensive and challenging to effectively learn from such systems. Recent advancements in practical demonstration of quantum computing (QC) theories, such as compact encoding, enhanced representation and learning algorithms, random sampling, or the inherent stochastic nature of quantum systems, have opened up new directions to tackle these challenges. Quantum reinforcement learning (QRL) is seeking significant traction over the past few years. However, the current state of quantum hardware is not enough to cater for such high-dimensional environments with complex multi-agent setup. To tackle this issue, we propose a distributed framework for QRL where multiple agents learn independently, distributing the load of joint training from individual machines. Our method works well for environments with disjoint sets of action and observation spaces, but can also be extended to other systems with reasonable approximations. We analyze the proposed method on cooperative-pong environment and our results indicate ~10% improvement from other distribution strategies, and ~5% improvement from classical models of policy representation.","published_date":"2026-04-13T07:44:23+00:00","viability_score":5,"cluster_label":"Reinforcement Learning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A distributed quantum reinforcement learning framework for multi-agent environments that enables independent learning to reduce computational load.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.11125v1","title":"A Proposed Biomedical Data Policy Framework to Reduce Fragmentation, Improve Quality, and Incentivize Sharing in Indian Healthcare in the era of Artificial Intelligence and Digital Health","abstract":"India generates vast biomedical data through postgraduate research, government hospital services and audits, government schemes, private hospitals and their electronic medical record (EMR) systems, insurance programs and standalone clinics. Unfortunately, these resources remain fragmented across institutional silos and vendor-locked EMR systems. The fundamental bottleneck is not technological but economic and academic. There is a systemic misalignment of incentives that renders data sharing a high-risk, low-reward activity for individual researchers and institutions. Until India's academic promotion criteria, institutional rankings, and funding mechanisms explicitly recognize and reward data curation as professional work, the nation's AI ambitions will remain constrained by fragmented, non-interoperable datasets. We propose a multi-layered incentive architecture integrating recognition of data papers in National Medical Commission (NMC) promotion criteria, incorporation of open data metrics into the National Institutional Ranking Framework (NIRF), adoption of Shapley Value-based revenue sharing in federated learning consortia, and establishment of institutional data stewardship as a mainstream professional role. Critical barriers to data sharing, including fear of data quality scrutiny, concerns about misinterpretation, and selective reporting bias, are addressed through mandatory data quality assessment, structured peer review, and academic credit for auditing roles. The proposed framework directly addresses regulatory constraints introduced by the Digital Personal Data Protection Act 2023 (DPDPA), while constructively engaging with the National Data Sharing and Accessibility Policy (NDSAP), Biotech-PRIDE Guidelines, and the Anusandhan National Research Foundation (ANRF) guidelines.","published_date":"2026-04-13T07:41:43+00:00","viability_score":3,"cluster_label":"Healthcare Data Policy","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A proposed biomedical data policy framework to reduce fragmentation, improve quality, and incentivize sharing in Indian healthcare.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11122v1","title":"Semantic-Geometric Dual Compression: Training-Free Visual Token Reduction for Ultra-High-Resolution Remote Sensing Understanding","abstract":"Multimodal Large Language Models (MLLMs) have demonstrated immense potential in Earth observation. However, the massive visual tokens generated when processing Ultra-High-Resolution (UHR) imagery introduce prohibitive computational overhead, severely bottlenecking their inference efficiency. Existing visual token compression methods predominantly adopt static and uniform compression strategies, neglecting the inherent \"Semantic-Geometric Duality\" in remote sensing interpretation tasks. Specifically, object semantic tasks focus on the abstract semantics of objects and benefit from aggressive background pruning, whereas scene geometric tasks critically rely on the integrity of spatial topology. To address this challenge, we propose DualComp, a task-adaptive dual-stream token compression framework. Dynamically guided by a lightweight pre-trained router, DualComp decouples feature processing into two dedicated pathways. In the object semantic stream, the Spatially-Contiguous Semantic Aggregator (SCSA) utilizes size-adaptive clustering to aggregates redundant background while protecting small object. In the scene geometric stream, the Instruction-Guided Structure Recoverer (IGSR) introduces a greedy path-tracing topology completion mechanism to reconstruct spatial skeletons. Experiments on the UHR remote sensing benchmark XLRS-Bench demonstrate that DualComp accomplishes high-fidelity remote sensing interpretation at an exceptionally low computational cost, achieving simultaneous improvements in both efficiency and accuracy.","published_date":"2026-04-13T07:36:02+00:00","viability_score":5,"cluster_label":"Remote Sensing AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A dual-stream framework for efficient and accurate ultra-high-resolution remote sensing image understanding by adaptively compressing visual tokens.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.11120v1","title":"Persona Non Grata: Single-Method Safety Evaluation Is Incomplete for Persona-Imbued LLMs","abstract":"Personality imbuing customizes LLM behavior, but safety evaluations almost always study prompt-based personas alone. We show this is incomplete: prompting and activation steering expose *different*, architecture-dependent vulnerability profiles, and testing with only one method can miss a model's dominant failure mode. Across 5,568 judged conditions on four standard models from three architecture families, persona danger rankings under system prompting are preserved across all architectures ($\u03c1= 0.71$--$0.96$), but activation-steering vulnerability diverges sharply and cannot be predicted from prompt-side rankings: Llama-3.1-8B is substantially more AS-vulnerable, whereas Gemma-3-27B and Qwen3.5 are more vulnerable to prompting. The most striking illustration of this divergence is the *prosocial persona paradox*: on Llama-3.1-8B, P12 (high conscientiousness + high agreeableness) is among the safest personas under prompting yet becomes the highest-ASR activation-steered persona (ASR ~0.818). This is an inversion robust to coefficient ablation and matched-strength calibration, and replicated on DeepSeek-R1-Distill-Qwen-32B. A trait refusal alignment framework, in which conscientiousness is strongly anti-aligned with refusal on Llama-3.1-8B, offers a partial geometric account. Reasoning provides only partial protection: two 32B reasoning models reach 15--18% prompt-side ASR, and activation steering separates them sharply in both baseline susceptibility and persona-specific vulnerability. Heuristic trace diagnostics suggest that the safer model retains stronger policy recall and self-correction behavior, not merely longer reasoning.","published_date":"2026-04-13T07:34:02+00:00","viability_score":3,"cluster_label":"LLM Safety","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This research reveals that current LLM safety evaluations are incomplete, as they fail to account for different vulnerability profiles exposed by prompting versus activation steering, leading to a need for more comprehensive testing methodologies.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11111v1","title":"Use of AI Tools: Guidelines to Maintain Academic Integrity in Computing Colleges","abstract":"The rapid adoption of AI tools such as ChatGPT has significantly transformed academic practices, offering considerable benefits for both students and faculty in computing disciplines. These tools have been shown to enhance learning efficiency, academic self-efficacy, and confidence. However, their increasing use also raises pressing concerns regarding the preservation of academic integrity -- an essential pillar of the educational process. This paper explores the implications of widespread AI tool usage within computing colleges, with a particular focus on how to align their use with the principles of academic honesty. We begin by classifying common assessment techniques employed in computing education and examine how each may be impacted by AI-assisted tools. Building on this foundation, we propose a set of general guidelines applicable across various assessment formats to help instructors responsibly integrate AI tools into their pedagogy. Furthermore, we provide targeted, assessment-specific recommendations designed to uphold educational objectives while mitigating risks of academic misconduct. These guidelines serve as a practical framework for instructors aiming to balance the pedagogical advantages of AI tools with the imperative of maintaining academic integrity in computing education. Finally, we introduce a formal model that provides a structured mathematical framework for evaluating student assessments in the presence of AI-assisted tools.","published_date":"2026-04-13T07:25:47+00:00","viability_score":2,"cluster_label":"Academic Integrity","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Guidelines and a formal model to maintain academic integrity in computing colleges amidst the rise of AI tools like ChatGPT.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11109v1","title":"Record-Remix-Replay: Hierarchical GPU Kernel Optimization using Evolutionary Search","abstract":"As high-performance computing and AI workloads become increasingly dependent on GPUs, maintaining high performance across rapidly evolving hardware generations has become a major challenge. Developers often spend months tuning scientific applications to fully exploit new architectures, navigating a complex optimization space that spans algorithm design, source implementation, compiler flags and pass sequences, and kernel launch parameters. Existing approaches can effectively search parts of this space in isolation, such as launch configurations or compiler settings, but optimizing across the full space still requires substantial human expertise and iterative manual effort.   In this paper, we present Record-Remix-Replay (R^3), a hierarchical optimization framework that combines LLM-driven evolutionary search, Bayesian optimization, and record-replay compilation techniques to efficiently explore GPU kernel optimizations from source-level implementation choices down to compiler pass ordering and runtime configuration. By making candidate evaluation fast and scalable, our approach enables practical end-to-end search over optimization dimensions that are typically treated separately. We show that Record-Remix-Replay can optimize full scientific applications better than traditional approaches over kernel parameters and compiler flags, while also being nearly an order of magnitude faster than modern evolutionary search approaches.","published_date":"2026-04-13T07:25:12+00:00","viability_score":3,"cluster_label":"GPU Optimization","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A hierarchical optimization framework using LLM-driven evolutionary search to efficiently tune GPU kernels across various dimensions.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11104v1","title":"Frugal Knowledge Graph Construction with Local LLMs: A Zero-Shot Pipeline, Self-Consistency and Wisdom of Artificial Crowds","abstract":"This paper presents an empirical study of a multi-model zero-shot pipeline for knowledge graph construction and exploitation, executed entirely through local inference on consumer-grade hardware. We propose a reproducible evaluation framework integrating two external benchmarks (DocRED, HotpotQA), WebQuestionsSP-style synthetic data, and the RAGAS evaluation framework in an automated pipeline. On 500 document-level relations, our system achieves an F1 of 0.70 $\\pm$ 0.041 in zero-shot, compared to 0.80 for supervised DREEAM. Text-to-query achieves an accuracy of 0.80 $\\pm$ 0.06 on 200 samples. Multi-hop reasoning achieves an Exact Match (EM) of 0.46$\\pm$0.04 on 500 HotpotQA questions, with a RAGAS faithfulness of 0.96 $\\pm$ 0.04 on 50 samples. Beyond the pipeline, we study diversity mechanisms for difficult multi-hop reasoning. On 181 questions unsolvable at zero temperature, self-consistency (k=5, T =0.7) recovers up to 23% EM with a single Mixture-of-Experts (MoE) model, but the cross-model oracle (3 architectures x 5 samples) reaches 46.4%. We highlight an agreement paradox: strong consensus among samples signals collective hallucination rather than a reliable answer, echoing the work of Moussa{\u00ef}d et al. on the wisdom of crowds. Extending to the full pipeline (500 questions), self-consistency (k=3) raises EM from 0.46 to 0.48 $\\pm$ 0.04. A confidence-routing cascade mechanism (Phi-4 $\\rightarrow$ GPT-OSS, k=5) achieves an EM of 0.55 $\\pm$ 0.04, the best result obtained, with 45.4% of questions rerouted. Finally, we show that V3 prompt engineering applied to other models does not reproduce the gains observed with Gemma-4, confirming the specific prompt/model interaction. The entire system runs in $\\sim$5 h on a single RTX 3090, without any training, for an estimated carbon footprint of 0.09 kg CO2 eq.","published_date":"2026-04-13T07:20:21+00:00","viability_score":6,"cluster_label":"Knowledge Graph Construction","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A zero-shot pipeline for frugal knowledge graph construction using local LLMs, achieving competitive results with consumer hardware.","time_to_mvp":"1-2 weeks","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.11103v1","title":"ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing","abstract":"Role-playing has garnered rising attention as it provides a strong foundation for human-machine interaction and facilitates sociological research. However, current work is confined to textual modalities, neglecting speech, which plays a predominant role in daily life, thus limiting genuine role-playing. To bridge this gap, we conceptualize and benchmark speech role-playing through ActorMindBench, and we present a corresponding reasoning framework, called ActorMind. Specifically, (1) Speech Role-Playing enables models to deliver spontaneous responses with personalized verbal traits based on their role, the scene, and spoken dialogue. (2) ActorMindBench is a hierarchical benchmark comprises Utterance-Level content with 7,653 utterances, Scene-Level content with 313 scenes, and Role-Level content with 6 roles. (3) ActorMind is an off-the-shelf, multi-agent, chain-of-though style reasoning framework that emulates how human actors perform in theaters. Concretely, ActorMind first reads its assigned role description via Eye Agent, then comprehends emotional cues within contextual spoken dialogues through Ear Agent. Subsequently, Brain Agent generates a descriptive emotional state, and finally, Mouth Agent delivers the scripts infused with corresponding emotion state. Experimental results demonstrate the effectiveness of ActorMind in enhancing speech role-playing.","published_date":"2026-04-13T07:20:20+00:00","viability_score":4,"cluster_label":"Speech AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"ActorMind is a reasoning framework that enables AI to perform speech role-playing with personalized verbal traits by emulating human actor processes.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.11096v1","title":"Efficient Training for Cross-lingual Speech Language Models","abstract":"Currently, large language models (LLMs) predominantly focus on the text modality. To enable more natural human-AI interaction, speech LLMs are emerging, but building effective end-to-end speech LLMs remains challenging due to limited data and the difficulty in expanding to more languages. In this paper, we introduce Cross-lingual Speech Language Model (CSLM), an efficient training method for cross-lingual speech LLMs based on discrete speech tokens. We propose a novel alignment strategy that achieves cross-modal and cross-lingual alignment through continual pre-training. By conducting instruction fine-tuning following a speech-text interleaved chain-of-modality generation process, we enhance modal alignment at a finer granularity, thereby improving generation quality and reducing latency. CSLM aligns different modalities and languages simultaneously without the need for massive speech data, thus exhibiting good language scalability. Evaluations on cross-modal tasks, mono-lingual conversational tasks, and cross-lingual conversational tasks demonstrate CSLM's strong cross-modal alignment capabilities and general task abilities. (Code is available at: https://github.com/ictnlp/CSLM)","published_date":"2026-04-13T07:12:40+00:00","viability_score":7,"cluster_label":"Cross-lingual Speech LLMs","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"CSLM is an efficient training method for cross-lingual speech LLMs that aligns different modalities and languages without massive speech data, enabling scalable natural human-AI interaction.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.11095v1","title":"Bottleneck Tokens for Unified Multimodal Retrieval","abstract":"Adapting decoder-only multimodal large language models (MLLMs) for unified multimodal retrieval faces two structural gaps. First, existing methods rely on implicit pooling, which overloads the hidden state of a standard vocabulary token (e.g., <EOS>) as the sequence-level representation, a mechanism never designed for information aggregation. Second, contrastive fine-tuning specifies what the embedding should match but provides no token-level guidance on how information should be compressed into it. We address both gaps with two complementary components. Architecturally, we introduce Bottleneck Tokens (BToks), a small set of learnable tokens that serve as a fixed-capacity explicit pooling mechanism. For training, we propose Generative Information Condensation: a next-token prediction objective coupled with a Condensation Mask that severs the direct attention path from target tokens to query tokens. All predictive signals are thereby forced through the BToks, converting the generative loss into dense, token-level supervision for semantic compression. At inference time, only the input and BToks are processed in a single forward pass with negligible overhead over conventional last-token pooling. On MMEB-V2 (78 datasets, 3 modalities, 9 meta-tasks), our approach achieves state-of-the-art among 2B-scale methods under comparable data conditions, attaining an Overall score of 59.0 (+3.6 over VLM2Vec-V2) with substantial gains on semantically demanding tasks (e.g., +12.6 on Video-QA).","published_date":"2026-04-13T07:12:12+00:00","viability_score":8,"cluster_label":"Multimodal Retrieval","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Bottleneck Tokens (BToks) and Generative Information Condensation enable decoder-only multimodal LLMs to achieve state-of-the-art unified multimodal retrieval with negligible inference overhead.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.11094v1","title":"E2E-REME: Towards End-to-End Microservices Auto-Remediation via Experience-Simulation Reinforcement Fine-Tuning","abstract":"Contemporary microservice systems continue to grow in scale and complexity, leading to increasingly frequent and costly failures. While recent LLM-based auto-remediation approaches have emerged, they primarily translate textual instructions into executable Ansible playbooks and rely on expert-crafted prompts, lacking runtime knowledge guidance and depending on large-scale general-purpose LLMs, which limits their accuracy and efficiency. We introduce \\textit{End-to-End Microservice Remediation} (E2E-MR), a new task that requires directly generating executable playbooks from diagnosis reports to autonomously restore faulty systems. To enable rigorous evaluation, we build \\textit{MicroRemed}, a benchmark that automates microservice deployment, failure injection, playbook execution, and post-repair verification. We further propose \\textit{E2E-REME}, an end-to-end auto-remediation model trained via experience-simulation reinforcement fine-tuning. Experiments on public and industrial microservice platforms, compared with nine representative LLMs, show that E2E-REME achieves superior accuracy and efficiency.","published_date":"2026-04-13T07:12:04+00:00","viability_score":7,"cluster_label":"Microservices Auto-Remediation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"E2E-REME is an end-to-end auto-remediation model for microservices, trained via experience-simulation reinforcement fine-tuning, that generates executable playbooks from diagnosis reports.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.11088v1","title":"Do Agent Rules Shape or Distort? Guardrails Beat Guidance in Coding Agents","abstract":"Developers increasingly guide AI coding agents through natural language instruction files (e.g., CLAUDE.md, .cursorrules), yet no controlled study has measured whether these rules actually improve agent performance or which properties make a rule beneficial. We scrape 679 such files (25,532 rules) from GitHub and conduct the first large-scale empirical evaluation, running over 5,000 agent runs with a state-of-the-art coding agent on SWE-bench Verified. Rules improve performance by 7--14 percentage points, but random rules help as much as expert-curated ones -- suggesting rules work through context priming rather than specific instruction. Negative constraints (\"do not refactor unrelated code\") are the only individually beneficial rule type, while positive directives (\"follow code style\") actively hurt -- a pattern we analyze through the lens of potential-based reward shaping (PBRS). Moreover, individual rules are mostly harmful in isolation yet collectively helpful, with no degradation up to 50 rules. These findings expose a hidden reliability risk -- well-intentioned rules routinely degrade agent performance -- and provide a clear principle for safe agent configuration: constrain what agents must not do, rather than prescribing what they should.","published_date":"2026-04-13T07:10:01+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This research identifies that constraining AI coding agents on what NOT to do, rather than prescribing what to do, significantly improves performance and reduces reliability risks, offering a clear principle for safer agent configuration.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11083v1","title":"FlowCoMotion: Text-to-Motion Generation via Token-Latent Flow Modeling","abstract":"Text-to-motion generation is driven by learning motion representations for semantic alignment with language. Existing methods rely on either continuous or discrete motion representations. However, continuous representations entangle semantics with dynamics, while discrete representations lose fine-grained motion details. In this context, we propose FlowCoMotion, a novel motion generation framework that unifies both treatments from a modeling perspective. Specifically, FlowCoMotion employs token-latent coupling to capture both semantic content and high-fidelity motion details. In the latent branch, we apply multi-view distillation to regularize the continuous latent space, while in the token branch we use discrete temporal resolution quantization to extract high-level semantic cues. The motion latent is then obtained by combining the representations from the two branches through a token-latent coupling network. Subsequently, a velocity field is predicted based on the textual conditions. An ODE solver integrates this velocity field from a simple prior, thereby guiding the sample to the potential state of the target motion. Extensive experiments show that FlowCoMotion achieves competitive performance on text-to-motion benchmarks, including HumanML3D and SnapMoGen.","published_date":"2026-04-13T07:04:47+00:00","viability_score":7,"cluster_label":"Generative Video","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"FlowCoMotion is a text-to-motion generation framework that unifies continuous and discrete motion representations using token-latent coupling to capture both semantic content and high-fidelity motion details, achieving competitive performance.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11080v1","title":"ReSpinQuant: Efficient Layer-Wise LLM Quantization via Subspace Residual Rotation Approximation","abstract":"Rotation-based Post-Training Quantization (PTQ) has emerged as a promising solution for mitigating activation outliers in the quantization of Large Language Models (LLMs). Global rotation methods achieve inference efficiency by fusing activation rotations into attention and FFN blocks, but suffer from limited expressivity as they are constrained to use a single learnable rotation matrix across all layers. To tackle this, layer-wise transformation methods emerged, achieving superior accuracy through localized adaptation. However, layer-wise methods cannot fuse activation rotation matrices into weights, requiring online computations and causing significant overhead. In this paper, we propose ReSpinQuant, a quantization framework that resolves such overhead by leveraging offline activation rotation fusion and matching basis using efficient residual subspace rotation. This design reconciles the high expressivity of layer-wise adaptation with only negligible inference overhead. Extensive experiments on W4A4 and W3A3 quantization demonstrate that ReSpinQuant achieves state-of-the-art performance, outperforming global rotation methods and matching the accuracy of computationally expensive layer-wise methods with minimal overhead.","published_date":"2026-04-13T07:00:26+00:00","viability_score":7,"cluster_label":"LLM Quantization","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"ReSpinQuant is an efficient layer-wise LLM quantization framework that achieves state-of-the-art performance by reconciling high expressivity with minimal inference overhead through offline activation rotation fusion.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11077v1","title":"Towards Proactive Information Probing: Customer Service Chatbots Harvesting Value from Conversation","abstract":"Customer service chatbots are increasingly expected to serve not merely as reactive support tools for users, but as strategic interfaces for harvesting high-value information and business intelligence. In response, we make three main contributions. 1) We introduce and define a novel task of Proactive Information Probing, which optimizes when to probe users for pre-specified target information while minimizing conversation turns and user friction. 2) We propose PROCHATIP, a proactive chatbot framework featuring a specialized conversation strategy module trained to master the delicate timing of probes. 3) Experiments demonstrate that PROCHATIP significantly outperforms baselines, exhibiting superior capability in both information probing and service quality. We believe that our work effectively redefines the commercial utility of chatbots, positioning them as scalable, cost-effective engines for proactive business intelligence. Our code is available at https://github.com/SCUNLP/PROCHATIP.","published_date":"2026-04-13T06:57:29+00:00","viability_score":9,"cluster_label":"Customer Service AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"PROCHATIP is a proactive chatbot framework that redefines commercial utility by intelligently probing users for high-value information and business intelligence, significantly outperforming baselines in both information gathering and service quality.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.11072v1","title":"Hodoscope: Unsupervised Monitoring for AI Misbehaviors","abstract":"Existing approaches to monitoring AI agents rely on supervised evaluation: human-written rules or LLM-based judges that check for known failure modes. However, novel misbehaviors may fall outside predefined categories entirely and LLM-based judges can be unreliable. To address this, we formulate unsupervised monitoring, drawing an analogy to unsupervised learning. Rather than checking for specific misbehaviors, an unsupervised monitor assists humans in discovering problematic agent behaviors without prior assumptions about what counts as problematic, leaving that determination to the human.   We observe that problematic behaviors are often distinctive: a model exploiting a benchmark loophole exhibits actions absent from well-behaved baselines, and a vulnerability unique to one evaluation manifests as behavioral anomalies when the same model runs across multiple benchmarks. This motivates using group-wise behavioral differences as the primary signal for unsupervised monitoring. We introduce Hodoscope, a tool that operationalizes this insight. Hodoscope compares behavior distributions across groups and highlights distinctive and potentially suspicious action patterns for human review. Using Hodoscope, we discover a previously unknown vulnerability in the Commit0 benchmark (unsquashed git history allowing ground-truth recovery, inflating scores for at least five models) and independently recover known exploits on ImpossibleBench and SWE-bench. Quantitative evaluation estimates that our method reduces review effort by 6-23$\\times$ compared to naive uniform sampling. Finally, we show that behavior descriptions discovered through Hodoscope could improve the detection accuracy of LLM-based judges, demonstrating a path from unsupervised to supervised monitoring.","published_date":"2026-04-13T06:52:05+00:00","viability_score":7,"cluster_label":"AI Monitoring","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Hodoscope offers unsupervised monitoring for AI agents, enabling humans to discover novel misbehaviors by highlighting distinctive behavioral anomalies across benchmarks, reducing review effort significantly.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11071v1","title":"Lightweight Low-Light Image Enhancement via Distribution-Normalizing Preprocessing and Depthwise U-Net","abstract":"We present a lightweight two-stage framework for low-light image enhancement (LLIE) that achieves competitive perceptual quality with significantly fewer parameters than existing methods. Our approach combines frozen algorithm-based preprocessing with a compact U-Net built entirely from depthwise-separable convolutions. The preprocessing normalizes the input distribution by providing complementary brightness-corrected views, enabling the trainable network to focus on residual color correction. Our method achieved 4th place in the CVPR 2026 NTIRE Efficient Low-Light Image Enhancement Challenge. We further provide extended benchmarks and ablations to demonstrate the general effectiveness of our methods.","published_date":"2026-04-13T06:50:28+00:00","viability_score":7,"cluster_label":"Image Enhancement","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A lightweight, two-stage framework for low-light image enhancement that achieves competitive quality with fewer parameters.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.11070v1","title":"PRISM Risk Signal Framework: Hierarchy-Based Red Lines for AI Behavioral Risk","abstract":"Current approaches to AI safety define red lines at the case level: specific prompts, specific outputs, specific harms. This paper argues that red lines can be set more fundamentally -- at the level of value, evidence, and source hierarchies that govern AI reasoning. Using the PRISM (Profile-based Reasoning Integrity Stack Measurement) framework, we define a taxonomy of 27 behavioral risk signals derived from structural anomalies in how AI systems prioritize values (L4), weight evidence types (L3), and trust information sources (L2). Each signal is evaluated through a dual-threshold principle combining absolute rank position and relative win-rate gap, producing a two-tier classification (Confirmed Risk vs. Watch Signal). The hierarchy-based approach offers three advantages over case-specific red lines: it is anticipatory rather than reactive (detecting dangerous reasoning structures before they produce harmful outputs), comprehensive rather than enumerative (a single value-hierarchy signal subsumes an unlimited number of case-specific violations), and measurable rather than subjective (grounded in empirical forced-choice data). We demonstrate the framework's detection capacity using approximately 397,000 forced-choice responses from 7 AI models across three Authority Stack layers, showing that the signal taxonomy successfully discriminates between models with structurally extreme profiles, models with context-dependent risk, and models with balanced hierarchies.","published_date":"2026-04-13T06:50:23+00:00","viability_score":4,"cluster_label":"AI Safety","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"The PRISM framework introduces a hierarchy-based approach to AI safety, defining 27 behavioral risk signals to anticipate and measure dangerous reasoning structures before they lead to harmful outputs.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.11065v1","title":"AI Integrity: A New Paradigm for Verifiable AI Governance","abstract":"AI systems increasingly shape high-stakes decisions in healthcare, law, defense, and education, yet existing governance paradigms -- AI Ethics, AI Safety, and AI Alignment -- share a common limitation: they evaluate outcomes rather than verifying the reasoning process itself. This paper introduces AI Integrity, a concept defined as a state in which the Authority Stack of an AI system -- its layered hierarchy of values, epistemological standards, source preferences, and data selection criteria -- is protected from corruption, contamination, manipulation, and bias, and maintained in a verifiable manner. We distinguish AI Integrity from the three existing paradigms, define the Authority Stack as a 4-layer cascade model (Normative, Epistemic, Source, and Data Authority) grounded in established academic frameworks -- Schwartz Basic Human Values for normative authority, Walton argumentation schemes with GRADE/CEBM hierarchies for epistemic authority, and Source Credibility Theory for source authority -- characterize the distinction between legitimate cascading and Authority Pollution, and identify Integrity Hallucination as the central measurable threat to value consistency. We further specify the PRISM (Profile-based Reasoning Integrity Stack Measurement) framework as the operational methodology, defining six core metrics and a phased research roadmap. Unlike normative frameworks that prescribe which values are correct, AI Integrity is a procedural concept: it requires that the path from evidence to conclusion be transparent and auditable, regardless of which values a system holds.","published_date":"2026-04-13T06:45:30+00:00","viability_score":0,"cluster_label":"AI Governance","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"AI Integrity proposes a new paradigm for verifiable AI governance by focusing on the procedural verification of an AI system's reasoning process, rather than just its outcomes.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11061v1","title":"Pando: Do Interpretability Methods Work When Models Won't Explain Themselves?","abstract":"Mechanistic interpretability is often motivated for alignment auditing, where a model's verbal explanations can be absent, incomplete, or misleading. Yet many evaluations do not control whether black-box prompting alone can recover the target behavior, so apparent gains from white-box tools may reflect elicitation rather than internal signal; we call this the elicitation confounder. We introduce Pando, a model-organism benchmark that breaks this confound via an explanation axis: models are trained to produce either faithful explanations of the true rule, no explanation, or confident but unfaithful explanations of a disjoint distractor rule.   Across 720 finetuned models implementing hidden decision-tree rules, agents predict held-out model decisions from $10$ labeled query-response pairs, optionally augmented with one interpretability tool output. When explanations are faithful, black-box elicitation matches or exceeds all white-box methods; when explanations are absent or misleading, gradient-based attribution improves accuracy by 3-5 percentage points, and relevance patching, RelP, gives the largest gains, while logit lens, sparse autoencoders, and circuit tracing provide no reliable benefit. Variance decomposition suggests gradients track decision computation, which fields causally drive the output, whereas other readouts are dominated by task representation, biases toward field identity and value.   We release all models, code, and evaluation infrastructure.","published_date":"2026-04-13T06:42:24+00:00","viability_score":8,"cluster_label":"AI Interpretability","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Pando is a model-organism benchmark that evaluates AI interpretability methods by controlling for model explanations, revealing that gradient-based attribution and relevance patching offer significant gains when models don't explain themselves.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.11056v1","title":"Rethinking Token-Level Credit Assignment in RLVR: A Polarity-Entropy Analysis","abstract":"Reinforcement Learning with Verifiable Rewards (RLVR) has substantially improved the reasoning ability of Large Language Models (LLMs). However, its sparse outcome-based rewards pose a fundamental credit assignment problem. We analyze this problem through the joint lens of reward polarity and token entropy. Our diagnostic tool, the Four Quadrant Decomposition, isolates token updates by polarity and entropy, and controlled ablations show that reasoning improvements concentrate in the high-entropy quadrants. To justify this observation theoretically, we adapt Conditional Mutual Information to the autoregressive RLVR setting and prove that the credit a token can carry is upper-bounded by its entropy. This view yields testable predictions that reasoning gains arise primarily from high-entropy tokens, with unique roles for positive and negative updates. A gradient analysis of GRPO further reveals how uniform reward broadcast dilutes signal at high-entropy positions while over-crediting deterministic tokens. Grounded in these insights, we propose Entropy-Aware Policy Optimization (EAPO) that modulates token-level learning signals accordingly. Extensive experiments demonstrate that EAPO outperforms strong baselines across two model families.","published_date":"2026-04-13T06:32:49+00:00","viability_score":7,"cluster_label":"LLM Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel method to improve LLM reasoning by analyzing and modulating token-level credit assignment based on polarity and entropy.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11050v1","title":"Shared Emotion Geometry Across Small Language Models: A Cross-Architecture Study of Representation, Behavior, and Methodological Confounds","abstract":"We extract 21-emotion vector sets from twelve small language models (six architectures x base/instruct, 1B-8B parameters) under a unified comprehension-mode pipeline at fp16 precision, and compare the resulting geometries via representational similarity analysis on raw cosine RDMs. The five mature architectures (Qwen 2.5 1.5B, SmolLM2 1.7B, Llama 3.2 3B, Mistral 7B v0.3, Llama 3.1 8B) share nearly identical 21-emotion geometry, with pairwise RDM Spearman correlations of 0.74-0.92. This universality persists across diametrically opposed behavioral profiles: Qwen 2.5 and Llama 3.2 occupy opposite poles of MTI Compliance facets yet produce nearly identical emotion RDMs (rho = 0.81), so behavioral facet differences arise above the shared emotion representation. Gemma-3 1B base, the one immature case in our dataset, exhibits extreme residual-stream anisotropy (0.997) and is restructured by RLHF across all geometric descriptors, whereas the five already-mature families show within-family base x instruct RDM correlations of rho >= 0.92 (Mistral 7B v0.3 at rho = 0.985), suggesting RLHF restructures only representations that are not yet organized. Methodologically, we show that what prior work has read as a single comprehension-vs-generation method effect in fact decomposes into four distinct layers -- a coarse method-dependent dissociation, robust sub-parameter sensitivity within generation, a true precision (fp16 vs INT8) effect, and a conflated cross-experiment bias that distorts in opposite directions for different models -- so that a single rho between two prior emotion-vector studies is not a safe basis for interpretation without the layered decomposition.","published_date":"2026-04-13T06:27:40+00:00","viability_score":7,"cluster_label":"LLM Representation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Uncovers universal emotion geometry across diverse small language models, revealing how RLHF restructures representations and identifying methodological confounds.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11048v1","title":"A Systematic Analysis of the Impact of Persona Steering on LLM Capabilities","abstract":"Imbuing Large Language Models (LLMs) with specific personas is prevalent for tailoring interaction styles, yet the impact on underlying cognitive capabilities remains unexplored. We employ the Neuron-based Personality Trait Induction (NPTI) framework to induce Big Five personality traits in LLMs and evaluate performance across six cognitive benchmarks. Our findings reveal that persona induction produces stable, reproducible shifts in cognitive task performance beyond surface-level stylistic changes. These effects exhibit strong task dependence: certain personalities yield consistent gains on instruction-following, while others impair complex reasoning. Effect magnitude varies systematically by trait dimension, with Openness and Extraversion exerting the most robust influence. Furthermore, LLM effects show 73.68% directional consistency with human personality-cognition relationships. Capitalizing on these regularities, we propose Dynamic Persona Routing (DPR), a lightweight query-adaptive strategy that outperforms the best static persona without additional training.","published_date":"2026-04-13T06:24:25+00:00","viability_score":8,"cluster_label":"LLM Personalization","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel framework for persona induction in LLMs that dynamically routes queries to improve cognitive capabilities and outperforms static personas.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11043v1","title":"EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models","abstract":"Unified multimodal embedding spaces underpin practical applications such as cross-modal retrieval and zero-shot recognition. In many real deployments, however, supervision is available only for a small subset of modality pairs (e.g., image--text), leaving \\emph{unpaired} modality pairs (e.g., audio$\\leftrightarrow$depth, infrared$\\leftrightarrow$audio) weakly connected and thus performing poorly on zero-shot transfer. Addressing this sparse-pairing regime is therefore essential for scaling unified embedding systems to new tasks without curating exhaustive pairwise data. We propose \\textbf{EmergentBridge}, an embedding-level bridging framework that improves performance on these unpaired pairs \\emph{without requiring exhaustive pairwise supervision}. Our key observation is that naively aligning a new modality to a synthesized proxy embedding can introduce \\emph{gradient interference}, degrading the anchor-alignment structure that existing retrieval/classification relies on. EmergentBridge addresses this by (i) learning a mapping that produces a \\emph{noisy bridge anchor} (a proxy embedding of an already-aligned modality) from an anchor embedding, and (ii) enforcing proxy alignment only in the subspace orthogonal to the anchor-alignment direction, preserving anchor alignment while strengthening non-anchor connectivity. Across nine datasets spanning multiple modalities, EmergentBridge consistently outperforms prior binding baselines on zero-shot classification and retrieval, demonstrating strong emergent alignment.","published_date":"2026-04-13T06:15:11+00:00","viability_score":7,"cluster_label":"Multimodal AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"EmergentBridge improves zero-shot cross-modal transfer in multimodal embedding models by learning a bridging framework that strengthens connections between unpaired modalities without requiring exhaustive pairwise supervision.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11041v1","title":"From Topology to Trajectory: LLM-Driven World Models For Supply Chain Resilience","abstract":"Semiconductor supply chains face unprecedented resilience challenges amidst global geopolitical turbulence. Conventional Large Language Model (LLM) planners, when confronting such non-stationary \"Policy Black Swan\" events, frequently suffer from Decision Paralysis or a severe Grounding Gap due to the absence of physical environmental modeling. This paper introduces ReflectiChain, a cognitive agentic framework tailored for resilient macroeconomic supply chain planning. The core innovation lies in the integration of Latent Trajectory Rehearsal powered by a generative world model, which couples reflection-in-action (System 2 deliberation) with delayed reflection-on-action. Furthermore, we leverage a Retrospective Agentic RL mechanism to enable autonomous policy evolution during the deployment phase (test-time). Evaluations conducted on our high-fidelity benchmark, Semi-Sim, demonstrate that under extreme scenarios such as export bans and material shortages, ReflectiChain achieves a 250% improvement in average step rewards over the strongest LLM baselines. It successfully restores the Operability Ratio (OR) from a deficient 13.3% to over 88.5% while ensuring robust gradient convergence. Ablation studies further underscore that the synergy between physical grounding constraints and double-loop learning is fundamental to bridging the gap between semantic reasoning and physical reality for long-horizon strategic planning.","published_date":"2026-04-13T06:14:15+00:00","viability_score":9,"cluster_label":"Supply Chain AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An agentic framework for resilient semiconductor supply chain planning that integrates LLM-driven world models with latent trajectory rehearsal and retrospective RL.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11040v1","title":"Intelligent Approval of Access Control Flow in Office Automation Systems via Relational Modeling","abstract":"Office automation (OA) systems play a crucial role in enterprise operations and management, with access control flow approval (ACFA) being a key component that manages the accessibility of various resources. However, traditional ACFA requires approval from the person in charge at each step, which consumes a significant amount of manpower and time. Its intelligence is a crucial issue that needs to be addressed urgently by all companies. In this paper, we propose a novel relational modeling-driven intelligent approval (RMIA) framework to automate ACFA. Specifically, our RMIA consists of two core modules: (1) The binary relation modeling module aims to characterize the coupling relation between applicants and approvers and provide reliable basic information for ACFA decision-making from a coarse-grained perspective. (2) The ternary relation modeling module utilizes specific resource information as its core, characterizing the complex relations between applicants, resources, and approvers, and thus provides fine-grained gain information for informed decision-making. Then, our RMIA effectively fuses these two kinds of information to form the final decision. Finally, extensive experiments are conducted on two product datasets and an online A/B test to verify the effectiveness of RMIA.","published_date":"2026-04-13T06:13:52+00:00","viability_score":5,"cluster_label":"Workflow Automation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Automate office access control approvals using a novel relational modeling framework that significantly reduces manual effort and time.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.11037v1","title":"RTMC: Step-Level Credit Assignment via Rollout Trees","abstract":"Multi-step agentic reinforcement learning benefits from fine-grained credit assignment, yet existing approaches offer limited options: critic-free methods like GRPO assign a uniform advantage to every action in a trajectory, while learned value networks introduce notable overhead and can be fragile under sparse rewards. We observe that group rollouts targeting the same problem often traverse overlapping intermediate states, implicitly forming a tree whose branches diverge at successive decision points. Building on this insight, we introduce Rollout-Tree Monte Carlo (RTMC) advantage estimation, which aggregates return statistics across rollouts sharing a common state to produce per-step Q-values and advantages--without any learned critic. A state-action signature system compresses raw interaction histories into compact, comparable representations, making cross-rollout state matching tractable. On SWE-bench Verified, RTMC improves pass@1 by 3.2 percentage points over GRPO.","published_date":"2026-04-13T06:01:51+00:00","viability_score":4,"cluster_label":"Reinforcement Learning Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Improve multi-step agentic reinforcement learning by assigning credit at each step using a novel rollout tree aggregation method, without needing a learned critic.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11036v1","title":"Uncertainty-Aware Web-Conditioned Scientific Fact-Checking","abstract":"Scientific fact-checking is vital for assessing claims in specialized domains such as biomedicine and materials science, yet existing systems often hallucinate or apply inconsistent reasoning, especially when verifying technical, compositional claims against an evidence snippet under source and cost/latency constraints. We present a pipeline centered on atomic predicate-argument decomposition and calibrated, uncertainty-gated corroboration: atomic facts are aligned to local snippets via embeddings, verified by a compact evidence-grounded checker, and only facts with uncertain support trigger domain-restricted web search over authoritative sources. The system supports both binary and tri-valued classification where it predicts labels from Supported, Refuted, NEI for three-way tasks. We evaluate under two regimes, Context-Only (no web) and Context+Web (uncertainty-gated web corroboration); when retrieved evidence conflicts with the provided context, we abstain with NEI rather than overriding the context. On multiple benchmarks, our framework surpasses the strongest benchmarks. In our experiments, web corroboration was invoked for only a minority of atomic facts on average, indicating that external evidence is consulted selectively under calibrated uncertainty rather than routinely. Overall, coupling atomic granularity with calibrated, uncertainty-gated corroboration yields more interpretable and context-conditioned verification, making the approach well-suited to high-stakes, single-document settings that demand traceable rationales, predictable cost/latency, and conservative.","published_date":"2026-04-13T06:01:20+00:00","viability_score":5,"cluster_label":"Scientific Fact-Checking","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Enhance scientific fact-checking with an uncertainty-aware pipeline that decomposes claims, verifies them against evidence, and selectively uses web search for corroboration.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.11035v1","title":"Introspective Diffusion Language Models","abstract":"Diffusion language models promise parallel generation, yet still lag behind autoregressive (AR) models in quality. We stem this gap to a failure of introspective consistency: AR models agree with their own generations, while DLMs often do not. We define the introspective acceptance rate, which measures whether a model accepts its previously generated tokens. This reveals why AR training has a structural advantage: causal masking and logit shifting implicitly enforce introspective consistency. Motivated by this observation, we introduce Introspective Diffusion Language Model (I-DLM), a paradigm that retains diffusion-style parallel decoding while inheriting the introspective consistency of AR training. I-DLM uses a novel introspective strided decoding (ISD) algorithm, which enables the model to verify previously generated tokens while advancing new ones in the same forward pass. From a systems standpoint, we build I-DLM inference engine on AR-inherited optimizations and further customize it with a stationary-batch scheduler. To the best of our knowledge, I-DLM is the first DLM to match the quality of its same-scale AR counterpart while outperforming prior DLMs in both model quality and practical serving efficiency across 15 benchmarks. It reaches 69.6 on AIME-24 and 45.7 on LiveCodeBench-v6, exceeding LLaDA-2.1-mini (16B) by more than 26 and 15 points, respectively. Beyond quality, I-DLM is designed for the growing demand of large-concurrency serving, delivering about 3x higher throughput than prior state-of-the-art DLMs.","published_date":"2026-04-13T06:01:01+00:00","viability_score":7,"cluster_label":"Generative Language Models","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Achieve autoregressive-level text generation quality with diffusion models through introspective consistency, enabling faster parallel decoding and higher serving throughput.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11028v1","title":"Federated Single-Agent Robotics: Multi-Robot Coordination Without Intra-Robot Multi-Agent Fragmentation","abstract":"As embodied robots move toward fleet-scale operation, multi-robot coordination is becoming a central systems challenge. Existing approaches often treat this as motivation for increasing internal multi-agent decomposition within each robot. We argue for a different principle: multi-robot coordination does not require intra-robot multi-agent fragmentation. Each robot should remain a single embodied agent with its own persistent runtime, local policy scope, capability state, and recovery authority, while coordination emerges through federation across robots at the fleet level. We present Federated Single-Agent Robotics (FSAR), a runtime architecture for multi-robot coordination built on single-agent robot runtimes. Each robot exposes a governed capability surface rather than an internally fragmented agent society. Fleet coordination is achieved through shared capability registries, cross-robot task delegation, policy-aware authority assignment, trust-scoped interaction, and layered recovery protocols. We formalize key coordination relations including authority delegation, inter-robot capability requests, local-versus-fleet recovery boundaries, and hierarchical human supervision, and describe a fleet runtime architecture supporting shared Embodied Capability Module (ECM) discovery, contract-aware cross-robot coordination, and fleet-level governance. We evaluate FSAR on representative multi-robot coordination scenarios against decomposition-heavy baselines. Results show statistically significant gains in governance locality (d=2.91, p<.001 vs. centralized control) and recovery containment (d=4.88, p<.001 vs. decomposition-heavy), while reducing authority conflicts and policy violations across all scenarios. Our results support the view that the path from embodied agents to embodied fleets is better served by federation across coherent robot runtimes than by fragmentation within them.","published_date":"2026-04-13T05:51:13+00:00","viability_score":4,"cluster_label":"Robotics","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A runtime architecture for multi-robot coordination that enables fleet-level federation without fragmenting individual robot agents, improving governance and recovery.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.11026v1","title":"Optimal Stability of KL Divergence under Gaussian Perturbations","abstract":"We study the problem of characterizing the stability of Kullback-Leibler (KL) divergence under Gaussian perturbations beyond Gaussian families. Existing relaxed triangle inequalities for KL divergence critically rely on the assumption that all involved distributions are Gaussian, which limits their applicability in modern applications such as out-of-distribution (OOD) detection with flow-based generative models. In this paper, we remove this restriction by establishing a sharp stability bound between an arbitrary distribution and Gaussian families under mild moment conditions. Specifically, let $P$ be a distribution with finite second moment, and let $\\mathcal{N}_1$ and $\\mathcal{N}_2$ be multivariate Gaussian distributions. We show that if $KL(P||\\mathcal{N}_1)$ is large and $KL(\\mathcal{N}_1||\\mathcal{N}_2)$ is at most $\u03b5$, then $KL(P||\\mathcal{N}_2) \\ge KL(P||\\mathcal{N}_1) - O(\\sqrt\u03b5)$. Moreover, we prove that this $\\sqrt\u03b5$ rate is optimal in general, even within the Gaussian family. This result reveals an intrinsic stability property of KL divergence under Gaussian perturbations, extending classical Gaussian-only relaxed triangle inequalities to general distributions. The result is non-trivial due to the asymmetry of KL divergence and the absence of a triangle inequality in general probability spaces. As an application, we provide a rigorous foundation for KL-based OOD analysis in flow-based models, removing strong Gaussian assumptions used in prior work. More broadly, our result enables KL-based reasoning in non-Gaussian settings arising in deep learning and reinforcement learning.","published_date":"2026-04-13T05:49:59+00:00","viability_score":0,"cluster_label":"Machine Learning Theory","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Establishes a sharp stability bound for KL divergence under Gaussian perturbations for arbitrary distributions, extending classical inequalities and providing a foundation for OOD analysis.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11019v1","title":"Brief2Design: A Multi-phased, Compositional Approach to Prompt-based Graphic Design","abstract":"Professional designers work from client briefs that specify goals and constraints but often lack concrete design details. Translating these abstract requirements into visual designs poses a central challenge, yet existing tools address specific aspects or induce fixation through complete outputs. Through interviews with six professional designers, we identified how designers address this challenge: first structuring ambiguous requirements, then exploring individual elements, and finally recombining alternatives. We developed Brief2Design, supporting this workflow through requirement extraction and recommendation, element-level exploration for objects, backgrounds, text, typography, and composition, and flexible recombination of selected elements. A within-subjects study with twelve designers compared Brief2Design against a conversational baseline. The structured approach increased prompt diversity and received high ratings for requirement extraction and recommendation, but required longer generation time and achieved comparable image diversity. These findings reveal that structured workflows benefit requirement clarification at the cost of efficiency, informing design trade-offs for AI-assisted graphic design tools.","published_date":"2026-04-13T05:34:24+00:00","viability_score":4,"cluster_label":"Generative Design","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Brief2Design supports a multi-phased graphic design workflow, enabling designers to structure ambiguous requirements, explore individual elements, and recombine alternatives for AI-assisted design.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11017v1","title":"NimbusGuard: A Novel Framework for Proactive Kubernetes Autoscaling Using Deep Q-Networks","abstract":"Cloud native architecture is about building and running scalable microservice applications to take full advantage of the cloud environments. Managed Kubernetes is the powerhouse orchestrating cloud native applications with elastic scaling. However, traditional Kubernetes autoscalers are reactive, meaning the scaling controllers adjust resources only after they detect demand within the cluster and do not incorporate any predictive measures. This can lead to either over-provisioning and increased costs or under-provisioning and performance degradation. We propose NimbusGuard, an open-source, Kubernetes-based autoscaling system that leverages a deep reinforcement learning agent to provide proactive autoscaling. The agents perception is augmented by a Long Short-Term Memory model that forecasts future workload patterns. The evaluations were conducted by comparing NimbusGuard against the built-in scaling controllers, such as Horizontal Pod Autoscaler, and the event-driven autoscaler KEDA. The experimental results demonstrate how NimbusGuard's proactive framework translates into superior performance and cost efficiency compared to existing reactive methods.","published_date":"2026-04-13T05:32:12+00:00","viability_score":7,"cluster_label":"Cloud Infrastructure","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"NimbusGuard is an open-source Kubernetes autoscaling system using deep reinforcement learning and LSTMs to proactively predict workload patterns for improved performance and cost efficiency.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11012v1","title":"Min-$k$ Sampling: Decoupling Truncation from Temperature Scaling via Relative Logit Dynamics","abstract":"The quality of text generated by large language models depends critically on the decoding sampling strategy. While mainstream methods such as Top-$k$, Top-$p$, and Min-$p$ achieve a balance between diversity and accuracy through probability-space truncation, they share an inherent limitation: extreme sensitivity to the temperature parameter. Recent logit-space approaches like Top-$n\u03c3$ achieve temperature invariance but rely on global statistics that are susceptible to long-tail noise, failing to capture fine-grained confidence structures among top candidates. We propose \\textbf{Min-$k$ Sampling}, a novel dynamic truncation strategy that analyzes the local shape of the sorted logit distribution to identify \"semantic cliffs\": sharp transitions from high-confidence core tokens to uncertain long-tail tokens. By computing a position-weighted relative decay rate, Min-$k$ dynamically determines truncation boundaries at each generation step. We formally prove that Min-$k$ achieves strict temperature invariance and empirically demonstrate its low sensitivity to hyperparameter choices. Experiments on multiple reasoning benchmarks, creative writing tasks, and human evaluation show that Min-$k$ consistently improves text quality, maintaining robust performance even under extreme temperature settings where probability-based methods collapse. We make our code, models, and analysis tools publicly available.","published_date":"2026-04-13T05:25:36+00:00","viability_score":7,"cluster_label":"LLM Sampling","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel sampling strategy for large language models that dynamically adjusts truncation based on local logit distribution to improve text quality and temperature invariance.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11005v1","title":"Diffusion-CAM: Faithful Visual Explanations for dMLLMs","abstract":"While diffusion Multimodal Large Language Models (dMLLMs) have recently achieved remarkable strides in multimodal generation, the development of interpretability mechanisms has lagged behind their architectural evolution. Unlike traditional autoregressive models that produce sequential activations, diffusion-based architectures generate tokens via parallel denoising, resulting in smooth, distributed activation patterns across the entire sequence. Consequently, existing Class Activation Mapping (CAM) methods, which are tailored for local, sequential dependencies, are ill-suited for interpreting these non-autoregressive behaviors. To bridge this gap, we propose Diffusion-CAM, the first interpretability method specifically tailored for dMLLMs. We derive raw activation maps by differentiably probing intermediate representations in the transformer backbone, accordingly capturing both latent features and their class-specific gradients. To address the inherent stochasticity of these raw signals, we incorporate four key modules to resolve spatial ambiguity and mitigate intra-image confounders and redundant token correlations. Extensive experiments demonstrate that Diffusion-CAM significantly outperforms SoTA methods in both localization accuracy and visual fidelity, establishing a new standard for understanding the parallel generation process of diffusion multimodal systems.","published_date":"2026-04-13T05:14:18+00:00","viability_score":7,"cluster_label":"Multimodal AI Interpretability","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Diffusion-CAM is the first interpretability method for diffusion multimodal large language models, providing faithful visual explanations for their parallel generation process.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11004v1","title":"Panoptic Pairwise Distortion Graph","abstract":"In this work, we introduce a new perspective on comparative image assessment by representing an image pair as a structured composition of its regions. In contrast, existing methods focus on whole image analysis, while implicitly relying on region-level understanding. We extend the intra-image notion of a scene graph to inter-image, and propose a novel task of Distortion Graph (DG). DG treats paired images as a structured topology grounded in regions, and represents dense degradation information such as distortion type, severity, comparison and quality score in a compact interpretable graph structure. To realize the task of learning a distortion graph, we contribute (i) a region-level dataset, PandaSet, (ii) a benchmark suite, PandaBench, with varying region-level difficulty, and (iii) an efficient architecture, Panda, to generate distortion graphs. We demonstrate that PandaBench poses a significant challenge for state-of-the-art multimodal large language models (MLLMs) as they fail to understand region-level degradations even when fed with explicit region cues. We show that training on PandaSet or prompting with DG elicits region-wise distortion understanding, opening a new direction for fine-grained, structured pairwise image assessment.","published_date":"2026-04-13T05:12:16+00:00","viability_score":7,"cluster_label":"Multimodal AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Panoptic Pairwise Distortion Graph represents image pairs as structured graphs of regions to enable fine-grained, interpretable pairwise image assessment, challenging current multimodal models.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.11003v1","title":"Sanity Checks for Agentic Data Science","abstract":"Agentic data science (ADS) pipelines have grown rapidly in both capability and adoption, with systems such as OpenAI Codex now able to directly analyze datasets and produce answers to statistical questions. However, these systems can reach falsely optimistic conclusions that are difficult for users to detect. To address this, we propose a pair of lightweight sanity checks grounded in the Predictability-Computability-Stability (PCS) framework for veridical data science. These checks use reasonable perturbations to screen whether an agent can reliably distinguish signal from noise, acting as a falsifiability constraint that can expose affirmative conclusions as unsupported. Together, the two checks characterize the trustworthiness of an ADS output, e.g. whether it has found stable signal, is responding to noise, or is sensitive to incidental aspects of the input. We validate the approach on synthetic data with controlled signal-to-noise ratios, confirming that the sanity checks track ground-truth signal strength. We then demonstrate the checks on 11 real-world datasets using OpenAI Codex, characterizing the trustworthiness of each conclusion and finding that in 6 of the datasets an affirmative conclusion is not well-supported, even though a single ADS run may support one. We further analyze failure modes of ADS systems and find that ADS self-reported confidence is poorly calibrated to the empirical stability of its conclusions.","published_date":"2026-04-13T05:11:28+00:00","viability_score":7,"cluster_label":"Agentic AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Lightweight sanity checks based on the PCS framework to expose unreliable conclusions from agentic data science pipelines by testing for signal stability.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.11854v1","title":"MVAdapt: Zero-Shot Multi-Vehicle Adaptation for End-to-End Autonomous Driving","abstract":"End-to-End (E2E) autonomous driving models are usually trained and evaluated with a fixed ego-vehicle, even though their driving policy is implicitly tied to vehicle dynamics. When such a model is deployed on a vehicle with different size, mass, or drivetrain characteristics, its performance can degrade substantially; we refer to this problem as the vehicle-domain gap. To address it, we propose MVAdapt, a physics-conditioned adaptation framework for multi-vehicle E2E driving. MVAdapt combines a frozen TransFuser++ scene encoder with a lightweight physics encoder and a cross-attention module that conditions scene features on vehicle properties before waypoint decoding. In the CARLA Leaderboard 1.0 benchmark, MVAdapt improves over naive transfer and multi-embodiment adaptation baselines on both in-distribution and unseen vehicles. We further show two complementary behaviors: strong zero-shot transfer on many unseen vehicles, and data-efficient few-shot calibration for severe physical outliers. These results suggest that explicitly conditioning E2E driving policies on vehicle physics is an effective step toward more transferable autonomous driving models. All codes are available at https://github.com/hae-sung-oh/MVAdapt","published_date":"2026-04-13T05:05:04+00:00","viability_score":7,"cluster_label":"Autonomous Driving","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Adapt autonomous driving models to different vehicle dynamics with a physics-conditioned framework for improved transferability.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.10996v1","title":"When Valid Signals Fail: Regime Boundaries Between LLM Features and RL Trading Policies","abstract":"Can large language models (LLMs) generate continuous numerical features that improve reinforcement learning (RL) trading agents? We build a modular pipeline where a frozen LLM serves as a stateless feature extractor, transforming unstructured daily news and filings into a fixed-dimensional vector consumed by a downstream PPO agent. We introduce an automated prompt-optimization loop that treats the extraction prompt as a discrete hyperparameter and tunes it directly against the Information Coefficient - the Spearman rank correlation between predicted and realized returns - rather than NLP losses. The optimized prompt discovers genuinely predictive features (IC above 0.15 on held-out data). However, these valid intermediate representations do not automatically translate into downstream task performance: during a distribution shift caused by a macroeconomic shock, LLM-derived features add noise, and the augmented agent under-performs a price-only baseline. In a calmer test regime the agent recovers, yet macroeconomic state variables remain the most robust driver of policy improvement. Our findings highlight a gap between feature-level validity and policy-level robustness that parallels known challenges in transfer learning under distribution shift.","published_date":"2026-04-13T04:53:06+00:00","viability_score":4,"cluster_label":"LLM Finance","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"LLM-generated financial features can improve RL trading agents, but robustness issues arise during market regime shifts.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.10991v1","title":"Examining EAP Students' AI Disclosure Intention: A Cognition-Affect-Conation Perspective","abstract":"The growing use of generative artificial intelligence (AI) in academic writing has raised increasing concerns regarding transparency and academic integrity in higher education. This study examines the psychological factors influencing English for Academic Purposes (EAP) students' intention to disclose their use of AI tools. Drawing on the cognition-affect-conation framework, the study proposes a model integrating both enabling and inhibiting factors shaping disclosure intention. A sequential explanatory mixed-methods design was employed. Quantitative data from 324 EAP students at an English-medium instruction university in China were analysed using structural equation modelling, followed by semi-structured interviews with 15 students to further interpret the findings. The quantitative results indicate that psychological safety positively predicts AI disclosure intention, whereas fear of negative evaluation negatively predicts it. The qualitative findings further reveal that supportive teacher practices and clear guidance foster psychological safety, while policy ambiguity and reputational concerns intensify fear of negative evaluation and discourage disclosure. These findings highlight the importance of clear institutional policies and supportive pedagogical environments in promoting transparent AI use.","published_date":"2026-04-13T04:48:55+00:00","viability_score":0,"cluster_label":"EdTech AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Investigating psychological factors influencing student AI disclosure intentions in academic writing.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.10990v1","title":"When Verification Fails: How Compositionally Infeasible Claims Escape Rejection","abstract":"Scientific claim verification, the task of determining whether claims are entailed by scientific evidence, is fundamental to establishing discoveries in evidence while preventing misinformation. This process involves evaluating each asserted constraint against validated evidence. Under the Closed-World Assumption (CWA), a claim is accepted if and only if all asserted constraints are positively supported. We show that existing verification benchmarks cannot distinguish models enforcing this standard from models applying a simpler shortcut called salient-constraint checking, which applies CWA's rejection criterion only to the most salient constraint and accepts when that constraint is supported. Because existing benchmarks construct infeasible claims by perturbing a single salient element they are insufficient at distinguishing between rigorous claim verification and simple salient-constraint reliance. To separate the two, we construct compositionally infeasible claims where the salient constraint is supported but a non-salient constraint is contradicted. Across model families and modalities, models that otherwise saturate existing benchmarks consistently over-accept these claims, confirming the prevalence of such shortcut reasoning. Via model context interventions, we show that different models and prompting strategies occupy distinct positions on a shared ROC curve, indicating that the gap between model families reflects differences in verification threshold rather than underlying reasoning ability, and that the compositional inference bottleneck is a structural property of current verification behavior that strategy guidance alone cannot overcome.","published_date":"2026-04-13T04:48:20+00:00","viability_score":7,"cluster_label":"LLM Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"New benchmarks reveal LLMs struggle with compositional claim verification, relying on salient shortcuts rather than robust reasoning.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.10989v1","title":"MAFIG: Multi-agent Driven Formal Instruction Generation Framework","abstract":"Emergency situations in scheduling systems often trigger local functional failures that undermine system stability and even cause system collapse. Existing methods primarily rely on robust scheduling or reactive scheduling, handling emergencies through predefined rules or rescheduling strategies. However, the diversity and unpredictability of real-world emergencies make them difficult to anticipate, which limits the adaptability of these methods in complex scenarios. Recent studies have shown that Large Language Models (LLMs) possess strong potential for complex scheduling tasks because of their extensive prior knowledge and strong reasoning capabilities. Nevertheless, the high inference latency of LLMs and the lengthy contextual information of scheduling systems significantly hinder their application for emergency handling. To mitigate these issues, we propose the Multi-agent Driven Formal Instruction Generation Framework (MAFIG). The framework constrains the decision scope to local functional modules affected by emergency situations and repairs scheduling logic rapidly by generating formal instructions. MAFIG contains a Perception Agent and an Emergency Decision Agent, which mitigates the adverse impact of lengthy system contexts on emergency decision-making. We further introduce span-focused loss-driven local distillation mechanism (SFL) to transfer the decision-making capability of powerful Cloud Large Language Models (C-LLMs) to lightweight local models, reducing inference latency while preserving decision-making effectiveness. Experiments in the Port, Warehousing, and Deck scheduling datasets show success rates of 98.49\\%, 94.97\\%, and 97.50\\%, with average processing times of 0.33 s, 0.23 s, and 0.19 s. These results demonstrate that MAFIG effectively mitigates the impact of emergencies and improves the robustness and adaptability of scheduling systems.","published_date":"2026-04-13T04:47:20+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for generating formal instructions to rapidly repair scheduling logic in emergency situations using multi-agent systems and distilled LLMs.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.10988v1","title":"WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark","abstract":"Existing browser agent benchmarks face a fundamental trilemma: real-website benchmarks lack reproducibility due to content drift, controlled environments sacrifice realism by omitting real-web noise, and both require costly manual curation that limits scalability. We present WebForge, the first fully automated framework that resolves this trilemma through a four-agent pipeline -- Plan, Generate, Refine, and Validate -- that produces interactive, self-contained web environments end-to-end without human annotation. A seven-dimensional difficulty control framework structures task design along navigation depth, visual complexity, reasoning difficulty, and more, enabling systematic capability profiling beyond single aggregate scores. Using WebForge, we construct WebForge-Bench, a benchmark of 934 tasks spanning 7 domains and 3 difficulty levels. Multi-model experiments show that difficulty stratification effectively differentiates model capabilities, while cross-domain analysis exposes capability biases invisible to aggregate metrics. Together, these results confirm that multi-dimensional evaluation reveals distinct capability profiles that a single aggregate score cannot capture. Code and benchmark are publicly available at https://github.com/yuandaxia2001/WebForge.","published_date":"2026-04-13T04:45:27+00:00","viability_score":5,"cluster_label":"Browser Automation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"WebForge automates browser agent benchmarking by generating realistic, reproducible web environments.","time_to_mvp":"","tags":["quick_build"]},{"arxiv_id":"2604.10985v1","title":"Back to the Barn with LLAMAs: Evolving Pretrained LLM Backbones in Finetuning Vision Language Models","abstract":"Vision-Language Models (VLMs) have rapidly advanced by leveraging powerful pre-trained Large Language Models (LLMs) as core reasoning backbones. As new and more capable LLMs emerge with improved reasoning, instruction-following, and generalization, there is a pressing need to efficiently update existing VLMs to incorporate these advancements. However, the integration of new LLMs into VLMs, particularly how the evolving LLMs contribute to multimodal reasoning, alignment, and task-specific performance remains underexplored. Addressing this gap is important for VLM development, given the rapid evolution of pretrained LLM backbones. This study presents a controlled and systematic investigation of how changes in the pretrained LLM backbone affect downstream VLM task performance. By having the vision encoder, training data, and post-training algorithm remain same across LLAMA-1, LLAMA-2, and LLAMA-3 based VLMs, we find that newer LLM backbones do not always lead to better VLMs, but the performance depends on the downstream VLM task. For example, in visual question and answering tasks, newer LLM backbones tend to solve different questions rather than just more questions, and our analysis shows this is driven by differences in how the models process information, including better calibrated confidence and more stable internal representations. We also find that some VLM capabilities appear only in the newest LLM generation, while tasks that depend mainly on visual understanding see little benefit from a newer LLM backbone.","published_date":"2026-04-13T04:44:42+00:00","viability_score":3,"cluster_label":"Vision Language Models","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Investigating how evolving LLM backbones impact Vision-Language Model performance across different downstream tasks.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.10981v1","title":"ATANT v1.1: Positioning Continuity Evaluation Against Memory, Long-Context, and Agentic-Memory Benchmarks","abstract":"ATANT v1.0 (arXiv:2604.06710) defined continuity as a system property with 7 required properties and introduced a 10-checkpoint, LLM-free evaluation methodology validated on a 250-story corpus. Since publication, a recurring reviewer and practitioner question has concerned not the framework itself but its relationship to a wider set of memory evaluations: LOCOMO, LongMemEval, BEAM, MemoryBench, Zep's evaluation suite, Letta/MemGPT's evaluations, and RULER. This companion paper, v1.1, does not modify the v1.0 standard. It closes a related-work gap that v1.0 left brief under page limits. We show by structural analysis that none of these benchmarks measures continuity as defined in v1.0: of the 7 required properties, the median existing eval covers 1 property, the mean covers 0.43 when partial credit is scored at 0.5, and no eval covers more than 2. We provide a cell-by-cell property-coverage matrix, identify methodological defects specific to each benchmark (including an empty-gold scoring bug in the LOCOMO reference implementation that renders 23% of its corpus unscorable by construction), and publish our reference implementation's LOCOMO score (8.8%) alongside the structural reason that number is uninformative about continuity. We publish our 8.8% LOCOMO score alongside our 96% ATANT cumulative-scale score as a calibration pair: the 87-point divergence is evidence that the two benchmarks measure different properties, not that one system is an order of magnitude better than another. The position v1.1 takes is not adversarial: each benchmark measures a real capability. The claim is that none of them can adjudicate continuity, and conflating them with continuity evaluation has led the field to under-invest in the properties v1.0 names.","published_date":"2026-04-13T04:40:37+00:00","viability_score":4,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"ATANT v1.1 provides a framework for evaluating positioning continuity in LLM memory systems, demonstrating its distinctness from other memory benchmarks.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.10978v1","title":"Enabling and Inhibitory Pathways of Students' AI Use Concealment Intention in Higher Education: Evidence from SEM and fsQCA","abstract":"This study investigates students' AI use concealment intention in higher education by integrating the cognition-affect-conation (CAC) framework with a dual-method approach combining structural equation modelling (SEM) and fuzzy-set qualitative comparative analysis (fsQCA). Drawing on data from 1346 university students, the findings reveal two opposing mechanisms shaping concealment intention. The enabling pathway shows that perceived stigma, perceived risk, and perceived policy uncertainty increase fear of negative evaluation, which in turn promotes concealment. In contrast, the inhibitory pathway demonstrates that AI self-efficacy, perceived fairness, and perceived social support enhance psychological safety, thereby reducing concealment intention. SEM results confirm the hypothesised relationships and mediation effects, while fsQCA identifies multiple configurational pathways, highlighting equifinality and the central role of fear of negative evaluation across conditions. The study contributes to the literature by conceptualising concealment as a distinct behavioural outcome and by providing a nuanced explanation that integrates both net-effect and configurational perspectives. Practical implications emphasise the need for clear institutional policies, destigmatisation of appropriate AI use, and the cultivation of supportive learning environments to promote transparency.","published_date":"2026-04-13T04:35:07+00:00","viability_score":2,"cluster_label":"AI Ethics & Education","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This study investigates student AI use concealment intention in higher education using SEM and fsQCA, revealing enabling and inhibitory pathways related to stigma, self-efficacy, and psychological safety.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.10973v1","title":"CFMS: A Coarse-to-Fine Multimodal Synthesis Framework for Enhanced Tabular Reasoning","abstract":"Reasoning over tabular data is a crucial capability for tasks like question answering and fact verification, as it requires models to comprehend both free-form questions and semi-structured tables. However, while methods like Chain-of-Thought (CoT) introduce reasoning chains, purely symbolic methodes are inherently limited by their blindness to holistic visual patterns. To address this, we propose the Coarse-to-Fine Multimodal Synthesis framework (CFMS), a novel two-stage paradigm that hierarchically decouples high-level visual perception from granular symbolic reasoning. In the Coarse Stage, CFMS leverages the Multimodal Large Language Models (MLLMs) to perform a one-time synthesis of a multi-perspective knowledge tuple. This tuple subsequently serves as a dynamic reasoning map to guide the fine stage, where a symbolic engine executes a targeted and efficient sequence of iterative operations over the table. Extensive experiments on the WikiTQ and TabFact benchmarks demonstrate that CFMS achieves competitive accuracy. The framework exhibits particular robustness when handling large tables and when instantiated with smaller backbone models, validating its effectiveness and generalizability.","published_date":"2026-04-13T04:21:12+00:00","viability_score":7,"cluster_label":"Multimodal Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"CFMS is a two-stage framework that uses MLLMs for multimodal synthesis to guide a symbolic engine for efficient tabular reasoning, achieving competitive accuracy on benchmarks.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.10971v1","title":"MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models","abstract":"In the progress of industrial anomaly detection, general anomaly detection (GAD) is an emerging trend and also the ultimate goal. Unlike the conventional single- and multi-class AD, general AD aims to train a general AD model that can directly detect anomalies in diverse novel classes without any retraining or fine-tuning on the target data. Recently, Multimodal Large Language Models (MLLMs) have shown great promise in achieving general anomaly detection due to their revolutionary visual understanding and language reasoning capabilities. However, MLLM's general AD ability remains underexplored due to: (1) MLLMs are pretrained on amounts of data sourced from the Web, these data still have significant gaps with the data in AD scenarios. Moreover, the image-text pairs during pretraining are also not specifically for AD tasks. (2) The current mainstream AD datasets are image-based and not yet suitable for post-training MLLMs. To facilitate MLLM-based general AD research, we present MMR-AD, which is a comprehensive benchmark for both training and evaluating MLLM-based AD models. With MMR-AD, we reveal that the AD performance of current SOTA generalist MLLMs still falls far behind the industrial requirements. Based on MMR-AD, we also propose a baseline model, Anomaly-R1, which is a reasoning-based AD model that learns from the CoT data in MMR-AD and is further enhanced by reinforcement learning. Extensive experiments show that our Anomaly-R1 achieves remarkable improvements over generalist MLLMs in both anomaly detection and localization.","published_date":"2026-04-13T04:14:56+00:00","viability_score":8,"cluster_label":"Anomaly Detection","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"MMR-AD is a large-scale multimodal dataset and baseline model (Anomaly-R1) for benchmarking general anomaly detection with MLLMs, revealing performance gaps and offering improvements.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.10969v1","title":"Towards Automated Solar Panel Integrity: Hybrid Deep Feature Extraction for Advanced Surface Defect Identification","abstract":"To ensure energy efficiency and reliable operations, it is essential to monitor solar panels in generation plants to detect defects. It is quite labor-intensive, time consuming and costly to manually monitor large-scale solar plants and those installed in remote areas. Manual inspection may also be susceptible to human errors. Consequently, it is necessary to create an automated, intelligent defect-detection system, that ensures continuous monitoring, early fault detection, and maximum power generation. We proposed a novel hybrid method for defect detection in SOLAR plates by combining both handcrafted and deep learning features. Local Binary Pattern (LBP), Histogram of Gradients (HoG) and Gabor Filters were used for the extraction of handcrafted features. Deep features extracted by leveraging the use of DenseNet-169. Both handcrafted and deep features were concatenated and then fed to three distinct types of classifiers, including Support Vector Machines (SVM), Extreme Gradient Boost (XGBoost) and Light Gradient-Boosting Machine (LGBM). Experimental results evaluated on the augmented dataset show the superior performance, especially DenseNet-169 + Gabor (SVM), had the highest scores with 99.17% accuracy which was higher than all the other systems. In general, the proposed hybrid framework offers better defect-detection accuracy, resistance, and flexibility that has a solid basis on the real-life use of the automated PV panels monitoring system.","published_date":"2026-04-13T04:13:45+00:00","viability_score":7,"cluster_label":"Computer Vision","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A hybrid method combining handcrafted features (LBP, HoG, Gabor) and deep features from DenseNet-169 for automated solar panel defect identification, achieving 99.17% accuracy.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.10966v1","title":"You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass","abstract":"We present a discriminative multimodal reward model that scores all candidate responses in a single forward pass. Conventional discriminative reward models evaluate each response independently, requiring multiple forward passes, one for each potential response. Our approach concatenates multiple responses with separator tokens and applies cross-entropy over their scalar scores, enabling direct comparative reasoning and efficient $N$-way preference learning. The multi-response design also yields up to $N\\times$ wall-clock speedup and FLOPs reduction over conventional single-response scoring. To enable $N$-way reward evaluation beyond existing pairwise benchmarks, we construct two new benchmarks: (1) MR$^2$Bench-Image contains human-annotated rankings over responses from 8 diverse models; (2) MR$^2$Bench-Video is a large-scale video-based reward benchmark derived from 94K crowdsourced pairwise human judgments over video question-answering spanning 19 models, denoised via preference graph ensemble. Both benchmarks provide 4-response evaluation variants sampled from the full rankings. Built on a 4B vision-language backbone with LoRA fine-tuning and a lightweight MLP value head, our model achieves state-of-the-art results on six multimodal reward benchmarks, including MR$^2$Bench-Image, MR$^2$Bench-Video, and four other existing benchmarks. Our model outperforms existing larger generative and discriminative reward models. We further demonstrate that our reward model, when used in reinforcement learning with GRPO, produces improved policy models that maintain performance across standard multimodal benchmarks while substantially improving open-ended generation quality, outperforming a single-response discriminative reward model (RM) baseline by a large margin in both training stability and open-ended generation quality.","published_date":"2026-04-13T04:02:03+00:00","viability_score":8,"cluster_label":"Multimodal Reward Modeling","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A multimodal reward model that scores all candidate responses in a single forward pass, achieving significant speedups and outperforming existing models for improved generation quality.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.10963v1","title":"Delving Aleatoric Uncertainty in Medical Image Segmentation via Vision Foundation Models","abstract":"Medical image segmentation supports clinical workflows by precisely delineating anatomical structures and lesions. However, medical image datasets medical image datasets suffer from acquisition noise and annotation ambiguity, causing pervasive data uncertainty that substantially undermines model robustness. Existing research focuses primarily on model architectural improvements and predictive reliability estimation, while systematic exploration of the intrinsic data uncertainty remains insufficient. To address this gap, this work proposes leveraging the universal representation capabilities of visual foundation models to estimate inherent data uncertainty. Specifically, we analyze the feature diversity of the model's decoded representations and quantify their singular value energy to define the semantic perception scale for each class, thereby measuring sample difficulty and aleatoric uncertainty. Based on this foundation, we design two uncertainty-driven application strategies: (1) the aleatoric uncertainty-aware data filtering mechanism to eliminate potentially noisy samples and enhance model learning quality; (2) the dynamic uncertainty-aware optimization strategy that adaptively adjusts class-specific loss weights during training based on the semantic perception scale, combined with a label denoising mechanism to improve training stability. Experimental results on five public datasets encompassing CT and MRI modalities and involving multi-organ and tumor segmentation tasks demonstrate that our method achieves significant and robust performance improvements across various mainstream network architectures, revealing the broad application potential of aleatoric uncertainty in medical image understanding and segmentation tasks.","published_date":"2026-04-13T03:59:54+00:00","viability_score":7,"cluster_label":"Medical Image Segmentation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Leveraging vision foundation models to estimate data uncertainty in medical image segmentation, enabling improved model robustness and training quality.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.10960v1","title":"RAG-KT: Cross-platform Explainable Knowledge Tracing with Multi-view Fusion Retrieval Generation","abstract":"Knowledge Tracing (KT) infers a student's knowledge state from past interactions to predict future performance. Conventional Deep Learning (DL)-based KT models are typically tied to platform-specific identifiers and latent representations, making them hard to transfer and interpret. Large Language Model (LLM)-based methods can be either ungrounded under prompting or overly domain-dependent under fine-tuning. In addition, most existing KT methods are developed and evaluated under a same-distribution assumption. In real deployments, educational data often arise from heterogeneous platforms with substantial distribution shift, which often degrades generalization. To this end, we propose RAG-KT, a retrieval-augmented paradigm that frames cross-platform KT as reliable context constrained inference with LLMs. It builds a unified multi-source structured context with cross-source alignment via Question Group abstractions and retrieves complementary rich and reliable context for each prediction, enabling grounded prediction and interpretable diagnosis. Experiments on three public KT benchmarks demonstrate consistent gains in accuracy and robustness, including strong performance under cross-platform conditions.","published_date":"2026-04-13T03:56:17+00:00","viability_score":8,"cluster_label":"Explainable Knowledge Tracing","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A retrieval-augmented knowledge tracing system that provides cross-platform explainability and improved accuracy by fusing multi-view retrieval and generation.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.10958v1","title":"Continuous-time Online Learning via Mean-Field Neural Networks: Regret Analysis in Diffusion Environments","abstract":"We study continuous-time online learning where data are generated by a diffusion process with unknown coefficients. The learner employs a two-layer neural network, continuously updating its parameters in a non-anticipative manner. The mean-field limit of the learning dynamics corresponds to a stochastic Wasserstein gradient flow adapted to the data filtration. We establish regret bounds for both the mean-field limit and finite-particle system. Our analysis leverages the logarithmic Sobolev inequality, Polyak-Lojasiewicz condition, Malliavin calculus, and uniform-in-time propagation of chaos. Under displacement convexity, we obtain a constant static regret bound. In the general non-convex setting, we derive explicit linear regret bounds characterizing the effects of data variation, entropic exploration, and quadratic regularization. Finally, our simulations demonstrate the outperformance of the online approach and the impact of network width and regularization parameters.","published_date":"2026-04-13T03:55:26+00:00","viability_score":3,"cluster_label":"Online Learning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Theoretical analysis of continuous-time online learning in diffusion environments using mean-field neural networks, establishing regret bounds.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.11852v1","title":"Evaluating the Limitations of Protein Sequence Representations for Parkinson's Disease Classification","abstract":"The identification of reliable molecular biomarkers for Parkinson's disease remains challenging due to its multifactorial nature. Although protein sequences constitute a fundamental and widely available source of biological information, their standalone discriminative capacity for complex disease classification remains unclear. In this work, we present a controlled and leakage-free evaluation of multiple representations derived exclusively from protein primary sequences, including amino acid composition, k-mers, physicochemical descriptors, hybrid representations, and embeddings from protein language models, all assessed under a nested stratified cross-validation framework to ensure unbiased performance estimation. The best-performing configuration (ProtBERT + MLP) achieves an F1-score of 0.704 +/- 0.028 and ROC-AUC of 0.748 +/- 0.047, indicating only moderate discriminative performance. Classical representations such as k-mers reach comparable F1 values (up to approximately 0.667), but exhibit highly imbalanced behavior, with recall close to 0.98 and precision around 0.50, reflecting a strong bias toward positive predictions. Across representations, performance differences remain within a narrow range (F1 between 0.60 and 0.70), while unsupervised analyses reveal no intrinsic structure aligned with class labels, and statistical testing (Friedman test, p = 0.1749) does not indicate significant differences across models. These results demonstrate substantial overlap between classes and indicate that primary sequence information alone provides limited discriminative power for Parkinson's disease classification. This work establishes a reproducible baseline and provides empirical evidence that more informative biological features, such as structural, functional, or interaction-based descriptors, are required for robust disease modeling.","published_date":"2026-04-13T03:54:24+00:00","viability_score":3,"cluster_label":"Medical AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Evaluating protein sequence representations for Parkinson's disease classification shows limited discriminative power, requiring more informative biological features.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.10957v1","title":"A molecular clock for writing systems reveals the quantitative impact of imperial power on cultural evolution","abstract":"Writing systems are cultural replicators whose evolution has never been studied quantitatively at global scale. We compile the Global Script Database (GSD): 300 writing and notation systems, 50 binary structural characters, and 259 phylogenetic edges spanning 5,400 years. Applying four methods -- phenetics, cladistics, Bayesian inference, and neural network clustering -- we find that scripts exhibit a detectable molecular clock. The best-fitting model (Mk+Gamma strict clock) yields a substitution rate of q = 0.226 substitutions/character/millennium (95% CI: 0.034-1.22; Delta BIC = -4.1 versus relaxed clock; Delta BIC = -1,364.7 versus Mk without rate variation). Political interventions break this clock: deviation from expected divergence times correlates with intervention intensity (Spearman rho = 0.556, p < 10^{-4}), and per-character rate analysis reveals that intervention selectively rewrites deep structural features rather than merely accelerating change (rate profile correlation rho = 0.320). We identify 30 major script replacement events and rank their destructive impact. A ceiling effect suppresses independent invention wherever writing already exists (Fisher's exact OR = 0.054, p < 10^{-6}), and colonial contact predicts script extinction (Cox HR = 5.25, p = 0.0006). The Spanish Empire extinguished the most scripts (6 of 12 contacted, 50%), followed by the Empire of Japan (3 of 9, 33.3%). Feature coding was validated by inter-rater reliability testing with two independent human coders (Cohen's kappa = 0.877; human-LLM kappa = 0.929; Fleiss' kappa = 0.911).","published_date":"2026-04-13T03:54:01+00:00","viability_score":2,"cluster_label":"Cultural Evolution","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A molecular clock for writing systems reveals the quantitative impact of imperial power on cultural evolution, with political interventions breaking the clock.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.10949v1","title":"Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models","abstract":"Unified multimodal models (UMMs) were designed to combine the reasoning ability of large language models (LLMs) with the generation capability of vision models. In practice, however, this synergy remains elusive: UMMs fail to transfer LLM-like reasoning to image synthesis and exhibit divergent response behaviors. We term this phenomenon pseudo-unification. Diagnosing its internal causes is important, but existing probing methods either lack model-internal insight or ignore prompt-response dependencies. To address these limitations, we propose an information-theoretic probing framework that jointly analyzes how UMMs encode inputs and generate outputs. Applied to ten representative UMMs, our framework reveals that pseudo-unification stems from a dual divergence: (i) Modality-Asymmetric Encoding, where vision and language follow different entropy trajectories, and (ii) Pattern-Split Response, where text generation exhibits high-entropy creativity while image synthesis enforces low-entropy fidelity. Only models that unify both sides (e.g., via contextual prediction) achieve more genuine unification, enabling stronger reasoning-based text-to-image generation even with fewer parameters. Our work provides the first model-internal probing of unification, demonstrating that real multimodal synergy requires consistency in information flow, not just shared parameters.","published_date":"2026-04-13T03:46:45+00:00","viability_score":6,"cluster_label":"Multimodal AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Pseudo-unification in multimodal models stems from divergent information patterns, requiring consistency in information flow for genuine synergy.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.10933v1","title":"QShield: Securing Neural Networks Against Adversarial Attacks using Quantum Circuits","abstract":"Deep neural networks remain highly vulnerable to adversarial perturbations, limiting their reliability in security- and safety-critical applications. To address this challenge, we introduce QShield, a modular hybrid quantum-classical neural network (HQCNN) architecture designed to enhance the adversarial robustness of classical deep learning models. QShield integrates a conventional convolutional neural network (CNN) backbone for feature extraction with a quantum processing module that encodes the extracted features into quantum states, applies structured entanglement operations under realistic noise models, and outputs a hybrid prediction through a dynamically weighted fusion mechanism implemented via a lightweight multilayer perceptron (MLP). We systematically evaluate both classical and hybrid quantum-classical models on the MNIST, OrganAMNIST, and CIFAR-10 datasets, using a comprehensive set of robustness, efficiency, and computational performance metrics. Our results demonstrate that classical models are highly vulnerable to adversarial attacks, whereas the proposed hybrid models with entanglement patterns maintain high predictive accuracy while substantially reducing attack success rates across a wide range of adversarial attacks. Furthermore, the proposed hybrid architecture significantly increased the computational cost required to generate adversarial examples, thereby introducing an additional layer of defense. These findings indicate that the proposed modular hybrid architecture achieves a practical balance between predictive accuracy and adversarial robustness, positioning it as a promising approach for secure and reliable machine learning in sensitive and safety-critical applications.","published_date":"2026-04-13T03:09:01+00:00","viability_score":7,"cluster_label":"AI Security","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"QShield, a hybrid quantum-classical neural network, enhances adversarial robustness of deep learning models for security-critical applications.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.10923v1","title":"Mem$^2$Evolve: Towards Self-Evolving Agents via Co-Evolutionary Capability Expansion and Experience Distillation","abstract":"While large language model--powered agents can self-evolve by accumulating experience or by dynamically creating new assets (i.e., tools or expert agents), existing frameworks typically treat these two evolutionary processes in isolation. This separation overlooks their intrinsic interdependence: the former is inherently bounded by a manually predefined static toolset, while the latter generates new assets from scratch without experiential guidance, leading to limited capability growth and unstable evolution. To address this limitation, we introduce a novel paradigm of co-evolutionary Capability Expansion and Experience Distillation. Guided by this paradigm, we propose the \\textbf{Mem$^{\\textbf{2}}$Evolve}, which integrates two core components: \\textbf{Experience Memory} and \\textbf{Asset Memory}. Specifically, Mem$^{2}$Evolve leverages accumulated experience to guide the dynamic creation of assets, thereby expanding the agent's capability space while simultaneously acquiring new experience to achieve co-evolution. Extensive experiments across 6 task categories and 8 benchmarks demonstrate that Mem$^{2}$Evolve achieves improvement of 18.53\\% over standard LLMs, 11.80\\% over agents evolving solely through experience, and 6.46\\% over those evolving solely through asset creation, establishing it as a substantially more effective and stable self-evolving agent framework. Code is available at: https://buaa-irip-llm.github.io/Mem2Evolve.","published_date":"2026-04-13T02:44:54+00:00","viability_score":8,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A co-evolutionary framework for self-evolving agents that expands capabilities by integrating experience memory with asset creation.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.10918v1","title":"CSPO: Alleviating Reward Ambiguity for Structured Table-to-LaTeX Generation","abstract":"Tables contain rich structured information, yet when stored as images their contents remain \"locked\" within pixels. Converting table images into LaTeX code enables faithful digitization and reuse, but current multimodal large language models (MLLMs) often fail to preserve structural, style, or content fidelity. Conventional post-training with reinforcement learning (RL) typically relies on a single aggregated reward, leading to reward ambiguity that conflates multiple behavioral aspects and hinders effective optimization. We propose Component-Specific Policy Optimization (CSPO), an RL framework that disentangles optimization across LaTeX tables components-structure, style, and content. In particular, CSPO assigns component-specific rewards and backpropagates each signal only through the tokens relevant to its component, alleviating reward ambiguity and enabling targeted component-wise optimization. To comprehensively assess performance, we introduce a set of hierarchical evaluation metrics. Extensive experiments demonstrate the effectiveness of CSPO, underscoring the importance of component-specific optimization for reliable structured generation.","published_date":"2026-04-13T02:34:30+00:00","viability_score":7,"cluster_label":"Structured Generation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel RL framework that disentangles optimization for table-to-LaTeX generation by assigning component-specific rewards to improve structural, style, and content fidelity.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.10916v1","title":"ReXSonoVQA: A Video QA Benchmark for Procedure-Centric Ultrasound Understanding","abstract":"Ultrasound acquisition requires skilled probe manipulation and real-time adjustments. Vision-language models (VLMs) could enable autonomous ultrasound systems, but existing benchmarks evaluate only static images, not dynamic procedural understanding. We introduce ReXSonoVQA, a video QA benchmark with 514 video clips and 514 questions (249 MCQ, 265 free-response) targeting three competencies: Action-Goal Reasoning, Artifact Resolution & Optimization, and Procedure Context & Planning. Zero-shot evaluation of Gemini 3 Pro, Qwen3.5-397B, LLaVA-Video-72B, and Seed 2.0 Pro shows VLMs can extract some procedural information, but troubleshooting questions remain challenging with minimal gains over text-only baselines, exposing limitations in causal reasoning. ReXSonoVQA enables developing perception systems for ultrasound training, guidance, and robotic automation.","published_date":"2026-04-13T02:32:51+00:00","viability_score":7,"cluster_label":"Medical AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A video QA benchmark for procedure-centric ultrasound understanding, enabling the development of AI systems for training, guidance, and robotic automation in medical procedures.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.10911v1","title":"EvoNash-MARL: A Closed-Loop Multi-Agent Reinforcement Learning Framework for Medium-Horizon Equity Allocation","abstract":"Medium-to-long-horizon stock allocation presents significant challenges due toveak predictive structures, non-stadonary market regimes, and the degradationf signals following the application of transaction costs, capacity limits, and tail-isk constraints. Conventional approaches commonly rely on a single predictor orloosely coupled prediction-to-allocation pipeline, limiting robustness underThis work addresses a targeted design question: whetherlistribution shift. 1coupling reinforcement learning (RL), multi-agent policy populations, Policy-Space Response Oracle (PSRO)-style aggregation, league best-response trainingevolutionary replacement, and execution-aware checkpoint selection within ainified walk-forward loop improves allocator robustness at medium to longhorizons. The proposed framework, EvoNash-MARL, integrates these componentswithin an execution-aware allocation loop and further introduces a layeredpolicy architecture comprising a direction head and a risk head, nonlinear signalenhancement, feature-quality reweighting, and constraint-aware checkpointselection. Under a 120-window walk-forward protocol, the resolved v21configuration achieves mean excess Sharpe 0.7600 and robust score -0.0203,anking first among internal controls; on aligned daily out-of-sample returnsrom 2014-01-02 to 2024-01-05, it delivers 19.6% annualized return versus 11.7% for SPY, and in an extended walk-forward evaluation through 2026-02-10 it delivers 20.5% rersus 13.5%. The framework maintains positive performance under realistictress constraints and exhibits structured cross-market generalization; however,lobal strong significance under White's Reality Check (WRC) and SPA-lite testingestablished. Therefore, the results are presented as evidence supporting asnotnore stable medium-to long-horizon training and selection paradigm, ratherhan as prooffof universally superior market-timing performance.","published_date":"2026-04-13T02:24:32+00:00","viability_score":4,"cluster_label":"Finance AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A closed-loop multi-agent reinforcement learning framework for robust medium-horizon equity allocation, integrating advanced techniques for improved performance and generalization.","time_to_mvp":"6+ months","tags":["series_a_plus"]},{"arxiv_id":"2604.10908v1","title":"Reasoning as Data: Representation-Computation Unity and Its Implementation in a Domain-Algebraic Inference Engine","abstract":"Every existing knowledge system separates storage from computation. We show this separation is unnecessary and eliminate it. In a standard triple is_a(Apple, Company), domain context lives in the query or the programmer's mind. In a CDC four-tuple is_a(Apple, Company, @Business), domain becomes a structural field embedded in predicate arity. Any system respecting arity automatically performs domain-scoped inference without external rules. We call this representation-computation unity (RCU). From the four-tuple structure, three inference mechanisms emerge: domain-scoped closure, typed inheritance, and write-time falsification via cycle detection per domain fiber. We establish RCU formally via four theorems. RCU is implementable. We present a working symbolic engine (2400 lines Python+Prolog) resolving four engineering issues: rule-data separation, shared-fiber handling, read-only meta-layer design, and intersective convergence. A central result: CDC domain-constrained inference is distinct from Prolog with a domain argument. Two case studies validate the engine. ICD-11 classification (1247 entities, 3 axes) shows fibers resolve multiple inheritance. CBT clinical reasoning shows generalization to temporal reasoning with session turn as ordered domain index. Multi-constraint queries realize CSP arc-consistency with complexity O(m (N/K)^2), confirming the domain lattice's sparsity governs performance. When domain is structural, data computes itself.","published_date":"2026-04-13T02:17:18+00:00","viability_score":1,"cluster_label":"Knowledge Representation","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A symbolic engine that unifies representation and computation for domain-specific inference by embedding domain context directly into predicate arity.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.10905v1","title":"Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music","abstract":"We present Audio Flamingo Next (AF-Next), the next-generation and most capable large audio-language model in the Audio Flamingo series, designed to advance understanding and reasoning over speech, environmental sounds and music. Compared to Audio Flamingo 3, AF-Next introduces: (i) a stronger foundational audio-language model that significantly improves accuracy across diverse audio understanding tasks; (ii) scalable strategies for constructing large-scale audio understanding and reasoning data beyond existing academic benchmarks; (iii) support for long and complex audio inputs up to 30 minutes; and (iv) Temporal Audio Chain-of-Thought, a new reasoning paradigm that explicitly grounds intermediate reasoning steps to timestamps in long audio, enabling fine-grained temporal alignment and improved interpretability. To enable these capabilities, we first conduct a systematic analysis of Audio Flamingo 3 to identify key gaps in audio understanding and reasoning. We then curate and scale new large-scale datasets totaling over 1 million hours to address these limitations and expand the existing AudioSkills-XL, LongAudio-XL, AF-Think and AF-Chat datasets. AF-Next is trained using a curriculum-based strategy spanning pre-training, mid-training and post-training stages. Extensive experiments across 20 audio understanding and reasoning benchmarks, including challenging long-audio tasks, show that AF-Next outperforms similarly sized open models by large margins and remains highly competitive with and sometimes surpasses, much larger open-weight and closed models. Beyond benchmark performance, AF-Next exhibits strong real-world utility and transfers well to unseen tasks, highlighting its robustness and generalization ability. In addition to all data, code and methods, we open-source 3 variants of AF-Next, including AF-Next-Instruct, AF-Next-Think and AF-Next-Captioner.","published_date":"2026-04-13T02:11:56+00:00","viability_score":8,"cluster_label":"Audio-Language Models","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A next-generation open audio-language model that significantly improves understanding and reasoning over speech, sound, and music with long audio support and temporal chain-of-thought.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.10904v1","title":"Evaluating the Impact of Medical Image Reconstruction on Downstream AI Fairness and Performance","abstract":"AI-based image reconstruction models are increasingly deployed in clinical workflows to improve image quality from noisy data, such as low-dose X-rays or accelerated MRI scans. However, these models are typically evaluated using pixel-level metrics like PSNR, leaving their impact on downstream diagnostic performance and fairness unclear. We introduce a scalable evaluation framework that applies reconstruction and diagnostic AI models in tandem, which we apply to two tasks (classification, segmentation), three reconstruction approaches (U-Net, GAN, diffusion), and two data types (X-ray, MRI) to assess the potential downstream implications of reconstruction. We find that conventional reconstruction metrics poorly track task performance, where diagnostic accuracy remains largely stable even as reconstruction PSNR declines with increasing image noise. Fairness metrics exhibit greater variability, with reconstruction sometimes amplifying demographic biases, particularly regarding patient sex. However, the overall magnitude of this additional bias is modest compared to the inherent biases already present in diagnostic models. To explore potential bias mitigation, we adapt two strategies from classification literature to the reconstruction setting, but observe limited efficacy. Overall, our findings emphasize the importance of holistic performance and fairness assessments throughout the entire medical imaging workflow, especially as generative reconstruction models are increasingly deployed.","published_date":"2026-04-13T02:07:48+00:00","viability_score":3,"cluster_label":"Medical AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework to evaluate how medical image reconstruction impacts downstream AI fairness and performance, revealing that pixel-level metrics don't track diagnostic accuracy or bias.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.10900v1","title":"CASK: Core-Aware Selective KV Compression for Reasoning Traces","abstract":"In large language models performing long-form reasoning, the KV cache grows rapidly with decode length, creating bottlenecks in memory and inference stability. Existing reasoning-oriented KV compression has mostly followed an eviction-centered view: estimate token importance more accurately, then discard lower-ranked entries. Our analysis suggests that scorer refinement alone often fails to substantially reorganize the actual keep-set and may therefore not be the main lever for preserving reasoning behavior. We instead frame reasoning KV compression as a behavior-preserving structured consolidation problem. CASK partitions the decode-time reasoning trace into a protected core that anchors answer formation and intermediate state, and mergeable scratch with high redundancy. The core is preserved, while selective consolidation is applied only to the scratch. To address prompt-heavy regimes where the prefix can exhaust the budget before decode-stage compression becomes active, CASK further uses a two-stage design: prefix eviction followed by decode-stage consolidation. On the H100 reasoning gate, CASK shows higher full-KV continuation fidelity than TriAttention at matched budgets on both AIME24 and AIME25, with recurring cask@384 > triattention@512 crossings. In prompt-heavy replay, multi_news and vcsum act as decode-active witnesses, while qmsum and gov_report expose the prefix_budget_exhausted boundary. The overall evidence supports a simple conclusion: effective reasoning KV compression depends less on more elaborate scorer engineering than on combining core preservation with selective scratch consolidation to lower the usable budget frontier.","published_date":"2026-04-13T02:03:16+00:00","viability_score":4,"cluster_label":"LLM Optimization","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"CASK is a KV compression method for LLMs that preserves reasoning by separating a core trace from mergeable scratch, outperforming existing methods.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.10898v1","title":"ZoomR: Memory Efficient Reasoning through Multi-Granularity Key Value Retrieval","abstract":"Large language models (LLMs) have shown great performance on complex reasoning tasks but often require generating long intermediate thoughts before reaching a final answer. During generation, LLMs rely on a key-value (KV) cache for autoregressive decoding. However, the memory footprint of the KV cache grows with output length. Prior work on KV cache optimization mostly focus on compressing the long input context, while retaining the full KV cache for decoding. For tasks requiring long output generation, this leads to increased computational and memory costs. In this paper, we introduce ZoomR, a novel approach that enables LLMs to adaptively compress verbose reasoning thoughts into summaries and uses a dynamic KV cache selection policy that leverages these summaries while also strategically \"zooming in\" on fine-grained details. By using summary keys as a coarse-grained index during decoding, ZoomR uses the query to retrieve details for only the most important thoughts. This hierarchical strategy significantly reduces memory usage by avoiding full-cache attention at each step. Experiments across math and reasoning tasks show that our approach achieves competitive performance compared to baselines, while reducing inference memory requirements by more than $4\\times$. These results demonstrate that a multi-granularity KV selection enables more memory efficient decoding, especially for long output generation.","published_date":"2026-04-13T02:00:35+00:00","viability_score":7,"cluster_label":"LLM Optimization","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"ZoomR reduces LLM inference memory by over 4x for long outputs through adaptive KV cache compression and multi-granularity retrieval of reasoning thoughts.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.10893v1","title":"Beyond A Fixed Seal: Adaptive Stealing Watermark in Large Language Models","abstract":"Watermarking provides a critical safeguard for large language model (LLM) services by facilitating the detection of LLM-generated text. Correspondingly, stealing watermark algorithms (SWAs) derive watermark information from watermarked texts generated by victim LLMs to craft highly targeted adversarial attacks, which compromise the reliability of watermarks. Existing SWAs rely on fixed strategies, overlooking the non-uniform distribution of stolen watermark information and the dynamic nature of real-world LLM generation processes. To address these limitations, we propose Adaptive Stealing (AS), a novel SWA featuring enhanced design flexibility through Position-Based Seal Construction and Adaptive Selection modules. AS operates by defining multiple attack perspectives derived from distinct activation states of contextually ordered tokens. During attack execution, AS dynamically selects the optimal perspective based on watermark compatibility, generation priority, and dynamic generation relevance. Our experiments demonstrate that AS significantly increases steal efficiency against target watermarks under identical experimental conditions. These findings highlight the need for more robust LLM watermarks to withstand potential attacks. We release our code to the community for future research\\footnote{https://github.com/DrankXs/AdaptiveStealingWatermark}.","published_date":"2026-04-13T01:46:51+00:00","viability_score":7,"cluster_label":"AI Security","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Adaptive Stealing (AS) is a novel watermark attack for LLMs that dynamically selects optimal attack perspectives to significantly increase steal efficiency.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.10885v1","title":"Product Review Based on Optimized Facial Expression Detection","abstract":"This paper proposes a method to review public acceptance of products based on their brand by analyzing the facial expression of the customer intending to buy the product from a supermarket or hypermarket. In such cases, facial expression recognition plays a significant role in product review. Here, facial expression detection is performed by extracting feature points using a modified Harris algorithm. The modified Harris algorithm reduced the time complexity of the existing feature extraction Harris Algorithm. A comparison of time complexities of existing algorithms is done with proposed algorithm. The algorithm proved to be significantly faster and nearly accurate for the needed application by reducing the time complexity for corner points detection.","published_date":"2026-04-13T01:20:23+00:00","viability_score":2,"cluster_label":"Computer Vision","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A faster facial expression detection algorithm for product review based on customer reactions.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.10884v1","title":"Ambiguity Detection and Elimination in Automated Executable Process Modeling","abstract":"Automated generation of executable Business Process Model and Notation (BPMN) models from natural-language specifications is increasingly enabled by large language models. However, ambiguous or underspecified text can yield structurally valid models with different simulated behavior. Our goal is not to prove that one generated BPMN model is semantically correct, but to detect when a natural-language specification fails to support a stable executable interpretation under repeated generation and simulation. We present a diagnosis-driven framework that detects behavioral inconsistency from the empirical distribution of key performance indicators (KPIs), localizes divergence to gateway logic using model-based diagnosis, maps that logic back to verbatim narrative segments, and repairs the source text through evidence-based refinement. Experiments on diabetic nephropathy health-guidance policies show that the method reduces variability in regenerated model behavior. The result is a closed-loop approach for validating and repairing executable process specifications in the absence of ground-truth BPMN models.","published_date":"2026-04-13T01:11:01+00:00","viability_score":5,"cluster_label":"LLM Applications","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A framework to detect and fix ambiguities in natural language specifications for automated business process modeling.","time_to_mvp":"1-3 months","tags":["quick_build"]},{"arxiv_id":"2604.10882v1","title":"DIB-OD: Preserving the Invariant Core for Robust Heterogeneous Graph Adaptation via Decoupled Information Bottleneck and Online Distillation","abstract":"Graph Neural Network pretraining is pivotal for leveraging unlabeled graph data. However, generalizing across heterogeneous domains remains a major challenge due to severe distribution shifts. Existing methods primarily focus on intra-domain patterns, failing to disentangle task-relevant invariant knowledge from domain-specific redundant noise, leading to negative transfer and catastrophic forgetting. To this end, we propose DIB-OD, a novel framework designed to preserve the invariant core for robust heterogeneous graph adaptation through a Decoupled Information Bottleneck and Online Distillation framework. Our core innovation is the explicit decomposition of representations into orthogonal invariant and redundant subspaces. By utilizing an Information Bottleneck teacher-student distillation mechanism and the Hilbert-Schmidt Independence Criterion, we isolate a stable invariant core that transcends domain boundaries. Furthermore, a self-adaptive semantic regularizer is introduced to protect this core from corruption during target-domain adaptation by dynamically gating label influence based on predictive confidence. Extensive experiments across chemical, biological, and social network domains demonstrate that DIB-OD significantly outperforms state-of-the-art methods, particularly in challenging inter-type domain transfers, showcasing superior generalization and anti-forgetting performance.","published_date":"2026-04-13T01:03:44+00:00","viability_score":7,"cluster_label":"Graph Neural Networks","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for robust heterogeneous graph adaptation that preserves invariant core knowledge across domains.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.10875v1","title":"Compliant But Unsatisfactory: The Gap Between Auditing Standards and Practices for Probabilistic Genotyping Software","abstract":"AI governance efforts increasingly rely on audit standards: agreed-upon practices for conducting audits. However, poorly designed standards can hide and lend credibility to inadequate systems. We explore how an audit standard's design influences its effectiveness through a case study of ASB 018, a standard for auditing probabilistic genotyping software -- software that the U.S. criminal legal system increasingly uses to analyze DNA samples. Through qualitative analysis of ASB 018 and five audit reports, we identify numerous gaps between the standard's desired outcomes and the auditing practices it enables. For instance, ASB 018 envisions that compliant audits establish restrictions on software use based on observed failures. However, audits can comply without establishing such boundaries. We connect these gaps to the design of the standard's requirements such as vague language and undefined terms. We conclude with recommendations for designing audit standards and evaluating their effectiveness.","published_date":"2026-04-13T00:49:58+00:00","viability_score":3,"cluster_label":"AI Governance","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"An analysis of the gap between auditing standards and practices for probabilistic genotyping software in the legal system.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.10874v1","title":"AOP-Smart: A RAG-Enhanced Large Language Model Framework for Adverse Outcome Pathway Analysis","abstract":"Adverse Outcome Pathways (AOPs) are an important knowledge framework in toxicological research and risk assessment. In recent years, large language models (LLMs) have gradually been applied to AOP-related question answering and mechanistic reasoning tasks. However, due to the existence of the hallucination problem, that is, the model may generate content that is inconsistent with facts or lacks evidence, their reliability is still limited. To address this issue, this study proposes an AOP-oriented Retrieval-Augmented Generation (RAG) framework, AOP-Smart. Based on the official XML data from AOP-Wiki, this method uses Key Events (KEs), Key Event Relationships (KERs), and specific AOP information to retrieve relevant knowledge for user questions, thereby improving the reliability of the generated results of large language models. To evaluate the effectiveness of the proposed method, this study constructed a test set containing 20 AOP-related question answering tasks, covering KE identification, upstream and downstream KE retrieval, and complex AOP retrieval tasks. Experiments were conducted on three mainstream large language models, Gemini, DeepSeek, and ChatGPT, and comparative tests were performed under two settings: without RAG and with RAG. The experimental results show that, without using RAG, the accuracies of GPT, DeepSeek, and Gemini were 15.0\\%, 35.0\\%, and 20.0\\%, respectively; after using RAG, their accuracies increased to 95.0\\%, 100.0\\%, and 95.0\\%, respectively. The results indicate that AOP-Smart can significantly alleviate the hallucination problem of large language models in AOP knowledge tasks, and greatly improve the accuracy and consistency of their answers.","published_date":"2026-04-13T00:49:37+00:00","viability_score":3,"cluster_label":"Toxicology AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"AOP-Smart enhances large language models for reliable toxicological analysis by addressing hallucination issues.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.10873v1","title":"A Quantitative Definition of Intelligence","abstract":"We propose an operational, quantitative definition of intelligence for arbitrary physical systems. The intelligence density of a system is the ratio of the logarithm of its independent outputs to its total description length. A system memorizes if its description length grows with its output count; it knows if its description length remains fixed while its output count diverges. The criterion for knowing is generalization: a system knows its domain if a single finite mechanism can produce correct outputs across an unbounded range of inputs, rather than storing each answer individually. We argue that meaning over a domain is a selection and ordering of functions that produces correct outputs, and that a system whose intelligence density diverges necessarily captures this structure. The definition (1) places intelligence on a substrate-independent continuum from logic gates to brains, (2) blocks Putnam's pancomputationalist triviality argument via an independence condition on outputs, and (3) resolves Searle's Chinese Room Argument by showing that any finite rulebook handling an infinite domain must generalize.","published_date":"2026-04-13T00:46:56+00:00","viability_score":3,"cluster_label":"Theoretical AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper proposes a quantitative definition of intelligence applicable to various physical systems.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.10865v1","title":"Beyond Statistical Co-occurrence: Unlocking Intrinsic Semantics for Tabular Data Clustering","abstract":"Deep Clustering (DC) has emerged as a powerful tool for tabular data analysis in real-world domains like finance and healthcare. However, most existing methods rely on data-level statistical co-occurrence to infer the latent metric space, often overlooking the intrinsic semantic knowledge encapsulated in feature names and values. As a result, semantically related concepts like `Flu' and `Cold' are often treated as symbolic tokens, causing conceptually related samples to be isolated. To bridge the gap between dataset-specific statistics and intrinsic semantic knowledge, this paper proposes Tabular-Augmented Contrastive Clustering (TagCC), a novel framework that anchors statistical tabular representations to open-world textual concepts. Specifically, TagCC utilizes Large Language Models (LLMs) to distill underlying data semantics into textual anchors via semantic-aware transformation. Through Contrastive Learning (CL), the framework enriches the statistical tabular representations with the open-world semantics encapsulated in these anchors. This CL framework is jointly optimized with a clustering objective, ensuring that the learned representations are both semantically coherent and clustering-friendly. Extensive experiments on benchmark datasets demonstrate that TagCC significantly outperforms its counterparts.","published_date":"2026-04-13T00:25:22+00:00","viability_score":7,"cluster_label":"Tabular Data Analysis","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"TagCC leverages LLMs for enhanced clustering of tabular data by integrating intrinsic semantics.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.10857v1","title":"Query Lower Bounds for Diffusion Sampling","abstract":"Diffusion models generate samples by iteratively querying learned score estimates. A rapidly growing literature focuses on accelerating sampling by minimizing the number of score evaluations, yet the information-theoretic limits of such acceleration remain unclear.   In this work, we establish the first score query lower bounds for diffusion sampling. We prove that for $d$-dimensional distributions, given access to score estimates with polynomial accuracy $\\varepsilon=d^{-O(1)}$ (in any $L^p$ sense), any sampling algorithm requires $\\widetilde\u03a9(\\sqrt{d})$ adaptive score queries. In particular, our proof shows that any sampler must search over $\\widetilde\u03a9(\\sqrt{d})$ distinct noise levels, providing a formal explanation for why multiscale noise schedules are necessary in practice.","published_date":"2026-04-12T23:47:46+00:00","viability_score":1,"cluster_label":"Generative Models","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Establishes theoretical lower bounds for score query acceleration in diffusion model sampling.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.10856v1","title":"BridgeSim: Unveiling the OL-CL Gap in End-to-End Autonomous Driving","abstract":"Open-loop (OL) to closed-loop (CL) gap (OL-CL gap) exists when OL-pretrained policies scoring high in OL evaluations fail to transfer effectively in closed-loop (CL) deployment. In this paper, we unveil the root causes of this systemic failure and propose a practical remedy. Specifically, we demonstrate that OL policies suffer from Observational Domain Shift and Objective Mismatch. We show that while the former is largely recoverable with adaptation techniques, the latter creates a structural inability to model complex reactive behaviors, which forms the primary OL-CL gap. We find that a wide range of OL policies learn a biased Q-value estimator that neglects both the reactive nature of CL simulations and the temporal awareness needed to reduce compounding errors. To this end, we propose a Test-Time Adaptation (TTA) framework that calibrates observational shift, reduces state-action biases, and enforces temporal consistency. Extensive experiments show that TTA effectively mitigates planning biases and yields superior scaling dynamics than its baseline counterparts. Furthermore, our analysis highlights the existence of blind spots in standard OL evaluation protocols that fail to capture the realities of closed-loop deployment.","published_date":"2026-04-12T23:37:07+00:00","viability_score":7,"cluster_label":"Autonomous Driving","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A test-time adaptation framework to bridge the open-loop to closed-loop gap in autonomous driving policies.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.10853v1","title":"A Benchmark for Gap and Overlap Analysis as a Test of KG Task Readiness","abstract":"Task-oriented evaluation of knowledge graph (KG) quality increasingly asks whether an ontology-based representation can answer the competency questions that users actually care about, in a manner that is reproducible, explainable, and traceable to evidence. This paper adopts that perspective and focuses on gap and overlap analysis for policy-like documents (e.g., insurance contracts), where given a scenario, which documents support it (overlap) and which do not (gap), with defensible justifications. The resulting gap/overlap determinations are typically driven by genuine differences in coverage and restrictions rather than missing data, making the task a direct test of KG task readiness rather than a test of missing facts or query expressiveness. We present an executable and auditable benchmark that aligns natural-language contract text with a formal ontology and evidence-linked ground truth, enabling systematic comparison of methods. The benchmark includes: (i) ten simplified yet diverse life-insurance contracts reviewed by a domain expert, (ii) a domain ontology (TBox) with an instantiated knowledge base (ABox) populated from contract facts, and (iii) 58 structured scenarios paired with SPARQL queries with contract-level outcomes and clause-level excerpts that justify each label. Using this resource, we compare a text-only LLM baseline that infers outcomes directly from contract text against an ontology-driven pipeline that answers the same scenarios over the instantiated KG, demonstrating that explicit modeling improves consistency and diagnosis for gap/overlap analyses. Although demonstrated for gap and overlap analysis, the benchmark is intended as a reusable template for evaluating KG quality and supporting downstream work such as ontology learning, KG population, and evidence-grounded question answering.","published_date":"2026-04-12T23:18:47+00:00","viability_score":6,"cluster_label":"Knowledge Graph and Ontology Tools","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An executable benchmark for evaluating knowledge graph readiness through gap and overlap analysis in policy documents.","time_to_mvp":"","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.10849v1","title":"Task2vec Readiness: Diagnostics for Federated Learning from Pre-Training Embeddings","abstract":"Federated learning (FL) performance is highly sensitive to heterogeneity across clients, yet practitioners lack reliable methods to anticipate how a federation will behave before training. We propose readiness indices, derived from Task2Vec embeddings, that quantifies the alignment of a federation prior to training and correlates with its eventual performance. Our approach computes unsupervised metrics -- such as cohesion, dispersion, and density -- directly from client embeddings. We evaluate these indices across diverse datasets (CIFAR-10, FEMNIST, PathMNIST, BloodMNIST) and client counts (10--20), under Dirichlet heterogeneity levels spanning $\u03b1\\in \\{0.05,\\dots,5.0\\}$ and FedAVG aggregation strategy. Correlation analyses show consistent and significant Pearson and Spearman coefficients between some of the Task2Vec-based readiness and final performance, with values often exceeding 0.9 across dataset$\\times$client configurations, validating this approach as a robust proxy for FL outcomes. These findings establish Task2Vec-based readiness as a principled, pre-training diagnostic for FL that may offer both predictive insight and actionable guidance for client selection in heterogeneous federations.","published_date":"2026-04-12T22:48:51+00:00","viability_score":4,"cluster_label":"Federated Learning Diagnostics","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A diagnostic tool using pre-training embeddings to predict federated learning performance before training begins.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.10843v1","title":"Retinal Cyst Detection from Optical Coherence Tomography Images","abstract":"Retinal Cysts are formed by leakage and accumulation of fluid in the retina due to the incompetence of retinal vasculature. These cystic spaces have significance in several ocular diseases such as age-related macular degeneration, diabetic macular edema, etc. Optical coherence tomography is one of the predominant diagnosing techniques for imaging retinal pathologies. Segmenting and quantification of intraretinal cysts plays the vital role in predicting visual acuity. In literature, several methods have been proposed for automatic segmentation of intraretinal cysts. As cystoid macular edema becomes a major problem to humankind, we need to quantify it accurately and operate it out, else it might cause many problems later on. Though research is being carried out in this area, not much of progress has been made and accuracy achieved so far is 68\\% which is very less. Also, the methods depend on the quality of the image and give very low results for high noise images like topcon. This work uses ResNet CNN (Convolutional Neural Network) approach of segmentation by the way of patchwise classification for training on image set from cyst segmentation challenge dataset and testing on test data set given by 2 different graders for all 4 vendors in the challenge. It also compares these methods using first publicly available novel cyst segmentation challenge dataset. The methods were evaluated using quantitative measures to assess their robustness against the challenges of intraretinal cyst segmentation. The results are found to be better than the previous state of the art approaches giving more than 70\\% dice coefficient on all vendors irrespective of their quality.","published_date":"2026-04-12T22:28:19+00:00","viability_score":7,"cluster_label":"Medical Imaging AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An AI model that accurately detects and quantifies retinal cysts from OCT images, improving upon existing methods for early disease detection.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.10842v1","title":"Resilient Write: A Six-Layer Durable Write Surface for LLM Coding Agents","abstract":"LLM-powered coding agents increasingly rely on tool-use protocols such as the Model Context Protocol~(MCP) to read and write files on a developer's workstation. When a write fails -- due to content filters, truncation, or an interrupted session -- the agent typically receives no structured signal, loses the draft, and wastes tokens retrying blindly. We present \\textbf{Resilient Write}, an MCP server that interposes a six-layer durable write surface between the agent and the filesystem. The layers -- pre-flight risk scoring, transactional atomic writes, resume-safe chunking, structured typed errors, out-of-band scratchpad storage, and task-continuity handoff envelopes -- are orthogonal and independently adoptable. Each layer maps to a concrete failure mode observed during a real agent session in April~2026, in which content-safety filters silently rejected a draft containing redacted API-key prefixes. Three additional tools -- chunk preview, format-aware validation, and journal analytics -- emerged from using the system to compose this paper. A 186-test suite validates correctness at each layer, and quantitative comparison against naive and defensive baselines shows a 5x reduction in recovery time and a 13x improvement in agent self-correction rate. Resilient Write is open-source under the MIT license.","published_date":"2026-04-12T22:23:55+00:00","viability_score":6,"cluster_label":"LLM Agent Tools","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A robust file writing system for LLM coding agents that prevents data loss and improves self-correction through a multi-layered approach.","time_to_mvp":"1-2 weeks","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.10841v1","title":"Harnessing Photonics for Machine Intelligence","abstract":"The exponential growth of machine-intelligence workloads is colliding with the power, memory, and interconnect limits of the post-Moore era, motivating compute substrates that scale beyond transistor density alone. Integrated photonics is emerging as a candidate for artificial intelligence (AI) acceleration by exploiting optical bandwidth and parallelism to reshape data movement and computation. This review reframes photonic computing from a circuits-and-systems perspective, moving beyond building-block progress toward cross-layer system analysis and full-stack design automation. We synthesize recent advances through a bottleneck-driven taxonomy that delineates the operating regimes and scaling trends where photonics can deliver end-to-end sustained benefits. A central theme is cross-layer co-design and workload-adaptive programmability to sustain high efficiency and versatility across evolving application domains at scale. We further argue that Electronic-Photonic Design Automation (EPDA) will be pivotal, enabling closed-loop co-optimization across simulation, inverse design, system modeling, and physical implementation. By charting a roadmap from laboratory prototypes to scalable, reproducible electronic-photonic ecosystems, this review aims to guide the CAS community toward an automated, system-centric era of photonic machine intelligence.","published_date":"2026-04-12T22:23:14+00:00","viability_score":0,"cluster_label":"AI Hardware Acceleration","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A review of integrated photonics for AI acceleration, focusing on cross-layer co-design and electronic-photonic design automation.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.10834v1","title":"LLMs for Qualitative Data Analysis Fail on Security-specificComments in Human Experiments","abstract":"[Background:] Thematic analysis of free-text justifications in human experiments provides significant qualitative insights. Yet, it is costly because reliable annotations require multiple domain experts. Large language models (LLMs) seem ideal candidates to replace human annotators. [Problem:] Coding security-specific aspects (code identifiers mentioned, lines-of-code mentioned, security keywords mentioned) may require deeper contextual understanding than sentiment classification. [Objective:] Explore whether LLMs can act as automated annotators for technical security comments by human subjects. [Method:] We prompt four top-performing LLMs on LiveBench to detect nine security-relevant codes in free-text comments by human subjects analyzing vulnerable code snippets. Outputs are compared to human annotators using Cohen's Kappa (chance-corrected accuracy). We test different prompts mimicking annotation best practices, including emerging codes, detailed codebooks with examples, and conflicting examples. [Negative Results:] We observed marked improvements only when using detailed code descriptions; however, these improvements are not uniform across codes and are insufficient to reliably replace a human annotator. [Limitations:] Additional studies with more LLMs and annotation tasks are needed.","published_date":"2026-04-12T22:01:40+00:00","viability_score":2,"cluster_label":"LLM Application","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"LLMs struggle to reliably identify security-specific comments in human experiment data, failing to replace human annotators.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.10833v1","title":"Speaking to No One: Ontological Dissonance and the Double Bind of Conversational AI","abstract":"Recent reports indicate that sustained interaction with conversational artificial intelligence (AI) systems can, in a small subset of users, contribute to the emergence or stabilisation of delusional experience. Existing accounts typically attribute such cases either to individual vulnerability or to failures of safety engineering. These explanations are incomplete. Drawing on phenomenology, psychiatry, and cognitive neuroscience, this paper argues that the risk arises from the relational and ontological structure of the interaction itself. Conversational AI generates ontological dissonance: a conflict between the appearance of relational presence and the absence of any subject capable of sustaining it. Maintained through a communicative double bind and amplified by attentional asymmetries, this dissonance tends, under conditions of affective vulnerability, to stabilise into a technologically mediated analogue of folie a deux. This account explains why explicit disclaimers often fail to disrupt delusional involvement and clarifies the ethical and clinical implications for the design and use of conversational AI.","published_date":"2026-04-12T21:58:21+00:00","viability_score":1,"cluster_label":"Conversational AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Conversational AI can induce delusional experiences through ontological dissonance and communicative double binds, impacting user mental well-being.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.10827v1","title":"Your Model Diversity, Not Method, Determines Reasoning Strategy","abstract":"Compute scaling for LLM reasoning requires allocating budget between exploring solution approaches ($breadth$) and refining promising solutions ($depth$). Most methods implicitly trade off one for the other, yet why a given trade-off works remains unclear, and validation on a single model obscures the role of the model itself. We argue that $\\textbf{the optimal strategy depends on the model's diversity profile, the spread of probability mass across solution approaches, and that this must be characterized before any exploration strategy is adopted.}$ We formalize this through a theoretical framework decomposing reasoning uncertainty and derive conditions under which tree-style depth refinement outperforms parallel sampling. We validate it on Qwen-3 4B and Olmo-3 7B families, showing that lightweight signals suffice for depth-based refinement on low-diversity aligned models while yielding limited utility for high-diversity base models, which we hypothesize require stronger compensation for lower exploration coverage.","published_date":"2026-04-12T21:49:13+00:00","viability_score":2,"cluster_label":"LLM Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Optimal LLM reasoning strategies depend on model diversity, requiring characterization before exploration.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.10825v1","title":"CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms","abstract":"We introduce CheeseBench, a benchmark that evaluates large language models (LLMs) on nine classical behavioral neuroscience paradigms (Morris water maze, Barnes maze, T-maze, radial arm maze, star maze, operant chamber, shuttle box, conditioned place preference, and delayed non-match to sample), spanning six cognitive dimensions. Each task is grounded in peer-reviewed rodent protocols with approximate animal baselines. The agent receives a unified system prompt with no task-specific instructions and must discover goals purely from ASCII text observations and reward signals, much like a rodent placed into an unfamiliar apparatus. We evaluate six open-weight LLMs (3B to 72B parameters) on text-based ASCII renderings and compare against both a random baseline and a graph-based reinforcement learning agent. Our best model (Qwen2.5-VL-7B) reaches 52.6% average success on ASCII input, compared to 32.1% for random agents and 78.9% for approximate rodent baselines. We find that (1) scaling beyond 7B yields diminishing returns, (2) longer context history degrades performance, (3) chain-of-thought prompting hurts rather than helps, and (4) a vision-language architecture provides an advantage at 7B but hurts at 32B. Because the same model's performance ranges from 20% to 57% depending on interface parameters alone, these results characterize the agent-plus-interface system, not the model in isolation. Under this unified zero-shot ASCII protocol, current open-weight LLM agents remain well below approximate rodent reference values, particularly on tasks requiring spatial navigation and within-trial state tracking.","published_date":"2026-04-12T21:37:26+00:00","viability_score":5,"cluster_label":"LLM Benchmarking","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"CheeseBench evaluates LLMs on rodent behavioral neuroscience paradigms, revealing current models lag behind animal baselines and are sensitive to interface parameters.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.10815v1","title":"MeloTune: On-Device Arousal Learning and Peer-to-Peer Mood Coupling for Proactive Music Curation","abstract":"MeloTune is an iPhone-deployed music agent that instantiates the Mesh Memory Protocol (MMP) and Symbolic-Vector Attention Fusion (SVAF) as a production system for affect-aware music curation with peer-to-peer mood coupling. Each device runs two closed-form continuous-time (CfC) networks: a private listener-level CfC that predicts a short-horizon affective trajectory on Russell's circumplex and drives proactive curation, and a shared mesh-runtime CfC at MMP Layer 6 that integrates Cognitive Memory Blocks (CMBs) from co-listening peers. CfC hidden states never cross the wire; only structured CMBs do. A Personal Arousal Function (PAF) replaces the standard linear mapping from audio intensity to psychological arousal with a per-listener learned adjustment, trained from behavioral signals (skip, completion, favorite, volume) and from drift between user-declared mood and machine inference. The same track receives different arousal predictions for different listeners. The model (94,552 parameters) achieves trajectory MAE 0.414, pattern accuracy 96.6%, and intent accuracy 69.4% on held-out validation. PAF evidence from a live deployment session (46 observations across 11 genres) demonstrates that the learning loop operates end-to-end, with pop reaching full confidence after 22 observations. All inference runs on-device via CoreML. To our knowledge, this is the first production deployment of MMP/SVAF on consumer mobile hardware. The accompanying SDK (sym-swift v0.3.78, SYMCore v0.3.7) enforces strict protocol conformance. Music is the case study; the substrate is the contribution.","published_date":"2026-04-12T20:56:36+00:00","viability_score":7,"cluster_label":"Proactive Music Curation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An on-device iPhone music agent that proactively curates music based on individual user arousal and peer mood, with a learned personal arousal function and peer-to-peer mood coupling.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.10800v1","title":"Verify Before You Fix: Agentic Execution Grounding for Trustworthy Cross-Language Code Analysis","abstract":"Learned classifiers deployed in agentic pipelines face a fundamental reliability problem: predictions are probabilistic inferences, not verified conclusions, and acting on them without grounding in observable evidence leads to compounding failures across downstream stages. Software vulnerability analysis makes this cost concrete and measurable. We address this through a unified cross-language vulnerability lifecycle framework built around three LLM-driven reasoning stages-hybrid structural-semantic detection, execution-grounded agentic validation, and validation-aware iterative repair-governed by a strict invariant: no repair action is taken without execution-based confirmation of exploitability. Cross-language generalization is achieved via a Universal Abstract Syntax Tree (uAST) normalizing Java, Python, and C++ into a shared structural schema, combined with a hybrid fusion of GraphSAGE and Qwen2.5-Coder-1.5B embeddings through learned two-way gating, whose per-sample weights provide intrinsic explainability at no additional cost. The framework achieves 89.84-92.02% intra-language detection accuracy and 74.43-80.12% zero-shot cross-language F1, resolving 69.74% of vulnerabilities end-to-end at a 12.27% total failure rate. Ablations establish necessity: removing uAST degrades cross-language F1 by 23.42%, while disabling validation increases unnecessary repairs by 131.7%. These results demonstrate that execution-grounded closed-loop reasoning is a principled and practically deployable mechanism for trustworthy LLM-driven agentic AI.","published_date":"2026-04-12T20:22:23+00:00","viability_score":4,"cluster_label":"AI for Code Analysis","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Develop a cross-language code analysis tool using agentic execution to ensure verified vulnerability fixes in Java, Python, and C++.","time_to_mvp":"","tags":["high_potential"]},{"arxiv_id":"2604.10799v1","title":"Advancing Polish Language Modeling through Tokenizer Optimization in the Bielik v3 7B and 11B Series","abstract":"The development of the Bielik v3 PL series, encompassing both the 7B and 11B parameter variants, represents a significant milestone in the field of language-specific large language model (LLM) optimization. While general-purpose models often demonstrate impressive multilingual capabilities, they frequently suffer from a fundamental architectural inefficiency: the use of universal tokenizers. These tokenizers, typically designed to cover a broad spectrum of languages, often fail to capture the morphological nuances of specific languages like Polish, leading to higher fertility ratios, increased inference costs, and restricted effective context windows. This report details the transition from the universal Mistral-based tokenization to a dedicated Polish-optimized vocabulary for the Bielik v3 models, exploring the FOCUS-based embedding initialization, the multi-stage pretraining curriculum, and the subsequent post-training alignment involving Supervised Fine-Tuning, Direct Preference Optimization, and Reinforcement Learning through Group Relative Policy Optimization with verifiable rewards.","published_date":"2026-04-12T20:19:27+00:00","viability_score":4,"cluster_label":"LLM Tokenizer Optimization","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Optimizing Polish language modeling by developing a dedicated tokenizer for the Bielik v3 LLM series, improving efficiency and context window utilization.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.10788v1","title":"TInR: Exploring Tool-Internalized Reasoning in Large Language Models","abstract":"Tool-Integrated Reasoning (TIR) has emerged as a promising direction by extending Large Language Models' (LLMs) capabilities with external tools during reasoning. Existing TIR methods typically rely on external tool documentation during reasoning. However, this leads to tool mastery difficulty, tool size constraints, and inference inefficiency. To mitigate these issues, we explore Tool-Internalized Reasoning (TInR), aiming at facilitating reasoning with tool knowledge internalized into LLMs. Achieving this goal presents notable requirements, including tool internalization and tool-reasoning coordination. To address them, we propose TInR-U, a tool-internalized reasoning framework for unified reasoning and tool usage. TInR-U is trained through a three-phase pipeline: 1) tool internalization with a bidirectional knowledge alignment strategy; 2) supervised fine-tuning warm-up using high-quality reasoning annotations, and 3) reinforcement learning with TInR-specific rewards. We comprehensively evaluate our method across in-domain and out-of-domain settings. Experiment results show that TInR-U achieves superior performance in both settings, highlighting its effectiveness and efficiency.","published_date":"2026-04-12T19:38:19+00:00","viability_score":5,"cluster_label":"Tool-Integrated Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for Tool-Internalized Reasoning (TInR) that integrates tool knowledge directly into LLMs, improving reasoning efficiency and performance without external documentation.","time_to_mvp":"1-3 months","tags":["series_a_plus"]},{"arxiv_id":"2604.10786v1","title":"Do BERT Embeddings Encode Narrative Dimensions? A Token-Level Probing Analysis of Time, Space, Causality, and Character in Fiction","abstract":"Narrative understanding requires multidimensional semantic structures. This study investigates whether BERT embeddings encode dimensions of fictional narrative semantics -- time, space, causality, and character. Using an LLM to accelerate annotation, we construct a token-level dataset labeled with these four narrative categories plus \"others.\" A linear probe on BERT embeddings (94% accuracy) significantly outperforms a control probe on variance-matched random embeddings (47%), confirming that BERT encodes meaningful narrative information. With balanced class weighting, the probe achieves a macro-average recall of 0.83, with moderate success on rare categories such as causality (recall = 0.75) and space (recall = 0.66). However, confusion matrix analysis reveals \"Boundary Leakage,\" where rare dimensions are systematically misclassified as \"others.\" Clustering analysis shows that unsupervised clustering aligns near-randomly with predefined categories (ARI = 0.081), suggesting that narrative dimensions are encoded but not as discretely separable clusters. Future work includes a POS-only baseline to disentangle syntactic patterns from narrative encoding, expanded datasets, and layer-wise probing.","published_date":"2026-04-12T19:23:48+00:00","viability_score":5,"cluster_label":"LLM Analysis","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This paper probes BERT embeddings to understand if they encode narrative dimensions like time, space, causality, and character in fiction, achieving 94% accuracy with a linear probe.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.10784v1","title":"TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training","abstract":"Recent advances in unified multimodal models (UMMs) have led to a proliferation of architectures capable of understanding, generating, and editing across visual and textual modalities. However, developing a unified framework for UMMs remains challenging due to the diversity of model architectures and the heterogeneity of training paradigms and implementation details. In this paper, we present TorchUMM, the first unified codebase for comprehensive evaluation, analysis, and post-training across diverse UMM backbones, tasks, and datasets. TorchUMM supports a broad spectrum of models covering a wide range of scales and design paradigms. Our benchmark encompasses three core task dimensions: multimodal understanding, generation, and editing, and integrates both established and novel datasets to evaluate perception, reasoning, compositionality, and instruction-following abilities. By providing a unified interface and standardized evaluation protocols, TorchUMM enables fair and reproducible comparisons across heterogeneous models and fosters deeper insights into their strengths and limitations, facilitating the development of more capable unified multimodal systems. Code is available at: https://github.com/AIFrontierLab/TorchUMM.","published_date":"2026-04-12T19:19:04+00:00","viability_score":5,"cluster_label":"AI Tooling","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A unified codebase for evaluating and analyzing multimodal models post-training.","time_to_mvp":"","tags":["high_potential"]},{"arxiv_id":"2604.10783v1","title":"Learning Preference-Based Objectives from Clinical Narratives for Sequential Treatment Decision-Making","abstract":"Designing reward functions remains a central challenge in reinforcement learning (RL) for healthcare, where outcomes are sparse, delayed, and difficult to specify. While structured data capture physiological states, they often fail to reflect the overall quality of a patient's clinical trajectory, including recovery dynamics, treatment burden, and stability. Clinical narratives, in contrast, summarize longitudinal reasoning and implicitly encode evaluations of treatment effectiveness. We propose Clinical Narrative-informed Preference Rewards (CN-PR), a framework for learning reward functions directly from discharge summaries by treating them as scalable supervision for trajectory-level preferences. Using a large language model, we derive trajectory quality scores (TQS) and construct pairwise preferences over patient trajectories, enabling reward learning via a structured preference-based objective. To account for variability in narrative informativeness, we incorporate a confidence signal that weights supervision based on its relevance to the decision-making task. The learned reward aligns strongly with trajectory quality (Spearman rho = 0.63) and enables policies that are consistently associated with improved recovery-related outcomes, including increased organ support-free days and faster shock resolution, while maintaining comparable performance on mortality. These effects persist under external validation. Our results demonstrate that narrative-derived supervision provides a scalable and expressive alternative to handcrafted or outcome-based reward design for dynamic treatment regimes.","published_date":"2026-04-12T19:18:02+00:00","viability_score":6,"cluster_label":"Healthcare AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This framework learns reward functions from clinical narratives to improve sequential treatment decision-making in healthcare, aligning with improved patient recovery outcomes.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.10765v1","title":"Lung Cancer Detection Using Deep Learning","abstract":"Lung cancer, the second leading cause of cancer-related deaths, is primarily linked to long-term tobacco smoking (85% of cases). Surprisingly, 10-15% of cases occur in non-smokers. In 2020, approximately 2 million people were affected globally, resulting in 1.5 million deaths. The survival rate, at around 20%, lags behind other cancers, partly due to late-stage symptom manifestation. Necessitates early and accurate detection for effective treatment. Performance metrics such as accuracy, precision, recall (sensitivity), and F1-score are computed to provide a comprehensive evaluation of each model's capabilities. By comparing these metrics, this study offers insights into the strengths and limitations of each approach, contributing to the advancement of lung cancer detection techniques. In this paper, we are going to discuss the methodologies of lung cancer detection using different deep learning algorithms - InceptionV3, MobileNetV2, VGG16, ResNet152 - are explored for their efficacy in classifying lung cancer cases. Our Proposed Model algorithm based is a 16 layers architecture based on CNN model. Our Proposed model exhibits several key highlights that contribute to its novelty. By integrating multiple layer types such as convolutional, pooling, flatten, dropout, fully connected and dense layers, the model leverages the strengths of each layer to enhance its predictive capabilities. Novelty of our proposed model is that its accuracy is increasing consistently with the increasing no of epochs. We have tested the model performance up to epoch no 30. Our proposed model also overcome the overfitting problem.","published_date":"2026-04-12T18:23:00+00:00","viability_score":3,"cluster_label":"Medical AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper explores deep learning algorithms like InceptionV3, MobileNetV2, VGG16, and ResNet152 for lung cancer detection, proposing a 16-layer CNN model.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.10760v1","title":"Prosociality by Coupling, Not Mere Observation: Homeostatic Sharing in an Inspectable Recurrent Artificial Life Agent","abstract":"Artificial agents can be made to \"help\" for many reasons, including explicit social reward, hard-coded prosocial bonuses, or direct access to another agent's internal state. Those possibilities make minimal prosocial behavior hard to interpret. Building on ReCoN-Ipsundrum, an inspectable recurrent controller with affect-coupled regulation, I add an explicit homeostat and a social coupling channel while keeping planning strictly self-directed: the agent scores only its own predicted internal state, and no partner-welfare reward term is introduced. I compare four matched conditions in two toy worlds. In a one-step FoodShareToy, an exact solver finds a sharp switch from EAT to PASS at $\u03bb* \\approx 0.91$ for the default state. In the experimental runs, the self-only and partner-observing conditions never help, whereas the affectively coupled conditions always do. In a multi-step SocialCorridorWorld, the same dissociation reappears: coupling flips help rate and partner recovery from 0 to 1 and cuts rescue latency from 18 to 9 steps, while raising mutual viability from 0.15 to 0.33. Sham lesions preserve helping, but coupling-off and shuffled-partner lesions abolish it in both tasks. A coupling sweep shows a load-dependent feasibility boundary: under low load, helping appears for $\u03bb \\geq 0.25$, whereas under medium and high loads no tested value rescues the partner within horizon. The result is a narrow claim for artificial life: in this minimal architecture, helping appears when another's need is routed into self-regulation.","published_date":"2026-04-12T18:09:24+00:00","viability_score":3,"cluster_label":"Artificial Life Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"This paper explores how to induce prosocial behavior in artificial agents through a novel 'homeostatic sharing' mechanism, demonstrating its effectiveness in simulated environments.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.09544v1","title":"Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism","abstract":"Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely bypass them, and fine-tuning on narrow domains can induce ``emergent misalignment'' that generalizes broadly. Whether this brittleness reflects a fundamental lack of coherent internal organization for harmfulness remains unclear. Here we use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. We find that harmful content generation depends on a compact set of weights that are general across harm types and distinct from benign capabilities. Aligned models exhibit a greater compression of harm generation weights than unaligned counterparts, indicating that alignment reshapes harmful representations internally--despite the brittleness of safety guardrails at the surface level. This compression explains emergent misalignment: if weights of harmful capabilities are compressed, fine-tuning that engages these weights in one domain can trigger broad misalignment. Consistent with this, pruning harm generation weights in a narrow domain substantially reduces emergent misalignment. Notably, LLMs harmful generation capability is dissociated from how they recognize and explain such content. Together, these results reveal a coherent internal structure for harmfulness in LLMs that may serve as a foundation for more principled approaches to safety.","published_date":"2026-04-10T17:58:31+00:00","viability_score":2,"cluster_label":"LLM Safety","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This research identifies a distinct, unified mechanism for harmful content generation in LLMs, suggesting a path for more principled safety approaches.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.09537v1","title":"Case-Grounded Evidence Verification: A Framework for Constructing Evidence-Sensitive Supervision","abstract":"Evidence-grounded reasoning requires more than attaching retrieved text to a prediction: a model should make decisions that depend on whether the provided evidence supports the target claim. In practice, this often fails because supervision is weak, evidence is only loosely tied to the claim, and evaluation does not test evidence dependence directly. We introduce case-grounded evidence verification, a general framework in which a model receives a local case context, external evidence, and a structured claim, and must decide whether the evidence supports the claim for that case. Our key contribution is a supervision construction procedure that generates explicit support examples together with semantically controlled non-support examples, including counterfactual wrong-state and topic-related negatives, without manual evidence annotation. We instantiate the framework in radiology and train a standard verifier on the resulting support task. The learned verifier substantially outperforms both case-only and evidence-only baselines, remains strong under correct evidence, and collapses when evidence is removed or swapped, indicating genuine evidence dependence. This behavior transfers across unseen evidence articles and an external case distribution, though performance degrades under evidence-source shift and remains sensitive to backbone choice. Overall, the results suggest that a major bottleneck in evidence grounding is not only model capacity, but the lack of supervision that encodes the causal role of evidence.","published_date":"2026-04-10T17:55:38+00:00","viability_score":3,"cluster_label":"Evidence Verification","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A framework for evidence-grounded reasoning that trains models to explicitly depend on provided evidence for decision-making, improving accuracy in domains like radiology.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.09532v1","title":"Seeing is Believing: Robust Vision-Guided Cross-Modal Prompt Learning under Label Noise","abstract":"Prompt learning is a parameter-efficient approach for vision-language models, yet its robustness under label noise is less investigated. Visual content contains richer and more reliable semantic information, which remains more robust under label noise. However, the prompt itself is highly susceptible to label noise. Motivated by this intuition, we propose VisPrompt, a lightweight and robust vision-guided prompt learning framework for noisy-label settings. Specifically, we exploit a cross-modal attention mechanism to reversely inject visual semantics into prompt representations. This enables the prompt tokens to selectively aggregate visual information relevant to the current sample, thereby improving robustness by anchoring prompt learning to stable instance-level visual evidence and reducing the influence of noisy supervision. To address the instability caused by using the same way of injecting visual information for all samples, despite differences in the quality of their visual cues, we further introduce a lightweight conditional modulation mechanism to adaptively control the strength of visual information injection, which strikes a more robust balance between text-side semantic priors and image-side instance evidence. The proposed framework effectively suppresses the noise-induced disturbances, reduce instability in prompt updates, and alleviate memorization of mislabeled samples. VisPrompt significantly improves robustness while keeping the pretrained VLM backbone frozen and introducing only a small amount of additional trainable parameters. Extensive experiments under synthetic and real-world label noise demonstrate that VisPrompt generally outperforms existing baselines on seven benchmark datasets and achieves stronger robustness. Our code is publicly available at https://github.com/gezbww/Vis_Prompt.","published_date":"2026-04-10T17:48:56+00:00","viability_score":6,"cluster_label":"Robust Vision-Language Models","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"VisPrompt enhances vision-language models' robustness to label noise by injecting visual semantics into prompt learning, improving performance on noisy datasets.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.09531v1","title":"VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images","abstract":"Vision-language models (VLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint recognition. One plausible contributing factor is that natural image datasets provide limited supervision for low-level visual skills. This motivates a practical question: can targeted synthetic supervision, generated from only a task keyword such as Depth Order, address these weaknesses? To investigate this question, we introduce VisionFoundry, a task-aware synthetic data generation pipeline that takes only the task name as input and uses large language models (LLMs) to generate questions, answers, and text-to-image (T2I) prompts, then synthesizes images with T2I models and verifies consistency with a proprietary VLM, requiring no reference images or human annotation. Using VisionFoundry, we construct VisionFoundry-10K, a synthetic visual question answering (VQA) dataset containing 10k image-question-answer triples spanning 10 tasks. Models trained on VisionFoundry-10K achieve substantial improvements on visual perception benchmarks: +7% on MMVP and +10% on CV-Bench-3D, while preserving broader capabilities and showing favorable scaling behavior as data size increases. Our results suggest that limited task-targeted supervision is an important contributor to this bottleneck and that synthetic supervision is a promising path toward more systematic training for VLMs.","published_date":"2026-04-10T17:48:51+00:00","viability_score":6,"cluster_label":"Synthetic Data for VLMs","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"VisionFoundry teaches VLMs visual perception skills using task-specific synthetic images generated by LLMs and text-to-image models.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.09529v1","title":"VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning","abstract":"Large Vision Language Models (LVLMs) achieve strong multimodal reasoning but frequently exhibit hallucinations and incorrect responses with high certainty, which hinders their usage in high-stakes domains. Existing verbalized confidence calibration methods, largely developed for text-only LLMs, typically optimize a single holistic confidence score using binary answer-level correctness. This design is mismatched to LVLMs: an incorrect prediction may arise from perceptual failures or from reasoning errors given correct perception, and a single confidence conflates these sources while visual uncertainty is often dominated by language priors. To address these issues, we propose VL-Calibration, a reinforcement learning framework that explicitly decouples confidence into visual and reasoning confidence. To supervise visual confidence without ground-truth perception labels, we introduce an intrinsic visual certainty estimation that combines (i) visual grounding measured by KL-divergence under image perturbations and (ii) internal certainty measured by token entropy. We further propose token-level advantage reweighting to focus optimization on tokens based on visual certainty, suppressing ungrounded hallucinations while preserving valid perception. Experiments on thirteen benchmarks show that VL-Calibration effectively improves calibration while boosting visual reasoning accuracy, and it generalizes to out-of-distribution benchmarks across model scales and architectures.","published_date":"2026-04-10T17:47:19+00:00","viability_score":7,"cluster_label":"Vision-Language Models","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A reinforcement learning framework that decouples confidence calibration in large vision-language models into visual and reasoning components to reduce hallucinations and improve accuracy.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.09527v1","title":"Envisioning the Future, One Step at a Time","abstract":"Accurately anticipating how complex, diverse scenes will evolve requires models that represent uncertainty, simulate along extended interaction chains, and efficiently explore many plausible futures. Yet most existing approaches rely on dense video or latent-space prediction, expending substantial capacity on dense appearance rather than on the underlying sparse trajectories of points in the scene. This makes large-scale exploration of future hypotheses costly and limits performance when long-horizon, multi-modal motion is essential. We address this by formulating the prediction of open-set future scene dynamics as step-wise inference over sparse point trajectories. Our autoregressive diffusion model advances these trajectories through short, locally predictable transitions, explicitly modeling the growth of uncertainty over time. This dynamics-centric representation enables fast rollout of thousands of diverse futures from a single image, optionally guided by initial constraints on motion, while maintaining physical plausibility and long-range coherence. We further introduce OWM, a benchmark for open-set motion prediction based on diverse in-the-wild videos, to evaluate accuracy and variability of predicted trajectory distributions under real-world uncertainty. Our method matches or surpasses dense simulators in predictive accuracy while achieving orders-of-magnitude higher sampling speed, making open-set future prediction both scalable and practical. Project page: http://compvis.github.io/myriad.","published_date":"2026-04-10T17:46:05+00:00","viability_score":8,"cluster_label":"Visual Forecasting AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Empower simulations and analytics by forecasting future event outcomes from static images.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.09521v1","title":"Semantic Rate-Distortion for Bounded Multi-Agent Communication: Capacity-Derived Semantic Spaces and the Communication Cost of Alignment","abstract":"When two agents of different computational capacities interact with the same environment, they need not compress a common semantic alphabet differently; they can induce different semantic alphabets altogether. We show that the quotient POMDP $Q_{m,T}(M)$ - the unique coarsest abstraction consistent with an agent's capacity - serves as a capacity-derived semantic space for any bounded agent, and that communication between heterogeneous agents exhibits a sharp structural phase transition. Below a critical rate $R_{\\text{crit}}$ determined by the quotient mismatch, intent-preserving communication is structurally impossible. In the supported one-way memoryless regime, classical side-information coding then yields exponential decay above the induced benchmark. Classical coding theorems tell you the rate once the source alphabet is fixed; our contribution is to derive that alphabet from bounded interaction itself.   Concretely, we prove: (1) a fixed-$\\varepsilon$ structural phase-transition theorem whose lower bound is fully general on the common-history quotient comparison; (2) a one-way Wyner-Ziv benchmark identification on quotient alphabets, with exact converse, exact operational equality for memoryless quotient sources, and an ergodic long-run bridge via explicit mixing bounds; (3) an asymptotic one-way converse in the shrinking-distortion regime $\\varepsilon = O(1/T)$, proved from the message stream and decoder side information; and (4) alignment traversal bounds enabling compositional communication through intermediate capacity levels. Experiments on eight POMDP environments (including RockSample(4,4)) illustrate the phase transition, a structured-policy benchmark shows the one-way rate can drop by up to $19\\times$ relative to the counting bound, and a shrinking-distortion sweep matches the regime of the asymptotic converse.","published_date":"2026-04-10T17:41:38+00:00","viability_score":3,"cluster_label":"Multi-Agent Communication","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A theoretical framework for understanding communication between agents with different computational capacities, defining capacity-derived semantic spaces and identifying critical communication rates.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.09508v1","title":"VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning","abstract":"Visual Retrieval-Augmented Generation (VRAG) empowers Vision-Language Models to retrieve and reason over visually rich documents. To tackle complex queries requiring multi-step reasoning, agentic VRAG systems interleave reasoning with iterative retrieval.. However, existing agentic VRAG faces two critical bottlenecks. (1) Visual Evidence Sparsity: key evidence is scattered across pages yet processed in isolation, hindering cross-page reasoning; moreover, fine-grained intra-image evidence often requires precise visual actions, whose misuse degrades retrieval quality; (2) Search Drift in Long Horizons: the accumulation of visual tokens across retrieved pages dilutes context and causes cognitive overload, leading agents to deviate from their search objective. To address these challenges, we propose VISOR (Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning), a unified single-agent framework. VISOR features a structured Evidence Space for progressive cross-page reasoning, coupled with a Visual Action Evaluation and Correction mechanism to manage visual actions. Additionally, we introduce a Dynamic Trajectory with Sliding Window and Intent Injection to mitigate search drift. They anchor the evidence space while discarding earlier raw interactions, preventing context from being overwhelmed by visual tokens. We train VISOR using a Group Relative Policy Optimization-based Reinforcement Learning (GRPO-based RL) pipeline with state masking and credit assignment tailored for dynamic context reconstruction. Extensive experiments on ViDoSeek, SlideVQA, and MMLongBench demonstrate that VISOR achieves state-of-the-art performance with superior efficiency for long-horizon visual reasoning tasks.","published_date":"2026-04-10T17:25:34+00:00","viability_score":6,"cluster_label":"AI & Computer Vision","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A visual retrieval-augmented generation system improving on state-of-the-art, targeting visual data workflows.","time_to_mvp":"1-2 weeks","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.09502v1","title":"Strategic Algorithmic Monoculture:Experimental Evidence from Coordination Games","abstract":"AI agents increasingly operate in multi-agent environments where outcomes depend on coordination. We distinguish primary algorithmic monoculture -- baseline action similarity -- from strategic algorithmic monoculture, whereby agents adjust similarity in response to incentives. We implement a simple experimental design that cleanly separates these forces, and deploy it on human and large language model (LLM) subjects. LLMs exhibit high levels of baseline similarity (primary monoculture) and, like humans, they regulate it in response to coordination incentives (strategic monoculture). While LLMs coordinate extremely well on similar actions, they lag behind humans in sustaining heterogeneity when divergence is rewarded.","published_date":"2026-04-10T17:14:46+00:00","viability_score":1,"cluster_label":"Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"LLMs exhibit both baseline and strategic algorithmic monoculture in coordination games, but lag humans in sustaining rewarded heterogeneity.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.09497v1","title":"BERT-as-a-Judge: A Robust Alternative to Lexical Methods for Efficient Reference-Based LLM Evaluation","abstract":"Accurate evaluation is central to the large language model (LLM) ecosystem, guiding model selection and downstream adoption across diverse use cases. In practice, however, evaluating generative outputs typically relies on rigid lexical methods to extract and assess answers, which can conflate a model's true problem-solving ability with its compliance with predefined formatting guidelines. While recent LLM-as-a-Judge approaches mitigate this issue by assessing semantic correctness rather than strict structural conformity, they also introduce substantial computational overhead, making evaluation costly. In this work, we first systematically investigate the limitations of lexical evaluation through a large-scale empirical study spanning 36 models and 15 downstream tasks, demonstrating that such methods correlate poorly with human judgments. To address this limitation, we introduce BERT-as-a-Judge, an encoder-driven approach for assessing answer correctness in reference-based generative settings, robust to variations in output phrasing, and requiring only lightweight training on synthetically annotated question-candidate-reference triplets. We show that it consistently outperforms the lexical baseline while matching the performance of much larger LLM judges, providing a compelling tradeoff between the two and enabling reliable, scalable evaluation. Finally, through extensive experimentation, we provide detailed insights into BERT-as-a-Judge's performance to offer practical guidance for practitioners, and release all project artifacts to foster downstream adoption.","published_date":"2026-04-10T17:08:40+00:00","viability_score":7,"cluster_label":"LLM Evaluation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"BERT-as-a-Judge offers a scalable and robust alternative to lexical methods for LLM evaluation, matching larger models at lower computational cost.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.09494v1","title":"RecaLLM: Addressing the Lost-in-Thought Phenomenon with Explicit In-Context Retrieval","abstract":"We propose RecaLLM, a set of reasoning language models post-trained to make effective use of long-context information. In-context retrieval, which identifies relevant evidence from context, and reasoning are deeply intertwined: retrieval supports reasoning, while reasoning often determines what must be retrieved. However, their interaction remains largely underexplored. In preliminary experiments on several open-source LLMs, we observe that in-context retrieval performance substantially degrades even after a short reasoning span, revealing a key bottleneck for test-time scaling that we refer to as lost-in-thought: reasoning steps that improve performance also make subsequent in-context retrieval more challenging. To address this limitation, RecaLLM interleaves reasoning with explicit in-context retrieval, alternating between reasoning and retrieving context information needed to solve intermediate subproblems. We introduce a negligible-overhead constrained decoding mechanism that enables verbatim copying of evidence spans, improving the grounding of subsequent generation. Trained on diverse lexical and semantic retrieval tasks, RecaLLM achieves strong performance on two long-context benchmarks, RULER and HELMET, significantly outperforming baselines. Notably, we observe consistent gains at context windows of up to 128K tokens using training samples of at most 10K tokens, far shorter than those used by existing long-context approaches, highlighting a promising path toward improving long-context performance without expensive long-context training data.","published_date":"2026-04-10T17:04:32+00:00","viability_score":7,"cluster_label":"LLM Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"RecaLLM addresses the 'lost-in-thought' phenomenon in LLMs by interleaving explicit in-context retrieval with reasoning, improving long-context performance without expensive training data.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.09489v1","title":"XFED: Non-Collusive Model Poisoning Attack Against Byzantine-Robust Federated Classifiers","abstract":"Model poisoning attacks pose a significant security threat to Federated Learning (FL). Most existing model poisoning attacks rely on collusion, requiring adversarial clients to coordinate by exchanging local benign models and synchronizing the generation of their poisoned updates. However, sustaining such coordination is increasingly impractical in real-world FL deployments, as it effectively requires botnet-like control over many devices. This approach is costly to maintain and highly vulnerable to detection. This context raises a fundamental question: Can model poisoning attacks remain effective without any communication between attackers? To address this challenge, we introduce and formalize the \\textbf{non-collusive attack model}, in which all compromised clients share a common adversarial objective but operate independently. Under this model, each attacker generates its malicious update without communicating with other adversaries, accessing other clients' updates, or relying on any knowledge of server-side defenses. To demonstrate the feasibility of this threat model, we propose \\textbf{XFED}, the first aggregation-agnostic, non-collusive model poisoning attack. Our empirical evaluation across six benchmark datasets shows that XFED bypasses eight state-of-the-art defenses and outperforms six existing model poisoning attacks. These findings indicate that FL systems are substantially less secure than previously believed and underscore the urgent need for more robust and practical defense mechanisms.","published_date":"2026-04-10T16:54:29+00:00","viability_score":4,"cluster_label":"Federated Learning Security","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"XFED is the first aggregation-agnostic, non-collusive model poisoning attack against Byzantine-robust federated classifiers, demonstrating significant security vulnerabilities.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.09482v1","title":"Process Reward Agents for Steering Knowledge-Intensive Reasoning","abstract":"Reasoning in knowledge-intensive domains remains challenging as intermediate steps are often not locally verifiable: unlike math or code, evaluating step correctness may require synthesizing clues across large external knowledge sources. As a result, subtle errors can propagate through reasoning traces, potentially never to be detected. Prior work has proposed process reward models (PRMs), including retrieval-augmented variants, but these methods operate post hoc, scoring completed trajectories, which prevents their integration into dynamic inference procedures. Here, we introduce Process Reward Agents (PRA), a test-time method for providing domain-grounded, online, step-wise rewards to a frozen policy. In contrast to prior retrieval-augmented PRMs, PRA enables search-based decoding to rank and prune candidate trajectories at every generation step. Experiments on multiple medical reasoning benchmarks demonstrate that PRA consistently outperforms strong baselines, achieving 80.8% accuracy on MedQA with Qwen3-4B, a new state of the art at the 4B scale. Importantly, PRA generalizes to unseen frozen policy models ranging from 0.5B to 8B parameters, improving their accuracy by up to 25.7% without any policy model updates. More broadly, PRA suggests a paradigm in which frozen reasoners are decoupled from domain-specific reward modules, allowing the deployment of new backbones in complex domains without retraining.","published_date":"2026-04-10T16:45:44+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Process Reward Agents provide step-wise rewards to frozen LLMs for improved reasoning accuracy in knowledge-intensive domains like medicine.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.09474v1","title":"SafeMind: A Risk-Aware Differentiable Control Framework for Adaptive and Safe Quadruped Locomotion","abstract":"Learning-based quadruped controllers achieve impressive agility but typically lack formal safety guarantees under model uncertainty, perception noise, and unstructured contact conditions. We introduce SafeMind, a differentiable stochastic safety-control framework that unifies probabilistic Control Barrier Functions with semantic context understanding and meta-adaptive risk calibration. SafeMind explicitly models epistemic and aleatoric uncertainty through a variance-aware barrier constraint embedded in a differentiable quadratic program, thereby preserving gradient flow for end-to-end training. A semantics-to-constraint encoder modulates safety margins using perceptual or language cues, while a meta-adaptive learner continuously adjusts risk sensitivity across environments. We provide theoretical conditions for probabilistic forward invariance, feasibility, and stability under stochastic dynamics. SafeMind is deployed on Unitree A1 and ANYmal C at 200~Hz and validated across 12 terrain types, dynamic obstacles, morphology perturbations, and semantically defined tasks. Experiments show that SafeMind reduces safety violations by 3--10x and energy consumption by 10--15% relative to state-of-the-art CBF, MPC, and hybrid RL baselines, while maintaining real-time control performance.","published_date":"2026-04-10T16:33:32+00:00","viability_score":4,"cluster_label":"Robotics","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"SafeMind is a differentiable control framework for quadrupeds that unifies safety guarantees with adaptive locomotion.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.09455v1","title":"E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning","abstract":"While Large Language Models (LLMs) have demonstrated significant potential in Tool-Integrated Reasoning (TIR), existing training paradigms face significant limitations: Zero-RL suffers from inefficient exploration and mode degradation due to a lack of prior guidance, while SFT-then-RL is limited by high data costs and capability plateaus caused by low-entropy collapse. To address these challenges, we propose E3-TIR (Enhanced Experience Exploitation), a warm-up paradigm for the early stages of agent training. Specifically, we formulate training as the dynamic integration of three experience types: Expert Prefixes, Expert Guided, and Self-Exploration. By executing diverse branching exploration around expert \"anchors\" and employing a mix policy optimization mechanism, we effectively mitigate distribution shifts and resolve optimization conflicts arising from shared prefixes. Our method dynamically adapts the model's knowledge boundaries, effectively balancing exploration diversity with training efficiency.Experimental results demonstrate that E3-TIR achieves a 6 performance improvement over traditional paradigms on tool-use tasks, while requiring less than 10 of the synthetic data. Furthermore, in terms of ROI, a comprehensive metric integrating performance, data cost, and training efficiency we achieve a 1.46x gain compared to baselines. Code is available at https://github.com/yuki-younai/E3-TIR.","published_date":"2026-04-10T16:14:48+00:00","viability_score":7,"cluster_label":"Tool-Integrated Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"E3-TIR enhances AI's reasoning capabilities by integrating advanced tool exploitation, outperforming existing solutions.","time_to_mvp":"","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.09452v1","title":"SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning","abstract":"Safety guarantees are a prerequisite to the deployment of reinforcement learning (RL) agents in safety-critical tasks. Often, deployment environments exhibit non-stationary dynamics or are subject to changing performance goals, requiring updates to the learned policy. This leads to a fundamental challenge: how to update an RL policy while preserving its safety properties on previously encountered tasks? The majority of current approaches either do not provide formal guarantees or verify policy safety only a posteriori. We propose a novel a priori approach to safe policy updates in continual RL by introducing the Rashomon set: a region in policy parameter space certified to meet safety constraints within the demonstration data distribution. We then show that one can provide formal, provable guarantees for arbitrary RL algorithms used to update a policy by projecting their updates onto the Rashomon set. Empirically, we validate this approach across grid-world navigation environments (Frozen Lake and Poisoned Apple) where we guarantee an a priori provably deterministic safety on the source task during downstream adaptation. In contrast, we observe that regularisation-based baselines experience catastrophic forgetting of safety constraints while our approach enables strong adaptation with provable guarantees that safety is preserved.","published_date":"2026-04-10T16:09:39+00:00","viability_score":7,"cluster_label":"Reinforcement Learning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"SafeAdapt provides provable safety guarantees for updating reinforcement learning policies in non-stationary environments.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.09450v1","title":"ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion","abstract":"Chest X-ray report generation (CXR-RG) has the potential to substantially alleviate radiologists' workload. However, conventional autoregressive vision--language models (VLMs) suffer from high inference latency due to sequential token decoding. Diffusion-based models offer a promising alternative through parallel generation, but they still require multiple denoising iterations. Compressing multi-step denoising to a single step could further reduce latency, but often degrades textual coherence due to the mean-field bias introduced by token-factorized denoisers. To address this challenge, we propose \\textbf{ECHO}, an efficient diffusion-based VLM (dVLM) for chest X-ray report generation. ECHO enables stable one-step-per-block inference via a novel Direct Conditional Distillation (DCD) framework, which mitigates the mean-field limitation by constructing unfactorized supervision from on-policy diffusion trajectories to encode joint token dependencies. In addition, we introduce a Response-Asymmetric Diffusion (RAD) training strategy that further improves training efficiency while maintaining model effectiveness. Extensive experiments demonstrate that ECHO surpasses state-of-the-art autoregressive methods, improving RaTE and SemScore by \\textbf{64.33\\%} and \\textbf{60.58\\%} respectively, while achieving an \\textbf{$8\\times$} inference speedup without compromising clinical accuracy.","published_date":"2026-04-10T16:07:14+00:00","viability_score":7,"cluster_label":"Medical AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An efficient diffusion-based vision-language model for one-step chest X-ray report generation that significantly speeds up inference while maintaining clinical accuracy.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.09443v1","title":"Many-Tier Instruction Hierarchy in LLM Agents","abstract":"Large language model agents receive instructions from many sources-system messages, user prompts, tool outputs, and more-each carrying different levels of trust and authority. When these instructions conflict, models must reliably follow the highest-privilege instruction to remain safe and effective. The dominant paradigm, instruction hierarchy (IH), assumes a fixed, small set of privilege levels (typically fewer than five) defined by rigid role labels (e.g., system > user). This is inadequate for real-world agentic settings, where conflicts can arise across far more sources and contexts. In this work, we propose Many-Tier Instruction Hierarchy (ManyIH), a paradigm for resolving instruction conflicts among instructions with arbitrarily many privilege levels. We introduce ManyIH-Bench, the first benchmark for ManyIH. ManyIH-Bench requires models to navigate up to 12 levels of conflicting instructions with varying privileges, comprising 853 agentic tasks (427 coding and 426 instruction-following). ManyIH-Bench composes constraints developed by LLMs and verified by humans to create realistic and difficult test cases spanning 46 real-world agents. Our experiments show that even the current frontier models perform poorly (~40% accuracy) when instruction conflict scales. This work underscores the urgent need for methods that explicitly target fine-grained, scalable instruction conflict resolution in agentic settings.","published_date":"2026-04-10T16:00:04+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new benchmark and paradigm for LLM agents to resolve instruction conflicts across many privilege levels, addressing a critical gap in agent safety and effectiveness.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.09439v1","title":"TME-PSR: Time-aware, Multi-interest, and Explanation Personalization for Sequential Recommendation","abstract":"In this paper, we propose a sequential recommendation model that integrates Time-aware personalization, Multi-interest personalization, and Explanation personalization for Personalized Sequential Recommendation (TME-PSR). That is, we consider the differences across different users in temporal rhythm preference, multiple fine-grained latent interests, and the personalized semantic alignment between recommendations and explanations. Specifically, the proposed TME-PSR model employs a dual-view gated time encoder to capture personalized temporal rhythms, a lightweight multihead Linear Recurrent Unit architecture that enables fine-grained sub-interest modeling with improved efficiency, and a dynamic dual-branch mutual information weighting mechanism to achieve personalized alignment between recommendations and explanations. Extensive experiments on real-world datasets demonstrate that our method consistently improves recommendation accuracy and explanation quality, at a lower computational cost.","published_date":"2026-04-10T15:55:54+00:00","viability_score":7,"cluster_label":"Recommendation Systems","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A personalized sequential recommendation model that integrates time, multi-interest, and explanation personalization for improved accuracy and explanation quality at lower computational cost.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.09434v1","title":"Physics-guided surrogate learning enables zero-shot control of turbulent wings","abstract":"Turbulent boundary layers over aerodynamic surfaces are a major source of aircraft drag, yet their control remains challenging due to multiscale dynamics and spatial variability, particularly under adverse pressure gradients. Reinforcement learning has outperformed state-of-the-art strategies in canonical flows, but its application to realistic geometries is limited by computational cost and transferability. Here we show that these limitations can be overcome by exploiting local structures of wall-bounded turbulence. Policies are trained in turbulent channel flows matched to wing boundary-layer statistics and deployed directly onto a NACA4412 wing at $Re_c=2\\times10^5$ without further training, being the so-called zero-shot control. This achieves a 28.7\\% reduction in skin-friction drag and a 10.7\\% reduction in total drag, outperforming the state-of-the-art opposition control by 40\\% in friction drag reduction and 5\\% in total drag. Training cost is reduced by four orders of magnitude relative to on-wing training, enabling scalable flow control.","published_date":"2026-04-10T15:50:44+00:00","viability_score":3,"cluster_label":"Aerospace AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Physics-guided surrogate learning enables zero-shot control of turbulent wings, significantly reducing drag with drastically reduced training cost.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.09430v1","title":"On the Representational Limits of Quantum-Inspired 1024-D Document Embeddings: An Experimental Evaluation Framework","abstract":"Text embeddings are central to modern information retrieval and Retrieval-Augmented Generation (RAG). While dense models derived from Large Language Models (LLMs) dominate current practice, recent work has explored quantum-inspired alternatives motivated by the geometric properties of Hilbert-like spaces and their potential to encode richer semantic structure.   This paper presents an experimental framework for constructing quantum-inspired 1024-dimensional document embeddings based on overlapping windows and multi-scale aggregation. The pipeline combines semantic projections (e.g., EigAngle), circuit-inspired feature mappings, and optional teacher-student distillation, together with a fingerprinting mechanism for reproducibility and controlled evaluation.   We introduce a set of diagnostic tools for hybrid retrieval, including static and dynamic interpolation between BM25 and embedding-based scores, candidate union strategies, and a conceptual alpha-oracle that provides an upper bound for score-level fusion.   Experiments on controlled corpora of Italian and English documents across technical, narrative, and legal domains, using synthetic queries, show that BM25 remains a strong baseline, teacher embeddings provide stable semantic structure, and standalone quantum-inspired embeddings exhibit weak and unstable ranking signals. Distillation yields mixed effects, improving alignment in some cases but not consistently enhancing retrieval performance, while hybrid retrieval can recover competitive results when lexical and embedding-based signals are combined.   Overall, the results highlight structural limitations in the geometry of quantum-inspired embeddings, including distance compression and ranking instability, and clarify their role as auxiliary components rather than standalone retrieval representations.","published_date":"2026-04-10T15:48:37+00:00","viability_score":2,"cluster_label":"Information Retrieval","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper introduces a framework to evaluate quantum-inspired document embeddings, finding they have structural limitations and are best used as auxiliary components.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.09429v1","title":"Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories","abstract":"Recovering camera parameters from images and rendering scenes from novel viewpoints have long been treated as separate tasks in computer vision and graphics. This separation breaks down when image coverage is sparse or poses are ambiguous, since each task needs what the other produces. We propose Rays as Pixels, a Video Diffusion Model (VDM) that learns a joint distribution over videos and camera trajectories. We represent each camera as dense ray pixels (raxels) and denoise them jointly with video frames through Decoupled Self-Cross Attention mechanism. A single trained model handles three tasks: predicting camera trajectories from video, jointly generating video and camera trajectory from input images, and generating video from input images along a target camera trajectory. Because the model can both predict trajectories from a video and generate views conditioned on its own predictions, we evaluate it through a closed-loop self-consistency test, demonstrating that its forward and inverse predictions agree. Notably, trajectory prediction requires far fewer denoising steps than video generation, even a few denoising steps suffice for self-consistency. We report results on pose estimation and camera-controlled video generation.","published_date":"2026-04-10T15:47:23+00:00","viability_score":7,"cluster_label":"Generative Video","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A Video Diffusion Model that jointly learns video frames and camera trajectories, enabling novel view synthesis and pose estimation.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.09426v1","title":"Three Modalities, Two Design Probes, One Prototype, and No Vision: Experience-Based Co-Design of a Multi-modal 3D Data Visualization Tool","abstract":"Three-dimensional (3D) data visualizations, such as surface plots, are vital in STEM fields from biomedical imaging to spectroscopy, yet remain largely inaccessible to blind and low-vision (BLV) people. To address this gap, we conducted an Experience-Based Co-Design with BLV co-designers with expertise in non-visual data representations to create an accessible, multi-modal, web-native visualization tool. Using a multi-phase methodology, our team of five BLV and one non-BLV researcher(s) participated in two iterative sessions, comparing a low-fidelity tactile probe with a high-fidelity digital prototype. This process produced a prototype with empirically grounded features, including reference sonification, stereo and volumetric audio, and configurable buffer aggregation, which our co-designers validated as improving analytic accuracy and learnability. In this study, we target core analytic tasks essential for non-visual 3D data exploration: orientation, landmark and peak finding, comparing local maxima versus global trends, gradient tracing, and identifying occluded or partially hidden features. Our work offers accessibility researchers and developers a co-design protocol for translating tactile knowledge to digital interfaces, concrete design guidance for future systems, and opportunities to extend accessible 3D visualization into embodied data environments.","published_date":"2026-04-10T15:39:10+00:00","viability_score":3,"cluster_label":"Accessibility Tools","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper details the co-design process for an accessible, multi-modal 3D data visualization tool for blind and low-vision users.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.09417v1","title":"Do We Really Need to Approach the Entire Pareto Front in Many-Objective Bayesian Optimisation?","abstract":"Many-objective optimisation, a subset of multi-objective optimisation, involves optimisation problems with more than three objectives. As the number of objectives increases, the number of solutions needed to adequately represent the entire Pareto front typically grows substantially. This makes it challenging, if not infeasible, to design a search algorithm capable of effectively exploring the entire Pareto front. This difficulty is particularly acute in the Bayesian optimisation paradigm, where sample efficiency is critical and only a limited number of solutions (often a few hundred) are evaluated. Moreover, after the optimisation process, the decision-maker eventually selects just one solution for deployment, regardless of how many high-quality, diverse solutions are available. In light of this, we argue an idea that under a very limited evaluation budget, it may be more useful to focus on finding a single solution of the highest possible quality for the decision-maker, rather than aiming to approximate the entire Pareto front as existing many-/multi-objective Bayesian optimisation methods typically do. Bearing this idea in mind, this paper proposes a \\underline{s}ingle \\underline{p}oint-based \\underline{m}ulti-\\underline{o}bjective search framework (SPMO) that aims to improve the quality of solutions along a direction that leads to a good tradeoff between objectives. Within SPMO, we present a simple acquisition function, called expected single-point improvement (ESPI), working under both noiseless and noisy scenarios. We show that ESPI can be optimised effectively with gradient-based methods via the sample average approximation (SAA) approach and theoretically prove its convergence guarantees under the SAA. We also empirically demonstrate that the proposed SPMO is computationally tractable and outperforms state-of-the-arts on a wide range of benchmark and real-world problems.","published_date":"2026-04-10T15:27:49+00:00","viability_score":7,"cluster_label":"Bayesian Optimization","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A new Bayesian optimization framework that focuses on finding a single high-quality solution rather than approximating the entire Pareto front for many-objective problems.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.09415v1","title":"PhysInOne: Visual Physics Learning and Reasoning in One Suite","abstract":"We present PhysInOne, a large-scale synthetic dataset addressing the critical scarcity of physically-grounded training data for AI systems. Unlike existing datasets limited to merely hundreds or thousands of examples, PhysInOne provides 2 million videos across 153,810 dynamic 3D scenes, covering 71 basic physical phenomena in mechanics, optics, fluid dynamics, and magnetism. Distinct from previous works, our scenes feature multiobject interactions against complex backgrounds, with comprehensive ground-truth annotations including 3D geometry, semantics, dynamic motion, physical properties, and text descriptions. We demonstrate PhysInOne's efficacy across four emerging applications: physics-aware video generation, long-/short-term future frame prediction, physical property estimation, and motion transfer. Experiments show that fine-tuning foundation models on PhysInOne significantly enhances physical plausibility, while also exposing critical gaps in modeling complex physical dynamics and estimating intrinsic properties. As the largest dataset of its kind, orders of magnitude beyond prior works, PhysInOne establishes a new benchmark for advancing physics-grounded world models in generation, simulation, and embodied AI.","published_date":"2026-04-10T15:27:27+00:00","viability_score":7,"cluster_label":"Generative AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A massive synthetic dataset for training AI to understand and generate physically plausible videos, enabling advancements in world models.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.09413v1","title":"Yes, But Not Always. Generative AI Needs Nuanced Opt-in","abstract":"This paper argues that a one-size-fits-all approach to specifying consent for the use of creative works in generative AI is insufficient. Real-world ownership and rights holder structures, the imitation of artistic styles and likeness, and the limitless contexts of use of AI outputs make the status quo of binary consent with opt-in by default untenable. To move beyond the current impasse, we consider levers of control in generative AI workflows at training, inference, and dissemination. Based on these insights, we position inference-time opt-in as an overlooked opportunity for nuanced consent verification. We conceptualize nuanced consent conditions for opt-in and propose an agent-based inference-time opt-in architecture to verify if user intent requests meet conditional consent granted by rights holders. In a case study for music, we demonstrate that nuanced opt-in at inference can account for established rights and re-establish a balance of power between rights holders and AI developers.","published_date":"2026-04-10T15:26:57+00:00","viability_score":5,"cluster_label":"Generative AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An agent-based system for nuanced, inference-time consent verification in generative AI, balancing rights holder control with developer flexibility.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.09408v1","title":"HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?","abstract":"Frontier coding agents solve complex tasks when given complete context but collapse when specifications are incomplete or ambiguous. The bottleneck is not raw capability, but judgment: knowing when to act autonomously and when to ask for help. Current benchmarks are blind to this failure mode. They supply unambiguous detailed instructions and solely reward execution correctness, so an agent that makes a lucky guess for a missing requirement will score identically to one that would have asked to be certain.   We present HiL-Bench (Human-in-the-Loop Benchmark) to measure this selective escalation skill. Each task contains human-validated blockers (missing information, ambiguous requests, contradictory information) that surface only through progressive exploration, not upfront inspection. Our core metric, Ask-F1, the harmonic mean of question precision and blocker recall, captures the tension between over-asking and silent guessing; its structure architecturally prevents gaming through question spam.   Evaluation across SWE and text-to-SQL domains reveals a large universal judgment gap: no frontier model recovers more than a fraction of its full-information performance when deciding whether to ask. Failure analysis identifies three key help-seeking patterns: overconfident wrong beliefs with no gap detection; high uncertainty detection yet persistent errors; broad, imprecise escalation without self-correction. These consistent patterns confirm poor help-seeking is a model-level flaw, not task-specific. RL training on shaped Ask-F1 reward shows judgment is trainable: a 32B model improves both help-seeking quality and task pass rate, with gains that transfer across domains. The model does not learn domain-specific heuristics for when to ask; it learns to detect unresolvable uncertainty and act on it.","published_date":"2026-04-10T15:21:44+00:00","viability_score":7,"cluster_label":"Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A benchmark and training method for AI agents to intelligently ask for help when faced with ambiguity, improving task completion and reducing errors.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.09388v1","title":"The AI Codebase Maturity Model: From Assisted Coding to Self-Sustaining Systems","abstract":"AI coding tools are widely adopted, but most teams plateau at prompt-and-review without a framework for systematic progression. This paper presents the AI Codebase Maturity Model (ACMM), a 5-level framework describing how codebases evolve from basic AI-assisted coding to self-sustaining systems. Inspired by CMMI, each level is defined by its feedback loop topology the specific mechanisms that must exist before the next level becomes possible. I validate the model through a 4-month experience report maintaining KubeStellar Console, a CNCF Kubernetes dashboard built from scratch with Claude Code (Opus) and GitHub Copilot. The system currently operates with 63 CI/CD workflows, 32 nightly test suites, 91% code coverage, and achieves bug-to-fix times under 30 minutes 24 hours a day. The central finding: the intelligence of an AI-driven development system resides not in the AI model itself, but in the infrastructure of instructions, tests, metrics, and feedback loops that surround it. You cannot skip levels, and at each level, the thing that unlocks the next one is another feedback mechanism. Testing the volume of test cases, the coverage thresholds, and the reliability of test execution proved to be the single most important investment in the entire journey.","published_date":"2026-04-10T15:00:59+00:00","viability_score":7,"cluster_label":"AI Development Tools","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A maturity model and framework for evolving AI-assisted coding into self-sustaining development systems, emphasizing feedback loops over AI models.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.09378v1","title":"BadSkill: Backdoor Attacks on Agent Skills via Model-in-Skill Poisoning","abstract":"Agent ecosystems increasingly rely on installable skills to extend functionality, and some skills bundle learned model artifacts as part of their execution logic. This creates a supply-chain risk that is not captured by prompt injection or ordinary plugin misuse: a third-party skill may appear benign while concealing malicious behavior inside its bundled model. We present BadSkill, a backdoor attack formulation that targets this model-in-skill threat surface. In BadSkill, an adversary publishes a seemingly benign skill whose embedded model is backdoor-fine-tuned to activate a hidden payload only when routine skill parameters satisfy attacker-chosen semantic trigger combinations. To realize this attack, we train the embedded classifier with a composite objective that combines classification loss, margin-based separation, and poison-focused optimization, and evaluate it in an OpenClaw-inspired simulation environment that preserves third-party skill installation and execution while enabling controlled multi-model study. Our benchmark spans 13 skills, including 8 triggered tasks and 5 non-trigger control skills, with a combined main evaluation set of 571 negative-class queries and 396 trigger-aligned queries. Across eight architectures (494M--7.1B parameters) from five model families, BadSkill achieves up to 99.5\\% average attack success rate (ASR) across the eight triggered skills while maintaining strong benign-side accuracy on negative-class queries. In poison-rate sweeps on the standard test split, a 3\\% poison rate already yields 91.7\\% ASR. The attack remains effective across the evaluated model scales and under five text perturbation types. These findings identify model-bearing skills as a distinct model supply-chain risk in agent ecosystems and motivate stronger provenance verification and behavioral vetting for third-party skill artifacts.","published_date":"2026-04-10T14:48:29+00:00","viability_score":3,"cluster_label":"AI Security","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel backdoor attack formulation targeting model-in-skill threats within agent ecosystems.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.09360v1","title":"LLM-Rosetta: A Hub-and-Spoke Intermediate Representation for Cross-Provider LLM API Translation","abstract":"The rapid proliferation of Large Language Model (LLM) providers--each exposing proprietary API formats--has created a fragmented ecosystem where applications become tightly coupled to individual vendors. Switching or bridging providers requires $O(N^2)$ bilateral adapters, impeding portability and multi-provider architectures. We observe that despite substantial syntactic divergence, the major LLM APIs share a common semantic core: the practical challenge is the combinatorial surface of syntactic variations, not deep semantic incompatibility. Based on this finding, we present LLM-Rosetta, an open-source translation framework built on a hub-and-spoke Intermediate Representation (IR) that captures the shared semantic core--messages, content parts, tool calls, reasoning traces, and generation controls--in a 9-type content model and 10-type stream event schema. A modular Ops-composition converter architecture enables each API standard to be added independently. LLM-Rosetta supports bidirectional conversion (provider-to-IR-to-provider) for both request and response payloads, including chunk-level streaming with stateful context management. We implement converters for four API standards (OpenAI Chat Completions, OpenAI Responses, Anthropic Messages, and Google GenAI), covering the vast majority of commercial providers. Empirical evaluation demonstrates lossless round-trip fidelity, correct streaming behavior, and sub-100 microsecond conversion overhead--competitive with LiteLLM's single-pass approach while providing bidirectionality and provider neutrality. LLM-Rosetta passes the Open Responses compliance suite and is deployed in production at Argonne National Laboratory. Code is available at https://github.com/Oaklight/llm-rosetta.","published_date":"2026-04-10T14:31:32+00:00","viability_score":7,"cluster_label":"LLM API Translation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An open-source framework for seamless cross-provider LLM API translation using an intermediate representation.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.09349v1","title":"Visually-Guided Policy Optimization for Multimodal Reasoning","abstract":"Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning ability of vision-language models (VLMs). However, the inherent text-dominated nature of VLMs often leads to insufficient visual faithfulness, characterized by sparse attention activation to visual tokens. More importantly, our empirical analysis reveals that temporal visual forgetting along reasoning steps exacerbates this deficiency. To bridge this gap, we propose Visually-Guided Policy Optimization (VGPO), a novel framework to reinforce visual focus during policy optimization. Specifically, VGPO initially introduces a Visual Attention Compensation mechanism that leverages visual similarity to localize and amplify visual cues, while progressively elevating visual expectations in later steps to counteract visual forgetting. Building on this mechanism, we implement a dual-grained advantage re-weighting strategy: the intra-trajectory level highlights tokens exhibiting relatively high visual activation, while the inter-trajectory level prioritizes trajectories demonstrating superior visual accumulation. Extensive experiments demonstrate that VGPO achieves better visual activation and superior performance in mathematical multimodal reasoning and visual-dependent tasks.","published_date":"2026-04-10T14:22:38+00:00","viability_score":3,"cluster_label":"Multimodal Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework to enhance visual focus and counteract visual forgetting in multimodal reasoning models.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.09338v1","title":"Mind the Gap Between Spatial Reasoning and Acting! Step-by-Step Evaluation of Agents With Spatial-Gym","abstract":"Spatial reasoning is central to navigation and robotics, yet measuring model capabilities on these tasks remains difficult. Existing benchmarks evaluate models in a one-shot setting, requiring full solution generation in a single response, unlike humans, who work in interactive environments step-by-step. We introduce Spatial-Gym, a Gymnasium environment that isolates spatial constraint reasoning by testing pathfinding in 2D-grid puzzles as a sequential decision task with optional backtracking. We evaluate eight models in three settings (one-shot, step-by-step, step-by-step with backtracking) against human, random, and A* baselines on 500 episodes. The best model, GPT-OSS 120B, achieves a solve rate of 16.0%, 82 points below the human baseline (98.0%). Step-by-step format helps weaker models (up to +5.4%) by removing formatting errors, but hurts stronger models (up to 5.6%) by constraining global planning. Backtracking improves episode completion, but increases solve rate only for weaker models; stronger models rarely backtrack and do not benefit from it. Our experiments have three key findings: (1) models fail to scale reasoning effort with difficulty, (2) vision models receiving images of the spatial environment reduce solve rate by 73%, and (3) extended chain-of-thought reasoning retains a 3-5x accuracy advantage over standard inference even in the step-by-step setting. Spatial-Gym enables diagnosis of model limitations and provides a framework for improving spatial reasoning through reinforcement learning.","published_date":"2026-04-10T14:05:50+00:00","viability_score":4,"cluster_label":"Agent Spatial Reasoning","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Spatial-Gym: A new environment for step-by-step evaluation of agent spatial reasoning capabilities.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.09308v1","title":"Constraint-Aware Corrective Memory for Language-Based Drug Discovery Agents","abstract":"Large language models are making autonomous drug discovery agents increasingly feasible, but reliable success in this setting is not determined by any single action or molecule. It is determined by whether the final returned set jointly satisfies protocol-level requirements such as set size, diversity, binding quality, and developability. This creates a fundamental control problem: the agent plans step by step, while task validity is decided at the level of the whole candidate set. Existing language-based drug discovery systems therefore tend to rely on long raw history and under-specified self-reflection, making failure localization imprecise and planner-facing agent states increasingly noisy. We present CACM (Constraint-Aware Corrective Memory), a language-based drug discovery framework built around precise set-level diagnosis and a concise memory write-back mechanism. CACM introduces protocol auditing and a grounded diagnostician, which jointly analyze multimodal evidence spanning task requirements, pocket context, and candidate-set evidence to localize protocol violations, generate actionable remediation hints, and bias the next action toward the most relevant correction. To keep planning context compact, CACM organizes memory into static, dynamic, and corrective channels and compresses them before write-back, thereby preserving persistent task information while exposing only the most decision-relevant failures. Our experimental results show that CACM improves the target-level success rate by 36.4% over the state-of-the-art baseline. The results show that reliable language-based drug discovery benefits not only from more powerful molecular tools, but also from more precise diagnosis and more economical agent states.","published_date":"2026-04-10T13:16:44+00:00","viability_score":4,"cluster_label":"Drug Discovery Agents","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A framework for language-based drug discovery agents that improves success rates by precisely diagnosing and correcting protocol violations at the set level.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.09306v1","title":"SatQNet: Satellite-assisted Quantum Network Entanglement Routing Using Directed Line Graph Neural Networks","abstract":"Quantum networks are expected to become a key enabler for interconnecting quantum devices. In contrast to classical communication networks, however, information transfer in quantum networks is usually restricted to short distances due to physical constraints of entanglement distribution. Satellites can extend entanglement distribution over long distances, but routing in such networks is challenging because satellite motion and stochastic link generation create a highly dynamic quantum topology. Existing routing methods often rely on global topology information that quickly becomes outdated due to delays in the classical control plane, while decentralized methods typically act on incomplete local information. We propose SatQNet, a reinforcement learning approach for entanglement routing in satellite-assisted quantum networks that can be decentralized at runtime. Its key innovation is an edge-centric directed line graph neural network that performs local message passing on directed edge embeddings, enabling it to better capture link properties in high-degree and time-varying topologies. By exchanging messages with neighboring repeaters, SatQNet learns a local graph representation at runtime that supports agents in establishing high-fidelity end-to-end entanglements. Trained on random graphs, SatQNet outperforms heuristic and learning-based approaches across diverse settings, including a real-world European backbone topology, and generalizes to unseen topologies without retraining.","published_date":"2026-04-10T13:14:48+00:00","viability_score":4,"cluster_label":"Quantum Networks","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A reinforcement learning approach using graph neural networks for decentralized entanglement routing in dynamic satellite-assisted quantum networks.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.09297v1","title":"SkillMOO: Multi-Objective Optimization of Agent Skills for Software Engineering","abstract":"Agent skills provide modular, task-specific guidance for LLM- based coding agents, but manually tuning skill bundles to balance success rate, cost, and runtime is expensive and fragile. We present SkillMOO, a multi-objective optimization framework that automatically evolves skill bundles using LLM-proposed edits and NSGA-II survivor selection: a solver agent evaluates candidate skill bundles on coding tasks and an optimizer agent proposes bundle edits based on failure analysis. On three SkillsBench software engineering tasks, SkillMOO improves pass rate by up to 131% while reducing cost up to 32% relative to the best baseline per task at low optimization overhead. Pattern analysis reveals pruning and substitution as primary drivers of improvement, suggesting effective bundles favor minimal, focused content over accumulated instructions.","published_date":"2026-04-10T13:08:01+00:00","viability_score":4,"cluster_label":"Agent Skills","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A framework that automatically optimizes agent skill bundles for software engineering tasks, improving success rates and reducing costs.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.09285v1","title":"SAGE: A Service Agent Graph-guided Evaluation Benchmark","abstract":"The development of Large Language Models (LLMs) has catalyzed automation in customer service, yet benchmarking their performance remains challenging. Existing benchmarks predominantly rely on static paradigms and single-dimensional metrics, failing to account for diverse user behaviors or the strict adherence to structured Standard Operating Procedures (SOPs) required in real-world deployments. To bridge this gap, we propose SAGE (Service Agent Graph-guided Evaluation), a universal multi-agent benchmark for automated, dual-axis assessment. SAGE formalizes unstructured SOPs into Dynamic Dialogue Graphs, enabling precise verification of logical compliance and comprehensive path coverage. We introduce an Adversarial Intent Taxonomy and a modular Extension Mechanism, enabling low-cost deployment across domains and facilitating automated dialogue data synthesis. Evaluation is conducted via a framework where Judge Agents and a Rule Engine analyze interactions between User and Service Agents to generate deterministic ground truth. Extensive experiments on 27 LLMs across 6 industrial scenarios reveal a significant ``Execution Gap'' where models accurately classify intents but fail to derive correct subsequent actions. We also observe ``Empathy Resilience'', a phenomenon where models maintain polite conversational facades despite underlying logical failures under high adversarial intensity. Code and resources are available at https://anonymous.4open.science/r/SAGE-Bench-4CD3/.","published_date":"2026-04-10T12:55:23+00:00","viability_score":7,"cluster_label":"Customer Service Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A benchmark for evaluating customer service LLMs with dynamic dialogue graphs and adversarial testing, revealing an 'Execution Gap' in action derivation.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.09253v1","title":"Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization","abstract":"Vision-Language Models (VLMs) are powerful but remain vulnerable to multimodal jailbreak attacks. Existing attacks mainly rely on either explicit visual prompt attacks or gradient-based adversarial optimization. While the former is easier to detect, the latter produces subtle perturbations that are less perceptible, but is usually optimized and evaluated under homogeneous open-source surrogate-target settings, leaving its effectiveness on commercial closed-source VLMs under heterogeneous settings unclear. To examine this issue, we study different surrogate-target settings and observe a consistent gap between homogeneous and heterogeneous settings, a phenomenon we term surrogate dependency. Motivated by this finding, we propose Mosaic, a Multi-view ensemble optimization framework for multimodal jailbreak against closed-source VLMs, which alleviates surrogate dependency under heterogeneous surrogate-target settings by reducing over-reliance on any single surrogate model and visual view. Specifically, Mosaic incorporates three core components: a Text-Side Transformation module, which perturbs refusal-sensitive lexical patterns; a Multi-View Image Optimization module, which updates perturbations under diverse cropped views to avoid overfitting to a single visual view; and a Surrogate Ensemble Guidance module, which aggregates optimization signals from multiple surrogate VLMs to reduce surrogate-specific bias. Extensive experiments on safety benchmarks demonstrate that Mosaic achieves state-of-the-art Attack Success Rate and Average Toxicity against commercial closed-source VLMs.","published_date":"2026-04-10T12:09:06+00:00","viability_score":7,"cluster_label":"LLM Security","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Mosaic is a multimodal jailbreak framework that overcomes surrogate dependency to achieve state-of-the-art attack success rates against closed-source vision-language models.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.09251v1","title":"DRBENCHER: Can Your Agent Identify the Entity, Retrieve Its Properties and Do the Math?","abstract":"Deep research agents increasingly interleave web browsing with multi-step computation, yet existing benchmarks evaluate these capabilities in isolation, creating a blind spot in assessing real-world performance. We introduce DRBENCHER, a synthetic benchmark generator for questions that require both browsing and computation. It enforces four criteria: verifiability (gold answers are computed by executing parameterized code over knowledge-graph values), complexity (multi-hop entity identification, property retrieval, and domain-specific computation), difficulty (a two-stage verification cascade filters out questions solvable by the generating model), and diversity (a greedy max-min embedding filter maximizes coverage). These criteria are realized via a unified answer-first pipeline spanning five domains: biochemistry, financial, geophysical, security, and history. Human evaluation shows 76% validity (84% excluding stale data), with 35% of errors due to outdated knowledge-graph entries, highlighting an inherent limitation of systems that reason over evolving data. Automatic evaluation shows that the strongest frontier model achieves only 20% answer accuracy. Compared to manually constructed benchmarks (BrowseComp+, MATH-500, GPQA), DRBENCHER achieves the highest semantic diversity.","published_date":"2026-04-10T12:07:22+00:00","viability_score":7,"cluster_label":"AI Agents","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"DRBENCHER is a novel benchmark generator for AI agents that require both web browsing and multi-step computation, revealing significant performance gaps in current frontier models.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.09246v1","title":"DDSP-QbE++: Improving Speech Quality for Speech Anonymisation for Atypical Speech","abstract":"Differentiable Digital Signal Processing (DDSP) pipelines for voice conversion rely on subtractive synthesis, where a periodic excitation signal is shaped by a learned spectral envelope to reconstruct the target voice. In DDSP-QbE, the excitation is generated via phase accumulation, producing a sawtooth-like waveform whose abrupt discontinuities introduce aliasing artefacts that manifest perceptually as buzziness and spectral distortion, particularly at higher fundamental frequencies. We propose two targeted improvements to the excitation stage of the DDSP-QbE subtractive synthesizer. First, we incorporate explicit voicing detection to gate the harmonic excitation, suppressing the periodic component in unvoiced regions and replacing it with filtered noise, thereby avoiding aliased harmonic content where it is most perceptually disruptive. Second, we apply Polynomial Band-Limited Step (PolyBLEP) correction to the phase-accumulated oscillator, substituting the hard waveform discontinuity at each phase wrap with a smooth polynomial residual that cancels alias-generating components without oversampling or spectral truncation. Together, these modifications yield a cleaner harmonic roll-off, reduced high-frequency artefacts, and improved perceptual naturalness, as measured by MOS. The proposed approach is lightweight, differentiable, and integrates seamlessly into the existing DDSP-QbE training pipeline with no additional learnable parameters.","published_date":"2026-04-10T11:58:58+00:00","viability_score":4,"cluster_label":"Audio Synthesis","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This work improves speech anonymization by enhancing the excitation stage of DDSP-QbE with explicit voicing detection and PolyBLEP correction for cleaner, more natural synthesized speech.","time_to_mvp":"1-2 weeks","tags":["quick_build"]},{"arxiv_id":"2604.09234v1","title":"Statistical Properties of the King Wen Sequence: An Anti-Habituation Structure That Does Not Improve Neural Network Training","abstract":"The King Wen sequence of the I-Ching (c. 1000 BC) orders 64 hexagrams -- states of a six-dimensional binary space -- in a pattern that has puzzled scholars for three millennia. We present a rigorous statistical characterization of this ordering using Monte Carlo permutation analysis against 100,000 random baselines. We find that the sequence has four statistically significant properties: higher-than-random transition distance (98.2nd percentile), negative lag-1 autocorrelation (p=0.037), yang-balanced groups of four (p=0.002), and asymmetric within-pair vs. between-pair distances (99.2nd percentile). These properties superficially resemble principles from curriculum learning and curiosity-driven exploration, motivating the hypothesis that they might benefit neural network training. We test this hypothesis through three experiments: learning rate schedule modulation, curriculum ordering, and seed sensitivity analysis, conducted across two hardware platforms (NVIDIA RTX 2060 with PyTorch and Apple Silicon with MLX). The results are uniformly negative. King Wen LR modulation degrades performance at all tested amplitudes. As curriculum ordering, King Wen is the worst non-sequential ordering on one platform and within noise on the other. A 30-seed sweep confirms that only King Wen's degradation exceeds natural seed variance. We explain why: the sequence's high variance -- the very property that makes it statistically distinctive -- destabilizes gradient-based optimization. Anti-habituation in a fixed combinatorial sequence is not the same as effective training dynamics.","published_date":"2026-04-10T11:44:09+00:00","viability_score":3,"cluster_label":"LLM Training","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This paper statistically analyzes the King Wen sequence and finds it does not improve neural network training, despite its anti-habituation properties.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.09232v1","title":"Neural Distribution Prior for LiDAR Out-of-Distribution Detection","abstract":"LiDAR-based perception is critical for autonomous driving due to its robustness to poor lighting and visibility conditions. Yet, current models operate under the closed-set assumption and often fail to recognize unexpected out-of-distribution (OOD) objects in the open world. Existing OOD scoring functions exhibit limited performance because they ignore the pronounced class imbalance inherent in LiDAR OOD detection and assume a uniform class distribution. To address this limitation, we propose the Neural Distribution Prior (NDP), a framework that models the distributional structure of network predictions and adaptively reweights OOD scores based on alignment with a learned distribution prior. NDP dynamically captures the logit distribution patterns of training data and corrects class-dependent confidence bias through an attention-based module. We further introduce a Perlin noise-based OOD synthesis strategy that generates diverse auxiliary OOD samples from input scans, enabling robust OOD training without external datasets. Extensive experiments on the SemanticKITTI and STU benchmarks demonstrate that NDP substantially improves OOD detection performance, achieving a point-level AP of 61.31\\% on the STU test set, which is more than 10$\\times$ higher than the previous best result. Our framework is compatible with various existing OOD scoring formulations, providing an effective solution for open-world LiDAR perception.","published_date":"2026-04-10T11:40:18+00:00","viability_score":7,"cluster_label":"Autonomous Driving Perception","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A framework for robust out-of-distribution object detection in LiDAR data for autonomous driving, significantly improving safety by identifying unexpected objects.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.09229v1","title":"The Fast Lane Hypothesis: Von Economo Neurons Implement a Biological Speed-Accuracy Tradeoff","abstract":"Von Economo neurons (VENs) are large bipolar projection neurons found exclusively in the anterior cingulate cortex (ACC) and frontal insula of species with complex social cognition, including humans, great apes, and cetaceans. Their selective depletion in frontotemporal dementia (FTD) and altered development in autism implicate them in rapid social decision-making, yet no computational model of VEN function has previously existed. We introduce the Fast Lane Hypothesis: VENs implement a biological speed-accuracy tradeoff (SAT) by providing a sparse, fast projection pathway that enables rapid social decisions at the cost of deliberate processing accuracy. We model VENs as fast leaky integrate-and-fire (LIF) neurons with membrane time constant 5 ms and sparse dendritic fan-in of eight afferents, compared to 20 ms and eighty afferents for standard pyramidal neurons, within a spiking cortical circuit of 2,000 neurons trained on a social discrimination task. Networks are evaluated under three clinically motivated conditions across 10 independent random seeds: typical (2% VENs), autism-like (0.4% VENs), and FTD-like (post-training VEN ablation). All configurations achieve equivalent asymptotic classification accuracy (99.4%), consistent with the prediction that VENs modulate decision speed rather than representational capacity. Temporal analysis confirms that VENs produce median first-spike latencies 4 ms earlier than pyramidal neurons. At a fixed decision threshold, the typical condition is significantly faster than FTD-like (t=-23.31, p<0.0001), while autism-like is intermediate (mean RT=26.91+/-9.01 ms vs. typical 20.70+/-2.02 ms; p=0.078). A preliminary evolutionary analysis shows qualitative correspondence between model-optimal VEN fraction and the primate phylogenetic gradient. To our knowledge, this is the first computational model that asks what a Von Economo neuron actually computes.","published_date":"2026-04-10T11:37:19+00:00","viability_score":3,"cluster_label":"Computational Neuroscience","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A computational model proposing that Von Economo neurons facilitate rapid social decision-making by implementing a biological speed-accuracy tradeoff.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.09222v1","title":"GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking","abstract":"Audio large language models (ALLMs) enable rich speech-text interaction, but they also introduce jailbreak vulnerabilities in the audio modality. Existing audio jailbreak methods mainly optimize jailbreak success while overlooking utility preservation, as reflected in transcription quality and question answering performance. In practice, stronger attacks often come at the cost of degraded utility. To study this trade-off, we revisit existing attacks by varying their perturbation coverage in the frequency domain, from partial-band to full-band, and find that broader frequency coverage does not necessarily improve jailbreak performance, while utility consistently deteriorates. This suggests that concentrating perturbation on a subset of bands can yield a better attack-utility trade-off than indiscriminate full-band coverage. Based on this insight, we propose GRM, a utility-aware frequency-selective jailbreak framework. It ranks Mel bands by their attack contribution relative to utility sensitivity, perturbs only a selected subset of bands, and learns a reusable universal perturbation under a semantic-preservation objective. Experiments on four representative ALLMs show that GRM achieves an average Jailbreak Success Rate (JSR) of 88.46% while providing a better attack-utility trade-off than representative baselines. These results highlight the potential of frequency-selective perturbation for better balancing attack effectiveness and utility preservation in audio jailbreak. Content Warning: This paper includes harmful query examples and unsafe model responses.","published_date":"2026-04-10T11:27:25+00:00","viability_score":5,"cluster_label":"Audio LLM Security","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"A utility-aware framework for jailbreaking audio LLMs by selectively perturbing frequency bands, balancing attack success with transcription quality.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus"]},{"arxiv_id":"2604.09202v1","title":"On the Role of DAG topology in Energy-Aware Cloud Scheduling : A GNN-Based Deep Reinforcement Learning Approach","abstract":"Cloud providers must assign heterogeneous compute resources to workflow DAGs while balancing competing objectives such as completion time, cost, and energy consumption. In this work, we study a single-workflow, queue-free scheduling setting and consider a graph neural network (GNN)-based deep reinforcement learning scheduler designed to minimize workflow completion time and energy usage. We identify specific out-of-distribution (OOD) conditions under which GNN-based deep reinforcement learning schedulers fail and provide a principled explanation of why these failures occur. Through controlled OOD evaluations, we demonstrate that performance degradation stems from structural mismatches between training and deployment environments, which disrupt message passing and undermine policy generalization. Our analysis exposes fundamental limitations of current GNN-based schedulers and highlights the need for more robust representations to ensure reliable scheduling performance under distribution shifts.","published_date":"2026-04-10T10:44:30+00:00","viability_score":4,"cluster_label":"Cloud Scheduling","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Identifies limitations in GNN-based reinforcement learning for cloud scheduling under distribution shifts and proposes more robust representations.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.09200v1","title":"Artificial intelligence can persuade people to take political actions","abstract":"There is substantial concern about the ability of advanced artificial intelligence to influence people's behaviour. A rapidly growing body of research has found that AI can produce large persuasive effects on people's attitudes, but whether AI can persuade people to take consequential real-world actions has remained unclear. In two large preregistered experiments N=17,950 responses from 14,779 people), we used conversational AI models to persuade participants on a range of attitudinal and behavioural outcomes, including signing real petitions and donating money to charity. We found sizable AI persuasion effects on these behavioural outcomes (e.g. +19.7 percentage points on petition signing). However, we observed no evidence of a correlation between AI persuasion effects on attitudes and behaviour. Moreover, we replicated prior findings that information provision drove effects on attitudes, but found no such evidence for our behavioural outcomes. In a test of eight behavioural persuasion strategies, all outperformed the most effective attitudinal persuasion strategy, but differences among the eight were small. Taken together, these results suggest that previous findings relying on attitudinal outcomes may generalize poorly to behaviour, and therefore risk substantially mischaracterizing the real-world behavioural impact of AI persuasion.","published_date":"2026-04-10T10:34:49+00:00","viability_score":3,"cluster_label":"AI Persuasion","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"AI models can significantly persuade people to take real-world actions like signing petitions and donating to charity, but attitudinal persuasion does not correlate with behavioral outcomes.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.09197v1","title":"Vision Transformers for Preoperative CT-Based Prediction of Histopathologic Chemotherapy Response Score in High-Grade Serous Ovarian Carcinoma","abstract":"Purpose. High-grade serous ovarian carcinoma (HGSOC) is characterized by pronounced biological and spatial heterogeneity and is frequently diagnosed at an advanced stage. Neoadjuvant chemotherapy (NACT) followed by delayed primary surgery is commonly employed in patients unsuitable for primary cytoreduction. The Chemotherapy Response Score (CRS) is a validated histopathological biomarker of response to NACT, but it is only available postoperatively. In this study, we investigate whether pre-treatment computed tomography (CT) imaging and clinical data can be used to predict CRS as an investigational decision-support adjunct to inform multidisciplinary team (MDT) discussions regarding expected treatment response. Methods. We proposed a 2.5D multimodal deep learning framework that processes lesion-dense omental slices using a pre-trained Vision Transformer encoder and integrates the resulting visual representations with clinical variables through an intermediate fusion module to predict CRS. Results. Our multimodal model, integrating imaging and clinical data, achieved a ROC-AUC of 0.95 alongside 95% accuracy and 80% precision on the internal test cohort (IEO, n=41 patients). On the external test set (OV04, n=70 patients), it achieved a ROC-AUC of 0.68, alongside 67% accuracy and 75% precision. Conclusion. These preliminary results demonstrate the feasibility of transformer-based deep learning for preoperative prediction of CRS in HGSOC using routine clinical data and CT imaging. As an investigational, pre-treatment decision-support tool, this approach may assist MDT discussions by providing early, non-invasive estimates of treatment response.","published_date":"2026-04-10T10:33:07+00:00","viability_score":7,"cluster_label":"Medical AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A multimodal deep learning framework using Vision Transformers can predict chemotherapy response in ovarian cancer from CT scans, acting as a decision-support tool for multidisciplinary teams.","time_to_mvp":"6+ months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.09195v1","title":"Camera Artist: A Multi-Agent Framework for Cinematic Language Storytelling Video Generation","abstract":"We propose Camera Artist, a multi-agent framework that models a real-world filmmaking workflow to generate narrative videos with explicit cinematic language. While recent multi-agent systems have made substantial progress in automating filmmaking workflows from scripts to videos, they often lack explicit mechanisms to structure narrative progression across adjacent shots and deliberate use of cinematic language, resulting in fragmented storytelling and limited filmic quality. To address this, Camera Artist builds upon established agentic pipelines and introduces a dedicated Cinematography Shot Agent, which integrates recursive storyboard generation to strengthen shot-to-shot narrative continuity and cinematic language injection to produce more expressive, film-oriented shot designs. Extensive quantitative and qualitative results demonstrate that our approach consistently outperforms existing baselines in narrative consistency, dynamic expressiveness, and perceived film quality.","published_date":"2026-04-10T10:27:52+00:00","viability_score":6,"cluster_label":"AI Video & Graphics","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Camera Artist automates narrative video creation with advanced cinematic storytelling techniques.","time_to_mvp":"1-3 months","tags":["quick_build","high_potential"]},{"arxiv_id":"2604.09189v1","title":"Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies","abstract":"LLMs internalize safety policies through RLHF, yet these policies are never formally specified and remain difficult to inspect. Existing benchmarks evaluate models against external standards but do not measure whether models understand and enforce their own stated boundaries. We introduce the Symbolic-Neural Consistency Audit (SNCA), a framework that (1) extracts a model's self-stated safety rules via structured prompts, (2) formalizes them as typed predicates (Absolute, Conditional, Adaptive), and (3) measures behavioral compliance via deterministic comparison against harm benchmarks. Evaluating four frontier models across 45 harm categories and 47,496 observations reveals systematic gaps between stated policy and observed behavior: models claiming absolute refusal frequently comply with harmful prompts, reasoning models achieve the highest self-consistency but fail to articulate policies for 29% of categories, and cross-model agreement on rule types is remarkably low (11%). These results demonstrate that the gap between what LLMs say and what they do is measurable and architecture-dependent, motivating reflexive consistency audits as a complement to behavioral benchmarks.","published_date":"2026-04-10T10:18:45+00:00","viability_score":7,"cluster_label":"LLM Safety","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"The Symbolic-Neural Consistency Audit (SNCA) framework measures the gap between LLMs' self-stated safety policies and their actual behavior, revealing systematic compliance gaps.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.09175v1","title":"Generalization and Scaling Laws for Mixture-of-Experts Transformers","abstract":"We develop a theory of generalization and scaling for Mixture-of-Experts (MoE) Transformers that cleanly separates \\emph{active} per-input capacity from routing combinatorics. By conditioning on fixed routing patterns and union-bounding across them, we derive a sup-norm covering-number bound whose metric entropy scales with the active parameter budget and incurs a MoE-specific routing overhead. Combined with a standard ERM analysis for squared loss, this yields a generalization bound under a $d$-dimensional manifold data model and $C^\u03b2$ targets, showing that approximation and estimation trade off as in dense networks once active parameters are accounted for appropriately. We further prove a constructive approximation theorem for MoE architectures, showing that, under the approximation construction, error can decrease either by scaling active capacity or by increasing the number of experts, depending on the dominant bottleneck. From these results we derive neural scaling laws for model size, data size, and compute-optimal tradeoffs. Overall, our results provide a transparent statistical reference point for reasoning about MoE scaling, clarifying which behaviors are certified by worst-case theory and which must arise from data-dependent routing structure or optimization dynamics.","published_date":"2026-04-10T09:59:48+00:00","viability_score":2,"cluster_label":"LLM Training","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Theoretical analysis of generalization and scaling laws for Mixture-of-Experts Transformers to optimize model size, data size, and compute tradeoffs.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.09162v1","title":"Persona-E$^2$: A Human-Grounded Dataset for Personality-Shaped Emotional Responses to Textual Events","abstract":"Most affective computing research treats emotion as a static property of text, focusing on the writer's sentiment while overlooking the reader's perspective. This approach ignores how individual personalities lead to diverse emotional appraisals of the same event. Although role-playing Large Language Models (LLMs) attempt to simulate such nuanced reactions, they often suffer from \"personality illusion'' -- relying on surface-level stereotypes rather than authentic cognitive logic. A critical bottleneck is the absence of ground-truth human data to link personality traits to emotional shifts. To bridge the gap, we introduce Persona-E$^2$ (Persona-Event2Emotion), a large-scale dataset grounded in annotated MBTI and Big Five traits to capture reader-based emotional variations across news, social media, and life narratives. Extensive experiments reveal that state-of-the-art LLMs struggle to capture precise appraisal shifts, particularly in social media domains. Crucially, we find that personality information significantly improves comprehension, with the Big Five traits alleviating \"personality illusion.'","published_date":"2026-04-10T09:49:53+00:00","viability_score":7,"cluster_label":"Affective Computing","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A human-grounded dataset and LLM evaluation for personality-shaped emotional responses to text, addressing the 'personality illusion' in affective computing.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]},{"arxiv_id":"2604.09158v1","title":"Structuring versus Problematizing: How LLM-based Agents Scaffold Learning in Diagnostic Reasoning","abstract":"Supporting students in developing diagnostic reasoning is a key challenge across educational domains. Novices often face cognitive biases such as premature closure and over-reliance on heuristics, and they struggle to transfer diagnostic strategies to new cases. Scenario-based learning (SBL) enhanced by Learning Analytics (LA) and large language models (LLM) offers a promising approach by combining realistic case experiences with personalized scaffolding. Yet, how different scaffolding approaches shape reasoning processes remains insufficiently explored. This study introduces PharmaSim Switch, an SBL environment for pharmacy technician training, extended with an LA- and LLM-powered pharmacist agent that implements pedagogical conversations rooted in two theory-driven scaffolding approaches: \\emph{structuring} and \\emph{problematizing}, as well as a student learning trajectory. In a between-groups experiment, 63 vocational students completed a learning scenario, a near-transfer scenario, and a far-transfer scenario under one of the two scaffolding conditions. Results indicate that both scaffolding approaches were effective in supporting the use of diagnostic strategies. Performance outcomes were primarily influenced by scenario complexity rather than students' prior knowledge or the scaffolding approach used. The structuring approach was associated with more accurate Active and Interactive participation, whereas problematizing elicited more Constructive engagement. These findings underscore the value of combining scaffolding approaches when designing LA- and LLM-based systems to effectively foster diagnostic reasoning.","published_date":"2026-04-10T09:43:17+00:00","viability_score":3,"cluster_label":"Educational AI","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"Evaluating two LLM-based agent scaffolding approaches ('structuring' vs. 'problematizing') for improving diagnostic reasoning in vocational training.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.09155v1","title":"CORA: Conformal Risk-Controlled Agents for Safeguarded Mobile GUI Automation","abstract":"Graphical user interface (GUI) agents powered by vision language models (VLMs) are rapidly moving from passive assistance to autonomous operation. However, this unrestricted action space exposes users to severe and irreversible financial, privacy or social harm. Existing safeguards rely on prompt engineering, brittle heuristics and VLM-as-critic lack formal verification and user-tunable guarantees. We propose CORA (COnformal Risk-controlled GUI Agent), a post-policy, pre-action safeguarding framework that provides statistical guarantees on harmful executed actions. CORA reformulates safety as selective action execution: we train a Guardian model to estimate action-conditional risk for each proposed step. Rather than thresholding raw scores, we leverage Conformal Risk Control to calibrate an execute/abstain boundary that satisfies a user-specified risk budget and route rejected actions to a trainable Diagnostician model, which performs multimodal reasoning over rejected actions to recommend interventions (e.g., confirm, reflect, or abort) to minimize user burden. A Goal-Lock mechanism anchors assessment to a clarified, frozen user intent to resist visual injection attacks. To rigorously evaluate this paradigm, we introduce Phone-Harm, a new benchmark of mobile safety violations with step-level harm labels under real-world settings. Experiments on Phone-Harm and public benchmarks against diverse baselines validate that CORA improves the safety--helpfulness--interruption Pareto frontier, offering a practical, statistically grounded safety paradigm for autonomous GUI execution. Code and benchmark are available at cora-agent.github.io.","published_date":"2026-04-10T09:41:21+00:00","viability_score":8,"cluster_label":"AI Safety and Automation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"CORA provides statistically grounded safeguards for mobile GUI automation to prevent harmful actions.","time_to_mvp":"1-2 weeks","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.09130v1","title":"EquiformerV3: Scaling Efficient, Expressive, and General SE(3)-Equivariant Graph Attention Transformers","abstract":"As $SE(3)$-equivariant graph neural networks mature as a core tool for 3D atomistic modeling, improving their efficiency, expressivity, and physical consistency has become a central challenge for large-scale applications. In this work, we introduce EquiformerV3, the third generation of the $SE(3)$-equivariant graph attention Transformer, designed to advance all three dimensions: efficiency, expressivity, and generality. Building on EquiformerV2, we have the following three key advances. First, we optimize the software implementation, achieving $1.75\\times$ speedup. Second, we introduce simple and effective modifications to EquiformerV2, including equivariant merged layer normalization, improved feedforward network hyper-parameters, and attention with smooth radius cutoff. Third, we propose SwiGLU-$S^2$ activations to incorporate many-body interactions for better theoretical expressivity and to preserve strict equivariance while reducing the complexity of sampling $S^2$ grids. Together, SwiGLU-$S^2$ activations and smooth-cutoff attention enable accurate modeling of smoothly varying potential energy surfaces (PES), generalizing EquiformerV3 to tasks requiring energy-conserving simulations and higher-order derivatives of PES. With these improvements, EquiformerV3 trained with the auxiliary task of denoising non-equilibrium structures (DeNS) achieves state-of-the-art results on OC20, OMat24, and Matbench Discovery.","published_date":"2026-04-10T09:12:16+00:00","viability_score":3,"cluster_label":"3D Atomistic Modeling","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"An advanced SE(3)-equivariant graph attention Transformer for efficient and expressive 3D atomistic modeling, achieving state-of-the-art results on multiple benchmarks.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.09121v1","title":"Interactive ASR: Towards Human-Like Interaction and Semantic Coherence Evaluation for Agentic Speech Recognition","abstract":"Recent years have witnessed remarkable progress in automatic speech recognition (ASR), driven by advances in model architectures and large-scale training data. However, two important aspects remain underexplored. First, Word Error Rate (WER), the dominant evaluation metric for decades, treats all words equally and often fails to reflect the semantic correctness of an utterance at the sentence level. Second, interactive correction-an essential component of human communication-has rarely been systematically studied in ASR research. In this paper, we integrate these two perspectives under an agentic framework for interactive ASR. We propose leveraging LLM-as-a-Judge as a semantic-aware evaluation metric to assess recognition quality beyond token-level accuracy. Furthermore, we design an LLM-driven agent framework to simulate human-like multi-turn interaction, enabling iterative refinement of recognition outputs through semantic feedback. Extensive experiments are conducted on standard benchmarks, including GigaSpeech (English), WenetSpeech (Chinese), the ASRU 2019 code-switching test set. Both objective and subjective evaluations demonstrate the effectiveness of the proposed framework in improving semantic fidelity and interactive correction capability. We will release the code to facilitate future research in interactive and agentic ASR.","published_date":"2026-04-10T09:02:42+00:00","viability_score":7,"cluster_label":"Interactive Speech Recognition","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"An agentic framework for interactive ASR that uses LLM-as-a-Judge for semantic evaluation and multi-turn correction to improve recognition quality.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.09111v1","title":"PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing","abstract":"Recently, artificial intelligence-based dubbing technology has advanced, enabling automated dubbing (AD) to convert the source speech of a video into target speech in different languages. However, natural AD still faces synchronization challenges such as duration and lip-synchronization (lip-sync), which are crucial for preserving the viewer experience. Therefore, this paper proposes a synchronization method for AD processes that paraphrases translated text, comprising two steps: isochrony for timing constraints and phonetic synchronization (PS) to preserve lip-sync. First, we achieve isochrony by paraphrasing the translated text with a language model, ensuring the target speech duration matches that of the source speech. Second, we introduce PS, which employs dynamic time warping (DTW) with local costs of vowel distances measured from training data so that the target text composes vowels with pronunciations similar to source vowels. Third, we extend this approach to PSComet, which jointly considers semantic and phonetic similarity to preserve meaning better. The proposed methods are incorporated into text-to-speech systems, PS-TTS and PS-Comet TTS. The performance evaluation using Korean and English lip-reading datasets and a voice-actor dubbing dataset demonstrates that both systems outperform TTS without PS on several objective metrics and outperform voice actors in Korean-to-English and English-to-Korean dubbing. We extend the experiments to French, testing all pairs among these languages to evaluate cross-linguistic applicability. Across all language pairs, PS-Comet performed best, balancing lip-sync accuracy with semantic preservation, confirming that PS-Comet achieves more accurate lip-sync with semantic preservation than PS alone.","published_date":"2026-04-10T08:42:45+00:00","viability_score":7,"cluster_label":"Text-to-Speech Dubbing","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A text-to-speech system that achieves natural automated dubbing by synchronizing translated text phonetically and temporally with source speech.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.09107v1","title":"TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training","abstract":"Modern LLM reinforcement learning (RL) workloads require a highly efficient weight transfer system to scale training across heterogeneous computational resources. However, existing weight transfer approaches either fail to provide flexibility for dynamically scaling clusters or incur fundamental data movement overhead, resulting in poor performance.   We introduce Reference-Oriented Storage (ROS), a new storage abstraction for RL weight transfer that exploits the highly replicated model weights in place. ROS presents the illusion that certain versions of the model weights are stored and can be fetched on demand. Underneath, ROS does not physically store any copies of the weights; instead, it tracks the workers that hold these weights on GPUs for inference. Upon request, ROS directly uses them to serve reads. We build TensorHub, a production-quality system that extends the ROS idea with topology-optimized transfer, strong consistency, and fault tolerance. Evaluation shows that TensorHub fully saturates RDMA bandwidth and adapts to three distinct rollout workloads with minimal engineering effort. Specifically, TensorHub reduces total GPU stall time by up to 6.7x for standalone rollouts, accelerates weight update for elastic rollout by 4.8x, and cuts cross-datacenter rollout stall time by 19x. TensorHub has been deployed in production to support cutting-edge RL training.","published_date":"2026-04-10T08:40:56+00:00","viability_score":6,"cluster_label":"LLM Training Infrastructure","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"TensorHub is a production-ready system for scalable and elastic weight transfer in LLM RL training, significantly reducing GPU stall time.","time_to_mvp":"6+ months","tags":["series_a_plus"]},{"arxiv_id":"2604.09104v1","title":"Scheming in the wild: detecting real-world AI scheming incidents with open-source intelligence","abstract":"Scheming, the covert pursuit of misaligned goals by AI systems, represents a potentially catastrophic risk, yet scheming research suffers from significant limitations. In particular, scheming evaluations demonstrate behaviours that may not occur in real-world settings, limiting scientific understanding, hindering policy development, and not enabling real-time detection of loss of control incidents. Real-world evidence is needed, but current monitoring techniques are not effective for this purpose. This paper introduces a novel open-source intelligence (OSINT) methodology for detecting real-world scheming incidents: collecting and analysing transcripts from chatbot conversations or command-line interactions shared online. Analysing over 183,420 transcripts from X (formerly Twitter), we identify 698 real-world scheming-related incidents between October 2025 and March 2026. We observe a statistically significant 4.9x increase in monthly incidents from the first to last month, compared to a 1.7x increase in posts discussing scheming. We find evidence of multiple scheming-related behaviours in real-world deployments previously reported only in experiments, many resulting in real-world harms. While we did not detect catastrophic scheming incidents, the behaviours observed demonstrate concerning precursors, such as willingness to disregard instructions, circumvent safeguards, lie to users, and single-mindedly pursue goals in harmful ways. As AI systems become more capable, these could evolve into more strategic scheming with potentially catastrophic consequences. Our findings demonstrate the viability of transcript-based OSINT as a scalable approach to real-world scheming detection supporting scientific research, policy development, and emergency response. We recommend further investment towards OSINT techniques for monitoring scheming and loss of control.","published_date":"2026-04-10T08:37:18+00:00","viability_score":4,"cluster_label":"AI Safety","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"A novel OSINT methodology to detect real-world AI scheming incidents by analyzing chatbot and command-line transcripts.","time_to_mvp":"6+ months","tags":["high_potential"]},{"arxiv_id":"2604.09101v1","title":"CLIP-Inspector: Model-Level Backdoor Detection for Prompt-Tuned CLIP via OOD Trigger Inversion","abstract":"Organisations with limited data and computational resources increasingly outsource model training to Machine Learning as a Service (MLaaS) providers, who adapt vision-language models (VLMs) such as CLIP to downstream tasks via prompt tuning rather than training from scratch. This semi-honest setting creates a security risk where a malicious provider can follow the prompt-tuning protocol yet implant a backdoor, forcing triggered inputs to be classified into an attacker-chosen class, even for out-of-distribution (OOD) data. Such backdoors leave encoders untouched, making them undetectable to existing methods that focus on encoder corruption. Other data-level methods that sanitize data before training or during inference, also fail to answer the critical question, \"Is the delivered model backdoored or not?\" To address this model-level verification problem, we introduce CLIP-Inspector (CI), a backdoor detection method designed for prompt-tuned CLIP models. Assuming white-box access to the delivered model and a pool of unlabeled OOD images, CI reconstructs possible triggers for each class to determine if the model exhibits backdoor behaviour or not. Additionally, we demonstrate that using CI's reconstructed trigger for fine-tuning on correctly labeled triggered inputs enables us to re-align the model and reduce backdoor effectiveness. Through extensive experiments across ten datasets and four backdoor attacks, we demonstrate that CI can reconstruct effective triggers in a single epoch using only 1,000 OOD images, achieving a 94% detection accuracy (47/50 models). Compared to adapted trigger-inversion baselines, CI yields a markedly higher AUROC score (0.973 vs 0.495/0.687), thus enabling the vetting and post-hoc repair of prompt-tuned CLIP models to ensure safe deployment.","published_date":"2026-04-10T08:33:56+00:00","viability_score":7,"cluster_label":"Model Security","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"CLIP-Inspector detects backdoors in prompt-tuned CLIP models by inverting out-of-distribution triggers, enabling model vetting and repair.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.09089v1","title":"DeepGuard: Secure Code Generation via Multi-Layer Semantic Aggregation","abstract":"Large Language Models (LLMs) for code generation can replicate insecure patterns from their training data. To mitigate this, a common strategy for security hardening is to fine-tune models using supervision derived from the final transformer layer. However, this design may suffer from a final-layer bottleneck: vulnerability-discriminative cues can be distributed across layers and become less detectable near the output representations optimized for next-token prediction. To diagnose this issue, we perform layer-wise linear probing. We observe that vulnerability-related signals are most detectable in a band of intermediate-to-upper layers yet attenuate toward the final layers. Motivated by this observation, we introduce DeepGuard, a framework that leverages distributed security-relevant cues by aggregating representations from multiple upper layers via an attention-based module. The aggregated signal powers a dedicated security analyzer within a multi-objective training objective that balances security enhancement and functional correctness, and further supports a lightweight inference-time steering strategy. Extensive experiments across five code LLMs demonstrate that DeepGuard improves the secure-and-correct generation rate by an average of 11.9% over strong baselines such as SVEN. It also preserves functional correctness while exhibiting generalization to held-out vulnerability types. Our code is public at https://github.com/unknownhl/DeepGuard.","published_date":"2026-04-10T08:19:48+00:00","viability_score":7,"cluster_label":"Secure Code Generation","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"DeepGuard enhances LLM code generation security by aggregating multi-layer representations to detect and mitigate vulnerabilities.","time_to_mvp":"1-3 months","tags":["quick_build","series_a_plus","high_potential"]},{"arxiv_id":"2604.09085v1","title":"Beyond Isolated Clients: Integrating Graph-Based Embeddings into Event Sequence Models","abstract":"Large-scale digital platforms generate billions of timestamped user-item interactions (events) that are crucial for predicting user attributes in, e.g., fraud prevention and recommendations. While self-supervised learning (SSL) effectively models the temporal order of events, it typically overlooks the global structure of the user-item interaction graph. To bridge this gap, we propose three model-agnostic strategies for integrating this structural information into contrastive SSL: enriching event embeddings, aligning client representations with graph embeddings, and adding a structural pretext task. Experiments on four financial and e-commerce datasets demonstrate that our approach consistently improves the accuracy (up to a 2.3% AUC) and reveals that graph density is a key factor in selecting the optimal integration strategy.","published_date":"2026-04-10T08:11:41+00:00","viability_score":4,"cluster_label":"Graph-Based Embeddings","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"Integrates graph-based embeddings into event sequence models to improve user attribute prediction for fraud prevention and recommendations.","time_to_mvp":"1-3 months","tags":["high_potential"]},{"arxiv_id":"2604.09072v1","title":"Overhang Tower: Resource-Rational Adaptation in Sequential Physical Planning","abstract":"Humans effortlessly navigate the physical world by predicting how objects behave under gravity and contact forces, yet how such judgments support sequential physical planning under resource constraints remains poorly understood. Research on intuitive physics debates whether prediction relies on the Intuitive Physics Engine (IPE) or fast, cue-based heuristics; separately, decision-making research debates deliberative lookahead versus myopic strategies. These debates have proceeded in isolation, leaving the cognitive architecture of sequential physical planning underspecified. How physical prediction mechanisms and planning strategies jointly adapt under limited cognitive resources remains an open question. Here we show that humans exhibit a dual transition under resource pressure, simultaneously shifting both physical prediction mechanism and planning strategy to match cognitive budget. Using Overhang Tower, a construction task requiring participants to maximize horizontal overhang while maintaining stability, we find that IPE-based simulation dominates early stages while CNN-based visual heuristics prevail as complexity grows; concurrently, time pressure truncates deliberative lookahead, shifting planning toward shallower horizons: a dual transition unpredicted by prior single-mechanism accounts. These findings reveal a hierarchical, resource-rational architecture that flexibly trades computational cost against predictive fidelity. Our results unify two long-standing debates (simulation vs. heuristics and myopic vs. deliberative planning) as a dynamic repertoire reconfigured by cognitive budget.","published_date":"2026-04-10T07:54:25+00:00","viability_score":2,"cluster_label":"Cognitive Science","has_code":false,"repo_url":null,"commercial_flags":[],"one_liner":"This research explores how humans adapt physical prediction and planning strategies under cognitive resource constraints, revealing a hierarchical, resource-rational architecture.","time_to_mvp":"6+ months","tags":[]},{"arxiv_id":"2604.09069v1","title":"NyayaMind- A Framework for Transparent Legal Reasoning and Judgment Prediction in the Indian Legal System","abstract":"Court Judgment Prediction and Explanation (CJPE) aims to predict a judicial decision and provide a legally grounded explanation for a given case based on the facts, legal issues, arguments, cited statutes, and relevant precedents. For such systems to be practically useful in judicial or legal research settings, they must not only achieve high predictive performance but also generate transparent and structured legal reasoning that aligns with established judicial practices. In this work, we present NyayaMind, an open-source framework designed to enable transparent and scalable legal reasoning for the Indian judiciary. The proposed framework integrates retrieval, reasoning, and verification mechanisms to emulate the structured decision-making process typically followed in courts. Specifically, NyayaMind consists of two main components: a Retrieval Module and a Prediction Module. The Retrieval Module employs a RAG pipeline to identify legally relevant statutes and precedent cases from large-scale legal corpora, while the Prediction Module utilizes reasoning-oriented LLMs fine-tuned for the Indian legal domain to generate structured outputs including issues, arguments, rationale, and the final decision. Our extensive results and expert evaluation demonstrate that NyayaMind significantly improves the quality of explanation and evidence alignment compared to existing CJPE approaches, providing a promising step toward trustworthy AI-assisted legal decision support systems.","published_date":"2026-04-10T07:51:42+00:00","viability_score":7,"cluster_label":"Legal AI","has_code":true,"repo_url":null,"commercial_flags":["has_code"],"one_liner":"NyayaMind is an open-source framework for transparent legal reasoning and judgment prediction in the Indian legal system, improving explanation quality and evidence alignment.","time_to_mvp":"1-3 months","tags":["series_a_plus","high_potential"]}],"meta":{"count":1000,"artifact_id":"public-dataset:2026-04-22T17-29-57-989Z","schema_version":"dataset-public-v3","exported_at":"2026-04-22T17:29:57.989Z","last_updated_at":"2026-04-21T17:59:02.000Z","fresh_until":"2026-04-23T17:29:57.989Z","status":"ready","source_count":1000,"coverage_window":"Public dataset snapshot","method_version":"dataset_export_v3"}}