177 papers - avg viability 5.9
Large language model (LLM) agents are evolving to enhance their capabilities in various domains, including bug detection, economic modeling, and workflow optimization. Recent advancements, such as AnyPoC, automate the generation of proof-of-concept tests for bug validation, significantly improving the reliability of automated bug detection. Market-Bench evaluates LLMs in economic tasks, revealing performance disparities among agents in competitive environments. Additionally, frameworks like SAGE and CLEAR focus on optimizing modeling strategies and context generation, respectively, enhancing LLM performance in complex tasks. These developments are crucial for builders as they enable more efficient and effective use of LLMs, facilitating automation and improving productivity across diverse applications in software development and research.
An LLM-powered multi-agent system that autonomously generates executable proofs-of-concept to validate bug reports in software, significantly improving bug detection accuracy and reducing false positives.
A framework that trains LLM agents to generate task-specific context, improving decision-making and task completion rates.
Market-Bench is a benchmark for evaluating LLMs in economic and trade competition, revealing significant performance disparities in multi-agent supply chain scenarios.
RePoT enables LLM agents to recover from invalid action plans with minimal extra computation, significantly improving reliability.
An agentic framework that learns generalizable principles from low-fidelity LLM experiments to efficiently configure high-cost LLM settings, mimicking human researchers.
Automatically distill transferable agent skills from execution experience, enabling LLM agents to tackle complex tasks without parameter updates.
SAGE is a framework that improves LLM-generated optimization programs by making modeling strategy explicit, leading to more correct and efficient formulations.
An LLM-powered workflow optimization system for multidisciplinary software development that drastically reduces development time and improves communication efficiency in the automotive industry.
A fine-tuned small language model that replaces frontier LLMs for agentic terminal execution, reducing token usage by up to 30% without impacting performance.
This work introduces Time Series Augmented Generation (TSAG), a novel evaluation framework and benchmark for assessing LLM agents' quantitative reasoning in financial time-series analysis, with publicly released code and insights.