LLM Agents

Proof pending

177papers

5.9viability

+36%30d

Proof pending

Proof pending. Core topic summary fields are still materializing.

State of the Field

Large language model (LLM) agents are evolving to enhance their capabilities in various domains, including bug detection, economic modeling, and workflow optimization. Recent advancements, such as AnyPoC, automate the generation of proof-of-concept tests for bug validation, significantly improving the reliability of automated bug detection. Market-Bench evaluates LLMs in economic tasks, revealing performance disparities among agents in competitive environments. Additionally, frameworks like SAGE and CLEAR focus on optimizing modeling strategies and context generation, respectively, enhancing LLM performance in complex tasks. These developments are crucial for builders as they enable more efficient and effective use of LLMs, facilitating automation and improving productivity across diverse applications in software development and research.

Last updated May 26, 2026

Topic-linked question coverage is still building for this proof surface.

Topic trend

Topic-specific paper and score movement from the daily diff ledger.

Papers

1-10 of 50

Research Paper·Apr 13, 2026

AnyPoC: Universal Proof-of-Concept Test Generation for Scalable LLM-Based Bug Detection

While recent LLM-based agents can identify many candidate bugs in source code, their reports remain static hypotheses that require manual validation, limiting the practicality of automated bug detecti...

9.0 viability

Research Paper·Apr 6, 2026

ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

Large language model (LLM) agents are increasingly deployed to automate productivity tasks (e.g., email, scheduling, document management), but evaluating them on live services is risky due to potentia...

8.0 viability

Research Paper·Apr 1, 2026

The Silicon Mirror: Dynamic Behavioral Gating for Anti-Sycophancy in LLM Agents

Large Language Models (LLMs) increasingly prioritize user validation over epistemic accuracy-a phenomenon known as sycophancy. We present The Silicon Mirror, an orchestration framework that dynamicall...

8.0 viabilityHas code

Research Paper·Apr 7, 2026

Market-Bench: Benchmarking Large Language Models on Economic and Trade Competition

The ability of large language models (LLMs) to manage and acquire economic resources remains unclear. In this paper, we introduce \textbf{Market-Bench}, a comprehensive benchmark that evaluates the ca...

8.0 viability

Research Paper·May 4, 2026

Terminus-4B: Can a Smaller Model Replace Frontier LLMs at Agentic Execution Tasks?

Modern coding agents increasingly delegate specialized subtasks to subagents, which are smaller, focused agentic loops that handle narrow responsibilities like search, debugging or terminal execution....

8.0 viability

Research Paper·Apr 8, 2026

CLEAR: Context Augmentation from Contrastive Learning of Experience via Agentic Reflection

Large language model agents rely on effective model context to obtain task-relevant information for decision-making. Many existing context engineering approaches primarily rely on the context generate...

8.0 viabilityHas code

Research Paper·May 28, 2026

REPOT: Recoverable Program-of-Thought via Checkpoint Repair

One-shot Program-of-Thought (PoT) emits a Python program that prints a primitive-action plan; a single invalid action silently invalidates the trajectory. We introduce RePoT (Recoverable PoT): a deter...

8.0 viabilityHas code

Research Paper·May 12, 2026

AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive

Effectively configuring scalable large language model (LLM) experiments, spanning architecture design, hyperparameter tuning, and beyond, is crucial for advancing LLM research, as poor configuration c...

8.0 viability

Research Paper·Apr 21, 2026

Time Series Augmented Generation for Financial Applications

Evaluating the reasoning capabilities of Large Language Models (LLMs) for complex, quantitative financial tasks is a critical and unsolved challenge. Standard benchmarks often fail to isolate an agent...

8.0 viability

Research Paper·Mar 22, 2026

LLM-Powered Workflow Optimization for Multidisciplinary Software Development: An Automotive Industry Case Study

Multidisciplinary Software Development (MSD) requires domain experts and developers to collaborate across incompatible formalisms and separate artifact sets. Today, even with AI coding assistants like...

8.0 viability

Page 1 of 5

LLM Agents

Proof pending

State of the Field

Topic trend

Papers

AnyPoC: Universal Proof-of-Concept Test Generation for Scalable LLM-Based Bug Detection

ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

The Silicon Mirror: Dynamic Behavioral Gating for Anti-Sycophancy in LLM Agents

Market-Bench: Benchmarking Large Language Models on Economic and Trade Competition

Terminus-4B: Can a Smaller Model Replace Frontier LLMs at Agentic Execution Tasks?

CLEAR: Context Augmentation from Contrastive Learning of Experience via Agentic Reflection

REPOT: Recoverable Program-of-Thought via Checkpoint Repair

AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive

Time Series Augmented Generation for Financial Applications

LLM-Powered Workflow Optimization for Multidisciplinary Software Development: An Automotive Industry Case Study

Filters

Topic proof surfaces

LLM Agents

Use this topic page as a durable research-area proof surface