Evaluation Frameworks

TrendingProof pending

3papers

5.7viability

+100%30d

Proof pending

Proof pending. This topic has not reached the minimum paper threshold yet.

Topic-linked question coverage is still building for this proof surface.

Papers

1-2 of 2

Research Paper·Feb 5, 2026·B2BMedia & Entertainment

GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?

The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematica...

7.0 viability

Research Paper·Feb 3, 2026

Evaluating LLMs When They Do Not Know the Answer: Statistical Evaluation of Mathematical Reasoning via Comparative Signals

Evaluating mathematical reasoning in LLMs is constrained by limited benchmark sizes and inherent model stochasticity, yielding high-variance accuracy estimates and unstable rankings across platforms. ...

5.0 viability

Evaluation Frameworks

Proof pending

Papers

GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?

Evaluating LLMs When They Do Not Know the Answer: Statistical Evaluation of Mathematical Reasoning via Comparative Signals

Filters

Topic proof surfaces

Evaluation Frameworks

Use this topic page as a durable research-area proof surface