ARXIV:2603.19173 · GPU KERNEL OPTIMIZATION · SUBMITTED 02 APR · 02:30 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Edward Lin · Sahil Modi · Siva Kumar Sastry Hari · Qijing Huang · Zhifan Ye · Nestor Qin · +27 at arXiv

A new benchmark and evaluation framework for GPU kernel optimization that measures performance against hardware limits, enabling faster development of efficient AI models.

Ship in 2-4 weeks›Score7.0Evidence unverified

Opportunity summary

Pain A new benchmark and evaluation framework for GPU kernel optimization that measures performance against hardware limits, enabling faster development of efficient AI models.

Evidence 0 refs | 0 sources | 17% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A new benchmark and evaluation framework for GPU kernel optimization that measures performance against hardware limits, enabling faster development of efficient AI models. We present SOL-ExecBench, a benchmark of 235 CUDA kernel optimization problems…

METHOD

Full abstract

As agentic AI systems become increasingly capable of generating and optimizing GPU kernels, progress is constrained by benchmarks that reward speedup over software baselines rather than proximity to hardware-efficient execution. We present SOL-ExecBench, a benchmark of 235 CUDA kernel optimization problems extracted from 124 production and emerging AI models spanning language, diffusion, vision, audio, video, and hybrid architectures, targeting NVIDIA Blackwell GPUs. The benchmark covers forward and backward workloads across BF16, FP8, and NVFP4, including kernels whose best performance is expected to rely on Blackwell-specific capabilities. Unlike prior benchmarks that evaluate kernels primarily relative to software implementations, SOL-ExecBench measures performance against analytically derived Speed-of-Light (SOL) bounds computed by SOLAR, our pipeline for deriving hardware-grounded SOL bounds, yielding a fixed target for hardware-efficient optimization. We report a SOL Score that quantifies how much of the gap between a release-defined scoring baseline and the hardware SOL bound a candidate kernel closes. To support robust evaluation of agentic optimizers, we additionally provide a sandboxed harness with GPU clock locking, L2 cache clearing, isolated subprocess execution, and static analysis based checks against common reward-hacking strategies. SOL-ExecBench reframes GPU kernel benchmarking from beating a mutable software baseline to closing the remaining gap to hardware Speed-of-Light.

RESULT

ScienceToStartup currently rates this 7.0/10 on the public viability pass. To support robust evaluation of agentic optimizers, we additionally provide a sandboxed harness with GPU clock locking, L2 cache clearing, isolated subprocess execution, and…

WHY NOW

GPU Kernel Optimization moved forward this cycle; last verified April 2026. Public score 7.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score7.0

PainA new benchmark and evaluation framework for GPU kernel optimization that measures performance against hardware limits, enabling faster development of efficient AI models.

Evidence0 refs | 0 sources | 17% coverage

Blockerno shell-level blocker reported

Analysis summary

A new benchmark and evaluation framework for GPU kernel optimization that measures performance against hardware limits, enabling faster development of efficient AI models.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A new benchmark and evaluation framework for GPU kernel optimization that measures performance against hardware limits, enabling faster development of efficient AI models.

Segment

GPU Kernel Optimization

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

References(18)

CUDABench: Benchmarking LLMs for Text-to-CUDA Generation

2026Jiace Zhu, Wentao Chen et al.

VibeTensor: System Software for Deep Learning, Fully Generated by AI Agents

2026Bing Xu, Terry Chen et al.

FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems

2026Shanli Xing, Yiyan Zhai et al.

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

2025Robert Lange, Qi Sun et al.

TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators

2025Jianling Li, Sha Li et al.

KernelBench: Can LLMs Write Efficient GPU Kernels?

2025Anne Ouyang, Simon Guo et al.

Gated Delta Networks: Improving Mamba2 with Delta Rule

2024Songlin Yang, Jan Kautz et al.

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

2024Tri Dao, Albert Gu

The EDGE Language: Extended General Einsums for Graph Algorithms

2024Toluwanimi O. Odemuyiwa, J. Emer et al.

StarCoder 2 and The Stack v2: The Next Generation

2024Anton Lozhkov, Raymond Li et al.

StarCoder: may the source be with you!

2023Raymond Li, Loubna Ben Allal et al.

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

2022Tri Dao, Daniel Y. Fu et al.

Evaluating Large Language Models Trained on Code

2021Mark Chen, Jerry Tworek et al.

Triton: an intermediate language and compiler for tiled neural network computations

2019Philippe Tillet, Hsiang-Tsung Kung et al.

The tensor algebra compiler

2017Fredrik Kjolstad, Shoaib Kamil et al.

Attention is All you Need

2017Ashish Vaswani, Noam Shazeer et al.

Roofline: an insightful visual performance model for multicore architectures

2009Samuel Williams, Andrew Waterman et al.

The General Theory of Relativity

1984S. Goldberg

{ "contract_version": "paper-r2", "paper_id": "f2e2a7fd-d84a-4bbd-81fd-4cb3ce81cbaf", "arxiv_id": "2603.19173", "canonical_route": "/paper/sol-execbench-speed-of-light-benchmarking-for-real-world-gpu-kernels-against-hardware-limits", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "sol-execbench-speed-of-light-benchmarking-for-real-world-gpu-kernels-against-hardware-limits", "endpoints": { "paper_pack": "/api/v1/paper/sol-execbench-speed-of-light-benchmarking-for-real-world-gpu-kernels-against-hardware-limits/paper-pack", "build_passport": "/api/v1/paper/sol-execbench-speed-of-light-benchmarking-for-real-world-gpu-kernels-against-hardware-limits/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits", "normalized_query": "2603.19173", "route": "/paper/sol-execbench-speed-of-light-benchmarking-for-real-world-gpu-kernels-against-hardware-limits", "paper_ref": "sol-execbench-speed-of-light-benchmarking-for-real-world-gpu-kernels-against-hardware-limits", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/sol-execbench-speed-of-light-benchmarking-for-real-world-gpu-kernels-against-hardware-limits#webpage", "url": "https://sciencetostartup.com/paper/sol-execbench-speed-of-light-benchmarking-for-real-world-gpu-kernels-against-hardware-limits", "name": "SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits", "description": "A new benchmark and evaluation framework for GPU kernel optimization that measures performance against hardware limits, enabling faster development of efficient AI models.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/sol-execbench-speed-of-light-benchmarking-for-real-world-gpu-kernels-against-hardware-limits#scholarlyArticle", "headline": "SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits", "description": "A new benchmark and evaluation framework for GPU kernel optimization that measures performance against hardware limits, enabling faster development of efficient AI models.", "url": "https://sciencetostartup.com/paper/sol-execbench-speed-of-light-benchmarking-for-real-world-gpu-kernels-against-hardware-limits", "sameAs": "https://arxiv.org/abs/2603.19173", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2603.19173" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-03-19T17:30:02.000Z", "author": [ { "@type": "Person", "name": "Edward Lin" }, { "@type": "Person", "name": "Sahil Modi" }, { "@type": "Person", "name": "Siva Kumar Sastry Hari" }, { "@type": "Person", "name": "Qijing Huang" }, { "@type": "Person", "name": "Zhifan Ye" }, { "@type": "Person", "name": "Nestor Qin" }, { "@type": "Person", "name": "Fengzhe Zhou" }, { "@type": "Person", "name": "Yuan Zhang" }, { "@type": "Person", "name": "Jingquan Wang" }, { "@type": "Person", "name": "Sana Damani" }, { "@type": "Person", "name": "Dheeraj Peri" }, { "@type": "Person", "name": "Ouye Xie" }, { "@type": "Person", "name": "Aditya Kane" }, { "@type": "Person", "name": "Moshe Maor" }, { "@type": "Person", "name": "Michael Behar" }, { "@type": "Person", "name": "Triston Cao" }, { "@type": "Person", "name": "Rishabh Mehta" }, { "@type": "Person", "name": "Vartika Singh" }, { "@type": "Person", "name": "Vikram Sharma Mailthody" }, { "@type": "Person", "name": "Terry Chen" }, { "@type": "Person", "name": "Zihao Ye" }, { "@type": "Person", "name": "Hanfeng Chen" }, { "@type": "Person", "name": "Tianqi Chen" }, { "@type": "Person", "name": "Vinod Grover" }, { "@type": "Person", "name": "Wei Chen" }, { "@type": "Person", "name": "Wei Liu" }, { "@type": "Person", "name": "Eric Chung" }, { "@type": "Person", "name": "Luis Ceze" }, { "@type": "Person", "name": "Roger Bringmann" }, { "@type": "Person", "name": "Cyril Zeller" }, { "@type": "Person", "name": "Michael Lightstone" }, { "@type": "Person", "name": "Christos Kozyrakis" }, { "@type": "Person", "name": "Humphrey Shi" } ], "citation": [ { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "b0226639149de72633c63c41b4a379780b7f9189" }, "url": "https://www.semanticscholar.org/paper/b0226639149de72633c63c41b4a379780b7f9189" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "d6caee13bfde6b5585c61a9ba4e5cff593ae67c8" }, "url": "https://www.semanticscholar.org/paper/d6caee13bfde6b5585c61a9ba4e5cff593ae67c8" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "645723b529122e6e68cc6a094d9e0a144b495957" }, "url": "https://www.semanticscholar.org/paper/645723b529122e6e68cc6a094d9e0a144b495957" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "19f467da35b2b69478ac3a4cb884030e044bfdf0" }, "url": "https://www.semanticscholar.org/paper/19f467da35b2b69478ac3a4cb884030e044bfdf0" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "ef7226f8c60b9ae0e71a41b8393733c39db407d1" }, "url": "https://www.semanticscholar.org/paper/ef7226f8c60b9ae0e71a41b8393733c39db407d1" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "ffea1f0412dfa8149535dbfc3ca0e94e629afad7" }, "url": "https://www.semanticscholar.org/paper/ffea1f0412dfa8149535dbfc3ca0e94e629afad7" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "968ea1c7f681592ce6b0830eac0b7fda8db71792" }, "url": "https://www.semanticscholar.org/paper/968ea1c7f681592ce6b0830eac0b7fda8db71792" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "ca9f5b3bf0f54ad97513e6175b30497873670fed" }, "url": "https://www.semanticscholar.org/paper/ca9f5b3bf0f54ad97513e6175b30497873670fed" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "288ba5795dc3f89e28a0f4c61827d6a3cd6ca4f0" }, "url": "https://www.semanticscholar.org/paper/288ba5795dc3f89e28a0f4c61827d6a3cd6ca4f0" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "18e7ab056c16928d8f9539509a4b366889106d97" }, "url": "https://www.semanticscholar.org/paper/18e7ab056c16928d8f9539509a4b366889106d97" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "3e4085e5869f1b7959707a1e1d7d273b6057eb4e" }, "url": "https://www.semanticscholar.org/paper/3e4085e5869f1b7959707a1e1d7d273b6057eb4e" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "87c5b281fa43e6f27191b20a8dd694eda1126336" }, "url": "https://www.semanticscholar.org/paper/87c5b281fa43e6f27191b20a8dd694eda1126336" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "acbdbf49f9bc3f151b93d9ca9a06009f4f6eb269" }, "url": "https://www.semanticscholar.org/paper/acbdbf49f9bc3f151b93d9ca9a06009f4f6eb269" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "661d142c23cb2a3207d5f1ba2ac7ff61f2d4fb2f" }, "url": "https://www.semanticscholar.org/paper/661d142c23cb2a3207d5f1ba2ac7ff61f2d4fb2f" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "c2e1139691c3a337831e36ee7afeab8817ab5d48" }, "url": "https://www.semanticscholar.org/paper/c2e1139691c3a337831e36ee7afeab8817ab5d48" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "204e3073870fae3d05bcbc2f6a8e263d9b72e776" }, "url": "https://www.semanticscholar.org/paper/204e3073870fae3d05bcbc2f6a8e263d9b72e776" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "092217c2267f6e0673590aa151d811e579ff7760" }, "url": "https://www.semanticscholar.org/paper/092217c2267f6e0673590aa151d811e579ff7760" }, { "@type": "ScholarlyArticle", "identifier": { "@type": "PropertyValue", "propertyID": "SemanticScholar", "value": "31de905b9e53e5f9289e0875a8a0749137deaada" }, "url": "https://www.semanticscholar.org/paper/31de905b9e53e5f9289e0875a8a0749137deaada" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 7 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "GPU Kernel Optimization" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "GPU Kernel Optimization", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GP", "item": "https://sciencetostartup.com/paper/sol-execbench-speed-of-light-benchmarking-for-real-world-gpu-kernels-against-hardware-limits" } ] } ] }

Competitive landscape

A new benchmark and evaluation framework for GPU kernel optimization that measures performance against hardware limits, enabling faster development of efficient AI models.

Segment

GPU Kernel Optimization

Adoption evidence

No public code link in the paper record yet

Commercial read

7.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

References(18)

CUDABench: Benchmarking LLMs for Text-to-CUDA Generation

2026Jiace Zhu, Wentao Chen et al.

VibeTensor: System Software for Deep Learning, Fully Generated by AI Agents

2026Bing Xu, Terry Chen et al.

FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems

2026Shanli Xing, Yiyan Zhai et al.

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

2025Robert Lange, Qi Sun et al.

TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators

2025Jianling Li, Sha Li et al.

KernelBench: Can LLMs Write Efficient GPU Kernels?

2025Anne Ouyang, Simon Guo et al.

Gated Delta Networks: Improving Mamba2 with Delta Rule

2024Songlin Yang, Jan Kautz et al.

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

2024Tri Dao, Albert Gu

The EDGE Language: Extended General Einsums for Graph Algorithms

2024Toluwanimi O. Odemuyiwa, J. Emer et al.

StarCoder 2 and The Stack v2: The Next Generation

2024Anton Lozhkov, Raymond Li et al.

StarCoder: may the source be with you!

2023Raymond Li, Loubna Ben Allal et al.

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

2022Tri Dao, Daniel Y. Fu et al.

Evaluating Large Language Models Trained on Code

2021Mark Chen, Jerry Tworek et al.

Triton: an intermediate language and compiler for tiled neural network computations

2019Philippe Tillet, Hsiang-Tsung Kung et al.

The tensor algebra compiler

2017Fredrik Kjolstad, Shoaib Kamil et al.

Attention is All you Need

2017Ashish Vaswani, Noam Shazeer et al.

Roofline: an insightful visual performance model for multicore architectures

2009Samuel Williams, Andrew Waterman et al.

The General Theory of Relativity

1984S. Goldberg

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Claim map

Constellation map

Competitive landscape

Buzz

PDF

References(18)

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

References(18)

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline