ARXIV:2604.04443 · LLM REASONING · SUBMITTED 07 APR · 20:13 UTC · FRESHNESS UNKNOWN

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

DeonticBench: A Benchmark for Reasoning over Rules

Guangyao Dou · Luis Brena · Akhil Deo · William Jurayj · Jingyu Zhang · Nils Holzenberger · +1 at arXiv

A new benchmark and dataset to evaluate and improve LLM reasoning over complex, real-world rules, with a focus on legal and policy domains.

Ship in 2-4 weeks›Score4.0Evidence unverified

Opportunity summary

Pain A new benchmark and dataset to evaluate and improve LLM reasoning over complex, real-world rules, with a focus on legal and policy domains.

Evidence 0 refs | 0 sources | 0% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A new benchmark and dataset to evaluate and improve LLM reasoning over complex, real-world rules, with a focus on legal and policy domains. In legal and policy settings, this manifests as deontic reasoning: reasoning…

METHOD

Full abstract

Reasoning with complex, context-specific rules remains challenging for large language models (LLMs). In legal and policy settings, this manifests as deontic reasoning: reasoning about obligations, permissions, and prohibitions under explicit rules. While many recent benchmarks emphasize short-context mathematical reasoning, fewer focus on long-context, high-stakes deontic reasoning. To address this gap, we introduce DEONTICBENCH, a benchmark of 6,232 tasks across U.S. federal taxes, airline baggage policies, U.S. immigration administration, and U.S. state housing law. These tasks can be approached in multiple ways, including direct reasoning in language or with the aid of symbolic computation. Besides free-form chain-of-thought reasoning, DEONTICBENCH enables an optional solver-based workflow in which models translate statutes and case facts into executable Prolog, leading to formal problem interpretations and an explicit program trace. We release reference Prolog programs for all instances. Across frontier LLMs and coding models, best hard-subset performance reaches only 44.4% on SARA Numeric and 46.6 macro-F1 on Housing. We further study training with supervised fine-tuning and reinforcement learning for symbolic program generation. Although training improves Prolog generation quality, current RL methods still fail to solve these tasks reliably. Overall, DEONTICBENCH provides a benchmark for studying context-grounded rule reasoning in real-world domains under both symbolic and non-symbolic settings.

RESULT

ScienceToStartup currently rates this 4.0/10 on the public viability pass. Besides free-form chain-of-thought reasoning, DEONTICBENCH enables an optional solver-based workflow in which models translate statutes and case facts into executable Prolog, leading to formal…

WHY NOW

LLM Reasoning moved forward this cycle; last verified April 2026. Public score 4.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score4.0

PainA new benchmark and dataset to evaluate and improve LLM reasoning over complex, real-world rules, with a focus on legal and policy domains.

Evidence0 refs | 0 sources | 0% coverage

Blockerno shell-level blocker reported

Analysis summary

A new benchmark and dataset to evaluate and improve LLM reasoning over complex, real-world rules, with a focus on legal and policy domains.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A new benchmark and dataset to evaluate and improve LLM reasoning over complex, real-world rules, with a focus on legal and policy domains.

Segment

LLM Reasoning

Adoption evidence

No public code link in the paper record yet

Commercial read

4.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "16da6eef-02c6-45e0-b360-dbe1ced445b3", "arxiv_id": "2604.04443", "canonical_route": "/paper/deonticbench-a-benchmark-for-reasoning-over-rules", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "deonticbench-a-benchmark-for-reasoning-over-rules", "endpoints": { "paper_pack": "/api/v1/paper/deonticbench-a-benchmark-for-reasoning-over-rules/paper-pack", "build_passport": "/api/v1/paper/deonticbench-a-benchmark-for-reasoning-over-rules/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "DeonticBench: A Benchmark for Reasoning over Rules", "normalized_query": "2604.04443", "route": "/paper/deonticbench-a-benchmark-for-reasoning-over-rules", "paper_ref": "deonticbench-a-benchmark-for-reasoning-over-rules", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/deonticbench-a-benchmark-for-reasoning-over-rules#webpage", "url": "https://sciencetostartup.com/paper/deonticbench-a-benchmark-for-reasoning-over-rules", "name": "DeonticBench: A Benchmark for Reasoning over Rules", "description": "A new benchmark and dataset to evaluate and improve LLM reasoning over complex, real-world rules, with a focus on legal and policy domains.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/deonticbench-a-benchmark-for-reasoning-over-rules#scholarlyArticle", "headline": "DeonticBench: A Benchmark for Reasoning over Rules", "description": "A new benchmark and dataset to evaluate and improve LLM reasoning over complex, real-world rules, with a focus on legal and policy domains.", "url": "https://sciencetostartup.com/paper/deonticbench-a-benchmark-for-reasoning-over-rules", "sameAs": "https://arxiv.org/abs/2604.04443", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2604.04443" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-04-06T05:41:02.000Z", "author": [ { "@type": "Person", "name": "Guangyao Dou" }, { "@type": "Person", "name": "Luis Brena" }, { "@type": "Person", "name": "Akhil Deo" }, { "@type": "Person", "name": "William Jurayj" }, { "@type": "Person", "name": "Jingyu Zhang" }, { "@type": "Person", "name": "Nils Holzenberger" }, { "@type": "Person", "name": "Benjamin Van Durme" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 4 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Reasoning" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Reasoning", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "DeonticBench: A Benchmark for Reasoning over Rules", "item": "https://sciencetostartup.com/paper/deonticbench-a-benchmark-for-reasoning-over-rules" } ] } ] }

Competitive landscape

A new benchmark and dataset to evaluate and improve LLM reasoning over complex, real-world rules, with a focus on legal and policy domains.

Segment

LLM Reasoning

Adoption evidence

No public code link in the paper record yet

Commercial read

4.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

DeonticBench: A Benchmark for Reasoning over Rules

DeonticBench: A Benchmark for Reasoning over Rules

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Related Resources

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline