ARXIV:2605.11232 · LLM SERVING · SUBMITTED 13 MAY · 20:49 UTC · FRESHNESS STALE

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Rethinking LLMOps for Fraud and AML: Building a Compliance-Grade LLM Serving Stack

Prathamesh Vasudeo Naik · Naresh Dintakurthi · Yue Wang · arXiv

A workload-aware LLMOps stack for fraud and AML compliance, significantly improving throughput and reducing latency for self-hosted open-weight models.

Ship in 2-4 weeks›Score9.0Evidence unverified

Opportunity summary

Pain A workload-aware LLMOps stack for fraud and AML compliance, significantly improving throughput and reducing latency for self-hosted open-weight models.

Evidence 0 refs | 3 sources | 50% coverage

Blocker Evidence unverified

Open Build Read PDF Signal Canvas Track

PROBLEM

A workload-aware LLMOps stack for fraud and AML compliance, significantly improving throughput and reducing latency for self-hosted open-weight models. Compliance prompts are often prefix-heavy, schema-constrained, and evidence-rich, combining reusable policy instructions, risk taxonomies, transaction…

METHOD

Full abstract

Fraud detection and anti-money-laundering (AML) compliance are high-value domains for large language models (LLMs), but their serving requirements differ sharply from generic chat workloads. Compliance prompts are often prefix-heavy, schema-constrained, and evidence-rich, combining reusable policy instructions, risk taxonomies, transaction or document context, and short structured outputs such as JSON labels or risk factors. These properties make prefix reuse, KV-cache efficiency, runtime tuning, model orchestration, and output validation first-order systems concerns. This paper introduces a workload-aware LLMOps stack for fraud and AML workloads using self-hosted open-weight models such as Meta Llama and Alibaba Qwen. The stack combines vLLM-style runtime tuning, PagedAttention, Automatic Prefix Caching, multi-adapter serving, adapter and prompt-length-aware batching, sleep/wake lifecycle management, speculative decoding, and optional prefill/decode disaggregation. To avoid exposing institution-specific data, the reproducibility track converts public synthetic AML datasets, including IBM AML and SAML-D, into prefix-heavy compliance prompts with reusable policy text, transaction evidence, typology definitions, and schema-constrained outputs. We also incorporate an LLM-as-judge quality gate using deterministic compliance checks, reference metrics, expert-adjudicated calibration data where available, and multi-judge rubric scoring. Across public-synthetic AML workloads and controlled serving benchmarks, workload-aware tuning improved throughput from 612-650 to 3,600 requests/hour, reduced P99 latency from 31-38 seconds to 6.4-8.7 seconds, and increased GPU utilization from 12% to 78%. These results show that regulated LLM performance is a workload-design, serving-optimization, and quality-gating problem, not only a model-selection problem.

RESULT

ScienceToStartup currently rates this 9.0/10 on the public viability pass. These results show that regulated LLM performance is a workload-design, serving-optimization, and quality-gating problem, not only a model-selection problem. Code availability is flagged in…

WHY NOW

LLM Serving moved forward this cycle; last verified May 2026. Public score 9.0/10. Production flags indicate code availability.

Continue into Read for claims, analysis, references, and neighboring papers.

Opportunity summary

Score9.0

PainA workload-aware LLMOps stack for fraud and AML compliance, significantly improving throughput and reducing latency for self-hosted open-weight models.

Evidence0 refs | 3 sources | 50% coverage

Blockerno shell-level blocker reported

Analysis summary

A workload-aware LLMOps stack for fraud and AML compliance, significantly improving throughput and reducing latency for self-hosted open-weight models.

VerifiedSource: PDF linkedVerifiedPaperPack: citation fields availablePartialProof: unverified proof status

Competitive landscape

A workload-aware LLMOps stack for fraud and AML compliance, significantly improving throughput and reducing latency for self-hosted open-weight models.

Segment

LLM Serving

Adoption evidence

No public code link in the paper record yet

Commercial read

9.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

{ "contract_version": "paper-r2", "paper_id": "b689104f-bb45-48f0-ad37-5bed9dedff88", "arxiv_id": "2605.11232", "canonical_route": "/paper/rethinking-llmops-for-fraud-and-aml-building-a-compliance-grade-llm-serving-stack", "active_tab": "synced from current hash by the drawer client", "selected_artifact": "rethinking-llmops-for-fraud-and-aml-building-a-compliance-grade-llm-serving-stack", "endpoints": { "paper_pack": "/api/v1/paper/rethinking-llmops-for-fraud-and-aml-building-a-compliance-grade-llm-serving-stack/paper-pack", "build_passport": "/api/v1/paper/rethinking-llmops-for-fraud-and-aml-building-a-compliance-grade-llm-serving-stack/build-passport", "mcp_resource": "sciencetostartup://surfaces/paper-workspace" } }

{ "surface": "paper", "mode": "paper", "query": "Rethinking LLMOps for Fraud and AML: Building a Compliance-Grade LLM Serving Stack", "normalized_query": "2605.11232", "route": "/paper/rethinking-llmops-for-fraud-and-aml-building-a-compliance-grade-llm-serving-stack", "paper_ref": "rethinking-llmops-for-fraud-and-aml-building-a-compliance-grade-llm-serving-stack", "topic_slug": null, "benchmark_ref": null, "dataset_ref": null }

{ "@context": "https://schema.org", "@graph": [ { "@type": "WebPage", "@id": "https://sciencetostartup.com/paper/rethinking-llmops-for-fraud-and-aml-building-a-compliance-grade-llm-serving-stack#webpage", "url": "https://sciencetostartup.com/paper/rethinking-llmops-for-fraud-and-aml-building-a-compliance-grade-llm-serving-stack", "name": "Rethinking LLMOps for Fraud and AML: Building a Compliance-Grade LLM Serving Stack", "description": "A workload-aware LLMOps stack for fraud and AML compliance, significantly improving throughput and reducing latency for self-hosted open-weight models.", "isPartOf": { "@id": "https://sciencetostartup.com/#website" } }, { "@type": "ScholarlyArticle", "@id": "https://sciencetostartup.com/paper/rethinking-llmops-for-fraud-and-aml-building-a-compliance-grade-llm-serving-stack#scholarlyArticle", "headline": "Rethinking LLMOps for Fraud and AML: Building a Compliance-Grade LLM Serving Stack", "description": "A workload-aware LLMOps stack for fraud and AML compliance, significantly improving throughput and reducing latency for self-hosted open-weight models.", "url": "https://sciencetostartup.com/paper/rethinking-llmops-for-fraud-and-aml-building-a-compliance-grade-llm-serving-stack", "sameAs": "https://arxiv.org/abs/2605.11232", "identifier": { "@type": "PropertyValue", "propertyID": "arXiv", "value": "2605.11232" }, "isAccessibleForFree": true, "isPartOf": { "@id": "https://sciencetostartup.com/#website" }, "datePublished": "2026-05-11T20:47:41.000Z", "author": [ { "@type": "Person", "name": "Prathamesh Vasudeo Naik" }, { "@type": "Person", "name": "Naresh Dintakurthi" }, { "@type": "Person", "name": "Yue Wang" } ], "additionalProperty": [ { "@type": "PropertyValue", "propertyID": "viabilityScore", "value": 9 }, { "@type": "PropertyValue", "propertyID": "researchDomain", "value": "LLM Serving" }, { "@type": "PropertyValue", "propertyID": "commercialReadiness", "value": "code" } ] }, { "@type": "BreadcrumbList", "itemListElement": [ { "@type": "ListItem", "position": 1, "name": "Home", "item": "https://sciencetostartup.com" }, { "@type": "ListItem", "position": 2, "name": "LLM Serving", "item": "https://sciencetostartup.com/topics" }, { "@type": "ListItem", "position": 3, "name": "Rethinking LLMOps for Fraud and AML: Building a Compliance-G", "item": "https://sciencetostartup.com/paper/rethinking-llmops-for-fraud-and-aml-building-a-compliance-grade-llm-serving-stack" } ] } ] }

Competitive landscape

A workload-aware LLMOps stack for fraud and AML compliance, significantly improving throughput and reducing latency for self-hosted open-weight models.

Segment

LLM Serving

Adoption evidence

No public code link in the paper record yet

Commercial read

9.0/10 public viability

Direct

not classified

Adjacent

not classified

Substitute

not classified

Unknown

not classified

Rethinking LLMOps for Fraud and AML: Building a Compliance-Grade LLM Serving Stack

Rethinking LLMOps for Fraud and AML: Building a Compliance-Grade LLM Serving Stack

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline

Claim map

Constellation map

Competitive landscape

Buzz

PDF

REFERENCES

Related Papers

Subscribe to the weekly brief

Build artifacts

Brief

Experiment plan

Validation checklist

Scientific founder

Translational engineer

Domain operator

GTM lead

Regulatory/clinical advisor

Timeline