← Back to Insights
DE Insights-Practitioner Briefing
Financial Services & KYC/AML

Rethinking LLMOps for Fraud and AML: Building a Compliance-Grade LLM Serving Stack

Curated by David deBoisblanc, Duczer East
Financial Services & KYC/AML 3 min read May 18, 2026 Duczer East Insights

A new paper on LLMOps for fraud and anti-money-laundering workloads reframes compliance inference as a serving problem. Prompts are prefix-heavy. Outputs are short and schema-constrained. Most production stacks are tuned for the wrong workload — generic chat — and pay for it in GPU hours.

The numbers are worth pausing on. Workload-aware tuning — paged KV, automatic prefix caching, multi-adapter batching, sleep/wake lifecycle control — moved throughput from roughly 612–650 requests per hour to 3,600, cut P99 latency from 31–38 seconds to under nine, and raised GPU utilization from 12% to 78%. Same hardware. Capacity plan dropped from about ten GPUs to three or four. The systems result is impressive on its own terms.

The deeper read is what it says about semantic intelligence at the inference tier.

Every reusable prefix in those compliance prompts is a semantic artifact. The policy text. The risk taxonomy. The typology definitions. The JSON schema. The few-shot examples. The paper treats them as cache targets — strings that happen to repeat and therefore deserve KV reuse. That framing is operationally correct and architecturally incomplete. Those strings repeat because someone curated, governed, and versioned the meaning behind them. The serving economics here are not really about cache mechanics. They are about how much of an enterprise's semantics is already stable enough to be reused millions of times without recomputation.

This is where most enterprises will discover an uncomfortable asymmetry. Organizations that have invested in a real semantic layer — canonical entities, governed taxonomies, versioned policy artifacts, schema contracts — will compound those investments at the serving tier. Their reusable prefixes are not improvised in a prompt-engineering session. They are derived from assets that already have owners, change-control, and audit trails. The 83% reduction in GPU-hours per thousand successful requests is not a runtime trick they got lucky on. It is the downstream payoff of upstream semantic discipline.

The inverse is also true, and harder to fix. If your policy language lives in three different Confluence pages, your risk taxonomy is whatever the last analyst wrote, and your output schemas drift per team, then your prefixes will drift too. Cache hit rates fall. Adapters multiply. The release gate the paper describes — deterministic schema checks, reference metrics, multi-judge rubric scoring — starts catching regressions you cannot explain because the "same" prompt is no longer the same prompt. You will buy more GPUs to compensate for missing governance.

There is a second implication worth flagging. The multi-adapter results — a 3.9x throughput gain from grouping by adapter identity and prompt length — point toward a serving topology where one base model fronts many task-specific behaviors. This only works if the tasks themselves are cleanly decomposed: classification, extraction, translation, narrative, escalation. That decomposition is a semantic design exercise before it is an MLOps exercise. Teams that have done the modeling work get adapter reuse; teams that have not get a thicket of overlapping prompts and full model loads.

For Architects, CIOs, and CTOs evaluating self-hosted LLM serving for regulated workloads, the practical question shifts. It is not only which runtime, which open-weight model, which GPU class. It is whether your policy, taxonomy, and schema assets are stable enough to function as cacheable prefixes and adapter boundaries. If they are, the economics of self-hosting on prefix-heavy workloads now beat the API alternative on more than just data locality. If they are not, no amount of runtime tuning will close the gap — and the release-gate framework the paper proposes will quietly tell you so, one judge disagreement at a time.

Serving is becoming a semantic discipline. The teams that recognize this early will run regulated inference at a structural cost advantage. The teams that do not will keep paying for recomputation of meaning they should have curated once.

Curated Article
Rethinking LLMOps for Fraud and AML: Building a Compliance-Grade LLM Serving Stack
Cornell University
Read the full article →

Would you like to discuss the ideas raised here?

Duczer East is recognized for deep work in data-centric AI, agentic systems, and enterprise integration. Happy to compare notes on any of the points raised — no pitch, just a conversation.

Get in touch
Duczer East — Where Data Engineering Meets Agentic AI

The Practitioner's Briefing

Senior-level insights on agentic AI, data engineering, and enterprise integration — delivered to your inbox.