The document workflow grew up in one part of the bank. The data platform grew up in another. The agents that need to defend their decisions are about to discover they were never connected. That is the architectural seam this post is about.
The prior post in this series argued that defensibility is a property of the ingest pipeline, not the agent — that identity, consent, version, and signature have to be emitted as first-class semantic events at the moment a document enters the institution, or no downstream model can recover them under examination. The question this post takes up is the practical one. Where does that pipeline live, on what platform, and how does it get instantiated in a large bank's estate without a green-field rebuild?
The Estate Most Banks Actually Have
Walk into a typical Tier 1 or Tier 2 bank and the data estate has a recognizable shape. There is a strategic platform that holds analytics, risk, customer data, transaction history, and increasingly the feature and model assets behind the bank's machine learning portfolio. For a meaningful share of the industry that platform is Cloudera, which has earned its position over years of carrying the most sensitive regulated workloads in the institution. The platform has advanced considerably from the early Hadoop chapter that some readers may still associate with the name — streaming ingest through Cloudera DataFlow, the open Iceberg-based lakehouse, unified governance through SDX, and on-premises AI inference are first-class capabilities now, not aspirations. For institutions that landed on a different platform, the same architectural argument applies; the named components change, the seams do not.
Sitting alongside that strategic platform is the document workflow. Capture lives on a point solution with its own embedded database. E-signature lives on a vendor SaaS with its own audit log. Case management lives on another system with its own data store. Each of these works, in isolation, for the workflow it was bought to run. None of them emits into the strategic data platform as a matter of architecture. The document layer and the data layer are in the same building and on the same network, and they might as well be in different institutions.
This is the anti-pattern worth naming. The mistake is not in any of the point solutions individually. The mistake is architecting the document layer as if the rest of the data estate did not exist — accumulating capture, signing, and case management as independent islands, each with its own store, and then asking agents to reason across the gaps when the agentic roadmap arrives. The provenance the agent needs is scattered across systems that were never asked to emit it. The lineage that examiners will ask for stops at the boundary of each point solution. The governance plane that the bank invested in for its analytics estate does not extend to the documents the agents have to defend.
The Architectural Move
The architectural move is to treat the document workflow as a producer into the strategic data platform, not as a sibling system that occasionally exchanges files with it. The point solutions for capture, identity verification, and signing do not have to be replaced — most of them do useful work and have real audit properties of their own. What changes is what they emit and where those emissions land.
A streaming ingest plane sits between the document point solutions and the platform. Each event the workflow produces — identity verified, disclosure version presented, consent granted, signature applied, packet finalized — is emitted into the ingest plane as a structured signal, not as a file drop or a nightly batch. In a Cloudera environment this is what Cloudera DataFlow, powered by Apache NiFi, was built to do, with the additional property that the flow definitions themselves are governed assets the bank can reason about and version. The emission is the contract; the point solution remains the system of record for its own workflow, but the bank now owns the semantic event stream that flows from it.
Those events land in an open table format that supports time travel and schema evolution as first-class properties. Apache Iceberg, in the Cloudera context, with Cloudera's Iceberg REST Catalog providing the interoperability layer across engines. The reason this matters is that an examiner asking what did this agent see, eighteen months ago, when it made this decision needs an answer that does not depend on whether a downstream system has since been migrated, decommissioned, or schema-changed. Time-travel queries against an open table format provide that answer directly, provided the bank's snapshot retention policy is configured to match its audit window — a configuration decision worth making deliberately rather than by default. The record the agent acted on is reproducible at the moment of examination because the table format preserves it, not because the underlying point solution happened to retain it.
Governance and lineage are not a separate project bolted on later. They are the property that makes the architecture defensible, and they have to be enforced at the data platform layer where every document event lands. In a Cloudera environment this is the Shared Data Experience, or SDX, with Apache Atlas providing the lineage and classification graph and Apache Ranger enforcing fine-grained access policy. The lineage traces every event from emission through transformation to consumption; the policy plane controls who can see what at the column and row level. The agent querying for was this consent granted, by the verified identity, against the version that was actually presented hits a governed surface, not a raw stream. The audit trail is a query result, not a forensic reconstruction.
The agent itself runs on the inference layer the bank has already standardized on for its other AI workloads. Cloudera AI Inference in this case, which as of early 2026 runs on-premises as well as in cloud environments — material for regulated institutions that cannot move sensitive document context to a public endpoint. The point is broader: the agent should sit on the same platform that holds the events it reasons over, not on a separate inference service that has to fetch and cache them across a network boundary. Data gravity and inference gravity belong in the same estate. This is not a performance argument primarily — it is a defensibility argument. The shorter the trip from the event stream to the model's context, the fewer the places the chain of custody can break.
The KYC Refresh Worked Example
The KYC refresh scenario from part two carries through this architecture cleanly. The customer receives a refresh request from the case management point solution. The disclosure is presented through the signing vendor, which records the version and the consent. The identity verification runs through the capture solution, which records the method and the result. The signature is applied through the signing vendor, which timestamps it. Today, in most banks, each of those events lives in the system that generated it, and the case record in the case management database is the only place the workflow is summarized — without the supporting events, without the lineage, and without the versioned disclosure the customer actually saw.
In the architecture this post describes, each of those events is emitted into the streaming ingest plane as it happens. Cloudera DataFlow handles the emission contracts with the point solutions, so the capture vendor, the signing vendor, and the case management system each have a defined integration that produces governed events. The events land in Iceberg tables partitioned and time-stamped against the document identity. Atlas tracks the lineage from emission through any downstream transformation; Ranger enforces who can read what. When the KYC refresh agent, running on Cloudera AI Inference, evaluates the case, it queries the event stream directly. Was identity verification completed? By what method? Was disclosure version 4.2 the version presented? Was consent granted before the signature was applied? Is the signing identity the same identity that was verified? Each question has an answer with provenance. Each answer is reproducible eighteen months later because the table format preserved the state at the moment the question was asked.
The agent's recommendation is now defendable in a specific sense: every claim it makes traces to a captured event in a governed store, and the chain from the customer's action to the agent's decision is queryable end to end. The point solutions still do their jobs. The strategic data platform now owns the semantic record of what those point solutions did.
What Has to Be True
Three things have to be true, and they are the conversation the architecture team needs to have with the rest of the bank before scoping the build.
The point-solution vendors have to be willing and able to emit events into a streaming ingest plane. Most modern capture, signing, and case management vendors support this through webhooks, event APIs, or change-data-capture connectors. Some of the older ones do not, or do so poorly, and that constraint may force a vendor conversation the architecture team has been postponing. The honest answer is that a document workflow modernization is also a document vendor conversation, and pretending otherwise wastes time.
The data platform team and the document workflow team have to be the same team for the duration of this build, or at minimum have to share a roadmap with a single accountable owner. The reason the islands exist in the first place is that document operations and data engineering have historically been different functions reporting to different executives. The architecture this post describes does not respect that boundary. Either the boundary moves, or the architecture does not get built.
The bank has to be willing to treat the first workflow as platform investment, not project investment. The first use case — KYC refresh, expense and invoice review, vendor onboarding, whichever the bank picks — will look more expensive than the equivalent project scoped as a point solution. The justification is in the second, third, and fourth use cases that ship faster because the ingest plane, the table format, the lineage graph, and the inference layer are already in place. CFOs who fund only the first project on first-project economics will kill the architecture before it pays back. The pitch internally has to be platform.
The through-line across the three posts is straightforward. Document workflows are the hidden dependency in every agentic AI roadmap because the defensibility of an agent's decision is determined by what its ingest pipeline captured, not by what its model can infer. The pipeline that captures it has to emit identity, consent, version, and signature as semantic events at the source. And the platform to land those events on, for most large banks, is already in the estate — earning its keep on analytics and risk workloads — waiting to be wired into the document layer it was never asked to serve.
The institutions that make this connection in the next twelve to eighteen months will discover that their agentic roadmaps unblock not because their models got better but because their pipelines started telling the truth. The ones that do not will keep buying agentic capability they cannot defend, on top of document workflows that were never built to defend it. The architecture is not the hard part. The decision to scope it correctly, and to staff it across the boundary between data and documents, is.