The argument about AI in regulated work tends to collapse into two camps. One says the models are good enough now, or will be soon, and the job is to point them at the regulation and let them work. The other says a probabilistic system has no place anywhere a wrong answer carries a penalty, and the whole idea should wait. The two camps talk past each other, and they talk past each other because they share a hidden assumption: that using a language model is a single decision, all in or all out.

It is not a single decision. A language model in a compliance system is asked to do at least three different jobs, and they are not equally suited to it. It is asked to read, turning a messy document into something structured. It is asked to decide, turning facts into a verdict. And it is asked to explain, turning a verdict into language a person can act on. The model is genuinely excellent at the first and third. It is dangerous at the second. The entire design problem is keeping those three jobs separate, and letting the model do the two it is good at while something else does the one it is not.

This paper is about that division of labor. Where the model belongs, where it does not, and what holds the line between them.

The three jobs, pulled apart

Start by refusing the question "should we use AI for compliance," because it is the wrong question. The right question is "which part of compliance is the model actually doing." Once you pull the work apart, the answer stops being a matter of opinion and becomes a matter of fit.

Reading is the first job. A regulatory check needs structured input: a holding expressed as a percentage, a date in a field, a party resolved to a known entity, a clause identified by type. What arrives in the real world is none of that. It is a PDF, an email, a contract written by someone else's lawyer, a payment message in a format from the 1970s. Something has to turn the mess into structure before any rule can run. This is reading, and a language model is extraordinarily good at it. Pulling a date out of a sentence, recognizing that "Napa" is a brand of paracetamol, understanding that a paragraph is a limitation-of-liability clause, these are exactly the tasks models were built for.

Deciding is the second job, and it is a different kind of act entirely. Given the structured facts, what is the verdict? Did the holding cross the threshold. Was the filing late. Is the transfer lawful. This is not reading. It is the application of a rule to facts, and its answer has to be exact, repeatable, and defensible. The model is bad at this, not because it is weak but because it is the wrong tool. A model produces a likely answer. Deciding requires a correct one.

Explaining is the third job. Once the verdict exists, a person has to understand it. Not "DP-BREACH-EU-001 fired" but "the breach was reported eighty hours after you became aware of it, and the regulation gives you seventy-two." Turning a structured verdict into clear, human language is, again, exactly what a model is good at, as long as it is explaining a decision that was already made rather than making one.

Two of these three jobs play to the model's strength. The middle one is where every serious failure lives.

The model reads

Reading is where a compliance system meets reality, and reality does not arrive in fields. It arrives as documents that humans wrote for other humans, full of synonyms, abbreviations, formatting accidents, and local convention. Before the year 2000 this layer barely existed in software; you hired people to read documents and key the contents into a system. The model collapses that layer. It can take a scanned bill of lading, a free-text allergy note, a forty-page master agreement, and produce the structured shape the rules need.

This is real, and it is most of the practical value. The reason compliance automation was historically so expensive was not that the rules were hard to run. It was that getting clean inputs to run them against took an army. The model is the thing that finally makes the inputs cheap.

But notice what reading is and what it is not. Reading produces structured facts. It does not produce a verdict. When the model reads "ownership: 6.2%" out of a filing, it has not decided that a disclosure obligation was triggered. It has only made the number available to a rule that will decide. The boundary is clean and it matters: the model's reading is an input to the decision, never the decision itself. If the model reads the number wrong, a human or a cross-check can catch a wrong number. What you must never do is let the model skip from reading the number to declaring the obligation met, because then the error is buried inside a verdict instead of sitting in plain view as a field.

So the discipline even on the job the model is best at is to keep its output as data, not judgment. Read the document into fields. Stop there. Hand the fields to something else.

The rule decides

The decision is the part that has to be certain, and certainty is not something a probabilistic system can offer. So the decision does not belong to the model. It belongs to a rule.

A rule is a small, explicit object. It names the conditions that have to hold, it points at the fields those conditions read, and it states what happens when they are met. It does not have an accuracy. It is not right ninety-five percent of the time. It either fired or it did not, and you can read exactly why, because the reason is the rule itself.

{
  "rule_id": "SEC-DISC-US-001",
  "title": "Beneficial ownership crosses 5%: Schedule 13D due",
  "jurisdiction": "us",
  "source": "Securities Exchange Act §13(d)",
  "conditions": [
    { "type": "threshold_crossed", "path": "holding.beneficial_ownership_pct", "value": 5.0 },
    { "type": "deadline_window", "from": "holding.threshold_crossed_at", "to": "filing.submitted_at", "max_days": 10 }
  ],
  "deterministic": true,
  "validation_status": "expert_reviewed"
}

The model may have read the ownership percentage and the dates out of the documents. It does not get a vote on what they mean. The rule reads the fields the model produced, applies the threshold and the window, and returns a verdict that follows from the facts the way a sum follows from its numbers. Run it twice on the same input and it returns the same answer. Show it to an auditor and the auditor can see the article, the condition, and the value that triggered it. That is what deciding has to look like in work where the decision is the product.

This is the hard discipline, because letting the rule decide is slower to build than letting the model decide. The rule has to be written, by a person, from the regulation. There is no shortcut where the model reads the law and the rules write themselves into existence with any reliability, because that just moves the probabilistic guess one step earlier, into the authoring. Someone has to encode the obligation, and someone has to sign off that the encoding is right. That work is the cost of certainty, and there is no version of this that is both certain and free of it.

The model explains

After the rule decides, the verdict is correct but it is not yet usable. "SEC-DISC-US-001 fired with deadline_window exceeded by 8 days" is accurate and unreadable. A compliance officer needs the verdict in language: what happened, which rule, what to do next. This is the third job, and the model is good at it again, because turning structure into prose is the mirror image of reading.

The crucial thing is the order. The model explains a decision that already exists. It is not reasoning toward the decision and narrating as it goes. It is given the verdict and the rule that produced it, and asked to phrase them. This sounds like a small distinction and it is the whole safety of the third job. A model asked to explain a fixed verdict cannot change the verdict by explaining it badly. The decision is upstream, settled, deterministic. The explanation is downstream, and if the explanation is awkward, you have a wording problem, not a compliance failure.

Contrast that with the tempting shortcut, where the model both decides and explains in one pass, producing a verdict and a justification together. Now the justification is not an explanation of a decision. It is a story the model told itself on the way to a guess, and the story will sound exactly as confident when the guess is wrong. The separation of deciding from explaining is what stops a plausible narrative from standing in for a correct answer.

The hard middle

There is one place where the clean split blurs, and honesty requires meeting it head-on. Some judgments in regulated work are genuinely semantic. They cannot be reduced to a threshold or a date or a list membership. The clearest example is in legal review: does this contract clause deviate materially from the standard position? You cannot answer that with a comparison operator. You have to read the clause, understand it, and judge it against a reference. That is interpretation, and interpretation is the model's territory, not the rule's.

So in the hard middle, the model does make a kind of judgment. The answer is not to pretend otherwise. The answer is to fence the judgment with the same discipline that governs everything else, and to be loud about its nature.

When a genuinely semantic check runs, it is marked as semantic. The rule that contains it carries a flag saying it depends on a model. And the model's output is treated as what it is: a reading with a confidence, not a verdict with authority. A confident semantic reading can inform a decision. A low-confidence one becomes a preview, surfaced to a human, never quietly filed as a finding. The model is allowed to raise its hand and say a clause looks non-standard. It is not allowed to rule that the contract fails.

{
  "rule_id": "LEG-CL-014",
  "title": "Liability cap weaker than the playbook fallback",
  "source": "internal playbook + jurisdiction caselaw",
  "conditions": [
    { "type": "clause_deviation", "clause_type": "limitation_of_liability", "against": "fallback", "requires_llm": true }
  ],
  "deterministic": false,
  "requires_llm": true,
  "validation_status": "expert_reviewed"
}

The flag requires_llm: true is not an apology. It is a contract with the reader. It says: this part of the system is a judgment, treat its output as a judgment, send it to a person when it is unsure. The system does not become more honest by hiding the places it leans on a model. It becomes more honest by labeling them, so the parts that are certain and the parts that are judgments never get confused for one another.

Why the line cannot move

Everything above rests on one boundary: the model reads and explains, the rule decides, and the genuinely semantic exceptions are fenced and labeled. The pressure to erase that boundary is constant, and it always comes from the same place. Letting the model decide is faster to build and it demos beautifully. You can stand up a system in a weekend that reads a document, forms a view, and renders a verdict, all in one model call, and in a demo it looks like magic. The boundary feels like bureaucracy. Why insert a rules layer when the model can just tell you the answer?

The reason is that the demo and the deployment are judged by different standards. The demo is judged by whether it looks right on the cases you chose. The deployment is judged by whether it is right on the cases you did not choose, including the adversarial ones, the rare ones, and the ones a regulator picks precisely because they are hard. A model that decides will be confidently wrong on some of those, and you will not be able to tell which, and the wrong answer will arrive wearing the same certainty as the right ones. The rules layer is not bureaucracy. It is the thing that makes the system's behavior on the unchosen cases knowable in advance.

There is a deeper reason too. In regulated work you do not only have to be right. You have to be able to show why you were right, to someone who is not inclined to take your word for it. A decision that came out of a model cannot be shown. You can show the inputs and you can show the output, but the step in between is a hundred billion parameters of weights, and "the model decided" is not an account anyone can audit. A decision that came out of a rule can be shown completely: here is the article, here is the condition, here is the value that triggered it. The boundary between reading and deciding is also the boundary between a system you can defend and a system you can only hope about.

What the human does

The model reads and explains. The rule decides. That leaves the human, and the human is not a fallback for when the system fails. The human has a defined job, and it is the job neither the model nor the rule can do.

The human resolves the genuinely uncertain. When a semantic check is below confidence, it goes to a person, and the person rules on it. When a rule cannot verify its inputs, the case is flagged, and a person checks it. The system is built to hand up exactly the cases that need judgment, and no others, so the human's attention lands where it is worth the most.

The human also validates. A rule is only as trustworthy as the person who signed off that it encodes the regulation correctly. That sign-off is not a one-time event; it is an ongoing relationship between the people who know the domain and the rules that claim to capture it. The system tracks which rules have been reviewed and by whom, and it treats an unreviewed rule differently from a validated one. The point is not to remove the human from the loop. It is to spend the human's scarce, expensive judgment on the things that actually require it, and to let the deterministic layer handle everything that does not.

The shape of the whole thing

Put the pieces together and the system has a shape. A document arrives. The model reads it into structured facts. The rules evaluate those facts and produce verdicts, deterministically, each traceable to a source. Where a judgment is genuinely semantic, the model makes it, flagged and confidence-gated, and a low-confidence judgment goes to a person rather than into a decision. The verdicts, once settled, are explained back into language by the model. And throughout, when anything cannot be verified, the system says so rather than guessing, because a false clear is the one error that matters.

Notice that the model appears twice, at the start and the end, and not in the middle. It opens the system by reading the world into structure and it closes the system by turning structure back into language. The decision sits in between, deterministic, defensible, untouched by the probabilistic parts. That is not a compromise between the two camps. It is a rejection of the framing that produced them. The model is not doing everything and it is not doing nothing. It is doing the two things it is genuinely the best tool for, and it is kept away from the one thing it is the wrong tool for.

The point

The question was never whether to use AI in regulated work. It was where to put it. Point the model at reading and explaining, the jobs that turn on language, and it is the most capable tool anyone has ever had for them. Point it at deciding, the job that turns on certainty, and it is a liability dressed as a convenience. The maximalists are right that the model is transformative and wrong about which part it transforms. The skeptics are right that a guess has no place in a verdict and wrong that this rules the model out.

A regulated system built this way is not less ambitious for keeping the model on a leash. It is more ambitious, because it is aiming at the thing that actually matters, which is a decision you can stand behind, with a model doing the work it is brilliant at and a rule doing the work that has to be certain. Knowing where the model belongs is not a limit on what you can build. It is the precondition for building anything a bank can run.