Why is fine-tuning problematic for enterprise agents?

Fine-tuning embeds business rules into model weights. The consequence: individual decisions cannot be traced to a specific rule, regulatory changes require expensive retraining, and switching models means total loss. From August 2026, the EU AI Act (Art. 13, 14, 86) requires explainable individual decisions for high-risk systems. Trained models cannot deliver that by design.

What is the difference between training and configuration?

Training (fine-tuning) modifies a model's weights. Business rules become part of the model and can no longer be individually identified. Configuration means: the foundation model remains unchanged. Business rules exist as a versioned rulebook, the current case is passed as context. Result: every decision traces to a specific rule, is auditable and contestable.

What does model-agnostic mean?

Model-agnostic means the rulebook and the Decision Layer operate independently of the foundation model in use. Whether Claude, GPT, Llama or Mistral - business rules, decision tiers and audit trails remain identical. Switching models requires zero changes to the rulebook. No lock-in, no retraining costs.

Why We Don't Train AI Agents Anymore

Q: Is fine-tuning never useful?

Fine-tuning has its place. From roughly 35,000 queries per month with a stable, rarely changing rulebook, it becomes economical. But where the industry sells it today - enterprise HR and finance with annually changing regulations, collective agreements and framework agreements - it is the wrong architecture decision.

Training Is the New Fax

In 2019 we had to train AI models. They were too limited for anything else. GPT-2 could not write a coherent paragraph. BERT needed thousands of labelled examples and a GPU cluster running for days for every task.

That was six years ago. Six years in which language model capabilities improved by orders of magnitude. Yet the industry still acts as if “training” is the natural first step.

At a Glance - Why Training Is the Wrong Architecture

An LLM achieves 92% correct decisions in invoice review - without a single training example. Experienced lawyers reach 72%.^[1]
From August 2026, the EU AI Act (Art. 13, 14, 86) requires explainable individual decisions for high-risk systems. Trained models cannot deliver that.^[10]
The alternative: rulebook (versioned), context (per decision), Decision Layer (human / rulebook / AI per Micro-Decision).
Configured agents are model-agnostic: switch foundation models without changing a single rule. No lock-in, no retraining.
Over 40% of agentic AI projects will fail by 2027 - mostly due to missing governance, not missing model performance.^[9]

If someone says “we train our AI agents” in 2026, it is like saying “we fax our orders” in 2010. It works. But it shows a fundamental misunderstanding of the architecture.

From Training to Configuration

2018 - 2020

Training Is Required

BERT, GPT-2. 110M - 1.5B parameters.

Duration: Weeks

Cost: $10,000 - $100,000

Prerequisite: GPU cluster

2021 - 2023

Training Becomes Optional

GPT-3/3.5. 175B parameters.

Duration: Days

Cost: $1,000 - $10,000

Prerequisite: GPU required

2024

Training or Prompting?

GPT-4o, Claude 3.5. Multimodal.

Duration: Hours

Cost: $10 - $100

Prerequisite: API call

2025 - 2026

Configuration Is Enough

GPT-5, Claude Opus 4. Reasoning.

Duration: Minutes

Cost: $10 - $100

Prerequisite: API call

Kumar Gauraw puts it clearly: “Most reach for fine-tuning too early.”^[5] Not because fine-tuning is bad. Because in 2026, it is no longer necessary for most enterprise tasks.

What a Trained Model Cannot Do: Explain an Individual Decision

A candidate is rejected by your recruiting agent. They ask: Why?

Two answers. Two architectures.

Trained model: “Our model learned from 50,000 historical hiring decisions that your profile has a 34% success probability.”

Configured agent: “Your qualification in mechanical engineering does not meet requirement 3 (electrical engineering or equivalent). Rule: job profile v2026-03. Contestable: Yes. Process: department reviews whether mechanical engineering qualifies as ‘equivalent’.”

The first answer is illegal from August 2026.

EU AI Act, Art. 13 (transparency), Art. 14 (human oversight), Art. 86 (right to explanation).^[10] For high-risk systems - and recruiting is high-risk, Annex III(4) - every individual decision must be traceable, explainable and contestable. (US: No federal equivalent exists, but EEOC guidance increasingly demands similar explainability for automated hiring decisions.)

Not the model. The individual decision. For this candidate. With this justification.

A trained model cannot do that. It has no decision record. It has weights. And weights explain nothing to an employee representative body.

The Compliance Test: Trained vs. Configured

Architecture A

Trained Model

"Why this decision?"

"Model has learned" - black box

Not explainable

"Regulation changes?"

Retrain. 2 - 4 weeks, $5,000 - $20,000

Expensive and slow

"Can the affected person contest?"

Contest what? Model weights?

Not contestable

"New LLM model available?"

New training required. Weeks, lock-in.

Vendor dependency

"EU AI Act compliant?"

Art. 13: Transparency missing. Art. 14: Intervention = replace model. Art. 86: Explanation not possible.

Problematic

Lock-in: Yes | Audit: Difficult | EU AI Act: Problematic

Architecture B

Configured Agent

"Why this decision?"

"Travel expense rule v2026-01, absence 14h15min"

Rule, version, context documented

"Regulation changes?"

Update the rule. Effective immediately, $0.

Versioned and auditable

"Can the affected person contest?"

"Breakfast was not included." Reviewer checks.

Contestable with decision record

"New LLM model available?"

Rulebook stays. Zero effort, no lock-in.

Model-agnostic

"EU AI Act compliant?"

Decision record per Micro-Decision. Override the rule, not the model.

Compliant by design

Lock-in: No | Audit: By Design | EU AI Act: Compliant

The compliance problem is only the surface. Beneath it lies an architecture problem.

92% vs. 72%

Researchers tested in 2025 how well an LLM can review legal invoices against billing guidelines.^[1] No fine-tuning. No training. Just the rulebook as context.

The result:

Legal invoice: compliant or not?

Better Bill GPT, Whitehouse et al. (April 2025). Peer-reviewed. LLM received rulebook as context, no fine-tuning.^[1]

Overall accuracy

LLM (no training)

92%

Experienced lawyers

72%

Line item classification (F-Score)

LLM (no training)

81%

Best human group

43%

Time per invoice

LLM

3.6 sec

Lawyers

~250 sec

Cost per invoice

LLM

< $0.01

Lawyers

$4.27

Cost reduction: 99.97%.^[4] The mechanism is transferable to any rule-based compliance task.

The LLM was not trained on invoices. It received the billing guidelines as context. And decided immediately.

Why the LLM Performed Better

Not because it is smarter. Because it applies the same rule at 3 PM exactly as it does at 9 AM. Inconsistency is the human problem, not incompetence.^[1]

Experienced lawyers make 72% correct decisions - but each lawyer makes different wrong decisions. The errors are not systematic but random. Fatigue, time pressure, personal interpretation. An LLM knows no fatigue.

The Transferable Mechanism

Whether the rulebook is called “billing guideline”, “per diem regulation” or “travel expense policy”: check document against rule, identify deviation, document decision. The mechanism is identical.

Dimension	Trained Model	Configured Agent
Rule change	Retraining (weeks, $5k - $20k)	Rulebook update (minutes, $0)
Explainability	"Model has learned" (black box)	Rule + version + context (decision record)
Contestability	Not possible (no decision record)	Yes (affected person sees rule and can object)
Model switch	New training required (lock-in)	Zero effort (model-agnostic)
Audit trail	Input + output (no justification)	Input + rule + version + confidence + result
EU AI Act (Aug 2026)	Art. 13, 14, 86: Problematic	Art. 13, 14, 86: Compliant by design
Break-even fine-tuning	From ~35,000 queries/month^[6]	Economical immediately

A study by Chauhan et al. (2025) puts the break-even point of fine-tuning versus prompting at roughly 35,000 queries per month.^[6] Most enterprise HR and finance processes operate well below that.

Three Things Instead of Training

If not training, then what? Three components replace what fine-tuning promises but structurally cannot deliver.

1. Rulebook

Everything an agent needs to know is in a regulation, a directive, a collective agreement or a company policy. These rules change. Tax law changes annually. Per diem rates change annually. EU regulations change.

A trained model must be retrained with every change. A rulebook is updated. Effective immediately, versioned, auditable. No GPU cluster, no evaluation cycle, no regression risks.

RAG (Retrieval Augmented Generation) reduces factual errors by up to 50%.^[11] Not because the model gets smarter. Because it sees the current rule instead of retrieving an outdated weight.

2. Context

The agent does not need 10,000 historical expense reports. It needs this one report: travel date, departure, return, hotel, breakfast included or not. That is the context of this decision.

It is supplied through structured inputs or RAG, not trained in. When the context changes - different trip, different employee - the decision changes. Not the model.

A concrete example: the travel expense engine checks per diem allowances against the applicable tax regulation. In Germany, this is Section 9 of the Income Tax Act (EStG). The context is the individual trip. The rulebook is the current tax law. The foundation model is interchangeable.

3. Decision Framework

Who decides what? Not every decision in a process is equal.

The per diem allowance is rulebook: tax regulation, deterministic, 100% confidence. The question of whether an entertainment expense is “reasonable” is judgement: human. The classification of an illegible receipt is AI: LLM extraction, probabilistic.

This decomposition into Micro-Decisions with assignment to human / rulebook / AI is the real architecture work. Not training. The Decision Layer formalises exactly this decomposition. Architecture details: Decision Layer explained.

Micro-Decision in Practice

Travel expense report: 8-hour day, domestic trip, hotel with breakfast

#1 Travel date and absence duration Context Input: receipts

#2 Calculate per diem allowance Rulebook Tax regulation v2026-01

#3 Apply breakfast deduction Rulebook Per diem deduction rule, versioned

#4 Classify receipt AI LLM extraction, confidence: 87%

#5 Entertainment expense "reasonable"? Human Judgement, reviewer decides

#6 Create audit-compliant booking entry Rulebook Record-keeping regulation, versioned

Each step has a fixed type: Rulebook (deterministic), AI (probabilistic, with confidence threshold) or Human (judgement). When the tax regulation changes, the rule is updated. No retraining. No new model.

The Three Layers: Architecture Instead of Training

The architecture behind a configured agent consists of three layers. Each layer is independently replaceable.

Layer 3 Decision Layer

Micro-Decisions Human / Rulebook / AI Decision Record Audit Trail

Layer 2 Rulebook (versioned, replaceable)

Tax law Per diem rules Record-keeping Collective agreement Company policy EU AI Act

Layer 1 Foundation Model (replaceable)

Claude GPT Llama Mistral Gemini

↑

Everything above Layer 1 remains when the model changes. Rulebook, Decision Layer, decision records, audit trail - all model-agnostic. No retraining. No lock-in.

Why three layers? Because each has a different responsibility.

The foundation model provides language understanding and reasoning. It understands context, extracts information from documents, classifies inputs. It does not need to know what a specific tax regulation says. It needs to understand what a regulatory text is.

The rulebook contains the business logic. Regulations, directives, collective agreements, company policies. Every rule has a version. Every version has an effective date. When the regulation changes, the rule is updated. Not the model.

The Decision Layer governs who may decide what. It decomposes processes into decision steps. Defines for each: human, rulebook or AI. Documents every decision with rule, version, context and result.

What Training Really Costs

Not in dollars. In dependencies.

Lock-in

A fine-tuned model ties you to that vendor. The training set, the weights, the evaluation pipeline: all proprietary. Model switch = new training = new costs = new time loss.

A configured agent switches the foundation model without changing a single rule. Claude today, GPT tomorrow, an open-source model next week. The rulebook stays. The Decision Layer stays. The decision records stay.

Maintenance

Every regulatory change requires retraining. In finance, tax law, treasury guidance and social security contribution rates change annually. In HR, collective agreements, framework agreements and EU regulation change.

A trained agent needs continuous maintenance that looks like a software project. A configured agent needs a rulebook editor.

MIT and Stanford (Choi & Xie, 2025) show: AI reduces the monthly close by 7.5 days.^[7] But 62% of accountants worry about AI errors.^[8] The concern is justified - with trained models. With configured agents that have decision records and contestability, every error is identifiable and correctable.

Explainability

A trained model can tell you what it decided. It cannot tell you why.

“The model has learned” is not a justification a tax auditor accepts. No employee representative body accepts it. No rejected candidate accepts it.

“Travel expense rule v2026-01, applied to absence of 14h15min” is a justification.

If you cannot explain the decision, you cannot let it be contested. And if it cannot be contested, it is no longer legally compliant in the EU from August 2026.^[10]

Does Fine-Tuning Have Its Place?

Yes. From roughly 35,000 queries per month with a stable rulebook, fine-tuning becomes economical.^[6] Language adaptation, domain-specific jargon, latency optimisation: there are good reasons for it.

But where the industry sells it today - enterprise HR and finance with annually changing regulations - it is the wrong architecture decision. Gartner predicts that over 40% of agentic AI projects will fail by 2027.^[9] Not because of model performance. Because of governance.

The Question Your Board Should Ask

Not: “What data was your agent trained on?”

But:

1. Which rulebook underlies the decision? Which version was in effect at the time of the decision?

If the answer is “that is in the model”, there is no version. No change history. No audit trail.

2. What happens when the rule changes? Retraining or update?

If the answer is “we retrain”, you are paying for maintenance that is unnecessary.

3. Can the affected person see and contest the individual decision?

If there is no answer, you have a compliance problem from August 2026. Art. 86 EU AI Act: right to explanation. Not optional.^[10]

Gosign’s Approach

Gosign’s Decision Layer is an implementation of this architecture. It decomposes processes into decision steps. Defines for each: human, rulebook or AI. Rulebooks are versioned. Decisions are auditable. Results are contestable.

48 HR agents and 49 finance agents, each with a Micro-Decision table. No fine-tuning. No lock-in. No retraining when regulations change.

References

Better Bill GPT, Whitehouse et al. (April 2025). Legal Invoice Review: LLM achieves 92% accuracy reviewing legal invoices against billing guidelines. Peer-reviewed.
Better Bill GPT, Whitehouse et al. (April 2025). F-Score for line item classification: LLM 81% vs. best human group 43%.
Better Bill GPT, Whitehouse et al. (April 2025). Processing time per invoice: LLM 3.6 seconds vs. experienced lawyers 194 to 316 seconds.
Better Bill GPT, Whitehouse et al. (April 2025). Cost reduction in legal invoice review: 99.97% ($4.27 vs. <$0.01 per invoice).
Kumar Gauraw (March 2026). "Most reach for fine-tuning too early."
Chauhan et al., Journal of Information Systems Engineering (2025). Break-even fine-tuning vs. prompting: ~35,000 queries per month.
MIT/Stanford, Choi & Xie (August 2025). AI reduces the monthly close by an average of 7.5 days.
MIT/Stanford, Choi & Xie (August 2025). 62% of accountants express concerns about AI errors in financial processes.
Gartner (June 2025). Prediction: Over 40% of agentic AI projects will fail by 2027.
EU AI Act (Regulation 2024/1689), Crowell & Moring (February 2026). High-risk obligations from August 2026: Art. 13 (transparency), Art. 14 (human oversight), Art. 86 (right to explanation). Annex III(4): recruiting as high-risk system.
IBM (2024). RAG reduces factual errors in LLM outputs by up to 50%.

Bert Gogolin

CEO & Founder, Gosign

AI Governance Briefing

Enterprise AI, regulation, and infrastructure - once a month, directly from me.