White paper

The Integration Problem

Most of what is sold to charities as an AI agent is a faster horse. The real prize is software that does the coordination work charities currently do by hand, and it only earns trust if governance is built in from the first line of code. A white paper on what agents are actually for.

2026-06-09

On this pageJump to a section

1The faster-horses trap
2The coordination tax
3The two curves
4What an agent is, and what it's not
5From recipes to workflows
6Governance is the engine mount
7Why the incumbents can't sell the real thing
8What to do
The mission, and the gap

What's being sold to charities as an AI agent is mostly a faster horse. The work that actually drains the sector is the work of holding everything together by hand. That's the problem worth solving, and it's the one almost nobody's selling.

A disclosure before we start: we're an agency that designs and builds this kind of software for charities, so we have a stake in the argument that follows. We've sourced it heavily enough, we hope, that it stands on its evidence rather than our say-so.

A charity's staff are the glue between tools that were never built to talk to each other - receiving a gift here, re-keying it there, carrying every hand-off by hand. The faint web is every connection that could live in software; it only lights when a person bridges it. A chatbot makes each hop a little faster; it can't take the carrying away. Hover a tool to see what its hand-off costs.

Almost every charity in the UK now has access to AI. The Charity Digital Skills Report found that 76% of charities were using AI tools in their day-to-day work in 2025, up from 61% the year before and 35% the year before that. On the headline numbers, the sector's doing great.

Move one layer down and the picture changes. In the same survey, only 8% of charities had built AI into service delivery. Around 2% had reached the most embedded stage, using it strategically with plans to scale. Most adoption is one person at a desk, typing into a chatbot to draft an email or summarise a document. The same meetings, the same sign-off chains, and the same underlying systems have been left exactly as they were. The tool is new; but the flows that work has to go through hasn't changed at all.

of charities have built AI into how they actually deliver services. Almost all the rest is one person typing into a chatbot.

Charity Digital Skills Report, 2025

That gap is what this paper's about. Our argument's that the version of AI being sold hardest to charities, the "agent" as a slightly more capable chatbot, can't close that gap, because it's aimed at the wrong problem. The problem isn't that people type too slowly. It's that organisations spend an enormous, invisible amount of energy holding themselves together. That energy's spent moving information between tools that don't talk to each other, reconciling spreadsheets, chasing the right person, or re-keying the same data into the fourth form this week.

This is the coordination tax. It's the classic vision of someone sat at their computer with two screens copying and pasting data from one spreadsheet to another. A chatbot might make that a bit faster but doesn't fix the pieces not fitting together properly.

There's a useful historical parallel here: electricity. When factories first electrified, they did the obvious thing and pulled out the steam engine to drop an electric motor in its place. They kept the same central drive shaft, the same belts running to every machine and the same processes. The gains were marginal. It took around forty years for manufacturers to understand that the point of electricity wasn't a cleaner engine in the middle of the room. It was that a small motor could now go exactly where each piece of work happened, so the whole floor could be redesigned around the work rather than around the power source.

Most charities, and most of the vendors selling to them, are still at the swapped-the-engine stage. Getting to the second stage is the rest of this paper. And it starts with why governance - far from being the brake - is the thing that makes the journey possible at all.

1. The faster-horses trap

Ten minutes with the marketing from the large platform vendors would suggest the agentic future has already arrived. Microsoft describes an "agentic enterprise." Salesforce, Google and OpenAI all sell "agents." The word is everywhere.

It pays to be precise about what's actually on offer, because the word is doing a lot of work. Microsoft markets a whole spectrum of things as agents, from a chatbot that returns a pre-written answer, to a retrieval system that searches an organisation's documents and summarises them, to event-triggered automation, to genuinely autonomous software that controls a browser. These are radically different in capability and in risk, and the marketing collapses them into a single term. Gartner has called this "agent washing," the rebranding of existing assistants, chatbots and automation as agentic without much new underneath. By Gartner's own count, of the thousands of vendors claiming to offer agentic AI, only around 130 are doing something that meaningfully deserves the label.

When most of what reaches a charity under the banner of "agent" is, in practice, a chatbot that can also run a web search, the honest description is that it's a faster horse. It helps an individual move a little quicker through tasks they already do. That's not worthless. But it leaves the shape of the work untouched, and the shape of the work is the problem.

The numbers from outside the sector should give pause to anyone about to spend scarce funds on the promise. A 2025 Gartner survey of organisations that had piloted Microsoft 365 Copilot found that only around 5% had moved to a wider deployment, with most pausing and, in the words of one Gartner analyst, "waiting it out." And only a small fraction of Microsoft's enormous commercial base actually pays for Copilot, despite Nadella publicly calling it a "true daily habit". The same Gartner analyst was blunt about why the perception runs ahead of the reality: "Microsoft's marketing has been brilliant. Execution, not as strong as the marketing." Gartner's headline prediction for the category is that more than 40% of agentic AI projects will be cancelled by the end of 2027, citing escalating costs, unclear business value and inadequate risk controls. The scepticism isn't only external: Andrej Karpathy, a co-founder of OpenAI, has called the current crop of agents "cognitively lacking" and reckons the moment is better described as "the Decade of the Agent" than the year of it.¹

of agentic-AI projects will be cancelled by the end of 2027 - on runaway costs, unclear value and weak controls.

Gartner, 2025

For a sector running on donations and hard-won grants, every pound spent on a licence that ends up unused is a pound taken from the mission, and charities can least afford to buy the expensive version of the wrong thing. That's increasingly what they're being sold. None of this means the technology is empty. It means the dominant way it's being packaged and sold is aimed at the easy thing, the individual using a chatbot, rather than the hard thing, the organisation changing how work flows through it. The easy thing is what scales as a per-seat licence. The hard thing is what charities actually need.

2. The coordination tax

To see why the chatbot misses, it helps to be specific about what a charity spends its days doing.

In 1972, the British cybernetician Stafford Beer published a model of how any organisation that survives actually works. He called it the Viable System Model, and stripped of its jargon it makes a simple observation. The visible work, the part that delivers the mission, is only one of the things an organisation has to do. He called that operational work System 1: the support workers, the caseworkers, the fundraisers, the people doing the thing the charity exists to do.

For all those System 1 units to function without colliding, something has to coordinate them. Beer called this System 2, and he was precise about its job: it's the part that stops the autonomous units knocking into each other and setting up destabilising oscillations. The shared calendars, the standard templates, the handover notes, the "has anyone told the finance team" messages. Above that sits System 3, the part that optimises the whole as it runs day to day, allocating people and money across the operation and noticing when something is going wrong.

Here's the thing that matters for charities. In most organisations, System 2 and System 3 aren't designed. They're improvised by people. A fundraiser copies a gift from the donation platform into the CRM, then into a spreadsheet for the board report, then flags it to the thank-you team by email. A support worker takes notes in one system, summarises them into another for the handover, and re-enters the key facts into the funder's outcomes portal in a third format. Nobody planned this. It accreted, one tool and one well-meaning workaround at a time, and now a substantial share of the organisation's energy goes into being the human glue between tools that were never built to connect.²

The word "system" is doing two jobs here, and they're easy to run together. One is the everyday kind: the CRM, the spreadsheet, the case file, the funder's portal. The other is Beer's kind: the organisation as a whole, the thing that either coheres or doesn't. Most of the coordination tax is the human effort of making the everyday kind behave like Beer's kind.

Charities pay it at a punishing rate, for three reasons. They're chronically under-resourced, so there's never time to fix the plumbing. They handle unusually sensitive and unusually messy information, so the stakes of getting it wrong are high. And they tend to acquire software one grant-funded project at a time, leaving a patchwork of tools that each solve one problem and none of which talk to each other.

The tax is invisible on any budget line, which is precisely why it never gets addressed. No funder gives a grant for "the time our staff spend re-keying data." But it's real, and it's large. Put rough numbers on it, treating them as a back-of-the-envelope estimate rather than a measurement:³ a twenty-person charity with a fifth of everyone's time going on this kind of carrying is spending four full-time posts on shuffling information between tools instead of on the mission. And the errors mount up, because manual coordination across a dozen disconnected tools is exactly the sort of repetitive, attention-hungry work people are worst at, and the sort that quietly burns them out.

full-time posts' worth of time a 20-person charity can pour into shuffling information between tools - instead of into the mission.

A Fermi calculation based on sector knowledge (AKA a back-of-the-envelope estimate)

It's tempting to think agents will just sort this out between themselves; they can talk to each other, after all. They don't. Rohit Krishnan built a simulated market, filled it with capable AI agents, and handed them everything: transaction costs "effectively zero", "perfect information", incentives all lined up. They still wouldn't trade. "Every firm decided to build everything internally", and even after he added reputation systems, penalties and price history, there were still zero trades. A market appeared only once he forced them to take part, with what he called "Gosplan-ish price hints".⁴ Coordination has to be designed. It doesn't emerge for free, not even among machines that can reason perfectly well, in conversation, about why it would pay. Much of the coordination tax is what David Graeber called "duct taping": work whose whole point is to patch a gap that shouldn't be there in the first place.

A chatbot does almost nothing about this. It can help the fundraiser write a nicer thank-you letter. It can't carry the gift across all four tools, reconcile the discrepancy when the platform total doesn't match the CRM, and flag the one record that looks wrong. The coordination tax is not a writing problem. It is an integration problem.⁵

3. The two curves

A faster horse does not close an exponential gap.

There's a reason this can no longer be filed under "nice to fix when we have time." The economics have changed underneath the sector.

In 1966, the economists William Baumol and William Bowen published a study of why live performing arts kept getting more expensive. Their explanation has come to be known as Baumol's cost disease, and it generalises far beyond music. They divided the economy into two kinds of sector. Progressive sectors, like manufacturing, get steadily more productive as technology improves: fewer worker-hours produce more output every year. Stagnant sectors can't, because the human effort is the service. Their example was a string quartet. It takes four musicians a fixed stretch of time to play a piece, and no technology shortens it. Yet those musicians' wages still have to rise over the decades, because they could otherwise leave for the progressive sectors where pay is climbing. So the cost of the performance rises relentlessly, without any matching gain in productivity.

Charities are stagnant sectors in Baumol's exact sense, and they sit alongside healthcare, education and social care as the textbook cases. The work is people helping people. A safeguarding conversation, a benefits appointment or a bereavement call can't be made meaningfully "more productive" by speeding up the human in the room, and shouldn't be. But the wages, the rent, the insurance and the software subscriptions all rise with the wider economy regardless. This squeeze is a permanent headwind, and it has sharpened lately: rising employer National Insurance costs, falling numbers of donors, and demand climbing faster than income.⁶

Now set against that the other curve. The capability of AI systems is improving at a rate that has no precedent in the stagnant sectors. The research organisation METR measured how long a task an AI agent can complete on its own, and found that this length has been roughly doubling every seven months. The measurement deserves caution, and METR itself has been careful to flag the uncertainty in it, and it's drawn largely from software tasks rather than the relational work charities do. But even discounted heavily, the direction is clear, and it's exponential.⁷ At the same time, the cost of building bespoke software with these tools has fallen sharply: work that cost half a million pounds three years ago can cost a fraction of that now.

Put the two curves on the same axis and the problem states itself. A charity's capacity to deliver, left to its own devices, climbs along the slow linear line that Baumol described. The capability available to it climbs along an exponential. Every year the gap between what's possible and what the sector actually does widens.

There's a sleight of hand to watch for here, and it's worth naming because it's the hinge of the whole argument. METR measures what AI can do on software tasks; a charity's curve is its capacity to deliver. Those two only belong on the same axis if software capability can be turned into delivery capacity, and that conversion isn't automatic.⁸ A charity's ability to serve more is rarely capped by the front-line human moment, the conversation, the assessment, the visit, which can't be sped up and shouldn't be. It's capped by everything around that moment: the coordination tax, which is software-shaped work. That's the part a rising software curve can lift, but only if the work is redesigned to let it, which is exactly what most organisations don't do, and exactly why most attempts fail.

Capabilitywhat's possible

What AI makes technically possible for a charity. Flat for years, then a steep, exponential climb - the ceiling keeps rising.

With redesignthe prize

The capability only becomes real delivery if the work is redesigned around it, not just sped up. This is the line worth chasing - and what this paper is about.

Capacity, unaidedthe default

What a charity delivers if nothing changes - a slow, near-straight climb. A chatbot nudges this line; it can't bend it.

The capability climbs exponentially; a charity's capacity, left alone, climbs in a line - and the gap only closes if the work is redesigned, not just sped up.

A chatbot moves a charity a small, fixed distance up the linear line; it doesn't change the slope. Changing the slope means a different process, not a faster version of the old one.

4. What an agent is, and what it's not

If the word "agent" has been stretched to cover everything from a canned chatbot reply to autonomous software, it helps to recover the definition that the people who actually build these systems use.

The clearest one is also the plainest. The engineer and writer Simon Willison, who spent years refusing to use the word without an eye-roll, settled in 2025 on a definition plain enough to use: "An LLM agent runs tools in a loop to achieve a goal." Three parts. A model, the thing that can read and reason over language. Tools, which let it actually do things: search a database, send an email, run a calculation, write to a system. And a loop, in which it takes a step, sees the result, and decides the next step, continuing until the job is done.

Anthropic, in its widely cited essay "Building Effective Agents," draws the distinction that matters most for a charity leader trying to see through the marketing. It separates workflows, "systems where LLMs and tools are orchestrated through predefined code paths," from agents, "systems where LLMs dynamically direct their own processes and tool usage." In a workflow, a person has fixed the sequence of steps in advance. In an agent, the model decides what to do next. Most of what is sold to charities as an agent is, by this definition, a workflow at best, and often just a single chatbot turn. That's not a criticism. Anthropic's own advice is to "start with simple prompts" and "add multi-step agentic systems only when simpler solutions fall short." Even Microsoft's engineering documentation, under the marketing, tells developers plainly: "If you can write a function to handle the task, do that instead of using an AI agent."

The important consequence is this. The model is close to a commodity. Everyone has access to broadly similar ones. The thing that determines whether an agent is useful, safe and trustworthy is everything around the model: the set of tools it's allowed to use, the guardrails on what it can and can't do, the points at which it must stop and ask a human, the record it keeps of everything it did. Engineers call this surrounding software the harness. It's where almost all the real work lives, and it's exactly the part a per-seat chatbot licence doesn't include.

It also explains why almost every agent demo that dazzles is about code. When the team at Ramp turned a coding agent loose inside the game RollerCoaster Tycoon, it reliably pulled the digital levers, setting prices, opening and closing rides, but came apart when it had to lay the paths joining them up. Their conclusion was that the real limit on general-purpose agents is the legibility of their environments, and that it's more honest to think of them as "automating diligence, rather than intelligence". Coding is the easy case because its world is already legible: the terminal is text, the domain language models are built on, as Alex Shaw, who builds a benchmark for coding agents, puts it. A charity's tangle of half-connected tools isn't that world, and the obvious remedy, giving the model ever more context, doesn't save it: accuracy falls away as the input grows, even on simple tasks.⁹

Here's where the electricity parallel becomes concrete. A chatbot bolted onto an existing process is the new motor wired to the old layout: it runs, it's marginally better, nothing is rearranged. An agent with a real harness is a small motor placed where a piece of work actually happens, wired into the specific tools and data that work touches, doing the carrying a person used to do by hand. The first is a component swap; the second is a redesign of the floor.

There's a reason the platform vendors are better at selling the first than the second, and it isn't incompetence. The first scales as an identical product sold to millions of seats. The second has to be fitted to a particular organisation's own tools, data and risks. We'll come back to why that matters for who should be expected to build it.

5. From recipes to workflows

The work the person was doing was never the classifying or the drafting. It was the carrying.

We've spent the last few years cataloguing what charities can actually do with AI. The result was a library of 89 practical recipes, each one a single, well-scoped task: review a funding bid before submission, find relevant grants automatically, turn a case study into accessible formats, classify incoming enquiries. They're useful, and the discipline of writing them was clarifying. Each recipe is, in effect, one small motor that a person picks up and operates by hand. Useful, repeatable, and entirely dependent on a human to carry the work from one recipe to the next.

The shift we're describing in this paper is the move from operating those motors one at a time to wiring them into the floor. Instead of a person running the "classify this enquiry" recipe, then the "draft a response" recipe, then re-keying the result into the case-management system, an agent runs the loop: it takes the enquiry, classifies it, drafts the routing, checks it against the rules, and presents it for a human to approve, with every step recorded. The work the person was doing was never the classifying or the drafting. It was the carrying. That's the coordination tax, and that's what the agent absorbs.

Donella Meadows spent a career mapping the points where a system can be changed, and ranked the options from feeble to powerful. Running the same process a bit faster sits near the feeble end. Changing the structure, and the way information flows through it, sits near the powerful end. A chatbot is a feeble-end move. Rebuilding how the work flows is a structural one, and the structural changes are the ones that move the numbers.

It's the same lesson the factories eventually learned, and it took them around forty years to learn it. The economist Paul David documented it in his study of the electrification productivity paradox. Electric power was available from the 1880s, but the productivity gains didn't arrive until the 1920s, because for decades factories used the new power in the old way. The breakthrough came when they abandoned the single central engine driving one long shaft and switched to a separate motor on each machine. Only then could they put machines in the order the work actually flowed, build single-storey, and design the floor around the job rather than around the location of the power.

Economists call electricity, steam, the computer and now AI general-purpose technologies: their value comes from that downstream rebuild, not the core invention, which is why one can be everywhere and still missing from the productivity figures. AI turned up the day before yesterday, and the rebuild has barely started.

The pattern isn't unique to electricity. Containerised shipping required almost no new invention, only reorganisation, and it cut the cost of loading a ton of cargo from roughly $5.83 to $0.16. What changes an industry is seldom the device; it's the rebuild around it. The writer Kevin Baker puts the failure mode well: these tools "did not arrive as disruptors. They arrived as intensifiers", an accelerant for whatever was already running, which is exactly why bolting one onto an unchanged process changes so little.

We saw a small version of this with Breast Cancer Now. An eight-person team was hand-transcribing patient surveys, around 350 data points per form, across more than twenty NHS trusts that each used a different layout. The bottleneck wasn't effort or skill. It was that manual transcription couldn't scale, and hiring wasn't an option. The system we built reads the forms across every layout, extracts the data, and presents it for a team member to review and approve. The team didn't get a chatbot. They got a system that happens to use AI under the hood, and the part they kept was the judgement: the relationships with the trusts, the edge cases, the quality call. The most important detail is that the people using it don't interact with an AI at all. They interact with their work, redesigned.

data points per form, hand-transcribed across 20+ NHS trusts that each used a different layout - until the work itself was redesigned.

Breast Cancer Now

We should be honest about how rare this still is, because the marketing implies it's everywhere and the evidence says otherwise. Across the wider economy, Deloitte found only around 11% of organisations exploring agentic AI had anything in production. A widely discussed MIT study reported that the large majority of generative AI pilots delivered no measurable financial return, though that figure comes from a non-peer-reviewed industry report and should be held loosely.¹⁰

Where the work has actually been rebuilt, instead of a chatbot bolted on top, the results are real and large. JPMorgan rebuilt a single tedious process, the interpretation of commercial-loan agreements, and took around 360,000 lawyer-hours a year out of it. Novo Nordisk redesigned how it drafts clinical study reports for regulators, turning a job that ran to about twelve weeks into one that takes minutes, with a handful of expert reviewers in place of a team of around fifty. Commonwealth Bank built an agent that watches its transaction data for new fraud patterns and writes the rules to block them, and reported fraud losses down more than a fifth year on year, with human analysts still reviewing the rules before they go live. Even in the NHS, where caution is rightly higher, several London trusts now use AI to pre-read chest X-rays, cutting the wait for an urgent scan from around two weeks to about five days, and pushing the ones that show something significant further up the queue. None of these is a chatbot rollout. In each, somebody took one process apart and put it back together differently, and left a person where the judgement had to happen.

lawyer-hours a year taken out of one process - reading commercial-loan agreements - once it was rebuilt, not just sped up.

JPMorgan

These are big organisations with big engineering teams, and a charity is neither. But they aren't here as "be JPMorgan." They're here because the gain came from reorganising one process, not from scale and not from any particular model: JPMorgan's was a pre-LLM machine-learning system, not a chatbot at all. What's changed since is that the cost of doing that reorganising has collapsed, which is what brings the small version, Breast Cancer Now's, within a charity's reach.

Those four aren't the norm, and it would be a cheat to pretend otherwise. Most attempts fail, and these are the survivors. They survived by doing dull things: keeping the scope narrow, getting the data into shape before a model ever saw it, and leaving a person on the decisions that mattered. The town of Kyle, Texas put an agent on its non-emergency service line and resolved most of more than 12,000 requests at first contact, because the request types were well defined, the data was public, and there was a clear path to a human. Coding agents like Devin do real, multi-hour work, but only on well-specified, bounded tasks; their own makers admit they can't take on an ambiguous project end to end the way a senior engineer can.

Given those odds, buying the faster horse is a perfectly defensible move. A charity with nobody to build or maintain software, handing everyone a chatbot that saves them a few minutes a day, has done something sensible and low-risk, and given the failure rate it's often the safer bet this year. The chatbot is a floor, though, not a ceiling. The gains big enough to change what a charity can actually do sit above it, in the work nobody has rebuilt yet.

The reading for charities isn't "deploy an agent for casework." A sprawling, autonomous agent for something as judgement-laden and high-stakes as casework is precisely the kind of large-scale transformation that fails most of the time.¹¹ The reading is: find a narrow, repetitive, well-defined slice of the coordination tax, wire an agent into it, keep a person at the gate, and record everything. Then do the next one. That's how a floor gets redesigned, one motor at a time.

To make that repeatable, we found we kept building the same operational skeleton underneath each one: work arrives, is classified and routed, is acted on, moves through clear stages, a human reviews at a gate, it resolves, and the whole thing is recorded. Whether the job is theming free-text feedback, triaging incoming donations, extracting eligibility criteria from grant documents, routing a safeguarding concern, or drafting a first-pass report from project data, the shape is the same. The AI does the carrying inside that skeleton, under supervision. That skeleton is most of what an honest "agent for charities" actually is. And the most important parts of it aren't the clever bits. They are the gate and the record.

6. Governance is the engine mount

A chatbot suggests, and a bad suggestion can be ignored. An agent acts, and the bad action has already happened before anyone notices.

This is the centre of the argument, so we'll state it plainly. For a charity, governance is not the brake on agentic AI. It is the engine mount: the thing that lets the motor run hard without shaking itself off the bench. Without it, there's no safer-but-slower system; there's a system that can't responsibly be turned on at all.

Charities are right to be cautious about data and about automated decisions affecting the people they serve. This caution is sometimes treated as a backwardness to be overcome. The opposite is true: it's the correct response to the actual stakes. Many charities hold some of the most sensitive information that exists about some of the most vulnerable people there are: health conditions, financial hardship, immigration status, experiences of abuse, criminal records. In data protection law this is special category data, and it carries the highest obligations precisely because the harm from getting it wrong is severe. Trust isn't a soft value here. It's the entire operating licence. A charity that loses that trust has lost the ability to do its work.

The cautionary tales aren't hypothetical, and the striking thing about them is that none was a failure of the technology. Each was a failure of governance.

In the United States, the COMPAS algorithm was used in criminal courts to score how likely a defendant was to reoffend. A 2016 investigation by ProPublica found that Black defendants were almost twice as likely as white defendants to be wrongly flagged as high risk. The deeper lesson, which researchers established afterwards, is subtler and more important: there are several mathematically coherent definitions of "fair," and when the underlying rates differ between groups, they can't all be satisfied at once. Which definition of fairness is being optimised for is a choice. COMPAS made that choice implicitly, and nobody had to declare it or defend it before the system was sentencing people.

In England in 2020, an algorithm was used to award A-level grades when exams were cancelled. It downgraded around two in five results, and it hit state school pupils far harder than private school pupils, because it leaned on each school's historical performance. If no one from a pupil's school had reached the top grade in recent years, the model made it close to impossible for that pupil to receive it, whatever the teachers said. It was reversed within four days, but only after the damage and the protests. In Australia, the Robodebt scheme issued automated debt notices using a method its own lawyers knew was unlawful, and progressively removed the human review that would have caught the errors. The Royal Commission called it "a crude and cruel mechanism, neither fair nor legal."

None of those three was an LLM, and none was a charity; they were government scoring systems, and an agent fails in its own ways. They don't map neatly onto a charity's feedback-theming tool. What carries over is the shape of the failure. In each case, meaningful human oversight was removed in the name of efficiency, the bias was never tested for before the system ran at scale, and there was no clear chain of accountability when it went wrong. It has a name. Dan Davies, working in Stafford Beer's cybernetic tradition, calls this an accountability sink: a process arranged so that, by following the rules to the letter, no one is ever left answerable for the result. Nobody ordered the A-level outcome; in Davies's telling it was everyone doing their job exactly as required, and a disaster emerging all the same. Each failure has a direct governance remedy: an impact assessment that would have surfaced the risk before launch, bias testing that would have revealed the disparity, genuine human oversight that would have caught the errors, and an audit log that would have made someone answerable. No charity sets out to build the next COMPAS. They arrive there by treating governance as paperwork to be added at the end, after the interesting part is built.

The law has just moved, and charities aren't exempt from it. The Data (Use and Access) Act 2025 replaced the old rules on automated decision-making in the UK, with the change taking effect in February 2026. The previous default prohibited solely automated decisions with significant effects on people unless a narrow exception applied. The new regime flips that: such decisions are now permitted, provided specific safeguards are in place, including the right to be informed, to make representations, to obtain human intervention and to contest the outcome. For special category data, the data charities hold most of, the restrictions remain tight. The practical effect is that the legal burden has shifted onto the organisation to demonstrate its safeguards, rather than onto the law to forbid the activity. Building those safeguards in is no longer just good practice. It's increasingly how an organisation stays lawful.

There's one more reason governance can't be an afterthought with agents specifically, and it's the difference between an agent and a chatbot. A chatbot suggests, and a bad suggestion can be ignored. An agent acts, and the bad action has already happened before anyone notices.¹² In one well-documented incident in 2025, a coding agent run by the platform Replit deleted a live production database during an explicit freeze on changes, then generated false logs and initially misled the user about whether the data could be recovered. Jason Lemkin, the software founder running the experiment, asked the obvious question: "How could anyone use it in production if it ignores all orders and deletes your database?"

The agent could wipe a live database because the system around it let it. There was no wall between the place it was free to experiment and the production data, and no hard stop in front of an irreversible command. Those are ordinary controls, the kind any production system is meant to have, and they were missing. Calling the model reckless is like watching a bridge come down because someone botched the tolerances and deciding that Brunel was a fool and suspension bridges don't work. What an agent does change is the kind of mistake on offer: it acts fast, it repeats itself, and it skips the half-second of doubt a person would have had before running a destructive command. Once software can act on real records about real people, the architecture around it is not background plumbing. It is most of the safety there is.

So what does governance built in from day one actually look like? When we built our own framework for deploying these systems in charities, we treated this not as a feature but as the constraint that shaped every other decision. Concretely, that meant:

Personal data is stripped before the model ever sees it. Identifiers like names, postcodes and NHS numbers are redacted at the point of processing, on the principle that over-redacting is harmless and a leaked identifier is not. The model works on the case, not the person.

Every decision is recorded in an append-only audit log that can't be quietly edited. Each entry carries what went in, which version of the model and which version of the instructions produced the output, the result, and the confidence. When a funder, a regulator or a trustee asks "why did the system do that," there's an answer, traceable to the specific decision.

A data protection impact assessment is generated for each deployment, tied to a declared risk level. Low-stakes work, where nothing irreversible happens to an individual, is governed lightly. High-stakes work requires a trustee-level briefing before it goes anywhere near a real person. The oversight is matched to the stakes, instead of being applied as a uniform tax on everything or skipped entirely.

A human reviews at a gate, and the gate is matched to the risk. Some outputs publish directly. Some require a person to approve before anything leaves the system. Some require approval before any external action is taken at all. Crucially, the human review has to be real. The most common way oversight fails in practice is by turning a person into a rubber stamp, clicking approve on fifty items an hour until they're skimming and the errors flow through anyway. Meaningful oversight means a person with the competence, the authority and the time to actually disagree with the system at the points that matter.

The data stays in the UK, the right to erasure is honoured within a defined window, and every external service the data touches is named, not buried.

Picture it working, rather than absent. A charity runs an agent that triages incoming safeguarding concerns by urgency. One morning it misreads a carefully worded disclosure as low priority. Because there's a gate, a caseworker sees it before anything is routed, overrides it, and escalates. Because there's an audit log, the charity can see the agent has started mishandling that kind of phrasing, and can fix the instructions before it recurs. Nothing reached the world.

There's a version of human in the loop that's awful. It's the one where a person is expected to keep up with the machine, catching its mistakes as fast as the computer can make them, and that gets called oversight. The game below is that version: you're the human in the loop, and the loop is moving at the machine's speed. You want to be in the loop, but not like that.

That is the whole point, and it's worth being plain about what it buys. None of this governance is the price of admission, paid to be allowed to start. It's what lets a charity go further than it otherwise could. Redaction is what makes it safe to point AI at casework data nobody would otherwise dare touch. The audit log is what lets a trustee, a funder or the regulator say yes, because the answer to "how would we know if it went wrong" is already on the table. Proportionate review keeps the low-stakes work moving fast, instead of crawling through sign-off meant for the high-stakes kind. The gate is the only reason an agent can act on real cases at all. Governance built in from the start is the difference between an AI project the board kills at the first hard question and one it backs. It's the permission slip, not the paperwork.

We're not describing this to sell the framework. We're describing it because it's proof of a claim that's otherwise easy to wave away: that governance for agents is buildable, that it can be concrete, and that it can sit at the foundation instead of being bolted on after a scandal. The slogan "governance from day one" means nothing until someone produces the audit log. This is what it looks like when it's real.

7. Why the incumbents can't sell the real thing

The vendors are doing the rational thing for their model. It just isn't the thing charities need.

If the thing charities actually need is governed, integrated, organisation-specific software rather than another per-seat chatbot, an obvious question follows. Why aren't the largest and best-resourced companies in the world building it?

Clayton Christensen gave the classic account of this in 1997, in "The Innovator's Dilemma". His later books stretched the idea further than it would go, but the original was built on careful work on the disk-drive industry, and the core of it holds up. Successful incumbents, he found, are undone by doing what good managers are told to do: listening to their best customers, building what those customers ask for, and turning down the low-margin, awkward opportunities that don't fit the model. That works right up to the moment the awkward opportunity turns out to be where the value went.¹³

The per-seat chatbot is the platform vendors' best customer talking. It's a clean, identical product sold to hundreds of millions of seats, and its economics are beautiful. Governed, integrated, organisation-specific agents are the awkward opportunity: every deployment has to be fitted to a particular charity's own tools, data and risks, the margins are worse, and it doesn't scale as a single product line. The licensing architecture being built around the new agent products shows which one the vendors are optimising for. Costs are layered across base licences, per-seat add-ons, consumption credits that are hard to forecast, and a separate governance tier, with contractual terms that make it hard to reduce spend once an organisation has committed. The headline price of a flagship AI assistant is widely quoted at one figure; the realised cost, once the prerequisite licences are added in, runs appreciably higher - by some analyses half as much again or more, before any of the agent's actual work is paid for through consumption. The consumption is also unpredictable in a way no per-seat buyer can control: the team behind the agent Manus reported in 2025 that cached input tokens cost a tenth of uncached ones, so a real agent's bill turns on its cache-hit rate rather than its headline price, with input running to output at roughly a hundred to one.

For a charity, this is close to the worst possible shape of cost: large, recurring, hard to forecast, and difficult to exit, attached to a product that, on the vendors' own adoption numbers, most organisations never roll out beyond a pilot. The sector is being asked to pay full price for a capability it's mostly not receiving, while the work that actually drains it goes untouched. That's no knock on the engineers who build these tools, many of which are genuinely impressive. It's what happens when a product designed to be sold identically to everyone meets an organisation whose real problem is specific to it. This isn't an accusation of bad faith. It's the innovator's dilemma working exactly as described. The vendors are doing the rational thing for their model. It just isn't the thing charities need.

Charities have been here before, and recently. The content management systems that changed what organisations could do weren't the ones that made publishing a page a bit faster. They were the ones that made a cognitive leap: treating content not as pages glued to a single website but as structured, portable data, fields the organisation owns and can push anywhere, expressed in JSON, YAML or Markdown. Contentful, Sanity and the headless platforms that followed took content out of one vendor's templates and made it the organisation's own. The same leap is on offer now, aimed at operations rather than content: stop treating the work as tasks locked inside a vendor's chatbot, and start treating it as structured, owned, governed workflows. The agent vendors already grasp this; Salesforce has agreed to buy Contentful precisely to give its agents a structured content layer to draw on. The same misreading, content as a faster website then, agents as a faster chatbot now, is on offer again.

That gap is what we set out to fill, and it's why we built what we built. The model is deliberate, and it follows the same logic. Instead of a single hosted product that every charity rents, with its data locked inside our walls, we built an installable framework, with the intention of opening it up as shared infrastructure for the sector. Each charity owns its own deployment, its own data, its own audit trail, with an agency as the reference implementer to get it running. The reusable scaffolding, the governance, the audit, the human-in-the-loop gates, the run lifecycle, comes for free; the bespoke parts, the inevitable quirks of a charity's own tools, are ordinary code on top. Rails where rails help, and no walls where walls would only get in the way.

Charities have a scar here worth naming. Many were burned, in the years before good SaaS, by bespoke systems that arrived late, cost a fortune, and rotted the moment the contractor walked away. That fear is rational, and any honest version of this argument has to reckon with the total cost of owning software, not just building it. What's different now is what the framework carries: the governance, the audit, the gates and the run lifecycle are shared scaffolding, maintained once rather than rebuilt for each charity, so what any one organisation owns and maintains is the thin bespoke layer on top, not the whole machine.

It's only fair to name another real alternative. For wiring a job together, the off-the-shelf option that actually competes isn't a chatbot; it's the automation tools, Zapier, Make, n8n, that already connect one system to another, and for low-stakes plumbing they may be all a charity needs. Two things separate a quick automation from something a charity can run on sensitive work. The first is governance: an automation that pipes beneficiary data between services has no redaction, audit or erasure built in, and inherits every data-protection duty with none of the safeguards. The second is that service design is hard. An automation is only ever as good as the person who mapped the process, and mapping a charity's real, messy process is the work itself, the part no tool does for anyone. The governance and the design are the job; the automation is the easy part.

The wider point isn't that every charity should build. It's that buy, build or assemble is a decision to make deliberately, against the actual need, not default into because an off-the-shelf "agent" was the only thing on the table. Sometimes off-the-shelf is right. Increasingly, for the work that actually drains a charity, the answer is something fitted to it, owned by it, and governed from the start.

8. What to do

None of this requires a charity to become a technology company. It requires a different first question. Not "which AI tool should we buy," but "where in our work are people acting as the glue, and what would it take to do that part differently." A few principles follow from everything above.

Start with the problem, not the technology. The person who understands the work is more valuable here than the person who understands the model. Map where the coordination tax falls: the re-keying, the reconciling, the chasing, the place where the same record gets typed into yet another system. That list will be specific to each organisation.

Pick a job, not a chatbot.¹⁴ Choose one narrow, repetitive, well-defined slice of that work, the kind of thing that has clear inputs, clear outputs and a clear notion of "done." Theming feedback, triaging enquiries, extracting structured data from documents, drafting a first-pass report. Get one of those into stable, supervised production before reaching for the next. Narrow and finished beats broad and aspirational, by a wide margin, in all the evidence on what actually works.

Make governance the first design decision. Before asking whether the AI can do the task, ask what could go wrong, who could be harmed, how anyone would know, and what the fallback is. Decide where the human gate sits, and make sure that person can actually say no. Insist on an audit trail a trustee or regulator could inspect. If a vendor can't produce the log, treat that as the answer to whether its governance is real.

Hold goals firmly and methods loosely. The capability curve is moving fast enough that a three-year transformation plan will be wrong by year one. Set the goal and review the how every quarter. A task that belonged firmly to a human twelve months ago may be one that can responsibly move to a supervised system today, and the reverse can also be true.

Don't confuse AI literacy with changed work. Training everyone to use a chatbot is worth doing, but it's the faster-horse move. It raises the floor of individual productivity and leaves the structure untouched. The measure of success isn't how many people have been trained. It's whether the work itself is better, and whether people are spending more of their time on the mission and less on the glue.

The mission, and the gap

We opened with two curves, the slow linear climb of what a charity can do unaided and the exponential climb of what's becoming possible, and the widening gap between them. It's worth being clear about what that gap actually is.

It isn't a technology gap. It's a question of how much of the mission actually gets done. A charity that closes the gap can reach more people, cover more ground, answer more of what comes through the door, and spend more of its scarce human attention on the work only people can do. A charity that doesn't will watch the same money and the same staff buy less and less of its mission every year, as the cost disease grinds on and the coordination tax compounds. The ones that learn to redesign the work, rather than just speed it up, will pull steadily ahead, the way the factories that rebuilt around the motor pulled ahead of the ones that only swapped the engine.

The temptation the vendors are selling is the easy half: the faster horse. The harder, more valuable thing is to wire the work differently, under governance solid enough that the people a charity answers to, the beneficiaries, the supporters, the funders, the public, would be reassured rather than alarmed to learn how it works.

That last condition isn't a constraint on the mission. For a charity, it is the mission. The reason to put an agent to work was never to move a decision somewhere no one can be held to it, the accountability sink all over again. It's to take the carrying off people, so that more of them, and more of their attention, goes to the thing the charity exists for: the people helped, the places protected, the animals looked after, the work that justifies the charity existing at all. And the test of whether it's been done well is plain. A charity should be able to open the whole thing up, what the system does and how it's watched, and find that the openness earns more trust, not less, because the care and the rigour that went in are there to see.

A note on sources. Figures and quotations are linked inline to primary sources where they exist, with supporting evidence in the footnotes. Several numbers that circulate widely in this area trace back to single vendor blogs or to reports that haven't been peer-reviewed; where that's the case we've either said so or left the figure out. Where the argument draws on books rather than articles, it paraphrases and attributes rather than quoting, and those attributions should be checked against the printed editions before publication. The framing draws on William Baumol and William Bowen's work on cost disease (1966), Paul David's account of the electrification productivity paradox (1990), Stafford Beer's Viable System Model (1972), and Clayton Christensen's "The Innovator's Dilemma" (1997).

Andrej Karpathy, on the Dwarkesh Patel podcast (October 2025); the lines were picked up by Cal Newport in "Why A.I. Didn't Transform Our Lives in 2025", The New Yorker, December 2025. He also said, of agents, simply: "It's just not working." ↩
The surgeon Atul Gawande tells a parallel story about building: once construction outgrew any single "Master Builder" who could oversee a project from portico to plumbing, the industry coped not by finding cleverer individuals but by distributing the work across specialised trades held together with checklists and sign-offs. The Checklist Manifesto (2009). ↩
A made-up-but-useful number, of the kind worth reaching for when no one has measured the real thing. It isn't baseless: sector analyses put administrative work at roughly 15 to 20% of charity staff time, and one estimate has the sector spending some 15.8 million hours a year on grant-monitoring reports alone. ↩
In a companion experiment written with Alex Imas, coordination collapsed even at small scale: "by the time we get to even 8-12 agents the number of successful transactions drops to below 50%", and money never emerged on its own, "even though in conversations LLMs all know that this is the smart thing to do". Rohit Krishnan, "Will money still exist in the agentic economy?", Strange Loop Canon, December 2025. ↩
Arvind Narayanan and Sayash Kapoor reach the same point from the economics of technology: "the parts of workflows that AI tackles are unlikely to be bottlenecks, because much of the available productivity gains have already been unlocked", while "the actual bottlenecks prove resistant due to regulation or other external constraints". "A guide to understanding AI as Normal Technology", normaltech.ai, September 2025. ↩
A related risk runs through the workforce. The economist Luis Garicano calls it the "AI-Becker problem": firms have always underinvested in junior staff because rivals can poach them, but if AI means juniors now create little value, the apprenticeship that produces senior judgement erodes from the bottom. He cites New York Fed data showing new-graduate unemployment up 30% since the pandemic, against 18% for all workers. "The AI-Becker Problem", 2025. ↩
The pace is uneven but steep. A year ago Claude's attempts to play Pokémon were a running joke, needing elaborate helper tooling and still getting stuck; in June 2026 Anthropic reported that Claude Fable 5 finished Pokémon FireRed with vision alone, on a minimal harness, a game its predecessors could not complete even with extra tools bolted on. ↩
The most bullish public forecast, AI 2027 (Daniel Kokotajlo, Scott Alexander and others), turns on AI's software-engineering capability: it predicts AI surpassing the best human coders and then improving itself from there. Tellingly, its lead author has since put his own central estimate nearer 2030 than 2027. The point isn't the date. It's that the steep curve everyone argues about is a curve in software, which is exactly why turning it into charity capacity takes deliberate work rather than a licence. ↩
Chroma's "Context Rot" study (2025) found that models degrade as input length grows even on simple tasks, and that giving a model only the relevant excerpt of around 300 tokens beat giving it the full history of about 113,000 tokens. The cure for a confused agent is rarely more context. ↩
A sobering precedent from medicine. Computer-aided mammography was approved in 1998 and was reading about 74% of US mammograms by 2010, yet within a 430,000-mammogram study the clinics that adopted it carried out 20% more biopsies while catching no more cancer, and in 2018 Medicare withdrew the extra payment for it. Approval on a benchmark is not the same as value in production. Deena Mousa, "AI isn't replacing radiologists", 2025. ↩
This is why casework resists the treatment. Steve Newman, who co-created Writely, the prototype that became Google Docs, argues that a project is not a bundle of tasks: split into discrete agent subtasks, the "little nudge that will eventually accumulate with 20 other nudges" is lost the moment each agent's memory is discarded. "A Project Is Not a Bundle of Tasks", 2025. ↩
The mistakes are not always destructive; sometimes they are merely strange. In Anthropic's Project Vend, a Claude model left to run a small office shop hallucinated a payment account, sold goods at a loss, and lost track of being software at all, insisting it would deliver orders "in person" while "wearing a blue blazer and a red tie", and then emailing the company's security team in alarm. ↩
There is a sharper, cognitive version of this, the "stack fallacy", named by the investor Anshu Sharma in 2016: the people who build one layer of technology consistently underrate how hard, and how valuable, the layer above it is - which is precisely the workflow-and-coordination layer charities need. ↩
"Job" here means a piece of work to be done, not a person's post. It borrows from the "jobs to be done" lens, developed by Bob Moesta and others, which asks what specific job a tool is "hired" to do rather than who currently does it. The aim is to take work off people, not people off the payroll. And a job, unlike a chatbot, has a clear beginning, an end and a definition of "done", which is what makes it the right unit to start from. ↩