Make Sense Of ItTalk to us
← All work

The last ten percent took an order of magnitude more work

Building AIDA - the dual-schema AI that transcribes complex patient surveys across 20+ NHS form structures. The brief looked simple. It wasn't. Getting to 90% was easy; getting to 99% was the project.

The last ten percent took an order of magnitude more work

The brief looked simple.

Breast Cancer Now runs the Service Pledge Programme across NHS hospitals - 18-month projects collecting feedback from breast cancer patients and the staff who treat them. A four-person team. Paper surveys, hundreds of responses per project, 350-odd items per response. Before we arrived, someone was typing each one into a web form by hand. A 350-item survey takes the better part of an hour to transcribe. Multiply by hundreds of responses across dozens of hospital projects. It is one of the worst ways a person can spend a working day.

Scan the paper. Extract the data. Put it in a database. How hard could it be.

Getting to 90% was easy

Getting an LLM to extract something from a document is straightforward. Point Claude at a scanned form, ask "what does this say?", and you'll get a good answer. Document-to-text is a commodity these days - Google Document AI, Azure Form Recognizer, AWS Textract all handle it capably.

We didn't need strings. We needed structured data that conformed to a precise schema, was semantically comparable across forms that asked the same questions in different ways, could be validated and corrected by humans, and maintained consistency across hundreds of documents processed over weeks. At 95% per-item accuracy across 350 items you get 17 errors per form. That makes the whole exercise pointless - the person transcribing by hand is probably more accurate than that.

The implicit promise of every demo is that structured extraction is just unstructured extraction with a bit more prompting. The demo succeeds. The proof of concept gets approved. And then you try it on 300 real-world surveys and discover that the demo and production are separated by a chasm that no amount of prompt engineering can bridge alone.

Twenty forms, all almost the same

Each NHS trust had evolved slightly different survey formats. Patient advocates had input into question wording and layout. Each form contained approximately 400 data elements across different question types, often spanning multiple pages.

We knew we couldn't take the conventional approach of standardising forms or forcing digital collection. The patient cohort, often older and dealing with serious illness, consistently preferred paper to digital. Standardisation would have disrupted the collaborative relationships with NHS trusts that the programme was built on. Digital surveys would have excluded patients who preferred paper. Hiring more transcribers wasn't viable within charity budgets.

Instead, we built AI sophisticated enough to handle their reality. Twenty-plus form structures. Questions split across pages. Matrix grids where the columns on page 8 belonged to a header on page 7. Multi-page questions where the schema for page 13 depended on what the schema for page 12 said.

The dual-schema approach

For each form we generated two schemas: a per-page schema that processes individual scanned pages, and a per-question schema that reconstructs complete responses across multi-page questions. The model returns the minimum viable output - deliberately flat - and the application reassembles the response using both schemas. Fewer output tokens means lower cost and higher accuracy. Models are measurably more reliable when you ask them to produce less.

But every output had to be exactly right. The value "agree" couldn't be "Agree" or "I agree" or "4". It had to be a normalised enum from a predefined set. For matrix rows, the model had to correctly identify which row it was looking at - if a respondent ticked satisfied for cleanliness, the system couldn't assign that to noise_level because the rows were close together on the scan. For multiple-choice, it had to return an array, not a string, with values from the defined set. These are basic expectations of any data pipeline. For an LLM, every one is a potential failure point.

We also built extensive mock generation. The client couldn't give us test data - the surveys are confidential by construction. We built tooling to generate synthetic surveys that matched the structure of the real ones, end-to-end, so we could measure accuracy without ever touching real patient data until the system was production-ready.

The digital twist

A pipeline that handles paper surveys is part one. Then the client said: "We also have a digital version of the survey. It runs on QuestionPro. Can you import those responses too?"

Same survey, structured data, no scanning involved. Just an API. How hard could it be.

It turned out the digital version wasn't a faithful reproduction of the paper one. The digital instrument had been evolved separately. Different question ordering optimised for completion rates. Different wording ("How satisfied are you with the cleanliness of your workspace?" became "Rate the cleanliness of your office"). Matrix questions decomposed into individual items because the survey tool's matrix rendering didn't work well on mobile. Routing questions ("Which regional office?") that on paper were handled by which physical stack the form went into.

These weren't two versions of the same survey. They were two different instruments measuring overlapping topics.

The "trivial" digital import required its own schema generator, a six-step answer normalisation cascade, AI-augmented question matching to recognise semantic equivalence across syntactic divergence, and a concept of "virtual questions" - fields that exist in the canonical dataset but aren't directly asked in either instrument, populated by mapping.

Working within QuestionPro's 20,000-call annual API budget forced better architecture. The incremental sync became per-form, watermark-driven, with overlap buffers to catch late-arriving responses.

What AIDA became

AIDA - the dual-schema AI Document Assistant - is now in production at Breast Cancer Now. The four-person team uses it to process patient surveys across 20+ NHS form structures with per-item accuracy north of 99%. Researchers sign off on the resulting data and use it in their published reports.

The team's initial scepticism was justified. Most AI tools they'd been shown couldn't handle their reality without forcing them to compromise - standardise forms, push patients to digital, accept some error rate they couldn't defend in a research output. AIDA didn't ask them to compromise anything. It just removed the manual transcription work that was eating up their week.

What mattered to Breast Cancer Now - the collaborative approach with each NHS trust, the trust they'd built with patient advocates, patients being able to use paper if they preferred - none of that needed to change. The system adapted to their reality, not the other way around.

The honest bit

The same four-person team can now work with considerably more hospitals than they could before. Whether they actually do depends on something AIDA doesn't solve: the existing NHS-side scanners process a handful of forms at a time. We solved one bottleneck and surfaced another upstream.

This is why proof-of-concept work matters. You find out what actually gets in the way before you've committed significant budget. The AI worked as promised. Getting the full benefit means the rest of the process needs to adapt - sometimes that means upgrading a scanner.

For any charity exploring AI: look at the whole pipeline, not just whether the AI can do the job. The model is rarely the limiting factor. The infrastructure around it usually is.

Something similar in your charity?