The Learning Lab

Assessing clinicians on something other than multiple choice

Building AI-simulated patients that hold up under clinical scrutiny. Deterministic where the medicine has to be defensible; probabilistic where the dialogue has to feel real. The first scenario was anaphylaxis.

Over 85% of medical training is delivered online. Most of it is multiple choice. Clinicians do it because their trust requires it. Many treat it as an exercise in compliance rather than genuine education - you can click through, guess answers, pass an assessment without demonstrating you'd actually know what to do when a patient is in front of you.

The Learning Lab - a spinout from Guy's and St Thomas' NHS Foundation Trust and Guy's Cancer Academy - wanted to test whether AI could change that. Could a clinician interact with a simulated patient through natural conversation, demonstrate clinical reasoning in real time, and be assessed on the way they responded rather than the boxes they ticked?

We worked with them on the first scenario: anaphylaxis. Something every clinician understands. Well-documented protocols. Failure is fast and obvious. A good test case for whether the approach could work.

What had to be deterministic, what could be probabilistic

The first design decision was the spine of everything else.

The patient story - their symptoms, medical history, how their condition deteriorates without intervention - had to be controlled and consistent. Every clinician taking the assessment had to face equivalent difficulty. The clinical logic had to be deterministic: vital sign trajectories, valid treatments, the way the body responds to adrenaline or to its absence. None of that can vary based on what an LLM happened to generate.

But the dialogue had to feel natural. And the assessment had to go beyond binary right-or-wrong. A scripted branching tree would have needed thousands of branches to handle the nuance of how clinicians actually communicate - how they explain procedures to a frightened patient, how they prioritise when multiple things are wrong, how they demonstrate the patient-centred care that matters as much as the clinical decisions.

That's the bit the LLM is good at. The bit a tree wouldn't get right at any scale we could afford to build.

So: deterministic where the medicine has to be defensible. Probabilistic where the conversation has to feel real. Fourteen specialised AI components, each with a tight job.

The bounded LLM

The probabilistic parts use the LLM with carefully bounded constraints. The patient persona is grounded in a specific medical history. The conversational tone is set by the simulated patient's social and emotional state ("anxious", "in pain", "trying to be polite"). The model can refuse to answer questions the patient wouldn't reasonably know the answer to. It can describe symptoms in the patient's own language rather than medical terminology.

What the model isn't allowed to do: declare a diagnosis, suggest a treatment, or reveal a "correct" answer. Its job is to be the patient, not to coach the clinician.

The deterministic parts run on JSON-defined logic that clinical experts can audit. The deterioration curve for anaphylaxis is in JSON. The valid emergency treatments are in JSON. The thresholds at which the simulated patient's vitals cross into critical are in JSON. Anyone who knows the medicine can read the JSON and tell us if it's right. They don't need to read prompts.

Assessment that catches the nuance

The hardest part was the assessment surface. A clinician who misdiagnoses anaphylaxis as a panic attack has made a serious clinical error. But suppose during the debrief they articulate clearly why anaphylaxis was the right call, how they'd have spotted it earlier, what they'd do differently. That clinician is closer to being safe in a real situation than someone who guesses the right answer first time without being able to explain it.

The assessment has to surface both signals. The clinical decision, scored deterministically against the protocol. And the reasoning, scored probabilistically against rubrics for communication, prioritisation, and patient-centred care.

A clinician who got the diagnosis right but couldn't explain their decision-making during debrief should get a different result from one who got it wrong but reasoned soundly. Multiple choice can't tell those two apart. The combination of bounded LLM + structured rubric can.

What we did with clinicians

Throughout development, clinicians and learning specialists from The Learning Lab tested the prototype at each release. The team includes both registered clinicians and education experts, and the testing cycle was tight - release Friday, feedback Monday, fix Tuesday, retest Thursday. That cadence matters when the thing being built is something the medical-education side has strong opinions about. Production releases that came back to us with notes like "the simulated patient is too calm for the deterioration the JSON describes - at these vital signs they'd be confused, not chatty" are the kind of correction you only get from clinicians who've been in the room.

What it became

The anaphylaxis prototype is in use today. The LLM handles complex patient interactions naturally. The deterministic clinical logic stays defensible. The assessment captures nuance multiple-choice can't.

Testing in 2026 will compare simulation assessments against multiple choice on the same clinicians. That's the real question - not "can we build this?" but "does it actually measure clinical competence better than what we're replacing?". The hypothesis is yes. The evidence will tell.

The architecture was designed for additional scenarios. Each clinical scenario needs a JSON definition (deterministic clinical logic + patient persona + assessment rubric) plus a system prompt for the simulated patient's voice. The same engine runs sepsis, cardiac arrest, paediatric emergencies - anything where the medicine is well-protocolled and the human interaction matters.

The honest bit

This is one of the more controlled domains we've worked in. The medicine is well-documented. The protocols are agreed across UK clinical training. The Learning Lab has clinicians on staff who can sign off on every deterministic JSON file. That's not the typical charity-AI engagement - most of the time the domain is messier and the ground truth is less stable.

What transferred to other Loop work was the shape: the discipline of asking "what needs to be deterministic" before reaching for the model. The LLM is good at the conversational layer. It's terrible at being the source of truth for things that have to be defensible. When you mix the two up, you get demos that look great and assessments that nobody trusts.

Keep the medicine in JSON. Let the model be the patient.

Something similar in your charity?

Talk to us See more work