Building reliable AI for clinical education
Assessments that go beyond binary right or wrong answers. Clinicians can interact with AI simulated patients - to examine, diagnose and treat them - to demonstrate their clinical competence.

Medical knowledge is accelerating beyond what traditional training can handle. In 1950, medical knowledge doubled every 50 years. Now it doubles every 73 days!. Clinical education faces an impossible challenge: how do you train people fast enough, at scale, when what they need to know is constantly expanding?
The Learning Lab, a spinout from Guy's and St Thomas' NHS Foundation Trust and Guy’s Cancer Academy, is uniquely positioned to tackle this problem. With a reputation for delivering world-class e-learning in oncology, they understand both the clinical rigour required for credible medical education and the practical realities of training thousands of clinicians. This year they're running a series of experiments exploring how AI can improve clinical education. This AI simulation for assessments is the first.
Assessments that actually test competence
Over 85% of medical training is delivered online, but many clinicians report that it doesn't relate to the messy reality of clinical work and treat it as an exercise in compliance rather than genuine education. Multiple choice assessments reinforce this - you can click through, guess answers, pass without demonstrating real competence.
Simulations are a valuable part of clinical education, creating space for practice and reflection. By bringing simulation elements into assessment design, we can create evaluation scenarios that better reflect how clinicians actually think and respond, and reach learners at scale as new protocols and evidence emerge.
The Learning Lab's vision was dialogue-based assessment where clinicians interact with simulated patients through conversation, demonstrate clinical reasoning in real time, and reflect meaningfully on their performance. But creating an assessment that feels real whilst remaining rigorously fair raises fundamental questions. How do you ensure every clinician faces equivalent difficulty? How do you assess both performance and reflection? How do you prevent gaming? Most importantly, how do you capture how well people respond to a clinical situation in ways that mimic how they might behave in the real world?
Could an LLM create realistic patient interactions that meet these requirements - and do it at the speed and scale that modern medicine demands?
Proving the value of LLMs
The first scenario focused on anaphylaxis, something all clinicians understand, with well-documented protocols. The challenge was figuring out what needed to be deterministic and what could be probabilistic.
The patient story (their symptoms, medical history, how their condition deteriorates) needed to be controlled and consistent. Every clinician had to face equivalent difficulty. The clinical logic had to be deterministic: deterioration patterns, valid treatments - they can’t vary.
But patient dialogue needed to feel natural, and assessment needed to go beyond binary right or wrong answers. A scripted branching tree would require thousands of scenarios to handle the nuance of how clinicians communicate, prioritise, and demonstrate patient-centred care. This is where the LLM does its work - evaluating communication skills, clinical prioritisation, and patient care in ways that would be impossibly complex to hard-code.
Getting to a prototype
We built 14 specialised AI components. The deterministic parts (patient vitals, valid treatments, clinical state) run on JSON-defined logic that clinical experts can audit. The probabilistic parts (patient conversation and nuanced assessment) use the LLM with carefully bounded constraints.
Every clinician faces the same patient, but the interaction feels natural. After consultation, they reflect through guided conversation. The LLM assesses not just clinical decisions, but communication, prioritisation, and patient-centred care. A clinician who misdiagnosed anaphylaxis as panic attack receives acknowledgement of the error alongside recognition if they demonstrated perfect understanding during debrief.
Throughout development, clinicians and learning specialists from The Learning Lab tested the prototype at each release, making sure the simulation remained clinically accurate and realistic.
Multi-layered testing
The anaphylaxis prototype demonstrates that AI-powered clinical simulations can work. The LLM handles complex patient interactions naturally and provides nuanced assessment that goes far beyond what multiple choice could capture. The interface supports realistic clinical workflows. Testing is now underway to understand user experience. The next phase in 2026 will look at efficacy: comparing simulations against multiple choice questions to see which is a better assessment of clinical competence.
The architecture was designed for scaling - each clinical scenario requires a JSON definition, allowing the same systems to work across different emergencies.
Technological change continues to accelerate but only a quarter of charities say they feel prepared to respond to the opportunities and challenges. Let's close the opportunity gap together.

