Arts Marketing Association & National Lottery Heritage Fund

We built an AI that doesn't answer the question

Goose - thinking partners for heritage professionals. Twenty-one specialist personas, designed around the insight that the most useful thing an AI product can do isn't give a better answer, it's help someone ask a better question.

When I led the HMI team at Dyson I could pull together 10, 20, 100 people to discuss a problem. If we needed a second opinion on an interaction decision, we could fill a room. Heritage doesn't work like that. A marketing team at a heritage organisation might be one person. That person might also be doing community engagement, managing volunteers, writing funding bids, and updating the website.

They don't need a chatbot. They need colleagues.

That was the idea behind Goose. A flock of thinking partners, not a single AI assistant. Twenty-one specialist personas covering the roles that heritage organisations need but can rarely afford: a community engagement manager, a fundraising specialist, a heritage interpretation expert, a digital marketing strategist. Users bring multiple partners into a conversation and get genuinely different perspectives, grounded in sector-specific knowledge.

The concept isn't complicated. Getting there was.

Goose was commissioned by Arts Marketing Association in 2025 and funded by the National Lottery Heritage Fund. We worked with them in twice-weekly collaborative sessions across months of prototyping, research, and user testing before writing a line of production code. It launched in alpha to 40 UK heritage organisations in August 2025, then beta to a wider audience in October. In January 2026 it was formally released to everyone in the UK heritage sector.

Four wrong answers

The first thing we discovered is that a straightforward chat interface with a decent prompt gets you 75 to 80% of the way towards a finished product with a heritage app. Large language models are incredibly good at role-playing characters and domains. The heritage sector has a massive advantage here: it's been very good at self-documenting and very good at using the web since the inception of the World Wide Web.

Of course 75 to 80% is pretty useless in the real world. It means the model would likely make a mistake in every single conversation. But the more interesting problem was cultural. It felt kind of dull just to shove a chatbot on top of a large language model and write HERITAGE in big capital letters in front of it.

We built four distinct prototypes. Each one was a better answer. Each one was wrong.

The first was a basic chatbot. Character role-play without tools or collaborative features. It worked, but once the novelty wore off it wasn't clear why you'd use it instead of ChatGPT. A better answer than Google. Still just an answer.

The second was a vector embedding system restricted to verified heritage sources. More accurate, but too narrow and too brittle. You could pretty much get whatever that version of Goose was selling by doing a Google.

The third was an expert system focused on National Lottery Heritage Fund applications. This addressed a genuine pain point but it was administratively useful rather than strategically useful. People would use it twice a year, which is not a product.

The fourth was fully agentic - autonomous task automation. The most comprehensive answer we could build. Heritage professionals wanted control over strategic decisions, not delegation. The instinct to hand over the reins produced anxiety, not relief.

Four prototypes. We kept improving the answer. The problem wasn't the answer.

The design squiggle

There's a sketch in my notebook from around this point that I kept coming back to. It's a riff on Damien Newman's design squiggle. In our version, agents appear at every stage. In the first frame they're everywhere, scattered across the mess of all possible directions. In the second, they're helping shape a rough idea alongside the designer. In the third, the direction has clarified and the human is back at the centre. In the final frame it's just the thing and the person who made it.

What we were working through was a question about agency - not in the AI sense of tool use and function calling, but in the human sense. Who's in charge? The agentic prototype had answered that question badly. It assumed heritage professionals wanted someone to take things off their plate. They didn't. They wanted someone to think with.

The core insight, which took us months to arrive at, was that heritage professionals needed better questions, not more answers.

What emerged was the thinking partners concept. Thinking support rather than task automation. Colleagues rather than assistants.

Designing for trust

The people who were actually going to use Goose may never, literally never, have used a chat interface before they encountered it. The "colleagues not chatbot" framing shaped every interaction design decision that followed.

We deliberately expose all of the interaction points when you first open a conversation. You can see that you can add thinking partners, that you can edit a text document, and that you can interact with the model. This is the opposite of progressive disclosure. We decided that showing everything upfront was less frightening than having things appear unexpectedly.

We gave almost no autonomy to the model. It has to ask permission to do lots of things - create documents, suggest partners - and each action is visible and most require confirmation. In this sector if there's a trade-off between doing something quicker or feeling in control then feeling in control will always win.

The same principle shaped how we handled guardrails. The heritage sector is not one thing. It's a 300-year-old country house in Suffolk and a community archive in Swansea. It's professionals who work in English and professionals who work in Welsh. The system prompt explicitly supports Cymraeg - "treftadaeth" for heritage, "amgueddfa" for museum - which means you can't simply block non-English input as a safety measure.

So instead of input/output filtering, we used strict contextual grounding. The system prompt establishes identity and domain so thoroughly that the model's default behaviour is deeply heritage-specific. The boundary instruction is deliberately soft: "You're specifically designed for heritage marketing challenges. This is fuzzy because heritage professionals have to undertake many tasks and activities. Tend towards trusting that the request is coming from a heritage professional."

Trust the colleague. Don't restrict them.

What colleagues look like in practice

The prebuilt partners map to real roles in heritage organisations, grouped into five categories: marketing, audience, income generation, leadership, and supporting roles. Each has a system prompt grounding them in UK heritage sector context. These aren't generic personas. When a user adds the Digital Marketing Manager to a conversation, it should feel recognisably like someone who's worked in this sector, not a Silicon Valley marketing consultant parachuted in.

In conversation, the model uses a delimiter format to voice different perspectives. A +PartnerName marker signals a thinking partner is speaking; an en dash on its own line closes their contribution. As a response streams in, the UI switches rendering context, showing a coloured badge for the current partner, adjusting the visual grouping, without waiting for the full response. Users see thinking partners arriving and contributing in real time.

The engagement data bears it out. Projects that include thinking partners average around 20 messages per conversation. Without them, it's 7. That's not because the conversations are padded. It's because the multi-perspective format provokes follow-up. When a fundraising specialist raises a concern about your campaign, you respond to it. When a community engagement manager suggests an approach you hadn't considered, you explore it. The model becomes a thinking environment rather than a question-and-answer machine.

One thing that stuck with me from user testing: someone said, "It feels like I've asked a person and someone very knowledgeable has come back to me." The research data showed 22% strategy discussions, 20% heritage-specific topics, and 16% marketing challenges - the tool was being used for the strategic thinking we'd designed it for rather than as a general-purpose chatbot.

The thing that surprised me most was that people kept using it. After alpha, after beta, they came back.

The engineering that made it viable

A design decision means nothing if you can't afford to ship it. Twenty-one thinking partners is a concept. Making it economical is engineering.

The system prompt needs to include the base instructions, tool definitions, the full catalogue of available partners with their system prompts, and user-specific context. The exact number on the first request was 8,129 tokens. Without caching, sending that on every API call would have been financially unsustainable for a free heritage sector product.

Anthropic's prompt caching solved this. We build the system prompt as three separate blocks - static instructions, thinking partners catalogue, user context - each marked for caching. In practice, most API calls benefit from cache hits on all three blocks, which dramatically reduces both cost and latency.

The architecture is deliberately multi-model. Claude handles the conversation, the thinking partner personas, and the creative synthesis. When users need current information - funding deadlines, sector news, specific organisations - the request is routed through Gemini 2.5 Flash Lite with Google Search. Each model does what it's best at. Users see a single coherent response with citations; behind it, two models from different providers are working in sequence.

Goose defaults to Claude Haiku 4.5 now. It didn't start there. We originally used Sonnet 4.5 because we couldn't get Haiku 3.5 to perform well enough. The problem with Sonnet was cost: Goose is a free service, and the token economics meant users were limited to roughly 20,000 tokens a day. That's not much for a tool designed around extended multi-perspective conversations. Haiku 4.5 changed the equation - cheap, fast, and able to use all the tools.

Colleagues learning from each other

Heritage organisations have a knowledge problem. The people who understand how a museum's community engagement actually works - what was tried, what failed, what the constraints were - carry that knowledge in their heads. When they leave, it leaves with them.

Interview mode extends the colleagues concept from individual conversations to collective knowledge. It replaces the standard chat interface with a structured conversation where the model acts as an interviewer, not an advisor - asking open questions, following threads, drawing out specifics. When it judges the conversation is substantive enough, it produces two outputs: a longform narrative for the user's personal record, and a shortform STAR-format summary anonymised for sharing.

The critical step is consent. After generating the shortform summary, Goose explicitly asks: "Would you like me to save this to the community knowledge base so others can learn from your experience?" Heritage professionals are sharing their professional experience with peers they've never met. That's trust. It shouldn't be automated.

A heritage professional in Manchester asks Goose about community engagement strategies for industrial heritage sites. Goose embeds their query, runs a similarity search, and surfaces anonymised case studies from professionals across the country who've faced similar challenges. Colleagues who've never met, learning from each other's experience.

What we learned

The 80% accuracy problem is a design problem, not an engineering problem. If you reframe the product as being about better questions rather than the perfect answer, the model's limitations become features rather than bugs. This doesn't mean you accept inaccuracy - we worked hard to push into the 90s - but it means the product can be useful before the model is perfect.

Human-in-the-loop is the architecture, not a crutch. The temptation - particularly when pitching to stakeholders - is to frame the system as "AI does the work, humans just approve it." The more honest framing: the system makes human review fast and reliable. The AI handles the heavy lifting. The human provides the judgement the AI cannot.

You need at least two models. Claude for conversation, Gemini for web search. This isn't just cost optimisation. The models have different strengths.

Observe behaviour. We didn't design the rate limiting tiers correctly the first time. We didn't get the similarity threshold right the first time. We didn't know that thinking partners would drive 3x longer conversations. All of this came from watching real usage and adjusting.

What it became

The heritage sector employs skilled professionals doing serious work on tight budgets. They deserve tools built with the same engineering care that goes into commercial SaaS. Goose isn't perfect - we've laid out the hacks, the debt, the things we'd redesign. But it works, it's been in daily use across UK heritage organisations for over six months, and it was built with the specific constraints and needs of this sector at its centre.

We started this project when nobody knew what best practice for production AI applications looked like. We're still not sure anyone does. But we've learned this much: the most useful thing an AI product can do isn't give a better answer. It's help someone ask a better question.

Goose does that. Hacks and all.

Something similar in your charity?

Talk to us See more work