The Wrong Chatbot Worked Perfectly

The demo looked great.

The chatbot answered questions fluently. It retrieved sources. It formatted everything cleanly. It felt fast, responsive, confident. If you were watching without context, you'd think: yeah, that's ready to ship.

There was one problem. It was trained on a completely different company.

I was on a call with my dev team - we're building an AI chatbot product - and one of the developers pulled up a live demo to show me how far we'd come. He'd switched to GPT's Assistant API, moved off the old AWS infrastructure to cut costs, written a custom script to inject the assistant ID into any website we want. Real progress. The kind of update that makes you feel like the product is coming together.

He started demoing it. Someone asked the bot what the product was. The bot answered. Confidently. In detail. With sources.

Wrong product. Wrong company. Wrong everything.

He'd plugged in the wrong assistant ID. The chatbot was trained on some project management SaaS he used personally - not on ours. It was answering questions about a completely unrelated tool, citing that tool's documentation, referencing that tool's features. And it was doing it well.

He caught it quickly. Apologized. Said he'd fix it in a few minutes. No big deal in the context of a dev call.

But I couldn't stop thinking about what I'd just watched.

The System Worked. That Was the Problem.

Most people think product failure looks like a crash. A bug. A 500 error. Something obviously broken that you can point to and say: that's what we need to fix.

This wasn't that.

The system performed exactly as designed. The Assistant API call worked. The retrieval worked. The formatting worked. The output was coherent and helpful. Every single technical component did its job correctly.

The failure was invisible at the code level. You'd never catch it in a unit test. It wouldn't show up in your error logs. The only way to discover it was to know what the right answer was supposed to be - and that requires a human who understands the product, sitting in front of a live demo, asking the right questions.

Which is exactly what almost never happens until it's too late.

Most AI products get tested by the people who built them. Those people know the system inside out. They know which questions to ask, which inputs to avoid, which edge cases are still rough. They conduct demos that make the product look ready because they know how to navigate around the parts that aren't. And then someone else - a customer, an investor, a prospect on a live call - asks something the builder would never ask, and the whole thing unravels.

That's the failure mode I'm talking about. Not technical failure. Objective failure. The system is optimizing for the wrong thing, and it's doing it perfectly.

What We Were Actually Trying to Build

Let me back up and give you the full picture, because the chatbot problem is downstream of a product decision that matters.

The vision for this product is an automated salesperson. Not a FAQ bot. Not a help center widget. A bot that acts like your best closer - someone who understands your product, knows how to handle objections, and at the right moment in the conversation says: if you want to learn more, here's where you sign up for a free trial. Then drops the link.

That's a very different thing from a chatbot that retrieves information accurately.

Information retrieval is a commodity. You can spin up a RAG pipeline on any set of documents in an afternoon. What's hard - what actually takes work - is building something that behaves like a salesperson. That doesn't just answer questions but steers conversations. That doesn't dump ten paragraphs of documentation on someone but gives them the two sentences they need and then asks for the close.

When I watched the demo, the bot was giving long, sourced, formatted answers. Technically impressive. Totally wrong for what we need.

I told the team: we don't want any links that aren't buy-now links. No source citations. No documentation references. No seven-plus source footnotes. The bot needs to talk to people, not write Wikipedia articles at them. If someone asks what the product is, the bot shouldn't explain it - it should sell it, and then point them to the trial.

That's a prompt engineering problem as much as it's an architecture problem. And it's a problem we hadn't solved yet, which the demo - despite looking polished - made clear.

Technical Debt Isn't Always Code

Here's something I've learned from building multiple software products: the most expensive technical debt isn't in the codebase. It's in the assumptions.

My dev team made a reasonable assumption - that demonstrating the Assistant API integration with correct retrieval behavior was a sufficient proof of concept. And by a narrow technical definition, it was. The integration worked.

But the assumption underneath it was: the right output is a correct output.

That assumption is wrong for a sales-focused product. The right output isn't necessarily a correct output. It's a persuasive output. An output that moves someone closer to converting. Those are not the same thing, and conflating them is how you end up with a technically excellent product that does nothing for your business.

I've seen this trap in agencies too. A team builds a beautiful cold email sequence - perfect grammar, smart personalization, relevant case studies, clear value prop. Then they measure open rates and reply rates and nothing moves. The emails were technically correct. They just weren't written to get a response. They were written to sound professional. Those are different objectives, and optimizing for one while measuring the other is a fast way to burn months of effort.

The objective has to be defined before you start building. Not as a vague aspiration - "we want the bot to be helpful" - but as a specific, testable behavior. What does the bot say when someone asks what the product does? What does it say when someone asks about pricing? What does it say when someone seems interested but hesitant? Write those scripts before a single line of prompt engineering happens. Then test against those scripts, not against whether the API call returns a 200.

Free Download: 7-Figure Offer Builder

Drop your email and get instant access.

You're in! Here's your download:

Access Now →

The White-Label Model Problem

There was another conversation happening on this call that connects to the same theme from a different angle.

We were discussing how to handle the underlying AI models - whether to expose GPT-3.5 vs. GPT-4 as options to users, whether to let users pick their model, how to handle the cost difference. One of the developers suggested letting users who pay more access GPT-4 and users on lower tiers access GPT-3.5.

I shut that down immediately.

The product isn't called GPT. It's not an OpenAI reseller. We're building something with its own brand, its own positioning, its own identity. If users are picking between GPT-3.5 and GPT-4, they're thinking about OpenAI's product, not ours. They're price-comparing against going direct to OpenAI. They're evaluating us as a thin wrapper, because that's exactly what we'd be presenting ourselves as.

Instead: Galadon 1 and Galadon 2. Our branding. Our versioning. Under the hood, one is 3.5 and one is 4 - but that's not the customer's problem. The customer doesn't need to know the infrastructure. They need to know that the higher tier gives better results. That's a value conversation, not a technology conversation.

This matters beyond branding. When you expose your underlying infrastructure to customers, you create a ceiling on your pricing power. If they know you're running GPT-4 and GPT-4 costs X per token on OpenAI's website, they can calculate your margin and use that as leverage. When you white-label and position around outcomes, pricing is about the value you deliver, not the commodity you're reselling.

The same principle applies to cold email infrastructure, by the way. The question isn't which sending tool you're using - Smartlead or Instantly or something else. The question is what deliverability and reply rates you're hitting for the client. That's the conversation that justifies your price. The infrastructure is invisible.

Stripe Migration and the Users You Already Have

One more thing came up on this call that I want to address because it's a real operational issue a lot of founders avoid thinking about.

When you migrate payment infrastructure - new platform, new pricing tiers, new billing logic - what do you do with the customers you already have?

We have existing users on Stripe from an earlier version of the product. New platform, new setup, different pricing structure. The developer asked how we were going to handle them.

My answer: migrate them over to the equivalent plan, make it simple, don't leave people hanging. It's not a large number of users - we're early - so the complexity is manageable. But the principle scales: when you change your infrastructure, your existing customers should never feel the chaos. They signed up for an outcome, not for a billing system. Handle the migration on your end and tell them the minimum they need to know.

What you don't do is make existing customers re-sign up, re-enter payment info, or navigate a new checkout flow they didn't ask for. That's how you manufacture churn out of a technical migration that should be invisible to them.

The Overage Architecture

We were also working through how to implement usage-based billing - specifically, what happens when a user exceeds their interaction limit.

The plan: charge for overage after a threshold. After a user hits their included number of chatbot interactions, each additional interaction costs extra. This is the right model for a usage-heavy product because it aligns your costs with your revenue. Heavy users pay more. Light users aren't subsidizing them.

The implementation is tricky. You have to track usage in real time, fire the billing event at the right moment, and make sure the user experience doesn't degrade when they cross the threshold - you don't want the bot to just stop working, you want it to handle the transition gracefully. That's the part that takes actual engineering, not just Stripe configuration.

But the model itself is sound. And it matters for how you position the product. If you're charging per interaction, you're not selling a subscription to software - you're selling results. Every interaction is a sales conversation. Every interaction has measurable value. That's a much stronger position than a flat monthly fee for access to a tool.

Need Targeted Leads?

Search unlimited B2B contacts by title, industry, location, and company size. Export to CSV instantly. $149/month, free to try.

Try the Lead Database →

What the Wrong Chatbot Taught Me About Validation

I've written about the Vanity Product Trap before - the pattern where a product looks innovative, attracts early hype, gets early adopters excited, and then craters because it never actually moved anyone's bottom line. The customers weren't buying a bad product. They were buying the idea of the product, and when the idea met reality, it didn't hold up.

The wrong chatbot demo is a miniature version of that trap.

It looked like progress. It felt like progress. The team was energized. The API was working. The integration was clean. If I hadn't known what the product was supposed to do - if I were an outside investor watching that demo - I might have thought we were further along than we were.

That's dangerous. Not because the team was lying or cutting corners, but because forward momentum on the wrong objective feels identical to forward momentum on the right one. You can't tell from the inside. You have to define the right objective clearly enough that you can test against it - and then actually test against it, not against a proxy metric that's easier to measure.

For us, the test isn't "did the chatbot answer correctly?" The test is "did the chatbot move someone toward a conversion?" Those require completely different evaluation frameworks.

If you're building anything with AI right now, that's the question I'd push you to answer before your next demo: what is the actual success metric, and are you testing against it - or are you testing against something that correlates with it but isn't it?

Because the system that optimizes for the wrong objective will look exactly like the system that optimizes for the right one. Right up until it doesn't.

The Sandbox First Principle

One thing the dev team proposed that I thought was smart: build a dedicated sandbox page before shipping anything to the front end.

Instead of showing the chatbot integrated in the product and iterating there - which creates a messy feedback loop and risks exposing broken behavior to real users - build an isolated environment where you can tweak the prompt, test the model, adjust the output format, and share feedback without anything touching production.

This is obvious in theory. Almost nobody does it in practice. They build in production because it's faster to see results, and then they find out in front of a customer that the results weren't what they needed.

The sandbox is a forcing function. It requires you to define what "good" looks like before you start testing, because you're evaluating the sandbox output against something. That definition - the spec for what the AI should say, in which situations, with what tone, driving toward what action - is the most important work you can do before writing a single line of prompt.

If you haven't written that spec, you don't have a product yet. You have a demo.

If you're building outbound systems - not AI chatbots, but cold email and prospecting flows - the same principle applies. Define what a successful output looks like before you start building the sequence. What does a reply that converts look like? What does a reply that doesn't convert look like? How do you tell the difference at the top of the funnel? Tools like Clay let you build sophisticated personalization logic, but that logic has to be pointed at the right objective or you're just generating noise at scale. And if you need clean prospect data to start with, ScraperCity's B2B database gives you the raw material - but the same principle holds: garbage in, sophisticated process, garbage out. The data has to be right before the system can be right.

Get the top cold email scripts if you want to see what "right objective, right output" looks like in practice for outbound.

Ship When the Objective Is Met, Not When the Demo Looks Good

I'll close with the thing I kept coming back to after this call.

There's a version of every product that looks done. The UI is clean. The integrations work. The demo runs smoothly. Someone who doesn't know better would look at it and say: ship it.

And there's a version of every product that is done. Where the output matches the objective. Where the thing you built does the thing you needed it to do, in the real conditions it'll face, for the real users who'll use it.

Those two versions are often separated by a single question: what is this actually supposed to do?

We knew the answer on that call. The chatbot is supposed to be an automated salesperson. It's supposed to answer questions the way a good closer would - concisely, persuasively, with a clear path to conversion at the end. Not a documentation engine. Not a source aggregator. A closer.

The demo we saw wasn't that. It was a documentation engine for a completely different product, performing flawlessly.

Close enough to fool someone who didn't know better. Not close enough to ship.

Know the difference.

If you want to see how I apply this kind of objective-first thinking to building sales systems - outbound sequences, lead gen, agency growth - come join us at Galadon Gold. We work through this stuff live, on real businesses, in real time. Not theory. The same way this call happened.

Ready to Book More Meetings?

Get the exact scripts, templates, and frameworks Alex uses across all his companies.