Every second startup deck I read has "AI-powered" in the headline. Most of them mean they send a user's input to the OpenAI API and display the output. That's not a product strategy. That's an API call.
This isn't a complaint about using LLMs. I use them in production. The problem is the reflexive reach for them — adding AI to something because it sounds defensible in a pitch, not because it solves an actual problem better than the alternatives.
Where LLMs Actually Add Value
There are a few categories where language models consistently earn their place.
Tasks where output variance is acceptable. Drafting, summarizing, classifying, reformatting — anything where "pretty good most of the time" is the actual requirement. If you're generating a first draft of a demand strategy or summarizing a meeting transcript, a wrong word or a slightly off framing is not a production incident. You can tolerate variance.
Tasks that were previously impossible or prohibitively expensive. Before LLMs, extracting structured intent from a paragraph of free-form user input meant either a complicated NLP pipeline or a human reading it. Now it's a $0.20/MTok API call to gpt-5.4-nano — or Claude Haiku if you prefer Anthropic's stack. That's a real change.
Augmenting existing workflows. Not replacing them. A developer using Copilot to autocomplete boilerplate is faster. The same developer using an LLM to autonomously push production code without review is asking for a 2am incident.
Where They Don't
Core business logic. If the output of your AI step determines whether money moves, who gets access, or what data gets written — you need determinism. LLMs are not deterministic. "Usually correct" is not correct when the failure mode is a financial transaction.
Anywhere the output needs to be 100% right. Medical dosing. Legal clauses. Tax calculations. You can use LLMs to assist humans in these domains. You cannot use them to replace the human in the loop.
Anything requiring sub-100ms latency. Even gpt-5.4-nano with streaming takes 300-800ms to start a response on a good day. If you're building real-time autocomplete, a search ranking layer, or a validation step inside a tight loop — LLMs don't fit. Use a rules engine or a smaller, purpose-trained model.
A Real Example: FirstDemand
FirstDemand takes a landing page URL and generates a demand strategy — target audience, positioning angles, acquisition channels, messaging gaps. A structured LLM pipeline does the work in about 20 seconds.
What would this cost without AI? A strategy consultant billing at €150/h would spend 3 hours doing the same research and framing. That's €450 per output, and they'd book out weeks in advance.
Why does AI work here? Two reasons. First, the output is directional guidance, not a legal opinion. If the model gets one channel recommendation slightly wrong, the founder thinks critically and adjusts. The variance is fine. Second, the task was previously gated behind expensive human time — that's exactly the category where LLMs unlock real value.
A Counterexample: InferCheck
InferCheck is a directory of 350+ AI models across 111+ providers, with GDPR compliance data for each. People use it to evaluate whether a given AI provider can be used legally in the EU.
The tempting move: use an LLM to answer compliance questions conversationally. "Is gpt-5.4 GDPR compliant for healthcare data?"
We didn't do that. The AI isn't doing the work here. The structured, curated, human-verified data is. An LLM answering GDPR questions about AI providers is exactly the "output needs to be 100% correct" failure mode. Someone could make a wrong compliance call based on a hallucinated answer. So InferCheck is a database with search, filters, and sourced records — not a chatbot.
Don't over-rotate. Sometimes the right tool is a well-structured Postgres table.
The "LLM as Regex" Pattern
A lot of what developers call "AI features" are glorified text extraction tasks. A user submits a support ticket. You want to extract: category, urgency level, affected product, whether it contains a billing question.
You do not need a 70B model for this. You need a prompt with a JSON schema and gpt-5.4-nano at $0.20/MTok input — or claude-haiku-4-5 if you're already on Anthropic's API. That's the "LLM as regex" pattern — using a language model where you'd previously need a brittle rule set or a custom NLP classifier. Cheap, fast, and good enough.
Knowing when you're in this territory — and not reaching for a more capable model than the task requires — keeps your inference costs close to zero.
A Checklist Before You Add AI
Before wiring up any LLM call, answer these three questions:
- Would a regex, a lookup table, or a rule work? If yes, use that. Simpler is faster, cheaper, and debuggable.
- Does output quality variance matter here? If a wrong answer has real consequences — financial, legal, medical, security — skip the LLM or add a mandatory human review step.
- Do you need real-time results (sub-100ms)? If yes, skip the LLM.
If you answered no to all three, an LLM probably fits. Proceed.
The Real Cost Is Not the API Bill
The API bill is predictable. gpt-5.4-nano at $0.20/MTok for extraction tasks costs almost nothing at startup scale.
The real cost is debugging time. At 2am, when your LLM pipeline returns a malformed JSON blob because the model decided to add a "helpful" note before the closing brace, and your JSON.parse() throws, and the stack trace points at a dynamic string you built six abstraction layers deep — that's where you pay.
LLMs fail in ways that are hard to anticipate and annoying to reproduce. They're inconsistent across model versions. A prompt that works fine on gpt-5.4 might behave differently after a model update — OpenAI and Anthropic both do silent capability shifts. You need output validation, fallback handling, and structured outputs (JSON mode or function calling) from day one. Not as an afterthought when things break.
Add AI where it genuinely changes what's possible. Skip it where a simpler tool does the same job. The goal is a working product, not a pitch deck with a higher AI mention count.