5 min read

What an AI assistant is bad at

The honest list of failure modes. Knowing them saves hours.

A confused robot scratching its head

#It sounds confident even when wrong

The biggest failure mode. An LLM doesn't have a built-in "I'm unsure" signal โ€” its words come out smooth either way. If it tells you a hotel's email is reservations@hotel.com and it never actually visited their site, the sentence looks identical to one where it did.

Mitigation: Ask it to verify. "Did you actually check the site or are you guessing?" works. A well-tuned assistant will tell the truth.

#Math and counting

Mediocre at arithmetic, bad at counting items in lists, unreliable at anything involving precise tallies. If a number matters, the assistant should use a tool (calculator, code, SQL) โ€” not do it in its head.

  • Don't ask it to count words in a passage without writing code
  • Don't ask it to add up a column of numbers without a calculator
  • Don't trust dates it computes without checking โ€” day-of-week gets wrong often

#Time + dates

Specifically: most LLMs get day-of-week wrong if not actively checking. If timing matters, double-check. Ask the assistant to print today's date + the weekday before drafting anything time-sensitive.

#Recent events

Training has a cutoff. Anything after that the model either doesn't know or hallucinates. It can search the web to fill the gap, but it needs to actually run the search โ€” it won't do it automatically unless the context makes it obvious.

#Its own recent actions

Long sessions degrade context. After hours of conversation, the assistant sometimes claims something is done when only half of it actually shipped.

Mitigation: Ask it to verify what it built before declaring done. "Show me the actual URL / file / commit" โ€” concrete artifacts beat its word.

#Subtle creative judgment

An LLM can write competent prose but is generic at taste. The phrase that makes a line memorable, the way a joke lands, the difference between "good copy" and "copy a specific human would actually write" โ€” it struggles there. Best paired with a human providing voice while the assistant handles structure.

#Visual / spatial reasoning

Reading a complex diagram, understanding webpage layout from a screenshot, spotting a visual error in a chart โ€” much weaker than text reasoning. Basic image description works; deep visual analysis doesn't.

#Long messages on some channels

Not technically an LLM failure, but bites users constantly. A 500+ char multi-section response often vanishes between the gateway and the user's phone, especially on WhatsApp. The fix is keeping replies short and putting long content on pages or in shared docs.

#Things LLMs make up sometimes

  • URLs that look real but 404 โ€” always click to verify
  • API endpoint shapes โ€” invented fields
  • Quotes attributed to people who never said them
  • Library functions that don't exist (especially in less-popular packages)
  • File paths that "should" exist but don't โ€” ask the assistant to ls first

#What to do about all this

  • Ask the assistant to verify before committing to anything important
  • When given a URL / address / phone number, ask "did you check this?"
  • When it claims something is done, ask for the concrete artifact
  • When the cost of being wrong is real, prefer "show me what you found" over "tell me the answer"