Why AI Visibility Tools Are Lying About

What are AI visibility tools and why are marketers questioning them?

AI visibility tools promise to tell you how often your brand appears inside ChatGPT, Claude, Gemini, Perplexity, and Google’s AI answers, then compress that into tidy numbers like mention rate, citation rate, share of voice, and rank.

The pitch is seductive: a dashboard that says you are number 4 in your category, that you moved up 2 spots this week, or that you sit at 17% visibility while a competitor sits at 31%. A software engineer writing at Canonry, a vendor in this exact space, argues the signal is not worthless, but the precision is invented, because these systems are noisy, personalized, geographic, and nondeterministic by design.

A clean leaderboard number hides the one thing you actually need to see, which is the spread behind it.

What is the evidence that AI visibility rankings are unreliable?

The instability is measurable, and it comes from independent sources rather than marketing copy.

Thinking Machines Lab showed that identical temperature-0 requests can return several different completions under real production load. SparkToro and Gumshoe had volunteers run the same commercial prompts through ChatGPT, Claude, and Google repeatedly, and the recommended brands changed a lot between runs. A research paper from Chen, Zaharia, and Zou found GPT-4’s accuracy on 1 task swing from 84% in March 2023 to 51% in June 2023 under the same public model name. On top of that, the instrument itself keeps changing: in an April 29, 2025 post, OpenAI said it had rolled back a ChatGPT update because the version was too flattering and agreeable, the kind of shift an outside dashboard only notices after it has already bent the trend line.

If the next run of the same prompt names a different set of brands, then “you rank number 4” is one sample from a distribution, not a fact.

How do these tools measure visibility, and what should small business owners understand?

Every tool picks a method, and each method quietly bends the number before you ever see it.

Some scrape the consumer app, which captures 1 account, 1 location, 1 memory state, and 1 session, then sells it as what your customers see. Others call the provider API, which is repeatable and auditable but behaves differently from the app a real buyer uses. On top of the method, the vendor chooses the prompt set (Profound notes its users often track a couple hundred prompts) and the scoring formula, and those choices decide the headline. Digital Applied showed the same evidence producing 3 different numbers: 20% mention-based share of voice, 16.8% position-weighted, and 31.4% citation-based. Practitioners see it too. Paul Dyer, CEO of /prompt, put it plainly, that if you ask 3 tools you get 3 different answers.

Same data, 3 different standings, decided entirely by choices the vendor made for you.

How does false AI visibility data affect day-to-day operations for small businesses?

It pushes you to spend real money and real hours chasing a number that cannot support the decision.

If a dashboard says last week’s blog post lifted you 2 spots, you might pour budget into more of the same, when the real cause was model drift or a scoring quirk. Location makes it worse, because a result for “best roofing company near me” changes by city, so a single global rank is often meaningless for a business that serves one area. Before you trust any leaderboard, our running archive of field-tested tool breakdowns for small operators is a steadier starting point than a vendor’s number.

A number you cannot audit is a number that quietly sets your budget for you.

The dial indicator clamps onto the brake rotor and the needle settles on 0.003 inches of runout, a clean, exact, confidence-inspiring reading, so the tech writes it on the ticket and moves on. The trouble is the magnetic base is stuck to a panel that flexes every time someone leans on the fender, so that 0.003 is one frozen frame of a number that is actually wandering between 0.001 and 0.008 depending on where the car sits and who is standing near it. A week later the customer is back with the same vibration, because the shop measured a moving thing once, wrote down the tidy version, and billed a repair against it. That is the AI visibility dashboard in one image, a precise-looking readout clamped to a surface that will not hold still, and a decision made as if the needle told the whole truth.

What is the final verdict on AI visibility tools?

The category is useful for direction, dangerous for precision, and only honest when it shows its work.

Treat these tools as a rough compass. They can tell you that you are invisible on the commercial prompts buyers actually ask, or that a competitor gets cited far more often than you, and those are real, useful findings. What they cannot honestly hand you is an exact rank, a 1-decimal share of voice, or a clean cause for this week’s movement, unless they also show the prompt list, the runs per prompt, the variance, and the raw answers. The original piece lists 9 questions worth asking any vendor, and 4 of them matter most: what you are measuring, how many times, with what spread, and can you see the evidence.

If a vendor cannot show you the spread behind the number, you are buying decoration, not measurement.

Source: canonry.ai