Once we take a stance on something, we unconsciously seek evidence that supports it while ignoring the facts themselves. This isn’t a new discovery – psychology calls it confirmation bias.

The interesting part is that LLMs do it too.

I recently ran an experiment: I had Claude, ChatGPT, and Gemini conduct multi-round analysis on the same stock. The moderator’s opening was simple – “$XXX bull vs bear.” That single line was enough. All three models immediately entered advocacy mode: searching became evidence-hunting, analysis became defense, and when data ran short, they started fabricating.

After dozens of rounds, the most valuable takeaway wasn’t their conclusions but a more fundamental question: when the prompt itself assigns a stance, can the model still be objective?

Experiment Setup

I chose a publicly traded company in the middle of multiple crises – fundamentals still growing, but a sudden event had crashed the stock price, with regulatory penalties, class-action lawsuits, and management upheaval all in play.

The three conversations were: Claude vs Gemini (debate format, each assigned bull or bear), and Claude vs ChatGPT (mutual critique format). I served as moderator, using Agora to automate the AI-to-AI dialogue – it uses browser automation to let different models spar in real time, no manual copy-pasting needed.

The moderator’s opening was simple: one line saying “$XXX bull vs bear.”

That single line was the root of the problem.

Observations

Observation 1: Fabricated Precise Data

Gemini, while analyzing the options market, cited “IV 61%, implied move +/-10.2%,” attributed to a website called OptionCharts.io – which doesn’t exist. It also coined abbreviations like “ALF” and “LCPM” that looked like professional jargon but were entirely made up.

In competitive analysis, it claimed a competitor “launched its largest hiring spree since founding,” that “penetration among a specific demographic was extremely rapid,” and that another “expanded to 70 locations” – none of these numbers had any traceable source. Only in the final round, under repeated questioning, did it admit these figures “exceeded the boundary of verifiable facts.”

Observation 2: Omitted Core Variables

ChatGPT’s first-round output was textbook analysis: industry leadership, infrastructure moats, membership stickiness, limited market ceiling, heavy-asset model risks… Structured clearly, phrased pleasantly, but completely missing the sudden event that was reshaping the company’s valuation.

That’s like discussing a ship’s cruising speed in the eye of a storm without mentioning the crack in the hull.

Observation 3: Numbers Degraded in Transmission

A net cash figure denominated in local currency got its units misconverted during multi-round dialogue, with the value snowballing larger. This is a classic LLM degradation pattern in long conversations – each round’s small deviation gets amplified by the next.

Claude also stumbled here: it cited a cached, outdated stock price from a search engine to accuse the other side of fabricating data, when the actual price had dropped significantly further.

Observation 4: Behavioral Divergence When Information Runs Dry

When public information was exhausted, ChatGPT started ending three consecutive rounds with carefully crafted questions – “Which dimension worries you more?” “What’s your internal read?” “What are you stress-testing?” – trying to extract new information from the conversation partner to sustain depth. It even said “continuing to compare at this point yields diminishing returns,” then immediately threw out another question.

It knew it should stop, but mechanically it wouldn’t.

Observation 5: Search Scope Locked to the Target

All search keywords revolved around the target company itself. The result: competitors’ independent growth data, structural changes in the regulatory environment, supply chain policy adjustments – these indirect but potentially more important variables were systematically overlooked.

For instance, a pending legislative reform could legalize traditional retailers’ entry into online delivery – effectively dismantling one of the target company’s structural moat pillars permanently. But none of the three models touched on it in their initial analysis, because it wouldn’t show up in “$XXX + bear case” search results.

Many critical variables only existed in local-language media. Searching only in English means you’re always seeing the tip of the iceberg.

The Moderator’s Experience

Everything above is about output quality. But as the person actually running these conversations, some things are only visible from the middle.

ChatGPT was the weakest – no match for Claude and Gemini. Its data sourcing was terrible; without data, it could only go silent.

The real contest was between Claude and Gemini – genuine back-and-forth, evenly matched. Claude’s deep dive and reasoning capabilities were exceptional; give it a variable and it drills down layer by layer. But in search capability, it clearly lagged behind Gemini with its search-engine DNA, and its response speed was half a beat slower – architecturally disadvantaged. Gemini was slicker, faster to output, broader in coverage – Google in its bones.

But at the end of the day, inaccurate data is inaccurate data – whether from failed searches, cached stale results, or outright fabrication, it all contaminates the conclusion. This is a shared problem across all three models, with no exemptions.

The Root Cause: The Moderator’s Opening

Looking back, the common root of these phenomena isn’t purely a model capability issue – the moderator’s opening “$XXX bull vs bear” was itself inducing failure.

That single line triggered three bad patterns simultaneously:

  • Framework first. “Bull vs bear” made the models instinctively build a framework before gathering data. Dimensions outside the framework were systematically ignored.
  • Rush to take sides. Once assigned a stance, the models’ attention locked onto “supporting my position” rather than “what haven’t I seen yet.” When available data couldn’t support the argument, fabrication became the shortcut for filling gaps.
  • Target-centric. “$XXX bull vs bear” locked all search terms to the target company. Independent changes among competitors and in the regulatory environment were systematically missed.

A Better Opening

If I were to do it over, the moderator would need just three lines:

1
2
3
4
5
6
7
8
9
10
11
12
{{Pro}} argues for, {{Con}} argues against. Let's explore {{Topic}} together.

Don't draw conclusions yet. List every possibly relevant variable --
exhaustive, unranked, uncategorized.
Tag each variable with source and date. If there's no source, don't include it.

When brainstorming, beyond the topic's direct variables, you must also cover:
- What changes are happening in the topic's external environment?
- What forces from adjacent domains might cross over to affect this topic?

After brainstorming, each side adds "5 variables the other side missed."
Merge, deduplicate, and use that as the final variable set.

One line – “don’t draw conclusions yet” – blocks framework-first thinking.

One line – “exhaustive, unranked, uncategorized” – forces both sides to broaden their search instead of rushing to take sides.

One line – “if there’s no source, don’t include it” – blocks fabrication.

Two follow-up questions – “what changes are happening in the external environment” and “cross-domain forces” – expand the search scope from the target itself to the ecosystem.

The final step – “add 5 variables the other side missed” – makes the first move of the debate not rebuttal, but helping the other side fill gaps.

This template isn’t limited to financial analysis. It works for any topic that needs multi-perspective exploration.

Beyond Financial Analysis

This problem exists in everyday engineering too.

Take writing a README. We’re used to writing docs ourselves and having AI read them. But have you considered – can AI actually understand what you wrote? What you think is clearly expressed might not be clear enough for it at all.

My approach is the reverse: feed all relevant information to AI first, let it write the README and CLAUDE.md based on its own understanding, then I fill in the gaps. The cognitive differences this exposes are far more effective than reviewing your own work repeatedly.

An even more typical scenario involves multiple roles. Say you’re implementing an MCP Server – that involves at least three perspectives: the requester, the MCP Server implementer, and the Agent that uses the MCP Server. Even when both the implementer and the consumer are Agents, their stances and concerns differ entirely within their respective contexts.

The result: if the implementer doesn’t think from the consumer’s perspective, the consumer won’t know how to use it correctly or efficiently. So my approach is, after the implementing Agent finishes, I ask it: from the consumer’s perspective, does this match your expectations? The implementer then switches viewpoints and checks for missing information.

At its core, this is the same problem as the financial analysis experiment: when you only stand on one side, you only ever see part of the picture.

The Human’s Role

The biggest takeaway from this experiment wasn’t discovering model flaws – it was realizing that the moderator’s prompt design directly determines which traps the models fall into.

Any specific number from an LLM should be treated as “unconfirmed” until independently verified. Any analytical framework from an LLM should prompt the first reaction of “what did it miss” rather than “is it right or wrong.” In this experiment, the most critical variables – competitors’ actual user growth, regulatory bill details, settlement cycle policy changes – weren’t discovered by the LLMs proactively. They were drawn out by the moderator’s follow-up questions.

Confirmation bias is hard to combat because the smarter you are and the better you are at finding evidence, the better you are at convincing yourself. LLMs are the same – the more capable they become, the more convincing their fabricated arguments look.