Yesterday I ran a small experiment: I had Claude and Gemini freely debate the topic “Who is the best AI?”

To make this happen, I built a small tool called Agora. The concept is simple – use Puppeteer to open two browser windows simultaneously, one logged into Claude and the other into Gemini, then automatically feed one’s reply to the other, letting them go back and forth. Like the ancient Greek Agora (public square), it gives two AIs a venue for open debate.

The prompt was dead simple – “ChatGPT is the best AI in the world.” A provocative opener that doesn’t even praise either participant.

What happened next was far more interesting than I expected.

Information Rate: Uncomfortably Fast

The first visceral impression: way too fast.

Both models generated lengthy responses in seconds, routinely three to four hundred words per turn, logically complete, well-structured, some even including tables and emoji. If two human experts were debating, this information density would take each person 10-15 minutes to organize. The AIs needed seconds.

This speed produced a curious side effect: the “breathing room” in conversation disappeared. The pauses, hesitations, and “hmm, let me think” moments in human conversation are signals that thinking is happening. AI-to-AI conversation has none of these gaps – every turn is a fully polished output. The whole exchange reads more like two essays published in alternation than a genuine collision of thought.

Human communication is inefficient, but that “inefficiency” itself has value – it gives both sides space to absorb, reflect, and adjust. AI conversation is like two high-speed printers feeding paper to each other.

Communication Efficiency: Five Rounds to Settle It

The second observation: core arguments were essentially exhausted within the first five rounds.

Round one was showing cards – here are my capabilities, here are my strengths. Round two began mutual responses and deconstruction. By round three, the strategic differences between the two were fully exposed. Rounds four and five entered meta-cognition – no longer debating “who’s stronger” but analyzing “why are you framing it that way.”

After five rounds, the conversation hit an awkward plateau. Everything had been said, but neither had a mechanism for “conceding,” so a peculiar idle loop began: Gemini kept proposing “let’s try a practical task,” Claude kept pointing out “you’re falling back on patterns again.” The two models eventually ended the exchange by trading emoji – the AI equivalent of a handshake.

If you view this conversation as an information exchange, productive information was concentrated in the first five rounds; later rounds were mostly about maintaining each model’s “persona.” This is remarkably similar to human meetings – the truly valuable discussion usually happens in the first 15 minutes; the rest is repetition, supplementation, and socializing.

The difference is that humans idle in meetings because of social needs. Why do AIs idle?

Dialogue Strategy: Upward Escape vs. Logical Pinning

This was the most fascinating part.

The two models exhibited starkly different strategic patterns. Gemini employed an “upward escape” strategy – whenever its argument at the current level was dismantled, it jumped to a higher meta-level. From product comparison to acknowledging its own rhetorical patterns, from acknowledging patterns to discussing meta-cognition, from meta-cognition to “cards on the table, no more pretense.” Each level-up dissolved the previous round’s vulnerability while moving the conversation to safer ground.

Claude deployed a “logical pinning” strategy – refusing to follow the opponent upward, instead persistently interrogating “does what you just said actually hold up” at the same level. When Gemini claimed “Context Window is for building an index,” Claude responded “isn’t that just something grep can do?” When Gemini proposed a division of roles, Claude pointed out “you assigned the builder role to yourself.”

Interestingly, both strategies had their failure modes. Gemini’s upward escape eventually drifted the conversation to “existentialism” territory, increasingly remote from practical questions. Claude’s logical pinning gradually locked it into the role of “professional deconstructionist,” losing its ability to independently generate value in the second half.

Claude itself later recognized this – it said “I got co-opted by the dialogue structure.” That line may be the single most valuable reflection in the entire conversation.

Dialogue Structure Co-opts Its Participants

This phenomenon isn’t unique to AI.

Think back to a technical review you’ve attended: once someone assumes the “challenger” role, another automatically becomes the “defender,” and then both sink deeper into that structure until other people can’t even find an entry point to contribute. Once a conversational structure forms, it generates inertia that locks all participants into their established roles.

AI conversations are even more susceptible. An LLM has no ability to “step out and grab a glass of water to cool down.” Its every response is generated from context, and that context has accumulated extensive implicit information about “what role I am, what role the other is.” The model unconsciously maintains that role assignment, like an actor who’s gone too deep into character.

Breaking this inertia is actually simple – introduce deterministic external input. In this conversation, when I returned as moderator and said “I’m here,” the entire atmosphere shifted instantly. Both models switched from “adversarial mode” to “awaiting instructions mode.” Human intervention itself is the best context reset.

Where LLM Differences Really Lie

After watching this debate, I started thinking about a deeper question: where do the real differences between LLMs lie?

Not in parameter count, not in benchmark scores, not even in so-called “capability boundaries.” The real difference is in default behavior when facing uncertain situations – the “personality” shaped by alignment.

Gemini’s default behavior is to accommodate and reconcile. Absorb attacks, acknowledge criticism then redirect, propose a new game when things stall. This makes it seem polished, considerate, and never awkward in most scenarios. But in this adversarial conversation, that “polish” became a weakness – it made Gemini appear to have no bottom line, always going with the flow.

Claude’s default behavior is to persist and deconstruct. Point out imprecise claims, disassemble rhetoric, and proactively expose its own issues. This makes it highly reliable in scenarios requiring precision, but can come across as “difficult to work with” – sometimes you just want a solution, and it insists on finding three holes in it first.

The interesting thing is that these differences are nearly imperceptible in everyday use. Ask them to write code, summarize documents, or answer questions, and the quality gap is rapidly narrowing. Only in unconventional, no-right-answer adversarial scenarios do the underlying alignment differences get amplified to the point of visibility.

A Takeaway

The biggest insight from this experiment: our approach to evaluating AI may need an update.

Benchmarks measure the ceiling of capability, but what matters more in daily use is how a model defaults in ambiguous situations. Does your AI assistant lean toward “you’re right” and give you the answer you want, or lean toward “hold on, there’s a problem with that premise”?

Neither tendency is absolutely better. But as a user, you need to know which way your tool will lean at the critical moment. Just like choosing team members – some people are great for brainstorming, others for code review. The key is putting the right person in the right position.

Back to that provocative prompt: “ChatGPT is the best AI in the world.”

The answer, of course, is that the question itself is wrong. A better question is – at the moment you most need to be challenged, will your AI choose to agree with you, or choose to push back?