LLM Position Bias Benchmark Reveals Significant Primacy Effect
Why It Matters
This bias undermines the reliability of LLMs used for automated grading, legal document review, and ranking-based decision-making. If models consistently favor first-listed options, it introduces systemic unfairness in any evaluative pipeline.
Key Points
- Models select the first presented option 63.3% of the time on average.
- Choice consistency is low, with 44.8% of decisions flipping when option order is reversed.
- The GPT-5x family is identified as having significantly higher position bias than competitors like Opus 4.6.
- LLM 'primacy bias' is the inverse of the human 'recency bias' typically found in psychological studies.
The Mazur 2026 benchmark has identified a significant 'primacy bias' in Large Language Models, finding that models select the first of two options approximately 63.3% of the time. When the order of options is reversed, the models' decisions flip in 44.8% of cases, indicating that the choice is often dictated by position rather than content. The study highlights that the GPT-5x series exhibits particularly high levels of this bias compared to its peers. Researchers contrasted this with human behavior, which typically displays a 'recency bias' due to memory limitations during oral presentation. The findings suggest that current training methodologies, including RLHF, have failed to address these structural architectural tendencies. This benchmark raises questions about the validity of using AI for objective ranking tasks without rigorous shuffling and normalization of input data.
Imagine asking a friend to pick between 'Apples' or 'Bananas' and they pick Apples just because you said it first. That is exactly what is happening with AI models like GPT-5, but at a massive scale. A new study found that AI models pick the first choice over 60% of the time, regardless of what that choice actually is. Interestingly, humans usually do the opposite, picking the last thing they heard because it is fresher in their minds. This means if you use an AI to grade resumes or rank products, the order in which they appear might matter more than how good they actually are.
Sides
Critics
Conducted the 2026 study demonstrating that position bias is a systemic flaw in current LLM architectures.
Defenders
The developer of the models cited as having particularly egregious position bias, though they have not yet released a formal response.
Neutral
Socialized the findings and proposed that the bias may stem from how forward passes recompute activations for earlier tokens.
Noise Level
Forecast
Developers will likely implement mandatory 'shuffling' protocols for all ranking tasks to mitigate this effect. In the long term, we should expect new training objectives specifically designed to penalize positional dependency in evaluative prompts.
Based on current signals. Events may develop differently.
Timeline
Benchmark results shared on Reddit
User COAGULOPATH summarizes the Mazur 2026 findings regarding LLM position bias.
Join the Discussion
Discuss this story
Community comments coming in a future update
Be the first to share your perspective. Subscribe to comment.