Comparing the ability of three large language models to assess risk of bias using the ROBINS-I tool

Exploring the Reliability and Accuracy of Language Models in Bias Assessment of Non-Randomized Studies

In an era where artificial intelligence is rapidly evolving, the application of large language models (LLMs) in various fields is of significant interest. This study delves into the capabilities of three prominent LLMs—Claude, Gemini, and GPT—in evaluating the risk of bias in non-randomized studies using the ROBINS-I tool.

Objectives of the Study

The primary aim of this research is to compare the reliability and accuracy of the three LLMs in assessing the risk of bias in non-randomized studies. The ROBINS-I (Risk Of Bias In Non-randomized Studies of Interventions) tool serves as the benchmark for this evaluation.

Methods and Analysis

The study conducted a secondary analysis of 171 non-randomized studies. These studies were previously assessed by two independent human review teams using the ROBINS-I tool. For consistency, only studies with coherent human domain-level assessments were included. Each study underwent independent assessment twice by Claude, Gemini, and GPT through agent-based structured implementations of the ROBINS-I tool.

Reliability was gauged by the agreement between two runs of the same LLM, measured using percent agreement and AC1 of Gwet. Accuracy, defined as the agreement with human reviewers, was assessed for studies with consistent LLM assessments using the same metrics.

Study Findings

Claude demonstrated high reliability across all domains, with agreement ranging from 79.5% to 98.0% and AC1 values between 0.729 and 0.975. Gemini showed moderate to high reliability, with an agreement of 76.7% to 100% and AC1 values from 0.680 to 1.0. GPT exhibited lower reliability overall, with domain-level agreement between 70.9% and 95.6% (AC1 = 0.596–0.944).

In terms of accuracy, Claude’s agreement with human reviewers was poor overall, ranging from 14.4% to 68.5%, with low AC1 values. Gemini demonstrated moderate to high accuracy in several domains, notably in deviations from intended interventions (79.6%, AC1=0.848) and measurement of outcomes (73.9%, AC1=0.702), achieving the highest overall agreement of 40.0% (AC1=0.672). GPT’s accuracy was variable, with its highest scores in the measurement of outcomes (62.8%, AC1 = 0.571) and classification of interventions (57.8%, AC1 = 0.498), but it struggled in selection (14.3%, AC1 = −0.041) and overall agreement (23.0%, AC1 = 0.267).

Conclusions

While Claude showed internal consistency, it struggled to align with human assessments. Gemini emerged as a more reliable model with moderate to high accuracy in certain domains, whereas GPT exhibited lower reliability and mixed accuracy. The study indicates that current commercially available LLMs, such as Claude, Gemini, and GPT, are not yet equipped to reliably perform ROBINS-I risk of bias assessments.

For further details, please refer to the full study available Here.

“`

Development and prospective evaluation of a machine learning model to predict major cardiac outcomes in pediatric cardiac inpatients

Spotify boss defends move to AI music, saying it’s better than ‘slop’

SteelSeries Aerox 3 Wireless Gen 2 Gaming Mouse and QcK Heavy Mousepad review

Claude Code: Spec-Driven Development – Why Your AI Coding Sessions Crumble in the Third Hour

Comparing the ability of three large language models to assess risk of bias using the ROBINS-I tool

Exploring the Reliability and Accuracy of Language Models in Bias Assessment of Non-Randomized Studies

Objectives of the Study

Methods and Analysis

Study Findings

Conclusions

Development and prospective evaluation of a machine learning model to predict major cardiac outcomes in pediatric cardiac inpatients

Spotify boss defends move to AI music, saying it’s better than ‘slop’

SteelSeries Aerox 3 Wireless Gen 2 Gaming Mouse and QcK Heavy Mousepad review

Claude Code: Spec-Driven Development – Why Your AI Coding Sessions Crumble in the Third Hour

100 things we announced at I/O 2026

Development and prospective evaluation of a machine learning model to predict major cardiac outcomes in pediatric cardiac inpatients

Industry Voices – From Entitlement to Compassion: Reclaiming Patient Advocacy in the Revenue Cycle

Smartphone barrier: Uncovering the digital divide in mHealth prevention in disadvantaged middle-aged and older British communities

Women’s Health Capitol Hill Day: Advocates for advocacy to advance budget priorities

Insightful: Generating insights through clinical annotation, analysis and modeling of suicide-related factors to understand and save lives

LEAVE A REPLY Cancel reply

Useful Links

Latest News

Spotify boss defends move to AI music, saying it’s better than ‘slop’

SteelSeries Aerox 3 Wireless Gen 2 Gaming Mouse and QcK Heavy Mousepad review

Claude Code: Spec-Driven Development – Why Your AI Coding Sessions Crumble in the Third Hour

Our Newsletter