Exploring the Reliability and Accuracy of Language Models in Bias Assessment of Non-Randomized Studies
In an era where artificial intelligence is rapidly evolving, the application of large language models (LLMs) in various fields is of significant interest. This study delves into the capabilities of three prominent LLMs—Claude, Gemini, and GPT—in evaluating the risk of bias in non-randomized studies using the ROBINS-I tool.
Objectives of the Study
The primary aim of this research is to compare the reliability and accuracy of the three LLMs in assessing the risk of bias in non-randomized studies. The ROBINS-I (Risk Of Bias In Non-randomized Studies of Interventions) tool serves as the benchmark for this evaluation.
Methods and Analysis
The study conducted a secondary analysis of 171 non-randomized studies. These studies were previously assessed by two independent human review teams using the ROBINS-I tool. For consistency, only studies with coherent human domain-level assessments were included. Each study underwent independent assessment twice by Claude, Gemini, and GPT through agent-based structured implementations of the ROBINS-I tool.
Reliability was gauged by the agreement between two runs of the same LLM, measured using percent agreement and AC1 of Gwet. Accuracy, defined as the agreement with human reviewers, was assessed for studies with consistent LLM assessments using the same metrics.
Study Findings
Claude demonstrated high reliability across all domains, with agreement ranging from 79.5% to 98.0% and AC1 values between 0.729 and 0.975. Gemini showed moderate to high reliability, with an agreement of 76.7% to 100% and AC1 values from 0.680 to 1.0. GPT exhibited lower reliability overall, with domain-level agreement between 70.9% and 95.6% (AC1 = 0.596–0.944).
In terms of accuracy, Claude’s agreement with human reviewers was poor overall, ranging from 14.4% to 68.5%, with low AC1 values. Gemini demonstrated moderate to high accuracy in several domains, notably in deviations from intended interventions (79.6%, AC1=0.848) and measurement of outcomes (73.9%, AC1=0.702), achieving the highest overall agreement of 40.0% (AC1=0.672). GPT’s accuracy was variable, with its highest scores in the measurement of outcomes (62.8%, AC1 = 0.571) and classification of interventions (57.8%, AC1 = 0.498), but it struggled in selection (14.3%, AC1 = −0.041) and overall agreement (23.0%, AC1 = 0.267).
Conclusions
While Claude showed internal consistency, it struggled to align with human assessments. Gemini emerged as a more reliable model with moderate to high accuracy in certain domains, whereas GPT exhibited lower reliability and mixed accuracy. The study indicates that current commercially available LLMs, such as Claude, Gemini, and GPT, are not yet equipped to reliably perform ROBINS-I risk of bias assessments.
For further details, please refer to the full study available Here.
“`

