D2.547 - The Transparent and Credible AI for Clinical Protocols
Background
An artificial intelligence-native platform for clinical development, was evaluated for its ability to generate high-quality clinical trial protocols compared with general-purpose large language models (LLMs). The objective was to assess whether the researched tool can produce more complete, regulator-aligned phase II oncology protocols than traditional authoring or other AI tools when given the same minimal prompt.Methods: Using a real phase II trial of nintedanib in bevacizumab-resistant recurrent epithelial ovarian, fallopian tube, or primary peritoneal carcinoma (NCT01669798), the original protocol was retrieved and a standardized prompt was applied across four AI tools to generate new protocols. All five protocols (original plus four AI-generated) were scored using a structured tool derived from International Council for Harmonisation (ICH) M11 guidance, covering 13 key regulatory and scientific domains, with section scores aggregated to an overall 0–5 completeness rating.
Method
Using a real phase II trial of nintedanib in bevacizumab-resistant recurrent epithelial ovarian, fallopian tube, or primary peritoneal carcinoma (NCT01669798), the original protocol was retrieved and a standardized prompt was applied across four AI tools to generate new protocols. All five protocols (original plus four AI-generated) were scored using a structured tool derived from International Council for Harmonisation (ICH) M11 guidance, covering 13 key regulatory and scientific domains, with section scores aggregated to an overall 0–5 completeness rating.
Results
The researched tool generated a 74-page protocol including comprehensive front matter, trial schema, schedule of assessments, formal objectives, endpoints and estimands, detailed population criteria, intervention and concomitant therapy guidance, safety and risk management, statistical methods, and appendices for adverse event definitions, contraception, and laboratory assessments. In contrast, other LLMs produced shorter, less comprehensive documents that omitted critical elements such as full schedules of assessments, explicit estimands, detailed dose justification, robust safety reporting procedures, and complete regulatory and quality assurance frameworks. Overall scores were 2.7 for the original protocol, 4.3 for the research tool, 2.3 for tool 1, 1.5 for tool 2, and 2.2 for tool 3, with the research tool outperforming both the original and other AI-generated protocols across most ICH M11–based domains, especially in objectives/endpoints, design, population, and assessments.
Conclusion
These findings suggest that an AI system purpose-built for clinical trials and trained on curated protocol and regulatory data can objectively generate more complete, ICH M11–aligned protocols than either traditional drafting or general-purpose LLMs from a simple prompt, while still benefiting from expert validation by subject matter experts. Purpose-designed AI platforms may therefore reduce the need for downstream amendments, accelerate study start-up, and improve protocol robustness, whereas non-specialized LLMs risk omitting key design and regulatory elements when used without substantial human reconstruction.
