- D1.540 - Automated generation of real-world clinical databases in Allergy: a data-driven research framework
Background
Real-world data are increasingly central to clinical research in Allergy. However, most clinically relevant information remains embedded in unstructured narrative reports within electronic health records, making large-scale data collection slow, labour-intensive and poorly reproducible. This limitation represents a major barrier for the development of multicentre studies, national registries and data-driven research strategies. There is a need for methodological approaches that leverage artificial intelligence to enable the systematic transformation of routine clinical narratives into structured, analysable datasets, independently of local hospital information systems.
Method
We developed a data-driven clinical research methodology based on artificial intelligence, designed for the automated construction and exploitation of real-world clinical databases in Allergy. The framework combines: (i) the prior definition of disease-specific variable ontologies; (ii) automated extraction and structuring of information from unstructured clinical text using natural language processing techniques trained on Allergy-specific clinical data and terminology; (iii) an architecture independent of hospital information systems, enabling input from anonymized PDF reports or raw clinical text; and (iv) privacy-by-design principles with embedded data protection safeguards. An integrated analytical layer enables immediate descriptive and inferential analyses, as well as natural language–based interaction with the resulting structured datasets.
Results
The proposed methodology enables the automated generation of standardized, anonymized and interoperable clinical databases directly from routine clinical documentation. It supports scalable deployment across heterogeneous clinical environments, harmonizes data structures between centres and substantially reduces the manual workload traditionally associated with data abstraction. By formalizing clinical variables through disease-specific ontologies and domain-adapted language models, the framework improves semantic consistency and enhances data reusability for secondary research purposes. The integrated analytical components further allow rapid exploratory and hypothesis-generating analyses.
Conclusion
This artificial intelligence–driven framework provides a reproducible approach for transforming unstructured clinical narratives into structured real-world datasets in Allergy. By decoupling database construction from local information systems and embedding interoperability, data governance and analytical capabilities by design, it facilitates large-scale observational research, national registries and precision medicine initiatives. This methodology complements classical hypothesis-driven approaches by enabling systematic high-dimensional data exploration directly from routine clinical practice.
