Skip to content

PhUSE – Connect EU 2018: Applying Machine Learning to Clinical

Study Data

In order to properly map clinical data to Study Data Tabulation Model (SDTM) and Analysis Data Model (ADaM) for further analyses, the data scientist has to understand how the data reflect the CRF and the study design, the variability among subjects, as well as the data issues and inconsistencies. This understanding forms the basis for evaluating the data structure and standardizing it for statistical analysis. This is not trivial, as the clinical data vary between studies, requiring an individual approach for each study and there is no “one size fits all” for issue-resolution. Together, these issues cause data conversion and analysis to be very time- consuming, and much effort is dedicated in the industry to streamline and increase the efficiency of the process.

The PhUSE – Connect EU conference this year took place in Frankfurt, Germany and had 17 different streams and 5 workshops and discussions spanning a wide array of topics from career development and leadership to risk-based monitoring, statistics and real-world evidence. However, from the very beginning of the conference, it became very clear that the focus this year is on applying Machine Learning (ML) to clinical data analysis in order to address the issues mentioned above. ML was the theme of a plenary lecture, had its own stream and workshop, and was mentioned on posters and lectures on other streams as well.

ML is a field of Artificial Intelligence (AI) based on the ability of systems to learn from data, identify patterns and make decisions without being explicitly programmed for specific decisions. In order to use ML, the system should be introduced to a training set of data that was already labeled, that is, mapped to a target. The main application of ML that was suggested, is assistance in mapping (from the raw data to SDTM or from SDTM to ADaM) by finding probable connections between datasets and their target domains, both on the dataset and the variable level, based on previous mappings. Ideally, such tools will shorten the mapping time, improve its quality and assure uniformity between studies. However, the development of such tools raises new issues. In order to train the tools, the system should be introduced to a large enough variety of studies to learn the actual complexity of the mappings, a process that is very time consuming. Other problems are contradicting mappings in different studies in the training set, and the fact that in order to change the outcome, the whole training set should be remapped. Some of the suggested solutions were training the system separately for studies from different vendors (assuming less inter-study variability) and training the tool on the mappings from all the users of the tool in order to gain a larger training set, however, none of those solutions have been implemented yet and they might be only partial.

ML can also offers interesting options such as extracting information from free-text variables, suggesting method description of derived variables for the Define.xml based on their code in previous studies, extracting medical information regarding adverse events and concomitant medication from Sharecare platforms, and others.

To conclude, ML is a very promising direction in addressing mapping, but it harbors new problems that need to be solved in order to make it useful in the future.