A Standardised Approach for Preparing Imaging Data for Machine Learning Tasks in Radiology
Book Chapter: Artificial Intelligence in Medical Imaging, Springer, pp 61-72
Publishing Date: 30th January 2019
Medical imaging data is now extremely abundant due to over two decades of digitisation of imaging protocols and data storage formats. However, clean, well-curated data, that is amenable to machine learning, is relatively scarce, and AI developers are paradoxically data starved. Imaging and clinical data is also heterogeneous, often unstructured and unlabelled, whereas current supervised and semi-supervised machine learning techniques rely on homogeneous and carefully annotated data. While imaging biobanks contain small volumes of well-curated data, it is the leveraging of ‘big data’ from the front-line of healthcare that is the focus of many machine learning developers hoping to train and validate computer vision algorithms. The quest for sufficiently large volumes of clean data that can be used for training, validation and testing involves several hurdles, namely ethics and consent, security, the assessment of data quality, ground truth data labelling, bias reduction, reusability and generalisability. In this chapter we propose a new medical imaging data readiness (MIDaR) scale. The MIDaR scale is designed to objectively clarify data quality for both researchers seeking imaging data and clinical providers aiming to share their data. It is hoped that the MIDaR scale will be used globally during collaborative academic and business conversations, so that everyone can more easily understand and quickly appraise the relevant stages of data readiness for machine learning in relation to their AI development projects. We believe that the MIDaR scale could become essential in the design, planning and management of AI medical imaging projects, and significantly increase chances of success.