The AI devices: ML and NLP
In this section, we review the AI devices (or techniques) that have been found useful in the medial applications. We categorise them into three groups: the classical machine learning techniques,26 the more recent deep learning techniques27 and the NLP methods.28
Classical ML
ML constructs data analytical algorithms to extract features from data. Inputs to ML algorithms include patient ‘traits’ and sometimes medical outcomes of interest. A patient’s traits commonly include baseline data, such as age, gender, disease history and so on, and disease-specific data, such as diagnostic imaging, gene expressions, EP test, physical examination results, clinical symptoms, medication and so on. Besides the traits, patients’ medical outcomes are often collected in clinical research. These include disease indicators, patient’s survival times and quantitative disease levels, for example, tumour sizes. To fix ideas, we denote the jth trait of the ith patient by Xij, and the outcome of interest by Yi.
Depending on whether to incorporate the outcomes, ML algorithms can be divided into two major categories: unsupervised learning and supervised learning. Unsupervised learning is well known for feature extraction, while supervised learning is suitable for predictive modelling via building some relationships between the patient traits (as input) and the outcome of interest (as output). More recently, semisupervised learning has been proposed as a hybrid between unsupervised learning and supervised learning, which is suitable for scenarios where the outcome is missing for certain subjects. These three types of learning are illustrated in figure 4.
Figure 4Graphical illustration of unsupervised learning, supervised learning and semisupervised learning.
Clustering and principal component analysis (PCA) are two major unsupervised learning methods. Clustering groups subjects with similar traits together into clusters, without using the outcome information. Clustering algorithms output the cluster labels for the patients through maximising and minimising the similarity of the patients within and between the clusters. Popular clustering algorithms include k-means clustering, hierarchical clustering and Gaussian mixture clustering. PCA is mainly for dimension reduction, especially when the trait is recorded in a large number of dimensions, such as the number of genes in a genome-wide association study. PCA projects the data onto a few principal component (PC) directions, without losing too much information about the subjects. Sometimes, one can first use PCA to reduce the dimension of the data, and then use clustering to group the subjects.
On the other hand, supervised learning considers the subjects’ outcomes together with their traits, and goes through a certain training process to determine the best outputs associated with the inputs that are closest to the outcomes on average. Usually, the output formulations vary with the outcomes of interest. For example, the outcome can be the probability of getting a particular clinical event, the expected value of a disease level or the expected survival time.
Clearly, compared with unsupervised learning, supervised learning provides more clinically relevant results; hence AI applications in healthcare most often use supervised learning. (Note that unsupervised learning can be used as part of the preprocessing step to reduce dimensionality or identify subgroups, which in turn makes the follow-up supervised learning step more efficient.) Relevant techniques include linear regression, logistic regression, naïve Bayes, decision tree, nearest neighbour, random forest, discriminant analysis, support vector machine (SVM) and neural network.27 Figure 5 displays the popularity of the various supervised learning techniques in medical applications, which clearly shows that SVM and neural network are the most popular ones. This remains the case when restricting to the three major data types (image, genetic and EP), as shown in figure 6.
Figure 5The machine learning algorithms used in the medical literature. The data are generated through searching the machine learning algorithms within healthcare on PubMed.
Figure 6The machine learning algorithms used for imaging (upper), genetic (middle) and electrophysiological (bottom) data. The data are generated through searching the machine learning algorithms for each data type on PubMed.
Below we will provide more details about the mechanisms of SVM and neural networks, along with application examples in the cancer, neurological and cardiovascular disease areas.
Support vector machine
SVM is mainly used for classifying the subjects into two groups, where the outcome Yi is a classifier: Yi = −1 or 1 represents whether the ith patient is in group 1 or 2, respectively. (The method can be extended for scenarios with more than two groups.) The basic assumption is that the subjects can be separated into two groups through a decision boundary defined on the traits Xij, which can be written as:
where wj is the weight putting on the jth trait to manifest its relative importance on affecting the outcome among the others. The decision rule then follows that if ai >0, the ith patient is classified to group 1, that is, labelling Yi = −1; if ai <0, the patient is classified to group 2, that is, labelling Yi=1. The class memberships are indeterminate for the points with ai=0. See figure 7 for an illustration with , , a
1=1, and a
2=−1.
Figure 7An illustration of the support vector machine.
The training goal is to find the optimal wjs so that the resulting classifications agree with the outcomes as much as possible, that is, with the smallest misclassification error, the error of classifying a patient into the wrong group. Intuitively, the best weights must allow (1) the sign of ai to be the same as Yi so the classification is correct; and (2) |ai| to be far away from 0 so the ambiguity of the classification is minimised. These can be achieved by selecting wjs that minimise a quadratic loss function.29 Furthermore, assuming that the new patients come from the same population, the resulting wjs can be applied to classify these new patients based on their traits.
An important property of SVM is that the determination of the model parameters is a convex optimisation problem so the solution is always global optimum. Furthermore, many existing convex optimisation tools are readily applicable for the SVM implementation. As such, SVM has been extensively used in medical research. For instance, Orrù et al applied SVM to identify imaging biomarkers of neurological and psychiatric disease.30 Sweilam et al reviewed the use of SVM in the diagnosis of cancer.31 Khedher et al used the combination of SVM and other statistical tools to achieve early detection of Alzheimer’s disease.32 Farina et al used SVM to test the power of an offline man/machine interface that controls upper-limb prostheses.22
Neural network
One can think about neural network as an extension of linear regression to capture complex non-linear relationships between input variables and an outcome. In neural network, the associations between the outcome and the input variables are depicted through multiple hidden layer combinations of prespecified functionals. The goal is to estimate the weights through input and outcome data so that the average error between the outcome and their predictions is minimised. We describe the method in the following example.
Mirtskhulava et al used neural network in stroke diagnosis.33 In their analysis, the input variables Xi
1, . . . , Xip are p=16 stroke-related symptoms, including paraesthesia of the arm or leg, acute confusion, vision, problems with mobility and so on. The outcome Yi is binary: Yi=1/0 indicates the ith patient has/does not have stroke. The output parameter of interest is the probability of stroke, ai, which carries the form of
In the above equation, the w
10 and w
20≠0 guarantee the above form to be valid even when all Xij, fk are 0; the w
1l and
2ls are the weights to characterise the relative importance of the corresponding multiplicands on affecting the outcome; the fks and are prespecified functionals to manifest how the weighted combinations influence the disease risk as a whole. A stylised illustration is provided in figure 8.
Figure 8An illustration of neural network.
The training goal is to find the weights wij, which minimise the prediction error
2. The minimisation can be performed through standard optimisation algorithms, such as local quadratic approximation or gradient descent optimisation, that are included in both MATLAB and R. If the new data come from the same population, the resulting wij can be used to predict the outcomes based on their specific traits.29
Similar techniques have been used to diagnose cancer by Khan et al, where the inputs are the PCs estimated from 6567 genes and the outcomes are the tumour categories.34 Dheeba et al used neural network to predict breast cancer, with the inputs being the texture information from mammographic images and the outcomes being tumour indicators.35 Hirschauer et al used a more sophisticated neural network model to diagnose Parkinson’s disease based on the inputs of motor, non-motor symptoms and neuroimages.36
Deep learning: a new era of ML
Deep learning is a modern extension of the classical neural network technique. One can view deep learning as a neural network with many layers (as in figure 9). Rapid development of modern computing enables deep learning to build up neural networks with a large number of layers, which is infeasible for classical neural networks. As such, deep learning can explore more complex non-linear patterns in the data. Another reason for the recent popularity of deep learning is due to the increase of the volume and complexity of data.37 Figure 10 shows that the application of deep learning in the field of medical research nearly doubled in 2016. In addition, figure 11 shows that a clear majority of deep learning is used in imaging analysis, which makes sense given that images are naturally complex and high volume.
Figure 9An illustration of deep learning with two hidden layers.
Figure 10Current trend for deep learning. The data are generated through searching the deep learning in healthcare and disease category on PubMed.
Figure 11The data sources for deep learning. The data are generated through searching deep learning in combination with the diagnosis techniques on PubMed.
Different from the classical neural network, deep learning uses more hidden layers so that the algorithms can handle complex data with various structures.27 In the medical applications, the commonly used deep learning algorithms include convolution neural network (CNN), recurrent neural network, deep belief network and deep neural network. Figure 12 depicts their trends and relative popularities from 2013 to 2016. One can see that the CNN is the most popular one in 2016.
Figure 12The four main deep learning algorithm and their popularities. The data are generated through searching algorithm names in healthcare and disease category on PubMed.
The CNN is developed in viewing of the incompetence of the classical ML algorithms when handling high dimensional data, that is, data with a large number of traits. Traditionally, the ML algorithms are designed to analyse data when the number of traits is small. However, the image data are naturally high-dimensional because each image normally contains thousands of pixels as traits. One solution is to perform dimension reduction: first preselect a subset of pixels as features, and then perform the ML algorithms on the resulting lower dimensional features. However, heuristic feature selection procedures may lose information in the images. Unsupervised learning techniques such as PCA or clustering can be used for data-driven dimension reduction.
The CNN was first proposed and advocated for the high-dimensional image analysis by Lecun et al.38 The inputs for CNN are the properly normalised pixel values on the images. The CNN then transfers the pixel values in the image through weighting in the convolution layers and sampling in the subsampling layers alternatively. The final output is a recursive function of the weighted input values. The weights are trained to minimise the average error between the outcomes and the predictions. The implementation of CNN has been included in popular software packages such as Caffe from Berkeley AI Research,39 CNTK from Microsoft40 and TensorFlow from Google.41
Recently, the CNN has been successfully implemented in the medical area to assist disease diagnosis. Long et al used it to diagnose congenital cataract disease through learning the ocular images.24 The CNN yields over 90% accuracy on diagnosis and treatment suggestion. Esteva et al performed the CNN to identify skin cancer from clinical images.20 The proportions of correctly predicted malignant lesions (ie, sensitivity) and benign lesions (ie, specificity) are both over 90%, which indicates the superior performance of the CNN. Gulshan et al applied the CNN to detect referable diabetic retinopathy through the retinal fundus photographs.25 The sensitivity and specificity of the algorithm are both over 90%, which demonstrates the effectiveness of using the technique on the diagnosis of diabetes. It is worth mentioning that in all these applications, the performance of the CNN is competitive against experienced physicians in the accuracy for classifying both normal and disease cases.
Natural language processing
The image, EP and genetic data are machine-understandable so that the ML algorithms can be directly performed after proper preprocessing or quality control processes. However, large proportions of clinical information are in the form of narrative text, such as physical examination, clinical laboratory reports, operative notes and discharge summaries, which are unstructured and incomprehensible for the computer program. Under this context, NLP targets at extracting useful information from the narrative text to assist clinical decision making.28
An NLP pipeline comprises two main components: (1) text processing and (2) classification. Through text processing, the NLP identifies a series of disease-relevant keywords in the clinical notes based on the historical databases.42 Then a subset of the keywords are selected through examining their effects on the classification of the normal and abnormal cases. The validated keywords then enter and enrich the structured data to support clinical decision making.
The NLP pipelines have been developed to assist clinical decision making on alerting treatment arrangements, monitoring adverse effects and so on. For example, Fiszman et al showed that introducing NLP for reading the chest X-ray reports would assist the antibiotic assistant system to alert physicians for the possible need for anti-infective therapy.43 Miller et al used NLP to automatically monitor the laboratory-based adverse effects.44 Furthermore, the NLP pipelines can help with disease diagnosis. For instance, Castro et al identified 14 cerebral aneurysms disease-associated variables through implementing NLP on the clinical notes.45 The resulting variables are successfully used for classifying the normal patients and the patients with cerebral, with 95% and 86% accuracy rates on the training and validation samples, respectively. Afzal et al implemented the NLP to extract the peripheral arterial disease-related keywords from narrative clinical notes. The keywords are then used to classify the normal and the patients with peripheral arterial disease, which achieves over 90% accuracy.42