Ophthalmology, Artificial Intelligence and Oculomics: A beginner's guide to general concepts

From Kahook's Essentials Of Glaucoma Therapy
Primary authors
  • KEOGT Team

My goal is to introduce basic concepts and terminology related to AI and to succinctly touch on noteworthy publications that can help guide further exploration by you if you want to take a deeper dive.

The full unabridged talk will be available on youtube: https://youtu.be/bM86TU468zs

Let’s start with why I put this talk together:

• Artificial Intelligence (AI) is emerging from a field of promise within healthcare to one of real-world application • Learning the basics about AI will help the clinician better understand what is real from what might be hype • Foundational knowledge will also help in the analysis and critique of publications at a time when peer reviewed manuscripts in AI-healthcare are increasing rapidly • I want to learn

Is AI the Holy Grail?

One way to think about the potential benefit of AI is to contemplate how you might input data and then the output would be a very clear diagnosis with road plan for future care. Maybe we can use AI to look at nerves that are difficult to assess or to try and merge data from multiple diagnostic tools to give “diagnostic certainty” and make the clinical assessment objective rather than subjective. That would certainly make things interesting, but are we ready for that? Will we ever be ready for that? See what you think as we move through the next few slides.

Voices of Doom and Gloom

There are a lot of doom and gloom type quotes out there when it comes to AI and healthcare, here are a couple:

•Geoffrey Hinton (Google Brain and U. of Toronto) “[If] you work as a radiologist, you’re like the coyote that’s already over the edge of the cliff but hasn’t looked down so doesn’t realize there’s no ground underneath him.”

•Andrew Ng (Stanford) ”[A] highly-trained specialized radiologist may now be in greater danger of being replaced by a machine than his own executive assistant.”

Let’s see how you feel about these quotes when we get to the end of this talk.

This quote was something I thought of when I first started learning the new language of AI:

Define your terms, you will permit me again to say, or we shall never understand one another. -Voltaire, Dictionnaire philosophique portatif (1764)

So let’s dig in to some terminology

Data Science: Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. This of this as information you can put on a slide after analyzing data.

Artificial Intelligence: A broad discipline with the goal of creating intelligent machines, as opposed to the natural intelligence that is demonstrated by humans and animals. It has become a somewhat catch all term that nonetheless captures the long-term ambition of the field to build machines that emulate and then exceed the full range of human cognition.

Machine Learning: ML is a subset of AI. Think of it in this way, ML allows a system to achieve or enhance AI performance. It can also be thought of as computer algorithms that can improve performance through experiences. Or A subset of AI that often uses statistical techniques to give machines the ability to "learn" from data without being explicitly given the instructions for how to do so. This process is known as “training” a “model” using a learning “algorithm” that progressively improves model performance on a specific task.

Deep Learning: A subset of ML that depends on Neural Networks… DL and NN can be thought of as being the same thing. Or An area of ML that attempts to mimic the activity in layers of neurons in the brain to learn how to recognize complex patterns in data. The “deep” in deep learning refers to the large number of layers of neurons in contemporary ML models that help to learn rich representations of data to achieve better performance gains.

Artificial Neural Networks: Input-Hidden-Output. A deep NN is Input-multipleHiddenLayers-Output

Convolutional Neural Network (CNN): The most helpful definition for me is one that correlates with the biology of the human brain. CNNs are inspired by the connectivity pattern between neurons and resembles the organization of the brain’s visual cortex. Individual cortical neurons respond to stimuli in a restricted region of the visual field which we call the receptive field. The receptive fields of different neurons partially overlap covering the entire visual field.

Oculomics: This was a new term to me that I first noted in this paper presented as, “The convergence of modern multimodal imaging techniques and large-scale data sets has fostered an extraordinary opportunity to exhaustively characterize the macroscopic, microscopic, and molecular ophthalmic features associated with health and disease (i.e., the oculome)” Wagner SK, Fu DJ, Faes L, et al. Insights into Systemic Disease through Retinal Imaging-Based Oculomics. Transl Vis Sci Technol. 2020;9(2):6. Published 2020 Feb 12. doi:10.1167/tvst.9.2.6

I recognize that this is a significant list of new terms for most clinicians, and this makes me feel more empathy for those outside of ophthalmology when we use our own language amongst ourselves. I would encourage you to take time to look up these terms and read more about the basics of AI and, to be honest, I find myself returning over and over again to the basic terms as I gain more knowledge. The language is not yet intuitive to me.

How should you think of Deep Learning?

We have already defined DL, so let’s take a minute to define “Supervised Learning” and then illustrate the basics using the human brain as a corollary to what is happening inside the “black box”. Supervised Learning is a machine learning approach that uses labeled datasets. These datasets are designed to train or “supervise” algorithms into classifying data inputted into the system or to predict outcomes accurately. An example would be identifying glaucomatous optic nerves by “teaching or supervising” the system using glaucomatous optic nerve photos and healthy control datasets. Using labeled inputs and outputs, the model can measure its accuracy and learn over time. One important term to know that relates to learning over time is backpropagation. Backpropagation, short for "backward propagation of errors," is an algorithm for supervised learning of artificial neural networks using gradient descent. The way that I think about this is that backpropagation is a method for taking a less than ideal “output” and feeding it back into the system to “teach” the NN to do better the next time. In contrast to supervised learning is unsupervised learning which uses machine learning algorithms to analyze and cluster unlabeled data sets. This approach allows for discovery of hidden patterns in data without the need for human intervention (i.e., “unsupervised”). An example would be a system that can cluster images of cats in a group when you input various images of different animals.

We have already defined many of the terms on this slide. One that we haven’t is the perceptron. A Perceptron, a single-layer neural network, is an algorithm used for supervised learning of binary classifiers. Binary classifiers decide whether an input, represented by a series of vectors, belongs to a specific class.

The Promise of AI in Healthcare

•Augment our ability to care for patients

•Enhance access to latest data and implement in real time

•Automate repetitive tasks (dashboards and charting)

•Achieve equity across resource challenged areas

•Many other areas of promise

The Promise of AI in Eyecare

•Increased automation of test interpretation

•Dashboards to present pertinent data

•Alert clinician to diagnoses and care protocols (image processing)

•Decrease existing tendency towards high sensitivity but low specificity*

•Diagnose disease before it manifests clinically? (ex. AMD)

•Enhance surgical outcomes?

The next few slides will include studies that help tell the story of where AI has been and where it can lead us in the future.

Clinically applicable deep learning for diagnosis and referral in retinal disease

De Fauw et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat Med. 2018 Sep;24(9):1342-1350. The take home learning from this paper is that past optical coherence tomography (OCT) studies used small data sets and did not include multiple pathologies to teach and evaluate systems. This study uses a small data set that was manually segmented and then fed into a NN to achieve 1) a deep segmentation network created a detailed device-independent tissue-segmentation map and 2) A deep classification network analyses this segmentation map and provides diagnoses and referral suggestions. The end result is a system that can classify 3D volumetric OCT data and then recommend follow up on par with retina specialists and retina trained ODs. One point on what the gold standard was for the teaching set diagnoses…the data set was made even more powerful by the researchers looking up the final diagnosis that was eventually reached and then using that for teaching the system (in other words, the system was taught in a rigorous manner by making the first images predictive of what the ultimate diagnosis was as patient traveled through the system to the specialists).

The Value of Automated Diabetic Retinopathy Screening with the EyeArt System: A Study of More Than 100,000 Consecutive Encounters from People with Diabetes Bhaskaranand M, Ramachandra C, Bhat S, Cuadros J, Nittala MG, Sadda SR, Solanki K. The Value of Automated Diabetic Retinopathy Screening with the EyeArt System: A Study of More Than 100,000 Consecutive Encounters from People with Diabetes. Diabetes Technol Ther. 2019 Nov;21(11):635-643.

•The EyeArt system is an automated, cloud-based AI eye screening technology designed to detect referral-warranted Diabetic Retinopathy (DR).

•This retrospective study assessed the diagnostic efficacy of analyzing 850,908 fundus images from 101,710 consecutive patient visits, collected from 404 primary care clinics.

•The presence or absence of referral-warranted DR (more than mild nonproliferative DR [NPDR]) was automatically detected by the EyeArt system for each patient encounter, and its performance was compared against a clinical reference standard of quality-assured grading by rigorously trained eyecare professionals (OD/MD)

•EyeArt screening had:

•91.3% (95% confidence interval [CI]: 90.9–91.7) sensitivity

•91.1% (95% CI: 90.9–91.3) specificity

•For 5446 encounters with potentially treatable DR (more than moderate NPDR and/or diabetic macular edema), the system provided a positive ‘‘refer’’ output to 5363 encounters achieving sensitivity of 98.5%.

•Limitations: (1) Retrospective (2) Use of one grader without adjudication (3) Mean age and mean duration of disease were not available to assess effect on image acquisition and image quality (harder in elderly?)

•This study captures variations in real-world clinical practice and shows that an AI DR screening system can be safe and effective in the real world.

Evaluation of a Deep Learning System For Identifying Glaucomatous Optic Neuropathy Based on Color Fundus Photographs

Al-Aswad LA, Kapoor R, Chu CK, Walters S, Gong D, Garg A, Gopal K, Patel V, Sameer T, Rogers TW, Nicolas J, De Moraes GC, Moazami G. Evaluation of a Deep Learning System For Identifying Glaucomatous Optic Neuropathy Based on Color Fundus Photographs. J Glaucoma. 2019 Dec;28(12):1029-1034.

•Six ophthalmologists and the DL system, Pegasus, graded 110 color fundus photographs (randomly sampled from the Singapore Malay Eye Study) in this retrospective single-center study. Ophthalmologists and Pegasus were compared with each other and to the original clinical diagnosis, which was defined as the gold standard. Pegasus’ performance was compared with the “best case” consensus scenario, which was the combination of ophthalmologists whose opinion most closely matched the gold standard.

•The performance of the ophthalmologists and Pegasus, at the binary classification of nonglaucoma versus glaucoma from fundus photographs, was assessed in terms of sensitivity, specificity and AUROC.

•Pegasus achieved an AUROC of 92.6% and ophthalmologist AUROC ranged from 69.6% to 84.9% and the “best case” consensus scenario AUROC of 89.1%. Pegasus had a sensitivity of 83.7% and a specificity of 88.2%, and ophthalmologists’ sensitivity ranged from 61.3% to 81.6% and specificity ranged from 80.0% to 94.1%.

•The agreement between Pegasus and gold standard was 0.715, whereas the highest ophthalmologist agreement with the gold standard was 0.613. Intra-observer agreement ranged from 0.62 to 0.97 for ophthalmologists and was perfect (1.00) for Pegasus. The DL system took 10% of the time of the ophthalmologists in determining classification.

•Pegasus outperformed 5 of 6 ophthalmologists in diagnostic performance, and there was no statistically significant difference between the DL system and the “best case” consensus between the ophthalmologists.

Predicting Glaucoma before Onset Using Deep Learning Thakur A, Goldbaum M, Yousefi S. Predicting Glaucoma before Onset Using Deep Learning. Ophthalmol Glaucoma. 2020 Jul-Aug;3(4):262-268.

•Assess the accuracy of DL models to predict glaucoma development from ONH photos several years before disease onset.

•The reported accuracy of DL to correctly identify GON in past studies ranged between 0.83 and 0.98.

•Past studies centered on glaucoma diagnosis from images collected years after disease onset, whereas this study focused on predicting glaucoma before manifestation of clinical signs (GON).

•The AUC in predicting glaucoma development 4 to 7 years before disease onset was 0.77 (95% confidence interval [CI], 0.75e0.79).

•The AUC in predicting glaucoma 1 to 3 years before disease onset was 0.88 (95% CI, 0.86e0.91)

•The AUC in detecting glaucoma after onset was 0.95 (95% CI, 0.94e0.96).

•Testing with DL on images of eyes converting by HVF and not GON had an AUC or 0.88 while AUC was 0.97 when excluding these eyes

•Eyes with visual field abnormality but not GON had a higher tendency to be missed by deep learning algorithms.

From Machine to Machine: An OCT-trained Deep Learning Algorithm for Objective Quantification of Glaucomatous Damage in Fundus Photographs

Medeiros FA, Jammal AA, Thompson AC. From Machine to Machine: An OCT-Trained Deep Learning Algorithm for Objective Quantification of Glaucomatous Damage in Fundus Photographs. Ophthalmology. 2019 Apr;126(4):513-521.

•CNN was trained to assess optic disc photographs and predict SDOCT average RNFL thickness.

•32,820 pairs of ONH photos and SDOCT RNFL scans from 2,312 eyes of 1,198 subjects.

•Mean absolute error of the predictions of 7.39 μm.

•AUROC for discriminating glaucomatous from healthy eyes with the DL predictions and actual SDOCT average RNFL thickness measurements were 0.944 (95% CI: 0.912– 0.966) and 0.940 (95% CI: 0.902 – 0.966), respectively (P = 0.724).

•Real and predicted RNFL values correlated with HVF MD.

•This approach could be used to extract progression information from ONH photos that could then be used for monitoring glaucomatous damage over time. • The study utilized average RNFL and not sectoral values. “The deep learning algorithm classification (abnormal, with P=0.72) disagreed with the SDOCT average RNFL classification (normal), but subjective analysis of the disc photo actually shows inferior localized rim thinning.”

Multi-modal Machine Learning using Visual Fields and Peripapillary Circular OCT Scans in Detection of Glaucomatous Optic Neuropathy

Xiong J, Li F, Song D, Tang G, He J, Gao K, Zhang H, Cheng W, Song Y, Lin F, Hu K, Wang P, Olivia Li JP, Aung T, Qiao Y, Zhang X, Ting DS. Multi-modal Machine Learning using Visual Fields and Peripapillary Circular OCT Scans in Detection of Glaucomatous Optic Neuropathy. Ophthalmology. 2021 Jul 30:S0161-6420

•Purpose: To develop and test a multi-modal artificial intelligence (AI) algorithm, FusionNet, using VF pattern deviation probability plots (PDPs) and circular peripapillary OCT scans to detect glaucomatous optic neuropathy.

•Subjects: A total of 2463 pairs of VF and OCT images from 1083 patients.


•VF data were collected using Humphrey Field Analyzer (HFA). OCT images were collected from three types of devices (DRI-OCT, Cirrus OCT and Spectralis OCT).

•A total of 2463 pairs of VF and OCT images were divided into four datasets: 1567 for training (HFA and DRI-OCT), 441 for primary validation (HFA and DRI-OCT), 255 for the internal test set (HFA and Cirrus OCT), and 200 for the external test set (HFA and Spectralis OCT).

•GON was defined as retinal nerve fibre layer (RNFL) thinning with corresponding VF defects.

•Four glaucomatologists (GT, KG, HZ and CY) labeled the data into GON vs non-GON independently based on clinical history and VF-OCT pairs,16 with the discordant findings arbitrated via an open consensus manner.

•Take Home Results:

•In the internal and external test sets, FusionNet performance continued to be superior (AUROC: 0.917 [0.876-0.958], AUROC: 0.873 [0.822- 0.924]) to VFNet (AUROC: 0.854 [0.796-0.912], AUROC: 0.772 [0.707-0.838]), and OCTNet (AUROC: 0.811 [0.753-0.869], AUROC: 0.785 [0.721-0.850]).

•There was no significant difference between the two glaucomatologists (AUROC: 0.869 [0.818- 0.920] and 0.839 [0.777-0.901]; AUROC: (0.841 [0.780-0.902]) and FusionNet in the internal and external test sets, except for glaucomatologist 2 (AUROC:0.858 [0.805-0.912]) in the internal test set.


•FusionNet, developed using paired VF-OCT data, demonstrated superior performance to both VFNet and OCTNet in detecting GON.

Generative adversarial network and texture features applied to automatic glaucoma detection Bisneto et al. Generative adversarial network and texture features applied to automatic glaucoma detection. Applied Soft Computing. (2020) https://machinelearningmastery.com/what-are-generative-adversarial-networks-gans/ • Generative Adversarial Network (GAN) is a form of ML that automatically discovers and learns the patterns in input data so the model can output new examples that are plausibly drawn from the original dataset. • Two NN models compete in a zero-sum game (right or wrong): • Generator model trained to generate new examples. • Discriminator model trained to classify examples as real (from the domain) or fake (generated). • Allows for use of smaller data sets but appears better suited to be part of the overall approach rather than the star of the show. • “The results are promising and indicate that the method is robust, initially reaching 77.9% accuracy. However, as we apply improvements and adjustments in the method employed, we reach 100% accuracy and a ROC curve of 1.”

A Human-Centered Evaluation of a Deep Learning System Deployed in Clinics for the Detection of Diabetic Retinopathy Beede et al. CHI 2020, April 25–30, 2020, Honolulu, HI, USA

• 9.6% of Thailand’s population living with diabetes (9.1% in United States) • 1500 ophthalmologists (200 retinal specialists) provide ophthalmic care to ~4.5mil diabetic patients • Prospective study to evaluate the feasibility/performance of an AI algorithm in a real-world clinical setting • 393/1838 (21%) of images were ungradable due to poor lighting, pupil constriction and broken cameras • Ungradable image triggered referral to ophthalmology, which became untenable due to travel/cost/burden • Corrective measures: drape/curtain to improve lighting, fix cameras, referral only after review of ungradable images

Ocular Biomarkers to Systemic Disease

I do want to take some time to discuss studies that go beyond exploring ocular diseases by connecting retinal pathology with systemic disease. This is one area of AI studies that has tremendous implications for community health initiatives and continues to evolve as bigger data sets become available for analysis.

The first paper we will cover is: Insights into Systemic Disease through Retinal Imaging-Based Oculomics Wagner SK, Fu DJ, Faes L, Liu X, Huemer J, Khalid H, Ferraz D, Korot E, Kelly C, Balaskas K, Denniston AK, Keane PA. Insights into Systemic Disease through Retinal Imaging-Based Oculomics. Transl Vis Sci Technol. 2020 Feb 12;9(2):6.

• We have previously introduced the term Oculome but it is worth defining again. This area of study has emerged as a result of the convergence of multimodal imaging techniques and large data sets allowing for the characterization of the microscopic, macroscopic and molecular ophthalmic features that are associated with overall health and disease (Oculomics). This paper is a good place to start to get a sense of how researchers think of accessing and utilizing large “bio banks” to explore the oculome. • The acquisition of large data to train DL-models is evolving but often limited by access and quality of data and this is the case with both eye centered research as well as studies connecting ocular pathology with systemic disease. • Of course the major advantage in this line of exploration is that ophthalmoscopic changes in retinal microvasculature structure have been identified as independent predictors for hypertension, diabetes, coronary disease, renal disease, and stroke. • Some systemic disorders may present as distinct retinal manifestations such as the sea fan neovascularization of sickle cell anemia, the macular crystals of cystinosis, or the astrocytic hamartomas of tuberous sclerosis. • The authors stated that retrospective real-world data may be used successfully to investigate connections between retinal findings and systemic disease but is often hindered by incomplete labeling and inadequate access to health records outside of eye care.

The Potential Utility of Oculomics

Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning

Poplin R, Varadarajan AV, Blumer K, Liu Y, McConnell MV, Corrado GS, Peng L, Webster DR. Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nat Biomed Eng. 2018 Mar;2(3):158-164.

•Using deep-learning models trained on data, obtained from UK Biobank (population based health data obtained through health measurements and questionnaires) and EyePacs (fundus images obtained from patients with diabetes in a screening setting), from 284,335 patients and validated on two independent datasets of 12,026 and 999 patients, the authors predicted cardiovascular risk factors not previously thought to be present or quantifiable in retinal images, such as age (mean absolute error within 3.26 years), gender (area under the receiver operating characteristic curve (AUC) = 0.97), smoking status (AUC = 0.71), systolic blood pressure (mean absolute error within 11.23 mmHg) and major adverse cardiac events (AUC = 0.70).

•AUC of 0.5 suggests no discrimination (i.e., ability to diagnose patients with and without the disease or condition based on the test), 0.7 to 0.8 is considered acceptable, 0.8 to 0.9 is considered excellent, and more than 0.9 is considered outstanding.

•Limitations of this study included (1) Use of 450 images may not translate to smaller or larger image fileds (2) use of a smaller data set with large CI (3) Missing data such as lipid panels and gold standard DM diagnoses (4) Smoking was self-reported

•My own conclusion: Retinal images alone performed as well as calculators, but calculators are not used in a vacuum by clinicians and AI systems are typically at a disadvantage when applied in real world settings. Conclusions from studies like this grab attention but don’t often lead to actionable progress in the real world.

Precision vs Recall:

•Precision: How many of those called positive are actually positive (you can still miss a bunch of positives and have high precision)


•Recall How many of all positives were actually “caught” and called positive (You can catch all the positives but also many negatives and still have high Recall)

•The Tug of War between Precision (TP/TP+FP) and Recall (TP/TP+FN) is a constant issue in AI/DL

Attention Heat Maps or Saliency Maps:

This paper also allows me to touch on the concept of Attention Heat Maps or Saliency Maps:

We covered the idea that the “black box” in AI systems can be the bridge between input and output but what happens in that black box to achieve the output is not easily deciphered. Saliency maps are a popular visualization tool for gaining insight into why a deep learning model made an individual decision, such as classifying an image in one way or another. For example, Models trained to predict gender primarily highlighted the optic disc, vessels and macula. Using the “heat” concept, hotness corresponds to regions that have a big impact on the model’s final decision. This allows researchers to further understand which data points are important in reaching specific outputs and can also lead to further exploration of specific connections between disease and clinical presentations that were previously unappreciated by the “clinicians eye” alone.

Deep learning from “passive feeding” to “selective eating” of real-world data Li, Z., Guo, C., Nie, D. et al. Deep learning from “passive feeding” to “selective eating” of real-world data. npj Digit. Med. 3, 143 (2020).

•AI systems typically base diagnoses on images with uncontrolled quality (“passive feeding”), leading to uncertainty about their performance. • Li and colleagues used 40,562 fundus images to deep learning–based image filtering system (DLIFS) for detecting and filtering out poor-quality images (“selective eating”). • In three independent datasets from different clinical institutions, the DLIFS performed well with sensitivities of 95.6% -96.9%, and specificities of 96.6%-98.8%. • Implications for both teaching new AI systems as well as for practical clinical feedback when obtaining imaging for patient care. Can we create a red flag when repeated imaging is needed?

This slide touches on one of the main shortcomings of AI algorithms when viewing it from the angle of the clinician. Existing AI approaches are associative, meaning there is a correlation being made between patient symptoms and the ultimate diagnosis. An alternative approach is proposed of using counterfactual algorithms (i.e., an approach that relates what has not happened.) This approach was found to be more accurate when compared to traditional methods but requires further real-world testing (like all of AI in fact) but is noteworthy as both an indication of what can be as well as what present challenges are with the inherently concrete algorithmic approach utilized in AI.

As research volume increases and more study groups begin to approach AI as a tool to be used in clinical activities, it is helpful to set some standards that can be followed for both protocol writing as well as clinical data reporting to our community of clinicians and researchers. Towards this end, those interested can read the proposed international guidelines for clinical trial protocols (SPIRIT-AI) and clinical trial reports (CONSORT-AI) which were designed to improve the quality and transparency of publications. This should also help with independent validation and the ability to recognize data sets that are not generalizable or, certainly worse, data that are not of high quality and could lead to harmful conclusions. These frameworks are based on the existing international CONSORT and SPIRIT standards that were created for non-AI centered research. I would encourage you to take a look if you are either a consumer or producer of AI data/publications.

Healthcare Equity and AI

Adoption of AI and utilizing it in real world settings has been proposed as one way to enhance health equity and providing new tools for resource challenged areas to serve patients and improve outcomes. • 8.9% of the global population (646 million people) cannot reach healthcare within one hour if they have access to motorized transport. • 43.3% (3.16 billion people) cannot reach a healthcare facility by foot within one hour. • Access to paraprofessionals is more common in many places around the globe with the drawback of poor access to specialists. • AI empowered devices (cell-phones) may allow for a ”specialist on demand” approach that can enhance both access as well as outcomes. • AI empowered devices are becoming more cost-efficient and widely available. • Achieving equity in healthcare: • Teaching basics of utilizing AI • Run, adapt and create AI models for local use • Outside collaborations to improve AI ecosystem • It should be noted that one major shortcoming of current AI datasets, the lack of diversity in input data, must be addressed before these thoughts are applicable beyond a politically correct statement. It’s time to put up or shut up about this issue in AI and I know organizations like ORBIS are trying to facilitate new data sets that will apply to our global population.

What is Ophthalmology’s version of the Trolley Problem?

The trolley dilemma is a classic experiment developed by philosopher Philippa Foot in 1967 and adapted by Judith Jarvis Thomson in 1985. The basic premise is that humans presumably have the ability to make quick decisions that take into account the consequences of actions to the “self” and “others” and make decisions that are morally and not just algorithmically correct. Will the trolley try to minimize damage to the self or to the those on the rails and will the car choose to injure the baby or the elderly lady? Similar moral questions may arise when utilizing AI in healthcare and this is an emerging field that will be required to walk hand in hand with clinicians to produce ethics driven decision making, which still remains a challenge in our world today independent of AI.

The Best Algorithms Struggle to Recognize Black Faces Equally

• In line with the past slides, unintended consequences of using AI may include augmenting racial disparities instead of leveling the playing field. • Test challenges to investigate outcomes when algorithms are “asked” to verify that two photos showed the same face have been disappointing. In one example, algorithms falsely matched different white women’s faces at a rate of one in 10,000, it falsely matched black women’s faces about once in 1,000—10 times more frequently. A one in 10,000 false match rate is often used to evaluate facial recognition systems. • It has also been shown that Amazon, Microsoft and IBM services that try to detect the gender of faces in photos were near perfect for men with pale skin but failed more than 20 percent of the time on women with dark skin. • It is clear that we have a lot of work to do to make AI better, not just for clinical performance…but also for equitable and ethical performance.

On the Opportunities and Risks of Foundation Models

R. Bommasani et al, 2021 (Stanford’s Center for Research on Foundation Models)

I wanted to introduce the concept of Foundation Models and it means for the future of AI in the clinic. • Foundation Models are incomplete “foundations” that are adaptable to specific tasks and serve as the common basis from which many task-specific models are built. • Unchecked, this can lead to homogenized AI ecosystems (same foundation  same benefits and limitations across platforms, including bugs in the system) • This quote from Bommasani and colleagues is noteworthy, “The goal must be to ensure the responsible development and deployment of these models on durable foundations, we envision collaboration between different sectors, institutions, and disciplines from the onset to be especially critical.” • It would be desirable to create unique models that can be evaluated against each other rather than trying to level the playing field early on and create unintended harmony on the side of shortcomings that are perpetuated widely. An example relevant to all of us is how Zeiss controls much of what we do in the visual field realm while keeping their inner workings in a black box (they don’t play well with others). I think we have a chance to make things more competitive in AI and demand more transparency rather than rely on what we are given without pushback.

Artificial Intelligence Platform for Childhood Cataracts in Eye Clinics

• CC-Cruiser is an AI platform developed for diagnosing childhood cataracts. The high accuracy of CC-Cruiser was previously validated using specific datasets. The objective of this specific study by Lin and colleagues was to compare the diagnostic efficacy and treatment decision-making capacity between CC-Cruiser and ophthalmologists in real-world clinical settings. They found that CC-Cruiser exhibited less accurate performance comparing to senior consultants in diagnosing childhood cataracts and making treatment decisions. In this case, the doctor wins and I think that might be a good place to end the marathon of papers discussed. We still live in a world where the clinician is key to the diagnostic process, and we are waiting for AI to catch up and find its place in the clinic. I have some thoughts on how that might evolve.

Summary of Current State of AI in Eyecare

• The story of AI in eyecare is still one of promise. • Studies continue and data are promising but with several caveats including: • Serious questions about generalizability • Appropriate questions about population-based biases • Practical questions about access to imaging devices and economic feasibility involved with data acquisition and interpretation • Fundamental questions about utility of AI compared to the need for dashboards to assist (Augmented Intelligence and smart flow solutions)

Is AI the Holy Grail?

After contemplating this slide thinking about the information shared today, I wonder if you still feel the same about what might be “ideal” from an AI system as it relates to clinical activities. As I read more about AI, I am drawn to the idea of using AI to streamline information from diagnostics rather than to spit out a diagnosis for me to read and tabulate. I would prefer a system that alerts me to abnormalities and maybe also points out connections that correlate between different diagnostics (like a superior notch correlating with superior RNFL thinning on OCT and inferior visual field depressions). Or a system that can help with normative data in a smarter way to lessen the chance for high or low specificity. I gravitate towards an Augmented Intelligence partner rather than an artificial diagnostic surrogate for my brain.

Voices of Doom and Gloom

Remember these doom and gloom quotes? How do you feel about them now?

• Geoffrey Hinton (Google Brain and U. of Toronto) “[If] you work a radiologist, you’re like the coyote that’s already over the edge of the cliff but hasn’t looked down so doesn’t realize there’s no ground underneath him.” • Andrew Ng (Stanford) ”[A] highly-trained specialized radiologist may now be in greater danger of being replaced by a machine than his own executive assistant.”

These are certainly myopic statements that are meant more to instigate emotions that are not based on current or near future realities. We can do better than playing on fear and instead address how best to incorporate AI as part of the clinical care ecosystem rather than a cyborg that will take over medicine as we know it.

Future Directions

• Create a smart “Interpretation Assistance” platform rather than jumping to diagnostic capabilities • Increase ability to screen for disease based on low-cost diagnostics (photos) • Head-to-head trials in real world settings with AI vs Clinician outcomes with purpose to learn how AI can better support clinicians • AI in surgical practice to increase efficiency and alert to complications before they happen

Please remember that the full unabridged lecture can be found on YouTube (link provided above) and I welcome engaging with questions and ideas in the comments section.