“This pandemic is a major test of artificial intelligence and medicine,” said Driggs, who is working on a machine learning tool to help doctors during the pandemic. “If you want the public to be on our side, it will help a lot,” he said. “But I don’t think we passed that test.”
Both teams found that the researchers repeated the same basic mistakes in the way they trained or tested the tools. Incorrect assumptions about the data usually mean that the trained model is not working as claimed.
Wynants and Drggs still believe that artificial intelligence has the potential to help. But they worry that it may be harmful if it is constructed the wrong way, because they may miss the diagnosis or underestimate the risk of vulnerable patients. “There is a lot of hype about machine learning models and what they can do today,” said Driggs.
Unrealistic expectations encourage the use of these tools before they are ready. Both Wynants and Driggs said that some of the algorithms they studied are already in use in hospitals, and some are being sold by private developers. “I am worried that they may have harmed the patient,” Wynants said.
So what went wrong? How can we bridge this gap? If there are benefits, it is that the pandemic has made it clear to many researchers that the way artificial intelligence tools are built needs to change. Wynants said: “The pandemic has brought the issues we have been delaying for a while into the focus of attention.”
What went wrong
Many of the problems found were related to the poor quality of the data that researchers used to develop tools. During a global pandemic, it is often the doctors working to treat these patients to collect and share information about COVID patients, including medical scans. Researchers want to help quickly, and these are the only public data sets available. But this means that many tools are built using mislabeled data or data from unknown sources.
Driggs emphasized the problem with what he called the Frankenstein data sets, which are stitched together from multiple sources and may contain duplicates. This means that some tools will eventually be tested on the same data they used for training, making them look more accurate than they actually are.
It also confuses the sources of certain data sets. This may mean that researchers will miss important features that affect model training. Many people unknowingly use a data set that contains chest scans of children who have not been infected with the new coronavirus as an example of non-new coronavirus cases. But the result is that artificial intelligence has learned to recognize children, not the new coronavirus.
Driggs’ group trained its model using a data set that contains mixed scans taken while the patient is lying down and standing up. Because patients who lie down for scanning are more likely to be seriously ill, artificial intelligence has mistakenly learned to predict the serious risk of new coronavirus from a person’s location.
In other cases, some AIs were found to receive text fonts used by certain hospitals to mark scans. As a result, fonts from hospitals with more severe cases became predictors of covid risk.
In hindsight, mistakes like this seem obvious. If the researcher knows them, they can also be fixed by adjusting the model. It is possible to acknowledge these shortcomings and publish a less accurate but less misleading model. However, many tools are either developed by artificial intelligence researchers who lack medical expertise to find data flaws, or by medical researchers who lack mathematical skills to compensate for these flaws.
A more subtle issue highlighted by Driggs is the merging bias, or bias introduced when the data set is labeled. For example, many medical scans are flagged based on whether the radiologist who created them indicated that they showed the new coronavirus. But this embeds or incorporates any biases of a particular doctor into the basic facts of the data set. Driggs said it would be much better to label medical scans with the results of PCR tests rather than the opinions of doctors. But in a busy hospital, there is not always time for statistical details.
This has not stopped some of these tools from being rushed into clinical practice. Wynants stated that it is not yet clear which or how it is being used. Hospitals sometimes say that they only use the tools for research purposes, which makes it difficult to assess how much doctors rely on them. “There are many secrets,” she said.
Wynants asked a company marketing deep learning algorithms to share information about its methods, but did not receive a response. She later discovered several published models from researchers associated with the company, all of which carry a high risk of bias. “We don’t actually know what the company implemented,” she said.
According to Wynants, some hospitals have even signed confidentiality agreements with medical AI providers. When she asked the doctor what algorithm or software they were using, they sometimes told her they couldn’t say it.
How to fix
What is the solution? Better data will help, but in times of crisis, this is a big requirement. What’s more important is to make full use of the data sets we have. Driggs said that the easiest move is to get the artificial intelligence team to collaborate more with clinicians. Researchers also need to share their models and disclose how they were trained so that others can test them and build on them. “These are two things we can do today,” he said. “They may solve 50% of the problems we find.”
Bilal Mateen, a doctor in charge of clinical technology research at the Wellcome Trust, a global health research charity based in London, said that if the format is standardized, it will be easier to obtain data.