Artificial Intelligence

AI Trained with Public Databases Can Be Misleading

Using open-source data off-label for AI machine learning yields biased results.

Posted May 13, 2022 | Reviewed by Kaja Perina

Source: Geralt/Pixabay

Artificial intelligence (AI) machine learning is rapidly being deployed in biotechnology, healthcare, and life sciences. The pattern recognition and predictive capabilities of AI computer vision are increasingly being used to analyze medical imaging such as magnetic resonance imaging (MRI) data. A new University of California Berkeley (UC Berkeley) study identified conditions in which AI image reconstruction from magnetic resonance imaging data when using public datasets lead to biased results and “data crimes.”

The UC Berkeley researchers coined the term “data crimes” to describe when using public databases off-label for AI algorithms produce biased, overly positive results are reported in publications as state-of-the-art. They discovered that using open-access databases for MRI reconstruction AI modeling could yield “overly optimistic results.”

Artificial intelligence deep learning is well-suited to identify signals in the noise of complex MRI imaging data. Magnetic resonance imaging (MRI) is the most common method for imaging the brain and spinal cord to help diagnose stroke, aneurysms, tumors, traumatic brain injuries, multiple sclerosis, spinal cord disorders, vision and hearing disorders, and other conditions. MRI is a noninvasive medical imaging method that combines a magnetic field with computer-generated radio waves to create images of the organs, tissues, and skeletal system of the human body. Functional MRI (fMRI) is a special type of MRI that produces images of the blood flow to certain parts of the brain. fMRIs are used to determine damage from head trauma, concussion, epilepsy, schizophrenia, tumors, Parkinson’s disease, Alzheimer’s disease, and more.

According to the UC Berkeley researchers, problems arise when AI machine learning is trained and assessed on public databases used in an off-label manner where data intended for a specific purpose is used for a different one.

“Public databases are an important driving force in the current deep learning (DL) revolution; ImageNet is a well-known example,” the researchers wrote. “However, due to the growing availability of open-access data and the general hype around artificial intelligence, databases are sometimes used in an “off-label” manner: Data published for one task are used for different ones.”

Artificial intelligence machine learning typically requires massive amounts of data to train the algorithms. The algorithm learns features from the data in order to perform task such as classification or predictions. For example, for training AI computer vision algorithms developers may use open-source datasets such as Google’s Open Images, Imagenet, COIL100, CIFAR-10, and others.

The researchers studied the effects of training open-source data off-label on three commonly used algorithms for MRI reconstruction: deep learning, compressed sensing, and dictionary learning.

“Here we study three well-known algorithms developed for image reconstruction from magnetic resonance imaging measurements and show they could produce biased results with up to 48% artificial improvement when applied to public databases,” the researchers wrote.

The researchers' study demonstrates the possibility that there may be AI algorithms that claim state-of-the-art performance on critical functions such as detecting diseases and health disorders that actually have much lower accuracy due to the inflated impact of not using raw MRI data in training and testing the algorithm.

AI deep learning requires thousands of examples, and databases with raw data of the k-space. In radiology, k-space is the raw data for MRIs. K-space is the temporary data space (2D or 3D Fourier transform) in which the frequencies are stored during data collection. A common problem is scarcity; there are few databases with enough raw k-space data to train deep learning algorithms.

To solve this lack of k-space data, a common practice among AI developers is to generate the data and synthesize “raw” k-space data using the forward Fourier transform, a mathematical technique that transforms a function of time to a function of frequency. And therein lies the crux of the AI problem.

Artificial Intelligence Essential Reads

The Large Language Laboratory: AI as Scientist and Subject

The Emergence of Private LLMs

The scientists observed that the bias originates when using processed or synthesized versus raw data. The team took raw MRI data and changing them with carefully managed processing steps to observe the effects. The researchers discovered that artificial neural networks trained in raw data generalize better than those trained on processed data.

“In summary, this study reveals two types of algorithmic sensitivity related to misuse of publicly available processed data, in the context of MRI reconstruction: overly optimistic performance due to bias and limited generalization to unprocessed data,” the researchers wrote.