Courtesy of Getty Images
The pharmaceutical industry is expected to spend More than 3 billion dollars on artificial intelligence by 2025 – higher than $463 million in 2019. The AI clearly adds value, but advocates say it has not yet lived up to its potential.
There are many reasons why reality may not match the hype, but limited data sets are a big one.
With the vast amount of available data being collected every day – from steps taken to electronic medical records – data scarcity is one of the last barriers one might expect.
The traditional big data/AI approach uses hundreds or even thousands of data points to characterize something like a human face. For this training to be reliable, thousands of data sets are required for the AI to be able to recognize a face despite gender, age, race, or medical condition.
For facial recognition examples are readily available. Drug development is a completely different story.
“When you imagine all the different ways you can modify a drug…the dense amount of data covering the full range of possibilities is less plentiful,” said Adityo Prakash, co-founder and CEO of Verseon. biospace.
“Small changes make a big difference in what a drug does inside our bodies, so you really need improved data on all kinds of possible changes.”
That would require millions of model datasets, which Prakash said even the biggest pharmaceutical companies don’t have.
Limited predictive capabilities
He went on to say that AI can be very useful when the “rules of the game” are known, citing protein folding as an example. Protein folding is the same across multiple species and can therefore be leveraged to guess the possible structure of a functional protein because biology follows certain rules.
Designing drugs uses entirely new formulations and is less amenable to AI “because you don’t have enough data to cover all the possibilities,” Prakash said.
Even when data sets are used to make predictions about similar things, such as interactions of small molecules, the predictions are limited. He said this was because negative data was not published. Negative data is important for AI predictions.
In addition, “much of what is published cannot be reproduced”.
Small data sets, questionable data, and a lack of negative data combine to limit AI’s predictive capabilities.
Too much noise
Noise within the large datasets available is another challenge. Jason Rolfe, co-founder and CEO of Variational AI, said PubChem, one of the largest public databases, contains more than 300 million biomechanical data points from high-throughput screens.
“However, this data is unbalanced and noisy,” he said. biospace. “Typically, more than 99% of the compounds tested are inactive.”
Of the less than 1% of compounds that appear active high across the screen, Rolfe said, the vast majority are false positives. This is due to aggregation, assay interference, reaction, or contamination.
X-ray crystallography can be used to train AI in drug discovery and to determine the precise spatial arrangement of the ligand and its protein target. But despite great strides in predicting crystal structures, protein distortions induced by drugs cannot be predicted well.
Similarly, molecular docking (which mimics the binding of drugs to target proteins) is notoriously imprecise, Rolfe said.
“The correct spatial arrangements of a drug and its protein target are predicted accurately only about 30% of the time, and predictions of pharmacological activity are less reliable.”
With a huge number of possible drug-like molecules, even AI algorithms that can accurately predict the binding between ligands and proteins face an enormous challenge.
“This entails working against the primary target without disrupting tens of thousands of other proteins in the human body, lest it cause side effects or toxicity,” said Rolfe. Currently, AI algorithms are not up to the task.
He recommended the use of physics-based models of drug-protein interactions to improve accuracy, but noted that they are computationally intensive, requiring about 100 hours of CPU time per drug, which may limit their usefulness when searching for large numbers of molecules.
However, the computational physics simulation is a step toward overcoming the current limitations of artificial intelligence, Prakash noted.
“They can give you, artificially, virtually generated data on how two things interact. However, physics-based simulations won’t give you insight into the degradation inside the body.”
Another challenge is related to siled data systems and disconnected datasets.
“Many facilities still use paper batch records, so useful data is not… readily available electronically,” Moira Lynch, senior innovation leader at Thermo Fisher ScientificBiotreatment team said biospace.
Compounding the challenge, “the data available electronically is from different sources and in disparate formats and stored in disparate locations.”
According to Jaya Subramaniam, Head of Life Sciences Products and Strategy at Definitive Healthcare, these datasets are also limited in their scope and coverage.
She said the two main reasons are classified data and de-identified data. “No single entity has a complete collection of any one type of data, whether that’s claims, electronic medical records/electronic health records, or lab diagnoses.”
Furthermore, patient privacy laws require de-identified data, making it difficult to track an individual’s journey from diagnosis to final outcome. Pharmaceutical companies are then hampered by the slow pace of Visions.
Despite the availability of unprecedented amounts of data, relevant and usable data remains very limited. Only when these obstacles are overcome can the power of artificial intelligence be truly unleashed.