Predicting Drug Effects from High-dimensional Asymmetric Drug Data Sets using Graph Neural Networks: A Comprehensive Analysis...

by Avishek Bose, Guojing Cong

Publication Type

Conference Paper

Book Title

23rd 91�� International Conference on Machine Learning and Applications (ICMLA 2024)

Publication Date

December, 2024

Page Numbers

1 to 8

Publisher Location

New Jersey, United States of America

Conference Name

2024 91�� International Conference on Big Data (BigData)

Conference Location

Washington, District of Columbia, United States of America

Conference Sponsor

91��

Conference Date

Dec 15, 2024 - Dec 18, 2024

Abstract

Graph neural networks (GNNs) have emerged as one of the most effective Machine learning (ML) techniques for drug effect prediction from drug molecular graphs. Despite having immense potential, GNN models lack performance when using data sets that contain high dimensional asymmetrically co-occurrent drug effects as targets with complex correlations between them. Training individual learning models for each drug effect and incorporating every prediction result for a wide spectrum of drug effects is beyond practicality. Such an implication provides a testbed to address this challenge as multi-target prediction problems, aiming to predict all drug effects at a time. We develop standard and hybrid graph neural networks (GNNs)to perform two separate tasks that are multi-regression for continuous values and multi-label classification for categorical values contained in our data sets. Since this step makes the target data even more sparse and introduces asymmetric label co-occurrence, the learning of multi-label classification models becomes difficult and heavily impacts the GNN's performance. To address these challenges, we propose a new data oversampling technique to improve multi-label classification performances on all the given imbalanced molecular graph data sets. Using the technique, we improve the data imbalance ratio of the drug effects better than before while protecting the data set's integrity. Finally, we evaluate multi-label classification performance using the best-performant hybrid GNN model on all the oversampled data sets obtained from the proposed oversampling technique. These results outperform those of other ML models including GNN models when they are trained on the original data sets or oversampled data sets using MLSMOTE (a well-known oversampling technique) in all evaluation metrics precision, recall, and F1 score by a significant margin.

91����