Abstract
Graph neural networks (GNNs) have emerged as one of the most effective Machine learning (ML) techniques for drug effect prediction from drug molecular graphs. Despite having immense potential, GNN models lack performance when using data sets that contain high dimensional asymmetrically co-occurrent drug effects as targets with complex correlations between them. Training individual learning models for each drug effect and incorporating every prediction result for a wide spectrum of drug effects is beyond practicality. Such an implication provides a testbed to address this challenge as multi-target prediction problems, aiming to predict all drug effects at a time. We develop standard and hybrid graph neural networks (GNNs)to perform two separate tasks that are multi-regression for continuous values and multi-label classification for categorical values contained in our data sets. Since this step makes the target data even more sparse and introduces asymmetric label co-occurrence, the learning of multi-label classification models becomes difficult and heavily impacts the GNN's performance. To address these challenges, we propose a new data oversampling technique to improve multi-label classification performances on all the given imbalanced molecular graph data sets. Using the technique, we improve the data imbalance ratio of the drug effects better than before while protecting the data set's integrity. Finally, we evaluate multi-label classification performance using the best-performant hybrid GNN model on all the oversampled data sets obtained from the proposed oversampling technique. These results outperform those of other ML models including GNN models when they are trained on the original data sets or oversampled data sets using MLSMOTE (a well-known oversampling technique) in all evaluation metrics precision, recall, and F1 score by a significant margin.