Performance Comparison of Data Sampling Techniques to Handle Imbalanced Class on Prediction of Compound-Protein Interaction

The prediction of Compound-Protein Interactions (CPI) is an essential step in the drug-target analysis for developing new drugs as well as for drug repositioning. One challenging issue in this field is that commonly there are more numbers of non-interacting compound-protein pairs than interacting pairs. This problem causes bias, which may degrade the prediction of CPI. Besides, currently, there is not much research on CPI prediction that compares data sampling techniques to handle the class imbalance problem. To address this issue, we compare four data sampling techniques, namely Random Under-sampling (RUS), Combination of Over-Under-sampling (COUS), Synthetic Minority Over-sampling Technique (SMOTE), and Tomek Link (T-Link). The benchmark CPI data: Nuclear Receptor and G-Protein Coupled Receptor (GPCR) are used to test these techniques. Area Under Curve (AUC) applied to evaluate the CPI prediction performance of each technique. Results show that the AUC values for RUS, COUS, SMOTE, and T-Link are 0.75, 0.77, 0.85 and 0.79 respectively on Nuclear Receptor data and 0.70, 0.85, 0.91 and 0.72 respectively on GPCR data. These results indicate that SMOTE has the highest AUC values. Furthermore, we found that the SMOTE technique is more capable of handling class imbalance problems on CPI prediction compared to the remaining three other techniques.


INTRODUCTION
The identification of Compound-Protein Interaction (CPI) plays a key role in the development of drugs, particularly herbal medicines. The great advances in molecular medicine and the human genome project provide more opportunities to discover unknown associations in the CPI network. The new interactions that are discovered can be helpful for finding new drugs by screening candidate compounds and also essential to understand the causes of side effects in existing drugs (Mei et al., 2013;Hong et al., 2017). Currently, the latest computational models have been discovered in predicting of potential compound-protein interactions, including deep learning techniques (Tsubaki et al., 2019).
However, at this moment, there are only a few studies available to understand the interaction between compounds and proteins. For example, PubChem and ChEMBL database store 90 million drug candidate compound records, but some compounds interaction to protein targets are still limited Mendez et al., 2019). The computational method for predicting the CPI is thus essential in drug or herbal medicine studies. The method can reduce time, cost, and failure rate for discovering new drugs or herbal medicines (Kim et al., 2013).
To address the above issue, some studies on CPI predictions have been conducted by Biopharmaca Research Centre in Bogor, Indonesia. Indonesia Jamu Herbs (IJAH) webserver is developed by Biopharmaca Research Center to predict the efficacy of herbal of drug formulas for various diseases using the multicomponent-multitarget network AKHMAD REZKI PURNAJAYA et al.

Biogenesis 42
that consists of plant-compound interaction, compound-protein interaction, and proteindisease association networks (Masri & Kusuma, 2018). There are many medicinal properties of herbal formula, which cannot be predicted by IJAH due to a lack of CPI data. To solve this problem, a previous study by Kurnia (2017) has predicted CPI in IJAH by using the Bipartite Local Model-Neighbor Interaction profile Inferring (BLMNII). BLMNII has a good ability to predict new compounds or new protein data, which has a non-interacting pair (Kurnia, 2017). Also, BLMNII can solve the problem of other pharmacological network prediction that predicts LncRNA-Disease Associations (Cui et al., 2019) and Biomedical Bipartite Networks (Zhang et al., 2020). However, the study by Kurnia (2017) has not solved the class imbalance problem in the prediction of CPI. Another problem that may occur when an algorithm is created while ignoring data balance is that the prediction might be biased towards the majority class while ignoring the minority class (Chawla, 2003).
To overcome the imbalanced class in CPI, a study to compare CPI prediction performance by using Random Under-sampling (RUS) and Balanced Sampling techniques (Mousavian et al., 2016). Mousavian et al. gave some results from experiments in 2016 that the RUS technique has better results than Balanced Sampling. Ezzat et al. has also conducted another relevant study in 2016 by evaluating CPI prediction using Synthetic Minority Oversampling Technique (SMOTE). This is done by incorporating the Decision Tree. The Decision Tree has initially shown lower performance in predicting CPI than the Support Vector Machine (SVM). The study has also demonstrated that SMOTE implemented with a Decision Tree had better prediction performance than only using SVM. Then, an experiment has proven Tomek-Link (T-Link) can improve performance in the classification of arterial blood pressures and Ecoli2 datasets (Elhassan et al., 2017). Based on those three studies, we conclude that RUS, SMOTE, and T-Link techniques are proper sampling techniques to handle the imbalanced class on CPI.
Besides using the sampling techniques mentioned above, we try to implement a Combination of Over-Under-sampling (COUS) technique to handle the class imbalance problems in CPI prediction. COUS is done by balancing the amount of data distribution by increasing the amount of minor class data (oversampling) and reducing major class data (undersampling). However, after the matrix of CPI has been balanced by using the data sampling technique, the CPI matrix might have missing values of interacting class caused by duplication or reduction. To overcome this problem, we use k-Nearest Neighbors (k-NN) to impute missing values. This approach can be easily adjusted to work with any attribute as a class, using only distance metrics to modify attributes. This approach can also efficiently treat examples with multiple missing values (Batista & Monard, 2002).
This study used two Yamanishi datasets (i.e., Nuclear Receptor and G-Protein Coupled Receptor), a common benchmark dataset on CPI prediction. We then compare four data sampling techniques, i.e., RUS, Combination of Over-Under-sampling (COUS), SMOTE, and Tomek Link (T-Link); see the effectiveness of the technique to handle class imbalance problem on CPI prediction. To handle missing values when conducting sampling data, we implemented k-Nearest Neighbour imputation. We use the Bipartite Local Model (BLM) as CPI prediction method was first introduced by Bleakley & Yamanishi (2009) and improved by combining BLM and Hubness-Aware Regression in Buza & Peška (2017). BLM create two local models using SVM as a classifier. The CPI prediction result using the BLM method will then be evaluated by using the Area Under Curve (AUC) and Receiver Operating Characteristic (ROC) (Sonego et al., 2008). AUC is a numerical measure to differentiate model performance and can be employed to show how successful the model rankings are by separating positive and negative observations. AUC is known to have proven to be a reliable performance measure for class imbalance problems (Fawcett, 2004).

MATERIALS AND METHODS
Datasets. This study used two of four Yamanishi datasets, Nuclear Receptor and G-Protein Coupled Receptor (GPCR), which are benchmark datasets on CPI prediction (Yamanishi et al., 2008). These datasets were downloaded from http://web.kuicr.kyotou.ac.jp/supp/yoshi/drugtarget/. The nuclear Receptor dataset consists of 54 compounds, 26 proteins, and 1404 compound-protein interaction pairs that comprised 1314 noninteracting and 90 known interacting pairs. The GPCR dataset consists of 223 compounds, 95 proteins, and 21185 compound-protein interaction pairs that comprised 20550 noninteracting and 635 interacting pairs. Data sampling techniques. To see the effectiveness of several techniques for solving this problem, we compare four data sampling techniques: RUS, COUS, SMOTE, and T-Link, which will be discussed in the following subsections.
In RUS, data from classes with a large number of instances (majority class) are removed randomly. The selection and removal processes were repeated until the majority class is equal to the minority class (Mousavian et al., 2016). Firstly, the number of difference between the minority class and the majority class is calculated as follows: Next, we remove the data of the majority class as many as randomly. After that, we duplicate the data of the minority class as many as randomly.
SMOTE works by creating synthetic data, i.e., replication data from minor data. SMOTE method works by searching k-NN for every single data in a minor class. After that, it makes synthetic data as much as the desired duplication percentage between minor data and k-NN, chosen randomly. SMOTE method is known to avoid overfitting when synthetizing minority class data (Chawla et al., 2002). Illustration of the SMOTE is shown in the  (Hu & Li, 2013) The T-Link algorithm was defined as a refinement of the Condensed Nearest Neighbor (CNN) technique, where CNN could choose the subset from all classes using One-Nearest Neighbor (1-NN). It only reduced data on the majority class that has been done 1-NN because if it reduces the minority class again, it will add the probability of misclassification later. For example, x_i and x_j, where the minority class (x_i)  majority class (x_j) created a T-link pair and generated the x_k sample. The new x_j is reduced by x_k (Elhassan et al., 2017).
Missing data Imputation. CPI prediction runs if every compound and protein already has an interaction class. Therefore, to fill the values of NA on the CPI matrix, data imputation is needed. Data imputation is a technique that can be used to estimate the value of missing data by obtaining a pattern of data that has full features (Batista & Monard, 2002). AKHMAD REZKI PURNAJAYA et al.

Biogenesis 44
In this study, we use k-NN imputation to fill the missing interaction class. We can implement a k-NN imputation by following the following steps. First, the data was loaded and initialized the value of k for k-NN. For getting the predicted class, iterate from 1 to a total number of missing interaction class. Then the distance was calculated between the test data and each row of training data. Here, we use Gower distance as our distance metric. We then sort the calculated distances in ascending order based on distance values. Next, top k rows can be obtained from the sorted array and the most frequent class of these rows. Finally, missing interacting class is filled by predicted class.
Prediction. We use the Bipartite Local Model (BLM) algorithm and SVM classifier to predict CPI. The BLM algorithm was first proposed by Bleakley and Yamanishi (2009) and it has recently been shown to be effective in predicting CPI. The algorithm is as follows. First of all, the first local model denoted as We use 10-fold cross-validation to evaluate the performance of SVM on BLM. Crossvalidation was one of the methods used to measure the stability of SVM for predicting testing data.
To measure CPI performance, the ROC curve is visualized, as shown in Figure 2. If the curve is more likely to go to the upper left corner, then it can be ascertained that the CPI prediction result can solve the class imbalance problem because it classifies precisely the positive class and the negative class data. Conversely, if the curve is closer to the baseline or the line across from (0, 0) point to (1, 1) point, then the data is not well classified because the data have an imbalance class (Sonego et al., 2008). Figure 2. A basic ROC curve (Sonego et al., 2008)  Measured ROC parameters in this study were sensitivity, specificity, and AUC. From the equation above, the sensitivity and specificity values can be calculated from a confusion matrix. This table consists of TP (True-Positive), FP (False-Positive), FN (False-Negative), and TN (True-Negative) parts. After obtaining sensitivity and specificity values, we calculated AUC and Accuracy values. ROC is made by plotting sensitivity value on the y-axis and specificity value on the x-axis, as shown in Figure 2.
After the performance prediction is obtained, the ratio of positive data (interacting data class) can be calculated to see the percentage increase in the ratio of positive data between training data and prediction data.
where n1 is the number of interacting class, ns is the number of compounds, and np is the number of protein (Harris, 1967). Figures 3 and 4 show the CPI prediction evaluation results using ROC parameters previously implemented by BLM and data sampling techniques (RUS, COUS, SMOTE, and T-Link) on two Yamanishi datasets, i.e., Nuclear Receptor and GPCR. It can be seen in Figures 3 and 4 that each CPI prediction evaluation on two Yamanishi datasets gives different AUC values. On the Nuclear Receptor dataset using RUS, COUS, SMOTE, and T-Link sampling techniques, the AUC values are 0.77, 0.75, 0.85, and 0.79, respectively, as in Figure 3. We also compare the AUC values from the original data with the AUC values in four data sampling techniques. As shown in Table 1, the AUC value on the Nuclear Receptor dataset for SMOTE is 0.05 higher than that of imbalanced data. Whereas on GPCR, RUS, and SMOTE datasets, the AUC values are 0.11 and 0.18 higher than imbalanced data. In addition, to compare AUC values of CPI prediction on each sampling technique, we display the ROC curve, which visualizes the performance of each data sampling technique for CPI prediction, as can be seen in Figure 5. In particular, Figure 5 shows the ROC curve of the predicted CPI on the Nuclear Receptor dataset in each sampling technique. The ROC curve of CPI prediction on the GPRC dataset with each sampling technique can be seen in Figure 6. It can be inferred from Figures 5 and 6 that the ROC curve of the SMOTE sampling technique is closer to (0.1) point than the ROC curves of other data sampling.   The SMOTE can find new interacting pairs in CPI. This is evidenced by the increase in the percentage increase in the ratio of positive data by 16.2% in the Nuclear Receptor dataset and 18.6% in the GPCR dataset, as shown in Table  2.

CONCLUSION
We used four data sampling techniques: RUS, COUS, SMOTE, and T-Link, to balance the number of known interacting and noninteracting compound-protein pairs. In our experiments, SMOTE method had demonstrated better prediction performance than RUS, COUS, and T-Link techniques when 10-fold cross-validation was used. Also, we conclude that COUS and T-Link methods are unable to increase CPI prediction performance for an imbalanced class problem. Our experimental results also show that SMOTE has the highest AUC values, representing that it is reliable in sampling data and predicting interactions for new compounds or new protein data. In the future, there is a potential that SMOTE technique can be applied for CPI prediction, but it can also be used for drugtarget interaction prediction, which also has a class imbalance problem. This technique can provide more information about new drugs and detect new targets for drug repositioning.