SNP site-drug association prediction algorithm based on denoising variational auto-encoder

2022-09-19SONGXiaoyuFENGXiaobeiZHULinLIUTongWUHongyangLIYifan

Journal of Measurement Science and Instrumentation 2022年3期

SONG Xiaoyu, FENG Xiaobei, ZHU Lin, LIU Tong, WU Hongyang, LI Yifan

(1. School of Electronic and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China；2. Lanzhou Blue Whale Information Technology Co., Ltd., Lanzhou 730070, China)

Abstract： Single nucletide polymorphism (SNP) is an important factor for the study of genetic variation in human families and animal and plant strains. Therefore, it is widely used in the study of population genetics and disease related gene. In pharmacogenomics research, identifying the association between SNP site and drug is the key to clinical precision medication, therefore, a predictive model of SNP site and drug association based on denoising variational auto-encoder (DVAE-SVM) is proposed. Firstly, k-mer algorithm is used to construct the initial SNP site feature vector, meanwhile, MACCS molecular fingerprint is introduced to generate the feature vector of the drug module. Then, we use the DVAE to extract the effective features of the initial feature vector of the SNP site. Finally, the effective feature vector of the SNP site and the feature vector of the drug module are fused input to the support vector machines (SVM) to predict the relationship of SNP site and drug module. The results of five-fold cross-validation experiments indicate that the proposed algorithm performs better than random forest (RF) and logistic regression (LR) classification. Further experiments show that compared with the feature extraction algorithms of principal component analysis (PCA), denoising auto-encoder (DAE) and variational auto-encode (VAE), the proposed algorithm has better prediction results.

Key words： association prediction； k-mer； molecular fingerprinting； support vector machine (SVM)； denoising variational auto-encoder (DVAE)

0 Introduction

Single nucleotide polymorphism (SNP) refers to a DNA sequence polymorphism caused by a single nucleotide variation at the genome level. It is the most common one in human heritable variation, accounting for more than 90% of all known polymorphisms. Pharmacogenomics studies have shown that SNP site will be transferred to the drug target along with expression and translation of the gene, thereby affecting the combination of the drug and the target and leading to changes in the metabolism and efficacy of the drug in the body[1]. The association between SNP site and drug plays an important role in drug screening[2], drug response, and drug adverse reactions[3].

At this stage, the verification of the association between SNP site and drug is mainly done through experiments. For example, Zhang et al.[4]studied the effect of CYP2C19 gene polymorphism on the efficacy of clopidogrel in the treatment of ischemic cerebral infarction. On condition that 90 patients took the same dose of clopidogrel, it was found that the patients with *2A alleles had a higher recurrence rate of the disease than those with other genotypes, and homozygotes were higher than heterozygotes； Yang et al.[5]studied the effect of CYP2D6*10 gene polymorphism on the efficacy of metoprolol in the treatment of hypertension by giving the same dose of metoprolol to wild-type homozygous CC type, mutant heterozygous CT type, and mutant homozygous TT type patients. The results show that the blood concentration of patients with CC type increases significantly after treatment, and there is no difference in blood pressure values between the patients with CT type and TT type before and after taking the drug. In summary, traditional experimental methods can only verify the association relationship between one SNP site and one drug at a time[6-8], but as the number of SNP site and drug types continue to increase, it is becoming more and more difficult to identify the association between them only through biological experimental methods. With the rise of bioinformatics, computational methods represented by artificial intelligence and deep learning algorithms provide new possibilities for predicting the association between SNP site and drug.At present, the research of computational methods in the prediction of drug-target interaction association has achieved some results. Ivanciuc et al.[9]applied linear machine learning algorithms and nonlinear machine learning algorithms to quantitative structure-activity relationships, generating predictive models for ligand binding to biological receptors and study the interaction between drugs and their targets. Cheng et al.[10]analyzed and docked the molecular surface to characterize the molecular interaction between enzymes and various ligands. The research results proved the important role of this molecular target in structure-based drug design. Ezzat et al.[11]proposed an oversampling ensemble algorithm based on decision trees, which took into account the problem of too much difference in the ratio of positive and negative samples in drug target samples to effectively deal with the problems of imbalance between classes and imbalances within classes. Wang et al.[12]proposed a drug-target relationship prediction method based on rotating forest, which was improved on the basis of random forest, and it also improved the prediction accuracy. Peng et al.[13]proposed a drug-target interaction prediction model based on local global consistency (LLGC) learning. The model comprehensively considers the global and local characteristics of the target and drug data, integrates the sequence similarity of the target and the topological structure information of the drug target, and mines the drug-target interaction data based on the standard datasets.

Next-generation sequencing technology has accelerated the discovery of SNP site and other mutations. A large number of SNP sites have brought tremendous pressure to the verification of the association relationship between SNP site and drug. In this study we propose an algorithm to predict the association relationship between SNP site and drug based on denoising variational auto-encoder. The effective features of SNP sites are extracted by denoising variational auto-encoder, and support vector machine (SVM) classifiers are used to predict the association relationship between SNP site and drug. The experimental results show that this model has high accuracy and precision in predicting the association relationship between SNP site and drug.

1 Related theories

1.1 Variational auto-encoder (VAE)

VAE is a directed probability graph model proposed by Kingma et al.[14]in 2013. VAE consists of two parts： encoder and decoder. The encoder is used to encode the original input datax={x1,x2,…,xn} to generate the variational probability distribution of the latent variablez； the decoder restores the approximate probability distribution of the original data according to the variational probability distribution of the generated latent variablez. The specific working principle is shown in Fig.1.

Supposing there is a set of functionspθ(x|z) used to generatexfromz, and each function is uniquely determined byθ, the goal of the VAE is to maximizexunder the premise of sampling by optimizingθ, and finally generate probabilityp(x). According to the Bayesian formula,p(x) can be expressed as

(1)

In order to sample and getpθ(z), VAE introducespθ(z|x). Since the true probability distributionpθ(z|x) is difficult to be dealt with, VAE uses the recognition modelqφ(z|x) to approximate it as

(2)

Given an reasoning modelqΦ(z|x), the loss function of VAE can be written as

L(θ,φ;x)=Eqφ(z|x)[logpθ(x|z)]-

DKL[qφ(z|x)‖pθ(x)],

(3)

whereEqφ(z|x)[logpθ(x|z)] is the reconstruction error, so that the generated data are as close as possible to the original data； andDKL[qφ(z|x)‖pθ(x)] is a regularization term, called Kullback-Leibler divergence, which is used to measure the similarity of the two distributions.

1.2 Denoising variational auto-encoder (DAVE)

Although the VAE has better capabilities of feature learning and data reconstruction, it is still affected by noise. In order to improve the robustness and generalization ability of the VAE, Im et al.[15]proposed denoising variational auto-encoder (DVAE) in 2016. It adds noise to the training data of the VAE, so that the model has the ability to learn data features and noise features, and outputs the data without noise as much as possible, thereby enhancing the robustness of the features extracted by the VAE against noise. Fig.2 shows the structure of the DVAE model.

Fig.2 Model structure diagram of DAVE

(4)

Then the mixed Gaussian distribution is

(5)

(6)

From

pθ(x|z)=pθ(x|z)p(z),

(7)

We can get

(8)

The lower bound of denoising variation is

(9)

So the maximum lower bound is

(10)

The training process of DVAE neural network and the calculation of loss function are similar to those of VAE, so this article will not elaborate on it.

1.3 SVM

SVM is the most widely used classification method[16]. It minimizes the structural risk and the classification risk by establishing the optimal hyperplane of the high-dimensional feature space. It has the advantages of high efficiency and high precision. With its excellent learning and generalization capabilities, SVM has become a research hotspot in the field of data mining and machine learning. In recent years, it has also been widely used in the field of bioinformatics to predict protein subcellular localization[17]and protein role recognition[18].

Given a training datasetT={(x1,y1),(x2,y2),…,(xn,yn)}. Among them,xi∈Rn,yi∈{+1,-1},i=1,2,…,n;xiis theith feature vector;yiis the class mark; Whenyiis +1, it is a positive example； whenyiis -1, it is a negative example. If these samples are linearly separable, an optimal hyperplane needs to be found. The model can be expressed as

f(x)=wx+b,

(11)

wherewis the weight vector；xis the input vector；bis the bias term.

SVM solves the nonlinear regression problem is to find the optimal solution of the following function as

s.t.yi(wxi+b)≥1.

(12)

Introducing the Lagrange function, the above formula can be transformed into

(13)

whereai>0 is the Lagrange coefficient,i=1, 2,…,n.

(14)

wherexrandxsare any pair of SVMs in the two categories, and the function of the optimal hyperplane is calculated as

(15)

In order to avoid over-fitting of the SVM model, a relaxation variableξneed to be introduced to allow a small number of samples to be misclassified, thereby improving the prediction accuracy of the model.

In practical problems, the classification and modeling of nonlinear samples needs to use the kernel functionK(xi,xj)=φ(xi)φ(xj) to map the data to the high-dimensional feature space. At this time, the regression function is

f(x)=sgn(w*φ(x)+b*)=

(16)

2 SNP site-drug association prediction algorithm based on DVAE

2.1 Algorithm flow

The flow chart of the SNP site and drug association prediction algorithm based on the DAVE proposed is shown in Fig.3.

2.2 Feature representation

SNP site sequence information and drug molecular structure information are stored in the database in the form of characters, and cannot be directly used in deep learning algorithms. In this study, we first usek-mer algorithm and MACCS molecular fingerprinting to realize the digital characterization of SNP site sequence and drug molecular structure. Then, we use DVAE model to perform feature extraction of SNP site. Finally, we use SVM classifier to make association prediction.

2.2.1 Digital characterization of SNP site

The SNP site sequence is composed of four kinds of nucleotides, and its numerical process is the process of transforming a long sequence of nucleotides into a feature vector represented by numbers[19]. Starting from the first base of the sequence, a sliding pane with a length ofkand a step length of 1 divides the sequence into a sequence of bases with a length ofk, that isk-mer, wherekrespresents a series of consecutivekletters in the nucleotide sequence, and mer respresents each base. Then we calculate the frequency of these sequence fragments in the SNP site sequence, and finally construct the feature vector of the sequence from these frequency values. Assuming that an SNP site sequencepcontainsmnucleotides, it is expressed as

p=R1,R2,…,Rm,

(17)

whereR1represents the first nucleotide andR2represents the second nucleotide, and so on. Then we use thek-mer method to process the SNP site sequencepand transform it into a feature vectorp′ with a fixed length as

(18)

2.2.2 Drug molecular fingerprint representation

Molecular fingerprint is a method of describing the structure of a compound[21-22]. It constructs the fingerprint characteristics of the molecule by detecting the existence of specific sub-structures in the molecular structure and expressing the molecular structure as a series of binary vectors. Molecular fingerprints have the advantages of strong characteristic expression and good output stability. At present, there are many methods such as MACCS fingerprints (166), Pubchem fingerprints (801), FP4 fingerprints (79).

In this study, MACCS fingerprints are used to construct the feature vector of drug molecules, which is the most commonly used method based on substructure keys. It has two key sets： 166 bits and 960 bits[23]. Considerating feature vector dimensions, we use 166-bit key set. For a given drug molecule, it consists of several sub-molecular structures. First, these sub-molecular structures are expressed as a SMARTS pattern string, and then this SMARTS pattern string is compared with a 166-bit key. If the SMARTS pattern exists in the 166-bit key, the corresponding bit in the fingerprint is set to 1； on the contrary, it is set to 0. In this way, the 166-bit fingerprint feature of the drug molecule is constituted. Fig.4 is a fingerprint representation of a drug molecule.

2.3 DVAE-SVM prediction model

This paper presents a prediction model of SNP site drug association based on DVAE-SVM. This model uses DVAE to solve the feature extraction problem of SNP site, and uses SVM to complete the classification task of SNP site and drug association relationship.

The overall framework of the model is shown in Fig.5.

Fig.5 DVAE-SVM model structure diagram

For the association prediction task, we select the libsvmtoolbox to build the SVM classification model. The fusion feature (x′,y) of the SNP site and the drug are input to SVM classifier for association prediction, and the prediction result is obtained.

3 Experimental results and discussion

In this study, we use the tensorflow-based deep learning framework keras to build and train DVAE.The environment is AMD radeon HD 8 500 m GPU, the batch size is 32, the optimizer is Adam, and the learning rate is 0.005. The activation function of the output layer is set to sigmiod, and the activation function of the other layers is set to Leak-Relu.

3.1 Dataset

The datasets used come from the PharmGKB and NCBI databases. In the SNP site and drug association relationship pair, if the SNP site is associated with the drug, the relationship value is 1, otherwise, the relationship value is 0. All SNP site and drug pairs with a relationship value of 1 are positive samples. In this way, he rest are negative samples, and the model training set is constructed.

Table 1 lists the detailed information of the used datasets.

Table 1 Datasets information

3.2 Evaluation method

In this study, we use five-fold cross-validation for evaluation, mainly examining 4 evaluation indicators： accuracy (A), the proportion of all samples predicted to be correct; precision (P), the proportion of positive samples predicted to be correct among all predicted positive samples; sensitivity (S), the proportion of positive samples predicted to be correct among all positive samples; matthews correlation coefficient (M), the classification performance index of the classifier when the positive and negative samples are unbalanced,Mis in [-1，1] and the larger theM, the better theA,P,SandM.

(19)

(20)

(21)

(22)

whereNTPis the number of positive samples predicted to be positive；NTNis the number of negative samples predicted to be negative；NFPis the number of negative samples predicted to be positive； andNFNis the number of positive samples predicted to be negative.

3.3 Results and discussion

3.3.1 DVAE-SVM model experiment

The fusion features of SNP sites and drugs were input into SVM classifier for five-fold cross validation. In the experiment, the grid search method is used to optimize the parameterscandgof SVM. Letc=0.5,g=0.5, the kernel function is the radial basis function (RBF), and the experimental results proposed are shown in Table 2.

Table 2 Prediction results of proposed method

It can be seen from Table 2 that when using our proposed method to predict the association between SNP site and drug, the average values ofA,P,SandMare 89.79%, 91.09%, 88.83% and 77.50%, respectively. The high accuracy indicates that the SVM classifier is reasonable and effective in predicting the association relationship between SNP site and drug. In addition, the low standard deviation of these averages indicates that the proposed method is stable and robust.

3.3.2 Classifier comparison experiment

In order to further evaluate the proposed method, we use random forest (RF) and logistic regression (LR) to compare with the SVM algorithm. and the experimental results are shown in Table 3.

Table 3 Experimental results of three classifiers

It can be seen from Table 3 that when using RF to predict the association relationship between SNP site and drug, the proposed method has obtained good results. The average values ofA,P,SandMare 86.98%, 85.00%, 87.56% and 74.64%, respectively. When using LR to predict the association relationship between SNP site and drug, the LR-based method also obtained good results. The average values ofA,P,SandMare 85.70%, 84.59%, 82.21% and 71.60%, respectively. For these two classifiers, although higher accuracy is obtained, it is lower than that of the SVM algorithm proposed in this study.

Fig.6 is the ROC curve diagram of the three classifiers. The average ROC curve areas of SVM, RF, and LR are 97.16%, 95.41% and 94.23%, respectively, with SVM being the highest. Overall, the SVM model performs better than the RF and LR classifiers.

Fig.6 ROC curves of DVAE method in three classifiers

3.3.3 Feature extraction algorithm comparison experiment

In order to further analyze the influence of DVAE feature extraction method on prediction model, we compare the prediction results with those of PCA, DVE and AVE. The results are shown in Table 4.

Table 4 Experimental results of four feature extraction algorithms

Table 4 lists the comparison of the five-time cross-validation results on the SVM classifier of the features extracted from the PCA method, DAE and VAE with our proposed method. It can be seen from the table that among the three models, the highest is VAE, with an accuracy of 85.55%, but it is 4.24% lower than that of our method. The area of the ROC curve in Fig.7 shows that the DVAE is superior to PCA, VAE and DVE.

4 Conclusions

The association between SNP site and drug is largely affected by the sequence of SNP sites and the chemical structure information of drugs. In this study, a new calculation method is proposed to predict the association between potential SNP site and drug by integrating SNP site sequences and drug molecular structures. In order to extract more representative features, we construct a DVAE model to optimize the feature vector of SNP sites, combine the fingerprint features of drug molecules to form a fusion feature vector, and finally put the fusion feature vector into SVM classifier for association prediction. The experimental results show that this method has good performance in the association prediction of SNP site and drug molecule.